Montag, 20. April 2015

Creating and using Basic Auth Proxies with Scrapy

It's on Google but cumbersome to find. So here my way to install and use private proxies for Scrapy:


On the server you want to use as proxy
- Install squid3 and apache2-utils
- Put this in your /etc/squid3/squid.conf

auth_param basic program /usr/lib/squid3/ncsa_auth /etc/squid3/passwd
auth_param basic children 5
auth_param basic realm Squid proxy-caching web server
auth_param basic credentialsttl 2 hours
auth_param basic casesensitive off
acl ncsa_users proxy_auth REQUIRED
http_access allow ncsa_users
http_port 3128

- Create the password file
htpasswd /etc/squid3/passwd user
<< Enter password two times
- service squid3 restart

Now on scrapy add a middleware (don't forget to add it to the middlewares list in settings.py):
class ProxyMiddleware(object):
    def process_request(self, request, spider):
        try:
            request.meta['proxy'] = spider.proxy
            proxy_user_pass = "vim:<passwordhere>"
            encoded_user_pass = base64.encodestring(proxy_user_pass)
            request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass
        except KeyError: # No proxy address defined in spider? Then skip
            pass

Test your proxy with:
curl --proxy ip:3128 --proxy-user user:password ipecho.net/plain


Thanks to:
http://facts-world.blogspot.ru/2010/05/configuring-squid3-with-basic.html
https://sandalov.org/blog/1711/
http://stackoverflow.com/questions/20792152/setting-scrapy-proxy-middleware-to-rotate-on-each-request