aivarsk / scrapy-proxies Goto Github PK
View Code? Open in Web Editor NEWRandom proxy middleware for Scrapy
License: MIT License
Random proxy middleware for Scrapy
License: MIT License
Hi, Aivars!
I use your Random proxy middleware for Scrapy - scrapy_proxy. It works fine, thank you a lot!
At first, I get list.txt (list of proxies) by scraping free-proxy-site (without proxy rotating)
Then I make scraping of another site, (with scrapy_proxy)
When I run it by two different Scrapy projects it works well.
I tried to run it together in one Scrapy project, unfortunately, it doesn't work. Probably because in this case it tries to use list.txt for proxy rotating which is empty at that moment by request to free-proxy-site.
Is there another way around to handle it?
Thank you
Hello, on Reposhub about this I found setting:
'use_real_when_empty':False,
is it works? I've no found function inside..
when i run locally it works fine but i ran it on cloud it says 403 forbidden for every request.
I try several time diferente ways
I add the file "proxylist.txt" in the same folder than setting than the project in addition i upload it to "https://dl.dropboxusercontent.com/s/esdm19mnvz2yguf/proxylist.txt"
I substitute the name in the:
PROXY_LIST = 'https://dl.dropboxusercontent.com/s/esdm19mnvz2yguf/proxylist.txt'
or
PROXY_LIST = 'proxylist.txt'
or
PROXY_LIST = '/proxylist.txt'
PROXY_LIST = '../proxylist.txt'
if i do it like PROXY_LIST = 'proxylist.txt' in my PC, it works like a charm but not once i load it in Scrapy Cloud.
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1299, in _inlineCallbacks
result = g.send(result)
File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 90, in crawl
six.reraise(*exc_info)
File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 72, in crawl
self.engine = self._create_engine()
File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 97, in _create_engine
return ExecutionEngine(self, lambda _: self.stop())
File "/usr/local/lib/python2.7/site-packages/scrapy/core/engine.py", line 69, in init
self.downloader = downloader_cls(crawler)
File "/usr/local/lib/python2.7/site-packages/scrapy/core/downloader/init.py", line 88, in init
self.middleware = DownloaderMiddlewareManager.from_crawler(crawler)
File "/usr/local/lib/python2.7/site-packages/scrapy/middleware.py", line 58, in from_crawler
return cls.from_settings(crawler.settings, crawler)
File "/usr/local/lib/python2.7/site-packages/scrapy/middleware.py", line 36, in from_settings
mw = mwcls.from_crawler(crawler)
File "/app/python/lib/python2.7/site-packages/scrapy_proxies/randomproxy.py", line 55, in from_crawler
return cls(crawler.settings)
File "/app/python/lib/python2.7/site-packages/scrapy_proxies/randomproxy.py", line 35, in init
fin = open(self.proxy_list)
IOError: [Errno 2] No such file or directory: '../proxylist.txt'
please i need some help.
my proxies.txt file looks like
http://username:password@IP:Port
http://username:password@IP1:Port
.........
there are over 100 dedicated proxies(all are active) but using above library with PROXY_LIST = 'proxies.txt' all
I get All proxies are unusable error.
Hi,
buddy,
does this support shadowsocks?
I looked through your code and found out that proxy meta value was set only when there was proxy_user_pass from proxy record:
def process_request(self, request, spider):
# Don't overwrite with a random one (server-side state for IP)
if 'proxy' in request.meta:
if request.meta["exception"] is False:
return
request.meta["exception"] = False
if len(self.proxies) == 0:
raise ValueError('All proxies are unusable, cannot proceed')
if self.mode == ProxyMode.RANDOMIZE_PROXY_EVERY_REQUESTS:
proxy_address = random.choice(list(self.proxies.keys()))
else:
proxy_address = self.chosen_proxy
proxy_user_pass = self.proxies[proxy_address]
if proxy_user_pass:
request.meta['proxy'] = proxy_address
basic_auth = 'Basic ' + base64.b64encode(proxy_user_pass.encode()).decode()
request.headers['Proxy-Authorization'] = basic_auth
else:
log.debug('Proxy user pass not found')
log.debug('Using proxy <%s>, %d proxies left' % (
proxy_address, len(self.proxies)))
Have I missed smth?
This is not an Issue.
Sometimes, when we open an URL we can get HTTP.RESPONSE<200>
but gives 0 result... or i can say, i got banned from that website.
Is there any way to force remove the proxy item?
Thank you! Any help accepted :)
When I attempt to request the proxy I get a "KeyError: 'proxy'". Previously, I was able to get the IP address prior to using the proxies. Is there anyway to to get the proxy address that is used.
def parse_item(self, response):
item = {}
item['url'] = response.url
item['download_latency'] = download_latency = response.request.meta['download_latency']
**item['proxy'] = response.request.meta['proxy']**
Separate question from the previous, I was wondering if there was any way to get the start and stop time for a request. I'm trying to get a better understanding of concurrent_requests and how best to maximize request / second.
There's no setting proxy to meta if the proxy url without username and password.
In the process_request
function the proxy is passed to the request only if has an proxy_user_pass
, otherwise only print that the proxy is beign used and which are left. That means that a proxy like https://176.37.14.252:8080
does not work?
This is the function:
def process_request(self, request, spider):
# Don't overwrite with a random one (server-side state for IP)
if 'proxy' in request.meta:
if request.meta["exception"] is False:
return
request.meta["exception"] = False
if len(self.proxies) == 0:
raise ValueError('All proxies are unusable, cannot proceed')
if self.mode == Mode.RANDOMIZE_PROXY_EVERY_REQUESTS:
proxy_address = random.choice(list(self.proxies.keys()))
else:
proxy_address = self.chosen_proxy
proxy_user_pass = self.proxies[proxy_address]
if proxy_user_pass:
request.meta['proxy'] = proxy_address
basic_auth = 'Basic ' + base64.b64encode(proxy_user_pass.encode()).decode()
request.headers['Proxy-Authorization'] = basic_auth
else:
log.debug('Proxy user pass not found')
log.debug('Using proxy <%s>, %d proxies left' % (
proxy_address, len(self.proxies)))
Hi i getting this error may time i have used my own custom middleware i have passed proxy like this http://username:[email protected]:12345"
error message
scrapy.core.downloader.handlers.http11.TunnelError:Could not open CONNECT tunnel with proxy 104.120.33.32:12345 [{'status': 407, 'reason': b'Proxy Authentication Required'}]
I'm using scrapy-splash
to crawl an ajax site. And when using scrapy-proxies
, it seems that the request is not sending through the proxy, the proxy is not working at all.
Hey there,
Just looking for some basic info. I'm trying to figure out how to properly build my ProxyList.txt file. I've got the IP addresses from HMA pro but i'm not sure how to locate the port which goes at the end. I've tried searching on google how to find the ports but still not sure. Is there another free service i could use to find the information I need(IP address and port)?
Thanks a ton
and what's the license of the code?
Thank you =)
Hi,
I use a proxies list to run my spider. However, it failed to pick a new porxy when the connection failure happens.
2016-09-20 17:48:25 [scrapy] DEBUG: Using proxy http://xxx.160.162.95:8080, 3 proxies left
2016-09-20 17:48:27 [scrapy] INFO: Removing failed proxy http://xxx.160.162.95:8080, 2 proxies left
2016-09-20 17:48:27 [scrapy] DEBUG: Retrying <GET http://jsonip.com/> (failed 1 times): User timeout caused connection failure: Getting http://jsonip.com/ took longer than 2.0 seconds..
2016-09-20 17:48:29 [scrapy] INFO: Removing failed proxy http://xxx.160.162.95:8080, 2 proxies left
2016-09-20 17:48:29 [scrapy] DEBUG: Retrying <GET http://jsonip.com/> (failed 2 times): User timeout caused connection failure: Getting http://jsonip.com/ took longer than 2.0 seconds..
2016-09-20 17:48:31 [scrapy] INFO: Removing failed proxy http://xxx.160.162.95:8080, 2 proxies left
2016-09-20 17:48:31 [scrapy] DEBUG: Gave up retrying <GET http://jsonip.com/> (failed 3 times): User timeout caused connection failure: Getting http://jsonip.com/ took longer than 2.0 seconds..
Please help to fix this problem.
thanks a lot
Hi, is it possible to change proxy on http code 429?
If i get 429 error, i want to change to another proxy from the list
So i want run PROXY_MODE = 1
but if i get 429, i want to check/change to new proxy
if proxy_user_pass:
request.meta['proxy'] = proxy_address
basic_auth = 'Basic ' + base64.b64encode(proxy_user_pass.encode()).decode()
request.headers['Proxy-Authorization'] = basic_auth
else:
log.debug('Proxy user pass not found')
log.debug('Using proxy <%s>, %d proxies left' % (
proxy_address, len(self.proxies)))
I am very confused here as a noob python developer. From this part of the logic in randomProxy
file. It seems like if the proxy provided in the list.txt
is in the http://username:password@host2:port
format, then it will work by assigning proxy_address
to the request
, otherwise, do nothing but logging debug...
What am I missing here?
If it's possible, please, add a possibility to change proxy every N-th request.
Add a variable (for setting N) and a new value for "Proxy mode" for this.
when the flie have empty line this code many case error ?
if parts.group(2): exceptions.AttributeError: 'NoneType' object has no attribute 'group'
so,before group we can check it? like this
if parts:
if parts.group(2):
...
Is there any way to check failure through something except HTTP status code?
Maybe based on response body, headers or something else !?
Can I pass the name of the proxyfile as a variable to scrapy?
So if I'm running multiple crawlers at the same time, I would be able to use different list of proxies.
Thank you
It's possible use API ProxyList like https://getproxylist.com/#the-api ?
I gave a list of about 300 proxies but set CONCURRENT_REQUESTS = 64
. Still it seems that crawling is very slow (like 1 page every few seconds on average), much slower than not using any proxy at all. Of course DOWNLOAD_DELAY
is low.
Looking at it, it seems that people should usually also increase CONCURRENT_REQUESTS_PER_DOMAIN
in these cases (i.e. with a list of many possibly bad proxies), but even then it's still pretty slow.
I'm getting this error:
ValueError: All proxies are unusable, cannot proceed
2017-05-13 14:09:02 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: scrapy_bets)
2017-05-13 14:09:02 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'scrapy_bets.spiders', 'FEED_URI': 'matches.json', 'SPIDER_MODULES': ['scrapy_bets.spiders'], 'RETRY_TIMES': 10, 'BOT_NAME': 'scrapy_bets', 'RETRY_HTTP_CODES': [500, 503, 504, 400, 403, 404, 408], 'FEED_FORMAT': 'json'}
2017-05-13 14:09:02 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2017-05-13 14:09:02 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy_proxies.RandomProxy',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-05-13 14:09:02 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-05-13 14:09:02 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-05-13 14:09:02 [scrapy.core.engine] INFO: Spider opened
2017-05-13 14:09:02 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-05-13 14:09:02 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-05-13 14:09:02 [scrapy.core.scraper] ERROR: Error downloading <GET http://url_to_parse>
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1301, in _inlineCallbacks
result = g.send(result)
File "/usr/local/lib/python2.7/site-packages/scrapy/core/downloader/middleware.py", line 37, in process_request
response = yield method(request=request, spider=spider)
File "/usr/local/lib/python2.7/site-packages/scrapy_proxies/randomproxy.py", line 63, in process_request
raise ValueError('All proxies are unusable, cannot proceed')
ValueError: All proxies are unusable, cannot proceed
2017-05-13 14:09:02 [scrapy.core.scraper] ERROR: Error downloading <GET http://url_to_parse>
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1301, in _inlineCallbacks
result = g.send(result)
File "/usr/local/lib/python2.7/site-packages/scrapy/core/downloader/middleware.py", line 37, in process_request
response = yield method(request=request, spider=spider)
File "/usr/local/lib/python2.7/site-packages/scrapy_proxies/randomproxy.py", line 63, in process_request
raise ValueError('All proxies are unusable, cannot proceed')
ValueError: All proxies are unusable, cannot proceed
2017-05-13 14:09:02 [scrapy.core.engine] INFO: Closing spider (finished)
2017-05-13 14:09:02 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 2,
'downloader/exception_type_count/exceptions.ValueError': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 5, 13, 13, 9, 2, 915138),
'log_count/DEBUG': 1,
'log_count/ERROR': 2,
'log_count/INFO': 7,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2017, 5, 13, 13, 9, 2, 694730)}
2017-05-13 14:09:02 [scrapy.core.engine] INFO: Spider closed (finished)
Does anyone else experience timeout errors, specifically immediately after redirects?
I've only set this up today, but specifically https://www.game.co.uk/en/hardware/xbox-series-x/?contentOnly=&inStockOnly=true&listerOnly=&pageSize=100
I can fetch it ok with Scrapy fetch, but if I try to use a spider that crawls the URL, I hit a 302 redirect and my crawl just completely errors out from that point with immediate failures "response never received". It's not long timeouts, it's literally just erroring immediately.
Please could somebody help me? I'm fairly new to this and I have no idea what the cause may be.
I'm using a pool of 10 http proxies on port 80
2018-07-26 10:26:02 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-07-26 10:26:02 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-07-26 10:26:02 [scrapy.proxies] DEBUG: Proxy user pass not found
2018-07-26 10:26:02 [scrapy.proxies] DEBUG: Using proxy https://185.93.3.70:8080, 1 proxies left
2018-07-26 10:26:03 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://piknu.com/u/isabel_sanzz/similar> (failed 1 times): 403 Forbidden
2018-07-26 10:26:03 [scrapy.proxies] DEBUG: Proxy user pass not found
2018-07-26 10:26:03 [scrapy.proxies] DEBUG: Using proxy https://185.93.3.70:8080, 1 proxies left
I'm getting this error when I run:
Traceback (most recent call last):
File "/Library/Python/2.7/site-packages/twisted/internet/defer.py", line 1299, in _inlineCallbacks
result = g.send(result)
File "/Library/Python/2.7/site-packages/scrapy/crawler.py", line 90, in crawl
six.reraise(*exc_info)
File "/Library/Python/2.7/site-packages/scrapy/crawler.py", line 72, in crawl
self.engine = self._create_engine()
File "/Library/Python/2.7/site-packages/scrapy/crawler.py", line 97, in _create_engine
return ExecutionEngine(self, lambda _: self.stop())
File "/Library/Python/2.7/site-packages/scrapy/core/engine.py", line 69, in init
self.downloader = downloader_cls(crawler)
File "/Library/Python/2.7/site-packages/scrapy/core/downloader/init.py", line 88, in init
self.middleware = DownloaderMiddlewareManager.from_crawler(crawler)
File "/Library/Python/2.7/site-packages/scrapy/middleware.py", line 58, in from_crawler
return cls.from_settings(crawler.settings, crawler)
File "/Library/Python/2.7/site-packages/scrapy/middleware.py", line 34, in from_settings
mwcls = load_object(clspath)
File "/Library/Python/2.7/site-packages/scrapy/utils/misc.py", line 44, in load_object
mod = import_module(module)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/importlib/init.py", line 37, in import_module
import(name)
ImportError: No module named scrapy_proxies
I have an issue where proxies are not being used when accessing https://
websites (uses my actual ip).
I've verified that my proxies do support https:// (by setting the env variable set HTTPS_PROXY=proxy address
works).
Setting proxies in my proxy_list to http://
and https://
does not make a difference.
process_exception(request, exception, spider)
Scrapy calls process_exception() when a download handler or a process_request() (from a downloader middleware) raises an exception (including an IgnoreRequest exception)
There has a problem when the character “@” within password, maybe we should make the regex pattern more compatible? :) Here is my solution:
parts = re.match('(\w+://)([^:]+?:.+@)?(.+)', line.strip())
instead of
parts = re.match('(\w+://)([^:]+?:[^@]+?@)?(.+)', line.strip())
Hi, I am getting the error below when using the DOWNLOADER_MIDDLEWARES indicated in the ReadMe (I added a proxy list, etc..). Read a bunch of threads on SO but couldn't fix my issue.
Appreciate any help
thanks
Traceback (most recent call last):
File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/base64.py", line 517, in _input_type_check
m = memoryview(s)
TypeError: memoryview: a bytes-like object is required, not 'str'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/twisted/internet/defer.py", line 1386, in _inlineCallbacks
result = g.send(result)
File "/usr/local/lib/python3.6/site-packages/scrapy/core/downloader/middleware.py", line 37, in process_request
response = yield method(request=request, spider=spider)
File "/usr/local/lib/python3.6/site-packages/scrapy_proxies/randomproxy.py", line 70, in process_request
basic_auth = 'Basic ' + base64.encodestring(proxy_user_pass)
File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/base64.py", line 547, in encodestring
return encodebytes(s)
File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/base64.py", line 534, in encodebytes
_input_type_check(s)
File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/base64.py", line 520, in _input_type_check
raise TypeError(msg) from err
TypeError: expected bytes-like object, not str
2017-10-16 23:19:20 [scrapy.core.engine] INFO: Closing spider (finished)
2017-10-16 23:19:20 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 1,
'downloader/exception_type_count/builtins.TypeError': 1,
'finish_reason': 'finished
2017-07-12 14:35:33 [scrapy.proxies] DEBUG: Using proxy http://208.92.94.191:1080, 91 proxies left
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)
/search/category/2/10/g251p6?aid=79417082%2C20944119%2C67545588%2C512124%2C4665606%2C2517868%2C68124250%2C77336676%2C19331058%2C91955011%2C52802565%2C92076417&cpt=79417082%2C20944119%2C67545588%2C512124%2C4665606%2C2517868%2C68124250%2C77336676%2C19331058%2C91955011%2C52802565%2C92076417&tc=1 ==================
2017-07-12 14:35:34 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.dianping.com/shop/70170698> (failed 1 times): 403 Forbidden
2017-07-12 14:35:34 [scrapy.proxies] DEBUG: Using proxy http://110.244.119.139:80, 91 proxies left
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)
2017-07-12 14:35:35 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.dianping.com/shop/507618> (failed 1 times): 403 Forbidden
2017-07-12 14:35:35 [scrapy.proxies] DEBUG: Using proxy http://125.89.121.179:808, 91 proxies left
Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0
Is it possible to restart the proxylist if it gets to 0? I have dynamic proxies that refresh every 15 min, so I want scrapy to restart the list if len(self.proxies) == 0.
Thanks!
I see this is caused by the line no 83.
if 'proxy' in request.meta:
if request.meta["exception"] is False:
return
If we have used proxy in start requests function then this issue arises which makes sense because exception is not defined in meta up to this point for our first request.
I guess most of us either use a random proxy or custom proxy. So no one ever bothered about it.
I think the line 83 is important because it enables to change proxies in each retry or after exception.
def start_requests(self):
yield scrapy.Request('http://quotes.toscrape.com/', callback=self.parse, meta={'proxy': 'http://xxxx:xxxx@xxxx:xxxx'})
Also to change the proxy in retry. Comment out this in process_exception. #15
if 'proxy' not in request.meta:
return
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.