teamhg-memex / undercrawler Goto Github PK
View Code? Open in Web Editor NEWA generic crawler
A generic crawler
When spider downloads the first page it updates the regexp for allowed domains. Currently the code has some issues:
www
then non-www version gets banned. I think we should allow non-www version in this case (to get pages which are at non-www links), but handle it at duplicate filter level (to avoid getting the same content twice if a website has www and non-www mirrors)..
in domain name matches any character, not only a dot.Here is an example dockerfile.
https://github.com/TeamHG-Memex/undercrawler/blob/dockerize-with-arachnado/Dockerfile
I can run undercrawler over commandline, but it looks that it not integrated in arachnado's UI and API.
Every created crawl job over the UI or API from arachnado uses the (generic) arachnado crawler and not undercrawler and splash.
I think something is missing to get both working together.
We should prevent logging out caused by following logout links.
With splash this happens "inside" splash and we don't see the redirect, but without splash, if we set "domain" as start url and it's redirected to "www.domain", then the redirected request is filtered.
This happens because of our custom dupefilter that ignores www subdomain.
Currently when DupePredictor thinks some pattern leads to a duplicate page this page is not followed. I may be better to still follow it with some probability (e.g. 5%) and keep updating stats. It should make crawler more robust if patterns are different in different website parts.
One way to do it is to download files via Splash using splash:http_get
I would like to make sure I understand the Undercrawler CONCURRENT_REQUESTS setting and the number of Splash slots interaction. If I have 3 splash instances and each instance has a slot setting of 5, does that mean that the effective concurrency would be 15 even if the concurrent_requests in Undercrawler is set to 32? Thanks.
It is hard to stop the spider if it uses autologin and autologin is responding with 'pending' status; I had to kill -9
it.
I get the following error when crawling a site.
INFO [2018-12-01 00:09:47,567] (Crawler->329): 2018-12-01 00:09:47 [scrapy_splash.middleware] WARNING: Bad request to Splash: {'error': 400, 'type': 'ScriptError', 'info': {'error': 'JavaScript error: EvalError: Refused to evaluate a string as JavaScript because \'unsafe-eval\' is not an allowed source of script in the following Content Security Policy directive: "script-src \'self\' \'unsafe-inline\' https://checkout.stripe.com/checkout.js https://static.accountdock.com https://www.googletagmanager.com https://www.googletagmanager.com/gtm.js https://www.googleadservices.com https://googleads.g.doubleclick.net https://www.google-analytics.com https://www.google-analytics.com/analytics.js https://platform.twitter.com/widgets.js https://static.npmjs.com/".', 'message': 'Lua error: [string "print(\'here in the beginning\')..."]:79: JavaScript error: EvalError: Refused to evaluate a string as JavaScript because \'unsafe-eval\' is not an allowed source of script in the following Content Security Policy directive: "script-src \'self\' \'unsafe-inline\' https://checkout.stripe.com/checkout.js https://static.accountdock.com https://www.googletagmanager.com https://www.googletagmanager.com/gtm.js https://www.googleadservices.com https://googleads.g.doubleclick.net https://www.google-analytics.com https://www.google-analytics.com/analytics.js https://platform.twitter.com/widgets.js https://static.npmjs.com/".\n', 'type': 'LUA_ERROR', 'source': '[string "print(\'here in the beginning\')..."]', 'line_number': 79}, 'description': 'Error happened while executing Lua script'} INFO [2018-12-01 00:09:47,568] (Crawler->329): 2018-12-01 00:09:47 [scrapy.core.engine] DEBUG: Crawled (400) <GET https://www.npmjs.com/advisories via http://127.0.0.1:8050/execute> (referer: None)
I read about the Content Security Policy and understand that the site disallows running any external javascript. Questions I have are:
It's not clear yet where should it be fixed, so posting here:
It seems that currently requests queue takes a lot of memory if running without JOBDIR
, and a lot of disk space when running with JOBDIR
.
Result of memory profiling after some time:
Site 1 (after maybe 10k pages fetched?):
In [7]: summary.print_(sum1)
types | # objects | total size
============================================= | =========== | ============
<class 'dict | 325334 | 122.44 MB
<class 'str | 891057 | 66.60 MB
<class 'list | 140968 | 43.73 MB
<class 'scrapy.http.headers.Headers | 101893 | 27.99 MB
<class 'weakref | 310643 | 23.70 MB
<class 'bytes | 209461 | 15.84 MB
<class 'frozenset | 260 | 11.85 MB
<class 'int | 371624 | 9.94 MB
<class 'set | 10030 | 9.68 MB
<class 'urllib.parse.ParseResult | 101387 | 9.28 MB
<class 'method | 104241 | 6.36 MB
<class 'scrapy_splash.request.SplashRequest | 101885 | 5.44 MB
<class 'numpy.ndarray | 6641 | 3.91 MB
<class 'type | 3036 | 3.08 MB
<class 'code | 20056 | 2.76 MB
In [8]: len(spider.crawler.engine.slot.scheduler)
Out[8]: 100736
Here most memory is taken by dicts (request.meta and inner dicts), and also some by str.
Site 2 is different: here request queue is 10x smaller, but it takes the same amount of memory (after maybe 2k-3k pages fetched):
types | # objects | total size
===================================== | =========== | ============
<class 'bytes | 19726 | 127.90 MB
<class 'str | 213742 | 21.45 MB
<class 'dict | 44710 | 20.86 MB
<class 'frozenset | 257 | 11.84 MB
<class 'int | 369817 | 9.90 MB
<class 'list | 22572 | 5.54 MB
<class 'set | 4424 | 4.68 MB
<class 'type | 3012 | 3.05 MB
<class 'code | 20042 | 2.75 MB
<class 'scrapy.http.headers.Headers | 9248 | 2.54 MB
<class 'weakref | 32766 | 2.50 MB
<class 'numpy.ndarray | 2077 | 1.23 MB
<class 'tuple | 12974 | 858.53 KB
<class 'urllib.parse.ParseResult | 8628 | 808.88 KB
<class 'method | 10662 | 666.38 KB
len(spider.crawler.engine.slot.scheduler)
9212
In this case, bytes take more space (these are encoded requests to splash), because there are a lot of cookies on the site.
When using JOBDIR
, the main offenders are our lua and js scripts, that are repeated for each request.
This is not very severe I think. And maybe not directly related to an issue I noticed this morning, when scrapy process for Site 2 (having crawled 35k pages) was using about 8 Gb of RAM (using JOBDIR, although the server could be low on disk space at the time).
If I run multiple crawls at the same time using Undercrawler + Aquarium, could be multiple crawls to the same url or different urls, theoretically what issues I might run into, if any? And also, are there any specific config settings that would be helpful for this scenario?
There is a list of Downloader Middlewares in the settings file. Does that list mean that those are the only middlewares that are activated for any crawl? I am asking this because I noticed that the RetryMiddleware is working but the MetaRefreshMiddleware doesn't seem to be working - and none of them are specified in the middlewares list. Could someone explain that please? Thanks.
I think it makes sense to make file downloading more aggressive. Browsers allow ~6 parallel connections to the same server; it should be fine to download multiple static files in parallel from the same server. Currently we're throttling it to 1 parallel download on average using autothrottle algorithm.
Autothrottle doesn't make much sense for static files anyways - file size can vary a lot; it doesn't make sense to adjust download delay for 10Kb JPEG file based on that we spent 1 minute downloading 50MB pdf file.
Because of the way we handle file uploads there are many WARNING: Dropped
lines in the log when file uploads are enabled. I think we shouldn't log these duplicate uploads by default.
Requests may fail completely due to failure to execute javascript headless horsemen scripts:
[scrapy_splash.middleware] WARNING: Bad request to Splash: {'description': 'Error happened while executing Lua script', 'error': 400, 'info': {'line_number': 70, 'error': 'JavaScript error: EvalError: Refused to evaluate a string as JavaScript because \'unsafe-eval\' is not an allowed source of script in the following Content Security Policy directive: "script-src \'self\' https://*.twimg.com https://*.twitter.com https://static.ads-twitter.com".', 'message': 'Lua error: [string "function get_arg(arg, default)..."]:70: JavaScript error: EvalError: Refused to evaluate a string as JavaScript because \'unsafe-eval\' is not an allowed source of script in the following Content Security Policy directive: "script-src \'self\' https://*.twimg.com https://*.twitter.com https://static.ads-twitter.com".\n', 'source': '[string "function get_arg(arg, default)..."]', 'type': 'LUA_ERROR'}, 'type': 'ScriptError'}
Line 70 is this one:
It's possible to skip the error with pcall: https://www.lua.org/pil/8.4.html, but maybe there is a way to still execute js on the page?
I think this errors are not retried
2016-04-12 21:30:49 [scrapy_splash.middleware] WARNING: Bad request to Splash: {'error': 400, 'description': 'Error happened while executing Lua script', 'info': {'message': 'Lua
error: [string "function get_arg(arg, default)..."]:52: network5', 'error': 'network5', 'type': 'LUA_ERROR', 'source': '[string "function get_arg(arg, default)..."]', 'line_numb
er': 52}, 'type': 'ScriptError'}
2016-04-12 21:29:31 [scrapy] DEBUG: Crawled (400) <POST http://XXX:8050/execute> (referer: None)
2016-04-12 21:29:31 [scrapy] DEBUG: Ignoring response <400 XXX/>: HTTP status code is not handled or not allowed
Currently if
then depth is not increased. Depth should be increased at (4).
This happens on forums - sometimes topic list pages have pagination links to topic comments.
First request is made only after 6-7 seconds after starting the spider (that is why tests are so slow).
See also #26
Right now it is applied to documents with text content type, and to all requests.
Form action is only computed for deduplication purposes:
undercrawler/undercrawler/spiders.py
Line 165 in ba2c49a
hey
You have an interesting projekt
https://github.com/TeamHG-Memex/soft404
Would be a good feature for undercrawler.
Here:
canonicalize_url is called; by default it removes fragment. I think we should check what does fragment usually mean in pagination URLs and decide whether to keep it or drop. If that's too much to check that I think we should keep the fragment by default.I think it could be useful if we just want to crawl some sites without tor and heavy js usage, and to compare results with/without splash. It should be not too hard to support it, since we can just run all tests in two different configurations. What do you think @kmike ?
Right now cookies are extracted from headers, and we get headers from the last request in splash. So if there is a series of redirects and some cookies are changed, but are absent from the last request, we will miss them. The test in 724cc1b468d8e8d557a1745a7a1826ed7bc209a0 fails for this reason.
I ran a crawl on a site and a lot of pages were blank, in the sense that the screenshot was blank and the raw_content in the output didn't have the html body extracted, just had header and scripts stuff in it. On comparing that to the non-blank pages, I couldnt see a difference to conclude why it could have happened and no errors in the logs (both Undercrawler logs and splash logs). Any idea why that would happen? Any pointers to what can I look into? Thanks.
I havent' started autologin servers and got these exceptions:
py35 runtests: commands[4] | py.test --doctest-modules --cov=undercrawler undercrawler tests
============================================================= test session starts =============================================================
platform darwin -- Python 3.5.1, pytest-2.9.1, py-1.4.31, pluggy-0.3.1
rootdir: /Users/kmike/svn/undercrawler, inifile:
plugins: cov-2.2.1, twisted-1.5
collected 24 items
undercrawler/spiders/base_spider.py ..
tests/test_dupe_predict.py .................
tests/test_spider.py ...FF
----------------------------------------------- coverage: platform darwin, python 3.5.1-final-0 -----------------------------------------------
Name Stmts Miss Branch BrPart Cover
--------------------------------------------------------------------------------
undercrawler/__init__.py 0 0 0 0 100%
undercrawler/crazy_form_submitter.py 41 31 21 0 16%
undercrawler/directives/test_directive.py 71 60 14 1 14%
undercrawler/documents_pipeline.py 20 2 10 4 80%
undercrawler/dupe_predict.py 145 1 78 2 99%
undercrawler/items.py 18 0 2 0 100%
undercrawler/middleware/__init__.py 3 0 0 0 100%
undercrawler/middleware/autologin.py 88 46 42 7 39%
undercrawler/middleware/avoid_dup_content.py 42 20 18 4 43%
undercrawler/middleware/throttle.py 24 1 10 3 88%
undercrawler/settings.py 33 0 2 1 97%
undercrawler/spiders/__init__.py 1 0 0 0 100%
undercrawler/spiders/base_spider.py 150 25 63 11 78%
undercrawler/utils.py 39 10 22 2 64%
--------------------------------------------------------------------------------
TOTAL 675 196 282 35 67%
================================================================== FAILURES ===================================================================
__________________________________________________________ TestAutologin.test_login ___________________________________________________________
self = <tests.test_spider.TestAutologin testMethod=test_login>
@defer.inlineCallbacks
def test_login(self):
''' No logout links, just one page after login.
'''
with MockServer(Login) as s:
root_url = s.root_url
yield self.crawler.crawl(url=root_url)
spider = self.crawler.spider
> assert hasattr(spider, 'collected_items')
E AssertionError: assert hasattr(<BaseSpider 'base' at 0x10ce7ee80>, 'collected_items')
tests/test_spider.py:224: AssertionError
------------------------------------------------------------ Captured stderr call -------------------------------------------------------------
INFO:scrapy.middleware:Enabled extensions:
['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.logstats.LogStats']
INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.logstats.LogStats']
INFO:scrapy.middleware:Enabled downloader middlewares:
['undercrawler.middleware.AvoidDupContentMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'undercrawler.middleware.AutologinMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'undercrawler.middleware.SplashAwareAutoThrottle',
'scrapy_splash.SplashCookiesMiddleware',
'scrapy_splash.SplashMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
INFO: Enabled downloader middlewares:
['undercrawler.middleware.AvoidDupContentMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'undercrawler.middleware.AutologinMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'undercrawler.middleware.SplashAwareAutoThrottle',
'scrapy_splash.SplashCookiesMiddleware',
'scrapy_splash.SplashMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
INFO:scrapy.middleware:Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
INFO:scrapy.middleware:Enabled item pipelines:
['tests.utils.CollectorPipeline']
INFO: Enabled item pipelines:
['tests.utils.CollectorPipeline']
INFO:scrapy.core.engine:Spider opened
INFO: Spider opened
INFO:scrapy.extensions.logstats:Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
DEBUG:undercrawler.middleware.autologin:Attempting login at http://192.168.99.1:8781
DEBUG: Attempting login at http://192.168.99.1:8781
INFO:requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): 127.0.0.1
INFO: Starting new HTTP connection (1): 127.0.0.1
DEBUG:scrapy.downloadermiddlewares.retry:Retrying <GET http://192.168.99.1:8781> (failed 1 times): HTTPConnectionPool(host='127.0.0.1', port=8089): Max retries exceeded with url: /login-cookies (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x10ce84cc0>: Failed to establish a new connection: [Errno 61] Connection refused',))
DEBUG: Retrying <GET http://192.168.99.1:8781> (failed 1 times): HTTPConnectionPool(host='127.0.0.1', port=8089): Max retries exceeded with url: /login-cookies (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x10ce84cc0>: Failed to establish a new connection: [Errno 61] Connection refused',))
DEBUG:undercrawler.middleware.autologin:response <200 http://192.168.99.1:8781/> cookies <CookieJar[]>
DEBUG: response <200 http://192.168.99.1:8781/> cookies <CookieJar[]>
ERROR:scrapy.core.scraper:Error downloading <GET http://192.168.99.1:8781 via http://192.168.99.100:8050/execute>
Traceback (most recent call last):
File "/Users/kmike/svn/undercrawler/.tox/py35/lib/python3.5/site-packages/twisted/internet/defer.py", line 1128, in _inlineCallbacks
result = g.send(result)
File "/Users/kmike/svn/undercrawler/.tox/py35/lib/python3.5/site-packages/scrapy/core/downloader/middleware.py", line 43, in process_request
defer.returnValue((yield download_func(request=request,spider=spider)))
File "/Users/kmike/svn/undercrawler/.tox/py35/lib/python3.5/site-packages/twisted/internet/defer.py", line 1105, in returnValue
raise _DefGen_Return(val)
twisted.internet.defer._DefGen_Return: <200 http://192.168.99.100:8050/execute>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/kmike/svn/undercrawler/.tox/py35/lib/python3.5/site-packages/twisted/internet/defer.py", line 1128, in _inlineCallbacks
result = g.send(result)
File "/Users/kmike/svn/undercrawler/.tox/py35/lib/python3.5/site-packages/scrapy/core/downloader/middleware.py", line 53, in process_response
spider=spider)
File "/Users/kmike/svn/undercrawler/undercrawler/middleware/autologin.py", line 129, in process_response
if self.is_logout(response):
File "/Users/kmike/svn/undercrawler/undercrawler/middleware/autologin.py", line 149, in is_logout
auth_cookies = {c['name'] for c in self.auth_cookies if c['value']}
TypeError: 'NoneType' object is not iterable
ERROR: Error downloading <GET http://192.168.99.1:8781 via http://192.168.99.100:8050/execute>
Traceback (most recent call last):
File "/Users/kmike/svn/undercrawler/.tox/py35/lib/python3.5/site-packages/twisted/internet/defer.py", line 1128, in _inlineCallbacks
result = g.send(result)
File "/Users/kmike/svn/undercrawler/.tox/py35/lib/python3.5/site-packages/scrapy/core/downloader/middleware.py", line 43, in process_request
defer.returnValue((yield download_func(request=request,spider=spider)))
File "/Users/kmike/svn/undercrawler/.tox/py35/lib/python3.5/site-packages/twisted/internet/defer.py", line 1105, in returnValue
raise _DefGen_Return(val)
twisted.internet.defer._DefGen_Return: <200 http://192.168.99.100:8050/execute>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/kmike/svn/undercrawler/.tox/py35/lib/python3.5/site-packages/twisted/internet/defer.py", line 1128, in _inlineCallbacks
result = g.send(result)
File "/Users/kmike/svn/undercrawler/.tox/py35/lib/python3.5/site-packages/scrapy/core/downloader/middleware.py", line 53, in process_response
spider=spider)
File "/Users/kmike/svn/undercrawler/undercrawler/middleware/autologin.py", line 129, in process_response
if self.is_logout(response):
File "/Users/kmike/svn/undercrawler/undercrawler/middleware/autologin.py", line 149, in is_logout
auth_cookies = {c['name'] for c in self.auth_cookies if c['value']}
TypeError: 'NoneType' object is not iterable
INFO:scrapy.core.engine:Closing spider (finished)
INFO: Closing spider (finished)
INFO:scrapy.statscollectors:Dumping Scrapy stats:
{'downloader/exception_count': 1,
'downloader/exception_type_count/requests.exceptions.ConnectionError': 1,
'downloader/request_bytes': 27995,
'downloader/request_count': 1,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 1960,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 4, 11, 19, 32, 19, 94613),
'log_count/DEBUG': 3,
'log_count/ERROR': 1,
'log_count/INFO': 8,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'splash/execute/request_count': 1,
'splash/execute/response_count/200': 1,
'start_time': datetime.datetime(2016, 4, 11, 19, 32, 12, 678245)}
INFO: Dumping Scrapy stats:
{'downloader/exception_count': 1,
'downloader/exception_type_count/requests.exceptions.ConnectionError': 1,
'downloader/request_bytes': 27995,
'downloader/request_count': 1,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 1960,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 4, 11, 19, 32, 19, 94613),
'log_count/DEBUG': 3,
'log_count/ERROR': 1,
'log_count/INFO': 8,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'splash/execute/request_count': 1,
'splash/execute/response_count/200': 1,
'start_time': datetime.datetime(2016, 4, 11, 19, 32, 12, 678245)}
INFO:scrapy.core.engine:Spider closed (finished)
INFO: Spider closed (finished)
____________________________________________________ TestAutologin.test_login_with_logout _____________________________________________________
self = <tests.test_spider.TestAutologin testMethod=test_login_with_logout>
@defer.inlineCallbacks
def test_login_with_logout(self):
''' Login with logout.
'''
with MockServer(LoginWithLogout) as s:
root_url = s.root_url
yield self.crawler.crawl(url=root_url)
spider = self.crawler.spider
> assert hasattr(spider, 'collected_items')
E AssertionError: assert hasattr(<BaseSpider 'base' at 0x10e18d668>, 'collected_items')
tests/test_spider.py:236: AssertionError
------------------------------------------------------------ Captured stderr call -------------------------------------------------------------
INFO:scrapy.middleware:Enabled extensions:
['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.logstats.LogStats']
INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.logstats.LogStats']
INFO:scrapy.middleware:Enabled downloader middlewares:
['undercrawler.middleware.AvoidDupContentMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'undercrawler.middleware.AutologinMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'undercrawler.middleware.SplashAwareAutoThrottle',
'scrapy_splash.SplashCookiesMiddleware',
'scrapy_splash.SplashMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
INFO: Enabled downloader middlewares:
['undercrawler.middleware.AvoidDupContentMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'undercrawler.middleware.AutologinMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'undercrawler.middleware.SplashAwareAutoThrottle',
'scrapy_splash.SplashCookiesMiddleware',
'scrapy_splash.SplashMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
INFO:scrapy.middleware:Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
INFO:scrapy.middleware:Enabled item pipelines:
['tests.utils.CollectorPipeline']
INFO: Enabled item pipelines:
['tests.utils.CollectorPipeline']
INFO:scrapy.core.engine:Spider opened
INFO: Spider opened
INFO:scrapy.extensions.logstats:Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
DEBUG:undercrawler.middleware.autologin:Attempting login at http://192.168.99.1:8781
DEBUG: Attempting login at http://192.168.99.1:8781
INFO:requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): 127.0.0.1
INFO: Starting new HTTP connection (1): 127.0.0.1
DEBUG:scrapy.downloadermiddlewares.retry:Retrying <GET http://192.168.99.1:8781> (failed 1 times): HTTPConnectionPool(host='127.0.0.1', port=8089): Max retries exceeded with url: /login-cookies (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x10cd67080>: Failed to establish a new connection: [Errno 61] Connection refused',))
DEBUG: Retrying <GET http://192.168.99.1:8781> (failed 1 times): HTTPConnectionPool(host='127.0.0.1', port=8089): Max retries exceeded with url: /login-cookies (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x10cd67080>: Failed to establish a new connection: [Errno 61] Connection refused',))
DEBUG:undercrawler.middleware.autologin:response <200 http://192.168.99.1:8781/> cookies <CookieJar[]>
DEBUG: response <200 http://192.168.99.1:8781/> cookies <CookieJar[]>
ERROR:scrapy.core.scraper:Error downloading <GET http://192.168.99.1:8781 via http://192.168.99.100:8050/execute>
Traceback (most recent call last):
File "/Users/kmike/svn/undercrawler/.tox/py35/lib/python3.5/site-packages/twisted/internet/defer.py", line 1128, in _inlineCallbacks
result = g.send(result)
File "/Users/kmike/svn/undercrawler/.tox/py35/lib/python3.5/site-packages/scrapy/core/downloader/middleware.py", line 43, in process_request
defer.returnValue((yield download_func(request=request,spider=spider)))
File "/Users/kmike/svn/undercrawler/.tox/py35/lib/python3.5/site-packages/twisted/internet/defer.py", line 1105, in returnValue
raise _DefGen_Return(val)
twisted.internet.defer._DefGen_Return: <200 http://192.168.99.100:8050/execute>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/kmike/svn/undercrawler/.tox/py35/lib/python3.5/site-packages/twisted/internet/defer.py", line 1128, in _inlineCallbacks
result = g.send(result)
File "/Users/kmike/svn/undercrawler/.tox/py35/lib/python3.5/site-packages/scrapy/core/downloader/middleware.py", line 53, in process_response
spider=spider)
File "/Users/kmike/svn/undercrawler/undercrawler/middleware/autologin.py", line 129, in process_response
if self.is_logout(response):
File "/Users/kmike/svn/undercrawler/undercrawler/middleware/autologin.py", line 149, in is_logout
auth_cookies = {c['name'] for c in self.auth_cookies if c['value']}
TypeError: 'NoneType' object is not iterable
ERROR: Error downloading <GET http://192.168.99.1:8781 via http://192.168.99.100:8050/execute>
Traceback (most recent call last):
File "/Users/kmike/svn/undercrawler/.tox/py35/lib/python3.5/site-packages/twisted/internet/defer.py", line 1128, in _inlineCallbacks
result = g.send(result)
File "/Users/kmike/svn/undercrawler/.tox/py35/lib/python3.5/site-packages/scrapy/core/downloader/middleware.py", line 43, in process_request
defer.returnValue((yield download_func(request=request,spider=spider)))
File "/Users/kmike/svn/undercrawler/.tox/py35/lib/python3.5/site-packages/twisted/internet/defer.py", line 1105, in returnValue
raise _DefGen_Return(val)
twisted.internet.defer._DefGen_Return: <200 http://192.168.99.100:8050/execute>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/kmike/svn/undercrawler/.tox/py35/lib/python3.5/site-packages/twisted/internet/defer.py", line 1128, in _inlineCallbacks
result = g.send(result)
File "/Users/kmike/svn/undercrawler/.tox/py35/lib/python3.5/site-packages/scrapy/core/downloader/middleware.py", line 53, in process_response
spider=spider)
File "/Users/kmike/svn/undercrawler/undercrawler/middleware/autologin.py", line 129, in process_response
if self.is_logout(response):
File "/Users/kmike/svn/undercrawler/undercrawler/middleware/autologin.py", line 149, in is_logout
auth_cookies = {c['name'] for c in self.auth_cookies if c['value']}
TypeError: 'NoneType' object is not iterable
INFO:scrapy.core.engine:Closing spider (finished)
INFO: Closing spider (finished)
INFO:scrapy.statscollectors:Dumping Scrapy stats:
{'downloader/exception_count': 1,
'downloader/exception_type_count/requests.exceptions.ConnectionError': 1,
'downloader/request_bytes': 27995,
'downloader/request_count': 1,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 1960,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 4, 11, 19, 32, 28, 109595),
'log_count/DEBUG': 3,
'log_count/ERROR': 1,
'log_count/INFO': 8,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'splash/execute/request_count': 1,
'splash/execute/response_count/200': 1,
'start_time': datetime.datetime(2016, 4, 11, 19, 32, 21, 698408)}
INFO: Dumping Scrapy stats:
{'downloader/exception_count': 1,
'downloader/exception_type_count/requests.exceptions.ConnectionError': 1,
'downloader/request_bytes': 27995,
'downloader/request_count': 1,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 1960,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 4, 11, 19, 32, 28, 109595),
'log_count/DEBUG': 3,
'log_count/ERROR': 1,
'log_count/INFO': 8,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'splash/execute/request_count': 1,
'splash/execute/response_count/200': 1,
'start_time': datetime.datetime(2016, 4, 11, 19, 32, 21, 698408)}
INFO:scrapy.core.engine:Spider closed (finished)
INFO: Spider closed (finished)
Also, all logging messages are duplicated for some reason.
I am not sure if I'm using the project correctly, but where exactly is the exported content stored?
I can see the images being saved in the Files
folder and the screenshots in the Screenshots
folder.
But what about the actual html and other web files?
The scripts/crawl_stats.py
request for a crawler_out
file, which I can't really identify.
The log currently shows entries like:
017-10-22 00:10:38 [undercrawler] DEBUG: Saved <200 ...> screenshot to /Screenshots/xxx.png
2017-10-22 00:10:39 [scrapy.core.scraper] DEBUG: Scraped from <200 http://xxx/>
<CDRItem: _id: ......... timestamp_crawl: '2017-10-21T18:40:39.560401Z'>
I use the default settings.py with the following extras arguments:
IMAGES_ENABLED = 1
FILES_STORE = '../Files'
FOLLOW_LINKS = 0
SCREENSHOT = True
SCREENSHOT_HEIGHT = 0
SCREENSHOT_DEST = '../Screenshots'
PREFER_PAGINATION = 1
MAX_DOMAIN_SEARCH_FORMS = 10
When iframe is in a page it makes sense to get its content even if it is not in allowed domain. Maybe we shouldn't follow links in this case though.
I am trying to save files to S3, but could not get it up and running.
"s3://bucket/prefix/" need to be defined but where?
In the UI is
Scrapy settings
and
Spider Args.
Looks like nothing works.
Need to know, where AWS Key and AWS Secret needs to be defined and how to set the filestorage to "s3://bucket/prefix/".
A little bit help would be cool.
I'm not sure why. I fails with and without splash in the same way:
$ py.test tests/test_spider.py::test_documents -s
...
> assert len(spider.collected_items) == 4
E assert 5 == 4
tests/test_spider.py:119: AssertionError
Hi,
I have to say amazing tool.
I am struggling to understand on how I has store the results on a json file for each start url.
Currently i am getting binary files for each url within the domain which I have difficulties retrieving the information that I am seeking, (domain url, sub url, status code, html content or plain text )
I am running the following command:
scrapy crawl undercrawler -a url=https://www.bvrgroep.nl -s CLOSESPIDER_PAGECOUNT=5
with
FILES_STORE = "\output_data"
This creates several files without extension on that path.
So it is hard for me to get my head around the 'UndercrawlerMediaPipeline' and how I can adjust it to store files in a readable format.
Also I cannot find the
IMAGES_ENABLED
in the settings file to stop downloading images
PS: I have not activated Splash as I do not have access to docker on my local laptop.
Could you please shed some light on?
I noticed that the memory usage by the scrapy process gradually keeps going up when I use Undercrawler. And then I used the prefs() function (mentioned in the scrapy docs) to monitor the live-references. I noticed that the SplashRequest references never go down and the oldest in there is from the time when the crawl started. Depending on the site, the references can climb up very fast. Here are some numbers:
(1) Crawling cnn, after 5 minutes
Crawled 65 pages
prefs() results -
BaseSpider 1 oldest: 313s ago
SplashRequest 2120 oldest: 300s ago
(2) Crawling reddit, after 5 minutes
Crawled 156 pages
prefs() results -
BaseSpider 1 oldest: 308s ago
SplashRequest 16100 oldest: 302s ago
So, even though, for both the crawls, the SplashRequest object seems to be a problem, for Reddit for some reason there are too many of them in just 5 minutes.
I would expect the SplashRequest to be released after the request is made for all the crawls. Can someone explain this behavior? And what can be done about it?
Thanks.
When the page is not an html page but binary content (we can not know for sure when extracting links), the Lua script timeouts (even without HH enabled).
No only we do not download such pages, but this also slows down the whole crawl a lot.
Hi, I am having issues trying to crawl couple of sites. I read about the splash.plugins_enabled setting, and I would like to turn it on to see if that fixes the crawling issue I am running into. How would I turn it on in Undercrawler? Thanks.
I think it would be better to use a different download_slot
, so that download latency for files is calculated separately?
It makes sense to assing more weight to recent samples in DupePredictor; together with #41 it should allow to handle a case when crawler first visits a large part A of website, learns a pattern, then it goes to another part B of a website where this pattern is no longer valid.
hi,
How can i get both cookie
and html
through def parse(self,response)
I only can get one field of render like return render['html']
or return render['har']
or return render['cookiejar']
In the official H-H.lua
, splash can return render
,and the render
get all fields I want, not only one of the field,but all of them.
I don't def parse(self,response)
how to use this object
` def parse(self, response):
print(type(response))
print(dir(response))
`
I tried use #['class', 'delattr', 'dict', 'dir', 'doc', 'eq', 'format', 'ge', 'getattribute', 'gt', 'hash', 'init', 'le', 'lt', 'module', 'ne', 'new', 'reduce', 'reduce_ex', 'repr'.... , seem like no method I need.
thx.
I set the splash.args debug to true (and I checked that its true by printing the value of it from the .lua script), but I am unable to see the debugLogs from headless_horseman.js in the splash logs. Where would they appear? Or am I missing something? Thanks.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.