teamhg-memex / undercrawler Goto Github PK

View Code? Open in Web Editor NEW

77.0 77.0 24.0 272 KB

A generic crawler

Python 70.04% JavaScript 24.31% Lua 5.65%

undercrawler's People

Contributors

Stargazers

Watchers

undercrawler's Issues

issues with allowed domain regexp

When spider downloads the first page it updates the regexp for allowed domains. Currently the code has some issues:

If the domain strats with www then non-www version gets banned. I think we should allow non-www version in this case (to get pages which are at non-www links), but handle it at duplicate filter level (to avoid getting the same content twice if a website has www and non-www mirrors).
URL which is passed to re.compile is not escaped; it means . in domain name matches any character, not only a dot.

Creating a working docker image

Here is an example dockerfile.
https://github.com/TeamHG-Memex/undercrawler/blob/dockerize-with-arachnado/Dockerfile

I created an arachnado image as base for undercrawler
with the example above I have created my undercrawler image.

I can run undercrawler over commandline, but it looks that it not integrated in arachnado's UI and API.

Every created crawl job over the UI or API from arachnado uses the (generic) arachnado crawler and not undercrawler and splash.

I think something is missing to get both working together.

detect logout URLs automatically

We should prevent logging out caused by following logout links.

Redirect from domain to www.domain is not handled correctly without splash

With splash this happens "inside" splash and we don't see the redirect, but without splash, if we set "domain" as start url and it's redirected to "www.domain", then the redirected request is filtered.

This happens because of our custom dupefilter that ignores www subdomain.

don't always ignore duplicate pages

Currently when DupePredictor thinks some pattern leads to a duplicate page this page is not followed. I may be better to still follow it with some probability (e.g. 5%) and keep updating stats. It should make crawler more robust if patterns are different in different website parts.

Support downloading files from Tor sites

One way to do it is to download files via Splash using splash:http_get

Undercrawler concurrency and Splash slots

I would like to make sure I understand the Undercrawler CONCURRENT_REQUESTS setting and the number of Splash slots interaction. If I have 3 splash instances and each instance has a slot setting of 5, does that mean that the effective concurrency would be 15 even if the concurrent_requests in Undercrawler is set to 32? Thanks.

spider can't be stopped with Ctrl-C when autologin is pending

It is hard to stop the spider if it uses autologin and autologin is responding with 'pending' status; I had to kill -9 it.

Lua error.

I get the following error when crawling a site.

INFO [2018-12-01 00:09:47,567] (Crawler->329): 2018-12-01 00:09:47 [scrapy_splash.middleware] WARNING: Bad request to Splash: {'error': 400, 'type': 'ScriptError', 'info': {'error': 'JavaScript error: EvalError: Refused to evaluate a string as JavaScript because \'unsafe-eval\' is not an allowed source of script in the following Content Security Policy directive: "script-src \'self\' \'unsafe-inline\' https://checkout.stripe.com/checkout.js https://static.accountdock.com https://www.googletagmanager.com https://www.googletagmanager.com/gtm.js https://www.googleadservices.com https://googleads.g.doubleclick.net https://www.google-analytics.com https://www.google-analytics.com/analytics.js https://platform.twitter.com/widgets.js https://static.npmjs.com/".', 'message': 'Lua error: [string "print(\'here in the beginning\')..."]:79: JavaScript error: EvalError: Refused to evaluate a string as JavaScript because \'unsafe-eval\' is not an allowed source of script in the following Content Security Policy directive: "script-src \'self\' \'unsafe-inline\' https://checkout.stripe.com/checkout.js https://static.accountdock.com https://www.googletagmanager.com https://www.googletagmanager.com/gtm.js https://www.googleadservices.com https://googleads.g.doubleclick.net https://www.google-analytics.com https://www.google-analytics.com/analytics.js https://platform.twitter.com/widgets.js https://static.npmjs.com/".\n', 'type': 'LUA_ERROR', 'source': '[string "print(\'here in the beginning\')..."]', 'line_number': 79}, 'description': 'Error happened while executing Lua script'} INFO [2018-12-01 00:09:47,568] (Crawler->329): 2018-12-01 00:09:47 [scrapy.core.engine] DEBUG: Crawled (400) <GET https://www.npmjs.com/advisories via http://127.0.0.1:8050/execute> (referer: None)

I read about the Content Security Policy and understand that the site disallows running any external javascript. Questions I have are:

Does the error happen as soon as a SplashRequest to the page tries to load the scripts, and its being raised by Splash code? I couldnt quite figure out by looking at the Undercrawler code where does the error exactly get raised?
Is there a way to suppress this error and get as much content as possible without executing the scripts?
Thanks.

Bad interaction of subdomains and autologin keychain

It's not clear yet where should it be fixed, so posting here:

We start crawling scrapy.org
Pause the crawl
Resume crawl, the next request is to docs.scrapy.org and we need to get new cookies
The crawl is getting "pending" from the autologin: it does the match by the full domain, but we have only credentials or skip for scrapy.org, not docs.scrapy.org

High memory and disk usage of splash requests

It seems that currently requests queue takes a lot of memory if running without JOBDIR, and a lot of disk space when running with JOBDIR.

Result of memory profiling after some time:
Site 1 (after maybe 10k pages fetched?):

In [7]: summary.print_(sum1)
                                        types |   # objects |   total size
============================================= | =========== | ============
                                 <class 'dict |      325334 |    122.44 MB
                                  <class 'str |      891057 |     66.60 MB
                                 <class 'list |      140968 |     43.73 MB
          <class 'scrapy.http.headers.Headers |      101893 |     27.99 MB
                              <class 'weakref |      310643 |     23.70 MB
                                <class 'bytes |      209461 |     15.84 MB
                            <class 'frozenset |         260 |     11.85 MB
                                  <class 'int |      371624 |      9.94 MB
                                  <class 'set |       10030 |      9.68 MB
             <class 'urllib.parse.ParseResult |      101387 |      9.28 MB
                               <class 'method |      104241 |      6.36 MB
  <class 'scrapy_splash.request.SplashRequest |      101885 |      5.44 MB
                        <class 'numpy.ndarray |        6641 |      3.91 MB
                                 <class 'type |        3036 |      3.08 MB
                                 <class 'code |       20056 |      2.76 MB

In [8]: len(spider.crawler.engine.slot.scheduler)
Out[8]: 100736

Here most memory is taken by dicts (request.meta and inner dicts), and also some by str.

Site 2 is different: here request queue is 10x smaller, but it takes the same amount of memory (after maybe 2k-3k pages fetched):

                                types |   # objects |   total size
===================================== | =========== | ============
                        <class 'bytes |       19726 |    127.90 MB
                          <class 'str |      213742 |     21.45 MB
                         <class 'dict |       44710 |     20.86 MB
                    <class 'frozenset |         257 |     11.84 MB
                          <class 'int |      369817 |      9.90 MB
                         <class 'list |       22572 |      5.54 MB
                          <class 'set |        4424 |      4.68 MB
                         <class 'type |        3012 |      3.05 MB
                         <class 'code |       20042 |      2.75 MB
  <class 'scrapy.http.headers.Headers |        9248 |      2.54 MB
                      <class 'weakref |       32766 |      2.50 MB
                <class 'numpy.ndarray |        2077 |      1.23 MB
                        <class 'tuple |       12974 |    858.53 KB
     <class 'urllib.parse.ParseResult |        8628 |    808.88 KB
                       <class 'method |       10662 |    666.38 KB

len(spider.crawler.engine.slot.scheduler)
9212

In this case, bytes take more space (these are encoded requests to splash), because there are a lot of cookies on the site.

When using JOBDIR, the main offenders are our lua and js scripts, that are repeated for each request.

This is not very severe I think. And maybe not directly related to an issue I noticed this morning, when scrapy process for Site 2 (having crawled 35k pages) was using about 8 Gb of RAM (using JOBDIR, although the server could be low on disk space at the time).

Config/issues with running multiple crawls?

If I run multiple crawls at the same time using Undercrawler + Aquarium, could be multiple crawls to the same url or different urls, theoretically what issues I might run into, if any? And also, are there any specific config settings that would be helpful for this scenario?

Question about Downloader Middlewares

There is a list of Downloader Middlewares in the settings file. Does that list mean that those are the only middlewares that are activated for any crawl? I am asking this because I noticed that the RetryMiddleware is working but the MetaRefreshMiddleware doesn't seem to be working - and none of them are specified in the middlewares list. Could someone explain that please? Thanks.

increase aggresiveness for file downloads

I think it makes sense to make file downloading more aggressive. Browsers allow ~6 parallel connections to the same server; it should be fine to download multiple static files in parallel from the same server. Currently we're throttling it to 1 parallel download on average using autothrottle algorithm.

Autothrottle doesn't make much sense for static files anyways - file size can vary a lot; it doesn't make sense to adjust download delay for 10Kb JPEG file based on that we spent 1 minute downloading 50MB pdf file.

confusing `WARNING: Dropped` lines in log

Because of the way we handle file uploads there are many WARNING: Dropped lines in the log when file uploads are enabled. I think we shouldn't log these duplicate uploads by default.

EvalError: Refused to evaluate a string as JavaScript

Requests may fail completely due to failure to execute javascript headless horsemen scripts:

[scrapy_splash.middleware] WARNING: Bad request to Splash: {'description': 'Error happened while executing Lua script', 'error': 400, 'info': {'line_number': 70, 'error': 'JavaScript error: EvalError: Refused to evaluate a string as JavaScript because \'unsafe-eval\' is not an allowed source of script in the following Content Security Policy directive: "script-src \'self\' https://*.twimg.com https://*.twitter.com https://static.ads-twitter.com".', 'message': 'Lua error: [string "function get_arg(arg, default)..."]:70: JavaScript error: EvalError: Refused to evaluate a string as JavaScript because \'unsafe-eval\' is not an allowed source of script in the following Content Security Policy directive: "script-src \'self\' https://*.twimg.com https://*.twitter.com https://static.ads-twitter.com".\n', 'source': '[string "function get_arg(arg, default)..."]', 'type': 'LUA_ERROR'}, 'type': 'ScriptError'}

Line 70 is this one:

undercrawler/undercrawler/directives/headless_horseman.lua

Line 70 in 710b7f5

splash:wait_for_resume([[

It's possible to skip the error with pcall: https://www.lua.org/pil/8.4.html, but maybe there is a way to still execute js on the page?

NetworkX errors not retried

I think this errors are not retried

2016-04-12 21:30:49 [scrapy_splash.middleware] WARNING: Bad request to Splash: {'error': 400, 'description': 'Error happened while executing Lua script', 'info': {'message': 'Lua
 error: [string "function get_arg(arg, default)..."]:52: network5', 'error': 'network5', 'type': 'LUA_ERROR', 'source': '[string "function get_arg(arg, default)..."]', 'line_numb
er': 52}, 'type': 'ScriptError'}
2016-04-12 21:29:31 [scrapy] DEBUG: Crawled (400) <POST http://XXX:8050/execute> (referer: None)
2016-04-12 21:29:31 [scrapy] DEBUG: Ignoring response <400 XXX/>: HTTP status code is not handled or not allowed

increase depth if a different paginator is followed

Currently if

page A has a paginator;
crawler follows it and gets to a page B;
page B also has a paginator, but a different one;
crawler follows links from a paginator at page B

then depth is not increased. Depth should be increased at (4).

This happens on forums - sometimes topic list pages have pagination links to topic comments.

Long delay for the first request

First request is made only after 6-7 seconds after starting the spider (that is why tests are so slow).

Autologin should pass cookies for file download requests

Do not apply AvoidDupContentMiddleware to downloaded documents

Right now it is applied to documents with text content type, and to all requests.

crazy form submitter is not using form url

Form action is only computed for deduplication purposes:

undercrawler/undercrawler/spiders.py

Line 165 in ba2c49a

action = canonicalize_url(urljoin(url, form.action))

It looks like all forms are submitted with page's URL, while they should use form action.

feature request Soft404

hey
You have an interesting projekt
https://github.com/TeamHG-Memex/soft404

Would be a good feature for undercrawler.

fragment is removed from pagination links

Here:

undercrawler/undercrawler/spiders/base_spider.py

Line 203 in 87551a6

def _pagination_urls(self, response):

canonicalize_url is called; by default it removes fragment. I think we should check what does fragment usually mean in pagination URLs and decide whether to keep it or drop. If that's too much to check that I think we should keep the fragment by default.

An option to run without splash

I think it could be useful if we just want to crawl some sites without tor and heavy js usage, and to compare results with/without splash. It should be not too hard to support it, since we can just run all tests in two different configurations. What do you think @kmike ?

Bug with cookies & redirect

Right now cookies are extracted from headers, and we get headers from the last request in splash. So if there is a series of redirects and some cookies are changed, but are absent from the last request, we will miss them. The test in 724cc1b468d8e8d557a1745a7a1826ed7bc209a0 fails for this reason.

Blank pages extracted in a crawl.

I ran a crawl on a site and a lot of pages were blank, in the sense that the screenshot was blank and the raw_content in the output didn't have the html body extracted, just had header and scripts stuff in it. On comparing that to the non-blank pages, I couldnt see a difference to conclude why it could have happened and no errors in the logs (both Undercrawler logs and splash logs). Any idea why that would happen? Any pointers to what can I look into? Thanks.

Exception when autologin is not available

I havent' started autologin servers and got these exceptions:

py35 runtests: commands[4] | py.test --doctest-modules --cov=undercrawler undercrawler tests
============================================================= test session starts =============================================================
platform darwin -- Python 3.5.1, pytest-2.9.1, py-1.4.31, pluggy-0.3.1
rootdir: /Users/kmike/svn/undercrawler, inifile: 
plugins: cov-2.2.1, twisted-1.5
collected 24 items 

undercrawler/spiders/base_spider.py ..
tests/test_dupe_predict.py .................
tests/test_spider.py ...FF
----------------------------------------------- coverage: platform darwin, python 3.5.1-final-0 -----------------------------------------------
Name                                           Stmts   Miss Branch BrPart  Cover
--------------------------------------------------------------------------------
undercrawler/__init__.py                           0      0      0      0   100%
undercrawler/crazy_form_submitter.py              41     31     21      0    16%
undercrawler/directives/test_directive.py         71     60     14      1    14%
undercrawler/documents_pipeline.py                20      2     10      4    80%
undercrawler/dupe_predict.py                     145      1     78      2    99%
undercrawler/items.py                             18      0      2      0   100%
undercrawler/middleware/__init__.py                3      0      0      0   100%
undercrawler/middleware/autologin.py              88     46     42      7    39%
undercrawler/middleware/avoid_dup_content.py      42     20     18      4    43%
undercrawler/middleware/throttle.py               24      1     10      3    88%
undercrawler/settings.py                          33      0      2      1    97%
undercrawler/spiders/__init__.py                   1      0      0      0   100%
undercrawler/spiders/base_spider.py              150     25     63     11    78%
undercrawler/utils.py                             39     10     22      2    64%
--------------------------------------------------------------------------------
TOTAL                                            675    196    282     35    67%

================================================================== FAILURES ===================================================================
__________________________________________________________ TestAutologin.test_login ___________________________________________________________

self = <tests.test_spider.TestAutologin testMethod=test_login>

    @defer.inlineCallbacks
    def test_login(self):
        ''' No logout links, just one page after login.
            '''
        with MockServer(Login) as s:
            root_url = s.root_url
            yield self.crawler.crawl(url=root_url)
        spider = self.crawler.spider
>       assert hasattr(spider, 'collected_items')
E       AssertionError: assert hasattr(<BaseSpider 'base' at 0x10ce7ee80>, 'collected_items')

tests/test_spider.py:224: AssertionError
------------------------------------------------------------ Captured stderr call -------------------------------------------------------------
INFO:scrapy.middleware:Enabled extensions:
['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.logstats.LogStats']
INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.logstats.LogStats']
INFO:scrapy.middleware:Enabled downloader middlewares:
['undercrawler.middleware.AvoidDupContentMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'undercrawler.middleware.AutologinMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'undercrawler.middleware.SplashAwareAutoThrottle',
 'scrapy_splash.SplashCookiesMiddleware',
 'scrapy_splash.SplashMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
INFO: Enabled downloader middlewares:
['undercrawler.middleware.AvoidDupContentMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'undercrawler.middleware.AutologinMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'undercrawler.middleware.SplashAwareAutoThrottle',
 'scrapy_splash.SplashCookiesMiddleware',
 'scrapy_splash.SplashMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
INFO:scrapy.middleware:Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
INFO:scrapy.middleware:Enabled item pipelines:
['tests.utils.CollectorPipeline']
INFO: Enabled item pipelines:
['tests.utils.CollectorPipeline']
INFO:scrapy.core.engine:Spider opened
INFO: Spider opened
INFO:scrapy.extensions.logstats:Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
DEBUG:undercrawler.middleware.autologin:Attempting login at http://192.168.99.1:8781
DEBUG: Attempting login at http://192.168.99.1:8781
INFO:requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): 127.0.0.1
INFO: Starting new HTTP connection (1): 127.0.0.1
DEBUG:scrapy.downloadermiddlewares.retry:Retrying <GET http://192.168.99.1:8781> (failed 1 times): HTTPConnectionPool(host='127.0.0.1', port=8089): Max retries exceeded with url: /login-cookies (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x10ce84cc0>: Failed to establish a new connection: [Errno 61] Connection refused',))
DEBUG: Retrying <GET http://192.168.99.1:8781> (failed 1 times): HTTPConnectionPool(host='127.0.0.1', port=8089): Max retries exceeded with url: /login-cookies (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x10ce84cc0>: Failed to establish a new connection: [Errno 61] Connection refused',))
DEBUG:undercrawler.middleware.autologin:response <200 http://192.168.99.1:8781/> cookies <CookieJar[]>
DEBUG: response <200 http://192.168.99.1:8781/> cookies <CookieJar[]>
ERROR:scrapy.core.scraper:Error downloading <GET http://192.168.99.1:8781 via http://192.168.99.100:8050/execute>
Traceback (most recent call last):
  File "/Users/kmike/svn/undercrawler/.tox/py35/lib/python3.5/site-packages/twisted/internet/defer.py", line 1128, in _inlineCallbacks
    result = g.send(result)
  File "/Users/kmike/svn/undercrawler/.tox/py35/lib/python3.5/site-packages/scrapy/core/downloader/middleware.py", line 43, in process_request
    defer.returnValue((yield download_func(request=request,spider=spider)))
  File "/Users/kmike/svn/undercrawler/.tox/py35/lib/python3.5/site-packages/twisted/internet/defer.py", line 1105, in returnValue
    raise _DefGen_Return(val)
twisted.internet.defer._DefGen_Return: <200 http://192.168.99.100:8050/execute>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/kmike/svn/undercrawler/.tox/py35/lib/python3.5/site-packages/twisted/internet/defer.py", line 1128, in _inlineCallbacks
    result = g.send(result)
  File "/Users/kmike/svn/undercrawler/.tox/py35/lib/python3.5/site-packages/scrapy/core/downloader/middleware.py", line 53, in process_response
    spider=spider)
  File "/Users/kmike/svn/undercrawler/undercrawler/middleware/autologin.py", line 129, in process_response
    if self.is_logout(response):
  File "/Users/kmike/svn/undercrawler/undercrawler/middleware/autologin.py", line 149, in is_logout
    auth_cookies = {c['name'] for c in self.auth_cookies if c['value']}
TypeError: 'NoneType' object is not iterable
ERROR: Error downloading <GET http://192.168.99.1:8781 via http://192.168.99.100:8050/execute>
Traceback (most recent call last):
  File "/Users/kmike/svn/undercrawler/.tox/py35/lib/python3.5/site-packages/twisted/internet/defer.py", line 1128, in _inlineCallbacks
    result = g.send(result)
  File "/Users/kmike/svn/undercrawler/.tox/py35/lib/python3.5/site-packages/scrapy/core/downloader/middleware.py", line 43, in process_request
    defer.returnValue((yield download_func(request=request,spider=spider)))
  File "/Users/kmike/svn/undercrawler/.tox/py35/lib/python3.5/site-packages/twisted/internet/defer.py", line 1105, in returnValue
    raise _DefGen_Return(val)
twisted.internet.defer._DefGen_Return: <200 http://192.168.99.100:8050/execute>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/kmike/svn/undercrawler/.tox/py35/lib/python3.5/site-packages/twisted/internet/defer.py", line 1128, in _inlineCallbacks
    result = g.send(result)
  File "/Users/kmike/svn/undercrawler/.tox/py35/lib/python3.5/site-packages/scrapy/core/downloader/middleware.py", line 53, in process_response
    spider=spider)
  File "/Users/kmike/svn/undercrawler/undercrawler/middleware/autologin.py", line 129, in process_response
    if self.is_logout(response):
  File "/Users/kmike/svn/undercrawler/undercrawler/middleware/autologin.py", line 149, in is_logout
    auth_cookies = {c['name'] for c in self.auth_cookies if c['value']}
TypeError: 'NoneType' object is not iterable
INFO:scrapy.core.engine:Closing spider (finished)
INFO: Closing spider (finished)
INFO:scrapy.statscollectors:Dumping Scrapy stats:
{'downloader/exception_count': 1,
 'downloader/exception_type_count/requests.exceptions.ConnectionError': 1,
 'downloader/request_bytes': 27995,
 'downloader/request_count': 1,
 'downloader/request_method_count/POST': 1,
 'downloader/response_bytes': 1960,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 4, 11, 19, 32, 19, 94613),
 'log_count/DEBUG': 3,
 'log_count/ERROR': 1,
 'log_count/INFO': 8,
 'scheduler/dequeued': 3,
 'scheduler/dequeued/memory': 3,
 'scheduler/enqueued': 3,
 'scheduler/enqueued/memory': 3,
 'splash/execute/request_count': 1,
 'splash/execute/response_count/200': 1,
 'start_time': datetime.datetime(2016, 4, 11, 19, 32, 12, 678245)}
INFO: Dumping Scrapy stats:
{'downloader/exception_count': 1,
 'downloader/exception_type_count/requests.exceptions.ConnectionError': 1,
 'downloader/request_bytes': 27995,
 'downloader/request_count': 1,
 'downloader/request_method_count/POST': 1,
 'downloader/response_bytes': 1960,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 4, 11, 19, 32, 19, 94613),
 'log_count/DEBUG': 3,
 'log_count/ERROR': 1,
 'log_count/INFO': 8,
 'scheduler/dequeued': 3,
 'scheduler/dequeued/memory': 3,
 'scheduler/enqueued': 3,
 'scheduler/enqueued/memory': 3,
 'splash/execute/request_count': 1,
 'splash/execute/response_count/200': 1,
 'start_time': datetime.datetime(2016, 4, 11, 19, 32, 12, 678245)}
INFO:scrapy.core.engine:Spider closed (finished)
INFO: Spider closed (finished)
____________________________________________________ TestAutologin.test_login_with_logout _____________________________________________________

self = <tests.test_spider.TestAutologin testMethod=test_login_with_logout>

    @defer.inlineCallbacks
    def test_login_with_logout(self):
        ''' Login with logout.
            '''
        with MockServer(LoginWithLogout) as s:
            root_url = s.root_url
            yield self.crawler.crawl(url=root_url)
        spider = self.crawler.spider
>       assert hasattr(spider, 'collected_items')
E       AssertionError: assert hasattr(<BaseSpider 'base' at 0x10e18d668>, 'collected_items')

tests/test_spider.py:236: AssertionError
------------------------------------------------------------ Captured stderr call -------------------------------------------------------------
INFO:scrapy.middleware:Enabled extensions:
['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.logstats.LogStats']
INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.logstats.LogStats']
INFO:scrapy.middleware:Enabled downloader middlewares:
['undercrawler.middleware.AvoidDupContentMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'undercrawler.middleware.AutologinMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'undercrawler.middleware.SplashAwareAutoThrottle',
 'scrapy_splash.SplashCookiesMiddleware',
 'scrapy_splash.SplashMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
INFO: Enabled downloader middlewares:
['undercrawler.middleware.AvoidDupContentMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'undercrawler.middleware.AutologinMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'undercrawler.middleware.SplashAwareAutoThrottle',
 'scrapy_splash.SplashCookiesMiddleware',
 'scrapy_splash.SplashMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
INFO:scrapy.middleware:Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
INFO:scrapy.middleware:Enabled item pipelines:
['tests.utils.CollectorPipeline']
INFO: Enabled item pipelines:
['tests.utils.CollectorPipeline']
INFO:scrapy.core.engine:Spider opened
INFO: Spider opened
INFO:scrapy.extensions.logstats:Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
DEBUG:undercrawler.middleware.autologin:Attempting login at http://192.168.99.1:8781
DEBUG: Attempting login at http://192.168.99.1:8781
INFO:requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): 127.0.0.1
INFO: Starting new HTTP connection (1): 127.0.0.1
DEBUG:scrapy.downloadermiddlewares.retry:Retrying <GET http://192.168.99.1:8781> (failed 1 times): HTTPConnectionPool(host='127.0.0.1', port=8089): Max retries exceeded with url: /login-cookies (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x10cd67080>: Failed to establish a new connection: [Errno 61] Connection refused',))
DEBUG: Retrying <GET http://192.168.99.1:8781> (failed 1 times): HTTPConnectionPool(host='127.0.0.1', port=8089): Max retries exceeded with url: /login-cookies (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x10cd67080>: Failed to establish a new connection: [Errno 61] Connection refused',))
DEBUG:undercrawler.middleware.autologin:response <200 http://192.168.99.1:8781/> cookies <CookieJar[]>
DEBUG: response <200 http://192.168.99.1:8781/> cookies <CookieJar[]>
ERROR:scrapy.core.scraper:Error downloading <GET http://192.168.99.1:8781 via http://192.168.99.100:8050/execute>
Traceback (most recent call last):
  File "/Users/kmike/svn/undercrawler/.tox/py35/lib/python3.5/site-packages/twisted/internet/defer.py", line 1128, in _inlineCallbacks
    result = g.send(result)
  File "/Users/kmike/svn/undercrawler/.tox/py35/lib/python3.5/site-packages/scrapy/core/downloader/middleware.py", line 43, in process_request
    defer.returnValue((yield download_func(request=request,spider=spider)))
  File "/Users/kmike/svn/undercrawler/.tox/py35/lib/python3.5/site-packages/twisted/internet/defer.py", line 1105, in returnValue
    raise _DefGen_Return(val)
twisted.internet.defer._DefGen_Return: <200 http://192.168.99.100:8050/execute>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/kmike/svn/undercrawler/.tox/py35/lib/python3.5/site-packages/twisted/internet/defer.py", line 1128, in _inlineCallbacks
    result = g.send(result)
  File "/Users/kmike/svn/undercrawler/.tox/py35/lib/python3.5/site-packages/scrapy/core/downloader/middleware.py", line 53, in process_response
    spider=spider)
  File "/Users/kmike/svn/undercrawler/undercrawler/middleware/autologin.py", line 129, in process_response
    if self.is_logout(response):
  File "/Users/kmike/svn/undercrawler/undercrawler/middleware/autologin.py", line 149, in is_logout
    auth_cookies = {c['name'] for c in self.auth_cookies if c['value']}
TypeError: 'NoneType' object is not iterable
ERROR: Error downloading <GET http://192.168.99.1:8781 via http://192.168.99.100:8050/execute>
Traceback (most recent call last):
  File "/Users/kmike/svn/undercrawler/.tox/py35/lib/python3.5/site-packages/twisted/internet/defer.py", line 1128, in _inlineCallbacks
    result = g.send(result)
  File "/Users/kmike/svn/undercrawler/.tox/py35/lib/python3.5/site-packages/scrapy/core/downloader/middleware.py", line 43, in process_request
    defer.returnValue((yield download_func(request=request,spider=spider)))
  File "/Users/kmike/svn/undercrawler/.tox/py35/lib/python3.5/site-packages/twisted/internet/defer.py", line 1105, in returnValue
    raise _DefGen_Return(val)
twisted.internet.defer._DefGen_Return: <200 http://192.168.99.100:8050/execute>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/kmike/svn/undercrawler/.tox/py35/lib/python3.5/site-packages/twisted/internet/defer.py", line 1128, in _inlineCallbacks
    result = g.send(result)
  File "/Users/kmike/svn/undercrawler/.tox/py35/lib/python3.5/site-packages/scrapy/core/downloader/middleware.py", line 53, in process_response
    spider=spider)
  File "/Users/kmike/svn/undercrawler/undercrawler/middleware/autologin.py", line 129, in process_response
    if self.is_logout(response):
  File "/Users/kmike/svn/undercrawler/undercrawler/middleware/autologin.py", line 149, in is_logout
    auth_cookies = {c['name'] for c in self.auth_cookies if c['value']}
TypeError: 'NoneType' object is not iterable
INFO:scrapy.core.engine:Closing spider (finished)
INFO: Closing spider (finished)
INFO:scrapy.statscollectors:Dumping Scrapy stats:
{'downloader/exception_count': 1,
 'downloader/exception_type_count/requests.exceptions.ConnectionError': 1,
 'downloader/request_bytes': 27995,
 'downloader/request_count': 1,
 'downloader/request_method_count/POST': 1,
 'downloader/response_bytes': 1960,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 4, 11, 19, 32, 28, 109595),
 'log_count/DEBUG': 3,
 'log_count/ERROR': 1,
 'log_count/INFO': 8,
 'scheduler/dequeued': 3,
 'scheduler/dequeued/memory': 3,
 'scheduler/enqueued': 3,
 'scheduler/enqueued/memory': 3,
 'splash/execute/request_count': 1,
 'splash/execute/response_count/200': 1,
 'start_time': datetime.datetime(2016, 4, 11, 19, 32, 21, 698408)}
INFO: Dumping Scrapy stats:
{'downloader/exception_count': 1,
 'downloader/exception_type_count/requests.exceptions.ConnectionError': 1,
 'downloader/request_bytes': 27995,
 'downloader/request_count': 1,
 'downloader/request_method_count/POST': 1,
 'downloader/response_bytes': 1960,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 4, 11, 19, 32, 28, 109595),
 'log_count/DEBUG': 3,
 'log_count/ERROR': 1,
 'log_count/INFO': 8,
 'scheduler/dequeued': 3,
 'scheduler/dequeued/memory': 3,
 'scheduler/enqueued': 3,
 'scheduler/enqueued/memory': 3,
 'splash/execute/request_count': 1,
 'splash/execute/response_count/200': 1,
 'start_time': datetime.datetime(2016, 4, 11, 19, 32, 21, 698408)}
INFO:scrapy.core.engine:Spider closed (finished)
INFO: Spider closed (finished)

Also, all logging messages are duplicated for some reason.

What is the location of CDRv2 exports?

I am not sure if I'm using the project correctly, but where exactly is the exported content stored?
I can see the images being saved in the Files folder and the screenshots in the Screenshots folder.
But what about the actual html and other web files?

The scripts/crawl_stats.py request for a crawler_out file, which I can't really identify.

The log currently shows entries like:

017-10-22 00:10:38 [undercrawler] DEBUG: Saved <200 ...> screenshot to /Screenshots/xxx.png
2017-10-22 00:10:39 [scrapy.core.scraper] DEBUG: Scraped from <200 http://xxx/>
<CDRItem: _id: ......... timestamp_crawl: '2017-10-21T18:40:39.560401Z'>

I use the default settings.py with the following extras arguments:

IMAGES_ENABLED = 1
FILES_STORE = '../Files'
FOLLOW_LINKS = 0
SCREENSHOT = True
SCREENSHOT_HEIGHT = 0
SCREENSHOT_DEST = '../Screenshots'
PREFER_PAGINATION = 1
MAX_DOMAIN_SEARCH_FORMS = 10

download out-of-domain iframes

When iframe is in a page it makes sense to get its content even if it is not in allowed domain. Maybe we shouldn't follow links in this case though.

S3 Filestorage

I am trying to save files to S3, but could not get it up and running.

"s3://bucket/prefix/" need to be defined but where?
In the UI is
Scrapy settings
and
Spider Args.
Looks like nothing works.
Need to know, where AWS Key and AWS Secret needs to be defined and how to set the filestorage to "s3://bucket/prefix/".

A little bit help would be cool.

test_documents fails on scrapy master

I'm not sure why. I fails with and without splash in the same way:

$ py.test tests/test_spider.py::test_documents -s
...
>       assert len(spider.collected_items) == 4
E       assert 5 == 4
tests/test_spider.py:119: AssertionError

How to store urls and html content to json format?

Hi,
I have to say amazing tool.

I am struggling to understand on how I has store the results on a json file for each start url.
Currently i am getting binary files for each url within the domain which I have difficulties retrieving the information that I am seeking, (domain url, sub url, status code, html content or plain text )

I am running the following command:

scrapy crawl undercrawler -a url=https://www.bvrgroep.nl -s CLOSESPIDER_PAGECOUNT=5
with
FILES_STORE = "\output_data"
This creates several files without extension on that path.
So it is hard for me to get my head around the 'UndercrawlerMediaPipeline' and how I can adjust it to store files in a readable format.

Also I cannot find the
IMAGES_ENABLED in the settings file to stop downloading images

PS: I have not activated Splash as I do not have access to docker on my local laptop.

Could you please shed some light on?

Memory problems: SplashRequest references keep going up

I noticed that the memory usage by the scrapy process gradually keeps going up when I use Undercrawler. And then I used the prefs() function (mentioned in the scrapy docs) to monitor the live-references. I noticed that the SplashRequest references never go down and the oldest in there is from the time when the crawl started. Depending on the site, the references can climb up very fast. Here are some numbers:

(1) Crawling cnn, after 5 minutes
Crawled 65 pages
prefs() results -
BaseSpider 1 oldest: 313s ago
SplashRequest 2120 oldest: 300s ago

(2) Crawling reddit, after 5 minutes
Crawled 156 pages
prefs() results -
BaseSpider 1 oldest: 308s ago
SplashRequest 16100 oldest: 302s ago

So, even though, for both the crawls, the SplashRequest object seems to be a problem, for Reddit for some reason there are too many of them in just 5 minutes.

I would expect the SplashRequest to be released after the request is made for all the crawls. Can someone explain this behavior? And what can be done about it?

Thanks.

Lua page script timeouts when trying to render binary pages

When the page is not an html page but binary content (we can not know for sure when extracting links), the Lua script timeouts (even without HH enabled).
No only we do not download such pages, but this also slows down the whole crawl a lot.

How to set splash.plugins_enabled for Undercrawler.

Hi, I am having issues trying to crawl couple of sites. I read about the splash.plugins_enabled setting, and I would like to turn it on to see if that fixes the crawling issue I am running into. How would I turn it on in Undercrawler? Thanks.

AutoThrottle is applied to downloaded files as well

I think it would be better to use a different download_slot, so that download latency for files is calculated separately?

DupePredictor should assign more weight for recent samples

It makes sense to assing more weight to recent samples in DupePredictor; together with #41 it should allow to handle a case when crawler first visits a large part A of website, learns a pattern, then it goes to another part B of a website where this pattern is no longer valid.

How can i get both cookie and html through def parse(self,response)

hi,
How can i get both cookie and html through def parse(self,response)

I only can get one field of render like return render['html'] or return render['har'] or return render['cookiejar']

In the official H-H.lua , splash can return render ,and the render get all fields I want, not only one of the field,but all of them.

I don't def parse(self,response) how to use this object

` def parse(self, response):

 print(type(response))
 print(dir(response))

`
I tried use #['class', 'delattr', 'dict', 'dir', 'doc', 'eq', 'format', 'ge', 'getattribute', 'gt', 'hash', 'init', 'le', 'lt', 'module', 'ne', 'new', 'reduce', 'reduce_ex', 'repr'.... , seem like no method I need.

thx.

Where are debugLogs logged when splash.args debug is true?

I set the splash.args debug to true (and I checked that its true by printing the value of it from the .lua script), but I am unable to see the debugLogs from headless_horseman.js in the splash logs. Where would they appear? Or am I missing something? Thanks.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

Jobs

Jooble