Comments (9)
Thanks. I checked in one of my tests that the oldest url in SplashRequest references was already crawled, but I will check for that again in few more tests. And I will also use muppy soon. Will post my updates here once I do.
from undercrawler.
@nehakansal are this references from requests that have already been made? I'm asking because there might be two possible explanations for what you are seeing:
- this is a bug, a reference for request is kept while it shouldn't be. In this case, the best way to debug is IMO looking at who is referencing this objects (I usually use https://pythonhosted.org/Pympler/muppy.html for that)
- there is no bug, these are requests that were extracted from the first pages crawled, and they are sitting in the queue waiting to be crawled. In this case, one solution is to use a disk-based queue: https://doc.scrapy.org/en/latest/topics/jobs.html?highlight=JOBDIR#how-to-use-it - IIRC it works fine with undercrawler and we used it. When doing a breadth-first crawl, the number of links in the queue can be really large, so a disk based queue is a must to keep memory usage reasonable.
from undercrawler.
Thanks. I will try using muppy to get more information. Follow-up questions though -
- The fact that its breadth-first, the time on the oldest request in the prefs() should at some point change to a more recent time than the time around which crawl was started, correct?
- I am using Aquarium with Undercrawler, in that case isn't the request queue maintained by HA Proxy and not by Scrapy? Or is it that both HA Proxy and Scrapy maintain a queue?
from undercrawler.
- Yes, this is correct. The queue size is much larger for breadth-first crawls than it is for depth-first ones, because the number of links from say level 2 to level 3 can be huge
- They both maintain a queue - HAProxy maintains just a small queue of recent requests, while the scrapy spider creates requests for all links as they are extracted, and keeps them until their time to be crawled comes.
from undercrawler.
- Okay. And if a SplashRequest has completed, the reference to that should go away ideally even if the links on that page are still being processed, is that right? If yes and if the oldest time stayed same even after lets say an hour into the crawl, would that indicate that its more likely a bug than just the nature of breadth-first?
- Got it, thanks.
from undercrawler.
And if a SplashRequest has completed, the reference to that should go away ideally even if the links on that page are still being processed, is that right?
Yes, that's correct.
If yes and if the oldest time stayed same even after lets say an hour into the crawl, would that indicate that its more likely a bug than just the nature of breadth-first?
I think it's hard to tell if an hour is enough or not, depending on the site. The only reliable way is to check is to check whether there any any alive requests that have already been crawled.
from undercrawler.
Hi, here is some data from one of my runs where it seemed like something is wrong. I didnt see this behavior in couple of other runs so its probably not consistent, but don't have enough sample runs to be completely sure. I thought I will at least post what I have and see if you have any thoughts on it. I crawled a site 'jair.org'. And one of the urls that got successfully crawled very early on, stayed the oldest 'SplashRequest' object until I aborted the crawl, after 3 minutes of the successful crawl of the url. I couldn't get muppy to run for some reason. I got the data below using scrapy's trackref tool and Objgraph. Below is the meta and the headers dict data for the url that stayed the oldest url 'https://www.jair.org/index.php/jair/article/view/11210/26421'. Any idea why this SplashRequest object had references even after getting crawled and scraped successfully?
Here's how I got the oldest SplashRequest object
from scrapy.utils.trackref import get_oldest
get_oldest('SplashRequest')
Meta
extracted_at https://www.jair.org/index.php/jair
download_latency 1.491
download_slot www.jair.org
download_timeout 267.0
avoid_dup_content True
_splash_processed True
depth 1
splash {'http_status_from_error_code': True, 'session_id': 'default', 'cache_args': ['lua_source', 'js_source'], '_local_arg_fingerprints': {'lua_source': 'LOCAL+592eab4546bc5ab9e1c078a4a25648b2312d4211', 'js_source': 'LOCAL+e67f9577762ee32610fa2ad1bb02ec896b8791e7'}, 'endpoint': 'execute', 'args': {'images_enabled': False, 'headers': {'User-Agent': 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/43.0.2357.130 Chrome/43.0.2357.130 Safari/537.36', 'Referer': 'https://www.jair.org/index.php/jair', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en'}, 'load_args': {'lua_source': 'e6f1a2e0a9bcc4103c6a23139165b6b156d1716e', 'js_source': 'e01e258671da83b6eb7de5306c2b0c96767e36ef'}, 'url': 'https://www.jair.org/index.php/jair/article/view/11210/26421', 'run_hh': True, 'screenshot_width': 400, 'return_png': True, 'cookies': [{'secure': False, 'domain': '.www.jair.org', 'name': 'OJSSID', 'value': 'kgeg04h2mmkia2liinmab12al2', 'httpOnly': True, 'path': '/'}], 'screenshot_height': 400, 'timeout': 262}, 'slot_policy': 'per_domain', 'magic_response': True}
from_search None
ajax_crawlable True
Headers
b'User-Agent' [b'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/43.0.2357.130 Chrome/43.0.2357.130 Safari/537.36']
b'Accept-Language' [b'en']
b'Accept-Encoding' [b'gzip,deflate']
b'Content-Type' [b'application/json']
b'Accept' [b'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8']
The meta and the headers were same every time. I have attached the objgraph of the oldest request, one from first time it was the oldest url and one from right before I aborted the crawl. And you will notice that the url in the graph for some reason is the referrer url instead of the actual url, don't know why, but I did double check that the object that I used to print the metadata is the same object I passed to the objgraph.
from undercrawler.
@nehakansal I'm not 100% sure, but I would check how long does the parse
method for that (or some other) request is executed, in wall time? Because even after request is crawled, we'll be still extracting links from it, and scheduling them to be crawled, which might take a while depending on concurrency settings, splash capacity etc. - while request will still be around attached to response.
I can also suggest a different strategy for debugging it - instead of tracking individual requests, I'd enable disk queues and start a longer crawl, and see if memory is leaking or it's bounded, and then if it's leaking, enter the console after it has leaked enough and see why objects are kept alive. In this case, you'll be sure that this is a real leak.
from undercrawler.
Thanks, that info helps.
As for the other suggestion - actually that's exactly what I started doing yesterday. I enabled the disk queue yesterday. I need to run more tests to make sure that using the disk queue is working well for me or even if there are some inconsistencies in memory but queue helps enough to get me going for now.
from undercrawler.
Related Issues (20)
- Lua page script timeouts when trying to render binary pages HOT 5
- Redirect from domain to www.domain is not handled correctly without splash HOT 1
- Bad interaction of subdomains and autologin keychain
- S3 Filestorage HOT 4
- Creating a working docker image HOT 5
- feature request Soft404
- EvalError: Refused to evaluate a string as JavaScript HOT 5
- test_documents fails on scrapy master HOT 1
- crazy form submitter is not using form url
- What is the location of CDRv2 exports? HOT 1
- Config/issues with running multiple crawls? HOT 2
- Undercrawler concurrency and Splash slots HOT 2
- Question about Downloader Middlewares HOT 4
- How can i get both cookie and html through def parse(self,response) HOT 2
- Lua error. HOT 1
- Where are debugLogs logged when splash.args debug is true? HOT 2
- Blank pages extracted in a crawl. HOT 5
- How to set splash.plugins_enabled for Undercrawler. HOT 3
- How to store urls and html content to json format? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from undercrawler.