Comments (8)
Yes. I've been thinking about that. Of course the problem is that we cannot use the IP to distribute jobs, as it changes with time and server.
One solution I'm reckoning about is to add a dynamic IP-based politeness boost for sites with more than one URL. It is safe to assume, beyond a certain number (say, 3) of URL per IP that the same is happening in all agents—that is, other agents have several URLs with the same IP. Thus, one might want to increase the IP delay, multiplying it by the number of agents, so that the global enforcing is similar to the locally specified one. Once again, this should happen only if we can assume reasonably that the same IP appears elsewhere.
from bubing.
Good idea, but it would depend on the number of crawling nodes. With 30 nodes, you would have to have more than 60 sites on one IP to detect that.
from bubing.
"Of course the problem is that we cannot use the IP to distribute jobs, as it changes with time and server." : maybe it's still possible to do it.
- The sieve (and queues before the workbench) could still be distributed per host / this way the memory of visited hosts would be consistently managed by one node
- When a bubingJob with a host without IP arrives, we do as usual : create a visitstate in newVisitStates, and fill it with urls as long as the host is not resolved.
- As soon as the node responsible for the host has resolved the host's IP, a new bubingJob is created, containing the IP address
- A bubingJob containing the IP would be hashed based on the IP, then distributed to another node
- This node would receive it and directly pass it to the workbench
What did I miss ?
from bubing.
The problem is that the IP associated with a host can change in time. You would have inconsistent queues all over the place.
In any case, after some probabilistic analysis (involving very complicated formulae and help from a friend), we nailed out that given a square power-law distribution for the number of hosts associated with an IP, for k=2 you can predict that about half of the agent will have the same IP, for k=3 3/4, and for k=4 4/5. The model is very rough, but probably reasonable enough to be used. In any case, we will do more testing with other distributions. If you have any source for the distribution of the number of hosts associated with an IP address, that'd be great...
from bubing.
I've hit another problem (quite the opposite from enforcing a per-ip politeness setting). Some Workbench entries have thousands of visitStates (the biggest seen so far had 12678 hosts).
I think we can assume these IP are load balancers. What would be the right thing to do in this case, in your opinion ? Maybe having a log applied to the number of visitstates could be used to increase the per-host delay, instead of having a per ip delay ?
from bubing.
So you mean that would like to go to those site faster?
from bubing.
Sort of, yes. The problem if we enforce a per-IP politeness is that a workbench entry with 10000 sites (I just found one with more than 200000 sites) will take forever to process. Just to process one URL for each of the visit state with an ipDelay of 1s would take almost 3 hours. On the other hand, the per-host politeness only is obviously not enough for regular web servers with many small virtual hosts.
A simple variant almost equivalent to the current per-IP delay would be to multiply the per-host delay by the number of hosts on this IP. Doing so would equally be a problem, though, for sites with many many hosts per IP. So as an alternative, we could compute the per host delay as (perHostDelayBase * (sqrt(nbHostInEntry))) or (perHostDelayBase * (1+log(nbHostInEntry)))
Of course we could also cap the number of hosts per IP. FYI here is the biggest one I found so far :
Entry with 235321 visitStates first hosts :
- 8t9mf.id3x.com
- r4ctd.id3x.com
- hqr7t.id3x.com
- ffdb7.id3x.com
- n2zqk.id3x.com
- hrdrn.id3x.com
- gx84t.id3x.com
- fq4qq.id3x.com
- bc6nm.id3x.com
- y8hmj.id3x.com
This kind of thing is not uncommon, I have found same patterns for tumblr.com ; codeplex.com ; radio.fr ; aircostcontrol.com
With the first formula I proposed above, a site with 250000 hosts per IP would have a per-host delay 500 times longer than the base. Suppose your base delay is 10s, the actual delay between two queries on this IP would be 5000/250000 = 0.02s
An interesting reference could also be taken here : https://stackoverflow.com/questions/8236046/typical-politeness-factor-for-a-web-crawler
They have an absolute min delay (but it seems to be per host), and they compute the actual delay based on previous response time, and an estimation of the site "size".
from bubing.
I have just committed a tentative implementation of cross-agent IP politeness I devised with Paolo. The idea is explained in the documentation of the parameter ipDelayFactor of StartupConfiguration.
from bubing.
Related Issues (20)
- Distribution not working as expected HOT 8
- HTML5 charset declaration not detected HOT 2
- ParsingThread blocked by jgroups HOT 3
- NoSuchMethodException with default configuration (IsProbablyBinary.valueOf()) HOT 3
- BUbiNG should parse content streams of length 0 HOT 1
- WorkbenchEntry-based scheduling HOT 8
- robots.txt parsed as ISO-8859-1 - break when there's a UTF-8 BOM HOT 1
- ignores nofollow on button - adds items to cart
- 301 redirects on too many otherwise accessible pages (via wget from same server or browser) HOT 12
- Implementing robots Google parser
- Any simple tutorial on how to start Bubing HOT 1
- ivy.xml outdated HOT 2
- maxUrls config not honored HOT 6
- Unable to run compiled jar
- Gracefully recover crawl when unexpectedly stopped HOT 1
- URLMatchesRegex seems not to be working HOT 30
- Duplicates or 403 are not taken into account by the maxUrlPerSchemeAuthority limit HOT 2
- SSL Certificate are wrongly rejected HOT 14
- https urls are actually fetch using http HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bubing.