I work in a distributed setup (usually 8-16 machines). It seems that the per-IP polite

Hosts with same IP address are not processed by the same node, so IP delay cannot be enforced about bubing HOT 8 CLOSED

law-unimi commented on September 20, 2024

Hosts with same IP address are not processed by the same node, so IP delay cannot be enforced

from bubing.

Comments (8)

vigna commented on September 20, 2024

Yes. I've been thinking about that. Of course the problem is that we cannot use the IP to distribute jobs, as it changes with time and server.

One solution I'm reckoning about is to add a dynamic IP-based politeness boost for sites with more than one URL. It is safe to assume, beyond a certain number (say, 3) of URL per IP that the same is happening in all agents—that is, other agents have several URLs with the same IP. Thus, one might want to increase the IP delay, multiplying it by the number of agents, so that the global enforcing is similar to the locally specified one. Once again, this should happen only if we can assume reasonably that the same IP appears elsewhere.

from bubing.

guillaumepitel commented on September 20, 2024

Good idea, but it would depend on the number of crawling nodes. With 30 nodes, you would have to have more than 60 sites on one IP to detect that.

from bubing.

guillaumepitel commented on September 20, 2024

"Of course the problem is that we cannot use the IP to distribute jobs, as it changes with time and server." : maybe it's still possible to do it.

The sieve (and queues before the workbench) could still be distributed per host / this way the memory of visited hosts would be consistently managed by one node
When a bubingJob with a host without IP arrives, we do as usual : create a visitstate in newVisitStates, and fill it with urls as long as the host is not resolved.
As soon as the node responsible for the host has resolved the host's IP, a new bubingJob is created, containing the IP address
A bubingJob containing the IP would be hashed based on the IP, then distributed to another node
This node would receive it and directly pass it to the workbench

What did I miss ?

from bubing.

vigna commented on September 20, 2024

The problem is that the IP associated with a host can change in time. You would have inconsistent queues all over the place.

In any case, after some probabilistic analysis (involving very complicated formulae and help from a friend), we nailed out that given a square power-law distribution for the number of hosts associated with an IP, for k=2 you can predict that about half of the agent will have the same IP, for k=3 3/4, and for k=4 4/5. The model is very rough, but probably reasonable enough to be used. In any case, we will do more testing with other distributions. If you have any source for the distribution of the number of hosts associated with an IP address, that'd be great...

from bubing.

guillaumepitel commented on September 20, 2024

I've hit another problem (quite the opposite from enforcing a per-ip politeness setting). Some Workbench entries have thousands of visitStates (the biggest seen so far had 12678 hosts).

I think we can assume these IP are load balancers. What would be the right thing to do in this case, in your opinion ? Maybe having a log applied to the number of visitstates could be used to increase the per-host delay, instead of having a per ip delay ?

from bubing.

vigna commented on September 20, 2024

So you mean that would like to go to those site faster?

from bubing.

guillaumepitel commented on September 20, 2024

Sort of, yes. The problem if we enforce a per-IP politeness is that a workbench entry with 10000 sites (I just found one with more than 200000 sites) will take forever to process. Just to process one URL for each of the visit state with an ipDelay of 1s would take almost 3 hours. On the other hand, the per-host politeness only is obviously not enough for regular web servers with many small virtual hosts.

A simple variant almost equivalent to the current per-IP delay would be to multiply the per-host delay by the number of hosts on this IP. Doing so would equally be a problem, though, for sites with many many hosts per IP. So as an alternative, we could compute the per host delay as (perHostDelayBase * (sqrt(nbHostInEntry))) or (perHostDelayBase * (1+log(nbHostInEntry)))

Of course we could also cap the number of hosts per IP. FYI here is the biggest one I found so far :

Entry with 235321 visitStates first hosts :

8t9mf.id3x.com
r4ctd.id3x.com
hqr7t.id3x.com
ffdb7.id3x.com
n2zqk.id3x.com
hrdrn.id3x.com
gx84t.id3x.com
fq4qq.id3x.com
bc6nm.id3x.com
y8hmj.id3x.com

This kind of thing is not uncommon, I have found same patterns for tumblr.com ; codeplex.com ; radio.fr ; aircostcontrol.com

With the first formula I proposed above, a site with 250000 hosts per IP would have a per-host delay 500 times longer than the base. Suppose your base delay is 10s, the actual delay between two queries on this IP would be 5000/250000 = 0.02s

An interesting reference could also be taken here : https://stackoverflow.com/questions/8236046/typical-politeness-factor-for-a-web-crawler

They have an absolute min delay (but it seems to be per host), and they compute the actual delay based on previous response time, and an estimation of the site "size".

from bubing.

vigna commented on September 20, 2024

I have just committed a tentative implementation of cross-agent IP politeness I devised with Paolo. The idea is explained in the documentation of the parameter ipDelayFactor of StartupConfiguration.

from bubing.

Hosts with same IP address are not processed by the same node, so IP delay cannot be enforced about bubing HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs