GithubHelp home page GithubHelp logo

Comments (8)

vigna avatar vigna commented on September 20, 2024

Yes. I've been thinking about that. Of course the problem is that we cannot use the IP to distribute jobs, as it changes with time and server.

One solution I'm reckoning about is to add a dynamic IP-based politeness boost for sites with more than one URL. It is safe to assume, beyond a certain number (say, 3) of URL per IP that the same is happening in all agents—that is, other agents have several URLs with the same IP. Thus, one might want to increase the IP delay, multiplying it by the number of agents, so that the global enforcing is similar to the locally specified one. Once again, this should happen only if we can assume reasonably that the same IP appears elsewhere.

from bubing.

guillaumepitel avatar guillaumepitel commented on September 20, 2024

Good idea, but it would depend on the number of crawling nodes. With 30 nodes, you would have to have more than 60 sites on one IP to detect that.

from bubing.

guillaumepitel avatar guillaumepitel commented on September 20, 2024

"Of course the problem is that we cannot use the IP to distribute jobs, as it changes with time and server." : maybe it's still possible to do it.

  • The sieve (and queues before the workbench) could still be distributed per host / this way the memory of visited hosts would be consistently managed by one node
  • When a bubingJob with a host without IP arrives, we do as usual : create a visitstate in newVisitStates, and fill it with urls as long as the host is not resolved.
  • As soon as the node responsible for the host has resolved the host's IP, a new bubingJob is created, containing the IP address
  • A bubingJob containing the IP would be hashed based on the IP, then distributed to another node
  • This node would receive it and directly pass it to the workbench

What did I miss ?

from bubing.

vigna avatar vigna commented on September 20, 2024

The problem is that the IP associated with a host can change in time. You would have inconsistent queues all over the place.

In any case, after some probabilistic analysis (involving very complicated formulae and help from a friend), we nailed out that given a square power-law distribution for the number of hosts associated with an IP, for k=2 you can predict that about half of the agent will have the same IP, for k=3 3/4, and for k=4 4/5. The model is very rough, but probably reasonable enough to be used. In any case, we will do more testing with other distributions. If you have any source for the distribution of the number of hosts associated with an IP address, that'd be great...

from bubing.

guillaumepitel avatar guillaumepitel commented on September 20, 2024

I've hit another problem (quite the opposite from enforcing a per-ip politeness setting). Some Workbench entries have thousands of visitStates (the biggest seen so far had 12678 hosts).

I think we can assume these IP are load balancers. What would be the right thing to do in this case, in your opinion ? Maybe having a log applied to the number of visitstates could be used to increase the per-host delay, instead of having a per ip delay ?

from bubing.

vigna avatar vigna commented on September 20, 2024

So you mean that would like to go to those site faster?

from bubing.

guillaumepitel avatar guillaumepitel commented on September 20, 2024

Sort of, yes. The problem if we enforce a per-IP politeness is that a workbench entry with 10000 sites (I just found one with more than 200000 sites) will take forever to process. Just to process one URL for each of the visit state with an ipDelay of 1s would take almost 3 hours. On the other hand, the per-host politeness only is obviously not enough for regular web servers with many small virtual hosts.

A simple variant almost equivalent to the current per-IP delay would be to multiply the per-host delay by the number of hosts on this IP. Doing so would equally be a problem, though, for sites with many many hosts per IP. So as an alternative, we could compute the per host delay as (perHostDelayBase * (sqrt(nbHostInEntry))) or (perHostDelayBase * (1+log(nbHostInEntry)))

Of course we could also cap the number of hosts per IP. FYI here is the biggest one I found so far :

Entry with 235321 visitStates first hosts :

  • 8t9mf.id3x.com
  • r4ctd.id3x.com
  • hqr7t.id3x.com
  • ffdb7.id3x.com
  • n2zqk.id3x.com
  • hrdrn.id3x.com
  • gx84t.id3x.com
  • fq4qq.id3x.com
  • bc6nm.id3x.com
  • y8hmj.id3x.com

This kind of thing is not uncommon, I have found same patterns for tumblr.com ; codeplex.com ; radio.fr ; aircostcontrol.com

With the first formula I proposed above, a site with 250000 hosts per IP would have a per-host delay 500 times longer than the base. Suppose your base delay is 10s, the actual delay between two queries on this IP would be 5000/250000 = 0.02s

An interesting reference could also be taken here : https://stackoverflow.com/questions/8236046/typical-politeness-factor-for-a-web-crawler

They have an absolute min delay (but it seems to be per host), and they compute the actual delay based on previous response time, and an estimation of the site "size".

from bubing.

vigna avatar vigna commented on September 20, 2024

I have just committed a tentative implementation of cross-agent IP politeness I devised with Paolo. The idea is explained in the documentation of the parameter ipDelayFactor of StartupConfiguration.

from bubing.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.