law-unimi / bubing Goto Github PK

View Code? Open in Web Editor NEW

82.0 12.0 24.0 487 KB

The LAW next generation crawler.

Home Page: http://law.di.unimi.it/software.php#bubing

License: Apache License 2.0

Java 97.76% HTML 1.99% Shell 0.25%

crawler social-network web

bubing's Introduction

BUbiNG

This is the public repository of BUbiNG, the next generation web crawler from the Laboratory of Web Algorithmics; on the lab website you can find the API and configuration documentation, and instructions on how to stop from accessing your site.

bubing's People

Contributors

Stargazers

Watchers

bubing's Issues

SSL Certificate are wrongly rejected

I have this error regarding SSL Certificates, that occurs very frequently :

javax.net.ssl.SSLPeerUnverifiedException: Certificate for <www.genopole.fr> doesn't match any of the subject alternative names: [join-the-biocluster.genopole.fr, jointhebiocluster.genopole.fr, join.genopole.fr] at org.apache.http.conn.ssl.SSLConnectionSocketFactory.verifyHostname(SSLConnectionSocketFactory.java:467) at org.apache.http.conn.ssl.SSLConnectionSocketFactory.createLayeredSocket(SSLConnectionSocketFactory.java:397) at org.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:355) at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142) at org.apache.http.impl.conn.BasicHttpClientConnectionManager.connect(BasicHttpClientConnectionManager.java:323) at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:381) at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:237) at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185) at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:111) at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:72) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:221) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:191) at it.unimi.di.law.bubing.util.FetchData.fetch(FetchData.java:323) at it.unimi.di.law.bubing.frontier.FetchingThread.run(FetchingThread.java:253)
The VisitState is subsequently Killed.

When I look at the site using my browser, it doesn't seem to complain, though. A lot of sites are affected by this problem.

Implementing robots Google parser

https://github.com/google/robotstxt/blob/master/README.md

Distribution not working as expected

I've been trying to use BuBing in a cluster (first in a local network, then on EC2). I'm using jgroups' S3_PING protocol for cluster connection, and the views from the Jgroups messages (actually from the JGroupsJobManager) correctly show all clusters members. However, there is only one JobManager. For a long time I thought it was normal and all was correctly working, but today I realized that the receivedURLs counter stays at 0 and that no Jobs from other agents ever arrives on the nodes.

Here is a log sample from JGroups/JGroupsJobManager with a 2-node clusters (10.42.1.57 and 10.42.1.254)

Any help would be highly appreciated.

2017-10-05 21:02:26,084 7933 WARN [main] o.j.p.p.FLUSH - agent{1}@service:jmx:rmi:///jndi/rmi://10.42.1.57:9999/jmxrmi: waiting for UNBLOCK timed out after 2000 ms
2017-10-05 21:02:26,084 7933 INFO [main] i.u.d.j.j.JGroupsJobManager - Currently knowing 1 job managers (1 alive)
2017-10-05 21:02:26,084 7933 DEBUG [main] i.u.d.j.j.JGroupsJobManager - Currently known remote job managers: {[agent (address=agent{1}@service:jmx:rmi:///jndi/rmi://10.42.1.57:9999/jmxrmi, weight=1, suspected=false, disabled=false, pendingMessages=0)]}
2017-10-05 21:02:26,084 7933 DEBUG [main] i.u.d.j.j.JGroupsJobManager - Assignment strategy: [[agent (address=agent{1}@service:jmx:rmi:///jndi/rmi://10.42.1.57:9999/jmxrmi, weight=1, suspected=false, disabled=false, pendingMessages=0)]]
2017-10-05 21:02:27,723 9572 INFO [ViewHandler,debug,agent{1}@service:jmx:rmi:///jndi/rmi://10.42.1.57:9999/jmxrmi] i.u.d.j.j.JGroupsJobManager - New JGroups view [agent{1}@service:jmx:rmi:///jndi/rmi://10.42.1.57:9999/jmxrmi|1] (2) [agent{1}@service:jmx:rmi:///jndi/rmi://10.42.1.57:9999/jmxrmi, agent{1}@service:jmx:rmi:///jndi/rmi://10.42.1.254:9999/jmxrmi]
2017-10-05 21:02:27,723 9572 INFO [ViewHandler,debug,agent{1}@service:jmx:rmi:///jndi/rmi://10.42.1.57:9999/jmxrmi] i.u.d.j.j.JGroupsJobManager - New members: [agent{1}@service:jmx:rmi:///jndi/rmi://10.42.1.254:9999/jmxrmi]
2017-10-05 21:02:27,723 9572 INFO [ViewHandler,debug,agent{1}@service:jmx:rmi:///jndi/rmi://10.42.1.57:9999/jmxrmi] i.u.d.j.j.JGroupsJobManager - Currently knowing 1 job managers (1 alive)
2017-10-05 21:02:27,723 9572 DEBUG [ViewHandler,debug,agent{1}@service:jmx:rmi:///jndi/rmi://10.42.1.57:9999/jmxrmi] i.u.d.j.j.JGroupsJobManager - Currently known remote job managers: {[agent (address=agent{1}@service:jmx:rmi:///jndi/rmi://10.42.1.254:9999/jmxrmi, weight=1, suspected=false, disabled=false, pendingMessages=0)]}
2017-10-05 21:02:27,725 9574 DEBUG [ViewHandler,debug,agent{1}@service:jmx:rmi:///jndi/rmi://10.42.1.57:9999/jmxrmi] i.u.d.j.j.JGroupsJobManager - Assignment strategy: [[agent (address=agent{1}@service:jmx:rmi:///jndi/rmi://10.42.1.254:9999/jmxrmi, weight=1, suspected=false, disabled=false, pendingMessages=0)]]
2017-10-05 21:02:27,725 9574 DEBUG [ViewHandler,debug,agent{1}@service:jmx:rmi:///jndi/rmi://10.42.1.57:9999/jmxrmi] i.u.d.j.j.JGroupsJobManager - Current JGroups view: [agent{1}@service:jmx:rmi:///jndi/rmi://10.42.1.57:9999/jmxrmi|1] (2) [agent{1}@service:jmx:rmi:///jndi/rmi://10.42.1.57:9999/jmxrmi, agent{1}@service:jmx:rmi:///jndi/rmi://10.42.1.254:9999/jmxrmi]

Gracefully recover crawl when unexpectedly stopped

Hello, I am playing with the crawler and starting the crawl by issuing this command:

nohup java -cp bubing-0.9.15.jar:lib/* -server -Xss256K -Xms20G -XX:+UseNUMA -Djavax.net.ssl.sessionCacheSize=8192 \
        -XX:+UseTLAB -XX:+ResizeTLAB -XX:NewRatio=4 -XX:MaxTenuringThreshold=15 -XX:+CMSParallelRemarkEnabled \
        -verbose:gc -Xloggc:gc.log -XX:+PrintGCDetails \
        -XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1 \
        -Djava.rmi.server.hostname=<hostname> \
        -Djava.net.preferIPv4Stack=true \
        -Djgroups.bind_addr=<hostname> \
        -Dlogback.configurationFile=bubing-logback.xml \
        -Dcom.sun.management.jmxremote.port=9999 -Dcom.sun.management.jmxremote.rmi.port=9998 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false \
        it.unimi.di.law.bubing.Agent -h <hostname> -P eu.properties -g eu agent -n 2>err >out

Everything seems to run fine, crawl is started as background process. Since I am using AWS Spot instances to run crawls, they might be taken out. Volume data is preserved, though, but machine is swapped to another one so the crawl is interrupted. I tried to imitate this by simply killing java program process and issue the upper command without -n option to continue crawl. In crawl logs then I get these errors:

2021-04-26 07:25:21,631 1588 ERROR [main] i.u.d.l.b.f.Frontier - Trying to restore state from snap directory crawl-digital/frontier/snap, but it does not exist or is not a directory 2021-04-26 07:25:21,632 1589 ERROR [Distributor] i.u.d.l.b.f.Distributor - Unexpected exception java.lang.NullPointerException: null at it.unimi.di.law.bubing.frontier.Distributor.run(Distributor.java:134) 2021-04-26 07:25:21,775 1732 ERROR [MessageThread] i.u.d.l.b.f.MessageThread - Unexpected exception java.lang.NullPointerException: null at it.unimi.di.law.bubing.frontier.MessageThread.run(MessageThread.java:54)

So I guess somehow that snap directory is not being created. Anyone knows why this might happen? This does not let me continue the crawl

Thank you.

Add Accept headers to request, in order to avoid blocking because of mod_security default

As suggested (via e-mail):

You bot is being blocked by default with mod_security users with the core rule set due to a missing Accept: headers...

Sample log entry...

[Wed Aug 09 17:31:43.092235 2017] [:error] [pid 24450] [client 64.62.252.164] ModSecurity: Warning. Operator EQ matched 0 at REQUEST_HEADERS. [file "/etc/httpd/modsecurity.d/activated_rules/modsecurity_crs_21_protocol_anomalies.conf"] [line "47"] [id "960015"] [rev "1"] [msg "Request Missing an Accept Header"] [severity "NOTICE"] [ver "OWASP_CRS/2.2.9"] [maturity "9"] [accuracy "9"] [tag "OWASP_CRS/PROTOCOL_VIOLATION/MISSING_HEADER_ACCEPT"] [tag "WASCTC/WASC-21"] [tag "OWASP_TOP_10/A7"] [tag "PCI/6.5.10"] [hostname "www.atitd.com"] [uri "/"] [unique_id "WYtG-07PGRyBBSPejSylKgAAAAg"]

BUbiNG should parse content streams of length 0

This is necessary, as the headers might contain links.

301 redirects on too many otherwise accessible pages (via wget from same server or browser)

I explored few possible issues I could think of:

local name server via bind9 vs without
https vs http (SSL issues)
CDN cashing vs no cashing

But seems to appear in all variations, as have found URLs of all types in both DNS scenarios, and with and without JCE (which seems to be enabled by default in newer JDKs)

I.e.:

WARC/1.0
WARC-Record-ID: urn:uid:a368339d-4ec5-b2a6-fff9-e71b965def7e
WARC-Date: 2019-04-18T03:36:37Z
WARC-Target-URI: https://www.analyticalcannabis.com/news
WARC-Type: response
Content-Type: application/http;msgtype=response
WARC-Payload-Digest: bubing:9486e8cdc971e26d6e3c042d0b35f3bb
BUbiNG-Guessed-Charset: utf-8
Content-Length: 546

HTTP/1.1 301 Moved Permanently
Cache-Control: private
Content-Type: text/html; charset=utf-8
Location: https://www.analyticalcannabis.com/news
X-Frame-Options: DENY
X-Content-Type-Options: nosniff
X-XSS-Protection: 1
AMP-Access-Control-Allow-Source-Origin: http://localhost:62625
X-Powered-By: ASP.NET
Date: Thu, 18 Apr 2019 03:36:37 GMT
Connection: close
Content-Length: 156

<title>Object moved</title>

Object moved to here.

HTML5 charset declaration not detected

Hi, we've had some issues with charset detection for some websites. Current implementation on BUbiNG lacks the appropriate regex to detect HTML5 declarations :

We've implemented it, as well as a fallback using ICU probabilistic charset detection (with a dependency on ICU).

I think HTML5 charset detection could easily be submitted to the main repo. How do you stand regarding a potential dependency on ICU ?

Duplicates or 403 are not taken into account by the maxUrlPerSchemeAuthority limit

In several occasions, I've seen a lot of URLs requested on a hosts, even though the maxURLPerSchemeAuthority was low (maybe 50-100).

It seems that duplicates and other non-content responses (401,403) are not counted. This behaviour makes sense for a lot of sites, but I think there should be a limit "maxRequestPerSchemeAuthority" to avoid wasting time on sites with a lot of inlinks that leads to nothing (for instance, there are a lot of links pointing toward stumbleupon.com/submit?...... which produces an error).

ignores nofollow on button - adds items to cart

bot is adding items to cart even though the button is specified with nofollow

code sample (before jqueryui transforms it):

<button id="tdb2" type="submit"  rel="nofollow" >Προσθήκη Στο Καλάθι</button>

this type of button can be found on cart systems that use jqueryui for their button skinning (ie. oscommerce)

https urls are actually fetch using http

I know this sound ridiculous, but it seems to be the case. Most site are actually correctly crawled despite this problem because most sites have an HTTP version for each HTTPS url

But I stumbled on the problem on a site which does not have this behaviour : all http://xxx/yyy are redirected to https://xxx/yyy (an example problematic site : www.kernix.com )

Since Bubing thinks it is actually fetching https://xxx/yyy, only the redirect is stored, and nothing more.

The behaviour is visible in the logs when enabling DEBUG log for org.apache.http :

2017-09-27 19:33:24,180 18536 DEBUG [FetchingThread-0] i.u.d.l.b.f.FetchingThread - Next URL: https://www.exensa.com/robots.txt
2017-09-27 19:33:24,217 18573 DEBUG [FetchingThread-0] o.a.h.c.p.RequestAddCookies - CookieSpec selected: compatibility
2017-09-27 19:33:24,275 18631 DEBUG [FetchingThread-0] o.a.h.c.p.RequestAuthCache - Auth cache not set in the context
2017-09-27 19:33:24,315 18671 DEBUG [FetchingThread-0] i.u.d.l.b.f.FetchingThread$BasicHttpClientConnectionManagerWithAlternateDNS - Get connection for route {}->http://37.59.88.172:80
2017-09-27 19:33:24,379 18735 DEBUG [FetchingThread-0] o.a.h.i.e.MainClientExec - Opening connection {}->http://37.59.88.172:80
2017-09-27 19:33:24,381 18737 DEBUG [FetchingThread-0] o.a.h.i.c.DefaultHttpClientConnectionOperator - Connecting to /37.59.88.172:80
2017-09-27 19:33:24,387 18743 DEBUG [FetchingThread-0] o.a.h.i.c.DefaultHttpClientConnectionOperator - Connection established 172.23.0.89:44024<->37.59.88.172:80
2017-09-27 19:33:24,387 18743 DEBUG [FetchingThread-0] o.a.h.i.c.DefaultManagedHttpClientConnection - http-outgoing-0: set socket timeout to 60000
2017-09-27 19:33:24,387 18743 DEBUG [FetchingThread-0] o.a.h.i.e.MainClientExec - Executing request GET /robots.txt HTTP/1.1
2017-09-27 19:33:24,387 18743 DEBUG [FetchingThread-0] o.a.h.i.e.MainClientExec - Target auth state: UNCHALLENGED
2017-09-27 19:33:24,388 18744 DEBUG [FetchingThread-0] o.a.h.i.e.MainClientExec - Proxy auth state: UNCHALLENGED

The problem is that, when the Host address is already known (because it's internally cached in Bubin), the HttpHost sent to the HttpClient is created without the port or the scheme.

What must be done is changing a few lines in FetchData :

                                final URI uri = httpGet.getURI();
                                final String scheme = uri.getScheme();
                                final int port = uri.getPort() == -1 ? (scheme.equals("https") ? 443 : 80) : uri.getPort();
                                final HttpHost httpHost = visitState != null ?
                                        new HttpHost(InetAddress.getByAddress(visitState.workbenchEntry.ipAddress).getHostAddress(), port, scheme) :
                                        new HttpHost(uri.getHost(), port, scheme);

Unable to run compiled jar

Hello, sorry for newbie question as I am not from Java ecosystem, but lately I have built the jar and dependencies from master branch, then added bubing-0.9.15.jar to CLASSPATH env. variable and tried to run crawler by issuing command defined in overview in my main folder where I have all jars (compiled jar and dependencies):

java -server -Xss256K -Xms20G -XX:+UseNUMA -Djavax.net.ssl.sessionCacheSize=8192 \ -XX:+UseTLAB -XX:+ResizeTLAB -XX:NewRatio=4 -XX:MaxTenuringThreshold=15 -XX:+CMSParallelRemarkEnabled \ -verbose:gc -Xloggc:gc.log -XX:+PrintGCDetails \ -XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1 \ -Djava.rmi.server.hostname=192.168.0.20 \ -Djava.net.preferIPv4Stack=true \ -Djgroups.bind_addr=192.168.0.20 \ -Dlogback.configurationFile=bubing-logback.xml \ -Dcom.sun.management.jmxremote.port=9999 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false \ it.unimi.di.law.bubing.Agent -h 192.168.0.20 -P eu.properties -g eu agent -n 2>err >out

After issuing command in err log file I see the error:

Error: Could not find or load main class it.unimi.di.law.bubing.Agent
Caused by: java.lang.NoClassDefFoundError: it/unimi/dsi/jai4j/jgroups/JGroupsJobManager

This is how my main folder looks like, from where I am issuing java command:

What is the possible cause that given class cannot be found? Any help would be highly appreciated

NoSuchMethodException with default configuration (IsProbablyBinary.valueOf())

Hi,
I got an Exception with the default configuration (http://law.di.unimi.it/software/bubing-docs/overview-summary.html#overview.description) on storeFilter and parseFilter.

Exception in thread "main" org.apache.commons.configuration.ConfigurationException: it.unimi.di.law.warc.filters.parser.ParseException: java.lang.RuntimeException: java.lang.NoSuchMethodException: it.unimi.di.law.warc.filters.IsProbablyBinary.valueOf()
	at it.unimi.di.law.bubing.StartupConfiguration.<init>(StartupConfiguration.java:554)
	at it.unimi.di.law.bubing.StartupConfiguration.<init>(StartupConfiguration.java:605)
	at it.unimi.di.law.bubing.Agent.main(Agent.java:706)
Caused by: it.unimi.di.law.warc.filters.parser.ParseException: java.lang.RuntimeException: java.lang.NoSuchMethodException: it.unimi.di.law.warc.filters.IsProbablyBinary.valueOf()
	at it.unimi.di.law.warc.filters.parser.FilterParser.ground(FilterParser.java:141)
	at it.unimi.di.law.warc.filters.parser.FilterParser.atom(FilterParser.java:118)
	at it.unimi.di.law.warc.filters.parser.FilterParser.and(FilterParser.java:97)
	at it.unimi.di.law.warc.filters.parser.FilterParser.or(FilterParser.java:60)
	at it.unimi.di.law.warc.filters.parser.FilterParser.start(FilterParser.java:52)
	at it.unimi.di.law.warc.filters.parser.FilterParser.parse(FilterParser.java:46)
	at it.unimi.di.law.bubing.StartupConfiguration.<init>(StartupConfiguration.java:546)
	... 2 more

When I remove the and not IsProbablyBinary() from storeFilter and parseFilter, the crawler starts and all is ok.

ivy.xml outdated

Hi all,

I've been running into some trouble trying to build this program.
I have downloaded the appropriate dependencies using ant ivy-setupjars and tried running ant compile.

It fails with:
[javac] location: package org.apache.commons.io.input [javac] /root/BUbiNG/src/it/unimi/di/law/bubing/util/URLRespectsRobots.java:13: error: cannot find symbol [javac] import org.apache.commons.io.input.BOMInputStream; [javac] ^ [javac] symbol: class BOMInputStream [javac] location: package org.apache.commons.io.input [javac] /root/BUbiNG/src/it/unimi/di/law/bubing/util/FetchData.java:309: error: incompatible types: Charset cannot be converted to String [javac] fakeEntity.setContent(IOUtils.toInputStream(content, Charsets.ISO_8859_1)); [javac] ^ [javac] /root/BUbiNG/src/it/unimi/di/law/bubing/parser/SpamTextProcessor.java:69: error: cannot find symbol [javac] fbr.setReader(new CharSequenceReader(csq)); [javac] ^ [javac] symbol: class CharSequenceReader [javac] location: class SpamTextProcessor [javac] /root/BUbiNG/src/it/unimi/di/law/bubing/parser/SpamTextProcessor.java:76: error: cannot find symbol [javac] fbr.setReader(new CharSequenceReader(csq.subSequence(start, end))); [javac] ^ [javac] symbol: class CharSequenceReader [javac] location: class SpamTextProcessor [javac] /root/BUbiNG/src/it/unimi/di/law/bubing/util/URLRespectsRobots.java:183: error: cannot find symbol [javac] BOMInputStream bomInputStream = new BOMInputStream(robotsResponse.response().getEntity().getContent(), true); [javac] ^ [javac] symbol: class BOMInputStream [javac] location: class URLRespectsRobots [javac] /root/BUbiNG/src/it/unimi/di/law/bubing/util/URLRespectsRobots.java:183: error: cannot find symbol [javac] BOMInputStream bomInputStream = new BOMInputStream(robotsResponse.response().getEntity().getContent(), true); [javac] ^ [javac] symbol: class BOMInputStream [javac] location: class URLRespectsRobots [javac] /root/BUbiNG/src/it/unimi/di/law/warc/filters/ResponseMatches.java:49: error: no suitable method found for toString(InputStream,Charset) [javac] return pattern.matcher(IOUtils.toString(content, StandardCharsets.ISO_8859_1)).matches(); [javac] ^ [javac] method IOUtils.toString(InputStream,String) is not applicable [javac] (argument mismatch; Charset cannot be converted to String) [javac] method IOUtils.toString(byte[],String) is not applicable [javac] (argument mismatch; InputStream cannot be converted to byte[]) [javac] Note: /root/BUbiNG/src/it/unimi/di/law/bubing/frontier/Frontier.java uses or overrides a deprecated API. [javac] Note: Recompile with -Xlint:deprecation for details. [javac] Note: Some input files use unchecked or unsafe operations. [javac] Note: Recompile with -Xlint:unchecked for details. [javac] Note: Some messages have been simplified; recompile with -Xdiags:verbose to get full output [javac] 8 errors [javac] 1 warning

After checking out the package, commons-io.jar, I discovered that the package was indeed missing those classes. By adding this to ivy.xml:

 <dependency org="commons-io" name="commons-io" rev="2.6"/>

I was able to resolve quite a few errors, but this popped up:

[javac] Compiling 145 source files to /root/BUbiNG/build [javac] warning: [options] bootstrap class path not set in conjunction with -source 8 [javac] /root/BUbiNG/src/it/unimi/di/law/bubing/frontier/Frontier.java:916: error: unreported exception ConfigurationException; must be caught or declared to be thrown [javac] scalarData.save(new File(snapDir, "frontier.data")); [javac] ^ [javac] /root/BUbiNG/src/it/unimi/di/law/bubing/frontier/Frontier.java:963: error: unreported exception ConfigurationException; must be caught or declared to be thrown [javac] final Properties scalarData = new Properties(new File(snapDir, "frontier.data")); [javac] ^ [javac] Note: /root/BUbiNG/src/it/unimi/di/law/bubing/frontier/Frontier.java uses or overrides a deprecated API. [javac] Note: Recompile with -Xlint:deprecation for details. [javac] Note: Some input files use unchecked or unsafe operations. [javac] Note: Recompile with -Xlint:unchecked for details. [javac] 2 errors [javac] 1 warning

I've tried the follow commons-io versions so far with no success:

2.3 1.3.2 2.5 2.0 1.4 2.2 2.1

I believe this dependency issue popped up because commons-io is not explicitly specified in ivy.xml but I could be wrong --- I'm not a Java dev by any means.

Am I doing something wrong? If not, could someone list the dependencies used and their version numbers?

Hosts with same IP address are not processed by the same node, so IP delay cannot be enforced

I work in a distributed setup (usually 8-16 machines). It seems that the per-IP politeness delay cannot be enforced because the distribution relies on the hash of the hostname, not its IP address.

I would suggest to split responsibility of Agent, by creating a DNSSolverAgent that would take the current Jobs, and submit FetchingJobs with IP address as the bucketing variable.

maxUrls config not honored

I have tried the crawler and everything runs fine, except that the maxUrls parameter does not seem to get honored correctly. Admittedly, I set it to a rather low value of 10K. Is there something I am missing?

FetchingThreads seem to hang and do nothing

I have observed several times that BUbiNG starts slowing down at some point. I am under the impression that FetchingThreads starts behaving oddly and do not fill their role anymore. When I change the number of fetchingThreads manually, then suddenly todoSize decreases and readyToParse increases.

So here are some graphs when I play with "fetchingThreads". In the first half, before I change anything the todoSize keeps increasing. Then I increase the number of threads, decrease it, increase and decrease it again and, victory, the readyToParse curve have increased, first with a few spikes, then finally in a "normal" behaviour.

So is it possible that fetchingThreads somehow get stuck and don't fetch anything anymore ?

URLMatchesRegex seems not to be working

Due to provider warnings for network scans, I want to block direct requests to IP addresses (some end up being unrouted and trigger warnings), so only to allow domain names..

I add to the config file the following filter which works in JAVA:

scheduleFilter=not URLMatchesRegex(.*//[0-9][0-9]?[0-9]?\.[0-9][0-9]?[0-9]?\.[0-9][0-9]?[0-9]?.*)
followFilter=not URLMatchesRegex(.*//[0-9][0-9]?[0-9]?\.[0-9][0-9]?[0-9]?\.[0-9][0-9]?[0-9]?.*)

yet it keeps crawling IPs, and I keep getting abuse warnings from the host provider.

One possible reason why they get through is that this type of URL may be followed (its redirect):

http://example.com/?ip=123.123.123.213

But blocking those would not always solve the problem, as redirect may have a hash instead of IP, so could not be detected it in advance..

Any idea what may be wrong?

WorkbenchEntry-based scheduling

What follows is a bit the opposite of the other issue I raised about enforcing per-IP delay with the per host partitioning... so I guess I changed my mind about the priorities :)

This is more an open discussion than an issue per se, but here is what I have observed : BUbiNG's architecture is based on a WorkbenchEntry-based scheduling, a WorkbenchEntry contains the list of VisitStates related to a given IP address, in order to be able to enforce IP based politeness.

A problem occurs because it turns out that quite a large number of websites resolve to the same address. For instance all hosts in *.blogspot.com resolve to the same address. That turns out to be a problem because the Workbench is filled by these whales that take up all the available space (and the schemeAuthority2VisitState is also filled by them, consequently), so at the end, only a few WorkbenchEntry remain, and the crawl is limited by the per-IP politeness (maybe also by other limiting factors, even though I don't think the acquire/release mechanism is a problem here).

Basically, it boils down to this : as time pass, the crawl will be slower and slower because of the workbench size limit and those huge workbench entries.

I see three possible solutions :

Instead of a per-IP politeness setting, set something like (per-host delay) / (nb of hosts in entry)^alpha where alpha could be 0.5. For instance with a per-host delay of 10s, an entry with 100 hosts would have a per-IP delay of 1s, while an entry with 10000 entries would have a 0.1s delay (I mean, if you have 10000 hosts on the same IP, either you have a load balancer and a serious infrastructure, or it's a spam link farm)
We could add a limit to the number of simultaneous hosts in an entry : if the entry is already "full", then the VisitState in temporarily purged and its urls are re-scheduled for later. This seems not such a good solution because in this case, it would take days to process an entry like *.blogspot.com
We could have a Workbench Virtualizer that would put overhead visitstates on disk

I believe only the first option actually works. I've found conflicting opinions on the matter, but having a strong per-IP politeness seems not to be the most commonly accepted rule. So I guess we could replace the IP politeness with a IP-politeness powerfactor alpha. alpha=1 is equivalent to a zero IP delay, alpha=0 is the opposite, with IP delay = host delay.

Any simple tutorial on how to start Bubing

Hi Team,

Apologies for asking help on Github Issues, can anyone please guide me to a simple Tutorial on how to setup Bubbing for crawling?

Thanks in Advance

robots.txt parsed as ISO-8859-1 - break when there's a UTF-8 BOM

I've received a complaint by the site admin of ywam.org that my crawler does not follow its robots.txt which excludes all crawling.

https://update.ywam.org/robots.txt

It contains this :

<U+FEFF>User-agent: *
Disallow: /

The first character is actually a three-bytes sequence representing the dreading UTF-8 BOM https://en.wikipedia.org/wiki/Byte_order_mark

Since the robots.txt parser uses ISO-8859-1 encoding, it breaks and does not recognise the format.

This test can be added to check this behaviour :

@Test
public void testDisallowEverytingWithUTFBOM() throws Exception {
    	proxy = new SimpleFixedHttpProxy();
    	URI robotsURL = URI.create("http://foo.bar/robots.txt");
    	proxy.add200(robotsURL, "",
        		"\ufeffUser-agent: *\n" +
        		"Disallow: /\n"
    	);
    	final URI disallowedUri1 = URI.create("http://foo.bar/goo/zoo.html"); // Disallowed
    	final URI disallowedUri2 = URI.create("http://foo.bar/gaa.html"); // Disallowed
    	final URI disallowedUri3 = URI.create("http://foo.bar/"); // Disallowed
    	proxy.start();

	HttpClient httpClient = FetchDataTest.getHttpClient(new HttpHost("localhost", proxy.port()), false);

	FetchData fetchData = new FetchData(Helpers.getTestConfiguration(this));
	fetchData.fetch(robotsURL, httpClient, null, null, true);
	char[][] filter = URLRespectsRobots.parseRobotsResponse(fetchData, "any");
	assertFalse(URLRespectsRobots.apply(filter, disallowedUri1));
	assertFalse(URLRespectsRobots.apply(filter, disallowedUri2));
	assertFalse(URLRespectsRobots.apply(filter, disallowedUri3));
}

Google says robots.txt must be in UTF-8 and that they ignore BOMs https://developers.google.com/search/reference/robots_txt

Fixing this may not be as easy as changing the reader's encoding, the tokenizer must be modified too.

ParsingThread blocked by jgroups

Hi, so as often, I've had this problem where the crawl slows down until nothing happens anymore. A threadDump shows this :
"ParsingThread-63" #41669 daemon prio=3 os_prio=0 tid=0x00007f71c03bb800 nid=0x7a97 runnable [0x00007f7514b09000] java.lang.Thread.State: TIMED_WAITING (parking) at jdk.internal.misc.Unsafe.park([email protected]/Native Method) - parking to wait for <0x00007f7ea346b1b0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.parkNanos([email protected]/LockSupport.java:234) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await([email protected]/AbstractQueuedSynchronizer.java:2192) at org.jgroups.protocols.FC.handleDownMessage(FC.java:567) at org.jgroups.protocols.FC.down(FC.java:420) at org.jgroups.protocols.FRAG2.down(FRAG2.java:136) at org.jgroups.protocols.pbcast.STATE_TRANSFER.down(STATE_TRANSFER.java:202) at org.jgroups.protocols.pbcast.FLUSH.down(FLUSH.java:277) at org.jgroups.stack.ProtocolStack.down(ProtocolStack.java:1038) at org.jgroups.JChannel.down(JChannel.java:791) at org.jgroups.JChannel.send(JChannel.java:426) at it.unimi.dsi.jai4j.jgroups.JGroupsRemoteJobManager.process(JGroupsRemoteJobManager.java:144) at it.unimi.dsi.jai4j.jgroups.JGroupsJobManager.submit(JGroupsJobManager.java:469) at it.unimi.di.law.bubing.frontier.Frontier.enqueue(Frontier.java:627) at it.unimi.di.law.bubing.frontier.ParsingThread$FrontierEnqueuer.enqueue(ParsingThread.java:209) at it.unimi.di.law.bubing.frontier.ParsingThread.run(ParsingThread.java:425)

I will try to identify if this is happening as soon as the crawl slows down, since the slowdown started before I noticed jgroups-related issues, but it may be at least interesting to have a non-blocking file-backed queue for the URL submission process.

law-unimi / bubing Goto Github PK

bubing's Introduction

BUbiNG

bubing's People

Contributors

Stargazers

Watchers

Forkers

bubing's Issues

Object moved to here.

Recommend Projects

Recommend Topics

Recommend Org

Jobs