brandwatchltd / robots Goto Github PK

View Code? Open in Web Editor NEW

5.0 90.0 7.0 1.3 MB

Support for Robots Exclusion Protocol, including parsing and matching against robot.txt directives.

License: BSD 3-Clause "New" or "Revised" License

Shell 0.01% Java 99.99%

robots's People

Contributors

Stargazers

Watchers

Forkers

jstanier hamishmorgan bencampion bw-tom katiemannering eduardadebrum stephenflavin

robots's Issues

Build with Travis CI

Identical matching user-agent groups.

Consider the example bellow (from http://blakesmalltalkblog.dailymail.co.uk/robots.txt). There are two distinct groups for User-agent: *. Our strategy is to choose the most specific matching group. Since the groups are equally specific we will fall back to choosing the first. Path directives in the second group are ignored.

It's not entirely clear what the authors of this robots.txt intended, but I think they are under the impression that all matching groups are processed. The robots library applies at-most one group, so it really doesn't work as they expect.

User-agent: *
Disallow: /t/trackback
Disallow: /t/comments
Disallow: /t/stats
Disallow: /t/app
Disallow: /.m/

# block against duplicate content
User-agent: *
Disallow: /*.html?cid=*
Disallow: /*/comments/page/*
Disallow: /*/comments/atom.xml
Disallow: /*/comments/rss.xml
Disallow: /*/comments/index.rdf

User-agent: Googlebot-Mobile
Allow: /.m/
Disallow: /

User-agent: Y!J-SRD
Allow: /.m/
Disallow: /

User-agent: Y!J-MBS
Allow: /.m/
Disallow: /

# block MSIE from abusing cache request
User-agent: Active Cache Request
Disallow: *

Here's another example form http://sunshine-girls.net/robots.txt:

# This file was generated on Sun, 21 Sep 2014 21:20:00 +0000
# If you are regularly crawling WordPress.com sites, please use our firehose to receive real-time push updates instead.
# Please see http://en.wordpress.com/firehose/ for more details.

Sitemap: http://sunshine-girls.net/sitemap.xml
Sitemap: http://sunshine-girls.net/news-sitemap.xml

User-agent: IRLbot
Crawl-delay: 3600

User-agent: *
Disallow: /next/

User-agent: *
Disallow: /mshots/v1/

# har har
User-agent: *
Disallow: /activate/

User-agent: *
Disallow: /wp-login.php

User-agent: *
Disallow: /signup/

User-agent: *
Disallow: /related-tags.php

User-agent: *
Disallow: /public.api/

# MT refugees
User-agent: *
Disallow: /cgi-bin/

User-agent: *
Disallow: /wp-admin/

I'm not sure there is a coherent way to resolve this problem. One could combine the groups, but it feels like a awkward exception. If we combine two equally specific matching groups, then why don't be combine all matching groups? I'm inclined to ignore this until there is a clear need.

Add LICENCE

Sync to maven central

Improve test coverage to 90%

Javadoc on core API façade

To help developers integrate the code, we should add javadoc to the top level API facade. This will include the just following classes and interfaces:

com.brandwatch.robots.RobotsConfig
com.brandwatch.robots.RobotsService
com.brandwatch.robots.RobotsFactory

The RobotsFactory is part of the instantiation process, but it's methods are not generally designed to be public facing. Therefore, only the class defination, and not the method need be documented.

Once this is done, we should consider if it worth extending the documentation to deeper levels.

Respect HTTP response header content-type

When downloading robot.txt files, we are ignoring the HTTP response Content-type header. This is generally consistent with other peoples published interpretation of the protocol (e.g https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt), but maybe we can do better.

The content type contains both the mime-type of the data, and (optionally) the character encoding. If and how we handle these bits of information can be considered separately:

Mime-type

The most obvious thing to do here is dump all data that isn't sent as text/plain. We could handle this like other server-side errors; allow the whole domain, but try again in a day or two. The advantage of doing this is that we will save bandwidth and processing time time handling invalid data. Disadvantage
is that we will be likely to disregard data that would otherwise have been parsable; because mime-type mis-configuration is common. Over-all, I'm not convinced the advantage is worth effort.

Character encoding

In cases where a character coding is given, we could use that rather than the default (UTF-8). While this isn't required by the protocol, it would be a good-faith gesture to make a best effort interpreting the sites wishes. There's no really disadvantage to doing this, other than developer time.

Add "revert to cached" load action

Currently, when a robots.txt is retrieved there are three possible outcomes:

Fully disallow the site (e.g after 403 "Forbidden" HTTP response.)
Fully allow the site (e.g after 404 "Not Found")
Conditionally allow, based on the result of parsing the response.

Sometime, however, it would seem more sensible to back-off to a previously cached response, when that option is available. Examples where this behaviour would be desirable include:

Rate limiting responses; such as 420 "Enhance your calm", 429 "Too Many Requests", 509 "Bandwidth Limit Exceeded", and 598/8 "Netowork timeout error"
Cache-control responses: 304 "Not Modified"
Temporary errors: 408 "Request Timeout"

This would nice to have, but it's probably very important. I imagine these conditions don't occur very often.

Unit test coverage with coveralls

Sort out the command line interface distribution

There's a couple of issue I've noticed with the CLI distribution:

tar contains directory with no file system permission granted
it would be nice to see it as an easy to run download, rather than people having to compile from source

White-space included in field names.

When fields names start with, or are followed by, white-space characters, the characters are included in field token. This will cause them to produced as unknown directives. We need to adjust the parser to handle this better. A couple of options:

White space skipping doesn't appear to be working as expected, so we could fix that.
Alternatively, trim to token before it's produced.

Byte Order Marks cause problems

Byte Order Marks (BOM) are not handled explicitly anywhere. If a robots.txt file starts with BOM it will cause the parser to fail, throwing an unhelpful exception.

For example, http://www.marketingmagazine.co.uk/robots.txt starts with a UTF-8 BOM (at time of writing). Processing this file will result in the following output:

12:47:17.199 [com.brandwatch.robots.cli.Main.main()] DEBUG com.brandwatch.robots.RobotsFactory - Initializing factory with config: RobotsConfig{cacheExpiresHours=24, cacheMaxSizeRecords=10000, maxFileSizeBytes=196608, maxRedirectHops=5, defaultCharset=UTF-8, userAgent=robots, requestTimeoutMillis=10000}
12:47:17.256 [com.brandwatch.robots.cli.Main.main()] DEBUG com.brandwatch.robots.RobotsFactory - Initializing cache (maxSize: 10000, expires after: 24 hours)
12:47:17.286 [com.brandwatch.robots.cli.Main.main()] DEBUG c.b.robots.RobotsServiceImpl - Resolving robots URL for: http://www.marketingmagazine.co.uk/search/articles?KeyWords=Corporate&Disciplines=1003&HeadlinesOnly=false&SortOrder=2
12:47:17.293 [com.brandwatch.robots.cli.Main.main()] DEBUG c.b.robots.RobotsServiceImpl - Resolved robots URI to: http://www.marketingmagazine.co.uk:80/robots.txt
12:47:17.298 [com.brandwatch.robots.cli.Main.main()] DEBUG c.brandwatch.robots.RobotsLoaderImpl - Loading: http://www.marketingmagazine.co.uk:80/robots.txt
12:47:17.672 [com.brandwatch.robots.cli.Main.main()] DEBUG c.brandwatch.robots.RobotsLoaderImpl - Conditional allow; parsing contents of http://www.marketingmagazine.co.uk:80/robots.txt
12:47:17.683 [com.brandwatch.robots.cli.Main.main()] INFO  c.brandwatch.robots.RobotsLoaderImpl - Allowing entire site: http://www.marketingmagazine.co.uk:80; Caught parsing exception: "com.brandwatch.robots.parser.ParseException: Encountered " "disallow" "Disallow "" at line 2, column 1.
Was expecting one of:
    <EOF> 
    <EOL> ...
    "user-agent" ...
    "sitemap" ...
    <OTHER_FIELD> ...
    "
12:47:17.695 [com.brandwatch.robots.cli.Main.main()] DEBUG c.b.robots.RobotsServiceImpl - Matched user-agent group: *
12:47:17.695 [com.brandwatch.robots.cli.Main.main()] DEBUG c.b.robots.RobotsServiceImpl - Matched path directive allow:/*
12:47:17.695 [com.brandwatch.robots.cli.Main.main()] DEBUG c.b.robots.RobotsServiceImpl - Allowing: http://www.marketingmagazine.co.uk/search/articles?KeyWords=Corporate&Disciplines=1003&HeadlinesOnly=false&SortOrder=2

BOMs are created by various MS Windows software. They aren't very common on the web, but it's an edge case we should probably handle.

brandwatchltd / robots Goto Github PK

robots's People

Contributors

Stargazers

Watchers

Forkers

robots's Issues

Mime-type

Character encoding

Recommend Projects

Recommend Topics

Recommend Org

Jobs