brandwatchltd / robots Goto Github PK
View Code? Open in Web Editor NEWSupport for Robots Exclusion Protocol, including parsing and matching against robot.txt directives.
License: BSD 3-Clause "New" or "Revised" License
Support for Robots Exclusion Protocol, including parsing and matching against robot.txt directives.
License: BSD 3-Clause "New" or "Revised" License
Consider the example bellow (from http://blakesmalltalkblog.dailymail.co.uk/robots.txt). There are two distinct groups for User-agent: *
. Our strategy is to choose the most specific matching group. Since the groups are equally specific we will fall back to choosing the first. Path directives in the second group are ignored.
It's not entirely clear what the authors of this robots.txt intended, but I think they are under the impression that all matching groups are processed. The robots library applies at-most one group, so it really doesn't work as they expect.
User-agent: *
Disallow: /t/trackback
Disallow: /t/comments
Disallow: /t/stats
Disallow: /t/app
Disallow: /.m/
# block against duplicate content
User-agent: *
Disallow: /*.html?cid=*
Disallow: /*/comments/page/*
Disallow: /*/comments/atom.xml
Disallow: /*/comments/rss.xml
Disallow: /*/comments/index.rdf
User-agent: Googlebot-Mobile
Allow: /.m/
Disallow: /
User-agent: Y!J-SRD
Allow: /.m/
Disallow: /
User-agent: Y!J-MBS
Allow: /.m/
Disallow: /
# block MSIE from abusing cache request
User-agent: Active Cache Request
Disallow: *
Here's another example form http://sunshine-girls.net/robots.txt:
# This file was generated on Sun, 21 Sep 2014 21:20:00 +0000
# If you are regularly crawling WordPress.com sites, please use our firehose to receive real-time push updates instead.
# Please see http://en.wordpress.com/firehose/ for more details.
Sitemap: http://sunshine-girls.net/sitemap.xml
Sitemap: http://sunshine-girls.net/news-sitemap.xml
User-agent: IRLbot
Crawl-delay: 3600
User-agent: *
Disallow: /next/
User-agent: *
Disallow: /mshots/v1/
# har har
User-agent: *
Disallow: /activate/
User-agent: *
Disallow: /wp-login.php
User-agent: *
Disallow: /signup/
User-agent: *
Disallow: /related-tags.php
User-agent: *
Disallow: /public.api/
# MT refugees
User-agent: *
Disallow: /cgi-bin/
User-agent: *
Disallow: /wp-admin/
I'm not sure there is a coherent way to resolve this problem. One could combine the groups, but it feels like a awkward exception. If we combine two equally specific matching groups, then why don't be combine all matching groups? I'm inclined to ignore this until there is a clear need.
To help developers integrate the code, we should add javadoc to the top level API facade. This will include the just following classes and interfaces:
com.brandwatch.robots.RobotsConfig
com.brandwatch.robots.RobotsService
com.brandwatch.robots.RobotsFactory
The RobotsFactory
is part of the instantiation process, but it's methods are not generally designed to be public facing. Therefore, only the class defination, and not the method need be documented.
Once this is done, we should consider if it worth extending the documentation to deeper levels.
When downloading robot.txt
files, we are ignoring the HTTP response Content-type
header. This is generally consistent with other peoples published interpretation of the protocol (e.g https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt), but maybe we can do better.
The content type contains both the mime-type of the data, and (optionally) the character encoding. If and how we handle these bits of information can be considered separately:
The most obvious thing to do here is dump all data that isn't sent as text/plain
. We could handle this like other server-side errors; allow the whole domain, but try again in a day or two. The advantage of doing this is that we will save bandwidth and processing time time handling invalid data. Disadvantage
is that we will be likely to disregard data that would otherwise have been parsable; because mime-type mis-configuration is common. Over-all, I'm not convinced the advantage is worth effort.
In cases where a character coding is given, we could use that rather than the default (UTF-8). While this isn't required by the protocol, it would be a good-faith gesture to make a best effort interpreting the sites wishes. There's no really disadvantage to doing this, other than developer time.
Currently, when a robots.txt
is retrieved there are three possible outcomes:
Sometime, however, it would seem more sensible to back-off to a previously cached response, when that option is available. Examples where this behaviour would be desirable include:
This would nice to have, but it's probably very important. I imagine these conditions don't occur very often.
There's a couple of issue I've noticed with the CLI distribution:
When fields names start with, or are followed by, white-space characters, the characters are included in field token. This will cause them to produced as unknown directives. We need to adjust the parser to handle this better. A couple of options:
Byte Order Marks (BOM) are not handled explicitly anywhere. If a robots.txt file starts with BOM it will cause the parser to fail, throwing an unhelpful exception.
For example, http://www.marketingmagazine.co.uk/robots.txt starts with a UTF-8 BOM (at time of writing). Processing this file will result in the following output:
12:47:17.199 [com.brandwatch.robots.cli.Main.main()] DEBUG com.brandwatch.robots.RobotsFactory - Initializing factory with config: RobotsConfig{cacheExpiresHours=24, cacheMaxSizeRecords=10000, maxFileSizeBytes=196608, maxRedirectHops=5, defaultCharset=UTF-8, userAgent=robots, requestTimeoutMillis=10000}
12:47:17.256 [com.brandwatch.robots.cli.Main.main()] DEBUG com.brandwatch.robots.RobotsFactory - Initializing cache (maxSize: 10000, expires after: 24 hours)
12:47:17.286 [com.brandwatch.robots.cli.Main.main()] DEBUG c.b.robots.RobotsServiceImpl - Resolving robots URL for: http://www.marketingmagazine.co.uk/search/articles?KeyWords=Corporate&Disciplines=1003&HeadlinesOnly=false&SortOrder=2
12:47:17.293 [com.brandwatch.robots.cli.Main.main()] DEBUG c.b.robots.RobotsServiceImpl - Resolved robots URI to: http://www.marketingmagazine.co.uk:80/robots.txt
12:47:17.298 [com.brandwatch.robots.cli.Main.main()] DEBUG c.brandwatch.robots.RobotsLoaderImpl - Loading: http://www.marketingmagazine.co.uk:80/robots.txt
12:47:17.672 [com.brandwatch.robots.cli.Main.main()] DEBUG c.brandwatch.robots.RobotsLoaderImpl - Conditional allow; parsing contents of http://www.marketingmagazine.co.uk:80/robots.txt
12:47:17.683 [com.brandwatch.robots.cli.Main.main()] INFO c.brandwatch.robots.RobotsLoaderImpl - Allowing entire site: http://www.marketingmagazine.co.uk:80; Caught parsing exception: "com.brandwatch.robots.parser.ParseException: Encountered " "disallow" "Disallow "" at line 2, column 1.
Was expecting one of:
<EOF>
<EOL> ...
"user-agent" ...
"sitemap" ...
<OTHER_FIELD> ...
"
12:47:17.695 [com.brandwatch.robots.cli.Main.main()] DEBUG c.b.robots.RobotsServiceImpl - Matched user-agent group: *
12:47:17.695 [com.brandwatch.robots.cli.Main.main()] DEBUG c.b.robots.RobotsServiceImpl - Matched path directive allow:/*
12:47:17.695 [com.brandwatch.robots.cli.Main.main()] DEBUG c.b.robots.RobotsServiceImpl - Allowing: http://www.marketingmagazine.co.uk/search/articles?KeyWords=Corporate&Disciplines=1003&HeadlinesOnly=false&SortOrder=2
BOMs are created by various MS Windows software. They aren't very common on the web, but it's an edge case we should probably handle.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.