If you set Medusa to crawl http://www.foo.com
, which is redirected to https://www.foo.com
, Medusa will successfully crawl the site, but it will not respects robots.txt. This appears to be happening because Robotex will attempt pull the robots.txt file from http://www.foo.com/robots.txt without following the redirection. This results in no robot rules for the domain www.foo.com.
Example:
In https://www.yelp.com/robots.txt:
Disallow: /biz_link
> robotex = Robotex.new "My User Agent"
> robotex.allowed?("https://www.yelp.com/biz_link")
false
> robotex = Robotex.new "My User Agent"
> robotex.allowed?("http://www.yelp.com/biz_link")
true
I'd be happy to put in a PR to resolve this, but I've been going back and forth about whether the fix should be done in Robotex or Medusa.