brutuscat / medusa Goto Github PK

View Code? Open in Web Editor NEW

This project forked from chriskite/anemone

16.0 16.0 9.0 327 KB

- THIS IS AN OLD FORK - Checkout Medusa Crawler gem instead "medusa-crawler"

Home Page: https://github.com/brutuscat/medusa-crawler

License: MIT License

Ruby 100.00%

crawler forked multithreading openuri robots-txt ruby scrapper spider

medusa's People

Contributors

Stargazers

Watchers

Forkers

bstconsult adamquadmon nielskschjoedt zirconcode paresharma donv accelerationnet mothonmars fossabot

medusa's Issues

some URL fragments are being retained & encoded

Medua's regex to identify anchor fragments is fairly strict: /#[a-zA-Z0-9_-]*$/

As a result, fragments with non-alphanumeric characters are being retained and encoded. Example:

Medusa.crawl('https://www.usbr.gov/library/glossary/', depth_limit: 0, discard_page_bodies: true) do |medusa|
  medusa.on_every_page do |page|
    puts page.links.map(&:to_s).select{ |link| /%23/ === link }
  end
end

Result:
https://www.usbr.gov/library/glossary/%23crest%20elevation
https://www.usbr.gov/library/glossary/%23prestressed%20dam
https://www.usbr.gov/library/glossary/%23modifiedhomogeneousearthfilldam%3Emodified%20homogeneous%0D%0Aearthfill%20dam%3C/a%3E,%20%3Ca%20href=
https://www.usbr.gov/library/glossary/%23o&m

PR to come.

Upgrade up to ruby 2.7

Robots.txt not respected if first page is redirected

If you set Medusa to crawl http://www.foo.com, which is redirected to https://www.foo.com, Medusa will successfully crawl the site, but it will not respects robots.txt. This appears to be happening because Robotex will attempt pull the robots.txt file from http://www.foo.com/robots.txt without following the redirection. This results in no robot rules for the domain www.foo.com.

Example:
In https://www.yelp.com/robots.txt:
Disallow: /biz_link

> robotex = Robotex.new "My User Agent"

> robotex.allowed?("https://www.yelp.com/biz_link")
false

> robotex = Robotex.new "My User Agent"

> robotex.allowed?("http://www.yelp.com/biz_link")
true

I'd be happy to put in a PR to resolve this, but I've been going back and forth about whether the fix should be done in Robotex or Medusa.

brutuscat / medusa Goto Github PK

medusa's People

Contributors

Stargazers

Watchers

Forkers

medusa's Issues

some URL fragments are being retained & encoded

Upgrade up to ruby 2.7

Robots.txt not respected if first page is redirected

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs