felipecsl / wombat Goto Github PK

View Code? Open in Web Editor NEW

1.3K 51.0 132.0 2.59 MB

Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages.

Home Page: https://felipecsl.github.io/wombat/

License: MIT License

Ruby 100.00%

ruby scraper crawler dsl

wombat's Issues

Use more strict dependencies

Open ended versions are dangerous

How to go over multiple subpaths

I'd like to go over a lot of pages (thousands) and supply their path parameter to the base_url, because I've got them already. But I don't see any example how to do something like that. Is this a good use case for wombat ? Next I'd like to store scraped text from pages into one giant JSON.

Support many proxy server

When I used 1 proxy server, may be the proxy server is not working sometime
so I want to have many proxy server

EX: I have 3 proxy list server [proxy1.example.com, proxy2.example.com, proxy3.example.com]
First time I use proxy1.example.com
if proxy1.example.com died or refuse connection, I will trying proxy2.example.com, ...

Following pagination

Hi,

Nice little gem so far. I'm trying for scrape info across a number of pages. What I'd like to do is something such as this pseudo code:

load page
scrap page content
if next page link, follow link and loop to top
else done

Is there any way I can achieve this currently?
Thanks

Update

Should fix to work with 1.9.x or greater

Getting dynamic data from website

I am trying to scrape a website created in Angular. With Angular, it needs to run the scripts on the page to get the dynamic data. All I am getting is the static data when scraping the page. Is there a way that I can get dynamically generated data with this gem?

Dynamically set base_url and path.

This is more of a question,

Would it be possible for to set base_url and path dynamically/run time?

Backwards compatibility issue

I upgraded to wombat 2.5.0 and my script stopped working. The following script works on wombat 2.4.0 but fails in 2.5.0. I took some code out in order to isolate the problem.

require 'wombat'

result = Wombat.crawl do
    base_url "http://www.icy-veins.com/" 
    path "heroes/hero-guides"

    heroes  "css=.page_content .nav_content_block_entry_heroes_hero", :iterator do
        name "xpath=."
        builds "xpath=./a", :follow do
            title "css=h1"
        end
    end
end

I get the following error:

/usr/local/opt/rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/wombat-2.5.0/lib/wombat/property/locators/follow.rb:10:in `block (2 levels) in locate': undefined method `mechanize_page' for #<Nokogiri::XML::Element:0x007ffbe44e8760> (NoMethodError)
    from /usr/local/opt/rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/nokogiri-1.6.7.2/lib/nokogiri/xml/node_set.rb:187:in `block in each'
    from /usr/local/opt/rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/nokogiri-1.6.7.2/lib/nokogiri/xml/node_set.rb:186:in `upto'
    from /usr/local/opt/rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/nokogiri-1.6.7.2/lib/nokogiri/xml/node_set.rb:186:in `each'
    from /usr/local/opt/rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/wombat-2.5.0/lib/wombat/property/locators/follow.rb:9:in `flat_map'
    from /usr/local/opt/rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/wombat-2.5.0/lib/wombat/property/locators/follow.rb:9:in `block in locate'
    from /usr/local/opt/rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/wombat-2.5.0/lib/wombat/property/locators/base.rb:18:in `locate'
    from /usr/local/opt/rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/wombat-2.5.0/lib/wombat/property/locators/follow.rb:8:in `locate'
    from /usr/local/opt/rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/wombat-2.5.0/lib/wombat/property/locators/base.rb:35:in `block (2 levels) in filter_properties'
    from /usr/local/opt/rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/wombat-2.5.0/lib/wombat/property/locators/base.rb:35:in `map'
    from /usr/local/opt/rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/wombat-2.5.0/lib/wombat/property/locators/base.rb:35:in `block in filter_properties'
    from /usr/local/opt/rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/wombat-2.5.0/lib/wombat/property/locators/base.rb:32:in `tap'
    from /usr/local/opt/rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/wombat-2.5.0/lib/wombat/property/locators/base.rb:32:in `filter_properties'
    from /usr/local/opt/rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/wombat-2.5.0/lib/wombat/property/locators/iterator.rb:11:in `block (2 levels) in locate'
    from /usr/local/opt/rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/nokogiri-1.6.7.2/lib/nokogiri/xml/node_set.rb:187:in `block in each'
    from /usr/local/opt/rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/nokogiri-1.6.7.2/lib/nokogiri/xml/node_set.rb:186:in `upto'
    from /usr/local/opt/rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/nokogiri-1.6.7.2/lib/nokogiri/xml/node_set.rb:186:in `each'
    from /usr/local/opt/rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/wombat-2.5.0/lib/wombat/property/locators/iterator.rb:10:in `flat_map'
    from /usr/local/opt/rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/wombat-2.5.0/lib/wombat/property/locators/iterator.rb:10:in `block in locate'
    from /usr/local/opt/rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/wombat-2.5.0/lib/wombat/property/locators/base.rb:18:in `locate'
    from /usr/local/opt/rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/wombat-2.5.0/lib/wombat/property/locators/iterator.rb:9:in `locate'
    from /usr/local/opt/rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/wombat-2.5.0/lib/wombat/property/locators/base.rb:35:in `block (2 levels) in filter_properties'
    from /usr/local/opt/rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/wombat-2.5.0/lib/wombat/property/locators/base.rb:35:in `map'
    from /usr/local/opt/rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/wombat-2.5.0/lib/wombat/property/locators/base.rb:35:in `block in filter_properties'
    from /usr/local/opt/rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/wombat-2.5.0/lib/wombat/property/locators/base.rb:32:in `tap'
    from /usr/local/opt/rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/wombat-2.5.0/lib/wombat/property/locators/base.rb:32:in `filter_properties'
    from /usr/local/opt/rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/wombat-2.5.0/lib/wombat/property/locators/property_group.rb:9:in `block in locate'
    from /usr/local/opt/rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/wombat-2.5.0/lib/wombat/property/locators/base.rb:18:in `locate'
    from /usr/local/opt/rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/wombat-2.5.0/lib/wombat/property/locators/property_group.rb:8:in `locate'
    from /usr/local/opt/rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/wombat-2.5.0/lib/wombat/processing/parser.rb:43:in `parse'
    from /usr/local/opt/rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/wombat-2.5.0/lib/wombat/crawler.rb:30:in `crawl'
    from /usr/local/opt/rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/wombat-2.5.0/lib/wombat.rb:13:in `crawl'
    from crawler.rb:4:in `<main>'

How to parse element attributes

I want this script to return an array of hashes having [name, url]. But since the iterator returns what is INSIDE the a tag, I can't figure out how to get the info.

    require 'wombat'

    video_url = 'https://vimeo.com/26594942'
    result = Wombat.crawl do
      base_url video_url + "/likes"
      path "/"


      likers "css=.browse_people li a", :iterator do
        name "css=p.title"
        url "[href]", :html do |link|
          link
        end
      end

    end

    puts result

pagination infinite scroll

how can I paginate and scrap data on a page with infinite scroll mechanism?

local file support

Wombat can't parse local files:
/.gem/ruby/2.3.1/gems/wombat-2.5.1/lib/wombat/processing/parser.rb:33:in block (2 levels) in initialize': undefined method content_type' for #<Mechanize::FileResponse:0x007fe856a62d90> (NoMethodError)

Caching?

Is it possible to cache the results of pages to make successive runs faster?

Unable to run example

Hi,

I really want to try your work, so I copy&paste this code:

require 'wombat'

puts Wombat.crawl do
  base_url "http://www.github.com"
  path "/"

  headline "xpath=//h1"
  what_is "css=.column.secondary p", :html
  repositories "css=a.repo", :list

  explore "xpath=//ul/li[2]/a" do |e|
    e.gsub(/Explore/, "LOVE")
  end

  benefits do
    first_benefit "css=.column.leftmost h3"
    second_benefit "css=.column.leftmid h3"
    third_benefit "css=.column.rightmid h3"
    fourth_benefit "css=.column.rightmost h3"
  end
end

But it fails:

/home/black/.rvm/gems/ruby-1.9.3-p362/gems/mechanize-2.5.1/lib/mechanize/http/agent.rb:595:in `resolve': absolute URL needed (not "") (ArgumentError)
        from /home/black/.rvm/gems/ruby-1.9.3-p362/gems/mechanize-2.5.1/lib/mechanize/http/agent.rb:214:in `fetch'
        from /home/black/.rvm/gems/ruby-1.9.3-p362/gems/mechanize-2.5.1/lib/mechanize.rb:407:in `get'
        from /home/black/.rvm/gems/ruby-1.9.3-p362/gems/wombat-2.1.0/lib/wombat/processing/parser.rb:41:in `parser_for'
        from /home/black/.rvm/gems/ruby-1.9.3-p362/gems/wombat-2.1.0/lib/wombat/processing/parser.rb:29:in `parse'
        from /home/black/.rvm/gems/ruby-1.9.3-p362/gems/wombat-2.1.0/lib/wombat/crawler.rb:30:in `crawl'
        from /home/black/.rvm/gems/ruby-1.9.3-p362/gems/wombat-2.1.0/lib/wombat.rb:10:in `crawl'
        from amazon.rb:3:in `<main>'

I have the following gems:

activesupport (3.2.11)
blankslate (2.1.2.4)
bluecloth (2.2.0)
bundler (1.2.3)
capistrano (2.14.1)
curb (0.8.3)
domain_name (0.5.7)
gli (2.5.3)
highline (1.6.15)
i18n (0.6.1)
json (1.7.6)
mechanize (2.5.1)
mg (0.0.8)
mime-types (1.19)
multi_json (1.5.0)
net-http-digest_auth (1.2.1)
net-http-persistent (2.8)
net-scp (1.0.4)
net-sftp (2.0.5)
net-ssh (2.6.3)
net-ssh-gateway (1.1.0)
nokogiri (1.5.6)
ntlm-http (0.1.1)
parslet (1.5.0)
pdfkit (0.5.2)
rack (1.4.2)
rack-protection (1.3.2)
rake (10.0.3)
rest-client (1.6.7)
rmagick (2.13.1)
rubygems-bundler (1.1.0)
rvm (1.11.3.5)
showoff (0.7.0)
sinatra (1.3.3)
tilt (1.3.3)
unf (0.0.5)
unf_ext (0.0.5)
webrobots (0.0.13)
wombat (2.1.0)

Have you got an idea of the problem?

For your help,
Thanks by advance.

Advanced request options

Sometimes I need to emulate the xhr requests to the sites and pages (such as AJAX) or add different Headers in the request. It would be great to have the ability to specify the type of request (get, post, put, etc.)

How to ues cookies in wombat

Way to access url inside follow

Is there a way to put the url inside the follow?

products 'css=.products a', :follow do
   name css: 'h1'
   price css: '.price'
   url "??????????"    #How to get the url of the followed page
end

Missing Header `Content-Type` fails with ArgumentError: undefined method 'parser' for #Mechanize::File

I can't parse this link: http://www.sabah.com.tr/yazarlar/barlas/2014/11/30/zenginin-mali-gercekten-zugurdun-cenesini-yorar-mi

It gives error why?

I think it's because this link is missing the header Content-Type. How can I workaround that?

How do i access the siblings using css selector in wombat

i have tried

"css=p.someclass + p"
"css=p.someclass ~ p"

None of these worked.

Updating the example in README.md

Since the github's home page is different now, I think it is better to update the example in README.md as well.

Wombat is a scraper. I don't think it is a crawler.

From what I can tell, wombat isn't a crawler. Is this correct?

Web scraping, to use a minimal definition, is the process of processing a web document and extracting information out of it. You can do web scraping without doing web crawling. Wikipedia on web scraping

Web crawling, to use a minimal definition, is the process of iteratively finding and fetching web links starting from a list of seed URL's. Strictly speaking, to do web crawling, you have to do some degree of web scraping (to extract the URL's.) Wikipedia on web crawlers

Documentation links are incorrect

The "API Documentation" link on: http://felipecsl.com/wombat/ points to http://rubydoc.info/gems/wombat/2.1.1/frames

On that page, the "API Documentation" link points to https://www.rubydoc.info/gems/wombat/2.0.0/frames and so on.

Unrelated, gemnasium badge is reporting errors.

Hope this little helps. I'd send a PR, but I'm not using the gem right now.

page function is not work

I found page function is existed.
And I guess it is used for url parameter named page. So I code page: 2. But page function is not work. It give me error ArgumentError: wrong number of arguments (given 1, expected 0). Did I use page function not correctly?

How do I remove a node?

Thanks for the work you do with this gem

Hello, I am need remove multiples nodes
with class css

.media
.ads
.cite-content

How do I remove a nodes css?:

class ListCrawler
      include Wombat::Crawler

      base_url "https://rpp.pe"
      path "/politica/actualidad/ministerio-publico-las-discrepancias-entre-pablo-sanchez-y-pedro-chavarry-noticia-1142342"

      explore css: '#article-body' do |e|
        e remove: '.media'
        e remove: '.ads'
        e remove: '.cite-content'
      end
end
pp ListCrawler.new.crawl

#ERRORR!!

With standalone gem Mechanize works

 mechanize = Mechanize.new { |agent|
  agent.user_agent_alias = 'Mac Safari'
}

page = mechanize.get('https://rpp.pe/politica/actualidad/ministerio-publico-las-discrepancias-entre-pablo-sanchez-y-pedro-chavarry-noticia-1142342')

text = page.at('#article-body')

text.at_css(".media").remove
text.at_css(".ads").remove
text.at_css(".cite-content").remove

puts text

Someone who is an expert can help me?

Thanks!

Delay before parsing next page?

Could I set delay between following links (with :follow)?

how to change user_agent_alias in wombat?

Roadmap

Wombat have very good DSL, but lacks some stuff for crawling, like going through the pages. Do you have any roadmap or plans, on which parts of the wombat you will be working next?

How to get image src link

is there a way to retrieve src or data tags rather than the inner-text

Thanks.

`set_proxy` not working

I tried out a slightly modified version of the example provided:

require 'wombat'

class HeadersScraper
  include Wombat::Crawler

  base_url "http://www.rubygems.org"
  path "/"

  set_proxy("localhost", 8888)
end

I get this error:

/Users/michael/.rvm/gems/ruby-2.3.1/gems/wombat-2.5.1/lib/wombat/property/locators/factory.rb:34:in `locator_for': Unknown property format 8888. (Wombat::Property::Locators::UnknownTypeException)

Here's my setup:

Ruby 2.3.1
Wombat 2.5.1
Nokogiri 1.8.0
Mechanize 2.7.5

Moving this to the configure block is a workaround:

Wombat.configure do |config|
  config.set_proxy("localhost", 8888)
end

Unable to activate mechanize-2.7.3. Gem::LoadError

Sorry i made a silly mistake. i just tried wombat in my pry session and unable to use this great gem.
I need to solve this ASAP to complete my task.

require 'wombat'
Gem::LoadError: Unable to activate mechanize-2.7.3, because mime-types-1.25.1 conflicts with mime-types (~> 2.0)

Cannot set Mechanize page via Metadata's page method

When trying to use a pre-fethed Mechanize page as described in the wiki:

Wombat.crawl do
  m = Mechanize.new 
  mp = m.get 'http://www.google.com'
  page mp
end

I get an error with this stack trace:

crawler.rb:8:in `block in <main>': wrong number of arguments (1 for 0) (ArgumentError)
    from /Users/derantell/.rbenv/versions/2.1.2/lib/ruby/gems/2.1.0/gems/wombat-2.3.0/lib/wombat/crawler.rb:22:in `instance_eval'
    from /Users/derantell/.rbenv/versions/2.1.2/lib/ruby/gems/2.1.0/gems/wombat-2.3.0/lib/wombat/crawler.rb:22:in `crawl'
    from /Users/derantell/.rbenv/versions/2.1.2/lib/ruby/gems/2.1.0/gems/wombat-2.3.0/lib/wombat.rb:13:in `crawl'
    from crawler.rb:4:in `<main>'

Using @metadata_dup.page mp or renaming Metadata::page to something else works, therefore my guess is that the attr_accessor :page which Crawler includes from Parser is found and method_missing is never invoked.

Versions used: ruby 2.1.2, mechanize 2.7.3 and wombat 2.3.0

New Version

Hi,

are you planning to release a new version soon? Would be nice to see a 2.0.1 with the Proxy-Settings :)

Mock responses

Hi guys
I want to mock/stub responses (web pages) in order to be able to test my crawler without internet
What can you suggest?
Thanks

Follow more than 1 link?

Is it possible to follow more than one link using wombat ? I saw there was an issue that's closed but coulnd't find the solution?

xpath is working properly?

My Test

some_text xpath: '//*[@id="Content_Regs"]/table[1]/tbody/tr/td[2]/table/tbody/tr[4]/td[2]/text()[2]'

Return

{"some_text"=>nil}

Console Google Chrome

$x('//*[@id="Content_Regs"]/table[1]/tbody/tr/td[2]/table/tbody/tr[4]/td[2]/text()[2]')
["
            São Paulo - SP"
]

What might be happening? forgot something?

Image provision

Is it possible to crawl through the images and get the image source?

customize http header

I think we should have function to customize http header in this library.

http://docs.seattlerb.org/mechanize/HTTP/Agent.html#Headers

Get url of followed link

Hey, there's a way to get the url of the followed link? Something like:

  products 'css=.products-grid .item .product-name a', :follow do |url|
        url url
        title 'css=.product-name h1'
  end

Encoding Problem

when request url is http://info.ntust.edu.tw/faith/edua/app/qry_linkoutline.aspx?semester=1031&courseno=ET5117701

it will get error
encoding error : input conversion failed due to input error, bytes 0xA8 0x8B 0xE8 0xB3, but other course page isn't. I guess the page used special character.
I have tried and I think this is Mechanize problem if I override get method to

doc = original_get *args
doc.encoding = 'utf-8'
doc

It will work fine, but in some case other normal page will get trouble....

Is it possible to programmatically create the properties to search for?

I have a set of sites I index regularly where I store various field definitions alongside xpath parameters.
e.g.

   {"field_name" => "title", "xpath" => "//div[@itemprop=\"title\"]"}
    {"field_name" => "description", "xpath" => "//div[contains(@class,'description')]"}

Is there a way I can use that with the wombat DSL?
Can't work out how to make the field_name display itself in the right way.

Modifying response before parsing

Hello,

Is it at all possible to modify the HTML before it's attempted to be parsed by Mechanize or Nokogiri? As an example, there's a page I want to parse which is incorrectly closing an HTML comment with --!> (rather than -->).

Is there any way of doing this already? If not, I'll add it and submit a pull request.

Thanks,
Sam

Clearing cookies

Hey is it possible to remove all previously stored cookies between a request? In mechanize there is agent.cookie_jar.clear!, is this possible in wombat?

I'm pretty new to all of this :)
Thanks

can't login with github

I'm on this url http://mentoring.io/members/sign_in

Cannot run the example code.

Hi, I got wombat.rb:14:in block (3 levels) in <main>': undefined method gsub' for nil:NilClass (NoMethodError) error when running the example code at the introduction page of wombat.

I guess the reason for this error is that github has changed its home page's code. This makes the variable e become nil.

Anyone knows how to fix this problem? I'm not familiar with xpath. I just started learning xpath.

LoadError in sample code → mechanize.rb: cannot load net/http/digest_auth

wombat.rb exactly as in README; added a require 'net/http/digest_auth' but it didn't matter because of load order.

% ruby --version
ruby 2.6.5p114 (2019-10-01 revision 67812) [x86_64-darwin19]

% cat Gemfile
source "https://rubygems.org"
gem "wombat"

% bundle check
The Gemfile's dependencies are satisfied

% bundle list 
Gems included by the bundle:
  * activesupport (6.0.2.2)
  * concurrent-ruby (1.1.6)
  * connection_pool (2.2.2)
  * domain_name (0.5.20190701)
  * http-accept (1.7.0)
  * http-cookie (1.0.3)
  * i18n (1.8.2)
  * mechanize (2.7.6)
  * mime-types (3.3.1)
  * mime-types-data (3.2019.1009)
  * mini_portile2 (2.4.0)
  * minitest (5.14.0)
  * net-http-digest_auth (1.4.1)
  * net-http-persistent (3.1.0)
  * netrc (0.11.0)
  * nokogiri (1.10.9)
  * ntlm-http (0.1.1)
  * rest-client (2.1.0)
  * thread_safe (0.3.6)
  * tzinfo (1.2.6)
  * unf (0.1.4)
  * unf_ext (0.0.7.7)
  * webrobots (0.1.2)
  * wombat (2.10.0)
  * zeitwerk (2.3.0)

% bundle exec ruby wombat.rb
15: from wombat.rb:1:in `<main>'
14: from ruby/2.6.5/gems/bundler-2.1.4/lib/bundler.rb:174:in `require'
13: from ruby/2.6.5/gems/bundler-2.1.4/lib/bundler/runtime.rb:58:in `require'
12: from ruby/2.6.5/gems/bundler-2.1.4/lib/bundler/runtime.rb:58:in `each'
11: from ruby/2.6.5/gems/bundler-2.1.4/lib/bundler/runtime.rb:69:in `block in require'
10: from ruby/2.6.5/gems/bundler-2.1.4/lib/bundler/runtime.rb:69:in `each'
  9: from ruby/2.6.5/gems/bundler-2.1.4/lib/bundler/runtime.rb:74:in `block (2 levels) in require'
  8: from ruby/2.6.5/gems/bundler-2.1.4/lib/bundler/runtime.rb:74:in `require'
  7: from ruby/2.6.5/gems/wombat-2.10.0/lib/wombat.rb:3:in `<top (required)>'
  6: from ruby/2.6.5/gems/wombat-2.10.0/lib/wombat.rb:3:in `require'
  5: from ruby/2.6.5/gems/wombat-2.10.0/lib/wombat/crawler.rb:4:in `<top (required)>'
  4: from ruby/2.6.5/gems/wombat-2.10.0/lib/wombat/crawler.rb:4:in `require'
  3: from ruby/2.6.5/gems/wombat-2.10.0/lib/wombat/processing/parser.rb:4:in `<top (required)>'
  2: from ruby/2.6.5/gems/wombat-2.10.0/lib/wombat/processing/parser.rb:4:in `require'
  1: from ruby/2.6.5/gems/mechanize-2.7.6/lib/mechanize.rb:5:in `<top (required)>'
    ruby/2.6.5/gems/mechanize-2.7.6/lib/mechanize.rb:5:
    in `require': cannot load such file -- net/http/digest_auth (LoadError)

Allow hash selector as last argument to property

Small API improvement idea: instead of

stuff({ css: 'div.some-class' }, :list)

I want to be able to write:

stuff :list, css: 'div.some-class'

To me, this reads more like idiomatic ruby.

For the edge case where a user specifies both positional and hash arguments, we could just accept the first one? Or merge the two?

If you think this is worth it, I'd be happy to make a PR 😄

Return entire webpage

I've used Wombat before for projects like Noodles that require a simple web scraper, but now I'm trying to pull back the entirety of the web page. I want everything — the head, the body, the contents — not just something specific. Is Wombat able to do this?

How to select a link href using iterator?

I have a list of links to extract from a web page, but how do I select the attribute href when I use iterator? Following code always select the first link.

  result = Wombat.crawl do
      base_url provider_urls.url
      path '/'
      articles 'css=.article table.olt td.title', :iterator do 
        title({ css: "a" })
        article_path({xpath: Nokogiri::CSS.xpath_for(".article table.olt td.title a")[0] + '/@href'})
      end
    end

http://stackoverflow.com/questions/30498120/ruby-wombot-select-link-url-attribute-within-iterator

Having trouble using wombat for dashing

Hi there,
I am trying to screen scrap information from a website in dashing, when i try to send_event, it doesnt seem to show up? can this be used for dashing? im just writing in in a ruby file, it just doesnt appear with anything. Any help would be much appreciated.

Cheers

400 Bad Request on some websites.

Hello,
I noticed some strange behaviour of Wombat. Let's say I want to crawl 2 websites firstly I was using Typhoeus and Regex to crawl websites, but there was one website which constantly was giving me 302 and then i found Wombat but now the interesting thing is that when I use wombat for it it works perfectly, but when I try wombat on the other website i get an error which is

/.rvm/gems/ruby-2.1.5/gems/mechanize-2.7.3/lib/mechanize/http/agent.rb:308:in `fetch': 400 => Net::HTTPBadRequest for "THE_WEBSITE_URL" -- unhandled response (Mechanize::ResponseCodeError)

And the URL is correct ... I tried it in the browser and it worked. So can somebody help me with this one.. Also I don't have puts in front of Wombat.crawl do ... because I saw this also as a problem.
Thank you in advance and sorry for my english!

felipecsl / wombat Goto Github PK

wombat's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs