GithubHelp home page GithubHelp logo

felipecsl / wombat Goto Github PK

View Code? Open in Web Editor NEW
1.3K 51.0 132.0 2.59 MB

Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages.

Home Page: https://felipecsl.github.io/wombat/

License: MIT License

Ruby 100.00%
ruby scraper crawler dsl

wombat's Introduction

Wombat

Gem Version Code Climate Coverage Status FOSSA Status

Web scraper with an elegant DSL that parses structured data from web pages.

Usage:

gem install wombat

Scraping a page:

The simplest way to use Wombat is by calling Wombat.crawl and passing it a block:

require 'wombat'

Wombat.crawl do
  base_url "https://www.github.com"
  path "/"

  headline xpath: "//h1"
  subheading css: "p.alt-lead"

  what_is({ css: ".one-fourth h4" }, :list)

  links do
    explore xpath: '/html/body/header/div/div/nav[1]/a[4]' do |e|
      e.gsub(/Explore/, "Love")
    end

    features css: '.nav-item-opensource'
    business css: '.nav-item-business'
  end
end
The code above is gonna return the following hash:
{
  "headline"=>"How people build software",
  "subheading"=>"Millions of developers use GitHub to build personal projects, support their businesses, and work together on open source technologies.",
  "what_is"=>[
    "For everything you build",
    "A better way to work",
    "Millions of projects",
    "One platform, from start to finish"
  ],
  "links"=>{
    "explore"=>"Love",
    "features"=>"Open source",
    "business"=>"Business"
  }
}

This is just a sneak peek of what Wombat can do. For the complete documentation, please check the links below:

Contributing to Wombat

  • Check out the latest master to make sure the feature hasn't been implemented or the bug hasn't been fixed yet
  • Check out the issue tracker to make sure someone already hasn't requested it and/or contributed it
  • Fork the project
  • Start a feature/bugfix branch
  • Commit and push until you are happy with your contribution
  • Make sure to add tests for it. This is important so I don't break it in a future version unintentionally.
  • Please try not to mess with the Rakefile, version, or history. If you want to have your own version, or is otherwise necessary, that is fine, but please isolate to its own commit so I can cherry-pick around it.

Contributors

Copyright

Copyright (c) 2019 Felipe Lima. See LICENSE.txt for further details.

License

FOSSA Status

wombat's People

Contributors

akirill0v avatar algo31031 avatar bluekeys avatar cyu avatar danielnc avatar dependabot[bot] avatar eguneys avatar felipecsl avatar fossabot avatar orendon avatar petergoldstein avatar phortx avatar plerohellec avatar santib avatar sigi avatar tricknotes avatar viniciusdaniel avatar zloy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

wombat's Issues

Encoding Problem

when request url is http://info.ntust.edu.tw/faith/edua/app/qry_linkoutline.aspx?semester=1031&courseno=ET5117701

it will get error
encoding error : input conversion failed due to input error, bytes 0xA8 0x8B 0xE8 0xB3, but other course page isn't. I guess the page used special character.
I have tried and I think this is Mechanize problem if I override get method to

doc = original_get *args
doc.encoding = 'utf-8'
doc

It will work fine, but in some case other normal page will get trouble....

Support many proxy server

When I used 1 proxy server, may be the proxy server is not working sometime
so I want to have many proxy server

EX: I have 3 proxy list server [proxy1.example.com, proxy2.example.com, proxy3.example.com]
First time I use proxy1.example.com
if proxy1.example.com died or refuse connection, I will trying proxy2.example.com, ...

Follow more than 1 link?

Is it possible to follow more than one link using wombat ? I saw there was an issue that's closed but coulnd't find the solution?

Get url of followed link

Hey, there's a way to get the url of the followed link? Something like:

  products 'css=.products-grid .item .product-name a', :follow do |url|
        url url
        title 'css=.product-name h1'
  end

Unable to activate mechanize-2.7.3. Gem::LoadError

Sorry i made a silly mistake. i just tried wombat in my pry session and unable to use this great gem.
I need to solve this ASAP to complete my task.

require 'wombat'
Gem::LoadError: Unable to activate mechanize-2.7.3, because mime-types-1.25.1 conflicts with mime-types (~> 2.0)

Modifying response before parsing

Hello,

Is it at all possible to modify the HTML before it's attempted to be parsed by Mechanize or Nokogiri? As an example, there's a page I want to parse which is incorrectly closing an HTML comment with --!> (rather than -->).

Is there any way of doing this already? If not, I'll add it and submit a pull request.

Thanks,
Sam

Return entire webpage

I've used Wombat before for projects like Noodles that require a simple web scraper, but now I'm trying to pull back the entirety of the web page. I want everything — the head, the body, the contents — not just something specific. Is Wombat able to do this?

400 Bad Request on some websites.

Hello,
I noticed some strange behaviour of Wombat. Let's say I want to crawl 2 websites firstly I was using Typhoeus and Regex to crawl websites, but there was one website which constantly was giving me 302 and then i found Wombat but now the interesting thing is that when I use wombat for it it works perfectly, but when I try wombat on the other website i get an error which is

/.rvm/gems/ruby-2.1.5/gems/mechanize-2.7.3/lib/mechanize/http/agent.rb:308:in `fetch': 400 => Net::HTTPBadRequest for "THE_WEBSITE_URL" -- unhandled response (Mechanize::ResponseCodeError)

And the URL is correct ... I tried it in the browser and it worked. So can somebody help me with this one.. Also I don't have puts in front of Wombat.crawl do ... because I saw this also as a problem.
Thank you in advance and sorry for my english!

Is it possible to programmatically create the properties to search for?

I have a set of sites I index regularly where I store various field definitions alongside xpath parameters.
e.g.

   {"field_name" => "title", "xpath" => "//div[@itemprop=\"title\"]"}
    {"field_name" => "description", "xpath" => "//div[contains(@class,'description')]"}

Is there a way I can use that with the wombat DSL?
Can't work out how to make the field_name display itself in the right way.

How do I remove a node?

Thanks for the work you do with this gem

Hello, I am need remove multiples nodes
with class css

.media
.ads
.cite-content

How do I remove a nodes css?:

class ListCrawler
      include Wombat::Crawler

      base_url "https://rpp.pe"
      path "/politica/actualidad/ministerio-publico-las-discrepancias-entre-pablo-sanchez-y-pedro-chavarry-noticia-1142342"

      explore css: '#article-body' do |e|
        e remove: '.media'
        e remove: '.ads'
        e remove: '.cite-content'
      end
end
pp ListCrawler.new.crawl

#ERRORR!!

With standalone gem Mechanize works

 mechanize = Mechanize.new { |agent|
  agent.user_agent_alias = 'Mac Safari'
}

page = mechanize.get('https://rpp.pe/politica/actualidad/ministerio-publico-las-discrepancias-entre-pablo-sanchez-y-pedro-chavarry-noticia-1142342')

text = page.at('#article-body')

text.at_css(".media").remove
text.at_css(".ads").remove
text.at_css(".cite-content").remove

puts text

Someone who is an expert can help me?

Thanks!

Backwards compatibility issue

I upgraded to wombat 2.5.0 and my script stopped working. The following script works on wombat 2.4.0 but fails in 2.5.0. I took some code out in order to isolate the problem.

require 'wombat'

result = Wombat.crawl do
    base_url "http://www.icy-veins.com/" 
    path "heroes/hero-guides"

    heroes  "css=.page_content .nav_content_block_entry_heroes_hero", :iterator do
        name "xpath=."
        builds "xpath=./a", :follow do
            title "css=h1"
        end
    end
end

I get the following error:

/usr/local/opt/rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/wombat-2.5.0/lib/wombat/property/locators/follow.rb:10:in `block (2 levels) in locate': undefined method `mechanize_page' for #<Nokogiri::XML::Element:0x007ffbe44e8760> (NoMethodError)
    from /usr/local/opt/rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/nokogiri-1.6.7.2/lib/nokogiri/xml/node_set.rb:187:in `block in each'
    from /usr/local/opt/rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/nokogiri-1.6.7.2/lib/nokogiri/xml/node_set.rb:186:in `upto'
    from /usr/local/opt/rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/nokogiri-1.6.7.2/lib/nokogiri/xml/node_set.rb:186:in `each'
    from /usr/local/opt/rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/wombat-2.5.0/lib/wombat/property/locators/follow.rb:9:in `flat_map'
    from /usr/local/opt/rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/wombat-2.5.0/lib/wombat/property/locators/follow.rb:9:in `block in locate'
    from /usr/local/opt/rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/wombat-2.5.0/lib/wombat/property/locators/base.rb:18:in `locate'
    from /usr/local/opt/rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/wombat-2.5.0/lib/wombat/property/locators/follow.rb:8:in `locate'
    from /usr/local/opt/rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/wombat-2.5.0/lib/wombat/property/locators/base.rb:35:in `block (2 levels) in filter_properties'
    from /usr/local/opt/rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/wombat-2.5.0/lib/wombat/property/locators/base.rb:35:in `map'
    from /usr/local/opt/rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/wombat-2.5.0/lib/wombat/property/locators/base.rb:35:in `block in filter_properties'
    from /usr/local/opt/rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/wombat-2.5.0/lib/wombat/property/locators/base.rb:32:in `tap'
    from /usr/local/opt/rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/wombat-2.5.0/lib/wombat/property/locators/base.rb:32:in `filter_properties'
    from /usr/local/opt/rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/wombat-2.5.0/lib/wombat/property/locators/iterator.rb:11:in `block (2 levels) in locate'
    from /usr/local/opt/rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/nokogiri-1.6.7.2/lib/nokogiri/xml/node_set.rb:187:in `block in each'
    from /usr/local/opt/rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/nokogiri-1.6.7.2/lib/nokogiri/xml/node_set.rb:186:in `upto'
    from /usr/local/opt/rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/nokogiri-1.6.7.2/lib/nokogiri/xml/node_set.rb:186:in `each'
    from /usr/local/opt/rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/wombat-2.5.0/lib/wombat/property/locators/iterator.rb:10:in `flat_map'
    from /usr/local/opt/rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/wombat-2.5.0/lib/wombat/property/locators/iterator.rb:10:in `block in locate'
    from /usr/local/opt/rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/wombat-2.5.0/lib/wombat/property/locators/base.rb:18:in `locate'
    from /usr/local/opt/rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/wombat-2.5.0/lib/wombat/property/locators/iterator.rb:9:in `locate'
    from /usr/local/opt/rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/wombat-2.5.0/lib/wombat/property/locators/base.rb:35:in `block (2 levels) in filter_properties'
    from /usr/local/opt/rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/wombat-2.5.0/lib/wombat/property/locators/base.rb:35:in `map'
    from /usr/local/opt/rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/wombat-2.5.0/lib/wombat/property/locators/base.rb:35:in `block in filter_properties'
    from /usr/local/opt/rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/wombat-2.5.0/lib/wombat/property/locators/base.rb:32:in `tap'
    from /usr/local/opt/rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/wombat-2.5.0/lib/wombat/property/locators/base.rb:32:in `filter_properties'
    from /usr/local/opt/rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/wombat-2.5.0/lib/wombat/property/locators/property_group.rb:9:in `block in locate'
    from /usr/local/opt/rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/wombat-2.5.0/lib/wombat/property/locators/base.rb:18:in `locate'
    from /usr/local/opt/rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/wombat-2.5.0/lib/wombat/property/locators/property_group.rb:8:in `locate'
    from /usr/local/opt/rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/wombat-2.5.0/lib/wombat/processing/parser.rb:43:in `parse'
    from /usr/local/opt/rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/wombat-2.5.0/lib/wombat/crawler.rb:30:in `crawl'
    from /usr/local/opt/rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/wombat-2.5.0/lib/wombat.rb:13:in `crawl'
    from crawler.rb:4:in `<main>'

Roadmap

Wombat have very good DSL, but lacks some stuff for crawling, like going through the pages. Do you have any roadmap or plans, on which parts of the wombat you will be working next?

Following pagination

Hi,

Nice little gem so far. I'm trying for scrape info across a number of pages. What I'd like to do is something such as this pseudo code:

load page
scrap page content
if next page link, follow link and loop to top
else done

Is there any way I can achieve this currently?
Thanks

LoadError in sample code → mechanize.rb: cannot load net/http/digest_auth

wombat.rb exactly as in README; added a require 'net/http/digest_auth' but it didn't matter because of load order.

% ruby --version
ruby 2.6.5p114 (2019-10-01 revision 67812) [x86_64-darwin19]
% cat Gemfile
source "https://rubygems.org"
gem "wombat"
% bundle check
The Gemfile's dependencies are satisfied
% bundle list 
Gems included by the bundle:
  * activesupport (6.0.2.2)
  * concurrent-ruby (1.1.6)
  * connection_pool (2.2.2)
  * domain_name (0.5.20190701)
  * http-accept (1.7.0)
  * http-cookie (1.0.3)
  * i18n (1.8.2)
  * mechanize (2.7.6)
  * mime-types (3.3.1)
  * mime-types-data (3.2019.1009)
  * mini_portile2 (2.4.0)
  * minitest (5.14.0)
  * net-http-digest_auth (1.4.1)
  * net-http-persistent (3.1.0)
  * netrc (0.11.0)
  * nokogiri (1.10.9)
  * ntlm-http (0.1.1)
  * rest-client (2.1.0)
  * thread_safe (0.3.6)
  * tzinfo (1.2.6)
  * unf (0.1.4)
  * unf_ext (0.0.7.7)
  * webrobots (0.1.2)
  * wombat (2.10.0)
  * zeitwerk (2.3.0)
% bundle exec ruby wombat.rb
15: from wombat.rb:1:in `<main>'
14: from ruby/2.6.5/gems/bundler-2.1.4/lib/bundler.rb:174:in `require'
13: from ruby/2.6.5/gems/bundler-2.1.4/lib/bundler/runtime.rb:58:in `require'
12: from ruby/2.6.5/gems/bundler-2.1.4/lib/bundler/runtime.rb:58:in `each'
11: from ruby/2.6.5/gems/bundler-2.1.4/lib/bundler/runtime.rb:69:in `block in require'
10: from ruby/2.6.5/gems/bundler-2.1.4/lib/bundler/runtime.rb:69:in `each'
  9: from ruby/2.6.5/gems/bundler-2.1.4/lib/bundler/runtime.rb:74:in `block (2 levels) in require'
  8: from ruby/2.6.5/gems/bundler-2.1.4/lib/bundler/runtime.rb:74:in `require'
  7: from ruby/2.6.5/gems/wombat-2.10.0/lib/wombat.rb:3:in `<top (required)>'
  6: from ruby/2.6.5/gems/wombat-2.10.0/lib/wombat.rb:3:in `require'
  5: from ruby/2.6.5/gems/wombat-2.10.0/lib/wombat/crawler.rb:4:in `<top (required)>'
  4: from ruby/2.6.5/gems/wombat-2.10.0/lib/wombat/crawler.rb:4:in `require'
  3: from ruby/2.6.5/gems/wombat-2.10.0/lib/wombat/processing/parser.rb:4:in `<top (required)>'
  2: from ruby/2.6.5/gems/wombat-2.10.0/lib/wombat/processing/parser.rb:4:in `require'
  1: from ruby/2.6.5/gems/mechanize-2.7.6/lib/mechanize.rb:5:in `<top (required)>'
    ruby/2.6.5/gems/mechanize-2.7.6/lib/mechanize.rb:5:
    in `require': cannot load such file -- net/http/digest_auth (LoadError)

Mock responses

Hi guys
I want to mock/stub responses (web pages) in order to be able to test my crawler without internet
What can you suggest?
Thanks

page function is not work

I found page function is existed.
And I guess it is used for url parameter named page. So I code page: 2. But page function is not work. It give me error ArgumentError: wrong number of arguments (given 1, expected 0). Did I use page function not correctly?

Allow hash selector as last argument to property

Small API improvement idea: instead of

stuff({ css: 'div.some-class' }, :list)

I want to be able to write:

stuff :list, css: 'div.some-class'

To me, this reads more like idiomatic ruby.

For the edge case where a user specifies both positional and hash arguments, we could just accept the first one? Or merge the two?

If you think this is worth it, I'd be happy to make a PR 😄

Update

Should fix to work with 1.9.x or greater

local file support

Wombat can't parse local files:
/.gem/ruby/2.3.1/gems/wombat-2.5.1/lib/wombat/processing/parser.rb:33:in block (2 levels) in initialize': undefined method content_type' for #<Mechanize::FileResponse:0x007fe856a62d90> (NoMethodError)

Wombat is a scraper. I don't think it is a crawler.

From what I can tell, wombat isn't a crawler. Is this correct?

Web scraping, to use a minimal definition, is the process of processing a web document and extracting information out of it. You can do web scraping without doing web crawling. Wikipedia on web scraping

Web crawling, to use a minimal definition, is the process of iteratively finding and fetching web links starting from a list of seed URL's. Strictly speaking, to do web crawling, you have to do some degree of web scraping (to extract the URL's.) Wikipedia on web crawlers

How to select a link href using iterator?

I have a list of links to extract from a web page, but how do I select the attribute href when I use iterator? Following code always select the first link.

  result = Wombat.crawl do
      base_url provider_urls.url
      path '/'
      articles 'css=.article table.olt td.title', :iterator do 
        title({ css: "a" })
        article_path({xpath: Nokogiri::CSS.xpath_for(".article table.olt td.title a")[0] + '/@href'})
      end
    end

http://stackoverflow.com/questions/30498120/ruby-wombot-select-link-url-attribute-within-iterator

Image provision

Is it possible to crawl through the images and get the image source?

Cannot run the example code.

Hi, I got wombat.rb:14:in block (3 levels) in <main>': undefined method gsub' for nil:NilClass (NoMethodError) error when running the example code at the introduction page of wombat.

I guess the reason for this error is that github has changed its home page's code. This makes the variable e become nil.

Anyone knows how to fix this problem? I'm not familiar with xpath. I just started learning xpath.

xpath is working properly?

My Test

some_text xpath: '//*[@id="Content_Regs"]/table[1]/tbody/tr/td[2]/table/tbody/tr[4]/td[2]/text()[2]'

Return

{"some_text"=>nil}

Console Google Chrome

$x('//*[@id="Content_Regs"]/table[1]/tbody/tr/td[2]/table/tbody/tr[4]/td[2]/text()[2]')
["
            São Paulo - SP"
]

What might be happening? forgot something?

Advanced request options

Sometimes I need to emulate the xhr requests to the sites and pages (such as AJAX) or add different Headers in the request. It would be great to have the ability to specify the type of request (get, post, put, etc.)

Caching?

Is it possible to cache the results of pages to make successive runs faster?

New Version

Hi,

are you planning to release a new version soon? Would be nice to see a 2.0.1 with the Proxy-Settings :)

How to go over multiple subpaths

I'd like to go over a lot of pages (thousands) and supply their path parameter to the base_url, because I've got them already. But I don't see any example how to do something like that. Is this a good use case for wombat ? Next I'd like to store scraped text from pages into one giant JSON.

Way to access url inside follow

Is there a way to put the url inside the follow?

products 'css=.products a', :follow do
   name css: 'h1'
   price css: '.price'
   url "??????????"    #How to get the url of the followed page
end

Cannot set Mechanize page via Metadata's page method

When trying to use a pre-fethed Mechanize page as described in the wiki:

Wombat.crawl do
  m = Mechanize.new 
  mp = m.get 'http://www.google.com'
  page mp
end

I get an error with this stack trace:

crawler.rb:8:in `block in <main>': wrong number of arguments (1 for 0) (ArgumentError)
    from /Users/derantell/.rbenv/versions/2.1.2/lib/ruby/gems/2.1.0/gems/wombat-2.3.0/lib/wombat/crawler.rb:22:in `instance_eval'
    from /Users/derantell/.rbenv/versions/2.1.2/lib/ruby/gems/2.1.0/gems/wombat-2.3.0/lib/wombat/crawler.rb:22:in `crawl'
    from /Users/derantell/.rbenv/versions/2.1.2/lib/ruby/gems/2.1.0/gems/wombat-2.3.0/lib/wombat.rb:13:in `crawl'
    from crawler.rb:4:in `<main>'

Using @metadata_dup.page mp or renaming Metadata::page to something else works, therefore my guess is that the attr_accessor :page which Crawler includes from Parser is found and method_missing is never invoked.

Versions used: ruby 2.1.2, mechanize 2.7.3 and wombat 2.3.0

Having trouble using wombat for dashing

Hi there,
I am trying to screen scrap information from a website in dashing, when i try to send_event, it doesnt seem to show up? can this be used for dashing? im just writing in in a ruby file, it just doesnt appear with anything. Any help would be much appreciated.

Cheers

Getting dynamic data from website

I am trying to scrape a website created in Angular. With Angular, it needs to run the scripts on the page to get the dynamic data. All I am getting is the static data when scraping the page. Is there a way that I can get dynamically generated data with this gem?

Clearing cookies

Hey is it possible to remove all previously stored cookies between a request? In mechanize there is agent.cookie_jar.clear!, is this possible in wombat?

I'm pretty new to all of this :)
Thanks

Unable to run example

Hi,

I really want to try your work, so I copy&paste this code:

require 'wombat'

puts Wombat.crawl do
  base_url "http://www.github.com"
  path "/"

  headline "xpath=//h1"
  what_is "css=.column.secondary p", :html
  repositories "css=a.repo", :list

  explore "xpath=//ul/li[2]/a" do |e|
    e.gsub(/Explore/, "LOVE")
  end

  benefits do
    first_benefit "css=.column.leftmost h3"
    second_benefit "css=.column.leftmid h3"
    third_benefit "css=.column.rightmid h3"
    fourth_benefit "css=.column.rightmost h3"
  end
end

But it fails:

/home/black/.rvm/gems/ruby-1.9.3-p362/gems/mechanize-2.5.1/lib/mechanize/http/agent.rb:595:in `resolve': absolute URL needed (not "") (ArgumentError)
        from /home/black/.rvm/gems/ruby-1.9.3-p362/gems/mechanize-2.5.1/lib/mechanize/http/agent.rb:214:in `fetch'
        from /home/black/.rvm/gems/ruby-1.9.3-p362/gems/mechanize-2.5.1/lib/mechanize.rb:407:in `get'
        from /home/black/.rvm/gems/ruby-1.9.3-p362/gems/wombat-2.1.0/lib/wombat/processing/parser.rb:41:in `parser_for'
        from /home/black/.rvm/gems/ruby-1.9.3-p362/gems/wombat-2.1.0/lib/wombat/processing/parser.rb:29:in `parse'
        from /home/black/.rvm/gems/ruby-1.9.3-p362/gems/wombat-2.1.0/lib/wombat/crawler.rb:30:in `crawl'
        from /home/black/.rvm/gems/ruby-1.9.3-p362/gems/wombat-2.1.0/lib/wombat.rb:10:in `crawl'
        from amazon.rb:3:in `<main>'

I have the following gems:

activesupport (3.2.11)
blankslate (2.1.2.4)
bluecloth (2.2.0)
bundler (1.2.3)
capistrano (2.14.1)
curb (0.8.3)
domain_name (0.5.7)
gli (2.5.3)
highline (1.6.15)
i18n (0.6.1)
json (1.7.6)
mechanize (2.5.1)
mg (0.0.8)
mime-types (1.19)
multi_json (1.5.0)
net-http-digest_auth (1.2.1)
net-http-persistent (2.8)
net-scp (1.0.4)
net-sftp (2.0.5)
net-ssh (2.6.3)
net-ssh-gateway (1.1.0)
nokogiri (1.5.6)
ntlm-http (0.1.1)
parslet (1.5.0)
pdfkit (0.5.2)
rack (1.4.2)
rack-protection (1.3.2)
rake (10.0.3)
rest-client (1.6.7)
rmagick (2.13.1)
rubygems-bundler (1.1.0)
rvm (1.11.3.5)
showoff (0.7.0)
sinatra (1.3.3)
tilt (1.3.3)
unf (0.0.5)
unf_ext (0.0.5)
webrobots (0.0.13)
wombat (2.1.0)

Have you got an idea of the problem?

For your help,
Thanks by advance.

`set_proxy` not working

I tried out a slightly modified version of the example provided:

require 'wombat'

class HeadersScraper
  include Wombat::Crawler

  base_url "http://www.rubygems.org"
  path "/"

  set_proxy("localhost", 8888)
end

I get this error:

/Users/michael/.rvm/gems/ruby-2.3.1/gems/wombat-2.5.1/lib/wombat/property/locators/factory.rb:34:in `locator_for': Unknown property format 8888. (Wombat::Property::Locators::UnknownTypeException)

Here's my setup:

  • Ruby 2.3.1
  • Wombat 2.5.1
  • Nokogiri 1.8.0
  • Mechanize 2.7.5

Moving this to the configure block is a workaround:

Wombat.configure do |config|
  config.set_proxy("localhost", 8888)
end

How to parse element attributes

I want this script to return an array of hashes having [name, url]. But since the iterator returns what is INSIDE the a tag, I can't figure out how to get the info.

    require 'wombat'

    video_url = 'https://vimeo.com/26594942'
    result = Wombat.crawl do
      base_url video_url + "/likes"
      path "/"


      likers "css=.browse_people li a", :iterator do
        name "css=p.title"
        url "[href]", :html do |link|
          link
        end
      end

    end

    puts result

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.