GithubHelp home page GithubHelp logo

vifreefly / kimuraframework Goto Github PK

View Code? Open in Web Editor NEW
999.0 30.0 158.0 198 KB

Kimurai is a modern web scraping framework written in Ruby which works out of box with Headless Chromium/Firefox, PhantomJS, or simple HTTP requests and allows to scrape and interact with JavaScript rendered websites

License: MIT License

Ruby 99.85% Shell 0.15%
kimurai scraper crawler scrapy headless-chrome

kimuraframework's People

Contributors

matias-eduardo avatar vifreefly avatar zhustec avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

kimuraframework's Issues

Better support for testing

As I was writing tests for the scraper I've made, I've realised its not super straightforward at the moment. It would be great to improve on that front:

  1. Add testing section to the documentation, showcasing how to set it up and test in Rails for example
  2. Expand global configuration options. I would have liked to be able to disable delay globally in the test environment, instead of doing this in every scraper I write: @config = { before_request: { delay: 1..2 } } unless Rails.env.test?
  3. Add automatic detection of the test environment. Currently I have to manually set it in the rails_helper: ENV['KIMURAI_ENV'] ||= 'test'

How to set language?

I am unable to set the language for Selenium. According to my understanding, these two options are not supported:
@config = {
headers: { "Accept-Language" => "de-DE" }
}

or

options.add_argument("--lang=de-DE")

absolute_url corrupts url escaping it

I have an URL like this: https://www.example.com/path?query_param=N%2CU. absolute_url method:

def absolute_url(url, base:)
return unless url
URI.join(base, URI.escape(url)).to_s
end

escapes it so it becomes https://www.example.com/path?query_param=N%252CU, corrupting the URL and breaking the spider link following. What about adding an argument to absolute_url in order to skip escaping?

Setting desired_capabilities

I'm running a crawler with the selenium_firefox engine but I got the following exception Selenium::WebDriver::Error::InsecureCertificateError:. As I see on this link https://github.com/SeleniumHQ/selenium/wiki/Ruby-Bindings#ssl-certificates it's something I need to configure on firefox.

Is it possible to set the option desired_capabilities through Kimurai?

Support custom `max_retries`

I crawl a web site which network is very bad, I have to refresh 5 times or even more to get a normal response. I tried to configure retry_request_errors, but found that only retry 3 time.

def visit(visit_uri, delay: config.before_request[:delay], skip_request_options: false, max_retries: 3)

It would be great to improve on Supporting custom max_retries.

How do I click on something that isn't a link?

I'm trying to click on an SVG element – it is not a button or a link, so Capybara doesn't like it.

I can query for the element using element = response.css('.available') but I can't seem to click on it using

browser.click(element) – I think it would work if it were a capybara element, not a nokogiri element.

How do I get from the Capybara::Session to a Capybara::Page or Capybara::Node so I can execute .click on it?

Kimurai in RoR with Devise - Get current_user in parse - Pass user info to parse - Completed status in `parse!`

I am trying kimurai framework for rails with devise

Inside parse I want to:

  • get the current_user (devise session), or
  • somehow I can pass the user info to parse, or
  • I can use the parse! which can return the items array but in this case how can I know if the process completed ok or with errors? When using crawl! it returns the response which contains this info...

Any ideas on how can I do this?

Thank you

kimurai setup not passing all necessary arguments to ansible-playbook on Mac OS Catalina

Hi,

I'm getting this error while trying to use the kimurai setup command on a Ubuntu 18.04 LTS EC2, running from a fresh brew install ansible on a macbook pro on Catalina with ansible 2.9.7.

kimurai setup [email protected] --ask-sudo --ssh-key-path /Users/xxx/Development/ssh-keys/xxx.pem
usage: ansible-playbook [-h] [--version] [-v] [-k] [--private-key PRIVATE_KEY_FILE]
                        [-u REMOTE_USER] [-c CONNECTION] [-T TIMEOUT]
                        [--ssh-common-args SSH_COMMON_ARGS] [--sftp-extra-args SFTP_EXTRA_ARGS]
                        [--scp-extra-args SCP_EXTRA_ARGS] [--ssh-extra-args SSH_EXTRA_ARGS]
                        [--force-handlers] [--flush-cache] [-b] [--become-method BECOME_METHOD]
                        [--become-user BECOME_USER] [-K] [-t TAGS] [--skip-tags SKIP_TAGS] [-C]
                        [--syntax-check] [-D] [-i INVENTORY] [--list-hosts] [-l SUBSET]
                        [-e EXTRA_VARS] [--vault-id VAULT_IDS]
                        [--ask-vault-pass | --vault-password-file VAULT_PASSWORD_FILES] [-f FORKS]
                        [-M MODULE_PATH] [--list-tasks] [--list-tags] [--step]
                        [--start-at-task START_AT_TASK]
                        playbook [playbook ...]
ansible-playbook: error: argument --ssh-extra-args: expected one argument

I tried migrating back to ansible 2.8 but getting this :

kimurai setup [email protected] --ask-sudo --ssh-key-path /Users/xxx/Development/ssh-keys/xxx.pem
BECOME password:

PLAY [all] ******************************************************************************************

TASK [Gathering Facts] ******************************************************************************
ERROR! Unexpected Exception, this is probably a bug: cannot pickle '_io.TextIOWrapper' object

Looks like it's a matter of ansible's version (I've never used ansible before, only puppet and chef). I'll take a look around, but this might need some updating - or provide which version we should test against in the readme.

Note : No problem installing it through a localhost ansible install directly on the ubuntu machine, so not anything urgent whatsoever :)

Ruby 2.7.x obsolete warnings

Ruby 2.7.x triggers obsolete warnings for URI.escape. Could we consider using the well-maintained Addressable gem (https://github.com/sporkmonger/addressable) or using a different approach like CGI::escape or ERB::Util.url_encode?

I can create a pull request if that helps.

/usr/local/Cellar/rbenv/1.1.2/versions/2.7.1/lib/ruby/gems/2.7.0/gems/kimurai-1.4.0/lib/kimurai/base_helper.rb:7: warning: URI.escape is obsolete

Thanks.

Allow declarative page & item definitions

Though just using the Scrapy-like callbacks is easy and straightforward to code, it would be extra nice to have a higher abstraction of concepts so we could declaratively write scrapers. This would remove boilerplate, remove selector logic from page navigation logic and additionally allow graceful handling of unexpected and unsupported page types that would otherwise crash the scraper (without error handling).

For example, it would be nice to be able to do this:

class YourSpider < ApplicationSpider
  ...
  item :product do
    text    :name, '#ProdPageTitle'
    int     :ean, '#ProdPageProdCode' do |r|
      r[/([0-9]+)/]
    end
    async do
      array   :images, combi(css('#ShopProdImagesNew img.ShopProdThumbImg'), xpath('@src')))
      text    :description, '#ProdPageTabsWrap #tab1'
      custom  :specs, '#ProdPageProperties > span' do |r|
        r.to_a.in_groups_of(2).map{|s| {
          name: s[0].text,
          value: s[1].text
        }}
      end
    end
  end
end

The block contains invocations to (predefined) field types which are given names, (a) selector(s) and optionally a block for post-processing of the Nokogiri result. Every field accepts an async argument that specifies whether the element is rendered by Javascript or not. The async block sets every field in it to be async, meaning, that the browser.current_page is queried a few times with a timeout until extraction of the specified element works (when the page is actually rendered):

def extract_on_ready(expression, multi: true, retries: 3, wait: 1, default: nil)
    retries.times do
      # the extract function determines the type of the selector expression so it knows
      # whether to call xpath() or css() on the Nokogiri object.
      result = extract(expression, multi: multi, async: false)
      case result
      when Nokogiri::XML::Element
        return result
      when Nokogiri::XML::NodeSet, Array
        return result if !result.empty?
      end
      sleep 1
      refresh #self.response = browser.current_response
    end
    default
  end

Because the selectors are declaratively defined, the expression type has to be given (and defaults to 'css'), since the Nokogiri css and xpath methods are called indirectly.

css('#ShopProdImagesNew img.ShopProdThumbImg')

This design allows for the following, given a parse_item() function that would extract all fields from the defined item of the same name as the current inline handler:

class YourSpider < ApplicationSpider
  # start from start_urls
  request_start do |category_list, **opts1|
      request_all :product_list, urls: category_list.css(".css-selector1") do |product_list, **opts2|
        for link in product_list.css(".css-selector2")
          request_in :product, url: link do |product, **opts3|
            save_to "results.json", parse_item(), format: :pretty_json
          end
        end
      end
    end
end

You could even go further. If you move the result file definition to the item definition, then the inline callback handler could automatically and implicitly extract the item:

class YourSpider < ApplicationSpider
  ...
  # defined item with result file definition passed by hash
  item :product, file: "results/{:category}.json" do
    text :name, '#ProdPageTitle'
    text :category, css: '.category-name'
  end

  # item definition with result file config in body
  item :otherproduct do
    text :name, '#ProdPageTitle'
    text :category, css: '.category-name'
    save_to "results/{:category}.json", {
        format: :pretty_json,
        append: true,
        position: false
      }
  end

  # start from start_urls
  request_start do |category_list, **opts1|
      request_all :product_list, urls: category_list.css(".css-selector1") do |product_list, **opts2|
        for link in product_list.css(".css-selector2")
          # this call requests the page on url, knows that it contains a :product Entity, auto-extracts from the predefined Entity selectors, and auto-saves it to a result file as defined in the entity.
          request_item :product, url: link
        end
      end
    end
end

The logical result leaves us with only a DSL for defining the relationship between pages and how to get from one to the next. If we would have a class-level Page description, we could have a singular parse() entrypoint that can figure out the page type on its own.

class YourSpider < ApplicationSpider
  # class-level declaration of a page type
  page :product_list do
    identifier css: 'body.product-list-page'
    has_many :product, css: '#productlist a.product-link'
  end

  page :product do
    identifier do |response|
      !response.xpath('//div[@id="product-image"]').empty? and response.css('body.is-product').length > 0
    end
  end
end

In ApplicationSpider:

def parse(response, url, **opts)
  @page_types.each do |page_definition|
    if page_definition.page_of_type(response)
      @entities[page_definition.name].parse(response)
    else
      puts "unrecognised page type at #{url}!"
    end
  end
end

The parse() entrypoint would automatically find the right Page definition to know how to parse it and how to branch to deeper pages. All deeper pages are also parsed using the singular parse() callback. The advantage of this approach is that the navigational flow gets very robust, since page types are explicitly identified by selectors in the Page definition. You would get a nice log of all unexpected page types (customized landing pages, error pages etc), and encountering them does not break the code or require error catching by the user.

The downside to the explicit approach is the customizeability when you need somethign specific to be done in order to parse a speciic page (type). To account for this, the parse() entrypoint would have to check if a user-defined callback is defined that fits the page definition, similar to how it works now. So for a :product Page definition, it would look for a parse_product_page(response, url, **opts) callback that allows a user to hook into the flow.

helper is not loaded

I created google_spider.rb in ./helpers/

module GoogleHelper
  def time2int(time)
    time.to_i
  end
end

run console:

kimurai console --url https://www.google.com
[1] pry(#<ApplicationSpider>)> time2int(Time.now)

return:

NoMethodError: undefined method `time2int' for #<ApplicationSpider:0x00007fc655206ed0>
Did you mean?  timeout
from (pry):1:in `console'

Selenium Chrome Heroku

Hi Everyone,

Is there any way to solve Heroku's problem with Selenium Chrome engine with Kimurai?

I need to use

@engine = :selenium_chrome

I use build packs for Chrome on Heroku but I still gots error related to file path of Chrome and couldn't find any way to define it in spider settings.

Error message from Heroku:

Selenium::WebDriver::Error::WebDriverError: not a file: "/usr/local/bin/chromedriver"

Skip request error after retry

I have a site that times out, sometimes. I configured the @config to retry the errors, and to skip them if they fail, since I would like the spider to keep going. However, it seems the skip_request_errors option drops errors immediately. Is there a way to make retry_request_errors and skip_request_errors work together so errors are only dropped when the retries have been exhausted?

Running on Ubuntu 20.04 gives chromedriver error

I wrote a scraper that works just fine on macOS. When I run the same scraper on Ubuntu 20.04, I get an error:

/home/username/.rvm/gems/ruby-2.5.3/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/common/service.rb:104:in `start': already started: #<URI::HTTP http://127.0.0.1:9515/> "/usr/local/bin/chromedriver" (RuntimeError)

To make sure it wasn't just my own script, I tried running:

$ kimurai console --engine selenium_chrome --url https://github.com/vifreefly/kimuraframework

This also failed with the same error:

D, [2020-11-13 15:34:54 -0500#7086] [M: 47354020103620] DEBUG -- : BrowserBuilder (selenium_chrome): created browser instance
D, [2020-11-13 15:34:54 -0500#7086] [M: 47354020103620] DEBUG -- : BrowserBuilder (selenium_chrome): enabled native headless_mode
I, [2020-11-13 15:34:54 -0500#7086] [M: 47354020103620]  INFO -- : Browser: started get request to: https://github.com/vifreefly/kimuraframework
Traceback (most recent call last):
	22: from /home/username/.rvm/gems/ruby-2.5.3/bin/ruby_executable_hooks:24:in `<main>'
	21: from /home/username/.rvm/gems/ruby-2.5.3/bin/ruby_executable_hooks:24:in `eval'
	20: from /home/username/.rvm/gems/ruby-2.5.3/bin/kimurai:23:in `<main>'
	19: from /home/username/.rvm/gems/ruby-2.5.3/bin/kimurai:23:in `load'
	18: from /home/username/.rvm/gems/ruby-2.5.3/gems/kimurai-1.4.0/exe/kimurai:6:in `<top (required)>'
	17: from /home/username/.rvm/gems/ruby-2.5.3/gems/thor-0.20.3/lib/thor/base.rb:466:in `start'
	16: from /home/username/.rvm/gems/ruby-2.5.3/gems/thor-0.20.3/lib/thor.rb:387:in `dispatch'
	15: from /home/username/.rvm/gems/ruby-2.5.3/gems/thor-0.20.3/lib/thor/invocation.rb:126:in `invoke_command'
	14: from /home/username/.rvm/gems/ruby-2.5.3/gems/thor-0.20.3/lib/thor/command.rb:27:in `run'
	13: from /home/username/.rvm/gems/ruby-2.5.3/gems/kimurai-1.4.0/lib/kimurai/cli.rb:123:in `console'
	12: from /home/username/.rvm/gems/ruby-2.5.3/gems/kimurai-1.4.0/lib/kimurai/base.rb:201:in `request_to'
	11: from /home/username/.rvm/gems/ruby-2.5.3/gems/kimurai-1.4.0/lib/kimurai/capybara_ext/session.rb:52:in `visit'
	10: from /home/username/.rvm/gems/ruby-2.5.3/gems/kimurai-1.4.0/lib/kimurai/capybara_ext/session.rb:51:in `ensure in visit'
	 9: from /home/username/.rvm/gems/ruby-2.5.3/gems/kimurai-1.4.0/lib/kimurai/capybara_ext/driver/base.rb:16:in `current_memory'
	 8: from /home/username/.rvm/gems/ruby-2.5.3/gems/kimurai-1.4.0/lib/kimurai/capybara_ext/selenium/driver.rb:28:in `pid'
	 7: from /home/username/.rvm/gems/ruby-2.5.3/gems/kimurai-1.4.0/lib/kimurai/capybara_ext/selenium/driver.rb:32:in `port'
	 6: from /home/username/.rvm/gems/ruby-2.5.3/gems/capybara-3.13.2/lib/capybara/selenium/driver.rb:32:in `browser'
	 5: from /home/username/.rvm/gems/ruby-2.5.3/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver.rb:88:in `for'
	 4: from /home/username/.rvm/gems/ruby-2.5.3/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/common/driver.rb:46:in `for'
	 3: from /home/username/.rvm/gems/ruby-2.5.3/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/common/driver.rb:46:in `new'
	 2: from /home/username/.rvm/gems/ruby-2.5.3/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/chrome/driver.rb:40:in `initialize'
	 1: from /home/username/.rvm/gems/ruby-2.5.3/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/common/driver.rb:303:in `service_url'
/home/username/.rvm/gems/ruby-2.5.3/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/common/service.rb:104:in `start': already started: #<URI::HTTP http://127.0.0.1:9515/> "/usr/local/bin/chromedriver" (RuntimeError)

Again, this error only occurs on Ubuntu 20.04. When I try the same command on macOS, it works perfectly. I followed the Ubuntu 18.04 installation instructions in the README file, with the excption that I have installed the most recent versions of chromedriver and geckodriver like so:

cd /tmp && wget https://chromedriver.storage.googleapis.com/87.0.4280.20/chromedriver_linux64.zip
sudo unzip chromedriver_linux64.zip -d /usr/local/bin
cd /tmp && wget https://github.com/mozilla/geckodriver/releases/download/v0.28.0/geckodriver-v0.28.0-linux64.tar.gz
sudo tar -xvzf geckodriver-v0.28.0-linux64.tar.gz -C /usr/local/bin
rm -f geckodriver-v0.28.0-linux64.tar.gz

The reason I did this was because other people reported similar errors and suggested upgrading chromedriver to eliminate them. This did not fix the issue for me, but I figured that it's better to have more recent versions anyway.

When I looked in htop to see why this error was occurring, I noticed that on Ubuntu, there were multiple instances of chromedriver opening during this command, whereas in macOS only one instance of it opened. I am not sure why this is happening, but I suspect it is related to the error because it's apparently complaining about more than one chromedriver instance being open.

Crawl in Sidekiq - Selenium::WebDriver::Error::WebDriverError: not a file: "./bin/chromedriver

I try to run crawler via Sidekiq job on my DigitalOcean droplet, but always get fail with error Selenium::WebDriver::Error::WebDriverError: not a file: "./bin/chromedriver", in the same time I can run crawl! via rails console and it works well, also it works well via Sidekiq on my local machine. I defined chromedriver_path in the Kimurai initializer - config.chromedriver_path = Rails.root.join('lib', 'webdrivers', 'chromedriver_83').to_s
Logs of the Sidekiq job which I started also via rails console with FekoCrawlWorker.perform_async

Jun 29 19:43:26 aquacraft sidekiq[7201]: 2020-06-29T19:43:26.602Z 7201 TID-ou13yz8xx FekoCrawlWorker JID-7d134b4ee9407973d7803f0b INFO: start
Jun 29 19:43:26 aquacraft sidekiq[7201]: I, [2020-06-29 19:43:26 +0000#7201] [C: 70059979631140]  #033[36mINFO -- feko_spider:#033[0m Spider: started: feko_spider
Jun 29 19:43:26 aquacraft sidekiq[7201]: D, [2020-06-29 19:43:26 +0000#7201] [C: 70059979631140] #033[32mDEBUG -- feko_spider:#033[0m BrowserBuilder (selenium_chrome): created browser instance
Jun 29 19:43:26 aquacraft sidekiq[7201]: D, [2020-06-29 19:43:26 +0000#7201] [C: 70059979631140] #033[32mDEBUG -- feko_spider:#033[0m BrowserBuilder (selenium_chrome): enabled native headless_mode
Jun 29 19:43:26 aquacraft sidekiq[7201]: I, [2020-06-29 19:43:26 +0000#7201] [C: 70059979631140]  #033[36mINFO -- feko_spider:#033[0m Browser: started get request to: https://feko.com.ua/shop/category/kotly/gazovye-kotly331/page/1
Jun 29 19:43:26 aquacraft sidekiq[7201]: 2020-06-29 19:43:26 WARN Selenium [DEPRECATION] :driver_path is deprecated. Use :service with an instance of Selenium::WebDriver::Service instead.
Jun 29 19:43:26 aquacraft sidekiq[7201]: I, [2020-06-29 19:43:26 +0000#7201] [C: 70059979631140]  #033[36mINFO -- feko_spider:#033[0m Info: visits: requests: 1, responses: 0
Jun 29 19:43:26 aquacraft sidekiq[7201]: 2020-06-29 19:43:26 WARN Selenium [DEPRECATION] :driver_path is deprecated. Use :service with an instance of Selenium::WebDriver::Service instead.
Jun 29 19:43:26 aquacraft sidekiq[7201]: I, [2020-06-29 19:43:26 +0000#7201] [C: 70059979631140]  #033[36mINFO -- feko_spider:#033[0m Browser: driver selenium_chrome has been destroyed
Jun 29 19:43:26 aquacraft sidekiq[7201]: F, [2020-06-29 19:43:26 +0000#7201] [C: 70059979631140] #033[1;31mFATAL -- feko_spider:#033[0m Spider: stopped: {#033[35m:spider_name#033[0m=>#033[33m"feko_spider"#033[0m, #033[35m:status#033[0m=>:failed, #033[35m:error#033[0m=>#033[33m"#<Selenium::WebDriver::Error::WebDriverError: not a file: \"./bin/chromedriver\">"#033[0m, #033[35m:environment#033[0m=>#033[33m"development"#033[0m, #033[35m:start_time#033[0m=>#033[36m2020#033[0m-06-29 19:43:26 +0000, #033[35m:stop_time#033[0m=>#033[36m2020#033[0m-06-29 19:43:26 +0000, #033[35m:running_time#033[0m=>#033[33m"0s"#033[0m, #033[35m:visits#033[0m=>{#033[35m:requests#033[0m=>#033[36m1#033[0m, #033[35m:responses#033[0m=>#033[36m0#033[0m}, #033[35m:items#033[0m=>{#033[35m:sent#033[0m=>#033[36m0#033[0m, #033[35m:processed#033[0m=>#033[36m0#033[0m}, #033[35m:events#033[0m=>{#033[35m:requests_errors#033[0m=>{}, #033[35m:drop_items_errors#033[0m=>{}, #033[35m:custom#033[0m=>{}}}
Jun 29 19:43:26 aquacraft sidekiq[7201]: 2020-06-29T19:43:26.607Z 7201 TID-ou13yz8xx FekoCrawlWorker JID-7d134b4ee9407973d7803f0b INFO: fail: 0.006 sec
Jun 29 19:43:26 aquacraft sidekiq[7201]: 2020-06-29T19:43:26.608Z 7201 TID-ou13yz8xx WARN: {"context":"Job raised exception","job":{"class":"FekoCrawlWorker","args":[],"retry":false,"queue":"default","backtrace":true,"jid":"7d134b4ee9407973d7803f0b","created_at":1593459806.6006012,"enqueued_at":1593459806.6006787},"jobstr":"{\"class\":\"FekoCrawlWorker\",\"args\":[],\"retry\":false,\"queue\":\"default\",\"backtrace\":true,\"jid\":\"7d134b4ee9407973d7803f0b\",\"created_at\":1593459806.6006012,\"enqueued_at\":1593459806.6006787}"}
Jun 29 19:43:26 aquacraft sidekiq[7201]: 2020-06-29T19:43:26.608Z 7201 TID-ou13yz8xx WARN: Selenium::WebDriver::Error::WebDriverError: not a file: "./bin/chromedriver"
Jun 29 19:43:26 aquacraft sidekiq[7201]: 2020-06-29T19:43:26.608Z 7201 TID-ou13yz8xx WARN: /home/deploy/aquacraft/shared/bundle/ruby/2.6.0/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/common/platform.rb:136:in `assert_file'
Jun 29 19:43:26 aquacraft sidekiq[7201]: /home/deploy/aquacraft/shared/bundle/ruby/2.6.0/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/common/platform.rb:140:in `assert_executable'
Jun 29 19:43:26 aquacraft sidekiq[7201]: /home/deploy/aquacraft/shared/bundle/ruby/2.6.0/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/common/service.rb:138:in `binary_path'
Jun 29 19:43:26 aquacraft sidekiq[7201]: /home/deploy/aquacraft/shared/bundle/ruby/2.6.0/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/common/service.rb:94:in `initialize'
Jun 29 19:43:26 aquacraft sidekiq[7201]: /home/deploy/aquacraft/shared/bundle/ruby/2.6.0/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/common/service.rb:41:in `new'
Jun 29 19:43:26 aquacraft sidekiq[7201]: /home/deploy/aquacraft/shared/bundle/ruby/2.6.0/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/common/service.rb:41:in `chrome'
Jun 29 19:43:26 aquacraft sidekiq[7201]: /home/deploy/aquacraft/shared/bundle/ruby/2.6.0/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/common/driver.rb:299:in `service_url'
Jun 29 19:43:26 aquacraft sidekiq[7201]: /home/deploy/aquacraft/shared/bundle/ruby/2.6.0/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/chrome/driver.rb:40:in `initialize'
Jun 29 19:43:26 aquacraft sidekiq[7201]: /home/deploy/aquacraft/shared/bundle/ruby/2.6.0/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/common/driver.rb:46:in `new'
Jun 29 19:43:26 aquacraft sidekiq[7201]: /home/deploy/aquacraft/shared/bundle/ruby/2.6.0/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/common/driver.rb:46:in `for'
Jun 29 19:43:26 aquacraft sidekiq[7201]: /home/deploy/aquacraft/shared/bundle/ruby/2.6.0/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver.rb:88:in `for'
Jun 29 19:43:26 aquacraft sidekiq[7201]: /home/deploy/aquacraft/shared/bundle/ruby/2.6.0/gems/capybara-2.18.0/lib/capybara/selenium/driver.rb:23:in `browser'
Jun 29 19:43:26 aquacraft sidekiq[7201]: /home/deploy/aquacraft/shared/bundle/ruby/2.6.0/gems/kimurai-1.4.0/lib/kimurai/capybara_ext/selenium/driver.rb:32:in `port'
Jun 29 19:43:26 aquacraft sidekiq[7201]: /home/deploy/aquacraft/shared/bundle/ruby/2.6.0/gems/kimurai-1.4.0/lib/kimurai/capybara_ext/selenium/driver.rb:28:in `pid'
Jun 29 19:43:26 aquacraft sidekiq[7201]: /home/deploy/aquacraft/shared/bundle/ruby/2.6.0/gems/kimurai-1.4.0/lib/kimurai/capybara_ext/driver/base.rb:16:in `current_memory'
Jun 29 19:43:26 aquacraft sidekiq[7201]: /home/deploy/aquacraft/shared/bundle/ruby/2.6.0/gems/kimurai-1.4.0/lib/kimurai/capybara_ext/session.rb:51:in `ensure in visit'
Jun 29 19:43:26 aquacraft sidekiq[7201]: /home/deploy/aquacraft/shared/bundle/ruby/2.6.0/gems/kimurai-1.4.0/lib/kimurai/capybara_ext/session.rb:52:in `visit'
Jun 29 19:43:26 aquacraft sidekiq[7201]: /home/deploy/aquacraft/shared/bundle/ruby/2.6.0/gems/kimurai-1.4.0/lib/kimurai/base.rb:201:in `request_to'
Jun 29 19:43:26 aquacraft sidekiq[7201]: /home/deploy/aquacraft/shared/bundle/ruby/2.6.0/gems/kimurai-1.4.0/lib/kimurai/base.rb:128:in `block in crawl!'
Jun 29 19:43:26 aquacraft sidekiq[7201]: /home/deploy/aquacraft/shared/bundle/ruby/2.6.0/gems/kimurai-1.4.0/lib/kimurai/base.rb:124:in `each'
Jun 29 19:43:26 aquacraft sidekiq[7201]: /home/deploy/aquacraft/shared/bundle/ruby/2.6.0/gems/kimurai-1.4.0/lib/kimurai/base.rb:124:in `crawl!'
Jun 29 19:43:26 aquacraft sidekiq[7201]: /home/deploy/aquacraft/releases/20200627190630/app/workers/feko_crawl_worker.rb:9:in `perform'

Sidekiq worker code:

require 'sidekiq-scheduler'

class FekoCrawlWorker
  include Sidekiq::Worker

  sidekiq_options retry: false, backtrace: true, queue: 'default'

  def perform
    Crawlers::Feko.crawl!
  end
end

in_parallel: undefined method `call' for "app":String (NoMethodError)

An error occurred when I used the in_parallel method

this is example

# amazon_spider.rb
require 'kimurai'

class AmazonSpider < Kimurai::Base
  @name = "amazon_spider"
  @engine = :mechanize
  @start_urls = ["https://www.amazon.com/"]

  def parse(response, url:, data: {})
    browser.fill_in "field-keywords", with: "Web Scraping Books"
    browser.click_on "Go"

    # Walk through pagination and collect products urls:
    urls = []
    loop do
      response = browser.current_response
      response.xpath("//li//a[contains(@class, 's-access-detail-page')]").each do |a|
        urls << a[:href].sub(/ref=.+/, "")
      end

      browser.find(:xpath, "//a[@id='pagnNextLink']", wait: 1).click rescue break
    end

    # Process all collected urls concurrently within 3 threads:
    in_parallel(:parse_book_page, urls, threads: 3)
  end

  def parse_book_page(response, url:, data: {})
    item = {}

    item[:title] = response.xpath("//h1/span[@id]").text.squish
    item[:url] = url
    item[:price] = response.xpath("(//span[contains(@class, 'a-color-price')])[1]").text.squish.presence
    item[:publisher] = response.xpath("//h2[text()='Product details']/following::b[text()='Publisher:']/following-sibling::text()[1]").text.squish.presence

    save_to "books.json", item, format: :pretty_json
  end
end

AmazonSpider.crawl!

this is error info

I, [2019-01-17 10:25:33 +0800#12757] [M: 47339757413960]  INFO -- amazon_spider: Spider: started: amazon_spider
D, [2019-01-17 10:25:34 +0800#12757] [M: 47339757413960] DEBUG -- amazon_spider: BrowserBuilder (mechanize): created browser instance
I, [2019-01-17 10:25:34 +0800#12757] [M: 47339757413960]  INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/
I, [2019-01-17 10:25:38 +0800#12757] [M: 47339757413960]  INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/
I, [2019-01-17 10:25:38 +0800#12757] [M: 47339757413960]  INFO -- amazon_spider: Info: visits: requests: 1, responses: 1
I, [2019-01-17 10:25:48 +0800#12757] [M: 47339757413960]  INFO -- amazon_spider: Spider: in_parallel: starting processing 63 urls within 3 threads
D, [2019-01-17 10:25:48 +0800#12757] [C: 47339781353100] DEBUG -- amazon_spider: BrowserBuilder (mechanize): created browser instance
I, [2019-01-17 10:25:48 +0800#12757] [C: 47339781353100]  INFO -- amazon_spider: Browser: started get request to: /gp/slredirect/picassoRedirect.html/
I, [2019-01-17 10:25:48 +0800#12757] [C: 47339781353100]  INFO -- amazon_spider: Info: visits: requests: 2, responses: 1
I, [2019-01-17 10:25:48 +0800#12757] [C: 47339781353100]  INFO -- amazon_spider: Browser: driver mechanize has been destroyed
#<Thread:0x0000561c4db3dd18@/home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/kimurai-1.3.2/lib/kimurai/base.rb:295 run> terminated with exception (report_on_exception is true):
Traceback (most recent call last):
        14: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/kimurai-1.3.2/lib/kimurai/base.rb:301:in `block (2 levels) in in_parallel'
        13: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/kimurai-1.3.2/lib/kimurai/base.rb:301:in `each'
        12: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/kimurai-1.3.2/lib/kimurai/base.rb:309:in `block (3 levels) in in_parallel'
        11: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/kimurai-1.3.2/lib/kimurai/base.rb:197:in `request_to'
        10: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/kimurai-1.3.2/lib/kimurai/capybara_ext/session.rb:21:in `visit'
         9: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/capybara-3.12.0/lib/capybara/session.rb:265:in `visit'
         8: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/capybara-3.12.0/lib/capybara/rack_test/driver.rb:45:in `visit'
         7: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/capybara-3.12.0/lib/capybara/rack_test/browser.rb:23:in `visit'
         6: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/capybara-3.12.0/lib/capybara/rack_test/browser.rb:43:in `process_and_follow_redirects'
         5: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/capybara-3.12.0/lib/capybara/rack_test/browser.rb:65:in `process'
         4: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/capybara-mechanize-1.11.0/lib/capybara/mechanize/browser.rb:50:in `block (2 levels) in <class:Browser>'
         3: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/rack-test-1.1.0/lib/rack/test.rb:58:in `get'
         2: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/rack-test-1.1.0/lib/rack/test.rb:129:in `custom_request'
         1: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/rack-test-1.1.0/lib/rack/test.rb:266:in `process_request'
/home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/rack-test-1.1.0/lib/rack/mock_session.rb:29:in `request': undefined method `call' for "app":String (NoMethodError)
I, [2019-01-17 10:25:48 +0800#12757] [M: 47339757413960]  INFO -- amazon_spider: Browser: driver mechanize has been destroyed
F, [2019-01-17 10:25:48 +0800#12757] [M: 47339757413960] FATAL -- amazon_spider: Spider: stopped: {:spider_name=>"amazon_spider", :status=>:failed, :error=>"#<NoMethodError: undefined method `call' for \"app\":String>", :environment=>"development", :start_time=>2019-01-17 10:25:33 +0800, :stop_time=>2019-01-17 10:25:48 +0800, :running_time=>"15s", :visits=>{:requests=>2, :responses=>1}, :items=>{:sent=>0, :processed=>0}, :events=>{:requests_errors=>{}, :drop_items_errors=>{}, :custom=>{}}}
Traceback (most recent call last):
        14: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/kimurai-1.3.2/lib/kimurai/base.rb:301:in `block (2 levels) in in_parallel'
        13: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/kimurai-1.3.2/lib/kimurai/base.rb:301:in `each'
        12: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/kimurai-1.3.2/lib/kimurai/base.rb:309:in `block (3 levels) in in_parallel'
        11: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/kimurai-1.3.2/lib/kimurai/base.rb:197:in `request_to'
        10: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/kimurai-1.3.2/lib/kimurai/capybara_ext/session.rb:21:in `visit'
         9: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/capybara-3.12.0/lib/capybara/session.rb:265:in `visit'
         8: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/capybara-3.12.0/lib/capybara/rack_test/driver.rb:45:in `visit'
         7: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/capybara-3.12.0/lib/capybara/rack_test/browser.rb:23:in `visit'
         6: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/capybara-3.12.0/lib/capybara/rack_test/browser.rb:43:in `process_and_follow_redirects'
         5: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/capybara-3.12.0/lib/capybara/rack_test/browser.rb:65:in `process'
         4: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/capybara-mechanize-1.11.0/lib/capybara/mechanize/browser.rb:50:in `block (2 levels) in <class:Browser>'
         3: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/rack-test-1.1.0/lib/rack/test.rb:58:in `get'
         2: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/rack-test-1.1.0/lib/rack/test.rb:129:in `custom_request'
         1: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/rack-test-1.1.0/lib/rack/test.rb:266:in `process_request'
/home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/rack-test-1.1.0/lib/rack/mock_session.rb:29:in `request': undefined method `call' for "app":String (NoMethodError)

How to retry or skip Net::HTTP::Persistent::Error?

Hello
I sometimes get an error and want to skip it

Net::HTTP::Persistent::Error: too many connection resets (due to Net::ReadTimeout with #<TCPSocket:(closed)> - Net::ReadTimeout) after 0 requests on 47268817494080, last used 1585742984.5058694 seconds ago
/home/aleksandra/.rvm/gems/ruby-2.6.5/gems/rack-mini-profiler-2.0.1/lib/patches/net_patches.rb:9:in `block in request_with_mini_profiler'
/home/aleksandra/.rvm/gems/ruby-2.6.5/gems/rack-mini-profiler-2.0.1/lib/mini_profiler/profiling_methods.rb:39:in `step'
/home/aleksandra/.rvm/gems/ruby-2.6.5/gems/rack-mini-profiler-2.0.1/lib/patches/net_patches.rb:8:in `request_with_mini_profiler'
/home/aleksandra/.rvm/gems/ruby-2.6.5/gems/skylight-core-4.2.3/lib/skylight/core/probes/net_http.rb:55:in `block in request'
/home/aleksandra/.rvm/gems/ruby-2.6.5/gems/skylight-core-4.2.3/lib/skylight/core/fanout.rb:25:in `instrument'
/home/aleksandra/.rvm/gems/ruby-2.6.5/gems/skylight-core-4.2.3/lib/skylight/core/probes/net_http.rb:48:in `request'
/home/aleksandra/.rvm/gems/ruby-2.6.5/gems/net-http-persistent-3.1.0/lib/net/http/persistent.rb:964:in `block in request'
/home/aleksandra/.rvm/gems/ruby-2.6.5/gems/net-http-persistent-3.1.0/lib/net/http/persistent.rb:662:in `connection_for'
/home/aleksandra/.rvm/gems/ruby-2.6.5/gems/net-http-persistent-3.1.0/lib/net/http/persistent.rb:958:in `request'
/home/aleksandra/.rvm/gems/ruby-2.6.5/gems/mechanize-2.7.6/lib/mechanize/http/agent.rb:280:in `fetch'
/home/aleksandra/.rvm/gems/ruby-2.6.5/gems/mechanize-2.7.6/lib/mechanize.rb:464:in `get'
/home/aleksandra/.rvm/gems/ruby-2.6.5/gems/capybara-mechanize-1.11.0/lib/capybara/mechanize/browser.rb:131:in `process_remote_request'
/home/aleksandra/.rvm/gems/ruby-2.6.5/gems/capybara-mechanize-1.11.0/lib/capybara/mechanize/browser.rb:47:in `block (2 levels) in <class:Browser>'
/home/aleksandra/.rvm/gems/ruby-2.6.5/gems/capybara-3.31.0/lib/capybara/rack_test/browser.rb:68:in `process'
/home/aleksandra/.rvm/gems/ruby-2.6.5/gems/capybara-3.31.0/lib/capybara/rack_test/browser.rb:43:in `process_and_follow_redirects'
/home/aleksandra/.rvm/gems/ruby-2.6.5/gems/capybara-3.31.0/lib/capybara/rack_test/browser.rb:23:in `visit'
/home/aleksandra/.rvm/gems/ruby-2.6.5/gems/capybara-3.31.0/lib/capybara/rack_test/driver.rb:45:in `visit'
/home/aleksandra/.rvm/gems/ruby-2.6.5/gems/capybara-3.31.0/lib/capybara/session.rb:278:in `visit'
/home/aleksandra/.rvm/gems/ruby-2.6.5/gems/kimurai-1.4.0/lib/kimurai/capybara_ext/session.rb:21:in `visit'
/home/aleksandra/.rvm/gems/ruby-2.6.5/gems/kimurai-1.4.0/lib/kimurai/base.rb:201:in `request_to'
/home/aleksandra/.rvm/gems/ruby-2.6.5/gems/kimurai-1.4.0/lib/kimurai/base.rb:162:in `parse!'
/home/aleksandra/Projects/hub/app/models/job_ad.rb:80:in `block in scrape_job_ads'
/home/aleksandra/Projects/hub/app/models/job_ad.rb:78:in `each'
/home/aleksandra/Projects/hub/app/models/job_ad.rb:78:in `scrape_job_ads'
/home/aleksandra/Projects/hub/lib/tasks/scheduler.rake:78:in `block (2 levels) in <main>'
/home/aleksandra/.rvm/gems/ruby-2.6.5/gems/rake-13.0.1/exe/rake:27:in `<top (required)>'
/home/aleksandra/.rvm/gems/ruby-2.6.5/bin/ruby_executable_hooks:24:in `eval'
/home/aleksandra/.rvm/gems/ruby-2.6.5/bin/ruby_executable_hooks:24:in `<main>'

When added this error to the @config

    before_request: { delay: 120..180 },
    skip_request_errors: [{ error: Net::HTTP::Persistent::Error }],
    retry_request_errors: [
      { error: RuntimeError, message: '520', skip_on_failure: true }
    ]
  }```
got ```NameError: uninitialized constant Net::HTTP::Persistent```
Is it possible to skip this error and continue?

Some minor warnings when using kimurai

Hey there,

Sorry to bother you.

A few harmless warnings are issued when run under -w

Reason I report this is because I am quite pedantic, so I have -w on all the time.

This then pulls these warnings into my projects.

I know I can silence them, e. g. via $VERBOSE and probably the Warning module,
but I am trying the lazy approach and report them here. Feel free to disregard
this please. :-)

/.gem/gems/kimurai-1.4.0/lib/kimurai/browser_builder.rb:12: warning: assigned but unused variable - e
/.gem/gems/kimurai-1.4.0/lib/kimurai/base_helper.rb:11: warning: assigned but unused variable - uri
/.gem/gems/kimurai-1.4.0/lib/kimurai/base_helper.rb:12: warning: assigned but unused variable - e
/.gem/gems/kimurai-1.4.0/lib/kimurai/base.rb:33: warning: instance variable @run_info not initialized

(For local but unused variables, they can either be removed, or if you want to keep them, a leading
_ such as _uri would work. For uninitialized instance variables, I typically bundle them all in a method
called reset() where they are initialized to nil. That silences that warning.

I use kimurai to query javascript-heavy websites that do not easily allow us to parse the result. For
that purpose it works very well. For example I use kimurai to query the remote world-time, from a
website that uses javascript. (God I hate javascript sooo much though ...)

Include Docker support

Its easy to get up and running using Docker (no need to install a bunch of dependencies on a system that you don't know about).

I got Docker working using the following files:

#Dockerfile
FROM ruby:2.5.3-stretch
RUN gem install kimurai
RUN apt-get update && apt install -q -y git unzip wget tar openssl xvfb chromium \
                                        firefox-esr libsqlite3-dev sqlite3 mysql-client default-libmysqlclient-dev

RUN cd /tmp && \
    wget https://chromedriver.storage.googleapis.com/2.39/chromedriver_linux64.zip && \
    unzip chromedriver_linux64.zip -d /usr/local/bin && \
    rm -f chromedriver_linux64.zip

RUN cd /tmp && \
    wget https://github.com/mozilla/geckodriver/releases/download/v0.21.0/geckodriver-v0.21.0-linux64.tar.gz && \
    tar -xvzf geckodriver-v0.21.0-linux64.tar.gz -C /usr/local/bin && \
    rm -f geckodriver-v0.21.0-linux64.tar.gz

RUN apt install -q -y chrpath libxft-dev libfreetype6 libfreetype6-dev libfontconfig1 libfontconfig1-dev && \
    cd /tmp && \
    wget https://bitbucket.org/ariya/phantomjs/downloads/phantomjs-2.1.1-linux-x86_64.tar.bz2 && \
    tar -xvjf phantomjs-2.1.1-linux-x86_64.tar.bz2 && \
    mv phantomjs-2.1.1-linux-x86_64 /usr/local/lib && \
    ln -s /usr/local/lib/phantomjs-2.1.1-linux-x86_64/bin/phantomjs /usr/local/bin && \
    rm -f phantomjs-2.1.1-linux-x86_64.tar.bz2

RUN mkdir -p /app

ADD Gemfile /app

RUN cd /app && bundle install

ENTRYPOINT ['kimurai']

And its docker-compose.yml:

# 'extends' is not supported in version 3
version: '2'

services:

  base:
    build: ./
    entrypoint: /bin/bash
    working_dir: /app
    volumes:
      - ./:/app

  irb:
    extends: base
    entrypoint: irb
    volumes:
      - ./:/app

  kimurai:
    extends: base
    entrypoint: bundle exec kimurai
    volumes:
      - ./:/app

  crawl:
    extends: kimurai
    command: crawl
    volumes:
      - ./:/app

How to pass argument to Spider

Hi,

First of all thanks for your hard work, Kimurai really helps me a lot. I'm using it in Sinatra app and controlling using web requests. But I just can't find a proper way to pass data(or args) to spider.

I'm using proxies and I have a proxy class and .fetch method in it which is returns a new proxy. What I want to do is pass this proxy to my spider. If I include my proxy class inside spider class, proxy.fetch method only works once. That's not what I want, I want to make every run with different proxy.

Is there way to pass some arguments when calling spider like ExampleSpider.crawl!(foo, bar) ?

How to download a file? Alternatively, how to pass custom opts to the driver?

Hi there. I'm testing kimurai to try and automate a daily download of a bank statement. I've already managed to get the login working from the console and clicking on the button which fires the download (using browser.click_on). The file gets downloaded but I haven't found any way to control where it gets downloaded.

I found this example on downloading a file with selenium but kimurai doesn't seem to have any "official" way for me to use any custom configuration on the driver.

Do you have any recommendations on how to proceed?

Create directories before saving item

It seems currently no mkdir -p is run when items are saved for the first time.

Errno::ENOENT: No such file or directory @ rb_sysopen - ./results/spidername/.categories.json
  /usr/local/bundle/gems/kimurai-1.2.0/lib/kimurai/base/saver.rb:64:in `initialize'
  /usr/local/bundle/gems/kimurai-1.2.0/lib/kimurai/base/saver.rb:64:in `open'
  /usr/local/bundle/gems/kimurai-1.2.0/lib/kimurai/base/saver.rb:64:in `save_to_pretty_json'
  /usr/local/bundle/gems/kimurai-1.2.0/lib/kimurai/base/saver.rb:31:in `block in save'
  /usr/local/bundle/gems/kimurai-1.2.0/lib/kimurai/base/saver.rb:23:in `synchronize'
  /usr/local/bundle/gems/kimurai-1.2.0/lib/kimurai/base/saver.rb:23:in `save'
  /usr/local/bundle/gems/kimurai-1.2.0/lib/kimurai/base.rb:236:in `save_to'
  /app/spiders/application_spider.rb:66:in `create_category'

How to skip SocketError?

I, [2018-10-23T12:10:35.535202 #301]  INFO -- : Info: items: sent: 28, processed: 26
D, [2018-10-23T12:10:35.535471 #301] DEBUG -- : Browser: sleep 0.4 seconds before request...
I, [2018-10-23T12:10:35.935690 #301]  INFO -- : Browser: started get request to: https://www.video.com/?q=xxx&p=9
I, [2018-10-23T12:11:05.967685 #301]  INFO -- : Info: visits: requests: 38, responses: 37
I, [2018-10-23T12:11:05.967887 #301]  INFO -- : Browser: driver mechanize has been destroyed

Spider: stopped: {:spider_name=>"videos_spider", :status=>:failed, :error=>"#<SocketError: Failed to open TCP connection to www.video.com:443 (getaddrinfo: Name or service not known)>", :environment=>"production", :start_time=>2018-10-23 04:31:46 +0000, :stop_time=>2018-10-23 12:11:05 +0000, :running_time=>"7h, 39m", :visits=>{:requests=>38, :responses=>37}, :items=>{:sent=>28, :processed=>26}, :events=>{:requests_errors=>{}, :drop_items_errors=>{"#<Kimurai::Pipeline::DropItemError: Item download error.>"=>2}, :custom=>{}}}

I want to skip this error and keep spider to next page.

My spider's config:

@config = {
    skip_request_errors: [
      { error: RuntimeError, message: "404 => Net::HTTPNotFound" },
      { error: Net::HTTPNotFound, message: "404 => Net::HTTPNotFound" },
      { error: Down::ConnectionError, message: "Down::ConnectionError, Item Dropped." },
      { error: Net::OpenTimeout, message: "Net::OpenTimeout, Item Dropped." },
      { error: SocketError, message: "SocketError, Item Dropped." },
    ],
    before_request: {
      delay: 0.4
    }
  }

request_to method throws argument error for Ruby 3.0

Hello,

First, think you for maintaining this fantastic framework.

I set up a spider pretty much identically to the one in the README. I wrote a parse function with the same arguments as those specified in the README as well (response, url:, data: {})

In my first parse function I used the respond_to method to route urls to a second parse function, which had the same arguments as the first.

I got the following error: wrong number of arguments (given 2, expected 1; required keyword: url) (ArgumentError)

I'm running Ruby 3.0.1.

I believe there may be an issue with the use of keyword arguments in the request_to method related to Ruby 3.0. The spider works fine when I visit the url using the browser object and call the second parse function directly.

This appears to be similar to the related issue with rbcat

I'm relatively new to Ruby, so I apologize in advance for any inaccuracies!

Pass response callback as block

The one thing that has bothered me about Scrapy is that callbacks can't be given inline to show the visual hierarchy of pages scraped. Ruby however has blocks. Could we do something like this?

request url: 'http://example.com' do |response|
    response.at_xpath("//a[@class='next_page']").each do |next_link|
        request url: next_link do |response2|
            #etc
        end
    end
end

How to set encoding?

When the website is encoding with GB2312, the content of the website can not be obtained normally.

I think it would be better to change

    def current_response(response_type = :html)
      case response_type
      when :html
        Nokogiri::HTML(body)
      when :json
        JSON.parse(body)
      end
    end

TO

    def current_response(response_type = :html)
      case response_type
      when :html
        Nokogiri::HTML(body,nil,@config[:encoding])
      when :json
        JSON.parse(body)
      end
    end

OR

    def current_response(response_type = :html)
      case response_type
      when :html
        Nokogiri::HTML(body.force_encoding("encoding"))
      when :json
        JSON.parse(body)
      end
    end

Broken load balancers support

Hi,

I've trying to setup a scraper for a website that is under a load balancer. The thing is, from 10 request, theres 1 request at goes to a bad backend and it stalls at SSL negotiation.

I can't find a way to reduce Mechanize Read Timeout (same with selenium_chrome). From stack overflow, this can be done as the following example:

agent = Mechanize.new { |a| a.log = Logger.new("mech.log") }
agent.keep_alive=false
agent.open_timeout=15
agent.read_timeout=15

Is there any way to pass those parameters to Mechanize?

uninitialized constant Downloader::MovieEnum

I created movie_enum.rb in ./lib/

It not loaded when I start bundle exec kimurai crawl movie_spider

So, I add code to config/initializers/boot.rb under the require pipelines

# require lib
Dir.glob(File.join("./lib", "*.rb"), &method(:require))

it works.

Update Readme to include 'lsof' aptfile

It took me hours to figure this out so I want to help anyone else having trouble getting this running on Heroku.

Kimurai uses the lsof command, so you need to install the apt heroku buildpack to support lsof. Follow the directions described on the buildpack page. You basically need to create an Aptfile with the single line lsof and include it in the root folder along with adding the heroku buildpack. Can you add this to the docs? Thanks!

Wrap the Nokogiri response to reduce boilerplate

Currently, a Nokogiri object is passed as argument to a callback. This results in some boilerplate since some operations have to be defined over and over, like extracting the text, formatting the results, etc.

If instead of the current response a wrapper object was supplied, we could decorate it with some nice utility functions, and let it contain the URL. Wrapping the object would allow the definition of custom selectors besides css an xpath, such as regex, or a composite of any of these. It would also remove the need to specify whether single or multiple elements have to be extracted, similar to Scrapy's extract() and extract_first() of scrapy.Selector.

# in your scraper class
def parse_product_list_page(product_list, url:, data: {})
    product_ids = product_list.regex /"id":([0-9]+)\,/
end
#Page.rb
require 'forwardable'

class Page
  extend Forwardable

  def initialize(response, browser)
    @response = response
    @browser = browser
  end

  # get the current HTML page (fresh)
  def refresh
    @response = @browser.current_response
    self
  end

  #
  # extract methods
  #

  # general purpose entrypoint
  def extract(expression, multi: true, async: false)
    if async
      extract_on_ready(expression, multi: multi)
    elsif multi
      extract_all(expression)
    else
      extract_single(expression)
    end
  end

  # extract first element
  def extract_single(expression, **opts)
    extract_all(expression, **opts).first
  end

  # TODO: wrap results so we can apply a new expression on the subset
  def extract_all(expression, wrap=false)
    query = SelectorExpression.instance(expression)
    # self.send calls the delegated xpath() and css() functions, based on the type of the selector wrapper object ("expression"), which defaults to css
    Array(self.send(query.type, query.to_s))
  end

  def extract_on_ready(expression, multi: true, retries: 3, wait: 1, default: nil)
    retries.times do
      result = extract(expression, multi: multi, async: false)
      case result
      when Nokogiri::XML::Element
        return result
      when Nokogiri::XML::NodeSet, Array
        return result if !result.empty?
      end
      sleep 1
      refresh
    end
    default
  end

  #
  # Nokogiri wrapping
  #

  # delegate functions to the response object so this Page object responds to all classic parsing and selection functions
  def_delegators :@response, :xpath, :css, :text, :children

  def regex(selector)
    @response.text.scan(selector.to_s)
  end

end

Beyond this, it could be a consideration to also wrap the results of xpath() and css() calls, so we would have the same utility functions when doing a subquery:

page.xpath('//').css('.items').regex(/my-regex/)

Why class instead of instances?

Genuinely curious, it seems a bit unusual as it's not as straightforward to change the start_urls at runtime (if I understood correctly, class instance variables are not thread-safe, so if I change them at runtime, they might wreck havoc in something like Sidekiq?).

Link here from archived project?

Hi, thanks for your work on this project, it's really nice to work with. One thing that keeps bothering me is that when you google 'kimurai' the top result is this archived project. Not sure why that happened, but it's probably a good idea to link here from there so people don't think the project is dead.

Can't run within a test suite that is using Capybara

This is an interesting challenge:

When attempting to exercise a crawler within the context of a test suite which also runs Capybara for system specs, we run into the problem that both the test suite and kimurai are trying to configure Capybara.

If Kimurai runs first then my system specs fail because eg. because kimurai is specifying xpath as the default selector.

If my system specs run first then the specs using kimurai fail because:

Threadsafe setting cannot be changed once a session is created".

I wonder if these are just incompatible or if there's a way around this?

Issues with using skip_request_errors

I am trying to use the configuration provided, to skip 404 errors, but instead, I am getting Runtime error raised. Perhaps this is the intended behaviour, but I was expecting to get false or empty object, or something? Let me know if I misunderstood the functionality. Here is the configuration:

# frozen_string_literal: true

require 'kimurai'

module Spiders
  class Test < Kimurai::Base
    @name                = 'test_spider'
    @disable_images      = true
    @engine              = :mechanize
    @skip_request_errors = [
      { error: RuntimeError }
    ]

    def parse(response, url:, data: {})
    end
  end
end

If I then run it with Spiders::Test.parse!(:parse, url: 'https://google.com/asdfsdf'), I get back this error:

BrowserBuilder (mechanize): created browser instance
Browser: started get request to: https://google.com/asdfsdf
Browser: driver mechanize has been destroyed
Traceback (most recent call last):
        2: from (irb):2
        1: from (irb):2:in `rescue in irb_binding'
RuntimeError (Received the following error for a GET request to https://google.com/asdfsdf: '404 => Net::HTTPNotFound for https://google.com/asdfsdf -- unhandled response')

Am I doing something wrong or that's expected behaviour? I also tried this for the configuration:
{ error: RuntimeError, message: '404 => Net::HTTPNotFound' }

Error when installing on Linux

Error when installing on ubuntu
Logs:

root@f7f25d74ee8e:/Users/toan/Desktop/ruby/main-backend# kimurai setup localhost --local 

PLAY [all] *****************************************************************************************************************************************************************************

TASK [Gathering Facts] *****************************************************************************************************************************************************************
ok: [localhost]

TASK [Update apt cache] ****************************************************************************************************************************************************************
changed: [localhost]

TASK [Install base packages] ***********************************************************************************************************************************************************
[DEPRECATION WARNING]: Invoking "apt" only once while using a loop via squash_actions is deprecated. Instead of using a loop to supply multiple items and specifying `pkg: "{{ item 
}}"`, please use `pkg: ['xvfb', 'libsqlite3-dev', 'sqlite3', 'mongodb-clients', 'mysql-client', 'libmysqlclient-dev', 'postgresql-client', 'libpq-dev']` and remove the loop. This 
feature will be removed in version 2.11. Deprecation warnings can be disabled by setting deprecation_warnings=False in ansible.cfg.
failed: [localhost] (item=['xvfb', 'libsqlite3-dev', 'sqlite3', 'mongodb-clients', 'mysql-client', 'libmysqlclient-dev', 'postgresql-client', 'libpq-dev']) => {"changed": false, "item": ["xvfb", "libsqlite3-dev", "sqlite3", "mongodb-clients", "mysql-client", "libmysqlclient-dev", "postgresql-client", "libpq-dev"], "msg": "No package matching 'mongodb-clients' is available"}
        to retry, use: --limit @/usr/local/bundle/gems/kimurai-1.4.0/lib/kimurai/automation/setup.retry

PLAY RECAP *****************************************************************************************************************************************************************************
localhost                  : ok=2    changed=1    unreachable=0    failed=1   

How to limit the search depth level?

Like other scrap frameworks, e.g. Colly in Go

c := colly.NewCollector(
		// MaxDepth is 1, so only the links on the scraped page
		// is visited, and no further links are followed
		colly.MaxDepth(1),
	)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.