GithubHelp home page GithubHelp logo

Comments (8)

caecity43 avatar caecity43 commented on May 17, 2024 1

@vifreefly

Sorry, My bad, I found that I still use 1.1.0.

it works after update to 1.2.0.

from kimuraframework.

vifreefly avatar vifreefly commented on May 17, 2024

@caecity43

Here is your error which you want to skip:

#<SocketError: Failed to open TCP connection to www.video.com:443 (getaddrinfo: Name or service not known)>

Where class of error is SocketError and message is Failed to open TCP connection to www.video.com:443 (getaddrinfo: Name or service not known).

Your config:

error: SocketError, message: "SocketError, Item Dropped." 

Like you see, there is no match. Correct will be: error: SocketError, message: "Failed to open TCP connection to www.video.com". Or you can just provide only error class, example:

    skip_request_errors: [
      { error: RuntimeError, message: "404 => Net::HTTPNotFound" },
      { error: Net::HTTPNotFound, message: "404 => Net::HTTPNotFound" },
      { error: Down::ConnectionError, message: "Down::ConnectionError, Item Dropped." },
      { error: Net::OpenTimeout, message: "Net::OpenTimeout, Item Dropped." },
      SocketError
    ],

It will work as well.

from kimuraframework.

caecity43 avatar caecity43 commented on May 17, 2024

@vifreefly Thanks, I will try this. 👍

from kimuraframework.

caecity43 avatar caecity43 commented on May 17, 2024
Spider: stopped: {:spider_name=>"videos_spider", :status=>:failed, :error=>"#<RuntimeError: Received the following error for a GET request to https://www.videos.com/test'404 => Net::HTTPNotFound for https://www.videos.com/test -- unhandled response'>", :environment=>"production", :start_time=>2018-10-30 07:33:15 +0000, :stop_time=>2018-10-30 09:38:56 +0000, :running_time=>"2h, 5m", :visits=>{:requests=>27, :responses=>26}, :items=>{:sent=>13, :processed=>13}, :events=>{:requests_errors=>{}, :drop_items_errors=>{}, :custom=>{}}}

my config:

  @config = {
    #retry_request_errors: [Net::ReadTimeout],
    skip_request_errors: [
      { error: RuntimeError, message: "404 => Net::HTTPNotFound" },
      { error: Net::HTTPNotFound, message: "404 => Net::HTTPNotFound" },
      { error: Down::ConnectionError, message: "Down::ConnectionError, Item Dropped." },
      { error: Net::OpenTimeout, message: "Net::OpenTimeout, Item Dropped." },
      Net::HTTPNotFound,
      SocketError,
    ],
    before_request: {
      delay: 0.4
    }
  }

it not working.

from kimuraframework.

vifreefly avatar vifreefly commented on May 17, 2024

@caecity43 which version of Kimurai do you use?

from kimuraframework.

vifreefly avatar vifreefly commented on May 17, 2024

@caecity43, LGTM:

$ kimurai -v
1.2.0
# example_spider.rb
require 'kimurai'
require 'net/http'

class ExampleSpider < Kimurai::Base
  @name = "example_spider"
  @engine = :mechanize
  @start_urls = ["https://www.videos.com/test"]
  @config = {
    skip_request_errors: [
      { error: RuntimeError, message: "404 => Net::HTTPNotFound" }
    ]
  }

  def parse(response, url:, data: {})
    logger.info response.title
  end
end

ExampleSpider.crawl!
$ ruby example_spider.rb

I, [2018-10-31 09:02:53 +0400#7185] [M: 46973272606200]  INFO -- example_spider: Spider: started: example_spider
D, [2018-10-31 09:02:54 +0400#7185] [M: 46973272606200] DEBUG -- example_spider: BrowserBuilder (mechanize): created browser instance
D, [2018-10-31 09:02:54 +0400#7185] [M: 46973272606200] DEBUG -- example_spider: BrowserBuilder (mechanize): enabled skip_request_errors
I, [2018-10-31 09:02:54 +0400#7185] [M: 46973272606200]  INFO -- example_spider: Browser: started get request to: https://www.videos.com/test
E, [2018-10-31 09:02:55 +0400#7185] [M: 46973272606200] ERROR -- example_spider: Browser: skip request error: #<RuntimeError: Received the following error for a GET request to https://www.videos.com/test: '404 => Net::HTTPNotFound for https://www.videos.com/test -- unhandled response'>, url: https://www.videos.com/test
I, [2018-10-31 09:02:55 +0400#7185] [M: 46973272606200]  INFO -- example_spider: Info: visits: requests: 1, responses: 0
I, [2018-10-31 09:02:55 +0400#7185] [M: 46973272606200]  INFO -- example_spider: Browser: driver mechanize has been destroyed
I, [2018-10-31 09:02:55 +0400#7185] [M: 46973272606200]  INFO -- example_spider: Spider: stopped: {:spider_name=>"example_spider", :status=>:completed, :error=>nil, :environment=>"development", :start_time=>2018-10-31 09:02:53 +0400, :stop_time=>2018-10-31 09:02:55 +0400, :running_time=>"1s", :visits=>{:requests=>1, :responses=>0}, :items=>{:sent=>0, :processed=>0}, :events=>{:requests_errors=>{"#<RuntimeError: Received the following error for a GET request to https://www.videos.com/test: '404 => Net::HTTPNotFound for https://www.videos.com/test -- unhandled response'>"=>1}, :drop_items_errors=>{}, :custom=>{}}}

from kimuraframework.

caecity43 avatar caecity43 commented on May 17, 2024

@vifreefly I use 1.2.0 too

from kimuraframework.

vifreefly avatar vifreefly commented on May 17, 2024

@caecity43

Ok, so my example spider above works for you? If not, provide minimal spider example which fails, so I will able to investigate the problem. Or I'll close this issue.

from kimuraframework.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.