Comments (8)
Sorry, My bad, I found that I still use 1.1.0.
it works after update to 1.2.0.
from kimuraframework.
Here is your error which you want to skip:
#<SocketError: Failed to open TCP connection to www.video.com:443 (getaddrinfo: Name or service not known)>
Where class of error is SocketError
and message is Failed to open TCP connection to www.video.com:443 (getaddrinfo: Name or service not known)
.
Your config:
error: SocketError, message: "SocketError, Item Dropped."
Like you see, there is no match. Correct will be: error: SocketError, message: "Failed to open TCP connection to www.video.com"
. Or you can just provide only error class, example:
skip_request_errors: [
{ error: RuntimeError, message: "404 => Net::HTTPNotFound" },
{ error: Net::HTTPNotFound, message: "404 => Net::HTTPNotFound" },
{ error: Down::ConnectionError, message: "Down::ConnectionError, Item Dropped." },
{ error: Net::OpenTimeout, message: "Net::OpenTimeout, Item Dropped." },
SocketError
],
It will work as well.
from kimuraframework.
@vifreefly Thanks, I will try this. 👍
from kimuraframework.
Spider: stopped: {:spider_name=>"videos_spider", :status=>:failed, :error=>"#<RuntimeError: Received the following error for a GET request to https://www.videos.com/test'404 => Net::HTTPNotFound for https://www.videos.com/test -- unhandled response'>", :environment=>"production", :start_time=>2018-10-30 07:33:15 +0000, :stop_time=>2018-10-30 09:38:56 +0000, :running_time=>"2h, 5m", :visits=>{:requests=>27, :responses=>26}, :items=>{:sent=>13, :processed=>13}, :events=>{:requests_errors=>{}, :drop_items_errors=>{}, :custom=>{}}}
my config:
@config = {
#retry_request_errors: [Net::ReadTimeout],
skip_request_errors: [
{ error: RuntimeError, message: "404 => Net::HTTPNotFound" },
{ error: Net::HTTPNotFound, message: "404 => Net::HTTPNotFound" },
{ error: Down::ConnectionError, message: "Down::ConnectionError, Item Dropped." },
{ error: Net::OpenTimeout, message: "Net::OpenTimeout, Item Dropped." },
Net::HTTPNotFound,
SocketError,
],
before_request: {
delay: 0.4
}
}
it not working.
from kimuraframework.
@caecity43 which version of Kimurai do you use?
from kimuraframework.
@caecity43, LGTM:
$ kimurai -v
1.2.0
# example_spider.rb
require 'kimurai'
require 'net/http'
class ExampleSpider < Kimurai::Base
@name = "example_spider"
@engine = :mechanize
@start_urls = ["https://www.videos.com/test"]
@config = {
skip_request_errors: [
{ error: RuntimeError, message: "404 => Net::HTTPNotFound" }
]
}
def parse(response, url:, data: {})
logger.info response.title
end
end
ExampleSpider.crawl!
$ ruby example_spider.rb
I, [2018-10-31 09:02:53 +0400#7185] [M: 46973272606200] INFO -- example_spider: Spider: started: example_spider
D, [2018-10-31 09:02:54 +0400#7185] [M: 46973272606200] DEBUG -- example_spider: BrowserBuilder (mechanize): created browser instance
D, [2018-10-31 09:02:54 +0400#7185] [M: 46973272606200] DEBUG -- example_spider: BrowserBuilder (mechanize): enabled skip_request_errors
I, [2018-10-31 09:02:54 +0400#7185] [M: 46973272606200] INFO -- example_spider: Browser: started get request to: https://www.videos.com/test
E, [2018-10-31 09:02:55 +0400#7185] [M: 46973272606200] ERROR -- example_spider: Browser: skip request error: #<RuntimeError: Received the following error for a GET request to https://www.videos.com/test: '404 => Net::HTTPNotFound for https://www.videos.com/test -- unhandled response'>, url: https://www.videos.com/test
I, [2018-10-31 09:02:55 +0400#7185] [M: 46973272606200] INFO -- example_spider: Info: visits: requests: 1, responses: 0
I, [2018-10-31 09:02:55 +0400#7185] [M: 46973272606200] INFO -- example_spider: Browser: driver mechanize has been destroyed
I, [2018-10-31 09:02:55 +0400#7185] [M: 46973272606200] INFO -- example_spider: Spider: stopped: {:spider_name=>"example_spider", :status=>:completed, :error=>nil, :environment=>"development", :start_time=>2018-10-31 09:02:53 +0400, :stop_time=>2018-10-31 09:02:55 +0400, :running_time=>"1s", :visits=>{:requests=>1, :responses=>0}, :items=>{:sent=>0, :processed=>0}, :events=>{:requests_errors=>{"#<RuntimeError: Received the following error for a GET request to https://www.videos.com/test: '404 => Net::HTTPNotFound for https://www.videos.com/test -- unhandled response'>"=>1}, :drop_items_errors=>{}, :custom=>{}}}
from kimuraframework.
@vifreefly I use 1.2.0
too
from kimuraframework.
Ok, so my example spider above works for you? If not, provide minimal spider example which fails, so I will able to investigate the problem. Or I'll close this issue.
from kimuraframework.
Related Issues (20)
- Selenium Chrome Heroku HOT 3
- Crawl in Sidekiq - Selenium::WebDriver::Error::WebDriverError: not a file: "./bin/chromedriver HOT 4
- How do I click on something that isn't a link? HOT 1
- Running on Ubuntu 20.04 gives chromedriver error HOT 4
- Some minor warnings when using kimurai
- How to set language? HOT 1
- Is the project still being maintained? HOT 2
- Unable to use proxy with password for headless chrome HOT 1
- Using the last argument as keyword parameters is deprecated : using ruby 3.0.0 HOT 2
- Error when installing on Linux HOT 2
- request_to method throws argument error for Ruby 3.0 HOT 9
- How to create empty JSON when no records where scrapped? HOT 1
- How to handle OpenSSL::SSL::SSLError: wrong signature type? HOT 1
- How to parse pages with HTTP errors (403, 404) HOT 1
- Setting cookies will request a page twice HOT 1
- Unable to use Ruby 3.x HOT 10
- Ruby Gems 1.4 not up to date with GitHub
- edriver update
- uninitialized constant URI::HTTP HOT 1
- Keep the browser opened
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from kimuraframework.