Comments (8)
Hey Jetienne,
I'm not sure why that's happening. It looks like its happening inside net/http. What I can do is wrap the requests in an external library that handles all these exceptions (I'm planning to do this anyway).
One question: does this happen every time, or it's just a one case scenario? I was able to test your url with ruby 1.87 and ruby 1.9.2p136
from rawler.
Hi,
Yes, this happens all the time. Strange we don't have the same behavior.
Regards,
from rawler.
Hey
Here are an other error :
ruby rawler http://www.google.fr
200 - http://www.google.fr/imghp?hl=fr&tab=wi
200 - http://www.google.fr/webhp?hl=fr&tab=iw
200 - http://video.google.fr/?hl=fr&tab=wv
200 - http://maps.google.fr/maps?hl=fr&tab=wl
200 - http://news.google.fr/nwshp?hl=fr&tab=wn
200 - http://www.google.fr/prdhp?hl=fr&tab=wf
/Users/Arkan/.rvm/rubies/ruby-1.9.2-p0/lib/ruby/1.9.1/uri/generic.rb:746:in rescue in merge': bad URI(is not URI?): /products?hl=fr&gl=fr&holiday=l10&q=sèche-linge (URI::InvalidURIError) from /Users/Arkan/.rvm/rubies/ruby-1.9.2-p0/lib/ruby/1.9.1/uri/generic.rb:743:in
merge'
from /Users/Arkan/Code/rawler/lib/rawler/crawler.rb:28:in absolute_url' from /Users/Arkan/Code/rawler/lib/rawler/crawler.rb:19:in
block in links'
from /Users/Arkan/.rvm/gems/ruby-1.9.2-p0@messy/gems/nokogiri-1.4.4/lib/nokogiri/xml/node_set.rb:239:in block in each' from /Users/Arkan/.rvm/gems/ruby-1.9.2-p0@messy/gems/nokogiri-1.4.4/lib/nokogiri/xml/node_set.rb:238:in
upto'
from /Users/Arkan/.rvm/gems/ruby-1.9.2-p0@messy/gems/nokogiri-1.4.4/lib/nokogiri/xml/node_set.rb:238:in each' from /Users/Arkan/Code/rawler/lib/rawler/crawler.rb:19:in
map'
from /Users/Arkan/Code/rawler/lib/rawler/crawler.rb:19:in links' from /Users/Arkan/Code/rawler/lib/rawler/base.rb:23:in
validate_links_in_page'
from /Users/Arkan/Code/rawler/lib/rawler/base.rb:31:in validate_page' from /Users/Arkan/Code/rawler/lib/rawler/base.rb:24:in
block in validate_links_in_page'
from /Users/Arkan/Code/rawler/lib/rawler/base.rb:23:in each' from /Users/Arkan/Code/rawler/lib/rawler/base.rb:23:in
validate_links_in_page'
from /Users/Arkan/Code/rawler/lib/rawler/base.rb:31:in validate_page' from /Users/Arkan/Code/rawler/lib/rawler/base.rb:24:in
block in validate_links_in_page'
from /Users/Arkan/Code/rawler/lib/rawler/base.rb:23:in each' from /Users/Arkan/Code/rawler/lib/rawler/base.rb:23:in
validate_links_in_page'
from /Users/Arkan/Code/rawler/lib/rawler/base.rb:31:in validate_page' from /Users/Arkan/Code/rawler/lib/rawler/base.rb:24:in
block in validate_links_in_page'
from /Users/Arkan/Code/rawler/lib/rawler/base.rb:23:in each' from /Users/Arkan/Code/rawler/lib/rawler/base.rb:23:in
validate_links_in_page'
from /Users/Arkan/Code/rawler/lib/rawler/base.rb:17:in validate' from rawler:32:in
zsh: exit 1 ruby rawler http://www.google.fr
from rawler.
Thanks guys,
I should have fixed the problem. Please do a gem update rawler (make sure it's 0.0.4). The second problem should have gone away, the first one I wasn't able to reproduce unfortunately.
from rawler.
hi oscar,
i reproduced the bug even in 0.0.4 :
>>> rawler http://fr.yahoo.com/
/Users/jney/.rvm/rubies/jruby-1.5.6/lib/ruby/1.8/uri/generic.rb:732:in `merge': bad URI(is not URI?): %20http://fr.docs.yahoo.com/mail/fast/index.html%20 (URI::InvalidURIError)
from /Users/jney/.rvm/gems/jruby-1.5.6/gems/rawler-0.0.4/lib/rawler/crawler.rb:28:in `absolute_url'
from /Users/jney/.rvm/gems/jruby-1.5.6/gems/rawler-0.0.4/lib/rawler/crawler.rb:19:in `links'
from /Users/jney/.rvm/gems/jruby-1.5.6/gems/nokogiri-1.4.4.2-java/lib/nokogiri/xml/node_set.rb:239:in `each'
from /Users/jney/.rvm/gems/jruby-1.5.6/gems/nokogiri-1.4.4.2-java/lib/nokogiri/xml/node_set.rb:238:in `upto'
from /Users/jney/.rvm/gems/jruby-1.5.6/gems/nokogiri-1.4.4.2-java/lib/nokogiri/xml/node_set.rb:238:in `each'
from /Users/jney/.rvm/gems/jruby-1.5.6/gems/rawler-0.0.4/lib/rawler/crawler.rb:19:in `map'
from /Users/jney/.rvm/gems/jruby-1.5.6/gems/rawler-0.0.4/lib/rawler/crawler.rb:19:in `links'
from /Users/jney/.rvm/gems/jruby-1.5.6/gems/rawler-0.0.4/lib/rawler/base.rb:23:in `validate_links_in_page'
from /Users/jney/.rvm/gems/jruby-1.5.6/gems/rawler-0.0.4/lib/rawler/base.rb:17:in `validate'
from /Users/jney/.rvm/gems/jruby-1.5.6/gems/rawler-0.0.4/bin/rawler:32
from /Users/jney/.rvm/gems/jruby-1.5.6/gems/rawler-0.0.4/bin/rawler:19:in `load'
from /usr/local/bin/rawler:19
zsh: exit 1 rawler http://fr.yahoo.com/
from rawler.
hi oscar! I'mhaving the same problem on 0.0.5
":in `rescue in merge': bad URI(is not URI?): http://www.cardif.com.ar/Website Cardif/Argentina/site/movistar/paginaseguros.html (URI::InvalidURIError)"
from rawler.
PabloC, thsnks. These issues should be solved as soon as I switch to Mechanize. Sorry for the inconvenience.
from rawler.
Hi guys, I'm closing this, I've tested most urls here and it should work in 0.6. If you find other problems please open a new ticket so I can start over. Thanks again for ALL the help.
from rawler.
Related Issues (19)
- Shouldn't say "ERROR -- : Invalid url" on "mailto:" links HOT 1
- Exception thrown when server doesn't answer HOT 2
- Include page being checked in rawler's output HOT 3
- Bug when page is redirected? HOT 5
- "rawler http://geeks.aretotally.in/" - `initialize': the scheme http does not accept registry part HOT 2
- net/http.rb:1266:in `addr_port': undefined method `+' for nil:NilClass (NoMethodError) HOT 4
- http_proxy? HOT 3
- Returns 404 for anchor links HOT 2
- enable Travis HOT 1
- Base tag is not taken into consideration
- --ignore_fragments is not supported (but documented) HOT 5
- URLs with percent are parsed incorrectly
- The app can not rescue Errno::ETIMEDOUT exception HOT 2
- Authorization header is sent even without --username and --password
- warning: URI.escape is obsolete
- Returns 404 for links that contained chinese words HOT 4
- You should not have to specify http:// HOT 3
- Fails on JavaScript URLs HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from rawler.