sparklemotion / mechanize Goto Github PK

View Code? Open in Web Editor NEW

4.4K 87.0 473.0 2.54 MB

Mechanize is a ruby library that makes automated web interaction easy.

Home Page: https://www.rubydoc.info/gems/mechanize/

License: MIT License

Ruby 94.46% HTML 5.54%

scraping web ruby

mechanize's Introduction

Mechanize

Description

The Mechanize library is used for automating interaction with websites. Mechanize automatically stores and sends cookies, follows redirects, and can follow links and submit forms. Form fields can be populated and submitted. Mechanize also keeps track of the sites that you have visited as a history.

Dependencies

Ruby >= 2.6
Gems:
- addressable
- domain_name
- http-cookie
- mime-types
- net-http-digest_auth
- net-http-persistent
- nokogiri
- rubyntlm
- webrick
- webrobots

Support:

The bug tracker is available here:

https://github.com/sparklemotion/mechanize/issues

Examples

If you are just starting, check out GUIDE.rdoc or EXAMPLES.rdoc.

Developers

Use bundler to install dependencies:

bundle install

Run all tests with:

bundle exec rake test

See also Mechanize::TestCase to read about the built-in testing infrastructure.

Authors

Eric Hodel
Akinori MUSHA
Aaron Patterson
Lee Jarvis
Mike Dalessio

Acknowledgments

This library was heavily influenced by its namesake in the Perl world. A big thanks goes to Andy Lester, the author of the original Perl module WWW::Mechanize which is available here. Ruby Mechanize would not be around without you!

Thank you to Michael Neumann for starting the Ruby version. Thanks to everyone who's helped out in various ways. Finally, thank you to the people using this library!

License

This library is distributed under the MIT license. Please see LICENSE.txt.

mechanize's People

Contributors

Stargazers

Watchers

Forkers

flavorjones iconnor bandito devyn spejman rexikan romanbsd caring ijcd regularfry eric dacort byplayer jkestr maiha chrisconley kemiller jwilkins hjast voker57 pftg eshao xiphias uctc73 russennis valo yhara scrubber gma kaievns tw1nk runa dekart bnoguchi mattfoster ujihisa timcharper cayblood dholdren fsvehla gwynforthewyn vodafon miker81 jasonabi reddavis inkybro woto chrisle futurespective lorensr derdewey kitamomonga bwlang jamesdaniels fansnap srbartlett jedisct1 chancancode robinsp ustun neocoin jaredfolkins lunks runpaint rubysolo jinschoi peburrows aai10 gerryster rkabir joshaidan caribio vgololobov dwebster icecreamboyy larstobi yury ponny santosh-1987 ranjithtenz zzak reinteractive steveklabnik pmq20 dsisnero ogrishman codenauts prakashmurthy rafamvc geekontheway jshakespear erkan-yilmaz flyabroadkit adamramadhan rubyuser123 jhartftw garthsnyder esimionato phyten tiankui

mechanize's Issues

TypeError: can't convert nil into String util.rb:40 in iconv

[gluenow.com ~/gluenow/current]$ gem list | grep mechanize
 mechanize (0.9.3)
[gluenow.com ~/gluenow/current]$ gem list | grep nokogiri
 nokogiri (1.3.3)
[gluenow.com ~/gluenow/current]$ ruby script/console production
 Loading production environment (Rails 2.2.2)
 WARNING: Nokogiri was built against LibXML version 2.6.30, but has dynamically loaded 2.6.31

...
>> page = agent.submit(login_form)
TypeError: can't convert nil into String
from /opt/local/lib/ruby/gems/1.8/gems/mechanize-0.9.3/lib/www/mechanize/util.rb:40:in iconv' from /opt/local/lib/ruby/gems/1.8/gems/mechanize-0.9.3/lib/www/mechanize/util.rb:40:infrom_native_charset'
from /opt/local/lib/ruby/gems/1.8/gems/mechanize-0.9.3/lib/www/mechanize/form.rb:152:in from_native_charset' from /opt/local/lib/ruby/gems/1.8/gems/mechanize-0.9.3/lib/www/mechanize/form.rb:144:inproc_query'
from /opt/local/lib/ruby/gems/1.8/gems/mechanize-0.9.3/lib/www/mechanize/form.rb:143:in map' from /opt/local/lib/ruby/gems/1.8/gems/mechanize-0.9.3/lib/www/mechanize/form.rb:143:inproc_query'
from /opt/local/lib/ruby/gems/1.8/gems/mechanize-0.9.3/lib/www/mechanize/form.rb:166:in build_query' from /opt/local/lib/ruby/gems/1.8/gems/mechanize-0.9.3/lib/www/mechanize/form.rb:165:ineach'
from /opt/local/lib/ruby/gems/1.8/gems/mechanize-0.9.3/lib/www/mechanize/form.rb:165:in build_query' from /opt/local/lib/ruby/gems/1.8/gems/mechanize-0.9.3/lib/www/mechanize/form.rb:214:inrequest_data'
from /opt/local/lib/ruby/gems/1.8/gems/mechanize-0.9.3/lib/www/mechanize.rb:401:in post_form' from /opt/local/lib/ruby/gems/1.8/gems/mechanize-0.9.3/lib/www/mechanize.rb:344:insubmit'
from (irb):9

patched, working after
in mechanize-0.9.3/lib/www/mechanize/util.rb
BEFORE
40 Iconv.iconv(code, "UTF-8", s).join("")
AFTER
40 Iconv.iconv(code.to_s, "UTF-8", s.to_s).join("")

SSL pages are not handled correctly

Results in various failures, 'Rescuing EOF error' or 'wrong status line'. I traced the problem to the ssl_resolver.rb line 23. This checks ! ssl.frozen? I remember this being an issue in earlier version of mechanize. I removed the check and things are dandy again. Hardly a solution.

I'm using Ruby 1.9.1p129 and the very latest Mechanize 0.9.3 gem build directly from source.

undefined method `instance_variable_defined?' in ssl_resolver.rb

I get the error
undefined method instance_variable_defined?' for #<Net::HTTP controlside.ds.corp.kelkoo.net:80 open=false> (NoMethodError) with stacktrace /usr/lib/ruby/gems/1.8/gems/mechanize-0.9.3/lib/www/mechanize/chain/ssl_resolver.rb:20:inhandle'
/usr/lib/ruby/gems/1.8/gems/mechanize-0.9.3/lib/www/mechanize/chain.rb:30:in pass' /usr/lib/ruby/gems/1.8/gems/mechanize-0.9.3/lib/www/mechanize/chain/handler.rb:6:inhandle'
/usr/lib/ruby/gems/1.8/gems/mechanize-0.9.3/lib/www/mechanize/chain/connection_resolver.rb:73:in handle' /usr/lib/ruby/gems/1.8/gems/mechanize-0.9.3/lib/www/mechanize/chain.rb:30:inpass'
/usr/lib/ruby/gems/1.8/gems/mechanize-0.9.3/lib/www/mechanize/chain/handler.rb:6:in handle' /usr/lib/ruby/gems/1.8/gems/mechanize-0.9.3/lib/www/mechanize/chain/request_resolver.rb:27:inhandle'
/usr/lib/ruby/gems/1.8/gems/mechanize-0.9.3/lib/www/mechanize/chain.rb:30:in pass' /usr/lib/ruby/gems/1.8/gems/mechanize-0.9.3/lib/www/mechanize/chain/handler.rb:6:inhandle'
/usr/lib/ruby/gems/1.8/gems/mechanize-0.9.3/lib/www/mechanize/chain/parameter_resolver.rb:18:in handle' /usr/lib/ruby/gems/1.8/gems/mechanize-0.9.3/lib/www/mechanize/chain.rb:30:inpass'
/usr/lib/ruby/gems/1.8/gems/mechanize-0.9.3/lib/www/mechanize/chain/handler.rb:6:in handle' /usr/lib/ruby/gems/1.8/gems/mechanize-0.9.3/lib/www/mechanize/chain/uri_resolver.rb:72:inhandle'
/usr/lib/ruby/gems/1.8/gems/mechanize-0.9.3/lib/www/mechanize/chain.rb:25:in handle' /usr/lib/ruby/gems/1.8/gems/mechanize-0.9.3/lib/www/mechanize.rb:457:infetch_page'
/usr/lib/ruby/gems/1.8/gems/mechanize-0.9.3/lib/www/mechanize.rb:238:in get' /usr/lib/ruby/gems/1.8/gems/webrat-0.6.0/lib/webrat/adapters/mechanize.rb:18:inget'
/usr/lib/ruby/gems/1.8/gems/webrat-0.6.0/lib/webrat/core/session.rb:281:in send' /usr/lib/ruby/gems/1.8/gems/webrat-0.6.0/lib/webrat/core/session.rb:281:inprocess_request'
/usr/lib/ruby/gems/1.8/gems/webrat-0.6.0/lib/webrat/core/session.rb:122:in request_page' /usr/lib/ruby/gems/1.8/gems/webrat-0.6.0/lib/webrat/core/session.rb:220:invisit'
(eval):2:in `visit'

I can guess that there is some inconsistency in the versions of mechanize, webrat, ruby
I have the standard ruby (1.8.5) for CentOS

I also tried compiling the latest version of ruby (1.9.1)
but both seems to produce an object Net::HTTP that doesn't have the instance_variable_defined? function :

irb(main):002:0> require 'net/http'
=> true
irb(main):003:0> n = Net::HTTP.new("http://google.fr")
=> #<Net::HTTP http://google.fr:80 open=false>
irb(main):004:0> n.instance_variable_defined?("@something")
NoMethodError: undefined method `instance_variable_defined?' for #<Net::HTTP http://google.fr:80 open=false>
from (irb):4
from /usr/lib/ruby/1.8/net/http.rb:2268

I'm not sure wether this is a bug of mechanize not. However I did solve the problem by changing
line 20 in ssl_resolver.rb from
"if http_obj.instance_variable_defined?(:@ssl_context)"
to "if http_obj.instance_variable_get("@ssl_context")"

Could I submit the patch? Or should I continue to try some other version of mechanize, webrat?

Regards,
Johan Martinsson

Support for NTLM

I made support for NTLM - see my blog here for the gem I made and you can work out what I did so you can add it in a nice way: http://www.mindflowsolutions.net/2009/5/21/ruby-ntlm-mechanize

EOFError for no obvious reason?

Mechanize 1.0.0

Feature request to selectively turn off Content-Length checking? (???)

log of debug output below

require 'rubygems'
require 'mechanize'
require 'logger'

a = Mechanize.new
a.user_agent = "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.1.7) Gecko/20091221 Firefox/3.5.7 GTB7.0"
a.log = Logger.new(STDOUT)
p = a.get(:url => "http://www.retailmenot.com/view/abesofmaine.com")

If I comment-out the Content-Length check in Mechanize::Chain::ResponseReader, I do get the full page (and the length is the reported Content-Length).

Maybe there should be an option to suppress the Content-Length check? Not sure if it should be at the Mechanize-level, or at the level of a particular request.

In any case, I'm sure not getting why it would be EOF'ing anyways.

Log ------

I, [2010-02-16T14:34:21.230419 #4270] INFO -- : Net::HTTP::Get: /view/abesofmaine.com
D, [2010-02-16T14:34:21.230557 #4270] DEBUG -- : request-header: accept-language => en-us,en;q=0.5
D, [2010-02-16T14:34:21.230602 #4270] DEBUG -- : request-header: connection => keep-alive
D, [2010-02-16T14:34:21.230643 #4270] DEBUG -- : request-header: accept => /
D, [2010-02-16T14:34:21.230684 #4270] DEBUG -- : request-header: accept-encoding => gzip,identity
D, [2010-02-16T14:34:21.230724 #4270] DEBUG -- : request-header: user-agent => Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.1.7) Gecko/20091221 Firefox/3.5.7 GTB7.0
D, [2010-02-16T14:34:21.230766 #4270] DEBUG -- : request-header: accept-charset => ISO-8859-1,utf-8;q=0.7,*;q=0.7
D, [2010-02-16T14:34:21.231365 #4270] DEBUG -- : request-header: host => www.retailmenot.com
D, [2010-02-16T14:34:21.231416 #4270] DEBUG -- : request-header: keep-alive => 300
D, [2010-02-16T14:34:21.353167 #4270] DEBUG -- : Read 550 bytes
D, [2010-02-16T14:34:21.353423 #4270] DEBUG -- : Read 914 bytes
D, [2010-02-16T14:34:21.353894 #4270] DEBUG -- : Read 1938 bytes
D, [2010-02-16T14:34:21.354088 #4270] DEBUG -- : Read 2302 bytes
D, [2010-02-16T14:34:21.354302 #4270] DEBUG -- : Read 3326 bytes
D, [2010-02-16T14:34:21.354510 #4270] DEBUG -- : Read 3690 bytes
D, [2010-02-16T14:34:21.460328 #4270] DEBUG -- : Read 4714 bytes
D, [2010-02-16T14:34:21.460593 #4270] DEBUG -- : Read 5078 bytes
D, [2010-02-16T14:34:21.461439 #4270] DEBUG -- : Read 6102 bytes
D, [2010-02-16T14:34:21.461647 #4270] DEBUG -- : Read 6466 bytes
D, [2010-02-16T14:34:21.462294 #4270] DEBUG -- : Read 7490 bytes
D, [2010-02-16T14:34:21.462503 #4270] DEBUG -- : Read 7854 bytes
D, [2010-02-16T14:34:21.500857 #4270] DEBUG -- : Read 8878 bytes
D, [2010-02-16T14:34:21.501224 #4270] DEBUG -- : Read 9437 bytes
E, [2010-02-16T14:34:21.501605 #4270] ERROR -- : Rescuing EOF error
D, [2010-02-16T14:34:21.729018 #4270] DEBUG -- : Read 550 bytes
D, [2010-02-16T14:34:21.729343 #4270] DEBUG -- : Read 1574 bytes
D, [2010-02-16T14:34:21.729561 #4270] DEBUG -- : Read 2598 bytes
D, [2010-02-16T14:34:21.729748 #4270] DEBUG -- : Read 3622 bytes
D, [2010-02-16T14:34:21.729930 #4270] DEBUG -- : Read 3690 bytes
D, [2010-02-16T14:34:21.831662 #4270] DEBUG -- : Read 4714 bytes
D, [2010-02-16T14:34:21.831962 #4270] DEBUG -- : Read 5738 bytes
D, [2010-02-16T14:34:21.832202 #4270] DEBUG -- : Read 6466 bytes
D, [2010-02-16T14:34:21.832404 #4270] DEBUG -- : Read 7490 bytes
D, [2010-02-16T14:34:21.832586 #4270] DEBUG -- : Read 7854 bytes
D, [2010-02-16T14:34:21.899798 #4270] DEBUG -- : Read 8878 bytes
D, [2010-02-16T14:34:21.900050 #4270] DEBUG -- : Read 9437 bytes
E, [2010-02-16T14:34:21.900409 #4270] ERROR -- : Rescuing EOF error
D, [2010-02-16T14:34:22.139545 #4270] DEBUG -- : Read 550 bytes
D, [2010-02-16T14:34:22.139962 #4270] DEBUG -- : Read 1574 bytes
D, [2010-02-16T14:34:22.140176 #4270] DEBUG -- : Read 2598 bytes
D, [2010-02-16T14:34:22.140357 #4270] DEBUG -- : Read 3622 bytes
D, [2010-02-16T14:34:22.140538 #4270] DEBUG -- : Read 3690 bytes
D, [2010-02-16T14:34:22.245101 #4270] DEBUG -- : Read 4714 bytes
D, [2010-02-16T14:34:22.245364 #4270] DEBUG -- : Read 5078 bytes
D, [2010-02-16T14:34:22.245758 #4270] DEBUG -- : Read 6102 bytes
D, [2010-02-16T14:34:22.245971 #4270] DEBUG -- : Read 6466 bytes
D, [2010-02-16T14:34:22.246691 #4270] DEBUG -- : Read 7490 bytes
D, [2010-02-16T14:34:22.246890 #4270] DEBUG -- : Read 7854 bytes
D, [2010-02-16T14:34:22.300396 #4270] DEBUG -- : Read 8878 bytes
D, [2010-02-16T14:34:22.300651 #4270] DEBUG -- : Read 9437 bytes
E, [2010-02-16T14:34:22.301006 #4270] ERROR -- : Rescuing EOF error

isn't .links.text() deprecated?

rubyforge example is using it ..

assigning variables is hard. aka: stop messing with tmp_dh_callback= on frozen objects

scanning rubyforge
/Library/Ruby/Gems/1.8/gems/rubyforge-1.0.4/lib/rubyforge/client.rb:24:in tmp_dh_callback=': can't modify frozen object (TypeError) from /Library/Ruby/Gems/1.8/gems/rubyforge-1.0.4/lib/rubyforge/client.rb:24:inuse_ssl='
from /Library/Ruby/Gems/1.8/gems/mechanize-0.9.3/lib/www/mechanize/chain/ssl_resolver.rb:25:in handle' from /Library/Ruby/Gems/1.8/gems/mechanize-0.9.3/lib/www/mechanize/chain.rb:30:inpass'
from /Library/Ruby/Gems/1.8/gems/mechanize-0.9.3/lib/www/mechanize/chain/handler.rb:6:in handle' from /Library/Ruby/Gems/1.8/gems/mechanize-0.9.3/lib/www/mechanize/chain/connection_resolver.rb:73:inhandle'
from /Library/Ruby/Gems/1.8/gems/mechanize-0.9.3/lib/www/mechanize/chain.rb:30:in pass' from /Library/Ruby/Gems/1.8/gems/mechanize-0.9.3/lib/www/mechanize/chain/handler.rb:6:inhandle'
from /Library/Ruby/Gems/1.8/gems/mechanize-0.9.3/lib/www/mechanize/chain/request_resolver.rb:27:in handle' ... 18 levels... from /Library/Ruby/Gems/1.8/gems/omnifocus-1.2.0/lib/omnifocus.rb:53:inrun'
from /Library/Ruby/Gems/1.8/gems/omnifocus-1.2.0/bin/omnifocus:6
from /usr/bin/omnifocus:19:in `load'
from /usr/bin/omnifocus:19

redirect url bug

from aaron starr via mailing list:

Hi, all,

I found that I had to make the following patch to Mechanize for one of the sites I'm scraping:

alias fetch_page_original_version fetch_page
def fetch_page(params)
  params[:uri] = params[:uri].gsub(/^https?:/i) {|m| m.downcase } if String == params[:uri].class
  fetch_page_original_version(params)
end

(Also, here: http://pastie.org/1077542)

The problem was that the site was returning a 302 redirect with a Location header that looked like: httpS://www.blah-blah-blah... The weirdly capitalized protocol was causing EOF errors, so it needed to be adjusted.

My versions:

mechanize (1.0.0)
ruby 1.8.7 (2008-08-11 patchlevel 72) [x86_64-linux]

Gracefully support HTTP errors

Currently mechanize raises an error (mechanize.rb line 610) if it sees an error code like 400 in the response. I'd like to use webrat + mechanize to automate live testing of my staging environment including error cases. So I would like to be able to configure Mechanize to just return the page as is rather than raising an error so that I can assert that a 400 status was returned as expected.

Something like:

raise ResponseCodeError.new(page), "Unhandled response", caller if raise_errors?
page

where Mechanize.raise_errors? would default to true.

Segmentation fault when downloading (0.9.3)

/Library/Ruby/Gems/1.8/gems/mechanize-0.9.3/lib/www/mechanize/chain/response_reader.rb:18: [BUG] Segmentation fault
ruby 1.8.7 (2008-08-11 patchlevel 72) [universal-darwin10.0]

will gather further details

Digest: uri mismatch with apache (when query string present)

Apache complains about uri mismatch when fetching page with some query params:

Digest: uri mismatch - </some/page> does not match request-uri </some/page?some_param=some_value>

AuthHeaders#gen_auth_header should use full uri.request_uri not uri.path

File Upload to Java JSP page fails with "Malformed line after disposition: …"

It's choking on the "Content-Transfer-Encoding" after the "Content-Disposition" line. It's expecting the "Content-Type" line to be directly after the "Content-Disposition" line.

Source for the java parser clearly shows the root issue:
http://www-inf.int-evry.fr/~meunier/ProjetsIG3/TtImagesRep/node78.html

Note how extractContentType is directly after all the extractDispositionInfo handling.

Access to request and response objects

It would be very useful to be able to get full access to the request and response objects after a navigation. For example, to get the response code (2xx, 3xx) after the navigation when there is no error would be very handy. Additionally, all header information, encoding, response length, etc.

problem with using OpenSSL once mechanize has been started up

Hey there,

Whenever i use OpenSSL after opening up a mechanize agent I get the following error:

OpenSSL::SSL::SSLError: SSL_read:: no start line

method sysread in protocol.rb at line 133
method rbuf_fill in protocol.rb at line 133
method timeout in timeout.rb at line 62
method timeout in timeout.rb at line 93
method rbuf_fill in protocol.rb at line 132
method read in protocol.rb at line 86
method read_chunked in http.rb at line 2232
method read_body_0 in http.rb at line 2207
method read_body in http.rb at line 2173
method handle in response_reader.rb at line 15
method handle in chain.rb at line 25
method fetch_page in mechanize.rb at line 490
method request in http.rb at line 1053
method reading_body in http.rb at line 2136
method request in http.rb at line 1052
method fetch_page in mechanize.rb at line 485
method post_form in mechanize.rb at line 413
method submit in mechanize.rb at line 344
at top level in mechanize_problem.rb at line 17

nolan-evanss-macbook-pro:ssl_problem nolanevans$ gem which mechanize
(checking gem mechanize-0.9.3 for mechanize)
/Library/Ruby/Gems/1.8/gems/mechanize-0.9.3/lib/mechanize.rb
nolan-evanss-macbook-pro:ssl_problem nolanevans$ ruby --version
ruby 1.8.6 (2008-08-11 patchlevel 287) [universal-darwin9.0]

Has anyone run into a problem like this?

You can see sample code here:
http://gist.github.com/295276

Cheers,
-Nolan

Nokigiri does not receive the proper encoding if html_body =~ /<meta[^>]charset[^>]>/i

In this case you clear the variable @encoding and rely on the parsed page to detect the encoding. So Page#parser sends Nokigiri nil and no proper conversion takes place.
For example, I get the page 'windows-1251' encoded and it is not converted to utf-8 as it should.

Create a google group

It would be nice to have a google group where people who are trying to do things with mechanize could talk to each other.

CookieJar.load_cookiestxt is toooooo slow.

When I try to use CookieJar.load_cookiestxt to parse 3k+ records dump from Chrome cookies database, it cost almost 36mins.

Is cleanup() necessary when call add() everytime?

Page#encoding= does not work when Nokogiri fails to parse multibytes

Mechanize does not re-parse when Page#encoding='s argument equals Page#encoding.
In non-utf8/16/32 multibyte HTML, it sometimes occurs that "Nokogiri fails parsing and the Nokogiri returns the correct encoding".

It is OK

multibyte characters
...

It sometimes fails parsing (agent.page.parser.errors returns "Input is not proper UTF-8, indicate encoding" errors)

multibyte characters

...

Nokogiri::HTML::Document#encoding just returns meta['content'] charset token (if you set nothing to Nokogiri.parse 3rd argument).
"Nokogiri::HTML::Document#encoding is same as a meta charset" does not always stand for the success of parsing HTML.

#!ruby -Ku
# This script should be saved in UTF-8 and run under Ruby1.8
require 'rubygems'
require 'mechanize'
require 'kconv'

# "テスト" is "test" in Japanese, "tosjis" changes encoding into Japanese "Shift_JIS"
bad_html = <<HTML.tosjis
<html>
  <title>テスト</title>
  <meta http-equiv="Content-Type" content="text/html; charset=Shift_JIS">
</html>
HTML

class Mechanize
  def local(p)
    page = Page.new(URI.parse(p[:url].to_s), {'content-type' => 'text/html'}, p[:html]||p[:body], '200', self)
    add_to_history(page)
    page
  end
end

agent = Mechanize.new
agent.local(:url => 'http://example.com/', :body => bad_html)
puts "Nokogiri errors: #{agent.page.parser.errors}"
puts "title is same?: #{!!(agent.page.at('title').inner_text == 'テスト'.toutf8)}"

# lazy-re-parsing of Page#encoding= does not work because Page#encoding is already "Shift_JIS"
agent.page.encoding = 'Shift_JIS'
puts "title is same? (after Page#encoding=): #{!!(agent.page.at('title').inner_text == 'テスト'.toutf8)}"

# getting rid of the identity check
agent.page.instance_variable_set(:@parser, nil)
agent.page.encoding = 'Shift_JIS'
puts "title is same? (after @parser=nil): #{!!(agent.page.at('title').inner_text == 'テスト'.toutf8)}"

No links with Nokogiri

ver 0.9.2

>> agent = WWW::Mechanize.new
>> agent.get("http://lenta.ru/").links.size
=> 4
>> require 'hpricot'
=> []
>> WWW::Mechanize.html_parser = Hpricot
=> Hpricot
>> agent.get("http://lenta.ru/").links.size
=> 550

I have issues

boy do I have issues (2)

(ignore me... testing omnifocus-github)

follow_meta_refresh works if meta tag present on body

If follow_meta_refresh is turned on and body of html page has meta tag for meta-refresh,
mechanize try to follow link on it.
Mechanize-0.9.3 has this issue

search() returning fewer results with Mechanize than with Nokogiri

I have found a case where searching for "a" elements yields fewer results with Mechanize than with Nokogiri for the same page.

Code to reproduce:
http://gist.github.com/473014

On my machine, this code prints:
Number of links using Nokogiri: 275
Number of links using mechanize: 128

I'm using mechanize 1.0.0 and nokogiri 1.4.1. Also, after seeing a similar old issue (http://bit.ly/9F1xp7), I installed libxml2 2.2.7, but that didn't change this result.

Hope I'm not missing something obvious.

error: can't modify frozen object

I use ruby 1.8.7 (2008-08-11 patchlevel 72) [i386-mswin32] and mechanize (0.9.3), nokogiri (1.3.3)

SSL証明書が無効なサイトにつなぐと以下のエラーが起きます。

2009.08.20 15:50:47:error: can't modify frozen object
xxxxxxxxxxx/lib/ruby/1.8/net/https.rb:138:in verify_mode=' xxxxxxxxxxx/lib/ruby/1.8/net/https.rb:138:inverify_mode='
xxxxxxxxxxx/gems/mechanize-0.9.3/lib/www/mechanize/chain/ssl_resolver.rb:26:in handle' xxxxxxxxxxx/gems/mechanize-0.9.3/lib/www/mechanize/chain.rb:30:inpass'
xxxxxxxxxxx/gems/mechanize-0.9.3/lib/www/mechanize/chain/handler.rb:6:in handle' xxxxxxxxxxx/gems/mechanize-0.9.3/lib/www/mechanize/chain/connection_resolver.rb:73:inhandle'
xxxxxxxxxxx/gems/mechanize-0.9.3/lib/www/mechanize/chain.rb:30:in pass' xxxxxxxxxxx/gems/mechanize-0.9.3/lib/www/mechanize/chain/handler.rb:6:inhandle'
xxxxxxxxxxx/gems/mechanize-0.9.3/lib/www/mechanize/chain/request_resolver.rb:27:in handle' xxxxxxxxxxx/gems/mechanize-0.9.3/lib/www/mechanize/chain.rb:30:inpass'
xxxxxxxxxxx/gems/mechanize-0.9.3/lib/www/mechanize/chain/handler.rb:6:in handle' xxxxxxxxxxx/gems/mechanize-0.9.3/lib/www/mechanize/chain/parameter_resolver.rb:18:inhandle'
xxxxxxxxxxx/gems/mechanize-0.9.3/lib/www/mechanize/chain.rb:30:in pass' xxxxxxxxxxx/gems/mechanize-0.9.3/lib/www/mechanize/chain/handler.rb:6:inhandle'
xxxxxxxxxxx/gems/mechanize-0.9.3/lib/www/mechanize/chain/uri_resolver.rb:72:in handle' xxxxxxxxxxx/gems/mechanize-0.9.3/lib/www/mechanize/chain.rb:25:inhandle'
xxxxxxxxxxx/gems/mechanize-0.9.3/lib/www/mechanize.rb:457:in fetch_page' xxxxxxxxxxx/gems/mechanize-0.9.3/lib/www/mechanize.rb:557:infetch_page'
xxxxxxxxxxx/gems/mechanize-0.9.3/lib/www/mechanize.rb:238:in `get'

can't login to amazon's affiliate page

Amazon's affiliate page: https://affiliate-program.amazon.com/

Trying to signin to this page results in Amazon stating that you need to enable cookies and redirecting the client to an alternative form, but that form doesn't all results in Amazon saying you need to enable cookies.

I'm not sure if Amazon has some mechanism to prevent Mechanize from working, but if it doesn't this is an interesting case of cookie bug.

options[:res_klass]

Where is that settable ?

Via

agent.get({:url => foo, :res_klass => X }

Doesn't seem to get set, or make any difference

Adding support for unusual 'Content-Encoding: none' header

Although I'm pretty sure it's not a valid header, I have seen this a couple of times on servers that I'm using Mechanize with. I think that it's caused by someone getting gzip encoding really badly wrong somehow.

Patch is very simple and would be nice to have in (if only for me!):

http://github.com/edmund/mechanize/commit/ece2a3cb80285523bde24ef5c33dd11bf6f004c0

wrong encoding of non-ASCII characters

Some web sites contain links with non-ASCII characters (for example german umlauts), especially in get parameters.
Mechanize seems to break such links, making it impossible to follow that links.
Below I put a small sinatra app demonstrating the problem. Accessing this with a normal browser and clicking on the link will result in a YES. Doing the same with mechanize results in a NO.
Is there a way to work around that?

require 'sinatra'

get '/' do
  '<a href="?query=müller">test</a>'+
  params.inspect+
  ((params["query"]=="müller") ? "YES":"NO")
end
#test{"query"=>"m\303\274ller"}YES (Firefox)
#test{"query"=>"m%..FE7%..FAB%..FAFller"}NO (Mechanize)

issue with meta-refresh URL causes invalid redirect

A request to http://www.rmfpc.com/ returns an HTML document with a meta refresh tag that looks like this:

<meta http-equiv="Refresh" content="0;URL=home-ruskin.htm" />

Mechanize is parsing the redirect as: 'http://www.rmfpc.comhome-ruskin.htm'

It appears the code that is responsible for this is on line 38 of mechanize/page/meta.rb
url = case url
when nil, "" then uri.to_s
when /^http/i then url
else "http://#{uri.host}#{url}" # <<<<
end

Not sure what the best way to handle this. One option is to simply prepend the parsed url with a '/' if it doesn't already have one, but that may break other requests.

Here is a test script reproducing the issue (this will emit timeout error on ruby 1.8.6):

agent = Mechanize.new
agent.follow_meta_refresh = true
res = agent.get('http://www.rmfpc.com')

Finding elements by id

It would be nice if the code like
form.field_with(:id => 'foo')
would work as expected. It's possible to do page.at(..) but it's a)counter-intuitive b)Requires manual instantiation of Field class.

If I'm not missing something obvious here, then I guess that something like this would have to be added to the relevant class:

class Mechanize::Form::Field
def id
::Mechanize::Util.html_unescape(@node['id'])
end
end

Mechanize sends "www.host"'s cookie to "wwwxhost"

Mechanize sends "www.host"'s cookie to "wwwxhost", because /www.host/ matches to "wwwxhost".

lib/mechanize/cookie_jar.rb, L34
domains = @jar.find_all { |domain, _|
url.host =~ /#{CookieJar.strip_port(domain)}$/i
}
to
domains = @jar.find_all { |domain, _|
url.host =~ /#{Regexp.escape(CookieJar.strip_port(domain))}$/i
}

This sample code requires WebMock gem.

require 'rubygems'
require 'webmock'
require 'mechanize'
require 'logger'

WebMock.stub_request(:get, 'http://www.host/').to_return(:headers => {'Set-Cookie'=>'NAME=WWW.HOST'})
WebMock.stub_request(:get, 'http://other/')
WebMock.stub_request(:get, 'http://wwwxhost/')

# www.host -> other. Mechanize does not send cookie to other.
Mechanize.new{|a|
  a.get('http://www.host/')
  sio = StringIO.new
  a.log = Logger.new(sio)
  a.get('http://other/')
  sio.rewind
  p sio.read.scan(/request-header: cookie.+?$/)
} 

# www.host -> wwwxhost. Mechanize sends cookie to wwwxhost.
Mechanize.new{|a|
  a.get('http://www.host/')
  sio = StringIO.new
  a.log = Logger.new(sio)
  a.get('http://wwwxhost/')
  sio.rewind
  p sio.read.scan(/request-header: cookie.+?$/)
}

result:

[]
["request-header: cookie => NAME=WWW.HOST"]

No gemspec in project means can't use repository from Bundler

If I want to use the latest changes made to master using Bundler, I can't because there is no .gemspec in the repository.

So, I'm requesting that the generated gemspec from the Rakefile be committed to the repository and maintained. A fork and commit will be produced momentarily.

There is no tag for 0.9.3 release

While there are tags for releases up to 0.9.2, there is no tag for 0.9.3 and that makes it difficult to download a tarball of the already-released code.

Form submit doesn't behave correctly, when form action returns results on same page.

Hi,

I am stumped on how to proceed with this problem.

So I am building an application to scrape data off of the website, biblegateway.com, to get Bible passages, that I can then export the retrieved data to a file.

So I am just trying to get the behavior correct before I write the Ruby script.

So here is what I do:

I fire up an irb console.

irb

I'll declare the required Ruby libraries

require 'rubygems'
require 'mechanize'

I'll then create a new object of the Mechanize class.

agent = Mechanize.new # Callilng WWW::Mechanize.new throws a warning message

I'll then tell it what page to scrape.

agent.get("http://www.biblegateway.com/passage")

I'll then tell it to use the last form on the page, which is the one I am working with.

form = agent.page.forms.last

I'll then find the name of the fields, and set their values

form.search1 = "John 3:16"
form.version1 = "NKJV"

That is all the options needed to get the results, so then submit the form.

form.submit

Now technically speaking, the form does in fact submit. That's not the problem. The problem is, Mechanize is designed to render the results from a new page to a new Mechanize::Page object.

But how they have their website setup, the same page is rendered with the results then loaded on the page, and it uses a get method instead of a post method, and the URL ends up looking like:
http://www.biblegateway.com/passage/?search=John%203:16&version=NKJV

So, what I need to know is, what do I need to do to render the same page in a "get fashion" so to speak? The documentation is very difficult to pick apart, and I haven't had much luck with Google...

Thank you in advance for the help.

can't login to linkedin

Hi, I'm having issues logging to LinkedIn using mechanize and it seems to me it has to do with cookie management. The following code returns me the home page but with my user not signed in :

mech = WWW::Mechanize.new
mech.follow_meta_refresh = true
home_page = mech.get('http://www.linkedin.com')
sign_in_link = home_page.links.find{|link| link.text == "Sign In"}
login_form = sign_in_link.click.form('login')
# with email and password variables properly set
login_form.set_fields(:session_key => email, :session_password => password)
return_page = mech.submit(login_form, login_form.buttons.first)

When I use curl to login to LinkedIn and load the cookies in the cookie jar (mech.cookie_jar), it works fine.

"[]= called on nil object" on @jar[normal_domain][cookie.path][cookie.name] = cookie

On the June 24th commit, cookie_jar.rb file. http://github.com/tenderlove/mechanize/blob/0cb78c906ea339e16829636be47fe35b5dc8be6f/lib/www/mechanize/cookie_jar.rb

On the #add method
def add(uri, cookie)
return unless uri.host =~ /#{CookieJar.strip_port(cookie.domain)}$/i

    normal_domain = cookie.domain.downcase

    unless @jar.has_key?(normal_domain)
      @jar[normal_domain] = Hash.new { |h,k| h[k] = {} }
    end

    @jar[normal_domain][cookie.path][cookie.name] = cookie
    cleanup
    cookie
  end

I sometimes get an error on
@jar[normal_domain][cookie.path][cookie.name] = cookie
complaining about "[]= called on nil object".

I suspect that when a cookie_jar is loaded from YAML, the default value for @jar[normal_domain] is not being correctly set. I'm currently adding
@jar[normal_domain][cookie.path] ||= {}
@jar[normal_domain][cookie.path][cookie.name] = cookie

Gem doesn't build from source

This problem is easily solved by editing the gemspec file and removing all references to the subdirectory 'www'

No default html_parser when subclassing

I tried to subclass WWW::Mechanize, but got errors that parser was not defined. It turned out, that @html_parser has no default when subclassing, probably because it is not in the initialize method. So I had to define the parser myself, but would have expected it to use the default one.

Feature request: Allow :verb to be set as an option for Mechanize#get

So this is a bit perverse, I know, but . . .

The Varnish caching proxy allows for a pseudo HTTP method, "PURGE" -- Here's the description:

http://varnish-cache.org/wiki/FAQ#HowcanIforcearefreshonaobjectcachedbyvarnish

AND

http://varnish-cache.org/wiki/VCLExamplePurging

The big trick here is that the PURGE request has to be identical to the matching GET: The same headers, etc. So piggybacking on the existing functionality for GET would be convenient.

I've written some code using Mechanize to verify our expectations for cache hits/misses based on our rules, and the first thing I want to do is PURGE this way.

At present, I monkeypatch both Mechanize and Net::HTTP -- In Mechanize, I change the fetch_page in #get to . . .

  page = fetch_page(  :uri      => url,
                      :referer  => referer,
                      :headers  => headers || {},
                      :verb     => options[:verb] || :get,
                      :params   => parameters
                   )

And then for Net::HTTP I add a new class for Purge.

The upshot of all this is that it would be awfully nice if Mechanize would allow the option to be set for the :verb on get. Then the only thing I'd have to monkeypatch is Net::HTTP.

Or . . . the get method could be streamlined to make it easier to monkeypatch for this little thing. E.g., the fetch_page could be isolated in its own method.

Gem lacks tc_bad_charset.html and tc_charset.html files, needed for testsuite

Error:
test_encoding_override_after_parser_was_initialized(TestPage):
Errno::ENOENT: No such file or directory - /var/tmp/portage/dev-ruby/mechanize-0.9.3-r1/work/all/mechanize-0.9.3/test/htdocs/tc_bad_charset.html
/var/tmp/portage/dev-ruby/mechanize-0.9.3-r1/work/all/mechanize-0.9.3/test/helper.rb:82:in initialize' /var/tmp/portage/dev-ruby/mechanize-0.9.3-r1/work/all/mechanize-0.9.3/test/helper.rb:82:inopen'
/var/tmp/portage/dev-ruby/mechanize-0.9.3-r1/work/all/mechanize-0.9.3/test/helper.rb:82:in request' /var/tmp/portage/dev-ruby/mechanize-0.9.3-r1/work/all/mechanize-0.9.3/lib/www/mechanize.rb:485:infetch_page'
/var/tmp/portage/dev-ruby/mechanize-0.9.3-r1/work/all/mechanize-0.9.3/lib/www/mechanize.rb:238:in get' ./test/test_page.rb:52:intest_encoding_override_after_parser_was_initialized'
Error:
test_encoding_override_before_parser_initialized(TestPage):
Errno::ENOENT: No such file or directory - /var/tmp/portage/dev-ruby/mechanize-0.9.3-r1/work/all/mechanize-0.9.3/test/htdocs/tc_bad_charset.html
/var/tmp/portage/dev-ruby/mechanize-0.9.3-r1/work/all/mechanize-0.9.3/test/helper.rb:82:in initialize' /var/tmp/portage/dev-ruby/mechanize-0.9.3-r1/work/all/mechanize-0.9.3/test/helper.rb:82:inopen'
/var/tmp/portage/dev-ruby/mechanize-0.9.3-r1/work/all/mechanize-0.9.3/test/helper.rb:82:in request' /var/tmp/portage/dev-ruby/mechanize-0.9.3-r1/work/all/mechanize-0.9.3/lib/www/mechanize.rb:485:infetch_page'
/var/tmp/portage/dev-ruby/mechanize-0.9.3-r1/work/all/mechanize-0.9.3/lib/www/mechanize.rb:238:in get' ./test/test_page.rb:44:intest_encoding_override_before_parser_initialized'
Error:
test_page_gets_charset_from_page(TestPage):
Errno::ENOENT: No such file or directory - /var/tmp/portage/dev-ruby/mechanize-0.9.3-r1/work/all/mechanize-0.9.3/test/htdocs/tc_charset.html
/var/tmp/portage/dev-ruby/mechanize-0.9.3-r1/work/all/mechanize-0.9.3/test/helper.rb:82:in initialize' /var/tmp/portage/dev-ruby/mechanize-0.9.3-r1/work/all/mechanize-0.9.3/test/helper.rb:82:inopen'
/var/tmp/portage/dev-ruby/mechanize-0.9.3-r1/work/all/mechanize-0.9.3/test/helper.rb:82:in request' /var/tmp/portage/dev-ruby/mechanize-0.9.3-r1/work/all/mechanize-0.9.3/lib/www/mechanize.rb:485:infetch_page'
/var/tmp/portage/dev-ruby/mechanize-0.9.3-r1/work/all/mechanize-0.9.3/lib/www/mechanize.rb:238:in get' ./test/test_page.rb:11:intest_page_gets_charset_from_page'

:first-child not working when selecting on .class

require 'rubygems'
require 'mechanize'

agent=Mechanize.new
agent.user_agent='wget' # default user-agent string is banned apparantly
page=agent.get('http://en.wikipedia.org/wiki/Comparison_of_file_archivers')

page.search('table.wikitable:first-child') #does not work
page.search('.wikitable:first-child')      #does not work
page.search('table:first-child')           #works - but it's not a .wikitable table
page.search('table.wikitable')             #works - returns 5 tables

The error is:

> page.search('table.wikitable:first-child')
RuntimeError: xmlXPathCompOpEval: function first-child not found
from /usr/lib/ruby/gems/1.8/gems/nokogiri-1.4.0/lib/nokogiri/xml/node.rb:142:in `evaluate'
from /usr/lib/ruby/gems/1.8/gems/nokogiri-1.4.0/lib/nokogiri/xml/node.rb:142:in `xpath'
from /usr/lib/ruby/gems/1.8/gems/nokogiri-1.4.0/lib/nokogiri/xml/node.rb:139:in `map'
from /usr/lib/ruby/gems/1.8/gems/nokogiri-1.4.0/lib/nokogiri/xml/node.rb:139:in `xpath'
from /usr/lib/ruby/gems/1.8/gems/nokogiri-1.4.0/lib/nokogiri/xml/node.rb:106:in `search'
from (irb):119
from /usr/local/lib/site_ruby/1.8/rubygems.rb:168

RuntimeError: xmlXPathCompOpEval: function first-child not found

  from /usr/lib/ruby/gems/1.8/gems/nokogiri-1.4.0/lib/nokogiri/xml/node.rb:142:in `evaluate'
  from /usr/lib/ruby/gems/1.8/gems/nokogiri-1.4.0/lib/nokogiri/xml/node.rb:142:in `xpath'
  from /usr/lib/ruby/gems/1.8/gems/nokogiri-1.4.0/lib/nokogiri/xml/node.rb:139:in `map'
  from /usr/lib/ruby/gems/1.8/gems/nokogiri-1.4.0/lib/nokogiri/xml/node.rb:139:in `xpath'
  from /usr/lib/ruby/gems/1.8/gems/nokogiri-1.4.0/lib/nokogiri/xml/node.rb:106:in `search'
  from (irb):119
  from /usr/local/lib/site_ruby/1.8/rubygems.rb:168

Using Mechanize 1.0.0.

first form field for a given name should win

It seems that if you have multiple form fields with the same name, Mechanize does not Do The Right Thing when submitting the form.

According to the rails API docs, the HTML spec says that the first field encountered with a given name should "win" and subsequent fields with that name should be ignored.

The result is that given HTML like this, the form submitted should only contain the checkbox, not the hidden field.

I've written a failing test case using that HTML, you can see it here. Both are contained in the commit at bleything/mechanize@60bbf4c6ebde41a3326ca4ed7a54b60229c8fcb4, for which I've sent a pull request :)

meta.rb:38 ignore uri.port

"http://#{uri.host}#{url}"

I think "#{uri.to_s}#{url}" is correct.

requesting a options_with (working as fields_with)

Why is thera a form.fields_with(:name => "blah etc")

and not a

form.fields.first.options_with(:value => "the right value") ?

Bit confusing that there's only an ".options" that is an ordinary Array.

post_form not sending parameters / nagios / webrat

This is not a bug or anything - this is me doing something stupid I'm pretty sure ...

I just upgraded to

list of dependencies for a cucumber-nagios project
gem "bundler", " 0.9.1.pre1"
gem "cucumber", "0.5.2"
gem "rspec", "1.2.9"
gem "webrat", "0.7.1"
gem "mechanize", "0.9.3"
gem "templater", "1.0"
gem "net-ssh", "2.0.17"

Now,

/Library/Ruby/Gems/1.8/gems/mechanize-0.9.3/lib/www/mechanize.rb:413:in `post_form'

is not posting form parameters.

I'm running this before my features (was working before upgrade)

Before do
visit TESTING_SERVER

fill_in('myuser_login', :with => "my.name")

fill_in('myuser_password', :with => "xxxxxxx")

puts response.body

click_button('Submit')

end

When I look at my rails logs of the app i'm testing - no params are sent (just the action and controller are shown). I can see that I'm on the right page - and the fields (looking at response body)- but they are not set - and they are not sent. Even when I put in another field that is already populated - the form fields are not posted.

Anyway, if anyone can unstupid me on this, I'd appreciate it :P

Adding javascript parsing to mechanize

It would be so cool if mechanize could click on Javascript links. Do you have any thoughts on this and how hard it would be to do? Between Rhino and a bunch of other tools, it seems like all the necessary building blocks are there but they just need to be put together.

Form.build_query fails when page has no encoding

The fix appears to be at form.rb, line 150: change "if page" to "if page && page.encoding" in

  def from_native_charset(str, enc=nil)
    if page
      enc ||= page.encoding                               
      Util.from_native_charset(str,enc)
    else
      str
    end
  end

Mechanize currently errors out if given Content-Encoding: 7bit

Patch: http://gist.github.com/135872

ImageButton sends its name and value

When HTML is:

Mechanize sends "imgbtn=&imgbtn.x=0&imgbtn.y=0" to the server.
InternetExplorer, Firefox, and w3m send "imgbtn.x=0&imgbtn.y=0".

pretty_print exception on classes inherited from WWW:Mechanize

There seems to be a bug when using pretty print on responses from request methods, but only classes inherited from WWW::Mechanize.

The following example causes an exception to be raised, and demonstrates that it only occurs
on inherited classes.

require 'mechanize'
class TestMech < WWW::Mechanize; end
p WWW::Mechanize.new.get('http://www.example.com') # This works just fine...
p TestMech.new.get('http://www.example.com') # but not this.

Here it is in IRB, along the first few lines of the back-trace.

irb(main):001:0> require 'mechanize'
=> true
irb(main):002:0> class TestMech < WWW::Mechanize;end
=> nil
irb(main):003:0> x = TestMech.new.get 'http://www.google.com'
NoMethodError: undefined method parse' for nil:NilClass from /Library/Ruby/Gems/1.8/gems/tenderlove-mechanize-0.9.3.20090623142847/lib/www/mechanize/page.rb:82:inparser'
from /Library/Ruby/Gems/1.8/gems/tenderlove-mechanize-0.9.3.20090623142847/lib/www/mechanize/page.rb:142:in meta' from /Library/Ruby/Gems/1.8/gems/tenderlove-mechanize-0.9.3.20090623142847/lib/www/mechanize/inspect.rb:22:inpretty_print'
from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/prettyprint.rb:201:in group' from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/prettyprint.rb:227:innest'
from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/prettyprint.rb:200:in `group'

Form::MultiSelectList does not have "options_with"

There are "option_with" and "options_with" in Mechanize::Form::SelectList.
Mechanize::Form::MultiSelectList has no option-selection methods now.