Comments (8)
Thanks, I'll add this feature to the ToDo list for a new version
from kimuraframework.
Hello @dccmmtop ! You are right, there should be added config option like @encodig
.
Your examples are good, but setting with custom encoding should be optional, because in most cases pages parsed correctly, without need to provide encoding for it.
I would like to add "auto" mode as well, where Kimurai will try to automatically recognize the correct encoding. Encoding usually defined in meta tags like <meta http-equiv="Content-Type">
or <meta charset>
. (https://www.w3schools.com/html/html_charset.asp).
I have working regex (from one of my resent projects) which correctly parse encoding from both cases above:
resp_string = response.body
charset = resp_string.force_encoding("ISO-8859-1").encode("UTF-8")[/<meta.*?charset=["]?([\w+\d+\-]*)/i, 1]
Nokogiri::HTML(resp_string, nil, charset)
So the method current_response
can be modified to:
# Works with:
@config = {
encoding: nil # do not handle encoding at all (current behavior)
encoding: :auto # Try to handle encoding automatically
encoding: "GB2312" # Set required encoding manually
}
###
def current_response(response_type = :html)
case response_type
when :html
if encoding = @config[:encoding]
if encoding == :auto
charset = body.force_encoding("ISO-8859-1").encode("UTF-8")[/<meta.*?charset=["]?([\w+\d+\-]*)/i, 1]
Nokogiri::HTML(body, nil, charset)
else
Nokogiri::HTML(body, nil, encoding)
end
else
Nokogiri::HTML(body)
end
when :json
JSON.parse(body)
end
end
I'll try to add this feature today and release a new version. Thanks!
from kimuraframework.
@dccmmtop, I added config option encoding
. It's in the master now: 96fe695 .
Can you please check both cases, :auto
and custom encoding?
@config = {
encoding: nil # do not handle encoding at all (current behavior)
encoding: :auto # Try to handle encoding automatically
encoding: "GB2312" # Set required encoding manually
}
To use Kimurai version from master, add it to Gemfile this way:
gem 'kimurai', git: 'https://github.com/vifreefly/kimuraframework'
from kimuraframework.
I've tested the :auto
and custom encoding and found no errors.
:auto
is a good method, and it works in most cases. But some pages are actually coded differently from the way they are declared in the head
.
For example, the following situation, only in the way of GBK, can I get the right content, I think this is the fault of website developers.
from kimuraframework.
@dccmmtop
Can you please clarify where is a problem with :auto
method? Like I said, it can handle two cases, here is an example:
def fetch_encoding(html_doc_string)
html_doc_string.force_encoding("ISO-8859-1").encode("UTF-8")[/<meta.*?charset=["]?([\w+\d+\-]*)/i, 1]
end
###
example_1 = '
</!DOCTYPE html>
<html>
<head>
<title>Hello World!</title>
<meta http-equiv="content-type" content="text/html; charset=GB2312">
</head>
<body>
<h1>Hello World!</h1>
</body>
</html>
'
puts fetch_encoding(example_1)
# => GB2312
###
example_2 = '
</!DOCTYPE html>
<html>
<head>
<title>Hello World!</title>
<meta charset="GB2312">
</head>
<body>
<h1>Hello World!</h1>
</body>
</html>
'
puts fetch_encoding(example_2)
# => GB2312
Or do you mean something different?
from kimuraframework.
@vifreefly Sorry to have misunderstood you.
:auto
has no errors and can work normally.
What I mean is that the actual encoding of a web page is different from what it declares in .
from kimuraframework.
@dccmmtop
Thanks, now I see what you've meant :)
from kimuraframework.
The same website has different coding methods, but the @config
is global。
Should you specify a separate encoding for a url?
example:
request_to(:parse_content, url: link, encoding: 'GBK')
Now that's how I solve it.
@config = {
before_request: { delay: 1..3 },
encoding: 'utf-8'
}
def parse(response,url:,data:{})
topics = JSON.parse(response.xpath("//p").text[/(\[.+\])/,1])
topics.each do |topic|
link = topic["url"].strip
self.class.config[:encoding] = "GBK"
request_to(:parse_content, url: link)
self.class.config[:encoding] = "utf-8"
end
end
This method is not good.
from kimuraframework.
Related Issues (20)
- Selenium Chrome Heroku HOT 3
- Crawl in Sidekiq - Selenium::WebDriver::Error::WebDriverError: not a file: "./bin/chromedriver HOT 4
- How do I click on something that isn't a link? HOT 1
- Running on Ubuntu 20.04 gives chromedriver error HOT 4
- Some minor warnings when using kimurai
- How to set language? HOT 1
- Is the project still being maintained? HOT 2
- Unable to use proxy with password for headless chrome HOT 1
- Using the last argument as keyword parameters is deprecated : using ruby 3.0.0 HOT 2
- Error when installing on Linux HOT 2
- request_to method throws argument error for Ruby 3.0 HOT 9
- How to create empty JSON when no records where scrapped? HOT 1
- How to handle OpenSSL::SSL::SSLError: wrong signature type? HOT 1
- How to parse pages with HTTP errors (403, 404) HOT 1
- Setting cookies will request a page twice HOT 1
- Unable to use Ruby 3.x HOT 10
- Ruby Gems 1.4 not up to date with GitHub
- edriver update
- uninitialized constant URI::HTTP HOT 1
- Keep the browser opened
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from kimuraframework.