GithubHelp home page GithubHelp logo

How to set encoding? about kimuraframework HOT 8 CLOSED

vifreefly avatar vifreefly commented on May 17, 2024 3
How to set encoding?

from kimuraframework.

Comments (8)

vifreefly avatar vifreefly commented on May 17, 2024 1

Thanks, I'll add this feature to the ToDo list for a new version

from kimuraframework.

vifreefly avatar vifreefly commented on May 17, 2024

Hello @dccmmtop ! You are right, there should be added config option like @encodig.

Your examples are good, but setting with custom encoding should be optional, because in most cases pages parsed correctly, without need to provide encoding for it.

I would like to add "auto" mode as well, where Kimurai will try to automatically recognize the correct encoding. Encoding usually defined in meta tags like <meta http-equiv="Content-Type"> or <meta charset>. (https://www.w3schools.com/html/html_charset.asp).

I have working regex (from one of my resent projects) which correctly parse encoding from both cases above:

    resp_string = response.body
    charset = resp_string.force_encoding("ISO-8859-1").encode("UTF-8")[/<meta.*?charset=["]?([\w+\d+\-]*)/i, 1]
    Nokogiri::HTML(resp_string, nil, charset)

So the method current_response can be modified to:

# Works with:

@config = {
  encoding: nil      # do not handle encoding at all (current behavior)
  encoding: :auto    # Try to handle encoding automatically
  encoding: "GB2312" # Set required encoding manually
}

###

def current_response(response_type = :html)
  case response_type
  when :html
    if encoding = @config[:encoding]
      if encoding == :auto
        charset = body.force_encoding("ISO-8859-1").encode("UTF-8")[/<meta.*?charset=["]?([\w+\d+\-]*)/i, 1]
        Nokogiri::HTML(body, nil, charset)
      else
        Nokogiri::HTML(body, nil, encoding)
      end
    else
      Nokogiri::HTML(body)
    end
  when :json
    JSON.parse(body)
  end
end

I'll try to add this feature today and release a new version. Thanks!

from kimuraframework.

vifreefly avatar vifreefly commented on May 17, 2024

@dccmmtop, I added config option encoding. It's in the master now: 96fe695 .

Can you please check both cases, :auto and custom encoding?

@config = {
  encoding: nil      # do not handle encoding at all (current behavior)
  encoding: :auto    # Try to handle encoding automatically
  encoding: "GB2312" # Set required encoding manually
}

To use Kimurai version from master, add it to Gemfile this way:

gem 'kimurai', git: 'https://github.com/vifreefly/kimuraframework'

from kimuraframework.

dccmmtop avatar dccmmtop commented on May 17, 2024

I've tested the :auto and custom encoding and found no errors.

:auto is a good method, and it works in most cases. But some pages are actually coded differently from the way they are declared in the head.

For example, the following situation, only in the way of GBK, can I get the right content, I think this is the fault of website developers.

image

from kimuraframework.

vifreefly avatar vifreefly commented on May 17, 2024

@dccmmtop
Can you please clarify where is a problem with :auto method? Like I said, it can handle two cases, here is an example:

def fetch_encoding(html_doc_string)
  html_doc_string.force_encoding("ISO-8859-1").encode("UTF-8")[/<meta.*?charset=["]?([\w+\d+\-]*)/i, 1]
end

###

example_1 = '
  </!DOCTYPE html>
  <html>
    <head>
      <title>Hello World!</title>
      <meta http-equiv="content-type" content="text/html; charset=GB2312">
    </head>
    <body>
      <h1>Hello World!</h1>
    </body>
  </html>
'

puts fetch_encoding(example_1)
# => GB2312

###

example_2 = '
  </!DOCTYPE html>
  <html>
    <head>
      <title>Hello World!</title>
      <meta charset="GB2312">
    </head>
    <body>
      <h1>Hello World!</h1>
    </body>
  </html>
'

puts fetch_encoding(example_2)
# => GB2312

Or do you mean something different?

from kimuraframework.

dccmmtop avatar dccmmtop commented on May 17, 2024

@vifreefly Sorry to have misunderstood you.

:auto has no errors and can work normally.
What I mean is that the actual encoding of a web page is different from what it declares in .

from kimuraframework.

vifreefly avatar vifreefly commented on May 17, 2024

@dccmmtop
Thanks, now I see what you've meant :)

from kimuraframework.

dccmmtop avatar dccmmtop commented on May 17, 2024

The same website has different coding methods, but the @config is global。

Should you specify a separate encoding for a url?
example:

request_to(:parse_content, url: link, encoding: 'GBK')

Now that's how I solve it.

    @config = {
      before_request: { delay: 1..3 },
      encoding: 'utf-8'
    }

    def parse(response,url:,data:{})
      topics = JSON.parse(response.xpath("//p").text[/(\[.+\])/,1])
      topics.each do |topic|
        link = topic["url"].strip
        self.class.config[:encoding] = "GBK"
        request_to(:parse_content, url: link)
        self.class.config[:encoding] = "utf-8"
      end
    end

This method is not good.

from kimuraframework.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.