GithubHelp home page GithubHelp logo

Comments (9)

bitinn avatar bitinn commented on July 19, 2024

I tried to set both parse and decode to false and listen on headers event to inspect the header but charset=xxx are somehow missing from content-type, is this a node http module feature or am I just that unlucky...

$ curl -I http://bitinn.net/11084/
HTTP/1.1 200 OK
Date: Thu, 18 Dec 2014 10:12:17 GMT
Content-Type: text/html; charset=UTF-8
Connection: keep-alive
Set-Cookie: __cfduid=d3c54904565e8e93a88573cde1a7d6a211418897537; expires=Fri, 18-Dec-15 10:12:17 GMT; path=/; domain=.bitinn.net; HttpOnly
X-Powered-By: PHP/5.4.17RC1
X-Pingback: http://bitinn.net/xmlrpc.php
Link: <http://bitinn.net/?p=11084>; rel=shortlink
Server: cloudflare-nginx
CF-RAY: 19aa94ca892d11e3-SJC
needle.on('headers', function(headers) {
  console.log(headers);
});

{ date: 'Thu, 18 Dec 2014 10:14:12 GMT',
  'content-type': 'text/html',
  'transfer-encoding': 'chunked',
  connection: 'close',
  'set-cookie': [ '__cfduid=de102602fcd12a37cfcb6d934514556831418897652; expires=Fri, 18-Dec-15 10:14:12 GMT; path=/; domain=.bitinn.net; HttpOnly' ],
  'x-powered-by': 'PHP/5.4.17RC1',
  server: 'cloudflare-nginx',
  'cf-ray': '19aa979644b611e3-SJC',
  'content-encoding': 'gzip' }
chrome

HTTP/1.1 200 OK
Date: Thu, 18 Dec 2014 10:17:07 GMT
Content-Type: text/html; charset=UTF-8
Transfer-Encoding: chunked
Connection: keep-alive
X-Powered-By: PHP/5.4.17RC1
X-Pingback: http://bitinn.net/xmlrpc.php
Server: cloudflare-nginx
CF-RAY: 19aa9bddb34311e3-SJC
Content-Encoding: gzip

from needle.

bitinn avatar bitinn commented on July 19, 2024

So I ended up writing my own charset detection checks and convert charset on stream end instead of using the built-in stream transform decoder. And the problem is gone, so I suspect needle decoder is somehow messing up the buffer (chunk).

Looking at the decoder.js, I don't see how it prevent chunk from causing invalid multi-byte cut-off. So while the charset is detected correctly, there is no guarantee that the start/end of each chunk does not cut-off a multi-byte character, thus the issue I am observing...

So it looks like my previous concern is legit after all, and best practice should be to keep the chunk in array and concat on end of stream, then convert encoding altogether.

ref:

from needle.

tomas avatar tomas commented on July 19, 2024

Good find!

I guess the problem is that we're trying to decode individual chunks instead of the whole thing once the transfer ends, which leads to multibyte chars being cut, as you mention.

I'm not sure if there's a way around this, though. Did you try the .collect trick explained in the first link (iconv-lite wiki)?

from needle.

tomas avatar tomas commented on July 19, 2024

On second thought, rather than using .collect we might try using iconv.decodeStream(charset) instead of new StreamDecoder(charset) in line 49 in decoder.js. That way we can rely on iconv-lite's internal stream decoding logic instead of calling iconv.decode manually for each chunk.

Please give it a try and let me know if it works!

from needle.

bitinn avatar bitinn commented on July 19, 2024

If we know the original charset beforehand (say charset is present in header content-type), then it works; but if we need to extract the charset from body (say charset is from meta tag), then we must work with the first chunk.

PS: I still haven't figure out why is charset often missing from needle headers event (and some headers dropped), I can't reproduce this with curl or chrome...

{ date: 'Thu, 18 Dec 2014 19:57:49 GMT',
  'content-type': 'text/html; charset=UTF-8',
  'transfer-encoding': 'chunked',
  connection: 'close',
  'set-cookie': [ '__cfduid=d0054f11661f55f78714aed3a5f67e3091418932669; expires=Fri, 18-Dec-15 19:57:49 GMT; path=/; domain=.bitinn.net; HttpOnly' ],
  'x-powered-by': 'PHP/5.4.17RC1',
  link: '<http://bitinn.net/?p=11084>; rel=shortlink',
  'x-pingback': 'http://bitinn.net/xmlrpc.php',
  server: 'cloudflare-nginx',
  'cf-ray': '19adee7f8bb611e9-SJC',
  'content-encoding': 'gzip' }
{ date: 'Thu, 18 Dec 2014 19:58:03 GMT',
  'content-type': 'text/html',
  'transfer-encoding': 'chunked',
  connection: 'close',
  'set-cookie': [ '__cfduid=dea4aa77a2174c06c88f3b403e900cc2e1418932683; expires=Fri, 18-Dec-15 19:58:03 GMT; path=/; domain=.bitinn.net; HttpOnly' ],
  'x-powered-by': 'PHP/5.4.17RC1',
  server: 'cloudflare-nginx',
  'cf-ray': '19adeed7571311e3-SJC',
  'content-encoding': 'gzip' }

not a huge problem but quite annoying when trying to debug what's going on...

from needle.

leesei avatar leesei commented on July 19, 2024

I encounter the same issue with the latest release (0.9.2).
@bitinn, indeed many server won't report encoding in HTTP header.
The HTML meta detection with the first chunk works correctly and detects the page's encoding as big5.
Some of above snippets are not relevant anymore so I'm restating the issue here.

Say the BIG5 string "國際" (0xB0EA, 0xBBDA) in HTML is split across chunk.

// chunk = <Buffer B0 EA BB>, this.charset = 'big5'
// this would produce garbage
res = iconv.decode(chunk, this.charset);

So no one has tried iconv.decodeStream(charset) yet?

from needle.

bitinn avatar bitinn commented on July 19, 2024

@leesei shameless self-plug, I ended up writing this module: https://github.com/bitinn/node-fetch

You should be able to decode res.body as a stream, using iconv.decodeStream. But in most case you probably don't need a html stream, right? So we make sure res.text() decode correctly by buffering.

from needle.

leesei avatar leesei commented on July 19, 2024

@bitinn Thanks for the info. I'll wait and see if there's any progress on this issue.

I tried to create a StreamDecoder after getting charset at decoder.js:28, but my stream-fu is not powerful enough to get the job done.
And I don't know how error should be handled when using iconv's stream API.

from needle.

tomas avatar tomas commented on July 19, 2024

I'm closing this issue for the time being. If anyone wants more context, here's the related discussion on the iconv-lite repo.

from needle.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.