Comments (9)
I tried to set both parse
and decode
to false and listen on headers
event to inspect the header but charset=xxx
are somehow missing from content-type
, is this a node http module feature or am I just that unlucky...
$ curl -I http://bitinn.net/11084/
HTTP/1.1 200 OK
Date: Thu, 18 Dec 2014 10:12:17 GMT
Content-Type: text/html; charset=UTF-8
Connection: keep-alive
Set-Cookie: __cfduid=d3c54904565e8e93a88573cde1a7d6a211418897537; expires=Fri, 18-Dec-15 10:12:17 GMT; path=/; domain=.bitinn.net; HttpOnly
X-Powered-By: PHP/5.4.17RC1
X-Pingback: http://bitinn.net/xmlrpc.php
Link: <http://bitinn.net/?p=11084>; rel=shortlink
Server: cloudflare-nginx
CF-RAY: 19aa94ca892d11e3-SJC
needle.on('headers', function(headers) {
console.log(headers);
});
{ date: 'Thu, 18 Dec 2014 10:14:12 GMT',
'content-type': 'text/html',
'transfer-encoding': 'chunked',
connection: 'close',
'set-cookie': [ '__cfduid=de102602fcd12a37cfcb6d934514556831418897652; expires=Fri, 18-Dec-15 10:14:12 GMT; path=/; domain=.bitinn.net; HttpOnly' ],
'x-powered-by': 'PHP/5.4.17RC1',
server: 'cloudflare-nginx',
'cf-ray': '19aa979644b611e3-SJC',
'content-encoding': 'gzip' }
chrome
HTTP/1.1 200 OK
Date: Thu, 18 Dec 2014 10:17:07 GMT
Content-Type: text/html; charset=UTF-8
Transfer-Encoding: chunked
Connection: keep-alive
X-Powered-By: PHP/5.4.17RC1
X-Pingback: http://bitinn.net/xmlrpc.php
Server: cloudflare-nginx
CF-RAY: 19aa9bddb34311e3-SJC
Content-Encoding: gzip
from needle.
So I ended up writing my own charset detection checks and convert charset on stream end
instead of using the built-in stream transform decoder. And the problem is gone, so I suspect needle decoder is somehow messing up the buffer (chunk).
Looking at the decoder.js, I don't see how it prevent chunk from causing invalid multi-byte cut-off. So while the charset is detected correctly, there is no guarantee that the start/end of each chunk does not cut-off a multi-byte character, thus the issue I am observing...
So it looks like my previous concern is legit after all, and best practice should be to keep the chunk in array and concat on end
of stream, then convert encoding altogether.
ref:
- https://github.com/ashtuchkin/iconv-lite/wiki/Use-Buffers-when-decoding, while the problem is different, the solution applies.
- http://nodejs.org/api/stream.html#stream_readable_setencoding_encoding, explain the problem, though in needle's case it's not
toString
causing problem, but chunk contains partial multi-byte.
from needle.
Good find!
I guess the problem is that we're trying to decode individual chunks instead of the whole thing once the transfer ends, which leads to multibyte chars being cut, as you mention.
I'm not sure if there's a way around this, though. Did you try the .collect
trick explained in the first link (iconv-lite wiki)?
from needle.
On second thought, rather than using .collect we might try using iconv.decodeStream(charset)
instead of new StreamDecoder(charset)
in line 49 in decoder.js
. That way we can rely on iconv-lite's internal stream decoding logic instead of calling iconv.decode manually for each chunk.
Please give it a try and let me know if it works!
from needle.
If we know the original charset beforehand (say charset is present in header content-type), then it works; but if we need to extract the charset from body (say charset is from meta tag), then we must work with the first chunk.
PS: I still haven't figure out why is charset often missing from needle headers
event (and some headers dropped), I can't reproduce this with curl or chrome...
{ date: 'Thu, 18 Dec 2014 19:57:49 GMT',
'content-type': 'text/html; charset=UTF-8',
'transfer-encoding': 'chunked',
connection: 'close',
'set-cookie': [ '__cfduid=d0054f11661f55f78714aed3a5f67e3091418932669; expires=Fri, 18-Dec-15 19:57:49 GMT; path=/; domain=.bitinn.net; HttpOnly' ],
'x-powered-by': 'PHP/5.4.17RC1',
link: '<http://bitinn.net/?p=11084>; rel=shortlink',
'x-pingback': 'http://bitinn.net/xmlrpc.php',
server: 'cloudflare-nginx',
'cf-ray': '19adee7f8bb611e9-SJC',
'content-encoding': 'gzip' }
{ date: 'Thu, 18 Dec 2014 19:58:03 GMT',
'content-type': 'text/html',
'transfer-encoding': 'chunked',
connection: 'close',
'set-cookie': [ '__cfduid=dea4aa77a2174c06c88f3b403e900cc2e1418932683; expires=Fri, 18-Dec-15 19:58:03 GMT; path=/; domain=.bitinn.net; HttpOnly' ],
'x-powered-by': 'PHP/5.4.17RC1',
server: 'cloudflare-nginx',
'cf-ray': '19adeed7571311e3-SJC',
'content-encoding': 'gzip' }
not a huge problem but quite annoying when trying to debug what's going on...
from needle.
I encounter the same issue with the latest release (0.9.2).
@bitinn, indeed many server won't report encoding in HTTP header.
The HTML meta detection with the first chunk works correctly and detects the page's encoding as big5
.
Some of above snippets are not relevant anymore so I'm restating the issue here.
Say the BIG5 string "國際" (0xB0EA, 0xBBDA) in HTML is split across chunk.
// chunk = <Buffer B0 EA BB>, this.charset = 'big5'
// this would produce garbage
res = iconv.decode(chunk, this.charset);
So no one has tried iconv.decodeStream(charset)
yet?
from needle.
@leesei shameless self-plug, I ended up writing this module: https://github.com/bitinn/node-fetch
You should be able to decode res.body as a stream, using iconv.decodeStream
. But in most case you probably don't need a html stream, right? So we make sure res.text()
decode correctly by buffering.
from needle.
@bitinn Thanks for the info. I'll wait and see if there's any progress on this issue.
I tried to create a StreamDecoder
after getting charset at decoder.js:28
, but my stream-fu is not powerful enough to get the job done.
And I don't know how error should be handled when using iconv
's stream API.
from needle.
I'm closing this issue for the time being. If anyone wants more context, here's the related discussion on the iconv-lite repo.
from needle.
Related Issues (20)
- Potential security issue HOT 5
- Issue with webpack and react-native HOT 1
- `err` event not triggered on endless redirects (node16 specific) HOT 3
- Incorrect method
- ECONNRESET thrown after connection closed HOT 1
- Test failures on macOS w/ node v17 HOT 1
- Test failures on node v14-v16 HOT 3
- Test failures on node v4-v12
- Incorrect parsing of complex JS nested objects in POST request HOT 5
- Mangled file content when multipart-POSTing a file with a "text/*" content type HOT 2
- Very infrequent TypeError: iconv.encodingExists is not a function
- Tunnelling doesn't work in v3.1.0 HOT 12
- Proposal: Supporting User-Defined "Follow If" Conditions
- Piping needle request to Express response won't set the correct headers
- Multiple Files Upload Renames Input Field
- Uncaught asynchronous error
- Remote end closed socket abruptly errors HOT 2
- a ':' in nonce breaks digest authentication HOT 1
- Digest MD5 not working with Geovision IP Camera
- Uncaught ECONNRESET exception
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from needle.