The help page for ‘fromJSON()’ says that UTF-8 is used by default. This doesn’t seem t

Okay I think I have found a solution: <a class="commit-link" data-hovercard-type="comm

Even for a simple file like: <div class="snippet-clipboard-content notranslate pos

Your example file seems to work here: <div class="snippet-clipboard-content notran

UTF-8 data not handled correctly for local files about jsonlite HOT 17 CLOSED

jeroen commented on August 14, 2024

UTF-8 data not handled correctly for local files

from jsonlite.

Comments (17)

jeroen commented on August 14, 2024

Strange, it works fine here. Let me test on a windows machine.

from jsonlite.

jeroen commented on August 14, 2024

This is tricky. If we set encoding="UTF-8" as you suggest, then the downloaded file is read correctly. However in windows, text files are stored as ISO 8859-1 by default. For example if you open notepad, and paste something like this:

{"source":"Statistisk sentralbyrå"}

and save it to a file, than it is the other way around:

> readLines("~/../Desktop/test.txt", encoding="UTF-8")
[1] "{\"source\":\"Statistisk sentralbyr\xe5\"}"

So there is no good default. Indeed json is UTF8 by default, but under your locale, ISO 8859-1 is default for text files. I'm not sure what to do.

Also note that you can specify the default encoding for files on your system as follows:

options(encoding="UTF-8")
d = fromJSON("c:/tmp/49623.json")
d$dataset$source

from jsonlite.

jeroen commented on August 14, 2024

Okay I think I have found a solution: f058237. Can you give this a try?

from jsonlite.

huftis commented on August 14, 2024

Unfortunately, the proposed solution doesn’t work. validate() seems to accept byte sequences that are invalid UTF-8 (while parseJSON doesn’t).

I propose the following solution instead. Add an ‘encoding’ argument, having "UTF-8" as the default value:

fromJSON <- function(txt, simplifyVector = TRUE, simplifyDataFrame = simplifyVector,
  simplifyMatrix = simplifyVector, flatten = FALSE, encoding="UTF-8", ...) {

Then change that part that reads local files to:

} else if (file.exists(txt)) {
  filename <- txt;
  con <- file(filename, open="r", encoding=encoding)
  txt <- readLines(con, warn = FALSE)
  close(con)
}

Three things to note:

We need to set the ‘encoding’ argument to the file() function, not the readLines() functions (the two arguments mean different things!).

It’s not necessary to run paste() on the result, since this will automatically be done outside the loop.

I have used the name ‘encoding’ for the argument. It may be better to use the name ‘fileEncoding’ instead, to be consistent with read.csv(), read.table() and friends, and to make it clearer that the argument only applies to files, not URLs.

I believe this solution will work well. Almost all JSON files in the wild are written in UTF-8, so this should work for most users who download JSON files from the Web or who get JSON files from other tools (people don’t author JSON files using Notepad … ☺). For those files that use other encodings, it’s the user’s responsibility to specifiy the encoding used.

I have tested this (on Linux) on both UTF-8 files and ISO-8859-1 files, and it seems to work fine. (Reading ISO-8859-1 files without specifying the encoding gives a warning and an error message, but that’s the way it should be.)

from jsonlite.

jeroen commented on August 14, 2024

Could you give an example of a string or a file for which the current solution does not work?

from jsonlite.

huftis commented on August 14, 2024

Even for a simple file like:

{"example":"Test æøå"}

I get these error messages and warnings
Error in parseJSON(txt) : lexical error: invalid bytes in UTF8 string.
{"example":"Test ��"}
(right here) ------^
In addition: Warning message:
In grepl("invalid bytes in UTF8", attr(isvalid, "err"), fixed = TRUE) :
input string 1 is invalid in this locale

Actually, I’m wondering if perhaps an ‘encoding’ argument is needed for URLs too. I was under the impression that the character encoding specified in the HTTP headers were being used, but it seems like this is not the case. I have uploaded four example files at
http://huftis.org/kritikk/json/

The ones named real_utf8* are encoded in UTF-8 and the ones named real_latin1* are encoded in ISO-8859-1. The contents of both real_utf8* are identical (byte for byte), and the contents of both real_latin1* files are identical (byte for file).

The _charset suffix determines what character encoding is specified in the HTTP headers. In other words, when _real and _charset agree, the files are what they say they are, and should work. If the prefix and suffix disagree, the files are invalid. (Try opening the files in a Web browser. It should fine for the valid files, but give garbled output for the invalid files.) Now try the following in R:

# Valid files
fromJSON("http://huftis.org/kritikk/json/real_utf8_charset_utf8.json")
fromJSON("http://huftis.org/kritikk/json/real_latin1_charset_latin1.json")

# Invalid files
fromJSON("http://huftis.org/kritikk/json/real_latin1_charset_utf8.json")
fromJSON("http://huftis.org/kritikk/json/real_utf8_charset_latin1.json")

It looks like all files are interpreted as if they were in the native encoding (with the latin1 bytes handled as if they were invalid UTF-8 byte sequences, and shown as ). This is on a Linux machine; I suspect the opposite result would happen on a Windows machine (I don’t have access to one right now).

So I’m wondering what the ‘httr’ package is used for that ordinary ‘readLines’ doesn’t handle? I thought it was proper handling of HTTP character encoding specifications, but obviously this is not the case.

from jsonlite.

jeroen commented on August 14, 2024

I pushed a fix for the first problem, could you try again?

from jsonlite.

huftis commented on August 14, 2024

With your latest update, I don’t get an error message any longer, but now UTF-8 files don’t work unless this is set as the default in R (in options(encoding=…)).

from jsonlite.

jeroen commented on August 14, 2024

Your example file seems to work here:

> fromJSON("~/../Desktop/49623.json")$dataset$source
[1] "Statistisk sentralbyrå"

from jsonlite.

huftis commented on August 14, 2024

Is this on a Windows or a Linux PC? Remember that the encoding argument of readLines() (confusingly) does not specify that the input should recoded; for this you need the encoding argument of the file() function, as by my earlier code example. So specifying readLines(…, encoding="UTF-8") will only work properly if both the file is in UTF-8 and options("encoding") is set to UTF-8 (which is typically true on Linux, but not on Windows).

from jsonlite.

jeroen commented on August 14, 2024

I tested on ubuntu, mac and windows, and they all work with your example file:

> fromJSON("~/../Desktop/49623.json")$dataset$source
[1] "Statistisk sentralbyrå"

from jsonlite.

jeroen commented on August 14, 2024

Its probably the different locale. Argh. It's getting too late, I'll continue tomorrow.

from jsonlite.

huftis commented on August 14, 2024

Thanks for looking into this. Just a final comment (for today), that might clear things up a bit. There are two encoding arguments, one to readLines() and one to file() (and other connection functions). Their meaning is completely different. What one typically wants is the encoding argument to file(), e.g.:

readLines(file(filename, encoding="UTF-8"), warn=FALSE)

not

readLines(filename, encoding="UTF-8", warn=FALSE)

(but one should also close the file() connection after using it, so don’t use this exact code).

Then the file will be read as UTF-8, stored in whatever internal format R seems best fit, and can be used as ordinary text.

from jsonlite.

jeroen commented on August 14, 2024

The problem with specifying encoding="UTF8" in the connection is that it goes really wrong if the file is in fact Latin1. The result gets truncated around the first non-ascii character with a warning.

> readLines(file("~/../Desktop/test.txt", encoding="UTF-8"), warn=F)
[1] "{\"source\":\"Statistisk sentralbyr"
Warning message:
invalid input found on input connection '~/../Desktop/test.txt'

This will then of course lead to erros in the json parser because the text has been truncated.

Therefore I'm not sure if that would be the best default. The other way around seems to fail more gracefully so that we can more easily recover.

from jsonlite.

jeroen commented on August 14, 2024

Okay I have implemented yet another solution. I tried this on all my machines and it seems to work fine here. Could you give this a try?

from jsonlite.

jeroen commented on August 14, 2024

@huftis did you have a chance to test the new version?

from jsonlite.

jeroen commented on August 14, 2024

I think this is fixed. Please reopen if you find issues.

from jsonlite.

UTF-8 data not handled correctly for local files about jsonlite HOT 17 CLOSED

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs