UnicodeEncodeError: 'utf-8' codec can't encode character '\udcfc',about wummel/patool

Comments (14)

wummel commented on August 21, 2024

We believe that the issue you reported is fixed in the source repository of patool which can be found under:
https://github.com/wummel/patool

Changelog entry:

Escape output messages before logging them.
Closes: GH bug #29

Thank you for reporting the issue. It is now marked as fixed. If you believe that the issue is not fixed appropriately just add a comment to this issue.

from patool.

wummel commented on August 21, 2024

reverted the commit as it broke Python3 tests

from patool.

yarikoptic commented on August 21, 2024

@htgoebel could you please check if this issue persists with current master of patool? if you could provide a sample archive, it would be great

from patool.

cgmb commented on August 21, 2024

If the only issue is that print to stdout is failing, you may be able to set PYTHONIOENCODING=latin-1 to avoid changing anything but the logging. Though, you might also need to redirect stdout to a file if your terminal can't handle the character either.

from patool.

cgmb commented on August 21, 2024

Unfortunately, stdout/stderr are a real pain for encodings. @wummel's fix in d293404 looked reasonable, but the test suite caught some failures under Python 3. Here's an example:

tests/test_formats.py patool: running /home/travis/virtualenv/python3.3.5/bin/python /home/travis/build/wummel/patool/patool -vv formats


********** Oops, I did it again. *************

You have found an internal error in patool. Please write a bug report
at https://github.com/wummel/patool/issues/ and include at least the information below:

Not disclosing some of the information below due to privacy reasons is ok.
I will try to help you nonetheless, but you have to give me something
I can work with ;) .

<class 'TypeError'> must be str, not bytes
Traceback (most recent call last):
  File "/home/travis/build/wummel/patool/patool", line 210, in main
    res = globals()["run_%s" % args.command](args)
  File "/home/travis/build/wummel/patool/patool", line 120, in run_formats
    patoolib.list_formats()
  File "/home/travis/build/wummel/patool/patoolib/__init__.py", line 345, in list_formats
    print(format, "files:")
  File "/home/travis/virtualenv/python3.3.5/lib/python3.3/codecs.py", line 369, in write
    self.stream.write(data)
TypeError: must be str, not bytes
System info:
patool 1.8
Python 3.3.5 (default, Feb  4 2015, 19:07:04) 
[GCC 4.6.3] on linux
Local time: 2015-12-05 12:02:34+000
sys.argv ['/home/travis/build/wummel/patool/patool', '-vv', 'formats']
LC_ALL = 'en_US.UTF-8'
LC_CTYPE = 'en_US.UTF-8'
LANG = 'en_US.UTF-8'

 ******** patool internal error, over and out ********
F

It's an odd failure. Here's a minimal example of that problem:

from __future__ import print_function
import codecs
import locale
import sys
ArchiveFormats = (
    '7z', 'ace', 'adf', 'alzip', 'ape', 'ar', 'arc', 'arj',
)
encoding = locale.getpreferredencoding()
errors = 'backslashreplace'
logstream = codecs.getwriter(encoding)(sys.stdout, errors)
for format in ArchiveFormats:
    print(format, "files:", file=logstream)

In Python 3, it seems print(str,str, file=<>) will pass bytes to the file's encoder, but print(str, file=<>) passes a str? Weird. I guess you should avoid passing more than one positional argument to print. Use %, format or whatever to create a single string to pass to print.

My suggestions:

Rather than wrapping and replacing sys.stdout and sys.stderr, wrap them and name the new steams something like patool.logout and patool.errout. Use those everywhere. This seems a little less invasive than outright replacing them.
Replace direct uses of print with patool.print, patool.printlog and/or patool.printerr. Have those functions accept multiple strings and call print(str, file=<>). That makes fixing all the calls to print a find+replace job, rather than a rewriting job.
This is really annoying to keep consistent between Python 2 and Python 3 because of the different defaults. This issue more than anything is what has soured me on dynamic typing. It's possible to support both if you're careful/use u" everywhere, but you may want to consider just dropping Python 2.7 support to save yourself a lot of headache.

from patool.

htgoebel commented on August 21, 2024

could you please check if this issue persists with current master of patool? if you could provide a sample archive, it would be great

@yarikoptic After almost three years I can not remember the archive causing the problem. But since this is the error:

  File "/tmp/pa/lib64/python3.4/site-packages/patoolib/util.py", line 502, in log_info
    print("patool:", msg, file=out)
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcfc' in position 164: surrogates not allowed

it should be easy to reproduce by passing a string containing the character \udcfc to the print-function.

I also found this explanation and possible solution, which seams to work:

$ python3
Python 3.5.3
…
>>> print('\udcfc')                                                                  
Traceback (most recent call last):                                                    
  File "<stdin>", line 1, in <module>                                                 
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcfc' in position 0: surrogates not allowed                                                                     
>>> print('\udcfc'.encode('utf8','surrogateescape'))                                  
b'\xfc'                                                                               
>>>

In Latin-1 encoding (which was the encoding using on Amiga and thus used in the archive), b'\xfc' is ü (u-umlaut), which looks reasonable to me.

from patool.

yarikoptic commented on August 21, 2024

all the unicode handling is what keeps giving me grief any time I look at it with the prior idea that "I know what is going on". Differences in behavior between Python 2 and 3 just amplify the negative effects

$> python -c 'print(u"\udcfc")' 
í³¼

$> python3 -c 'print(u"\udcfc")' 
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcfc' in position 0: surrogates not allowed

$> LANG=Latin-1 python3 -c 'print(u"\udcfc")' 
ü

$> LANG=Latin-1 python -c 'print(u"\udcfc")' 
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\udcfc' in position 0: ordinal not in range(128)

especially annoying is that built-in "batteries" then do not cope with unicode nicely themselves. E.g. pdb - running this tiny script http://www.onerussian.com/tmp/test-uni.py on a system without locales (e.g. minimal docker image) and then pressing l to list the code works just fine in python2 but crashes in python3 .
So I guess we are doomed to provide shims for any output, and try various hacks to make it printed at least somehow

from patool.

htgoebel commented on August 21, 2024

I strongly suggest to drop support for Python 2. In Python 3 it is much clearer to see whether data is (encoded) bytes or a string.

Python 2 is end-of life in January 2020 anyway, and all GNU/Linux distributions, *BSD, etc. should provide Python 3. So there is no use in spending time on Python 2.

from patool.

yarikoptic commented on August 21, 2024

So let's resurrect "drop python 2" idea somewhere in 2020. In our case datalad which uses patoolib is supporting both.

from patool.

cgmb commented on August 21, 2024

@yarikoptic It took me a bit to figure this out, but the problem with your example is that your string is just the invalid code point U+DCFC, which has no meaning in Unicode. PEP 383 exploits the fact that it isn't valid on its own, using the values U+DC80..U+DCFF to represent unknown characters:

non-decodable bytes >= 128 will be represented as lone surrogate codes U+DC80..U+DCFF. Bytes below 128 will produce exceptions

So, Python allows that character alone in a string because it was given a special Python-specific meaning. In your first example, I guess that's how the non-decodable bytes it represents are represented in your locale? I just got the replacement character 3 times.

In your last example with LANG=Latin-1 under Python 2, the problem is that Latin-1 isn't recognized as a valid locale for LANG to be set to, so it's defaulting to the ascii encoding (which can't represent the character). This StackOverflow thread talks about the problem. You can check the encoding with python -c 'import sys; print(sys.stdout.encoding)'.

I'm not entirely sure what's happening in your other examples, but I guess the important take-away is that's not a good string to test with, because it's not really valid Unicode. The results are much less weird testing with u"Joyeux Noël!":

python -c print(u"Joyeux No\u00ebl")
Joyeux Noël

python3 -c print(u"Joyeux No\u00ebl")
Joyeux Noël

LANG=Latin-1 python3 -c print(u"Joyeux No\u00ebl")
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character '\xeb' in position 9: ordinal not in range(128)

LANG=Latin-1 python -c print(u"Joyeux No\u00ebl")
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xeb' in position 9: ordinal not in range(128)

That makes much more sense. Setting LANG=Latin-1 breaks the output, because my default is UTF-8, and Latin-1 is not recognized, so it falls back to ascii.

Now, let's set PYTHONIOENCODING=latin_1:backslashreplace. (docs). We can try your original tests:

export PYTHONIOENCODING=latin_1:backslashreplace
python -c print(u"\udcfc")
\udcfc
python3 -c print(u"\udcfc")
\udcfc

So, the code point U+DCFC can't be printed, but with the backslackreplace fallback, we at least get the value printed.

"Üü" in Latin-1 is 0xDCFC, but in Unicode that would be u"\u00DC\u00FC" or u"\xDC\xFC" for short. The important thing you were missing in specifying your Unicode string it is that it's two characters! U+00DC and U+00FC.

The other important thing is that if you set PYTHONIOENCODING=latin_1, it will try to output in Latin-1. That failed for U+DCFC because there's no Latin-1 character matching that invalid character. But, ë, Ü and ü all exist in Latin-1, so you can do something like this no problem:

export PYTHONIOENCODING=latin_1:backslashreplace
python -c 'print(u"\xeb\xdc\xfc")' > pythonecodingtestfile1.txt
python3 -c 'print(u"\udcfc")' > pythonencodingtestfile2.txt

If you open those files with an editor in Latin-1 mode, you will see "ëÜü". (If you don't redirect it to a file, you would need to have a Latin-1 terminal for it to display correctly, and mine is UTF-8.)

I think I got a little less coherent as time went on, and I really have to get to bed, but I hope that you feel a bit less lost after reading this.

from patool.

cgmb commented on August 21, 2024

I think I misunderstood what you were getting at with U+DCFC. :(

Maybe ignore the bits about Üü. Sorry, I have a similar problem in my program and I'm trying to understand it better too.

So here's the rundown as I understand it:

We have a string with U+DCFC that we're trying to log. The U+DCFC was created by the the decoding error handling policy 'surrogateescape', and it represents a ü character from a file name that was encoded in Latin-1.
The actual command patool runs needs U+DCFC, because it will be converted back to the original bytes that match the file name when patool passes the argument to Popen.
In this particular case, we know the encoding should be Latin-1, so for printing to the log (only), we could encode the file name to undo the escaping, then decode it as Latin-1 because we know that's the right encoding. Then we print the file name, and it will display just fine when printing to a UTF-8 terminal.
In the general case, we might not know the encoding of the file name. The surrogate escape is not what we want to output to the log, though, so my suggestion of PYTHONIOENCODING isn't ideal. Just like in 3) we can encode the string to undo the surrogate-escaping, but then try to decode as UTF-8 with an error handler like 'replace' or 'backslashreplace'. That gets us a string that represents the original bytes, but can be printed to the terminal.
Doing 3) is slightly better than doing 4), because it has better log messages, but we need to know the file name encoding (like from the suggested command-line option). Both 3) and 4) would successfully run the commands on the archives, as this problem was only ever about logging.

from patool.

yarikoptic commented on August 21, 2024

Hey @cgmb - thanks bunch for all the details (much appreciated!), I will try to absorb them later on with a good bottle of something helping digestion ;)

from patool.

htgoebel commented on August 21, 2024

I want to emphasis that this "garbage" character comes from the archive: The filenames in the archive are encoded using latin-1. (This is why I suggested addind some --input-encoding option).

Also:

I was able to find again the archive causing the error. Thus I'm able to test any suggested fix - just drop me a note. Unfortunately this archive contains private data thus I can not share it.
The issue only occurs when using Python 3.

When using Python 2, the logline reads (shortend by me) - mind the surrogates:

patool: running /bin/tar […] --file some-old-archive.lha -- file��name

from patool.

htgoebel commented on August 21, 2024

I had an idea how to create an archive showing the issue: Just patch the filename to contain a umlaut:

#!/usr/bin/env python3

from subprocess import run
import string

TEXTFILENAME = "sometextfile.txt"
ARCHIVENAME = "test.lha"

# create some text file
dummy_text = string.printable * 10
with open(TEXTFILENAME, "w", encoding="ascii") as fh:
    fh.write(dummy_text)

# create an archive containing this file
run(["lha", "add", ARCHIVENAME, TEXTFILENAME])

# read archive file into memory
with open(ARCHIVENAME, "rb") as fh:
    data = fh.read()

# inject a latin-1 umlaut into the filename
data_len = len(data)
data = data[:37] + b"\xfc" + data[38:]
assert data_len == len(data)

# write injected archive file back
with open(ARCHIVENAME, "wb") as fh:
    fh.write(data)

But maybe it is not worth spending time on this issue and switch to Python-3-only now that Python 2 is end-of-life.

from patool.

UnicodeEncodeError: 'utf-8' codec can't encode character '\udcfc' about patool HOT 14 CLOSED

Comments (14)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs