Describe the bug Python files that declare an alter

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Alternate file encodings throw UnicodeDecodeError about deptry HOT 13 CLOSED

fpgmaas commented on September 20, 2024

Alternate file encodings throw UnicodeDecodeError

from deptry.

Comments (13)

sbywater commented on September 20, 2024 1

I agree that this is now fixed.

from deptry.

fpgmaas commented on September 20, 2024

Thanks for raising the issue. I have not worked with files with alternate encodings before, I will have a look and see if I can reproduce this and fix it tomorrow!

from deptry.

fpgmaas commented on September 20, 2024

Strange, I just tried to reproduce it but was not able to.

I added the following file and ran deptry:

#!/usr/bin/python
# -*- coding: iso-8859-15 -*-
my_string = """
Ax	NBSP	¡	¢	£	€	¥	Š	§	š	©	ª	«	¬	SHY	®	¯
Bx	°	±	²	³	Ž	µ	¶	·	ž	¹	º	»	Œ	œ	Ÿ	¿
Cx	À	Á	Â	Ã	Ä	Å	Æ	Ç	È	É	Ê	Ë	Ì	Í	Î	Ï
Dx	Ð	Ñ	Ò	Ó	Ô	Õ	Ö	×	Ø	Ù	Ú	Û	Ü	Ý	Þ	ß
Ex	à	á	â	ã	ä	å	æ	ç	è	é	ê	ë	ì	í	î	ï
Fx	ð	ñ	ò	ó	ô	õ	ö	÷	ø	ù	ú	û	ü	ý	þ	ÿ
"""
import foo

deptry succesfully parsed this file and concluded that foo was a missing dependency.

So I think there is also a system-specific issue here? Maybe to avoid this error on all systems, we need to detect and explicitly specify specify the encoding while reading like shown [here](open('filename', encoding="ISO-8859-1")):

open('filename', encoding="ISO-8859-1")

But then we would need to detect the file encoding first.

Anyway, I do not have a lot of knowledge about encodings, so this might take me some time. Would also be good if I can find a way to reproduce this on my laptop. I will dive deeper into this issue tomorrow.

from deptry.

sbywater commented on September 20, 2024

Maybe you are using Windows, where ISO-8859-1 can be an assumed encoding?

System:

OS: Ubuntu 22.06
Language Version: Python 3.10

from deptry.

fpgmaas commented on September 20, 2024

I'm using macOS 12.3.1 and Python 3.9.

I think the issue should now be solved in release 0.4.6. From this version, deptry tries to identify the file-encoding before reading it using chardet, see here in the code and the corresponding unit tests. Please let me know if this resolves your issue.

from deptry.

sbywater commented on September 20, 2024

I've updated to 0.4.6. New error is UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 1498: character maps to

Here is the stack trace:

File "/home/vagrant/.virtualenvs/foo/lib/python3.10/site-packages/deptry/import_parser.py", line 36, in get_imported_modules_from_file modules = self._get_imported_modules_from_py(path_to_file) File "/home/vagrant/.virtualenvs/foo/lib/python3.10/site-packages/deptry/import_parser.py", line 51, in _get_imported_modules_from_py root = ast.parse(f.read(), path_to_py_file) # type: ignore File "/usr/lib/python3.10/encodings/cp1254.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 1498: character maps to <undefined>

from deptry.

fpgmaas commented on September 20, 2024

Sorry that the implemented solution did not solve your problem.

It seems that chardet identifies an incorrect encoding for the file. I guess the only possible solution left is to catch this error and log a warning to the user that the specific file will be omitted while scanning for imports, since AFAIK there is no other way to identify the encoding.

from deptry.

fpgmaas commented on September 20, 2024

@sbywater Would it be possible for you to create a reproducible example? I currently fail to reproduce the error. I am currently thinking of implementing the following:

simply parse the file
If UnicodeDecodeError: guess the encoding, then parse the file
If still UnicodeDecodeError, skip the file.

Which would look as follows.

    def _get_imported_modules_from_py(self, path_to_py_file: Path) -> List[str]:
        try:
            with open(path_to_py_file) as f:
                root = ast.parse(f.read(), path_to_py_file)  # type: ignore
            import_nodes = self._get_import_nodes_from(root)
            return self._get_import_modules_from(import_nodes)
        except UnicodeDecodeError:
            return self._get_imported_modules_from_py_and_guess_encoding(path_to_py_file)

    def _get_imported_modules_from_py_and_guess_encoding(self, path_to_py_file: Path) -> List[str]:
        try:
            with open(path_to_py_file, encoding=self._get_file_encoding(path_to_py_file)) as f:
                root = ast.parse(f.read(), path_to_py_file)  # type: ignore
            import_nodes = self._get_import_nodes_from(root)
            return self._get_import_modules_from(import_nodes)
        except UnicodeDecodeError:
            logging.warning(f"Warning: File {path_to_py_file} could not be decoded. Skipping...")
            return []

But I fail to write a unit test without being able to reproduce the error first.

from deptry.

sbywater commented on September 20, 2024

I can clarify now: the original problem file no longer throws an error. However, under 0.4.6 a file that worked before now throws the UnicodeDecodeError. The problem file does not declare a file encoding, and includes this code:

my_string = '🐺'

Let me know if you'd like me to create a new issue for this. Your proposed patch looks like a good solution.

Here is a verbose stack trace...

EUC-JP Japanese prober hit error at byte 374
EUC-KR Korean prober hit error at byte 374
CP949 Korean prober hit error at byte 374
Big5 Chinese prober hit error at byte 375
EUC-TW Taiwan prober hit error at byte 374
utf-8 not active
SHIFT_JIS Japanese confidence = 0.01
EUC-JP not active
GB2312 Chinese confidence = 0.01
EUC-KR not active
CP949 not active
Big5 not active
EUC-TW not active
Johab Korean confidence = 0.01
windows-1251 Russian confidence = 0.01
KOI8-R Russian confidence = 0.01
ISO-8859-5 Russian confidence = 0.01
MacCyrillic Russian confidence = 0.0
IBM866 Russian confidence = 0.0
IBM855 Russian confidence = 0.01
ISO-8859-7 Greek confidence = 0.01
windows-1253 Greek confidence = 0.01
ISO-8859-5 Bulgarian confidence = 0.01
windows-1251 Bulgarian confidence = 0.01
TIS-620 Thai confidence = 0.01
ISO-8859-9 Turkish confidence = 0.6157780896218796
windows-1255 Hebrew confidence = 0.0
windows-1255 Hebrew confidence = 0.01
windows-1255 Hebrew confidence = 0.01
windows-1251 Russian confidence = 0.01
KOI8-R Russian confidence = 0.01
ISO-8859-5 Russian confidence = 0.01
MacCyrillic Russian confidence = 0.0
IBM866 Russian confidence = 0.0
IBM855 Russian confidence = 0.01
ISO-8859-7 Greek confidence = 0.01
windows-1253 Greek confidence = 0.01
ISO-8859-5 Bulgarian confidence = 0.01
windows-1251 Bulgarian confidence = 0.01
TIS-620 Thai confidence = 0.01
ISO-8859-9 Turkish confidence = 0.6157780896218796
windows-1255 Hebrew confidence = 0.0
windows-1255 Hebrew confidence = 0.01
windows-1255 Hebrew confidence = 0.01
Traceback (most recent call last):
File "/home/vagrant/.virtualenvs/foo/bin/deptry", line 8, in
sys.exit(deptry())
File "/home/vagrant/.virtualenvs/foo/lib/python3.10/site-packages/click/core.py", line 1130, in call
return self.main(*args, **kwargs)
File "/home/vagrant/.virtualenvs/foo/lib/python3.10/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/home/vagrant/.virtualenvs/foo/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/vagrant/.virtualenvs/foo/lib/python3.10/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/home/vagrant/.virtualenvs/foo/lib/python3.10/site-packages/deptry/cli.py", line 198, in deptry
).run()
File "/home/vagrant/.virtualenvs/foo/lib/python3.10/site-packages/deptry/core.py", line 61, in run
imported_modules = ImportParser().get_imported_modules_for_list_of_files(all_python_files)
File "/home/vagrant/.virtualenvs/foo/lib/python3.10/site-packages/deptry/import_parser.py", line 24, in get_imported_modules_for_list_of_files
modules_per_file = [self.get_imported_modules_from_file(file) for file in list_of_files]
File "/home/vagrant/.virtualenvs/foo/lib/python3.10/site-packages/deptry/import_parser.py", line 24, in
modules_per_file = [self.get_imported_modules_from_file(file) for file in list_of_files]
File "/home/vagrant/.virtualenvs/foo/lib/python3.10/site-packages/deptry/import_parser.py", line 36, in get_imported_modules_from_file
modules = self._get_imported_modules_from_py(path_to_file)
File "/home/vagrant/.virtualenvs/foo/lib/python3.10/site-packages/deptry/import_parser.py", line 51, in _get_imported_modules_from_py
root = ast.parse(f.read(), path_to_py_file) # type: ignore
File "/usr/lib/python3.10/encodings/cp1254.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 375: character maps to

from deptry.

fpgmaas commented on September 20, 2024

Weirdly enough, a file with the line my_string = '🐺' is also parsed correctly on my machine. So it is still not possible for me to reproduce the error.

I have decided to release 0.4.7 with the snippet of code of my comment above anyway, since with the knowledge we now have it seems like a bad idea to always use chardet, since most files will be UTF-8 anyway. The only problem is that I am not able to test if it resolves your issue, since I am not able to write a unit test that throws a UnicodeError after using both default UTF-8 and chardet.

Could you try with 0.4.7?

from deptry.

fpgmaas commented on September 20, 2024

Added a PR with a unit test for the warning logging when a file has encoding-issues: #106

from deptry.

fpgmaas commented on September 20, 2024

I believe this is fixed with the aforementioned PR

from deptry.

wyattscarpenter commented on September 20, 2024

I'm getting this same unicode emoji issue on deptry 0.12.0, Windows 10, Python 3.11.0 ; the file with just my_string = '🐺' in it does not parse correctly, and instead I get this: Warning: File the_wolf.py could not be decoded. Skipping.... Adding a # coding = utf-8 line at the top does not help. (my_string = 'é' is fine, no problem.) The file is encoded as UTF-8. Changing the encoding to UTF-8 BOM, thus adding the UTF-8-encoded BOM, allows the file to be read by deptry just fine. Also, I happen to have WSL installed, and deptry 0.12.0 reads the file just fine when I run it through WSL, so I assume it's the Windows default encoding assumption in Python that is causing this problem to emerge.

from deptry.

Alternate file encodings throw UnicodeDecodeError about deptry HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs