GithubHelp home page GithubHelp logo

Bug: `UnicodeEncodeError: 'utf-8' codec can't encode character '\udcf6' in position 110372: surrogates not allowed` when trying to render unprintable filesystem path in view about archivebox HOT 15 OPEN

Finkregh avatar Finkregh commented on May 23, 2024
Bug: `UnicodeEncodeError: 'utf-8' codec can't encode character '\udcf6' in position 110372: surrogates not allowed` when trying to render unprintable filesystem path in view

from archivebox.

Comments (15)

pirate avatar pirate commented on May 23, 2024 1

Looks like you archived a URL that contains unprintable UTF-8 bytes (possibly from a broken emoji/accented character/crylic/chinese/arabic/etc.) and it ended up in a filesystem path, so it's failing when trying to render the path in the public view.

In the short term you can find/strip all special UTF-8 characters in filenames using this script I wrote: strip_bad_filename_characters.sh or a program like detox (apt install detox; man detox).

In the long term ArchiveBox should fix this by force-normalizing all filenames to UTF-8 form-D on creation so this doesn't happen in the future.

from archivebox.

Finkregh avatar Finkregh commented on May 23, 2024 1
[ol@archivebox data]$ echo $LANG $LC_ALL $LC_CTYPE
en_US.UTF-8
[ol@archivebox data]$ DEBUG=True DEBUG_TOOLBAR=True archivebox manage shell
[i] [2024-03-26 20:59:05] ArchiveBox v0.7.3: archivebox manage shell
    > /home/ol/data

Python 3.11.8 (main, Feb 12 2024, 14:50:05) [GCC 13.2.1 20230801]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.22.2 -- An enhanced Interactive Python. Type '?' for help.
# ArchiveBox Imports
from archivebox.core.models import Snapshot, ArchiveResult, Tag, User
from archivebox.cli import *
    help
    version
    init
    config
    setup
    add
    remove
    update
    list
    status
    shell
    manage
    server
    oneshot
    schedule

[i] Welcome to the ArchiveBox Shell!
    https://github.com/ArchiveBox/ArchiveBox/wiki/Usage#Shell-Usage

    Hint: Example use:
        print(Snapshot.objects.filter(is_archived=True).count())
        Snapshot.objects.get(url="https://example.com").as_json()
        add("https://example.com/some/new/url")

In [1]: import os

In [2]: os.environ
Out[2]:
environ{'DEBUG_TOOLBAR': 'True',
        'DEBUG': 'True',
        'SHELL': '/bin/bash',
        'PWD': '/home/ol/data',
        'LOGNAME': 'ol',
        'XDG_SESSION_TYPE': 'tty',
        'MOTD_SHOWN': 'pam',
        'HOME': '/home/ol',
        'LANG': 'en_US.UTF-8',
        'SSH_CONNECTION': 'xxx 22',
        'XDG_SESSION_CLASS': 'user',
        'TERM': 'xterm-256color',
        'USER': 'ol',
        'SHLVL': '1',
        'XDG_SESSION_ID': '7',
        'XDG_RUNTIME_DIR': '/run/user/1000',
        'SSH_CLIENT': 'x 56510 22',
        'DEBUGINFOD_URLS': 'https://debuginfod.archlinux.org ',
        'PATH': '/usr/local/sbin:/usr/local/bin:/usr/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl:/home/ol/.local/bin:/home/ol/.local/bin',
        'DBUS_SESSION_BUS_ADDRESS': 'unix:path=/run/user/1000/bus',
        'MAIL': '/var/spool/mail/ol',
        'SSH_TTY': '/dev/pts/3',
        'OLDPWD': '/home/ol',
        '_': '/usr/local/bin/archivebox',
        'TZ': 'UTC',
        'PYTHONSTARTUP': '/home/ol/.local/pipx/venvs/archivebox/lib/python3.11/site-packages/archivebox/core/welcome_message.py',
        'OUTPUT_DIR': '/home/ol/data',
        'DJANGO_SETTINGS_MODULE': 'core.settings'}

from archivebox.

Finkregh avatar Finkregh commented on May 23, 2024 1

I pulled the whole debug block in the settings.py to the bottom of the file and added ERROR_LOG="/tmp/err.log" now the server starts and throws the same 500 error as before w/o any debug toolbar :D

from archivebox.

Finkregh avatar Finkregh commented on May 23, 2024

Additional thoughts on this:

the char in question is https://www.unicodepedia.com/unicode/low-surrogates/dcf6/trail-surrogate-dcf6/

If i look a files/folders and grep through the list for non-ascii:

[ol@archivebox ~]$ find data > files.txt
[ol@archivebox ~]$ grep --color='auto' -P -n '[^\x00-\x7F]' files.txt 
16979:data/archive/1505464569.0/media/GopherCon 2017: Fatih Arslan - Writing a Go Tool to Parse and Modify Struct Tags [T4AIQ4RHp-c].webp
16980:data/archive/1505464569.0/media/GopherCon 2017: Fatih Arslan - Writing a Go Tool to Parse and Modify Struct Tags [T4AIQ4RHp-c].webm
16981:data/archive/1505464569.0/media/GopherCon 2017: Fatih Arslan - Writing a Go Tool to Parse and Modify Struct Tags [T4AIQ4RHp-c].description
16982:data/archive/1505464569.0/media/GopherCon 2017: Fatih Arslan - Writing a Go Tool to Parse and Modify Struct Tags [T4AIQ4RHp-c].info.json
39923:data/archive/1400350483.0/twibbon.com/Support/fem-weltverschwörung-ev.html
40841:data/archive/1518425808.0/media/This is how the world’s most covetable cameras get made [hasselblad-camera-factory-tour].description
40842:data/archive/1518425808.0/media/This is how the world’s most covetable cameras get made [hasselblad-camera-factory-tour].info.json
46349:data/archive/1500494199.0/media/"Don't run this on any system you expect to be up" they said, but we did it anyway - Hypernode [banner-{banner_id}-{type}].description
46350:data/archive/1500494199.0/media/"Don't run this on any system you expect to be up" they said, but we did it anyway - Hypernode [banner-{banner_id}-{type}].jpg
46351:data/archive/1500494199.0/media/"Don't run this on any system you expect to be up" they said, but we did it anyway - Hypernode [banner-{banner_id}-{type}].info.json
53211:data/archive/1556604197.0/www.slidescarnival.com/wp-content/uploads/2022/07/Blue-and-Pink-Geometric-Biography-About-Me-Creative-Presentation-·-SlidesCarnival-400x225.png
62502:data/archive/1527889634.0/3.bp.blogspot.com/_KihkJmE-KGc/TLA3WRTDG5I/AAAAAAAAAzU/JxyxTomjkd8/s320/Brühpulver.jpg
62521:data/archive/1527889634.0/2.bp.blogspot.com/-alQ8oPZliZU/WrYhwd21x3I/AAAAAAAARYw/ui6C8QPnzX81o2ZKrrCTJsL7NoXt0_c2ACLcBGAs/w72-h72-p-k-no-nu/gulasch+mälzer.jpg
62591:data/archive/1527889634.0/blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj7JgYIInAKOeRrRm0QjTUZyNrV4GodcX9tNfkMk5mvvtNCRcto0jESJjqVJP2dcZ5zo2C9ydTQeDDKhWX5AW7v35Iw19Yh-547FLU45ZasSsLubAWUf6jTa6lm_lMPMCAPdgUVH0bPCtGGt8buVSRMyFhA7LQ8q6_muOM_v5MmSihCgRq9YFMPz65cOg/w72-h72-p-k-no-nu/möhendurcheinander.jpg
86166:data/archive/1534075988.0/media/You Have Control: Learning From Aviation -  Andrew Godwin - PyCon Israel 2018 [d0eo3FxKQNc].webp
86167:data/archive/1534075988.0/media/You Have Control: Learning From Aviation -  Andrew Godwin - PyCon Israel 2018 [d0eo3FxKQNc].webm
86168:data/archive/1534075988.0/media/You Have Control: Learning From Aviation -  Andrew Godwin - PyCon Israel 2018 [d0eo3FxKQNc].description
86169:data/archive/1534075988.0/media/You Have Control: Learning From Aviation -  Andrew Godwin - PyCon Israel 2018 [d0eo3FxKQNc].info.json
91908:data/archive/1508142870.0/assets-global.website-files.com/61027bb0bc31fc6cafefbc0c/627d31f972023bb238b8124d_Ресурс 4.png
91926:data/archive/1508142870.0/assets-global.website-files.com/61027bb0bc31fc6cafefbc0c/618d3dfd1c3709ebafd0eb93_сhecker.svg
100214:data/archive/1501095170.0/media/世界地図図法 [オーサグラフ世界地図] (16G141127) [IVuxMGxTyEg].webp
100215:data/archive/1501095170.0/media/世界地図図法 [オーサグラフ世界地図] (16G141127) [IVuxMGxTyEg].info.json
100216:data/archive/1501095170.0/media/世界地図図法 [オーサグラフ世界地図] (16G141127) [IVuxMGxTyEg].webm
100217:data/archive/1501095170.0/media/世界地図図法 [オーサグラフ世界地図] (16G141127) [IVuxMGxTyEg].description
105632:data/archive/1506327195.0/media/documenting architecture: wireshark, plantuml and a repl [J2RGAPGFfP8].webm
105633:data/archive/1506327195.0/media/documenting architecture: wireshark, plantuml and a repl [J2RGAPGFfP8].description
105634:data/archive/1506327195.0/media/documenting architecture: wireshark, plantuml and a repl [J2RGAPGFfP8].webp
105635:data/archive/1506327195.0/media/documenting architecture: wireshark, plantuml and a repl [J2RGAPGFfP8].info.json

I cant see that char.

If i move the folders in question away I still get the same issue:

mkdir broken-data
grep --color='auto' -P -n '[^\x00-\x7F]' files.txt | cut -d ":" -f2 | cut -d "/" -f 1-3 | sort -u | xargs mv -t broken-data

I´d rather not run detox as it would rename all sorts of files and then the archive would be broken. Same with the script you linked.

Is there any way to narrow this down to where the actual files is? perhaps even more debug than DEBUG=True?

from archivebox.

Finkregh avatar Finkregh commented on May 23, 2024

Interstingly i did a sqlite3 database.db '.dump' > foo.sql (besides some strace) which lead to not having the issue anymore. I wonder what that did and if something went wrong insside the sqlite file before.

I´d still be interested in getting to know how to debug this :)

edit: I also moved all archive data back which i suspected to cause issues and it still works.

from archivebox.

Finkregh avatar Finkregh commented on May 23, 2024

Aaand its back... o_O?

I read your pretty nice upgrading documentation that explains what init does. So I ran it and everything works. Still guessing in the direction of some sqlite issue... And I tried to get django-debug-toolbar==3.2.4 to run but ran into an exception:

TypeError: CacheHandler.all() got an unexpected keyword argument 'initialized_only'

edit: restarted the server and the issue is back. now running init again w/o restart, lets see
edit2: still broken :|

from archivebox.

pirate avatar pirate commented on May 23, 2024

Good approach trying to narrow down the failing request with django-debug-toolbar, not sure why it failed, I'll take a look. You can also try disabling most of the panes that it uses as they're often individually buggy and not all panes are needed to track down a broken request: archivebox/core/settings.py:165 DEBUG_TOOLBAR_PANELS (you can comment out almost everything in there, I'd start by disabling 'debug_toolbar.panels.cache.CachePanel'). There are also middlewares that can be added to log requests specifically: https://github.com/Rhumbix/django-request-logging

We can also keep trying the more direct approach to find where the offending bytes are recorded on the filesystem or in sqlite, before spelunking through the ArchiveBox code, maybe something like:

# find non-ascii within db fields
sqlite3 index.sqlite3
> SELECT * FROM core_snapshot WHERE <column> GLOB ('*[^'||char(1,45,127)||']*');
> SELECT * FROM core_archiveresult WHERE <column> GLOB ('*[^'||char(1,45,127)||']*');

# or keep trying other ways to find \udcf6 within file contents / paths
grep -obarUP "\xdc\xf6" .

from archivebox.

Finkregh avatar Finkregh commented on May 23, 2024

Tried with request-logging:

GET /
{'HTTP_HOST': 'archivebox.local:8080', 'HTTP_USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:124.0) Gecko/20100101 Firefox/124.0', 'HTTP_ACCEPT': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8', 'HTTP_ACCEPT_LANGUAGE': 'en-US,en;q=0.5', 'HTTP_ACCEPT_ENCODING': 'gzip, deflate', 'HTTP_CONNECTION': 'keep-alive', 'HTTP_COOKIE': 'csrftoken=x; sessionid=y; GMT_OFFSET=60', 'HTTP_UPGRADE_INSECURE_REQUESTS': '1'}
b''
GET / - 302
"GET / HTTP/1.1" 302 0
Internal Server Error: /admin/core/snapshot/
Traceback (most recent call last):
  File "/home/ol/.local/pipx/venvs/archivebox/lib/python3.11/site-packages/django/core/handlers/exception.py", line 47, in inner
    response = get_response(request)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ol/.local/pipx/venvs/archivebox/lib/python3.11/site-packages/django/core/handlers/base.py", line 204, in _get_response
    response = response.render()
               ^^^^^^^^^^^^^^^^^
  File "/home/ol/.local/pipx/venvs/archivebox/lib/python3.11/site-packages/django/template/response.py", line 105, in render
    self.content = self.rendered_content
    ^^^^^^^^^^^^
  File "/home/ol/.local/pipx/venvs/archivebox/lib/python3.11/site-packages/django/template/response.py", line 134, in content
    HttpResponse.content.fset(self, value)
  File "/home/ol/.local/pipx/venvs/archivebox/lib/python3.11/site-packages/django/http/response.py", line 328, in content
    content = self.make_bytes(value)
              ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ol/.local/pipx/venvs/archivebox/lib/python3.11/site-packages/django/http/response.py", line 241, in make_bytes
    return bytes(value.encode(self.charset))
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcf6' in position 92793: surrogates not allowed
GET /admin/core/snapshot/
{'HTTP_HOST': 'archivebox.local:8080', 'HTTP_USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:124.0) Gecko/20100101 Firefox/124.0', 'HTTP_ACCEPT': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8', 'HTTP_ACCEPT_LANGUAGE': 'en-US,en;q=0.5', 'HTTP_ACCEPT_ENCODING': 'gzip, deflate', 'HTTP_CONNECTION': 'keep-alive', 'HTTP_COOKIE': 'csrftoken=x; sessionid=y; GMT_OFFSET=60', 'HTTP_UPGRADE_INSECURE_REQUESTS': '1'}
b''
GET /admin/core/snapshot/ - 500
"GET /admin/core/snapshot/ HTTP/1.1" 500 145

Sqlite glob with non-ascii returns all sort of stuff, not that char.

I tried with this and it returned nothing:

#!/bin/bash

# SQLite database file
DATABASE="index.sqlite3"

# Dump all table names
TABLES=$(sqlite3 "$DATABASE" ".tables")

# Loop through each table
for table in $TABLES; do
    #echo "Table: $table"

    # Dump all column names for the current table
    COLUMNS=$(sqlite3 "$DATABASE" "PRAGMA table_info($table);" | cut -d '|' -f 2)

    # Loop through each column
    for column in $COLUMNS; do
        #echo "Column: $column"

        # Run the query for the current table/column combination
        #echo "Results for $table.$column:"
        sqlite3 "$DATABASE" "SELECT * FROM $table WHERE $column LIKE '%' || X'DCF6' || '%';"
    done
done

Edit: the grep did find some files, i moved them away and nothing changed :(

from archivebox.

pirate avatar pirate commented on May 23, 2024

damn... ok. I guess I might have to fix it the harder way: changing the renderer to handle this.

Before we go debugging too much further can you help double check these super quick:

echo $LANG $LC_ALL $LC_CTYPE
# should be: LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 LC_CTYPE=en_US.UTF-8

Related issues:

from archivebox.

Finkregh avatar Finkregh commented on May 23, 2024

FYI the debug toolbar:

[ol@archivebox data]$ DJANGO_SETTINGS_MODULE=archivebox.core.settings DEBUG=True DEBUG_TOOLBAR=True archivebox server --nothreading '[::]:8080'
[i] [2024-03-26 21:00:54] ArchiveBox v0.7.3: archivebox server --nothreading [::]:8080
    > /home/ol/data

Traceback (most recent call last):
  File "/usr/local/bin/archivebox", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/ol/.local/pipx/venvs/archivebox/lib/python3.11/site-packages/archivebox/cli/__init__.py", line 140, in main
    run_subcommand(
  File "/home/ol/.local/pipx/venvs/archivebox/lib/python3.11/site-packages/archivebox/cli/__init__.py", line 74, in run_subcommand
    setup_django(in_memory_db=subcommand in fake_db, check_db=cmd_requires_db and not init_pending)
  File "/home/ol/.local/pipx/venvs/archivebox/lib/python3.11/site-packages/archivebox/config.py", line 1420, in setup_django
    with open(settings.ERROR_LOG, "a", encoding='utf-8') as f:
              ^^^^^^^^^^^^^^^^^^
  File "/home/ol/.local/pipx/venvs/archivebox/lib/python3.11/site-packages/django/conf/__init__.py", line 83, in __getattr__
    val = getattr(self._wrapped, name)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'Settings' object has no attribute 'ERROR_LOG'

with only debug_toolbar.panels.request.RequestPanel in the DEBUG_TOOLBAR_PANELS

from archivebox.

Finkregh avatar Finkregh commented on May 23, 2024

$ LC_ALL=en_US.UTF-8 LC_CTYPE=en_US.UTF-8 archivebox server --nothreading '[::]:8080'

leads to the same issue as before

from archivebox.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.