Comments (15)
Looks like you archived a URL that contains unprintable UTF-8 bytes (possibly from a broken emoji/accented character/crylic/chinese/arabic/etc.) and it ended up in a filesystem path, so it's failing when trying to render the path in the public view.
In the short term you can find/strip all special UTF-8 characters in filenames using this script I wrote: strip_bad_filename_characters.sh
or a program like detox
(apt install detox; man detox
).
In the long term ArchiveBox should fix this by force-normalizing all filenames to UTF-8 form-D on creation so this doesn't happen in the future.
from archivebox.
[ol@archivebox data]$ echo $LANG $LC_ALL $LC_CTYPE
en_US.UTF-8
[ol@archivebox data]$ DEBUG=True DEBUG_TOOLBAR=True archivebox manage shell
[i] [2024-03-26 20:59:05] ArchiveBox v0.7.3: archivebox manage shell
> /home/ol/data
Python 3.11.8 (main, Feb 12 2024, 14:50:05) [GCC 13.2.1 20230801]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.22.2 -- An enhanced Interactive Python. Type '?' for help.
# ArchiveBox Imports
from archivebox.core.models import Snapshot, ArchiveResult, Tag, User
from archivebox.cli import *
help
version
init
config
setup
add
remove
update
list
status
shell
manage
server
oneshot
schedule
[i] Welcome to the ArchiveBox Shell!
https://github.com/ArchiveBox/ArchiveBox/wiki/Usage#Shell-Usage
Hint: Example use:
print(Snapshot.objects.filter(is_archived=True).count())
Snapshot.objects.get(url="https://example.com").as_json()
add("https://example.com/some/new/url")
In [1]: import os
In [2]: os.environ
Out[2]:
environ{'DEBUG_TOOLBAR': 'True',
'DEBUG': 'True',
'SHELL': '/bin/bash',
'PWD': '/home/ol/data',
'LOGNAME': 'ol',
'XDG_SESSION_TYPE': 'tty',
'MOTD_SHOWN': 'pam',
'HOME': '/home/ol',
'LANG': 'en_US.UTF-8',
'SSH_CONNECTION': 'xxx 22',
'XDG_SESSION_CLASS': 'user',
'TERM': 'xterm-256color',
'USER': 'ol',
'SHLVL': '1',
'XDG_SESSION_ID': '7',
'XDG_RUNTIME_DIR': '/run/user/1000',
'SSH_CLIENT': 'x 56510 22',
'DEBUGINFOD_URLS': 'https://debuginfod.archlinux.org ',
'PATH': '/usr/local/sbin:/usr/local/bin:/usr/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl:/home/ol/.local/bin:/home/ol/.local/bin',
'DBUS_SESSION_BUS_ADDRESS': 'unix:path=/run/user/1000/bus',
'MAIL': '/var/spool/mail/ol',
'SSH_TTY': '/dev/pts/3',
'OLDPWD': '/home/ol',
'_': '/usr/local/bin/archivebox',
'TZ': 'UTC',
'PYTHONSTARTUP': '/home/ol/.local/pipx/venvs/archivebox/lib/python3.11/site-packages/archivebox/core/welcome_message.py',
'OUTPUT_DIR': '/home/ol/data',
'DJANGO_SETTINGS_MODULE': 'core.settings'}
from archivebox.
I pulled the whole debug block in the settings.py to the bottom of the file and added ERROR_LOG="/tmp/err.log"
now the server starts and throws the same 500 error as before w/o any debug toolbar :D
from archivebox.
Additional thoughts on this:
the char in question is https://www.unicodepedia.com/unicode/low-surrogates/dcf6/trail-surrogate-dcf6/
If i look a files/folders and grep through the list for non-ascii:
[ol@archivebox ~]$ find data > files.txt
[ol@archivebox ~]$ grep --color='auto' -P -n '[^\x00-\x7F]' files.txt
16979:data/archive/1505464569.0/media/GopherCon 2017: Fatih Arslan - Writing a Go Tool to Parse and Modify Struct Tags [T4AIQ4RHp-c].webp
16980:data/archive/1505464569.0/media/GopherCon 2017: Fatih Arslan - Writing a Go Tool to Parse and Modify Struct Tags [T4AIQ4RHp-c].webm
16981:data/archive/1505464569.0/media/GopherCon 2017: Fatih Arslan - Writing a Go Tool to Parse and Modify Struct Tags [T4AIQ4RHp-c].description
16982:data/archive/1505464569.0/media/GopherCon 2017: Fatih Arslan - Writing a Go Tool to Parse and Modify Struct Tags [T4AIQ4RHp-c].info.json
39923:data/archive/1400350483.0/twibbon.com/Support/fem-weltverschwörung-ev.html
40841:data/archive/1518425808.0/media/This is how the world’s most covetable cameras get made [hasselblad-camera-factory-tour].description
40842:data/archive/1518425808.0/media/This is how the world’s most covetable cameras get made [hasselblad-camera-factory-tour].info.json
46349:data/archive/1500494199.0/media/"Don't run this on any system you expect to be up" they said, but we did it anyway - Hypernode [banner-{banner_id}-{type}].description
46350:data/archive/1500494199.0/media/"Don't run this on any system you expect to be up" they said, but we did it anyway - Hypernode [banner-{banner_id}-{type}].jpg
46351:data/archive/1500494199.0/media/"Don't run this on any system you expect to be up" they said, but we did it anyway - Hypernode [banner-{banner_id}-{type}].info.json
53211:data/archive/1556604197.0/www.slidescarnival.com/wp-content/uploads/2022/07/Blue-and-Pink-Geometric-Biography-About-Me-Creative-Presentation-·-SlidesCarnival-400x225.png
62502:data/archive/1527889634.0/3.bp.blogspot.com/_KihkJmE-KGc/TLA3WRTDG5I/AAAAAAAAAzU/JxyxTomjkd8/s320/Brühpulver.jpg
62521:data/archive/1527889634.0/2.bp.blogspot.com/-alQ8oPZliZU/WrYhwd21x3I/AAAAAAAARYw/ui6C8QPnzX81o2ZKrrCTJsL7NoXt0_c2ACLcBGAs/w72-h72-p-k-no-nu/gulasch+mälzer.jpg
62591:data/archive/1527889634.0/blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj7JgYIInAKOeRrRm0QjTUZyNrV4GodcX9tNfkMk5mvvtNCRcto0jESJjqVJP2dcZ5zo2C9ydTQeDDKhWX5AW7v35Iw19Yh-547FLU45ZasSsLubAWUf6jTa6lm_lMPMCAPdgUVH0bPCtGGt8buVSRMyFhA7LQ8q6_muOM_v5MmSihCgRq9YFMPz65cOg/w72-h72-p-k-no-nu/möhendurcheinander.jpg
86166:data/archive/1534075988.0/media/You Have Control: Learning From Aviation - Andrew Godwin - PyCon Israel 2018 [d0eo3FxKQNc].webp
86167:data/archive/1534075988.0/media/You Have Control: Learning From Aviation - Andrew Godwin - PyCon Israel 2018 [d0eo3FxKQNc].webm
86168:data/archive/1534075988.0/media/You Have Control: Learning From Aviation - Andrew Godwin - PyCon Israel 2018 [d0eo3FxKQNc].description
86169:data/archive/1534075988.0/media/You Have Control: Learning From Aviation - Andrew Godwin - PyCon Israel 2018 [d0eo3FxKQNc].info.json
91908:data/archive/1508142870.0/assets-global.website-files.com/61027bb0bc31fc6cafefbc0c/627d31f972023bb238b8124d_Ресурс 4.png
91926:data/archive/1508142870.0/assets-global.website-files.com/61027bb0bc31fc6cafefbc0c/618d3dfd1c3709ebafd0eb93_сhecker.svg
100214:data/archive/1501095170.0/media/世界地図図法 [オーサグラフ世界地図] (16G141127) [IVuxMGxTyEg].webp
100215:data/archive/1501095170.0/media/世界地図図法 [オーサグラフ世界地図] (16G141127) [IVuxMGxTyEg].info.json
100216:data/archive/1501095170.0/media/世界地図図法 [オーサグラフ世界地図] (16G141127) [IVuxMGxTyEg].webm
100217:data/archive/1501095170.0/media/世界地図図法 [オーサグラフ世界地図] (16G141127) [IVuxMGxTyEg].description
105632:data/archive/1506327195.0/media/documenting architecture: wireshark, plantuml and a repl [J2RGAPGFfP8].webm
105633:data/archive/1506327195.0/media/documenting architecture: wireshark, plantuml and a repl [J2RGAPGFfP8].description
105634:data/archive/1506327195.0/media/documenting architecture: wireshark, plantuml and a repl [J2RGAPGFfP8].webp
105635:data/archive/1506327195.0/media/documenting architecture: wireshark, plantuml and a repl [J2RGAPGFfP8].info.json
I cant see that char.
If i move the folders in question away I still get the same issue:
mkdir broken-data
grep --color='auto' -P -n '[^\x00-\x7F]' files.txt | cut -d ":" -f2 | cut -d "/" -f 1-3 | sort -u | xargs mv -t broken-data
I´d rather not run detox as it would rename all sorts of files and then the archive would be broken. Same with the script you linked.
Is there any way to narrow this down to where the actual files is? perhaps even more debug than DEBUG=True?
from archivebox.
Interstingly i did a sqlite3 database.db '.dump' > foo.sql
(besides some strace) which lead to not having the issue anymore. I wonder what that did and if something went wrong insside the sqlite file before.
I´d still be interested in getting to know how to debug this :)
edit: I also moved all archive data back which i suspected to cause issues and it still works.
from archivebox.
Aaand its back... o_O?
I read your pretty nice upgrading documentation that explains what init
does. So I ran it and everything works. Still guessing in the direction of some sqlite issue... And I tried to get django-debug-toolbar==3.2.4
to run but ran into an exception:
TypeError: CacheHandler.all() got an unexpected keyword argument 'initialized_only'
edit: restarted the server and the issue is back. now running init again w/o restart, lets see
edit2: still broken :|
from archivebox.
Good approach trying to narrow down the failing request with django-debug-toolbar
, not sure why it failed, I'll take a look. You can also try disabling most of the panes that it uses as they're often individually buggy and not all panes are needed to track down a broken request: archivebox/core/settings.py:165
DEBUG_TOOLBAR_PANELS
(you can comment out almost everything in there, I'd start by disabling 'debug_toolbar.panels.cache.CachePanel'
). There are also middlewares that can be added to log requests specifically: https://github.com/Rhumbix/django-request-logging
We can also keep trying the more direct approach to find where the offending bytes are recorded on the filesystem or in sqlite, before spelunking through the ArchiveBox code, maybe something like:
# find non-ascii within db fields
sqlite3 index.sqlite3
> SELECT * FROM core_snapshot WHERE <column> GLOB ('*[^'||char(1,45,127)||']*');
> SELECT * FROM core_archiveresult WHERE <column> GLOB ('*[^'||char(1,45,127)||']*');
# or keep trying other ways to find \udcf6 within file contents / paths
grep -obarUP "\xdc\xf6" .
- https://www.unicodepedia.com/unicode/low-surrogates/dcf6/trail-surrogate-dcf6/
- https://charbase.com/dcf6-unicode-invalid-character
- https://unix.stackexchange.com/questions/474709/how-to-grep-for-unicode-in-a-bash-script
- https://sqlite-users.sqlite.narkive.com/rMCLvZ99/sqlite-finding-records-containing-non-ascii-characters
from archivebox.
Tried with request-logging:
GET /
{'HTTP_HOST': 'archivebox.local:8080', 'HTTP_USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:124.0) Gecko/20100101 Firefox/124.0', 'HTTP_ACCEPT': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8', 'HTTP_ACCEPT_LANGUAGE': 'en-US,en;q=0.5', 'HTTP_ACCEPT_ENCODING': 'gzip, deflate', 'HTTP_CONNECTION': 'keep-alive', 'HTTP_COOKIE': 'csrftoken=x; sessionid=y; GMT_OFFSET=60', 'HTTP_UPGRADE_INSECURE_REQUESTS': '1'}
b''
GET / - 302
"GET / HTTP/1.1" 302 0
Internal Server Error: /admin/core/snapshot/
Traceback (most recent call last):
File "/home/ol/.local/pipx/venvs/archivebox/lib/python3.11/site-packages/django/core/handlers/exception.py", line 47, in inner
response = get_response(request)
^^^^^^^^^^^^^^^^^^^^^
File "/home/ol/.local/pipx/venvs/archivebox/lib/python3.11/site-packages/django/core/handlers/base.py", line 204, in _get_response
response = response.render()
^^^^^^^^^^^^^^^^^
File "/home/ol/.local/pipx/venvs/archivebox/lib/python3.11/site-packages/django/template/response.py", line 105, in render
self.content = self.rendered_content
^^^^^^^^^^^^
File "/home/ol/.local/pipx/venvs/archivebox/lib/python3.11/site-packages/django/template/response.py", line 134, in content
HttpResponse.content.fset(self, value)
File "/home/ol/.local/pipx/venvs/archivebox/lib/python3.11/site-packages/django/http/response.py", line 328, in content
content = self.make_bytes(value)
^^^^^^^^^^^^^^^^^^^^^^
File "/home/ol/.local/pipx/venvs/archivebox/lib/python3.11/site-packages/django/http/response.py", line 241, in make_bytes
return bytes(value.encode(self.charset))
^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcf6' in position 92793: surrogates not allowed
GET /admin/core/snapshot/
{'HTTP_HOST': 'archivebox.local:8080', 'HTTP_USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:124.0) Gecko/20100101 Firefox/124.0', 'HTTP_ACCEPT': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8', 'HTTP_ACCEPT_LANGUAGE': 'en-US,en;q=0.5', 'HTTP_ACCEPT_ENCODING': 'gzip, deflate', 'HTTP_CONNECTION': 'keep-alive', 'HTTP_COOKIE': 'csrftoken=x; sessionid=y; GMT_OFFSET=60', 'HTTP_UPGRADE_INSECURE_REQUESTS': '1'}
b''
GET /admin/core/snapshot/ - 500
"GET /admin/core/snapshot/ HTTP/1.1" 500 145
Sqlite glob with non-ascii returns all sort of stuff, not that char.
I tried with this and it returned nothing:
#!/bin/bash
# SQLite database file
DATABASE="index.sqlite3"
# Dump all table names
TABLES=$(sqlite3 "$DATABASE" ".tables")
# Loop through each table
for table in $TABLES; do
#echo "Table: $table"
# Dump all column names for the current table
COLUMNS=$(sqlite3 "$DATABASE" "PRAGMA table_info($table);" | cut -d '|' -f 2)
# Loop through each column
for column in $COLUMNS; do
#echo "Column: $column"
# Run the query for the current table/column combination
#echo "Results for $table.$column:"
sqlite3 "$DATABASE" "SELECT * FROM $table WHERE $column LIKE '%' || X'DCF6' || '%';"
done
done
Edit: the grep did find some files, i moved them away and nothing changed :(
from archivebox.
damn... ok. I guess I might have to fix it the harder way: changing the renderer to handle this.
Before we go debugging too much further can you help double check these super quick:
echo $LANG $LC_ALL $LC_CTYPE
# should be: LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 LC_CTYPE=en_US.UTF-8
Related issues:
from archivebox.
FYI the debug toolbar:
[ol@archivebox data]$ DJANGO_SETTINGS_MODULE=archivebox.core.settings DEBUG=True DEBUG_TOOLBAR=True archivebox server --nothreading '[::]:8080'
[i] [2024-03-26 21:00:54] ArchiveBox v0.7.3: archivebox server --nothreading [::]:8080
> /home/ol/data
Traceback (most recent call last):
File "/usr/local/bin/archivebox", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/ol/.local/pipx/venvs/archivebox/lib/python3.11/site-packages/archivebox/cli/__init__.py", line 140, in main
run_subcommand(
File "/home/ol/.local/pipx/venvs/archivebox/lib/python3.11/site-packages/archivebox/cli/__init__.py", line 74, in run_subcommand
setup_django(in_memory_db=subcommand in fake_db, check_db=cmd_requires_db and not init_pending)
File "/home/ol/.local/pipx/venvs/archivebox/lib/python3.11/site-packages/archivebox/config.py", line 1420, in setup_django
with open(settings.ERROR_LOG, "a", encoding='utf-8') as f:
^^^^^^^^^^^^^^^^^^
File "/home/ol/.local/pipx/venvs/archivebox/lib/python3.11/site-packages/django/conf/__init__.py", line 83, in __getattr__
val = getattr(self._wrapped, name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'Settings' object has no attribute 'ERROR_LOG'
with only debug_toolbar.panels.request.RequestPanel
in the DEBUG_TOOLBAR_PANELS
from archivebox.
$ LC_ALL=en_US.UTF-8 LC_CTYPE=en_US.UTF-8 archivebox server --nothreading '[::]:8080'
leads to the same issue as before
from archivebox.
Related Issues (20)
- New Extractor: `rich` and `imgcat` for rendering markdown, code, error logs, and more to html/CLI HOT 1
- New Extractor Idea: `podcast-archiver` for auto-downloading podcasts
- Django Admin general improvements: tree view, better filters, better sorting, custom pages, etc.
- Feature Request: Raindrop.io import HOT 1
- htmltotext archive results are not recorded HOT 1
- parser=auto will almost always just fall back to parser=generic_txt, needs to let the first parser to find URLS win HOT 7
- Feature Request: Add config to show Snapshot.bookmarked timestamp instead of Snapshot.added in the UI
- New Extractor Idea: `forum-dl` for downloading forum threads as JSON/html HOT 1
- Feature Request: Add new `generic_jsonl` parser to support ingesting JSONL HOT 3
- How to navigate various snapshots of a single url? HOT 2
- Support: podman-compose rootless setup leads to `PUID=0` being passed, and ArchiveBox refuses to start as root HOT 9
- Ability to disable archiving if not logged in HOT 3
- Support: Singlefile is failing to archive some sites (`xz.aliyun.com`) HOT 1
- Bug: Bilibili fails to scrape
- Support: singlefile & readability fail to work HOT 3
- Bug: Enter a valid URL. HOT 2
- Bug: AttributeError: 'PosixPath' object has no attribute 'split' / ImportError: attempted relative import beyond top-level package HOT 7
- New Feature: Provide deeper `mitmproxy` integration out-of-the-box in Docker HOT 1
- Bug: upgrading Docker image from 0.7.2 to 0.7.4 - The 0.7.4 version doesn't work HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from archivebox.