GithubHelp home page GithubHelp logo

Comments (3)

takao8 avatar takao8 commented on May 30, 2024

Hi Nicholas, we actually made a framework a little while ago to be able to just capture metadata in PDFs that failed to download, but this error tells us it might be something else. I know you said you can't attach an example but is there any other way we could get some form of a corrupted PDF like the ones your working with to test with? It's hard for us to recreate this error otherwise.

from gamechanger-data.

nawagner avatar nawagner commented on May 30, 2024

Unfortunately, I tried recreating a similarly corrupted file from an arxiv PDF by scrambling contents randomly and removing the header, but it doesn't seem to cause gamechanger (and by gamechanger I assume mupdf) any major issues. Let me share my conda list to see if there is anything outdated you know about.

packages in environment at /home/nwagner/miniconda3/envs/gc:

Name Version Build Channel

_libgcc_mutex 0.1 main
_openmp_mutex 4.5 1_gnu
absl-py 0.11.0 pypi_0 pypi
alembic 1.4.1 pypi_0 pypi
aniso8601 8.0.0 pypi_0 pypi
annoy 1.16.3 pypi_0 pypi
appdirs 1.4.4 pypi_0 pypi
apscheduler 3.6.3 pypi_0 pypi
arabic-reshaper 2.1.0 pypi_0 pypi
asgiref 3.2.10 pypi_0 pypi
astroid 2.6.2 pypi_0 pypi
astunparse 1.6.3 pypi_0 pypi
attrs 19.3.0 pypi_0 pypi
automat 20.2.0 pypi_0 pypi
azure-core 1.8.2 pypi_0 pypi
azure-storage-blob 12.5.0 pypi_0 pypi
backcall 0.2.0 pypi_0 pypi
beautifulsoup4 4.9.1 pypi_0 pypi
bert-extractive-summarizer 0.5.1 pypi_0 pypi
blis 0.4.1 pypi_0 pypi
boto 2.49.0 pypi_0 pypi
boto3 1.13.2 pypi_0 pypi
botocore 1.16.2 pypi_0 pypi
ca-certificates 2021.5.25 h06a4308_1
cachetools 4.1.0 pypi_0 pypi
catalogue 2.0.4 pypi_0 pypi
certifi 2020.12.5 pypi_0 pypi
cffi 1.14.0 pypi_0 pypi
chardet 3.0.4 pypi_0 pypi
ci-info 0.2.0 pypi_0 pypi
click 7.1.2 pypi_0 pypi
cloudpickle 1.6.0 pypi_0 pypi
colorama 0.4.3 pypi_0 pypi
coloredlogs 14.0 pypi_0 pypi
configobj 5.0.6 pypi_0 pypi
configparser 5.0.0 pypi_0 pypi
constantly 15.1.0 pypi_0 pypi
contextvars 2.4 pypi_0 pypi
corenlp 0.0.14 pypi_0 pypi
corenlp-protobuf 3.8.0 pypi_0 pypi
coverage 5.3 pypi_0 pypi
cryptography 2.9.2 pypi_0 pypi
cssselect 1.1.0 pypi_0 pypi
cycler 0.10.0 pypi_0 pypi
cymem 2.0.3 pypi_0 pypi
cython 0.29.23 pypi_0 pypi
cytoolz 0.10.1 pypi_0 pypi
databricks-cli 0.13.0 pypi_0 pypi
dataclasses 0.7 pypi_0 pypi
decorator 4.4.2 pypi_0 pypi
devtools 0.6.1 pypi_0 pypi
dill 0.3.3 pypi_0 pypi
distlib 0.3.1 pypi_0 pypi
django 3.0.7 pypi_0 pypi
docker 4.3.1 pypi_0 pypi
docutils 0.15.2 pypi_0 pypi
dotmap 1.3.0 pypi_0 pypi
elastic-apm 5.9.0 pypi_0 pypi
elasticsearch 7.9.1 pypi_0 pypi
eli5 0.10.1 pypi_0 pypi
en-core-web-lg 3.0.0 pypi_0 pypi
en-core-web-md 3.0.0 pypi_0 pypi
en-core-web-sm 3.0.0 pypi_0 pypi
english 2020.7.0 pypi_0 pypi
entrypoints 0.3 pypi_0 pypi
etelemetry 0.2.1 pypi_0 pypi
faiss-cpu 1.6.3 pypi_0 pypi
faiss-gpu 1.6.3 pypi_0 pypi
farm 0.6.2 pypi_0 pypi
farm-haystack 0.7.0 pypi_0 pypi
fastapi 0.61.1 pypi_0 pypi
fastapi-utils 0.2.1 pypi_0 pypi
fasteners 0.16 pypi_0 pypi
fasttext 0.9.2 pypi_0 pypi
fasttext-wheel 0.9.2 pypi_0 pypi
filelock 3.0.12 pypi_0 pypi
filetype 1.0.7 pypi_0 pypi
flake8 3.9.2 pypi_0 pypi
flask 1.1.2 pypi_0 pypi
flask-cors 3.0.9 pypi_0 pypi
flask-restplus 0.13.0 pypi_0 pypi
flatbuffers 1.12 pypi_0 pypi
future 0.18.2 pypi_0 pypi
gamechanger evergreen pypi_0 pypi
gamechangerml 0.1.0 dev_0
gast 0.3.3 pypi_0 pypi
gensim 3.8.3 pypi_0 pypi
gitdb 4.0.5 pypi_0 pypi
gitpython 3.1.11 pypi_0 pypi
google-auth 1.16.1 pypi_0 pypi
google-auth-oauthlib 0.4.1 pypi_0 pypi
google-pasta 0.2.0 pypi_0 pypi
gorilla 0.3.0 pypi_0 pypi
grpcio 1.32.0 pypi_0 pypi
gunicorn 20.0.4 pypi_0 pypi
h11 0.9.0 pypi_0 pypi
h5py 2.10.0 pypi_0 pypi
hnswlib 0.5.1 pypi_0 pypi
html5lib 1.1 pypi_0 pypi
httplib2 0.18.1 pypi_0 pypi
httptools 0.1.1 pypi_0 pypi
humanfriendly 8.2 pypi_0 pypi
hyperlink 19.0.0 pypi_0 pypi
hypothesis 6.14.0 pypi_0 pypi
idna 2.9 pypi_0 pypi
image 1.5.32 pypi_0 pypi
img2pdf 0.4.0 pypi_0 pypi
immutables 0.15 pypi_0 pypi
importlib-metadata 1.6.0 pypi_0 pypi
importlib-resources 3.3.0 pypi_0 pypi
incremental 17.5.0 pypi_0 pypi
iniconfig 1.1.1 pypi_0 pypi
ipython 7.16.1 pypi_0 pypi
ipython-genutils 0.2.0 pypi_0 pypi
isodate 0.6.0 pypi_0 pypi
isort 5.9.1 pypi_0 pypi
itsdangerous 1.1.0 pypi_0 pypi
jedi 0.17.2 pypi_0 pypi
jellyfish 0.8.2 pypi_0 pypi
jinja2 2.11.2 pypi_0 pypi
jmespath 0.9.5 pypi_0 pypi
joblib 0.15.1 pypi_0 pypi
jsonschema 3.2.0 pypi_0 pypi
keras 2.3.1 pypi_0 pypi
keras-applications 1.0.8 pypi_0 pypi
keras-preprocessing 1.1.2 pypi_0 pypi
kiwisolver 1.3.1 pypi_0 pypi
langdetect 1.0.8 pypi_0 pypi
lazy-object-proxy 1.6.0 pypi_0 pypi
ld_impl_linux-64 2.35.1 h7274673_9
libffi 3.3 he6710b0_2
libgcc-ng 9.3.0 h5101ec6_17
libgomp 9.3.0 h5101ec6_17
libstdcxx-ng 9.3.0 hd4cf53a_17
lxml 4.5.1 pypi_0 pypi
lz4 3.1.3 pypi_0 pypi
mako 1.1.3 pypi_0 pypi
markdown 3.2.2 pypi_0 pypi
markupsafe 1.1.1 pypi_0 pypi
matplotlib 3.3.4 pypi_0 pypi
mccabe 0.6.1 pypi_0 pypi
mlflow 1.0.0 pypi_0 pypi
monotonic 1.5 pypi_0 pypi
more-itertools 8.6.0 pypi_0 pypi
msrest 0.6.19 pypi_0 pypi
murmurhash 1.0.2 pypi_0 pypi
mypy 0.910 pypi_0 pypi
mypy-extensions 0.4.3 pypi_0 pypi
ncurses 6.2 he6710b0_1
neo4j 4.1.1 pypi_0 pypi
neobolt 1.7.17 pypi_0 pypi
neotime 1.7.4 pypi_0 pypi
networkx 2.4 pypi_0 pypi
neuralcoref 4.0 pypi_0 pypi
neurdflib 5.0.1 pypi_0 pypi
nibabel 3.1.0 pypi_0 pypi
nipype 1.5.0 pypi_0 pypi
nltk 3.5 pypi_0 pypi
nose 1.3.7 pypi_0 pypi
numpy 1.19.5 pypi_0 pypi
oauthlib 3.1.0 pypi_0 pypi
ocrmypdf 11.3.2 pypi_0 pypi
openssl 1.1.1k h27cfd23_0
opt-einsum 3.3.0 pypi_0 pypi
packaging 20.4 pypi_0 pypi
pandas 1.0.4 pypi_0 pypi
pansi 2020.7.3 pypi_0 pypi
parsel 1.6.0 pypi_0 pypi
parso 0.7.1 pypi_0 pypi
pathy 0.5.2 pypi_0 pypi
pdfminer-six 20201018 pypi_0 pypi
pexpect 4.8.0 pypi_0 pypi
pickleshare 0.7.5 pypi_0 pypi
pikepdf 2.0.0 pypi_0 pypi
pillow 8.0.1 pypi_0 pypi
pip 21.1.3 py36h06a4308_0
plac 1.1.3 pypi_0 pypi
pluggy 0.13.1 pypi_0 pypi
preshed 3.0.2 pypi_0 pypi
prometheus-client 0.8.0 pypi_0 pypi
prometheus-flask-exporter 0.18.1 pypi_0 pypi
prompt-toolkit 2.0.10 pypi_0 pypi
protego 0.1.16 pypi_0 pypi
protobuf 3.12.2 pypi_0 pypi
prov 1.5.3 pypi_0 pypi
psutil 5.7.3 pypi_0 pypi
psycopg2-binary 2.8.6 pypi_0 pypi
ptyprocess 0.7.0 pypi_0 pypi
py 1.9.0 pypi_0 pypi
py2neo 2021.0.0 pypi_0 pypi
pyasn1 0.4.8 pypi_0 pypi
pyasn1-modules 0.2.8 pypi_0 pypi
pybind11 2.6.2 pypi_0 pypi
pycodestyle 2.7.0 pypi_0 pypi
pycparser 2.20 pypi_0 pypi
pydantic 1.7.4 pypi_0 pypi
pydispatcher 2.0.5 pypi_0 pypi
pydot 1.4.1 pypi_0 pypi
pydotplus 2.0.2 pypi_0 pypi
pyflakes 2.3.1 pypi_0 pypi
pygments 2.3.1 pypi_0 pypi
pyhamcrest 2.0.2 pypi_0 pypi
pylint 2.9.3 pypi_0 pypi
pymagnitude-lite 0.1.143 pypi_0 pypi
pymupdf 1.17.2 pypi_0 pypi
pyopenssl 19.1.0 pypi_0 pypi
pyparsing 2.4.7 pypi_0 pypi
pypdf2 1.26.0 pypi_0 pypi
pyrsistent 0.16.0 pypi_0 pypi
pytesseract 0.3.4 pypi_0 pypi
pytest 6.2.4 pypi_0 pypi
python 3.6.13 h12debd9_1
python-bidi 0.4.2 pypi_0 pypi
python-dateutil 2.8.1 pypi_0 pypi
python-docx 0.8.10 pypi_0 pypi
python-editor 1.0.4 pypi_0 pypi
python-graphviz 0.14 pypi_0 pypi
python-multipart 0.0.5 pypi_0 pypi
pytz 2020.1 pypi_0 pypi
pyxnat 1.3 pypi_0 pypi
pyyaml 5.4.1 pypi_0 pypi
querystring-parser 1.2.4 pypi_0 pypi
queuelib 1.5.0 pypi_0 pypi
rdflib 5.0.0 pypi_0 pypi
readline 8.1 h27cfd23_0
redis 3.5.3 pypi_0 pypi
regex 2020.5.14 pypi_0 pypi
reportlab 3.5.55 pypi_0 pypi
requests 2.23.0 pypi_0 pypi
requests-oauthlib 1.3.0 pypi_0 pypi
rsa 4.0 pypi_0 pypi
s3transfer 0.3.3 pypi_0 pypi
sacremoses 0.0.43 pypi_0 pypi
scikit-learn 0.23.1 pypi_0 pypi
scipy 1.4.1 pypi_0 pypi
scrapy 2.1.0 pypi_0 pypi
seaborn 0.11.1 pypi_0 pypi
selenium 3.141.0 pypi_0 pypi
sentence-transformers 0.4.1.2 pypi_0 pypi
sentencepiece 0.1.94 pypi_0 pypi
seqeval 1.2.2 pypi_0 pypi
service-identity 18.1.0 pypi_0 pypi
setuptools 57.0.0 pypi_0 pypi
setuptools-scm 6.0.1 pypi_0 pypi
simplejson 3.17.0 pypi_0 pypi
six 1.15.0 pypi_0 pypi
sklearn 0.0 pypi_0 pypi
smart-open 3.0.0 pypi_0 pypi
smmap 3.0.4 pypi_0 pypi
sortedcontainers 2.2.2 pypi_0 pypi
soupsieve 2.0.1 pypi_0 pypi
spacy 3.0.6 pypi_0 pypi
spacy-legacy 3.0.5 pypi_0 pypi
spicy 0.16.0 pypi_0 pypi
sqlalchemy 1.3.13 pypi_0 pypi
sqlalchemy-utils 0.36.8 pypi_0 pypi
sqlite 3.36.0 hc218d9a_0
sqlparse 0.4.1 pypi_0 pypi
srsly 2.4.1 pypi_0 pypi
starlette 0.13.6 pypi_0 pypi
syntok 1.3.1 pypi_0 pypi
tabulate 0.8.7 pypi_0 pypi
tensorboard 2.4.1 pypi_0 pypi
tensorboard-plugin-wit 1.7.0 pypi_0 pypi
tensorflow 2.4.1 pypi_0 pypi
tensorflow-estimator 2.4.0 pypi_0 pypi
termcolor 1.1.0 pypi_0 pypi
textblob 0.15.3 pypi_0 pypi
thinc 8.0.4 pypi_0 pypi
threadpoolctl 2.1.0 pypi_0 pypi
tika 1.24 pypi_0 pypi
tk 8.6.10 hbc83047_0
tokenizers 0.9.4 pypi_0 pypi
toml 0.10.2 pypi_0 pypi
toolz 0.10.0 pypi_0 pypi
torch 1.7.1 pypi_0 pypi
torchvision 0.8.2 pypi_0 pypi
tox 3.20.1 pypi_0 pypi
tqdm 4.46.1 pypi_0 pypi
traitlets 4.3.3 pypi_0 pypi
traits 6.1.0 pypi_0 pypi
transformers 4.1.1 pypi_0 pypi
twisted 20.3.0 pypi_0 pypi
txtai 2.0.0 pypi_0 pypi
typed-ast 1.4.1 pypi_0 pypi
typer 0.3.2 pypi_0 pypi
typing-extensions 3.7.4.3 pypi_0 pypi
tzlocal 2.1 pypi_0 pypi
urllib3 1.24.3 pypi_0 pypi
uvicorn 0.13.3 pypi_0 pypi
uvloop 0.14.0 pypi_0 pypi
virtualenv 20.1.0 pypi_0 pypi
w3lib 1.22.0 pypi_0 pypi
wasabi 0.8.2 pypi_0 pypi
wcwidth 0.2.3 pypi_0 pypi
webencodings 0.5.1 pypi_0 pypi
websocket-client 0.57.0 pypi_0 pypi
websockets 8.1 pypi_0 pypi
werkzeug 0.16.1 pypi_0 pypi
wget 3.2 pypi_0 pypi
wheel 0.36.2 pyhd3eb1b0_0
wrapt 1.12.1 pypi_0 pypi
xgboost 1.1.0 pypi_0 pypi
xhtml2pdf 0.2.5 pypi_0 pypi
xxhash 2.0.0 pypi_0 pypi
xz 5.2.5 h7b6447c_0
zipp 3.1.0 pypi_0 pypi
zlib 1.2.11 h7b6447c_3
zope-interface 5.1.0 pypi_0 pypi

from gamechanger-data.

takao8 avatar takao8 commented on May 30, 2024

Hey Nicholas, we've been doing a lot of changes/bugfixes over the last couple of weeks throughout the repo (and some with our parser, too). I'm not convinced that these updates will resolve this issue as stands, but could you git pull and verify that you're still getting the same error? If you are, I'll get back to you shortly of potentially testing fixes for this on a separate branch.

from gamechanger-data.

Related Issues (10)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.