neuml / paperetl Goto Github PK
View Code? Open in Web Editor NEW๐ โ๏ธ ETL processes for medical and scientific papers
License: Apache License 2.0
๐ โ๏ธ ETL processes for medical and scientific papers
License: Apache License 2.0
Make the following improvements:
Hi David,
In addition to following your paperetl installation instructions I had to take these steps to get rid of the following error and warning:
~$ python3
>>> import nltk
>>> nltk.download(โpunktโ)
>>> exit()
this created a directory ~/nltk_data/tokenizers/punkt
and fixed the above error
Also, the UserWarning below was eliminated as follows:
$ pip3 uninstall scikit-learn==0.23.2
$ pip3 install scikit-learn==0.23.1
Then, unfortunately after a full run the resulting articles.sqlite database came out with the Study Design fields, and Tags and Labels being NULL (see screenshots below).
Any ideas on how solve this NULL issue would be appreciated
Some of the information in README.md is inaccurate, such as the default location for the study design models. Review and update.
Currently, the files etl process is single threaded. With #34, it is now much easier to parallelize this process. This change will use a multiprocessing pool to process files.
Add .pre-commit-config.yaml file to enable checks for code quality.
For file-based jobs with an input directory, support traverses multi-level directory structures to process all PDF articles.
Currently, ETL processes assume operations are a full database reload each run. This works well for smaller datasets but for larger datasets, it's inefficient.
Add the ability to set the path to an existing database and copy unmodified records from the existing source. This way only new/updated records are processed each run.
SQLite needs a system for reading and inserting articles/sections from another database.
Elasticsearch already handles most of this, just needs a small change to only create the articles index if it doesn't already exist. Merges will be handled by Elasticsearch based on the article id.
Set a default export function to allow the json.dump() method to write objects it doesn't explicitly have a converter for.
Hi,
I have both paperetl and elasticsearch set up in docker containers running on my machine. When I try and process a .pdf file and add it to elasticsearch I get the error:
python -m paperetl.file paperetl/data http://localhost:9200 paperetl/models
Traceback (most recent call last):
File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.7/dist-packages/paperetl/file/__main__.py", line 15, in <module>
sys.argv[4] == "True" if len(sys.argv) > 4 else False,
File "/usr/local/lib/python3.7/dist-packages/paperetl/file/execute.py", line 176, in run
db = Factory.create(url, replace)
File "/usr/local/lib/python3.7/dist-packages/paperetl/factory.py", line 29, in create
return Elastic(url, replace)
File "/usr/local/lib/python3.7/dist-packages/paperetl/elastic.py", line 44, in __init__
exists = self.connection.indices.exists("articles")
File "/usr/local/lib/python3.7/dist-packages/elasticsearch/_sync/client/utils.py", line 308, in wrapped
"Positional arguments can't be used with Elasticsearch API methods. "
TypeError: Positional arguments can't be used with Elasticsearch API methods. Instead only use keyword arguments.
I assume it something to do with Elasticsearch changing in V8 but not sure.
The current functionality of paperetl is to create a new database each run. With the merge/duplicate changes in #36, it now makes more sense to have the default action to "create or update" a database each run. A flag will be available to force the old behavior and replace the database each run.
paperoni allows searching for scientific papers and downloading their corresponding PDFs (when available). Evaluate possible ways to integrate with paperetl.
ghSrc/paperetl % python -m paperetl.cord19 2020-08-12
.........
/usr/local/lib/python3.8/site-packages/sklearn/base.py:329: UserWarning: Trying to unpickle estimator TfidfVectorizer from version 0.23.1 when using version 0.23.2. This might lead to breaking code or invalid results. Use at your own risk.
warnings.warn(
/usr/local/lib/python3.8/site-packages/sklearn/base.py:329: UserWarning: Trying to unpickle estimator DecisionTreeClassifier from version 0.23.1 when using version 0.23.2. This might lead to breaking code or invalid results. Use at your own risk.
warnings.warn(
/usr/local/lib/python3.8/site-packages/sklearn/base.py:329: UserWarning: Trying to unpickle estimator RandomForestClassifier from version 0.23.1 when using version 0.23.2. This might lead to breaking code or invalid results. Use at your own risk.
warnings.warn(
Traceback (most recent call last):
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.8/site-packages/paperetl/cord19/main.py", line 11, in
Execute.run(sys.argv[1],
File "/usr/local/lib/python3.8/site-packages/paperetl/cord19/execute.py", line 281, in run
article.metadata = article.metadata + (dates[sha],)
KeyError: '47235b96c07e8066195b6521882340408b9bdd34'
ghSrc/paperetl %
my directory:
paperetl/2020-08-12 % ll
total 14642544
drwxr-xr-x 14 yuanke staff 448 8 14 10:06 .
drwxr-xr-x 13 yuanke staff 416 8 14 09:39 ..
drwxr-xr-x 3 yuanke staff 96 8 14 09:44 __results___files
-rw-r--r--@ 1 yuanke staff 455816 8 5 15:01 attribute
-rw-r--r--@ 1 yuanke staff 206732 8 5 15:01 attribute.csv
-rw-r--r-- 1 yuanke staff 24504 8 13 05:52 changelog
-rw-r--r-- 1 yuanke staff 1375487377 8 13 05:53 cord_19_embeddings.tar.gz
-rw-r--r-- 1 yuanke staff 3143476778 8 13 05:23 cord_19_embeddings_2020-08-12.csv
-rw-r--r--@ 1 yuanke staff 4185255 8 5 15:01 design
-rw-r--r--@ 1 yuanke staff 61843 8 5 15:01 design.csv
drwxr-xr-x 4 yuanke staff 128 8 14 10:04 document_parses
-rw-r--r-- 1 yuanke staff 2638941522 8 13 05:53 document_parses.tar.gz
-rw-r--r--@ 1 yuanke staff 15487674 8 14 01:44 entry-dates.csv
-rw-r--r-- 1 yuanke staff 297398784 8 13 05:53 metadata.csv
paperetl/2020-08-12 %
CORD-19 is no longer updated. Update scripts and documentation to note that.
!python -m paperetl.file paperetl/file/data paperetl/models
get the following error:
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers
before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Traceback (most recent call last):
File "/anaconda/envs/azureml_py38/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/anaconda/envs/azureml_py38/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/paperetl/file/main.py", line 11, in
Execute.run(
File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/paperetl/file/execute.py", line 176, in run
db = Factory.create(url, replace)
File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/paperetl/factory.py", line 36, in create
return SQLite(url.replace("sqlite://", ""), replace)
File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/paperetl/sqlite.py", line 104, in init
self.create(SQLite.ARTICLES, "articles")
File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/paperetl/sqlite.py", line 198, in create
self.cur.execute(create)
sqlite3.OperationalError: database is locked
Currently, each source has it's own local method for creating a global Grammar object. Standardize this logic.
Currently, the pre-trained study design models are stored on Kaggle. Put a copy of these files on the next GitHub release. This will allow automation/docker builds.
# Download pre-trained study design/attribute models
# https://www.kaggle.com/davidmezzetti/cord19-study-design/#attribute
# https://www.kaggle.com/davidmezzetti/cord19-study-design/#design
#34 removed all study design and attribute detection in paperetl in favor of paperai. paperetl is now significantly faster without spaCy pipelines slowing things down. With that, the legacy merge process designed to overcome performance concerns can be removed. This can be replaced with a simple duplicate detect and replace on entry date.
Python 3.6 is EOL in days. Update scripts and requirements to 3.7.
Support loading full-text open access documents via PubMed API queries.
This will add support for both PubMed MEDLINE archives and articles pulled via the API.
Currently, the entry date download process assumes there is a metadata.csv file for each day. Since the datasource changed to biweekly updates, there may not be a metadata.csv file for the 1st of the month. Add logic to detect month changes and use the earliest metadata.csv file per month instead.
Fix issue caused by trailing slash in setup.py
ValueError: path 'src/python/' cannot end with '/'
Support loading full-text arXiv documents via arXiv API queries
Greetings,
Thanks for working on this! Is there a neat way to borrow the metadata of my pdf files from the Zotero database instead of relying on parsing from the PDF?
Need to update the hyperparameter names as they are currently not correct.
For file based jobs interfacing with GROBID, add better error handling around publication date parsing
Improve the accuracy of sample size extraction. Add unit tests.
Add the ability to import article metadata from CSV files similar to the CORD-19 metadata.csv file.
67% is currently covered - need to improve coverage
I tried running paperetl in AWS (ubuntu 20.04 LTS t2.small instance with 50 GiB of memory) with the following procedure:
The cord-19_2020-09-01.tar.gz (release) dataset was downloaded and extracted in the following download path: ~/cordata
This extraction created a directory ~/cordata/2020-09-01 containing the following files:
~/cordata/2020-09-01/document_parses.tar.gz
~/cordata/2020-09-01/metadata.csv
document_parses.tar.gz was further extracted as a directory named document_parses, which contained the following 2 subdirectories:
~/cordata/2020-09-01/document_parses/pdf_json
~/cordata/2020-09-01/document_parses/pmc_json
entry-dates.csv generated in Kaggle https://www.kaggle.com/davidmezzetti/cord-19-article-entry-dates?scriptVersionId=41813239 was also placed in this directory; therefore, the command
~/cordata/2020-09-01$ python3 -m paperetl.cord19 .
was executed from the ~/cordata/2020-09-01 directory containing the following:
~/cordata/2020-09-01/document_parses
~/cordata/2020-09-01/entry-dates.csv
~/cordata/2020-09-01/metadata.csv
The above procedure gave the following error:
FileNotFoundError: [Errno 2] No such file or directory: '/home/ubuntu/.cord19/models/attribute'
This error resulted despite of the fact that I had pre-trained attribute and design files:
~/.cord19/models/attribute (from https://www.kaggle.com/davidmezzetti/cord19-study-design/#attribute)
~/.cord19/models/design (from https://www.kaggle.com/davidmezzetti/cord19-study-design/#design )
In another attempt, without using these 2 pre-trained files (i.e. starting with an empty ~/.cord19/models directory), I still got the exact same error message.
See error details in the following screenshot:
Any help would be appreciated.
Fix an error when a full reprocess is set when there are no updates. This should instead return with no updates.
Apologies if my question is too silly.
In the description you wrote :
"PDF parsing relies on an existing GROBID instance to be up and running.
It is assumed that this is running locally on the ETL server"
Can you provide some sample lines about how to do that?
Best regards
Currently, paperetl has a couple statistical study design related models to detect common study design fields. This requires a large NLP pipeline backed by spaCy to run a series of NLP/grammar steps. While this initially was a good solution in mid 2020, there are now better ways to do this.
Furthermore, the NLP pipelines are slow and add significant processing overhead. Last but not least, paperetl can process both medical and technical/scientific papers, these fields are medical specific. This functionality is more appropriate for the paperai project and the NLP logic should reside within that project.
Currently, scispacy doesn't have models for spacy 3.0 - see allenai/scispacy#303
A temporary workaround is to install spacy 2.x via
pip install spacy==2.3.5
If scispacy isn't updated in the near term, a dot release will be put out to limit setup.py to spacy 2.x
Currently, the only examples are old Kaggle notebooks from the CORD-19 challenge. Add a more recent example.
Modify the PDF file extraction process as follows:
Check the file extension of each input file and read using a gzip stream if it's a compressed file
Some text sections in CORD-19 are extremely long, often with large RNA sequences as text which won't split into sentences. Add logic to handle these types of sections.
Update setup.py to only show standard image on PyPI
paperetl is great and has been useful for my work! It has been working well for most of the PDF papers I feed it. I am having some issues with certain PDFs. I am new to python, so its very likely I am doing something wrong but I thought I'd reach out.
When I run this for a specific PDF:
python3.10 -m paperetl.file /home/bill/brokenone /home/bill/brokenone /home/bill/brokenone
I get this error:
Processing: /home/bill/brokenone/20 Immune Cells Enhance Selectivity of Nanosecond-Pulsed DBD Plasma Against Tumor Cells.pdf
/usr/local/lib/python3.10/dist-packages/bs4/builder/init.py:545: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument features="xml"
into the BeautifulSoup constructor.
warnings.warn(
Process Process-1:
Total articles inserted: 0
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.10/dist-packages/paperetl/file/execute.py", line 94, in process
for result in Execute.parse(*params):
File "/usr/local/lib/python3.10/dist-packages/paperetl/file/execute.py", line 67, in parse
yield PDF.parse(stream, source)
File "/usr/local/lib/python3.10/dist-packages/paperetl/file/pdf.py", line 34, in parse
return TEI.parse(xml, source) if xml else None
File "/usr/local/lib/python3.10/dist-packages/paperetl/file/tei.py", line 55, in parse
sections = TEI.text(soup, title)
File "/usr/local/lib/python3.10/dist-packages/paperetl/file/tei.py", line 247, in text
name = figure.get("xml:id").upper()
AttributeError: 'NoneType' object has no attribute 'upper'
Python 3.7 is now EOL. The minimum supported version of Python for releases moving forward should be 3.8+
Currently there is no duplicate detection within a single run during file processing. Add this capability, similar to what is in the CORD-19 process.
The CORD-19 releases page is no longer being updated consistently. Switch the entry date generation process to use the latest changelog file.
There is no major known use case for storing citations. Remove as this wouldn't be easy to support for incremental loads.
Build a dockerfile for instantiating a paperetl environment
Currently for file processes, PDF is hard-coded into the source column. The file name should be used instead.
Add unit tests to paperetl to help with quality assurance
When I try and index PDFs on my development machine, I get frequent failures because the grobid service is overwhelmed by the number of processes allocated for ingesting the PDFs:
ERROR [2023-12-03 11:57:31,564] org.grobid.service.process.GrobidRestProcessFiles: Could not get an engine from the pool within configured time. Sending service unavailable.
The default concurrency setting in grobid is 10. On my machine os.cpu_count() returns 16, so we are creating more processes that available engines in the grobid pool.
Whilst this is not an issue in paperetl itself, I think anyone for whom os.cpu_count() returns > 10 will hit this issue. The impact could be mitigated by adding a note to the documentation to suggest users increase the default concurrency limit in grobid if they this error. https://grobid.readthedocs.io/en/latest/Configuration/#service-configuration
I am happy to create a PR for this if you agree.
Currently the file process attempts to find a caption/label/name to use as the section name for TEI files. This is error prone. xml:id is unique and more reliable.
Currently, for the CORD-19 dataset, entry-dates.csv is required to be manually downloaded using the following instructions:
# Download entry-dates.csv and place in <download path>
# https://www.kaggle.com/davidmezzetti/cord-19-article-entry-dates/output
entry-dates.csv should be able to be built outside of Kaggle, to allow automation/docker builds. The Kaggle entry-dates component should be updated to call this new component.
ghSrc/paperetl % python -m paperetl.cord19 2020-03-27
Building articles database from 2020-03-27
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/pool.py", line 48, in mapstar
return list(map(*args))
File "/usr/local/lib/python3.8/site-packages/paperetl/cord19/execute.py", line 184, in process
sections, citations = Section.parse(row, indir)
File "/usr/local/lib/python3.8/site-packages/paperetl/cord19/section.py", line 49, in parse
for path in Section.files(row):
File "/usr/local/lib/python3.8/site-packages/paperetl/cord19/section.py", line 100, in files
if row[column]:
KeyError: 'pdf_json_files'
"""
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.