neuml / paperetl Goto Github PK

View Code? Open in Web Editor NEW

317.0 8.0 25.0 3.03 MB

📄 ⚙️ ETL processes for medical and scientific papers

License: Apache License 2.0

Python 96.56% Makefile 0.78% Dockerfile 2.18% Shell 0.48%

python etl scientific-papers parse medical

paperetl's Introduction

ETL processes for medical and scientific papers

paperetl is an ETL library for processing medical and scientific papers.

paperetl supports the following sources:

File formats:
- PDF
- XML (arXiv, PubMed, TEI)
- CSV
COVID-19 Research Dataset (CORD-19)

paperetl supports the following output options for storing articles:

SQLite
Elasticsearch
JSON files
YAML files

Installation

The easiest way to install is via pip and PyPI

pip install paperetl

Python 3.8+ is supported. Using a Python virtual environment is recommended.

paperetl can also be installed directly from GitHub to access the latest, unreleased features.

pip install git+https://github.com/neuml/paperetl

Additional dependencies

PDF parsing relies on an existing GROBID instance to be up and running. It is assumed that this is running locally on the ETL server. This is only necessary for PDF files.

Note: In some cases, the GROBID engine pool can be exhausted, resulting in a 503 error. This can be fixed by increasing concurrency and/or poolMaxWait in the GROBID configuration file.

Docker

A Dockerfile with commands to install paperetl, all dependencies and scripts is available in this repository.

wget https://raw.githubusercontent.com/neuml/paperetl/master/docker/Dockerfile
docker build -t paperetl -f Dockerfile .
docker run --name paperetl --rm -it paperetl

This will bring up a paperetl command shell. Standard Docker commands can be used to copy files over or commands can be run directly in the shell to retrieve input content.

Examples

Notebooks

Notebook	Description
Introducing paperetl	Overview of the functionality provided by paperetl

Load Articles into SQLite

The following example shows how to use paperetl to load a set of medical/scientific articles into a SQLite database.

Download the desired medical/scientific articles in a local directory. For this example, it is assumed the articles are in a directory named paperetl/data

Build the database

python -m paperetl.file paperetl/data paperetl/models

Once complete, there will be an articles.sqlite file in paperetl/models

Load into Elasticsearch

Elasticsearch is also a supported datastore as shown below. This example assumes Elasticsearch is running locally, change the URL to a remote server as appropriate.

python -m paperetl.file paperetl/data http://localhost:9200

Once complete, there will be an articles index in Elasticsearch with the metadata and full text stored.

Convert articles to JSON/YAML

paperetl can also be used to convert articles into JSON or YAML files. This is useful if the data is to be fed into another system or for manual inspection/debugging of a single file.

JSON:

python -m paperetl.file paperetl/data json://paperetl/json

YAML:

python -m paperetl.file paperetl/data yaml://paperetl/yaml

Converted files will be stored in paperetl/(json|yaml)

Load CORD-19

Note: The final version of CORD-19 was released on 2022-06-22. But this is still a large, valuable set of medical documents.

The following example shows how to use paperetl to load the CORD-19 dataset into a SQLite database.

Download and extract the dataset from Allen Institute for AI CORD-19 Release Page.
```
scripts/getcord19.sh cord19/data
```
The script above retrieves and unpacks the latest copy of CORD-19 into a directory named cord19/data. An optional second argument sets a specific date of the dataset in the format YYYY-MM-DD (ex. 2021-01-01) which defaults to the latest date.
Generate entry-dates.csv for current version of the dataset
```
python -m paperetl.cord19.entry cord19/data
```
An optional second argument sets a specific date of the dataset in the format YYYY-MM-DD (ex. 2021-01-01) which defaults of the latest date. This should match the date used in Step 1.
Build database
```
python -m paperetl.cord19 cord19/data cord19/models
```
Once complete, there will be an articles.sqlite file in cord19/models. As with earlier examples, the data can also be loaded into Elasticsearch.
```
python -m paperetl.cord19 cord19/data http://localhost:9200
```

paperetl's People

Contributors

Stargazers

Watchers

paperetl's Issues

Better error handling for parsing publication date

For file based jobs interfacing with GROBID, add better error handling around publication date parsing

Add example notebook

Currently, the only examples are old Kaggle notebooks from the CORD-19 challenge. Add a more recent example.

Additional installation steps and bug for CORD-19

Hi David,
In addition to following your paperetl installation instructions I had to take these steps to get rid of the following error and warning:

~$ python3
>>> import nltk
>>> nltk.download(‘punkt’)
>>> exit()

this created a directory ~/nltk_data/tokenizers/punkt
and fixed the above error

Also, the UserWarning below was eliminated as follows:

$ pip3 uninstall scikit-learn==0.23.2
$ pip3 install scikit-learn==0.23.1

Then, unfortunately after a full run the resulting articles.sqlite database came out with the Study Design fields, and Tags and Labels being NULL (see screenshots below).

Any ideas on how solve this NULL issue would be appreciated

Feature: Incremental database update

Currently, ETL processes assume operations are a full database reload each run. This works well for smaller datasets but for larger datasets, it's inefficient.

Add the ability to set the path to an existing database and copy unmodified records from the existing source. This way only new/updated records are processed each run.

SQLite needs a system for reading and inserting articles/sections from another database.

Elasticsearch already handles most of this, just needs a small change to only create the articles index if it doesn't already exist. Merges will be handled by Elasticsearch based on the article id.

Evaluate integrating with paperoni

paperoni allows searching for scientific papers and downloading their corresponding PDFs (when available). Evaluate possible ways to integrate with paperetl.

Add pre-trained study design models to GitHub

Currently, the pre-trained study design models are stored on Kaggle. Put a copy of these files on the next GitHub release. This will allow automation/docker builds.

# Download pre-trained study design/attribute models
# https://www.kaggle.com/davidmezzetti/cord19-study-design/#attribute
# https://www.kaggle.com/davidmezzetti/cord19-study-design/#design

Add generic CSV source

Add the ability to import article metadata from CSV files similar to the CORD-19 metadata.csv file.

Require Python 3.7+

Python 3.6 is EOL in days. Update scripts and requirements to 3.7.

Ensure length of sections is less than max nlp length

Some text sections in CORD-19 are extremely long, often with large RNA sequences as text which won't split into sentences. Add logic to handle these types of sections.

Review and update README.md

Some of the information in README.md is inaccurate, such as the default location for the study design models. Review and update.

Add database flag to determine if database should be replaced

The current functionality of paperetl is to create a new database each run. With the merge/duplicate changes in #36, it now makes more sense to have the default action to "create or update" a database each run. A flag will be available to force the old behavior and replace the database each run.

AttributeError: 'NoneType' object has no attribute 'upper'

paperetl is great and has been useful for my work! It has been working well for most of the PDF papers I feed it. I am having some issues with certain PDFs. I am new to python, so its very likely I am doing something wrong but I thought I'd reach out.

When I run this for a specific PDF:

python3.10 -m paperetl.file /home/bill/brokenone /home/bill/brokenone /home/bill/brokenone

I get this error:

Processing: /home/bill/brokenone/20 Immune Cells Enhance Selectivity of Nanosecond-Pulsed DBD Plasma Against Tumor Cells.pdf
/usr/local/lib/python3.10/dist-packages/bs4/builder/init.py:545: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument features="xml" into the BeautifulSoup constructor.
warnings.warn(
Process Process-1:
Total articles inserted: 0
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.10/dist-packages/paperetl/file/execute.py", line 94, in process
for result in Execute.parse(*params):
File "/usr/local/lib/python3.10/dist-packages/paperetl/file/execute.py", line 67, in parse
yield PDF.parse(stream, source)
File "/usr/local/lib/python3.10/dist-packages/paperetl/file/pdf.py", line 34, in parse
return TEI.parse(xml, source) if xml else None
File "/usr/local/lib/python3.10/dist-packages/paperetl/file/tei.py", line 55, in parse
sections = TEI.text(soup, title)
File "/usr/local/lib/python3.10/dist-packages/paperetl/file/tei.py", line 247, in text
name = figure.get("xml:id").upper()
AttributeError: 'NoneType' object has no attribute 'upper'

PDF extraction improvements

Modify the PDF file extraction process as follows:

Use dateutil to parse and format the published/date field
Extract content from tables
Build uid off the title but fallback to doi if title is not found
Add tag of PDF to articles

Error either with or without pre-trained attribute file

I tried running paperetl in AWS (ubuntu 20.04 LTS t2.small instance with 50 GiB of memory) with the following procedure:

The cord-19_2020-09-01.tar.gz (release) dataset was downloaded and extracted in the following download path: ~/cordata
This extraction created a directory ~/cordata/2020-09-01 containing the following files:
~/cordata/2020-09-01/document_parses.tar.gz
~/cordata/2020-09-01/metadata.csv
document_parses.tar.gz was further extracted as a directory named document_parses, which contained the following 2 subdirectories:
~/cordata/2020-09-01/document_parses/pdf_json
~/cordata/2020-09-01/document_parses/pmc_json

entry-dates.csv generated in Kaggle https://www.kaggle.com/davidmezzetti/cord-19-article-entry-dates?scriptVersionId=41813239 was also placed in this directory; therefore, the command
~/cordata/2020-09-01$ python3 -m paperetl.cord19 .
was executed from the ~/cordata/2020-09-01 directory containing the following:
~/cordata/2020-09-01/document_parses
~/cordata/2020-09-01/entry-dates.csv
~/cordata/2020-09-01/metadata.csv

The above procedure gave the following error:
FileNotFoundError: [Errno 2] No such file or directory: '/home/ubuntu/.cord19/models/attribute'

This error resulted despite of the fact that I had pre-trained attribute and design files:

~/.cord19/models/attribute (from https://www.kaggle.com/davidmezzetti/cord19-study-design/#attribute)
~/.cord19/models/design (from https://www.kaggle.com/davidmezzetti/cord19-study-design/#design )

In another attempt, without using these 2 pre-trained files (i.e. starting with an empty ~/.cord19/models directory), I still got the exact same error message.

See error details in the following screenshot:

Any help would be appreciated.

Recursively process files in input directory

For file-based jobs with an input directory, support traverses multi-level directory structures to process all PDF articles.

Add file name as source for file process

Currently for file processes, PDF is hard-coded into the source column. The file name should be used instead.

Add arXiv as source

Support loading full-text arXiv documents via arXiv API queries

Add dockerfile for building paperetl environment

Build a dockerfile for instantiating a paperetl environment

Fix bug with study model training

Need to update the hyperparameter names as they are currently not correct.

Improve sample size extraction

Improve the accuracy of sample size extraction. Add unit tests.

Update CORD-19 scripts

CORD-19 is no longer updated. Update scripts and documentation to note that.

Add PubMed as source

Support loading full-text open access documents via PubMed API queries.

This will add support for both PubMed MEDLINE archives and articles pulled via the API.

Add common method for accessing Grammar object

Currently, each source has it's own local method for creating a global Grammar object. Standardize this logic.

Windows install issue

Fix issue caused by trailing slash in setup.py

ValueError: path 'src/python/' cannot end with '/'

sqlite3.OperationalError: database is locked

!python -m paperetl.file paperetl/file/data paperetl/models

get the following error:

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Traceback (most recent call last):
File "/anaconda/envs/azureml_py38/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/anaconda/envs/azureml_py38/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/paperetl/file/main.py", line 11, in
Execute.run(
File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/paperetl/file/execute.py", line 176, in run
db = Factory.create(url, replace)
File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/paperetl/factory.py", line 36, in create
return SQLite(url.replace("sqlite://", ""), replace)
File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/paperetl/sqlite.py", line 104, in init
self.create(SQLite.ARTICLES, "articles")
File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/paperetl/sqlite.py", line 198, in create
self.cur.execute(create)
sqlite3.OperationalError: database is locked

Add multiprocessing support to files process

Currently, the files etl process is single threaded. With #34, it is now much easier to parallelize this process. This change will use a multiprocessing pool to process files.

Build test suite

Add unit tests to paperetl to help with quality assurance

Update CORD-19 entry dates source

The CORD-19 releases page is no longer being updated consistently. Switch the entry date generation process to use the latest changelog file.

Add pre-commit checks

Add .pre-commit-config.yaml file to enable checks for code quality.

sample lines for running etl server and grobid instance

Apologies if my question is too silly.
In the description you wrote :
"PDF parsing relies on an existing GROBID instance to be up and running.
It is assumed that this is running locally on the ETL server"
Can you provide some sample lines about how to do that?
Best regards

Remove study attribute and design models and all related dependencies

Currently, paperetl has a couple statistical study design related models to detect common study design fields. This requires a large NLP pipeline backed by spaCy to run a series of NLP/grammar steps. While this initially was a good solution in mid 2020, there are now better ways to do this.

Furthermore, the NLP pipelines are slow and add significant processing overhead. Last but not least, paperetl can process both medical and technical/scientific papers, these fields are medical specific. This functionality is more appropriate for the paperai project and the NLP logic should reside within that project.

Improve PMB filtering logic

Make the following improvements:

Allow filtering on article ids in addition to codes
Check filters before processing full text to improve performance

Filter duplicate ids

Currently there is no duplicate detection within a single run during file processing. Add this capability, similar to what is in the CORD-19 process.

Remove citations table/index

There is no major known use case for storing citations. Remove as this wouldn't be easy to support for incremental loads.

Increase test coverage

67% is currently covered - need to improve coverage

Use XML id for file figure processing

Currently the file process attempts to find a caption/label/name to use as the section name for TEI files. This is error prone. xml:id is unique and more reliable.

Zotero connection

Greetings,

Thanks for working on this! Is there a neat way to borrow the metadata of my pdf files from the Zotero database instead of relying on parsing from the PDF?

Add component to build entry-dates.csv

Currently, for the CORD-19 dataset, entry-dates.csv is required to be manually downloaded using the following instructions:

# Download entry-dates.csv and place in <download path>
# https://www.kaggle.com/davidmezzetti/cord-19-article-entry-dates/output

entry-dates.csv should be able to be built outside of Kaggle, to allow automation/docker builds. The Kaggle entry-dates component should be updated to call this new component.

Update minimum Python version to 3.8

Python 3.7 is now EOL. The minimum supported version of Python for releases moving forward should be 3.8+

Remove legacy merge logic

#34 removed all study design and attribute detection in paperetl in favor of paperai. paperetl is now significantly faster without spaCy pipelines slowing things down. With that, the legacy merge process designed to overcome performance concerns can be removed. This can be replaced with a simple duplicate detect and replace on entry date.

Support reading compressed files

Check the file extension of each input file and read using a gzip stream if it's a compressed file

Detect month changes in CORD-19 entry date process

Currently, the entry date download process assumes there is a metadata.csv file for each day. Since the datasource changed to biweekly updates, there may not be a metadata.csv file for the 1st of the month. Add logic to detect month changes and use the earliest metadata.csv file per month instead.

what I missing? KeyError: '47235b96c07e8066195b6521882340408b9bdd34'

ghSrc/paperetl % python -m paperetl.cord19 2020-08-12
.........
/usr/local/lib/python3.8/site-packages/sklearn/base.py:329: UserWarning: Trying to unpickle estimator TfidfVectorizer from version 0.23.1 when using version 0.23.2. This might lead to breaking code or invalid results. Use at your own risk.
warnings.warn(
/usr/local/lib/python3.8/site-packages/sklearn/base.py:329: UserWarning: Trying to unpickle estimator DecisionTreeClassifier from version 0.23.1 when using version 0.23.2. This might lead to breaking code or invalid results. Use at your own risk.
warnings.warn(
/usr/local/lib/python3.8/site-packages/sklearn/base.py:329: UserWarning: Trying to unpickle estimator RandomForestClassifier from version 0.23.1 when using version 0.23.2. This might lead to breaking code or invalid results. Use at your own risk.
warnings.warn(
Traceback (most recent call last):
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.8/site-packages/paperetl/cord19/main.py", line 11, in
Execute.run(sys.argv[1],
File "/usr/local/lib/python3.8/site-packages/paperetl/cord19/execute.py", line 281, in run
article.metadata = article.metadata + (dates[sha],)
KeyError: '47235b96c07e8066195b6521882340408b9bdd34'
ghSrc/paperetl %

my directory:

paperetl/2020-08-12 % ll
total 14642544
drwxr-xr-x 14 yuanke staff 448 8 14 10:06 .
drwxr-xr-x 13 yuanke staff 416 8 14 09:39 ..
drwxr-xr-x 3 yuanke staff 96 8 14 09:44 __results___files
-rw-r--r--@ 1 yuanke staff 455816 8 5 15:01 attribute
-rw-r--r--@ 1 yuanke staff 206732 8 5 15:01 attribute.csv
-rw-r--r-- 1 yuanke staff 24504 8 13 05:52 changelog
-rw-r--r-- 1 yuanke staff 1375487377 8 13 05:53 cord_19_embeddings.tar.gz
-rw-r--r-- 1 yuanke staff 3143476778 8 13 05:23 cord_19_embeddings_2020-08-12.csv
-rw-r--r--@ 1 yuanke staff 4185255 8 5 15:01 design
-rw-r--r--@ 1 yuanke staff 61843 8 5 15:01 design.csv
drwxr-xr-x 4 yuanke staff 128 8 14 10:04 document_parses
-rw-r--r-- 1 yuanke staff 2638941522 8 13 05:53 document_parses.tar.gz
-rw-r--r--@ 1 yuanke staff 15487674 8 14 01:44 entry-dates.csv
-rw-r--r-- 1 yuanke staff 297398784 8 13 05:53 metadata.csv
paperetl/2020-08-12 %

Issue processing into Elasticsearch

Hi,

I have both paperetl and elasticsearch set up in docker containers running on my machine. When I try and process a .pdf file and add it to elasticsearch I get the error:

python -m paperetl.file paperetl/data http://localhost:9200 paperetl/models
Traceback (most recent call last):
  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.7/dist-packages/paperetl/file/__main__.py", line 15, in <module>
    sys.argv[4] == "True" if len(sys.argv) > 4 else False,
  File "/usr/local/lib/python3.7/dist-packages/paperetl/file/execute.py", line 176, in run
    db = Factory.create(url, replace)
  File "/usr/local/lib/python3.7/dist-packages/paperetl/factory.py", line 29, in create
    return Elastic(url, replace)
  File "/usr/local/lib/python3.7/dist-packages/paperetl/elastic.py", line 44, in __init__
    exists = self.connection.indices.exists("articles")
  File "/usr/local/lib/python3.7/dist-packages/elasticsearch/_sync/client/utils.py", line 308, in wrapped
    "Positional arguments can't be used with Elasticsearch API methods. "
TypeError: Positional arguments can't be used with Elasticsearch API methods. Instead only use keyword arguments.

I assume it something to do with Elasticsearch changing in V8 but not sure.

KeyError: 'pdf_json_files'

ghSrc/paperetl % python -m paperetl.cord19 2020-03-27
Building articles database from 2020-03-27
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/pool.py", line 48, in mapstar
return list(map(*args))
File "/usr/local/lib/python3.8/site-packages/paperetl/cord19/execute.py", line 184, in process
sections, citations = Section.parse(row, indir)
File "/usr/local/lib/python3.8/site-packages/paperetl/cord19/section.py", line 49, in parse
for path in Section.files(row):
File "/usr/local/lib/python3.8/site-packages/paperetl/cord19/section.py", line 100, in files
if row[column]:
KeyError: 'pdf_json_files'
"""

Support spaCy 3.0

Currently, scispacy doesn't have models for spacy 3.0 - see allenai/scispacy#303

A temporary workaround is to install spacy 2.x via

pip install spacy==2.3.5

If scispacy isn't updated in the near term, a dot release will be put out to limit setup.py to spacy 2.x

Scaling to create a proccess per cpu core overwhelms grobid service

When I try and index PDFs on my development machine, I get frequent failures because the grobid service is overwhelmed by the number of processes allocated for ingesting the PDFs:

ERROR [2023-12-03 11:57:31,564] org.grobid.service.process.GrobidRestProcessFiles: Could not get an engine from the pool within configured time. Sending service unavailable.

The default concurrency setting in grobid is 10. On my machine os.cpu_count() returns 16, so we are creating more processes that available engines in the grobid pool.

Whilst this is not an issue in paperetl itself, I think anyone for whom os.cpu_count() returns > 10 will hit this issue. The impact could be mitigated by adding a note to the documentation to suggest users increase the default concurrency limit in grobid if they this error. https://grobid.readthedocs.io/en/latest/Configuration/#service-configuration

I am happy to create a PR for this if you agree.