GithubHelp home page GithubHelp logo

neuml / paperetl Goto Github PK

View Code? Open in Web Editor NEW
317.0 8.0 25.0 3.03 MB

๐Ÿ“„ โš™๏ธ ETL processes for medical and scientific papers

License: Apache License 2.0

Python 96.56% Makefile 0.78% Dockerfile 2.18% Shell 0.48%
python etl scientific-papers parse medical

paperetl's Introduction

ETL processes for medical and scientific papers

Version GitHub Release Date GitHub issues GitHub last commit Build Status Coverage Status


paperetl is an ETL library for processing medical and scientific papers.

architecture architecture

paperetl supports the following sources:

  • File formats:
    • PDF
    • XML (arXiv, PubMed, TEI)
    • CSV
  • COVID-19 Research Dataset (CORD-19)

paperetl supports the following output options for storing articles:

  • SQLite
  • Elasticsearch
  • JSON files
  • YAML files

Installation

The easiest way to install is via pip and PyPI

pip install paperetl

Python 3.8+ is supported. Using a Python virtual environment is recommended.

paperetl can also be installed directly from GitHub to access the latest, unreleased features.

pip install git+https://github.com/neuml/paperetl

Additional dependencies

PDF parsing relies on an existing GROBID instance to be up and running. It is assumed that this is running locally on the ETL server. This is only necessary for PDF files.

Note: In some cases, the GROBID engine pool can be exhausted, resulting in a 503 error. This can be fixed by increasing concurrency and/or poolMaxWait in the GROBID configuration file.

Docker

A Dockerfile with commands to install paperetl, all dependencies and scripts is available in this repository.

wget https://raw.githubusercontent.com/neuml/paperetl/master/docker/Dockerfile
docker build -t paperetl -f Dockerfile .
docker run --name paperetl --rm -it paperetl

This will bring up a paperetl command shell. Standard Docker commands can be used to copy files over or commands can be run directly in the shell to retrieve input content.

Examples

Notebooks

Notebook Description
Introducing paperetl Overview of the functionality provided by paperetl Open In Colab

Load Articles into SQLite

The following example shows how to use paperetl to load a set of medical/scientific articles into a SQLite database.

  1. Download the desired medical/scientific articles in a local directory. For this example, it is assumed the articles are in a directory named paperetl/data

  2. Build the database

    python -m paperetl.file paperetl/data paperetl/models
    

Once complete, there will be an articles.sqlite file in paperetl/models

Load into Elasticsearch

Elasticsearch is also a supported datastore as shown below. This example assumes Elasticsearch is running locally, change the URL to a remote server as appropriate.

python -m paperetl.file paperetl/data http://localhost:9200

Once complete, there will be an articles index in Elasticsearch with the metadata and full text stored.

Convert articles to JSON/YAML

paperetl can also be used to convert articles into JSON or YAML files. This is useful if the data is to be fed into another system or for manual inspection/debugging of a single file.

JSON:

python -m paperetl.file paperetl/data json://paperetl/json

YAML:

python -m paperetl.file paperetl/data yaml://paperetl/yaml

Converted files will be stored in paperetl/(json|yaml)

Load CORD-19

Note: The final version of CORD-19 was released on 2022-06-22. But this is still a large, valuable set of medical documents.

The following example shows how to use paperetl to load the CORD-19 dataset into a SQLite database.

  1. Download and extract the dataset from Allen Institute for AI CORD-19 Release Page.

    scripts/getcord19.sh cord19/data
    

    The script above retrieves and unpacks the latest copy of CORD-19 into a directory named cord19/data. An optional second argument sets a specific date of the dataset in the format YYYY-MM-DD (ex. 2021-01-01) which defaults to the latest date.

  2. Generate entry-dates.csv for current version of the dataset

    python -m paperetl.cord19.entry cord19/data
    

    An optional second argument sets a specific date of the dataset in the format YYYY-MM-DD (ex. 2021-01-01) which defaults of the latest date. This should match the date used in Step 1.

  3. Build database

    python -m paperetl.cord19 cord19/data cord19/models
    

    Once complete, there will be an articles.sqlite file in cord19/models. As with earlier examples, the data can also be loaded into Elasticsearch.

    python -m paperetl.cord19 cord19/data http://localhost:9200
    

paperetl's People

Contributors

davidmezzetti avatar elshimone avatar nialov avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

paperetl's Issues

Add example notebook

Currently, the only examples are old Kaggle notebooks from the CORD-19 challenge. Add a more recent example.

Additional installation steps and bug for CORD-19

Hi David,
In addition to following your paperetl installation instructions I had to take these steps to get rid of the following error and warning:

Screenshot from 2020-09-06 16-55-15

~$ python3
>>> import nltk
>>> nltk.download(โ€˜punktโ€™)
>>> exit()        

this created a directory ~/nltk_data/tokenizers/punkt
and fixed the above error

Also, the UserWarning below was eliminated as follows:

$ pip3 uninstall scikit-learn==0.23.2
$ pip3 install scikit-learn==0.23.1

Screenshot from 2020-09-09 10-36-26

Then, unfortunately after a full run the resulting articles.sqlite database came out with the Study Design fields, and Tags and Labels being NULL (see screenshots below).

Screenshot from 2020-09-11 07-27-52

Screenshot from 2020-09-11 07-29-24

Any ideas on how solve this NULL issue would be appreciated

Feature: Incremental database update

Currently, ETL processes assume operations are a full database reload each run. This works well for smaller datasets but for larger datasets, it's inefficient.

Add the ability to set the path to an existing database and copy unmodified records from the existing source. This way only new/updated records are processed each run.

SQLite needs a system for reading and inserting articles/sections from another database.

Elasticsearch already handles most of this, just needs a small change to only create the articles index if it doesn't already exist. Merges will be handled by Elasticsearch based on the article id.

Add pre-trained study design models to GitHub

Currently, the pre-trained study design models are stored on Kaggle. Put a copy of these files on the next GitHub release. This will allow automation/docker builds.

# Download pre-trained study design/attribute models
# https://www.kaggle.com/davidmezzetti/cord19-study-design/#attribute
# https://www.kaggle.com/davidmezzetti/cord19-study-design/#design

Add generic CSV source

Add the ability to import article metadata from CSV files similar to the CORD-19 metadata.csv file.

Review and update README.md

Some of the information in README.md is inaccurate, such as the default location for the study design models. Review and update.

Add database flag to determine if database should be replaced

The current functionality of paperetl is to create a new database each run. With the merge/duplicate changes in #36, it now makes more sense to have the default action to "create or update" a database each run. A flag will be available to force the old behavior and replace the database each run.

AttributeError: 'NoneType' object has no attribute 'upper'

paperetl is great and has been useful for my work! It has been working well for most of the PDF papers I feed it. I am having some issues with certain PDFs. I am new to python, so its very likely I am doing something wrong but I thought I'd reach out.

When I run this for a specific PDF:

python3.10 -m paperetl.file /home/bill/brokenone /home/bill/brokenone /home/bill/brokenone

I get this error:

Processing: /home/bill/brokenone/20 Immune Cells Enhance Selectivity of Nanosecond-Pulsed DBD Plasma Against Tumor Cells.pdf
/usr/local/lib/python3.10/dist-packages/bs4/builder/init.py:545: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument features="xml" into the BeautifulSoup constructor.
warnings.warn(
Process Process-1:
Total articles inserted: 0
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.10/dist-packages/paperetl/file/execute.py", line 94, in process
for result in Execute.parse(*params):
File "/usr/local/lib/python3.10/dist-packages/paperetl/file/execute.py", line 67, in parse
yield PDF.parse(stream, source)
File "/usr/local/lib/python3.10/dist-packages/paperetl/file/pdf.py", line 34, in parse
return TEI.parse(xml, source) if xml else None
File "/usr/local/lib/python3.10/dist-packages/paperetl/file/tei.py", line 55, in parse
sections = TEI.text(soup, title)
File "/usr/local/lib/python3.10/dist-packages/paperetl/file/tei.py", line 247, in text
name = figure.get("xml:id").upper()
AttributeError: 'NoneType' object has no attribute 'upper'

PDF extraction improvements

Modify the PDF file extraction process as follows:

  • Use dateutil to parse and format the published/date field
  • Extract content from tables
  • Build uid off the title but fallback to doi if title is not found
  • Add tag of PDF to articles

Error either with or without pre-trained attribute file

I tried running paperetl in AWS (ubuntu 20.04 LTS t2.small instance with 50 GiB of memory) with the following procedure:

The cord-19_2020-09-01.tar.gz (release) dataset was downloaded and extracted in the following download path: ~/cordata
This extraction created a directory ~/cordata/2020-09-01 containing the following files:
~/cordata/2020-09-01/document_parses.tar.gz
~/cordata/2020-09-01/metadata.csv
document_parses.tar.gz was further extracted as a directory named document_parses, which contained the following 2 subdirectories:
~/cordata/2020-09-01/document_parses/pdf_json
~/cordata/2020-09-01/document_parses/pmc_json

entry-dates.csv generated in Kaggle https://www.kaggle.com/davidmezzetti/cord-19-article-entry-dates?scriptVersionId=41813239 was also placed in this directory; therefore, the command
~/cordata/2020-09-01$ python3 -m paperetl.cord19 .
was executed from the ~/cordata/2020-09-01 directory containing the following:
~/cordata/2020-09-01/document_parses
~/cordata/2020-09-01/entry-dates.csv
~/cordata/2020-09-01/metadata.csv

The above procedure gave the following error:
FileNotFoundError: [Errno 2] No such file or directory: '/home/ubuntu/.cord19/models/attribute'

This error resulted despite of the fact that I had pre-trained attribute and design files:

~/.cord19/models/attribute (from https://www.kaggle.com/davidmezzetti/cord19-study-design/#attribute)
~/.cord19/models/design (from https://www.kaggle.com/davidmezzetti/cord19-study-design/#design )

In another attempt, without using these 2 pre-trained files (i.e. starting with an empty ~/.cord19/models directory), I still got the exact same error message.

See error details in the following screenshot:

Screenshot from 2020-09-07 21-17-06

Any help would be appreciated.

Add PubMed as source

Support loading full-text open access documents via PubMed API queries.

This will add support for both PubMed MEDLINE archives and articles pulled via the API.

Windows install issue

Fix issue caused by trailing slash in setup.py

ValueError: path 'src/python/' cannot end with '/'

sqlite3.OperationalError: database is locked

!python -m paperetl.file paperetl/file/data paperetl/models

get the following error:

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Traceback (most recent call last):
File "/anaconda/envs/azureml_py38/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/anaconda/envs/azureml_py38/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/paperetl/file/main.py", line 11, in
Execute.run(
File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/paperetl/file/execute.py", line 176, in run
db = Factory.create(url, replace)
File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/paperetl/factory.py", line 36, in create
return SQLite(url.replace("sqlite://", ""), replace)
File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/paperetl/sqlite.py", line 104, in init
self.create(SQLite.ARTICLES, "articles")
File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/paperetl/sqlite.py", line 198, in create
self.cur.execute(create)
sqlite3.OperationalError: database is locked

Update CORD-19 entry dates source

The CORD-19 releases page is no longer being updated consistently. Switch the entry date generation process to use the latest changelog file.

sample lines for running etl server and grobid instance

Apologies if my question is too silly.
In the description you wrote :
"PDF parsing relies on an existing GROBID instance to be up and running.
It is assumed that this is running locally on the ETL server"
Can you provide some sample lines about how to do that?
Best regards

Remove study attribute and design models and all related dependencies

Currently, paperetl has a couple statistical study design related models to detect common study design fields. This requires a large NLP pipeline backed by spaCy to run a series of NLP/grammar steps. While this initially was a good solution in mid 2020, there are now better ways to do this.

Furthermore, the NLP pipelines are slow and add significant processing overhead. Last but not least, paperetl can process both medical and technical/scientific papers, these fields are medical specific. This functionality is more appropriate for the paperai project and the NLP logic should reside within that project.

Improve PMB filtering logic

Make the following improvements:

  • Allow filtering on article ids in addition to codes
  • Check filters before processing full text to improve performance

Filter duplicate ids

Currently there is no duplicate detection within a single run during file processing. Add this capability, similar to what is in the CORD-19 process.

Remove citations table/index

There is no major known use case for storing citations. Remove as this wouldn't be easy to support for incremental loads.

Use XML id for file figure processing

Currently the file process attempts to find a caption/label/name to use as the section name for TEI files. This is error prone. xml:id is unique and more reliable.

Zotero connection

Greetings,

Thanks for working on this! Is there a neat way to borrow the metadata of my pdf files from the Zotero database instead of relying on parsing from the PDF?

Add component to build entry-dates.csv

Currently, for the CORD-19 dataset, entry-dates.csv is required to be manually downloaded using the following instructions:

# Download entry-dates.csv and place in <download path>
# https://www.kaggle.com/davidmezzetti/cord-19-article-entry-dates/output

entry-dates.csv should be able to be built outside of Kaggle, to allow automation/docker builds. The Kaggle entry-dates component should be updated to call this new component.

Remove legacy merge logic

#34 removed all study design and attribute detection in paperetl in favor of paperai. paperetl is now significantly faster without spaCy pipelines slowing things down. With that, the legacy merge process designed to overcome performance concerns can be removed. This can be replaced with a simple duplicate detect and replace on entry date.

Detect month changes in CORD-19 entry date process

Currently, the entry date download process assumes there is a metadata.csv file for each day. Since the datasource changed to biweekly updates, there may not be a metadata.csv file for the 1st of the month. Add logic to detect month changes and use the earliest metadata.csv file per month instead.

what I missing? KeyError: '47235b96c07e8066195b6521882340408b9bdd34'

ghSrc/paperetl % python -m paperetl.cord19 2020-08-12
.........
/usr/local/lib/python3.8/site-packages/sklearn/base.py:329: UserWarning: Trying to unpickle estimator TfidfVectorizer from version 0.23.1 when using version 0.23.2. This might lead to breaking code or invalid results. Use at your own risk.
warnings.warn(
/usr/local/lib/python3.8/site-packages/sklearn/base.py:329: UserWarning: Trying to unpickle estimator DecisionTreeClassifier from version 0.23.1 when using version 0.23.2. This might lead to breaking code or invalid results. Use at your own risk.
warnings.warn(
/usr/local/lib/python3.8/site-packages/sklearn/base.py:329: UserWarning: Trying to unpickle estimator RandomForestClassifier from version 0.23.1 when using version 0.23.2. This might lead to breaking code or invalid results. Use at your own risk.
warnings.warn(
Traceback (most recent call last):
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.8/site-packages/paperetl/cord19/main.py", line 11, in
Execute.run(sys.argv[1],
File "/usr/local/lib/python3.8/site-packages/paperetl/cord19/execute.py", line 281, in run
article.metadata = article.metadata + (dates[sha],)
KeyError: '47235b96c07e8066195b6521882340408b9bdd34'
ghSrc/paperetl %

my directory:

paperetl/2020-08-12 % ll
total 14642544
drwxr-xr-x 14 yuanke staff 448 8 14 10:06 .
drwxr-xr-x 13 yuanke staff 416 8 14 09:39 ..
drwxr-xr-x 3 yuanke staff 96 8 14 09:44 __results___files
-rw-r--r--@ 1 yuanke staff 455816 8 5 15:01 attribute
-rw-r--r--@ 1 yuanke staff 206732 8 5 15:01 attribute.csv
-rw-r--r-- 1 yuanke staff 24504 8 13 05:52 changelog
-rw-r--r-- 1 yuanke staff 1375487377 8 13 05:53 cord_19_embeddings.tar.gz
-rw-r--r-- 1 yuanke staff 3143476778 8 13 05:23 cord_19_embeddings_2020-08-12.csv
-rw-r--r--@ 1 yuanke staff 4185255 8 5 15:01 design
-rw-r--r--@ 1 yuanke staff 61843 8 5 15:01 design.csv
drwxr-xr-x 4 yuanke staff 128 8 14 10:04 document_parses
-rw-r--r-- 1 yuanke staff 2638941522 8 13 05:53 document_parses.tar.gz
-rw-r--r--@ 1 yuanke staff 15487674 8 14 01:44 entry-dates.csv
-rw-r--r-- 1 yuanke staff 297398784 8 13 05:53 metadata.csv
paperetl/2020-08-12 %

Issue processing into Elasticsearch

Hi,

I have both paperetl and elasticsearch set up in docker containers running on my machine. When I try and process a .pdf file and add it to elasticsearch I get the error:

python -m paperetl.file paperetl/data http://localhost:9200 paperetl/models
Traceback (most recent call last):
  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.7/dist-packages/paperetl/file/__main__.py", line 15, in <module>
    sys.argv[4] == "True" if len(sys.argv) > 4 else False,
  File "/usr/local/lib/python3.7/dist-packages/paperetl/file/execute.py", line 176, in run
    db = Factory.create(url, replace)
  File "/usr/local/lib/python3.7/dist-packages/paperetl/factory.py", line 29, in create
    return Elastic(url, replace)
  File "/usr/local/lib/python3.7/dist-packages/paperetl/elastic.py", line 44, in __init__
    exists = self.connection.indices.exists("articles")
  File "/usr/local/lib/python3.7/dist-packages/elasticsearch/_sync/client/utils.py", line 308, in wrapped
    "Positional arguments can't be used with Elasticsearch API methods. "
TypeError: Positional arguments can't be used with Elasticsearch API methods. Instead only use keyword arguments.

I assume it something to do with Elasticsearch changing in V8 but not sure.

KeyError: 'pdf_json_files'

ghSrc/paperetl % python -m paperetl.cord19 2020-03-27
Building articles database from 2020-03-27
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/pool.py", line 48, in mapstar
return list(map(*args))
File "/usr/local/lib/python3.8/site-packages/paperetl/cord19/execute.py", line 184, in process
sections, citations = Section.parse(row, indir)
File "/usr/local/lib/python3.8/site-packages/paperetl/cord19/section.py", line 49, in parse
for path in Section.files(row):
File "/usr/local/lib/python3.8/site-packages/paperetl/cord19/section.py", line 100, in files
if row[column]:
KeyError: 'pdf_json_files'
"""

Support spaCy 3.0

Currently, scispacy doesn't have models for spacy 3.0 - see allenai/scispacy#303

A temporary workaround is to install spacy 2.x via

pip install spacy==2.3.5

If scispacy isn't updated in the near term, a dot release will be put out to limit setup.py to spacy 2.x

Scaling to create a proccess per cpu core overwhelms grobid service

When I try and index PDFs on my development machine, I get frequent failures because the grobid service is overwhelmed by the number of processes allocated for ingesting the PDFs:

ERROR [2023-12-03 11:57:31,564] org.grobid.service.process.GrobidRestProcessFiles: Could not get an engine from the pool within configured time. Sending service unavailable.

The default concurrency setting in grobid is 10. On my machine os.cpu_count() returns 16, so we are creating more processes that available engines in the grobid pool.

Whilst this is not an issue in paperetl itself, I think anyone for whom os.cpu_count() returns > 10 will hit this issue. The impact could be mitigated by adding a note to the documentation to suggest users increase the default concurrency limit in grobid if they this error. https://grobid.readthedocs.io/en/latest/Configuration/#service-configuration

I am happy to create a PR for this if you agree.

Fix bug with JSON export

Set a default export function to allow the json.dump() method to write objects it doesn't explicitly have a converter for.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.