neuml / paperetl Goto Github PK

View Code? Open in Web Editor NEW

320.0 320.0 25.0 3.03 MB

📄 ⚙️ ETL processes for medical and scientific papers

License: Apache License 2.0

Python 96.56% Makefile 0.78% Dockerfile 2.18% Shell 0.48%

etl medical parse python scientific-papers

paperetl's Issues

Improve PMB filtering logic

Make the following improvements:

Allow filtering on article ids in addition to codes
Check filters before processing full text to improve performance

Additional installation steps and bug for CORD-19

Hi David,
In addition to following your paperetl installation instructions I had to take these steps to get rid of the following error and warning:

~$ python3
>>> import nltk
>>> nltk.download(‘punkt’)
>>> exit()

this created a directory ~/nltk_data/tokenizers/punkt
and fixed the above error

Also, the UserWarning below was eliminated as follows:

$ pip3 uninstall scikit-learn==0.23.2
$ pip3 install scikit-learn==0.23.1

Then, unfortunately after a full run the resulting articles.sqlite database came out with the Study Design fields, and Tags and Labels being NULL (see screenshots below).

Any ideas on how solve this NULL issue would be appreciated

Review and update README.md

Some of the information in README.md is inaccurate, such as the default location for the study design models. Review and update.

Add multiprocessing support to files process

Currently, the files etl process is single threaded. With #34, it is now much easier to parallelize this process. This change will use a multiprocessing pool to process files.

Add pre-commit checks

Add .pre-commit-config.yaml file to enable checks for code quality.

Recursively process files in input directory

For file-based jobs with an input directory, support traverses multi-level directory structures to process all PDF articles.

Feature: Incremental database update

Currently, ETL processes assume operations are a full database reload each run. This works well for smaller datasets but for larger datasets, it's inefficient.

Add the ability to set the path to an existing database and copy unmodified records from the existing source. This way only new/updated records are processed each run.

SQLite needs a system for reading and inserting articles/sections from another database.

Elasticsearch already handles most of this, just needs a small change to only create the articles index if it doesn't already exist. Merges will be handled by Elasticsearch based on the article id.

Fix bug with JSON export

Set a default export function to allow the json.dump() method to write objects it doesn't explicitly have a converter for.

Issue processing into Elasticsearch

Hi,

I have both paperetl and elasticsearch set up in docker containers running on my machine. When I try and process a .pdf file and add it to elasticsearch I get the error:

python -m paperetl.file paperetl/data http://localhost:9200 paperetl/models
Traceback (most recent call last):
  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.7/dist-packages/paperetl/file/__main__.py", line 15, in <module>
    sys.argv[4] == "True" if len(sys.argv) > 4 else False,
  File "/usr/local/lib/python3.7/dist-packages/paperetl/file/execute.py", line 176, in run
    db = Factory.create(url, replace)
  File "/usr/local/lib/python3.7/dist-packages/paperetl/factory.py", line 29, in create
    return Elastic(url, replace)
  File "/usr/local/lib/python3.7/dist-packages/paperetl/elastic.py", line 44, in __init__
    exists = self.connection.indices.exists("articles")
  File "/usr/local/lib/python3.7/dist-packages/elasticsearch/_sync/client/utils.py", line 308, in wrapped
    "Positional arguments can't be used with Elasticsearch API methods. "
TypeError: Positional arguments can't be used with Elasticsearch API methods. Instead only use keyword arguments.

I assume it something to do with Elasticsearch changing in V8 but not sure.

Add database flag to determine if database should be replaced

The current functionality of paperetl is to create a new database each run. With the merge/duplicate changes in #36, it now makes more sense to have the default action to "create or update" a database each run. A flag will be available to force the old behavior and replace the database each run.

Evaluate integrating with paperoni

paperoni allows searching for scientific papers and downloading their corresponding PDFs (when available). Evaluate possible ways to integrate with paperetl.

what I missing? KeyError: '47235b96c07e8066195b6521882340408b9bdd34'

ghSrc/paperetl % python -m paperetl.cord19 2020-08-12
.........
/usr/local/lib/python3.8/site-packages/sklearn/base.py:329: UserWarning: Trying to unpickle estimator TfidfVectorizer from version 0.23.1 when using version 0.23.2. This might lead to breaking code or invalid results. Use at your own risk.
warnings.warn(
/usr/local/lib/python3.8/site-packages/sklearn/base.py:329: UserWarning: Trying to unpickle estimator DecisionTreeClassifier from version 0.23.1 when using version 0.23.2. This might lead to breaking code or invalid results. Use at your own risk.
warnings.warn(
/usr/local/lib/python3.8/site-packages/sklearn/base.py:329: UserWarning: Trying to unpickle estimator RandomForestClassifier from version 0.23.1 when using version 0.23.2. This might lead to breaking code or invalid results. Use at your own risk.
warnings.warn(
Traceback (most recent call last):
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.8/site-packages/paperetl/cord19/main.py", line 11, in
Execute.run(sys.argv[1],
File "/usr/local/lib/python3.8/site-packages/paperetl/cord19/execute.py", line 281, in run
article.metadata = article.metadata + (dates[sha],)
KeyError: '47235b96c07e8066195b6521882340408b9bdd34'
ghSrc/paperetl %

my directory:

paperetl/2020-08-12 % ll
total 14642544
drwxr-xr-x 14 yuanke staff 448 8 14 10:06 .
drwxr-xr-x 13 yuanke staff 416 8 14 09:39 ..
drwxr-xr-x 3 yuanke staff 96 8 14 09:44 __results___files
-rw-r--r--@ 1 yuanke staff 455816 8 5 15:01 attribute
-rw-r--r--@ 1 yuanke staff 206732 8 5 15:01 attribute.csv
-rw-r--r-- 1 yuanke staff 24504 8 13 05:52 changelog
-rw-r--r-- 1 yuanke staff 1375487377 8 13 05:53 cord_19_embeddings.tar.gz
-rw-r--r-- 1 yuanke staff 3143476778 8 13 05:23 cord_19_embeddings_2020-08-12.csv
-rw-r--r--@ 1 yuanke staff 4185255 8 5 15:01 design
-rw-r--r--@ 1 yuanke staff 61843 8 5 15:01 design.csv
drwxr-xr-x 4 yuanke staff 128 8 14 10:04 document_parses
-rw-r--r-- 1 yuanke staff 2638941522 8 13 05:53 document_parses.tar.gz
-rw-r--r--@ 1 yuanke staff 15487674 8 14 01:44 entry-dates.csv
-rw-r--r-- 1 yuanke staff 297398784 8 13 05:53 metadata.csv
paperetl/2020-08-12 %

Update CORD-19 scripts

CORD-19 is no longer updated. Update scripts and documentation to note that.

sqlite3.OperationalError: database is locked

!python -m paperetl.file paperetl/file/data paperetl/models

get the following error:

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Traceback (most recent call last):
File "/anaconda/envs/azureml_py38/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/anaconda/envs/azureml_py38/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/paperetl/file/main.py", line 11, in
Execute.run(
File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/paperetl/file/execute.py", line 176, in run
db = Factory.create(url, replace)
File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/paperetl/factory.py", line 36, in create
return SQLite(url.replace("sqlite://", ""), replace)
File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/paperetl/sqlite.py", line 104, in init
self.create(SQLite.ARTICLES, "articles")
File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/paperetl/sqlite.py", line 198, in create
self.cur.execute(create)
sqlite3.OperationalError: database is locked

Add common method for accessing Grammar object

Currently, each source has it's own local method for creating a global Grammar object. Standardize this logic.

Add pre-trained study design models to GitHub

Currently, the pre-trained study design models are stored on Kaggle. Put a copy of these files on the next GitHub release. This will allow automation/docker builds.

# Download pre-trained study design/attribute models
# https://www.kaggle.com/davidmezzetti/cord19-study-design/#attribute
# https://www.kaggle.com/davidmezzetti/cord19-study-design/#design

Remove legacy merge logic

#34 removed all study design and attribute detection in paperetl in favor of paperai. paperetl is now significantly faster without spaCy pipelines slowing things down. With that, the legacy merge process designed to overcome performance concerns can be removed. This can be replaced with a simple duplicate detect and replace on entry date.

Require Python 3.7+

Python 3.6 is EOL in days. Update scripts and requirements to 3.7.

Add PubMed as source

Support loading full-text open access documents via PubMed API queries.

This will add support for both PubMed MEDLINE archives and articles pulled via the API.

Detect month changes in CORD-19 entry date process

Currently, the entry date download process assumes there is a metadata.csv file for each day. Since the datasource changed to biweekly updates, there may not be a metadata.csv file for the 1st of the month. Add logic to detect month changes and use the earliest metadata.csv file per month instead.

Windows install issue

Fix issue caused by trailing slash in setup.py

ValueError: path 'src/python/' cannot end with '/'

Add arXiv as source

Support loading full-text arXiv documents via arXiv API queries

Zotero connection

Greetings,

Thanks for working on this! Is there a neat way to borrow the metadata of my pdf files from the Zotero database instead of relying on parsing from the PDF?

Fix bug with study model training

Need to update the hyperparameter names as they are currently not correct.

Better error handling for parsing publication date

For file based jobs interfacing with GROBID, add better error handling around publication date parsing

Improve sample size extraction

Improve the accuracy of sample size extraction. Add unit tests.

Add generic CSV source

Add the ability to import article metadata from CSV files similar to the CORD-19 metadata.csv file.

Increase test coverage

67% is currently covered - need to improve coverage

Error either with or without pre-trained attribute file

I tried running paperetl in AWS (ubuntu 20.04 LTS t2.small instance with 50 GiB of memory) with the following procedure:

The cord-19_2020-09-01.tar.gz (release) dataset was downloaded and extracted in the following download path: ~/cordata
This extraction created a directory ~/cordata/2020-09-01 containing the following files:
~/cordata/2020-09-01/document_parses.tar.gz
~/cordata/2020-09-01/metadata.csv
document_parses.tar.gz was further extracted as a directory named document_parses, which contained the following 2 subdirectories:
~/cordata/2020-09-01/document_parses/pdf_json
~/cordata/2020-09-01/document_parses/pmc_json

entry-dates.csv generated in Kaggle https://www.kaggle.com/davidmezzetti/cord-19-article-entry-dates?scriptVersionId=41813239 was also placed in this directory; therefore, the command
~/cordata/2020-09-01$ python3 -m paperetl.cord19 .
was executed from the ~/cordata/2020-09-01 directory containing the following:
~/cordata/2020-09-01/document_parses
~/cordata/2020-09-01/entry-dates.csv
~/cordata/2020-09-01/metadata.csv

The above procedure gave the following error:
FileNotFoundError: [Errno 2] No such file or directory: '/home/ubuntu/.cord19/models/attribute'

This error resulted despite of the fact that I had pre-trained attribute and design files:

~/.cord19/models/attribute (from https://www.kaggle.com/davidmezzetti/cord19-study-design/#attribute)
~/.cord19/models/design (from https://www.kaggle.com/davidmezzetti/cord19-study-design/#design )

In another attempt, without using these 2 pre-trained files (i.e. starting with an empty ~/.cord19/models directory), I still got the exact same error message.

See error details in the following screenshot:

Any help would be appreciated.

Modify merge method to handle no update merges

Fix an error when a full reprocess is set when there are no updates. This should instead return with no updates.

sample lines for running etl server and grobid instance

Apologies if my question is too silly.
In the description you wrote :
"PDF parsing relies on an existing GROBID instance to be up and running.
It is assumed that this is running locally on the ETL server"
Can you provide some sample lines about how to do that?
Best regards

Remove study attribute and design models and all related dependencies

Currently, paperetl has a couple statistical study design related models to detect common study design fields. This requires a large NLP pipeline backed by spaCy to run a series of NLP/grammar steps. While this initially was a good solution in mid 2020, there are now better ways to do this.

Furthermore, the NLP pipelines are slow and add significant processing overhead. Last but not least, paperetl can process both medical and technical/scientific papers, these fields are medical specific. This functionality is more appropriate for the paperai project and the NLP logic should reside within that project.

Support spaCy 3.0

Currently, scispacy doesn't have models for spacy 3.0 - see allenai/scispacy#303

A temporary workaround is to install spacy 2.x via

pip install spacy==2.3.5

If scispacy isn't updated in the near term, a dot release will be put out to limit setup.py to spacy 2.x

Add example notebook

Currently, the only examples are old Kaggle notebooks from the CORD-19 challenge. Add a more recent example.

PDF extraction improvements

Modify the PDF file extraction process as follows:

Use dateutil to parse and format the published/date field
Extract content from tables
Build uid off the title but fallback to doi if title is not found
Add tag of PDF to articles

Support reading compressed files

Check the file extension of each input file and read using a gzip stream if it's a compressed file

Ensure length of sections is less than max nlp length

Some text sections in CORD-19 are extremely long, often with large RNA sequences as text which won't split into sentences. Add logic to handle these types of sections.

Update setup.py to only show standard image on PyPI

AttributeError: 'NoneType' object has no attribute 'upper'

paperetl is great and has been useful for my work! It has been working well for most of the PDF papers I feed it. I am having some issues with certain PDFs. I am new to python, so its very likely I am doing something wrong but I thought I'd reach out.

When I run this for a specific PDF:

python3.10 -m paperetl.file /home/bill/brokenone /home/bill/brokenone /home/bill/brokenone

I get this error:

Processing: /home/bill/brokenone/20 Immune Cells Enhance Selectivity of Nanosecond-Pulsed DBD Plasma Against Tumor Cells.pdf
/usr/local/lib/python3.10/dist-packages/bs4/builder/init.py:545: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument features="xml" into the BeautifulSoup constructor.
warnings.warn(
Process Process-1:
Total articles inserted: 0
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.10/dist-packages/paperetl/file/execute.py", line 94, in process
for result in Execute.parse(*params):
File "/usr/local/lib/python3.10/dist-packages/paperetl/file/execute.py", line 67, in parse
yield PDF.parse(stream, source)
File "/usr/local/lib/python3.10/dist-packages/paperetl/file/pdf.py", line 34, in parse
return TEI.parse(xml, source) if xml else None
File "/usr/local/lib/python3.10/dist-packages/paperetl/file/tei.py", line 55, in parse
sections = TEI.text(soup, title)
File "/usr/local/lib/python3.10/dist-packages/paperetl/file/tei.py", line 247, in text
name = figure.get("xml:id").upper()
AttributeError: 'NoneType' object has no attribute 'upper'

Update minimum Python version to 3.8

Python 3.7 is now EOL. The minimum supported version of Python for releases moving forward should be 3.8+

Filter duplicate ids

Currently there is no duplicate detection within a single run during file processing. Add this capability, similar to what is in the CORD-19 process.

Update CORD-19 entry dates source

The CORD-19 releases page is no longer being updated consistently. Switch the entry date generation process to use the latest changelog file.

Remove citations table/index

There is no major known use case for storing citations. Remove as this wouldn't be easy to support for incremental loads.

Add dockerfile for building paperetl environment

Build a dockerfile for instantiating a paperetl environment

Add file name as source for file process

Currently for file processes, PDF is hard-coded into the source column. The file name should be used instead.

Build test suite

Add unit tests to paperetl to help with quality assurance

Scaling to create a proccess per cpu core overwhelms grobid service

When I try and index PDFs on my development machine, I get frequent failures because the grobid service is overwhelmed by the number of processes allocated for ingesting the PDFs:

ERROR [2023-12-03 11:57:31,564] org.grobid.service.process.GrobidRestProcessFiles: Could not get an engine from the pool within configured time. Sending service unavailable.

The default concurrency setting in grobid is 10. On my machine os.cpu_count() returns 16, so we are creating more processes that available engines in the grobid pool.

Whilst this is not an issue in paperetl itself, I think anyone for whom os.cpu_count() returns > 10 will hit this issue. The impact could be mitigated by adding a note to the documentation to suggest users increase the default concurrency limit in grobid if they this error. https://grobid.readthedocs.io/en/latest/Configuration/#service-configuration

I am happy to create a PR for this if you agree.

Use XML id for file figure processing

Currently the file process attempts to find a caption/label/name to use as the section name for TEI files. This is error prone. xml:id is unique and more reliable.

Add component to build entry-dates.csv

Currently, for the CORD-19 dataset, entry-dates.csv is required to be manually downloaded using the following instructions:

# Download entry-dates.csv and place in <download path>
# https://www.kaggle.com/davidmezzetti/cord-19-article-entry-dates/output

entry-dates.csv should be able to be built outside of Kaggle, to allow automation/docker builds. The Kaggle entry-dates component should be updated to call this new component.

KeyError: 'pdf_json_files'

ghSrc/paperetl % python -m paperetl.cord19 2020-03-27
Building articles database from 2020-03-27
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/pool.py", line 48, in mapstar
return list(map(*args))
File "/usr/local/lib/python3.8/site-packages/paperetl/cord19/execute.py", line 184, in process
sections, citations = Section.parse(row, indir)
File "/usr/local/lib/python3.8/site-packages/paperetl/cord19/section.py", line 49, in parse
for path in Section.files(row):
File "/usr/local/lib/python3.8/site-packages/paperetl/cord19/section.py", line 100, in files
if row[column]:
KeyError: 'pdf_json_files'
"""

neuml / paperetl Goto Github PK

paperetl's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs