GithubHelp home page GithubHelp logo

hammer / pubmedextract Goto Github PK

View Code? Open in Web Editor NEW

This project forked from allenai/pubmedextract

0.0 1.0 0.0 797 KB

extracting demographics from tables in pubmed papers

License: Other

Python 100.00%

pubmedextract's Introduction

PubMed-Extract

Quantifying demographic bias in clinical trials using corpus of academic papers.

This repo accompanies the paper: Quantifying Sex Bias in Clinical Studies at Scale with Automated Data Extraction by Sergey Feldman, Waleed Ammar, Kyle Lo, Elly Trepman, Madeleine van Zuylen, and Oren Etzioni.

This code takes as input a clinical trial paper parsed by Omnipage and returns the extracted number of participating women and men.

This package is being released for (a) algorithmic documentation and (b) statistical analysis reproduction purposes, and will not work for extracting clinical trial participant counts from new PDFs you may have as it depends on Omnipage. For an example of the type of input that PubMed-Extract expects, see tests/test_sex/papers/.

The code for (a) is located in pubmedextract/ The code for (b) is located in analysis_scripts/.

Note that the paper compares PubMed-Extract to an algorithm referred to as AACT-Query. This is a relatively simple algorithm and its execution (essentially a SQL query) is contained entirely within analysis_scripts/04_analysis.py.

Installation

This project requires Python 3.6. We recommend you set up a conda environment:

conda create -n pubmedextract python=3.6
conda activate pubmedextract
conda install spacy=1.9 thinc=6.5.2 matplotlib=3.0.1 seaborn=0.9.0 joblib=0.12.0 psycopg2=2.7.5 pandas=0.23.4 statsmodels=0.9.0 patsy pytest pylint

You may need to do source activate pubmedextract instead of conda activate pubmedxtract, depending on your anaconda version.

Then clone the repo, and install it (along with remaining requirements).

git clone https://github.com/allenai/pubmedextract.git
cd pubmedextract
python setup.py install

Tests

After installing, you can run all the unit tests:

pylint --disable=R,C,W pubmedextract
python -m pytest tests/

Usage Example: Extracting Gender Counts from Available JSON Inputs

A simple example is in scripts/parse_paper_example.py, and also reproduced in its entirety below:

import pickle
from pubmedextract.sex import get_sex_counts
from pubmedextract.table_utils import PaperTable

# load some example papers
# assumes the cwd is pubmedextract/
with open('tests/test_sex/test_papers_and_counts.pickle', 'rb') as f:
    s2ids_and_true_counts, _ = pickle.load(f)

# get the counts and print them out
for s2id, true_counts in s2ids_and_true_counts:
    paper = PaperTable(s2id, 'tests/test_sex/papers/')
    demographic_info = get_sex_counts(paper)
    print('True counts:', true_counts)
    print('Estimated counts:', demographic_info.counts_dict, '\n')

Paper Analysis Reproduction

The scripts needs to reproduce the analyses in the paper Quantifying Sex Bias in Clinical Studies at Scale with Automated Data Extraction can be found here: https://github.com/allenai/pubmedextract/tree/master/analysis_scripts.

pubmedextract's People

Contributors

sergeyf avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.