GithubHelp home page GithubHelp logo

saiajaym / fexrep Goto Github PK

View Code? Open in Web Editor NEW

This project forked from amm-kun/score_psu

0.0 0.0 0.0 57.81 MB

Feature EXtraction and Representation framework to extract features from scientific publications

Python 100.00%

fexrep's Introduction

FREX: Feature Representation Extraction Framework for Claim Reproducibility Prediction

Feature extraction Pipeline

The pipeline is designed to extract features from scholarly work. Given a scholarly work, it extracts various features of the scholarly work and output features as a CSV file.

However, it cannot process the PDFs directly, instead, we input preprocessed PDF files using GROBID and pdf2text. This preprocessing can be done both by pipeline or individually before extracting the features.

Preprocessing PDFs

Preprocessing can be done by either using pipeline or separately running GROBID and pdf2text. In both cases, it is required to have a working GROBID and pdf2text installation.

PDF files have to be preprocessed using

  1. GROBID
  2. PDF2Text

While preprocessing with GROBID, it is required to convert using full text mode i.e., /api/processFulltextDocument, please refer to GROBID documentation for more details.

Once GROBID is installed and running, and pdf2text is installed, we can use the pipeline to preprocess the PDF files using the below command:

python process_docs.py --mode process-pdfs --pdf_input DIR_TO_PDFs -out OUTPUT_DIR

Alternatively, one can process them separately without using the pipeline.

Running the pipeline feature extraction

Once the PDFs are processed using GROBID and pdf2text, we can run a pipeline for feature extraction: You can run the pipeline using the below command:

python process_docs.py -out PROCESSED_GROBID_FILES -in TEXT_FILES -m generate-train" -csv OUTPUT_DIR

-out: path to preprocessed PDF files in tei.xml format(grobid output)

-in: path to preprocessed PDF files in txt format(output of pdf2text)

-m: generate-train mode

-csv: csv output directory

For more details, refer to process_docs.py file.

Project Structure

Important files for reference:

File Description
process_docs.py code execution starts here, there are 2 main modes (1) Preprocess (2) generate feature set
extractor.py grobid output gets torn down into various features and extracted information is used to call elsevier/crossref/semantic scholar api
elsevier.py Output from elsevier api/crossref/semantic scholar gets parsed and returned
XIN.py acknowlegement section is processed to identify funding information

NOTE: Elsevier api key may expire after certain number of hits. In case of batch processing, it is better to update api key details from elsevier developer portal. Check the same for semantic scholar api.

NOTE Place the citation sentiment model under pipeline/tamu_features/rec_model/pytorch_model.bin from the link

fexrep's People

Contributors

amm-kun avatar rajaln-3197 avatar saiajaym avatar sreesaiteja avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.