GithubHelp home page GithubHelp logo

agrafix / grabcite Goto Github PK

View Code? Open in Web Editor NEW
5.0 3.0 0.0 1.16 MB

Haskell: Library/Executable to extract citations from scientific papers

License: Other

Haskell 22.17% TeX 77.83%
haskell paper citation extraction text nlp

grabcite's Introduction

grabcite

CircleCI

GrabCite is a tool to generate data sets for tasks like citation recommendation. It supports various input formats such as:

  • Plain Text (i.E. pdftotext output)
  • PDF
  • Grobid Tei XML
  • Ice Cite JSON
  • CiteSeerX Databases (experimental)
  • TeX (+ Bib) Files

The output format is split into 3 files per input file, one containing the individual sentences and global citation markers, one containing meta information for citation markers and one containing citation markers for the paper itself. To use the tool, you can build it from source.

Building from source

# install haskell stack
curl -sSL https://get.haskellstack.org/ | sh

# clone the repo
git clone https://github.com/agrafix/grabcite && cd grabcite

# install dependencies and build
stack setup
stack build

# run it
stack exec -- grabcite-datagen --help

Command line interface

The main tool is called grabcite-datagen. It can be run from any linux command line. There are two supported use cases:

Unpack an ArXiv.org dump

To unpack an arxiv.org dump (i.E. extract all archives and find the correct .tex/.bib file), you can run:

grabcite-datagen --in-mode InPdf --in-dir [DUMP_DIR] --recursive --arxiv-to-tex-mode --arxiv-meta-xml [LOCATION_OF_META_XML] --out-dir [TARGET_DIR] --jobs [JOBS]

Generating a data set

To build a data set, first figure out what your input is shaped. As stated above we support:

  • InText: Plain Text (i.E. pdftotext output)
  • InPdf: PDF
  • InGrobid: Grobid Tei XML
  • InIceCite: Ice Cite JSON
  • InIceCiteBasic: Ice Cite JSON, but ignore the roles detected
  • InCiteSeerX: CiteSeerX Databases (experimental)
  • InTex: TeX (+ Bib) Files

Then, run the command line:

grabcite-datagen --in-mode [MODE_FROM_ABOVE] --in-dir [DATA_DIR] --recursive --out-dir [OUT_DIR] --jobs 4

If you which to write the output to the database, add the following flags:

--out-db [POSTGRESQL_CONNECTION_STRONG] --db-mig-dir [MIGRATION_FILES] --data-source-name [NAME_OF_DATA_SOURCE]

[MIGRATION_FILES] is the path to the database folder in this repository. You can add --db-overwrite if you wish to overwrite previous data with the same data-source-name.

Experimental support for InCiteSeerX requires to supply --in-db-conn-str and point to the CiteSeerX Oracle DB. This has not properly been tested yet.

Add --debug to output debug messages.

Hints

To speed up the process between runs, you can copy the ref_cache.json from a previous run into your new output directory before launching the task. This will prevent unneeded DBLP-ID lookups.

grabcite's People

Contributors

agrafix avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

grabcite's Issues

build steps use wrong version of stack and are missing some libraries

In the README.md, The following instruction:

# install haskell stack
curl -sSL https://get.haskellstack.org/ | sh

pulls down the latest stack version and installs it.

However, the repository hasn't been contributed to in a while and the latest version of stack no longer supports the stack yml config file.

I got it to work by downloading a stack version from 2018 instead.

I also found i needed to install a fair few new packages that are not listed in the build steps:

apt install -y pkg-config libpcre++-dev libpq-dev unixodbc-dev

(I am running on Ubuntu 22.04)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.