GithubHelp home page GithubHelp logo

avorio / science-parse Goto Github PK

View Code? Open in Web Editor NEW

This project forked from allenai/science-parse

0.0 1.0 0.0 73.92 MB

Library and tools for extracting metadata from PDFs

License: Apache License 2.0

Scala 37.37% Python 1.47% Java 61.10% Shell 0.06%

science-parse's Introduction

Science Parse

Science Parse parses scientific papers (in PDF form) and returns them in structured form. As of today, it supports these fields:

  • Title
  • Authors
  • Abstract
  • Sections (each with heading and body text)
  • Bibliography, each with
    • Title
    • Authors
    • Venue
    • Year
  • Mentions, i.e., places in the paper where bibliography entries are mentioned

In JSON format, the output looks like this (or like this, if you want sections). The easiest way to get started is to use the output from this server.

Get started

There are three different ways to get started with SP. Each has its own document:

  • Server: This contains the SP server. It's useful for PDF parsing as a service. It's also probably the easiest way to get going.
  • CLI: This contains the command line interface to SP. That's most useful for batch processing.
  • Core: This contains SP as a library. It has all the extraction code, plus training and evaluation. Both server and CLI use this to do the actual work.

Alternatively, you can run the docker image: docker run -p 8080:8080 --rm allenai-docker-public-docker.bintray.io/s2/scienceparse:1.2.8-SNAPSHOT

How to include into your own project

The current version is 1.2.7. If you want to include it in your own project, use this:

For SBT:

libraryDependencies += "org.allenai" %% "science-parse" % "1.2.7"

For Maven:

<dependency>
  <groupId>org.allenai</groupId>
  <artifactId>science-parse_2.11</artifactId>
  <version>1.2.7</version>
  <type>pom</type>
</dependency>

The first time you run it, SP will download some rather large model files. Don't be alarmed! The model files are cached, and startup is much faster the second time.

For licensing reasons, SP does not include libraries for some image formats. Without these libraries, SP cannot process PDFs that contain images in these formats. If you have no licensing restrictions in your project, we recommend you add these additional dependencies to your project as well:

  "com.github.jai-imageio" % "jai-imageio-core" % "1.2.1",
  "com.github.jai-imageio" % "jai-imageio-jpeg2000" % "1.3.0", // For handling jpeg2000 images
  "com.levigo.jbig2" % "levigo-jbig2-imageio" % "1.6.5", // For handling jbig2 images

Development

This project is a hybrid between Java and Scala. The interaction between the languages is fairly seamless, and SP can be used as a library in any JVM-based language.

Lombok

This project uses Lombok which requires you to enable annotation processing inside of an IDE. Here is the IntelliJ plugin and you'll need to enable annotation processing (instructions here).

Lombok has a lot of useful annotations that give you some of the nice things in Scala:

  • val is equivalent to final and the right-hand-side class. It gives you type-inference via some tricks
  • Check out @Data

Thanks

Special thanks goes to @kermitt2, whose work on kermitt2/grobid inspired Science Parse, and helped us get started with some labeled data.

science-parse's People

Contributors

amosjyng avatar aria42 avatar chrisc36 avatar dcdowney avatar dirkgr avatar nalourie-ai2 avatar rjpower avatar rodneykinney avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.