GithubHelp home page GithubHelp logo

jeekim / europepmc-identifier-extractor Goto Github PK

View Code? Open in Web Editor NEW
9.0 5.0 3.0 3.55 MB

A program to extract identifiers such as grant ids, accession numbers etc. in free text

Java 35.41% Ruby 8.74% Scala 31.70% Shell 23.73% Dockerfile 0.42%
information-extraction data-citation scalatest

europepmc-identifier-extractor's Introduction

Identifiers Extractor

Build Status

A text-mining pipeline to extract identifiers such as European Research Council grant ids in free text. The pipeline mainly consists of two java programs.

  1. (TODO) Dictionary builder. Given a tsv file, build an MWT-based dictionary.
  2. Dictionary-based tagger. Given a dictionary, the tagger identifies terms in the dictionary using a Java Finite Automata library.
  3. Validator. For each identified term, the validator removes an errorneous term using several mechanisms (contextual information, online validation, etc.).

How to build?

sbt assembly

How to use?

You need to create a dictionary based on mwt format and format your input documents in xml.

MWT dictionary format

<mwt>
  <template><acc db="%1" valmethod="%2" domain="%3" context="%4" wsize="%5">%0</acc></template>
  <r p1="$DBNAME" p2="$VALMETHOD" p3="$DOMAIN" p4="$CONTEXT" p5="$WINDOW_SIZE">$PATTERN</r>
</mwt>
  • valmethod: noval (no validation), contextOnly (keyword-based constraints), onlne (validation using online validation), onlineWithContext (keyword-based constraints and online validation using online validation.)
  • domain: one of domain identifier mentioned in https://www.ebi.ac.uk/ebisearch/overview.ebi
  • context: a list of keywords
  • wsize: the size of window on the left side of a matched term

How to build a dictionary?

  • from owl
  • (TODO) from tsv
  • (TODO) from identifiers.org

Input document format

The pipeline takes sentences as input. Those sentences have to be formatted as follows>

<article>
  <text>
    <SENT sid="0" pm="."><plain>$FIRST_SENTENCE</plain></SENT>
    <SENT sid="1" pm="."><plain>$SECOND_SENTENCE</plain></SENT>
  </text>
</article>

Examples

European Research Council funding id extraction
sbt testERC

or

cat test/ercfunds.txt | \
java -cp lib/monq-1.7.1.jar monq.programs.DictFilter -t elem -e plain -ie UTF-8 -oe UTF-8 automata/grants150714.mwt | \
java -cp target/scala-2.10/europepmc-identifier-extractor-assembly-0.1-SNAPSHOT.jar ukpmc.ValidateAccessionNumber -stdpipe
Accession number mining
sbt testAcc

or

cat test/accnums.txt | \
java -cp lib/monq-1.7.1.jar monq.programs.DictFilter -t elem -e plain -ie UTF-8 -oe UTF-8 automata/acc150612.mwt | \
java -cp target/scala-2.10/europepmc-identifier-extractor-assembly-0.1-SNAPSHOT.jar ukpmc.ValidateAccessionNumber -stdpipe
Running as server
java -cp lib/monq-1.7.1.jar monq.programs.DictFilter -t elem -e plain -ie UTF-8 -oe UTF-8 automata/acc150612.mwt -p 3333 &
java -cp target/scala-2.10/europepmc-identifier-extractor-assembly-0.1-SNAPSHOT.jar ukpmc.ValidateAccessionNumber &
echo "<SENT><plain>pdb 1aj9</plain></SENT>" | java -cp lib/monq-1.7.1.jar monq.programs.DistFilter -c . 'host=localhost;port=3333' 'host=localhost;port=7811'

TODO

  • to implement ! for negation.
  • sbt plugin for plug and play.
  • to build a container.
  • to run as container (e.g., Docker)
  • to nun on AWS

Acknowledgements

This work was supported by European Research Council (H2020 ERC-EuropePMC-2-2014 637529).

europepmc-identifier-extractor's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Forkers

jeehyub europepmc

europepmc-identifier-extractor's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.