GithubHelp home page GithubHelp logo

zhyuxie / cotype Goto Github PK

View Code? Open in Web Editor NEW

This project forked from ink-usc/usc-ds-relationextraction

0.0 2.0 0.0 32.13 MB

CoType: Joint Typing Entities and Relations with Knowledge Bases (WWW'17)

Python 2.34% C++ 70.99% C 1.51% Makefile 0.10% Perl 0.19% Shell 0.24% CMake 2.35% Fortran 17.59% JavaScript 0.09% CSS 0.06% Java 2.75% HTML 1.01% M4 0.76% MATLAB 0.01%

cotype's Introduction

CoType: Joint Typing of Entities and Relations with Knowledge Bases

Source code and data for WWW'17 paper CoType: Joint Extraction of Typed Entities and Relations with Knowledge Bases.

Given a text corpus with entity mentions detected and heuristically labeled by distant supervision, this code determine the entity types for each entity mention, and identify relationships between entities and their relation types.

An end-to-end tool (corpus to typed entities/relations) is under development. Please keep track of our updates.

Performance

Performance comparison with several distantly-supervised relation extraction systems over KBP 2013 dataset.

Method Precision Recall F1
Mintz (our implementation, Mintz et al., 2009) 0.296 0.387 0.335
LINE + Dist Sup (Tang et al., 2015) 0.360 0.257 0.299
MultiR (Hoffmann et al., 2011) 0.325 0.278 0.301
FCM + Dist Sup (Gormley et al., 2015) 0.151 0.498 0.300
CoType (Ren et al., 2017) 0.348 0.406 0.369

Dependencies

We will take Ubuntu for example.

  • python 2.7
  • Python library dependencies
$ pip install pexpect ujson tqdm
$ cd code/DataProcessor/
$ git clone [email protected]:stanfordnlp/stanza.git
$ cd stanza
$ pip install -e .
$ wget http://nlp.stanford.edu/software/stanford-corenlp-full-2016-10-31.zip
$ unzip stanford-corenlp-full-2016-10-31.zip

Data

We process (using our data pipeline) three public datasets to our JSON format. We ran Stanford NER on training set to detect entity mentions, and performed distant supervision using DBpediaSpotlight to assign type labels:

  • BioInfer: 100k PubMed paper abstracts as training data and 1,530 manually labeled biomedical paper abstracts from BioInfer (Pyysalo et al., 2007) as test data. It consists of 94 relation types and over 2,000 entity types. (Download JSON)
  • NYT (Riedel et al., 2011): 1.18M sentences sampled from 294K New York Times news articles. 395 sentences are manually annotated with 24 relation types and 47 entity types. (Download JSON)
  • Wiki-KBP: the training corpus contains 1.5M sentences sampled from 780k Wikipedia articles (Ling & Weld, 2012) plus ~7,000 sentences from 2013 KBP corpus. Test data consists of 14k mannually labeled sentences from 2013 KBP slot filling assessment results. It has 19 relation types and 126 entity types. (Download JSON)

Please put the data files in corresponding subdirectories under CoType/data/source

Makefile

We have included compilied binaries. If you need to re-compile retype.cpp under your own g++ environment

$ cd CoType/code/Model/retype; make

Default Run

Run CoType for the task of Relation Extraction on the Wiki-KBP dataset

$ java -mx4g -cp "code/DataProcessor/stanford-corenlp-full-2016-10-31/*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer
$ ./run.sh  

Parameters - run.sh

Dataset to run on.

Data="KBP"
  • Parameters for learning CoType embeddings:
- KBP: -negative 3 -iters 400 -lr 0.02 -transWeight 1.0
- NYT: -negative 5 -iters 700 -lr 0.02 -transWeight 7.0
- BioInfer: -negative 5 -iters 700 -lr 0.02 -transWeight 7.0

Evaluation

After learning the embedding vectors, following script evaluates relation extraction performance (precision, recall, F1).

$ python code/Evaluation/emb_test.py extract KBP retype cosine 0.0
$ python code/Evaluation/tune_threshold.py extract KBP emb retype cosine

Reference

Please cite the following paper if you find the codes and datasets useful:

@inproceedings{ren2017cotype,
 author = {Ren, Xiang and Wu, Zeqiu and He, Wenqi and Qu, Meng and Voss, Clare R. and Ji, Heng and Abdelzaher, Tarek F. and Han, Jiawei},
 title = {CoType: Joint Extraction of Typed Entities and Relations with Knowledge Bases},
 booktitle = {Proceedings of the 26th International Conference on World Wide Web},
 year = {2017},
 pages = {1015--1024},
} 

cotype's People

Contributors

shanzhenren avatar ellenmellon avatar

Watchers

James Cloos avatar Joey Xie avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.