GithubHelp home page GithubHelp logo

cogcomp / zoe Goto Github PK

View Code? Open in Web Editor NEW
43.0 6.0 5.0 368 KB

Zero-Shot Open Entity Typing as Type-Compatible Grounding, EMNLP'18.

Python 70.12% Shell 0.48% JavaScript 0.62% HTML 27.66% CSS 1.12%
named-entities entity-typing natural-language-processing

zoe's Introduction

ZOE (Zero-shot Open Entity Typing)

A state of the art system for zero-shot entity fine typing with minimum supervision

Introduction

This is a demo system for our paper "Zero-Shot Open Entity Typing as Type-Compatible Grounding", which at the time of publication represents the state-of-the-art of zero-shot entity typing.

The original experiments that produced all the results in the paper are done with a package written in Java. This is a re-written package solely for the purpose of demoing the algorithm and validating key results.

The results may be slightly different with published numbers, due to the randomness in Java's HashSet and Python set's iteration order. The difference should be negligible.

This system may take a long time if ran on a large number of new sentences, due to ELMo processing. We have cached ELMo results for the provided experiments.

The package also contains an online demo, please refer to Publication Page for more details.

Usage

Install the system

Prerequisites

  • Minimum 20G available disk space and 16G memory. (strict requirement)
  • Python 3.X (Mostly tested on 3.5)
  • A POSIX OS (Windows not supported)
  • Java JDK and Maven
  • virtualenv if you are installing with script
  • wget if you are installing with script (Use brew to install it on OSX)
  • unzip if you are installing with script

Install using a one-line command

To make life easier, we provide a simple way to install with sh install.sh.

This script does everything mentioned in the next section, plus creating a virtualenv. Use source venv/bin/activate to activate.

Install manually

See wiki manual-installation

Run the system

Currently you can do the following without changes to the code:

  • Run experiment on FIGER test set (randomly sampled as the paper): python3 main.py figer
  • Run experiment on BBN test set: python3 main.py bbn
  • Run experiment on the first 1000 Ontonotes_fine test set instances (due to size issue): python3 main.py ontonotes

Additionally, you can run server mode that initializes the online demo with python3 server.py However, this requires some additional files that's not provided for download yet. Please directly contact the authors.

It's generally an expensive operation to run on large numerb of new sentences, but you are welcome to do it. Please refer to main.py and Engineering Details to see how you can test on your own data.

Citation

See the following paper:

@inproceedings{ZKTR18,
    author = {Ben Zhou, Daniel Khashabi, Chen-Tse Tsai and Dan Roth },
    title = {Zero-Shot Open Entity Typing as Type-Compatible Grounding},
    booktitle = {EMNLP},
    year = {2018},
}

zoe's People

Contributors

heglertissot avatar slash0bz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

zoe's Issues

No such file or directory: 'data/log/runlog_figer.pickle'

After I ran the command "python3 main.py figer," I think the system ran and produced a lot of output on my terminal but spat out an error "No such file or directory: 'data/log/runlog_figer.pickle'".

Below is some of the output:
ELMo Candidate Titles: ['United_States', 'Beltsville,Maryland', 'United_States_district_court', 'Driving_under_the_influence', 'Kuala_Lumpur', 'Vision(Marvel_Comics)', 'SM_Entertainment', 'United_Kingdom', 'Jung_Yong_Hwa', 'Super_Junior', 'England_and_Wales', 'IU_(singer)', 'Baden-W%C3%BCrttemberg', 'South_Korea', 'Jabba_the_Hutt', 'Broadcasting_(television_and_radio)', 'Choi_Siwon', 'West_Memphis_3', 'Uthai_Thani_Province', 'Sima_Qian']
Selected Candidate: Federal_Way,_Washington
--Performance--
Strict Accuracy: 0.5916334661354582

Micro Precision: 0.7616191904047976
Micro Recall: 0.678237650200267
Micro F1: 0.7175141242937854

Macro Precision: 0.7740703851261619
Macro Recall: 0.7231739707835325
Macro F1: 0.7477571070722246

Traceback (most recent call last):
File "main.py", line 99, in
runner.save("data/log/runlog_figer.pickle")
File "main.py", line 80, in save
with open(file_name, "wb") as handle:
FileNotFoundError: [Errno 2] No such file or directory: 'data/log/runlog_figer.pickle'

Could you please provide full test dataset used in the paper

Hi,

Thanks for your work.
I see you only release the test datasets (i.e. FIGER, BBN, OntoNotes_fine) for fine-grained entity-typing. Are these datasets a full version or just a sample of the full datasets?

On the other hand, could you please provide the test datasets used in coarse entity-typing (i.e. Table 3 in the paper) and Biology Entity Typing (Table 5 in the paper).
Thanks a lot.

Question about cached target embedding map

Hi,

According to your cached target embedding map (target.min.embedding.pickle), every unique target mention has a static elmo embedding. In this case, when two sentences have a same mention, these two mentions would have the same embedding.

My question is, according to your paper, elmo embedding of a given mention should be context-aware, which means mentions with the same surface should have different embedding in different sentences. Could you please clarify how the embedding is the cache file was calculated? Did you calculate the mean of each surface mention in different sentence? Or the cache file is just an example to speed things up. Thank you!

Question about testing on new data

Hi, I'm trying to run ZOE on a new dataset and the following questions were raised:

  1. In the main.py, should I comment out runner.elmo_processor.load_cached_embeddings("target.min.embedding.pickle", "wikilinks.min.embedding.pickle")? If yes, could you show me how these two files are generated and what are the format for the raw version of these two files? Currently I found running new data is extremely slow (processed 30 sentences after one night). Anything idea how I can speed up things?

  2. Are there any other files/data I need to generate for testing on new dataset? (maybe vocab_test.txt?)

Thank you!

AttributeError: module 'tensorflow' has no attribute 'placeholder'

I'm getting the error AttributeError: module 'tensorflow' has no attribute 'placeholder' after running the command python3 main.py bbn.

Traceback (most recent call last): File "main.py", line 101, in <module> runner = ZoeRunner(allow_tensorflow=False) File "main.py", line 20, in __init__ self.elmo_processor = ElmoProcessor(allow_tensorflow) File "/scratch/chs298/zoe/zoe_utils.py", line 34, in __init__ self.batcher, self.ids_placeholder, self.ops, self.sess = initialize_sess(self.vocab_file, self.opt$ File "/scratch/chs298/zoe/venv/lib/python3.6/site-packages/bilm-0.1-py3.6.egg/bilm/model.py", line 71$ ids_placeholder = tf.placeholder('int32', AttributeError: module 'tensorflow' has no attribute 'placeholder'

Questions about title2freebase mapping

Hi,

I noticed that the file title2freebase.pickle is generated from the function def convert_freebase(freebase_file_name, freebase_sup_file_name). My questions are:

  1. It seems freebase_file_name and freebase_sup_file_name have the same format, could you explain what these two files are and where to get them?
  2. Related to the first question, looks like the key-value pairs in freebase_file_name and freebase_sup_file_name are (wikititle, freebase type) pairs. How did you get them?
  3. Another related one, how did you map wiki title to freebase entities and then get corresponding freebase types?

Thank you very much!

Java issue

I'm having difficulties installing the software, due to Java issues. What version of Java would you recommend installing to use your software? My CPU's OS is OSX (High Sierra).

Below is the output I get:
(venv) (base) MacBook-Pro:zoe-master usr$ python3 scripts.py CHECKFILES Traceback (most recent call last): File "scripts.py", line 6, in <module> from ccg_nlpy import local_pipeline File "/Users/usr/zoe-master/venv/lib/python3.6/site-packages/ccg_nlpy/local_pipeline.py", line 31, in <module> from jnius import autoclass File "/Users/usr/zoe-master/venv/lib/python3.6/site-packages/jnius/__init__.py", line 13, in <module> from .reflect import * # noqa File "/Users/usr/zoe-master/venv/lib/python3.6/site-packages/jnius/reflect.py", line 15, in <module> class Class(with_metaclass(MetaJavaClass, JavaClass)): File "/Users/usr/zoe-master/venv/lib/python3.6/site-packages/six.py", line 827, in __new__ return meta(name, bases, d) File "jnius/jnius_export_class.pxi", line 114, in jnius.MetaJavaClass.__new__ File "jnius/jnius_export_class.pxi", line 164, in jnius.MetaJavaClass.resolve_class File "jnius/jnius_env.pxi", line 11, in jnius.get_jnienv File "jnius/jnius_jvm_dlopen.pxi", line 90, in jnius.get_platform_jnienv File "jnius/jnius_jvm_dlopen.pxi", line 45, in jnius.create_jnienv File "/Users/usr/zoe-master/venv/lib/python3.6/os.py", line 669, in __getitem__ raise KeyError(key) from None KeyError: 'JAVA_HOME'

So I set $JAVA_HOME to point to Java 8:
(venv) (base) MacBook-Pro:zoe-master usr$ export JAVA_HOME=$(/usr/libexec/java_home -v 1.8) (venv) (base) MacBook-Pro:zoe-master usr$ echo $JAVA_HOME /Library/Java/JavaVirtualMachines/jdk1.8.0_192.jdk/Contents/Home

Yet I still get an error:
(venv) (base) MacBook-Pro:zoe-master usr$ python3 scripts.py CHECKFILES Traceback (most recent call last): File "scripts.py", line 6, in <module> from ccg_nlpy import local_pipeline File "/Users/usr/zoe-master/venv/lib/python3.6/site-packages/ccg_nlpy/local_pipeline.py", line 31, in <module> from jnius import autoclass File "/Users/usr/zoe-master/venv/lib/python3.6/site-packages/jnius/__init__.py", line 13, in <module> from .reflect import * # noqa File "/Users/usr/zoe-master/venv/lib/python3.6/site-packages/jnius/reflect.py", line 15, in <module> class Class(with_metaclass(MetaJavaClass, JavaClass)): File "/Users/usr/zoe-master/venv/lib/python3.6/site-packages/six.py", line 827, in __new__ return meta(name, bases, d) File "jnius/jnius_export_class.pxi", line 114, in jnius.MetaJavaClass.__new__ File "jnius/jnius_export_class.pxi", line 164, in jnius.MetaJavaClass.resolve_class File "jnius/jnius_env.pxi", line 11, in jnius.get_jnienv File "jnius/jnius_jvm_dlopen.pxi", line 90, in jnius.get_platform_jnienv File "jnius/jnius_jvm_dlopen.pxi", line 59, in jnius.create_jnienv SystemError: Error calling dlopen(b'/Library/Java/JavaVirtualMachines/jdk1.8.0_192.jdk/Contents/Home/lib/server/libjvm.dylib': b'dlopen(/Library/Java/JavaVirtualMachines/jdk1.8.0_192.jdk/Contents/Home/lib/server/libjvm.dylib, 10): image not found'

Output

Does the system save the results of the experiments? Is the terminal output the only output?

Investigate efficiency problem

Currently even with all ELMo results cached for FIGER experiment, the package still runs for hours for a few hundred sentences, which is much slower than original Java implementation (which usually takes 10 minutes.)

It's probably due to the usage of native data structures like lists and dicts (especially large ones).

Investigate what's slowing the package and upgrade them with better implementation.

Question about freebase mapping

Dear authors, I am trying to customize the mapping from freebase types to my own target taxonomies, however I am having trouble to find a good coverage of all related freebase types. Is there any way or resource that I can use to find out related freebase types for a certain target type? Thank you!

Question about ELMo embedding caching design

Hi,

I have a question about your ELMo embedding caching design. When querying caching ELMo vectors of standard dataset like FIGER, you query the vector using the mention string only, regardless of the context.

if sentence.get_mention_surface() not in self.target_embedding_map and self.allow_tensorflow:

Do you based on the assumption that standard dataset you used don't have two mentions with exactly the same surface form?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.