cogcomp / zoe Goto Github PK

View Code? Open in Web Editor NEW

43.0 6.0 5.0 368 KB

Zero-Shot Open Entity Typing as Type-Compatible Grounding, EMNLP'18.

Python 70.12% Shell 0.48% JavaScript 0.62% HTML 27.66% CSS 1.12%

named-entities entity-typing natural-language-processing

zoe's Introduction

ZOE (Zero-shot Open Entity Typing)

A state of the art system for zero-shot entity fine typing with minimum supervision

Introduction

This is a demo system for our paper "Zero-Shot Open Entity Typing as Type-Compatible Grounding", which at the time of publication represents the state-of-the-art of zero-shot entity typing.

The original experiments that produced all the results in the paper are done with a package written in Java. This is a re-written package solely for the purpose of demoing the algorithm and validating key results.

The results may be slightly different with published numbers, due to the randomness in Java's HashSet and Python set's iteration order. The difference should be negligible.

This system may take a long time if ran on a large number of new sentences, due to ELMo processing. We have cached ELMo results for the provided experiments.

The package also contains an online demo, please refer to Publication Page for more details.

Usage

Install the system

Prerequisites

Minimum 20G available disk space and 16G memory. (strict requirement)
Python 3.X (Mostly tested on 3.5)
A POSIX OS (Windows not supported)
Java JDK and Maven
virtualenv if you are installing with script
wget if you are installing with script (Use brew to install it on OSX)
unzip if you are installing with script

Install using a one-line command

To make life easier, we provide a simple way to install with sh install.sh.

This script does everything mentioned in the next section, plus creating a virtualenv. Use source venv/bin/activate to activate.

Install manually

See wiki manual-installation

Run the system

Currently you can do the following without changes to the code:

Run experiment on FIGER test set (randomly sampled as the paper): python3 main.py figer
Run experiment on BBN test set: python3 main.py bbn
Run experiment on the first 1000 Ontonotes_fine test set instances (due to size issue): python3 main.py ontonotes

Additionally, you can run server mode that initializes the online demo with python3 server.py However, this requires some additional files that's not provided for download yet. Please directly contact the authors.

It's generally an expensive operation to run on large numerb of new sentences, but you are welcome to do it. Please refer to main.py and Engineering Details to see how you can test on your own data.

Citation

See the following paper:

@inproceedings{ZKTR18,
    author = {Ben Zhou, Daniel Khashabi, Chen-Tse Tsai and Dan Roth },
    title = {Zero-Shot Open Entity Typing as Type-Compatible Grounding},
    booktitle = {EMNLP},
    year = {2018},
}

zoe's People

Contributors

Stargazers

Watchers

Forkers

shubhampachori12110095 munaachyuta yxd126 chsuong paihengxu

zoe's Issues

Question about sentences number used in learning representation of concept

Hi,

In your paper, you say "which is the average vector of the representation of all the sentences in WIKILINKS that describe the given concept" in Section 3.2.

zoe/zoe_utils.py

Line 225 in cf0924b

for i in range(0, min(len(example_sentences), 10)):

From the code above, do it mean you select at most 10 sentences in practice to calculate the concept representation?

Provide "combined" option in preset taxonomy

An option that outputs labels from all three datasets.

Provide

Rename variables in the program

Rename some of the variable names to be consistent with names used in the publication, like "concepts".

Access forbidden for downloading model.zip and data.zip

Hi, I am trying to download model.zip and data.zip described in manual installation page, but access is forbidden. Could you please let us know how to get access to those files? Thank you!

No such file or directory: 'data/log/runlog_figer.pickle'

After I ran the command "python3 main.py figer," I think the system ran and produced a lot of output on my terminal but spat out an error "No such file or directory: 'data/log/runlog_figer.pickle'".

Below is some of the output:
ELMo Candidate Titles: ['United_States', 'Beltsville,Maryland', 'United_States_district_court', 'Driving_under_the_influence', 'Kuala_Lumpur', 'Vision(Marvel_Comics)', 'SM_Entertainment', 'United_Kingdom', 'Jung_Yong_Hwa', 'Super_Junior', 'England_and_Wales', 'IU_(singer)', 'Baden-W%C3%BCrttemberg', 'South_Korea', 'Jabba_the_Hutt', 'Broadcasting_(television_and_radio)', 'Choi_Siwon', 'West_Memphis_3', 'Uthai_Thani_Province', 'Sima_Qian']
Selected Candidate: Federal_Way,_Washington
--Performance--
Strict Accuracy: 0.5916334661354582

Micro Precision: 0.7616191904047976
Micro Recall: 0.678237650200267
Micro F1: 0.7175141242937854

Macro Precision: 0.7740703851261619
Macro Recall: 0.7231739707835325
Macro F1: 0.7477571070722246

Traceback (most recent call last):
File "main.py", line 99, in
runner.save("data/log/runlog_figer.pickle")
File "main.py", line 80, in save
with open(file_name, "wb") as handle:
FileNotFoundError: [Errno 2] No such file or directory: 'data/log/runlog_figer.pickle'

Provide more preset examples

In the online demo.

Could you please provide full test dataset used in the paper

Hi,

Thanks for your work.
I see you only release the test datasets (i.e. FIGER, BBN, OntoNotes_fine) for fine-grained entity-typing. Are these datasets a full version or just a sample of the full datasets?

On the other hand, could you please provide the test datasets used in coarse entity-typing (i.e. Table 3 in the paper) and Biology Entity Typing (Table 5 in the paper).
Thanks a lot.

Question about cached target embedding map

Hi,

According to your cached target embedding map (target.min.embedding.pickle), every unique target mention has a static elmo embedding. In this case, when two sentences have a same mention, these two mentions would have the same embedding.

My question is, according to your paper, elmo embedding of a given mention should be context-aware, which means mentions with the same surface should have different embedding in different sentences. Could you please clarify how the embedding is the cache file was calculated? Did you calculate the mean of each surface mention in different sentence? Or the cache file is just an example to speed things up. Thank you!

Question about testing on new data

Hi, I'm trying to run ZOE on a new dataset and the following questions were raised:

In the main.py, should I comment out runner.elmo_processor.load_cached_embeddings("target.min.embedding.pickle", "wikilinks.min.embedding.pickle")? If yes, could you show me how these two files are generated and what are the format for the raw version of these two files? Currently I found running new data is extremely slow (processed 30 sentences after one night). Anything idea how I can speed up things?
Are there any other files/data I need to generate for testing on new dataset? (maybe vocab_test.txt?)

Thank you!

Explaining the `mapping` folder in the readme

Would be nice if we explain this folder in the readme.
In particular, we have to define the definition of the syntax in the .logic files.

AttributeError: module 'tensorflow' has no attribute 'placeholder'

I'm getting the error AttributeError: module 'tensorflow' has no attribute 'placeholder' after running the command python3 main.py bbn.

Traceback (most recent call last): File "main.py", line 101, in <module> runner = ZoeRunner(allow_tensorflow=False) File "main.py", line 20, in __init__ self.elmo_processor = ElmoProcessor(allow_tensorflow) File "/scratch/chs298/zoe/zoe_utils.py", line 34, in __init__ self.batcher, self.ids_placeholder, self.ops, self.sess = initialize_sess(self.vocab_file, self.opt$ File "/scratch/chs298/zoe/venv/lib/python3.6/site-packages/bilm-0.1-py3.6.egg/bilm/model.py", line 71$ ids_placeholder = tf.placeholder('int32', AttributeError: module 'tensorflow' has no attribute 'placeholder'

Questions about title2freebase mapping

Hi,

I noticed that the file title2freebase.pickle is generated from the function def convert_freebase(freebase_file_name, freebase_sup_file_name). My questions are:

It seems freebase_file_name and freebase_sup_file_name have the same format, could you explain what these two files are and where to get them?
Related to the first question, looks like the key-value pairs in freebase_file_name and freebase_sup_file_name are (wikititle, freebase type) pairs. How did you get them?
Another related one, how did you map wiki title to freebase entities and then get corresponding freebase types?

Thank you very much!

Java issue

I'm having difficulties installing the software, due to Java issues. What version of Java would you recommend installing to use your software? My CPU's OS is OSX (High Sierra).

Below is the output I get:
(venv) (base) MacBook-Pro:zoe-master usr$ python3 scripts.py CHECKFILES Traceback (most recent call last): File "scripts.py", line 6, in <module> from ccg_nlpy import local_pipeline File "/Users/usr/zoe-master/venv/lib/python3.6/site-packages/ccg_nlpy/local_pipeline.py", line 31, in <module> from jnius import autoclass File "/Users/usr/zoe-master/venv/lib/python3.6/site-packages/jnius/__init__.py", line 13, in <module> from .reflect import * # noqa File "/Users/usr/zoe-master/venv/lib/python3.6/site-packages/jnius/reflect.py", line 15, in <module> class Class(with_metaclass(MetaJavaClass, JavaClass)): File "/Users/usr/zoe-master/venv/lib/python3.6/site-packages/six.py", line 827, in __new__ return meta(name, bases, d) File "jnius/jnius_export_class.pxi", line 114, in jnius.MetaJavaClass.__new__ File "jnius/jnius_export_class.pxi", line 164, in jnius.MetaJavaClass.resolve_class File "jnius/jnius_env.pxi", line 11, in jnius.get_jnienv File "jnius/jnius_jvm_dlopen.pxi", line 90, in jnius.get_platform_jnienv File "jnius/jnius_jvm_dlopen.pxi", line 45, in jnius.create_jnienv File "/Users/usr/zoe-master/venv/lib/python3.6/os.py", line 669, in __getitem__ raise KeyError(key) from None KeyError: 'JAVA_HOME'

So I set $JAVA_HOME to point to Java 8:
(venv) (base) MacBook-Pro:zoe-master usr$ export JAVA_HOME=$(/usr/libexec/java_home -v 1.8) (venv) (base) MacBook-Pro:zoe-master usr$ echo $JAVA_HOME /Library/Java/JavaVirtualMachines/jdk1.8.0_192.jdk/Contents/Home

Yet I still get an error:
(venv) (base) MacBook-Pro:zoe-master usr$ python3 scripts.py CHECKFILES Traceback (most recent call last): File "scripts.py", line 6, in <module> from ccg_nlpy import local_pipeline File "/Users/usr/zoe-master/venv/lib/python3.6/site-packages/ccg_nlpy/local_pipeline.py", line 31, in <module> from jnius import autoclass File "/Users/usr/zoe-master/venv/lib/python3.6/site-packages/jnius/__init__.py", line 13, in <module> from .reflect import * # noqa File "/Users/usr/zoe-master/venv/lib/python3.6/site-packages/jnius/reflect.py", line 15, in <module> class Class(with_metaclass(MetaJavaClass, JavaClass)): File "/Users/usr/zoe-master/venv/lib/python3.6/site-packages/six.py", line 827, in __new__ return meta(name, bases, d) File "jnius/jnius_export_class.pxi", line 114, in jnius.MetaJavaClass.__new__ File "jnius/jnius_export_class.pxi", line 164, in jnius.MetaJavaClass.resolve_class File "jnius/jnius_env.pxi", line 11, in jnius.get_jnienv File "jnius/jnius_jvm_dlopen.pxi", line 90, in jnius.get_platform_jnienv File "jnius/jnius_jvm_dlopen.pxi", line 59, in jnius.create_jnienv SystemError: Error calling dlopen(b'/Library/Java/JavaVirtualMachines/jdk1.8.0_192.jdk/Contents/Home/lib/server/libjvm.dylib': b'dlopen(/Library/Java/JavaVirtualMachines/jdk1.8.0_192.jdk/Contents/Home/lib/server/libjvm.dylib, 10): image not found'

Output

Does the system save the results of the experiments? Is the terminal output the only output?

Investigate efficiency problem

Currently even with all ELMo results cached for FIGER experiment, the package still runs for hours for a few hundred sentences, which is much slower than original Java implementation (which usually takes 10 minutes.)

It's probably due to the usage of native data structures like lists and dicts (especially large ones).

Investigate what's slowing the package and upgrade them with better implementation.

Invalid model & data download link

can't access these two model & data download links
got 404 error on the web page

http://cogcomp.org/Data/ccgPapersData/xzhou45/zoe/model.zip
http://cogcomp.org/Data/ccgPapersData/xzhou45/zoe/data.zip

Would you please check the released model & data? Thanks.

I'm not understand the experiment on the unseen types of Otyper

1.the prediction of Otyper on unseen type is a concatenation of the evaluation part of each folds?? To get the score on Table 4, I need to compare the concatenation with the whole ground truth in Figer?

Question about freebase mapping

Dear authors, I am trying to customize the mapping from freebase types to my own target taxonomies, however I am having trouble to find a good coverage of all related freebase types. Is there any way or resource that I can use to find out related freebase types for a certain target type? Thank you!

zoe/zoe_utils.py

Line 207 in cf0924b

 if sentence.get_mention_surface() not in self.target_embedding_map and self.allow_tensorflow: 

Do you based on the assumption that standard dataset you used don't have two mentions with exactly the same surface form?

Create ChangeLog and local updater

Creates a place for changelogs for fixes and improvements.

Let the program to check for updates upon initialization.