uclnlp / jack Goto Github PK

View Code? Open in Web Editor NEW

257.0 257.0 82.0 28.72 MB

Jack the Reader

License: MIT License

Python 58.13% Shell 0.84% Jupyter Notebook 8.54% Makefile 0.14% HTML 19.69% CSS 2.65% JavaScript 10.01%

deep-learning knowledge-base natural-language-inference natural-language-processing question-answering tensorflow

jack's People

Contributors

Stargazers

Watchers

Forkers

dirkweissenborn wsh540677278 jg8610 tbmihailov nmstoker takuma-yoneda afcarl codeaudit falconzyx mehdimashayekhi hedgefair yucoian lsx13 meinwerk stuartchan pwiercinski daviddaijunchen kk-machine-learning fillinside oojiaoo thesage21 pkulzb nitesh10126 anniewie elyase davidkabiito clhne libertatis kimbiao daniilsorokin sunnybest1990 pvk444 naushadzaman alekseinagamoto pappagari skyazureecust dendisuhubdy zhk020 thegru sstarlib 2267164747 tangku006 neufang phantomlei3 gerham linyushen01 geor7 eva1117 lsllay massmurderer06 robertaaa issueda merryjanejian longlongtian siebeniris ljx17863632886 yiyepiaoling0715 lucabruno300 anshmittal1811 aman-sawarn ishanaj whatyouknow123 akashgupta97 khatrimann omkarvraikar chizala copenlu punkybella chouaib-benali kanara31 pawangithub10 pre3thi19 jakeyap sp-shaopeng mikeq0621 xiejinwen113 whopriyam elijahahianyo alexugu

jack's Issues

Evaluation Script

Read in data in format #1 as gold data and some format of predictions (question: how is system output stored?), and calculate accuracy/ranking metrics.

Features:

Rank accuracy
Hits@K
MRR

Hooks!

You all know you want them https://github.com/uclmr/tfrnn/blob/master/tfrnn/hooks.py
On the same note, would be great to connect to TensorFlow summary writes to monitor progress on TensorBoard.

Convert (Elsevier) ScienceQA data

As part of #2. Maybe something like

"question": "dataset_for(parsing, ?)",
"support": [ "list of paragraphs created by a heuristic" ]
"answer": ["list of answers from wikipedia"]

We should also generate cloze questions from paragraphs that mention the same entity

"question": "we used the ? corpus for our parsing model",
"support": [ "list of paragraphs created by a heuristic involving parsing, but removing the sentence that is the basis for the cloze question" ]
"answer": ["answer as taken from the cloze question sentence"]

We can also create cloze questions from the highlights.

get tensorflow to get installed on wercker

Somehow pip doesn't seem to find a tensorflow version for the wercker VM.

DeepMind DailyMail models

Models from DeepMind "Teaching Machines to Read and Comprehend" paper

http://arxiv.org/abs/1506.03340
https://github.com/thomasmesnard/DeepMind-Teaching-Machines-to-Read-and-Comprehend

Wikipedia multi-hop Dataset creation

Create a multihop QA dataset with NL supporting facts, exploiting the link structure from wikipedia.
So far I have a few todos

extract link matrix of all wikipedia articles towards each other
identify triangulation cases -- articles whose outlink articles are connected
filter out relevant sentences which link the entities, extract sentences/paragraphs
transform into cloze-style question with supporting sentences

Delexicalized Alignment Model

I believe there is a relatively task independent model that relies on the following assumptions:

Good answers are close to contexts that are well aligned with the question
a question utterance is well aligned with a context if
- there are a lot of words that are semantically close (e.g. based on word embeddings)
- The aligned words are locally coherent (for any phrase w1 w2 in the question the aligned words c1 and c2 in the context are close)
(Additionally): The Wh-word in the question should be aligned with the answer phrase, and the same spatial consistency constraints should hold on the complete alignment including the Wh-word.

The model may only have a few parameters that capture spatial consistence, weigh the word embedding dimensions etc. Further, one could build in a simple latent "coreference" mechanism.

Converters for datasets into our format

Implement loaders that save into the format of #1. To do:

(Add as you see fit).

Whoever is assigned (or assigns themselves) to one of these should create a new issue specific to that format.

SNLI loader

Loader for SNLI data into quebab format, with different options for support

support from training data
add support from external data (WordNet / PPDB)

Follow PEP8 python styleguides

If we do code on this more collaboratively we should follow a consistent style guide such as PEP8. Easy to install https://pypi.python.org/pypi/pep8 and run it on the code. PyCharm also complains and automatically fixes certain style issues. Maybe get a formatter for your own editor if it isn't PyCharm.

Snippets for datasets

As Sebastian suggested, would be great to have some tiny toy versions of the datasets for debugging and validation.

Python Loader for Reading Format

Load format in #1 into a python data-structure. Could just be a dictionary and simple json parsing for now.

Kinship Dataset

https://archive.ics.uci.edu/ml/datasets/Kinship

FB-15k and FB15K-237 loader

Load FB-15k and FB15K-237 into our format in #1. This requires the definition of a support neighbourhood for a given entity (and question). For example, we can use all entities that are direct neighbours:

"question": "born_in(BarackObama, ? )",
"support": [
  "BarackObama was born in Hawaii",
  "president_of(BarackObama, USA)"
]
"candidates": { "filename": "filename for file with list of candidates" }

Part of #2.

In future we could also support a more "global" (Johannes) way of viewing support by providing as support the list of all facts, for example by a filename pointing to the facts.

  "support" : {"filename":"filename of file with all training/test facts"}

Quebab reader for Arithmetic Word Problems

Options for conversion:

use the last sentence as question, other sentences as support.
use all sentences as support, and use a generic "answer?" question as question

For answers:

Could use the templates as answer candidates?

Column-less "USchema" Baseline

get a representation from each context for each answer candidate
Calculate a score for the answer candidate by aggregating over (functions of) the context representations, and comparing to a question representation.

WikiReading

http://www.aclweb.org/anthology/P/P16/P16-1145.pdf

Store quebap datasets in some shared space

Where should this be? cannon?

Automatically suggest candidate spans

For short-answer data sets like squad, the quebap way to approach this problem would be to generate candidate answer spans. This can be done offline and allows multiple-choice style models to function on these datasets without modification.

There's a few ways to do this. One is checked into preprocess/suggest_span_candidates.py, using pos-tag sequences of the training answers. However, coverage is not great -- using the top 300 pos-sequences, we get about 70% coverage, and 14% or so answers have pos-tag squences unique to that instance.

There are other ways to do this which might be complimentary to this approach. One would be to use the parse trees to make candidate suggestions.

And one disadvantage with the postag approach is that the sequence ['DT'] somehow sneaks in, and then now all determiners are possible answers. In some cases it would make sense to take a look at the word, and if it's say, 'the', not suggest it.

Squad Reader

Conversion to #1 as part of #2.

(own issue for Squad related issues)

Web Shell/Interface for Demo paper

Would be great to have a simple web user interface against our models.

Questions can be entered, direct answers are returned
provenance is returned

will require a API for the models (and an integrated IR step).

Terminology for quebap datasets

It would be good to have names for the individual components of a quebap file (and not call instances 'quebaps') . Here is a proposal (sort of obvious but it's good to have it official somewhere):

Top level: reading dataset
Element that contains a set of support documents and questions for those: reading instance
Support documents, questions, candidates, answers etc. (obvious based on keys)

Error Analysis

Error analysis to use for the analysis and discussion part of papers
Leave this for some time in future, e.g.

Error analysis of weights for different candidates
Pairwise comparison of different models
Heat maps for attention

Better training loop

Add training loop from tfrnn

Reading Architecture and Script

I created a lightweight reading architecture and script here:

https://github.com/uclmr/quebap/blob/master/quebap/model/reader.py#L430

You can get help for the script like so

python3 quebap/model/reader.py -h

and running it would look like

python3 quebap/model/reader.py --train quebap/data/NYT/naacl2013_train.quebap.json --batch_size 100 --epochs 10 --train_end 1 --model boe

The main "architecture" choice is to wrap all model dependent state, conversion between quebap instances and tensorflow, and batching, into the Batcher class. Everything on top of that is stateless construction of TF nodes based on input TF nodes. So composing a reader will look like this:

https://github.com/uclmr/quebap/blob/master/quebap/model/reader.py#L334

Currently the reader supports two models:

model_f: represents questions and candidates using embeddings, score via dot product
boe: bag of embeddings, represent questions and candidates as sum of their token embeddings, score via dot product (if you tokenize strings as single tokens you actually get model f)

Very lightly tested. Training loop, evaluation etc work, but are still very much just a placeholder. Biggest missing parts:

support for support.
assumes a single global set of candidates right now
support for several question per support
tests for the batchers. The batcher is the most tedious bit to write, and so it needs ample testing. One way to make these tests easy is to require batchers to do round-trip conversions (from quebap reading dataset to feed_dicts and back), because then the tests could simply test for convert_backward(convert_forward(data)).
RNN representations of sequences.

To push for model reuse and sharing I hope that we could try to get all our models into this framework where possible (it makes very few assumptions beyond multiple choice so this should be doable), and extend/improve as we go.

Transformer for preprocessing such as tokenisation

Instead of / in addition to having it as part of the batcher

Classify datasets and models to know which are compatible and which are not

Quebap Output/Predictions format

@narad @JohannesMaxWel Is there a schema or example for the output format? Would the output format be an object at the top level with instances, as for the input format, or just the instance list?

add Children Book Test dataset

paper from Felix Hill et al.
cloze-style QA

LAMBADA

https://arxiv.org/abs/1606.06031

JSON schema validation

As a safety check for these format readers, probably makes sense to release a proper JSON schema, and also to have a script to check that everything we're producing with these readers adheres to it. I have a small script that can do the verification, that I wrote in eqa-tools, so simple enough to bring that over.

But I actually don't know how to specify the schema file when there are lots of nesting. I actually wound up using an online tool to reverse-engineer the schema file from the JSON for eqa-tools!

convert MovieQA data

Prepare the MovieQA dataset for our shared dataset format.
It's from this paper here.

Note: One of the main takeaways of our reading group on this paper was that the video segments don't actually contribute a lot to model performance. It's probably best to skip the videos and only focus on the text.

Stats for various dataset issues

It would be great to have some code for dumping out data statistics to help us brainstorm strategies and compare datasets at a glance.

Some natural things if your answer is a span:
what is the length distribution for the span,
what are the postags like,
how frequently do they align to constituents,
what constituents do they align to

Please add other useful statistics here.

SimpleQuestions Dataset

See #2.

Unit Tests

We are using pytest for now. CI finds test__.py files in the source tree and executes test__ functions in those (standard configuration). We have a simple test in quebap/projects/modelF/test_evaluation.py.

Decision to be made:

Test files "inline" in the same directory as the corresponding source code
Test files in separate test source tree.

I like the simplicity of 1 and don't like to create too many directories, and synchronise directory structures. So I am voting for 1...

Global root directory variable

Readers should use a global root directory variable to search for data. On local machines this could be set to ./quebap/data (instead of ./quebap/quebap/data). On cannon/emerald it should be set to /cluster/project2/mr/data/.

AI2 Animal Tensor

Create json schema for system output/prediction file

Add Tensorflow to requirements.txt to run on wercker

Tried just adding "tensorflow" as done with numpy, didn't work on wercker. Probably needs a version.

Simple Overlap/Sliding Window baseline

Ideally should work a little on several datasets, reading in format #1. Several versions:

Vanilla version, preset hyperparameters
One for which hyperparameters can be tuned on dev sets
One with a bit of learning on training corpus.

Quite similar (identical) to the MCTest baseline.

Change batcher to work with global or local candidates

Check if json contains "candidates": "#/globals/candidates"
If so always use global candidates
Else use local candidates

Setup continuous integration

Travis requires open source project
wercker.com doesn't, but may at some point soon (when out of beta?)
Get VM from tsg to do this

NYT Loader + Model F

Goal: write simple version of model F for NYT data
(simplest possible proof-of-concept for MF models)

Annotation script

At least for tokenization and pos-tagging.

Define simple (Json?) format for reading tasks

Something like

{
   "question": "Where was Barack Obama born",
   "support": [
       "Barack Obama was born in Hawaii",
       "Barack Obama is the president of the USA"
  ],
  "candidates": [
      "Hawaii",
      "USA",
      "whatever"
  ],
  "answer": "Hawaii"
}

Comments:

support is a list of support documents. For now these can be strings, but we should also support a structured version with additional offset annotation. In many settings we have several question on the same support documents, in which case this will create a lot of duplication. We should therefore also allow support documents to be filenames.
A document can contain several sentences.
candidates is a list of candidate answers. Again, to avoid duplication I can imagine this also to be a filename of a list of entities (for FB15k where usually the candidate entities in each question are identical).
The Squad dataset (and likely others) don't have an explicit candidate answer set. Instead the answer needs to be a span of the support. We can handle this either by keeping the candidate list answer, or by preparing candidates based on heuristics.
Some datasets/questions can have multiple answers.
Tokenisation etc. should be provided via offset annotation.
We need a format for storing the answers of a system. Could just be a file with answers in the order of the gold file.

Spans, Providence

For some datasets we may have provenance information. That is, for the correct answer we know where the answer is mentioned (in the form of an answer span). We can include this information in the answer representation:

 "answer": { "value": "Hawaii", "doc_id": 0, "span": [26, 32] }

Personally I would also like to produce an alignment between the question and the context around the answer (in order to score it), but I don't think this needs to be in this format now.

uclnlp / jack Goto Github PK

jack's People

Contributors

Stargazers

Watchers

Forkers

jack's Issues

Spans, Providence

Recommend Projects

Recommend Topics

Recommend Org

Jobs