GithubHelp home page GithubHelp logo

uclnlp / jack Goto Github PK

View Code? Open in Web Editor NEW
257.0 257.0 82.0 28.72 MB

Jack the Reader

License: MIT License

Python 58.13% Shell 0.84% Jupyter Notebook 8.54% Makefile 0.14% HTML 19.69% CSS 2.65% JavaScript 10.01%
deep-learning knowledge-base natural-language-inference natural-language-processing question-answering tensorflow

jack's People

Contributors

dirkweissenborn avatar georgwiese avatar isabelleaugenstein avatar jg8610 avatar johannesmaxwel avatar marziehsaeidi avatar mbosnjak avatar mitchelljeff avatar narad avatar pminervini avatar riedelcastro avatar rockt avatar scienceie avatar tdmeeste avatar timdettmers avatar undwie avatar vict0rsch avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

jack's Issues

Evaluation Script

Read in data in format #1 as gold data and some format of predictions (question: how is system output stored?), and calculate accuracy/ranking metrics.

Features:

  • Rank accuracy
  • Hits@K
  • MRR

Convert (Elsevier) ScienceQA data

As part of #2. Maybe something like

"question": "dataset_for(parsing, ?)",
"support": [ "list of paragraphs created by a heuristic" ]
"answer": ["list of answers from wikipedia"]

We should also generate cloze questions from paragraphs that mention the same entity

"question": "we used the ? corpus for our parsing model",
"support": [ "list of paragraphs created by a heuristic involving parsing, but removing the sentence that is the basis for the cloze question" ]
"answer": ["answer as taken from the cloze question sentence"]

We can also create cloze questions from the highlights.

Wikipedia multi-hop Dataset creation

Create a multihop QA dataset with NL supporting facts, exploiting the link structure from wikipedia.
So far I have a few todos

  • extract link matrix of all wikipedia articles towards each other
  • identify triangulation cases -- articles whose outlink articles are connected
  • filter out relevant sentences which link the entities, extract sentences/paragraphs
  • transform into cloze-style question with supporting sentences

Delexicalized Alignment Model

I believe there is a relatively task independent model that relies on the following assumptions:

  • Good answers are close to contexts that are well aligned with the question
  • a question utterance is well aligned with a context if
    • there are a lot of words that are semantically close (e.g. based on word embeddings)
    • The aligned words are locally coherent (for any phrase w1 w2 in the question the aligned words c1 and c2 in the context are close)
  • (Additionally): The Wh-word in the question should be aligned with the answer phrase, and the same spatial consistency constraints should hold on the complete alignment including the Wh-word.

The model may only have a few parameters that capture spatial consistence, weigh the word embedding dimensions etc. Further, one could build in a simple latent "coreference" mechanism.

Converters for datasets into our format

Implement loaders that save into the format of #1. To do:

  • FB15k (plus the FB15K-237 MSR version) #6
  • MCTest
  • Squad #13
  • Children Book Test
  • AI2 Science Exam
  • ScienceQA #9
  • Stanford EE dataset
  • RTE datasets (SNLI?)
  • TAC KBP #62
  • bAbI
  • MovieQA (without the movies)
  • SimpleQuestions #22
  • Arithmetic Word Problems #21
  • AI2 Animal Tensor #31
  • Kinship #32
  • LAMBADA #34
  • Who did What cloze-style QA dataset #44

(Add as you see fit).

Whoever is assigned (or assigns themselves) to one of these should create a new issue specific to that format.

SNLI loader

Loader for SNLI data into quebab format, with different options for support

  • support from training data
  • add support from external data (WordNet / PPDB)

Follow PEP8 python styleguides

If we do code on this more collaboratively we should follow a consistent style guide such as PEP8. Easy to install https://pypi.python.org/pypi/pep8 and run it on the code. PyCharm also complains and automatically fixes certain style issues. Maybe get a formatter for your own editor if it isn't PyCharm.

Snippets for datasets

As Sebastian suggested, would be great to have some tiny toy versions of the datasets for debugging and validation.

FB-15k and FB15K-237 loader

Load FB-15k and FB15K-237 into our format in #1. This requires the definition of a support neighbourhood for a given entity (and question). For example, we can use all entities that are direct neighbours:

"question": "born_in(BarackObama, ? )",
"support": [
  "BarackObama was born in Hawaii",
  "president_of(BarackObama, USA)"
]
"candidates": { "filename": "filename for file with list of candidates" }

Part of #2.

In future we could also support a more "global" (Johannes) way of viewing support by providing as support the list of all facts, for example by a filename pointing to the facts.

  "support" : {"filename":"filename of file with all training/test facts"}

Quebab reader for Arithmetic Word Problems

Options for conversion:

  • use the last sentence as question, other sentences as support.
  • use all sentences as support, and use a generic "answer?" question as question

For answers:

  • Could use the templates as answer candidates?

Column-less "USchema" Baseline

  • get a representation from each context for each answer candidate
  • Calculate a score for the answer candidate by aggregating over (functions of) the context representations, and comparing to a question representation.

Automatically suggest candidate spans

For short-answer data sets like squad, the quebap way to approach this problem would be to generate candidate answer spans. This can be done offline and allows multiple-choice style models to function on these datasets without modification.

There's a few ways to do this. One is checked into preprocess/suggest_span_candidates.py, using pos-tag sequences of the training answers. However, coverage is not great -- using the top 300 pos-sequences, we get about 70% coverage, and 14% or so answers have pos-tag squences unique to that instance.

There are other ways to do this which might be complimentary to this approach. One would be to use the parse trees to make candidate suggestions.

And one disadvantage with the postag approach is that the sequence ['DT'] somehow sneaks in, and then now all determiners are possible answers. In some cases it would make sense to take a look at the word, and if it's say, 'the', not suggest it.

Squad Reader

Conversion to #1 as part of #2.

(own issue for Squad related issues)

Web Shell/Interface for Demo paper

Would be great to have a simple web user interface against our models.

  • Questions can be entered, direct answers are returned
  • provenance is returned

will require a API for the models (and an integrated IR step).

Terminology for quebap datasets

It would be good to have names for the individual components of a quebap file (and not call instances 'quebaps') . Here is a proposal (sort of obvious but it's good to have it official somewhere):

  • Top level: reading dataset
  • Element that contains a set of support documents and questions for those: reading instance
  • Support documents, questions, candidates, answers etc. (obvious based on keys)

Error Analysis

Error analysis to use for the analysis and discussion part of papers
Leave this for some time in future, e.g.

  • Error analysis of weights for different candidates
  • Pairwise comparison of different models
  • Heat maps for attention

Reading Architecture and Script

I created a lightweight reading architecture and script here:

https://github.com/uclmr/quebap/blob/master/quebap/model/reader.py#L430

You can get help for the script like so

python3 quebap/model/reader.py -h

and running it would look like

python3 quebap/model/reader.py --train quebap/data/NYT/naacl2013_train.quebap.json --batch_size 100 --epochs 10 --train_end 1 --model boe

The main "architecture" choice is to wrap all model dependent state, conversion between quebap instances and tensorflow, and batching, into the Batcher class. Everything on top of that is stateless construction of TF nodes based on input TF nodes. So composing a reader will look like this:

https://github.com/uclmr/quebap/blob/master/quebap/model/reader.py#L334

Currently the reader supports two models:

  • model_f: represents questions and candidates using embeddings, score via dot product
  • boe: bag of embeddings, represent questions and candidates as sum of their token embeddings, score via dot product (if you tokenize strings as single tokens you actually get model f)

Very lightly tested. Training loop, evaluation etc work, but are still very much just a placeholder. Biggest missing parts:

  • support for support.
  • assumes a single global set of candidates right now
  • support for several question per support
  • tests for the batchers. The batcher is the most tedious bit to write, and so it needs ample testing. One way to make these tests easy is to require batchers to do round-trip conversions (from quebap reading dataset to feed_dicts and back), because then the tests could simply test for convert_backward(convert_forward(data)).
  • RNN representations of sequences.

To push for model reuse and sharing I hope that we could try to get all our models into this framework where possible (it makes very few assumptions beyond multiple choice so this should be doable), and extend/improve as we go.

JSON schema validation

As a safety check for these format readers, probably makes sense to release a proper JSON schema, and also to have a script to check that everything we're producing with these readers adheres to it. I have a small script that can do the verification, that I wrote in eqa-tools, so simple enough to bring that over.

But I actually don't know how to specify the schema file when there are lots of nesting. I actually wound up using an online tool to reverse-engineer the schema file from the JSON for eqa-tools!

convert MovieQA data

Prepare the MovieQA dataset for our shared dataset format.
It's from this paper here.

Note: One of the main takeaways of our reading group on this paper was that the video segments don't actually contribute a lot to model performance. It's probably best to skip the videos and only focus on the text.

Stats for various dataset issues

It would be great to have some code for dumping out data statistics to help us brainstorm strategies and compare datasets at a glance.

Some natural things if your answer is a span:
what is the length distribution for the span,
what are the postags like,
how frequently do they align to constituents,
what constituents do they align to

Please add other useful statistics here.

Unit Tests

We are using pytest for now. CI finds test__.py files in the source tree and executes test__ functions in those (standard configuration). We have a simple test in quebap/projects/modelF/test_evaluation.py.

Decision to be made:

  1. Test files "inline" in the same directory as the corresponding source code
  2. Test files in separate test source tree.

I like the simplicity of 1 and don't like to create too many directories, and synchronise directory structures. So I am voting for 1...

Global root directory variable

Readers should use a global root directory variable to search for data. On local machines this could be set to ./quebap/data (instead of ./quebap/quebap/data). On cannon/emerald it should be set to /cluster/project2/mr/data/.

Simple Overlap/Sliding Window baseline

Ideally should work a little on several datasets, reading in format #1. Several versions:

  • Vanilla version, preset hyperparameters
  • One for which hyperparameters can be tuned on dev sets
  • One with a bit of learning on training corpus.

Quite similar (identical) to the MCTest baseline.

Setup continuous integration

  • Travis requires open source project
  • wercker.com doesn't, but may at some point soon (when out of beta?)
  • Get VM from tsg to do this

NYT Loader + Model F

Goal: write simple version of model F for NYT data
(simplest possible proof-of-concept for MF models)

  • NYT2quebap (with single list of candidates; train: 1 quebap/instance; test: 1 quebap/relation with multiple answers)
  • generate train / test data set
  • Model F graph
  • quebap loader (with mapping symbols <> indices)
  • batcher (with naive negative sampling)
  • trainer (generic)
  • training loop
  • evaluation: measure effectiveness on test facts per relation
  • tune parameters
  • usage simplification and documentation

Define simple (Json?) format for reading tasks

Something like

{
   "question": "Where was Barack Obama born",
   "support": [
       "Barack Obama was born in Hawaii",
       "Barack Obama is the president of the USA"
  ],
  "candidates": [
      "Hawaii",
      "USA",
      "whatever"
  ],
  "answer": "Hawaii"
}

Comments:

  1. support is a list of support documents. For now these can be strings, but we should also support a structured version with additional offset annotation. In many settings we have several question on the same support documents, in which case this will create a lot of duplication. We should therefore also allow support documents to be filenames.
  2. A document can contain several sentences.
  3. candidates is a list of candidate answers. Again, to avoid duplication I can imagine this also to be a filename of a list of entities (for FB15k where usually the candidate entities in each question are identical).
  4. The Squad dataset (and likely others) don't have an explicit candidate answer set. Instead the answer needs to be a span of the support. We can handle this either by keeping the candidate list answer, or by preparing candidates based on heuristics.
  5. Some datasets/questions can have multiple answers.
  6. Tokenisation etc. should be provided via offset annotation.
  7. We need a format for storing the answers of a system. Could just be a file with answers in the order of the gold file.

Spans, Providence

For some datasets we may have provenance information. That is, for the correct answer we know where the answer is mentioned (in the form of an answer span). We can include this information in the answer representation:

 "answer": { "value": "Hawaii", "doc_id": 0, "span": [26, 32] }

Personally I would also like to produce an alignment between the question and the context around the answer (in order to score it), but I don't think this needs to be in this format now.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.