GithubHelp home page GithubHelp logo

allenai / ike Goto Github PK

View Code? Open in Web Editor NEW
56.0 56.0 20.0 10.32 MB

Build tables of information by extracting facts from indexed text corpora via a simple and effective query language.

Home Page: http://allenai.org/software/interactive-knowledge-extraction/

License: Apache License 2.0

Scala 73.20% Shell 0.60% CSS 1.83% HTML 0.56% JavaScript 23.81%

ike's People

Contributors

afader avatar aimichal avatar bhavanadalvi avatar carissas avatar chrisc36 avatar codeviking avatar colinarenz avatar dirkgr avatar hiramhye avatar iellenberger avatar jkinkead avatar mmichelsonif avatar sbhaktha avatar schmmd avatar systemride avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ike's Issues

Query log

Issue by afader
Wed Mar 18 23:58:30 2015
Originally opened as https://github.com/allenai/okcorpus/issues/44


OKC should save your query history, just like bash saves your command history. Query logs should be persisted across browser sessions. You should be able to view the query log and scroll through it, then click on one to re-execute it. When named patterns are available, there should be a link to create a new named pattern from an entry in the query log.

[CLOSED] Generalize to multiple capture groups

Issue by schmmd
Wed Feb 18 17:07:41 2015
Originally opened as https://github.com/allenai/okcorpus/issues/5


I realize this isn't what you want to target immediately, but it's worth considering now lest it become more difficult in the future.

Presently you always show the first capture group--and from that capture group you build a dictionary. What if the "dictionary" could have multiple columns? For example, someone might search for (JJ* NN+) is a part of (JJ* NN+) and get rows such as (mining, processing) and (liquidity, growth). This might be sufficient for the tool to be applied to create tables.

Support Named Patterns

Issue by schmmd
Wed Feb 18 19:09:50 2015
Originally opened as https://github.com/allenai/okcorpus/issues/8


OpenRegex supports pattern hierarchies which allows people to write complex but understandable patterns. I understand that BlackLab doesn't support adding new types to its index without reindexing, but hierarchical patterns could be compiled down into base patterns. It's an open question how fast such complex queries would be.

For example, it would be nice to run the following ReVerb extractor in BlackLab.

https://github.com/allenai/taggers/blob/master/examples/chunkedextractor/reverb-parts.taggers
https://github.com/allenai/taggers/blob/master/examples/chunkedextractor/reverb-extr.taggers

[CLOSED] Use word2vec for phrase similarity

Issue by afader
Wed Mar 18 16:42:32 2015
Originally opened as https://github.com/allenai/okcorpus/issues/40


Right now, we are using Brown clusters for a notion of word similarity in OKC. The benefit of this is that the clusters are discrete and prefix-based, so we can easily add them as token annotations in BlackLab. The disadvantages are:

  • It only works on single words (no multiword phrases)
  • It's very, very slow
  • It may not be the best way to compute similarity

I was playing around with word2vec, and the results on the ACL corpus are great. For example, here are the phrases most similar to information_extraction:

                                              Word       Cosine distance
------------------------------------------------------------------------
                                       text_mining      0.826766
                                question_answering      0.796270
                              knowledge_extraction      0.772408
                                 ontology_learning      0.764749
                           automatic_summarization      0.752856
                      automatic_text_summarization      0.748179
                                                ie      0.742152
                                 entity_extraction      0.737166
                               knowledge_discovery      0.737004
                                   fact_extraction      0.735164
                                   image_retrieval      0.731172
                             information_retrieval      0.725849
                 biomedical_information_extraction      0.723953
                                information_fusion      0.721746
       applications_such_as_information_extraction      0.721209
                                 literature_mining      0.716144
                               relation_extraction      0.713560
                                text_summarization      0.710251
                          named_entity_recognition      0.708391
                                    record_linkage      0.705008
                                text_understanding      0.704244
                     such_as_information_retrieval      0.698317
                    such_as_information_extraction      0.698231
                               ontology_population      0.697375
                    information_extraction_systems      0.694957
            many_natural_language_processing_tasks      0.686736
                                    opinion_mining      0.684321
                                       open_domain      0.683641
                                entity_recognition      0.682989
                        cross_-_language_retrieval      0.681546
                        question_answering_systems      0.681245
                           document_classification      0.680587
                                   computer_vision      0.679659
                           named_entity_extraction      0.679576
                    textual_entailment_recognition      0.677568
                        named_-_entity_recognition      0.675705
                       natural_language_processing      0.675098
                                       data_mining      0.674606
           applications_such_as_question_answering      0.674527
                                opinion_extraction      0.673862

Here are the most similar phrases to outperforms:

                                              Word       Cosine distance
------------------------------------------------------------------------
                              performs_better_than      0.914765
                         significantly_outperforms      0.894385
                                      outperformed      0.879147
                                     improves_over      0.859142
                         performs_much_better_than      0.842823
                               performs_worse_than      0.832298
                          consistently_outperforms      0.831501
                                  outperforms_both      0.828320
                               clearly_outperforms      0.821622
                performs_significantly_better_than      0.817346
                                    is_superior_to      0.816786
                                             beats      0.806530
                             performed_better_than      0.805698
                     performs_slightly_better_than      0.804796
                                     outperforming      0.800169
                               performs_as_well_as      0.792212
                                 still_outperforms      0.788039
                                   outperforms_all      0.785506
                                  does_better_than      0.784221
                                  also_outperforms      0.782234
                  achieves_better_performance_than      0.774746
                                         surpasses      0.770587
                        significantly_outperformed      0.770434
                                 works_better_than      0.765581
                                    can_outperform      0.762990
                                 model_outperforms      0.762042
                              slightly_outperforms      0.759185
                               does_not_outperform      0.758555
                             our_model_outperforms      0.757645
                         substantially_outperforms      0.752425
                                     underperforms      0.751935
                                method_outperforms      0.749556
                                always_outperforms      0.740629
                                     performs_best      0.736030
                            performs_comparably_to      0.735505
                              performed_worse_than      0.734069
                              approach_outperforms      0.733126
                       is_consistently_better_than      0.729947
                                 models_outperform      0.723421
                                  even_outperforms      0.722097

So, word2vec is:

  • Many orders of magnitude faster than Brown clustering
  • Works on multiword phrases
  • Looks really good

I think that using this information could have a big impact on usability, since you can stop having to think in terms of single words, but still get a good notion of similarity.

The downside is that multiword phrases cannot be indexed like Brown clusters.

To add word2vec similarity into OKC, we can do the following:

  • Add a new web API similarPhrases(phrase: Seq[String], threshold: Double): Seq[Seq[String]] that returns the phrases that have similarity within the given threshold
  • Have the slider in the UI control the threshold
  • The phrases returned by similarPhrases are combined into a disjunction, which can then be queried against the index

@dirkgr and @chrisc36 any thoughts/opinions on this?

[CLOSED] Adds logo

Issue by dirkgr
Thu Mar 5 19:20:10 2015
Originally opened as https://github.com/allenai/okcorpus/pull/11


screen shot 2015-03-05 at 11 16 52

It's my first foray into react!

The image is half-size, which looks a little lumpy on normal displays, but super sweet on retina displays.

This is kind of a joke, so I don't really expect to merge this, but I'd still appreciate @codeviking's comments if I messed anything up.

@afader, FYI


dirkgr included the following code: https://github.com/allenai/okcorpus/pull/11/commits

[CLOSED] first step towards tables

Issue by afader
Mon Mar 9 20:20:08 2015
Originally opened as https://github.com/allenai/okcorpus/pull/20


Started moving the code from dictionaries to tables. Instead of doing this in one giant leap, I'm going to break it into small parts.

This first part does two things:

  • Replaces the Dictionary class with a Table class.
  • Updates the front-end code to use Table objects, but keeps the same functionality as before (i.e. only works with 1-column tables).

Will comment more below...


afader included the following code: https://github.com/allenai/okcorpus/pull/20/commits

[CLOSED] Record provenance for each table row

Issue by afader
Wed Mar 18 23:52:17 2015
Originally opened as https://github.com/allenai/okcorpus/issues/43


When you click the "add" button, okc should record:

  • The query that got you to that row
  • The context that you saw when adding it

In the Tables pane, there should be a little info button that you can click on to view a particular row's provenance.

Also, when exporting the table, you should also get the provenance info with it.

Generalize POS tags

Issue by schmmd
Wed Feb 18 16:58:36 2015
Originally opened as https://github.com/allenai/okcorpus/issues/4


It'd be neat if there were a feature that would generalize POS tags. For example, if you wrote a pattern person VBG you could select a dropdown that would offer to generalize VBG to /VB./ (a regular expression matching any verb).

[CLOSED] Make deployment easier

Issue by afader
Tue Mar 17 18:42:19 2015
Originally opened as https://github.com/allenai/okcorpus/issues/35


We want to have an instance of okcorpus running on some EC2 machine that answers to http://okcorpus.dev.ai2. This is useful for doing quick demos.

Right now the deployment is a little clunky:

  • Need to set those AWS_* env variables
  • Need to curl 'http://10.12.0.4/ddns/?host=okcorpus' in order to set the domain name
  • Need to mess with iptables to get the thing able to run on port 80
  • When running sbt deploy need to pass the deploy.user.ssh_keyfile argument

[CLOSED] Adds local persistence

Issue by afader
Mon Mar 16 22:30:19 2015
Originally opened as https://github.com/allenai/okcorpus/pull/32


Adds the functionality from #26

  • All of the interaction with localStorage takes place in TableManager
  • Writing: TableManager serializes and saves all of the tables on every change. This is inefficient, but I think it should be fine for now.
  • Reading: The DictApp component makes a call to TableManager when loading its initial state.

afader included the following code: https://github.com/allenai/okcorpus/pull/32/commits

[CLOSED] Adds download button

Issue by dirkgr
Fri Mar 13 00:16:19 2015
Originally opened as https://github.com/allenai/okcorpus/pull/21


@afader, this adds a simply download button to the dictionaries.

@codeviking, do you have any idea how to make this thing work in Safari? I tried a bunch of stuff, most promisingly FileSaver.js, but it fails on both Chrome and Safari for two different reasons. I'd love to go over it with you. I'm sure there is a way to make it work, because the FileSaver.js demo works just fine on Safari.


dirkgr included the following code: https://github.com/allenai/okcorpus/pull/21/commits

Support for relational operators

Issue by afader
Thu Mar 19 00:07:11 2015
Originally opened as https://github.com/allenai/okcorpus/issues/45


We should be able to write queries that combine the outputs of multiple patterns. For example intersect($pattern1, $pattern2) should return the intersection of the captured groups of both patterns.

This will require:

  • Defining some structure for representing capture groups as "tuples"
  • Operational semantics that can execute intersect($x, $y), potentially making multiple calls to BlackLab
  • Query language syntax

Short list of operators:

  • intersection
  • union
  • set difference

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.