Issue by afader
Wed Mar 18 16:42:32 2015
Originally opened as https://github.com/allenai/okcorpus/issues/40
Right now, we are using Brown clusters for a notion of word similarity in OKC. The benefit of this is that the clusters are discrete and prefix-based, so we can easily add them as token annotations in BlackLab. The disadvantages are:
- It only works on single words (no multiword phrases)
- It's very, very slow
- It may not be the best way to compute similarity
I was playing around with word2vec, and the results on the ACL corpus are great. For example, here are the phrases most similar to information_extraction
:
Word Cosine distance
------------------------------------------------------------------------
text_mining 0.826766
question_answering 0.796270
knowledge_extraction 0.772408
ontology_learning 0.764749
automatic_summarization 0.752856
automatic_text_summarization 0.748179
ie 0.742152
entity_extraction 0.737166
knowledge_discovery 0.737004
fact_extraction 0.735164
image_retrieval 0.731172
information_retrieval 0.725849
biomedical_information_extraction 0.723953
information_fusion 0.721746
applications_such_as_information_extraction 0.721209
literature_mining 0.716144
relation_extraction 0.713560
text_summarization 0.710251
named_entity_recognition 0.708391
record_linkage 0.705008
text_understanding 0.704244
such_as_information_retrieval 0.698317
such_as_information_extraction 0.698231
ontology_population 0.697375
information_extraction_systems 0.694957
many_natural_language_processing_tasks 0.686736
opinion_mining 0.684321
open_domain 0.683641
entity_recognition 0.682989
cross_-_language_retrieval 0.681546
question_answering_systems 0.681245
document_classification 0.680587
computer_vision 0.679659
named_entity_extraction 0.679576
textual_entailment_recognition 0.677568
named_-_entity_recognition 0.675705
natural_language_processing 0.675098
data_mining 0.674606
applications_such_as_question_answering 0.674527
opinion_extraction 0.673862
Here are the most similar phrases to outperforms
:
Word Cosine distance
------------------------------------------------------------------------
performs_better_than 0.914765
significantly_outperforms 0.894385
outperformed 0.879147
improves_over 0.859142
performs_much_better_than 0.842823
performs_worse_than 0.832298
consistently_outperforms 0.831501
outperforms_both 0.828320
clearly_outperforms 0.821622
performs_significantly_better_than 0.817346
is_superior_to 0.816786
beats 0.806530
performed_better_than 0.805698
performs_slightly_better_than 0.804796
outperforming 0.800169
performs_as_well_as 0.792212
still_outperforms 0.788039
outperforms_all 0.785506
does_better_than 0.784221
also_outperforms 0.782234
achieves_better_performance_than 0.774746
surpasses 0.770587
significantly_outperformed 0.770434
works_better_than 0.765581
can_outperform 0.762990
model_outperforms 0.762042
slightly_outperforms 0.759185
does_not_outperform 0.758555
our_model_outperforms 0.757645
substantially_outperforms 0.752425
underperforms 0.751935
method_outperforms 0.749556
always_outperforms 0.740629
performs_best 0.736030
performs_comparably_to 0.735505
performed_worse_than 0.734069
approach_outperforms 0.733126
is_consistently_better_than 0.729947
models_outperform 0.723421
even_outperforms 0.722097
So, word2vec is:
- Many orders of magnitude faster than Brown clustering
- Works on multiword phrases
- Looks really good
I think that using this information could have a big impact on usability, since you can stop having to think in terms of single words, but still get a good notion of similarity.
The downside is that multiword phrases cannot be indexed like Brown clusters.
To add word2vec similarity into OKC, we can do the following:
- Add a new web API
similarPhrases(phrase: Seq[String], threshold: Double): Seq[Seq[String]]
that returns the phrases that have similarity within the given threshold
- Have the slider in the UI control the threshold
- The phrases returned by
similarPhrases
are combined into a disjunction, which can then be queried against the index
@dirkgr and @chrisc36 any thoughts/opinions on this?