GithubHelp home page GithubHelp logo

ml-ai-nlp-ir / transorthogonal-linguistics Goto Github PK

View Code? Open in Web Editor NEW

This project forked from thoppe/transorthogonal-linguistics

0.0 1.0 0.0 63.66 MB

Uses an orthogonal vector space to finds words close to the hyperchord of two input words.

Home Page: https://transorthogonal-linguistics.herokuapp.com/

Python 100.00%

transorthogonal-linguistics's Introduction

transorthogonal-linguistics

Travis Hoppe

If heroku is running, checkout the live demo (it may take 30 seconds to warm up):

https://transorthogonal-linguistics.herokuapp.com/

Introduction

Words rarely exist in a vacuum. To understand the meaning of the word cat, it's useful to know that it is (hypernym) an animal, that it is the same as (synonym) a feline, that a Tabby is a type of (hyponym) cat, and that in some reasonable sense it is the opposite (antonym) of a dog. Since words are connected in a rich network of linguisitic information, why not (literally) follow that path and see where it takes us?

Instead of looking at a single word in isolation, this project tries elucidate what words should be in between a start and end word.

Grouping words together is a classic problem in computational linguisitics. Typical approaches use LSA, LSI, LDA or Pachinko allocation. Personally, I perfer Word2Vec which was developed by some lovely engineers from Google. Partly because there exists an excellent port to Python via gensim, but mostly because it's awesome.

Word2Vec maps each word to a point on a unit hypersphere. Words that are "close" on this sphere often share some kind of semantic relation. If we pick two words, say "boy" and "man", we can trace the shortest path that connects them. We parameterize this curve with a "time" where t=0 (at boy) and t=1 (at man). Words that are close to this timeline are selected and ordered by their t value (e.g. to the t where they are closest to the connecting curve). In theory, this timeline should be a semantic map from one word to another -- smoothly varying across meaning.

In practice however, it turns out that computing the true curve across the hypersphere is rather tricky. It's even harder to numerically find the nearest points efficiently. However if we cheat a little, we can draw a straight line connecting the two points as an approximation to the curve. If we do this, the problem reduces down to a fast linear algebra solution. Since we are moving across (trans) the orthogonal space spanned by the word2vec's construction, we call this method transorthogonal linguistics.

Data construction

The database contained within this repo was constructed from a full English dump of Wikipedia that was sentence and word tokenized by NLTK. Word2Vec training was done with a single pass, 300 dimensions and an 800 minimum vocabulary count. These choices were found to be optimal for the results, yet still be small enough to query online reasonably quickly.

Command-line interface

python transorthogonal_linguistics/word_path.py boy man

Examples

With the input of boy and man we get:

boy to man

boy
- 
sixteen-year-old, orphan
teenager, girl, schoolgirl
youngster, shepherd, lad, kid
kitten, lonely, maid
beggar, policeman
prostitute, thug, villager, handsome, loner, thief, cop
gentleman, stranger, lady, Englishman, guy
-
woman
person
man

sun to moon

sun
sunlight, mist
glow, shine, clouds
skies, shines, shining, glare, moonlight, sky, darkness
shadows, heavens
horizon, crescent
earth, eclipses
constellations, comet, planets, orbits, orbiting, Earth, Io
Jupiter, planet, Venus, Pluto, Uranus, orbit
-
moons, lunar
moon

Other interesting examples:

girl woman
lover sinner
fate destiny
god demon
good bad
mind body
heaven hell
American Soviet
idea action    
socialism capitalism
Marxism Stalinism
man machine
sustenance starvation
war peace
predictable idiosyncratic
acceptance uproar

transorthogonal-linguistics's People

Contributors

thoppe avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.