GithubHelp home page GithubHelp logo

fagan2888 / fuzzycategory Goto Github PK

View Code? Open in Web Editor NEW

This project forked from dedupeio/fuzzycategory

0.0 0.0 0.0 6 KB

:triangular_ruler: Fuzzy Categorical Distances

License: MIT License

Python 100.00%

fuzzycategory's Introduction

fuzzycategory

Fuzzy Categorical Distances

For cases which the number of classes is large, but much smaller than the number of of records we can do something like a "semantic" distance between categories. A good example would be something like occupation in campaign finance data.

{'name' : 'Jim Bob', 'employer' : 'JP Morgan Chase', 'occupation' : 'lawyer'}
{'name' : 'James Bob', 'employer' : 'JP Morgan Chase', 'occupation' : 'lawyer'}
{'name' : 'Jim Bob', 'employer' : 'JP Morgan Chase', 'occupation' : 'attorney''}

We can 1.

Create a vector of all the terms that don't appear in the focal field

lawyer : {'Jim' : 1, 'James' : 1, 'Bob' : 2, 'JP' : 2, 'Morgan' : 2, 'Chase' : 2}
attorney : {'Jim' : 1, 'Bob' : 1, 'JP' : 1, 'Morgan' : 1, 'Chase' : 1}

The "distance" between attorney and lawyer is then the tfidf weighted cosine distance between those vectors.

Alternately,

Create a vector of exact field matches

lawyer : {'Jim Bob' : 1, 'James Bob' : 1, 'JP Morgan Chase' : 2}
attorney : {'Jim Bob' : 1, 'JP Morgan Chase' : 2}

Or even a

vector exact matches for everything except the focal field

lawyer : {'Jim Bob, JP Morgan Chase' : 1, 'James Bob, JP Morgan Chase' : 1}
attorney : {'James Bob, JP Morgan Chase' : 1}

This last version is very similar to what http://www.naviddianati.com/fec is doing with their Maximum Likelihood Filter: http://arxiv.org/abs/1503.04085

If we wanted to get even more fancy we could use word2vec instead of the tfidf business: https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-2-word-vectors

fuzzycategory's People

Contributors

fgregg avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.