GithubHelp home page GithubHelp logo

veryluckyxyz / keywordfinder Goto Github PK

View Code? Open in Web Editor NEW

This project forked from lvsh/keywordfinder

0.0 2.0 0.0 34.78 MB

Fork from @lvsh -- Automatic keyword extraction - no alchemy required!

Shell 0.52% Python 99.48%

keywordfinder's Introduction

Automatic keyword extraction

As an Insight Data Science Fellow, I completed a 3-week project that involved building a keyword extraction algorithm. Given a block of text as input, the algorithm selects keywords that describe what the text is about. Keywords are useful, compact descriptions of the original text, and they are widely used in information retrieval applications.

Background

For this project, I partnered with URX, a San Francisco-based startup in the mobile advertising space. URX matches advertisers and content providers, in a context-specific way. For example, if the content consists of a news article about hip-hop music, URX will serve ads for hip-hop albums on Spotify, a music streaming service. URX accomplishes this matching by extracting keywords from a content page, using those keywords to search a database of advertisers, and then serving the best matching ad.

In my project, I focused on the keyword extraction step, and I built a prototype keyword extractor for URX. The deliverables were: (i) an algorithm for keyword extraction; and (ii) Python scripts to implement the algorithm. To learn about the algorithm I developed, check out the project page.

Running the code

To get started, run the example:

python example.py inspec.txt

To evaluate the algorithm on Crowd500 dataset from Marujo et al., 2012, run:

python evaluatemodel.py 

Note that the algorithm is trained on the train set of Crowd500, but it is evaluated only on the test set of Crowd500.

For comparison, I provide two baselines: random and AlchemyAPI. The random baseline selects words at random from the given text, whereas the AlchemyAPI baseline consists of keywords returned by the Alchemy analytics engine. The top-15 keyword evaluation methodology is similar to that of Jean-Louis et al., 2014.

python evaluaterandom.py 
python evaluatealchemy.py 

My algorithm (f1 score = 23.95) outperforms AlchemyAPI (f1 score = 21.19), and beats the random baseline quite easily (f1 = 8.41). It is worth noting that my algorithm was trained on the Crowd500 train set, whereas the Alchemy keyword extractor (presumably) was not. Additionally, Alchemy excels at returning keyphrases rather than keywords, which this benchmark does not assess.

keywordfinder's People

Contributors

lvsh avatar veryluckyxyz avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.