GithubHelp home page GithubHelp logo

charlie80 / w2v Goto Github PK

View Code? Open in Web Editor NEW

This project forked from castanan/w2v

0.0 1.0 0.0 1.43 MB

Word2Vec models with Twitter data using Spark. Blog:

Home Page: https://medium.com/ibm-data-science-experience/spark-based-machine-learning-tools-for-capturing-word-meanings-b89d2cb90a5f

License: MIT License

Python 5.47% Jupyter Notebook 94.53%

w2v's Introduction

Spark-based machine learning for capturing word meanings

In this repo, you will find out how to build Word2Vec models with Twitter data. For an end to end tutorial on how to build models on IBM's Watson Studio, please chech this repo.

Pre-reqs: install Python, numpy and Apache Spark

I.) Installing Anaconda installs Python, numpy, among other Python packages. If interested go here https://www.continuum.io/downloads

II.) Download and Install Apache Spark go here: http://spark.apache.org/downloads.html

This steps were useful for me to install Spark 1.5.1 on a Mac https://github.com/castanan/w2v/blob/master/Install%20Spark%20On%20Mac.txt

III.) Added a notebook here https://github.com/castanan/w2v/blob/master/mllib-scripts/Word2Vec with Twitter Data usign Spark RDDs.ipynb and the good news are that Spark comes with Jupyter + Pyspark integrated. This notebook can be invoked from the shell by typing the command: IPYTHON_OPTS="notebook" ./bin/pyspark if you are sitting on YOUR-SPARK-HOME.

Make sure that your pyspark is working

I.) Go to your spark home directory

cd YOUR-SPARK-HOME/bin

II.) Open a pyspark shell by typing the command

./pyspark

or Pyspark with Jupyter by typing the command

IPYTHON_OPTS="notebook" ./bin/pyspark

III.) print your spark context by typing sc in the pyspark shell, you should get something like this:

![image of pyspark shell] (https://github.com/castanan/w2v/blob/master/images/pyspark-shell.png)

Get the Repo

git clone https://github.com/castanan/w2v.git

cd /YOUR-PATH-TO-REPO/w2v

Get the Data

Download (without uncompressing) some tweets from here. The tweets.gz file contains a 10% sample (using Twitter decahose API) of a 15 minute batch of the public tweets from December 23rd. The size of this compressed file is 116MB (compression ratio is about 10 to 1).

Note: there is no need to uncompress the file, just download the tweets.gz file and save it on the repo /YOUR-PATH-TO-REPO/w2v/data/.

There are 2 options to perform the Twitter analysis:

  1. (suggested) use dataframes and Spark ML (March 2016). see ml-scripts/README.md

  2. use rdd's and Spark MLlib (October 2015). see mllib-scripts/README.md

w2v's People

Contributors

castanan avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.