GithubHelp home page GithubHelp logo

tweet2vec's Introduction

Tweet2Vec

Please Cite: Soroush Vosoughi, Prashanth Vijayaraghavan and Deb Roy. (2016). Tweet2Vec: Learning Tweet Embeddings using Character-level CNN-LSTM Encoder-Decoder. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2016). Pisa, Italy. ##Requirements

  1. numpy
  2. Theano
  3. joblib

##Training: Create subdirectories to store logs and models. However you can specify your own dir from command line: logs/ all_data/

Use the following command to see all the options & hyperparameters for training that can be specified from command line:

THEANO_FLAGS=mode=FAST_RUN,device=gpu1,floatX=float32 PYTHONPATH=. python model/train.py -h

Using default parameters, training can be performed by : THEANO_FLAGS=mode=FAST_RUN,device=gpu1,floatX=float32 PYTHONPATH=. python model/train.py -t -i ./train_file.json -m "new_model" -E 3

Training data format is provided in sample.json. However, the tweet1 & tweet2 need to be a modified version. It could either be replies to tweets which have similar meaning or just synonym modified version of the tweet.

NOTE: Training takes a long time! The cost is the total sum of the log probabilities across each batch, timestep and decoder. Note that the Cost will fluxuate a lot. You could experiment with other loss functions as in loss.py. It takes more than a week to get good vectors.

##Embeddings

Use the following command to see all the options & parameters for getting embeddings that can be specified from command line:

THEANO_FLAGS=mode=FAST_RUN,device=gpu1,floatX=float32 PYTHONPATH=. python model/embeddings.py -h

Get embedding using the following command: THEANO_FLAGS=mode=FAST_RUN,device=gpu1,floatX=float32 PYTHONPATH=. python model/embeddings.py -R -i /res/tweets.json -m "new_model" -e 3 -o "/res/emb.npy"

tweet2vec's People

Contributors

soroushv avatar

Stargazers

Bruno Carlos Vieira avatar Sahar Shayegan avatar Romain Boyrie avatar  avatar Liping avatar Sergey Melderis avatar  avatar  avatar Fernando Plascencia avatar Tom Davidson avatar Selim Firat Yilmaz avatar  avatar Ori Cohen avatar Omeed Maghzian avatar Gilles Jacobs avatar  avatar Dhruv Bhagat avatar Aditya Chetan avatar Fanglin avatar Karan Sehgal avatar Botty Dimanov avatar Jure Baban avatar Enrico Petrachi avatar Hossein Taghi-Zadeh avatar Reihaneh Rabbany avatar Christian Hotz-Behofsits avatar  avatar Guo linsen avatar John Foley avatar David McClure avatar  avatar wurentidai avatar Wu Ning avatar Sansiri Tarnpradab avatar  avatar chkztrukaz avatar  avatar Shashank Gupta avatar Riddhiman Dasgupta avatar Ganesh avatar Sweta Agrawal avatar Shi Chenqi avatar Surafel M. Lakew avatar  avatar sile avatar  avatar weijing avatar  avatar  avatar 爱可可-爱生活 avatar Slice avatar  avatar TENSORTALK avatar Vikas Raunak avatar Shubhanshu Mishra avatar Giannis Karamanolakis avatar  avatar  avatar Baird Howland avatar  avatar Thom Miano avatar

Watchers

Christophe Cerisara avatar  avatar  avatar Joshua Greenhalgh avatar

tweet2vec's Issues

Can not retrive encodings

I specified output file to be test1, but I get error No such file or directory 10//encoding_10_test1.npy

Problem running example

Running example sample.json.

[paulo@wormhole Tweet2Vec-master]$ THEANO_FLAGS=mode=FAST_RUN,floatX=float32 PYTHONPATH=. python model/train.py -t -i sample.json -m "new_model" -E 3
Using TensorFlow backend.
2017-06-12 18:59:47,981 - utils - INFO - Loading Data from: sample.json
2017-06-12 18:59:47,982 - utils - INFO - Loaded Data from: sample.json
Creating Training Function...
2017-06-12 19:00:29,165 - main - INFO - Starting epoch 1
2017-06-12 19:00:29,166 - main - INFO - Training with the Adam update function
2017-06-12 19:00:29,166 - main - INFO - Training epoch 1.
2017-06-12 19:00:29,166 - main - INFO - Epoch 1 results
Traceback (most recent call last):
File "model/train.py", line 145, in
trainer.train_enc_dec(train_file=args.train_file, model_name=args.model_name, num_epochs=args.epochs,out_dir=args.out_dir)
File "model/train.py", line 98, in train_enc_dec
self.logger.info("Saving model parameters for epoch {0} at path {1}".format(e, ))
IndexError: tuple index out of range

Switch to other loss function

I'm not quite familiar with theano. Can anyone help me with how to change to other loss function? A modified version of code is very appreciated.

get_embeddings, and output vector size ?

Hi, thanks for sharing the code with community. I want to know if the the output vector size is 256 ?

I also got one problem when I try to run the code.

in embeddings.py

Line 61

for i,xx,yy in data_manager.batch_generator_test(tweets,tweets,batch_size=256):

The function 'batch_generator_test' gives 4 return values instead of three.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.