yumoxu / stocknet-dataset Goto Github PK

A comprehensive dataset for stock movement prediction from tweets and historical stock prices.

License: MIT License

stocknet-dataset's Introduction

stocknet-dataset

This repository releases a comprehensive dataset for stock movement prediction from tweets and historical stock prices. Please cite the following paper [bib] if you use this dataset,

Yumo Xu and Shay B. Cohen. 2018. Stock Movement Prediction from Tweets and Historical Prices. In Proceedings of the 56st Annual Meeting of the Association for Computational Linguistics. Melbourne, Australia, volume 1.

Stock movement prediction is a challenging problem: the market is highly stochastic, and we make temporally-dependent predictions from chaotic data. We treat these three complexities and present a novel deep generative model jointly exploiting text and price signals for this task. Unlike the case with discriminative or topic modeling, our model introduces recurrent, continuous latent variables for a better treatment of stochasticity, and uses neural variational inference to address the intractable posterior inference. We also provide a hybrid objective with temporal auxiliary to flexibly capture predictive dependencies. We demonstrate the state-of-the-art performance of our proposed model on a new stock movement prediction dataset which we collected.

You might also be interested in our code for stock movement prediction.

Should you have any query please contact me at [email protected].

Dataset Overview

Two-year price movements from 01/01/2014 to 01/01/2016 of 88 stocks are selected to target, coming from all the 8 stocks in the Conglomerates sector and the top 10 stocks in capital size in each of the other 8 sectors. The full list of 88 stocks and their companies selected from 9 sectors is available in StockTable, a facsimile of the paper appendix appendix_table_of_target_stocks.pdf.

Data Component

This dataset comprises two main components,

./tweet: tweet data from Twitter
./price: price data from Yahoo Finance

Each component contains their raw data and preprocessed data organized by stocks,

./tweet/raw
./tweet/preprocessed

and

./price/raw
./price/preprocessed

Data Format

Raw Tweet Data

Format: JSON
Keys: see Introduction to Tweet JSON

Preprocessed Tweet Data

Format: JSON
Keys: 'text', 'user_id_str', 'created_at'

Raw Price Data

Format: CSV
Entries: date, open price, high price, low price, close price, adjust close price, volume

Preprocessed Price Data

Format: TXT
Entries: date, movement percent, open price, high price, low price, close price, volume
Note: open, high, low, close prices are normalized values.

stocknet-dataset's People

Contributors

Stargazers

Watchers

Forkers

huangyuanbuhuijia tomzhang chongyang915 stevenlol decpaul shaocongwu levelsethu mengbinzhu c0ns0le williamwhe denethor1997 lifangd nazariyv yinxx finnzc dtaylor-530 schollz hanhanzhai kcompher caoxu915683474 jinlccs sahanduiuc zoonono pingpong87 nathanielwei drharitaparikh yanbigong2 glinboy fangego janardhanv xiafanzeng aissa8976 danish-mehmood romanraufov minas1900 weidezhang harishterli tshivani06 walker5858 abtinshahidi paulsanjo kylejohnson363 tchklovski douskaki zhangxt youlei5898y sprgn brian841102 mikeinottawa frstudy oriakiva lalalaashen ivanwongtf reonjames strawberring hjc5858 parvez2017 stungkit jrgantunes laranea naseh998 aastha1794 m-shaf vananle abhiminato4444 chang111 prcornick palak-narula sidrah-ijaz sakastlord ann-eat-apple vbyravarasu yangxichun25 guido-miracle ttyka 1895-art leon-liu039 jjshin95 kesposito641 macabdul9 zir0ne minghao2016 surapoom kamalravi narcolis jasondatascience zilez ksauka msinghraniyal kangmincho1 scared-fish peyajm29 corneezy like0403 bigandsweet huangyingting fuyuzora amelioratede anoop-phoenix intechguy

stocknet-dataset's Issues

Request for the code of the corresponding paper

Can you provide the code of the corresponding paper "Stock Movement Prediction from Tweets and Historical Prices"? Thank you very much!

Missing GMRE tweets?

Hi, i am not finding the GMRE corresponding textual data in tweet folder.
Am i missing something?

why normalized close price is not equal to close/last_adj_price -1?

The last second column of the preprocessed price is not equal to close_t / adj_price_{t-1} - 1？

Tools to scrape data

Hi, i am wondering what tools did you use to scrape the twitter data?
The official API restrict a span of only one past week. And some python library like GetOldtweets only return results with too much noise, like 2017-12-30 17:09:53 $ BTC $ SPX $ NASDAQ $ DJIA $ CAC $ DAX $ FTSE $ JPM $ TSLA $ ES_F $ CL_F $ GC_F $ TLT $ WMT $ JNJ $ FB $ GOOGL $ MSFT $ INTC $ AMD $ TWTR $ NFLX $ AMZN $ MA $ AAPL $ MO $ PG $ GE $ BA $ UTX $ LOW $ HD $ ORCL $ CSCO $ VZ $ DIS $ PCLN $ TSLA $ USB $ AMGN $ SLB $ V $ ICT $ FTI $ IBM $ MS # bitcoin # speculations # GaryShillingpic.twitter.com/EZ4d6Fh6Fs
Or did you use some pre-processing techniques to eliminate such sequences?

new

My name is Luis, I'm a big-data machine-learning developer, I'm a fan of your work, and I usually check your updates.

I was afraid that my savings would be eaten by inflation. I have created a powerful tool that based on past technical patterns (volatility, moving averages, statistics, trends, candlesticks, support and resistance, stock index indicators).
All the ones you know (RSI, MACD, STOCH, Bolinger Bands, SMA, DEMARK, Japanese candlesticks, ichimoku, fibonacci, williansR, balance of power, murrey math, etc) and more than 200 others.

The tool creates prediction models of correct trading points (buy signal and sell signal, every stock is good traded in time and direction).
For this I have used big data tools like pandas python, stock market libraries like: tablib, TAcharts ,pandas_ta... For data collection and calculation.
And powerful machine-learning libraries such as: Sklearn.RandomForest , Sklearn.GradientBoosting, XGBoost, Google TensorFlow and Google TensorFlow LSTM.

With the models trained with the selection of the best technical indicators, the tool is able to predict trading points (where to buy, where to sell) and send real-time alerts to Telegram or Mail. The points are calculated based on the learning of the correct trading points of the last 2 years (including the change to bear market after the rate hike).

I think it could be useful to you, to improve, I would like to share it with you, and if you are interested in improving and collaborating I am also willing, and if not file it in the box.

Dataset dosen't match the description in paper

Hi, I cloned the stock-net repo and try to reproduce the results you mentioned in your paper.
I found there is only 656 examples in Devset and 1008 examples in Testset. But according to your paper, the numbers are 2555 and 3720.
I wonder if there's something missing in the dataset you've uploaded. Or the DataPipe code was wrong.