Using Natural Language Processing to predict Tesla stock movement based on news article sentiment from the New York Times
News articles from The New York Times discussing Tesla were sourced for sentiment analysis. Article information was retrieved from API requests and web scraping.
The New York Times Developer Network: Article Search API https://developer.nytimes.com/docs/articlesearch-product/1/overview
The Article Search API of The New York Time was queried for articles containing the term 'tesla' between January 1, 2010 (the year that Tesla launched its IPO) and May 31, 2019. The search term returned a total of 2,540 article hits.
The following information was requested for each article document:
- Web URL
- Snippet (Headline)
- Publication Date
- Identifier
- Lead Paragraph
The text of the article body was retrieved by accessing each web URL and extracting the body main body. Of the 2,540 web URLs, 1,829 full text articles were captured. The remaining 711 articles, generally before 2013, were not included in the dataset.
Natural Language Processing (NLP) was applied to each article snippet (headline), lead paragraph and article body. Two sentiment analysis toolsets were applied to the article texts.
Valence Aware Dictionary and Sentiment Reasoner (VADER) https://www.nltk.org/_modules/nltk/sentiment/vader.html
Based on a design to evaluate short sentences, VADER Sentiment Analysis was applied to the snippet (headline) and lead paragraph of each article. Negative, neutral and positive percentages were recorded for each article, including a compound score.
TextBlob https://textblob.readthedocs.io/en/dev/
TextBlob Sentiment Analysis was applied to the snippet (headline), lead paragraph and article body of each article. Polarity and subjectivity scores were recorded for each article.
In order to merge with the daily closing stock price of Tesla, articles were grouped by date. For dates with multiple articles, the mean sentiment score of all articles was aggregated for each date. Additionally, the total number of articles retrieved on a given day was recorded to quantify news intensity.
The final Pandas DataFrame of sentiment analysis for The New York Times discussing Tesla contains 1,829 articles grouped on 1,032 unique days. 15 feature columns were engineered from article information using NLP:
Daily Records Beginning January 25, 2013
- TextBlob polarity and subjectivity: article body, lead paragraph and snippet (headline)
- VADER negative, neutral, positive and compound scores: lead paragraph and snippet (headline)
- Total article count
The distribution of each sentiment feature is shown below:
Classification models are used to predict the binary outcome of whether the stock price of Tesla moved up (1) or down(0) for that day.
- Vader compound of snippet (continuous)
- Vader positive sentiment of snippet (continuous)
- Vader negative sentiment of snippet (continuous)
- Vader neutral sentiment of snippet (continuous)
- TextBlob article polarity (continuous)
- TextBlob article subjectivity (continuous)
- Article count for the day (continuous)
- Daily trading volumne of Tesla (continuous)
- Nasdaq movement (binary)
The model provided the following coefficients:
Accuracy and F1 score:
The Gaussian Naives Baye model provided the following results:
Several versions of the models were tested with additions and deletions of various features. The best results were yielded by the Logistics Model with the following features:
- TextBlob article polarity (continuous)
- TextBlob article subjectivity (continuous)
- Article count for the day (continuous)
- Daily trading volumne of Tesla (continuous)
- Nasdaq movement (binary)
In this case, TextBlob sentiment analysis on article bodies proved to have better predictive power than Vader on article headlines/snippets.