GithubHelp home page GithubHelp logo

vtoliveira / advanced-data-science-capstone Goto Github PK

View Code? Open in Web Editor NEW
1.0 2.0 1.0 748 KB

This repository has a series of notebooks used in the course Advanced Data Science Capstone project where I developed models to work with sentiment analysis from Twitter US Airline Sentiment.

Jupyter Notebook 98.36% Python 1.64%

advanced-data-science-capstone's Introduction

Advanced Data Science Capstone

This repository has a series of notebooks used in the course Advanced Data Science Capstone project where I developed models to work with sentiment analysis from Twitter US Airline Sentiment.

Content

  • Initial Data Exploration: Notebook for exploratory analysis, checking features, missing values, graphics and insights for feature extraction.
  • Feature Creation (Bag of Words): Here I extract features based on classical approach for text mining. Polarity scores, negation, POS count, emoticons, etc. The data was saved as data_preprocessed.csv
  • Feature Creation (Word Embedding): Another approach is to create embeddings based on the given tweets. Here I use gensim, a python library to train embeddings from pure texts.
  • Model Definition (Classical Algorithms): Here I develop the pipeline to train classical models such as SVMs, Naive Bayes, and Random Forests. I also implement more text transformations such as CountVectorizer and TF-IDF and cross-validate the models with a variety of features.
  • Model Definition (Deep Learning): In this notebook I define a LSTM Network for text classification using pytorch. We also import GloVe pre-trained embeddings from twitter data found on kaggle. You can download it here: https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/discussion/50350
  • Model Evaluation: Here I explore some metrics to check how our models are perform and do a brief discussion about the best models.
  • Sentiment Analysis: Here we have a demo where a few tweets are classified and we see our model for a non-technical public.

Summary

My major contribution with this project was being able to produce diffent models in order to do a systematic comparison between classical machine learning techniques and feature extraction vs deep learning recurrent neural networks with twitter word embeddings.

Below we have the confustion matrix for the three best classical approaches (Random Forest, SVM and Naive Bayes). I used a tf-idf approach with SVM and Random Forest and a count vector with NB. The results were obtained after cross-validation for hyperparameter tuning. It is clear that SVM with bi-grams achieved the best results in this case.

Sentiment Analysis

Next, we used word embeddings trained on twitter data, for 50, 100, and 200 dimensions and used a LSTM network to classify the tweets sentiments.

Sentiment Analysis

To finish, we had prediction, accuracy, f1 and recall scores for each model.

Sentiment Analysis Sentiment Analysis

We can see the LSTM performs better overall and that was our choosed model.

Observation: I would like to include a comment on training the LSTM Network. Even though it achieved a better result, as our test set is small and we have a considerable small sample regarding text classification, the best results were achieved after intense training and tuning, which can be impractical sometimes.

Video Presentation

Also, one of the tasks was to produce a video presentation about my project. Here you can see all the details about the product developed, models definition, evaluation, etc.

Sentiment Analysis (https://www.youtube.com/watch?v=oLX73CMh8Fc&t=43s)

advanced-data-science-capstone's People

Contributors

vtoliveira avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Forkers

skaiphd

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.