GithubHelp home page GithubHelp logo

arka57 / ted-talk-views-prediction Goto Github PK

View Code? Open in Web Editor NEW
3.0 1.0 0.0 4 MB

Building a predictive model to predict views of Ted Talks in YouTube from dataset of past events using Machine Learning models

License: MIT License

Dockerfile 0.01% Python 0.01% Jupyter Notebook 99.97% HTML 0.01%
decision-trees eda feature-engineering linear-regression machine-learning pandas random-forest skicit-learn xgboost-regression

ted-talk-views-prediction's Introduction

Ted-Talk-Views-Prediction

TED is devoted to spreading powerful ideas on just about any topic. Founded in 1984 by Richard Salman as a nonprofit organization that aimed at bringing experts from the fields of Technology, Entertainment, and Design together, TED Conferences have gone on to become the Mecca of ideas from virtually all walks of life
These project is aimed at developing a predictive model that will predict the number of views of a TED Talks event in YouTube. Have used a dataset containing details of 4,000 past TED talks including transcripts in many languages and developed various machine learning models.

Dataset info

Number of records: 4,005
Number of attributes: 19

Features information:

The dataset contains features like:

talk_id: Talk identification number provided by TED
title: Title of the talk
speaker_1: First speaker in TED's speaker list
all_speakers: Speakers in the talk
occupations: Occupations of the speakers
about_speakers: Blurb about each speaker
recorded_date: Date the talk was recorded
published_date: Date the talk was published to TED.com
event: Event or medium in which the talk was given
native_lang: Language the talk was given in
available_lang: All available languages (lang_code) for a talk
comments: Count of comments
duration: Duration in seconds
topics: Related tags or topics for the talk
related_talks: Related talks (key='talk_id',value='title')
url: URL of the talk
description: Description of the talk
transcript: Full transcript of the talk

Target Variable :
views: Contains Count of views of every talk

Data Pre-Processing and Feature Engineering

1)Duplicate Rows: No Duplicate Rows were present

2)Handling Missing Values:

occupation: NULL values were replaced with 'Others'
about_speakers: NULL values were replaced with 'NA'
comments: NULL values were replaced with extreme value 0
recorded_date,all speakers have very less number of NULL values(<1%): Rows were deleted

3)Outlier Handling: For numerical columns,outlier values were replaced with median

4)Feature Engineering: --New Column time_since_published was created for storing how old the video is. Was calculated on the basis of the difference between last published date of any vdeo(last date assumed) and published date of the current video
--New Column daily_views was created as it is better metric. Was calculated as views/time_since_published
--New Column avg_speaker_1_views was created as correlation between a particular spekaer and average views of his/her videos
--New Column is_weekend was created to check whether the video released on weekdays or weekends
--New Column available_lang_count was created to store in how many languages a video is avialable
--New Column topics_count was created to store how many topics are present in a video
--New Column avg_event_views was created to store how many topics are present in a video

EDA

EDA was performed on the dataset to gain insights on the dataset
Following type of visulaizations were used:
--Bar Plot
--Box Plor
--Scatter Plot
--Count Plot

Models Used

The dataset was divided in train and test set in the ratio of 80:20 .Various models were tried upon the dataset such as Linear Regression and tree based models like Decision Tree, Random Forest. To reduce overfitting hyperparameter tuning was applied on each of the models to enhance the output of the model.

Results

Random Forest with regularization using GridSearch performed the best among the models with least overfitting. Due to training time resource constraint XGBoost was not applied. The result were evaluated on various parameters like MSE,MSRT and R2 scores

Scores

Model MSE Training Score MSE Test Score MSRT Training Score MSRT Test Score R2 Training Score R2 Test Score
Linear Regression 13734743.9833 9048703.5828 3706.0415 3008.1063 0.7622 0.8102
L1 Regression 13811142.7076 8688681.1133 3716.3345 2947.6568 0.7608 0.8178
L2 Regression 13764151.3081 8907230.4988 3710.0069 2984.4983 0.7616 0.8132
Decision Tree 11233973.1642 18367479.3015 3351.7119 4285.7297 0.8054 0.6149
Random Forest 2629275.8123 7230347.4333 1621.5041 2688.9305 0.9544 0.8484
Gradient Boosting 459258.8654 8088584.9908 677.6864 2844.0437 0.9920 0.8304
XGBoosting 5734634.3142 11262460.5049 2394.7096 3355.9589 0.9007 0.7638

**Conclusion**
Random Forest and XGBoosting are the two best performing models and R2 score above 0.9 was achieved by them. Hence Random Forest with hyperparameter tuning is the best working model for the given problem
Further the model was deployed on AWS EC2 instance along with Dockerization

Tool and Technologies used

Colab IDE,VSCode,AWS EC2,WinSCP,Putty,Docker

Further scope

  1. Applying NLP concepts to include title feature and check it's effect on views
  2. Use Related topics feature of a video and use it in the prediction

ted-talk-views-prediction's People

Contributors

arka57 avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.