Jeopardy--NLP

Problem Statement:

 
Build a model to predict the value of the question in the TV game show  “Jeopardy!”. 
Data can be downloaded from this link: https://www.kaggle.com/tunguz/200000-jeopardy-questions 

Data description 
▪ 'category' : the question category, e.g. "HISTORY" 
▪ ‘value' : $ value of the question as string, e.g. "$200" (Note - "None" for Final Jeopardy! and Tiebreaker questions) 
▪ 'question' : text of question (Note: This sometimes contains hyperlinks and other things messy text such as when there's a picture or video question) 
▪ 'answer' : text of answer 
▪ round' : one of "Jeopardy!","Double Jeopardy!","Final Jeopardy!"  or "Tiebreaker" (Note: Tiebreaker questions do happen but they're very rare (like once every 20 years)) 
▪ 'show_number' : string of show number, e.g '4680' 
▪ 'air_date' : the show air date in format YYYY-MM-DD

Data Preparation

1- First 100k samples from the datatset were taken.
2- Only samples from Jeopardy round were selected.
3- Redundant features like 'round','show_number','airdate' were dropped.
3- Preprocess data : stopwords removal,stemming,lemmatization,lower-casing etc.
4- Depending upon binary/ multi class classification -> A class balanced dataset was prepared.

Approach:

1- Important features are: Question, Ans and Category
2- Using these three features -> value is predicted
3- To generate word embeddings -> fasttext model is fine-tuned on pretrained wiki news dataset.
   Pre-trained embeddings downloaded from: https://fasttext.cc/docs/en/english-vectors.html.
5- To generate sentence vectors from these word embeddings -> concatenated power means method is followed.
   Pmeans paper: https://arxiv.org/pdf/1803.01400.pdf
6- Sentence vectors of "ques","ans" and "category" were concatented together to generate final feature matrix.
7- Using these feature matrix-> Various ML and DL models were trained.

Results

Case A: Binary Classification

Baseline for binary classification: https://github.com/yashajoshi/Predicting-Value-of-Jeopardy-Questions

Best reported metric are:

Fasttext with pmeans->
Best results by XG Boost classifier with hyperparameters:
learning_rate = 0.1,max_depth=2,n_estimators= 140,objective="binary:logistic"

Case B: Multi-class Classification

3 Classes:

5 Classes:

Best performance is given by XG Boost classifier with hyperparameters:
learning_rate = 0.1,max_depth=2,n_estimators= 140,objective="multi:softmax"

mayank05942 / jeopardy--nlp Goto Github PK

jeopardy--nlp's Introduction

Jeopardy--NLP

jeopardy--nlp's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs