GithubHelp home page GithubHelp logo

luigi_mlpipeline_multiclass_classification's Introduction

My Comments

The main task of this data challenge was to handel the class imbalance using technique like randomover sampling or doing SMOTE etc. Please see the EDA folder to see different visualization of the data provided

I have tried to map the sentiments on the Maps And have plotted different analysis of data using profiling data.

If you think this repo is been useful in someway then please Star it.

I still need to improve the code, I will keep working on it as I get more time. Many Thanks :)

Usefull URL to understand the solution:

Rightway to oversample in predective modelling https://beckernick.github.io/oversampling-modeling/

Crossvalidation pipeline with randomforest https://www.kaggle.com/alexisbcook/cross-validation https://towardsdatascience.com/yet-another-twitter-sentiment-analysis-part-1-tackling-class-imbalance-4d7a7f717d44

Preventing Dataleakage https://www.kaggle.com/alexisbcook/data-leakage

ML Pipeline Problem

This question will test some basic skills in cleaning data and building a machine learning pipeline.

The focus of this test is to evaluate:

  • Ability to quickly learn a new framework (luigi)
  • Ability to manipulate and process data (cleaning, processing, feature engineering)
  • Competency in software development

This test does not focus on modelling accuracy, ability to use a fancy model, or efficiency. It is mainly about the mechanics of building a proper machine learning pipeline.

Datasets

There are two files: airline_tweets.csv and cities.csv.

airline_tweets.csv has twitter data regarding airline sentiment augmented with some extra columns. The relevant columns are:

  • airline_sentiment: a string indicating if the tweet had positive, neutral or negative sentiment.
  • tweet_coord: is a string with form "[, ]" if a geo-coordinate exists for that tweet, or an empty string otherwise.

The cities.csv contains information about latitude and longitude for large cities. The relevant columns are:

  • name: The name of the city.
  • latitude: The latitude of the city.
  • longitude: The longitude of the city.

Problem

Build a basic ML pipeline using the luigi Python framework. The pipeline should clean the tweet data, prepare features for building a model, train a classifier and score using the model. The pipeline should have these steps:

  • CleanDataTask: Cleans the input tweet CSV file by removing any rows without valid geo-coordinates.
    • An invalid coordinate has either an empty tweet_coord column or is coordinate (0.0, 0.0).
  • TrainingDataTask: Extracts features/outcome variable in preparation for training a model.
    • This prepares the cleaned data into the exact form that is able to be fit by the model.
    • The "y" variable will be the multi-class sentiment (0, 1, 2 for negative, neutral and positive respectively).
    • The "X" variables will be the closest city to the "tweet_coord" using Euclidean distance.
    • You should use the cities.csv file to find the closest city.
    • You probably will need to one-hot encode the city names.
  • TrainModelTask: Trains a classifier to predict negative, neutral, positive based only on the input city.
    • Train a classifier that uses closest cities as features.
    • Dump the fitted model to the output file.
  • ScoreTask: Uses the scored model to compute the sentiment for each city.
    • Use the trained model to predict the probability/score for each city the negative, neutral and positive sentiment.
    • Output a sorted list of cities by the predicted positive sentiment score to the output file.

Notes/Hints/Suggestions

  • We have provided a skeleton file to get you started named pipeline.py, and a script run.sh that will execute this luigi pipeline.
  • You must use the luigi package.
  • You must use Python (any version is fine).
  • Feel free to use any Python packages. We used pandas, scikit-learn, numpy (as seen in the included requirements.txt).
  • Do not worry too much about run-time/memory efficiency. So long as it runs within 15 minutes, it should be fine.

References

 * Luigi package: `http://luigi.readthedocs.io/en/stable/`

Luigi_MlPipeline_Multiclass_Classification

luigi_mlpipeline_multiclass_classification's People

Contributors

mehta128 avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.