GithubHelp home page GithubHelp logo

hse_ml_final's Introduction

HSE_ML_final

our team rep for final ML task

We have multiclass classification problem with this case

dataset with 3 columns:

  • cuisine is cuisine category, target column;
  • id is some object identificator;
  • ingredients are ingredients in recipe, features. Stringtype

we have classes, they are imbalanced:

  • italian - 7838
  • mexican - 6438
  • southern_us - 4320
  • indian - 3003
  • chinese - 2673
  • french - 2646
  • cajun_creole - 1546
  • thai - 1539
  • japanese - 1423
  • greek - 1175
  • spanish - 989
  • korean - 830
  • vietnamese - 825
  • moroccan - 821
  • british - 804
  • filipino - 755
  • irish - 667
  • jamaican - 526
  • russian - 489
  • brazilian - 467

there are 6714 unique ingredients, and 3074 unique words in ingredients column

classification problem in this case have different ways to solve:

  • onehot encoding;
  • nlp text preprocessing and vectorizing techniques;
  • class balance;
  • different models;
  • hyperparametrs tuning;
  • try some deep learning.

we have try spyci nlp and spark nlp, try tensorflow with lstm

#BASELINE, two tries on Kaggle.com - 0.28278 & 0.27544

sparknlp ClassifierDLApproach with embeddings

vectorozation is done using the UniversalSentenceEncoder.pretrained model from the sparknlp toolkit (The Universal Sentence Encoder encodes text into high-dimensional vectors that can be used for text classification, semantic similarity, clustering and other natural language tasks. The model is trained and optimized for greater-than-word length text, such as sentences, phrases or short paragraphs), which, in turn, did not count on the listing of unrelated words, but on the presence of sentences and is expected for full-fledged texts. the mathematical representation of the text obtained in this way of preprocessing was transferred to the ClassifierDLApproach - a deep learning tool also from the sparknlp composition. (ClassifierDL uses the state-of-the-art Universal Sentence Encoder as an input for text classifications. The ClassifierDL annotator uses a deep learning model (DNNs) we have built inside TensorFlow and supports up to 100 classes)

#two tries on Kaggle.com - 0.28278 & 0.27544

##then I had to look at the logs

sparknlp ClassifierDLApproach training logs:

Training started - epochs: 30 - learning_rate: 0.01 - batch_size: 256 - training_examples: 35797 - classes: 20, Kaggle.com

  • Epoch 0/30 - 10.02s - loss: 391.01025 - acc: 0.29464805 - val_acc: 43.02238 - batches: 140
  • Epoch 1/30 - 9.30s - loss: 368.92258 - acc: 0.4756007 - val_acc: 50.138294 - batches: 140
  • Epoch 2/30 - 9.87s - loss: 363.7496 - acc: 0.5127525 - val_acc: 51.898415 - batches: 140
  • Epoch 3/30 - 9.31s - loss: 363.78198 - acc: 0.5221782 - val_acc: 52.225296 - batches: 140

...

  • Epoch 26/30 - 9.45s - loss: 356.01282 - acc: 0.5420299 - val_acc: 53.43224 - batches: 140
  • Epoch 27/30 - 9.41s - loss: 355.9558 - acc: 0.54219854 - val_acc: 53.43224 - batches: 140
  • Epoch 28/30 - 9.15s - loss: 355.9018 - acc: 0.5423109 - val_acc: 53.40709 - batches: 140
  • Epoch 29/30 - 9.28s - loss: 355.84833 - acc: 0.54250765 - val_acc: 53.45738 - batches: 140

Training started - epochs: 1000 - learning_rate: 1.0E-4 - batch_size: 256 - training_examples: 35797 - classes: 20

  • Epoch 0/1000 - 9.82s - loss: 417.42383 - acc: 0.19734961 - val_acc: 19.58763 - batches: 140
  • Epoch 1/1000 - 5.69s - loss: 408.16028 - acc: 0.19864231 - val_acc: 19.58763 - batches: 140
  • Epoch 2/1000 - 5.88s - loss: 405.27988 - acc: 0.19864231 - val_acc: 19.58763 - batches: 140
  • Epoch 3/1000 - 5.72s - loss: 404.9568 - acc: 0.19864231 - val_acc: 19.58763 - batches: 140

...

  • Epoch 996/1000 - 9.16s - loss: 391.82648 - acc: 0.33264804 - val_acc: 33.31657 - batches: 140
  • Epoch 997/1000 - 9.28s - loss: 391.82486 - acc: 0.33264804 - val_acc: 33.31657 - batches: 140
  • Epoch 998/1000 - 9.49s - loss: 391.82343 - acc: 0.33264804 - val_acc: 33.31657 - batches: 140
  • Epoch 999/1000 - 9.19s - loss: 391.822 - acc: 0.33264804 - val_acc: 33.31657 - batches: 140

It seem like model is undertrained, too less information by our vectors - “Never use a shotgun when a flyswatter will do”

so, in jupyter notebooks classic ML and some Tensor Flow with simply vectorization

hse_ml_final's People

Contributors

adelisa1984 avatar nat-belikova avatar verschworer avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.