GithubHelp home page GithubHelp logo

sandy4321 / vtb-data-fusion Goto Github PK

View Code? Open in Web Editor NEW

This project forked from pskliff/vtb-data-fusion

0.0 0.0 0.0 42 KB

This repository provides code solution for Data Fusion Contest task 1

Jupyter Notebook 82.80% Python 17.20%

vtb-data-fusion's Introduction

vtb-data-fusion

This repository provides code solution for Data Fusion Contest task 1

Short description: Single distilbert
Place: 7/265 (top 3%)
Public LB = 0.8683
Private LB = 0.8674

Requirements

To install requirements:

pip install -r requirements.txt

Datasets

Boosters

Data description

Task is to predict the predefined category of the item in a receipt based on its name

Solution description

  • Baseline โ€” Russian Part of Multilingual Distillbert as is (spoiler - it was Cased): Public = 0.7875
  • + Pretraining on masked language modeling task: Public = 0.8261
  • + Label Smoothing: Public = 0.8323
  • + Custom Model Arch (Weighted sum of hidden states + multisample dropout): Public = 0.8354
  • + Lowercase: Public = 0.8459
  • + Increase number of training epochs to 50: Public = 0.8532
  • + Pseudolabeling (distilbert-distilbert): Public = 0.8626
  • + Pseudolabeling (RuBERT-distilbert): Public = 0.8683

How to run

  • Pretrain RuBERT and distilbert on all unique texts using masked language modeling task: train_mlm_base_tokenizer.ipynb
  • Finetune pretrained RuBERT on the texts with labels (~40k unique texts): rubert_base.ipynb
  • Create pseudolabels (~1M unique texts) for all unique texts using finetuned RuBERT: pseudo_label.ipynb
  • Finetune distilbert on these pseudolabels: pseudo_label.ipynb
  • Create submission .zip with finetuned distilbert: pseudo_label.ipynb

vtb-data-fusion's People

Contributors

pskliff avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.