GithubHelp home page GithubHelp logo

kamrul-brur / vision-transformer-based-food-classification Goto Github PK

View Code? Open in Web Editor NEW
5.0 1.0 1.0 1.02 MB

Vision Transformer Based Food-101 Classification

Python 100.00%
food-101 pytorch transformer-models vision-transformer vision-transformer-image-classification vit vit-pytorch

vision-transformer-based-food-classification's Introduction

Vision Transformer (ViT) based Food Classification

After the amazing performance of transformer architecture in natural language processing tasks Attention Is All You Need, it is used for image classification tasks recently. A team of Google consists of Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby complete the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale in 2020. Just like a sequence of words, a number of image patches are fed through the encoder block to obtain the feature. And finally Multi-layer perceptron (MLP) Head classifies the image. In this work, I used the Food-101 dataset which contains 101 classes, and implemented the Vision Transformer (ViT) model for classification. By using the pretrained model provided by Google and after proper training, the model obtained standard classification accuracy.

Project Overview

Food classification plays a vital role in various domains, such as nutrition analysis, dietary monitoring, and meal planning. Traditional approaches to food classification rely on handcrafted features or convolutional neural networks (CNNs). However, recent advancements in computer vision have introduced a new paradigm called Vision Transformers (ViTs), which have shown remarkable performance in image recognition tasks. This project aims to leverage the power of Vision Transformers for food classification tasks.
For a sequence of words, the transformer has an encoder-decoder structure but for an image, we need only encode the block to extract the feature. And just like a sequence of words image can be divided into a number of patches and fed through the encoder. And the encoder will find self-attention among the patches. Finally, MLP-head will classify the image. Vision_Transformer
Vision Transformer uses a standard Transformer encoder and fixed-size patches to achieve the State-of-the-Art (SOTA) task of image recognition. The author employs the conventional strategy of including an additional learnable "classification token" in the sequence in order to perform classification. In this project I follow several steps to complete the project and obtain better classification result.

  • Data Collection and Observation
  • Data Preprocessing
  • Create the Model and Modify
  • Observe the Test result

Data Collection and Observation

The dataset I used in this project is Food-101. It contains 101 classes with a total of 101'000 images. Some of the images are; sample_food_image

Data Preprocessing

First task is to process and split the dataset for the experiments. The dataset format is:

     food-101
     .  images
     .      apple_pie
     .          all_images
     .      baby_back_ribs
     .          all_images
     .      .
     .      .
     .      waffles
     .          all_images 


Run the python file to rearrange the dataset as follows: python rearrange.py

     food-101
     .  train
     .      apple_pie
     .          all_images
     .      baby_back_ribs
     .          all_images
     .      .
     .      .
     .      waffles
     .          all_images
     .  test
     .      apple_pie
     .          all_images
     .      baby_back_ribs
     .          all_images
     .      .
     .      .
     .      waffles
     .          all_images
          

Create the Model and Modify

Vision Transformer follows the four equations to extract the image feature. four-equations-vit-paper
The initial image shape is (H×W×C) and after converting the image into patches it will be N×(PxP·C) where (P, P) is the resolution of each image patch and N is the number of patches.

Before

pizza_global

After converting into patches

pizza_patchified

Configs

According to the original paper of Vision Transformer:

  • image_size: 224.
    Image size. If you have rectangular images, make sure your image size is the maximum of the width and height
  • patch_size: 16
    Number of patches. image_size must be divisible by patch_size.
  • dim: 768.
    Last dimension of output tensor after linear transformation nn.Linear(..., dim).
  • depth: 12.
    Number of Transformer blocks.
  • heads: 12.
    Number of heads in Multi-head Attention layer.
  • mlp_dim: 3072.
    Dimension of the MLP (FeedForward) layer.
  • channels: int, default 3.
    Number of image's channels.
  • dropout: float between [0, 1], default 0.
    Dropout rate.
  • emb_dropout: float between [0, 1], default 0.
    Embedding dropout rate.
  • Learning Rate: 3e-2
    Learning Rate to update the parameters

Pre-trained model download (Google's Official Checkpoint)

There are many models available and in this work I use ViT-B_16(85.8M). Available models. The path of pre-trained checkpoint must be added in the python train.py file

Run the python file to train the the model : python train.py

Observe the Test result

The test accuracy is = 90.07%

Reference

Citations

@article{vaswani2017attention,
  title={Attention is all you need},
  author={Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, {\L}ukasz and Polosukhin, Illia},
  journal={Advances in neural information processing systems},
  volume={30},
  year={2017}
}
@article{dosovitskiy2020,
  title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
  author={Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and  Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and Uszkoreit, Jakob and Houlsby, Neil},
  journal={arXiv preprint arXiv:2010.11929},
  year={2020}
}
@inproceedings{bossard14,
  title = {Food-101 -- Mining Discriminative Components with Random Forests},
  author = {Bossard, Lukas and Guillaumin, Matthieu and Van Gool, Luc},
  booktitle = {European Conference on Computer Vision},
  year = {2014}
}

vision-transformer-based-food-classification's People

Contributors

kamrulhasanrony avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

attention-fun

vision-transformer-based-food-classification's Issues

Custom Training and testing

Hello, Thank you for sharing the code. I am using your code for the classification of my custom data. The image size of the dataset is just 64x64 and has 8 classes.
I just want to know, if can you provide the testing code so one can test the final checkpoint and calculate the confusion matrix. Thank you

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.