GithubHelp home page GithubHelp logo

gaze-estimation's Introduction

GazeEstimation

This repository contains my efforts to solve Gaze Estimation task in order to join Computer Vision Lab at CMC MSU

Here you can find :

  • Working demo
  • Some thoughts on the approach
  • Possible further improvements of the used algorithm

Demo

First run pip install -r requirements.txt to install all necessary modules to run the demo (you may change it if you don't want to update torch, though, I believe, it will work on earlier versions too).

Also, don't forget to download model weights and place them into weights folder.

Hotkeys :

  • i to disable/enable gaze drawing on input image.
  • n to disable/enable gaze drawing on normalized image.
  • v to disable/enable verbose mode.
  • f to disable/enable FPS.

To run demo : python gaze_estimation.py

Pipeline

Repository structure

SynthesEyes.ipynb contains step-by-step implementation of training environment for SynthesEyes dataset.

Hourglass.ipynb contains my implementation of Hourglass neural network, it's training and evaluation for pupil heatmaps extraction.

Gaze-Estimation-using-XGaze-dataset.ipynb contains training environment used in order to train and test ResGaze model on XGaze dataset.

Spatial-Net.ipynb contains my implementation of DenseNet neural network (paritally) and also SpaNet, which was used to fit XGaze dataset, but no luck - after 50-60 hours of training, it was able to achieve only 10 degrees angular error, and more than that, it was slower than ResGaze.

  • modules - all implemented models
  • src - all important helper functions
  • face_detection - BlazeNet implementation (it's not mine, but I forgot to save link to source). Allows to predict face bounding box using <1ms time on GPU. Didn't use it in the project, though.

Algorithms for gaze estimation

Regression from eye images

Model Test Error Train size/Amount of epochs Model size
GazeNet (7 conv, 1 dense, w/o BN) 0.91 10240/70 8.7 Mb
GazeNet_v2 (7 conv, 2 dense, w/ BN) 0.79 10240/70 15.6 Mb

GazeNet (7 conv, 1 dense, w/o BN)

Test error is quite big because it represents L1 loss in terms of euler's angles in 3D-space (we predict two angles in spherical coordinates in order to get unit-vector on a 3D-sphere as the direction of gaze)

GazeNet_v2 (8 conv, 2 dense, w/ BN)

Learning curves show us, that model is underfitting, my guess is that it is hard to learn direct mapping from feature space of the image (HxWx3) to gaze (just two features). Maybe, we should try to learn intermediate features first.

UPD : Legend is not right, it must be "Train loss and test loss"

Pupil landmarks estimation (hence regression from intermediate features)

Model Test Error Train size/Amount of epochs Model size Evaluation time
PupilNet-3Hourglass w/ BN ~3000 10240/153 2 Mb 52 ms on pretty old Intel Core i5-2310 CPU

PupilNet-3Hourglass-sigma_10 w/ BN

Test error is around 3000, which is actually 3000 / 32 ~ 93.75 per prediction, because I accidentally measured it over batch, not over single image. It means that following model gives approximately less than 0.01 error per pixel (because one prediction contains 8 heatmaps each of them has 80x120 pixels), which is enough to predict valuable heatmaps.

The loss used here is the same used in Learning to Find Eye Region Landmarks for Remote Gaze Estimation in Unconstrained Settings, which is

Where this suspicious 8 is the number of pupil landmarks, you can check how heatmaps are generated in Hourglass.ipynb, and "p" is every pixel in the heatmap.

On how the ground-truth heatmaps were generated : suppose that we have a single landmark located on (x, y) point. All we do is we take an empty image (filled with zeroes) and then place a Gaussian distribution at (x, y) with mean == 1 and some variance == sigma. Examples of generated heatmaps are also shown in Hourglass.ipynb.

The heatmaps encode the per-pixel confidence on a specific landmark’s location.

Actual heatmaps of pupil landmarks :

Regression directly from face image

Spa-Net

Model Test Error Train size/Amount of epochs Model size Evaluation time
Spa-Net : 3Hourglass w/ BN + small DenseNet as the regressor 10 degrees angular error on XGaze dataset 750k/2.7 2 Mb 20ms on RTX 3060Ti

Although, Hourglass network was good on pupil landmarks heatmaps estimation, it lacks the abilities of feature extraction, because even though the size of the model is pretty small, training of this model is hard, because it uses a lot of memory during training (7.3Gb VRAM when batch_size == 8).

I think with longer training this model can achieve far more better results, but the next model is way easier to train and also it shown SOTA perfomance on classification tasks (so, for sure, worked pretty well on feature extraction task).

ResGaze

Model Test Error Train size/Amount of epochs Model size Evaluation time
ResGaze (resnet50 as a backbone + simple Linear layer) 2 degrees (angular error derived from cosine similarity) on XGaze dataset 750k/10 100 Mb 10ms on RTX 3060Ti per sample

This model is inspired by RT-GENE paper, where they used VGG-16 network for feature extraction, and I decided to use Resnet50 to do the job.

Next very import thing is that XGaze dataset was used to train robust gaze predictor.

Is was said, that the model was able to achieve angular error of 2 degrees per sample, which is impressive, because this dataset has very rich distribution in sense of head and gaze rotations.

The variance in appearence allowed us to forget about head position estimation, because neural net will learn them by itself.

This is how the model performed on XGaze dataset.

Train predictions (green is the prediction and blue is a ground truth gaze vector) Test predictons
Train predictions Test predictons

Further improvements

  • Implement face and facial landmarks detection such that it can be executed on GPU (faster inference)
  • Pruning
  • Try model compression (for faster inference, less VRAM consumed during evaluation)

gaze-estimation's People

Contributors

dazzle-me avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar

gaze-estimation's Issues

Dataset

Hello, I want to ask how to download the dataset?

The meaning of desiredLeftEye

I am trying to understand the FaceAligner part. Could you please explain what desiredLeftEye = (0.3, 0.3) is used for ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.