GithubHelp home page GithubHelp logo

unpaired-face-to-face-translation-using-cyclegan's Introduction

Unpaired face to face translation using CycleGAN

The Cycle Generative Adversarial Network, or CycleGAN, is an approach to training a deep convolutional neural network for image-to-image translation tasks.

Unlike other GAN models for image translation, the CycleGAN does not require a dataset of paired images. For example, if we are interested in translating photographs of oranges to apples, we do not require a training dataset of oranges that have been manually converted to apples. This allows the development of a translation model on problems where training datasets may not exist, such as translating paintings to photographs.

In this project we develop a generic model based on CycleGAN architecture, which can be used for the translation of photographs of faces in mutiple aspects like age, ethnicity and gender.

Data set

Source: UTKFace - Large Scale Face Dataset

  • Consists of 20,000+ face images (Only single face in one image)
  • All the images are aligned & cropped to contain only the face and not the neck or hair parts
  • Images are labelled by age, gender and ethnicity.

Model

The Generative Adversarial Networks (GAN) is a type of unsupervised, generative model first introduced by Ian Goodfellow in 2014. These networks typically consist of a Generator network and a Discriminator network, which contest with each other in a zero sum game. More specifically, given a training set, the generator learns to generate a new data with the same statistics as the training data and the discriminator learns to distinguish the fake data (produced by generator) from the real data. By alternatively training the generator and discriminator networks, both the networks become better and we end up with a model which can produce high fidelity data which is undistinguishable from real data.

When trained on an UTKFace dataset, these GAN models can produce high quality faces but its very difficult to control the outputs generated and produce the images with desired attributes. This problem is common to all the GAN models and there have been many approaches to control the outputs by exploring the latent space of the training data distributions (Info GAN, VAE GAN etc.). One of the way to control the outputs genereated by GANs is to use a Cyclic consistency loss. These models (called as CycleGANs) consist of 2 GANs connected using the Cyclic consistency loss. This ensures that the content of the image remains same while the style gets translated across the domains.

For example, if we want to translate the images from domainA to domainB, the input data will be 2 datasets, one containing the images from domainA and other containing the images form domainB. Note that the images need not paired. For example, if domainA is a set of images of faces from ages 20 to 30 and domainB is a set of images of faces from ages 50 to 60, all the images can be of different people. The CycleGAN consist of 2 Generators and 2 discriminators. Lets look at the details of each module below:

Generator A2B:

  • Input:
    • IA - Real image from domainA
    • IAG - Generated image from Generator B2A (for input IB)
  • Output:
    • IBG - Generated image with the content of IA translated to domainB
    • IBcycle - Generated image with the content of IAG (i.e the content of IB) translated to domainB
  • Model architecture: Resnet model
  • Losses:
    • Discriminator loss (Mean average error between IDAG and Iones)
    • Cyclic loss (Absolute error between IB and IBcycle)

Generator B2A:

  • Input:
    • IB - Real image from domainB
    • IBG - Generated image from Generator A2B (for input IA)
  • Output:
    • IAG - Generated image with the content of IB translated to domainA
    • IAcycle - Generated image with the content of IBG (i.e the content of IA) translated to domainA
  • Model architecture: Resnet model
  • Losses:
    • Discriminator loss (Mean average error between IDBG and Iones)
    • Cyclic loss (Absolute error between IA and IAcycle)

Discriminator A:

  • Input:
    • IAG - Generated image from generator A2B
    • IA - Real images form domain A.
  • Output:
    • IDAG - Output of the disciminator model which represents if the input is real or fake
    • IDA - Output of the disciminator model which represents if the input is real or fake
  • Model architecture: CNN classifier model
  • Losses:
    • Discriminator loss (Mean average error between IDAG and Izeros)
    • Discriminator loss (Mean average error between IDA and Iones)

Discriminator B:

  • Input:
    • IBG - Generated image from generator B2A
    • IB - Real images form domain B.
  • Output:
    • IDBG - Output of the disciminator model which represents if the input is real or fake
    • IDB - Output of the disciminator model which represents if the input is real or fake
  • Model architecture: CNN classifier model
  • Losses:
    • Discriminator loss (Mean average error between IDBG and Izeros)
    • Discriminator loss (Mean average error between IDB and Iones)

Losses

As mentioned above, there are mainly 2 types of losses. Each loss can be thought of as a constraint we impose on the model to produce the desired images.

  • Discriminative losses
    • Forces the model to produce realistic looking images from the target domain.
  • Cyclic losses
    • Forces the model to produce the images with the content of the input image

Generator model details:

  • The generator architecture is mostly taken from the fast-neural-style-transfer-models fast-neural-style-transfer-models
  • The generator can be divided into 3 parts
    • Encoder stage
    • Transformations stage
    • Decoder stage
  • The encoder stage consists of a simple CNN (with increasing channels and decreasing dimensions) and is responsible for encoding the image information into a smaller latent space.
  • The transformation stage consists of a Residual layers which maintain the same channel size and dimensions. This stage is responsible for the translation of the image into a different domain.
  • The Decoder stage consists of a Deconvolution layers (or Transpose convolutions) to increase the image dimensions and decrease the channels. This stage is responsible for decoding the information in the translated latent space and producing the final output image

Discriminator model details:

  • A simple classifier style CNN is used to balance with the generator. A very strong discriminator will not allow generator to learn
  • Convolutions with stride 2 were used instead of maxpool to prevent the loss of information
  • Every convolution in the model is followed by a BatchNormalisation layer and a LReLU layer
  • 1 X 1 convolution with stride 1 is used to reduce the number of channels at the end.
  • The output of the discriminator is of size (input_size//8) X (input_size//8) X 1 instead of a binary output like a typical binary classifier. This is done so that we get more gradients to learn for the Discriminator and Generator models.

Some points to note:

  • In both the generator and discriminator we use Convolutions with stride 2 instead of Maxpooling to prevent any loss of information
  • The Generator and Discriminator are well balanced (by using a little weak discriminator). This allows us to train the Generator and discriminator alternatively (1 epoch each) making the training process simpler.
  • The whole model is created such that the same kernels can be used for any of the input image sizes, making the progressive training possible.

Training:

Using the same base CycleGAN architectures mentioned above, three different face translation models were trained. They are:

  • Age translation from 20s to 50s and vice-versa
  • Ethnicity translation from white to black and vice-versa
  • Gender translation from male to female and vice-versa

Progressive training:

  • All the above models were trained using the progressive training approach.
  • This reduces the training time significantly
  • The model is constructed in such a way that all the kernels can be used on any input size
  • The input sizes for the images and their corresponding epochs are:
    • 48x48 for 15 epochs
    • 64x64 for 25 epochs
    • 96x96 for 30 epochs
    • 128x128 for 20 epochs
    • 256x256 for 20 epochs

Environment:

  • Used Tensor-flow slim library
  • Trained the models on Google cloud platform
  • GPU - Tesla P4

Results:

Ethnicity translation

White to Black

Black to White

Observations:

  • The model is able to change the color of the skin while maintaining the hair color, dress color etc. from the inputs
  • Since the model is trained on the faces of all the ages, its performance is almost uniform across the ages
  • The resolution of the outputs depend on the input image resolutions

Age translation

20s to 50s

50s to 20s

Observations:

  • The model is able to add or remove wrikles, folds and bags under the eyes pretty good accuracy
  • The model doesnot change the color of the hair or the beard although in some cases we might want to change (eg: for 50s to 20s translation, we might want to change the white beard to balck ideally)
  • The model does a better job in converting 50s to 20s rather than 20s to 50s

Gender translation

Male to Female

Female to male

Observations:

  • The model is able to add or remove the lipsticks accurately based on the gender
  • The model can change the thickness of the eyebrows and tends to change the skin tone a bit
  • The model is able to add a thin beard for female -> male. But generally struggles to completely remove thick beards for male -> female

Code details:

  • The training notebooks for Ethnicity, age and gender translations are present in notebooks folder respectively as ethnicity_translation.ipynb, age_translation.ipynb and gender_translation.ipynb.
  • The final checkpoints for each of the model is present in the checkpoints directory
  • The inference for each of these models and result plots are present in the final_result.ipynb

unpaired-face-to-face-translation-using-cyclegan's People

Contributors

kiran-p12 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.