Machine Learning at Berkeley partnered with Unity Technologies to apply ML methods toward de-lighting surface texture.
Realistic in-game objects are often captured from the real world through a process called photogrammetry. Photogrammetry involves capturing images of all angles of an object (e.g. a rock) and then reconstructing the object from those images with 3D reconstruction techniques. De-lighting is necessary to remove the effect of non-uniform real world lighting and shadow on the object, so that the object can be re-lit by lighting within the game environment. Currently, de-lighting is manually done by artists.
We aimed to build models that operated on surface texture maps (i.e. the unwrapped surface of a 3D object), instead of operating on meshes directly. Our model takes in the lighted texture map and seeks to generate a de-lit texture map. Unity Technologies has several de-lit texture maps already (done by artists), so this serves as our desired output. To generate the lighted inputs into our models, Unity Technologies placed the de-lit meshes in various lighting conditions and generated the lit texture maps. Our dataset is entirely consisted of textures of rocks.
Below: left is de-lit by artist (our ground truth), right is lighted.
Our core model consists of a 4 layer encoder followed by a 4 layer decoder, with residual connections between the layers. The model is fully convolutional, so it takes in 32x32 randomly cropped / rotated / flipped patches as input during training time (this speeds up training dramatically and produces better output), but takes in the entire texture map during test time.
We experimented with several loss functions on top of our core model.
This is the pixelwise L2 loss between predicted texture map and desired texture map.
Let I(0, 0) be an image and let I(1, 0) be the image shifted one pixel to the right, and I(0, 1) be the image shifted one pixel up. We call the horizontal gradient I(0, 0) - I(1, 0), and the vertical gradient I(0, 0) - I(0, 1). We compute the horizontal and vertical gradients for the model output and desired output, then take the L2 loss between these two outputs for our gradient difference loss.
This particular loss enforces relative changes from pixel to pixel, so this allows the output to keep more of the fine details from the input.
We train a convolutional discriminator to predict whether a texture map is a de-lit ground truth (vs generated by our generator).
Since our inputs have regions where alpha = 0 (see examples above), we tried applying a mask that only take the alpha > 0 regions into account when calculating the losses above.
We show results on the test set (ground truth has never been seen by model before). All result series are lit texture (model input), model output, ground truth. Title is type of loss function we used.
This has the three losses in the title, and we directly add a scaled (factor is trainable) input to the output. We hoped that this would better directly transfer the fine details to the output, but this transfered too much of the lighting to the output as well. In addition we use a mask to ignore contributions to the loss from regions where alpha = 0.
The lighting and shadow effects are removed, and the resolution is better than all the other models that we tried. Some mid-level details (e.g. dark regions of ~ 3-5 pixels in radius) are lost.
Output is much blurrier than our best, training time was significantly long, not all lighting effects are removed, and the color of the red rim is off.
Output resolution is better and the color is closer (especially the rim), but still not all lighting effects are removed and the color of the red rim is off.
The gradient difference loss greatly improves the amount of fine details kept.
Adversarial loss didn't seem to help too much beyond just L2 + Gradient Difference, but we didn't get to do more tuning.
The adversarial component of generator loss never plateau'ed, so we probably should spend more time tuning hyperparameters.
This has the same loss as above, but we directly add a scaled (factor is trainable) input to the output. We hoped that this would better directly transfer the fine details to the output, but this transfered too much of the lighting to the output as well.