Teach an agent to use a replicate an image by decomposing it to a set of strokes that can be painted on a canvas in some sequence. The agent will learn to paint the next stroke based on the current set of strokes on the canvas and the image being replicated. For this we follow the approach followed in paper Huang et al. alongside some of our own modifications. We aim to design a model-based TD3 approach as well as a model-based PPO approach and compare it to their model-based DDPG for the task of learning to paint.
Try it out on Collaboratory
- Trained the actor model using TD3 instead of DDPG. Within TD3, we implemented a policy update using both Proximal Policy gradient, and naive policy optimization. For TD3, we add an additional critic model in order to overcome the shortcomings of DDPG which is well known to overestimate the learnt value functions. PPO is an efficient version of TRPO, which uses a trust region to figure the direction and magnitude of gradient update
- Compared the quantitative as well as qualitative performance of the TD3 (PPO and naive) model with baseline (with DDPG).
pip install tensorboardX
tensorboard --log_dir=<path to log folder> --port=<port>