GithubHelp home page GithubHelp logo

flying-pig's Introduction

Malmo Challenge Overview

Introduction

We approached the challenge mainly from the classic reinforcement learning with function approximation setting as we were more interested in building models capable of learning complex policies and determining the behaviour of other agents rather than developing hard-coded heuristics informed by human experience.

In order to acquire an understanding of the task at hand and the dynamics of the Malmo-Challenge world we experimented with various deep reinforcement learning methods, starting with well known value-based algorithms and arriving at a variation of the Advantage Asynchronous Actor-Critic with recurrent units and augmented with two auxiliary cost functions used to help our learning algorithm internalize the episodic behaviour of the Challenger Agent.

To speed up the experimentation cycle we built a secondary environment approximating the dynamics of the Malmo-Challenge task in the top-view mode. We further successfully used this medium for transfer learning experiments as detailed below. We performed all the experiments on the top-view symbolic view as we deemed it a good computational trade-off while still representing a sufficient statistics for our reinforcement learning algorithms.

Methods

Early experimentation involved training feed-forward parametrized estimators with DQN, Double DQN (in order to compensate for over-estimation effects early in the training) and policy-gradient based methods. We concluded policy-gradient based methods with recurrent units could provide us with a good baseline to build upon. For this purpose we implemented a state of the art Advantage Actor-Critic inspired by Mnih2016 with a four layer convolutional neural network for feature extraction fed into two successive GRU layers. Next are two fully connected layers and the final softmax, value, and auxiliary reward heads. The state representation we used during training was a 18x9x9 tensor, with three layers for sand, grass and lapis blocks and five layers for each of the two agents and the pig, encoding their position and orientation. We provide code for all the models discussed in this overview.

Auxiliary tasks

While the recurrent A3C model was able to learn a good policy with good sample-efficiency we tried to provide our model with additional cost functions designed to help learning relevant features for the present task as first developed in Jaderberg2016.

Specifically we trained the agent on predicting the instantaneous reward at the next step in order for our model to learn faster about states and situations leading to high reward.

The second auxiliary task we trained with was next map prediction. We first considered fully generating the next map, complete with the future position of the Challenger Agent and the Pig, hoping that this would help our agent determine the unknown policy of the Challenger Agent based on its moves. We first considered feeding the hidden states of the recurrent layers into a deconvolution for generating the next state of the map, however we observed a severe slow-down during learning when training this way. Therefore we set up to predict a random coordinate on the (18, 9, 9) state representation we used for our agents. At the start of each episode we picked a random coordinate to be predicted at each time-step. We hypothesize this additional cost function helps our agent to learn faster the dynamics of the environment and the given policy of the Challenger Agent during each episode.

Training

We employed a two-stage training process as follows:

  1. Pre-training on the secondary environment. As mentioned above we developed a secondary environment that approximates the dynamics of the Malmo-Challenge world in the top-down view. We used this environment to generate large batches of 1024 variable length episodes, doing an optimization step on each batch using RMSProp. We used batch normalization between the convolutional layers as we noticed it improves the sample-complexity of our model and allows for higher learning rates. This initial pre-training phase allowed us easy quick experimentation with various models and, more importantly, a good prior when training our model on the Malmo-Challenge. A main difference from A3C is that we didn't use an asynchronous set-up but we leveraged the ability of our environment to serve large batches of episodes.

  2. Training on the Malmo-Challenge environment. We used the full pre-trained model and a custom StateBuilder to further train our agent on the Malmo-Challenge environment. For this phase we started multiple environments and employed a training scheme inspired by GA3C Babaeizadeh2017, collecting prediction requests from all the workers and doing batched prediction on a single model. A separate training process is doing optimization steps on batches of 128 episodes. We noticed best results in this phase using Adam optimisation with a smaller-learning rate.

Other experiments

We also considered a hierarchical model learning at different time resolutions inspired by FeUdal Networks Vezhnevets2017, reasoning the Malmo-Challenge set-up is a good example of learning different skills or options (exiting the pigsty and collaborating to catch the Pig) while also taking higher-level decisions about which of the learned options to follow depending on the policy of the Challenge Agent. Although it is a research direction we plan to pursue further, the preliminary results on the Malmo-Challenge suggest less involved methods based on policy-gradient methods are perfectly capable of solving the task.

Running

You can evaluate our trained model with: python test_challenge.py

Video

Link to video

Screencapture

Training curves for Actor-Critic with auxiliary tasks

Adam Loss Game Avg Reward Game Avg Reward RMSprop Loss Reward per step Reward per step

flying-pig's People

Contributors

andreicnica avatar floringogianu avatar

Stargazers

 avatar  avatar Carol Salvato avatar  avatar  avatar Kun Shao avatar Thomas avatar  avatar  avatar Cosmin Paraschiv avatar Bogdan Silivestru avatar Codrut Cristian Grosu avatar  avatar Constantin Musca avatar Alex Awada avatar  avatar Cioboata Vlad avatar Ciprian-Octavian Truică avatar  avatar  avatar  avatar  avatar Tudor Grigoriu avatar Rares Visalom avatar  avatar Ionut Banu avatar Cosmin Popescu avatar Gabriel Burceanu avatar Mircea Mironenco avatar  avatar Elena Burceanu avatar Mihai Trăscău avatar Adrian Dinu avatar  avatar Mazilu Remus avatar Haiducu Andrei avatar Gavril Alexandru avatar  avatar David Iancu avatar  avatar  avatar  avatar Adrian Patrascu avatar Andru-Octavian Mocanu avatar Paul Săpunaru avatar Marius Chiroiu avatar  avatar Andreea Circiumaru avatar  avatar  avatar Bogdan Girbea avatar  avatar Florin Pulbere avatar Bratiloveanu Florentina - Stefania avatar Andra avatar  avatar Alexandru Sorici avatar  avatar  avatar  avatar

Watchers

Mihai Trăscău avatar James Cloos avatar  avatar Alexandru Sorici avatar Gavril Alexandru avatar Alex Awada avatar

flying-pig's Issues

use Image input?

HI, all:
I think your work is insteresting, but I don't know, your model use the image input?or just use the feature vecotr?

best wish

maybe my English is bad, sorry about it.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.