Orchestrated Value Mapping

This repository hosts the code release for the paper "Orchestrated Value Mapping for Reinforcement Learning", published at ICLR 2022. This work was done by Mehdi Fatemi (Microsoft Research) and Arash Tavakoli (Max Planck Institute for Intelligent Systems).

We release a flexible framework, built upon Dopamine (Castro et al., 2018), for building and orchestrating various mappings over different reward decomposition schemes. This enables the research community to easily explore the design space that our theory opens up and investigate new convergent families of algorithms.

The code has been developed by Arash Tavakoli.

LICENSE

Microsoft Open Source Code of Conduct

Citing

If you make use of our work, please use the citation information below:

@inproceedings{Fatemi2022Orchestrated,
  title={Orchestrated Value Mapping for Reinforcement Learning},
  author={Mehdi Fatemi and Arash Tavakoli},
  booktitle={International Conference on Learning Representations},
  year={2022},
  url={https://openreview.net/forum?id=c87d0TS4yX}
}

Getting started

We install the required packages within a virtual environment.

Virtual environment

Create a virtual environment using conda via:

conda create --name maprl-env python=3.8
conda activate maprl-env

Prerequisites

Atari benchmark. To set up the Atari suite, please follow the steps outlined here.

Install Dopamine. Install a compatible version of Dopamine with pip:

pip install dopamine-rl==3.1.10

Installing from source

To easily experiment within our framework, install it from source and modify the code directly:

git clone https://github.com/microsoft/orchestrated-value-mapping.git
cd orchestrated-value-mapping
pip install -e .

Training an agent

Change directory to the workspace directory:

cd map_rl

To train a LogDQN agent, similar to that introduced by van Seijen, Fatemi & Tavakoli (2019), run the following command:

python -um map_rl.train \
  --base_dir=/tmp/log_dqn \
  --gin_files='configs/map_dqn.gin' \
  --gin_bindings='MapDQNAgent.map_func_id="[log,log]"' \
  --gin_bindings='MapDQNAgent.rew_decomp_id="polar"' &

Here, polar refers to the reward decomposition scheme described in Equation 13 of Fatemi & Tavakoli (2022) (which has two reward channels) and [log,log] results in a logarithmic mapping for each of the two reward channels.

Train a LogLinDQN agent, similar to that described by Fatemi & Tavakoli (2022), using:

python -um map_rl.train \
  --base_dir=/tmp/loglin_dqn \
  --gin_files='configs/map_dqn.gin' \
  --gin_bindings='MapDQNAgent.map_func_id="[loglin,loglin]"' \
  --gin_bindings='MapDQNAgent.rew_decomp_id="polar"' &

Creating custom agents

To instantiate a custom agent, simply set the mapping functions for each channel and a reward decomposition scheme. For instance, the following setting

MapDQNAgent.map_func_id="[log,identity]"
MapDQNAgent.rew_decomp_id="polar"

results in a logarithmic mapping for the positive-reward channel and the identity mapping (same as in DQN) for the negative-reward channel.

To use more complex reward decomposition schemes, such as Configurations 1 and 2 from Fatemi & Tavakoli (2022), you can do as follows:

MapDQNAgent.map_func_id="[identity,identity,log,log,loglin,loglin]"
MapDQNAgent.rew_decomp_id="config_1"

To instantiate an ensemble of two learners, each using a polar reward decomposition, use the following syntax:

MapDQNAgent.map_func_id="[loglin,loglin,log,log]"
MapDQNAgent.rew_decomp_id="two_ensemble_polar"

Custom mappings and reward decomposition schemes

To implement custom mapping functions and reward decomposition schemes, we suggest that you draw on insights from Fatemi & Tavakoli (2022) and follow the format of such methods in map_dqn_agent.py to design yours.

microsoft / orchestrated-value-mapping Goto Github PK