GithubHelp home page GithubHelp logo

horizonrobotics / alf Goto Github PK

View Code? Open in Web Editor NEW
296.0 13.0 49.0 86.49 MB

Agent Learning Framework https://alf.readthedocs.io

License: Apache License 2.0

Shell 0.24% Python 99.25% Nix 0.01% C++ 0.51%

alf's Introduction

ALF

ALF-logo

CI

Agent Learning Framework (ALF) is a reinforcement learning framework emphasizing on the flexibility and easiness of implementing complex algorithms involving many different components. ALF is built on PyTorch. The development of previous version based on Tensorflow 2.1 has stopped as of Feb 2020.

Tutorial

A draft tutorial can be accessed on RTD. This tutorial is still under construction and some chapters are unfinished yet.

Documentation

Read the ALF documentation here.

Algorithms

Algorithm Type Reference
A2C On-policy RL OpenAI Baselines: ACKTR & A2C
PPO On-policy RL Schulman et al. "Proximal Policy Optimization Algorithms" arXiv:1707.06347
PPG On-policy RL Cobbe et al. "Phasic Policy Gradient" arXiv:2009.04416
DDQN Off-policy RL Hasselt et al. "Deep Reinforcement Learning with Double Q-learning" arXiv:1509.06461
DDPG Off-policy RL Lillicrap et al. "Continuous control with deep reinforcement learning" arXiv:1509.02971
QRSAC Off-policy RL Dabney et al. "Distributional Reinforcement Learning with Quantile Regression" arXiv:1710.10044
SAC Off-policy RL Haarnoja et al. "Soft Actor-Critic Algorithms and Applications" arXiv:1812.05905
OAC Off-policy RL Ciosek et al. "Better Exploration with Optimistic Actor-Critic" arXiv:1910.12807
HER Off-policy RL Andrychowicz et al. "Hindsight Experience Replay" arXiv:1707.01495
TAAC Off-policy RL Yu et al. "TAAC: Temporally Abstract Actor-Critic for Continuous Control" arXiv:2104.06521
SEditor Off-policy/Safe RL Yu et al. "Towards Safe Reinforcement Learning with a Safety Editor Policy" NeurIPS 2022
DIAYN Intrinsic motivation/Exploration Eysenbach et al. "Diversity is All You Need: Learning Diverse Skills without a Reward Function" arXiv:1802.06070
ICM Intrinsic motivation/Exploration Pathak et al. "Curiosity-driven Exploration by Self-supervised Prediction" arXiv:1705.05363
RND Intrinsic motivation/Exploration Burda et al. "Exploration by Random Network Distillation" arXiv:1810.12894
MuZero Model-based RL Schrittwieser et al. "Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model" arXiv:1911.08265
BC Offline RL Pomerleau "ALVINN: An Autonomous Land Vehicle in a Neural Network" NeurIPS 1988
Bain et al. "A framework for behavioural cloning" Machine Intelligence 1999
Causal BC Offline RL Swamy et al. "Causal Imitation Learning under Temporally Correlated Noise" ICML2022
IQL Offline RL Kostrikov, et al. "Offline Reinforcement Learning with Implicit Q-Learning" arXiv:2110.06169
MERLIN Unsupervised learning Wayne et al. "Unsupervised Predictive Memory in a Goal-Directed Agent"arXiv:1803.10760
MoNet Unsupervised learning Burgess et al. "MONet: Unsupervised Scene Decomposition and Representation" arXiv:1901.11390
Amortized SVGD General Feng et al. "Learning to Draw Samples with Amortized Stein Variational Gradient Descent" arXiv:1707.06626
HyperNetwork General Ratzlaff and Fuxin. "HyperGAN: A Generative Model for Diverse, Performant Neural Networks" arXiv:1901.11058
MCTS General Grill et al. "Monte-Carlo tree search as regularized policy optimization" arXiv:2007.12509
MINE General Belghazi et al. "Mutual Information Neural Estimation" arXiv:1801.04062
ParticleVI General Liu and Wang. "Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm" arXiv:1608.04471
Liu et al. "Understanding and accelerating particle-based variational inference" arXiv:1807.01750
GPVI General Ratzlaff, Bai et al. "Generative Particle Variational Inference via Estimation of Functional Gradients" arXiv:2103.01291
SVGD optimizer General Liu et al. "Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm" arXiv:1608.04471
VAE General Higgins et al. "beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework" ICLR2017
RealNVP General Dinh et al. "Density estimation using Real NVP" arXiv:1605.08803
SpatialBroadcastDecoder General Watters et al. "Spatial Broadcast Decoder: A Simple Architecture for Learning Disentangled Representations in VAEs" arXiv:1901.07017
VQ-VAE General A van den Oord et al. "Neural Discrete Representation Learning" NeurIPS2017

Installation

OS software

The following installation was tested on Ubuntu22.04 with CUDA 11.8.

Python3.11 is currently supported by ALF. Note that some pip packages (e.g., pybullet) need python dev files, so make sure python3.11-dev is installed:

sudo apt install -y python3.11 python3.11-dev

Boost is also required by ALF for fast parallel environments.

sudo apt install libboost-all-dev

Python environment

Virtualenv is recommended for the installation. After creating and activating a virtual env, you can run the following commands to install ALF:

git clone https://github.com/HorizonRobotics/alf
cd alf
pip install pybind11
pip install -e . --extra-index-url https://download.pytorch.org/whl/cu118

For Nix Users

There is a built-in Nix-based development environment defined in flake.nix. To activate it, run

$ nix develop

in the root of your local repository.

Docker

We also provide a docker image of ALF for convenience. In order to use this image, you need to have docker and nvidia-docker (for ALF gpu usage) installed first.

docker run --gpus all -it horizonrobotics/cuda:11.8.0-py3.11-torch2.2-ubuntu22.04 /bin/bash

This will give you a shell that have all ALF and dependencies pre-installed.

The current docker image contains an ALF version on Feb 21, 2024. Regular version updates are expected in the future.

Examples

You can train any _conf.py file under alf/examples as follows:

python -m alf.bin.train --conf=CONF_FILE --root_dir=LOG_DIR
  • CONF_FILE is the path to your conf file which follows ALF configuration file format (basically python).
  • LOG_DIR is the directory when you want to store the training results. Note that if you want to train from scratch, LOG_DIR must point to a location that doesn't exist. Otherwise, it is assumed to resume the training from a previous checkpoint (if any).

During training, we use tensorboard to show the progress of training:

tensorboard --logdir=LOG_DIR

After training, you can evaluate the trained model and visualize environment frames using the following command:

python -m alf.bin.play --root_dir=LOG_DIR

Deprecated

An older version of ALF used gin for job configuration. Its syntax is not as flexible as ALF conf (e.g., you can't easily do math computation in a gin file). There are still some examples with .gin under alf/examples. We are in the process of converting all .gin examples to _conf.py examples.

You can train any .gin file under alf/examples using the following command:

cd alf/examples; python -m alf.bin.train --gin_file=GIN_FILE --root_dir=LOG_DIR
  • GIN_FILE is the path to the gin conf (some .gin files under alf/examples might be invalid; they have not been converted to use the latest pytorch version of ALF).
  • LOG_DIR has the same meaning as in the ALF conf example above.

Warning: When using gin, ALF has to be launched in the same directory with the gin file(s). If an error says that no configuration file is found, then probably you've launched ALF in a wrong place.

All the examples below are trained on a single machine Intel(R) Core(TM) i9-7960X CPU @ 2.80GHz with 32 CPUs and one RTX 2080Ti GPU.

A2C

  • Cart pole. The training score took only 30 seconds to reach 200, using 8 environments.

    breakout-training-curve cartpole-video

  • Atari games. Need to install python package atari-py for atari game environments. The evaluation score (by taking argmax of the policy) took 1.5 hours to reach 800 on Breakout, using 64 environments.

    breakout-training-curve breakout-playing-screen

  • Simple navigation with visual input. Follow the instruction at SocialRobot to install the environment.

    simple-navigation-curve simple0navigation-video

PPO

  • PR2 grasping state only. Follow the instruction at SocialRobot to install the environment.

    ppo-pr2-curve pr2-video

  • Humanoid. Learning to walk using the pybullet Humanoid environment. Need to install python pybullet>=2.5.0 for the environment. The evaluation score reaches 3k in 50M steps, using 96 parallel environments.

    Humanoid-training-curve Humanoid-video

PPG

DDQN

DDPG

  • FetchSlide (sparse rewards). Need to install the MuJoCo simulator first. This example reproduces the performance of vanilla DDPG reported in the OpenAI's Robotics environment paper. Our implementation doesn't use MPI, but obtains (evaluation) performance on par with the original implementation. (The original MPI implementation has 19 workers, each worker containing 2 environments for rollout and sampling a minibatch of size 256 from its replay buffer for computing gradients. All the workers' gradients will be summed together for a centralized optimizer step. Our implementation simply samples a minibatch of size 5000 from a common replay buffer per optimizer step.) The training took about 1 hour with 38 (19*2) parallel environments on a single GPU.

    ddpg-fetchslide-training-curve

SAC

  • Bipedal Walker.

    bipedal-walker-training-curve bipedal-walker-video

  • FetchReach (sparse rewards). Need to install the MuJoCo simulator first. The training took about 20 minutes with 20 parallel environments on a single GPU.

    sac-fetchreach-training-curve

  • FetchSlide (sparse rewards). Need to install the MuJoCo simulator first. This is the same task with the DDPG example above, but with SAC as the learning algorithm. Also it has only 20 (instead of 38) parallel environments to improve sample efficiency. The training took about 2 hours on a single GPU.

    sac-fetchslide-training-curve

  • Fetch Environments (sparse rewards) w/ Action Repeat. We are able to achieve even better performance than reported by DDPG + Hindsight Experience Replay in some cases simply by using SAC + Action Repeat with length 3 timesteps. See this note to view learning curves, videos, and more details.

ICM

  • Super Mario. Playing Super Mario only using intrinsic reward. Python package gym-retro>=0.7.0 is required for this experiment and also a suitable SuperMarioBros-Nes rom should be obtained and imported (roms are not included in gym-retro). See this doc on how to import roms.

    super-mario-training-curve super-mario-video

RND

  • Montezuma's Revenge. Training the hard exploration game Montezuma's Revenge with intrinsic rewards generated by RND. A lucky agent can get an episodic score of 6600 in 160M frames (40M steps with frame_skip=4). A normal agent would get an episodic score of 4000~6000 in the same number of frames. The training took about 6.5 hours with 128 parallel environments on a single GPU.

mrevenge-training-curvemrevenge-video

DIAYN

  • Pendulum. Learning diverse skills without external rewards.

    Discriminator loss Skills learned with DIAYN

BC

Merlin

  • Collect Good Objects. Learn to collect good objects and avoid bad objects. DeepmindLab is required, Follow the instruction at DeepmindLab to install the environment.

    room-collect-good-objects-training-curve room-collect-good-objects

MuZero

  • 6x6 Go. It took about a day to train a reasonable agent to play 6x6 go using one GPU.

    6x6-go

Citation

If you use ALF for research and find it useful, please consider citing:

@software{Xu2021ALF,
  title={{{ALF}: Agent Learning Framework}},
  author={Xu, Wei and Yu, Haonan and Zhang, Haichao and Hong, Yingxiang and Yang, Break and Zhao, Le and Bai, Jerry and ALF contributors},
  url={https://github.com/HorizonRobotics/alf},
  year={2021}
}

Contribute to ALF

You are welcome to contribute to ALF. Please follow the guideline here.

alf's People

Contributors

7gao avatar bayesian avatar breakds avatar emailweixu avatar haichao-zhang avatar hnyu avatar jesbu1 avatar jialn avatar le-horizon avatar neale avatar pd-perry avatar pinzhang avatar quantumope avatar resuscitated avatar ruizhaogit avatar runjerry avatar witwolf avatar www2171668 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

alf's Issues

_cached_opt_and_var_sets inconsistent between initial and sequential calls

I tried printing out how many variables each optimizer is responsible for optimization. I printed in two places. The first place is inside algorithm._get_opt_and_var_sets(), which is before self._cached_opt_and_var_sets is set, and I got something like below

{'amsgrad': False,
 'beta_1': 0.9,
 'beta_2': 0.999,
 'decay': 0.0,
 'epsilon': 1e-07,
 'learning_rate': 3e-05,
 'name': 'Adam'} 24
{'amsgrad': False,
 'beta_1': 0.9,
 'beta_2': 0.999,
 'decay': 0.0,
 'epsilon': 1e-07,
 'learning_rate': 1e-05,
 'name': 'Adam'} 26
{'amsgrad': False,
 'beta_1': 0.9,
 'beta_2': 0.999,
 'decay': 0.0,
 'epsilon': 1e-07,
 'learning_rate': 0.001,
 'name': 'Adam'} 1

For each row, the first is the optimizer config and the second is len(vars), which makes sense given my job. However, if I printed inside train_complete, after the results of calling self._get_cached_opt_and_var_sets(), the output is

{'amsgrad': False,
 'beta_1': 0.9,
 'beta_2': 0.999,
 'decay': 0.0,
 'epsilon': 1e-07,
 'learning_rate': 3e-05,
 'name': 'Adam'} 75
{'amsgrad': False,
 'beta_1': 0.9,
 'beta_2': 0.999,
 'decay': 0.0,
 'epsilon': 1e-07,
 'learning_rate': 1e-05,
 'name': 'Adam'} 26
{'amsgrad': False,
 'beta_1': 0.9,
 'beta_2': 0.999,
 'decay': 0.0,
 'epsilon': 1e-07,
 'learning_rate': 0.001,
 'name': 'Adam'} 1

A simple calculation suggests that somehow the parent algorithm (holds the first optimizer above) also takes into account all leaf trainable variables somewhere in the code (24 + 24 + 26 + 1 = 75). So some variables will be also optimized by the parent optimizer even though I have specified child optimizers for them (our tape is "persistent" and grads can be computed multiple times).

Verified that this also happened in the eager mode.

Add evaluation to on_policy_trainer

Similar to tf_agent train_eval, we need to periodically evaluate the policy during the training. The evaluation usually use greedy_predict and may have big difference from non-greedy predict.

Figures showing metric against time

Right now, we have figures showing metrics against global count, environmental steps,... In order to comparison wall clock speed between algorithms/settings, we also need figures with time as x-axis and metric as y-axis.

can not work with tf.functions

1)Code Patch: (examples/actor_critic.py)

    # driver = PyDriver(
    #     tf_env,
    #     policy,
    #     observers=train_metrics,
    #     max_steps=num_steps_per_iter)

    driver = DynamicStepDriver(
        tf_env,
        policy,
        observers=train_metrics,
        num_steps=num_steps_per_iter)
    # _ = algorithm.variables  # build networks
    driver.run = tfa_common.function(driver.run)

it run success with (use_icm=0): python actor_critic.py --root_dir=~/tmp/icm/ --num_parallel_environments=1 --gin_param=train_eval.use_icm=0

but failed with (use_icm=1): python actor_critic.py --root_dir=~/tmp/icm/ --num_parallel_environments=1 --gin_param=train_eval.use_icm=1

TypeError: An op outside of the function building code is being passed
a "Graph" tensor. It is possible to have Graph tensors
leak out of the function building context by including a
tf.init_scope in your function building code.
For example, the following function will fail:
@tf.function
def has_init_scope():
my_constant = tf.constant(1.)
with tf.init_scope():
added = my_constant * 2
The graph tensor has name: ActorDistributionNetwork/CategoricalProjectionNetwork/Categorical/sample/Reshape_1:0
In call to configurable 'train_eval' (<function train_eval at 0x134018950>)

the behavior was strange since there's nothing special with icm module (ICMAlgorithm )

and experiments above run success when we build networks before run graph
_ = algorithm.variables # build networks

2) training procedure is incorrect in success cases with tf.functions

extend grid search

Towards two directions:

  1. early stopping. For some obviously inferior hyperparameter combination, it's possible to determine a stop at the early stage of training. For that, we could periodically compare its performance with the top-k runs at the same training iteration number, via some communication channel like Queue. Basically, this eventually becomes an evolution-like search of hyperparameters. Of course, more advanced methods can be employed:
    https://towardsdatascience.com/a-conceptual-explanation-of-bayesian-model-based-hyperparameter-optimization-for-machine-learning-b8172278050f

  2. If each training run needs a GPU, then it's impractical to launch all runs on a single machine. So a distributed search is needed (after our cluster is ready?).

I noticed that there are existing Python based libs for searching hyperparameters. One example is Hyperopt:
https://github.com/hyperopt/hyperopt

We should investigate more into these libs. If they are ready for use in our case, then we might not want to reinvent the wheel.

Refactoring: Move OffPolicyDriver.train to RLAlgorithm

And also move replaybuffer to RLAlgortihm.

The exact training procedure should be decided by algorithm. Moving it to algorithm will make algorithm more flexible so that we can minimize the need to change or write policy driver for future algorithms

Unittests for running all examples

Even it's not reasonable to train all the examples until they reaches desired performance, we should at least make sure they can run a few iterations and play correctly.

Make sure environments in SocialRobot work with tf_agents

We want to be able to run SocialRobot environment with tf_agents. There are two issues:

  1. Since tf_agents run step() of an environment in a thread different from the thread it is created, somehow SocialRobot will crash when it's used this way

  2. We want to run multiple SocialRobot environments. And each environment needs to be in a separate process because of Gazebo. Hence we need have an environment wrapper to wrap an environment in a separate process while use it like in the same process. This is similar to https://github.com/deepmind/scalable_agent/blob/master/py_process.py, which is used here https://github.com/deepmind/scalable_agent/blob/6c0c8a701990fab9053fb338ede9c915c18fa2b1/experiment.py#L437

images are not stored as tf.uint8 in replay buffers

Due to tf_agents implementations in gym_wrapper.py _spec_from_gym_space(), all inputs of spec gym.spaces.Box will be mapped to the same dtype specified in dtype_map. This will have issues when the inputs are a mixture of uint8 and float32, or when the action is float32. So the default type for now is always float32. We are spending x4 memory storing image inputs right now.

async train sometime fails

when test with ppo_async_icm_super_mario_intrinsic_only

rm -rf tmp && python3 -m alf.bin.train \
 --root_dir=tmp \
 --gin_file=ppo_async_icm_super_mario_intrinsic_only.gin \
 --gin_param=TrainerConfig.random_seed=0 \
 --gin_param=create_environment.num_parallel_environments=1 \
 --gin_param=TrainerConfig.num_iterations=2 \
 --gin_param=TrainerConfig.num_steps_per_iter=1 \
 --gin_param=TrainerConfig.num_updates_per_train_step=1 \
 --gin_param=TrainerConfig.mini_batch_length=2 \
 --gin_param=TrainerConfig.mini_batch_size=4 \
 --gin_param=TrainerConfig.num_envs=2 \
 --gin_param=ReplayBuffer.max_length=64 \
 --gin_param=TrainerConfig.unroll_length=2 \
 --gin_param=TrainerConfig.num_updates_per_train_step=2 \
 --gin_param=TrainerConfig.use_tf_functions=False

get error msg:

  ...
  File "/home/hongyingxiang/FLA/alf/drivers/threads.py", line 410, in _step
    self._env.step(action), action, first_env_id=self._first_env_id)
  File "/usr/local/lib/python3.6/dist-packages/tf_agents/environments/tf_environment.py", line 232, in step
    return self._step(action)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/impl/api.py", line 292, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tf_agents/environments/tf_py_environment.py", line 319, in _step
    name='step_py_func')
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/script_ops.py", line 591, in numpy_function
    return py_func_common(func, inp, Tout, stateful=True, name=name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/script_ops.py", line 488, in py_func_common
    result = func(*[x.numpy() for x in inp])
  File "/usr/local/lib/python3.6/dist-packages/tf_agents/environments/tf_py_environment.py", line 302, in _isolated_step_py
    return self._execute(_step_py, *flattened_actions)
  File "/usr/local/lib/python3.6/dist-packages/tf_agents/environments/tf_py_environment.py", line 195, in _execute
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tf_agents/environments/tf_py_environment.py", line 298, in _step_py
    self._time_step = self._env.step(packed)
  File "/usr/local/lib/python3.6/dist-packages/tf_agents/environments/py_environment.py", line 174, in step
    self._current_time_step = self._step(action)
  File "/usr/local/lib/python3.6/dist-packages/tf_agents/environments/parallel_py_environment.py", line 136, in _step
    time_steps = [promise() for promise in time_steps]
  File "/usr/local/lib/python3.6/dist-packages/tf_agents/environments/parallel_py_environment.py", line 136, in <listcomp>
    time_steps = [promise() for promise in time_steps]
  File "/usr/local/lib/python3.6/dist-packages/tf_agents/environments/parallel_py_environment.py", line 338, in _receive
    raise Exception(stacktrace)
Exception: Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tf_agents/environments/parallel_py_environment.py", line 377, in _worker
    result = getattr(env, name)(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tf_agents/environments/py_environment.py", line 174, in step
    self._current_time_step = self._step(action)
  File "/usr/local/lib/python3.6/dist-packages/tf_agents/environments/wrappers.py", line 105, in _step
    time_step = self._env.step(action)
  File "/usr/local/lib/python3.6/dist-packages/tf_agents/environments/py_environment.py", line 174, in step
    self._current_time_step = self._step(action)
  File "/usr/local/lib/python3.6/dist-packages/tf_agents/environments/gym_wrapper.py", line 197, in _step
    observation, reward, self._done, self._info = self._gym_env.step(action)
  File "/usr/local/lib/python3.6/dist-packages/gym/core.py", line 282, in step
    return self.env.step(self.action(action))
  File "/home/hongyingxiang/FLA/alf/environments/mario_wrappers.py", line 121, in action
    for i in self._actions[a]:
IndexError: list index out of range

and with diff

diff --git a/alf/algorithms/actor_critic_algorithm.py b/alf/algorithms/actor_critic_algorithm.py
index 20216fa..836e9dc 100644
--- a/alf/algorithms/actor_critic_algorithm.py
+++ b/alf/algorithms/actor_critic_algorithm.py
@@ -110,6 +110,8 @@ class ActorCriticAlgorithm(OnPolicyAlgorithm):
             step_type=time_step.step_type,
             network_state=state.actor)

+        import threading
+        print(action_distribution.logits[0][:4], threading.current_thread().ident)
         action = common.sample_action_distribution(action_distribution)
         return PolicyStep(
             action=action,

i get

  File "/usr/local/lib/python3.6/dist-packages/tf_agents/environments/py_environment.py", line 174, in step
    self._current_time_step = self._step(action)
  File "/usr/local/lib/python3.6/dist-packages/tf_agents/environments/gym_wrapper.py", line 197, in _step
    observation, reward, self._done, self._info = self._gym_env.step(action)
  File "/usr/local/lib/python3.6/dist-packages/gym/core.py", line 282, in step
    return self.env.step(self.action(action))
  File "/home/hongyingxiang/FLA/alf/environments/mario_wrappers.py", line 121, in action
    for i in self._actions[a]:
IndexError: list index out of range

tf.Tensor([nan nan nan nan], shape=(4,), dtype=float32) 140210483619648
tf.Tensor([nan nan nan nan], shape=(4,), dtype=float32) 140210483619648
tf.Tensor([nan nan nan nan], shape=(4,), dtype=float32) 140210483619648

logits for distributions.Categorical might be nan (it's very easy to reproduce this issue)

can you help take a look for this issue @emailweixu @hnyu

Speeding up the loading of SocialRobot environment

When using 30 or 60 parallel environments, the time for loading the environments become quite long. It seems that the environments are loaded sequentially. We might be able to speed up the loading by loading the environments parallelly

alf build failing

This is the current status on the github page:

image

Looking into travis logs, the following one fails
image

Weird Tensorboard behaviors

Sometimes when I train intrinsic-only agents, I observed non-zero "AverageReturn" points but at the same time observed all zero points in "extrinsic/mean". See the two figures below.

Metrics_AverageReturn (1)
reward_extrinsic_mean

Without any extrinsic rewards, it's basically impossible that "AverageReturn" is nonzero.
Note that this has nothing to do with "average" because the second curve is exactly 0. And I didn't downsample any curve.

Does anyone know why this happens?

Actor critic RNN policy unit test fails

There is a shape error when running test alf/algorithms/test_actor_critic_rnn_policy or alf/drivers/on_policy_driver_test.py

File "alf/algorithms/actor_critic_algorithm_test.py", line 165, in test_actor_critic_rnn_policy policy_step = policy.action(time_step, policy_state) ValueError: Incompatible shape for value ((100, 100)), expected ((100, 1))

Refactoring: remove ceate_???_algorithm

We should make it possible to directly configure Algorithm from gin instead of relying on create_???_algorithm.
Since we need to pass observation_spec, action_spec, time_spec etc. to many of the the constructors. We can implement several functions: observation_spec(), action_spec(), time_step_spec() to get those specs from gin.

Support rendering on multiple GPUs

Currently, for suite_socialbot, when using ParallelPyEnvironment, all the environments perform rendering on the same GPU0. We should take a look if allowing rendering on multiple GPUs can speed up training.

Better histogram for the summary of discrete action

At alf.utils.common.add_action_summaries, tf.summary.histogram does not generate good histogram for discrete actions. As we know the min and max of the discrete action, we should generate a histogram with the known number of buckets and with the right min and max.

missing curves and gifs in README

We are missing training curves and GIFs showing the trained agent play in the following two environments in the README file:

"Simple navigation with visual input. Follow the instruction at SocialRobot to install the environment."
and
"PR2 grasping state only. Follow the instruction at SocialRobot to install the environment."

@witwolf and @Jialn Maybe these are easy for you guys to add?

TrainTest fails sometimes

======================================================================
FAIL: test_ppo_cart_pole (bin.train_test.TrainTest)
test_ppo_cart_pole (bin.train_test.TrainTest)

Traceback (most recent call last):
File "/ALF/alf/bin/train_test.py", line 100, in test_ppo_cart_pole
self._test_train('ppo_cart_pole.gin', _test_func)
File "/ALF/alf/bin/train_test.py", line 126, in _test_train
assert_func(episode_returns, episode_lengths)
File "/ALF/alf/bin/train_test.py", line 97, in _test_func
self.assertGreater(np.mean(returns[-2:]), 198)
AssertionError: 197.8499984741211 not greater than 198

@witwolf It seems that the determinism isn't working as expected? There are some other cases failed sometimes, and I have to manually restart the testing job each time.

can't init submodule with current master

current master e5e903b reports error when git submodule update:

03:10:03 (py3env) immars@immars-brick alf ±|master ✗|→ git submodule update
error: Server does not allow request for unadvertised object ba9cf75787469554483874efcf07a4f00306443c
Fetched in submodule path 'tf_agents', but it did not contain ba9cf75787469554483874efcf07a4f00306443c. Direct fetching of that commit failed.

seems that #22 updated tf_agents to that version ba9cf75787 which no longer exist now?

remove all specs

I believe that specs makes our framework strong-type which is good for avoiding mistakes but also makes code much less flexible and more verbose (not pythonic). The main reasons for having specs are two:

  1. tf_agents networks require specs to initialize
  2. a replay buffer requires specs to allocate memories in __init__

For 1), probably we can rewrite some networks (keras doesn't require specs); for 2) when we create a replay buffer, we should input an example of experience and let the replay buffer class extract the specs internally.

Am I missing other reasons for specs?

trac_ddpg_pendulum failed

when test with trac_ddpg_pendulum

python -m alf.bin.train --root_dir=tdp --gin_file=trac_ddpg_pendulum

get error msg below, still investagte on it

Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/hongyingxiang/FLA/alf/bin/train.py", line 88, in <module>
    app.run(main)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/home/hongyingxiang/FLA/alf/bin/train.py", line 79, in main
    train_eval(FLAGS.root_dir)
  File "/usr/local/lib/python3.6/dist-packages/gin/config.py", line 1032, in wrapper
    utils.augment_exception_message_and_reraise(e, err_str)
  File "/usr/local/lib/python3.6/dist-packages/gin/utils.py", line 49, in augment_exception_message_and_reraise
    six.raise_from(proxy.with_traceback(exception.__traceback__), None)
  File "<string>", line 3, in raise_from
  File "/usr/local/lib/python3.6/dist-packages/gin/config.py", line 1009, in wrapper
    return fn(*new_args, **new_kwargs)
  File "/home/hongyingxiang/FLA/alf/bin/train.py", line 73, in train_eval
    trainer.train()
  File "/home/hongyingxiang/FLA/alf/trainers/policy_trainer.py", line 315, in train
    summary_max_queue=self._summary_max_queue)
  File "/home/hongyingxiang/FLA/alf/utils/common.py", line 265, in run_under_record_context
    func()
  File "/home/hongyingxiang/FLA/alf/trainers/policy_trainer.py", line 345, in _train
    time_step=time_step)
  File "/home/hongyingxiang/FLA/alf/trainers/off_policy_trainer.py", line 74, in _train_iter
    update_counter_every_mini_batch=self._config.
  File "/home/hongyingxiang/FLA/alf/algorithms/off_policy_algorithm.py", line 105, in train
    mini_batch_length, update_counter_every_mini_batch)
  File "/home/hongyingxiang/FLA/alf/utils/common.py", line 985, in __call__
    return tf_func_instance(get_current_scope(), *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/def_function.py", line 568, in __call__
    result = self._call(*args, **kwds)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/def_function.py", line 696, in _call
    return function_lib.defun(fn_with_cond)(*canon_args, **canon_kwds)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py", line 2363, in __call__
    return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py", line 1611, in _filtered_call
    self.captured_inputs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py", line 1692, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py", line 545, in call
    ctx=ctx)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.InvalidArgumentError:  assertion failed: [100]
   [[{{node cond/else/_1/StatefulPartitionedCall/while/body/_694/while/body/_1961/StatefulPartitionedCall/Assert/AssertGuard/else/_7599/Assert}}]] [Op:__inference_fn_with_cond_12063]

Function call stack:
fn_with_cond

  In call to configurable 'train_eval' (<function train_eval at 0x7f35b018a840>)

inoperative config info seems incorrect

I included "atari.gin" in my gin file to run a job. In TensorBoard "text", I see the following inoperative info:

Screenshot from 2019-09-26 16-54-34

However, as I check these confs are indeed used by my job. Is there any bug here?

grocery_ground_goal_task training taking 3x more memory than expected?

4 bytes (float) * 80 * 80 (image size) * 3 (channels) * 100 (unroll length) * 12 (input + two conv layers + backprop + framestack) * 30 (parallel envs) /1000,000,000
=2.7 GB

Currently cuda seems to be taking ~9 GB GPU memory, (rendering is taking another ~4GB):
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56 Driver Version: 418.56 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... Off | 00000000:17:00.0 On | N/A |
| 18% 59C P5 46W / 250W | 5274MiB / 10988MiB | 31% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 208... Off | 00000000:65:00.0 Off | N/A |
| 30% 66C P2 81W / 250W | 8856MiB / 10989MiB | 3% Default |
+-------------------------------+----------------------+----------------------+

This seems to suggest it is taking 3x the memory vs what we expect?

Did we miss anything in the calculation?

Thanks,
Le
-----some details-----
conv_layer_params = ((16, 3, 2), (32, 3, 2))
1st conv layer 404016, 2nd conv layer 202032, roughly adds up to 2x input layer,
then *2 for actor and critic networks, and *2 again for forward and backprop.

  • 4x input layer for FrameStack
    = 8 + 4 = 12x size of the input layer

Incorporating long-term entropy bonus reward in on-policy AC

For on-policy AC, we have the entropy regularizer E_{\pi}[-\log\pi(a_t|s_t)] at every step. As an unbiased estimation, we can simply use the negative log-likelihood -\log\pi(a_t|s_t) as an additional reward added to the computed advantage towards the policy gradient loss computation. This is essentially a one-step entropy bonus reward where the policy only cares about the single-step bonus.

We can extend this to long-term entropy maximization by adding -log(a_t|s_t) as instrinsic rewards (just like the ICM rewards) which will be absorbed in the advantage and return computations. This could potentially further improve our on-policy AC performance on top of the current single-step entropy bonus.

Note that this simple treatment only applies to on-policy algorithms. For off-policy algorithms, we then need SAC or soft-Q's formulation.

Unoperative gin config

Sometime the configuration provided in gin config file may be overridden by python code. One example is alf/examples/ppo_pr2.sh in PR #20, where the commandline specifies --gin_param='train_eval.num_epochs=10', but unintentionally override by a FLAG '--num_epochs' with default value 10.

It would be nice if we can find a way to write all the unused config from gin_file or gin_params to tensorboard.

ParallelPyEnvironment will cause sys.exit() to hang after main() finishes

I've tested for quite some time to pin down an issue (reason unknown at this moment) that once we created tf_agents' ParallelPyEnvironment in main(), after the code finishes, sys.exit() will hang forever without releasing the GPU memory (running in CPU mode also has this hanging issue). This makes grid_search.py fail. The minimal re-producible example is to replace the train_eval(root_dir) function with the following:

@gin.configurable
def train_eval(root_dir):
    env = create_environment()
    env.pyenv.close()

and run any training job. The script will never exit.

And I have located the issue exactly at sys.exit(main(argv)) in absl's app.py. It's more like an issue happened inside sys.exit() rather than in absl's code.

@witwolf When you ran grid_search.py, did you ever have such a problem?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.