GithubHelp home page GithubHelp logo

phasic-policy-gradient's Introduction

Status: Archive (code is provided as-is, no updates expected)

Phasic Policy Gradient

This is code for training agents using Phasic Policy Gradient (citation).

Supported platforms:

  • macOS 10.14 (Mojave)
  • Ubuntu 16.04

Supported Pythons:

  • 3.7 64-bit

Install

You can get miniconda from https://docs.conda.io/en/latest/miniconda.html if you don't have it, or install the dependencies from environment.yml manually.

git clone https://github.com/openai/phasic-policy-gradient.git
conda env update --name phasic-policy-gradient --file phasic-policy-gradient/environment.yml
conda activate phasic-policy-gradient
pip install -e phasic-policy-gradient

Reproduce and Visualize Results

PPG with default hyperparameters (results/ppg-runN):

mpiexec -np 4 python -m phasic_policy_gradient.train
python -m phasic_policy_gradient.graph --experiment_name ppg

PPO baseline (results/ppo-runN):

mpiexec -np 4 python -m phasic_policy_gradient.train --n_epoch_pi 3 --n_epoch_vf 3 --n_aux_epochs 0 --arch shared
python -m phasic_policy_gradient.graph --experiment_name ppo

PPG, varying E_pi (results/e-pi-N):

mpiexec -np 4 python -m phasic_policy_gradient.train --n_epoch_pi N
python -m phasic_policy_gradient.graph --experiment_name e_pi

PPG, varying E_aux (results/e-aux-N):

mpiexec -np 4 python -m phasic_policy_gradient.train --n_aux_epochs N
python -m phasic_policy_gradient.graph --experiment_name e_aux

PPG, varying N_pi (results/n-pi-N):

mpiexec -np 4 python -m phasic_policy_gradient.train --n_pi N
python -m phasic_policy_gradient.graph --experiment_name n_pi

PPG, using L_KL instead of L_clip (results/ppgkl-runN):

mpiexec -np 4 python -m phasic_policy_gradient.train --clip_param 0 --kl_penalty 1
python -m phasic_policy_gradient.graph --experiment_name ppgkl

PPG, single network variant (results/ppgsingle-runN):

mpiexec -np 4 python -m phasic_policy_gradient.train --arch detach
python -m phasic_policy_gradient.graph --experiment_name ppg_single_network

Pass --normalize_and_reduce to compute and visualize the mean normalized return with phasic_policy_gradient.graph.

Citation

Please cite using the following bibtex entry:

@article{cobbe2020ppg,
  title={Phasic Policy Gradient},
  author={Cobbe, Karl and Hilton, Jacob and Klimov, Oleg and Schulman, John},
  journal={arXiv preprint arXiv:2009.04416},
  year={2020}
}

phasic-policy-gradient's People

Contributors

kcobbe avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

phasic-policy-gradient's Issues

VideoRecorder?

Quick question, How would you go about implementing a videorecorder so that after some number of episodes passes, you can record to view it? I can do it in a normal environment, but I think MPI is throwing things off, I'm not mistaken.

Does this implementation of ppg support only discrete observation and action types?

I was trying out this code with custom gym environment (a non gaming, timeseries environment tested to work with baselines) converted to gym3 environment using gym3.interop.FromGymEnv() and ended up getting following error:

Expected ScalarType, got <class 'gym3.types.TensorType'>
  File "/home/user/workspace/phasic_policy_gradient/distr_builder.py", line 35, in tensor_distr_builder
    raise ValueError(f"Expected ScalarType, got {type(ac_space)}")
  File "/home/user/workspace/phasic_policy_gradient/distr_builder.py", line 47, in distr_builder
    return tensor_distr_builder(ac_type)
  File "/home/user/workspace/phasic_policy_gradient/ppg.py", line 97, in __init__
    pi_outsize, self.make_distr = distr_builder(actype)
  File "/home/user/workspace/phasic_policy_gradient/train.py", line 58, in train_fn
    model = ppg.PhasicValueModel(venv.ob_space, venv.ac_space, enc_fn, arch=arch)

I debugged and found that venv.ac_space is of type R[30] and venv.ob_space is of type R[301]. (I made some changes to use this implementation in non-gaming environment / timeseries environment. Added MlpEncoder to replace ImpalaEncoder etc.) This is because my custom gym environment has observation_space of type Box(-inf, inf, (301,), float32) and action_space of type Box(-inf, inf, (301,), float32) and gets converted to gym3.types.Real. And it seems that ppg distr_builder allows only Discrete observation and action spaces. Is it so?

About auxiliary loss

Hi! I wonder whether the code means it will compute aux loss and its gradient for arch == "detach" and "dual". (If i missed something important I'm sorry)

RCALL_LOGDIR

What is RCALL_LOGDIR supposed to be? I literally cannot find any documentation stating what it's related to anywhere. It's only used in one place, and to print to the terminal, but I'm worried I've screwed up some mpi config (even though I can't find it in the mpi documentation either).

CUDA out of memory issue

Hi there,
I am trying to run the example but getting the out of memory result on my 8GB RTX 2070:

/home/jd/anaconda3/envs/phasic-policy-gradient/lib/python3.7/runpy.py:125: RuntimeWarning: 'phasic_policy_gradient.train' found in sys.modules after import of package 'phasic_policy_gradient', but prior to execution of 'phasic_policy_gradient.train'; this may result in unpredictable behaviour
warn(RuntimeWarning(msg))
Traceback (most recent call last):
File "/home/jd/anaconda3/envs/phasic-policy-gradient/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/jd/anaconda3/envs/phasic-policy-gradient/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/jd/competition/phasic-policy-gradient/phasic_policy_gradient/train.py", line 106, in
main()
File "/home/jd/competition/phasic-policy-gradient/phasic_policy_gradient/train.py", line 103, in main
comm=comm)
File "/home/jd/competition/phasic-policy-gradient/phasic_policy_gradient/train.py", line 76, in train_fn
comm=comm,
File "/home/jd/competition/phasic-policy-gradient/phasic_policy_gradient/ppg.py", line 256, in learn
**ppo_hps,
File "/home/jd/competition/phasic-policy-gradient/phasic_policy_gradient/ppo.py", line 244, in learn
verbose=verbose,
File "/home/jd/competition/phasic-policy-gradient/phasic_policy_gradient/minibatch_optimize.py", line 60, in minibatch_optimize
train_fn(**mb) for mb in minibatch_gen(tensordict, nminibatch=nminibatch)
File "/home/jd/competition/phasic-policy-gradient/phasic_policy_gradient/minibatch_optimize.py", line 60, in
train_fn(**mb) for mb in minibatch_gen(tensordict, nminibatch=nminibatch)
File "/home/jd/competition/phasic-policy-gradient/phasic_policy_gradient/ppo.py", line 192, in train_pi_and_vf
return train_with_losses_and_opt(["pi", "vf"], opts["pi"], **arrays)
File "/home/jd/competition/phasic-policy-gradient/phasic_policy_gradient/ppo.py", line 173, in train_with_losses_and_opt
**arrays,
File "/home/jd/competition/phasic-policy-gradient/phasic_policy_gradient/ppo.py", line 88, in compute_losses
pd, vpred, aux, _state_out = model(ob=ob, first=first, state_in=state_in)
File "/home/jd/anaconda3/envs/phasic-policy-gradient/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/home/jd/competition/phasic-policy-gradient/phasic_policy_gradient/ppg.py", line 140, in forward
x_out[k], state_out[k] = self.get_encoder(k)(ob, first, state_in[k])
File "/home/jd/anaconda3/envs/phasic-policy-gradient/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/home/jd/competition/phasic-policy-gradient/phasic_policy_gradient/impala_cnn.py", line 182, in forward
x = self.cnn(x)
File "/home/jd/anaconda3/envs/phasic-policy-gradient/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/home/jd/competition/phasic-policy-gradient/phasic_policy_gradient/impala_cnn.py", line 149, in forward
x = tu.sequential(self.stacks, x, diag_name=self.name)
File "/home/jd/competition/phasic-policy-gradient/phasic_policy_gradient/torch_util.py", line 363, in sequential
x = layer(x, *args)
File "/home/jd/anaconda3/envs/phasic-policy-gradient/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/home/jd/competition/phasic-policy-gradient/phasic_policy_gradient/impala_cnn.py", line 109, in forward
x = block(x)
File "/home/jd/anaconda3/envs/phasic-policy-gradient/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/home/jd/competition/phasic-policy-gradient/phasic_policy_gradient/impala_cnn.py", line 85, in forward
return x + self.residual(x)
File "/home/jd/competition/phasic-policy-gradient/phasic_policy_gradient/impala_cnn.py", line 76, in residual
x = F.relu(x, inplace=False)
File "/home/jd/anaconda3/envs/phasic-policy-gradient/lib/python3.7/site-packages/torch/nn/functional.py", line 914, in relu
result = torch.relu(input)
RuntimeError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 7.79 GiB total capacity; 1.19 GiB already allocated; 151.88 MiB free; 1.21 GiB reserved in total by PyTorch)

I would say this is rather unreasonable? Is there anything I can do to debug?

Issue about ```n_epoch_pi != n_epoch_pi``` in ppo.py

https://github.com/openai/phasic-policy-gradient/blob/master/phasic_policy_gradient/ppo.py#L224
When reading code for the Policy Phase in ppo.py, the equation n_epoch_pi != n_epoch_pi makes me confuse.

If n_epoch_pi is not specified, this equation will be false, the value network and policy network will train together.
If n_epoch_pi is specified, it will also be false, the value network and policy network will also train together.

Whether this equation is typo of n_epoch_vf != n_epoch_pi?
So when n_epoch_pi is not specified or n_epoch_pi != n_epoch_vf, the optimization will perform separately for the policy network and the value network.
and https://github.com/openai/phasic-policy-gradient/blob/master/phasic_policy_gradient/ppo.py#L230 also be nepoch=n_epoch_vf for udpate value network?

detach of value function for single-network PPG

According to the code (https://github.com/openai/phasic-policy-gradient/blob/master/phasic_policy_gradient/train.py#L14), arch 'detach' seems corresponding to the single-network variant described in section 3.6 of the paper. According the paper and the comment in the code, the value function should not be detached from the encoder during aux phase. However, the value function (vfvec) seems always detached according to the code:

for k in self.vf_keys:
if self.detach_value_head:
x_out[k] = x_out[k].detach()
aux[k] = self.get_vhead(k)(x_out[k])[..., 0]
vfvec = aux[self.true_vf_key]
aux.update({"vpredaux": self.aux_vf_head(pi_x)[..., 0], "vpredtrue": vfvec})

Can you clarify whether it should be detached or not in the aux phase and whether it affects the results reported in the paper?

Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.