aravindr93 / mjrl Goto Github PK

View Code? Open in Web Editor NEW

349.0 349.0 97.0 2.47 MB

Reinforcement learning algorithms for MuJoCo tasks

License: Apache License 2.0

Python 100.00%

mujoco reinforcement-learning robotics simulation

mjrl's People

Contributors

Stargazers

Watchers

Forkers

zafarali johnyjyu vikashplus trickydickie zorrock txing-casia bennevans gracedgl divye02 jendker wookayin odellus zhaomingxie bhairavmehta95 satpreetsingh weiqiao stevenlsw dylanamiller lyltc1 chenyixuan96 zivzone qiuweimin1332499 shaw7ock hebowei2000 kristery shahrutav shikharbahl zhan0903 jiangsy staminatang bonsaiai sashalambert wuyuup icaruswizard jacarvalho brunobsm chendrag eulerecho lebrice vrn25 groupxiao vittorio-caggiano jean-moorman kpertsch rafapi linhavefun xtwentian3 terrisgo valerio-colombo gkswamy98 syamamori alizeepace suraj-nair-1 raunaqbhirangi ianthechan antheali soopark0221 gaoyuezhou ssilwal mohakbhardwaj kelym liuyixin-louis ddeconelee junyaoshi jasonma2016 robot-learning-library mandizhao nirbhayjm kevlund palanc chinganc cocolyl zlatanajanovic kaiwenw kyoran qrowbranwen wjohnsonup blankshc skeli9989 giuliaghisolfi ykarmesh ligvxi ryanchenstats jackhhao zibindong molumitu huiwenn dario-loi nicktfranklin utiasstars elyasyassin binjian t6-thu

mjrl's Issues

Not learning reward?

Hi,

In the original code base, the algo does not learn the reward function, instead, the agent computes cumulative reward using the true reward function (directly get from env). Meaning we are actually using some true information from the environment, is that allowed in the offline setting? Also, after changing the 'learn_reward' option to True in the config file, the performance is much worse.

NPG kl_mean is always 0

I do not understand while calculating hessianvector product using grads of KL divergence, KL is always 0!!! Doesn't make any sense.

Pickling of _VariableFunctions not compatible with PyTorch 1.5.0

pickling of _VariableFunctions not compatible with PyTorch 1.5.0
It would thus be useful if the version of PyTorch can be specified in the env.yml

Unnecessary imports. Can be cleaned

mjrl/mjrl/utils/make_train_plots.py

Lines 5 to 6 in af58ff2

 import scipy 

 import csv

Unable to complete the installation from the provided yml file

Hi,

Thanks for open sourcing the code! Great work!
I am trying to clone the repos as instructed in hand_dapg. I am using Ubuntu 18.04 and conda version 4.7.12. However, running the following command throws up a host of errors:
conda env create -f setup/linux.yml

Result:

Collecting package metadata (repodata.json): done
Solving environment: \
Found conflicts! Looking for incompatible packages.
This can take several minutes. Press CTRL-C to abort.
failed

UnsatisfiableError: The following specifications were found to be incompatible with each other:

....

There is a long list of packages that turn out to be incompatible with versions of other packages. I tried removing the version numbers from the yml file and running it again i.e. with the latest versions of all packages (gym, mujoco-2.0, pytorch-1.2, etc). All the packages are successfully installed, but the tests in mj_envs and dapg fail (to be expected I suppose).
mj_envs -->python utils/visualize_env.py --env_name hammer-v0
dapg --> python utils/visualize_demos.py --env_name relocate-v0

I tried a bunch of things to get mujoco-1.5 working, but it refuses to build successfully using pip. This seems to be a common problem with mujoco, for which the solution is to build from source (which always builds the latest version I suppose?)
Should I be using an older version of conda? What is the workaround for this? Or better yet, do you have an updated codebase that compiles with newer package versions?

Thanks!

Hyperparams for D4RL Mujoco tasks

Hello!!

I want to know the optimal hyperparameters to replicate the experiments presented in the paper with d4rl, , in particular the configuration for walker 2d and halfcheetah because I am not achieving good results.

Thanks for the code and all the work!!

why are the advantages multiplied by 1e-2 in dapg.py?

On line 69 here: https://github.com/aravindr93/mjrl/blob/master/mjrl/algos/dapg.py you have:

all_adv = 1e-2*np.concatenate([advantages/(np.std(advantages) + 1e-8), demo_adv])

I'm just curious as to what the rationale behind multiplying by 1e-2 is here? especially as if you don't concatenate the demo paths then you don't scale in this way.

المغرب

حاصل على دكتوراه في الكيمياء الأشعاعي

RuntimeError: CUDA out of memory.

Value function approximator

Looking at https://github.com/aravindr93/mjrl/blob/master/examples/linear_nn_comparison.py, together with a linear policy an mlp is used as the value function approximator. Is this what was used for the experiments in the paper as well?

Actions not clipped when generating synthetic trajectories

Hi.

In MoREL experiments using D4RL dataset, I think actions fed into dynamics models should be clipped in a range of [-1, 1].

Thank you.

Is mean KL always zero?

Hi Aravind, looking at https://github.com/aravindr93/mjrl/blob/master/mjrl/algos/npg_cg.py#L68 the mean KL is computed between the new and old policy distributions. But I believe at this point the distributions are always the same since both the new and old policies are updated at the end of each iteration. Is this the case? And why does it make sense to do this?
I may be misunderstanding the code segment but why wouldn't the Fisher be the Hessian of the KL between the distribution defined by the current params and that defined by a step given current solution to the conjugate gradient optimization?

Sampler timeouts when running many concurrent trainings

If I start many concurrent jobs I run into timeouts of the policy sampler. I have seen, that the code has been designed explicitly to handle such timeouts and restart the sampler to mitigate the problem if I see it correctly. Is it known what can be the source of such timeouts?

gail

Hello, do you have the code to implement Gail for this library?

linear policy?

Hi I got here after reading your paper "Towards Generalization and Simplicity in Continuous Control" and I was wondering if the file https://github.com/aravindr93/mjrl/blob/master/mjrl/policies/gaussian_mlp.py was the one used in your paper? The paper claims to use linear policies but it seems this network in an MLP?

What's the intuition behind act_repeat ?

obs = e.obs_mask * raw_obs[::act_repeat]
act = np.array([np.mean(raw_act[i * act_repeat : (i+1) * act_repeat], axis=0) for i in range(traj_length // act_repeat)])
rew = np.array([np.sum(raw_rew[i * act_repeat : (i+1) * act_repeat]) for i in range(traj_length // act_repeat)])

what is act_repeat's significance ?

No RBF code?

Hi,

I want to check out the code for the RBF kernel in your paper Towards Generalization and Simplicity in Continuous Control, but it seems that you didn't implement it in this repo. Can you upload the RBF code to this repo?

difference between `a is b` and `a == b`

To be fixed in batch_reinforce.py and possibly other locations.

For Morel, you truncate the uncertain rollouts instead of setting the negative reward?

Thanks for your code. However, I find that the implementation is different from the paper Morel.
This implementation truncates the uncertain rollouts instead of setting the negative reward.

If I didn't misunderstand your code, May you explain why there is some difference? And can you release the code which is totally following the algorithm described in your paper?

Look forward your replays. Thanks a lot.

Understanding obs_mask

Hello,

I am using the morel code and am trying to understand how the obs_mask variable was decided for Hopper-v3 task. If I set all the values in the obs_mask to be 1.0, the generalization loss for the dynamics model is much higher (order 1e-3 instead of 1e-7) and morel does not perform as well. Is obs_mask necessary for getting the pipeline to work, and if so, how can we obtain the correct obs_mask for other environments such as HalfCheetah, Walker-2D or hand_dapg tasks?

Thanks!

missing packages in conda env build

** Linux **

click
tabulate
termcolor

Experimental results of MoREL for D4RL benchmarks

Hi,
Thanks so much for the open-sourced work. The results presented here are very impressive. But I really cannot reproduce the experimental results shown in "Readme page", especially the random datasets in D4RL benchmarks. Could you please share the configs fils and "reward_functions.".py used for D4RL benchamrks?

Btw:
Here is my results in the hopper-random-v0 and hopper-medium-v0 datasets, using the default hyperparameters from d4rl_hopper_medium.txt

[Question] Meaning of the "al" variable?

Hello,

looking at the code in that repo, I came across this variable:

mjrl/mjrl/baselines/linear_baseline.py

Lines 16 to 17 in 4a7c219

 al = np.arange(l).reshape(-1, 1) / 1000.0 

 feat = np.concatenate([o, al, al**2, al**3, np.ones((l, 1))], axis=1)

the last time I saw something similar, that was in the OpenAI baselines code (here the stable baselines fork):

https://github.com/hill-a/stable-baselines/blob/333c59379f23e1f5c5c9e8bf93cbfa56ac52d13b/stable_baselines/acktr/value_functions.py#L58-L64

I assume this encodes some information about the time (cf plot below of the different features), but I'm still wondering what does al stands for and where does it come from? (there is no mention of such feature in both papers)

Small typos in the readme

implimentations ->implementations

Much worse learning performance with new code base

I am working based on the code from repo hand_dapg. First of all thank you very much for open sourcing this great work!

Because I need to work with mujoco-py 2.0 I have switched to the new code base 'redesign' for both mjrl and mj_envs. The problem is, that right now the dapg from dapg example https://github.com/aravindr93/hand_dapg/tree/master/dapg/examples does not learn anymore, please have a look on training results.

For the code from master I get the following results for iteration 78:

----------------  ------------
VF_error_after       0.108092
VF_error_before      0.129821
alpha              208.393
delta                0.1
eval_score        2932.27
kl_dist              0.0504399
running_score     1543.33
stoc_pol_max      4359.31
stoc_pol_mean     2119.18
stoc_pol_min       -11.1661
stoc_pol_std      1347.49
success_rate        79
surr_improvement     0.045797
time_VF              3.60636
time_npg             2.59354
time_sampling       12.1133
time_vpg             0.136289
----------------  ------------

and with redesign (here I switched from behavior_cloning_2 to behavior_cloning, because the former was removed from repo):

----------------  -------------
VF_error_after        0.385029
VF_error_before       0.671899
alpha               383.606
delta                 0.1
eval_score          366.002
kl_dist               0.0500591
num_samples       40000
running_score        23.3569
stoc_pol_max        667.282
stoc_pol_mean        29.4223
stoc_pol_min        -15.7963
stoc_pol_std         87.9433
success_rate          2.5
surr_improvement      0.0288742
time_VF               3.47379
time_npg              2.64448
time_sampling         8.92309
time_vpg              0.137907
----------------  -------------

This difference in performance was confirmed by the next runs of the both code versions.

Is there anything what should be adjusted in the dapg example to achieve the same learning performance as before?

	al = np.arange(l).reshape(-1, 1) / 1000.0
	feat = np.concatenate([o, al, al2, al3, np.ones((l, 1))], axis=1)

aravindr93 / mjrl Goto Github PK

mjrl's People

Contributors

Stargazers

Watchers

Forkers

mjrl's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs