aravindr93 / mjrl Goto Github PK
View Code? Open in Web Editor NEWReinforcement learning algorithms for MuJoCo tasks
License: Apache License 2.0
Reinforcement learning algorithms for MuJoCo tasks
License: Apache License 2.0
Hi,
In the original code base, the algo does not learn the reward function, instead, the agent computes cumulative reward using the true reward function (directly get from env). Meaning we are actually using some true information from the environment, is that allowed in the offline setting? Also, after changing the 'learn_reward' option to True in the config file, the performance is much worse.
I do not understand while calculating hessianvector product using grads of KL divergence, KL is always 0!!! Doesn't make any sense.
mjrl/mjrl/utils/make_train_plots.py
Lines 5 to 6 in af58ff2
Hi,
Thanks for open sourcing the code! Great work!
I am trying to clone the repos as instructed in hand_dapg. I am using Ubuntu 18.04 and conda version 4.7.12. However, running the following command throws up a host of errors:
conda env create -f setup/linux.yml
Result:
Collecting package metadata (repodata.json): done
Solving environment: \
Found conflicts! Looking for incompatible packages.
This can take several minutes. Press CTRL-C to abort.
failedUnsatisfiableError: The following specifications were found to be incompatible with each other:
....
There is a long list of packages that turn out to be incompatible with versions of other packages. I tried removing the version numbers from the yml file and running it again i.e. with the latest versions of all packages (gym, mujoco-2.0, pytorch-1.2, etc). All the packages are successfully installed, but the tests in mj_envs and dapg fail (to be expected I suppose).
mj_envs -->python utils/visualize_env.py --env_name hammer-v0
dapg --> python utils/visualize_demos.py --env_name relocate-v0
I tried a bunch of things to get mujoco-1.5 working, but it refuses to build successfully using pip. This seems to be a common problem with mujoco, for which the solution is to build from source (which always builds the latest version I suppose?)
Should I be using an older version of conda? What is the workaround for this? Or better yet, do you have an updated codebase that compiles with newer package versions?
Thanks!
Hello!!
I want to know the optimal hyperparameters to replicate the experiments presented in the paper with d4rl, , in particular the configuration for walker 2d and halfcheetah because I am not achieving good results.
Thanks for the code and all the work!!
On line 69 here: https://github.com/aravindr93/mjrl/blob/master/mjrl/algos/dapg.py you have:
all_adv = 1e-2*np.concatenate([advantages/(np.std(advantages) + 1e-8), demo_adv])
I'm just curious as to what the rationale behind multiplying by 1e-2 is here? especially as if you don't concatenate the demo paths then you don't scale in this way.
حاصل على دكتوراه في الكيمياء الأشعاعي
Looking at https://github.com/aravindr93/mjrl/blob/master/examples/linear_nn_comparison.py, together with a linear policy an mlp is used as the value function approximator. Is this what was used for the experiments in the paper as well?
Hi.
In MoREL experiments using D4RL dataset, I think actions fed into dynamics models should be clipped in a range of [-1, 1].
Thank you.
Hi Aravind, looking at https://github.com/aravindr93/mjrl/blob/master/mjrl/algos/npg_cg.py#L68 the mean KL is computed between the new and old policy distributions. But I believe at this point the distributions are always the same since both the new and old policies are updated at the end of each iteration. Is this the case? And why does it make sense to do this?
I may be misunderstanding the code segment but why wouldn't the Fisher be the Hessian of the KL between the distribution defined by the current params and that defined by a step given current solution to the conjugate gradient optimization?
If I start many concurrent jobs I run into timeouts of the policy sampler. I have seen, that the code has been designed explicitly to handle such timeouts and restart the sampler to mitigate the problem if I see it correctly. Is it known what can be the source of such timeouts?
Hello, do you have the code to implement Gail for this library?
Hi I got here after reading your paper "Towards Generalization and Simplicity in Continuous Control" and I was wondering if the file https://github.com/aravindr93/mjrl/blob/master/mjrl/policies/gaussian_mlp.py was the one used in your paper? The paper claims to use linear policies but it seems this network in an MLP?
obs = e.obs_mask * raw_obs[::act_repeat]
act = np.array([np.mean(raw_act[i * act_repeat : (i+1) * act_repeat], axis=0) for i in range(traj_length // act_repeat)])
rew = np.array([np.sum(raw_rew[i * act_repeat : (i+1) * act_repeat]) for i in range(traj_length // act_repeat)])
what is act_repeat's significance ?
Hi,
I want to check out the code for the RBF kernel in your paper Towards Generalization and Simplicity in Continuous Control, but it seems that you didn't implement it in this repo. Can you upload the RBF code to this repo?
To be fixed in batch_reinforce.py
and possibly other locations.
Thanks for your code. However, I find that the implementation is different from the paper Morel.
This implementation truncates the uncertain rollouts instead of setting the negative reward.
If I didn't misunderstand your code, May you explain why there is some difference? And can you release the code which is totally following the algorithm described in your paper?
Look forward your replays. Thanks a lot.
Hello,
I am using the morel code and am trying to understand how the obs_mask variable was decided for Hopper-v3 task. If I set all the values in the obs_mask to be 1.0, the generalization loss for the dynamics model is much higher (order 1e-3 instead of 1e-7) and morel does not perform as well. Is obs_mask necessary for getting the pipeline to work, and if so, how can we obtain the correct obs_mask for other environments such as HalfCheetah, Walker-2D or hand_dapg tasks?
Thanks!
** Linux **
Hi,
Thanks so much for the open-sourced work. The results presented here are very impressive. But I really cannot reproduce the experimental results shown in "Readme page", especially the random datasets in D4RL benchmarks. Could you please share the configs fils and "reward_functions.".py used for D4RL benchamrks?
Btw:
Here is my results in the hopper-random-v0 and hopper-medium-v0 datasets, using the default hyperparameters from d4rl_hopper_medium.txt
Hello,
looking at the code in that repo, I came across this variable:
mjrl/mjrl/baselines/linear_baseline.py
Lines 16 to 17 in 4a7c219
the last time I saw something similar, that was in the OpenAI baselines code (here the stable baselines fork):
I assume this encodes some information about the time (cf plot below of the different features), but I'm still wondering what does al
stands for and where does it come from? (there is no mention of such feature in both papers)
implimentations ->implementations
I am working based on the code from repo hand_dapg. First of all thank you very much for open sourcing this great work!
Because I need to work with mujoco-py 2.0 I have switched to the new code base 'redesign' for both mjrl and mj_envs. The problem is, that right now the dapg from dapg example https://github.com/aravindr93/hand_dapg/tree/master/dapg/examples does not learn anymore, please have a look on training results.
For the code from master I get the following results for iteration 78:
---------------- ------------
VF_error_after 0.108092
VF_error_before 0.129821
alpha 208.393
delta 0.1
eval_score 2932.27
kl_dist 0.0504399
running_score 1543.33
stoc_pol_max 4359.31
stoc_pol_mean 2119.18
stoc_pol_min -11.1661
stoc_pol_std 1347.49
success_rate 79
surr_improvement 0.045797
time_VF 3.60636
time_npg 2.59354
time_sampling 12.1133
time_vpg 0.136289
---------------- ------------
and with redesign (here I switched from behavior_cloning_2 to behavior_cloning, because the former was removed from repo):
---------------- -------------
VF_error_after 0.385029
VF_error_before 0.671899
alpha 383.606
delta 0.1
eval_score 366.002
kl_dist 0.0500591
num_samples 40000
running_score 23.3569
stoc_pol_max 667.282
stoc_pol_mean 29.4223
stoc_pol_min -15.7963
stoc_pol_std 87.9433
success_rate 2.5
surr_improvement 0.0288742
time_VF 3.47379
time_npg 2.64448
time_sampling 8.92309
time_vpg 0.137907
---------------- -------------
This difference in performance was confirmed by the next runs of the both code versions.
Is there anything what should be adjusted in the dapg example to achieve the same learning performance as before?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.