GithubHelp home page GithubHelp logo

aravindr93 / mjrl Goto Github PK

View Code? Open in Web Editor NEW
349.0 349.0 97.0 2.47 MB

Reinforcement learning algorithms for MuJoCo tasks

License: Apache License 2.0

Python 100.00%
mujoco reinforcement-learning robotics simulation

mjrl's People

Contributors

aravindr93 avatar bennevans avatar vikashplus avatar wookayin avatar zafarali avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mjrl's Issues

Not learning reward?

Hi,

In the original code base, the algo does not learn the reward function, instead, the agent computes cumulative reward using the true reward function (directly get from env). Meaning we are actually using some true information from the environment, is that allowed in the offline setting? Also, after changing the 'learn_reward' option to True in the config file, the performance is much worse.

NPG kl_mean is always 0

I do not understand while calculating hessianvector product using grads of KL divergence, KL is always 0!!! Doesn't make any sense.

Unable to complete the installation from the provided yml file

Hi,

Thanks for open sourcing the code! Great work!
I am trying to clone the repos as instructed in hand_dapg. I am using Ubuntu 18.04 and conda version 4.7.12. However, running the following command throws up a host of errors:
conda env create -f setup/linux.yml

Result:

Collecting package metadata (repodata.json): done
Solving environment: \
Found conflicts! Looking for incompatible packages.
This can take several minutes. Press CTRL-C to abort.
failed

UnsatisfiableError: The following specifications were found to be incompatible with each other:

....

There is a long list of packages that turn out to be incompatible with versions of other packages. I tried removing the version numbers from the yml file and running it again i.e. with the latest versions of all packages (gym, mujoco-2.0, pytorch-1.2, etc). All the packages are successfully installed, but the tests in mj_envs and dapg fail (to be expected I suppose).
mj_envs -->python utils/visualize_env.py --env_name hammer-v0
dapg --> python utils/visualize_demos.py --env_name relocate-v0

I tried a bunch of things to get mujoco-1.5 working, but it refuses to build successfully using pip. This seems to be a common problem with mujoco, for which the solution is to build from source (which always builds the latest version I suppose?)
Should I be using an older version of conda? What is the workaround for this? Or better yet, do you have an updated codebase that compiles with newer package versions?

Thanks!

Hyperparams for D4RL Mujoco tasks

Hello!!

I want to know the optimal hyperparameters to replicate the experiments presented in the paper with d4rl, , in particular the configuration for walker 2d and halfcheetah because I am not achieving good results.

Thanks for the code and all the work!!

المغرب

حاصل على دكتوراه في الكيمياء الأشعاعي

Is mean KL always zero?

Hi Aravind, looking at https://github.com/aravindr93/mjrl/blob/master/mjrl/algos/npg_cg.py#L68 the mean KL is computed between the new and old policy distributions. But I believe at this point the distributions are always the same since both the new and old policies are updated at the end of each iteration. Is this the case? And why does it make sense to do this?
I may be misunderstanding the code segment but why wouldn't the Fisher be the Hessian of the KL between the distribution defined by the current params and that defined by a step given current solution to the conjugate gradient optimization?

Sampler timeouts when running many concurrent trainings

If I start many concurrent jobs I run into timeouts of the policy sampler. I have seen, that the code has been designed explicitly to handle such timeouts and restart the sampler to mitigate the problem if I see it correctly. Is it known what can be the source of such timeouts?

gail

Hello, do you have the code to implement Gail for this library?

What's the intuition behind act_repeat ?

obs = e.obs_mask * raw_obs[::act_repeat]
act = np.array([np.mean(raw_act[i * act_repeat : (i+1) * act_repeat], axis=0) for i in range(traj_length // act_repeat)])
rew = np.array([np.sum(raw_rew[i * act_repeat : (i+1) * act_repeat]) for i in range(traj_length // act_repeat)])

what is act_repeat's significance ?

No RBF code?

Hi,

I want to check out the code for the RBF kernel in your paper Towards Generalization and Simplicity in Continuous Control, but it seems that you didn't implement it in this repo. Can you upload the RBF code to this repo?

For Morel, you truncate the uncertain rollouts instead of setting the negative reward?

Thanks for your code. However, I find that the implementation is different from the paper Morel.
This implementation truncates the uncertain rollouts instead of setting the negative reward.

If I didn't misunderstand your code, May you explain why there is some difference? And can you release the code which is totally following the algorithm described in your paper?

Look forward your replays. Thanks a lot.

Understanding obs_mask

Hello,

I am using the morel code and am trying to understand how the obs_mask variable was decided for Hopper-v3 task. If I set all the values in the obs_mask to be 1.0, the generalization loss for the dynamics model is much higher (order 1e-3 instead of 1e-7) and morel does not perform as well. Is obs_mask necessary for getting the pipeline to work, and if so, how can we obtain the correct obs_mask for other environments such as HalfCheetah, Walker-2D or hand_dapg tasks?

Thanks!

Experimental results of MoREL for D4RL benchmarks

Hi,
Thanks so much for the open-sourced work. The results presented here are very impressive. But I really cannot reproduce the experimental results shown in "Readme page", especially the random datasets in D4RL benchmarks. Could you please share the configs fils and "reward_functions.".py used for D4RL benchamrks?

Btw:
Here is my results in the hopper-random-v0 and hopper-medium-v0 datasets, using the default hyperparameters from d4rl_hopper_medium.txt

eval_score

eval_score_medium

[Question] Meaning of the "al" variable?

Hello,

looking at the code in that repo, I came across this variable:

al = np.arange(l).reshape(-1, 1) / 1000.0
feat = np.concatenate([o, al, al**2, al**3, np.ones((l, 1))], axis=1)

the last time I saw something similar, that was in the OpenAI baselines code (here the stable baselines fork):

https://github.com/hill-a/stable-baselines/blob/333c59379f23e1f5c5c9e8bf93cbfa56ac52d13b/stable_baselines/acktr/value_functions.py#L58-L64

I assume this encodes some information about the time (cf plot below of the different features), but I'm still wondering what does al stands for and where does it come from? (there is no mention of such feature in both papers)

al

Much worse learning performance with new code base

I am working based on the code from repo hand_dapg. First of all thank you very much for open sourcing this great work!

Because I need to work with mujoco-py 2.0 I have switched to the new code base 'redesign' for both mjrl and mj_envs. The problem is, that right now the dapg from dapg example https://github.com/aravindr93/hand_dapg/tree/master/dapg/examples does not learn anymore, please have a look on training results.

For the code from master I get the following results for iteration 78:

----------------  ------------
VF_error_after       0.108092
VF_error_before      0.129821
alpha              208.393
delta                0.1
eval_score        2932.27
kl_dist              0.0504399
running_score     1543.33
stoc_pol_max      4359.31
stoc_pol_mean     2119.18
stoc_pol_min       -11.1661
stoc_pol_std      1347.49
success_rate        79
surr_improvement     0.045797
time_VF              3.60636
time_npg             2.59354
time_sampling       12.1133
time_vpg             0.136289
----------------  ------------

and with redesign (here I switched from behavior_cloning_2 to behavior_cloning, because the former was removed from repo):

----------------  -------------
VF_error_after        0.385029
VF_error_before       0.671899
alpha               383.606
delta                 0.1
eval_score          366.002
kl_dist               0.0500591
num_samples       40000
running_score        23.3569
stoc_pol_max        667.282
stoc_pol_mean        29.4223
stoc_pol_min        -15.7963
stoc_pol_std         87.9433
success_rate          2.5
surr_improvement      0.0288742
time_VF               3.47379
time_npg              2.64448
time_sampling         8.92309
time_vpg              0.137907
----------------  -------------

This difference in performance was confirmed by the next runs of the both code versions.

Is there anything what should be adjusted in the dapg example to achieve the same learning performance as before?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.