GithubHelp home page GithubHelp logo

ritchiehuang / deeprl_algorithms Goto Github PK

View Code? Open in Web Editor NEW
306.0 11.0 78.0 8.42 MB

DeepRL algorithms implementation easy for understanding and reading with Pytorch and Tensorflow 2(DQN, REINFORCE, VPG, A2C, TRPO, PPO, DDPG, TD3, SAC)

Python 97.57% Shell 2.43%
reinforcement reinforcement-learning-algorithms pytorch-implementation deep-reinforcement-learning dqn policygradient ppo trpo mujoco policy-gradient

deeprl_algorithms's Introduction

Hi there ๐Ÿ‘‹

I'm currently working as a Maching Learning Engineer intern mainly on reinforcement learning area.

I love reading, writing, watching tv series, especially sleeping ๐Ÿ˜„.

ritchiehuang

ritchiehuang

deeprl_algorithms's People

Contributors

ritchiehuang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

deeprl_algorithms's Issues

too slow when using SAC and SAC_Alpha

Compared with the PPO algorithm (10000 fps for BipedalWalker-v2 ), the running speed of the Sac algorithm series (100 fps for BipedalWalker-v2) is much slower. I test it in both Linux (12 CPU cores and no GPU) and wsl2 in win10 (i9-10700 and no GPU ).

At first, I thought it was caused by the unequal dimensions of the two tensors( line 23 in sac_step.py, q_value_1 and target_next_value, line 27 in sac_alpha_step, q_value_1, target_q_value ), but it was still very slow after modification.

shape of PPO advantages

The shape of output advantages of the estimate_advantages function in GAE.py is (batch_size, 1), but the importance ratio in the ppo_step function is (batch_size, ), which will have a certain impact on the convergence speed of the algorithm as the shape of suurr1 and surr2 is (batch_size, batch_size).

Only a small change is required:

surr1 = ratio * advantages.reshape(-1)

Result in Gridword before:

Iter: 1, num steps: 4000, total reward: -178.9500, min reward: -12.0000, max reward: 5.9500, average reward: -3.1955, sample time: 0.3478
Iter: 2, num steps: 4000, total reward: -128.9500, min reward: -16.0000, max reward: 5.6000, average reward: -2.3027, sample time: 0.2542
Iter: 3, num steps: 4000, total reward: -44.6000, min reward: -10.0000, max reward: 7.5500, average reward: -0.7690, sample time: 0.2599
Iter: 4, num steps: 4000, total reward: 60.2000, min reward: -9.0000, max reward: 6.7000, average reward: 0.8853, sample time: 0.2566
Iter: 5, num steps: 4000, total reward: 172.0000, min reward: -8.0000, max reward: 7.3000, average reward: 2.1235, sample time: 0.2605
Iter: 6, num steps: 4000, total reward: 232.6500, min reward: -9.0000, max reward: 7.3000, average reward: 2.6140, sample time: 0.2673
Iter: 7, num steps: 4000, total reward: 314.7000, min reward: -10.0000, max reward: 8.4500, average reward: 2.9971, sample time: 0.2619
Iter: 8, num steps: 4000, total reward: 291.3500, min reward: -7.0000, max reward: 6.9000, average reward: 2.5116, sample time: 0.2615
Iter: 9, num steps: 4000, total reward: 394.7500, min reward: -10.0000, max reward: 7.6000, average reward: 3.1083, sample time: 0.2657
Iter: 10, num steps: 4000, total reward: 432.3500, min reward: -10.0000, max reward: 7.8500, average reward: 3.1104, sample time: 0.2624

Results after:

Iter: 1, num steps: 4000, total reward: -173.5000, min reward: -15.0000, max reward: 6.3500, average reward: -2.7540, sample time: 0.3204
Iter: 2, num steps: 4000, total reward: 77.4000, min reward: -8.0000, max reward: 8.1500, average reward: 1.0320, sample time: 0.2714
Iter: 3, num steps: 4000, total reward: 585.7500, min reward: -8.8000, max reward: 7.4000, average reward: 3.6609, sample time: 0.2593
Iter: 4, num steps: 4000, total reward: 574.3000, min reward: -7.0000, max reward: 8.5500, average reward: 3.6119, sample time: 0.2592
Iter: 5, num steps: 4000, total reward: 1946.3500, min reward: -1.1500, max reward: 8.6500, average reward: 4.5264, sample time: 0.2648
Iter: 6, num steps: 4000, total reward: 3432.3500, min reward: 0.9500, max reward: 11.5000, average reward: 4.9816, sample time: 0.2682
Iter: 7, num steps: 4000, total reward: 4349.9500, min reward: -0.1000, max reward: 8.6500, average reward: 4.9319, sample time: 0.2734
Iter: 8, num steps: 4000, total reward: 4214.1000, min reward: 1.8000, max reward: 9.0500, average reward: 4.8718, sample time: 0.3343
Iter: 9, num steps: 4000, total reward: 4549.7000, min reward: 0.6500, max reward: 12.4500, average reward: 5.0721, sample time: 0.2909
Iter: 10, num steps: 4000, total reward: 4280.2500, min reward: 1.7000, max reward: 8.6500, average reward: 5.0475, sample time: 0.2748

an issues

when i run (python DeepRL_Algorithms-master/Algorithms/pytorch/DDPG/main.py) has a problem that tips 'No module named 'Common.fixed_size_replay_memory'', then i really not found this .py file? could you tell me what happend? than you.

some question about the result figure.

Excuse me, your result figures are very beautiful!! Can you tell me how can I plot the figure like yours ? the code in your plt_utils cannot show shadows like yours.

RuntimeError in TRPO

Thanks for providing the wonderful repo! It helps.

Unforunately, when I ran ./Algorithms/pytorch/TRPO/main.py, I got the following error:

Could you please give some suggestions? Thanks in advanced.

D:\Programs\anaconda3\envs\pytorch17\lib\site-packages\gym\logger.py:30: UserWarning: WARN: Box bound precision lowered by casting to float32
  warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))
Iter: 1, num steps: 4000, total reward: -3180.0536, min reward: -126.2744, max reward: -3.1223, average reward: -106.0018, sample timD:\  25.8737
Traceback (most recent call last):
  File "main.py", line 47, in <module>
    main()
  File "D:\Programs\anaconda3\envs\pytorch17\lib\site-packages\click\core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "D:\Programs\anaconda3\envs\pytorch17\lib\site-packages\click\core.py", line 782, in main
    rv = self.invoke(ctx)
  File "D:\Programs\anaconda3\envs\pytorch17\lib\site-packages\click\core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "D:\Programs\anaconda3\envs\pytorch17\lib\site-packages\click\core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "main.py", line 35, in main
    trpo.learn(writer, i_iter)
  File "D:\DeepRL_Algorithms-master\Algorithms\pytorch\TRPO\trpo.py", line 134, in learn
    trpo_step(self.policy_net, self.value_net, batch_state, batch_action,
  File "D:\DeepRL_Algorithms-master\Algorithms\pytorch\TRPO\trpo_step.py", line 77, in trpo_step
    update_policy(policy_net, states, actions, old_log_probs, advantages, max_kl, damping)
  File "D:\DeepRL_Algorithms-master\Algorithms\pytorch\TRPO\trpo_step.py", line 173, in update_policy
    step_dir = conjugate_gradient(Hvp, loss_grad)  # approximation solution of H^(-1)g
  File "D:\DeepRL_Algorithms-master\Algorithms\pytorch\TRPO\trpo_step.py", line 94, in conjugate_gradient
    Hvp = Hvp_f(p)  # A @ p
  File "D:\DeepRL_Algorithms-master\Algorithms\pytorch\TRPO\trpo_step.py", line 152, in Hvp
    kl = policy_net.get_kl(states)
  File "D:\DeepRL_Algorithms-master\Algorithms\pytorch\Models\Policy.py", line 79, in get_kl
    mean = self.policy(x)
  File "D:\Programs\anaconda3\envs\pytorch17\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "D:\Programs\anaconda3\envs\pytorch17\lib\site-packages\torch\nn\modules\linear.py", line 93, in forward
    return F.linear(input, self.weight, self.bias)
  File "D:\Programs\anaconda3\envs\pytorch17\lib\site-packages\torch\nn\functional.py", line 1690, in linear
    ret = torch.addmm(bias, input, weight.t())
RuntimeError: mat1 and mat2 shapes cannot be multiplied (4000x24 and 128x4)

BTW, I encountered similar errors when I tried to run SAC alpha /Algorithms/pytorch/SAC_Alpha/main.py

D:\Programs\anaconda3\envs\pytorch17\lib\site-packages\gym\logger.py:30: UserWarning: WARN: Box bound precision lowered by casting to float32
  warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))
Traceback (most recent call last):
  File "sac_alpha_main.py", line 69, in <module>
    main()
  File "D:\Programs\anaconda3\envs\pytorch17\lib\site-packages\click\core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "D:\Programs\anaconda3\envs\pytorch17\lib\site-packages\click\core.py", line 782, in main
    rv = self.invoke(ctx)
  File "D:\Programs\anaconda3\envs\pytorch17\lib\site-packages\click\core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "D:\Programs\anaconda3\envs\pytorch17\lib\site-packages\click\core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "sac_alpha_main.py", line 57, in main
    sac_alpha.learn(writer, i_iter)
  File "D:\DeepRL_Algorithms-master\DeepRL_Algorithms-master\Algorithms\pytorch\SAC_Alpha\sac_alpha.py", line 160, in learn
    self.update(batch, k)
  File "D:\DeepRL_Algorithms-master\DeepRL_Algorithms-master\Algorithms\pytorch\SAC_Alpha\sac_alpha.py", line 203, in update
    sac_alpha_step(self.policy_net, self.q_net_1, self.q_net_2, self.alpha, self.q_net_target_1,
  File "D:\DeepRL_Algorithms-master\DeepRL_Algorithms-master\Algorithms\pytorch\SAC_Alpha\sac_alpha_step.py", line 18, in sac_alpha_step
    next_actions, next_log_probs = policy_net.rsample(next_states)
  File "D:\DeepRL_Algorithms-master\DeepRL_Algorithms-master\Algorithms\pytorch\Models\Policy.py", line 69, in rsample
    log_prob -= (torch.log(1. - action.pow(2) + eps)).sum(dim=-1)
RuntimeError: The size of tensor a (4) must match the size of tensor b (256) at non-singleton dimension 1

A issuer about run main.py

There is a error like this:
AttributeError: 'MountainCarEnv' object has no attribute 'seed'
How to solve it ?Thank you!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.