ritchiehuang / deeprl_algorithms Goto Github PK

DeepRL algorithms implementation easy for understanding and reading with Pytorch and Tensorflow 2(DQN, REINFORCE, VPG, A2C, TRPO, PPO, DDPG, TD3, SAC)

Python 97.57% Shell 2.43%

reinforcement reinforcement-learning-algorithms pytorch-implementation deep-reinforcement-learning dqn policygradient ppo trpo mujoco policy-gradient

deeprl_algorithms's Introduction

Hi there 👋

I'm currently working as a Maching Learning Engineer intern mainly on reinforcement learning area.

I love reading, writing, watching tv series, especially sleeping 😄.

deeprl_algorithms's People

Contributors

Stargazers

Watchers

deeprl_algorithms's Issues

too slow when using SAC and SAC_Alpha

Compared with the PPO algorithm (10000 fps for BipedalWalker-v2 ), the running speed of the Sac algorithm series (100 fps for BipedalWalker-v2) is much slower. I test it in both Linux (12 CPU cores and no GPU) and wsl2 in win10 (i9-10700 and no GPU ).

At first, I thought it was caused by the unequal dimensions of the two tensors( line 23 in sac_step.py, q_value_1 and target_next_value, line 27 in sac_alpha_step, q_value_1, target_q_value ), but it was still very slow after modification.

Observation Normalization in TD3 algorithm

You have commented on the observation normalization explicitly. Any reason?

shape of PPO advantages

The shape of output advantages of the estimate_advantages function in GAE.py is (batch_size, 1), but the importance ratio in the ppo_step function is (batch_size, ), which will have a certain impact on the convergence speed of the algorithm as the shape of suurr1 and surr2 is (batch_size, batch_size).

Only a small change is required:

surr1 = ratio * advantages.reshape(-1)

Result in Gridword before:

Iter: 1, num steps: 4000, total reward: -178.9500, min reward: -12.0000, max reward: 5.9500, average reward: -3.1955, sample time: 0.3478
Iter: 2, num steps: 4000, total reward: -128.9500, min reward: -16.0000, max reward: 5.6000, average reward: -2.3027, sample time: 0.2542
Iter: 3, num steps: 4000, total reward: -44.6000, min reward: -10.0000, max reward: 7.5500, average reward: -0.7690, sample time: 0.2599
Iter: 4, num steps: 4000, total reward: 60.2000, min reward: -9.0000, max reward: 6.7000, average reward: 0.8853, sample time: 0.2566
Iter: 5, num steps: 4000, total reward: 172.0000, min reward: -8.0000, max reward: 7.3000, average reward: 2.1235, sample time: 0.2605
Iter: 6, num steps: 4000, total reward: 232.6500, min reward: -9.0000, max reward: 7.3000, average reward: 2.6140, sample time: 0.2673
Iter: 7, num steps: 4000, total reward: 314.7000, min reward: -10.0000, max reward: 8.4500, average reward: 2.9971, sample time: 0.2619
Iter: 8, num steps: 4000, total reward: 291.3500, min reward: -7.0000, max reward: 6.9000, average reward: 2.5116, sample time: 0.2615
Iter: 9, num steps: 4000, total reward: 394.7500, min reward: -10.0000, max reward: 7.6000, average reward: 3.1083, sample time: 0.2657
Iter: 10, num steps: 4000, total reward: 432.3500, min reward: -10.0000, max reward: 7.8500, average reward: 3.1104, sample time: 0.2624

Results after:

Iter: 1, num steps: 4000, total reward: -173.5000, min reward: -15.0000, max reward: 6.3500, average reward: -2.7540, sample time: 0.3204
Iter: 2, num steps: 4000, total reward: 77.4000, min reward: -8.0000, max reward: 8.1500, average reward: 1.0320, sample time: 0.2714
Iter: 3, num steps: 4000, total reward: 585.7500, min reward: -8.8000, max reward: 7.4000, average reward: 3.6609, sample time: 0.2593
Iter: 4, num steps: 4000, total reward: 574.3000, min reward: -7.0000, max reward: 8.5500, average reward: 3.6119, sample time: 0.2592
Iter: 5, num steps: 4000, total reward: 1946.3500, min reward: -1.1500, max reward: 8.6500, average reward: 4.5264, sample time: 0.2648
Iter: 6, num steps: 4000, total reward: 3432.3500, min reward: 0.9500, max reward: 11.5000, average reward: 4.9816, sample time: 0.2682
Iter: 7, num steps: 4000, total reward: 4349.9500, min reward: -0.1000, max reward: 8.6500, average reward: 4.9319, sample time: 0.2734
Iter: 8, num steps: 4000, total reward: 4214.1000, min reward: 1.8000, max reward: 9.0500, average reward: 4.8718, sample time: 0.3343
Iter: 9, num steps: 4000, total reward: 4549.7000, min reward: 0.6500, max reward: 12.4500, average reward: 5.0721, sample time: 0.2909
Iter: 10, num steps: 4000, total reward: 4280.2500, min reward: 1.7000, max reward: 8.6500, average reward: 5.0475, sample time: 0.2748

About polyak update in Double DQN

I'm not sure why you removed the Polyak update part in Double DQN by commenting out. Is there any specific reason?

an issues

when i run (python DeepRL_Algorithms-master/Algorithms/pytorch/DDPG/main.py) has a problem that tips 'No module named 'Common.fixed_size_replay_memory'', then i really not found this .py file? could you tell me what happend? than you.

Issue with num_process while running trpo_mujoco script

Terminal freezes without any error. There is some bug related to multiprocessing in the memory collector class.

some question about the result figure.

Excuse me, your result figures are very beautiful!! Can you tell me how can I plot the figure like yours ? the code in your plt_utils cannot show shadows like yours.

can it run **NoFrameskip-v4?

A visitor's opinion

Warining: Your pictures in the Markdown were missing.

RuntimeError in TRPO

Thanks for providing the wonderful repo! It helps.

Unforunately, when I ran ./Algorithms/pytorch/TRPO/main.py, I got the following error:

Could you please give some suggestions? Thanks in advanced.

D:\Programs\anaconda3\envs\pytorch17\lib\site-packages\gym\logger.py:30: UserWarning: WARN: Box bound precision lowered by casting to float32
  warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))
Iter: 1, num steps: 4000, total reward: -3180.0536, min reward: -126.2744, max reward: -3.1223, average reward: -106.0018, sample timD:\  25.8737
Traceback (most recent call last):
  File "main.py", line 47, in <module>
    main()
  File "D:\Programs\anaconda3\envs\pytorch17\lib\site-packages\click\core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "D:\Programs\anaconda3\envs\pytorch17\lib\site-packages\click\core.py", line 782, in main
    rv = self.invoke(ctx)
  File "D:\Programs\anaconda3\envs\pytorch17\lib\site-packages\click\core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "D:\Programs\anaconda3\envs\pytorch17\lib\site-packages\click\core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "main.py", line 35, in main
    trpo.learn(writer, i_iter)
  File "D:\DeepRL_Algorithms-master\Algorithms\pytorch\TRPO\trpo.py", line 134, in learn
    trpo_step(self.policy_net, self.value_net, batch_state, batch_action,
  File "D:\DeepRL_Algorithms-master\Algorithms\pytorch\TRPO\trpo_step.py", line 77, in trpo_step
    update_policy(policy_net, states, actions, old_log_probs, advantages, max_kl, damping)
  File "D:\DeepRL_Algorithms-master\Algorithms\pytorch\TRPO\trpo_step.py", line 173, in update_policy
    step_dir = conjugate_gradient(Hvp, loss_grad)  # approximation solution of H^(-1)g
  File "D:\DeepRL_Algorithms-master\Algorithms\pytorch\TRPO\trpo_step.py", line 94, in conjugate_gradient
    Hvp = Hvp_f(p)  # A @ p
  File "D:\DeepRL_Algorithms-master\Algorithms\pytorch\TRPO\trpo_step.py", line 152, in Hvp
    kl = policy_net.get_kl(states)
  File "D:\DeepRL_Algorithms-master\Algorithms\pytorch\Models\Policy.py", line 79, in get_kl
    mean = self.policy(x)
  File "D:\Programs\anaconda3\envs\pytorch17\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "D:\Programs\anaconda3\envs\pytorch17\lib\site-packages\torch\nn\modules\linear.py", line 93, in forward
    return F.linear(input, self.weight, self.bias)
  File "D:\Programs\anaconda3\envs\pytorch17\lib\site-packages\torch\nn\functional.py", line 1690, in linear
    ret = torch.addmm(bias, input, weight.t())
RuntimeError: mat1 and mat2 shapes cannot be multiplied (4000x24 and 128x4)

BTW, I encountered similar errors when I tried to run SAC alpha /Algorithms/pytorch/SAC_Alpha/main.py

D:\Programs\anaconda3\envs\pytorch17\lib\site-packages\gym\logger.py:30: UserWarning: WARN: Box bound precision lowered by casting to float32
  warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))
Traceback (most recent call last):
  File "sac_alpha_main.py", line 69, in <module>
    main()
  File "D:\Programs\anaconda3\envs\pytorch17\lib\site-packages\click\core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "D:\Programs\anaconda3\envs\pytorch17\lib\site-packages\click\core.py", line 782, in main
    rv = self.invoke(ctx)
  File "D:\Programs\anaconda3\envs\pytorch17\lib\site-packages\click\core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "D:\Programs\anaconda3\envs\pytorch17\lib\site-packages\click\core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "sac_alpha_main.py", line 57, in main
    sac_alpha.learn(writer, i_iter)
  File "D:\DeepRL_Algorithms-master\DeepRL_Algorithms-master\Algorithms\pytorch\SAC_Alpha\sac_alpha.py", line 160, in learn
    self.update(batch, k)
  File "D:\DeepRL_Algorithms-master\DeepRL_Algorithms-master\Algorithms\pytorch\SAC_Alpha\sac_alpha.py", line 203, in update
    sac_alpha_step(self.policy_net, self.q_net_1, self.q_net_2, self.alpha, self.q_net_target_1,
  File "D:\DeepRL_Algorithms-master\DeepRL_Algorithms-master\Algorithms\pytorch\SAC_Alpha\sac_alpha_step.py", line 18, in sac_alpha_step
    next_actions, next_log_probs = policy_net.rsample(next_states)
  File "D:\DeepRL_Algorithms-master\DeepRL_Algorithms-master\Algorithms\pytorch\Models\Policy.py", line 69, in rsample
    log_prob -= (torch.log(1. - action.pow(2) + eps)).sum(dim=-1)
RuntimeError: The size of tensor a (4) must match the size of tensor b (256) at non-singleton dimension 1

A issuer about run main.py

There is a error like this:
AttributeError: 'MountainCarEnv' object has no attribute 'seed'
How to solve it ?Thank you!

ritchiehuang / deeprl_algorithms Goto Github PK

deeprl_algorithms's Introduction

Hi there 👋

deeprl_algorithms's People

Contributors

Stargazers

Watchers

Forkers

deeprl_algorithms's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs