I'm currently working as a Maching Learning Engineer intern mainly on reinforcement learning area.
I love reading, writing, watching tv series, especially sleeping ๐.
DeepRL algorithms implementation easy for understanding and reading with Pytorch and Tensorflow 2(DQN, REINFORCE, VPG, A2C, TRPO, PPO, DDPG, TD3, SAC)
Compared with the PPO algorithm (10000 fps for BipedalWalker-v2 ), the running speed of the Sac algorithm series (100 fps for BipedalWalker-v2) is much slower. I test it in both Linux (12 CPU cores and no GPU) and wsl2 in win10 (i9-10700 and no GPU ).
At first, I thought it was caused by the unequal dimensions of the two tensors( line 23 in sac_step.py, q_value_1 and target_next_value, line 27 in sac_alpha_step, q_value_1, target_q_value ), but it was still very slow after modification.
You have commented on the observation normalization explicitly. Any reason?
The shape of output advantages of the estimate_advantages function in GAE.py is (batch_size, 1), but the importance ratio in the ppo_step function is (batch_size, ), which will have a certain impact on the convergence speed of the algorithm as the shape of suurr1 and surr2 is (batch_size, batch_size).
Only a small change is required:
surr1 = ratio * advantages.reshape(-1)
Result in Gridword before:
Iter: 1, num steps: 4000, total reward: -178.9500, min reward: -12.0000, max reward: 5.9500, average reward: -3.1955, sample time: 0.3478
Iter: 2, num steps: 4000, total reward: -128.9500, min reward: -16.0000, max reward: 5.6000, average reward: -2.3027, sample time: 0.2542
Iter: 3, num steps: 4000, total reward: -44.6000, min reward: -10.0000, max reward: 7.5500, average reward: -0.7690, sample time: 0.2599
Iter: 4, num steps: 4000, total reward: 60.2000, min reward: -9.0000, max reward: 6.7000, average reward: 0.8853, sample time: 0.2566
Iter: 5, num steps: 4000, total reward: 172.0000, min reward: -8.0000, max reward: 7.3000, average reward: 2.1235, sample time: 0.2605
Iter: 6, num steps: 4000, total reward: 232.6500, min reward: -9.0000, max reward: 7.3000, average reward: 2.6140, sample time: 0.2673
Iter: 7, num steps: 4000, total reward: 314.7000, min reward: -10.0000, max reward: 8.4500, average reward: 2.9971, sample time: 0.2619
Iter: 8, num steps: 4000, total reward: 291.3500, min reward: -7.0000, max reward: 6.9000, average reward: 2.5116, sample time: 0.2615
Iter: 9, num steps: 4000, total reward: 394.7500, min reward: -10.0000, max reward: 7.6000, average reward: 3.1083, sample time: 0.2657
Iter: 10, num steps: 4000, total reward: 432.3500, min reward: -10.0000, max reward: 7.8500, average reward: 3.1104, sample time: 0.2624
Results after:
Iter: 1, num steps: 4000, total reward: -173.5000, min reward: -15.0000, max reward: 6.3500, average reward: -2.7540, sample time: 0.3204
Iter: 2, num steps: 4000, total reward: 77.4000, min reward: -8.0000, max reward: 8.1500, average reward: 1.0320, sample time: 0.2714
Iter: 3, num steps: 4000, total reward: 585.7500, min reward: -8.8000, max reward: 7.4000, average reward: 3.6609, sample time: 0.2593
Iter: 4, num steps: 4000, total reward: 574.3000, min reward: -7.0000, max reward: 8.5500, average reward: 3.6119, sample time: 0.2592
Iter: 5, num steps: 4000, total reward: 1946.3500, min reward: -1.1500, max reward: 8.6500, average reward: 4.5264, sample time: 0.2648
Iter: 6, num steps: 4000, total reward: 3432.3500, min reward: 0.9500, max reward: 11.5000, average reward: 4.9816, sample time: 0.2682
Iter: 7, num steps: 4000, total reward: 4349.9500, min reward: -0.1000, max reward: 8.6500, average reward: 4.9319, sample time: 0.2734
Iter: 8, num steps: 4000, total reward: 4214.1000, min reward: 1.8000, max reward: 9.0500, average reward: 4.8718, sample time: 0.3343
Iter: 9, num steps: 4000, total reward: 4549.7000, min reward: 0.6500, max reward: 12.4500, average reward: 5.0721, sample time: 0.2909
Iter: 10, num steps: 4000, total reward: 4280.2500, min reward: 1.7000, max reward: 8.6500, average reward: 5.0475, sample time: 0.2748
I'm not sure why you removed the Polyak update part in Double DQN by commenting out. Is there any specific reason?
when i run (python DeepRL_Algorithms-master/Algorithms/pytorch/DDPG/main.py) has a problem that tips 'No module named 'Common.fixed_size_replay_memory'', then i really not found this .py file? could you tell me what happend? than you.
Terminal freezes without any error. There is some bug related to multiprocessing in the memory collector class.
Excuse me, your result figures are very beautiful!! Can you tell me how can I plot the figure like yours ? the code in your plt_utils cannot show shadows like yours.
Warining: Your pictures in the Markdown were missing.
Thanks for providing the wonderful repo! It helps.
Unforunately, when I ran ./Algorithms/pytorch/TRPO/main.py
, I got the following error:
Could you please give some suggestions? Thanks in advanced.
D:\Programs\anaconda3\envs\pytorch17\lib\site-packages\gym\logger.py:30: UserWarning: WARN: Box bound precision lowered by casting to float32
warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))
Iter: 1, num steps: 4000, total reward: -3180.0536, min reward: -126.2744, max reward: -3.1223, average reward: -106.0018, sample timD:\ 25.8737
Traceback (most recent call last):
File "main.py", line 47, in <module>
main()
File "D:\Programs\anaconda3\envs\pytorch17\lib\site-packages\click\core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "D:\Programs\anaconda3\envs\pytorch17\lib\site-packages\click\core.py", line 782, in main
rv = self.invoke(ctx)
File "D:\Programs\anaconda3\envs\pytorch17\lib\site-packages\click\core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "D:\Programs\anaconda3\envs\pytorch17\lib\site-packages\click\core.py", line 610, in invoke
return callback(*args, **kwargs)
File "main.py", line 35, in main
trpo.learn(writer, i_iter)
File "D:\DeepRL_Algorithms-master\Algorithms\pytorch\TRPO\trpo.py", line 134, in learn
trpo_step(self.policy_net, self.value_net, batch_state, batch_action,
File "D:\DeepRL_Algorithms-master\Algorithms\pytorch\TRPO\trpo_step.py", line 77, in trpo_step
update_policy(policy_net, states, actions, old_log_probs, advantages, max_kl, damping)
File "D:\DeepRL_Algorithms-master\Algorithms\pytorch\TRPO\trpo_step.py", line 173, in update_policy
step_dir = conjugate_gradient(Hvp, loss_grad) # approximation solution of H^(-1)g
File "D:\DeepRL_Algorithms-master\Algorithms\pytorch\TRPO\trpo_step.py", line 94, in conjugate_gradient
Hvp = Hvp_f(p) # A @ p
File "D:\DeepRL_Algorithms-master\Algorithms\pytorch\TRPO\trpo_step.py", line 152, in Hvp
kl = policy_net.get_kl(states)
File "D:\DeepRL_Algorithms-master\Algorithms\pytorch\Models\Policy.py", line 79, in get_kl
mean = self.policy(x)
File "D:\Programs\anaconda3\envs\pytorch17\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "D:\Programs\anaconda3\envs\pytorch17\lib\site-packages\torch\nn\modules\linear.py", line 93, in forward
return F.linear(input, self.weight, self.bias)
File "D:\Programs\anaconda3\envs\pytorch17\lib\site-packages\torch\nn\functional.py", line 1690, in linear
ret = torch.addmm(bias, input, weight.t())
RuntimeError: mat1 and mat2 shapes cannot be multiplied (4000x24 and 128x4)
BTW, I encountered similar errors when I tried to run SAC alpha /Algorithms/pytorch/SAC_Alpha/main.py
D:\Programs\anaconda3\envs\pytorch17\lib\site-packages\gym\logger.py:30: UserWarning: WARN: Box bound precision lowered by casting to float32
warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))
Traceback (most recent call last):
File "sac_alpha_main.py", line 69, in <module>
main()
File "D:\Programs\anaconda3\envs\pytorch17\lib\site-packages\click\core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "D:\Programs\anaconda3\envs\pytorch17\lib\site-packages\click\core.py", line 782, in main
rv = self.invoke(ctx)
File "D:\Programs\anaconda3\envs\pytorch17\lib\site-packages\click\core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "D:\Programs\anaconda3\envs\pytorch17\lib\site-packages\click\core.py", line 610, in invoke
return callback(*args, **kwargs)
File "sac_alpha_main.py", line 57, in main
sac_alpha.learn(writer, i_iter)
File "D:\DeepRL_Algorithms-master\DeepRL_Algorithms-master\Algorithms\pytorch\SAC_Alpha\sac_alpha.py", line 160, in learn
self.update(batch, k)
File "D:\DeepRL_Algorithms-master\DeepRL_Algorithms-master\Algorithms\pytorch\SAC_Alpha\sac_alpha.py", line 203, in update
sac_alpha_step(self.policy_net, self.q_net_1, self.q_net_2, self.alpha, self.q_net_target_1,
File "D:\DeepRL_Algorithms-master\DeepRL_Algorithms-master\Algorithms\pytorch\SAC_Alpha\sac_alpha_step.py", line 18, in sac_alpha_step
next_actions, next_log_probs = policy_net.rsample(next_states)
File "D:\DeepRL_Algorithms-master\DeepRL_Algorithms-master\Algorithms\pytorch\Models\Policy.py", line 69, in rsample
log_prob -= (torch.log(1. - action.pow(2) + eps)).sum(dim=-1)
RuntimeError: The size of tensor a (4) must match the size of tensor b (256) at non-singleton dimension 1
There is a error like this:
AttributeError: 'MountainCarEnv' object has no attribute 'seed'
How to solve it ?Thank you!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.