Hi, in the original PPO paper, it runs T timesteps(e.g. 1 actor) and then update K tim

PPO Algorithm (paper): <div class="snippet-clipboard-content notranslate position-

Update Timestep (T) = 2000 Mini-Batch size (M) = 2000 <code clas

When to Update about ppo-pytorch HOT 5 CLOSED

nikhilbarhate99 commented on May 9, 2024

When to Update

from ppo-pytorch.

Comments (5)

nikhilbarhate99 commented on May 9, 2024 1

PPO Algorithm (paper):

for iteration=1, 2, . . . do
  for actor=1, 2, . . . , N do
    Run policy πθold in environment for T timesteps
    Compute advantage estimates A
  end for
Optimize surrogate L wrt θ, with K epochs and minibatch size M ≤ NT
θold ← θ
end for

In this repo, N = 1 (one actor), batch size M = T. i.e the sample is the entire batch.

Given that performance of the algorithm is dependent on the environment, I am not sure as to how this will affect its overall efficiency. It is a hyper parameter and need to be tuned according to the environment.

But Using parallel workers (N>1) is generally more useful since the expectations are approximated with experience generated by different random seeds.

from ppo-pytorch.

xunzhang commented on May 9, 2024

In PPO.py the T=300(max_timesteps=300) and the M=2000(update_timestep=2000), why you said M=T? Little confused here. Do you want to simulate multiple actors(N) by setting M > T. So in the PPO.py example, 300(T) * 6.66(N) = 2000(M). Correct me if I am wrong.

from ppo-pytorch.

nikhilbarhate99 commented on May 9, 2024

Update Timestep (T) = 2000
Mini-Batch size (M) = 2000

max_timesteps is the maximum timesteps in ONE episode. One update may have experience from multiple episodes.

for iteration=1, 2, . . . do
  for actor=1, 2, . . . , N do
    Run policy πθold in environment for T timesteps
    Compute advantage estimates A
  end for
Optimize surrogate L wrt θ, with K epochs and minibatch size M ≤ NT
θold ← θ
end for

Using Multiple Actors (N), means to run multiple instances of actors (Parallel / Multithreaded), all collecting experience of length T.
For updating, Mini-batch size(M) can NOT be greater than the total batch size (NT)

from ppo-pytorch.

xunzhang commented on May 9, 2024

I see. I misread the max_timesteps in your code as T in the paper. I think update_timestep in your code is =M, =T.

One more confusion with multiple actors, it makes sense to use parallel environments, but why I can't use the N*T sequential process to simulate parallel environments?

from ppo-pytorch.

nikhilbarhate99 commented on May 9, 2024

All the instances will be running with different random seeds. This will lead to more varied experience, thus approximating the expectation better.

Source: skip to 54:19 of (https://www.youtube.com/watch?v=EKqxumCuAAY&list=PLkFD6_40KJIwhWJpGazJ9VSj9CFMkb79A&index=6)

from ppo-pytorch.

Recommend Projects

When to Update about ppo-pytorch HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs