seungeunrho / minimalrl Goto Github PK
View Code? Open in Web Editor NEWImplementations of basic RL algorithms with minimal lines of codes! (pytorch based)
License: MIT License
Implementations of basic RL algorithms with minimal lines of codes! (pytorch based)
License: MIT License
Hey there,
Would it be wise to include entropy factor in your ppo implementation?
How to do that?
Second question I have is why do you not use 0.5*MSE Loss instead of F.smooth_l1_loss.
Here are some snippets as suggestion - but I am not absolutely sure
surr1 = ratio * advantage
surr2 = torch.clamp(ratio, 1-eps_clip, 1+eps_clip) * advantage
actor_loss = -torch.min(surr1, surr2)
critic_loss = F.smooth_l1_loss(self.v(s) , td_target.detach())# alternative: 0.5*self.MseLoss(state_values, torch.tensor(rewards))
#beta = 0.01 # encourage to explore different policies let at 0.01
total_loss = critic_loss+actor_loss#- beta*dist_entropy
Including entropy we need a function like this:
def evaluate(self, state, action):
#what values are returned here?
action_probs = self.action_layer(state)
dist = Categorical(action_probs)
action_logprobs = dist.log_prob(action)
dist_entropy = dist.entropy()
state_value = self.value_layer(state)
return action_logprobs, torch.squeeze(state_value), dist_entropy
However I am not sure about the best way to include entropy in your implementation.
Glad for some help
if you train ppo far enough likes 3000 episodes or more, rewards got dropped. (like 500 to 30)
혹시 critic target에서 done mask
를 곱해주는것을 빼먹으신거 아닌가요?
Line 89 in 7095e0f
hi, I think the ratio
in ppo.py should be ratio.detach()
.
I doubt whether the asynchronous update made by the current a3c is adhering to what is suggested in the paper. Suppose the workers share the shared_model
. Then each worker should:
local_model
n
steps or end of episodelocal_model
to shared_model
shared_model
and go to step 1Thus, when the local_model
is taking the n
steps, it's weights do not change.
However, in the current implementation, we directly use the shared_model
for taking those n
steps. Hence, it may happen that some process P1
updates the weights of shared_model
, which might affect the process P2
. P2
might have started with some weight configuration of shared_model
, which are now modified before those n
steps are completed.
I think we can make the following change to the train
method to avoid the above phenomenon:
def train(model):
local_model = ActorCritic()
local_model.load_state_dict(model.state_dict())
# Create optimizer for the shared model
# Create environment
# Take n steps using local_model
optimizer.zero_grad()
# Calculate loss and get the gradients
loss_fn(local_model(data), labels).backward()
for param, shared_param in zip(local_model.parameters(), model.parameters()):
if shared_param.grad is not None:
shared_param._grad = param.grad
optimizer.step()
I am not very much familiar with the asynchronous model update in Pytorch but looking at the docs at https://pytorch.org/docs/stable/notes/multiprocessing.html#asynchronous-multiprocess-training-e-g-hogwild, I think we are using the shared_model
all the time.
If you think what I say is correct, I can make a PR.
What changes would be required to employ your ppo algorithm in a continuous action space like Pendulum-v0?
There are some unused imports in the code, such as
Line 8 in 8c364c3
Hi, @seungeunrho
Thanks for creating wonderful repo like this.
I think I've spotted a typo on actor_critic.py
, but since I'm new to RL I'm not sure this is a typo or not.
Isn't https://github.com/seungeunrho/minimalRL/blob/master/actor_critic.py#L48 should be using s_prime_lst
instead of s_prime
? In other words,
s_batch, a_batch, r_batch, s_prime_batch, done_batch = torch.tensor(s_lst, dtype=torch.float), torch.tensor(a_lst), \
torch.tensor(r_lst, dtype=torch.float), torch.tensor(s_prime, dtype=torch.float), \
torch.tensor(done_lst, dtype=torch.float)
should be
s_batch, a_batch, r_batch, s_prime_batch, done_batch = torch.tensor(s_lst, dtype=torch.float), torch.tensor(a_lst), \
torch.tensor(r_lst, dtype=torch.float), torch.tensor(s_prime_lst, dtype=torch.float), \
torch.tensor(done_lst, dtype=torch.float)
Both of them works, interestingly.
Hi, I am trying to create an environment that is a variation of Cartpole.
From the Cartpole definiton:
The studied system is a cart of which a rigid pole is hinged (see figure). The cart is free to move within the bounds of a one-dimensional track. The pole can move inthe vertical plane parallel to the track. The controller can apply a force F to the cart, parallel to the track.
Suppose you can apply a force F but also a multiplier of this force M, so the total force applied is F * M.
The following is the code:
#PPO-LSTM
import gym
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.distributions import Categorical
import time
import math
import numpy as np
import gym.envs.classic_control
#Hyperparameters
learning_rate = 0.0005
gamma = 0.98
lmbda = 0.95
eps_clip = 0.1
K_epoch = 2
T_horizon = 20
class CustomCartpole(gym.envs.classic_control.CartPoleEnv):
"""Add a dimension to the cartpole action space that is used as 'speed' button."""
def __init__(self, env_config):
super().__init__()
self.force_mag = 5.0
self.action_space = gym.spaces.MultiDiscrete([2, 4])
def step(self, action):
err_msg = "%r (%s) invalid" % (action, type(action))
assert self.action_space.contains(action), err_msg
x, x_dot, theta, theta_dot = self.state
force = self.force_mag if action[0] == 1 else -self.force_mag
force *= (action[1] + 1)
costheta = math.cos(theta)
sintheta = math.sin(theta)
temp = (force + self.polemass_length * theta_dot ** 2 * sintheta) / self.total_mass
thetaacc = (self.gravity * sintheta - costheta * temp) / (self.length * (4.0 / 3.0 - self.masspole * costheta ** 2 / self.total_mass))
xacc = temp - self.polemass_length * thetaacc * costheta / self.total_mass
if self.kinematics_integrator == 'euler':
x = x + self.tau * x_dot
x_dot = x_dot + self.tau * xacc
theta = theta + self.tau * theta_dot
theta_dot = theta_dot + self.tau * thetaacc
else: # semi-implicit euler
x_dot = x_dot + self.tau * xacc
x = x + self.tau * x_dot
theta_dot = theta_dot + self.tau * thetaacc
theta = theta + self.tau * theta_dot
self.state = (x, x_dot, theta, theta_dot)
done = bool(
x < -self.x_threshold
or x > self.x_threshold
or theta < -self.theta_threshold_radians
or theta > self.theta_threshold_radians
)
if not done:
reward = 1.0
elif self.steps_beyond_done is None:
# Pole just fell!
self.steps_beyond_done = 0
reward = 1.0
else:
if self.steps_beyond_done == 0:
logger.warn(
"You are calling 'step()' even though this "
"environment has already returned done = True. You "
"should always call 'reset()' once you receive 'done = "
"True' -- any further steps are undefined behavior."
)
self.steps_beyond_done += 1
reward = 0.0
return np.array(self.state), reward, done, {}
class PPO(nn.Module):
def __init__(self):
super(PPO, self).__init__()
self.data = []
self.fc1 = nn.Linear(4,64)
self.lstm = nn.LSTM(64,32)
self.fc_pi = nn.Linear(32,2)
self.fc_v = nn.Linear(32,2)
self.optimizer = optim.Adam(self.parameters(), lr=learning_rate)
def pi(self, x, hidden):
x = F.relu(self.fc1(x))
x = x.view(-1, 1, 64)
x, lstm_hidden = self.lstm(x, hidden)
x = self.fc_pi(x)
prob = F.softmax(x, dim=2)
return prob, lstm_hidden
def v(self, x, hidden):
x = F.relu(self.fc1(x))
x = x.view(-1, 1, 64)
x, lstm_hidden = self.lstm(x, hidden)
v = self.fc_v(x)
return v
def put_data(self, transition):
self.data.append(transition)
def make_batch(self):
s_lst, a_lst, r_lst, s_prime_lst, prob_a_lst, h_in_lst, h_out_lst, done_lst = [], [], [], [], [], [], [], []
for transition in self.data:
s, a, r, s_prime, prob_a, h_in, h_out, done = transition
s_lst.append(s)
a_lst.append([a])
r_lst.append([r])
s_prime_lst.append(s_prime)
prob_a_lst.append([prob_a])
h_in_lst.append(h_in)
h_out_lst.append(h_out)
done_mask = 0 if done else 1
done_lst.append([done_mask])
s,a,r,s_prime,done_mask,prob_a = torch.tensor(s_lst, dtype=torch.float), torch.tensor(a_lst), \
torch.tensor(r_lst), torch.tensor(s_prime_lst, dtype=torch.float), \
torch.tensor(done_lst, dtype=torch.float), torch.tensor(prob_a_lst)
self.data = []
return s,a,r,s_prime, done_mask, prob_a, h_in_lst[0], h_out_lst[0]
def train_net(self):
s,a,r,s_prime,done_mask, prob_a, (h1_in, h2_in), (h1_out, h2_out) = self.make_batch()
first_hidden = (h1_in.detach(), h2_in.detach())
second_hidden = (h1_out.detach(), h2_out.detach())
for i in range(K_epoch):
v_prime = self.v(s_prime, second_hidden).squeeze(1)
td_target = r + gamma * v_prime * done_mask
v_s = self.v(s, first_hidden).squeeze(1)
delta = td_target - v_s
delta = delta.detach().numpy()
advantage_lst = []
advantage = 0.0
for item in delta[::-1]:
advantage = gamma * lmbda * advantage + item[0]
advantage_lst.append([advantage])
advantage_lst.reverse()
advantage = torch.tensor(advantage_lst, dtype=torch.float)
pi, _ = self.pi(s, first_hidden)
pi_a = pi.squeeze(1).gather(1,a)
ratio = torch.exp(torch.log(pi_a) - torch.log(prob_a)) # a/b == log(exp(a)-exp(b))
surr1 = ratio * advantage
surr2 = torch.clamp(ratio, 1-eps_clip, 1+eps_clip) * advantage
loss = -torch.min(surr1, surr2) + F.smooth_l1_loss(v_s, td_target.detach())
self.optimizer.zero_grad()
loss.mean().backward(retain_graph=True)
self.optimizer.step()
def main():
#env = gym.make('CartPole-v1')
env = CustomCartpole({'override_actions': False})
model = PPO()
score = 0.0
print_interval = 20
for n_epi in range(10000):
h_out = (torch.zeros([1, 1, 32], dtype=torch.float), torch.zeros([1, 1, 32], dtype=torch.float))
s = env.reset()
done = False
while not done:
for t in range(T_horizon):
h_in = h_out
prob, h_out = model.pi(torch.from_numpy(s).float(), h_in)
prob = prob.view(-1)
m = Categorical(prob)
a = m.sample().item()
s_prime, r, done, info = env.step(a)
model.put_data((s, a, r/100.0, s_prime, prob[a].item(), h_in, h_out, done))
s = s_prime
score += r
if done:
break
model.train_net()
if n_epi%print_interval==0 and n_epi!=0:
print("# of episode :{}, avg score : {:.1f}".format(n_epi, score/print_interval))
score = 0.0
env.close()
if __name__ == '__main__':
main()
This code fails with the following error:
Could you please tell me how to adjust the main loop:
while not done:
for t in range(T_horizon):
h_in = h_out
prob, h_out = model.pi(torch.from_numpy(s).float(), h_in)
prob = prob.view(-1)
m = Categorical(prob)
a = m.sample().item()
s_prime, r, done, info = env.step(a)
model.put_data((s, a, r/100.0, s_prime, prob[a].item(), h_in, h_out, done))
s = s_prime
score += r
if done:
break
model.train_net()
to this environment?
First of all, this is a really nice repo - simple and clean.
I have two issues with the a3c implementation
td_target
calculated in https://github.com/seungeunrho/minimalRL/blob/master/a3c.py#L70 gives the same weight of gamma
to calculate the value of the s_prime
(the last state visited). Let's say s
is your starting state and you are doing n
step return, then, the target will be \sum_{i=0}^{n-1} gamma^{i} r_i + gamma^{n} v(s_prime)You can make the following change https://github.com/seungeunrho/minimalRL/blob/master/a3c.py#L59 to the following,
R = 0.0 if done else model.v(s_prime).detach().item()
then, td_target = R_batch
test
might be executed before the training is complete. If you plan to probe how good the model is during training than this is alright. But if you wish to see the model performance once the training is complete then you should fire the test
after .join()
on train
processes.It would be nice to add the following algorithms:
I will submit a PR if I finish any of them.
Hello, nice and clear implementation! I want to ask something about the LSTM usage. While gatthering experience the input to the LSTM is of dimension [1, 1, 64] which represents 1 timestep of 1 episode along with the 64 FC features?
Also when training on a batch you sample this size eg [20, 1, 64] which corresponds to 20 timesteps?
Finally, shouldn't the hidden state be of the same dimensions except the last? Correspond to the timestep dimension for example? What is the best way to handle using an LSTM is it just an implementation choice?
In the ReplayBuffer implementation, you can use
self.buffer = collections.deque(maxlen=buffer_limit)
to simplify the put() method -- deque will automatically drop the oldest elements.
Keep up the amazing work!
Line 104 in 46f9b32
According to original paper, gradient for bias correction term is define as below,
and as pi
serves as the probability for expectation calculation, it seems it's not the target of optimization.
Shouldn't we detach the pi
from computational graph at above line?
You may want to rename it, since .train() and .eval() are important methods of the base class, and new people might be picking up a wrong habit here. Just a thought.
The traditional implementation of REINFORCE, without importance sampling should only use data collected by the current policy to update the parameters. However, in reinforce.py, the data buffer doesn't seem to reset after every policy update. Thoughts?
As from my understanding the policy network is giving an output of mean and variance for a single action. After that torch.gather is used to calculate the log_prob. Can someone help me to understand the process?
Thanks for the help. 😃
Hi,
First congratulations by this project.
Would be great a minimal implementation of MuZero algorithm.
The paper is here: https://arxiv.org/pdf/1911.08265
The pseudocode is: https://arxiv.org/src/1911.08265v2/anc/pseudocode.py
Thanks.
Hello, Thanks for your great work!
I have one dumb question.
in LSTM PPO realization, I noticed that when calculating v_prime and v_s, the same first_hidden value is used, my question is: should v_prime use a different first hidden value? or just a approximation.
Thank You!
v_prime = self.v(s_prime, first_hidden).squeeze(1)
td_target = r + gamma * v_prime * done_mask
v_s = self.v(s, first_hidden).squeeze(1)
Hi, I am trying to create an environment that is a variation of Cartpole.
From the Cartpole definiton:
The studied system is a cart of which a rigid pole is hinged (see figure). The cart is free to move within the bounds of a one-dimensional track. The pole can move inthe vertical plane parallel to the track. The controller can apply a force F to the cart, parallel to the track.
Suppose you can apply a force F but also a multiplier of this force M, so the total force applied is F * M.
The following is the code:
#PPO-LSTM
import gym
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.distributions import Categorical
import time
import math
import numpy as np
import gym.envs.classic_control
#Hyperparameters
learning_rate = 0.0005
gamma = 0.98
lmbda = 0.95
eps_clip = 0.1
K_epoch = 2
T_horizon = 20
class CustomCartpole(gym.envs.classic_control.CartPoleEnv):
"""Add a dimension to the cartpole action space that is used as 'speed' button."""
def __init__(self, env_config):
super().__init__()
self.force_mag = 5.0
self.action_space = gym.spaces.MultiDiscrete([2, 4])
def step(self, action):
err_msg = "%r (%s) invalid" % (action, type(action))
assert self.action_space.contains(action), err_msg
x, x_dot, theta, theta_dot = self.state
force = self.force_mag if action[0] == 1 else -self.force_mag
force *= (action[1] + 1)
costheta = math.cos(theta)
sintheta = math.sin(theta)
temp = (force + self.polemass_length * theta_dot ** 2 * sintheta) / self.total_mass
thetaacc = (self.gravity * sintheta - costheta * temp) / (self.length * (4.0 / 3.0 - self.masspole * costheta ** 2 / self.total_mass))
xacc = temp - self.polemass_length * thetaacc * costheta / self.total_mass
if self.kinematics_integrator == 'euler':
x = x + self.tau * x_dot
x_dot = x_dot + self.tau * xacc
theta = theta + self.tau * theta_dot
theta_dot = theta_dot + self.tau * thetaacc
else: # semi-implicit euler
x_dot = x_dot + self.tau * xacc
x = x + self.tau * x_dot
theta_dot = theta_dot + self.tau * thetaacc
theta = theta + self.tau * theta_dot
self.state = (x, x_dot, theta, theta_dot)
done = bool(
x < -self.x_threshold
or x > self.x_threshold
or theta < -self.theta_threshold_radians
or theta > self.theta_threshold_radians
)
if not done:
reward = 1.0
elif self.steps_beyond_done is None:
# Pole just fell!
self.steps_beyond_done = 0
reward = 1.0
else:
if self.steps_beyond_done == 0:
logger.warn(
"You are calling 'step()' even though this "
"environment has already returned done = True. You "
"should always call 'reset()' once you receive 'done = "
"True' -- any further steps are undefined behavior."
)
self.steps_beyond_done += 1
reward = 0.0
return np.array(self.state), reward, done, {}
class PPO(nn.Module):
def __init__(self):
super(PPO, self).__init__()
self.data = []
self.fc1 = nn.Linear(4,64)
self.lstm = nn.LSTM(64,32)
self.fc_pi = nn.Linear(32,2)
self.fc_v = nn.Linear(32,2)
self.optimizer = optim.Adam(self.parameters(), lr=learning_rate)
def pi(self, x, hidden):
x = F.relu(self.fc1(x))
x = x.view(-1, 1, 64)
x, lstm_hidden = self.lstm(x, hidden)
x = self.fc_pi(x)
prob = F.softmax(x, dim=2)
return prob, lstm_hidden
def v(self, x, hidden):
x = F.relu(self.fc1(x))
x = x.view(-1, 1, 64)
x, lstm_hidden = self.lstm(x, hidden)
v = self.fc_v(x)
return v
def put_data(self, transition):
self.data.append(transition)
def make_batch(self):
s_lst, a_lst, r_lst, s_prime_lst, prob_a_lst, h_in_lst, h_out_lst, done_lst = [], [], [], [], [], [], [], []
for transition in self.data:
s, a, r, s_prime, prob_a, h_in, h_out, done = transition
s_lst.append(s)
a_lst.append([a])
r_lst.append([r])
s_prime_lst.append(s_prime)
prob_a_lst.append([prob_a])
h_in_lst.append(h_in)
h_out_lst.append(h_out)
done_mask = 0 if done else 1
done_lst.append([done_mask])
s,a,r,s_prime,done_mask,prob_a = torch.tensor(s_lst, dtype=torch.float), torch.tensor(a_lst), \
torch.tensor(r_lst), torch.tensor(s_prime_lst, dtype=torch.float), \
torch.tensor(done_lst, dtype=torch.float), torch.tensor(prob_a_lst)
self.data = []
return s,a,r,s_prime, done_mask, prob_a, h_in_lst[0], h_out_lst[0]
def train_net(self):
s,a,r,s_prime,done_mask, prob_a, (h1_in, h2_in), (h1_out, h2_out) = self.make_batch()
first_hidden = (h1_in.detach(), h2_in.detach())
second_hidden = (h1_out.detach(), h2_out.detach())
for i in range(K_epoch):
v_prime = self.v(s_prime, second_hidden).squeeze(1)
td_target = r + gamma * v_prime * done_mask
v_s = self.v(s, first_hidden).squeeze(1)
delta = td_target - v_s
delta = delta.detach().numpy()
advantage_lst = []
advantage = 0.0
for item in delta[::-1]:
advantage = gamma * lmbda * advantage + item[0]
advantage_lst.append([advantage])
advantage_lst.reverse()
advantage = torch.tensor(advantage_lst, dtype=torch.float)
pi, _ = self.pi(s, first_hidden)
pi_a = pi.squeeze(1).gather(1,a)
ratio = torch.exp(torch.log(pi_a) - torch.log(prob_a)) # a/b == log(exp(a)-exp(b))
surr1 = ratio * advantage
surr2 = torch.clamp(ratio, 1-eps_clip, 1+eps_clip) * advantage
loss = -torch.min(surr1, surr2) + F.smooth_l1_loss(v_s, td_target.detach())
self.optimizer.zero_grad()
loss.mean().backward(retain_graph=True)
self.optimizer.step()
def main():
#env = gym.make('CartPole-v1')
env = CustomCartpole({'override_actions': False})
model = PPO()
score = 0.0
print_interval = 20
for n_epi in range(10000):
h_out = (torch.zeros([1, 1, 32], dtype=torch.float), torch.zeros([1, 1, 32], dtype=torch.float))
s = env.reset()
done = False
while not done:
for t in range(T_horizon):
h_in = h_out
prob, h_out = model.pi(torch.from_numpy(s).float(), h_in)
prob = prob.view(-1)
m = Categorical(prob)
a = m.sample().item()
s_prime, r, done, info = env.step(a)
model.put_data((s, a, r/100.0, s_prime, prob[a].item(), h_in, h_out, done))
s = s_prime
score += r
if done:
break
model.train_net()
if n_epi%print_interval==0 and n_epi!=0:
print("# of episode :{}, avg score : {:.1f}".format(n_epi, score/print_interval))
score = 0.0
env.close()
if __name__ == '__main__':
main()
This code fails with the following error:
Could you please tell me how to adjust the main loop:
while not done:
for t in range(T_horizon):
h_in = h_out
prob, h_out = model.pi(torch.from_numpy(s).float(), h_in)
prob = prob.view(-1)
m = Categorical(prob)
a = m.sample().item()
s_prime, r, done, info = env.step(a)
model.put_data((s, a, r/100.0, s_prime, prob[a].item(), h_in, h_out, done))
s = s_prime
score += r
if done:
break
model.train_net()
to this environment?
Hi, I got RuntimeError when I executed DDPG.py.
It seems to be occurred during training process of QNet.
RuntimeError: Expected object of scalar type Float but got scalar type Double for argument #2 'mat1' in call to _th_addmm
I'm using torch 1.5.0 and Python 3.7.3
Is there anyone with the same problem as me?
All implementations of SAC I saw use a paid physics sim, any plans to implement it here?
Hi, first of all, thanks for your awesome codes. This is not about any technical issue, but about the algorithm of the DDPG code.
As far as I know, the DDPG method can exploit online parameter update due to the TD learning. But, in your code, the parameters are updated after an episode is over.
I would like to ask you if there are some theoretical background behind this parameter update interval?
Thank you in advance.
Hello,
its a fantastic job and really helpful for me! Is it possible to add IMPALA by revamp A3C?
IMPALA is more efficient than A2C and A3C,all the code I find in github for that is detailed and complicated
pytorch-lightning allows for less boilerplate and more optimization. Maybe it should be used to allow for easier reuse of the code.
In line 110:
for n_epi in range(10000):
s = env.reset()
done = False
while not done:
for t in range(T_horizon):
prob = model.pi(torch.from_numpy(s).float())
m = Categorical(prob)
a = m.sample().item()
s_prime, r, done, info = env.step(a)
model.put_data((s, a, r/100.0, s_prime, prob[a].item(), done))
s = s_prime
score += r
if done:
break
model.train_net() # <------- HERE
I think it should be left shifted to align with while not done
, i.e. after collecting data of one episode, we update the networks' parameters. I have tested and this gives stable performance.
for n_epi in range(10000):
s = env.reset()
done = False
while not done:
for t in range(T_horizon):
prob = model.pi(torch.from_numpy(s).float())
m = Categorical(prob)
a = m.sample().item()
s_prime, r, done, info = env.step(a)
model.put_data((s, a, r/100.0, s_prime, prob[a].item(), done))
s = s_prime
score += r
if done:
break
model.train_net() # <------- UPDATED
https://github.com/seungeunrho/minimalRL/blob/master/dqn.py
Line 63 in 7597b9a
I am wondering why the train
method is internally looping 10 times? Shouldn't the policy network train per action?
Thanks for the high quality implementations. But I have a question about train_net()
in REINFORCE algorithm:
def train_net(self):
R = 0
for r, prob in self.data[::-1]:
R = r + gamma * R
loss = -torch.log(prob) * R
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
self.data = []
The policy is updated for each step of an episode. But the policy is supposed to be updated after a complete episode (or batch of episodes).
Since after we do an update, the policy is changed, and the data collected is no longer useful according to the theory.
Thank you for this wonderful repo. If you also implement SAC, that could be better.
Hello,
I have enjoyed reading your good examples! Is it possible for you to add a few meta RL algorithms? Thanks!
Hi,
Thanks for your simple and awesome A3C code.
I got some of the questions to ask.
Q. Why did you put the 'test' process into multi-process with other actor-learner thread?
I think that the 'test' process should be called after all of the other actor-learner threads are done.
OR Is it just to check a train performance of global_model (=test)?
Thank you.
Hi.
I think REINFORCE.py:44 should be placed after REINFORCE.py:46. Because once a single episode is terminated, the value of "done" will be False (and won't be reset), causing the main function to skip the while loop in the subsequent episodes.
After all, I'm not sure about this issue. I'm a total newbie.
BTW, all of the implementations are highly efficient, easy-to-customize, easy-to-understand, and very helpful. Thank you for sharing.
I'm somewhat new to the field of reinforcement learning, and I find these simplistic examples to be extremely helpful -- thank you!
Would you be able to help me with understanding a minimal way to save / replace these trained models?
Would it be possible to implement these algorithms in a continuous env like bipedalwalker?
also, SAC is a cool algorithm.
finally it would be wonderful if you posted scores for each algorithm in the readme (so we can compare performance without having to clone and run everything)
my only negative feedback would be, in some places, you use 1 letter only to describe something, when a word would be more clear, and would not add lines. If you want this to be the most clear/simple RL repo, it would be good if readers can more easily understand the algorithm without having to guess "what does this letter mean?"
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.