GithubHelp home page GithubHelp logo

seungeunrho / minimalrl Goto Github PK

View Code? Open in Web Editor NEW
2.7K 2.7K 450.0 62 KB

Implementations of basic RL algorithms with minimal lines of codes! (pytorch based)

License: MIT License

Python 100.00%
a2c a3c acer ddpg deep-learning deep-reinforcement-learning dqn machine-learning policy-gradients ppo pytorch reinforce reinforcement-learning sac simple

minimalrl's People

Contributors

jsrimr avatar rahulptel avatar rossrho avatar seungeunrho avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

minimalrl's Issues

PPO has no entropy factor

Hey there,

Would it be wise to include entropy factor in your ppo implementation?

How to do that?

Second question I have is why do you not use 0.5*MSE Loss instead of F.smooth_l1_loss.

Here are some snippets as suggestion - but I am not absolutely sure

            surr1 = ratio * advantage
            surr2 = torch.clamp(ratio, 1-eps_clip, 1+eps_clip) * advantage
            actor_loss  = -torch.min(surr1, surr2)
            critic_loss = F.smooth_l1_loss(self.v(s) , td_target.detach())# alternative: 0.5*self.MseLoss(state_values, torch.tensor(rewards))
            #beta       = 0.01 # encourage to explore different policies let at 0.01
            total_loss = critic_loss+actor_loss#- beta*dist_entropy

Including entropy we need a function like this:

    def evaluate(self, state, action):
        #what values are returned here?
        action_probs = self.action_layer(state)
        dist = Categorical(action_probs)

        action_logprobs = dist.log_prob(action)
        dist_entropy = dist.entropy()

        state_value = self.value_layer(state)

        return action_logprobs, torch.squeeze(state_value), dist_entropy

However I am not sure about the best way to include entropy in your implementation.

Glad for some help

Improper asynchronous update in a3c

I doubt whether the asynchronous update made by the current a3c is adhering to what is suggested in the paper. Suppose the workers share the shared_model. Then each worker should:

  1. Copy the weight of the shared network into its local_model
  2. Runs for n steps or end of episode
  3. Calculate gradient
  4. Pass the gradient of the local_model to shared_model
  5. Update the shared_model and go to step 1

Thus, when the local_model is taking the n steps, it's weights do not change.

However, in the current implementation, we directly use the shared_model for taking those n steps. Hence, it may happen that some process P1 updates the weights of shared_model, which might affect the process P2. P2 might have started with some weight configuration of shared_model, which are now modified before those n steps are completed.

I think we can make the following change to the train method to avoid the above phenomenon:

def train(model):
    local_model = ActorCritic()
    local_model.load_state_dict(model.state_dict())
    # Create optimizer for the shared model
    # Create environment
    
    # Take n steps using local_model

    optimizer.zero_grad()
    # Calculate loss and get the gradients     
    loss_fn(local_model(data), labels).backward()

    for param, shared_param in zip(local_model.parameters(), model.parameters()):
        if shared_param.grad is not None:
            shared_param._grad = param.grad

        optimizer.step()

I am not very much familiar with the asynchronous model update in Pytorch but looking at the docs at https://pytorch.org/docs/stable/notes/multiprocessing.html#asynchronous-multiprocess-training-e-g-hogwild, I think we are using the shared_model all the time.

If you think what I say is correct, I can make a PR.

PPO Continuous Action Space

What changes would be required to employ your ppo algorithm in a continuous action space like Pendulum-v0?

Remove unused import

There are some unused imports in the code, such as

import time

I believe it would be better if we remove those unused imports. The code will become shorter and cleaner.

Typo of actor_critic.py?

Hi, @seungeunrho
Thanks for creating wonderful repo like this.

I think I've spotted a typo on actor_critic.py, but since I'm new to RL I'm not sure this is a typo or not.

Isn't https://github.com/seungeunrho/minimalRL/blob/master/actor_critic.py#L48 should be using s_prime_lst instead of s_prime? In other words,

s_batch, a_batch, r_batch, s_prime_batch, done_batch = torch.tensor(s_lst, dtype=torch.float), torch.tensor(a_lst), \
                                                       torch.tensor(r_lst, dtype=torch.float), torch.tensor(s_prime, dtype=torch.float), \
                                                       torch.tensor(done_lst, dtype=torch.float)

should be

s_batch, a_batch, r_batch, s_prime_batch, done_batch = torch.tensor(s_lst, dtype=torch.float), torch.tensor(a_lst), \
                                                       torch.tensor(r_lst, dtype=torch.float), torch.tensor(s_prime_lst, dtype=torch.float), \
                                                       torch.tensor(done_lst, dtype=torch.float)

Both of them works, interestingly.

Cartpole environment with Multidiscrete action space

Hi, I am trying to create an environment that is a variation of Cartpole.
From the Cartpole definiton:

The studied system is a cart of which a rigid pole is hinged (see figure). The cart is free to move within the bounds of a one-dimensional track. The pole can move inthe vertical plane parallel to the track. The controller can apply a force F to the cart, parallel to the track.

Suppose you can apply a force F but also a multiplier of this force M, so the total force applied is F * M.

The following is the code:

#PPO-LSTM
import gym
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.distributions import Categorical
import time
import math
import numpy as np
import gym.envs.classic_control

#Hyperparameters
learning_rate = 0.0005
gamma         = 0.98
lmbda         = 0.95
eps_clip      = 0.1
K_epoch       = 2
T_horizon     = 20

class CustomCartpole(gym.envs.classic_control.CartPoleEnv):
    """Add a dimension to the cartpole action space that is used as 'speed' button."""

    def __init__(self, env_config):
        super().__init__()
        self.force_mag = 5.0
        self.action_space = gym.spaces.MultiDiscrete([2, 4])

    def step(self, action):
        err_msg = "%r (%s) invalid" % (action, type(action))
        assert self.action_space.contains(action), err_msg

        x, x_dot, theta, theta_dot = self.state
        force = self.force_mag if action[0] == 1 else -self.force_mag
        force *= (action[1] + 1)
        costheta = math.cos(theta)
        sintheta = math.sin(theta)

        temp = (force + self.polemass_length * theta_dot ** 2 * sintheta) / self.total_mass
        thetaacc = (self.gravity * sintheta - costheta * temp) / (self.length * (4.0 / 3.0 - self.masspole * costheta ** 2 / self.total_mass))
        xacc = temp - self.polemass_length * thetaacc * costheta / self.total_mass

        if self.kinematics_integrator == 'euler':
            x = x + self.tau * x_dot
            x_dot = x_dot + self.tau * xacc
            theta = theta + self.tau * theta_dot
            theta_dot = theta_dot + self.tau * thetaacc
        else:  # semi-implicit euler
            x_dot = x_dot + self.tau * xacc
            x = x + self.tau * x_dot
            theta_dot = theta_dot + self.tau * thetaacc
            theta = theta + self.tau * theta_dot

        self.state = (x, x_dot, theta, theta_dot)

        done = bool(
            x < -self.x_threshold
            or x > self.x_threshold
            or theta < -self.theta_threshold_radians
            or theta > self.theta_threshold_radians
        )

        if not done:
            reward = 1.0
        elif self.steps_beyond_done is None:
            # Pole just fell!
            self.steps_beyond_done = 0
            reward = 1.0
        else:
            if self.steps_beyond_done == 0:
                logger.warn(
                    "You are calling 'step()' even though this "
                    "environment has already returned done = True. You "
                    "should always call 'reset()' once you receive 'done = "
                    "True' -- any further steps are undefined behavior."
                )
            self.steps_beyond_done += 1
            reward = 0.0

        return np.array(self.state), reward, done, {}


class PPO(nn.Module):
    def __init__(self):
        super(PPO, self).__init__()
        self.data = []

        self.fc1   = nn.Linear(4,64)
        self.lstm  = nn.LSTM(64,32)
        self.fc_pi = nn.Linear(32,2)
        self.fc_v  = nn.Linear(32,2)
        self.optimizer = optim.Adam(self.parameters(), lr=learning_rate)

    def pi(self, x, hidden):
        x = F.relu(self.fc1(x))
        x = x.view(-1, 1, 64)
        x, lstm_hidden = self.lstm(x, hidden)
        x = self.fc_pi(x)
        prob = F.softmax(x, dim=2)
        return prob, lstm_hidden

    def v(self, x, hidden):
        x = F.relu(self.fc1(x))
        x = x.view(-1, 1, 64)
        x, lstm_hidden = self.lstm(x, hidden)
        v = self.fc_v(x)
        return v

    def put_data(self, transition):
        self.data.append(transition)

    def make_batch(self):
        s_lst, a_lst, r_lst, s_prime_lst, prob_a_lst, h_in_lst, h_out_lst, done_lst = [], [], [], [], [], [], [], []
        for transition in self.data:
            s, a, r, s_prime, prob_a, h_in, h_out, done = transition

            s_lst.append(s)
            a_lst.append([a])
            r_lst.append([r])
            s_prime_lst.append(s_prime)
            prob_a_lst.append([prob_a])
            h_in_lst.append(h_in)
            h_out_lst.append(h_out)
            done_mask = 0 if done else 1
            done_lst.append([done_mask])

        s,a,r,s_prime,done_mask,prob_a = torch.tensor(s_lst, dtype=torch.float), torch.tensor(a_lst), \
                                         torch.tensor(r_lst), torch.tensor(s_prime_lst, dtype=torch.float), \
                                         torch.tensor(done_lst, dtype=torch.float), torch.tensor(prob_a_lst)
        self.data = []
        return s,a,r,s_prime, done_mask, prob_a, h_in_lst[0], h_out_lst[0]

    def train_net(self):
        s,a,r,s_prime,done_mask, prob_a, (h1_in, h2_in), (h1_out, h2_out) = self.make_batch()
        first_hidden  = (h1_in.detach(), h2_in.detach())
        second_hidden = (h1_out.detach(), h2_out.detach())

        for i in range(K_epoch):
            v_prime = self.v(s_prime, second_hidden).squeeze(1)
            td_target = r + gamma * v_prime * done_mask
            v_s = self.v(s, first_hidden).squeeze(1)
            delta = td_target - v_s
            delta = delta.detach().numpy()

            advantage_lst = []
            advantage = 0.0
            for item in delta[::-1]:
                advantage = gamma * lmbda * advantage + item[0]
                advantage_lst.append([advantage])
            advantage_lst.reverse()
            advantage = torch.tensor(advantage_lst, dtype=torch.float)

            pi, _ = self.pi(s, first_hidden)
            pi_a = pi.squeeze(1).gather(1,a)
            ratio = torch.exp(torch.log(pi_a) - torch.log(prob_a))  # a/b == log(exp(a)-exp(b))

            surr1 = ratio * advantage
            surr2 = torch.clamp(ratio, 1-eps_clip, 1+eps_clip) * advantage
            loss = -torch.min(surr1, surr2) + F.smooth_l1_loss(v_s, td_target.detach())

            self.optimizer.zero_grad()
            loss.mean().backward(retain_graph=True)
            self.optimizer.step()

def main():
    #env = gym.make('CartPole-v1')
    env = CustomCartpole({'override_actions': False})
    model = PPO()
    score = 0.0
    print_interval = 20

    for n_epi in range(10000):
        h_out = (torch.zeros([1, 1, 32], dtype=torch.float), torch.zeros([1, 1, 32], dtype=torch.float))
        s = env.reset()
        done = False

        while not done:
            for t in range(T_horizon):
                h_in = h_out
                prob, h_out = model.pi(torch.from_numpy(s).float(), h_in)
                prob = prob.view(-1)
                m = Categorical(prob)
                a = m.sample().item()
                s_prime, r, done, info = env.step(a)

                model.put_data((s, a, r/100.0, s_prime, prob[a].item(), h_in, h_out, done))
                s = s_prime

                score += r
                if done:
                    break

            model.train_net()

        if n_epi%print_interval==0 and n_epi!=0:
            print("# of episode :{}, avg score : {:.1f}".format(n_epi, score/print_interval))
            score = 0.0

    env.close()

if __name__ == '__main__':
    main()

This code fails with the following error:

image

Could you please tell me how to adjust the main loop:

while not done:
    for t in range(T_horizon):
        h_in = h_out
        prob, h_out = model.pi(torch.from_numpy(s).float(), h_in)
        prob = prob.view(-1)
        m = Categorical(prob)
        a = m.sample().item()
        s_prime, r, done, info = env.step(a)

        model.put_data((s, a, r/100.0, s_prime, prob[a].item(), h_in, h_out, done))
        s = s_prime

        score += r
        if done:
            break

    model.train_net()

to this environment?

Wrong td_target and test() call in a3c implementation

First of all, this is a really nice repo - simple and clean.

I have two issues with the a3c implementation

  1. The td_target calculated in https://github.com/seungeunrho/minimalRL/blob/master/a3c.py#L70 gives the same weight of gamma to calculate the value of the s_prime (the last state visited). Let's say s is your starting state and you are doing n step return, then, the target will be \sum_{i=0}^{n-1} gamma^{i} r_i + gamma^{n} v(s_prime)

You can make the following change https://github.com/seungeunrho/minimalRL/blob/master/a3c.py#L59 to the following,
R = 0.0 if done else model.v(s_prime).detach().item()
then, td_target = R_batch

  1. test might be executed before the training is complete. If you plan to probe how good the model is during training than this is alright. But if you wish to see the model performance once the training is complete then you should fire the test after .join() on train processes.

Add new algorithms

It would be nice to add the following algorithms:

  • RAINBOW
  • A2C (multiprocessing)

I will submit a PR if I finish any of them.

Query about LSTM

Hello, nice and clear implementation! I want to ask something about the LSTM usage. While gatthering experience the input to the LSTM is of dimension [1, 1, 64] which represents 1 timestep of 1 episode along with the 64 FC features?

Also when training on a batch you sample this size eg [20, 1, 64] which corresponds to 20 timesteps?

Finally, shouldn't the hidden state be of the same dimensions except the last? Correspond to the timestep dimension for example? What is the best way to handle using an LSTM is it just an implementation choice?

Training speed is very slow!!!

Readme:Every algorithm can be trained within 30 seconds, even without GPU?it's False
image
The two places marked in the picture stopped for a long time, and dqn training did not end for more than an hour.

Use maxlen in deque initializer

In the ReplayBuffer implementation, you can use

self.buffer = collections.deque(maxlen=buffer_limit)

to simplify the put() method -- deque will automatically drop the oldest elements.
Keep up the amazing work!

Wrong gradient flow in bias correction term of ACER?

loss2 = -correction_coeff * pi * torch.log(pi) * (q.detach()-v) # bias correction term

According to original paper, gradient for bias correction term is define as below,
image
and as pi serves as the probability for expectation calculation, it seems it's not the target of optimization.

Shouldn't we detach the pi from computational graph at above line?

torch.gather in relevant to policy gradient

As from my understanding the policy network is giving an output of mean and variance for a single action. After that torch.gather is used to calculate the log_prob. Can someone help me to understand the process?
Thanks for the help. 😃

LSTM + PPO value fitting

Hello, Thanks for your great work!
I have one dumb question.
in LSTM PPO realization, I noticed that when calculating v_prime and v_s, the same first_hidden value is used, my question is: should v_prime use a different first hidden value? or just a approximation.
Thank You!

v_prime = self.v(s_prime, first_hidden).squeeze(1)
td_target = r + gamma * v_prime * done_mask
v_s = self.v(s, first_hidden).squeeze(1)

Cartpole environment with Multidiscrete action space

Hi, I am trying to create an environment that is a variation of Cartpole.
From the Cartpole definiton:

The studied system is a cart of which a rigid pole is hinged (see figure). The cart is free to move within the bounds of a one-dimensional track. The pole can move inthe vertical plane parallel to the track. The controller can apply a force F to the cart, parallel to the track.

Suppose you can apply a force F but also a multiplier of this force M, so the total force applied is F * M.

The following is the code:

#PPO-LSTM
import gym
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.distributions import Categorical
import time
import math
import numpy as np
import gym.envs.classic_control

#Hyperparameters
learning_rate = 0.0005
gamma         = 0.98
lmbda         = 0.95
eps_clip      = 0.1
K_epoch       = 2
T_horizon     = 20

class CustomCartpole(gym.envs.classic_control.CartPoleEnv):
    """Add a dimension to the cartpole action space that is used as 'speed' button."""

    def __init__(self, env_config):
        super().__init__()
        self.force_mag = 5.0
        self.action_space = gym.spaces.MultiDiscrete([2, 4])

    def step(self, action):
        err_msg = "%r (%s) invalid" % (action, type(action))
        assert self.action_space.contains(action), err_msg

        x, x_dot, theta, theta_dot = self.state
        force = self.force_mag if action[0] == 1 else -self.force_mag
        force *= (action[1] + 1)
        costheta = math.cos(theta)
        sintheta = math.sin(theta)

        temp = (force + self.polemass_length * theta_dot ** 2 * sintheta) / self.total_mass
        thetaacc = (self.gravity * sintheta - costheta * temp) / (self.length * (4.0 / 3.0 - self.masspole * costheta ** 2 / self.total_mass))
        xacc = temp - self.polemass_length * thetaacc * costheta / self.total_mass

        if self.kinematics_integrator == 'euler':
            x = x + self.tau * x_dot
            x_dot = x_dot + self.tau * xacc
            theta = theta + self.tau * theta_dot
            theta_dot = theta_dot + self.tau * thetaacc
        else:  # semi-implicit euler
            x_dot = x_dot + self.tau * xacc
            x = x + self.tau * x_dot
            theta_dot = theta_dot + self.tau * thetaacc
            theta = theta + self.tau * theta_dot

        self.state = (x, x_dot, theta, theta_dot)

        done = bool(
            x < -self.x_threshold
            or x > self.x_threshold
            or theta < -self.theta_threshold_radians
            or theta > self.theta_threshold_radians
        )

        if not done:
            reward = 1.0
        elif self.steps_beyond_done is None:
            # Pole just fell!
            self.steps_beyond_done = 0
            reward = 1.0
        else:
            if self.steps_beyond_done == 0:
                logger.warn(
                    "You are calling 'step()' even though this "
                    "environment has already returned done = True. You "
                    "should always call 'reset()' once you receive 'done = "
                    "True' -- any further steps are undefined behavior."
                )
            self.steps_beyond_done += 1
            reward = 0.0

        return np.array(self.state), reward, done, {}


class PPO(nn.Module):
    def __init__(self):
        super(PPO, self).__init__()
        self.data = []

        self.fc1   = nn.Linear(4,64)
        self.lstm  = nn.LSTM(64,32)
        self.fc_pi = nn.Linear(32,2)
        self.fc_v  = nn.Linear(32,2)
        self.optimizer = optim.Adam(self.parameters(), lr=learning_rate)

    def pi(self, x, hidden):
        x = F.relu(self.fc1(x))
        x = x.view(-1, 1, 64)
        x, lstm_hidden = self.lstm(x, hidden)
        x = self.fc_pi(x)
        prob = F.softmax(x, dim=2)
        return prob, lstm_hidden

    def v(self, x, hidden):
        x = F.relu(self.fc1(x))
        x = x.view(-1, 1, 64)
        x, lstm_hidden = self.lstm(x, hidden)
        v = self.fc_v(x)
        return v

    def put_data(self, transition):
        self.data.append(transition)

    def make_batch(self):
        s_lst, a_lst, r_lst, s_prime_lst, prob_a_lst, h_in_lst, h_out_lst, done_lst = [], [], [], [], [], [], [], []
        for transition in self.data:
            s, a, r, s_prime, prob_a, h_in, h_out, done = transition

            s_lst.append(s)
            a_lst.append([a])
            r_lst.append([r])
            s_prime_lst.append(s_prime)
            prob_a_lst.append([prob_a])
            h_in_lst.append(h_in)
            h_out_lst.append(h_out)
            done_mask = 0 if done else 1
            done_lst.append([done_mask])

        s,a,r,s_prime,done_mask,prob_a = torch.tensor(s_lst, dtype=torch.float), torch.tensor(a_lst), \
                                         torch.tensor(r_lst), torch.tensor(s_prime_lst, dtype=torch.float), \
                                         torch.tensor(done_lst, dtype=torch.float), torch.tensor(prob_a_lst)
        self.data = []
        return s,a,r,s_prime, done_mask, prob_a, h_in_lst[0], h_out_lst[0]

    def train_net(self):
        s,a,r,s_prime,done_mask, prob_a, (h1_in, h2_in), (h1_out, h2_out) = self.make_batch()
        first_hidden  = (h1_in.detach(), h2_in.detach())
        second_hidden = (h1_out.detach(), h2_out.detach())

        for i in range(K_epoch):
            v_prime = self.v(s_prime, second_hidden).squeeze(1)
            td_target = r + gamma * v_prime * done_mask
            v_s = self.v(s, first_hidden).squeeze(1)
            delta = td_target - v_s
            delta = delta.detach().numpy()

            advantage_lst = []
            advantage = 0.0
            for item in delta[::-1]:
                advantage = gamma * lmbda * advantage + item[0]
                advantage_lst.append([advantage])
            advantage_lst.reverse()
            advantage = torch.tensor(advantage_lst, dtype=torch.float)

            pi, _ = self.pi(s, first_hidden)
            pi_a = pi.squeeze(1).gather(1,a)
            ratio = torch.exp(torch.log(pi_a) - torch.log(prob_a))  # a/b == log(exp(a)-exp(b))

            surr1 = ratio * advantage
            surr2 = torch.clamp(ratio, 1-eps_clip, 1+eps_clip) * advantage
            loss = -torch.min(surr1, surr2) + F.smooth_l1_loss(v_s, td_target.detach())

            self.optimizer.zero_grad()
            loss.mean().backward(retain_graph=True)
            self.optimizer.step()

def main():
    #env = gym.make('CartPole-v1')
    env = CustomCartpole({'override_actions': False})
    model = PPO()
    score = 0.0
    print_interval = 20

    for n_epi in range(10000):
        h_out = (torch.zeros([1, 1, 32], dtype=torch.float), torch.zeros([1, 1, 32], dtype=torch.float))
        s = env.reset()
        done = False

        while not done:
            for t in range(T_horizon):
                h_in = h_out
                prob, h_out = model.pi(torch.from_numpy(s).float(), h_in)
                prob = prob.view(-1)
                m = Categorical(prob)
                a = m.sample().item()
                s_prime, r, done, info = env.step(a)

                model.put_data((s, a, r/100.0, s_prime, prob[a].item(), h_in, h_out, done))
                s = s_prime

                score += r
                if done:
                    break

            model.train_net()

        if n_epi%print_interval==0 and n_epi!=0:
            print("# of episode :{}, avg score : {:.1f}".format(n_epi, score/print_interval))
            score = 0.0

    env.close()

if __name__ == '__main__':
    main()

This code fails with the following error:

image

Could you please tell me how to adjust the main loop:

while not done:
    for t in range(T_horizon):
        h_in = h_out
        prob, h_out = model.pi(torch.from_numpy(s).float(), h_in)
        prob = prob.view(-1)
        m = Categorical(prob)
        a = m.sample().item()
        s_prime, r, done, info = env.step(a)

        model.put_data((s, a, r/100.0, s_prime, prob[a].item(), h_in, h_out, done))
        s = s_prime

        score += r
        if done:
            break

    model.train_net()

to this environment?

RuntimeError while running DDPG.py

Hi, I got RuntimeError when I executed DDPG.py.
It seems to be occurred during training process of QNet.

RuntimeError: Expected object of scalar type Float but got scalar type Double for argument #2 'mat1' in call to _th_addmm

I'm using torch 1.5.0 and Python 3.7.3
Is there anyone with the same problem as me?

Soft Actor Critic?

All implementations of SAC I saw use a paid physics sim, any plans to implement it here?

A naive question about updating parameters in DDPG.

Hi, first of all, thanks for your awesome codes. This is not about any technical issue, but about the algorithm of the DDPG code.

As far as I know, the DDPG method can exploit online parameter update due to the TD learning. But, in your code, the parameters are updated after an episode is over.

I would like to ask you if there are some theoretical background behind this parameter update interval?

Thank you in advance.

Add minimal IMPALA?

Hello,

its a fantastic job and really helpful for me! Is it possible to add IMPALA by revamp A3C?
IMPALA is more efficient than A2C and A3C,all the code I find in github for that is detailed and complicated

PPO update mistake?

In line 110:

for n_epi in range(10000):
        s = env.reset()
        done = False
        while not done:
            for t in range(T_horizon):
                prob = model.pi(torch.from_numpy(s).float())
                m = Categorical(prob)
                a = m.sample().item()
                s_prime, r, done, info = env.step(a)

                model.put_data((s, a, r/100.0, s_prime, prob[a].item(), done))
                s = s_prime

                score += r
                if done:
                    break

            model.train_net() # <------- HERE

I think it should be left shifted to align with while not done, i.e. after collecting data of one episode, we update the networks' parameters. I have tested and this gives stable performance.

for n_epi in range(10000):
        s = env.reset()
        done = False
        while not done:
            for t in range(T_horizon):
                prob = model.pi(torch.from_numpy(s).float())
                m = Categorical(prob)
                a = m.sample().item()
                s_prime, r, done, info = env.step(a)

                model.put_data((s, a, r/100.0, s_prime, prob[a].item(), done))
                s = s_prime

                score += r
                if done:
                    break

        model.train_net() # <------- UPDATED

TypeError: expected np.ndarray (got tuple)

image

My system environment is below

  • virtual machine ubuntu 18.04 on windows
  • miniconda
  • python 3.9 version

I just copy and paste this minimalRL code in my workspace...
I can not execute the example code.

Thank you

Problem of `train_net()` in REINFORCE algorithm.

Thanks for the high quality implementations. But I have a question about train_net() in REINFORCE algorithm:

    def train_net(self):
        R = 0
        for r, prob in self.data[::-1]:
            R = r + gamma * R
            loss = -torch.log(prob) * R
            self.optimizer.zero_grad()
            loss.backward()
            self.optimizer.step()
        self.data = []

The policy is updated for each step of an episode. But the policy is supposed to be updated after a complete episode (or batch of episodes).

Since after we do an update, the policy is changed, and the data collected is no longer useful according to the theory.

Add SAC?

Thank you for this wonderful repo. If you also implement SAC, that could be better.

Add meta RL algorithms?

Hello,

I have enjoyed reading your good examples! Is it possible for you to add a few meta RL algorithms? Thanks!

Questions about A3C

Hi,
Thanks for your simple and awesome A3C code.

I got some of the questions to ask.

Q. Why did you put the 'test' process into multi-process with other actor-learner thread?
I think that the 'test' process should be called after all of the other actor-learner threads are done.
OR Is it just to check a train performance of global_model (=test)?

Thank you.

Termination of a CartPole episode in REINFORCE.py

Hi.
I think REINFORCE.py:44 should be placed after REINFORCE.py:46. Because once a single episode is terminated, the value of "done" will be False (and won't be reset), causing the main function to skip the while loop in the subsequent episodes.
After all, I'm not sure about this issue. I'm a total newbie.
BTW, all of the implementations are highly efficient, easy-to-customize, easy-to-understand, and very helpful. Thank you for sharing.

Minimal way to save / replay trained model?

I'm somewhat new to the field of reinforcement learning, and I find these simplistic examples to be extremely helpful -- thank you!

Would you be able to help me with understanding a minimal way to save / replace these trained models?

Please add 1 continuous env

Would it be possible to implement these algorithms in a continuous env like bipedalwalker?

also, SAC is a cool algorithm.

finally it would be wonderful if you posted scores for each algorithm in the readme (so we can compare performance without having to clone and run everything)

my only negative feedback would be, in some places, you use 1 letter only to describe something, when a word would be more clear, and would not add lines. If you want this to be the most clear/simple RL repo, it would be good if readers can more easily understand the algorithm without having to guess "what does this letter mean?"

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.