ericyangyu / ppo-for-beginners Goto Github PK

A simple and well styled PPO implementation. Based on my Medium series: https://medium.com/@eyyu/coding-ppo-from-scratch-with-pytorch-part-1-4-613dfc1b14c8.

License: MIT License

Python 97.08% Shell 2.92%

ppo reinforcement-learning reinforcement-learning-algorithms machine-learning pytorch

ppo-for-beginners's Introduction

PPO for Beginners

Introduction

Hi! My name is Eric Yu, and I wrote this repository to help beginners get started in writing Proximal Policy Optimization (PPO) from scratch using PyTorch. My goal is to provide a code for PPO that's bare-bones (little/no fancy tricks) and extremely well documented/styled and structured. I'm especially targeting people who are tired of reading endless PPO implementations and having absolutely no idea what's going on.

If you're not coming from Medium, please read my series first.

I wrote this code with the assumption that you have some experience with Python and Reinforcement Learning (RL), including how policy gradient (pg) algorithms and PPO work (for PPO, should just be familiar with theoretical level. After all, this code should help you with putting PPO into practice). If unfamiliar with RL, pg, or PPO, follow the three links below in order:

If unfamiliar with RL, read OpenAI Introduction to RL (all 3 parts)
If unfamiliar with pg, read An Intuitive Explanation of Policy Gradient
If unfamiliar with PPO theory, read PPO stack overflow post
If unfamiliar with all 3, go through those links above in order from top to bottom.

Please note that this PPO implementation assumes a continuous observation and action space, but you can change either to discrete relatively easily. I follow the pseudocode provided in OpenAI's Spinning Up for PPO: https://spinningup.openai.com/en/latest/algorithms/ppo.html; pseudocode line numbers are specified as "ALG STEP #" in ppo.py.

Hope this is helpful, as I wish I had a resource like this when I started my journey into Reinforcement Learning.

Usage

First I recommend creating a python virtual environment:

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

To train from scratch:

python main.py

To test model:

python main.py --mode test --actor_model ppo_actor.pth

To train with existing actor/critic models:

python main.py --actor_model ppo_actor.pth --critic_model ppo_critic.pth

NOTE: to change hyperparameters, environments, etc. do it in main.py; I didn't have them as command line arguments because I don't like how long it makes the command.

How it works

main.py is our executable. It will parse arguments using arguments.py, then initialize our environment and PPO model. Depending on the mode you specify (train by default), it will train or test our model. To train our model, all we have to do is call learn function! This was designed with how you train PPO2 with stable_baselines in mind.

arguments.py is what main will call to parse arguments from command line.

ppo.py contains our PPO model. All the learning magic happens in this file. Please read my Medium series to see how it works. Another method I recommend is using something called pdb, or python debugger, and stepping through my code starting from when I call learn in main.py.

network.py contains a sample Feed Forward Neural Network we can use to define our actor and critic networks in PPO.

eval_policy.py contains the code to evaluating the policy. It's a completely separate module from the other code.

graph_code directory contains the code to automatically collect data and generate graphs. Takes ~10 hours on a decent computer to generate all the data in my Medium article. All the data from the medium article should still be in graph_code/graph_data too in case you're interested; if you want, you can regenerate the graphs I use with the data. For more details, read the README in graph_code.

Here's a great pdb tutorial to get started: https://www.youtube.com/watch?v=VQjCx3P89yk&ab_channel=TutorialEdge
Or if you're an expert with debuggers, here's the documentation: https://docs.python.org/3/library/pdb.html

Environments

Here's a list of environments you can try out. Note that in this PPO implementation, you can only use the ones with Box for both observation and action spaces.

Hyperparameters can be found here.

Results

Please refer to my Medium article.

Contact

If you have any questions or would like to reach out to me, you can find me here:
Email: [email protected]
LinkedIn: https://www.linkedin.com/in/eric-yu-engineer/

ppo-for-beginners's People

Contributors

Stargazers

Watchers

Forkers

britig clemens-tolboom rishabhdevyadav newxei zhangtjtongxue zclab ayusukemiake qianxiao1111 pgkang walkacross jfyao90 zivzone milkigit kellsky roberthalwass maritimenn zhili-zh nihilistparth junqiqian whoismanoj alirezashamsoshoara longjiao993 bigtomatokim rodrigoclira faherngeit maukappel canu2esp xinyusun mikesifanele djt-hust hsuth1996 dulayjm affordck vldknd jinwoop nitesh4146 miki-yuasa choijunnyeong liaoxianglai mortorit nicoleorzan ankurneural nancysaxena1-eng ice-bear-git shiweiba narong-b nicoschif xiongtongzhao wyhmhs ibagur googleja ebron01 crystalxy123 chadmcintire josetorraca whit3snow kiyosio ubeydemavus elenguo bigfoot496 dmnkhppl zombasy wt160 peins amegatron roeyg vulcandth mrwonderfulness ttdow marcuslaw0074 fungust aidenfavish mcpfirefly shashankkapoor wayan123 genius2787 ahmed-m2020 panyuewh raymondxzr jiaquan01 mdzimmer akamsali zhiyangdeng1994 phoenixfirestone mocking286 jymikezhang paulzbigniew dbstmdgks93 mali-erel jamesz101ece edwardjjj wangjueya unk0vvvn rabirajb techthiyanes nalgae73 simonachiurato matteomariani99 brightgems lu-research

ppo-for-beginners's Issues

why critic's loss is mean squared error of the predicted values with rewards-to-go.

Thanks very much for the tutorial but I have a question.
From my understanding, critic's loss should be 'sqr(predicted valve - true value)'
but from code and paper, it is
critic_loss = nn.MSELoss()(V, batch_rtgs)
V is predicted value, but why we can see 'batch_rtgs' is true value? It was previously seen as Q value in advantage function.
A_k = batch_rtgs - V.detach()

ImportError: libboost_filesystem.so.1.65.1 in Collab

Did anyone run into this error?

the python version of this repo?

Hi Yu
GOOD work! your tutorials save me lots of time. But which version of python you use?

PPO gets stuck in custom environment

Hello and thank you for the implementation it really helps. I have an environment that is sparse and only receives a reward on task completion. I followed your code and implemented a PPO algorithm that uses a simple actor-critic network. I am attaching my code for the network and PPO here.

ActorCritic

class ActorCritic(nn.Module):

    def __init__(self):
        super(ActorCritic, self).__init__()

        self.fc1 = nn.Linear(33, 128)
        self.fc2 = nn.Linear(128, 128)

        self.critic = nn.Linear(128, 1)
        self.actor = nn.Linear(128, 3)
        self.apply(init_weights)

    def forward(self, x):
        x = torch.tanh(self.fc1(x))
        x = torch.tanh(self.fc2(x))

        return self.critic(x), F.softmax(self.actor(x), dim=-1)

PPO

class PPO():
    
    def __init__(
        self,
        env,
        policy,
        lr,
        gamma,
        betas,
        gae_lambda,
        eps_clip,
        entropy_coef,
        value_coef,
        max_grad_norm,
        timesteps_per_batch,
        n_updates_per_itr,
        summary_writer,
        norm_obs = True):

        self.policy = policy
        self.env = env
        self.lr = lr
        self.gamma = gamma
        self.betas = betas
        self.gae_lambda = gae_lambda
        self.eps_clip = eps_clip
        self.entropy_coef = entropy_coef
        self.value_coef = value_coef
        self.max_grad_norm = max_grad_norm
        self.timesteps_per_batch = timesteps_per_batch
        self.n_updates_per_itr = n_updates_per_itr
        self.summary_writer = summary_writer
        self.norm_obs = norm_obs
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.total_updates = 0

        self.optimizer = torch.optim.Adam(self.policy.parameters(), lr=self.lr, betas=self.betas)

    def learn(self, total_timesteps=1000000, callback=None):
        
        timesteps = 0
        while timesteps < total_timesteps:
            batch_obs, batch_actions, batch_log_probs, batch_rtgs, batch_advantages, batch_lens = self.rollout()

            timesteps += np.sum(batch_lens)
            
            advantage_k = (batch_advantages - batch_advantages.mean()) / (batch_advantages.std() + 1e-10)

            for i in range(self.n_updates_per_itr):
                state_values, action_probs = self.policy(batch_obs)
                state_values = state_values.squeeze()

                dist = Categorical(action_probs)
                curr_log_probs = dist.log_prob(batch_actions)

                ratios = torch.exp(curr_log_probs - batch_log_probs)

                surr1 = ratios * advantage_k
                surr2 = torch.clamp(ratios, 1 - self.eps_clip, 1 + self.eps_clip) * advantage_k

                policy_loss = (-torch.min(surr1, surr2)).mean()
                value_loss = F.mse_loss(state_values, batch_rtgs)
                total_loss = policy_loss + self.value_coef * value_loss
                self.optimizer.zero_grad()
                total_loss.backward()
                self.optimizer.step()

                self.total_updates += 1
                self.summary_writer.add_scalar("policy_loss", policy_loss, self.total_updates)
                self.summary_writer.add_scalar("value_loss", value_loss, self.total_updates)
                self.summary_writer.add_scalar("total_loss", total_loss, self.total_updates)


            if callback:
                callback.eval_policy(self.policy, self.summary_writer, self.norm_obs)
        

    def rollout(self):
        batch_obs = []
        batch_acts = []
        batch_state_values = []
        batch_log_probs = []
        batch_rewards = []
        batch_rtgs = []
        batch_lens = []
        batch_advantages = []
        batch_terminals = []

        timesteps_collected = 0
        while timesteps_collected < self.timesteps_per_batch:
            eps_rewards = []
            eps_state_values = []
            eps_terminals = []
            obs = self.env.reset()

            done = False
            eps_timesteps = 0
            for _ in range(50):
                timesteps_collected += 1
                if self.norm_obs:
                    obs = (obs - obs.mean()) / (obs.std() - 1e-10)
                batch_obs.append(obs)

                state_value, action_probs = self.policy(torch.from_numpy(obs).type(torch.float).to(self.device))

                dist = Categorical(action_probs)
                action = dist.sample()
                act_log_prob = dist.log_prob(action)

                obs, reward, done, _ = self.env.step(action.cpu().detach().item())
                
                eps_rewards.append(reward)
                eps_state_values.append(state_value.squeeze().cpu().detach().item())
                eps_terminals.append(0 if done else 1)

                batch_acts.append(action.cpu().detach().item())
                batch_log_probs.append(act_log_prob.cpu().detach().item())
                
                eps_timesteps += 1
                if done:
                    break
            batch_lens.append(eps_timesteps)
            batch_rewards.append(eps_rewards)
            batch_state_values.append(eps_state_values)
            batch_terminals.append(eps_terminals)

        batch_obs = torch.tensor(batch_obs, dtype=torch.float).to(self.device)
        batch_acts = torch.tensor(batch_acts, dtype=torch.float).to(self.device)
        batch_log_probs = torch.tensor(batch_log_probs, dtype=torch.float).flatten().to(self.device)
        
        for eps_rewards, eps_state_values, eps_terminals in zip(reversed(batch_rewards), reversed(batch_state_values), reversed(batch_terminals)):
            discounted_reward = 0
            gae = 0
            next_state_value = 0
            next_terminal = 0
            for reward, state_value, terminal in zip(reversed(eps_rewards), reversed(eps_state_values), reversed(eps_terminals)):
                discounted_reward = reward + self.gamma * discounted_reward
                delta = reward + self.gamma * next_state_value * next_terminal - state_value
                gae = delta + self.gamma * self.gae_lambda * next_terminal * gae
                batch_rtgs.insert(0, discounted_reward)
                batch_advantages.insert(0, gae)
                next_state_value = state_value
                next_terminal = terminal

        batch_rtgs = torch.tensor(batch_rtgs, dtype=torch.float).to(self.device)
        batch_advantages = torch.tensor(batch_advantages, dtype=torch.float).to(self.device)
        return batch_obs, batch_acts, batch_log_probs, batch_rtgs, batch_advantages, batch_lens

I am using the Generalised Advantage Estimate in my case but even when using the simpler advantage function,R-V(s) my implementation still gets stuck and will always choose the same action when I am evaluating. This is how I evaluate the policy in a deterministic way. I am not sampling from a categorical distribution.

_, action_probs = policy(torch.from_numpy(obs).type(torch.float).to(device))
action = torch.argmax(action_probs).item()

Can you provide any pointers as to where the problem might be? I have used stablebaselines3 with the same environment implementation however because I want to have more control over the model I am using I opted for a custom implementation. I can't seem to figure out where the problem might be however.

Using for custom environment with different actions

Hi,
I use your code in different gym environments, and I get good results. I try to control a robot to navigate in a dynamic environment by changing the linear and angular velocities. The robot's interaction time with the environment with a particular action is also not fixed. (That is, the length of the time step is also different, and from the three options, 0.2, 0.4,0.6, and 0.8 is selected)

Since the action space that the agent must learn is three parts, how to use log_prob to calculate the ratio in my problem?

Robot linear velocity in the range of [-3, 3] from a tanh activation from actor
robot angular velocity in the range of [-pi/12, pi/12] from a tanh activation from actor
Robot time_step length from a set of [0.2, 0.4,0.6, 0.8] using softmax on actor

You use MultivariableNormal that indicates the possibility of selecting all actions together, but my actions dist is Normal and has different means.
If I use multivariable normal for velocities but how to add my step_time categorical prob?

Dummy way:

Can I use this method?
I get 3 outputs from actor-network in range [-1 , 1], next change output [0] to [-3, 3], output[1] to [-pi/12, pi/12], and output[2] to [0, 1],
output[0] for linear, output[1] for angular, output[2] for time. So I manually change the mean for time to the range [0.2, 0.4, 0.6, 0.8] with if condition, next use MultivariableNormal with different std, for the first two actions, use std as you use, and for time use very small std, like 1e-17 and get prob of them?

covariance matrix

I don't know why you choose the determinate covariance matrix?
The covariance matrix should be learned by actor network.

How to fix: Broken with latest gym pip package

The env.step return values changed, so this code is now how to get it going:

        # Number of timesteps run so far this batch
        t = 0 
        while t < self.timesteps_per_batch:
            # Rewards this episode
            ep_rews = []

            obs = self.env.reset()
            if isinstance(obs, tuple):
                obs = obs[0]  # Assuming the first element of the tuple is the relevant data

            terminated = False
            for ep_t in range(self.max_timesteps_per_episode):
                # Increment timesteps ran this batch so far
                t += 1
                # Collect observation
                batch_obs.append(obs)
                action, log_prob = self.get_action(obs)

                obs, rew, terminated, truncated, _ = self.env.step(action)
                if isinstance(obs, tuple):
                    obs = obs[0]  # Assuming the first element of the tuple is the relevant data

                # Collect reward, action, and log prob
                ep_rews.append(rew)
                batch_acts.append(action)
                batch_log_probs.append(log_prob)

            if terminated or truncated:
                break

Note that you now have to check terminated and truncated return values. Latest documentation is here: https://www.gymlibrary.dev/api/core/

Without this if you follow along with the blog post, it will fail at the end of Blog 3 at this step:

import gym
env = gym.make('Pendulum-v1')
model = PPO(env)
model.learn(10000)

Also you need to update Pendulum-v0 to Pendulum-v1.

The average Episodic Return and Average Loss is nan

Hi Eric,

I am a beginner with PPO and I tried your code with the _log_summary() function implemented with the following main block.

if __name__ == '__main__':
	hyperparameters = {
				'timesteps_per_batch': 2048, 
				'max_timesteps_per_episode': 200, 
				'gamma': 0.99, 
				'n_updates_per_iteration': 10,
				'lr': 3e-4, 
				'clip': 0.2
			  }
	env = gym.make('Pendulum-v0')
	model = PPO(env=env, **hyperparameters)
	print(f'Model information ====== {model}')
	model.learn(10000)

The output keeps on giving nan value on Episodic Return and Average Loss. For example

-------------------- Iteration #50 --------------------
Average Episodic Length: 1.0
Average Episodic Return: nan
Average Loss: nan
Timesteps So Far: 50

Can you kindly help with this? The policy network is FeedForwardNN.