GithubHelp home page GithubHelp logo

sweetice / deep-reinforcement-learning-with-pytorch Goto Github PK

View Code? Open in Web Editor NEW
3.6K 33.0 816.0 43.09 MB

PyTorch implementation of DQN, AC, ACER, A2C, A3C, PG, DDPG, TRPO, PPO, SAC, TD3 and ....

License: MIT License

Python 100.00%
policy-gradient pytorch actor-critic-algorithm alphago deep-reinforcement-learning a2c dqn sarsa ppo a3c

deep-reinforcement-learning-with-pytorch's Introduction

Status: Active (under active development, breaking changes may occur)

This repository will implement the classic and state-of-the-art deep reinforcement learning algorithms. The aim of this repository is to provide clear pytorch code for people to learn the deep reinforcement learning algorithm.

In the future, more state-of-the-art algorithms will be added and the existing codes will also be maintained.

demo

Requirements

  • python <=3.6
  • tensorboardX
  • gym >= 0.10
  • pytorch >= 0.4

Note that tensorflow does not support python3.7

Installation

pip install -r requirements.txt

If you fail:

  • Install gym
pip install gym
  • Install the pytorch
please go to official webisite to install it: https://pytorch.org/

Recommend use Anaconda Virtual Environment to manage your packages
  • Install tensorboardX
pip install tensorboardX
pip install tensorflow==1.12
  • Test
cd Char10\ TD3/
python TD3_BipedalWalker-v2.py --mode test

You could see a bipedalwalker if you install successfully.

BipedalWalker:

    1. install openai-baselines (Optional)
# clone the openai baselines
git clone https://github.com/openai/baselines.git
cd baselines
pip install -e .

DQN

Here I uploaded two DQN models which is trianing CartPole-v0 and MountainCar-v0.

Tips for MountainCar-v0

This is a sparse binary reward task. Only when car reach the top of the mountain there is a none-zero reward. In genearal it may take 1e5 steps in stochastic policy. You can add a reward term, for example, to change to the current position of the Car is positively related. Of course, there is a more advanced approach that is inverse reinforcement learning.

value_loss
step This is value loss for DQN, We can see that the loss increaded to 1e13, however, the network work well. Because the target_net and act_net are very different with the training process going on. The calculated loss cumulate large. The previous loss was small because the reward was very sparse, resulting in a small update of the two networks.

Papers Related to the DQN

  1. Playing Atari with Deep Reinforcement Learning [arxiv] [code]
  2. Deep Reinforcement Learning with Double Q-learning [arxiv] [code]
  3. Dueling Network Architectures for Deep Reinforcement Learning [arxiv] [code]
  4. Prioritized Experience Replay [arxiv] [code]
  5. Noisy Networks for Exploration [arxiv] [code]
  6. A Distributional Perspective on Reinforcement Learning [arxiv] [code]
  7. Rainbow: Combining Improvements in Deep Reinforcement Learning [arxiv] [code]
  8. Distributional Reinforcement Learning with Quantile Regression [arxiv] [code]
  9. Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation [arxiv] [code]
  10. Neural Episodic Control [arxiv] [code]

Policy Gradient

Use the following command to run a saved model

python Run_Model.py

Use the following command to train model

python pytorch_MountainCar-v0.py

policyNet.pkl

This is a model that I have trained.

Actor-Critic

This is an algorithmic framework, and the classic REINFORCE method is stored under Actor-Critic.

DDPG

Episode reward in Pendulum-v0:

ep_r

PPO

A2C

Advantage Policy Gradient, an paper in 2017 pointed out that the difference in performance between A2C and A3C is not obvious.

The Asynchronous Advantage Actor Critic method (A3C) has been very influential since the paper was published. The algorithm combines a few key ideas:

  • An updating scheme that operates on fixed-length segments of experience (say, 20 timesteps) and uses these segments to compute estimators of the returns and advantage function.
  • Architectures that share layers between the policy and value function.
  • Asynchronous updates.

A3C

Original paper: https://arxiv.org/abs/1602.01783

SAC

This is not the implementation of the author of paper!!!

Episode reward in Pendulum-v0:

ep_r

TD3

This is not the implementation of the author of paper!!!

Episode reward in Pendulum-v0:

ep_r

Episode reward in BipedalWalker-v2:
ep_r

If you want to use the test your model:

python TD3_BipedalWalker-v2.py --mode test

Papers Related to the Deep Reinforcement Learning

[01] A Brief Survey of Deep Reinforcement Learning
[02] The Beta Policy for Continuous Control Reinforcement Learning
[03] Playing Atari with Deep Reinforcement Learning
[04] Deep Reinforcement Learning with Double Q-learning
[05] Dueling Network Architectures for Deep Reinforcement Learning
[06] Continuous control with deep reinforcement learning
[07] Continuous Deep Q-Learning with Model-based Acceleration
[08] Asynchronous Methods for Deep Reinforcement Learning
[09] Trust Region Policy Optimization
[10] Proximal Policy Optimization Algorithms
[11] Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation
[12] High-Dimensional Continuous Control Using Generalized Advantage Estimation
[13] Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
[14] Addressing Function Approximation Error in Actor-Critic Methods

TO DO

  • DDPG
  • SAC
  • TD3

Best RL courses

deep-reinforcement-learning-with-pytorch's People

Contributors

dependabot[bot] avatar sweetice avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

deep-reinforcement-learning-with-pytorch's Issues

About PPO

I don't think this code can solve the problem(pendulum), and another question is why this reward is 'running_reward * 0.9 + score * 0.1'

Big bug in PPO2

In dist = Normal(mu, sigma) , sigma should be a positive value, but actor_net output can be negative, so action_log_prob = dist.log_prob(action) can be nan.

Try:

import torch
a = torch.FloatTensor([1]).cuda()
b = torch.FloatTensor([-1]).cuda()
dist = Normal(a,b)
action = dist.sample()
action_log_prob = dist.log_prob(action)

print(action.cpu().numpy())
print(action_log_prob.item())

Bugs in PPO

  1. counter

  2. for index in BatchSampler(SubsetRandomSampler(range(self.buffer_capacity), self.batch_size, True)):

Cannot solve the Pendulum problem by PPO iml in Chapter 07

I cannot solve the continuous control problem of the Pendulum with your implementation in Chapter 07, i.e., PPO.

When the program exits finally, the problem is still not solved. Could you please verify it and tell me how to reproduce your solution? Thx.
image

A problem in Chapter 5: DDPG

I don't konw if it's because of my device or the program, but this pendulum-v0 just doesn't work so well in my device.
You see, the pendulum only moves one circle and for some reason, it just stops moving! I tried multiple ways and haven't figured out why? Hopefully you might tell me what is going wrong?

提个小建议

利益无关
首先非常感谢你所收集的代码
代码质量都很高,写法很经典,通俗易懂
但希望能附上源代码的链接
如:
actor-critic
A2C

confused about the calculation of R in PPO

hello,i am confused about the calculation of R in PPO. In file PPO_CartPole_v0.py you calc R in function update, but I think the reward in the buffer maybe come from two diffent trajectory.

About updating.

Thank you for publishing your A2C codes.
In the updating block, you are using torch de-touch method. And it seems to me as same as stop using no grad method on calculating advantage like my code.
But my code doesn't learn at all. Is my idea wrong?
Thanks.

Confused about different action_sample way in SAC

I notice in SAC, the function select_action(), function sample() is simply used to randomly sample an "action",
But in function evaluate(), the code is written as batch_mu + batch_sigma*z

Why don't just use sample() as the first one ? Is there any important differences?

wrong code in SAC

If raise NotImplementedError, just modify the functions title of NormalizedActions class, modify _action to aciton and _reverse_action to reverse_action.

SAC Bugs

In SAC.py, SAC_BipedalWalker-v2.py, the codes:

class NormalizedActions(gym.ActionWrapper):
    def _action(self, action):
        low = self.action_space.low
        high = self.action_space.high

        action = low + (action + 1.0) * 0.5 * (high - low)
        action = np.clip(action, low, high)

        return action

    def _reverse_action(self, action):
        low = self.action_space.low
        high = self.action_space.high

        action = 2 * (action - low) / (high - low) - 1
        action = np.clip(action, low, high)

        return action

now should be changed as follows:

class NormalizedActions(gym.ActionWrapper):
    def action(self, action):
        low = self.action_space.low
        high = self.action_space.high

        action = low + (action + 1.0) * 0.5 * (high - low)
        action = np.clip(action, low, high)

        return action

    def reverse_action(self, action):
        low = self.action_space.low
        high = self.action_space.high

        action = 2 * (action - low) / (high - low) - 1
        action = np.clip(action, low, high)

        return action

in order to adapt to the latest OpenAI Gym core.py

about TRPO

No find trpo, Is it included in the PPO? Still there is no implementation part about TRPO

Char 05 DDPG missing exploration noise

The authors of DDPG are adding noise to the action (Section 7 - Experimental details), in order to explore more options. This feature is missing in DDPG.py script.

bug in reinforce with baseline

the update value network should be:

    alpha_w = 1e-3  # 初始化

    optimizer_w = optim.Adam(**s_value_func**.parameters(), lr=alpha_w)
    optimizer_w.zero_grad()
    policy_loss_w =-delta
    policy_loss_w.backward(retain_graph = True)
    clip_grad_norm_(policy_loss_w, 0.1)
    optimizer_w.step()

Problem about DDPG

What is the role of args.update_iteration, it means that the update is repeated two hundred times within each learn. Is this inconsistent with the original algorithm

program error in gridworld.py

In gridworld.py ,77 lines,self.position = [np.random.randint(tot_row), np.random.randint(tot_col)]. I think it should modify self.position = [np.random.randint(self.world_row), np.random.randint(self.world_col)]

SAC_Bug

in sac.py
s = torch.tensor([t.s for t in self.replay_buffer]).float().to(device)
Traceback (most recent call last):
File "D:\PycharmProject\Deep-reinforcement-learning-with-pytorch-master\Char09 SAC\SAC.py", line 307, in
main()
File "D:\PycharmProject\Deep-reinforcement-learning-with-pytorch-master\Char09 SAC\SAC.py", line 293, in main
agent.update()
File "D:\PycharmProject\Deep-reinforcement-learning-with-pytorch-master\Char09 SAC\SAC.py", line 244, in update
Q_loss.backward(retain_graph = True)
File "C:\Users\lx\anaconda3\envs\torch\lib\site-packages\torch_tensor.py", line 363, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "C:\Users\lx\anaconda3\envs\torch\lib\site-packages\torch\autograd_init_.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: Found dtype Double but expected Float

Char 05 DDPG: step index and episode index

for i in range(args.num_iteration): state = env.reset() for t in range(args.max_episode):

from the above code we can infer that i stands for the i-th step, and t stands for the t-th episode.
However, it is shown in code:
print('Episode {}, The memory size is {} '.format(i, len(agent.replay_buffer.storage)))
that i is used for counting episode.

So do we need to change the positions of args.max_episode and args.num_iteration ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.