GithubHelp home page GithubHelp logo

maphysart / practical-drl Goto Github PK

View Code? Open in Web Editor NEW

This project forked from ibrahimsobh/practical-drl

0.0 1.0 0.0 8.7 MB

This is a practical resource that makes it easier to learn about and apply Practical Deep Reinforcement Learning (DRL) https://ibrahimsobh.github.io/Practical-DRL/

License: MIT License

Jupyter Notebook 100.00%

practical-drl's Introduction

Practical Deep Reinforcement Learning

This is a practical resource that makes it easier to learn about and apply deep reinforcement learning. For practitioners and researchers, Practical RL provides a set of practical implementations of reinforcement learning algorithms applied on different environments, enabling easy experimentations and comparisons.

RL

Reinforcement Learning (RL) is a machine learning approach for teaching agents how to solve tasks by interaction with environments. Deep Reinforcement Learning refers to the combination of RL with deep learning.

Code for RL Algorithms:

  • Simple RL algorithms from scratch, based on Numpy, such as Q-Learning, SARSA and REINFORCE applied on simple grid world environments.
  • Advanced RL algorithms using the Stable Baselines that extends and improves the OpenAI Baselines.

OpenAI

stable baselines

1- Hello Environments!

Open In Colab

Gym comes with a diverse suite of environments ranging from Classic control and toy text to Atari, 2D and 3D robots.

for t in range(1000):
    action = env.action_space.sample()
    env.render()
    observation, reward, done, info = env.step(action)
    rewards_list.append(reward)
    if done: 
      break;

colab

colab

2- Hello RL!

Open In Colab

Some RL methods must wait until the end of an episode to update the value function estimate. More practically, Temporal-difference (TD) methods update the value function after every time step. Two main algorithms are implemented:

  • 2.1 SARSA: Updates Q after SARSA sequence (A is chosen from the e-greedy policy)
Q[s,a] = Q[s,a] + alpha * ((r + gama* Q[s1,a1]) - Q[s,a])
  • 2.2 Q-Learning: Updates Q after SARS and use max A (A is chosen from the greedy policy)
Q[s,a] = Q[s,a] + alpha*(r + gama*np.max(Q[s1,:]) - Q[s,a])

Cliff Walking


Advanced Deep RL:

3- DQN Human-level control through deep reinforcement learning

Open In Colab

A value based RL algorithm, where Deep Neural Network is used as a function approximator to estimate the action value function Q(s, a).

Lunar DQN

total_timesteps = 150000 
env = gym.make('LunarLander-v2')
model = DQN(MlpPolicy, env, verbose=0, prioritized_replay=True, tensorboard_log="./DQN_LunarLander_tensorboard/")
model.learn(total_timesteps=total_timesteps, tb_log_name="DQN_prioreplay")
model.save("dqn_LunarLander_prioreplay")

DQNtb

4- REINFORCE

Open In Colab

A policy based RL algorithm that directly gets the optimal policy (direct mapping from states to actions) without estimating a value function. REINFORCE samples few trajectories using the current policy and uses them to estimate the gradient to increase / decrease the action probability based in the return.

5- PPO Proximal Policy Optimization

Open In Colab

On policy algorithm that uses old trajectories, instead of just throwing them away, by modifying them so that they are representative of the new policy, using approximated re-weight factor.

cart

# multiprocess environment
n_cpu = 4
env = SubprocVecEnv([lambda: gym.make('CartPole-v0') for i in range(n_cpu)])
model = PPO2(MlpPolicy, env, verbose=0, tensorboard_log="./ppo_cartpole_tensorboard/")
model.learn(total_timesteps=total_timesteps, tb_log_name = "PPO2_4")
model.save("ppo_cartpole_4")

PPOtb

6- A3C and A2C Asynchronous Methods for Deep Reinforcement Learning

Open In Colab

Actor Critic (AC) methods are a hybrid of value based and policy based methods, where a Critic measures how good the action taken is by estimating a value function, and an Actor controls how the agent behaves (policy-based). Asynchronous Methods: multiple agents on different threads are used for parallel exploring the state spaces and make decorrelated updates to the actor and the critic. A3C Asynchronous Advantage Actor Critic where Each agent updates the network on its own, while A2C is the Synchronous variant where it waits for all agents and then update the network at once.

model = A2C(MlpPolicy, env, verbose=0, tensorboard_log="./a2c_cartpole_tensorboard/")
model.learn(total_timesteps=total_timesteps)

a2c

7- DDPG Deep Deterministic Policy Gradient

Open In Colab

In DDPG, (DQN) is adapted to continuous action domains, where the Deterministic Policy (the Actor) gives the best believed action for any given state (no argmax over actions)

pend

env = gym.make('Pendulum-v0') 
env = DummyVecEnv([lambda: env])
action_noise = OrnsteinUhlenbeckActionNoise(mean=np.zeros(n_actions), sigma=float(0.5) * np.ones(n_actions))
model = DDPG(MlpPolicy, env, verbose=0, param_noise=param_noise, action_noise=action_noise)
model.learn(total_timesteps=total_timesteps, callback=callback)

8- TD3 Twin Delayed Deep Deterministic Policy Gradients

Open In Colab

TD3 is an algorithm that addresses the overestimated Q-values issue of DDPG by introducing the Clipped Double-Q Learning. where TD3 learns two Q-functions instead of one.

td3

env = gym.make('BipedalWalker-v2')
env = DummyVecEnv([lambda: env])
n_actions = env.action_space.shape[-1]
action_noise = NormalActionNoise(mean=np.zeros(n_actions), sigma=0.1 * np.ones(n_actions))
model = TD3(MlpPolicy, env, action_noise=action_noise, verbose=0, tensorboard_log="./td3_BipedalWalker_tensorboard/")
model.learn(total_timesteps=total_timesteps)

td3tb

9- Behavior Cloning (BC)

Open In Colab

BC uses expert demonstrations (observations-actions pairs), as a supervised learning problem. The policy network is trained to reproduce the expert behavior, then train the RL model for self-improvement.

Steps:

  • Generate and save trajectories (ex: using a trained DQN agent)
  • Load expert trajectories
  • Pretrain the RL model in a supervised way
  • Evaluate the pre-trained model
  • Train the RL model for self improvement (RL)
  • Evaluate the final RL model

bc

10- GAIL Generative Adversarial Imitation Learning

Open In Colab

In GANs Generative Adversarial Networks, we have two networks learning together:

  • Generator network: try to fool the discriminator by generating real-looking images
  • Discriminator network: try to distinguish between real and fake images

GAIL uses a discriminator that tries to separate expert trajectory from trajectories of the learned policy, which has the role of the generator here.

Steps:

  • Generate and save expert dataset
  • Load the expert dataset
  • Train GAIL agent and evaluate

practical-drl's People

Contributors

ibrahimsobh avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.