GithubHelp home page GithubHelp logo

gaceir's Introduction

GAC²EIR

Clean and Self Contained implementations of Reinforcement Learning Agents for solving GAC²E.

  • PPO (Probabilistic)
  • TD3 (Deterministic)
  • SAC (Probabilistic)

Usage

Cleanup:

$ make clean wipe kill

Will clean model directory, wipe run directory and kill all spectre instances.

Notes on Algorithms

My personal notes on implementing the algorithms.

Proximal Policy Optimization (PPO)

Proximal Policy Optimization Algorithm

  • Keep track of small, fixed length batch of trajectories (s,a,r,d,v,l)
  • Multiple epochs for each batch
  • batch sized chunks of memories
  • Critic only criticises states
  • Actor outputs probabilities for taking an action (probabilistic)

Hyper Parameters

  • Memory Size
  • Batch size
  • Number of Epochs

Network Updates

Actor

Conservative Policy Iteration (CPI):

$$ L^{CPI} (\theta) = E_{t} ( \frac{\pi_{\theta} (a_{t} | s_{t})}{\pi_{\theta,old} (a_{t} | s_{t})} \cdot A_{t} ) = E_{t} (r_{t}(\theta) \cdot A_{t}) $$

Where

  • A: Advantage
  • E: Expectation
  • π Actor Network returning Probability of an action a for a given state s at a given time t
  • θ: Current network parameters

$$ L^{CLIP} = E_{t} ( min(r_{t}(\theta) \cdot A_{t}, clip(r_{t}(\theta), 1 - \epsilon, 1 + \epsilon) \cdot A_{t} ) ) $$

Where

  • ε ≈ 0.2

Pessimistic lower bound of loss

Advantage

Gives benefit of new state over previous state

$$ A_{t} = \delta_{t} + (\gamma \lambda) \cdot \delta_{t + 1} + ... + (\gamma \lambda)^{T - (t + 1)} \cdot \delta_{T - 1} $$ with

$$ \delta_{t} = r_{t} + \gamma \cdot V(s_{t + 1}) - V(s_{t}) $$

Where

  • V(s_t): Critic output, aka Estimated Value (stored in memory)
  • γ ≈ 0.95

Critic

return = advantage + value

Where value is critic output stored in memory

$$ L^{VF} = MSE(return - value) $$

Total Loss

$$ L^{CLIP + VF + S}{t} (\theta) = E{t} [ L^{CLIP}{t} (\theta) - c{1} \cdot L^{VF}{t} (\theta) + c{2} \cdot S\pi_{\theta} ] $$

Gradient Ascent, not Descent!

  • S: only used for shared AC Network
  • c1 = 0.5

Twin Delayed Double Dueling Policy Gradient (TD3)

Addressing Function Approximation Error in Actor-Critic Methods

Hyper Parameters

  • Update Intervall
  • Number of Epochs
  • Number of Samples

Loss

$$ \nabla_{\phi} J_(\phi) = \frac{1}{N} \sum \nabla_{a} Q_{\theta 1} (s,a) |{a = \pi{\phi} (s)} \cdot \nabla_{\phi} \pi_{\phi} (s) $$

Where

  • π: Policy Network with parameters φ
  • Gradient of first critic w.r.t. actions chosen by critic
  • Gradient of policy network w.r.t. it's own parameters

Chain rule applied to loss function

Network Updates

Initialize Target Networks with parameters from online networks.

$$ \theta \leftarrow \tau \cdot \theta_{i} + (1 - \tau) \cdot \theta_{i}' $$ $$ \phi \leftarrow \tau \cdot \phi{i} + (1 - \tau) \cdot \phi{i}' $$

Where

  • τ ≈ 0.005

Soft update with heavy weight on current target parameters vs. heavily discounted parameters of online network.

Not every step, only after actor update.

Actor

  • Randomly sample trajectories from replay buffer (s,a,r,s')
  • Use actor to determine actions for sampled states (don't use actions from memory)
  • Use sampled states and newly found actions to get values from critic
    • Only the first critic, never the second!
  • Take gradient w.r.t. actor network parameters
  • Every nth step (hyper parameter of algorithm)

Critic

  • Randomly sample trajectories from replay buffer (s,a,r,s')
  • New states run Ï�'(s') where Ï�' is target actor
  • Add noise and clip

$$ a^{~} \leftarrow \pi_{\phi'} (s') + \epsilon $$

with

$$ \epsilon ~ clip(N(0, \sigma), -c, c) $$

Where

  • σ ≈ 0.2, noise standard deviation
  • c ≈ 0.5, noise clipping
  • γ ≈ 0.99, discount factor

$$ y \leftarrow r + \gamma \cdot min( Q'{\theta1}(s', a^{~}), Q'{\theta1}(s', a^{~})) $$

$$ \theta_{i} \leftarrow argmin_{\theta i} ( N^{-1} \cdot \sum ( y - Q_{\theta i} (s,a))^{2} ) $$

Soft Actor Critic (SAC)

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

Note: Entropy in this case means something like Randomness of actions, and is modeled by reward scaling.

$$ log( \pi (a|s) ) = log (\mu (a|s)) - \sum^{D}{i=1} log ( 1 - tanh^{2} (a{i}) ) $$

Where

  • μ: Sample of a distribution (NOT MEAN)
  • π: Probability of selecting this particular action a given state s

Hyper Parameters

  • Target smoothing coefficient
  • target update interval
  • replay buffer size
  • gradient steps

Actor Update

$$ J = N^{-1} \cdot \sum log(\pi (a_{t} | s_{t})) - Q_{min}(s_{t}, a_{t}) $$

Where

  • s_t is sampled from replay buffer / memory
  • a_t is generated with actor network given sampled states
  • Qmin is minimum of 2 critics

Value Update

$$ J = N^{-1} \cdot \sum \frac{1}{2} \cdot ( V(s_{t}) - Qmin (s_{t}, a_{t}) - log(\pi (a_{t} | s_{t})) ) $$

Where

  • V(s_t): sampled values from memory
  • s_t: sampled states from memory
  • a_t: newly computed actions

Critic

$$ J_{1} = N^{-1} \sum \frac{1}{2} \cdot ( Q_{1}(s_{t}, a_{t}) - Q'{1}(s{t}, a_{t}))^{2} $$

$$ J_{2} = N^{-1} \sum \frac{1}{2} \cdot ( Q_{2}(s_{t}, a_{t}) - Q'{2}(s{t}, a_{t}))^{2} $$

$$ Q'= r_{scaled} + \gamma \cdot V'(s_{t + 1}) $$

Where

  • Both critics get updated
  • Both actions and states are sampled from memory

Network Updates

$$ \psi \leftarrow \tau \cdot \psi + (1 - \tau) \cdot \psi' $$

Where

  • τ ≈ 0.005

Things TODO

  • Implement replay buffer / memory as algebraic data type
  • Include step count in reward
  • Try Discrete action spaces
  • Normalize and/or reduce observation space
  • consider previous reward
  • return trained models instead of loss
  • handling of done for parallel envs
  • higher reward for finishsing earlier

gaceir's People

Contributors

augustunderground avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.