Clean and Self Contained implementations of Reinforcement Learning Agents for solving GAC²E.
- PPO (Probabilistic)
- TD3 (Deterministic)
- SAC (Probabilistic)
Cleanup:
$ make clean wipe kill
Will clean model directory, wipe run directory and kill all spectre instances.
My personal notes on implementing the algorithms.
Proximal Policy Optimization Algorithm
- Keep track of small, fixed length batch of trajectories (s,a,r,d,v,l)
- Multiple epochs for each batch
- batch sized chunks of memories
- Critic only criticises states
- Actor outputs probabilities for taking an action (probabilistic)
- Memory Size
- Batch size
- Number of Epochs
Conservative Policy Iteration (CPI):
Where
- A: Advantage
- E: Expectation
- π Actor Network returning Probability of an action a for a given state s at a given time t
- θ: Current network parameters
Where
- ε ≈ 0.2
Pessimistic lower bound of loss
Gives benefit of new state over previous state
$$ A_{t} = \delta_{t} + (\gamma \lambda) \cdot \delta_{t + 1} + ... + (\gamma \lambda)^{T - (t + 1)} \cdot \delta_{T - 1} $$ with
Where
- V(s_t): Critic output, aka Estimated Value (stored in memory)
- γ ≈ 0.95
return = advantage + value
Where value is critic output stored in memory
$$ L^{CLIP + VF + S}{t} (\theta) = E{t} [ L^{CLIP}{t} (\theta) - c{1} \cdot L^{VF}{t} (\theta) + c{2} \cdot S\pi_{\theta} ] $$
Gradient Ascent, not Descent!
- S: only used for shared AC Network
- c1 = 0.5
Addressing Function Approximation Error in Actor-Critic Methods
- Update Intervall
- Number of Epochs
- Number of Samples
$$ \nabla_{\phi} J_(\phi) = \frac{1}{N} \sum \nabla_{a} Q_{\theta 1} (s,a) |{a = \pi{\phi} (s)} \cdot \nabla_{\phi} \pi_{\phi} (s) $$
Where
- π: Policy Network with parameters φ
- Gradient of first critic w.r.t. actions chosen by critic
- Gradient of policy network w.r.t. it's own parameters
Chain rule applied to loss function
Initialize Target Networks with parameters from online networks.
Where
- τ ≈ 0.005
Soft update with heavy weight on current target parameters vs. heavily discounted parameters of online network.
Not every step, only after actor update.
- Randomly sample trajectories from replay buffer (s,a,r,s')
- Use actor to determine actions for sampled states (don't use actions from memory)
- Use sampled states and newly found actions to get values from critic
- Only the first critic, never the second!
- Take gradient w.r.t. actor network parameters
- Every nth step (hyper parameter of algorithm)
- Randomly sample trajectories from replay buffer (s,a,r,s')
- New states run �'(s') where �' is target actor
- Add noise and clip
with
Where
- σ ≈ 0.2, noise standard deviation
- c ≈ 0.5, noise clipping
- γ ≈ 0.99, discount factor
$$ y \leftarrow r + \gamma \cdot min( Q'{\theta1}(s', a^{~}), Q'{\theta1}(s', a^{~})) $$
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
Note: Entropy in this case means something like Randomness of actions, and is modeled by reward scaling.
$$ log( \pi (a|s) ) = log (\mu (a|s)) - \sum^{D}{i=1} log ( 1 - tanh^{2} (a{i}) ) $$
Where
- μ: Sample of a distribution (NOT MEAN)
- π: Probability of selecting this particular action a given state s
- Target smoothing coefficient
- target update interval
- replay buffer size
- gradient steps
Where
- s_t is sampled from replay buffer / memory
- a_t is generated with actor network given sampled states
- Qmin is minimum of 2 critics
Where
- V(s_t): sampled values from memory
- s_t: sampled states from memory
- a_t: newly computed actions
$$ J_{1} = N^{-1} \sum \frac{1}{2} \cdot ( Q_{1}(s_{t}, a_{t}) - Q'{1}(s{t}, a_{t}))^{2} $$
$$ J_{2} = N^{-1} \sum \frac{1}{2} \cdot ( Q_{2}(s_{t}, a_{t}) - Q'{2}(s{t}, a_{t}))^{2} $$
Where
- Both critics get updated
- Both actions and states are sampled from memory
Where
- τ ≈ 0.005
- Implement replay buffer / memory as algebraic data type
- Include step count in reward
- Try Discrete action spaces
- Normalize and/or reduce observation space
- consider previous reward
- return trained models instead of loss
- handling of
done
for parallel envs - higher reward for finishsing earlier