rl,ceteke

rl's People

Stargazers

Watchers

rl's Issues

Question about Linear Approximation SARSA

When updating w for non-terminal state, why use this:

w += alpha*(reward - discount*q_hat_next)*q_hat_grad

rather than using the TD error (from Sutton's book v2 page 244, Episodic Semi-gradient Sarsa) like below?

w += alpha*(reward + discount*q_hat_next - q_hat)*q_hat_grad

Issue with reproducibility of Linear SARSA notebook

Hello. I am learning reinforcement learning and have been using your repository as a reference against my own implementations. However, when I cloned and immediately ran the linear approximation SARSA notebook with no changes to the repository, I was not able to reproduce your results. Specifically, I did not receive either of the printouts you did (the WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype. and the Episodes before solve 99) and the rewards vs episodes graph that was output was very different from yours (see below).

Do you have any thoughts on how to fix the issue? Any help is appreciated.

Thank you for your time!

Recommend Projects

ceteke / rl Goto Github PK

rl's People

Stargazers

Watchers

Forkers

rl's Issues

Question about Linear Approximation SARSA

Issue with reproducibility of Linear SARSA notebook

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs