karpathy / reinforcejs Goto Github PK

Reinforcement Learning Agents in Javascript (Dynamic Programming, Temporal Difference, Deep Q-Learning, Stochastic/Deterministic Policy Gradients)

JavaScript 34.60% HTML 65.40%

reinforcejs's Introduction

REINFORCEjs

REINFORCEjs is a Reinforcement Learning library that implements several common RL algorithms, all with web demos. In particular, the library currently includes:

Dynamic Programming methods
(Tabular) Temporal Difference Learning (SARSA/Q-Learning)
Deep Q-Learning for Q-Learning with function approximation with Neural Networks
Stochastic/Deterministic Policy Gradients and Actor Critic architectures for dealing with continuous action spaces. (very alpha, likely buggy or at the very least finicky and inconsistent)

See the main webpage for many more details, documentation and demos.

Code Sketch

The library exports two global variables: R, and RL. The former contains various kinds of utilities for building expression graphs (e.g. LSTMs) and performing automatic backpropagation, and is a fork of my other project recurrentjs. The RL object contains the current implementations:

RL.DPAgent for finite state/action spaces with environment dynamics
RL.TDAgent for finite state/action spaces
RL.DQNAgent for continuous state features but discrete actions

A typical usage might look something like:

// create an environment object
var env = {};
env.getNumStates = function() { return 8; }
env.getMaxNumActions = function() { return 4; }

// create the DQN agent
var spec = { alpha: 0.01 } // see full options on DQN page
agent = new RL.DQNAgent(env, spec); 

setInterval(function(){ // start the learning loop
  var action = agent.act(s); // s is an array of length 8
  //... execute action in environment and get the reward
  agent.learn(reward); // the agent improves its Q,policy,model, etc. reward is a float
}, 0);

The full documentation and demos are on the main webpage.

License

MIT.

reinforcejs's People

Contributors

Stargazers

Watchers

Forkers

gaapt peterjliu rtvt123 ospreyx czeinerb fdoperezi sundaylab ajaytalati mathn jorgelamb mryellow nagyistoce codeaudit edersantana sidec chentiejian xkhldy zhangweiabc wangg12 copyfun gradjitta cloudxtreme nagyistge brunogal gnonio vseledkin jamesw6811 belvo hli2020 imclab zzmjohn ciurmy-gianluca cantren pcooksey jordanmicahbennett deeplearnphy putraxor cinneesol rlugojr ignacy130 wulfebw ml-lab subercui cheng-xie gitali fantajeon yenchenlin deepstupid dastjead paulhendricks nazar-ivantsiv calvinalvin harishgp zbxzc35 michaelnkang vibster shangxing2015 lungtakumi arasharchor solaris33 krish240574 arkadiuszsz baxter-cs piandpower benjaminjackman vyraun gdg phpmind mydeeplearning arjunchandra vincentzhang vdpappu pengcheng-wang zyxue thecoons hezhongyu sibi-revi deepalcoholic selimam igor-krawczuk jithsjoy baerxxl cbentes michalliu shahnewazkhan sealionkat yangangchen studentese ingsanchezgarzon vtpp2014 malagori alexxnica kryndex cherishing78 x012 angice mathieuflamant kartechbabu saadmahboob tbryn

reinforcejs's Issues

How to export and re-use agent brain information?

I made a program which trains the agent. But now if I want to export the data and reimport the data, how can I do that?

I want something like this:

var data = agent.exportData(); // this might give me a obj of the data for its NN or whatever

Then I just save it to a txt file or database.

Then later, I can import it like this

agent.SetData(data); // imports the data

This way I don't have to re-train it.

Does anyone know how to do something like this?

Thanks

LaTeX is not rendered properly

@karpathy, for some reasons LaTeX equations are not being rendered properly. Below is a snippet of GridWorld: Dynamic Programming Demo page

GridWorld: TD, Demo Page: Cannot reset cell reward to 0.00 once changed

After changing a cell's reward, one can never change it back to 0.00. The least possible amount to be chosen is always -0.1 or 0.1.

act and learn input ranges

Should all state inputs to act be 0<=stateX<1?
Should all reward inputs be 0<=reward<1?
Is there any way to get out "nope, that wasn't a good reply. I want a second opinion!" (second place answer, etc)

Tex equations not loading on website

I am on chrome latest version on a mac os on the website
https://cs.stanford.edu/people/karpathy/reinforcejs/gridworld_td.html
and I don't see the tex equations being rendered properly

Looks from here: https://tex.stackexchange.com/questions/299523/how-can-i-compile-tex-code-appearing-on-websites, that MathJAX library could help

Globals and exports

Any reason for deciding to go with globals over checking for browser and deliveringwindow.RL or module.exports.RL?

Something you'd accept as a PR, if it was done in a way that fits?

SARSA not working in GridWorld_td

Using the default agent parameters, but set spec.update to 'sarsa', the model simply does not converge to the optimal solution.

// agent parameter spec to play with (this gets eval()'d on Agent reset)
var spec = {}
spec.update = 'sarsa'; // 'qlearn' or 'sarsa'
spec.gamma = 0.9; // discount factor, [0, 1)
spec.epsilon = 0.2; // initial epsilon for epsilon-greedy policy, [0, 1)
spec.alpha = 0.1; // value function learning rate
spec.lambda = 0.1; // eligibility trace decay, [0,1). 0 = no eligibility traces
spec.replacing_traces = true; // use replacing or accumulating traces
spec.planN = 0; // number of planning steps per iteration. 0 = no planning

spec.smooth_policy_update = true; // non-standard, updates policy smoothly to follow max_a Q
spec.beta = 0.1; // learning rate for smooth policy update

Reinforcejs VS ConvNetjs

Breakout Example

In case it's interesting or you have a sample gallery, breakout with deep-q learning (since that's the youtube video example originally shown for such games)
http://4quant.com/javascript-breakout/ repo: https://github.com/4Quant/javascript-breakout

waterworld.js - forward()

Hi,

First of all, thank you very much for sharing great demos!

In waterworld.js, I don't get this part in forward()

  forward: function() {
    // in forward pass the agent simply behaves in the environment
    // create input to brain
   .....
  for(var i=0;i<num_eyes;i++) {
      var e = this.eyes[i];
      input_array[i*5] = 1.0;   // ???
      input_array[i*5+1] = 1.0;  //  ???
      input_array[i*5+2] = 1.0;  //   ???
      input_array[i*5+3] = e.vx; // velocity information of the sensed target
      input_array[i*5+4] = e.vy;
      if(e.sensed_type !== -1) {
        // sensed_type is 0 for wall, 1 for food and 2 for poison.
        // lets do a 1-of-k encoding into the input array
        input_array[i*5 + e.sensed_type] = e.sensed_proximity/e.max_range; // normalize to [0,1]
      }
  }

I don't understand why the first three inputs are all 1.0. Shouldn't it be the type of sensed object or something?

On the demo page, it says:

The agent has 30 eye sensors pointing in all directions and in each direction is observes 5 variables: the range, the type of sensed object (green, red), and the velocity of the sensed object. The agent's proprioception includes two additional sensors for its own speed in both x and y directions. This is a total of 152-dimensional state space.

Failed to load file

For the problem waterworld, when I click the button 'Load a Pretrained Agent', it prompts

Failed to load file:///C:/Users/xuxiyang/Desktop/reinforcejs-master/agentzoo/wateragent.json: Cross origin requests are only supported for protocol schemes: http, data, chrome, chrome-extension, https.
jquery-2.1.3.min.js:4

Anyone knows how to resolve this and load the saved data?

Math.tanh failures in node.js

The current recommended version of node.js (4.5.0) uses v8 version (4.5.103.37) which has a
Math.tanh function that will return NaN for some inputs. The issue and suggsested fix are here:
http://stackoverflow.com/questions/34835641/tanh-returning-nan-for-large-input

is this worth doing a PR?

Multiple Workers

Is it possible to get this to run with multiple workers ?
Is there a paper I can look at that explains how this is done ?

Multiple layers of neurons?

How hard would it be to implement this?

I'm trying ReinforceJS the 2048 game here: https://github.com/NullVoxPopuli/doctor-who-thirteen-game-ai/blob/master/worker.js#L105

and I've noticed a couple things:

the ai gets to it's best score (of not very high) pretty quickly
it seems to have trouble beating its best score
achieving the best score is likely a fluke of the random nature of tile spawns

Additionally,

I'm not sure how long I should expect training to take
is a day too long?

idk :D

Question about strategy

Fantastic library!
I have a ton of questions, most of which likely have answers along the lines of "it depends" :) But, the top questions:

In a basic game with many invalid moves (don't crash into a wall, don't play an invalid move, etc) and only a few valid moves, is it normally better to let the system "work out" the rules? I was considering an alternate of only offering the agent a list of valid moves and having it pick among them, but intuitively it seems more confusing - a small shift in valid moves would "off-by-one" the list, and that would be hard for the agent to learn.
I combined a few examples for the spec. Are any of these "bad" in a general game solver?

    spec.update = 'qlearn'; // 'qlearn' or 'sarsa'
    spec.gamma = 0.9; // discount factor, [0, 1)
    spec.epsilon = 0.2; // initial epsilon for epsilon-greedy policy, [0, 1)
    spec.lambda = 0.8; // eligibility trace decay, [0,1). 0 = no eligibility traces
    spec.replacing_traces = false; // use replacing or accumulating traces
    spec.planN = 50; // number of planning steps per iteration. 0 = no planning
    spec.smooth_policy_update = true; // non-standard, updates policy smoothly to follow max_a Q
    spec.beta = 0.1; // learning rate for smooth policy update
    spec.alpha = 0.005; // value function learning rate
    spec.experience_add_every = 5; // number of time steps before we add another experience to replay memory
    spec.experience_size = 10000; // size of experience
    spec.learning_steps_per_iteration = 5;
    spec.tderror_clamp = 1.0; // for robustness
    spec.num_hidden_units = 100; // number of neurons in hidden layer

In your examples, you have

env.getNumStates = function() {
      return 9;
    };
env.getMaxNumActions...

Any particular reason to have it be a function? (relates back to my #1 question)

Continuous control with deep reinforcement learning

hi, I wonder if the below paper algorithms can be added to the library.
Many thanks for your help.
Andrew

http://arxiv.org/abs/1509.02971

Using DQN

Hello,

In the example library usage, the environment is created with number of states and also maximum number of actions possible.

For example, if I have 5 possible states, which is defined with 2 values, and 3 possible actions, which is also defined with 2 values, what should be the relevant variables of environment object? Moreover, what should be the given state array in 'act(state)' method.

I normally have more states and actions however I couldn't get the general idea behind it.

P.S. I know that this question is more suitable for StackOverFlow, however I also think that it would be beneficial for other new people.

Missing link to API Docs

Can't find the API Docs. Where are they?

Saving trained agent with agent.toJSON()

Hi,

does anyone ever tried to get a proper JSON of the agent after you trained it. My agent's net object does contain nothing as far as i can see.

hidden_size is most likely undefined on line 460...

reinforcejs/lib/rl.js

Line 460 in 08d2030

model['Whd'] = new RandMat(output_size, hidden_size, 0, 0.08);

On line 460 there is a call to make a new random matrix, and the second argument is hidden_size which is only actually defined inside the for loop above it, meaning it should resolve the argument d to undefined when calling the RandMat function because it is out of scope.

I only caught this because I am porting your rl.js to c++ and when doing a unit test, came across this one. I then realized most likely javascript would have let this one slip right under your nose, or anyone's noses, as these RL learners are so good at learning regardless of coding errors.

I hope one day you update as this is a 7 year old repository. I can imagine what you have learned about RL in 7 years and working with Tesla. It would be amazing to see some of the newer stuff like the recent DeepMind paper about continuous action space and their "Director" agent. This lib could provide even more generalization. Anyway, thanks for your wonderful code and keep up the great work. I like the way you go about things, and I can also imagine while porting this, that quite possibly you already made this in c++ and were actually porting it to JS as you refer to some things in comments as structs.

Thanks, and I hope you see this.

GridWorld: TD, Demo Page: Discounted Reward greater than 1.0?

Using the initial settings, how can the discounted reward of the center field be 1.1? The max reward the agent can get is 1.0 and then the goal is reached and the agent is reset.

Also, if changing the field below to R 1.0, I'd expect the discounted reward to be 10 instead of 9.9:

and here 50 instead of 49.90:

Could throw error when input array mismatches getNumStates

I had neglected to set getNumStates yet nothing complains.

Guess it's an extra check every time you give an input array to act, a comparison of sizes in each call to setFrom would likely be over the top...

Could validate in Agent.forward before passing to DQNAgent.act, but then if doing it there, then why not in act.

Maybe there is a solution in checking the sizes once for the first call to act?

Question about DQN inputs

I am trying to understand the inputs for the example given here
http://cs.stanford.edu/people/karpathy/reinforcejs/index.html

env.getNumStates()
This is the size of the vector that represents the variables of the current game configuration?

env.getMaxNumActions
For this one, is it the total number of configurations the game can have? Or is it the number of actions the play can currently do in the current game configuration, such as in a grid maze, the player has up to 4 directions to move, so it would be 4.

Inside the "setInterval" function, "s" is not defined. It is the vector of the variables of the current game configuration that I have to get myself?

And "reward" is something I have to calculate too based on the current "s" vector?

Also why is getNumStates and getMaxNumActions a function, when they seem to return a constant value? Is it supposed to support returning dynamic values? Can the vector size be allowed to be different at anytime? And the Max num of actions, is that dynamic too?