GithubHelp home page GithubHelp logo

tinyverse's Introduction

Universe RL trainer platform. Simple. Supple. Scalable.

Why should i care?

tinyverse is a reinforcement learning platform for gym/universe/custom environments that lets you utilize any resources you have to train reinforcement learning algorithm.

Key features

  • Simple: the core is currently under 400 lines including code (~50%), comments(~40%) and spaces (~10%).
  • Supple: tinyverse assumes almost nothing of your agent and environment. The environment may not be interruptable. Agent may have any algorithm/structure. Agent [will soon](#14) support any framework from numpy to pure tensorflow/theano to keras/lasagne+agentnet.
  • Scalable: You can train and play 10 parallel games on your GPU desktop/server, 20 more sessions on your Macbook and another 5 on your friend's laptop when he doesn't look. (And 1000 more games and 10 trainers in the cloud ofc).

The core idea is to have two types of processes:

  • play-er - interacts with the environment, records sessions to the database, periodically loads new params
  • train-er - reads sessions from the database, trains agent via experience replay, sends params to the database

Those processes revolve around database that stores experience sessions and weights. The database is currently implemented with Redis since it is simple to setup and swift with key-value operations. You can, however, implement the database interface with what database you prefer.

Quickstart

  1. install redis server
  • (Ubuntu) sudo apt-get install redis-server
  • Mac OS version HERE.
  • Otherwise search "Install redis your_OS" or ask on gitter.
  • If you want to run on multiple machines, configure redis-server to listen to 0.0.0.0 (also mb set password)
  1. install python packages
  • gym and universe
    • pip install gym[atari]
    • pip install universe - most likely needs dependencies, see urls above.
  • install bleeding edge theano, lasagne and agentnet for agentnet examples to work.
    • Preferably setup theano to use floatX=float32 in .theanorc
  • pip install joblib redis prefetch_generator six
  • examples require opencv: conda install -y -c https://conda.binstar.org/menpo opencv3
  1. Spawn several player processes. Each process simply interacts and saves results. -b stands for batch size.
for i in `seq 1 10`; 
do
        python tinyverse atari.py play -b 3 &
done
  1. Spawn trainer process. (demo below runs on gpu, change to cpu if you have to) THEANO_FLAGS=device=gpu python tinyverse atari.py train -b 10 &
  2. evaluate results at any time (records video to ./records) python tinyverse atari.py eval -n 5

Devs: see workbench.ipynb

tinyverse's People

Contributors

justheuristic avatar kuzmichevdima avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tinyverse's Issues

Database

We need some sort of storage for

  • game sessions
  • actual network and target network params
  • whatever metadata we might want to store as well

For an ugly version 1, i used a minimalistic mongoDB wrapper with an app for numpy arrays.
After we assemble the crude prototype, we may want to use some other DB that's more suitable.
Pls tell us if you know what DB would fit here.

Framework agnosticism

Right now miniverse is linked to agentnet in two spots:

  • wants agentnet agent here
  • takes params from agentnet agent here

It now takes three steps to make it backend-agnostic:

  • replace agent reference with some Experiment.get_all_params
  • replace agent link with experiment link in database
  • remove any theano imports from core

Sanity check

To make sure that we didn't mess up in the early stage, let's try training this thing on ANY rl problem, even some classic control thing will do.

  • use cartpole/mountaincar/lunarlander/whatever, the faster it trains - the better
  • the goal is to replicate the single-process baseline (and hopefully not get much slower)
  • gym and universe have same environment interface so we won't need to refactor for universe

Off-policy learning

Right now we only use on-policy algorithms (in fact, one algorithm: adv. actor critic only).

This restricts the algorithm in at least two ways

  • can't use old sessions since they represent older policy (experience replay pool is super small)
  • depend on fast sync between learner and player (else player plays older policy)

It would probably be a good alternative to have a classic value-based off-policy agent like DQN, but with all the improvements there are. E.g. would be nice to use optimality tightening https://arxiv.org/abs/1611.01606 or intrinsic motivation.

Would be great to use some LARGE replay buffer because now we can. This algorithm could even work great with minimum player processes as it can just store last 100500 games for training.

Start with go9x9, than make sure it generalizes

clear mode

Current run script is bare bones and still has a lot of auxiliary stuff to do.
One such thing is a 'clear' mode that should have options to

  • kills all local processes that registered for this experiment and removes their heartbeats (pls check if they are really tinyverse)
  • clears sessions
  • clears weights
  • clears everything at once

One simple way to design it is through arguments like

  • python tinyverse clear --all - does everything
  • python tinyverse clear --kill-players - only kills local players, keeps sessions intact
  • python tinyverse clear --del-sessions - only removes sessions, keeps processes alive

The goal is to implement the functionality so that runner script is still readable afterwards.

Player process 1.0

A process that

  • compiles the agent
  • takes the stored dqn params every once in a while
  • if there are no params, put them there
  • interacts :) (cpu)
  • stores the interactions in the database
  • all interactions are of same length (e.g. 10 ticks)
  • loads new weights every once in a while
  • does not break down everything if restarted at random point of time
  • does not erase old sessions
  • does not intentionally break down if there are several such processes running in parallel :)

Basically do that but with an agent and all the trouble that comes with it.
Example agent setup for mountaincar.

Insta-deploy

We may want to think about some one-line deployment kit for the player or learner or database processes to allow quickly adding a lot of CPU (e.g. a classroom-full of desktops :) ).

So far i only considered docker containers, but are there any better/easier ways for this?

Also, it would require some minor changes like changing database ip from localhost to actual host.

Learner process 1.0

A process that

  • takes params for actual [and target] network from the database
  • if they aren't there, init with random
  • samples sessions at random from the database
  • trains the agent (uses GPU)
  • saves updated [and target] params to the database every N interations (e.g. ~30sec)
  • replaces the old ones so that the player process will use updated params automatically
  • cleans up old sessions
  • every N iterations deletes all but last K(e.g. 10^5) sesssions
  • does not break down if killed and restarted at random point in time

So far we can consider that there's only one learner process.

Survivors

Please check in if you feel recovered from the new year and ready to rock :)

LSTM/GRU agent

The current A3C-feedforward policy is only using 4-frame history to predict policy and may fail in case the environment is still partially observable.
Would probably be nice to implement a policy with GRU/LSTM hidden units [instead of window or along with it].

A good proving ground would be a game with field of view like doom - DefendCenter or HealthGathering - just make sure that image preprocessing works fine with it.

Bonus kudos for implementing a soft attention mechanism that actually improves results [or does not harm and gives clues on what agent looks at]. link link

Parameter server support.

Currently __train__er simply saves all params to the database assuming he is alone.
This makes running several parallel training processes useless [kind of bootstrap dqn for higher price].

There are, however, techniques that allow parallel updates with periodic synchronizations.
https://www.cs.cmu.edu/~muli/file/parameter_server_osdi14.pdf
May or may not be used here

The goal is to allow such parallelism with minimum lines of code.

  • in this method handle the coefficient by which to change params on the server (default 1).
  • in this method add flags:
    • whether to also LOAD params every save_period to synchronize with other trainers
    • a coefficient by which to change params on the server (default 1) - send it to save_all_params
  • a flag here that allows to partially update params on server. Default 1. Warn if >1. If != 1, also make trainer load weights from server (prev point)

Also it may be wise to avoid locks in case someone wants this to work in 100500 processes. Or at least measure lock time loss and make sure it is small.

Would be super-nice if you first created implementation with max readability / min lines of code.

Wrapping universe

By default Universe provides a low-level keyboard control with very high-dimensional action space AND a high-dimensional image + occasional other stuff as observations.

While it is a noble quest to tackle the environment as it is, we better first try learning something in a more RL-friendly setup:

  • smaller-resolution image
  • discrete action space: wrap only the essential actions like "turn car left/right", not "move cursor to that random location"
  • repeat the same action for N (e.g.4) timesteps to make effective sessions shorter

So, we need to create a wrapper environment that takes openai universe game and alters it's reset() and step() methods to make env easier for agent to master.

Here's one game that seems simple enough for this, i suggest sticking to it unless we bump into some unsurpassable obstacle.

One example of altering an environment this way can be found here

PEP8

All of the code surely needs to be PEP8 formatted.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.