yandexdataschool / tinyverse Goto Github PK

Universe RL trainer platform. Simple. Supple. Scalable.

Python 24.85% Jupyter Notebook 75.15%

tinyverse's Introduction

Universe RL trainer platform. Simple. Supple. Scalable.

Why should i care?

tinyverse is a reinforcement learning platform for gym/universe/custom environments that lets you utilize any resources you have to train reinforcement learning algorithm.

Key features

Simple: the core is currently under 400 lines including code (~50%), comments(~40%) and spaces (~10%).
Supple: tinyverse assumes almost nothing of your agent and environment. The environment may not be interruptable. Agent may have any algorithm/structure. Agent [will soon](#14) support any framework from numpy to pure tensorflow/theano to keras/lasagne+agentnet.
Scalable: You can train and play 10 parallel games on your GPU desktop/server, 20 more sessions on your Macbook and another 5 on your friend's laptop when he doesn't look. (And 1000 more games and 10 trainers in the cloud ofc).

The core idea is to have two types of processes:

play-er - interacts with the environment, records sessions to the database, periodically loads new params
train-er - reads sessions from the database, trains agent via experience replay, sends params to the database

Those processes revolve around database that stores experience sessions and weights. The database is currently implemented with Redis since it is simple to setup and swift with key-value operations. You can, however, implement the database interface with what database you prefer.

Quickstart

install redis server

(Ubuntu) sudo apt-get install redis-server
Mac OS version HERE.
Otherwise search "Install redis your_OS" or ask on gitter.
If you want to run on multiple machines, configure redis-server to listen to 0.0.0.0 (also mb set password)

install python packages

gym and universe
- pip install gym[atari]
- pip install universe - most likely needs dependencies, see urls above.
install bleeding edge theano, lasagne and agentnet for agentnet examples to work.
- Preferably setup theano to use floatX=float32 in .theanorc
pip install joblib redis prefetch_generator six
examples require opencv: conda install -y -c https://conda.binstar.org/menpo opencv3

Spawn several player processes. Each process simply interacts and saves results. -b stands for batch size.

for i in `seq 1 10`; 
do
        python tinyverse atari.py play -b 3 &
done

Spawn trainer process. (demo below runs on gpu, change to cpu if you have to) THEANO_FLAGS=device=gpu python tinyverse atari.py train -b 10 &
evaluate results at any time (records video to ./records) python tinyverse atari.py eval -n 5

Devs: see workbench.ipynb

tinyverse's People

Contributors

Stargazers

Watchers

Forkers

feygina pshvechikov olegb94 ilopezfr lazycrazyowl

tinyverse's Issues

Database

We need some sort of storage for

game sessions
actual network and target network params
whatever metadata we might want to store as well

For an ugly version 1, i used a minimalistic mongoDB wrapper with an app for numpy arrays.
After we assemble the crude prototype, we may want to use some other DB that's more suitable.
Pls tell us if you know what DB would fit here.

Framework agnosticism

Right now miniverse is linked to agentnet in two spots:

wants agentnet agent here
takes params from agentnet agent here

It now takes three steps to make it backend-agnostic:

replace agent reference with some Experiment.get_all_params
replace agent link with experiment link in database
remove any theano imports from core

Sanity check

To make sure that we didn't mess up in the early stage, let's try training this thing on ANY rl problem, even some classic control thing will do.

use cartpole/mountaincar/lunarlander/whatever, the faster it trains - the better
the goal is to replicate the single-process baseline (and hopefully not get much slower)
gym and universe have same environment interface so we won't need to refactor for universe

Off-policy learning

Right now we only use on-policy algorithms (in fact, one algorithm: adv. actor critic only).

This restricts the algorithm in at least two ways

can't use old sessions since they represent older policy (experience replay pool is super small)
depend on fast sync between learner and player (else player plays older policy)

It would probably be a good alternative to have a classic value-based off-policy agent like DQN, but with all the improvements there are. E.g. would be nice to use optimality tightening https://arxiv.org/abs/1611.01606 or intrinsic motivation.

Would be great to use some LARGE replay buffer because now we can. This algorithm could even work great with minimum player processes as it can just store last 100500 games for training.

Start with go9x9, than make sure it generalizes

clear mode

Current run script is bare bones and still has a lot of auxiliary stuff to do.
One such thing is a 'clear' mode that should have options to

kills all local processes that registered for this experiment and removes their heartbeats (pls check if they are really tinyverse)
clears sessions
clears weights
clears everything at once

One simple way to design it is through arguments like

python tinyverse clear --all - does everything
python tinyverse clear --kill-players - only kills local players, keeps sessions intact
python tinyverse clear --del-sessions - only removes sessions, keeps processes alive

The goal is to implement the functionality so that runner script is still readable afterwards.

Player process 1.0

A process that

compiles the agent
takes the stored dqn params every once in a while
if there are no params, put them there
interacts :) (cpu)
stores the interactions in the database
all interactions are of same length (e.g. 10 ticks)
loads new weights every once in a while
does not break down everything if restarted at random point of time
does not erase old sessions
does not intentionally break down if there are several such processes running in parallel :)

Basically do that but with an agent and all the trouble that comes with it.
Example agent setup for mountaincar.

Insta-deploy

We may want to think about some one-line deployment kit for the player or learner or database processes to allow quickly adding a lot of CPU (e.g. a classroom-full of desktops :) ).

So far i only considered docker containers, but are there any better/easier ways for this?

Also, it would require some minor changes like changing database ip from localhost to actual host.

Learner process 1.0

A process that

takes params for actual [and target] network from the database
if they aren't there, init with random
samples sessions at random from the database
trains the agent (uses GPU)
saves updated [and target] params to the database every N interations (e.g. ~30sec)
replaces the old ones so that the player process will use updated params automatically
cleans up old sessions
every N iterations deletes all but last K(e.g. 10^5) sesssions
does not break down if killed and restarted at random point in time

So far we can consider that there's only one learner process.

add readthedocs

automatic docs would be awesome.

NeonRace policy

Learn to play NeonRace with a2c policy

Survivors

Please check in if you feel recovered from the new year and ready to rock :)

Continuous/structured action space

A nice example for continuous action space problem would be most welcome.
Ideally, compare several algorithms like

q-naf
dpg
svg
and different exploration strategies therein.

Plan -> ??? -> PROFIT!

For starters, try any one algorithm and a simple env like lunar lander and make it stable & fast. example of not super-tuned algo. Great source of tips
Make it run on a larger problem, e.g. this
compare with other algorithms

Or any other way you see fit.

LSTM/GRU agent

The current A3C-feedforward policy is only using 4-frame history to predict policy and may fail in case the environment is still partially observable.
Would probably be nice to implement a policy with GRU/LSTM hidden units [instead of window or along with it].

A good proving ground would be a game with field of view like doom - DefendCenter or HealthGathering - just make sure that image preprocessing works fine with it.

Bonus kudos for implementing a soft attention mechanism that actually improves results [or does not harm and gives clues on what agent looks at]. link link

Parameter server support.

Currently __train__er simply saves all params to the database assuming he is alone.
This makes running several parallel training processes useless [kind of bootstrap dqn for higher price].

There are, however, techniques that allow parallel updates with periodic synchronizations.
https://www.cs.cmu.edu/~muli/file/parameter_server_osdi14.pdf
May or may not be used here

The goal is to allow such parallelism with minimum lines of code.

in this method handle the coefficient by which to change params on the server (default 1).
in this method add flags:
- whether to also LOAD params every save_period to synchronize with other trainers
- a coefficient by which to change params on the server (default 1) - send it to save_all_params
a flag here that allows to partially update params on server. Default 1. Warn if >1. If != 1, also make trainer load weights from server (prev point)

Also it may be wise to avoid locks in case someone wants this to work in 100500 processes. Or at least measure lock time loss and make sure it is small.

Would be super-nice if you first created implementation with max readability / min lines of code.

Wrapping universe

By default Universe provides a low-level keyboard control with very high-dimensional action space AND a high-dimensional image + occasional other stuff as observations.

While it is a noble quest to tackle the environment as it is, we better first try learning something in a more RL-friendly setup:

smaller-resolution image
discrete action space: wrap only the essential actions like "turn car left/right", not "move cursor to that random location"
repeat the same action for N (e.g.4) timesteps to make effective sessions shorter

So, we need to create a wrapper environment that takes openai universe game and alters it's reset() and step() methods to make env easier for agent to master.

Here's one game that seems simple enough for this, i suggest sticking to it unless we bump into some unsurpassable obstacle.

One example of altering an environment this way can be found here

PEP8

All of the code surely needs to be PEP8 formatted.