chainer / chainerrl Goto Github PK

ChainerRL is a deep reinforcement learning library built on top of Chainer.

License: MIT License

Python 97.82% Shell 2.18%

actor-critic chainer deep-learning dqn machine-learning python reinforcement-learning

chainerrl's Introduction

Notice: As announced, Chainer is under the maintenance phase and further development will be limited to bug-fixes and maintenance only.

Chainer: A deep learning framework

Forum (en, ja) | Slack invitation (en, ja) | Twitter (en, ja)

Chainer is a Python-based deep learning framework aiming at flexibility. It provides automatic differentiation APIs based on the define-by-run approach (a.k.a. dynamic computational graphs) as well as object-oriented high-level APIs to build and train neural networks. It also supports CUDA/cuDNN using CuPy for high performance training and inference. For more details about Chainer, see the documents and resources listed above and join the community in Forum, Slack, and Twitter.

Installation

For more details, see the installation guide.

To install Chainer, use pip.

$ pip install chainer

To enable CUDA support, CuPy is required. Refer to the CuPy installation guide.

Docker image

We are providing the official Docker image. This image supports nvidia-docker. Login to the environment with the following command, and run the Python interpreter to use Chainer with CUDA and cuDNN support.

$ nvidia-docker run -it chainer/chainer /bin/bash

Contribution

See the contribution guide.

ChainerX

See the ChainerX documentation.

License

MIT License (see LICENSE file).

More information

Release notes

References

Tokui, Seiya, et al. "Chainer: A Deep Learning Framework for Accelerating the Research Cycle." Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 2019. URL BibTex

Tokui, S., Oono, K., Hido, S. and Clayton, J., Chainer: a Next-Generation Open Source Framework for Deep Learning, Proceedings of Workshop on Machine Learning Systems(LearningSys) in The Twenty-ninth Annual Conference on Neural Information Processing Systems (NIPS), (2015) URL, BibTex

Akiba, T., Fukuda, K. and Suzuki, S., ChainerMN: Scalable Distributed Deep Learning Framework, Proceedings of Workshop on ML Systems in The Thirty-first Annual Conference on Neural Information Processing Systems (NIPS), (2017) URL, BibTex

chainerrl's People

Contributors

Stargazers

Watchers

Forkers

chagge mylearning2017 rarilurelo kiyukuta corochann muupan rl-gan-vision-privacy-finance-projects ajaytalati originholic trigrass2 rhythm92 tigerneil mmizutani toslunar lichnak codeaudit harendranathvegi9 gongqingyi-github liyi14 kastnerkyle ariewahyu ethancaballero happywu markjingnb arita37 jmgasper shivajid zhexiaozhe orangejuiceblues tenninyan ohagikato narendluffy ghosthamlet adityab liuweiming wildturtles knorth55 gaofangshu celestial-intelligence iory pchalasani geekstor ichimunemasa dustinul6 ikeyasu s-maruyama njustesen llenroc mitmul delta2323 holden1 alibaheri takeratta duoergun0729 ando-takahiro droiter meelement johnjohnsp1 rkawajiri hanfengchi mr4msm hulalazz pazocal neuralnetworkingtechnologies lyx-x shyamalschandra williamissirius howl-anderson salemameen hageshogun abiraja2004 ayberkydn tajimaa4csel rahowa atiroms uidilr tony32769 tamamachi kuni-kuni afcarl aserun jimhalverson shubhampachori12110095 idvhfd terminiter daominglyu mmilk1231 bilio jleni mbrukman kaushalya kazuaki-17 ocechain yuishihara waffoo lorenzom1997 prabhatnagarajan speedcell4 parsonszeng iloveopenworld

chainerrl's Issues

Record the number of episodes in scores.txt

Stop DDPG and PGT inheriting DQN

because they are clearly not DQN and their current code structures are difficult to understand.

env.spec.timestep_limit has been deprecated

gym now complains:

DEPRECATION WARNING: env.spec.timestep_limit has been deprecated. Replace your call to env.spec.timestep_limit with env.spec.tags.get('wrapper_config.TimeLimit.max_episode_steps'). This change was made 12/28/2016 and is included in version 0.7.0

ValueError: On entry to SGEMV parameter number 8 had an illegal value

Travis CI failed on examples/gym/train_ddpg_gym.py:

Traceback (most recent call last):
  File "examples/gym/train_ddpg_gym.py", line 173, in <module>
    main()
  File "examples/gym/train_ddpg_gym.py", line 170, in main
    max_episode_len=timestep_limit)
  File "/home/travis/build/pfnet/chainerrl/chainerrl/experiments/train_agent.py", line 144, in train_agent_with_evaluation
    logger=logger)
  File "/home/travis/build/pfnet/chainerrl/chainerrl/experiments/train_agent.py", line 52, in train_agent
    action = agent.act_and_train(obs, r)
  File "/home/travis/build/pfnet/chainerrl/chainerrl/agents/ddpg.py", line 314, in act_and_train
    self.replay_updater.update_if_necessary(self.t)
  File "/home/travis/build/pfnet/chainerrl/chainerrl/replay_buffer.py", line 327, in update_if_necessary
    self.update_func(transitions)
  File "/home/travis/build/pfnet/chainerrl/chainerrl/agents/ddpg.py", line 246, in update
    self.actor_optimizer.update(lambda: self.compute_actor_loss(batch))
  File "/home/travis/virtualenv/python2.7.9/lib/python2.7/site-packages/chainer/optimizer.py", line 416, in update
    loss.backward()
  File "/home/travis/virtualenv/python2.7.9/lib/python2.7/site-packages/chainer/variable.py", line 398, in backward
    gxs = func.backward(in_data, out_grad)
  File "/home/travis/virtualenv/python2.7.9/lib/python2.7/site-packages/chainer/functions/connection/linear.py", line 59, in backward
    gW = gy.T.dot(x).astype(W.dtype, copy=False)
ValueError: On entry to SGEMV parameter number 8 had an illegal value

This may be the same issue as chainer/chainer#2744

FeUdal Networks for HRL

FeUdal Networks for Hierarchical Reinforcement Learning
https://arxiv.org/pdf/1703.01161.pdf

average_loss always 0 when using episodic_replay=True (DQN)

Trying this two different q_functions:

(non recurrent)

class QFunction(chainer.Chain, StateQFunction):

        def __init__(self, n_input_channels=3, n_actions = 4, bias=0.1):
            self.n_actions = n_actions
            self.n_input_channels = n_input_channels
            conv_layers = chainer.ChainList(
                L.Convolution2D(n_input_channels, 32, 8, stride=4, bias=bias),
                L.Convolution2D(32, 64, 4, stride=2, bias=bias),
                L.Convolution2D(64, 64, 3, stride=1, bias=bias),
                L.Convolution2D(64, 128, 7, stride=1, bias=bias)
                )

            lin_layer = L.Linear(128, 128)                     

            a_stream = MLP(128,n_actions,[2])
            v_stream = MLP(128,1,[2])

            super().__init__(conv_layers=conv_layers, lin_layer=lin_layer, a_stream=a_stream,v_stream=v_stream)

        def __call__(self, x, test=False):
            """
            Args:
                x (ndarray or chainer.Variable): An observation
                test (bool): a flag indicating whether it is in test mode
            """
            h = x
            for l in self.conv_layers:
                h = F.relu(l(h))
            h = self.lin_layer(h)

            batch_size = x.shape[0]
            ya = self.a_stream(h, test=test)
            mean = F.reshape(F.sum(ya,axis=1) / self.n_actions, (batch_size,1))
            ya, mean = F.broadcast(ya,mean)
            ya -= mean

            ys = self.v_stream(h,test=test)
            
            ya,ys = F.broadcast(ya,ys)
            q = ya+ys
            return chainerrl.action_value.DiscreteActionValue(q)

(recurrent)

class QFunctionRecurrent(chainer.Chain, StateQFunction):

        def __init__(self, n_input_channels=3, n_actions = 4, bias=0.1):
            self.n_actions = n_actions
            self.n_input_channels = n_input_channels
            conv_layers = chainer.ChainList(
                L.Convolution2D(n_input_channels, 32, 8, stride=4, bias=bias),
                L.Convolution2D(32, 64, 4, stride=2, bias=bias),
                L.Convolution2D(64, 64, 3, stride=1, bias=bias),
                L.Convolution2D(64, 128, 7, stride=1, bias=bias)
                )

            lstm_layer = L.LSTM(128, 128)                     

            a_stream = MLP(128,n_actions,[2])
            v_stream = MLP(128,1,[2])

            super().__init__(conv_layers=conv_layers, lstm_layer=lstm_layer, a_stream=a_stream,v_stream=v_stream)

        def __call__(self, x, test=False):
            """
            Args:
                x (ndarray or chainer.Variable): An observation
                test (bool): a flag indicating whether it is in test mode
            """
            h = x
            for l in self.conv_layers:
                h = F.relu(l(h))
            h = self.lstm_layer(h)

            batch_size = x.shape[0]
            ya = self.a_stream(h, test=test)
            mean = F.reshape(F.sum(ya,axis=1) / self.n_actions, (batch_size,1))
            ya, mean = F.broadcast(ya,mean)
            ya -= mean

            ys = self.v_stream(h,test=test)
            
            ya,ys = F.broadcast(ya,ys)
            q = ya+ys
            return chainerrl.action_value.DiscreteActionValue(q)

I found that for the non-recurrent version the loss is not zero and the agent will eventually master the gym environment provided.

However, changing nothing else than adding an lstm layer and setting episodic_replay to True the average_loss will become 0 all the time and the agents is not able to learn to better interact with its environent.

First, I thought that this was due to some kind of rounding issues so I set the minibatch_size=1, episodic_update_len = 1 (assuming that one episodic replay will now only containg one time step) but still no changes.

I wonder if this is some kind of bug or (which I think is more likely) an error on my side.

Any help is very much appreciated!

Question on gym action space

Hi, I've defined my own OpenAI gym and have specified my actions in the environment as follows:

`
self.actions = ["NOOP", "LEFT", "RIGHT", "FIRE", "CLOAK"]

self.action_space = spaces.Discrete(len(self.actions))
`

When I try my environment against the 'train_dqn_gym.py' example I can see from my debug output that the training is correctly resulting in trying a variety of different actions.

However, with both 'train_a3c_gym.py' and 'train_acer_gym.py' the action value provided to my step value is always 0 (NOOP) - it never tries any other action.

Have I coded something wrong in my environment? I would appreciate any tips on how to investigate my issue further.

PCL (Path Consistency Learning)

https://arxiv.org/abs/1702.08892

Add a quick start guide

I think I need a more simplified example with detailed comments.

Register PyPI

The tutorial code causes TypeError on python 3.4

On python 3.4, random.sample don't accept collections.deque, so I got such error.

Traceback (most recent call last):
  File "quickstart.py", line 111, in <module>
    action = agent.act_and_train(obs, reward)
  File "/opt/rl/lib/python3.4/site-packages/chainerrl/agents/dqn.py", line 340, in act_and_train
    self.replay_updator.update_if_necessary(self.t)
  File "/opt/rl/lib/python3.4/site-packages/chainerrl/replay_buffer.py", line 194, in update_if_necessary
    transitions = self.replay_buffer.sample(self.batchsize)
  File "/opt/rl/lib/python3.4/site-packages/chainerrl/replay_buffer.py", line 42, in sample
    return random.sample(self.memory, n)
  File "/opt/rl/lib/python3.4/random.py", line 311, in sample
    raise TypeError("Population must be a sequence or set.  For dicts, use list(d).")
TypeError: Population must be a sequence or set.  For dicts, use list(d).

python 2.7 works fine, and meybe 3.5+.

Visualization of models?

Is there an existing method to visualize models?

Specify successful configurations for examples

Current examples don't specify in what configuration they work well, except newer ones (train_pcl_gym.py and train_reinforce_gym.py). Such instructions are important because users can easily confirm that the implementations actually work.

average_loss is not updated when episodic_update=True

DQN.update has code updating average_loss:
https://github.com/pfnet/chainerrl/blob/03c2ff975e1fca64c67d6ad1a1daaa9638d04e66/chainerrl/agents/dqn.py#L232

However, DQN.update_from_episode doesn't have such code:
https://github.com/pfnet/chainerrl/blob/03c2ff975e1fca64c67d6ad1a1daaa9638d04e66/chainerrl/agents/dqn.py#L287

This keeps average_loss zero. This problem is reported in #82.

Neural Fictitious Self Play NFSP

Neural Fictitious Self Play

https://arxiv.org/abs/1603.01121

Neural Episodic Control

https://arxiv.org/abs/1703.01988

There is already an implementation by @ISakony, though it is not tested against Atari. https://github.com/ISakony/NEC_chainerrl_CartPole-v0

Make learning rate decaying in train_agent_async customizable

Currently, chainerrl.experiments.train_agent_async always decays the learning rate linearly to zero. This kind of behavior should be customizable.

env.monitor has been deprecated as of 12/23/2016

gym.error.Error: env.monitor has been deprecated as of 12/23/2016. Remove your call to env.monitor.start(directory) and instead wrap your env with env = gym.wrappers.Monitor(env, directory) to record data.

Create documents

Multi-agent example

Maybe TicTacToe or other easy-to-implement games?

train_ddpg_gym.py doesn't work

Got this error while trying to run train_ddpg_gym.py:
init() got an unexpected keyword argument 'update_interval'

Windows Bash Run Chainerrl unknown cuda error

recently I got a windows 10 computer, successfully installed bash on ubuntu for windows, cuda, cudnn, chainer and chainerrl. But to run the example, I got the following error. Any suggestions?

(py2env) neil@DESKTOP-C22605O:~/chainerrl$ xvfb-run -s "-screen 0 1400x900x24" python examples/gym/train_dqn_gym.py
Output files are saved in dqn_out/20170324141722891586
INFO:gym.envs.registration:Making new env: Pendulum-v0
Traceback (most recent call last):
File "examples/gym/train_dqn_gym.py", line 179, in
main()
File "examples/gym/train_dqn_gym.py", line 154, in main
episodic_update=args.episodic_replay, episodic_update_len=16)
File "/home/neil/py2env/local/lib/python2.7/site-packages/chainerrl/agents/dqn.py", line 115, in init
cuda.get_device(gpu).use()
File "cupy/cuda/device.pyx", line 75, in cupy.cuda.device.Device.use (cupy/cuda/device.cpp:2083)
File "cupy/cuda/device.pyx", line 81, in cupy.cuda.device.Device.use (cupy/cuda/device.cpp:2035)
File "cupy/cuda/runtime.pyx", line 178, in cupy.cuda.runtime.setDevice (cupy/cuda/runtime.cpp:2915)
File "cupy/cuda/runtime.pyx", line 130, in cupy.cuda.runtime.check_status (cupy/cuda/runtime.cpp:2241)
cupy.cuda.runtime.CUDARuntimeError: cudaErrorUnknown: unknown error

Date and time format for experiments

Human-readability of the name of the subdirectory for an experiment might be improved. The current imprementation (time_str = datetime.datetime.now().strftime('%Y%m%d%H%M%S%f') in chainerrl/experiments/prepare_output_dir.py) produces e.g. 21120903182945898662. How about

strftime('%Y%m%d-%H%M%S-%f') (e.g. 21120903-182945-898662), or
the basic format in ISO 8601 (e.g. 21120903T182945.898662+0900)?

Add roboschool to environment in chainerrl

Any plans to add openai roboschool for robot-simulation?

Documentation on usage of recurrent models

In ChainerRL, to use user-defined recurrent models, you need to make sure they implement chainerrl.recurent.Recurrent interface, otherwise they won't be treated as recurrent models.

When your model's recurrent-ness comes from chainer.links.LSTM, all you have to do is inheriting chainer.recurrent.RecurrentChainMixin.

This kind of information is missing in the document.

Specify requirements that are only needed by Python2.7

There are requirements that just backports py3 features to py2:

funcsigs
futures
statistics
fastcache

They should be skipped for py3.

PyTorch as an additional backend

I'm curious about whether ChainerRL can support PyTorch as an additional NN backend. Its interface is similar to Chainer's, but I'm not sure how easy it would be to support both. Any suggestions and opinions are welcome.

Add suppression option for print messages during training loop?

In chainerrl.experiments.train_agent, statistical information is reported via print per episode during the training loop. However, this sometimes looks so verbose and I want to suppress these messages, but currently there is no good way to do so. Adding some option that enables/disables these prints might be beneficial.

Thanks,