GithubHelp home page GithubHelp logo

Comments (7)

muupan avatar muupan commented on August 24, 2024 1

Hi, you need to make sure your model implements chainerrl.recurent.Recurrent interface so that can be treated as a recurrent model. I guess the easiest way to do it is inheriting chainer.recurrent.RecurrentChainMixin like

class QFunctionRecurrent(chainer.Chain, StateQFunction, RecurrentChainMixin):

, which will find L.LSTM by searching recursively in chainer.Chain and chainer.ChainList.

Documentation on the usage of recurrent models is almost missing, so I opened another issue for it #83. Thanks for reporting the issue!

from chainerrl.

muupan avatar muupan commented on August 24, 2024 1

Thanks for your code!

Just for clarification:

If episodic_replay=true, then:

minibatch_size corresponds to the episodes used for the experience replay

and

episodic_update_len corresponds to the time steps within each of those episodes, right?

Thus, if one batch has e.g. 50 time steps and episodic_update_len=16 it will draw 16 consecutive time steps from this episodes for replay? Furthermore if episodic_update_len=None it will use all time steps within this episode?

You are correct. minibatch_size is the number of episodes to sample for an update. Each sample episode's length is at most episodic_update_len.

As for average_loss, it turned out to be a bug in ChainerRL. Losses are computed and the model is updated as usual. However, the value of average_loss is not updated at all when episodic_update=True. I'll open an issue for it and fix it soon. Thanks for reporting it!

from chainerrl.

kfeeeeee avatar kfeeeeee commented on August 24, 2024

Hi, thank you very much for your fast response! That indeed solved the issue.

Just for clarification:

If episodic_replay=true, then:

minibatch_size corresponds to the episodes used for the experience replay

and

episodic_update_len corresponds to the time steps within each of those episodes, right?

Thus, if one batch has e.g. 50 time steps and episodic_update_len=16 it will draw 16 consecutive time steps from this episodes for replay? Furthermore if episodic_update_len=None it will use all time steps within this episode?

Thanks again!

from chainerrl.

kfeeeeee avatar kfeeeeee commented on August 24, 2024

Unfortunately, my above statement about the issue being solved was a bit hastily.

I modified the the Q Function, i.e.:

class QFunction(chainer.Chain, StateQFunction,RecurrentChainMixin):

        def __init__(self, n_input_channels=3, n_actions = 4, bias=0.1):
            self.n_actions = n_actions
            self.n_input_channels = n_input_channels
            conv_layers = chainer.ChainList(
                L.Convolution2D(n_input_channels, 32, 8, stride=4, bias=bias),
                L.Convolution2D(32, 64, 4, stride=2, bias=bias),
                L.Convolution2D(64, 64, 3, stride=1, bias=bias),
                L.Convolution2D(64, 128, 7, stride=1, bias=bias)
                )

            lstm_layer = L.LSTM(128, 128)                     

            a_stream = MLP(128,n_actions,[2])
            v_stream = MLP(128,1,[2])

            super().__init__(conv_layers=conv_layers, lstm_layer=lstm_layer, a_stream=a_stream,v_stream=v_stream)

        def __call__(self, x, test=False):
            """
            Args:
                x (ndarray or chainer.Variable): An observation
                test (bool): a flag indicating whether it is in test mode
            """
            h = x
            for l in self.conv_layers:
                h = F.relu(l(h))
            h = self.lstm_layer(h)

            batch_size = x.shape[0]
            ya = self.a_stream(h, test=test)
            mean = F.reshape(F.sum(ya,axis=1) / self.n_actions, (batch_size,1))
            ya, mean = F.broadcast(ya,mean)
            ya -= mean

            ys = self.v_stream(h,test=test)
            
            ya,ys = F.broadcast(ya,ys)
            q = ya+ys
            return chainerrl.action_value.DiscreteActionValue(q)

with

episodic_replay = True
minibatch_size = 4
episodic_update_len = None

but still, the average_loss is 0, whereas it isn't in the non-recurrent case.

However, there is a good chance that this is due to some bug in my code.

Below I have attached the full source code (without the environment) based on the train_dqn_gym.py which you have provided .

Thanks again!

def main():
    import logging
    logging.basicConfig(level=logging.INFO)

    parser = argparse.ArgumentParser()
    parser.add_argument('--outdir', type=str, default='dqn_out')
    parser.add_argument('--env', type=str, default='Pendulum-v0')
    parser.add_argument('--seed', type=int, default=None)
    parser.add_argument('--gpu', type=int, default=0)
    parser.add_argument('--final-exploration-steps',
                        type=int, default= 1000*50)
    parser.add_argument('--start-epsilon', type=float, default=1.0)
    parser.add_argument('--end-epsilon', type=float, default=.05)
    parser.add_argument('--demo', action='store_true', default=False)
    parser.add_argument('--load', type=str, default=None)
    parser.add_argument('--steps', type=int, default=500000)
    parser.add_argument('--prioritized-replay', action='store_true')
    parser.add_argument('--episodic-replay',type=bool, default=True)
    parser.add_argument('--replay-start-size', type=int, default=None)
    parser.add_argument('--target-update-frequency', type=int, default=1)
    parser.add_argument('--target-update-method', type=str, default='soft')
    parser.add_argument('--soft-update-tau', type=float, default=0.001)
    parser.add_argument('--update-frequency', type=int, default=1)
    parser.add_argument('--eval-n-runs', type=int, default=10)
    parser.add_argument('--eval-frequency', type=int, default=50*10)
    parser.add_argument('--n-hidden-channels', type=int, default=100)
    parser.add_argument('--n-hidden-layers', type=int, default=2)
    parser.add_argument('--gamma', type=float, default=0.99)
    parser.add_argument('--minibatch-size', type=int, default=None)
    parser.add_argument('--render-train', action='store_true')
    parser.add_argument('--render-eval', action='store_true')
    parser.add_argument('--monitor', action='store_true')
    parser.add_argument('--reward-scale-factor', type=float, default=.1)
    args = parser.parse_args()

    

    args.outdir = experiments.prepare_output_dir(
        args, args.outdir, argv=sys.argv)
    print('Output files are saved in {}'.format(args.outdir))

    if args.seed is not None:
        misc.set_random_seed(args.seed)

    def clip_action_filter(a):
        return np.clip(a, action_space.low, action_space.high)

    def make_env(for_eval):
        env = gym.make(args.env)
        if args.monitor:
            env = gym.wrappers.Monitor(env, args.outdir)
        if isinstance(env.action_space, spaces.Box):
            misc.env_modifiers.make_action_filtered(env, clip_action_filter)
        if not for_eval:
            misc.env_modifiers.make_reward_filtered(
                env, lambda x: x * args.reward_scale_factor)
        if ((args.render_eval and for_eval) or
                (args.render_train and not for_eval)):
            misc.env_modifiers.make_rendered(env)
        return env

    env = make_env(for_eval=False)
    timestep_limit = env.spec.tags.get(
        'wrapper_config.TimeLimit.max_episode_steps')
    obs_size = env.observation_space.low.size
    action_space = env.action_space

    n_actions = action_space.n

    class QFunction(chainer.Chain, StateQFunction,RecurrentChainMixin):

        def __init__(self, n_input_channels=3, n_actions = 4, bias=0.1):
            self.n_actions = n_actions
            self.n_input_channels = n_input_channels
            conv_layers = chainer.ChainList(
                L.Convolution2D(n_input_channels, 32, 8, stride=4, bias=bias),
                L.Convolution2D(32, 64, 4, stride=2, bias=bias),
                L.Convolution2D(64, 64, 3, stride=1, bias=bias),
                L.Convolution2D(64, 128, 7, stride=1, bias=bias)
                )

            lstm_layer = L.LSTM(128, 128)                     

            a_stream = MLP(128,n_actions,[2])
            v_stream = MLP(128,1,[2])

            super().__init__(conv_layers=conv_layers, lstm_layer=lstm_layer, a_stream=a_stream,v_stream=v_stream)

        def __call__(self, x, test=False):
            """
            Args:
                x (ndarray or chainer.Variable): An observation
                test (bool): a flag indicating whether it is in test mode
            """
            h = x
            for l in self.conv_layers:
                h = F.relu(l(h))
            h = self.lstm_layer(h)

            batch_size = x.shape[0]
            ya = self.a_stream(h, test=test)
            mean = F.reshape(F.sum(ya,axis=1) / self.n_actions, (batch_size,1))
            ya, mean = F.broadcast(ya,mean)
            ya -= mean

            ys = self.v_stream(h,test=test)
            
            ya,ys = F.broadcast(ya,ys)
            q = ya+ys
            return chainerrl.action_value.DiscreteActionValue(q)

    
    explorer = explorers.LinearDecayEpsilonGreedy(
    args.start_epsilon, args.end_epsilon, args.final_exploration_steps,
    action_space.sample)
    q_func = QFunction(3,4)

    opt = optimizers.Adam()
    opt.setup(q_func)

    rbuf_capacity = 100000
    if args.episodic_replay:
        print('episodic replay')
        if args.minibatch_size is None:
            args.minibatch_size = 4
        if args.replay_start_size is None:
            args.replay_start_size = 10
        if args.prioritized_replay:
            betasteps = \
                (args.steps - timestep_limit * args.replay_start_size) \
                // args.update_frequency
            rbuf = replay_buffer.PrioritizedEpisodicReplayBuffer(
                rbuf_capacity, betasteps=betasteps)
        else:
            rbuf = replay_buffer.EpisodicReplayBuffer(rbuf_capacity)
    else:
        if args.minibatch_size is None:
            args.minibatch_size = 32
        if args.replay_start_size is None:
            args.replay_start_size = 1000
        if args.prioritized_replay:
            betasteps = (args.steps - args.replay_start_size) \
                // args.update_frequency
            rbuf = replay_buffer.PrioritizedReplayBuffer(
                rbuf_capacity, betasteps=betasteps)
        else:
            rbuf = replay_buffer.ReplayBuffer(rbuf_capacity)

    def phi(obs):
        return (np.swapaxes(obs,0,2).astype(np.float32))/255.

    
    

    gym.undo_logger_setup()  # Turn off gym's default logger settings
    logging.basicConfig(level=logging.DEBUG, stream=sys.stdout, format='')

    agent = DoubleDQN(q_func, opt, rbuf, gpu=args.gpu, gamma=args.gamma,
                explorer=explorer, replay_start_size=args.replay_start_size,
                target_update_interval=args.target_update_frequency,
                update_interval=args.update_frequency,
                phi=phi, minibatch_size=args.minibatch_size,
                target_update_method=args.target_update_method,
                soft_update_tau=args.soft_update_tau,
                episodic_update=args.episodic_replay, episodic_update_len=None)

    if args.load:
        agent.load(args.load)

    eval_env = make_env(for_eval=True)

    if args.demo:
        mean, median, stdev = experiments.eval_performance(
            env=eval_env,
            agent=agent,
            n_runs=args.eval_n_runs,
            max_episode_len=50)
        print('n_runs: {} mean: {} median: {} stdev'.format(
            args.eval_n_runs, mean, median, stdev))
    else:
        experiments.train_agent_with_evaluation(
            agent=agent, env=env, steps=args.steps,
            eval_n_runs=args.eval_n_runs, eval_interval=args.eval_frequency,
            outdir=args.outdir, eval_env=eval_env,
            max_episode_len=50)


if __name__ == '__main__':
    main()

from chainerrl.

muupan avatar muupan commented on August 24, 2024

@kfeeeeee Can you give me the complete code (including import sentences) and command line arguments?

from chainerrl.

kfeeeeee avatar kfeeeeee commented on August 24, 2024

@muupan Sure thing.

Gym-Environment:
gridworld.py

Train-Script:
train_dqn_gym.py

And command execution:
python train_dqn_gym.py --env 'Gridworld-v0'

For the rest I use the default args defined in train_dqn_gym.py

Thanks again for looking into this!

from chainerrl.

kfeeeeee avatar kfeeeeee commented on August 24, 2024

Thank you very much for your effort!

from chainerrl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.