GithubHelp home page GithubHelp logo

reinforcementzerotoall's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

reinforcementzerotoall's Issues

DQN implementations should be updated

Summary

  • DQN implementations need to be updated/modified
  • I'm going to leave this issue as a future reference because I am not going to work on it now
  • Anyone is welcome to contribute

Problem

After the max step is set to 200, the following DQN implementations won't clear the CartPole-v0

  • 07_1_q_net_cartpole.py
  • 07_2_dqn_2013_cartpole.py
  • 07_3_dqn_2015_cartpole.py

[CartPole-v0 clear condition: average reward >= 195 over 100 games]

Possible fix

  • Currently in the code, it iterates 50 times updates, and it can be a problem because an initial policy is bad, and it's fitting to the bad policy.
if episode % 10 == 1:  # train every 10 
    # Get a random batch of experiences.
    for  _ in range(50):
        minibatch = random.sample(replay_buffer, 10)
        loss, _ = ddqn_replay_train(mainDQN, targetDQN, minibatch)
  • Note that DQN is actually an online algorithm. In the original paper(Mnih et al., 2013), it iterates once per each step.
    algorithm
  • Add a target network as it was suggested in the original paper in 2013 though it should work without a target network.

10_1_Actor_Critic.ipynb 에 관한 질문입니다! (Question about 10_1 Actor_Critic)

네트워크 클래스에 create_op()에서
policy_gain, value_loss, entropy 로 loss구성하셨는데.. 어떤 이론이 뒷받침 되는지

혹은 무엇을 reference로 하여 작성하셨는지 여주고 싶습니다!!
(특히 각각이 계산되는 식이 왜 저렇게 되는지 잘 이해가 가지 않습니다...)

I want to ask about s.t. in create_op() function of ActorCriticNetwork class.
There are values (policy_gain, value_loss, entropy) in create_op().
Could you explain how you get to those values or how those values are calculated?
Which reference did you use to get those values?

Thank you.

08_4_softmax_pg_pong.py 에서 image reshape에 관한 질문

08_4_softmax_pg_pong.py 에 다음과 같은 부분이 있습니다.

X = tf.placeholder(tf.float32, [None, input_size], name="input_x")
x_image = tf.reshape(X, [-1, 80, 80, 4])

여기서 placeholder에 넘어오는 array는 (80x80)이미지를 4개 모아 (4,6400)으로 만든 후, 다시
flatten하여 (25600,) 형태로 넘어온 것입니다.
이것을 (80,80,4)로 reshape하면 이미지로서의 특성은 깨져버립니다.

X = tf.placeholder(tf.float32, [None, input_size], name="input_x")
x_image = tf.transpose(tf.reshape(X, [-1, 4, 80, 80]),[0,2,3,1])

이렇게 해야되지 않나요?

Bug?

10_1_Actor_Critic.ipynb 에서

policy_gain = tf.reduce_sum(policy_gain, name="policy_gain")

이렇게 되어 있는데, 아래와 같이 바뀌어야 될 것 같습니다.

policy_gain = tf.reduce_mean(policy_gain, name="policy_gain")

08_4_softmax_pg_pong_y.py ---> model restore BUG

agent = Agent(new_HW + [repeat], output_dim=action_dim, logdir='logdir/train',
                      checkpoint_dir="checkpoints")
init = tf.global_variables_initializer()
sess.run(init)

Agent를 선언하면서, model을 restore하고 있습니다. 그런데, 그 아래에서 initialization이 이루어지고 있습니다.
순서가 바뀌어야 합니다.

init = tf.global_variables_initializer()
sess.run(init)
agent = Agent(new_HW + [repeat], output_dim=action_dim, logdir='logdir/train',
                      checkpoint_dir="checkpoints")

Need to Improve Discounted Reward

Issue

I came to notice that the current discount reward function is not summing up the future rewards.
I'm not sure if it's intended, but even if it's intended, the policy gradient will not behave as intended because it will focus on learning the very first action of each episode

Recall that the Discounted Reward Function is
discount r

Example

  • r = [r0, r1, r2] = [1, 1, 1]
  • gamma = 0.99
  • expected discount r = [1 + 0.99 + 0.99 ** 2, 1 + 0.99, 1]
  • current discounted r implementation = [1, 0.99, 0.99**2]

Implementation in this repo

def discount_rewards(r, gamma=0.99):
    """Takes 1d float array of rewards and computes discounted reward
    e.g. f([1, 1, 1], 0.99) -> [1, 0.99, 0.9801] -> [1.22 -0.004 -1.22]
    """
    d_rewards = np.array([val * (gamma ** i) for i, val in enumerate(r)])

    # Normalize/standardize rewards
    d_rewards -= d_rewards.mean()
    d_rewards /= d_rewards.std()
    return d_rewards

Correct Implementation (from Kapathy's code)

def discount_correct_rewards(r, gamma=0.99):
  """ take 1D float array of rewards and compute discounted reward """
  discounted_r = np.zeros_like(r)
  running_add = 0
  for t in reversed(range(0, r.size)):
    #if r[t] != 0: running_add = 0 # reset the sum, since this was a game boundary (pong specific!)
    running_add = running_add * gamma + r[t]
    discounted_r[t] = running_add

  discounted_r -= discounted_r.mean()
  discounted_r /- discounted_r.std()
  return discounted_r

With Latex

Therefore, the above function should change as below

discount_reward_2

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.