hunkim / reinforcementzerotoall Goto Github PK

View Code? Open in Web Editor NEW

249.0 25.0 132.0 111 KB

Python 80.85% Jupyter Notebook 19.15%

reinforcementzerotoall's People

Stargazers

Watchers

Forkers

zeran4 kimsungjin kkweon oppa3109 charlie13 wsjeon imcomking seanlee10 rubythonode jeilove 4skynet gm06041 mulan101 hal2001 danielwshim 404akhan tapattan lkh-1 markjingnb yungbyun niilante labyrins parksunwoo hanssoo shagru jaehyek picopoco qqiang00 yeolpyeong hjhjpark grayhong dhlee421 dkfromsd necronia siscos kyungkoo70 shirishgoyal osirisjs collector-m kgeneral meelement sungreong superpiggy again4you paulpaul91 shinjaehun ririro93 yongyongyooo yms9654 ryfan-rs lab930boss hccho2 dronerl2020 jinyeong resoliwan ehdgks0627 foranything lifegear yangwooseong nhatnguyen12 seedfac juny1905 ethan-cho jerry4897 liviust pb-pravin brunoson batermj bjo9280 moon0823 yuhyeonkim llejo3 milyangparkjaehun yskim525 learnaidrist w0lv3r1nix hishoss substage kevintrannz dsp6414 bilgehannevruz oceavos tungk noname2048 459548764 adldotori kyuhwas 8bitscoding hyungjunelee weehad jukyellow munki-chung harimyi harkyoo srshin jayjun911 ooksang seungminjang meng216073 munjuhyeok

reinforcementzerotoall's Issues

DQN implementations should be updated

Summary

DQN implementations need to be updated/modified
I'm going to leave this issue as a future reference because I am not going to work on it now
Anyone is welcome to contribute

Problem

After the max step is set to 200, the following DQN implementations won't clear the CartPole-v0

07_1_q_net_cartpole.py
07_2_dqn_2013_cartpole.py
07_3_dqn_2015_cartpole.py

[CartPole-v0 clear condition: average reward >= 195 over 100 games]

Possible fix

Currently in the code, it iterates 50 times updates, and it can be a problem because an initial policy is bad, and it's fitting to the bad policy.

if episode % 10 == 1:  # train every 10 
    # Get a random batch of experiences.
    for  _ in range(50):
        minibatch = random.sample(replay_buffer, 10)
        loss, _ = ddqn_replay_train(mainDQN, targetDQN, minibatch)

Note that DQN is actually an online algorithm. In the original paper(Mnih et al., 2013), it iterates once per each step.
Add a target network as it was suggested in the original paper in 2013 though it should work without a target network.

10_1_Actor_Critic.ipynb 에 관한 질문입니다! (Question about 10_1 Actor_Critic)

네트워크 클래스에 create_op()에서
policy_gain, value_loss, entropy 로 loss구성하셨는데.. 어떤 이론이 뒷받침 되는지

혹은 무엇을 reference로 하여 작성하셨는지 여주고 싶습니다!!
(특히 각각이 계산되는 식이 왜 저렇게 되는지 잘 이해가 가지 않습니다...)

I want to ask about s.t. in create_op() function of ActorCriticNetwork class.
There are values (policy_gain, value_loss, entropy) in create_op().
Could you explain how you get to those values or how those values are calculated?
Which reference did you use to get those values?

Thank you.

the PowerPoint

08_4_softmax_pg_pong.py 에서 image reshape에 관한 질문

08_4_softmax_pg_pong.py 에 다음과 같은 부분이 있습니다.

X = tf.placeholder(tf.float32, [None, input_size], name="input_x")
x_image = tf.reshape(X, [-1, 80, 80, 4])

여기서 placeholder에 넘어오는 array는 (80x80)이미지를 4개 모아 (4,6400)으로 만든 후, 다시
flatten하여 (25600,) 형태로 넘어온 것입니다.
이것을 (80,80,4)로 reshape하면 이미지로서의 특성은 깨져버립니다.

X = tf.placeholder(tf.float32, [None, input_size], name="input_x")
x_image = tf.transpose(tf.reshape(X, [-1, 4, 80, 80]),[0,2,3,1])

이렇게 해야되지 않나요?

Bug?

10_1_Actor_Critic.ipynb 에서

policy_gain = tf.reduce_sum(policy_gain, name="policy_gain")

이렇게 되어 있는데, 아래와 같이 바뀌어야 될 것 같습니다.

policy_gain = tf.reduce_mean(policy_gain, name="policy_gain")

08_4_softmax_pg_pong_y.py ---> model restore BUG

agent = Agent(new_HW + [repeat], output_dim=action_dim, logdir='logdir/train',
                      checkpoint_dir="checkpoints")
init = tf.global_variables_initializer()
sess.run(init)

Agent를 선언하면서, model을 restore하고 있습니다. 그런데, 그 아래에서 initialization이 이루어지고 있습니다.
순서가 바뀌어야 합니다.

init = tf.global_variables_initializer()
sess.run(init)
agent = Agent(new_HW + [repeat], output_dim=action_dim, logdir='logdir/train',
                      checkpoint_dir="checkpoints")

Need to Improve Discounted Reward

Issue

I came to notice that the current discount reward function is not summing up the future rewards.
I'm not sure if it's intended, but even if it's intended, the policy gradient will not behave as intended because it will focus on learning the very first action of each episode

Recall that the Discounted Reward Function is

Example

r = [r0, r1, r2] = [1, 1, 1]
gamma = 0.99
expected discount r = [1 + 0.99 + 0.99 ** 2, 1 + 0.99, 1]
current discounted r implementation = [1, 0.99, 0.99**2]

Implementation in this repo

def discount_rewards(r, gamma=0.99):
    """Takes 1d float array of rewards and computes discounted reward
    e.g. f([1, 1, 1], 0.99) -> [1, 0.99, 0.9801] -> [1.22 -0.004 -1.22]
    """
    d_rewards = np.array([val * (gamma ** i) for i, val in enumerate(r)])

    # Normalize/standardize rewards
    d_rewards -= d_rewards.mean()
    d_rewards /= d_rewards.std()
    return d_rewards

Correct Implementation (from Kapathy's code)

def discount_correct_rewards(r, gamma=0.99):
  """ take 1D float array of rewards and compute discounted reward """
  discounted_r = np.zeros_like(r)
  running_add = 0
  for t in reversed(range(0, r.size)):
    #if r[t] != 0: running_add = 0 # reset the sum, since this was a game boundary (pong specific!)
    running_add = running_add * gamma + r[t]
    discounted_r[t] = running_add

  discounted_r -= discounted_r.mean()
  discounted_r /- discounted_r.std()
  return discounted_r

With Latex

Therefore, the above function should change as below