reinforcementzerotoall's People
Forkers
zeran4 kimsungjin kkweon oppa3109 charlie13 wsjeon imcomking seanlee10 rubythonode jeilove 4skynet gm06041 mulan101 hal2001 danielwshim 404akhan tapattan lkh-1 markjingnb yungbyun niilante labyrins parksunwoo hanssoo shagru jaehyek picopoco qqiang00 yeolpyeong hjhjpark grayhong dhlee421 dkfromsd necronia siscos kyungkoo70 shirishgoyal osirisjs collector-m kgeneral meelement sungreong superpiggy again4you paulpaul91 shinjaehun ririro93 yongyongyooo yms9654 ryfan-rs lab930boss hccho2 dronerl2020 jinyeong resoliwan ehdgks0627 foranything lifegear yangwooseong nhatnguyen12 seedfac juny1905 ethan-cho jerry4897 liviust pb-pravin brunoson batermj bjo9280 moon0823 yuhyeonkim llejo3 milyangparkjaehun yskim525 learnaidrist w0lv3r1nix hishoss substage kevintrannz dsp6414 bilgehannevruz oceavos tungk noname2048 459548764 adldotori kyuhwas 8bitscoding hyungjunelee weehad jukyellow munki-chung harimyi harkyoo srshin jayjun911 ooksang seungminjang meng216073 munjuhyeokreinforcementzerotoall's Issues
DQN implementations should be updated
Summary
- DQN implementations need to be updated/modified
- I'm going to leave this issue as a future reference because I am not going to work on it now
- Anyone is welcome to contribute
Problem
After the max step is set to 200, the following DQN implementations won't clear the CartPole-v0
- 07_1_q_net_cartpole.py
- 07_2_dqn_2013_cartpole.py
- 07_3_dqn_2015_cartpole.py
[CartPole-v0 clear condition: average reward >= 195 over 100 games]
Possible fix
- Currently in the code, it iterates 50 times updates, and it can be a problem because an initial policy is bad, and it's fitting to the bad policy.
if episode % 10 == 1: # train every 10
# Get a random batch of experiences.
for _ in range(50):
minibatch = random.sample(replay_buffer, 10)
loss, _ = ddqn_replay_train(mainDQN, targetDQN, minibatch)
- Note that DQN is actually an online algorithm. In the original paper(Mnih et al., 2013), it iterates once per each step.
- Add a target network as it was suggested in the original paper in 2013 though it should work without a target network.
10_1_Actor_Critic.ipynb 에 관한 질문입니다! (Question about 10_1 Actor_Critic)
네트워크 클래스에 create_op()에서
policy_gain, value_loss, entropy 로 loss구성하셨는데.. 어떤 이론이 뒷받침 되는지
혹은 무엇을 reference로 하여 작성하셨는지 여주고 싶습니다!!
(특히 각각이 계산되는 식이 왜 저렇게 되는지 잘 이해가 가지 않습니다...)
I want to ask about s.t. in create_op() function of ActorCriticNetwork class.
There are values (policy_gain, value_loss, entropy) in create_op().
Could you explain how you get to those values or how those values are calculated?
Which reference did you use to get those values?
Thank you.
the PowerPoint
08_4_softmax_pg_pong.py 에서 image reshape에 관한 질문
08_4_softmax_pg_pong.py 에 다음과 같은 부분이 있습니다.
X = tf.placeholder(tf.float32, [None, input_size], name="input_x")
x_image = tf.reshape(X, [-1, 80, 80, 4])
여기서 placeholder에 넘어오는 array는 (80x80)이미지를 4개 모아 (4,6400)으로 만든 후, 다시
flatten하여 (25600,) 형태로 넘어온 것입니다.
이것을 (80,80,4)로 reshape하면 이미지로서의 특성은 깨져버립니다.
X = tf.placeholder(tf.float32, [None, input_size], name="input_x")
x_image = tf.transpose(tf.reshape(X, [-1, 4, 80, 80]),[0,2,3,1])
이렇게 해야되지 않나요?
Bug?
10_1_Actor_Critic.ipynb 에서
policy_gain = tf.reduce_sum(policy_gain, name="policy_gain")
이렇게 되어 있는데, 아래와 같이 바뀌어야 될 것 같습니다.
policy_gain = tf.reduce_mean(policy_gain, name="policy_gain")
08_4_softmax_pg_pong_y.py ---> model restore BUG
agent = Agent(new_HW + [repeat], output_dim=action_dim, logdir='logdir/train',
checkpoint_dir="checkpoints")
init = tf.global_variables_initializer()
sess.run(init)
Agent를 선언하면서, model을 restore하고 있습니다. 그런데, 그 아래에서 initialization이 이루어지고 있습니다.
순서가 바뀌어야 합니다.
init = tf.global_variables_initializer()
sess.run(init)
agent = Agent(new_HW + [repeat], output_dim=action_dim, logdir='logdir/train',
checkpoint_dir="checkpoints")
Need to Improve Discounted Reward
Issue
I came to notice that the current discount reward function is not summing up the future rewards.
I'm not sure if it's intended, but even if it's intended, the policy gradient will not behave as intended because it will focus on learning the very first action of each episode
Recall that the Discounted Reward Function is
Example
- r = [r0, r1, r2] = [1, 1, 1]
- gamma = 0.99
- expected discount r = [1 + 0.99 + 0.99 ** 2, 1 + 0.99, 1]
- current discounted r implementation = [1, 0.99, 0.99**2]
Implementation in this repo
def discount_rewards(r, gamma=0.99):
"""Takes 1d float array of rewards and computes discounted reward
e.g. f([1, 1, 1], 0.99) -> [1, 0.99, 0.9801] -> [1.22 -0.004 -1.22]
"""
d_rewards = np.array([val * (gamma ** i) for i, val in enumerate(r)])
# Normalize/standardize rewards
d_rewards -= d_rewards.mean()
d_rewards /= d_rewards.std()
return d_rewards
Correct Implementation (from Kapathy's code)
def discount_correct_rewards(r, gamma=0.99):
""" take 1D float array of rewards and compute discounted reward """
discounted_r = np.zeros_like(r)
running_add = 0
for t in reversed(range(0, r.size)):
#if r[t] != 0: running_add = 0 # reset the sum, since this was a game boundary (pong specific!)
running_add = running_add * gamma + r[t]
discounted_r[t] = running_add
discounted_r -= discounted_r.mean()
discounted_r /- discounted_r.std()
return discounted_r
With Latex
Therefore, the above function should change as below
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.