GithubHelp home page GithubHelp logo

colinskow / move37 Goto Github PK

View Code? Open in Web Editor NEW
179.0 15.0 115.0 61 KB

Coding Demos from the School of AI's Move37 Course

Python 100.00%
reinforcement-learning dynamic-programming markov-decision-processes

move37's Introduction

Move37 Coding Exercises

From the School of AI's free Move37 Reinforcement Learning Course.

move37's People

Contributors

colinskow avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

move37's Issues

ARS agents score not good on exploring

Hey @colinskow , I have implemented ars.py for the bipedal problem. The score on 1500 iteration is around 330.
In each step, in the training loop we explore once using below code

            # Play an episode with the new weights and print the score
            reward_evaluation = self.explore()

Now I have saved the theta at 1500 iterations and also all the other parameter.
Next I have initialized the theta with this pretrained theta while creating an instance of Policy() class and explored 10 times, but score is around 6.23 not even closer to 330.
Can you tell me why is this happening.

Each time in Explore() we do self.env.reset() so just restarts the env but why reward from the explore function, when called from inside of training loop and manually call explore function is so different.

Let me know if my query is not clear, thanks.

Bugs in code

Colin,

Please check your code for ars.py. There seems to be some bugs:

Line 90: np.random.seed(self.hp.seed)

Lines 156-162: They are indented to much, some lines will need rework as a result of fixing that.

convergence of PPO

Hi @colinskow
I am running your implementation of PPO . Here I am writing the output of the code I received. There are episodes with very good rewards but after those episodes, it again goes back to performing not so good for quite a long time. can you reason this ? why the networks performs well and then it goes back to performing bad. e.g. best reward update is 1949 after this reward still goes as low as 107.

Best reward updated: 1948.110 -> 1949.720
Frame 1633280. reward: 1075.2007614799948
Frame 1635840. reward: 1670.997139271826
Frame 1638400. reward: 1018.4468468976963
Frame 1640960. reward: 1420.2155009716134
Frame 1643520. reward: 1588.7992440253583
Frame 1646080. reward: 1753.0279339305634
Frame 1648640. reward: 1860.2275200387144
Frame 1651200. reward: 1767.266627974523
Frame 1653760. reward: 1481.847750864609
Frame 1656320. reward: 1582.9256486460663
Frame 1658880. reward: 1542.20711622449
Frame 1661440. reward: 1493.876897049993
Frame 1664000. reward: 1152.1284198423598
Frame 1666560. reward: 751.9248701423015
Frame 1669120. reward: 905.2771644728124
Frame 1671680. reward: 560.908236078274
Frame 1674240. reward: 1082.013090584239
Frame 1676800. reward: 1046.646806467438
Frame 1679360. reward: 876.3824661804499
Frame 1681920. reward: 1062.4438298249572
Frame 1684480. reward: 1088.2959040471396
Frame 1687040. reward: 707.492522855422
Frame 1689600. reward: 1030.025091353886
Frame 1692160. reward: 1282.4198527064523
Frame 1694720. reward: 1229.895262817485
Frame 1697280. reward: 988.6682842766364
Frame 1699840. reward: 1195.186973661666
Frame 1702400. reward: 1324.756614240514
Frame 1704960. reward: 1453.9949632353496
Frame 1707520. reward: 1335.0087984155216
Frame 1710080. reward: 953.082310479385
Frame 1712640. reward: 810.0918137141756
Frame 1715200. reward: 1122.4453270261279
Frame 1717760. reward: 950.7951475575255
Frame 1720320. reward: 1104.2244063831204
Frame 1722880. reward: 1012.5089169708348
Frame 1725440. reward: 1098.1972955642696
Frame 1728000. reward: 1234.9955152755679
Frame 1730560. reward: 1009.9835758461932
Frame 1733120. reward: 767.4947807888421
Frame 1735680. reward: 785.9173723735277
Frame 1738240. reward: 761.1041693491346
Frame 1740800. reward: 968.5392057393674
Frame 1743360. reward: 900.4641580833337
Frame 1745920. reward: 882.700614149994
Frame 1748480. reward: 706.6923274090474
Frame 1751040. reward: 939.397385555623
Frame 1753600. reward: 665.0867534186814
Frame 1756160. reward: 804.7112795027496
Frame 1758720. reward: 891.6484742302937
Frame 1761280. reward: 967.5101638971204
Frame 1763840. reward: 744.7142625163355
Frame 1766400. reward: 749.4700675517472
Frame 1768960. reward: 506.6182054831338
Frame 1771520. reward: 1014.2107352288695
Frame 1774080. reward: 809.1423389541403
Frame 1776640. reward: 876.0433867101992
Frame 1779200. reward: 813.8414440885408
Frame 1781760. reward: 515.4108240397534
Frame 1784320. reward: 651.6691073419961
Frame 1786880. reward: 731.5255149421589
Frame 1789440. reward: 320.39684629005984
Frame 1792000. reward: 375.93090401349946
Frame 1794560. reward: 457.54616876119127
Frame 1797120. reward: 587.8317794052147
Frame 1799680. reward: 712.8826882744027
Frame 1802240. reward: 464.1343310987787
Frame 1804800. reward: 872.8156115813341
Frame 1807360. reward: 481.3579957367894
Frame 1809920. reward: 161.0999360510515
Frame 1812480. reward: 477.2615145556106
Frame 1815040. reward: 421.3701411550719
Frame 1817600. reward: 414.8453931932833
Frame 1820160. reward: 107.71140715633244
Frame 1822720. reward: 266.0511582891612
`

Edit: these are the current rewards with same best reward of 1949
Frame 6200320. reward: 17.11311706593478
Frame 6202880. reward: 23.981910791527774
Frame 6205440. reward: 23.762318308465773
Frame 6208000. reward: 41.586614347813445
Frame 6210560. reward: 31.07887714373325
Frame 6213120. reward: 62.13878456088249
Frame 6215680. reward: 17.634756136540812
Frame 6218240. reward: 37.18480424747613
Frame 6220800. reward: 15.499178082800498
Frame 6223360. reward: 30.817446702089804
Frame 6225920. reward: 19.912517506566367
Frame 6228480. reward: 16.267924045076658
Frame 6231040. reward: 31.301011149751425
Frame 6233600. reward: 46.27317145211332
Frame 6236160. reward: 45.98502002805296
Frame 6238720. reward: 59.11215199097131
Frame 6241280. reward: 25.44306679687535
Frame 6243840. reward: 16.138228368208512
Frame 6246400. reward: 29.348333472325617
Frame 6248960. reward: 28.718363679404575
Frame 6251520. reward: 49.57073109564953
Frame 6254080. reward: 33.13798087679316
Frame 6256640. reward: 42.5639778683189
Frame 6259200. reward: 12.61570759066788
Frame 6261760. reward: 40.1821534914928
Frame 6264320. reward: 36.18557595023129
Frame 6266880. reward: 14.499046552244106
Frame 6269440. reward: 14.631654686564143
Frame 6272000. reward: 16.704796792095586
Frame 6274560. reward: 63.7560888899812
Frame 6277120. reward: 16.32953072405434
Frame 6279680. reward: 64.83239870415508
Frame 6282240. reward: 27.41974425498833
Frame 6284800. reward: 23.451732698694975
Frame 6287360. reward: 26.787834144481895
Frame 6289920. reward: 46.04774589335807
Frame 6292480. reward: 35.86749520874131
Frame 6295040. reward: 20.264629524957854
Frame 6297600. reward: 29.750538883366396
Frame 6300160. reward: 17.3407210517615
Frame 6302720. reward: 31.23824488422949
Frame 6305280. reward: 19.869568862758605
Frame 6307840. reward: 16.27222037414918
Frame 6310400. reward: 25.12550263318975
Frame 6312960. reward: 17.435229246157828
Frame 6315520. reward: 20.641396756504403
Frame 6318080. reward: 14.340213175797981
Frame 6320640. reward: 28.532613680261427
Frame 6323200. reward: 28.2661781896805
Frame 6325760. reward: 31.43462487334007
Frame 6328320. reward: 18.87424070271256
Frame 6330880. reward: 36.18999628824575
Frame 6333440. reward: 39.74114942337283

Your code outputs are still pretty good. I have implemented my own PPO on my own environment and it is performing differently in my case it does see some good episodes and then again goes back to performing badly it is like not learning anything , what can be done here, like how to confirm if the reward function is correct? or implementation is ok

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.