TOPIC

A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every timestep that the pole remains upright. The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.

Approaches

1. random movements

in this approach we choose an random action (left or right) given a paticular state of the enviroment. needles to say this approach performs very poorly because it does not take into consideration the present state.
this approach because of its random nature is quite un predictable. On 10 trails runs the max time of survival is 118 timesteps and acg survival time of about 21 time steps which is pretty bad.

2. using weight vector

in this approach we take a random weight vecotor of size 4 which is equal to the dimension of the state of the enviroment. A dot product is taken between the weight vector and state and depending upon the value of the output we take an action i.e either left or right. we see that this method outperforms the previous method but this method does not uses any machine learning algorithm. Resutls of this approach is very impressive. with proper number of games played this approach can last for more than 1000 time steps.
on 10 trail run of this algorithm max score achieved was 762 and avg score of about315.
Note that these can change with trail run and we can get even better results than this with appropriate parameter tuining

3. using deep neural networks

in this approach we take generate training data by randomly taking actions on the enviromnent . if the run is succesful that is the pole is balanced on the cart from more than 100 time steps we add this example to out training set. this approach aims that we can learn how to balance the pole by learning from good training examples. we then fit the model to this training data and try to predict the outcome that is action for any new observation.

4. using deep Q networks

this uses a technique in which the model is rewarded is if makes correct action given the observations of a state and penalty otherwise. initially the model will not be very good at guessing the output but slowly it will become good at predicting the output. exploration and exploitation is carried simaltaneouly to find new improved solutions and to find the good solution in explored search space

comparison how model performs in the begining and after a few epochs

we can see that initialy the model was not able to perform very good, but eventually it learns from its mistakes and performs very good( 1199 is the upper time limit ...after this game is forcefully closed).even higher avg score can be achieved by training longer and increasing the time limit

plot of score during various episodes

the pole was balanced on the cart for more than 2000 timeframes and outperforms all the approaches used above

references

Sentdex
Machine Learning with Phil
Medium blog

Link to other OpenAI-GYM Enviroments

mountain car

adibyte95 / cartpole-openai-gym Goto Github PK

cartpole-openai-gym's Introduction

TOPIC

Approaches

1. random movements

2. using weight vector

3. using deep neural networks

4. using deep Q networks

references

Link to other OpenAI-GYM Enviroments

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs