GithubHelp home page GithubHelp logo

Add a more elaborate example about pilco HOT 5 CLOSED

nrontsis avatar nrontsis commented on May 29, 2024
Add a more elaborate example

from pilco.

Comments (5)

kyr-pol avatar kyr-pol commented on May 29, 2024

So I did some experimenting with different gym environments, and mostly the continuous mountain car one ('MountainCarContinuous-v0'). I created a new branch, 'more_envs', to work on this and other environments.

  1. Increasing the car's power, allowing it to climb the hill without momentum-> this works, but it is trivial.

  2. Subsampling and longer horizons. To subsample from the environment, each action is repeated a number of times, and intermediate states are discarded. Experimented with 1 (no subsampling), 2, 4 and 5.

  3. Different controller initialisation. Experimented with larger and smaller variances in the initialisation of the controllers, without significant effect.

  4. Restarting the model. Retraining the model and the controller from scratch, but using the data collected that far, to help the optimisation process restart. No significant improvement, at least with the few restarts ( less than 5) tried.

  5. Tuning the weight matrix of the reward function. Setting the values of the 'W' matrix in the reward function controls how steeply the reward decreases, moving away from the goal state in each direction (smaller weights, slower reward decay).

  6. Fixing the model uncertainty (kernel variance), tested with fixed values at 1, 0.1 and 0.02.

  7. RBF with more basis functions, hoping that in a higher dimensional space the optimisation will have less issues with local minima. Tried 10, 15 and 20.

In most cases, the algorithm is getting stuck on the greedy behaviour of going to the right as much as possible (action = 1 throughout the episode), or some other more or less constant action that pushes to the right (0<action<1).

Possible additions:
Diagnostics: We could add plots comparing the predicted trajectory, given an initial state and a controller, to the actual trajectory when the policy is implemented, or the predicted reward and the actual reward. I don't think the model is the issue in this case, but i'd be good to know for sure, and this is functionality that the original implementation has.

Restarts: the optimisation is local, so might benefit from random restarts, for example when no progress has been made for some consecutive steps (this is changing the algorithm though, since the original PILCO, has no such restarts even though other similar works in the literature do).

As we said, we can connect the mountain car to the original Matlab implementation of PILCO and see if it's solved properly. If it is we'd know for sure there is something crucial missing in our implementation, if not it might still be a matter of tuning parameters. Meanwhile I might give some other scenario a shot, maybe the algorithm is just not a good match for this particular environment.

from pilco.

nrontsis avatar nrontsis commented on May 29, 2024

It sounds like a complicated task with many variables to tune.

I can write (probably this week) the connection for Matlab with gym to test PILCO's original implementation. However, I won't have time to tune it (adjust gains for controllers etc) before the end of October. @kyr-pol would you be up for doing that?

from pilco.

kyr-pol avatar kyr-pol commented on May 29, 2024

Yes, thanks, that would be helpful and I can work on the testing and tuning afterwards if necessary.

I have done some more experiments though, working on the pendulum swing up task as well as the mountain car, and it seems that the model gets quite inaccurate over longer time horizons, more so than the Matlab version on similar scenarios. I'll give some examples below.

from pilco.

kyr-pol avatar kyr-pol commented on May 29, 2024

Some more observations.

Mountain Car
Model inaccuracy, on longer term prediction persists, even with more data points, and also when, by subsampling, every step is long enough that the episode consists of a smaller number of steps.
For example (edc6d81) with a subsampling rate of 20, every rollout has just 5 time steps. After collecting 260 data points:
mountain_car_subs20_260points

and

mountain_car_subs20_260points_s_pred

where x_pred, s_pred are predictions for the position of the car, X_new are the real values. The task here is to move the car from -0.5 to +0.45 so the differences are quite big.

Pendulum (swing-up)
Firstly, this environment causes crashes, probably similar to the ones mentioned in #7 (comment). In this case, the 3rd GP, predicting the angular velocity, fails to learn after the initial random rollout, and causes numerical errors in the controller optimisation. This doesn't occur when we subsample, or fix the noise to 1e-4 instead of the minimum 1e-6 where it ends up.

There is a similar scenario implemented in Matlab, so I used most hyperparameter settings from it.
With subsampling rate 3, and 30 timesteps per episode, after 240 data points (876faac):
pendulum_swing_no_stab
where the dimension plotted corresponds to the cosine of the angle of the pendulum.
Whereas from Matlab, with 40 timestep episodes, after training on 120 data points the error:
matlab_predictions_accuracy
here the dimension corresponds the angle itself, going from 0 to pi in a successful run.

One difference between the two models is that in the Matlab version, the model has inputs cos, sin, angular velocity and control input and predicts angular velocity and angle, while goes from sin, cos, ang. velocity and control to sin, cos and ang. velocity. It doesn't look like this should be that important though.

Another possibility, mentioned by the original author, is the integrator used in the forward dynamics (Euler for gym, dopri integrator in Matlab).

from pilco.

nrontsis avatar nrontsis commented on May 29, 2024

Okay, thanks for the detailed results.

Another possibility might be differences in the training of the gp? Matlab's implementation has a special optimiser that penalises extreme lengthscales and snr (see hypCurb.m).

Either way it seems complicated. I now think even more that the best way is to link gym with MATLAB's PILCO and see the differences there.

from pilco.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.