GithubHelp home page GithubHelp logo

Comments (8)

cbfinn avatar cbfinn commented on July 27, 2024

Negative numbers are normal. It simply depends on how the cost is defined.

Nans are not normal, though the mjc_peg_images example is meant as an example of how you could interface the codebase with images inputs, not necessarily one that succeeds at the peg insertion task (but it does improve). I believe there is a working vision example in the rfgps branch. (@wmontgomery4 Can you comment?)

from gps.

wmontgomery4 avatar wmontgomery4 commented on July 27, 2024

I didn't create mjc_peg_images, but usually when the simulation blows up it's because you've already converged and the policy covariance is too small, leading to weird effects when you try to linearize the resulting nearly deterministic trajectories. Are the NaNs occurring before or after the task is getting solved consistently? If it's after I wouldn't worry too much, just drop the number of iterations or something.

P.S. Did you change the camera angle? I remember it being directly top-down before, but it looks like skewed in your screenshot (although it's hard to tell). The vision experiments get the pixels straight from the window buffer, so you have to leave the camera unchanged while running the experiment.

from gps.

henrykmichalewski avatar henrykmichalewski commented on July 27, 2024

Thanks @cbfinn @wmontgomery4 for an immediate response. This seems to be the scenario described by @wmontgomery4 , that is trajectories learnt by the model look already very good and only later NaNs start to appear and the average cost shown in the GUI explodes to very big numbers.

I would be glad to understand to which number @wmontgomery4 refers as the "policy covariance". Is it shown in the GUI? Is it shown in the console? In general, what of kind of heuristic do you use to establish that no more iterations of learning is needed?

Regarding the camera position, I am using the following hyperparameter:

'camera_pos': np.array([0., 0., 0., 2., -90, +90])

from gps.

wmontgomery4 avatar wmontgomery4 commented on July 27, 2024

Policy covariance is actually not displayed in the GUI sorry, but it's stored in algorithm.policy_opt.var if you need to access it.

For determining the number of iterations, I usually just set it to 12 and adjust that based on results. When running multiple seeds for experiments, it's usually easiest to just have more iterations than you need, and just clip the extra iterations (where things might blow up past convergence) when plotting.

Camera position should be fine if you're getting good results, just wanted to make sure you weren't changing the angle during training.

from gps.

chillacile avatar chillacile commented on July 27, 2024

@cbfinn @wmontgomery4 @henrykmichalewski

I digged really deep into these python codes, and I also tried to run these GPS code for like about 1000 times, and every time i just set the iteration to about 100 to 200. The source code, they are very tricky sometimes, especially they are not the same as discribed in the papers.

At first, the training is not stable when you are using the algrithm BADMM, and that will be better when you are using MDGPS. The origin BADMM example is set to 10 iteration, but it just explodes at 13-15 iteration.So, the answer to your first question is simple, and the reason is mjc_peg_images is using BADMM and reacher_images is using MDGPS.

And I can also tell you why BADMM is worse. See the source code in the last few lines in both algorithm_badmm.py and algorithm_mdgps.py. You will find that the calculation of fCm and fcv are different. And if you look carefully, you will realize that as the "eta" getting larger, the MDGPS way of calculation leads to somehow a convergence, and the BADMM way of computation just blows up itself.

And during the training, when sometimes the program feels not good, it just makes "eta" 10 times bigger, or somehow make "eta" + 2^n, to find a better solution of the iLQG(a better trajectory)

from gps.

chillacile avatar chillacile commented on July 27, 2024

@cbfinn @wmontgomery4 @henrykmichalewski

And about algorithm.policy_opt.var, according to the results, these "varience" and "covarience" used in the whole training are maybe not neccesory. There is a thing called "Qtt"(I do not know what is it, maybe it is the same as the Qxx in the original iLQG) in the backward function in the traj_opt_lqr_python.py(for the iLQG to generate optimized trajectory), which is later used in the update function as something called "prc"(maybe precision matrix) in the policy_opt_caffe.py of just algorithm.py(both badmm and mdgps for the neural network to train itself). But even the "var" is computed in the update function, it is recomputed after each neural network update by itself(nothing to do with the output of the neural network), which means it is not neccesory to compute it in the update function.

And about the covarience(precision matrix), it is used as the multiplier of the neural network's loss of the last loss layer, which maybe a way to change the learning rate of each output(7 joints' torque). However, you can set the tgt_prc(covarience or the precision matrix) to a constant value large enough (like tgt_prc = np.eye(7) * 10000.0 or even 1000000.0, the value doesn't matter. But at least 10 is not OK), and the training of neural network goes as good as before.

And I cannot understand the calculation of loss using the precision matrix, which means the loss of output0 can change the loss of output1. I think the neural network's outputs are independent. Especially when practically the precision matrix(the covarience) is not needed(at least in the experiments i tried).

In the "end to end" paper, it is said using the precision matrix in the training of neural network is xxxx
and cited a paper which is the original Guided Policy Search paper. But even in the original Guided Policy Search paper, I cannot find something related to the precision matrix

from gps.

cbfinn avatar cbfinn commented on July 27, 2024

from gps.

wmontgomery4 avatar wmontgomery4 commented on July 27, 2024

every time i just set the iteration to about 100 to 200

I'm not sure why you're setting the iterations so high. I believe all of the tasks in the codebase work in <12 iterations, and GPS methods generally shouldn't need >100 iterations for any task.

Also, as I mentioned previously, running GPS past convergence in simulation can lead to stability issues, since the linearizations are poorly behaved when the policy/trajectories are nearly deterministic.

See the source code in the last few lines in both algorithm_badmm.py and algorithm_mdgps.py. You will find that the calculation of fCm and fcv are different.

Yes, they should be different, this is the point of MDGPS vs. BADMM. MDGPS uses a simpler formulation of the modified cost during the LQR step, making fCm/fcv simpler. Yes, this difference is kind of the reason BADMM is worse in practice, but I don't think it's for the reason you are describing.

There is a thing called "Qtt"(I do not know what is it, maybe it is the same as the Qxx in the original iLQG) in the backward function in the traj_opt_lqr_python.py(for the iLQG to generate optimized trajectory), which is later used in the update function as something called "prc"(maybe precision matrix)

Qtt is a large matrix which holds Quu, Qxx, and Qux. The Quu portion of Qtt is the precision matrix of the optimal controller as explained in the iLQR paper.

And about the covarience(precision matrix), it is used as the multiplier of the neural network's loss of the last loss layer, which maybe a way to change the learning rate of each output(7 joints' torque). However, you can set the tgt_prc(covarience or the precision matrix) to a constant value large enough (like tgt_prc = np.eye(7) * 10000.0 or even 1000000.0, the value doesn't matter. But at least 10 is not OK), and the training of neural network goes as good as before.

I'm not sure what you're asking here, but the precision matrix is used in the final layer in order to compute the KL divergence between the local controller and the neural network global policy. This should be explained in the appendix of the BADMM/MDGPS papers.

I'm closing this thread, as the original issue has been solved.

from gps.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.