When using hoverAviary with singleagent I can perfectly simulate the training of missi

Thank you again <a class="user-mention notranslate" data-hovercard-type="user" data-ho

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Thank you <a class="user-mention notranslate" data-hovercard-type="user" data-hovercar

[QUESTION] Training complex scenarios about gym-pybullet-drones HOT 11 CLOSED

utiasdsl commented on July 18, 2024

[QUESTION] Training complex scenarios

from gym-pybullet-drones.

Comments (11)

JacopoPan commented on July 18, 2024

Thank you again @chris-aeviator,
the whole repo is still partially work-in-progress and every feedback is appreciated!

Just a few remarks to make sure we are on the same page:

fly.py per se does not involve learning, it is purely a flight simulation using PID control
Your implementation of wind can make sense, maybe you would want to randomize the external force, check out PyBullet's Quick Start Guide to understand how to use applyExternalForce

I guess that your problem arises from the fact that the default ActionType used in singleagent.py is one_d_rpm

gym-pybullet-drones/experiments/learning/singleagent.py

Line 60 in bf173d0

 parser.add_argument('--act', default='one_d_rpm', type=ActionType, help='Help (default: ..)', metavar='') 

I.e., the RPM of all propellers are the same in the learned controller: you would want to change that to just rpm to apply 4 different actions to the propellers. I have to warn you that that's where the learning problem gets a lot more complicated (even a simple hover might take hours, rather than just minutes to learn).

from gym-pybullet-drones.

chris-aeviator commented on July 18, 2024

@JacopoPan thanks for your explanation, yes we are on the same line. Since I'm evaluating an airframe design I've mentioned fly.py to test and compare certain parameters and also evaluated the wind.

I'm ok training for hrs and this is expected, with the ´one_d_rpm` setting (thanks for making that clear!) my loss converged against around -5576 and I stopped the training after 3 hrs. I will now try to understand and implement the necessary separate actions for each of the propellers.

EDIT: you are saying I will just need to set --act rpm to achieve the 4 rotors being processed independently?

EDIT 2: it seems like so 👍 🚀 🥳

EDIT 3: for anybody interested in my usecase: after 1 hr of training with a2c I still got really poor results (not even flying), so I switched to PPO and it shows a much bigger GPU utilization (up to 75% compared to max 20 on A2C) and seems to perform way better, the time before the craft crashes is 1.5 s (A2C) after 1 hr of training and about 4.5 seconds after about 15 min of training with --alg ppo. Even though I've set --cpu 12 I can only see one core being utilized. I'll keep posting results here.

GPU Utilization `--alg ppo`

GPU Utilization `--alg a2c`

steady 20%

EDIT 4:

Training 1hr 20 min in

Vehicle manages to counteract the Y-directed constant wind force though desired Z-position (1) is not reached yet

video-01.10.2021_13.40.48.mp4

from gym-pybullet-drones.

JacopoPan commented on July 18, 2024

Yes, not all algorithms might be equally successful. You might want to look at changing the number of neurons and layers in the networks in singleagent.py. Reward shaping and trying to limit the range of RPMs for each propeller might also be options (e.g., if your wind is along the global y axis and the quad in the x configuration facing +x, you might simplify the problem commanding the same RPMs to prop 0-1 and 2-3).

The goal of this repo is to give you the tools to try all these things, I don't think I've solved the entire problem of addressing generic control with RL yet :D

This is an example of hover that was learned by PPO over ~8hrs.

video-10.28.2020_09.45.37.mp4

from gym-pybullet-drones.

chris-aeviator commented on July 18, 2024

In your video the vehicle has a crazy spin around the Z axis when hovering, my training vehicle is tending to flip over (no new best reward for 2 hrs), would

Reward shaping

mean "punishing it" for this behaviour? Would I for example apply a reward-decreasing factor when experiencing high angular velocities or can I even punish no-go scenarios like flipping over with a -1 ?

from gym-pybullet-drones.

JacopoPan commented on July 18, 2024

Yes, if the reward function does not account for yaw or the z-axis turn rate (as it's the case in that example), the learning agent cannot distinguish between a spinning and non-spinning hover.

You can try to speed-up/"guide" your learning agent by customizing _computeReward()—the reward of a given state or state-action pair—and _computeDone() —the conditions for terminating an episode.

from gym-pybullet-drones.

chris-aeviator commented on July 18, 2024

Thanks Jacopo for all your help and responsiveness! I'll plan this next steps and leave more findings in this GH issue within the next days, please feel free to close it (or not) :0 .

from gym-pybullet-drones.

JacopoPan commented on July 18, 2024

👌note that I don't think there's a silver bullet for those implementations and every use case is very much of general interest (I'll keep the issue open).

from gym-pybullet-drones.

rogerscristo commented on July 18, 2024

Hi @JacopoPan and @chris-aeviator . I've been following this issue and it helped me a lot to understand the simulator better.
Regarding training the hover task with PPO: @JacopoPan, can you please provide the hyperparameters and reward to achieve such good results? I've tried to train using default parameters over 50 million timesteps but did not reach anything like your video. Also, I've tried some reward shaping, as well as tuning PPO parameters but again without success.

Thank you in advance!

Yes, not all algorithms might be equally successful. You might want to look at changing the number of neurons and layers in the networks in singleagent.py. Reward shaping and trying to limit the range of RPMs for each propeller might also be options (e.g., if your wind is along the global y axis and the quad in the x configuration facing +x, you might simplify the problem commanding the same RPMs to prop 0-1 and 2-3).

The goal of this repo is to give you the tools to try all these things, I don't think I've solved the entire problem of addressing generic control with RL yet :D

This is an example of hover that was learned by PPO over ~8hrs.

video-10.28.2020_09.45.37.mp4
Download

from gym-pybullet-drones.

JacopoPan commented on July 18, 2024

@rogerscristo I didn't tag the commit that result came out but I remember it was one of those I obtained when I was testing sa_script.bash and sa_script.slrm on the computing cluster (for 8+ hrs). I don't think the reward has changed much (even if I tried a few variations of it, it has always been either a stepwise function, a distance, or a quadratic distance along the z axis.) Originally I had 256 instead of 512 units in the first layer of the networks. I did not touch any other hyperparameter.

I don't think it should be too surprising if some of the training runs do not succeed: during that set of experiments only PPO and SAC produced "decent" policies. My general suggestion is to start a few experiments in parallel and make sure that the network capacities are appropriate for the task at hand by checking that the learning curves are somewhat stable.

from gym-pybullet-drones.

rogerscristo commented on July 18, 2024

Thank you @JacopoPan for the directions. I will try to generate new learning examples to complement the documentation.

Thanks a lot!

from gym-pybullet-drones.

4ku commented on July 18, 2024

@rogerscristo @chris-aeviator Do you solve this problem? I also have some problems with training. I am trying to train PPO from stable_baselines3 but I don't have any good result.

from gym-pybullet-drones.

[QUESTION] Training complex scenarios about gym-pybullet-drones HOT 11 CLOSED

Comments (11)

GPU Utilization `--alg ppo`

GPU Utilization `--alg a2c`

Training 1hr 20 min in

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs

Comments (11)

GPU Utilization --alg ppo

GPU Utilization --alg a2c

Training 1hr 20 min in

Related Issues (20)

Recommend Projects

Recommend Topics

Recommend Org

Jobs

GPU Utilization `--alg ppo`

GPU Utilization `--alg a2c`