Hi Jacopo, could you please share with us the reward function you have used in learn.p

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Alter the Reward function used in learn.py about gym-pybullet-drones HOT 4 CLOSED

utiasdsl commented on July 18, 2024

Alter the Reward function used in learn.py

from gym-pybullet-drones.

Comments (4)

JacopoPan commented on July 18, 2024 1

@ArminBaz the reward function in the latest commit is not the same of when I wrote the message above.
Have you tried looking at the performance of the trained agent using script test_singleagent.py?
It should be under folder gym-pybullet-drones/experiments/learning/results

$ python ./test_singleagent.py --exp ./results/save-<env>-<algo>-<obs>-<act>-<time-date>

(-30 over the episode should be ok, as there are negative rewards for any point except the desired hover one)

from gym-pybullet-drones.

ArminBaz commented on July 18, 2024 1

@JacopoPan That makes a lot of sense, thank you for getting back so quickly!

from gym-pybullet-drones.

JacopoPan commented on July 18, 2024

Hello @amijeet,
apologies if I break workflows (especially around learn.py) as I am actively modifying the code.
If you want to get started on single agent RL, look at this commit and, in particular,

These 2 scripts:

singleagent.py—using a few of stable-baselines3 algorithms
test_singleagent.py—to re-run a model trained with the previous script

And these 2 classes:

This is a much simplified take-off and hover scenario with a 2-D obs space (z and velocity in z) and a 1-D action space (the RPM for all motors).

The reward is 1 for z between 0.75 and 0.99 and 0 otherwise.

In this example, running stable-baselines3's PPO finds a solution in just a few minutes.

$ cd gym-pybullet-drones/experiments/learning/
$ python singleagent.py --env takeoff --algo ppo --pol mlp --input rpm

Output:

Eval num_timesteps=10000, episode_reward=26.00 +/- 0.00
Episode length: 242.00 +/- 0.00
New best mean reward!
Eval num_timesteps=20000, episode_reward=29.00 +/- 0.00
Episode length: 242.00 +/- 0.00
New best mean reward!
Eval num_timesteps=30000, episode_reward=58.00 +/- 0.00
Episode length: 242.00 +/- 0.00
New best mean reward!
Eval num_timesteps=40000, episode_reward=173.00 +/- 0.00
Episode length: 242.00 +/- 0.00
New best mean reward!
Stopping training because the mean reward 173.00  is above the threshold 100

Of course, more complicated tasks, using higher dimensional observations and action vectors can require:

More sophisticated reward engineering (see TakeoffAviary.py)
And/or to customize the learning networks architecture (see singleagent.py)

as well as much longer training times. E.g. simply making the input 4-D complicates the problem enough that PPO only collects 1/5 of the reward in 15x the number of iterations:

Eval num_timesteps=680000, episode_reward=31.00 +/- 0.00
Episode length: 86.00 +/- 0.00
New best mean reward!

I don't have all the answers, the purpose of this gym is exactly to try (and let others try) these things.

from gym-pybullet-drones.

ArminBaz commented on July 18, 2024

Hey @JacopoPan, forgive me if this is a naive question as I am still relatively new to reinforcement learning and your library. I just ran singleagent.py (from the most recent commit) on takeoff and I noticed that my model seems to be far slower than the one you showed.

It seems that you were able to break the mean reward threshold after 40000 timesteps. While I am stuck in -30 at around 120000. Do you know why this may be happening and do you have any suggestions on how to speed up the training? Thanks!

Here is the output for reference:

Eval num_timesteps=110000, episode_reward=-30.23 +/- 0.00
Episode length: 242.00 +/- 0.00
Eval num_timesteps=115000, episode_reward=-30.18 +/- 0.00
Episode length: 242.00 +/- 0.00
New best mean reward!
Eval num_timesteps=120000, episode_reward=-30.15 +/- 0.00
Episode length: 242.00 +/- 0.00
New best mean reward!
Eval num_timesteps=125000, episode_reward=-30.12 +/- 0.00
Episode length: 242.00 +/- 0.00
New best mean reward!

from gym-pybullet-drones.

Alter the Reward function used in learn.py about gym-pybullet-drones HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs