Hi there, Very impressive work ! when I run the learn.py I can s

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Briefly yes no, 10'000 <code cl

record the learning process about gym-pybullet-drones HOT 7 CLOSED

nbenave commented on July 18, 2024

record the learning process

from gym-pybullet-drones.

Comments (7)

JacopoPan commented on July 18, 2024

Hi @nbenave,

when you run

python  gym-pybullet-drones/examples/learn.py

what you see at the end is a trained model applied to the quadrotor, i.e. line 88:

gym-pybullet-drones/examples/learn.py

Line 88 in c62e67a

action, _states = model.predict(obs,

the resulting performance is not great because learn.py is an example script that learns over "only" 10000 steps

gym-pybullet-drones/examples/learn.py

Line 56 in c62e67a

model.learn(total_timesteps=10000) # Typically not enough

if you want to look at those 10000 steps, you only need to change this line

gym-pybullet-drones/examples/learn.py

Line 42 in c62e67a

env = gym.make("takeoff-aviary-v0")

env = gym.make("takeoff-aviary-v0", gui=True)

however, I think you'll realize that adding the frontend and rendering can make the learning prohibitively time consuming

in singleagent.py I used stable-baselines3's EvalCallback to save a model every time it improves performance

gym-pybullet-drones/experiments/learning/singleagent.py

Line 235 in c62e67a

eval_callback = EvalCallback(eval_env,

you might want to do something similar to visualize how the agent changes during learning "offline"

from gym-pybullet-drones.

nbenave commented on July 18, 2024

Thank you for your quick and detailed answer!

in the Show performance code section (line 72-101)
the environment will display the model top performance ?

I have a few short questions if you can clarify few things

10,000 timestamps equivalent to 10 seconds of training ?
The reward is changing from -200 at the initial steps and can reach to about -20, what is the optimal reward ? what is this numeric value is representing in this environment ?
in line 81 , the range of the for loop is range(3*env.SIM_FERQ) , can you explain why iterating over SIM_FREQ ? and why multiply by 3 ?

Thanks again.

from gym-pybullet-drones.

JacopoPan commented on July 18, 2024

Briefly

no, 10'000 env.steps()'s, i.e. 10'000/(env.SIM_FREQ/env.AGGR_PHY_STEPS) seconds
the reward depends on the environment/task, the one you are referring to penalizes the quadrotor for not being at the desired hover position, it cannot be 0, as the quadrotor does not start at that position but it gets smaller as quickly as the quadrotor gets and stays there
it only means you'll be shown an arbitrary number (3*env.AGGR_PHY_STEPS) seconds of simulation (different environments have different episode length)

from gym-pybullet-drones.

nbenave commented on July 18, 2024

Thank you again mate.
now its more clear for me :)

Another question about the multi-agent learning.
The training is for both of the quad-copter? each quadcopter is training separately ?
Each of them observe simultaneously ? or there's a joint observation ?

The reward in each step related to the follower / the leader / or both of them ?

Thanks!

from gym-pybullet-drones.

JacopoPan commented on July 18, 2024

The MARL example in multiagent.py is based on the centralized critic examples of RLlib so yes, both agents learn, there is some postprocessing that goes into creating the observations of each agent and each agent has its own reward signal.
The multi-agent script, in my intention, was meant as a demonstration of how a multi-agent environment can be used.
The best way to do MARL is still a bit up for debate, imho.

from gym-pybullet-drones.

nbenave commented on July 18, 2024

Thanks again,
how the reward is calculated in multiagent learning ?

there's a reward for drone 0 , and reward for drone 1, but I dont understand how you calculate the overall reward.

There's an equation for calculating these two rewards into one overall reward ?

I'm using tensorboard and the mean-reward graph displays only one parameter, and not for two drones.

from gym-pybullet-drones.

JacopoPan commented on July 18, 2024

The multi-agent aviary returns a dictionary of rewards because each agent can receive its own signal. How to use these to learn multiple critics/value functions depends on the MARL approach you are implementing (see parameter sharing vs. fully independent learning vs. centralized critic, etc.). Off the top of my head, I don't remember what value you'd see on TB.

from gym-pybullet-drones.

record the learning process about gym-pybullet-drones HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs