Hi, Mr. Watanabe Sorry for posting a question here. I understand the

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Question regarding SAC discrete about rljax HOT 8 CLOSED

zmce2018 commented on July 17, 2024

Question regarding SAC discrete

from rljax.

Comments (8)

toshikwa commented on July 17, 2024

Hi @zmce2018

Let me assure you what kind of env you want to use.
Which is your situation?

Multiple environments (N obs, N actions)
One env with multiple actions (1 obs, N actions)

Thanks :)

from rljax.

zmce2018 commented on July 17, 2024

Hi @ku2482 One env with multiple actions (1 obs, N actions). Thank you

from rljax.

toshikwa commented on July 17, 2024

I think you have two options.

1
Consider N action space A as one action space A^N.
Which is often used, for example, the env which has action spaces (left, None, right) and (forward, None, backward).
As a result, the action space has 9 actions.
However, actions increase exponentially.

2
Train SAC-Discrete with MultiCategorical distribution. In other words, train SAC-Discrete with N actor's and critic's heads.

I think option 2 can be a reasonable candidate.
Does it answer your question?

Thanks:)

from rljax.

zmce2018 commented on July 17, 2024

Hi @ku2482

1 would be infeasible. The second seems okay at first glance. However, if you build N actors and critics, the agent is learning each action space independently. (Agent would not be able to know which action drives the reward).

Am I understanding it correctly?
Thank you.

from rljax.

toshikwa commented on July 17, 2024

Suppose, you have a special cart pole game where you have to handle 10 cart poles simultaneously

In this explanation, I thought that each (underlying) dynamics was independent. Are these dynamics dependent?

from rljax.

zmce2018 commented on July 17, 2024

Thank you, Ku

Yes, dynamics dependent. You can think of it as humanoid but each action space is a discrete action.

Thank you for your patient.

from rljax.

toshikwa commented on July 17, 2024

I see.
You can still model the policy as multiple categorical distributions that share some layers.
\pi(a|s) = \pi_1(a_1|s) * \pi_2(a_2|s) * ...

However, if each action spaces are dependent, you have to evaluate the values of |A|^N sets of actions.
So you need to model Q function which outputs |A|^N values because you need Q values at all action sets to calculate the expectations. It would be infeasible when N is large.

Or you may be able to model Q function which input state and N one-hot actions, and output a scalar. In this case, you can compute the expectations as the sample means. It is no longer SAC-Discrete, it's Soft Actor-Critic. (I don't think it's smart.)

I'm sorry that I can't come up with a smart solution...

BTW, please call me Toshiki.
Thanks.

from rljax.

zmce2018 commented on July 17, 2024

Thank you so much for your explanation, Toshiki.
That answers my question.

from rljax.

Question regarding SAC discrete about rljax HOT 8 CLOSED

Comments (8)

Related Issues (3)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs