iffix / machin Goto Github PK

Reinforcement learning library(framework) designed for PyTorch, implements DQN, DDPG, A2C, PPO, SAC, MADDPG, A3C, APEX, IMPALA ...

License: MIT License

Python 71.66% Shell 0.06% Batchfile 0.03% HTML 28.25%

reinforcement-learning deep-learning pytorch pytorch-reinforcement-learning dqn ddpg sac ppo td3 prioritized-experience-replay

machin's People

Contributors

Stargazers

Watchers

machin's Issues

Algorithm impala cannot use GPU[ALTER]

machin/frame/algorithms/impala.py, line 363, 373

vs[idx] = (value[idx] + delta_v[idx] + self.discount * c[idx] * (vs[idx + 1] - value[idx + 1]))

This should be corrected to the following code

vs[idx] = (value[idx].to('cpu') + delta_v[idx] + self.discount * c[idx] * (vs[idx + 1] - value[idx + 1].to('cpu')))

Do the same for line 373

len(tmp_observations) < 2 on PPO raise ValueError: The parameter probs has invalid values

It seems that your code produce error if the len of your trajectory < 2 ( len(tmp_observations) < 2). I tested this on PPO I don't know if this happens with all algorithms.

The error:

ValueError: The parameter probs has invalid values

Error importing PPO

Hi,

I was using machin until today, but I think it there was a super recent update and now my code stopped working (only for PPO, DQN still works).

With version 0.4.0 I get the error (when importing "from machin.frame.algorithms import PPO"):
ModuleNotFoundError: No module named 'machin.frame.helpers'

And if I install version 0.3.4 I get the error (when updating PPO):
AttributeError: 'PPO' object has no attribute 'grad_max'

Joao

[Question] Hybrid action space

Hey, I'm trying to implement hybrid action space with A2C agent, maybe you have some advice.
My expected output are two actions: one discrete, one continuous. Network predicts 3 things:

logits for discrete action (sampling from Categorical dist)
mean for distribution of continuous action
std for same (sampling from Normal)

Net outputs sum of log probabilities of actions from both distributions (same for entropy). Network successfully learns the mean and std but the weight for the logits layers are not updated at all. What can be the reason?

Hierarchical discrete action space

Hi! Thanks for your excellent work!
I tried several RL frameworks based on PyTorch, machin is one of the few libraries that discusses hybrid action space.

I'm diving into a complex environment which is a hierarchical action space problem. I hope you could give me some advice!

To explain the meaning of hierarchical action space more clearly, here is an example in the paper Generalising Discrete Action Spaces with Conditional Action Trees. Figure2 in the paper shows that the actions are decomposed as an action tree. One should first select the first level actions, then select the second level actions. The action space of the first level is 3 and the action space of the second level depends on the first level.

I try to give one possible solution to solve this:

change the transition part:

transition = { "state": {"some_state": old_state} , ...  } # old
transition = {"state": {"some_state": old_state, "valid_actions": valid_action_set } , ... }

here, the "valid_actions" contains all of the possible second-level actions based on the first-level action.

change the agent sampling(explore) flow:

state = env.reset() # state have key words "some_state" and "valid_actions", here, for initial, the valid_actions are pre-defined manually. For example, we choose the first-level action 'use', and then the valid_actions will be 'food'
while not done:
    action2 = agent.act2(state) # choose action2 from valid action set. Here, action2 is 'food'
    action1 = agent.act1(state) # choose action1 which denotes the first level action space. Here, suppose action1 is 'move'.
    env.first_level(action1) # tell env, the valid action set in next step is 'up, down, right, left' under the 'move' branch
    next_state, reward, done = env.step(action2) # next_state have key words "some_state" and "valid_actions", the valid_actions have 'up, down, right, left'.

change the actor network agent.act2 to:

class act2(nn.Module):
  def forward(self, state = {'some_state': <...> , 'valid_actions': <...>}  ):
    state, second_level_actions = state.state, state.valid_actions
    ....  # calculate the similarity between state and valid actions, and output logits
    return (...), None

Do you think this solution is reasonable? Is there any better way to support such a conditional hierarchical action space?

Multi Discrete Action Spaces

Hello,

Does machin support Multi Discrete Action Spaces? (two different actions in the same time step)
I've looked through the documentation but cannot find anything related to that

João

Apex-ddpg cannot use GPU

Apex-ddpg cannot use the GPU. The tests in the apex-ddpg seem to all be using the CPU, is this feature currently not supported? I tried changing the actor and critic network in ddpg_apex.py to use cuda:0 but I get the following error. I tried using the default example and also world size=2, 1 worker and 1 sampler to resolve any self-deadlocks being caused by the multiple processes.

RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Variable lengths samples in batch in update()

Hey, i'm using DQN and my q-values are variable lengths sequences as I have different amount of actions for each state (my states have different shapes also). When sampling batches default Buffer concatenating them which leads to a tensor error. But when using update() with concatenate_samples=False it stills doesn't solve the problem as now samples are just lists, and all torch operations fail. Of course, I can pad the sequences, but then argmax() can return one of the padded indexes, as it's not possible to pass original lengths of each sample in batch in update() function. Is there any way to solve the problem right now, or it yet to be implemented?

Can your_first_program be trained on a GPU?

I've tried to train your_first_program on a GPU by
uncommenting static_module_wrapper lines like so:

q_net = static_module_wrapper(q_net, "cuda", "cuda")
q_net_t = static_module_wrapper(q_net_t, "cuda", "cuda")

and got an error:

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

a combination

q_net = static_module_wrapper(q_net,  "cuda", "cpu")
q_net_t = static_module_wrapper(q_net_t,  "cuda", "cpu")

fails too. Following setting runs, but doesn't use GPU:

    q_net = static_module_wrapper(q_net, "cpu", "cuda")
    q_net_t = static_module_wrapper(q_net_t, "cpu", "cuda")

What shall I do to train your_first_program on a GPU?
Here is requirements.txt:

absl-py==0.10.0
astor==0.8.1
astunparse==1.6.3
backcall==0.2.0
brotlipy==0.7.0
cachetools==4.1.1
certifi==2020.6.20
cffi==1.13.2
chardet==3.0.4
cloudpickle==1.6.0
colorlog==4.4.0
cryptography @ file:///tmp/build/80754af9/cryptography_1601046817403/work
cycler==0.10.0
decorator==4.4.2
dill==0.3.2
dm-reverb==0.1.0
dm-tree==0.1.5
EasyProcess==0.3
future==0.18.2
gast==0.3.3
gin-config==0.3.0
google-auth==1.22.1
google-auth-oauthlib==0.4.1
google-pasta==0.2.0
GPUtil==1.4.0
graphviz==0.14.2
grpcio==1.33.1
gym==0.17.3
h5py @ file:///tmp/build/80754af9/h5py_1593454119955/work
idna==2.10
imageio==2.9.0
imageio-ffmpeg==0.4.2
importlib-metadata==2.0.0
install==1.3.4
ipython @ file:///tmp/build/80754af9/ipython_1598883837425/work
ipython-genutils==0.2.0
jedi @ file:///tmp/build/80754af9/jedi_1596490743326/work
Keras-Applications @ file:///tmp/build/80754af9/keras-applications_1594366238411/work
Keras-Preprocessing==1.1.2
kiwisolver==1.2.0
machin==0.3.4
Markdown==3.3.2
matplotlib==3.3.2
mkl-fft==1.2.0
mkl-random==1.1.1
mkl-service==2.3.0
moviepy==1.0.3
numpy==1.18.5
oauthlib==3.1.0
opt-einsum==3.3.0
pandas @ file:///tmp/build/80754af9/pandas_1602088128026/work
parso==0.7.0
pexpect @ file:///tmp/build/80754af9/pexpect_1594383317248/work
pickleshare @ file:///tmp/build/80754af9/pickleshare_1594384075987/work
Pillow==8.0.1
portpicker==1.3.1
proglog==0.1.9
progressbar==2.5
prompt-toolkit @ file:///tmp/build/80754af9/prompt-toolkit_1602688806899/work
protobuf==3.13.0
psutil==5.7.2
ptyprocess==0.6.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycparser==2.19
pyglet==1.5.0
Pygments @ file:///tmp/build/80754af9/pygments_1600458456400/work
pyOpenSSL @ file:///tmp/build/80754af9/pyopenssl_1594392929924/work
pyparsing==2.4.7
PySocks @ file:///tmp/build/80754af9/pysocks_1594394576006/work
python-dateutil==2.8.1
pytz==2020.1
PyVirtualDisplay @ file:///home/conda/feedstock_root/build_artifacts/pyvirtualdisplay_1602367622068/work
requests==2.24.0
requests-oauthlib==1.3.0
rsa==4.6
scipy==1.5.3
six==1.15.0
tensorboard==2.3.0
tensorboard-plugin-wit==1.7.0
tensorboardX==2.1
tensorflow==2.3.1
tensorflow-estimator==2.3.0
tensorflow-probability==0.11.1
termcolor==1.1.0
torch==1.6.0
torchvision==0.5.0
torchviz==0.0.1
tqdm==4.50.2
traitlets @ file:///tmp/build/80754af9/traitlets_1602787416690/work
urllib3==1.25.11
wcwidth @ file:///tmp/build/80754af9/wcwidth_1593447189090/work
Werkzeug==1.0.1
wrapt==1.12.1
zipp==3.3.1

[FEATURE] Is there a tutorial for maddpg.py

Hi,
this project is really awesome and the codes is well structured!

Is your feature request related to a problem? Please describe.
I have run some codes in examples/tutorials, but can not find some about MARL algorithms such as maddpg.

Describe the solution you'd like
Since you have already implemented the maddpg.py and test_maddpg.py, I am wondering could you implement a tutorial for maddpg.py too? (It would be better to implement more MARL algorithms, such as COMA, QMIX, VDN)

"index_add_(): self and source must have the same scalar type" while training DQN rainbow

When i try to train the dqn rainbow agent i'm getting "index_add_(): self and source must have the same scalar type" error when weights are updated.

I'm using a simple Model to understand

class QNet(nn.Module):
    # this test setup lacks the noisy linear layer and dueling structure.
    def __init__(self, action_num, atom_num=10):
        super(QNet, self).__init__()

        self.hidden_in = nn.Conv2d(4, 64, kernel_size = 4, stride = 2)
        self.fc1 = nn.Linear(105280, 64)
        self.fc2 = nn.Linear(64, 16)
        self.fc3 = nn.Linear(16, action_num * atom_num)
        self.action_num = action_num
        self.atom_num = atom_num
    
        self.flat = nn.Flatten()

    def forward(self, state):
        a = t.relu(self.hidden_in(state))
        a = self.flat(a)
    
        a = t.relu(self.fc1(a))
        a = t.relu(self.fc2(a))
        return t.softmax(self.fc3(a)
                         .view(-1, self.action_num, self.atom_num),
                     dim=-1)

i'm not unable to understand where is the tensor changing its type.
input is in shape (4,50,50) dtype.float32.

AttributeError: module 'torch.distributed.rpc' has no attribute 'rpc_sync' when running tutorials

Hello, when I am trying to run a tutorial script, e.g. the your_first_program example, I always encounter this AttributeError during the imports:

AttributeError: module 'torch.distributed.rpc' has no attribute 'rpc_sync'

However, I fulfill the listed requirements. Is there anything I am missing or have can I solve this?

Failed to run IMALA and MADDPG from examples folder

First of all, I just want to show my gratitude regarding your efforts in writing such a cool library and also such a nice documentation. I just discovered your work a few days ago and I am trying to use it.

SYSTEM:

OS: Manjaro [Arch Linux]
Version: v0.4.1
Python version: 3.9.5

Getting the repo:

git clone https://github.com/iffiX/machin
git checkout tags/v0.4.1

[MADDPG] Describe the bug
I am trying to run this code from your example folder but I get the following error message

    [optimizer(acc.parameters(), lr=actor_learning_rate) for acc in ac]
TypeError: 'list' object is not callable

It seems that the optimizer is not a callable but instead I have the following data on it

[IMPALA] Describe the bug
Also, I am trying to run this code but I am facing another problem. My application is going to run without any kind of feedback and at some point in time, it seems that I will receive a connection error due to a timeout.

  store = TCPStore(result.hostname, result.port, world_size, start_daemon, timeout)
RuntimeError: connect() timed out.

So far as I debug it's because of World(world_size=4, rank=rank, name=str(rank), rpc_timeout=20) and to be more precise this part from world.py

dist.init_process_group(
                backend=dist_backend,
                init_method=dist_init_method,
                timeout=timedelta(seconds=dist_timeout),
                rank=rank,
                world_size=world_size,
            )

[TEST FAILES] Describe the bug
And one mode thing that I've tried is to run run_linux_test.sh from the main branch this time (the rest of them were from the version v0.4.1) and there is the output, in case of having any importance for the 2 bugs that I've described above or any other.

================================== test session starts ===================================
platform linux -- Python 3.9.5, pytest-6.0.1, py-1.10.0, pluggy-0.13.1
rootdir: /home/vlad/Documents/TradingBotRL/machin, configfile: pytest.ini
plugins: metadata-1.11.0, repeat-0.8.0, html-1.22.1
collected 949 items / 3 errors / 31 deselected / 915 selected                            

========================================= ERRORS =========================================
_______________________ ERROR collecting test/auto/test_dataset.py _______________________
test/auto/test_dataset.py:1: in <module>
    from machin.auto.dataset import determine_precision, DatasetResult
machin/auto/__init__.py:2: in <module>
    from . import envs
machin/auto/envs/__init__.py:1: in <module>
    from . import openai_gym
machin/auto/envs/openai_gym.py:5: in <module>
    from ..dataset import DatasetResult, RLDataset, log_video, determine_precision
machin/auto/dataset.py:94: in <module>
    class RLDataset(IterableDataset):
venv/lib/python3.9/site-packages/torch/utils/data/_typing.py:273: in __new__
    return super().__new__(cls, name, bases, namespace, **kwargs)  # type: ignore[call-overload]
/usr/lib/python3.9/abc.py:85: in __new__
    cls = super().__new__(mcls, name, bases, namespace, **kwargs)
venv/lib/python3.9/site-packages/torch/utils/data/_typing.py:370: in _dp_init_subclass
    raise TypeError("Expected 'Iterator' as the return annotation for `__iter__` of {}"
E   TypeError: Expected 'Iterator' as the return annotation for `__iter__` of RLDataset, but found typing.Iterable
______________________ ERROR collecting test/auto/test_launcher.py _______________________
test/auto/test_launcher.py:2: in <module>
    from machin.auto.launcher import Launcher
machin/auto/__init__.py:2: in <module>
    from . import envs
machin/auto/envs/__init__.py:1: in <module>
    from . import openai_gym
machin/auto/envs/openai_gym.py:5: in <module>
    from ..dataset import DatasetResult, RLDataset, log_video, determine_precision
machin/auto/dataset.py:94: in <module>
    class RLDataset(IterableDataset):
venv/lib/python3.9/site-packages/torch/utils/data/_typing.py:273: in __new__
    return super().__new__(cls, name, bases, namespace, **kwargs)  # type: ignore[call-overload]
/usr/lib/python3.9/abc.py:85: in __new__
    cls = super().__new__(mcls, name, bases, namespace, **kwargs)
venv/lib/python3.9/site-packages/torch/utils/data/_typing.py:370: in _dp_init_subclass
    raise TypeError("Expected 'Iterator' as the return annotation for `__iter__` of {}"
E   TypeError: Expected 'Iterator' as the return annotation for `__iter__` of RLDataset, but found typing.Iterable
___________________ ERROR collecting test/auto/env/test_openai_gym.py ____________________
test/auto/env/test_openai_gym.py:11: in <module>
    from machin.auto.envs.openai_gym import (
machin/auto/__init__.py:2: in <module>
    from . import envs
machin/auto/envs/__init__.py:1: in <module>
    from . import openai_gym
machin/auto/envs/openai_gym.py:5: in <module>
    from ..dataset import DatasetResult, RLDataset, log_video, determine_precision
machin/auto/dataset.py:94: in <module>
    class RLDataset(IterableDataset):
venv/lib/python3.9/site-packages/torch/utils/data/_typing.py:273: in __new__
    return super().__new__(cls, name, bases, namespace, **kwargs)  # type: ignore[call-overload]
/usr/lib/python3.9/abc.py:85: in __new__
    cls = super().__new__(mcls, name, bases, namespace, **kwargs)
venv/lib/python3.9/site-packages/torch/utils/data/_typing.py:370: in _dp_init_subclass
    raise TypeError("Expected 'Iterator' as the return annotation for `__iter__` of {}"
E   TypeError: Expected 'Iterator' as the return annotation for `__iter__` of RLDataset, but found typing.Iterable
--- generated html file: file:///home/vlad/Documents/TradingBotRL/machin/test_api.html ---
================================ short test summary info =================================
ERROR test/auto/test_dataset.py - TypeError: Expected 'Iterator' as the return annotati...
ERROR test/auto/test_launcher.py - TypeError: Expected 'Iterator' as the return annotat...
ERROR test/auto/env/test_openai_gym.py - TypeError: Expected 'Iterator' as the return a...
!!!!!!!!!!!!!!!!!!!!!!!! Interrupted: 3 errors during collection !!!!!!!!!!!!!!!!!!!!!!!!!
============================ 31 deselected, 3 errors in 2.87s ============================

[FEATURE] Large Transition Batch Size

Is your feature request related to a problem? Please describe.

Transition batch size

I read the tutorial about RL in spiningup. I found that for on-policy RL, they have a step to collect a set of trajectories in their pseudocode. However, in your documentation Data flow in Machin, you point out that

Currently, the constructor of the default transition implementation Transition requires batch size to be 1

and

Buffer.store_episode(): If you pass in a dict type transition object, it will be automatically converted to Transition

In your PPO examples/ppo.py it seems that you only save one trajectory per iteration and update it. What should I do if I want to save a set of trajectories? Will such a change affect the update() part?

Multiple trajectories with one reward

In general scenarios, one trajectory (episode) will have one total reward. However, I encountered a case where multiple trajectories with only one total reward. For example:

(trajectory1: [s,a,0,s,a,0,...,s,a], trajectory2 [s,a,0,s,a,0,...,s,a], trajectory3[s,a,0,s,a,0,...,s,a] ) ---> final reward
Imagine that many football players are playing the same game. They receive the same reward only when goal,
Imagine that generating a batch of noise sequences to attack the neural network will get only one reward which indicates the degraded performance of the neural network.

I give two solutions to this problem in the next section, but I am not sure which one is better. Could you give me some advice?

Describe the solution you'd like
For feature 1, it may realized as below(I am not sure whether this affect the update() part):

while episode< max_episodes:
    for i in range(sub_episode_size) # add a loop here
        episode += 1
        tmp_observations = []
        while not terminal and step <= max_steps:
               # ....
               tmp_observations.append(...) # store transition
         ppo.store_episode(tmp_observations)
     ppo.update()
     # clean buffer

For feature 2, it may have two solutions. One is to assign the final reward to every trajectory:

from collection import defaultdict
while episode< max_episodes:
    batch_trajectory = defaultdict[List]
    episode += batch_size
    while not terminal and step <= max_steps:
           # using batch state to generate batch action
           reward = env.step(batch_action)
           for i in range(batch_size):
               batch_trajectory[i] += batch_state[i] + batch_action[i] + reward # assign the same reward to each trajectory

     for i in range(batch_size):
         ppo.store_episode(batch_trajectory[i])
     ppo.update()
     # clean buffer

Describe alternatives you've considered

For feature 2, another solution may resort to the multi-agent RL. Each agent manages one trajectory and they receive the same reward from the environment. I found that Machin has a multi-agent algorithm implementation called MADDPG. From spiningup I found that this algorithm is only for continuous action space. Is there any plan to implement other multi-agent RL algorithms such as multi-agent PPO for discrete action space?

Additional context

ImportError: cannot import name 'FileStore'

File "D:\Anaconda3\envs\universe\lib\site-packages\torch\distributed\rendezvous.py", line 9, in
from . import FileStore, TCPStore
ImportError: cannot import name 'FileStore'

After installing machin, run PPO. py，than report an error, and try others to report the same error.As follow:

D:\Anaconda3\envs\universe\python.exe F:/machin/machin-master/examples/framework_examples/dqn.py
Traceback (most recent call last):
File "F:/machin/machin-master/examples/framework_examples/dqn.py", line 1, in
from machin.frame.algorithms import DQN
File "D:\Anaconda3\envs\universe\lib\site-packages\machin_init_.py", line 1, in
from . import env, frame, model, parallel, utils
File "D:\Anaconda3\envs\universe\lib\site-packages\machin\env_init_.py", line 1, in
from . import utils, wrappers
File "D:\Anaconda3\envs\universe\lib\site-packages\machin\env\wrappers_init_.py", line 1, in
from . import base, openai_gym
File "D:\Anaconda3\envs\universe\lib\site-packages\machin\env\wrappers\openai_gym.py", line 8, in
from machin.parallel.exception import ExceptionWithTraceback
File "D:\Anaconda3\envs\universe\lib\site-packages\machin\parallel_init_.py", line 2, in
from . import distributed, server, assigner, exception, pickle, thread, pool, queue
File "D:\Anaconda3\envs\universe\lib\site-packages\machin\parallel\distributed_init_.py", line 1, in
from .world import (
File "D:\Anaconda3\envs\universe\lib\site-packages\machin\parallel\distributed\world.py", line 14, in
import torch.distributed.distributed_c10d as dist_c10d
File "D:\Anaconda3\envs\universe\lib\site-packages\torch\distributed\distributed_c10d.py", line 10, in
from .rendezvous import rendezvous, register_rendezvous_handler # noqa: F401
File "D:\Anaconda3\envs\universe\lib\site-packages\torch\distributed\rendezvous.py", line 9, in
from . import FileStore, TCPStore
ImportError: cannot import name 'FileStore'

Process finished with exit code 1

[FEATURE] Custom replay buffers

Sparse rewards are a problem for everyone in RL dealing with robotics and such.
Just like its possible to create your own networks (which is awesome), I think it could be useful be able to create your own replay buffers to implement things like Hindsight Experience Replay and others.

[ALTER] Readability - black standard formatting

Where to alter
It's best to alter throughout the entire library, but my specific pain started with ppo.py.

Why to alter

Custom formatting is less readable than standard black formatting.
Standard formatting means minimal diffs on any future edits - easier to review

How to alter

Run black https://github.com/psf/black on entire repository
[OPTIONAL] add this commit to list of ignored commits for git blame as described here https://akrabat.com/ignoring-revisions-with-git-blame/ - I think this can be skipped because so far you are the main author of the library and git history is not that rich.
[OPTIONAL] add pre-commit hook to run black on all future commits automatically https://black.readthedocs.io/en/stable/version_control_integration.html

Example: originally, this is how code looks that I needed to debug:

            batch_size, (state, action, advantage) = \
                self.replay_buffer.sample_batch(self.batch_size,
                                                sample_method="random_unique",
                                                concatenate=concatenate_samples,
                                                sample_attrs=[
                                                    "state", "action", "gae"],
                                                additional_concat_attrs=[
                                                    "gae"
                                                ])

that's a festival of different indentation levels.

Here is how it looks after black reformatting:

            batch_size, (state, target_value) = self.replay_buffer.sample_batch(
                self.batch_size,
                sample_method="random_unique",
                concatenate=concatenate_samples,
                sample_attrs=["state", "value"],
                additional_concat_attrs=["value"],
            )

A2C entropy minimized instead of maximized

Hi,

I guess the entropy in A2C is wrong:

if new_action_entropy is not None:
    act_policy_loss += self.entropy_weight * new_action_entropy.mean()

instead it should be:

if new_action_entropy is not None:
    act_policy_loss -= self.entropy_weight * new_action_entropy.mean()

Best,

Lorenzo

iffix / machin Goto Github PK

machin's People

Contributors

Stargazers

Watchers

Forkers

machin's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs