michaelnny / deep_rl_zoo Goto Github PK

A collection of Deep Reinforcement Learning algorithms implemented with PyTorch to solve Atari games and classic control tasks like CartPole, LunarLander, and MountainCar.

License: Apache License 2.0

Python 99.53% Shell 0.26% TeX 0.21%

dqn r2d2 never-give-up agent57 retrace rainbow ppo c51 iqn qr-dqn

deep_rl_zoo's People

Contributors

Stargazers

Watchers

Forkers

emtiazsamad daihuiao beccohov terrisgo emrul 1124562662 malecada

deep_rl_zoo's Issues

Quickly eats up memory, R2D2

I'm not exactly sure what I am doing wrong here, I have replay_buffer size at 1e6, but I also have memory size of 1T with a A100 gpu, with only 8 actors running. It quickly fills up my memory (my vram is good), even when I do half of the replay_buffer. Can you give me a good parameter so I can start testing R2D2? Thank you

nonlinear_bellman.py

c_t = torch.minimum(torch.tensor(1.0), pi_a_t / (mu_t + eps)) * lambda_

mu_t should be mu_a_t right? Is this a bug or intentional?

Wrong RND implementaion

RND only takes in a single frame instead of stacked frames
The intrinsic reward should be the squared distance between the predictor and target
openAI uses forward filter to compute returns and then normalize intrinsic reward

https://github.com/openai/random-network-distillation/blob/f75c0f1efa473d5109d487062fd8ed49ddce6634/ppo_agent.py#L257

自动驾驶更新笔记 Autopilot Updating Notes

您好，仓库内容很全面，非常受益，可否引荐下本人的笔记，把我对自动驾驶的理解分享给大家，希望大家和我一起不断完善相关内容，谢谢您！

Hello, the content of the repository is very comprehensive and very beneficial. Could you introduce my notes and share my understanding of autopilot with others? I hope you can continue to improve the relevant content with me, thank you!

Autopilot-Updating-Notes

agent57 priorities

I am not sure how you got this from Agent57, (agent.py). I don't see this mentioned in the original paper.
line 676 "priorities = 0.8 * ext_priorities + 0.2 * int_priorities"
Thanks

the self.add() of "Unroll" in replay.py

Hi, @michaelnny ,

Thanks for your repository, helped me a lot. I encountered an issue while using it and would like to seek your advice.

When using the R2D2 method, data generated by the interaction between the actor and the environment is first stored in an Unroll. Then, when the Unroll is full or when done=True, the data inside the Unroll is placed in a queue.

    def add(self, transition: Any, done: bool) -> Union[ReplayStructure, None]:
        """Add new transition into storage."""
        self._storage.append(transition)

        if self.full:
            return self._pack_unroll_into_single_transition()
        if done:
            return self._handle_episode_end()
        return None

    def _pack_unroll_into_single_transition(self) -> Union[ReplayStructure, None]:
        """Return a single transition object with transitions stacked with the unroll structure."""
        if not self.full:
            return None

        _sequence = list(self._storage)
        # Save for later use.
        self._last_unroll = copy.deepcopy(_sequence)
        self._storage.clear()

        # Handling adjacent unroll sequences overlapping
        if self._overlap > 0:
            for transition in _sequence[-self._overlap :]:  # noqa: E203
                self._storage.append(transition)
        return self._stack_unroll(_sequence)

    def _handle_episode_end(self) -> Union[ReplayStructure, None]:
        """Handle episode end, incase no cross episodes, try to build a full unroll if last unroll is available."""
        if self._cross_episode:
            return None
        if self.size > 0 and self._last_unroll is not None:
            # Incase episode ends without reaching a full 'unroll length'
            # Use whatever we got from current unroll, fill in the missing ones from previous sequence
            _suffix = list(self._storage)
            _prefix_indices = self._full_unroll_length - len(_suffix)
            _prefix = self._last_unroll[-_prefix_indices:]
            _sequence = list(itertools.chain(_prefix, _suffix))
            return self._stack_unroll(_sequence)
        else:
            return None

The first question is, does the setting of "unroll length" have to be smaller than the maximum length of the task? And is it necessary to adjust this value based on the maximum length of different tasks?

I think the parameter "unroll_length" should be set to less than the maximum episode length of the environment. Otherwise, "Unroll" may not be filled before reaching "done=True", resulting in missing content at "self._last_unroll=None". I'm unsure about my thoughts and would like to seek your advice, thank you.

I am currently using the MiniGrid environment. For instance, in MiniGrid-MultiroomS2N4, the maximum episode length is 40. I have set the 'unroll_length' to 30, and 'burn_in' to 10. Despite running the R2D2 algorithm for one million steps, it has not converged. In Rainbow DQN, convergence is achieved within two hundred thousand steps. I am uncertain whether you have tested the R2D2 algorithm in MiniGrid-related environments. Initially, I speculated that the performance of this environment would be better with R2D2. This issue has been hindering me for two weeks, so I'm seeking your advice. Thank you very much.

Numerical issues when advantage_t.std() is 0 in PPO

Hi @michaelnny ,

When I ran experiments, I found there were cases of division by 0 when the advantage_t.std() is 0 at here which then yields -inf or inf as the advantage_t. Would it make sense to add a small epsilon value, i.e: 1e-10, to avoid the division by 0.

On a side note, thank you for this amazing repository. 💯

about data_queue in main_loop

i don't understand why actor don't use data_queue, only one place in actor using data_queue is data_queue.put('PROCESS_DONE'), and I want to know how to transfer data from actor to learner

Bug in the gym_env module and trackers

The wrapper which collect 'raw_reward' in the info dict should be applied after frame skip and frame stack. As current code is applied before frame skip and frame stack, thus it will re-count the same reward multiple times.

The issue could be reproduced by running any agent on the Atari Pong game. The expected episode returns should be around -20 or -21 when starting out, however with current code, we get some random values for the episode return.

Meme paper

I love your work, are we going to see an implementation of MEME ( https://arxiv.org/abs/2209.07550 ) by paper anytime in the future? I think it's a big improvement upon Agent57.

Thank you

retro env

I'm sort of a newbie, how can I replace the atari env so that it supports retro env?

multigpu agent57

I'm having issues running multigpu on Agent57 in the same way I run them on let's say R2D2.
Where I change
actor_devices = [runtime_device] * FLAGS.num_actors
to
actor_devices = [torch.device('cuda:0'),torch.device('cuda:1')....etc etc] * FLAGS.num_actors
So this works for R2D2 but not Agent57, howcome?
Thank you.

Bug in curiosity.py

https://github.com/michaelnny/deep_rl_zoo/blob/main/deep_rl_zoo/curiosity.py

distance_rate = torch.min((distance_rate - self._cluster_distance), torch.tensor(0.0))

I think this should be torch.max instead of torch.min

C51 agent not working on Pong

It seems like the C51 agent is not working on Pong, and it can be unstable on CartPole sometimes.

Normally, for DQN like agents using e-greedy policy, we'd expect the agent to make some progress when the 'exploration_epsilon' is less than 0.7, but that's not the case for C51 agent. What makes this issue more interesting is the Rainbow agent works very well on Pong. And the code for Rainbow is almost identical to the C51 agent, except in C51 we don't use noisy layers.

Things we've tried so far, but still didn't solve the issue:

Using the same dueling architecture as in Rainbow
Increase/decrease learning rate
Using small/large frequency to update target Q network
Train much longer than normal DQN, Rainbow

PPO returns and advantages should be fixed

According to the original PPO paper, the estimated returns and advantages are pre-computed for the unroll sequences and fixed across the K epoch updates.

However this implementation computes the returns and advantages on-the-fly during parameters update, maybe that's not correct and we need to fix it.

Just installed and spotted something odd

deep_rl_zoo/deep_rl_zoo/gym_env.py

Line 228 in 17c4efe

else:

I'm not sure if the else should be here? It seems like then obs would end up undefined on the next line where you do obs.copy()?

MEME paper

I'm currently reimplementing Agent57 to understand the concepts thoroughly, im almost done except batch inference and multi actor, retrace, meta controller, bandits, and multi head q value. Would you be interested in working on an open source MEME agent together for the ultimate reinforcement learning agent outside of this repository? I'm also considering swapping GTrXL for the main model.

AttributeError: 'str' object has no attribute 'reset'

Hi,

I'm trying to get this running but am getting an attribute error when trying to evaluate a DQN agent against Lunar Lander v2 as a test. The training seems to have completed correctly and a ckpt file created (although it seems small, only 38 kB). The traceback I'm getting is below. I'm running on Ubuntu 22.04 within a virtual environment. I have had the same error using both venv and conda.

python3 -m deep_rl_zoo.dqn.eval_agent --environment_name=LunarLander-v2 --load_checkpoint_file=saved_checkpoints/DQN_LunarLander-v2_1.ckpt
I1117 23:38:12.875501 139881514845120 eval_agent.py:91] Environment: LunarLander-v2
I1117 23:38:12.875638 139881514845120 eval_agent.py:92] Action spec: 4
I1117 23:38:12.875766 139881514845120 eval_agent.py:93] Observation spec: 8
I1117 23:38:14.595612 139881514845120 main_loop.py:662] Testing iteration 0
Traceback (most recent call last):
File "/home/sdmeers/miniconda3/envs/deep_rl_zoo/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/sdmeers/miniconda3/envs/deep_rl_zoo/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/sdmeers/OneDrive/Steve/Code/deep_rl_zoo/deep_rl_zoo/dqn/eval_agent.py", line 125, in
app.run(main)
File "/home/sdmeers/miniconda3/envs/deep_rl_zoo/lib/python3.9/site-packages/absl/app.py", line 308, in run
_run_main(main, args)
File "/home/sdmeers/miniconda3/envs/deep_rl_zoo/lib/python3.9/site-packages/absl/app.py", line 254, in _run_main
sys.exit(main(argv))
File "/home/sdmeers/OneDrive/Steve/Code/deep_rl_zoo/deep_rl_zoo/dqn/eval_agent.py", line 114, in main
main_loop.run_evaluation_iterations(
File "/home/sdmeers/OneDrive/Steve/Code/deep_rl_zoo/deep_rl_zoo/main_loop.py", line 666, in run_evaluation_iterations
eval_stats = run_env_steps(num_eval_frames, eval_agent, eval_env, eval_tb_log_dir)
File "/home/sdmeers/OneDrive/Steve/Code/deep_rl_zoo/deep_rl_zoo/main_loop.py", line 133, in run_env_steps
stats = trackers_lib.generate_statistics(trackers, seq_truncated)
File "/home/sdmeers/OneDrive/Steve/Code/deep_rl_zoo/deep_rl_zoo/trackers.py", line 375, in generate_statistics
tracker.reset()
AttributeError: 'str' object has no attribute 'reset'

pip freeze is as follows

absl-py==1.3.0
ale-py==0.7.5
AutoROM==0.4.2
AutoROM.accept-rom-license==0.4.2
black==22.10.0
box2d-py==2.3.8
cachetools==5.2.0
certifi @ file:///croot/certifi_1665076670883/work/certifi
cfgv==3.3.1
charset-normalizer==2.1.1
click==8.1.3
cloudpickle==2.2.0
distlib==0.3.6
filelock==3.8.0
glfw==2.5.5
google-auth==2.14.1
google-auth-oauthlib==0.4.6
grpcio==1.50.0
gym==0.25.2
gym-notices==0.0.8
identify==2.5.8
idna==3.4
imageio==2.22.2
importlib-metadata==5.0.0
importlib-resources==5.10.0
Markdown==3.4.1
MarkupSafe==2.1.1
mujoco==2.2.2
mypy-extensions==0.4.3
nodeenv==1.7.0
numpy==1.23.4
oauthlib==3.2.2
opencv-python==4.6.0.66
pathspec==0.10.2
Pillow==9.3.0
platformdirs==2.5.4
pre-commit==2.20.0
protobuf==3.20.3
pyasn1==0.4.8
pyasn1-modules==0.2.8
pygame==2.1.2
pyglet==1.5.27
PyOpenGL==3.1.6
python-snappy==0.6.1
PyYAML==6.0
requests==2.28.1
requests-oauthlib==1.3.1
rsa==4.9
six==1.16.0
tensorboard==2.11.0
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
toml==0.10.2
tomli==2.0.1
torch==1.12.1
torchsummary==1.5.1
torchvision==0.13.1
tqdm==4.64.1
typing_extensions==4.4.0
urllib3==1.26.12
virtualenv==20.16.7
Werkzeug==2.2.2
zipp==3.10.0

michaelnny / deep_rl_zoo Goto Github PK

deep_rl_zoo's People

Contributors

Stargazers

Watchers

Forkers

deep_rl_zoo's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs