GithubHelp home page GithubHelp logo

Comments (3)

puyuan1996 avatar puyuan1996 commented on May 29, 2024

Hello, we currently support multi-GPU training on a single node through PyTorch's Distributed Data Parallel (DDP) technology. Please refer to the discussion on this issue: #196. As for multi-node training, we plan to consider incorporating this functionality in future version updates.

At present, we have integrated experimental monitoring and performance analysis support that is supported by DI-engine into LightZero. For detailed information, please consult this document (https://github.com/opendilab/DI-engine-docs/blob/main/source/04_best_practice/training_generated_folders_zh.rst, Chinese version). For your convenience, we have provided the following English summary and will subsequently integrate it fully into our codebase documentation. Thank you for your suggestion. Best regards.

from lightzero.

puyuan1996 avatar puyuan1996 commented on May 29, 2024

Experimental monitoring and logging system in LightZero

LightZero generates log and checkpoint folders during the training process. The file tree generated is as follows:

cartpole_muzero
├── ckpt
│   ├── ckpt_best.pth.tar
│   ├── iteration_0.pth.tar
│   └── iteration_10000.pth.tar
├── log
│   ├── buffer
│   │   └── buffer_logger.txt
│   ├── collector
│   │   └── collector_logger.txt
│   ├── evaluator
│   │   └── evaluator_logger.txt
│   ├── learner
│   │   └── learner_logger.txt
│   └── serial
│       └── events.out.tfevents.1626453528.CN0014009700M.local
├── formatted_total_config.py
└── total_config.py

log/collector

In the collector folder, there is a file named collector_logger.txt, which contains information related to the interaction between the collector and the environment.
Special information generated when the collector interacts with the environment, such as:

  • episode_count: the number of episodes collected
  • envstep_count: the number of envsteps collected
  • train_sample_count: the number of training sample data
  • avg_envstep_per_episode: the average envstep per episode
  • avg_sample_per_episode: the average number of samples per episode
  • avg_envstep_per_sec: the average env_step per second
  • avg_train_sample_per_sec: the average number of training samples per second
  • avg_episode_per_sec: the average number of episodes per second
  • collect_time: collection time
  • reward_mean: the average reward
  • reward_std: the standard deviation of the reward
  • each_reward: the reward for each episode of the collector's interaction with the environment.
  • reward_max: the maximum reward
  • reward_min: the minimum reward
  • total_envstep_count: the total envstep count
  • total_train_sample_count: the total number of training samples
  • total_episode_count: the total number of episodes
  • total_duration: the total duration

log/evaluator

In the evaluator folder, there is a file named evaluator_logger.txt, which contains information about the evaluator's interaction with the environment.

  • [INFO]: [EVALUATOR]env x completes an episode, final reward: xxx, current episode: xxx
  • train_iter: the number of training iterations
  • ckpt_name: the model path, such as iteration_0.pth.tar
  • episode_count: episode count
  • envstep_count: envstep count
  • evaluate_time: the time spent by the evaluator
  • avg_envstep_per_episode: the average envstep per episode
  • avg_envstep_per_sec: the average envstep per second
  • avg_time_per_episode: the average time per episode per second
  • reward_mean: the average reward
  • reward_std: the standard deviation of the reward
  • each_reward: the reward for each episode of the evaluator's interaction with the environment.
  • reward_max: the maximum reward
  • reward_min: the minimum reward

log/learner

In the learner folder, there is a file named learner_logger.txt, which contains information about the learner.
The following information is generated during the MuZero training period:

Policy neural network architecture:

[04-08 13:12:59] INFO     [RANK0]: DI-engine DRL Policy                                                                                                base_learner.py:338
                          MuZeroModelMLP(                                                                                                                                 
                            (representation_network): RepresentationNetworkMLP(                                                                                           
                              (fc_representation): Sequential(                                                                                                            
                                (0): Linear(in_features=4, out_features=128, bias=True)                                                                                   
                                (1): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)                                                     
                                (2): ReLU(inplace=True)                                                                                                                   
                                (3): Linear(in_features=128, out_features=128, bias=True)                                                                                 
                              )                                                                                                                                                                                                                                                                                
                                                                                                     
                                                                                                      
Learner information:
    Grid table:
    | Name  | cur_lr_avg | total_loss_avg |
    |-------|------------|----------------|
    | Value | 0.001000   | 0.098996       |

log/serial

The buffer, collector, evaluator, and learner's relevant information is saved into a file named events.out.tfevents for use with tensorboard.

LightZero saves all tensorboard files from the serial folder as one tensorboard file, rather than individual folders. This is because when running a large number of experiments, say n, it is not easy to distinguish between 4*n individual tensorboard files. Therefore, in LightZero, all tensorboard files are in the serial folder.

ckpt

In the ckpt folder, there are model parameter checkpoints:

  • ckpt_best.pth.tar. The best model that achieved the highest evaluation score.
  • "iteration" + iter number. Models saved every iter_number.
    You can load the model using torch.load('ckpt_best.pth.tar').

from lightzero.

selfsim avatar selfsim commented on May 29, 2024

Thanks for the information.

from lightzero.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.