GithubHelp home page GithubHelp logo

Comments (5)

srama2512 avatar srama2512 commented on August 15, 2024

Hi @vincent341,

You might want to change the num_processes and map_batch_size to fit your requirements. The models were trained on 8 GPUs with 16GB memory each. This allowed training on 36 parallel environments spread throughout the GPUs with DataParallel training for the Mapper.

from occupancyanticipation.

vincent341 avatar vincent341 commented on August 15, 2024

Hi @vincent341,

You might want to change the num_processes and map_batch_size to fit your requirements. The models were trained on 8 GPUs with 16GB memory each. This allowed training on 36 parallel environments spread throughout the GPUs with DataParallel training for the Mapper.

Hi @srama2512 ,
Thanks for your reply. Currently the program can run after setting 'NUM_PROCESSES' in "*.yaml" 1. I have some questions.

  1. Once 'NUM_PROCESSES' was set a number larger than 1, I got an "raise EOFError" error. I'm not sure if it is caused by shortage of cuda memory. My current PC is with a single Geforece 1080, which is an old 8GB GPU. Is it possible to run your program on a PC with a single GPU? How long does the training take in your 8GPUs PC? It would be great if you could give me some hints.

  2. In addition, would you mind letting me know what 'NUM_PROCESSES' variable controls in fact? Is there any relationship between NUM_PROCESSES and number of GPUs?

  3. Regarding "NUM_GLOBAL_UPDATES" in this line.. I suppose this line is the outter loop for training the algorithm. As far as I understand, the iteration of the outter loop for training RL is episode. I found that NUM_GLOBAL_UPDATES is computed by NUM_GLOBAL_UPDATES = self.config.NUM_EPISODES* NUM_GLOBAL_UPDATES_PER_EPISODE// self.config.NUM_PROCESSES . Could you please explain more about the meaning of this computation?

  4. Training on 36 parallel environments can accelerate training. Does it benefit the performance of learning in any way?

  5. If I understand correctly, the mapper module is trained by another created process "map_update_func" . I tried to run it in debug mode (Pycharm IDE) to see what is going on. It seems that it's hard to debug multiprocessing python program. Would you mind letting me know how you debug it and tools/IDE you use?

I truely appreciate it if you could offer any help on the above questions.

from occupancyanticipation.

srama2512 avatar srama2512 commented on August 15, 2024

Hi @vincent341,

  1. You could try NUM_PROCESSES=4 and reduce MAPPER.map_batch_size significantly (from 420 to say 16). We have not tested on an 8GB GPU unfortunately. All our experiments were run on 8 GPUs, each with a memory of 16/32 GB for around 2 days.
  2. NUM_PROCESSES controls the number of parallel habitat environment instances. Ideally, you should try to have 6 environments per GPU on a 16 GB GPU. In the sample config, the 36 environments are spread over GPUS [2, 3, 4, 5, 6, 7], and the mapper training is spread over GPUs [1, 2, 3, 4, 5, 6, 7].
  3. Following convention from habitat-baselines, we define the training code based on the number of policy updates. A global action is sampled every ans_config.goal_interval steps of environment interaction. NUM_GLOBAL_STEPS is the number of such global actions to take before updating global policy. NUM_GLOBAL_UPDATES_PER_EPISODE measures how many such updates will happen in an episode. NUM_GLOBAL_UPDATE therefore measures the number of global updates that correspond to the given number of total episodes (NUM_EPISODES) spread over all the parallel environments (NUM_PROCESSES). I hope this clarifies that computation.
  4. The reason for training on 36 parallel environments is that it gives diverse training data for the policy and the mapper. Each environment typically spawns the agent in a different 3D scene from Gibson (72 in total). This was adopted from the ActiveNeuralSLAM project (they use 72 instead of our 36).
  5. This was an optimization to train the mapper parallely while collected data. You could possibly modify the code to update sequentially instead of parallely by making appropriate changes to the mapper update. That is what I did during early versions of the code.

from occupancyanticipation.

vincent341 avatar vincent341 commented on August 15, 2024

Hi @srama2512 ,

Thanks so much for your detailed explanation. It truely helps a lot.

from occupancyanticipation.

AgentEXPL avatar AgentEXPL commented on August 15, 2024

Hi @vincent341,

  1. You could try NUM_PROCESSES=4 and reduce MAPPER.map_batch_size significantly (from 420 to say 16). We have not tested on an 8GB GPU unfortunately. All our experiments were run on 8 GPUs, each with a memory of 16/32 GB for around 2 days.
  2. NUM_PROCESSES controls the number of parallel habitat environment instances. Ideally, you should try to have 6 environments per GPU on a 16 GB GPU. In the sample config, the 36 environments are spread over GPUS [2, 3, 4, 5, 6, 7], and the mapper training is spread over GPUs [1, 2, 3, 4, 5, 6, 7].
  3. Following convention from habitat-baselines, we define the training code based on the number of policy updates. A global action is sampled every ans_config.goal_interval steps of environment interaction. NUM_GLOBAL_STEPS is the number of such global actions to take before updating global policy. NUM_GLOBAL_UPDATES_PER_EPISODE measures how many such updates will happen in an episode. NUM_GLOBAL_UPDATE therefore measures the number of global updates that correspond to the given number of total episodes (NUM_EPISODES) spread over all the parallel environments (NUM_PROCESSES). I hope this clarifies that computation.
  4. The reason for training on 36 parallel environments is that it gives diverse training data for the policy and the mapper. Each environment typically spawns the agent in a different 3D scene from Gibson (72 in total). This was adopted from the ActiveNeuralSLAM project (they use 72 instead of our 36).
  5. This was an optimization to train the mapper parallely while collected data. You could possibly modify the code to update sequentially instead of parallely by making appropriate changes to the mapper update. That is what I did during early versions of the code.

Hi, @srama2512 . Based on the config parameters, the number of global updates NUM_GLOBAL_UPDATE can be computed as NUM_EPISODES * (T_EXP // (NUM_GLOBAL_STEPS * goal_interval)) // NUM_PROCESSES = 10000 * (1000 // (20 * 25)) // 36 = 555.
In normal case, how many times of updates are needed for a DRL network model to achieve a good performance? Previously I thought that at least ten thousands of updates are needed. (Maybe this is a wrong opinion).

In the equation, greater NUM_PROCESSES means less NUM_GLOBAL_UPDATE. This also makes me confused. Why the number of global updates is impacted by the number of parallel processces which generate data for batches? Hope some explanations could be provided. Thanks!

from occupancyanticipation.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.