GithubHelp home page GithubHelp logo

Comments (8)

bokveizen avatar bokveizen commented on August 25, 2024

True, so I am only using resnet and VGG

from pytorch-cifar100.

weiaicunzai avatar weiaicunzai commented on August 25, 2024

I've just updated my code, fixed this bug.

I've tested my updated code on Google Colab

python3.6
pytorch1.6
a K80 gpu:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.57       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   45C    P8    31W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

here is the output during training(seresnet152, batch_size=64), we can see that the GPU memory consumption is 7832MB, you could try yourself. If you have a different result, plz let me know, thanks. @monkeyDemon @bokveizen

|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |  772034 KB |    6276 MB |   37813 GB |   37812 GB |
|       from large pool |  438784 KB |    5926 MB |   37354 GB |   37354 GB |
|       from small pool |  333250 KB |     480 MB |     458 GB |     458 GB |
|---------------------------------------------------------------------------|
| Active memory         |  772034 KB |    6276 MB |   37813 GB |   37812 GB |
|       from large pool |  438784 KB |    5926 MB |   37354 GB |   37354 GB |
|       from small pool |  333250 KB |     480 MB |     458 GB |     458 GB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |    7832 MB |    7832 MB |    7832 MB |       0 B  |
|       from large pool |    7350 MB |    7350 MB |    7350 MB |       0 B  |
|       from small pool |     482 MB |     482 MB |     482 MB |       0 B  |
|---------------------------------------------------------------------------|
| Non-releasable memory |  354366 KB |    1425 MB |   19691 GB |   19690 GB |
|       from large pool |  351744 KB |    1423 MB |   19197 GB |   19197 GB |
|       from small pool |    2622 KB |      33 MB |     493 GB |     493 GB |
|---------------------------------------------------------------------------|
| Allocations           |    2940    |    3808    |    6708 K  |    6705 K  |
|       from large pool |     141    |     549    |    2429 K  |    2429 K  |
|       from small pool |    2799    |    3414    |    4278 K  |    4275 K  |
|---------------------------------------------------------------------------|
| Active allocs         |    2940    |    3808    |    6708 K  |    6705 K  |
|       from large pool |     141    |     549    |    2429 K  |    2429 K  |
|       from small pool |    2799    |    3414    |    4278 K  |    4275 K  |
|---------------------------------------------------------------------------|
| GPU reserved segments |     499    |     499    |     499    |       0    |
|       from large pool |     258    |     258    |     258    |       0    |
|       from small pool |     241    |     241    |     241    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |     106    |     121    |    3384 K  |    3384 K  |
|       from large pool |      44    |      83    |    1058 K  |    1058 K  |
|       from small pool |      62    |      77    |    2325 K  |    2325 K  |
|===========================================================================|

from pytorch-cifar100.

ShaoZeng avatar ShaoZeng commented on August 25, 2024

I found that mobilenet.py has similar problem, which occupies more GPU memory.
Can you check it?
THX!

from pytorch-cifar100.

Vickeyhw avatar Vickeyhw commented on August 25, 2024

I found that googlenet.py also occupies so many gpu memory that when I train it on ImageNet dataset , even 4 gpus with 20GB per gpu are not enough.

from pytorch-cifar100.

weiaicunzai avatar weiaicunzai commented on August 25, 2024

I found that googlenet.py also occupies so many gpu memory that when I train it on ImageNet dataset , even 4 gpus with 20GB per gpu are not enough.

Could you please tell me, what is your input image size and batch size during training?

from pytorch-cifar100.

Vickeyhw avatar Vickeyhw commented on August 25, 2024

I found that googlenet.py also occupies so many gpu memory that when I train it on ImageNet dataset , even 4 gpus with 20GB per gpu are not enough.

Could you please tell me, what is your input image size and batch size during training?

@weiaicunzai my input image size is 224x224. I tried to set batch size to 128 , 256 and 64, but none of them work.

from pytorch-cifar100.

weiaicunzai avatar weiaicunzai commented on August 25, 2024

I found that googlenet.py also occupies so many gpu memory that when I train it on ImageNet dataset , even 4 gpus with 20GB per gpu are not enough.

Could you please tell me, what is your input image size and batch size during training?

@weiaicunzai my input image size is 224x224. I tried to set batch size to 128 , 256 and 64, but none of them work.

Thanks, I will try to reproduce the bug you mentioned. Currently My GPU server is down due to the hardware problems, already sent to repaired, it might take a while, sorry.

from pytorch-cifar100.

weiaicunzai avatar weiaicunzai commented on August 25, 2024

I found that googlenet.py also occupies so many gpu memory that when I train it on ImageNet dataset , even 4 gpus with 20GB per gpu are not enough.

Could you please tell me, what is your input image size and batch size during training?

@weiaicunzai my input image size is 224x224. I tried to set batch size to 128 , 256 and 64, but none of them work.

I use 3 downsampling in googlenet, results larger feature map size during training , thats why we have large momery consumption during training. Fewer downsampling is beneficial for small input size like 32x32.I add one more downsampling layer in my googlenet implementation, the GPU memory usage drops from 14GB to 7GB during training on cifar100, but accuracy also drops about 2 percent.
If you are going to train the large input image(224x224), you could use 5 times downsampling just as in the original paper, to further reduce the memory usage without losing much network performance.

from pytorch-cifar100.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.