It seems some of the nets define in models has some hidden bug. For example, I us

CUDA out of memory problem about pytorch-cifar100 HOT 8 OPEN

weiaicunzai commented on August 25, 2024

CUDA out of memory problem

from pytorch-cifar100.

Comments (8)

bokveizen commented on August 25, 2024

True, so I am only using resnet and VGG

from pytorch-cifar100.

weiaicunzai commented on August 25, 2024

I've just updated my code, fixed this bug.

I've tested my updated code on Google Colab

python3.6
pytorch1.6
a K80 gpu:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.57       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   45C    P8    31W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

here is the output during training(seresnet152, batch_size=64), we can see that the GPU memory consumption is 7832MB, you could try yourself. If you have a different result, plz let me know, thanks. @monkeyDemon @bokveizen

|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |  772034 KB |    6276 MB |   37813 GB |   37812 GB |
|       from large pool |  438784 KB |    5926 MB |   37354 GB |   37354 GB |
|       from small pool |  333250 KB |     480 MB |     458 GB |     458 GB |
|---------------------------------------------------------------------------|
| Active memory         |  772034 KB |    6276 MB |   37813 GB |   37812 GB |
|       from large pool |  438784 KB |    5926 MB |   37354 GB |   37354 GB |
|       from small pool |  333250 KB |     480 MB |     458 GB |     458 GB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |    7832 MB |    7832 MB |    7832 MB |       0 B  |
|       from large pool |    7350 MB |    7350 MB |    7350 MB |       0 B  |
|       from small pool |     482 MB |     482 MB |     482 MB |       0 B  |
|---------------------------------------------------------------------------|
| Non-releasable memory |  354366 KB |    1425 MB |   19691 GB |   19690 GB |
|       from large pool |  351744 KB |    1423 MB |   19197 GB |   19197 GB |
|       from small pool |    2622 KB |      33 MB |     493 GB |     493 GB |
|---------------------------------------------------------------------------|
| Allocations           |    2940    |    3808    |    6708 K  |    6705 K  |
|       from large pool |     141    |     549    |    2429 K  |    2429 K  |
|       from small pool |    2799    |    3414    |    4278 K  |    4275 K  |
|---------------------------------------------------------------------------|
| Active allocs         |    2940    |    3808    |    6708 K  |    6705 K  |
|       from large pool |     141    |     549    |    2429 K  |    2429 K  |
|       from small pool |    2799    |    3414    |    4278 K  |    4275 K  |
|---------------------------------------------------------------------------|
| GPU reserved segments |     499    |     499    |     499    |       0    |
|       from large pool |     258    |     258    |     258    |       0    |
|       from small pool |     241    |     241    |     241    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |     106    |     121    |    3384 K  |    3384 K  |
|       from large pool |      44    |      83    |    1058 K  |    1058 K  |
|       from small pool |      62    |      77    |    2325 K  |    2325 K  |
|===========================================================================|

from pytorch-cifar100.

ShaoZeng commented on August 25, 2024

I found that mobilenet.py has similar problem, which occupies more GPU memory.
Can you check it?
THX!

from pytorch-cifar100.

Vickeyhw commented on August 25, 2024

I found that googlenet.py also occupies so many gpu memory that when I train it on ImageNet dataset , even 4 gpus with 20GB per gpu are not enough.

from pytorch-cifar100.

weiaicunzai commented on August 25, 2024

I found that googlenet.py also occupies so many gpu memory that when I train it on ImageNet dataset , even 4 gpus with 20GB per gpu are not enough.

Could you please tell me, what is your input image size and batch size during training?

from pytorch-cifar100.

Vickeyhw commented on August 25, 2024

I found that googlenet.py also occupies so many gpu memory that when I train it on ImageNet dataset , even 4 gpus with 20GB per gpu are not enough.

Could you please tell me, what is your input image size and batch size during training?

@weiaicunzai my input image size is 224x224. I tried to set batch size to 128 , 256 and 64, but none of them work.

from pytorch-cifar100.

weiaicunzai commented on August 25, 2024

I found that googlenet.py also occupies so many gpu memory that when I train it on ImageNet dataset , even 4 gpus with 20GB per gpu are not enough.

Could you please tell me, what is your input image size and batch size during training?

@weiaicunzai my input image size is 224x224. I tried to set batch size to 128 , 256 and 64, but none of them work.

Thanks, I will try to reproduce the bug you mentioned. Currently My GPU server is down due to the hardware problems, already sent to repaired, it might take a while, sorry.

from pytorch-cifar100.

weiaicunzai commented on August 25, 2024

I found that googlenet.py also occupies so many gpu memory that when I train it on ImageNet dataset , even 4 gpus with 20GB per gpu are not enough.

Could you please tell me, what is your input image size and batch size during training?

@weiaicunzai my input image size is 224x224. I tried to set batch size to 128 , 256 and 64, but none of them work.

I use 3 downsampling in googlenet, results larger feature map size during training , thats why we have large momery consumption during training. Fewer downsampling is beneficial for small input size like 32x32.I add one more downsampling layer in my googlenet implementation, the GPU memory usage drops from 14GB to 7GB during training on cifar100, but accuracy also drops about 2 percent.
If you are going to train the large input image(224x224), you could use 5 times downsampling just as in the original paper, to further reduce the memory usage without losing much network performance.

from pytorch-cifar100.

CUDA out of memory problem about pytorch-cifar100 HOT 8 OPEN

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs