GithubHelp home page GithubHelp logo

Comments (13)

shishishu avatar shishishu commented on August 22, 2024 1

大佬,多卡训练速度是不是不一定比单卡训练速度快很多?,下面这种情况正常吗
sigle gpu : 2.66 global step /sec
8 gpus: 2.9 glocal step /sec

我的理解:多卡训练很多时候是针对长序列文本。如果使用单卡,那么允许的batch_size很小,否则会出现OOM问题。为了使模型训练收敛,必须使用多卡,以达到增加batch_size的效果(global_batch_size = num_gpu * batch_size)。至于训练速度,可能跟负载的平衡, strategy,CPU适配有关。

from bert-multi-gpu.

Jhangsy avatar Jhangsy commented on August 22, 2024

然后跑着会出现这种 error:
"Resource exhausted: OOM when allocating tensor with shape[4096,3072] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc'

from bert-multi-gpu.

haoyuhu avatar haoyuhu commented on August 22, 2024

GPU资源不够OOM了

from bert-multi-gpu.

Jhangsy avatar Jhangsy commented on August 22, 2024

GPU资源不够OOM了

为什么会出现这种情况呢,有 8 张 Tesla K80 的显卡,运行的时候需要做哪些限制吗?,单 GPU 跑的时候没有问题啊

from bert-multi-gpu.

Jhangsy avatar Jhangsy commented on August 22, 2024

Screen Shot 2020-02-28 at 11 59 43 AM

这个 log 感觉有问题,这是说没有用 mirror strategy 吗?

from bert-multi-gpu.

haoyuhu avatar haoyuhu commented on August 22, 2024
Screen Shot 2020-02-28 at 11 59 43 AM

这个 log 感觉有问题,这是说没有用 mirror strategy 吗?

开始训练后可以用nvidia-smi看看GPU有没有被占用。
REF: tensorflow/tensorflow#26020

from bert-multi-gpu.

Jhangsy avatar Jhangsy commented on August 22, 2024

开始训练之后 GPU Util 是很高的 90%~100%, 就是log 让人疑惑,不论是把 log_n_every_steps 改成多少,log 上面的都是这个数字的 double,用你代码里 的 8,就是 16,改为 10 就是 20.
image

from bert-multi-gpu.

haoyuhu avatar haoyuhu commented on August 22, 2024

开始训练之后 GPU Util 是很高的 90%~100%, 就是log 让人疑惑,不论是把 log_n_every_steps 改成多少,log 上面的都是这个数字的 double,用你代码里 的 8,就是 16,改为 10 就是 20.
image

你是8卡一起训练吗?之前有人提类似的issue。

from bert-multi-gpu.

pengxia24 avatar pengxia24 commented on August 22, 2024

大佬,多卡训练速度是不是不一定比单卡训练速度快很多?,下面这种情况正常吗
sigle gpu : 2.66 global step /sec
8 gpus: 2.9 glocal step /sec

from bert-multi-gpu.

Jhangsy avatar Jhangsy commented on August 22, 2024

开始训练之后 GPU Util 是很高的 90%~100%, 就是log 让人疑惑,不论是把 log_n_every_steps 改成多少,log 上面的都是这个数字的 double,用你代码里 的 8,就是 16,改为 10 就是 20.
image

你是8卡一起训练吗?之前有人提类似的issue。

是 8 卡同时,看到了那个 ISSUE,上面没有解答。换成4卡之后速度提升明显。。估计是 GPU 连接的问题,或者是 CPU 的瓶颈?

from bert-multi-gpu.

haoyuhu avatar haoyuhu commented on August 22, 2024

@Jhangsy 这个日志问题确实困扰很久了,好在不影响训练。有可能是CPU瓶颈。

from bert-multi-gpu.

Jhangsy avatar Jhangsy commented on August 22, 2024

@Jhangsy 这个日志问题确实困扰很久了,好在不影响训练。有可能是CPU瓶颈。

哈哈哈,是呀,你回复的好及时,十分感谢~

from bert-multi-gpu.

haoyuhu avatar haoyuhu commented on August 22, 2024

@Jhangsy 这个日志问题确实困扰很久了,好在不影响训练。有可能是CPU瓶颈。

哈哈哈,是呀,你回复的好及时,十分感谢~

不客气:P

from bert-multi-gpu.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.