GithubHelp home page GithubHelp logo

Comments (2)

albertz avatar albertz commented on May 27, 2024

It might be a hardware issue. On that node, I get now:

zeyer@cn-244 ~ % nvidia-smi                        
Unable to determine the device handle for GPU0000:03:00.0: Unknown Error

And dmesg:

[  +0.000844] nvidia 0000:03:00.0: AER: can't recover (no error_detected callback)
[  +0.000001] snd_hda_intel 0000:03:00.1: AER: can't recover (no error_detected callback)
[  +0.000008] pcieport 0000:00:03.0: AER: device recovery failed
[  +0.000001] pcieport 0000:00:03.0: AER: Multiple Uncorrected (Fatal) error received: 0000:00:03.0
[  +0.000004] pcieport 0000:00:03.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Requester ID)
[  +0.000898] pcieport 0000:00:03.0:   device [8086:6f08] error status/mask=00004020/00000000
[  +0.000889] pcieport 0000:00:03.0:    [ 5] SDES                  
[  +0.000894] pcieport 0000:00:03.0:    [14] CmpltTO                (First)
[  +0.000921] nvidia 0000:03:00.0: AER: can't recover (no error_detected callback)
[  +0.000002] snd_hda_intel 0000:03:00.1: AER: can't recover (no error_detected callback)
[  +1.050439] pcieport 0000:00:03.0: AER: Root Port link has been reset (0)
[  +0.000041] pcieport 0000:00:03.0: AER: device recovery failed

from espnet.

albertz avatar albertz commented on May 27, 2024

I was just looking at the code. There is:

mask_length = torch.randint(
        mask_width_range[0],
        mask_width_range[1],
        (B, num_mask),
        device=spec.device,
    )

And it calls the function like this (as you see from the stacktrace):

    line: return mask_along_axis(
              spec,
              spec_lengths,
              mask_width_range=self.mask_width_range,
              dim=self.dim,
              num_mask=self.num_mask,
              replace_with_zero=self.replace_with_zero,
          )
    locals:
      spec = <local> tensor[6, 1220, 80] n=585600 (2.2Mb) x∈[-0.051, 0.252] μ=-0.051 σ=0.252 cuda:1
      spec_lengths = <local> tensor[6] i32 x∈[1112, 1220] μ=1.194e+03 σ=40.913 [1203, 1206, 1207, 1218, 1220, 1112]
      self = <local> MaskAlongAxis(mask_width_range=[0, 27], num_mask=2, axis=freq)
      self.mask_width_range = <local> [0, 27]
      self.dim = <local> 2
      self.num_mask = <local> 2
      self.replace_with_zero = <local> True

So, mask_length should have values in between 0 and 26 (inclusive).

But then you see later in the stacktrace:

      mask_length = <local> tensor[6, 2, 1] i64 n=12 x∈[-4825293490701537652, 4474139002465713060] μ=-9.419e+17 σ=4.744e+18 cuda:1

So, I guess it's clear that this is some hardware issue. So I guess we can close this.

from espnet.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.