I ran the fast_rnnt.get_rnnt_prune_ranges() function with a RuntimeError: Invalid devi

RuntimeError: invalid device ordinal about fast_rnnt HOT 19 CLOSED

sanzimu commented on August 18, 2024

RuntimeError: invalid device ordinal

from fast_rnnt.

Comments (19)

sanzimu commented on August 18, 2024 1

I have sovled this porblem using cuda-10.1

from fast_rnnt.

pkufool commented on August 18, 2024

How do you run it with fast_rnnt? Can you show your running command? Looks like to me, it is the wrong configuration of CUDA devices.

from fast_rnnt.

sanzimu commented on August 18, 2024

How do you run it with fast_rnnt? Can you show your running command? Looks like to me, it is the wrong configuration of CUDA devices.

For data security reasons, i can't show the code here, but only part of it.
export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7,8"

class E2E():
def forward(self, feats, feats_len, target, target_len):
    target = target.to(torch.int64)                                                                                                                                                            
    am = self.lin_enc_out(encoder_out)                                                                                                                                                              
    lm = self.lin_dec_out(decoder_out)                                                                                                                                                                                                                                                                                              
    boundary = torch.zeros((lm.size(0), 4), dtype=torch.int64, device=am.device)                                                                                                               
    boundary[:, 2] = target_len                                                                                                                                                                
    boundary[:, 3] = encoder_out_len                                                                                                                                                                  
    simple_loss, (px_grad, py_grad) = self.fast_rnnt.rnnt_loss_smoothed(                                                                                                                       
        lm=lm.float(),                                                                                                                                                                         
        am=am.float(),                                                                                                                                                                         
        symbols=target,                                                                                                                                                                        
        termination_symbol=self.blank_id,                                                                                                                                                      
        lm_only_scale=0.25,                                                                                                                                                                    
        am_only_scale=0.0,                                                                                                                                                                     
        boundary=boundary,                                                                                                                                                                     
        reduction="sum",                                                                                                                                                                       
        return_grad=True,                                                                                                                                                                      
     )
                                                                                                                                                                                                                 
     ranges = self.fast_rnnt.get_rnnt_prune_ranges(                                                                                                                                             
         px_grad=px_grad,                                                                                                                                                                       
         py_grad=py_grad,                                                                                                                                                                       
         boundary=boundary,                                                                                                                                                                     
         s_range=2,                                                                                                                                                                             
     )                                                                                                                                    
     am_pruned, lm_pruned = self.fast_rnnt.do_rnnt_pruning(                                                                                                                                     
         am=self.joint_network.lin_enc(encoder_out),                                                                                                                                                 
         lm=self.joint_network.lin_dec(decoder_out),                                                                                                                                               
         ranges=ranges,                                                                                                                                                                         
     )                                                                                                                         
     logits = self.joint_network(am_pruned, lm_pruned)                                                                                                                                          
     pruned_loss = self.fast_rnnt.rnnt_loss_pruned(                                                                                                                                             
         logits=logits.float(),                                                                                                                                                                 
         symbols=target,                                                                                                                                                                        
         termination_symbol=self.blank_id,                                                                                                                                                      
         boundary=boundary,                                                                                                                                                                     
         reduction="sum",                                                                                                                                                                       
     )
def main():
    rank = args.rank 
    local_rank = args.gpu_id
    torch.distributed.init_process_group(backend="nccl", init_method=init_method,  world_size=args.world_size, rank=rank)
    torch.cuda.set_device(local_rank)
    device = torch.device("cuda",local_rank)
    model = E2E()
    model.to(device) 
    model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[local_rank])
    model(feats, feats_len, target, target_len)
if __name__ == '__main__':
    main()

from fast_rnnt.

pkufool commented on August 18, 2024

For data security reasons, i can't show the code here, but only part of it.

Thanks！I see.

export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7,8"

BTW, do you have 9 cards in your machine? What's the world_size?

from fast_rnnt.

csukuangfj commented on August 18, 2024

Are you using the same machine to compile and run fast_rnnt?

from fast_rnnt.

sanzimu commented on August 18, 2024

For data security reasons, i can't show the code here, but only part of it.

Thanks！I see.

export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7,8"

BTW, do you have 9 cards in your machine? What's the world_size?
sorry，it's only 8 cards. I made a mistake here, but my train script is 'export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"'

from fast_rnnt.

sanzimu commented on August 18, 2024

Are you using the same machine to compile and run fast_rnnt?

yes

from fast_rnnt.

pkufool commented on August 18, 2024

Can you run your training with torch.rnnt_loss successfully? Seems that it gets wrong device id somewhere, I checked the code and did not see obvious bugs in the function listed in the backtrace.

from fast_rnnt.

pkufool commented on August 18, 2024

Does this error raise at the begining or at the middle of your training?

from fast_rnnt.

sanzimu commented on August 18, 2024

Can you run your training with torch.rnnt_loss successfully? Seems that it gets wrong device id somewhere, I checked the code and did not see obvious bugs in the function listed in the backtrace.

It can run warp-transducer loss successfully

from fast_rnnt.

sanzimu commented on August 18, 2024

Does this error raise at the begining or at the middle of your training?

The Error happend at the begining of the training

from fast_rnnt.

sanzimu commented on August 18, 2024

Is it possible the way launch multi jobs raise this error?
ngpu=8
export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"
node_rank=0
node_num=1
for ((i=0; i<$ngpu; i++)); do
(
gpu_id=$(echo $CUDA_VISIBLE_DEVICES | cut -d',' -f$[$i+1])
rank=expr $node_rank \* $ngpu + $i
python main.py --node-num $node_num
--ngpu $ngpu
--gpu-id $gpu_id
--rank $rank
) &
done
wait

from fast_rnnt.

csukuangfj commented on August 18, 2024

Is there a similar issue if you use the pruned rnnt loss from k2?

from fast_rnnt.

csukuangfj commented on August 18, 2024

Also, does it happen if you use a single GPU or fewer GPUs for training?

from fast_rnnt.

sanzimu commented on August 18, 2024

Also, does it happen if you use a single GPU or fewer GPUs for training?

Same error happened when using 1 or 4 GPU

from fast_rnnt.

sanzimu commented on August 18, 2024

Is there a similar issue if you use the pruned rnnt loss from k2?

I'll try it . Intalling runtime env is annoying ~~~

from fast_rnnt.

sanzimu commented on August 18, 2024

Hmm, i can run fast_rnnt successfully with cuda-10.1. but failed with 11.2. And i trace the error happend in https://github.com/danpovey/fast_rnnt/blob/master/fast_rnnt/csrc/utils.cu#:~:text=auto%20s%20%3D%20cub%3A%3ADeviceScan%3A%3AInclusiveScan(nullptr%2C%20temp_storage_bytes%2C

from fast_rnnt.

csukuangfj commented on August 18, 2024

And i trace the error happend in

Are there error logs and backtraces?

from fast_rnnt.

sanzimu commented on August 18, 2024

And i trace the error happend in

Are there error logs and backtraces?

No;I print the return error log by cub::DeviceScan::InclusiveScan, error code show 101(invalid device ordinal)

from fast_rnnt.

RuntimeError: invalid device ordinal about fast_rnnt HOT 19 CLOSED

Comments (19)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs