sooftware / conformer Goto Github PK

[Unofficial] PyTorch implementation of "Conformer: Convolution-augmented Transformer for Speech Recognition" (INTERSPEECH 2020)

License: Apache License 2.0

Python 100.00%

conformer transformer cnn transformer-xl asr speech-recognition pytorch conv convolution augmented

conformer's Introduction

PyTorch implementation of Conformer: Convolution-augmented Transformer for Speech Recognition.

Transformer models are good at capturing content-based global interactions, while CNNs exploit local features effectively. Conformer combine convolution neural networks and transformers to model both local and global dependencies of an audio sequence in a parameter-efficient way. Conformer significantly outperforms the previous Transformer and CNN based models achieving state-of-the-art accuracies.

This repository contains only model code, but you can train with conformer at openspeech

Installation

This project recommends Python 3.7 or higher. We recommend creating a new virtual environment for this project (using virtual env or conda).

Prerequisites

Numpy: pip install numpy (Refer here for problem installing Numpy).
Pytorch: Refer to PyTorch website to install the version w.r.t. your environment.

Install from source

Currently we only support installation from source code using setuptools. Checkout the source code and run the following commands:

pip install -e .

Usage

import torch
import torch.nn as nn
from conformer import Conformer

batch_size, sequence_length, dim = 3, 12345, 80

cuda = torch.cuda.is_available()  
device = torch.device('cuda' if cuda else 'cpu')

criterion = nn.CTCLoss().to(device)

inputs = torch.rand(batch_size, sequence_length, dim).to(device)
input_lengths = torch.LongTensor([12345, 12300, 12000])
targets = torch.LongTensor([[1, 3, 3, 3, 3, 3, 4, 5, 6, 2],
                            [1, 3, 3, 3, 3, 3, 4, 5, 2, 0],
                            [1, 3, 3, 3, 3, 3, 4, 2, 0, 0]]).to(device)
target_lengths = torch.LongTensor([9, 8, 7])

model = Conformer(num_classes=10, 
                  input_dim=dim, 
                  encoder_dim=32, 
                  num_encoder_layers=3).to(device)

# Forward propagate
outputs, output_lengths = model(inputs, input_lengths)

# Calculate CTC Loss
loss = criterion(outputs.transpose(0, 1), targets, output_lengths, target_lengths)

Troubleshoots and Contributing

If you have any questions, bug reports, and feature requests, please open an issue on github or
contacts [email protected] please.

I appreciate any kind of feedback or contribution. Feel free to proceed with small issues like bug fixes, documentation improvement. For major contributions and new features, please discuss with the collaborators in corresponding issues.

Code Style

I follow PEP-8 for code style. Especially the style of docstrings is important to generate documentation.

Reference

Author

Soohwan Kim @sooftware
Contacts: [email protected]

conformer's People

Contributors

Stargazers

Watchers

Forkers

conformer's Issues

no mask ?

For mutihead-attention, I confused that no mask is passed in. Will it work ？

About the "input length mismatch" bug in torchaudio's RNNT loss

In conformer/convolution.py, line 183, the code

output_lengths = input_lengths >> 2
output_lengths -= 1

when the result of input_lengths >>2 is xx.75, the torchaudio.transforms.RNNTLoss will raise "input length mismatch" Error.
Maybe this is a bug when calculating the output lengths, I'm not sure of it.

Feature Extraction using Pre-trained Conformer Model

Is there any possibility to use pre-trained conformer model for feature extraction on another speech dataset. Have you uploaded your pre-trained model and is there any tutorial how to extract embeddings ?
Thank you

The meaning of Usage

sir,can you explain the real usage of the example code ,I can't understand.
please connect it with ASR or CV,thanks.

to use conformer for acoustic scenes classification ?

thanks alot for the great work .can i use the mult-scale features aggregation conformer for acoustic scene classification ?

Any data / training loop on the performance on a real corpus?

Thanks!

Testing conformer at realtime

I want to test the test conformer in real time as a feature of my project. If anyone has any update kindly share the resources. Thanks

ImportError: cannot import name 'Conformer' from 'conformer' (unknown location)

error when reproducing the example of use (RuntimeError: Input tensor at index 1 has invalid shape [1, 3085, 8, 10], but expected [1, 3085, 9, 10])

Running the code results in an error:

import torch
print(torch.__version__)
import torch.nn as nn
from conformer import Conformer

batch_size, sequence_length, dim = 3, 12345, 80

cuda = torch.cuda.is_available()  
device = torch.device('cuda' if cuda else 'cpu')

inputs = torch.rand(batch_size, sequence_length, dim).to(device)
input_lengths = torch.IntTensor([12345, 12300, 12000])
targets = torch.LongTensor([[1, 3, 3, 3, 3, 3, 4, 5, 6, 2],
                            [1, 3, 3, 3, 3, 3, 4, 5, 2, 0],
                            [1, 3, 3, 3, 3, 3, 4, 2, 0, 0]]).to(device)
target_lengths = torch.LongTensor([9, 8, 7])

model = nn.DataParallel(Conformer(num_classes=10, input_dim=dim, 
                                  encoder_dim=32, num_encoder_layers=3, 
                                  decoder_dim=32, device=device)).to(device)

# Forward propagate
outputs = model(inputs, input_lengths, targets, target_lengths)

# Recognize input speech
outputs = model.module.recognize(inputs, input_lengths)

1.9.0+cu111
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-12-eea3aeffaf58> in <module>
     21 
     22 # Forward propagate
---> 23 outputs = model(inputs, input_lengths, targets, target_lengths)
     24 
     25 # Recognize input speech

/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1050                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051             return forward_call(*input, **kwargs)
   1052         # Do not call functions when jit is used
   1053         full_backward_hooks, non_full_backward_hooks = [], []

/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py in forward(self, *inputs, **kwargs)
    167             replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
    168             outputs = self.parallel_apply(replicas, inputs, kwargs)
--> 169             return self.gather(outputs, self.output_device)
    170 
    171     def replicate(self, module, device_ids):

/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py in gather(self, outputs, output_device)
    179 
    180     def gather(self, outputs, output_device):
--> 181         return gather(outputs, output_device, dim=self.dim)
    182 
    183 

/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py in gather(outputs, target_device, dim)
     76     # Setting the function to None clears the refcycle.
     77     try:
---> 78         res = gather_map(outputs)
     79     finally:
     80         gather_map = None

/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py in gather_map(outputs)
     61         out = outputs[0]
     62         if isinstance(out, torch.Tensor):
---> 63             return Gather.apply(target_device, dim, *outputs)
     64         if out is None:
     65             return None

/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/_functions.py in forward(ctx, target_device, dim, *inputs)
     73             ctx.unsqueezed_scalar = False
     74         ctx.input_sizes = tuple(i.size(ctx.dim) for i in inputs)
---> 75         return comm.gather(inputs, ctx.dim, ctx.target_device)
     76 
     77     @staticmethod

/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/comm.py in gather(tensors, dim, destination, out)
    233                 'device object or string instead, e.g., "cpu".')
    234         destination = _get_device_index(destination, allow_cpu=True, optional=True)
--> 235         return torch._C._gather(tensors, dim, destination)
    236     else:
    237         if destination is not None:

RuntimeError: Input tensor at index 1 has invalid shape [1, 3085, 8, 10], but expected [1, 3085, 9, 10]

I am using version Python 3.8.8.
Which version should it work with?

question about the relative shift function

Hi @sooftware, thank you for coding this repo. I have a question about the relative shift function:

conformer/conformer/attention.py

Line 105 in c76ff16

def _relative_shift(self, pos_score: Tensor) -> Tensor:

I don't quite understand how this function works. Could you elaborate on this?

An example input and output of size 4 is shown below, which does not really make sense to me.

Input:

tensor([[[[-0.9623, -0.3168, -1.1478, -1.3076],
          [ 0.5907, -0.0391, -0.1849, -0.6368],
          [-0.3956,  0.2142, -0.6415,  0.2196],
          [-0.8194, -0.2601,  1.1337, -0.3478]]]])

output:

tensor([[[[-1.3076,  0.0000,  0.5907, -0.0391],
          [-0.1849, -0.6368,  0.0000, -0.3956],
          [ 0.2142, -0.6415,  0.2196,  0.0000],
          [-0.8194, -0.2601,  1.1337, -0.3478]]]])

Thank you!

Forward function missing to FC layer

conformer/conformer/model.py

Line 83 in acb9d9f

self.fc = Linear(encoder_dim << 1, num_classes, bias=False)

Invalid size error when running usage in README

Hello sooftware, thank you very much for your wonderful work!

When I run the sample code in Usage of README:

import torch
import torch.nn as nn
from conformer import Conformer

batch_size, sequence_length, dim = 3, 12345, 80

cuda = torch.cuda.is_available()  
device = torch.device('cuda' if cuda else 'cpu')

inputs = torch.rand(batch_size, sequence_length, dim).to(device)
input_lengths = torch.IntTensor([12345, 12300, 12000])
targets = torch.LongTensor([[1, 3, 3, 3, 3, 3, 4, 5, 6, 2],
                            [1, 3, 3, 3, 3, 3, 4, 5, 2, 0],
                            [1, 3, 3, 3, 3, 3, 4, 2, 0, 0]]).to(device)
target_lengths = torch.LongTensor([9, 8, 7])

model = nn.DataParallel(Conformer(num_classes=10, input_dim=dim, 
                                  encoder_dim=32, num_encoder_layers=3, 
                                  decoder_dim=32, device=device)).to(device)

# Forward propagate
outputs = model(inputs, input_lengths, targets, target_lengths)

# Recognize input speech
outputs = model.module.recognize(inputs, input_lengths)

I got this error:

Traceback (most recent call last):
  File "/home/xuchutian/ASR/sooftware-conformer/try.py", line 36, in <module>
    outputs = model(inputs, input_lengths, targets, target_lengths)
  File "/home/yangyi/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/yangyi/anaconda3/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 156, in forward
    return self.gather(outputs, self.output_device)
  File "/home/yangyi/anaconda3/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 168, in gather
    return gather(outputs, output_device, dim=self.dim)
  File "/home/yangyi/anaconda3/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
    res = gather_map(outputs)
  File "/home/yangyi/anaconda3/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_map
    return Gather.apply(target_device, dim, *outputs)
  File "/home/yangyi/anaconda3/lib/python3.8/site-packages/torch/nn/parallel/_functions.py", line 68, in forward
    return comm.gather(inputs, ctx.dim, ctx.target_device)
  File "/home/yangyi/anaconda3/lib/python3.8/site-packages/torch/cuda/comm.py", line 165, in gather
    return torch._C._gather(tensors, dim, destination)
RuntimeError: Gather got an input of invalid size: got [1, 3085, 8, 10], but expected [1, 3085, 9, 10]

May I ask how to solve this error?

Thank you very much.

Switching device

Hi. I notice the model requires passing the device as an argument, which may have not been decided yet at the point of the module initialization. Once the device is decided, it seems we cannot easily change it. Do you consider making the device switchable? One solution may be instead of passing the device, add an attribute:

@property
def device(self):
    return next(self.parameters()).device

export onnx

Hi, I am a little confused, if I want to export the onnx, should I use the forward or the recognize function? The difference seems to be that in the recognize function, the decoder loop num is adaptive according to the encoder outputs

Invalid version in setup.py

My enviroment:

Python:3.10.10
pip:23.0.1
Kernel:6.2.12-arch1-1

problem

When I tried to install conformer locally, I encountered error shown below.
I'm not so familiar with Python and pip, but maybeversion='latest'in setup.py is cause of this problem.
Finaly,with replacing this parameter with number(I replaced with 1.0) is worked for me

>pip install -e conformer  
Defaulting to user installation because normal site-packages is not writeable
Obtaining file:///home/kimlab4/research/tutorial/python_modules/conformer
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [59 lines of output]
      /home/kimlab4/.local/lib/python3.10/site-packages/setuptools/dist.py:520: SetuptoolsDeprecationWarning: Invalid version: 'latest'.
      !!
      
              ********************************************************************************
              The version specified is not a valid version according to PEP 440.
              This may not work as expected with newer versions of
              setuptools, pip, and PyPI.
      
              By 2023-Sep-26, you need to update your project and remove deprecated calls
              or your builds will no longer be supported.
      
              See https://peps.python.org/pep-0440/ for details.
              ********************************************************************************
      
      !!
        self._validate_version(self.metadata.version)
      running egg_info
      /home/kimlab4/.local/lib/python3.10/site-packages/setuptools/command/egg_info.py:131: SetuptoolsDeprecationWarning: Invalid version: 'latest'.
      !!
      
              ********************************************************************************
              Version 'latest' is not valid according to PEP 440.
      
              Please make sure to specify a valid version for your package.
              Also note that future releases of setuptools may halt the build process
              if an invalid version is given.
      
              By 2023-Sep-26, you need to update your project and remove deprecated calls
              or your builds will no longer be supported.
      
              See https://peps.python.org/pep-0440/ for details.
              ********************************************************************************
      
      !!
        return _normalization.best_effort_version(tagged)
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/home/kimlab4/research/tutorial/python_modules/conformer/setup.py", line 17, in <module>
          setup(
        File "/home/kimlab4/.local/lib/python3.10/site-packages/setuptools/__init__.py", line 107, in setup
          return distutils.core.setup(**attrs)
        File "/home/kimlab4/.local/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 185, in setup
          return run_commands(dist)
        File "/home/kimlab4/.local/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 201, in run_commands
          dist.run_commands()
        File "/home/kimlab4/.local/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 969, in run_commands
          self.run_command(cmd)
        File "/home/kimlab4/.local/lib/python3.10/site-packages/setuptools/dist.py", line 1244, in run_command
          super().run_command(command)
        File "/home/kimlab4/.local/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 987, in run_command
          cmd_obj.ensure_finalized()
        File "/home/kimlab4/.local/lib/python3.10/site-packages/setuptools/_distutils/cmd.py", line 111, in ensure_finalized
          self.finalize_options()
        File "/home/kimlab4/.local/lib/python3.10/site-packages/setuptools/command/egg_info.py", line 219, in finalize_options
          parsed_version = packaging.version.Version(self.egg_version)
        File "/home/kimlab4/.local/lib/python3.10/site-packages/setuptools/_vendor/packaging/version.py", line 197, in __init__
          raise InvalidVersion(f"Invalid version: '{version}'")
      setuptools.extern.packaging.version.InvalidVersion: Invalid version: 'latest'
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

the Usage sample got an invalid size error

I have installed the package and after run the usage sample. there comes an error like following:
RuntimeError: Gather got an input of invalid size: got [1, 3085, 8, 10], but expected [1, 3085, 9, 10]

Could you tell me where is my mistake?

Example in README doesn't work

I'm not sure, but I think these are the right replacement kwargs:

model = Conformer(dim=dim,
                  dim_head=32,
                  depth=3).to(device)

The to.(self.device) in return

The inputs.to(self.device) in ConformerConvmodule and FeedForwardModule will cause the network graph in tensorboard to fork and appear kind of messy. Is there any special reason to write like that? Since in most cases we should have send both the model and tensor to the device before we input the tensor to the model, probably no more sending action is needed?

Questions about the format and shape of the data

Hello. I am currently using this package. I'm afraid this may be a basic question, but I'd like to ask a question.

1 Is the input a spectrogram or raw audio data?

2 When I run model(x,x_len,target,target_len), I get a four-dimensional output (batch, join_len,target_len,class_num) due to the calculation of the loss function.

I wanted to see the recognition result, so I used model.recognize(x,x_len), but the shape of the output was (batch,join_len). But the shape of the output was (batch,join_len). I would like to see it with (batch,target_len). What is the process of recognizing?

Example of how to train conformer

Hello, thanks for sharing this nice project. Could you provide some example on how to train conformer model with rnnt loss?

Conformer Transducer inference

Transducer inference할 때 audio encoder의 output을 time step마다 한개씩 넣는 이유가 있을까요?
Real time을 대비해서 그렇게 inference를 하는 것 같은데 non-real time일때는 어떻게 inference가 되는건지
제가 이해하고 있는게 맞는건지 잘 모르겠습니다.

There is a problem about training a conformer+RNN-T model?

Hi,
There is a problem about training a conformer+RNN-T model.
How about the cer and wer with one GPU?

I'm train the model on one RTX TITAN GPU, training the conformer(encoder layers 16, encoder dim 144, decoder layer 1, decoder dim 320), after 50 epoch training the CER is about 27 and don't reduce anymore.

When using Conformer in speech recognition

The output of DecoderRNN-T is combined with 4 dimensions, how to use it to recognize speech? Besides, the auther make the model architecture with the LAS? Such as: Conformer-Encoder, LSTM-Decoder, Attention?

SentencePieceTrainer.Train error

python prepare_libri.py --dataset_path ../../data/lasr/libri/LibriSpeech --vocab_size 5000
sentencepiece_trainer.cc(177) LOG(INFO) Running command: --input=spm_input.txt --model_prefix=tokenizer --vocab_size=5000 --model_type=unigram --pad_id=0 --bos_id=1 --eos_id=2
sentencepiece_trainer.cc(77) LOG(INFO) Starts training with :
trainer_spec {
  input: spm_input.txt
  input_format:
  model_prefix: tokenizer
  model_type: UNIGRAM
  vocab_size: 5000
  self_test_sample_size: 0
  character_coverage: 0.9995
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  treat_whitespace_as_suffix: 0
  required_chars:
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 0
  bos_id: 1
  eos_id: 2
  pad_id: 0
  unk_piece: <unk>
  bos_piece: <s>
  eos_piece: </s>
  pad_piece: <pad>
  unk_surface:  ⁇
}
normalizer_spec {
  name: nmt_nfkc
  add_dummy_prefix: 1
  remove_extra_whitespaces: 1
  escape_whitespaces: 1
  normalization_rule_tsv:
}
denormalizer_spec {}
Traceback (most recent call last):
  File "prepare_libri.py", line 65, in <module>
    main()
  File "prepare_libri.py", line 58, in main
    prepare_tokenizer(transcripts_collection[0], opt.vocab_size)
  File "lasr/dataset/preprocess.py", line 71, in prepare_tokenizer
    spm.SentencePieceTrainer.Train(cmd)
  File "anaconda3/envs/lasr/lib/python3.7/site-packages/sentencepiece/__init__.py", line 407, in Train
    return SentencePieceTrainer._TrainFromString(arg)
  File "anaconda3/envs/lasr/lib/python3.7/site-packages/sentencepiece/__init__.py", line 385, in _TrainFromString
    return _sentencepiece.SentencePieceTrainer__TrainFromString(arg)
RuntimeError: Internal: /home/conda/feedstock_root/build_artifacts/sentencepiece_1612846348604/work/src/trainer_interface.cc(666) [insert_id(trainer_spec_.pad_id(), trainer_spec_.pad_piece())]

您好，您有这篇论文”[时域语音增强] SE-Conformer: Time-Domain Speech Enhancement using Conformer阅读笔记“的代码吗？

Influence of 10ms to 40ms rate in Conformer

There's one thing I don't understand。Why do have to change the one-quarter sampling rate? How does it help?
Thanks.

NaN output and loss value

I am using the following training function and librispeech dataset. Every time the output of the model while training become Nan as a result the loss is also nan. What could be the possible issue.

class IterMeter(object):
"""keeps track of total iterations"""
def init(self):
self.val = 0

def step(self):
    self.val += 1

def get(self):
    return self.val

def train(model, device, train_loader, criterion, optimizer, scheduler, epoch):
model.train()

train_loss = 0


data_len = len(train_loader.dataset)
for batch_idx, _data in enumerate(train_loader):
        spectrograms, labels, input_lengths, label_lengths = _data 
        
        spectrograms=torch.squeeze(spectrograms, dim=1)
        
        spectrograms = spectrograms.transpose(1,2)
        
        labels= torch.LongTensor(labels.long())
        
        input_lengths=torch.LongTensor(input_lengths)
        label_lengths=torch.LongTensor(label_lengths)
        input_lengths = input_lengths.to(device)
        label_lengths = label_lengths.to(device)
        spectrograms, labels = spectrograms.to(device), labels.to(device)
        print(spectrograms.type())

        optimizer.zero_grad()

        output, output_lengths = model(spectrograms,input_lengths)  # (batch, time, n_class)
        
        output = output.transpose(0, 1) # (time, batch, n_class)
        loss = criterion(output, labels, output_lengths, label_lengths)

        train_loss += loss.item() / len(train_loader)
        
        loss.backward()

 
        optimizer.step()
        scheduler.step()
       
        if batch_idx % 100 == 0 or batch_idx == data_len:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx * len(spectrograms), data_len,
                100. * batch_idx / len(train_loader), loss.item()))
        
return train_loss

def test(model, device, test_loader, criterion, epoch,batch_size=20):
print('\nevaluating...')
model.eval()
test_loss = 0
test_cer, test_wer = [], []
n_classes = 29

if epoch%5==0:
    with torch.no_grad():
            for i, _data in enumerate(test_loader):
                spectrograms, labels, input_lengths, label_lengths = _data 
                spectrograms=torch.squeeze(spectrograms)
                
                spectrograms = spectrograms.transpose(1,2)
        
                labels=labels.long()

                input_lengths=torch.LongTensor(input_lengths)
                label_lengths=torch.LongTensor(label_lengths)
                input_lengths = input_lengths
                label_lengths = label_lengths

                spectrograms, labels = spectrograms.to(device), labels.to(device)

                output, output_lengths = model(spectrograms,input_lengths)  # (batch, time, n_class)
                soft_max = torch.nn.functional.softmax(output,dim=2)
                output = output.transpose(0, 1) # (time, batch, n_class)
                loss = criterion(output, labels, output_lengths, label_lengths)
                test_loss += loss.item() / len(test_loader)


                decoder = CTCBeamDecoder(
                    [''] * (n_classes - 1) + [' '],
                    model_path=None,
                    alpha=0,
                    beta=0,
                    cutoff_top_n=40,
                    cutoff_prob=1.0,
                    beam_width=1000,
                    num_processes=4,
                    blank_id=28,
                    log_probs_input=False
                )
                beam_results, beam_scores, timesteps, out_lens = decoder.decode(soft_max, output_lengths)
                b=[]
                for i in range(batch_size):
                     b.append(beam_results[i][0][:out_lens[i][0]])
                decoded_preds, decoded_targets = numtoword(b,out_lens,labels, label_lengths)

                for j in range(len(decoded_preds)):
                    test_cer.append(cer(decoded_targets[j], decoded_preds[j]))
                    test_wer.append(wer(decoded_targets[j], decoded_preds[j]))


    avg_cer = sum(test_cer)/len(test_cer)
    avg_wer = sum(test_wer)/len(test_wer)

    print('Test set: Average loss: {:.4f}, Average CER: {:4f} Average WER: {:.4f}\n'.format(test_loss, avg_cer, avg_wer))

    return test_loss, avg_cer, avg_wer 
else:
    with torch.no_grad():
        for i, _data in enumerate(test_loader):
            spectrograms, labels, input_lengths, label_lengths = _data 
            spectrograms=torch.squeeze(spectrograms)
            
            spectrograms = spectrograms.transpose(1,2)
        
            labels=labels.long()

            input_lengths=torch.LongTensor(input_lengths)
            label_lengths=torch.LongTensor(label_lengths)
            
            input_lengths = input_lengths.to(device)
            label_lengths = label_lengths.to(device)
            

            spectrograms, labels = spectrograms.to(device), labels.to(device)

            output, output_lengths = model(spectrograms,input_lengths)  # (batch, time, n_class)
            soft_max = torch.nn.functional.softmax(output,dim=2)
            output = output.transpose(0, 1) # (time, batch, n_class)
            loss = criterion(output, labels, output_lengths, label_lengths)
            test_loss += loss.item() / len(test_loader)
    print('Test set: Average loss: {:.4f}\n'.format(test_loss))
    return test_loss, 0 , 0

def main(learning_rate=5e-4, batch_size=20, epochs=10,
train_url="train-clean-100", test_url="test-clean"):

hparams = {

    "n_class": 29,
    "n_feats": 80,
    "learning_rate": learning_rate,
    "batch_size": batch_size,
    "epochs": epochs
}


use_cuda = torch.cuda.is_available()
torch.manual_seed(7)
device = torch.device("cuda" if use_cuda else "cpu")

if not os.path.isdir("./data"):
    os.makedirs("./data")

train_dataset = torchaudio.datasets.LIBRISPEECH("./data", url=train_url, download=True)
test_dataset = torchaudio.datasets.LIBRISPEECH("./data", url=test_url, download=True)

  
kwargs = {'num_workers': 4, 'pin_memory': True} if use_cuda else {}
train_loader = data.DataLoader(dataset=train_dataset,
                            batch_size=hparams['batch_size'],
                            shuffle=True,
                            collate_fn=lambda x: data_processing(x, 'train'),
                            **kwargs)
test_loader = data.DataLoader(dataset=test_dataset,
                            batch_size=hparams['batch_size'],
                            shuffle=False,
                            collate_fn=lambda x: data_processing(x, 'valid'),
                            **kwargs)

model = Conformer(num_classes=hparams['n_class'], 
              input_dim=hparams['n_feats'], 
              encoder_dim=512, 
              num_encoder_layers=1)

model = nn.DataParallel(model)

model.to(device)

print('Num Model Parameters', sum([param.nelement() for param in model.parameters()]))

optimizer = optim.AdamW(model.parameters(), hparams['learning_rate'])
criterion = nn.CTCLoss().to(device)
scheduler = optim.lr_scheduler.OneCycleLR(optimizer, max_lr=hparams['learning_rate'], 
                                        steps_per_epoch=int(len(train_loader)),
                                        epochs=hparams['epochs'],
                                        anneal_strategy='linear')
train_loss=[]
test_loss=[]
cer=[]
wer=[]
for epoch in range(1, epochs + 1):
    tra_loss = train(model, device, train_loader, criterion, optimizer, scheduler, epoch)
    tes_loss, c, w =  test(model, device, test_loader, criterion, epoch)
    train_loss.append(tra_loss)
    test_loss.append(tes_loss)
    cer.append(c)
    wer.append(w)
return train_loss, test_loss, cer, wer

Outputs differ from Targets

@sooftware Can you kindly explain to me why the output lengths and targets are so different? :/ (also in outputs I get negative floats). Example shown below

The outputs are of shape [32,490,16121] (where 16121 is the len of my vocab) What is the 490 dimensions
Also the outputs are probabilities right?

(outputs)
tensor([[[-9.7001, -9.6490, -9.6463,  ..., -9.6936, -9.6430, -9.7431],
         [-9.6997, -9.6487, -9.6470,  ..., -9.6903, -9.6450, -9.7416],
         [-9.6999, -9.6477, -9.6479,  ..., -9.6898, -9.6453, -9.7417],
         ...,
         [-9.7006, -9.6449, -9.6513,  ..., -9.6889, -9.6477, -9.7405],
         [-9.7003, -9.6448, -9.6512,  ..., -9.6893, -9.6477, -9.7410],
         [-9.7007, -9.6453, -9.6513,  ..., -9.6892, -9.6466, -9.7403]],

        [[-9.6844, -9.6316, -9.6387,  ..., -9.6880, -9.6269, -9.7657],
         [-9.6834, -9.6299, -9.6404,  ..., -9.6872, -9.6283, -9.7642],
         [-9.6834, -9.6334, -9.6387,  ..., -9.6864, -9.6290, -9.7616],
         ...,
         [-9.6840, -9.6299, -9.6431,  ..., -9.6830, -9.6304, -9.7608],
         [-9.6838, -9.6297, -9.6428,  ..., -9.6834, -9.6303, -9.7609],
         [-9.6842, -9.6300, -9.6428,  ..., -9.6837, -9.6292, -9.7599]],

        [[-9.6966, -9.6386, -9.6458,  ..., -9.6896, -9.6375, -9.7521],
         [-9.6974, -9.6374, -9.6462,  ..., -9.6890, -9.6369, -9.7516],
         [-9.6974, -9.6405, -9.6456,  ..., -9.6876, -9.6378, -9.7491],
         ...,
         [-9.6978, -9.6336, -9.6493,  ..., -9.6851, -9.6419, -9.7490],
         [-9.6971, -9.6334, -9.6487,  ..., -9.6863, -9.6411, -9.7501],
         [-9.6972, -9.6338, -9.6489,  ..., -9.6867, -9.6396, -9.7497]],

        ...,

        [[-9.7005, -9.6249, -9.6588,  ..., -9.6762, -9.6557, -9.7555],
         [-9.7028, -9.6266, -9.6597,  ..., -9.6765, -9.6574, -9.7542],
         [-9.7016, -9.6240, -9.6605,  ..., -9.6761, -9.6576, -9.7553],
         ...,
         [-9.7036, -9.6237, -9.6624,  ..., -9.6728, -9.6590, -9.7524],
         [-9.7034, -9.6235, -9.6620,  ..., -9.6735, -9.6589, -9.7530],
         [-9.7038, -9.6240, -9.6622,  ..., -9.6738, -9.6582, -9.7524]],

        [[-9.7058, -9.6305, -9.6566,  ..., -9.6739, -9.6557, -9.7466],
         [-9.7061, -9.6273, -9.6569,  ..., -9.6774, -9.6564, -9.7499],
         [-9.7046, -9.6280, -9.6576,  ..., -9.6772, -9.6575, -9.7498],
         ...,
         [-9.7060, -9.6263, -9.6609,  ..., -9.6714, -9.6561, -9.7461],
         [-9.7055, -9.6262, -9.6605,  ..., -9.6723, -9.6558, -9.7469],
         [-9.7058, -9.6270, -9.6606,  ..., -9.6725, -9.6552, -9.7460]],

        [[-9.7101, -9.6312, -9.6570,  ..., -9.6736, -9.6551, -9.7420],
         [-9.7102, -9.6307, -9.6579,  ..., -9.6733, -9.6576, -9.7418],
         [-9.7078, -9.6281, -9.6598,  ..., -9.6704, -9.6596, -9.7418],
         ...,
         [-9.7084, -9.6288, -9.6605,  ..., -9.6706, -9.6588, -9.7399],
         [-9.7081, -9.6286, -9.6600,  ..., -9.6714, -9.6584, -9.7406],
         [-9.7085, -9.6291, -9.6601,  ..., -9.6717, -9.6577, -9.7398]]],
       device='cuda:0', grad_fn=<LogSoftmaxBackward0>)

(output_lengths)
tensor([312, 260, 315, 320, 317, 275, 308, 291, 272, 300, 262, 227, 303, 252,
        298, 256, 303, 251, 284, 259, 263, 286, 209, 262, 166, 194, 149, 212,
        121, 114, 110,  57], device='cuda:0', dtype=torch.int32)

(target_lengths)
tensor([57, 55, 54, 50, 49, 49, 49, 48, 48, 47, 43, 42, 41, 40, 40, 39, 37, 37,
        36, 36, 36, 35, 34, 33, 29, 27, 26, 24, 20, 19, 17,  9])

I am using the following code for training and evaluation

import torch
import time
import sys
from google.colab import output
import torch.nn as nn
from conformer import Conformer
import torchmetrics
import random

cuda = torch.cuda.is_available()  
device = torch.device('cuda' if cuda else 'cpu')
print('Device:', device)

################################################################################

def train_model(model, optimizer, criterion, loader, metric):
  running_loss = 0.0
  for i, (audio,audio_len, translations, translation_len) in enumerate(loader):
    # with output.use_tags('some_outputs'):
    #   sys.stdout.write('Batch: '+ str(i+1)+'/290')
    #   sys.stdout.flush();

    #sorting inputs and targets to have targets in descending order based on len
    sorted_list,sorted_indices=torch.sort(translation_len,descending=True)

    sorted_audio=torch.zeros((32,201,1963),dtype=torch.float)
    sorted_audio_len=torch.zeros(32,dtype=torch.int)
    sorted_translations=torch.zeros((32,78),dtype=torch.int)
    sorted_translation_len=sorted_list

    for index, contentof in enumerate(translation_len):
      sorted_audio[index]=audio[sorted_indices[index]]
      sorted_audio_len[index]=audio_len[sorted_indices[index]]
      sorted_translations[index]=translations[sorted_indices[index]]

    #transpose inputs from (batch, dim, seq_len) to (batch, seq_len, dim)
    inputs=sorted_audio.to(device)
    inputs=torch.transpose(inputs, 1, 2)
    input_lengths=sorted_audio_len
    targets=sorted_translations.to(device)
    target_lengths=sorted_translation_len

    optimizer.zero_grad()
  
    # Forward propagate
    outputs, output_lengths = model(inputs, input_lengths)
    # print(outputs)

    # Calculate CTC Loss
    loss = criterion(outputs.transpose(0, 1), targets, output_lengths, target_lengths)

    loss.backward()
    optimizer.step()

    # print statistics
    running_loss += loss.item()

    output.clear(output_tags='some_outputs')

  loss_per_epoch=running_loss/(i+1)
  # print(f'Loss: {loss_per_epoch:.3f}')

  return loss_per_epoch

################################################################################

def eval_model(model, optimizer, criterion, loader, metric):
  running_loss = 0.0
  wer_calc=0.0
  random_index_per_epoch= random.randint(0, 178)

  for i, (audio,audio_len, translations, translation_len) in enumerate(loader):
    # with output.use_tags('some_outputs'):
    #   sys.stdout.write('Batch: '+ str(i+1)+'/72')
    #   sys.stdout.flush();

    #sorting inputs and targets to have targets in descending order based on len
    sorted_list,sorted_indices=torch.sort(translation_len,descending=True)

    sorted_audio=torch.zeros((32,201,1963),dtype=torch.float)
    sorted_audio_len=torch.zeros(32,dtype=torch.int)
    sorted_translations=torch.zeros((32,78),dtype=torch.int)
    sorted_translation_len=sorted_list

    for index, contentof in enumerate(translation_len):
      sorted_audio[index]=audio[sorted_indices[index]]
      sorted_audio_len[index]=audio_len[sorted_indices[index]]
      sorted_translations[index]=translations[sorted_indices[index]]

    #transpose inputs from (batch, dim, seq_len) to (batch, seq_len, dim)
    inputs=sorted_audio.to(device)
    inputs=torch.transpose(inputs, 1, 2)
    input_lengths=sorted_audio_len
    targets=sorted_translations.to(device)
    target_lengths=sorted_translation_len

    # Forward propagate
    outputs, output_lengths = model(inputs, input_lengths)
    # print(outputs)

    # Calculate CTC Loss
    loss = criterion(outputs.transpose(0, 1), targets, output_lengths, target_lengths)

    print(output_lengths)
    print(target_lengths)
    # outputs_in_words=words_vocab.convert_pred_to_words(outputs.transpose(0, 1))
    # targets_in_words=words_vocab.convert_pred_to_words(targets)
    # wer=metrics_calculation(metric, outputs_in_words,targets_in_words)
    
    break

    if (i==random_index_per_epoch):
        print(outputs_in_words,targets_in_words)

    running_loss += loss.item()
    # wer_calc += wer

    output.clear(output_tags='some_outputs')

  loss_per_epoch=running_loss/(i+1)
  wer_per_epoch=wer_calc/(i+1)

  return loss_per_epoch, wer_per_epoch

################################################################################

def train_eval_model(epochs):
  #conformer model init
  model = nn.DataParallel(Conformer(num_classes=16121, input_dim=201, encoder_dim=32, num_encoder_layers=1)).to(device)

  # Optimizers specified in the torch.optim package
  optimizer = torch.optim.Adam(model.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)

  #loss function
  criterion = nn.CTCLoss().to(device)

  #metrics init
  metric=torchmetrics.WordErrorRate()

  for epoch in range(epochs):
    print("Epoch", epoch+1)

    ############################################################################
    #TRAINING      
    model.train()
    print("Training")

    # epoch_loss=train_model(model=model,optimizer=optimizer, criterion=criterion, loader=train_loader, metric=metric)

    # print(f'Loss: {epoch_loss:.3f}')
    # print(f'WER: {epoch_wer:.3f}')

    ############################################################################
    #EVALUATION
    model.train(False)
    print("Validation")

    epoch_val_loss, epoch_val_wer=eval_model(model=model,optimizer=optimizer, criterion=criterion, loader=test_loader, metric=metric)
    
    print(f'Loss: {epoch_val_loss:.3f}')     
    print(f'WER: {epoch_val_wer:.3f}')   

################################################################################

def metrics_calculation(metric, predictions, targets):
    print(predictions)
    print(targets)
    wer=metric(predictions, targets)

    return wer



train_eval_model(1)

Count of Conformer parameters mismatch with that in the paper

In the Conformer original paper, the number of parameters are

However, with the implementation in this repo, the number of parameters are slightly different

Conformer  small: 10.16 M
Conformer medium: 31.86 M
Conformer  large: 120.11 M

I get the size with this script

from conformer import Conformer


def count_parameters(model) -> int:
    return sum(p.numel() for p in model.parameters() if p.requires_grad)


models = {
    'small': Conformer(
        num_classes=1000,
        input_dim=80,
        encoder_dim=144,
        decoder_dim=320,
        num_encoder_layers=16,
        num_decoder_layers=1,
        num_attention_heads=4,
        conv_kernel_size=31
    ),
    'medium': Conformer(
        num_classes=1000,
        input_dim=80,
        encoder_dim=256,
        decoder_dim=640,
        num_encoder_layers=16,
        num_decoder_layers=1,
        num_attention_heads=4,
        conv_kernel_size=31
    ),
    'large': Conformer(
        num_classes=1000,
        input_dim=80,
        encoder_dim=512,
        decoder_dim=640,
        num_encoder_layers=17,
        num_decoder_layers=1,
        num_attention_heads=8,
        conv_kernel_size=31
    )
}

for size, m in models.items():
    print("Conformer {:>6}: {:.2f} M".format(size, count_parameters(m)/1e6))

Since the convolution layer kernel size couldn't be set to 32, I just set it to 31. But this won't make such difference in number of params.

Decoding predictions to strings

Hi, thanks for the great repo.

the README Usage example gives outputs as a torch tensor of ints. How would you suggest decoding these to strings (the actual speech)?

Thanks!

mat1 and mat2 shapes cannot be multiplied (1323x9248 and 1568x32)

These are the shapes of my input, input_len, target, target_len where batch size=27

This is the setup I am running (only using first batch to check that is working before training with all the batches)

This is the error I am getting

I need some assistance here please:)

cannot import name 'Conformer'

Hi when I tried to import conformer, I got this issue
>>> from conformer import Conformer Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/workspace/bert/conformer/conformer.py", line 3, in <module> from conformer import Conformer ImportError: cannot import name 'Conformer' from partially initialized module 'conformer' (most likely due to a circular import) (/workspace/bert/conformer/conformer.py)
I did as the installation instruction. Would you please see where I might be wrong? Thanks.

RuntimeError: input must have 2 dimensions, got 1

Hi,

I'm trying to use this model, but I'm having issues.
I'm using PyTorch 1.8.0 and Pytorch Lightning.

At first, I got an error:

RuntimeError: `lengths` array must be sorted in decreasing order when `enforce_sorted` is True. You can pass `enforce_sorted=False` to pack_padded_sequence and/or pack_sequence to sidestep this requirement if you do not need ONNX exportability.

that I fixed by updating DecoderRNNT and setting enforce_sorted=False

embedded = nn.utils.rnn.pack_padded_sequence(embedded.transpose(0, 1), input_lengths.cpu(), enforce_sorted=False)

but then I got an error at the bottom. Not sure what could be wrong?

According to docs provided the inputs should be:

 inputs (torch.LongTensor): A target sequence passed to decoder. `IntTensor` of size ``(batch, seq_length)``

And they are - torch.Size([32, 320])

The error:

 File "torch_asr/asr/models/conformer/model.py", line 169, in forward
    decoder_outputs, _ = self.decoder(targets, target_lengths)
  File "/home/martynas/.cache/pypoetry/virtualenvs/torch-asr-9y4FJ-MW-py3.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "torch_asr/asr/models/conformer/decoder.py", line 122, in forward
    outputs, hidden_states = self.rnn(embedded, hidden_states)
  File "/home/martynas/.cache/pypoetry/virtualenvs/torch-asr-9y4FJ-MW-py3.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/martynas/.cache/pypoetry/virtualenvs/torch-asr-9y4FJ-MW-py3.7/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 659, in forward
    self.check_forward_args(input, hx, batch_sizes)
  File "/home/martynas/.cache/pypoetry/virtualenvs/torch-asr-9y4FJ-MW-py3.7/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 605, in check_forward_args
    self.check_input(input, batch_sizes)
  File "/home/martynas/.cache/pypoetry/virtualenvs/torch-asr-9y4FJ-MW-py3.7/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 200, in check_input
    expected_input_dim, input.dim()))
RuntimeError: input must have 2 dimensions, got 1