deepforcedaligner's Introduction

DeepForcedAligner

With this tool you can create accurate text-audio alignments given a bunch of audio files and their transcription. The alignments can for example be used to train text-to-speech models such as FastSpeech. In comparison to other forced alignment tools this repo has following advantages:

Multilingual: By design, the DFA is language-agnostic and can align both characters or phonemes.
Robustness: The alignment extraction is highly tolerant against text errors and silent characters.
Convenience: Easy installation with no extra dependencies. You can provide your own data in the standard LJSpeech format without special preprocessing (such as applying phonetic dictionaries, non-speech annotations etc.).

The approach is based on training a simple speech recognition model with CTC loss on mel spectrograms extracted from the wav files.

Installation

Running on Python >=3.6

pip install -r requirements.txt

Example Training and Extraction

Check out the following demo notebook for training and character duration extraction on the LJSpeech dataset:

(1) Download the LJSpeech dataset, set paths in config.yaml:

  dataset_dir: LJSpeech
  metadata_path: LJSpeech/metadata.csv

(2) Preprocess the data and train aligner:

  python preprocess.py
  python train.py

(3) Extract durations with latest model checkpoint (60k steps should be sufficient):

  python extract_durations.py

By default durations are put as numpy files into:

  output/durations

Each character duration correspons to one mel time step, which translates to hop_length / sample_rate seconds in the wav file.

Tensorboard

You can monitor the training with

  tensorboard dfa_checkpoints

Using Your Own Dataset

Just bring your dataset to the LJSpeech format. We recommend to clean and preprocess the text in the metafile.csv before running the DFA, e.g. lower-case, phonemization etc.

Using Preprocessed Mel Spectrograms

You can provide your own mel spectrogams by setting in the config.yaml:

  precomputed_mels: /path/to/mels

Make sure that the mel names match the ids in the metafile, e.g.

  00001.mel ---> 00001|First sample text

deepforcedaligner's People

Contributors

Stargazers

Watchers

deepforcedaligner's Issues

About training on Bangla Dataset and Finding duration for each character

Hi @cschaefer26,
I have performed training DFA model on Bangla Dataset and predicted duration for sample audio. The duration file (.npy format) ouputs some integer values only for each characters. Then I have used the formula as follows:
duration of a character = (hop_length* pred_value for particular character)/sample_rate. Am I right for this case. Kindly help me in this regard.

Align Larger Audio File

Hi @cschaefer26,
You have done nice job. I'm using your repo. But while aligning larger audio (> 1 minute) with its character (phone) sequence at inference period, the number of predicted values in duration file (. npy file) does not match with the number of characters (phones) that I input with the audio file. What is the problem here? I want to use pretrained model (trained on bangla dataset [audio, phoneme sequence] ) for phoneme duration prediction.So accuracy is a major concern for me.

Note that: While training, I have used 10-15 second larger audio files and corresponding transcriptions (phoneme sequences). And I customized your code (preprocess.py and extract_durations.py) to fit the inference for single audio and its transcription.

as-ideas / deepforcedaligner Goto Github PK

deepforcedaligner's Introduction

DeepForcedAligner

Installation

Example Training and Extraction

Tensorboard

Using Your Own Dataset

Using Preprocessed Mel Spectrograms

deepforcedaligner's People

Contributors

Stargazers

Watchers

Forkers

deepforcedaligner's Issues

About training on Bangla Dataset and Finding duration for each character

Align Larger Audio File

Loss value : nan at first epoch while training

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs