GithubHelp home page GithubHelp logo

deepforcedaligner's Introduction

DeepForcedAligner

With this tool you can create accurate text-audio alignments given a bunch of audio files and their transcription. The alignments can for example be used to train text-to-speech models such as FastSpeech. In comparison to other forced alignment tools this repo has following advantages:

  • Multilingual: By design, the DFA is language-agnostic and can align both characters or phonemes.
  • Robustness: The alignment extraction is highly tolerant against text errors and silent characters.
  • Convenience: Easy installation with no extra dependencies. You can provide your own data in the standard LJSpeech format without special preprocessing (such as applying phonetic dictionaries, non-speech annotations etc.).

The approach is based on training a simple speech recognition model with CTC loss on mel spectrograms extracted from the wav files.

Installation

Running on Python >=3.6

pip install -r requirements.txt

Example Training and Extraction

Check out the following demo notebook for training and character duration extraction on the LJSpeech dataset:

Open In Colab

(1) Download the LJSpeech dataset, set paths in config.yaml:

  dataset_dir: LJSpeech
  metadata_path: LJSpeech/metadata.csv

(2) Preprocess the data and train aligner:

  python preprocess.py
  python train.py

(3) Extract durations with latest model checkpoint (60k steps should be sufficient):

  python extract_durations.py

By default durations are put as numpy files into:

  output/durations 

Each character duration correspons to one mel time step, which translates to hop_length / sample_rate seconds in the wav file.

Tensorboard

You can monitor the training with

  tensorboard dfa_checkpoints

Using Your Own Dataset

Just bring your dataset to the LJSpeech format. We recommend to clean and preprocess the text in the metafile.csv before running the DFA, e.g. lower-case, phonemization etc.

Using Preprocessed Mel Spectrograms

You can provide your own mel spectrogams by setting in the config.yaml:

  precomputed_mels: /path/to/mels

Make sure that the mel names match the ids in the metafile, e.g.

  00001.mel ---> 00001|First sample text

deepforcedaligner's People

Contributors

cfrancesco avatar cschaefer26 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

deepforcedaligner's Issues

About training on Bangla Dataset and Finding duration for each character

Hi @cschaefer26,
I have performed training DFA model on Bangla Dataset and predicted duration for sample audio. The duration file (.npy format) ouputs some integer values only for each characters. Then I have used the formula as follows:
duration of a character = (hop_length* pred_value for particular character)/sample_rate. Am I right for this case. Kindly help me in this regard.

Align Larger Audio File

Hi @cschaefer26,
You have done nice job. I'm using your repo. But while aligning larger audio (> 1 minute) with its character (phone) sequence at inference period, the number of predicted values in duration file (. npy file) does not match with the number of characters (phones) that I input with the audio file. What is the problem here? I want to use pretrained model (trained on bangla dataset [audio, phoneme sequence] ) for phoneme duration prediction.So accuracy is a major concern for me.

Note that: While training, I have used 10-15 second larger audio files and corresponding transcriptions (phoneme sequences). And I customized your code (preprocess.py and extract_durations.py) to fit the inference for single audio and its transcription.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.