GithubHelp home page GithubHelp logo

liuhaozhe6788 / dailytalk Goto Github PK

View Code? Open in Web Editor NEW

This project forked from keonlee9420/dailytalk

0.0 0.0 0.0 104.87 MB

Official repository of DailyTalk: Spoken Dialogue Dataset for Conversational Text-to-Speech, ICASSP 2023

License: MIT License

Python 99.62% Dockerfile 0.38%

dailytalk's Introduction

Hits

DailyTalk: Spoken Dialogue Dataset for Conversational Text-to-Speech

Keon Lee*, Kyumin Park*, Daeyoung Kim

In our paper, we introduce DailyTalk, a high-quality conversational speech dataset designed for Text-to-Speech.

Abstract: The majority of current Text-to-Speech (TTS) datasets, which are collections of individual utterances, contain few conversational aspects. In this paper, we introduce DailyTalk, a high-quality conversational speech dataset designed for conversational TTS. We sampled, modified, and recorded 2,541 dialogues from the open-domain dialogue dataset DailyDialog inheriting its annotated attributes. On top of our dataset, we extend prior work as our baseline, where a non-autoregressive TTS is conditioned on historical information in a dialogue. From the baseline experiment with both general and our novel metrics, we show that DailyTalk can be used as a general TTS dataset, and more than that, our baseline can represent contextual information from DailyTalk. The DailyTalk dataset and baseline code are freely available for academic use with CC-BY-SA 4.0 license.

Dataset

You can download our dataset. Please refer to Statistic Details for details.

Pretrained Models

You can download our pretrained models. There are two different directories: 'history_none' and 'history_guo'. The former has no historical encodings so that it is not a conversational context-aware model. The latter has historical encodings following Conversational End-to-End TTS for Voice Agent (Guo et al., 2020).

Toggle the type of history encodings by

# In the model.yaml
history_encoder:
  type: "Guo" # ["none", "Guo"]

Quickstart

Dependencies

You can install the Python dependencies with

pip3 install -r requirements.txt

Also, Dockerfile is provided for Docker users.

Inference

You have to download both our dataset. Download pretrained models and put them in output/ckpt/DailyTalk/. Also unzip generator_LJSpeech.pth.tar or generator_universal.pth.tar in hifigan folder. The models are trained with unsupervised duration modeling under transformer building block and the history encoding types.

Only the batch inference is supported as the generation of a turn may need contextual history of the conversation. Try

python3 synthesize.py --source preprocessed_data/DailyTalk/val_*.txt --restore_step RESTORE_STEP --mode batch --dataset DailyTalk

to synthesize all utterances in preprocessed_data/DailyTalk/val_*.txt.

Training

Preprocessing

  • For a multi-speaker TTS with external speaker embedder, download ResCNN Softmax+Triplet pretrained model of philipperemy's DeepSpeaker for the speaker embedding and locate it in ./deepspeaker/pretrained_models/. Please note that our pretrained models are not trained with this (they are trained with speaker_embedder: "none").

  • Run

    python3 prepare_align.py --dataset DailyTalk
    

    for some preparations.

    For the forced alignment, Montreal Forced Aligner (MFA) is used to obtain the alignments between the utterances and the phoneme sequences. Pre-extracted alignments for the datasets are provided here. You have to unzip the files in preprocessed_data/DailyTalk/TextGrid/. Alternately, you can run the aligner by yourself. Please note that our pretrained models are not trained with supervised duration modeling (they are trained with learn_alignment: True).

    After that, run the preprocessing script by

    python3 preprocess.py --dataset DailyTalk
    

Training

Train your model with

python3 train.py --dataset DailyTalk

Useful options:

  • To use a Automatic Mixed Precision, append --use_amp argument to the above command.
  • The trainer assumes single-node multi-GPU training. To use specific GPUs, specify CUDA_VISIBLE_DEVICES=<GPU_IDs> at the beginning of the above command.

TensorBoard

Use

tensorboard --logdir output/log

to serve TensorBoard on your localhost. The loss curves, synthesized mel-spectrograms, and audios are shown.

Notes

  • Convolutional embedding is used as StyleSpeech for phoneme-level variance in unsupervised duration modeling. Otherwise, bucket-based embedding is used as FastSpeech2.
  • Unsupervised duration modeling in phoneme-level will take longer time than frame-level since the additional computation of phoneme-level variance is activated at runtime.
  • Two options for embedding for the multi-speaker TTS setting: training speaker embedder from scratch or using a pre-trained philipperemy's DeepSpeaker model (as STYLER did). You can toggle it by setting the config (between 'none' and 'DeepSpeaker').
  • For vocoder, HiFi-GAN is used for all experiments in our paper.

Citation

If you would like to use our dataset and code or refer to our paper, please cite as follows.

@misc{lee2022dailytalk,
    title={DailyTalk: Spoken Dialogue Dataset for Conversational Text-to-Speech},
    author={Keon Lee and Kyumin Park and Daeyoung Kim},
    year={2022},
    eprint={2207.01063},
    archivePrefix={arXiv},
    primaryClass={eess.AS}
}

License

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

References

dailytalk's People

Contributors

keonlee9420 avatar kyumin-park avatar hnhnarek avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.