lhotse-speech / lhotse Goto Github PK

View Code? Open in Web Editor NEW

887.0 44.0 204.0 31.8 MB

Tools for handling speech data in machine learning projects.

Home Page: https://lhotse.readthedocs.io/en/latest/

License: Apache License 2.0

Python 99.97% Shell 0.03%

speech audio kaldi machine-learning ai deep-learning pytorch data python speech-recognition

lhotse's Introduction

Lhotse

Lhotse is a Python library aiming to make speech and audio data preparation flexible and accessible to a wider community. Alongside k2, it is a part of the next generation Kaldi speech processing library.

Tutorial presentations and materials

(Interspeech 2023) Tutorial notebook
(Interspeech 2023) Tutorial slides
(Interspeech 2021) Recorded lecture (3h)

About

Main goals

Attract a wider community to speech processing tasks with a Python-centric design.
Accommodate experienced Kaldi users with an expressive command-line interface.
Provide standard data preparation recipes for commonly used corpora.
Provide PyTorch Dataset classes for speech and audio related tasks.
Flexible data preparation for model training with the notion of audio cuts.
Efficiency, especially in terms of I/O bandwidth and storage capacity.

Tutorials

We currently have the following tutorials available in examples directory:

Basic complete Lhotse workflow
Transforming data with Cuts
WebDataset integration
How to combine multiple datasets
Lhotse Shar: storage format optimized for sequential I/O and modularity

Examples of use

Check out the following links to see how Lhotse is being put to use:

Icefall recipes: where k2 and Lhotse meet.
Minimal ESPnet+Lhotse example:

Main ideas

Like Kaldi, Lhotse provides standard data preparation recipes, but extends that with a seamless PyTorch integration through task-specific Dataset classes. The data and meta-data are represented in human-readable text manifests and exposed to the user through convenient Python classes.

Lhotse introduces the notion of audio cuts, designed to ease the training data construction with operations such as mixing, truncation and padding that are performed on-the-fly to minimize the amount of storage required. Data augmentation and feature extraction are supported both in pre-computed mode, with highly-compressed feature matrices stored on disk, and on-the-fly mode that computes the transformations upon request. Additionally, Lhotse introduces feature-space cut mixing to make the best of both worlds.

Installation

Lhotse supports Python version 3.7 and later.

Pip

Lhotse is available on PyPI:

pip install lhotse

To install the latest, unreleased version, do:

pip install git+https://github.com/lhotse-speech/lhotse

Development installation

For development installation, you can fork/clone the GitHub repo and install with pip:

git clone https://github.com/lhotse-speech/lhotse
cd lhotse
pip install -e '.[dev]'
pre-commit install  # installs pre-commit hooks with style checks

# Running unit tests
pytest test

# Running linter checks
pre-commit run

This is an editable installation (-e option), meaning that your changes to the source code are automatically reflected when importing lhotse (no re-install needed). The [dev] part means you're installing extra dependencies that are used to run tests, build documentation or launch jupyter notebooks.

Environment variables

Lhotse uses several environment variables to customize it's behavior. They are as follows:

LHOTSE_REQUIRE_TORCHAUDIO - when it's set and not any of 1|True|true|yes, we'll not check for torchaudio being installed and remove it from the requirements. It will disable many functionalities of Lhotse but the basic capabilities will remain (including reading audio with soundfile).
LHOTSE_AUDIO_DURATION_MISMATCH_TOLERANCE - used when we load audio from a file and receive a different number of samples than declared in Recording.num_samples. This is sometimes necessary because different codecs (or even different versions of the same codec) may use different padding when decoding compressed audio. Typically values up to 0.1, or even 0.3 (second) are still reasonable, and anything beyond that indicates a serious issue.
LHOTSE_AUDIO_BACKEND - may be set to any of the values returned from CLI lhotse list-audio-backends to override the default behavior of trial-and-error and always use a specific audio backend.
LHOTSE_AUDIO_LOADING_EXCEPTION_VERBOSE - when set to 1 we'll emit full exception stack traces when every available audio backend fails to load a given file (they might be very large).
LHOTSE_DILL_ENABLED - when it's set to 1|True|true|yes, we will enable dill-based serialization of CutSet and Sampler across processes (it's disabled by default even when dill is installed).
LHOTSE_LEGACY_OPUS_LOADING - (=1) reverts to a legacy OPUS loading mechanism that triggered a new ffmpeg subprocess for each OPUS file.
LHOTSE_PREPARING_RELEASE - used internally by developers when releasing a new version of Lhotse.
TORCHAUDIO_USE_BACKEND_DISPATCHER - when set to 1 and torchaudio version is below 2.1, we'll enable the experimental ffmpeg backend of torchaudio.
RANK, WORLD_SIZE, WORKER, and NUM_WORKERS are internally used to inform Lhotse Shar dataloading subprocesses.
READTHEDOCS is internally used for documentation builds.

Optional dependencies

Other pip packages. You can leverage optional features of Lhotse by installing the relevant supporting package like this: pip install lhotse[package_name]. The supported optional packages include:

pip install lhotse[kaldi] for a maximal feature set related to Kaldi compatibility. It includes libraries such as kaldi_native_io (a more efficient variant of kaldi_io) and kaldifeat that port some of Kaldi functionality into Python.
pip install lhotse[orjson] for up to 50% faster reading of JSONL manifests.
pip install lhotse[webdataset]. We support "compiling" your data into WebDataset tarball format for more effective IO. You can still interact with the data as if it was a regular lazy CutSet. To learn more, check out the following tutorial:
pip install h5py if you want to extract speech features and store them as HDF5 arrays.
pip install dill. When dill is installed, we'll use it to pickle CutSet that uses a lambda function in calls such as .map or .filter. This is helpful in PyTorch DataLoader with num_jobs>0. Without dill, depending on your environment, you'll see an exception or a hanging script.
pip install smart_open to read and write manifests and data in any location supported by smart_open (e.g. cloud, http).
pip install opensmile for feature extraction using the OpenSmile toolkit's Python wrapper.

sph2pipe. For reading older LDC SPHERE (.sph) audio files that are compressed with codecs unsupported by ffmpeg and sox, please run:

# CLI
lhotse install-sph2pipe

# Python
from lhotse.tools import install_sph2pipe
install_sph2pipe()

It will download it to ~/.lhotse/tools, compile it, and auto-register in PATH. The program should be automatically detected and used by Lhotse.

Examples

We have example recipes showing how to prepare data and load it in Python as a PyTorch Dataset. They are located in the examples directory.

A short snippet to show how Lhotse can make audio data preparation quick and easy:

from torch.utils.data import DataLoader
from lhotse import CutSet, Fbank
from lhotse.dataset import VadDataset, SimpleCutSampler
from lhotse.recipes import prepare_switchboard

# Prepare data manifests from a raw corpus distribution.
# The RecordingSet describes the metadata about audio recordings;
# the sampling rate, number of channels, duration, etc.
# The SupervisionSet describes metadata about supervision segments:
# the transcript, speaker, language, and so on.
swbd = prepare_switchboard('/export/corpora3/LDC/LDC97S62')

# CutSet is the workhorse of Lhotse, allowing for flexible data manipulation.
# We create 5-second cuts by traversing SWBD recordings in windows.
# No audio data is actually loaded into memory or stored to disk at this point.
cuts = CutSet.from_manifests(
    recordings=swbd['recordings'],
    supervisions=swbd['supervisions']
).cut_into_windows(duration=5)

# We compute the log-Mel filter energies and store them on disk;
# Then, we pad the cuts to 5 seconds to ensure all cuts are of equal length,
# as the last window in each recording might have a shorter duration.
# The padding will be performed once the features are loaded into memory.
cuts = cuts.compute_and_store_features(
    extractor=Fbank(),
    storage_path='feats',
    num_jobs=8
).pad(duration=5.0)

# Construct a Pytorch Dataset class for Voice Activity Detection task:
dataset = VadDataset()
sampler = SimpleCutSampler(cuts, max_duration=300)
dataloader = DataLoader(dataset, sampler=sampler, batch_size=None)
batch = next(iter(dataloader))

The VadDataset will yield a batch with pairs of feature and supervision tensors such as the following - the speech starts roughly at the first second (100 frames):

lhotse's People

Contributors

Stargazers

Watchers

Forkers

jimbozhang popcornell yiwenshaostephen desh2608 mirishkarganesh twistedmove jiltseb hajime9652 thschaaf leibny leixin zhangaustin johnjosephmorgan medabalimi gorinars freewym jarvan-wang lvchigo leeensub yushanyong naminwang karendeng zcth428 lql0716 janvainer csukuangfj ts0923 cnheider juxiangyu helloword12345678 ouc-lan fanlu sknadig wuxiaobo kingkenway prclibo changxiangshi glynpu entn-at oplatek aarora8 dongjigao m-wiesner llmhao frankenliu pkufool sciai-ai dophist luomingshuang junshipeng videodanchik megazone87 oucxlw jason-lee-lxx kingfener boostpapa y00281951 shiyuzh2007 huygens12 yueyedeai 447555240 goodatlas jtrmal gnosil ferb2015 marcinwitkowski stachu86 techthiyanes rosrad kkarrancsu speech-minscale-2022 huangziliandy lassewolter armusc martinkocour reloadbrain govivace sundy1219 drawfish scarecrow1123 wgb14 tomiinek mohsen-goodarzi ruabraun amirhussein96 fwl2000 mikuchar robflynnyh shanguanma yuekaizhang ishine normonisping baekms shaynemei sanzimu mxuer speechoceantech zuoyunzheng zhengweihuang ai-x-king

lhotse's Issues

Re-structure the corpus preparation module

Currently, the corpora are documented by a top-level docstring in each lhotse.recipes module. For better discoverability to the users, we should attach that description to the docstrings of prepare_X functions and lhotse prepare X CLI help messages. We could achieve that with a similar approach to what the transformers library does here.

shebang should be #!/usr/bin/env bash for portability

https://github.com/pzelasko/lhotse/blob/14a55f25519405c26d76ccfb7dc9cc8d65ba2a75/examples/librimix/librimix.sh#L1

Sometimes start_frame + num_frames > T in K2SpeechRecognitionIterableDataset

Here is a failure example :

feature shape (B,T,F) : torch.Size([10, 3017, 40])
start_frame + num_frames: tensor([1460, 3018, 1476, 2968, 1472, 2968, 1532, 2836, 1491, 2809, 1536, 2688,
        1545, 2681, 1557, 2593, 1594, 2085, 2580, 1640], dtype=torch.int32)

As you can see, there is 3018 > T = 3017

I comput (and save) feature with
https://github.com/k2-fsa/snowfall/blob/811e5333281a279e27e3008f5d025111d19cc487/egs/librispeech/asr/simple_v1/prepare.py#L42-L48,
then load train-clean-100 with https://github.com/k2-fsa/snowfall/blob/811e5333281a279e27e3008f5d025111d19cc487/egs/librispeech/asr/simple_v1/train.py#L177-L181

Noted here is a warning about torch version, should I upgrade PyTorch version?

/ceph-hw/lhotse/lhotse/augmentation/torchaudio.py:13: UserWarning: Torchaudio SoX effects chains are only introduced in version 0.7 - please upgrade your PyTorch to 1.7+ and torchaudio to 0.7+ to use them.
  warnings.warn('Torchaudio SoX effects chains are only introduced in version 0.7 - '

GPU/batch feature extraction

As I notice, we are using torchaudio to extract features from raw waveform (according to here:https://github.com/pzelasko/lhotse/blob/master/lhotse/features.py#L122).
But here are two potential issues with torchaudio, especially with torchaudio.compliance.kaldi:

From my experiment, I found that these founctions can only work with CPU tensors (waveform). And in normal situations, we are most likely to put our data and model on GPUs and in that case this function will raise an error. (Haven't found any official document for this point and it would be great if someone can help me double check it).
Waveform should be in shape (c, t) where c can only be in [0, 2) (https://pytorch.org/audio/compliance.kaldi.html#mfcc). In other words, it doesn't support batch level feature extraction and can only be used before you form a batch (e.g. in __getitem__function). It would probably be OK in normal cases, but will make trouble if you'd like to do feature extraction outside dataset but in model forward instead.
As an alternative, torchaudio.transforms(torch.nn.Module) seems more flexible and we can even customize it as a normal nn.Module to fit with our user cases.
Please let me know how you think.
Thanks!

Originally posted by @YiwenShaoStephen in #44 (comment)

Release date?

Hi, we are very interested in this project (willing to contribute) as it aligns well with our own plans for ASR datapipelines.
However, we do not want to start integrating Lhotse until there is a stable version.
Is there a firm release date for a stable version of this project, or at least a realistic timeline and when a stable api will be released?
What features need work so that this project can be released soon?

Regards
Jan Vainer

High IO performance storage for features

I'm putting this so that I don't forget - I have some reservations about storing a huge amount of feature files on the disk, the IO might get slow on network disks. We might want to look into some kind of storage format that's optimized for random access. Maybe Apache Arrow is a promising choice. I recall a project called petastorm developed by Uber for high data loading scalability (including distributed), but I'm not sure if it's usable for non-tabular data formats. There's also HDF5 which I think can be tuned for the chunked storage layout.

A few questions...

I came across a few confusions while I was reading the code in order to write an example. It would be helpful for me if they are clarified ( I may have missed something).

https://github.com/lhotse-speech/lhotse/blob/master/lhotse/kaldi.py#L68
why duration - start rather than just duration?

https://github.com/lhotse-speech/lhotse/blob/master/lhotse/audio.py#L178
why not [n_sources, n_channels, n_samples]?

Thanks!

Full Librispeech recipe

Besides working on the PR #67 and #85, I'll add a full Librispeech recipe, which will support selecting the subsets, e.g. train-clean-100, train-clean-360, dev-clean, and so on.

The full Librspeech recipe will be ready before Oct. 15 @pzelasko @danpovey @songmeixu @qindazhu

Create documentation

I'm thinking about a "read-the-docs" style doc page and adding some rationale, main features, and some examples in the README.md.

Refactoring Cut class hierarchy

The *Cut classes code readability would benefit from a traditional inheritance structure with a common interface/abstract base class. Currently, it's somewhat messy because cuts are dataclasses, and these do not work that well with inheritance. Consider making cuts regular Python classes to overcome this, and define the common API in a clean way.

Support for working with raw audio

Currently, Lhotse is somewhat based on the assumption that a user will mostly work with features such as log-mel energies. It'll be useful to have PyTorch Datasets providing raw audio as well. I'm thinking we should make it a convention to put Waveform in the class name of datasets that provide raw audio (e.g. UnsupervisedWaveformDataset). I'll try to list the steps that are needed to get there.

Create Cuts directly from RecordingSet (with an empty Features field - currently, you have to extract some features to make cuts, just to ignore them later)
load_audio() in Cut
load_audio() in MixedCut
load_audio() in PaddingCut
Implement UnsupervisedWaveformDataset
Implement other waveform datasets

Repository with downloadable manifests for standard recipes

Dan's original comment:

I wonder if the next step could be creating an example setup for some dataset, e.g. mini_librispeech?
I'm thinking for now we can provide scripts that create the manifest files; making them downloadable could be an optional next step.

I don't think we necessarily have to be too purist about this, in terms of having the scripts be just python; there may be some datasets where having shell scripts, like Kaldi does, or using multiple languages will be necessary. One possibility it to structure it a little like Kaldi, with different egs directories. We also don't have to have it in the lhotse repo if we don't feel that's right (but also including it is fine with me, I think). I am slightly concerned about versioning issues, if people make recipes based on datasets and we then change something, what happens.

I am thinking about this a little like the standard datasets available in things like PyTorch, where people import them into their own setups/repos.

Bear in mind that at some point we'll want to be writing and loading compressed features using lilcom. We could extract the features using kaldi10feat or some other method (I'm thinking of log-mel features). For now we can probably have one recording per file (?).

Before making an example script it's OK to work on the lhotse stuff for handling features, though. We should just be thinking about what the scripts will look like.

Parallel feature extraction with HDF5 storage backend testing/examples

In the current implementation, it's not possible to use a thread/process/distributed executor and store the features in HDF5 files because the HDF5Writer opens a single filehandle.

We should at least test and write an example showing that it's possible to split the manifest, extract features for each split in a parallel way and then re-combine into a single manifest again. Each split will store the features in a separate HDF5 file (I'm not sure if it makes sense to merge them too, but I don't think so).

Consistency between duration and number of frames

Original comment from Dan:

It would probably make sense to have duration always be n * frame_shift, but understand that this is kind of approximate and actually it may use slightly more context than that?

I'd rather make it so that downstream, these cuts behave quite simply w.r.t. the relationship of duration and num-frames, and the user doesn't have to worry about the frame-length and how it may impact the num-frames. That was a rabbit-hole in Kaldi itself.

Originally posted by @danpovey in #16

TTS recipe

I'm not sure if it useful to let lhotse support TTS training. If needed, I can make a TTS recipe using the LJSpeech corpus.
@danpovey @pzelasko

Sync the style of example bash scripts with Kaldi

The style of the librimix.sh is not quite similar as Kaldi's:

librimix/s5/run.sh (kaldi) vs. librimix/librimix.sh (lhotse)
use stage blocks (kaldi) vs. no stage (lhotse)
some other differences

I'm writing the mini_librispeech recipe. I consider to use the kaldi-like style for the bash version. Not sure if it's a good idea. @pzelasko @danpovey

If we choose to use the kaldi-like style, I'll update the the librimix recipe as well.

Source Separation Integration: sum(sources + background_noise) != mixture with mels.

I am experimenting a bit with lhotse integration in asteroid here:
https://github.com/mpariente/asteroid/blob/lhotse_integration_test/egs/MiniLibriMix/lhotse/

One thing i noticed (unless I did something completely wrong) is that the sum of the sources plus background noise features is different from the mixture features:
https://github.com/mpariente/asteroid/blob/lhotse_integration_test/egs/MiniLibriMix/lhotse/test_additive.py
This could be a problem when training a separation model as basically the underlining assumption is that the process is additive.
I guess this is due to the fact that the feature computation via torchaudio.complicance.kaldi.fbank must have some non-linear operations (aside the log operation of course !).
I guess so because dithering is disabled by default ( see pytorch/audio#371 ).
Does any of you have a clue of why this happens ? The difference seems too substantial (first decimal digit) to be ascribed to truncation etc.

BTW the problem is easily side-stepped by summing at training time the sources and noise mels to get the mixture. It is inexpensive + you save space on the disk by avoiding dumping also the mixture feats.

Data augmentation with Torchaudio

Since release 0.7, torchaudio basically has the same functionality as WavAugment - we should replace it to make data augmentation easier in Lhotse (no extra, optional dep).

More details: https://pytorch.org/audio/sox_effects.html

comment

would be nice to have a comment here mentioning that DummySet contains everything.

https://github.com/pzelasko/lhotse/blob/7555df605def57836c9454ae44aac95c504d86b0/lhotse/audio.py#L77

Split and combine manifests

Introduce python functions and CLI modes: lhotse split and lhotse combine. Similar to Kaldi split_data_dir.sh and combine_data_dirs.sh.

Randomized testing of *Cut classes invariants

Motivation:

MixedCut has grown into a more complex class than it seems since it can be composed of many underlying cuts that have either Cut or PaddingCut type. Recently I fixed some off-by-one num_frames errors between the meta-data and the actual feature matrices due to rounding. While things seem to be working okay now, I'd like to be sure we are free of this sort of errors, but the space of possible cut combinations is too large to cover with standard unit test cases.

Goal:

Test that MixedCut created in various ways always has consistent num_samples and num_frames metadata with the actual data shapes when samples/features are loaded into memory.

We should create the MixedCut by initializing fake Recordings + Cuts with random sampling rates and durations, extracting features for them, and then performing a number of randomly selected operations: pad, mix, and append. We can use a randomized testing library like hypothesis if it is useful.

Interfacing with externally extracted features

Example use-case: somebody has their own feature extractor (e.g., a pre-trained network), and they want to use it with Lhotse.

It is currently possible, but somewhat ugly - basically they have to ignore our FeatureExtractor and FeatureSetBuilder and construct the FeatureSet themselves. They will still need to pass a default FeatureExtractor to the FeatureSet.

I'm thinking of the following:

decouple FeatureSetBuilder from the FeatureExtractor so that somebody can supply their own feature extraction function
make FeatureExtractor config optional in the FeatureSet (but added there by default in Lhotse)

Update:

Generic mechanism for adding custom feature extractors (feature extractor registry); should also specify for each feature type how to mix them, e.g. simple add, or exp-sum-log (or raise exceptions)
Decoupling FeatureSetBuilder from FeatureExtractor
Decoupling configuration for different feature extractor types
Separate feature config from feature manifest (keep vital settings as part of individual feature documents)

Suspicious supervisions in AMI

There are 307 supervisions that have no text, and 5 of them have an extremely long duration (400 - 1600 seconds). See:

@jimbozhang could you take a look at it?

Time-domain data augmentation

We can start with these two:

SoxEffects and transforms in torchaudio
https://github.com/facebookresearch/WavAugment

Progress indicator

@pzelasko how hard would it be to implement a progress indicator?
IDK how that would interact with the multiprocessing?
Its a bit disconcerting to start running something when you have no idea how long it will take.

Two issues kaldi style data formats have (that I'm hoping will be improved)

Not sure if this is the right place, sorry if not. I have a vague memory of hearing in the recent kaldi community meetings one should make an issue for suggestions.

In kaldi the utterances are already split by speakers, there are no speaker changes inside an utterance. Depending on the use case, this is not at all realistic and ends up overestimating the impact of things like ivectors, as in reality speaker changes will cause degradations in performance (when using ivectors for example).
Because of that, I think one should add the option of doing decoding on a test set across speaker changes.
In kaldi decoding is never done on long utterances. This is of course because of the need to do alignment, but in practice decoding long audio files is a common usecase, and by segmenting it the LM loses context and one loses the ability (or rather it does not make sense to) to try out techniques that could take advantage of having hundreds of tokens to use as context. So I think there should be the capability of doing decoding on a test set of audio files where each audio files is at least several minutes long.

I guess by doing 2. one would get 1. for free as well.

Thank you for reading, Rudolf

torchaudio.sox_signalinfo_t is going to be deprecated.

torchaudio is perfoming overhaul on I/O and libsox binding. The detail is found here.
As part of this plan, in 0.9.0 release we plan to remove libsox class from API.

I looked at lhoste and it refers to torchaudio. sox_signalinfo_t. for example here
It is not immediately clear to me how this can be migrated, but torchaudio team is eager to help this project.
Please let us know what you think of torchaudio.

cc @dongreenberg @astaff @cpuhrsch @vincentqb

Source separation dataset improvements

return a mask in SourceSeparationDataset
a variant of LibriMix that uses pre-mixed audio files
add option for on-the-fly noise augmentation in SourceSeparationDataset
implement the correct method for overlaying spectrogram features
max speech separation mode - pad the shorter source to match the longer one
min speech separation mode - trim the mix to the length of the shorter recording

Feature extraction got stuck when both augmenter and executor is not None

When augmenter is not None, the feature extraction seems to get stuck (meaning that it never enters into _extract_and_store_features_helper_fn) at the line below

https://github.com/lhotse-speech/lhotse/blob/master/lhotse/cut.py#L1263

There is no problems if augmenter is None, or executor is None

Distributed processing

Some time ago I was looking for alternatives to run.pl/queue.pl style of distributing work on a cluster, hoping for something that would both interface nicely with SGE and allow to stay 100% in Python at the same time. I discovered that dask is able to do that, and I was able to run some simple computations on CLSP cluster this way. I know it also supports slurm and some other grid engines, we could have a "local" backend too.

The minimal code to achieve that looks more or less like that:

from dask_jobqueue import SGECluster
from dask.distributed import Client

def run_jobs(
        fn,
        inputs,
        jobs=1,
        memory='1GB',
        timeout_s='600',
        queue='all.q',
        proc_per_worker=1,
        cores_per_proc=1,
        env_extra=None,
        **kwargs
):
    if env_extra is None:
        env_extra = []
    with SGECluster(
            queue=queue,
            walltime=timeout_s,
            processes=proc_per_worker,
            memory=memory,
            cores=cores_per_proc,
            env_extra=env_extra  # e.g. ['export ENV_VARIABLE="SOMETHING"', 'source myscript.sh']
    ) as cluster:
        with Client(cluster) as client:
            cluster.scale(jobs)
            futures = client.map(fn, inputs)
            results = client.gather(futures)
    return results

and can be used like:

def process_item(x):
    pass  # do processing

items = [ ... ]

results = run_jobs(process_item, items)

I'd like to know if you'd be interested in giving it a try and exploring this option. @danpovey @jtrmal

Voice activity detection (VAD) dataset

Couple comments on reading

https://github.com/pzelasko/lhotse/blob/7555df605def57836c9454ae44aac95c504d86b0/lhotse/audio.py#L21

There should possibly be at least a TODO here to support getting a subset of channels. I'm thinking of scenarios like Switchboard. Since I think typical storage formats store the different channels interleaved, it may be just as efficient to read all of them and then select the needed ones. But there are scenarios where different channels are stored separately (AMI). We have to figure out at what level to do this.

Regarding the raw-PCM thing: I think if there is any common/universal format it would probably be wav, not raw pcm, as raw pcm is too easy to get wrong (no error if wrong format). I didn't really understand the AudioSource code well as your Python style is very modern... e.g. no init function.

Long-recording ASR recipe example

I'm thinking that the next step (after merging #24) is to create an example data preparation recipe for something conversational (e.g. SWBD), where we have 10 mins or more of continuous recordings. So far, I only tested the cut mechanism on LibriMix, where the recordings are already cut into short utterances, so the 'cutting' procedure was more simplistic than in a typical scenario we want to support.

We'd create Cuts from the transcribed segments by "padding" them with the actual acoustic context. These cuts could have different length "bins", e.g. 5, 10, 20 seconds so that we can construct the batches from utterances of similar lengths (and maybe scale the batch size accordingly). We'd also mix in some noise, e.g. MUSAN.

We'll need to provide the supervision for the ASR. Once K2 is ready, these will be K2 graphs. For now we can start with PyChain, or just provide a word/character sequence (e2e seq2seq style). We might actually tackle preparing the supervisions in a separate PR, as we might need to introduce bigger changes such as some kind of vocabulary manifest, etc.

CLI command: dump truncated features from a truncated cut set

Just curious about something... what would it take to re-dump the features from such a manifest? E.g. suppose you were planning to train for many epochs and did not want to incur repeated extra I/O?

Originally posted by @danpovey in #24

[proposal+infoshare] apt/pip-like manager for speech dataset

recently I came across this interesting project(https://github.com/activeloopai/Hub). It provides AI dataset (un)install/versioning/upgrade like what we do with pip/yum/apt for traditional software packages.

This is closely related to our current solution(those "download_and_untar_xxx.py" scripts + OpenSLR hosting), I believe there is something that we can learn from this to enhance lhotse usability.

The hosting part may shift from OpenSLR to blob/objects storage services(AWS/Azure/Aliyun) and so on, hence reducing the effort to maintain the OpenSLR data hosting server, and protect it from DDOS attacts (I've heard some ignorant guy in China spawned hundreds of downloading threads to OpenSLR that pull down whole kaldi website)

Say, end point K2 user needs only 1 line of python or bash command to get librispeech something like:

lothse.dataset_install('librispeech', 'v1.0', 'AWS', '/home/kaldi/database_warehouse/')

vinnie-the-pooh@ubuntu$ lothse install librispeech

this is conceptually much more friendly to new beginners, and behind the scence, the essential work is pretty much the same as those download_and_untar_xxx.py. With a bootstrap design, I believe contributors can make this better and easier along time.

another benefits of doing this is that datasets MAY evolve, bring centralized versioning and clear install/upgrade paths to speech datasets management, may be a good thing in the future.

This functionality may also apply to other speech related resources such as lexicon, vocabulary, text normalization rewrite grammars, standard benchmark test sets and so on.

I know you guys are busy preparing next-gen kaldi release, so the proposal is not urgent at all, just a thought to share : )

Progress bars for time-taking operations

Lhotse's UX could be improved by leveraging some progress bar library (e.g. tqdm) in corpus downloading. Maybe it would also work for feature extraction. I'm not sure yet if any other operations would take that much time.

Validate manifest

We could use a python function validate_manifest and a corresponding CLI mode lhotse validate at some point to check data integrity (like Kaldi validate_data_dir.sh).

AudioSource interface

I get the impression that the AudioSource interface might assume that different channels will be located in different files? In general this won't be the case, often multiple channels will be interleaved in one file, like in the wav format.

@jtrmal please comment on this next, larger issue:

I think maybe the AudioSet interface, and its dependencies, are too concrete. Or at least there should be a more abstract version of it somewhere, that is intended to be overloaded. Issues include:

We might at some point want to have some kind of dynamically created AudioSet where we add noise and/or reverberation. We could use an AudioSet child class that does reverberation and takes the base AudioSet as a constructor arg. Similarly for duplicating and merging and speed-perturbing datasets.
The same consideration might apply to the recording stuff, that it is too concrete. We don't know what kinds formats and reading methods, or what simulation methods, we or others might want to use in future. I'm thinking that maybe we could somehow make it more changeable/overridable? I'm not sure of what mechanism to suggest here. One possibility is to separate the metadata from the actual audio, so that the AudioSet has a mechanism whereby you can get metadata somehow, but the interface to retrieve the audio goes through AudioSet and just returns the float array in a standardized format without revealing much about where it came from.
With metadata: bear in mind that we may later need to add new metadata fields that we may not be anticipating now. So perhaps we could avoid making any decisions that will pin us down to only a specified set of fields. On the other hand, sometimes concreteness is a blessing, so I'm willing to argue about this point.

Also, having a default implementation of AudioSet is fine. But let's be clear what is the standard supported interface of AudioSet (and of its mechanism for supplying metadata) and what is just the default implementation.

[Sorry, I'm having trouble with GitHub interface, apologies for dupes, the following is a small unrelated issue: ]

It would be nice to have a comment here mentioning that DummySet contains everything:

https://github.com/pzelasko/lhotse/blob/7555df605def57836c9454ae44aac95c504d86b0/lhotse/audio.py#L77

Reading SPH files directly from Python

Current mechanism is very similar to how Kaldi does it - we can leverage external shell tools to pipe the result and read it in Lhotse.

To do that, one has to create a RecordingSet (recording manifest), where each Recording has an AudioSource of type ‘command’. Then, put something like sph2pipe file.sph - in there (the pipe symbol at the end is not needed), and when “load_audio()” is called somewhere in Lhotse, it will take care of spawning a subprocess and reading the data from it.

We could support for reading SPH files directly in Python with https://pypi.org/project/sphfile/ (sounds like a good first issue)

Padding in librispeech example

Piotr, in here
https://github.com/lhotse-speech/lhotse/blob/master/examples/librispeech/librispeech.ipynb
there is a call to .pad(), which I think pads all cuts to the longest utterance there- I assume that would be quite
inefficient when it comes time to train?
Is it possible to delay padding until after we form minibatches, and form minibatches in a more efficient way so that
things of a like size go together?

FeatureSet - scope

I'm thinking about the FeatureSet, and I'm not sure what's the scope of operations we'd like to support in lhotse. We will use lilcom to load/store the feature matrices, but what about feature extraction? Should we just use something precomputed e.g. with Kaldi, or also extract them on-the-fly at the FeatureSet API level? If the second is true, we'll either need to use some other library (e.g. librosa) or delegate feature extraction to Kaldi by running it as a subprocess (unless there are some Python bindings available). I guess the same questions apply to data augmentation (we'll get to that after having something initial working for features and having some example dataset represented in lhotse).

Of course, having the whole data augmentation + feature extraction pipeline as a part of lhotse would be more convenient in the long run. It'll just take longer to get there. @danpovey @jtrmal WDYT?

use `mp_context=multiprocessing.get_context("spawn")` in ProcessPoolExecutor will crash

With this PR k2-fsa/snowfall#5, I will get error

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/linuxbrew/.linuxbrew/opt/[email protected]/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/home/linuxbrew/.linuxbrew/opt/[email protected]/lib/python3.8/multiprocessing/spawn.py", line 125, in _main
    prepare(preparation_data)
  File "/home/linuxbrew/.linuxbrew/opt/[email protected]/lib/python3.8/multiprocessing/spawn.py", line 236, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "/home/linuxbrew/.linuxbrew/opt/[email protected]/lib/python3.8/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
  File "/home/linuxbrew/.linuxbrew/opt/[email protected]/lib/python3.8/runpy.py", line 265, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "/home/linuxbrew/.linuxbrew/opt/[email protected]/lib/python3.8/runpy.py", line 97, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/home/linuxbrew/.linuxbrew/opt/[email protected]/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/ceph-hw/snowfall/egs/librispeech/asr/simple_v1/prepare.py", line 47, in <module>
    cut_set = CutSet.from_manifests(
  File "/ceph-hw/lhotse/lhotse/cut.py", line 1319, in compute_and_store_features
    executor.submit(
  File "/home/linuxbrew/.linuxbrew/Cellar/[email protected]/3.8.6_1/lib/python3.8/concurrent/futures/process.py", line 645, in submit
    self._start_queue_management_thread()
  File "/home/linuxbrew/.linuxbrew/Cellar/[email protected]/3.8.6_1/lib/python3.8/concurrent/futures/process.py", line 584, in _start_queue_management_thread
    self._adjust_process_count()
  File "/home/linuxbrew/.linuxbrew/Cellar/[email protected]/3.8.6_1/lib/python3.8/concurrent/futures/process.py", line 608, in _adjust_process_count
Traceback (most recent call last):
  File "./prepare.py", line 47, in <module>
    cut_set = CutSet.from_manifests(
  File "/ceph-hw/lhotse/lhotse/cut.py", line 1328, in compute_and_store_features
    cut_set = CutSet.from_cuts(f.result() for f in futures)
  File "/ceph-hw/lhotse/lhotse/cut.py", line 989, in from_cuts
    return CutSet({cut.id: cut for cut in cuts})
  File "/ceph-hw/lhotse/lhotse/cut.py", line 989, in <dictcomp>
    return CutSet({cut.id: cut for cut in cuts})
  File "/ceph-hw/lhotse/lhotse/cut.py", line 1328, in <genexpr>
    cut_set = CutSet.from_cuts(f.result() for f in futures)
  File "/home/linuxbrew/.linuxbrew/Cellar/[email protected]/3.8.6_1/lib/python3.8/concurrent/futures/_base.py", line 439, in result
    return self.__get_result()
  File "/home/linuxbrew/.linuxbrew/Cellar/[email protected]/3.8.6_1/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result
    raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
/home/linuxbrew/.linuxbrew/opt/[email protected]/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 10 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

BTW, as we pass spawn , it starts a fresh python interpreter process, then will output too many duplicate logs
https://github.com/k2-fsa/snowfall/blob/7201fdebd18231df4c3a6a4c198e1d0a7d7c7d22/egs/librispeech/asr/simple_v1/prepare.py#L17-L21
which is a little bit annoying, if would be great if you can fix that together with the error above, but not urgent.

Fbank energies are much greater in Kaldi than torchaudio

Relevant part of discussion in the first feature extraction PR:

As to the large 0th MFCC coefficient, the plot thickens. When I compared the Kaldi and lhotse (torchaudio) output for the same file, but using log-mel energies instead - the output still has the same shape, but Kaldi energies are way larger, see the image:

The Kaldi result can be recreated by: steps/make_fbank.sh --fbank-config librifbank.conf --nj 1 libridata fbank_libri fbank_libri

where librifbank.conf has:

--dither=0
--sample-frequency=16000

and the wav.scp in libridata is:

rec1 /path/to/lhotse/test/fixtures/libri-1088-134315-0000.wav

Then, to analyse in Python, I read it in python like:

import kaldi_io
fbank_kaldi = list(kaldi_io.read_mat_scp('/path/to/fbank_libri/raw_fbank_libridata.1.scp'))[0][1]

Paranoid check with copy-feats ark:fbank_libri/raw_fbank_libridata.1.ark ark,t:- yields the same output:

rec1  [
  7.189558 9.037251 10.99918 11.80182 13.34635 14.90742 14.93212 16.49876 16.68227 16.07505 17.88986 17.97858 18.57075 17.19065 16.88927 17.6966 17.28112 17.44429 16.39793 15.73682 15.18271 14.46821 13.25515
  8.695339 10.86374 11.10041 11.28711 12.6426 13.6858 15.10104 14.55601 15.04671 15.78543 17.72975 17.83109 17.31431 16.33819 16.76282 17.25342 17.55506 17.83777 16.99071 15.81893 15.01264 13.81689 13.02168
  8.378333 9.902431 10.18937 10.38635 12.07959 13.96713 15.26997 15.65272 15.58296 15.4958 18.21008 17.68359 17.81999 16.76442 16.32026 17.28751 17.46375 17.26385 16.55256 15.46311 14.84257 13.89831 13.44193
(...)

Any ideas? Anyway, I think we can merge this one - let me know once you review.

Originally posted by @pzelasko in #10 (comment)

Channels

https://github.com/pzelasko/lhotse/blob/7555df605def57836c9454ae44aac95c504d86b0/lhotse/audio.py#L21

Directory structure of features

I notice that by default lhotse seems to put all the .llc files in one directory.
On many (maybe even most) file systems this scales very poorly because you spend too much time manipulating the meta-info of the directory- particularly when writing.
I think a better default would be to use the first few characters of the hash as a sub-directory name.
Was thinking of making this dependent on how much data there is to process, but I'm concerned people will write their own scripts that assume a certain directory structure, and then get surprised when things are not as they expect.

Map-style vs Iterable-style Dataset

I want to start a discussion to determine what's the right way to go forward with data loading efficiency and flexbility for Lhotse PyTorch datasets.

PyTorch documentation for dataset API describes two types of datasets - map-style and iterable-style. Short summary:

Map-style:

for index in sampler:
    sample = dataset[index]

loads a single item at a time
DataLoader samples indices of the items to be loaded (using a Sampler) based on len() of the dataset
DataLoader specifies a fixed batch size and takes care of collation (possibly with a custom collate_fn)

Iterable-style:

for batch in iter(dataset):  # batch/sample depending on the implementation
    pass

may load multiple items at the same time
has to care of collation, shuffling, batch size (can be dynamic), etc. itself
does not have to specify len()

Comments

Our current examples are based on the map-style paradigm, which offloads more things to the DataLoader default settings. I'm wondering if we should explore the iterable-style dataset more going forward. It seems like it might make it easier to work with sharded data, or support non-random-access reading (e.g. block-random sampling for batches). I think it'd be good to start collecting some insights, experiences and requirements to make sure the design is solid.

Another question is should Lhotse concern itself with that, and if so, to what extent? I don't think we'll be able to provide an optimal solution for every use case, but I think it'd be good to provide an option that (perhaps with some tuning) is at least okay in most compute infrastructures. We'll also gain more insights regarding that from building and running K2 recipes.

Add MUSAN data preparation

We'll need to:

Download MUSAN
Create separate recording manifests for the subsets: noise, music, babble. I don't think we need a supervision manifest.

It should be done similarly to how mini_librispeech and librimix are being prepared.

Support PyChain graph supervision for ASR dataset

Some things to think about:

Does Lhotse need to have PyChain as a dependency? Probably yes. But we should probably take care to use it only with local imports so that users not interested in PyChain do not have to install it.
Can we assume that PyChain graphs are created outside of Lhotse, and we are just referencing an archive?
We'll need to extend the SupervisionSegment to support these graphs - we'll probably need to add two (optional) fields, like graph_archive_path and graph_archive_key?

Let's see what Yiwen thinks.

Example data preparation recipes

We should start creating example recipes for some data sets and tasks. I'll post an initial list here, and we can modify or extend it based on discussions. I'll sort it by the level of implementation difficulty.

Source separation

(mini) LibriMix

Speech enhancement

Should be fairly simple to achieve with source separation in place, maybe it's even already possible.

(mini) LibriMix

Speaker identification

Could be simpler to build the first Dataset with supervision provided by SupervisionSegment than ASR.

? VoxCeleb 1/2 (widely studied dataset)

TTS

LJSpeech

ASR

(mini) Librispeech (widely studied dataset)
? Wall Street Journal (widely studied dataset)
1997 Broadcast News
TED-LIUM v3
Switchboard (conversational, two-channel)
AMI (conversational, multi-channel)
? CHiME 6 (conversational, multi-channel)

Wake word detection

Mobvoi HotWords

Feel free to propose any other tasks and datasets.

CLI integration/regression tests

We're currently running only unit tests, which prevent regressions to a limited extent. We could use a separate top-level test directory, e.g., integration_tests, which tests Lhotse CLI. Ideally, it would also be pytest based (could run the tests with subprocess.run).

We should wait a bit before doing this one out, though - let's implement a few example recipes first to get a bit closer to a stable CLI.