Comments (3)
@pyf98, can you answer it?
from espnet.
Hi, thanks for the question!
For LibriSpeech, I do not use the standard segmented version. Instead, I used the "original-mp3". I believe this is released along with the segmented version. You might need to check the original source of the LibriSpeech distribution.
Here are some paragraphs of the README
file in LibriSpeech.
2. Structure
============
The corpus is split into several parts to enable users to selectively download
subsets of it, according to their needs. The subsets with "clean" in their name
are supposedly "cleaner"(at least on average), than the rest of the audio and
US English accented. That classification was obtained using very crude automated
means, and should not be considered completely reliable. The subsets are
disjoint, i.e. the audio of each speaker is assigned to exactly one subset.
The parts of the corpus are as follows:
* dev-clean, test-clean - development and test set containing "clean" speech.
* train-clean-100 - training set, of approximately 100 hours of "clean" speech
* train-clean-360 - training set, of approximately 360 hours of "clean" speech
* dev-other, test-other - development and test set, with speech which was
automatically selected to be more "challenging" to
recognize
* train-other-500 - training set of approximately 500 hours containing speech
that was not classified as "clean", for some (possibly wrong)
reason
* intro - subset containing only the LibriVox's intro disclaimers for some of the
readers.
* mp3 - the original MP3-encoded audio on which the corpus is based
* texts - the original Project Gutenberg texts on which the reference transcripts
for the utterances in the corpus are based.
* raw_metadata - SQLite databases which record various pieces of information about
the source text/audio materials used, and the alignment process.
(mostly for completeness - probably not very interesting or useful)
2.3 Organization of the "original-mp3" subset
---------------------------------------------
This part contains the original MP3-compressed recordings as downloaded from the
Internet Archive. It is intended to serve as a secure reference "snapshot" for
the original audio chapters, but also to preserve (most of) the information both
about audio, selected for the corpus, and audio that was discarded. I decided to
try make the corpus relatively balanced in terms of per-speaker durations, so
part of the audio available for some of the speakers was discarded. Also for the
speakers in the training sets, only up to 10 minutes of audio is used, to
introduce more speaker diversity during evaluation time. There should be enough
information in the "mp3" subset to enable the re-cutting of an extended
"LibriSpeech+" corpus, containing around 150 extra hours of speech, if needed.
The directory hierarchy follows the already familiar pattern. In each
speaker directory there is a file named "utterance_map" which list for each
of the utterances in the corpus, the original "raw" aligned utterance.
In the "header" of that file there are also 2 lines, that show if the
sentence-aware segmentation was used in the LibriSpeech corpus(i.e. if the
reader is assigned to a test set) and the maximum allowed duration for
the set to which this speaker was assigned.
Then in the chapter directory, besides the original audio chapter .mp3 file,
there are two sets of ".seg.txt" and ".trans.txt" files. The former contain
the time range(in seconds) for each of the original(that I called "raw" above)
utterances. The latter contains the respective transcriptions. There are two
sets for the two possible segmentations of each chapter. The ".sents"
segmentation is "sentence-aware", that is, we only split on silence intervals
coinciding with (automatically obtained) sentence boundaries in the text.
The other segmentation was derived by allowing splitting on every silence
interval longer than 300ms, which leads to better utilization of the aligned
audio.
from espnet.
Hi, thanks for the question!
For LibriSpeech, I do not use the standard segmented version. Instead, I used the "original-mp3". I believe this is released along with the segmented version. You might need to check the original source of the LibriSpeech distribution.
Here are some paragraphs of the
README
file in LibriSpeech.2. Structure ============ The corpus is split into several parts to enable users to selectively download subsets of it, according to their needs. The subsets with "clean" in their name are supposedly "cleaner"(at least on average), than the rest of the audio and US English accented. That classification was obtained using very crude automated means, and should not be considered completely reliable. The subsets are disjoint, i.e. the audio of each speaker is assigned to exactly one subset. The parts of the corpus are as follows: * dev-clean, test-clean - development and test set containing "clean" speech. * train-clean-100 - training set, of approximately 100 hours of "clean" speech * train-clean-360 - training set, of approximately 360 hours of "clean" speech * dev-other, test-other - development and test set, with speech which was automatically selected to be more "challenging" to recognize * train-other-500 - training set of approximately 500 hours containing speech that was not classified as "clean", for some (possibly wrong) reason * intro - subset containing only the LibriVox's intro disclaimers for some of the readers. * mp3 - the original MP3-encoded audio on which the corpus is based * texts - the original Project Gutenberg texts on which the reference transcripts for the utterances in the corpus are based. * raw_metadata - SQLite databases which record various pieces of information about the source text/audio materials used, and the alignment process. (mostly for completeness - probably not very interesting or useful) 2.3 Organization of the "original-mp3" subset --------------------------------------------- This part contains the original MP3-compressed recordings as downloaded from the Internet Archive. It is intended to serve as a secure reference "snapshot" for the original audio chapters, but also to preserve (most of) the information both about audio, selected for the corpus, and audio that was discarded. I decided to try make the corpus relatively balanced in terms of per-speaker durations, so part of the audio available for some of the speakers was discarded. Also for the speakers in the training sets, only up to 10 minutes of audio is used, to introduce more speaker diversity during evaluation time. There should be enough information in the "mp3" subset to enable the re-cutting of an extended "LibriSpeech+" corpus, containing around 150 extra hours of speech, if needed. The directory hierarchy follows the already familiar pattern. In each speaker directory there is a file named "utterance_map" which list for each of the utterances in the corpus, the original "raw" aligned utterance. In the "header" of that file there are also 2 lines, that show if the sentence-aware segmentation was used in the LibriSpeech corpus(i.e. if the reader is assigned to a test set) and the maximum allowed duration for the set to which this speaker was assigned. Then in the chapter directory, besides the original audio chapter .mp3 file, there are two sets of ".seg.txt" and ".trans.txt" files. The former contain the time range(in seconds) for each of the original(that I called "raw" above) utterances. The latter contains the respective transcriptions. There are two sets for the two possible segmentations of each chapter. The ".sents" segmentation is "sentence-aware", that is, we only split on silence intervals coinciding with (automatically obtained) sentence boundaries in the text. The other segmentation was derived by allowing splitting on every silence interval longer than 300ms, which leads to better utilization of the aligned audio.
thanks
from espnet.
Related Issues (20)
- Can not download covost2, could you please update it with new link?
- Availability of OWSM-CTC HOT 5
- [QUESTION] [TTS] 'num_elements_batch_sampler' loses the randomness of the samples HOT 1
- how to set padding_idx in conformer_ctc? set padding_idx = -1 may be wrong ? HOT 5
- Probably a bug in saving checkpoints and loading for inference HOT 1
- How to inference with s4 decoder? HOT 1
- Upgrade typeguard version HOT 1
- Changes that requires to be made while using wav2vec2.0(CLSRIL-23.pt) features for training CTC/Attention based training HOT 6
- Error when training VITS model for vctk dataset HOT 3
- No such parameter e_branchformer_ctc in encoder parameter HOT 3
- X-vector based TTS model packaging broken in tts.sh HOT 1
- USES `ref_channel` usage HOT 4
- Question regarding switching speakers, weights during runtime.
- Question about asr2.sh and its options to reproduce the librispeech_100 recipe. HOT 5
- An error when using LoRA for s3prl frontend. HOT 1
- TSE with Librimix: mismatch in number of speakers HOT 4
- Streaming ASR model latency issue HOT 6
- asr_train.py: error: unrecognized arguments: use_lora HOT 1
- Espnet Collect stats: s3prl Upstream 'hubert-large-ll60k' HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from espnet.