m-bain / whisperx Goto Github PK
View Code? Open in Web Editor NEWWhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
License: BSD 4-Clause "Original" or "Old" License
WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
License: BSD 4-Clause "Original" or "Old" License
Thank you for this rep first!
When I tried to align an audio, I met this error.
File"/home/ubuntu/.conda/envs/whisper/lib/python3.9/site-packages/whisperx/alignment.py", line 67, in backtrack raise ValueError("Failed to align") ValueError: Failed to align "/home/ubuntu/.conda/envs/whisper/lib/python3.9/site-packages/whisperx/alignment.py", line 67, in backtrack raise ValueError("Failed to align") ValueError: Failed to align
Any idea to fix this?
Currently supported only English translation
How to do translation in others Language ?
Hi,
Thanks for sharing this work. You wrote that it still needs testing... can I test it in French 😉?
I am not sure what I should change. I saw that the wav2vec2 model could be passed in as parameter (see the readme), but in code there are some harcoded pipelines refering to the the english model. For French there is a wav2vec2 model for French (never tested since I was relying on Whisper only).
Looking forward to testing this in French!
I tried to run WhisperX on a file that starts with nonspeech - the initial 13 sec of the file are nonspeech.
The results WhisperX provided me ignored that silent time and start the first word in 00:00 instead of 00:14
Any Idea?
@m-bain Thank you for whisperX!
In some audio, I found IndexError: list index out of range
error happens at aligning stage.
Traceback (most recent call last):
File "/home/syoyo/miniconda3/envs/whisperx/bin/whisperx", line 8, in <module>
sys.exit(cli())
File "/home/syoyo/miniconda3/envs/whisperx/lib/python3.8/site-packages/whisperx/transcribe.py", line 505, in
cli
result_aligned = align(result["segments"], align_model, align_metadata, audio_path, device,
File "/home/syoyo/miniconda3/envs/whisperx/lib/python3.8/site-packages/whisperx/transcribe.py", line 374, in
align
word_segments_list[-1]['text'] += ' ' + curr_word
IndexError: list index out of range
https://commonvoice.mozilla.org/en/datasets
Download Japanese -> Common Voice Copus 12.0
$ whisperx --model large --language ja cv-corpus-12.0-2022-12-07/ja/clips/common_voice_ja_35797612.mp3
whisperX/whisperx/transcribe.py
Line 382 in 2aa074e
Here is the dump of t_local
and t_words
before for x in range(len(t_local)):
t_local = [None, None, (0.842688, 1.083456), (1.424544, 1.5048), (1.524864, 1.544928), (1.665312, 1.7455679999999998), (1.7656319999999999, 1.785696), (1.865952, 2.026464), (2.046528, 2.186976), (2.20704, 2.2271039999999998), (2.247168, 2.327424), (2.347488, 2.367552), (2.427744, 2.508), (2.548128, 2.568192), (2.688576, 2.969472), None, (3.089856, 3.10992), (3.129984, 3.2102399999999998), (3.31056, 3.8322239999999996), (3.852288, 3.9325439999999996), (4.052928, 4.072992), (4.133184, 4.2535680000000005), (4.333824, 4.434144), (4.494336, 4.715039999999999), (4.735104, 4.755168), (4.775232, 5.196576), (5.31696, 5.457407999999999), (5.597856, 5.678112), (5.738303999999999, 5.81856), (5.838624, 5.898815999999999), (5.9790719999999995, 6.0793919999999995), (6.159648, 6.199776), (6.21984, 6.280032), (6.360288, 6.460608000000001), (6.480671999999999, 6.5207999999999995), (6.641183999999999, 6.761568), (6.781632, 6.801696), None]
t_words = ['童', '貞', '助', 'か', 'ら', 'な', 'い', 'と', '思', 'っ', 'て', 'い', 'る', 'か', 'ら', '、', 'い', 'る', 'と', 'う', 'っ', 'と', 'さ', 'り', 'っ', 'と', '音', 'が', 'し', 'て', '目', 'か', 'ら', '火', 'が', '出', 'た', '。']
Probably we need to consider a situation t_local
may contain multiple None
s from the start?
Thank you very much.
What a great tool! Is it somehow possible to use also a version of Whisper that has been fine-tuned? I have one model trained with transformers on Hugging Face.
Thanks.
I have cloned WhisperX and I'm getting a syntax error, about getting torch 1.8. But it seems like there is a syntax at error "torch (>=1.8.)" when ran, it should probably be replaced with "torch >=1.8.0" or "torch >=1.8,<1.9" if torch 1.9 also works.
epkg_resources.extern.packaging.requirements.InvalidRequirement: Expected closing RIGHT_PARENTHESIS - torch (>=1.8.*)
I'm also interested in learning how to add parameters in python.
--vad_filter
can sometimes cause GPU OOM due to input segments beeing too large.
Need to divide VAD segments which are too long and cut them up into smaller segments
Hi,
I was trying out your package, it seems like a pretty useful addition to whisper.
I was wondering if you had any plans to add word level confidences / logprobs, eg similar to this ticket?
openai/whisper#284
Thanks!
You provide installation instructions. It would be nice to mention how to update in the same area of the readme so that people who know enough to follow the install line can also be given the information to update. (even if that is just to perform the same command again)
Hi, do I just find a compatible Wav2Vec2 model for the language and add it to the models list or any other steps needed? also is there are any modifications needed for RTL languages such as Arabic?
I keep getting this error after the new commit to Pandas:
File "//whisper.py", line 47, in whisper
result_aligned = whisperx.align(segments, model_a, metadata, audio_file, device)
File "/usr/local/lib/python3.9/site-packages/whisperx/alignment.py", line 309, in align
word_segments_arr["segment-text-start"] = per_word_grp["level_1"].min().reset_index()["level_1"]
File "/usr/local/lib/python3.9/site-packages/pandas/core/groupby/generic.py", line 1416, in getitem
return super().getitem(key)
File "/usr/local/lib/python3.9/site-packages/pandas/core/base.py", line 248, in getitem
raise KeyError(f"Column not found: {key}")
KeyError: 'Column not found: level_1'
generated .ass is giving constant full segments with occasional word highlighting. how to make it so the segments are joined for a longer on-screen session?
ty @m-bain, looking forward to a sweet built-in diarization implementation! This will also be very useful for quick preview of audio sound clips when searching.
Would like to see if above question is easily achieved because my eyes follow the auto-generated captions on youtube videos, the scrolling, like rsvp (https://en.wikipedia.org/wiki/Rapid_serial_visual_presentation) and I only watch videos for that reason. It is sort of a speed reading-like tool, but instead of presenting one word at a time, which can be nausating, a more relaxed method is possible, where you can look at old text.
These .ass files generated are possible to view with audio files on android 12 mpv, when renaming them to .srt.
A display that can hold more text than 2 lines (unlike youtube) would be ideal.
Hi, I am using this, but it doesn't seems to release the gpu vram. I am using large-v2 model and it's using 16gigs of GPU ram. Which I don't think is NORMAL at all, moreover it doesn't free up the ram. Is this normal?
I saw that support for portuguese was added a few commits ago and decided to give it a go. But when loading the align model this error happens:
ValueError: The chosen align_model "jonatasgrosman/wav2vec2-large-xlsr-53-portuguese" could not be found in huggingface (https://huggingface.co/models) or torchaudio (https://pytorch.org/audio/stable/pipelines.html#id14)
I assume that the resulting segments resemble the black lines in the image. In the image, the segments have been lengthened at the beginning (red line), and at the end (green line).
Is it possible to add parameters for the optional lengthening of all the resulting segments at the beginning and at the end?
[parameter 1:] The red line is an example of the result of a parameter to lengthen all segments at the beginning (in milliseconds).
[parameter 2:] The green line is an example of the result of a parameter to lengthen all segments at the end (in milliseconds).
[parameter 3:] A third parameter is for the minimum distance (in milliseconds) between the end of a (lengthened) segment and the beginning of the next (lengthened) segment. This third parameter may overrule the other two parameters to prevent a collision.
Would be nice, if there was a flag to pass a srt or other text file directly to the alignment, skipping the text recognition.
E.g. sometimes I already have a ground truth transcription and I only want to align it to the audio.
Is this possible?
Hi @m-bain,
This is a very cool repository and definitely useful for getting more reliable and accurate timestamps for the generated transcriptions.
I was wondering if you'd like to extend the current transcription codebase to also support transformers
fine-tuned Whisper checkpoints as well.
For context, we recently ran a Whisper fine-tuning event powered by 🤗 transformers and over the course of the event we managed to fine-tune 650+ Whisper checkpoints, across 112 languages. You can find the leaderboard here: https://huggingface.co/spaces/whisper-event/leaderboard
In most all the cases the fine-tuned models beat the original Whisper model's zero shot performance by a huge margin.
I think It'll be of huge benefit for the community to be able to utilise these models with your repo. Happy to support you if you have any questions from 🤗 transformers side. :)
Cheers,
VB
Thanks for this wonderful effort. Do you have any plans to enable Python code usage (instead of CLI)?
I noticed that when I transcribe videos the subtitles aren't displayed anymore. Apparently the start and end timestamp are much closer together than before. I noticed this for all my new transcriptions.
This is a comparison between a transcription I did with the same settings in December vs now
red: old transcription
green: new transcription
--- <unnamed>
+++ <unnamed>
@@ -1,55 +1,55 @@
WEBVTT
-00:06.854 --> 00:09.218
+00:06.854 --> 00:06.874
REDACTED
-00:10.747 --> 00:10.990
+00:10.747 --> 00:10.869
REDACTED
-01:30.038 --> 01:31.039
+01:30.038 --> 01:30.098
REDACTED
-01:32.560 --> 01:36.980
+01:32.560 --> 01:32.600
REDACTED
-01:37.100 --> 01:37.685
+01:37.100 --> 01:37.141
REDACTED
-01:39.860 --> 01:42.014
+01:39.860 --> 01:39.900
REDACTED
-01:42.960 --> 01:43.124
+01:42.960 --> 01:43.062
REDACTED
-01:44.538 --> 01:44.620
+01:44.538 --> 01:44.559
REDACTED
-01:45.820 --> 01:47.458
+01:45.820 --> 01:45.861
REDACTED
-01:47.680 --> 01:49.620
+01:47.680 --> 01:47.741
REDACTED
-01:49.660 --> 01:51.518
+01:49.660 --> 01:49.761
REDACTED
-01:54.140 --> 01:57.878
+01:54.140 --> 01:54.301
REDACTED
-01:58.400 --> 02:03.240
+01:58.400 --> 01:58.541
REDACTED
-02:18.761 --> 02:20.514
+02:18.761 --> 02:18.781
REDACTED
-02:21.280 --> 02:23.316
+02:21.280 --> 02:21.381
REDACTED
-02:25.300 --> 02:27.217
+02:25.300 --> 02:25.361
REDACTED
-02:27.600 --> 02:32.379
+02:27.600 --> 02:27.620
REDACTED
-02:32.840 --> 02:34.756
+02:32.840 --> 02:32.921
REDACTED
I wanted to use WhisperX to do forced alignment on the Mozilla Common Voice German Dataset, but the words are often cut of or the segments do not align at all.
Additionally, some audio tracks are recognized as Farsi instead of German.
Is it because of the short duration of these clips (< 2-5 seconds, each)?
And how can I improve this accuracy?
Is the accuracy of the english models (for english audio) better?
I think we need to explore multilingual models such as wav2vec2-xls-r-300m-21-to-en to see if the 300M models are better than the 53M models currently used for low resource languages and see if we could use a single model for multilingual alignment
I have the code written but not thoroughly tested so I might share it after the merging of #53 but I wanted to hear your thoughts about this first
hello team, thank you for this project!
Just wondering is there a possibility to incorporate whisper mic for real time processing ?
Hi,
I get this error message when trying to run whisperx with the latest features (VAD and parallel processing).
I have already upgraded to the latest version with all the requirements being satisfied...
Any clue on what it is missing/wrong in my installation?
Thanks&Regards,
I tried to translate from japanese to english and use whisperx.
In exactly one entry it's missing a "word-level" in it's alignment dict, causing utils.write_ass
to fail.
Alignment output entry:
{ "id": 654, "seek": 518644, "start": 5198.74, "end": 5198.7404, "text": " Shizu", "tokens": [ 1160, 590, 84 ], "temperature": 1, "avg_logprob": -4.464975124452172, "compression_ratio": 1.0123456790123457, "no_speech_prob": 0.07407991588115692 },
I'm trying it out in my code alongside WhisperX, might want it built-in.
If someone will speak too long than the whole text will appear throught the entire screen at that timeframe so is there any option to cut this in the pieces?
Failed to align segment: no characters in this segment found in model dictionary, resorting to original...
I was testing to align a audio file, but it didn't worked and give above error. It was a plain English .wav file
Thank you for your work but unless one also does 'pip install soundfile' during the installation of this they will get an error during initialization of WhiperX.
I would suggest adding that step to the readme as even though it doesn't stop operation it does throw a warning.
Great work!
Is it possible to expand this to phoneme-level timestamps, instead of word-level timestamps?
For example, instead of
"[00:13.50->00:13.60] smiles"
have
"[00:13.50->00:13.53] s
[00:13.53->00:13.57] mi
[00:13.57->00:13.58] l
[00:13.58->00:13.60] es"
Currently arabic numerals and symbols in whisper transcript cannot be aligned, needs to be phonetic alphabet.
Need to perform inverse of normalization in https://github.com/m-bain/whisperX/blob/main/whisperx/normalizers/english.py
Such that numbers and currencies are converted to their phonetic word form.
E.g.
"$300" -> "three hundred dollars"
To perform wav2vec alignment.
Then convert back to symbol form, and assign timestamps.
I can use the --diarize alone and it works, but if I add the --vad_filter ill get this error.
huggingface_hub.utils._validators.HFValidationError: Repo id must use alphanumeric chars or '-', '_', '.', '--' and '..' are forbidden, '-' and '.' cannot start or end the name, max length is 96: 'pyannote\segmentation'.
Which doesnt make sense since the hf_token is validated when diarizing only. Aaaaaaand its here while writing this, that I just realized that the segmentation is used to improve the diarization and is another repository.
Also a simple cross-takling answer like yes and no, is often under 300 ms long. Pyannote only accurately detects speaker change every 2 seconds of spoken audio, which is almost 7 times slower than what is needed if we want to use as little "logic" as possible as I call it.
"Logic": Whisper almost always detects when a new sentence is needed, also if its an answer to a question. So having the timestamps of the spoken word(s) rerun the diarization, but with empty space in the beginning or as I do double the length of the audio file and then divide the time. Will 90% be treated as another speaker than the one who spoke before. But it still cant accurately detect who is talking and would just create a new speaker eg. Speaker_03, if its only 2 people who where talking. This means that there is a lot of annoying, but simple steps in listening to each unknown speaker and labelling them correctly, I would say that with my method the accuracy of guessing the speaker in the streched audio is around 50%, so not the worst.
I also briefly checked out on a few stats from Nvideo NeMo diarize/segments, and it seems to be a tiny bit better at handling switching speakers. But accuracy drops drastically when its nearing the 0,5-1 second mark (40% I think I saw.) But then again maybe the same method of strecthing the audio would help a bit, like it does with Pyannote. The bad thing with Nvidia NeMo is that it can only be trained with gpus that support CUDA or maybe can only run great on CUDA gpus, im not sure and havent tried as I only have an RX 580 8GB and a Ryzen 2700x.
Also maximum segment length in characters would be nice as when you get past the 70 charactersit begins to fill a lot on the subtitle floor. Like this port that uses --max-len "characters" too limit it very well.
Now I would not call myself a programmer, I just play around with things and cross my fingers that I made it work. I will look into It myself and maybe make a pull request if, I can improve or implement something. 😃
I'm trying to use whisperX on the Korean language and came across some issues on how to do so. Since there's no default model, I went over to HF to find a model to use. As expected, there are many models and I'm not too sure which one to choose, especially because some of the recent models trained by slplab are lacking evaluation results. Also, not really sure what they are evaluated on or any other models. Personally I am not familiar with wav2vec2 or the evaluation benchmark it uses let alone what these other trained models are benchmarked against.
Let me get back on tangent, does it matter if the model I select was fine-tuned on wav2vec2-large-xlsr-53? I'm asking because the current default Japanese model was fine-tuned on it and is now the default. For Chinese, another fine-tuned model on wav2vec2-large-xlsr-53 was again selected according to #7.
Do I have to choose a model like slplab/wav2vec2-large-xlsr-53-korean-samsung-54k or could I use thisisHJLee/wav2vec2-large-xls-r-1b-korean-sample5?
The diarize version just puts the audio in an .ass file atm as far as I can see - is it possible in the current version to get the diarized output in a txt file?
Been trying this out with KBLab/wav2vec2-large-voxrex-swedish for Swedish, and while I lack the hardware for extensive testing atm, it seems to be working fine.
pyannote.audio brings some great features but it is rightfully guarded in code since it requires additional credentials. Additionally it appears to be currently incompatible with the newest macs meaning install fails when it is included. The recommendation here would be to have the additional features it brings be an extra instead of part of the default install. Maybe just add:
#added to the setup() call in setup.py
#additionally, it should be removed from the requirements.txt
#I've never been good at naming things.
extras_require={"pyannote": ["pyannote.audio"]},
alternatively, the code can remove it from the setup entirely and detect, and gracefully handle, if it isn't installed when VAD and diarization are called. This may be preferred since decoupling the diarization/vad implementation could be useful in the future when there are multiple valid options.
Hi,
I couldn't find align model used for Chinese in "torchaudio.pipelines". Does it support Chinese?
Thanks
Hi @m-bain. Thank you so much for your amazing work!
I wanted to test your new VAD feature, but I get an error stating that:
Traceback (most recent call last):
File "/usr/local/bin/whisperx", line 8, in <module>
sys.exit(cli())
File "/usr/local/lib/python3.8/dist-packages/whisperx/transcribe.py", line 451, in cli
result = transcribe_with_vad(model, audio_path, vad_pipeline, temperature=temperature, **args)
File "/usr/local/lib/python3.8/dist-packages/whisperx/transcribe.py", line 310, in transcribe_with_vad
vad_segments = vad_pipeline(audio)
File "/usr/local/lib/python3.8/dist-packages/pyannote/audio/core/pipeline.py", line 238, in __call__
return self.apply(file, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/pyannote/audio/pipelines/voice_activity_detection.py", line 197, in apply
segmentations: SlidingWindowFeature = self._segmentation(file)
File "/usr/local/lib/python3.8/dist-packages/pyannote/audio/core/inference.py", line 328, in __call__
waveform, sample_rate = self.model.audio(file)
File "/usr/local/lib/python3.8/dist-packages/pyannote/audio/core/io.py", line 278, in __call__
waveform, sample_rate = torchaudio.load(file["audio"])
File "/usr/local/lib/python3.8/dist-packages/torchaudio/backend/soundfile_backend.py", line 205, in load
with soundfile.SoundFile(filepath, "r") as file_:
File "/usr/local/lib/python3.8/dist-packages/soundfile.py", line 629, in __init__
self._file = self._open(file, mode_int, closefd)
File "/usr/local/lib/python3.8/dist-packages/soundfile.py", line 1183, in _open
_error_check(_snd.sf_error(file_ptr),
File "/usr/local/lib/python3.8/dist-packages/soundfile.py", line 1357, in _error_check
raise RuntimeError(prefix + _ffi.string(err_str).decode('utf-8', 'replace'))
RuntimeError: Error opening 'xxxx.mp3': File contains data in an unknown format.
I've tried with different (valid) mp3 files and each time it results in this error.
I'm using whisperx on Google Collab.
// EDIT: I get the same error when I try the diarization feature.
I cannot find this anywhere in the documentation. In the whisperx transcribe function there is a massive section of optional parameters that can be passed in. How can I actually use these in python?
# parser.add_argument("--model", default="small", choices=available_models(), help="name of the Whisper model to use")
# parser.add_argument("--model_dir", type=str, default=None, help="the path to save model files; uses ~/.cache/whisper by default")
# parser.add_argument("--device", default="cuda" if torch.cuda.is_available() else "cpu", help="device to use for PyTorch inference")
# # alignment params
# parser.add_argument("--align_model", default=None, help="Name of phoneme-level ASR model to do alignment")
# parser.add_argument("--align_extend", default=2, type=float, help="Seconds before and after to extend the whisper segments for alignment")
# parser.add_argument("--align_from_prev", default=True, type=bool, help="Whether to clip the alignment start time of current segment to the end time of the last aligned word of the previous segment")
# parser.add_argument("--drop_non_aligned", action="store_true", help="For word .srt, whether to drop non aliged words, or merge them into neighbouring.")
# parser.add_argument("--output_dir", "-o", type=str, default=".", help="directory to save the outputs")
# parser.add_argument("--output_type", default="srt", choices=['all', 'srt', 'vtt', 'txt'], help="File type for desired output save")
# parser.add_argument("--verbose", type=str2bool, default=True, help="whether to print out the progress and debug messages")
# parser.add_argument("--task", type=str, default="transcribe", choices=["transcribe", "translate"], help="whether to perform X->X speech recognition ('transcribe') or X->English translation ('translate')")
# parser.add_argument("--language", type=str, default=None, choices=sorted(LANGUAGES.keys()) + sorted([k.title() for k in TO_LANGUAGE_CODE.keys()]), help="language spoken in the audio, specify None to perform language detection")
# parser.add_argument("--temperature", type=float, default=0, help="temperature to use for sampling")
# parser.add_argument("--best_of", type=optional_int, default=5, help="number of candidates when sampling with non-zero temperature")
# parser.add_argument("--beam_size", type=optional_int, default=5, help="number of beams in beam search, only applicable when temperature is zero")
# parser.add_argument("--patience", type=float, default=None, help="optional patience value to use in beam decoding, as in https://arxiv.org/abs/2204.05424, the default (1.0) is equivalent to conventional beam search")
# parser.add_argument("--length_penalty", type=float, default=None, help="optional token length penalty coefficient (alpha) as in https://arxiv.org/abs/1609.08144, uses simple length normalization by default")
# parser.add_argument("--suppress_tokens", type=str, default="-1", help="comma-separated list of token ids to suppress during sampling; '-1' will suppress most special characters except common punctuations")
# parser.add_argument("--initial_prompt", type=str, default=None, help="optional text to provide as a prompt for the first window.")
# parser.add_argument("--condition_on_previous_text", type=str2bool, default=False, help="if True, provide the previous output of the model as a prompt for the next window; disabling may make the text inconsistent across windows, but the model becomes less prone to getting stuck in a failure loop")
# parser.add_argument("--fp16", type=str2bool, default=True, help="whether to perform inference in fp16; True by default")
# parser.add_argument("--temperature_increment_on_fallback", type=optional_float, default=0.2, help="temperature to increase when falling back when the decoding fails to meet either of the thresholds below")
# parser.add_argument("--compression_ratio_threshold", type=optional_float, default=2.4, help="if the gzip compression ratio is higher than this value, treat the decoding as failed")
# parser.add_argument("--logprob_threshold", type=optional_float, default=-1.0, help="if the average log probability is lower than this value, treat the decoding as failed")
# parser.add_argument("--no_speech_threshold", type=optional_float, default=0.6, help="if the probability of the <|nospeech|> token is higher than this value AND the decoding has failed due to `logprob_threshold`, consider the segment as silence")
# parser.add_argument("--threads", type=optional_int, default=0, help="number of threads used by torch for CPU inference; supercedes MKL_NUM_THREADS/OMP_NUM_THREADS")
Where would I actually put this? Transcribe does not seem to have an input parameter for these, neither does load_model.
model = whisperx.load_model(modelSize, device)
result = model.transcribe(audio_file)
model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
result_aligned = whisperx.align(result["segments"], model_a, metadata, audio_file, device)
Specifically interested in --threads, --beam_size, --patience, and --best_of
@m-bain Thank you so much for your amazing work.
There are still some incoherent timestamps in the word-level srt files (it was the case for 139 files out of 360 on my data). I'm about to write a Python script to parse all the srt files and fix the concerned timestamps, but maybe there is a way to avoid them from the beginning? It makes it hard to convert them into TextGrid files... (I use https://github.com/rctatman/SrtToTextgrid ) Beside that, Whisperx is working so well!!
E:\Applications\WhisperX>whisperx --model large --language ko GM2Ki_FmF5U.mp4 --align_model wav2vec2-xls-r-300m-korean --output_dir examples/whisperx --align_extend 2
C:\Users\REDACTED\AppData\Roaming\Python\Python39\site-packages\torchaudio\backend\utils.py:62: UserWarning: No audio backend is available.
warnings.warn("No audio backend is available.")
[00:00.000 --> 00:03.360] 안녕하세요. 저는 현우입니다. 한국어로 말해주세요.
[00:03.360 --> 00:09.120] 아무도 불가능한 길게 말하는 사람을 듣고 싶지 않습니다.
[00:09.120 --> 00:18.240] 그러나, 이런 길게 말하는 말은 한국어로 말하는 새로운 언어를 공부하는 경우에 정말 도움이 될 것입니다.
[00:18.240 --> 00:23.160] 영어로는 짧은 말과 길게 말하는 것과는 차이가 없죠.
[00:23.160 --> 00:24.160] 예를 들어,
[00:24.160 --> 00:30.620] I am a lion. You are a bunny. We can't be friends.
[00:30.620 --> 00:35.160] You simply add more words like and, and so, and you say
[00:35.160 --> 00:41.260] I'm a lion. And you're a bunny. So we can't be friends.
[00:41.260 --> 00:43.460] The verbs themselves don't really change,
[00:43.460 --> 00:47.180] so it's relatively easier to make longer sentences in English.
[00:47.180 --> 00:52.480] But in Korean, with these three short sentences,
[00:52.480 --> 00:58.680] 나는 사자야. 너는 토끼야. 우리는 친구가 될 수 없어.
[00:58.680 --> 01:03.060] You have to change the verb endings to form a longer sentence using them.
[01:03.060 --> 01:08.500] 나는 사자고, 너는 토끼니까, 우리는 친구가 될 수 없어.
[01:08.500 --> 01:15.060] So, without understanding how the verbs change forms to be linked with the following part,
[01:15.060 --> 01:18.660] you can't really make your sentences more fluid and flexible,
[01:18.660 --> 01:24.860] and it'll be harder for you to understand native speakers when they mix and link various sentence parts.
[01:24.860 --> 01:28.440] Again, you don't have to talk like this.
[01:30.920 --> 01:34.580] 나는 사자고 너는 토끼니까 우리는 친구가 될 수 없지만
[01:34.580 --> 01:38.320] 내가 배가 안 고플 때는 너를 잡아먹지 않으려고 노력하겠다는 약속은
[01:38.320 --> 01:41.000] 지금은 일단 해줄 수 있다고 볼 수 있는데
[01:41.000 --> 01:44.560] 100% 보장할 수는 없다는 점을 이해해줬으면 좋겠는데
[01:44.560 --> 01:45.600] 가능할까?
[01:45.600 --> 01:49.600] But you don't want to always talk like this either.
[01:49.600 --> 01:54.720] 나는 사자야, 너는 토끼야. 우리는 친구가 될 수 없어.
[01:54.720 --> 01:59.680] 내가 배가 안 고파. 그러면 너를 안 잡아먹어. 노력할게.
[02:24.720 --> 02:32.720] 재미있는 책, 한국어 공부를 좀 하려고 어디로 가면 좋을지 아직 모르겠어요.
[02:32.720 --> 02:37.320] 마침 and 우연히 are connected together here.
[02:37.320 --> 02:41.120] This one is so, this is but.
[02:41.120 --> 02:43.920] 집에만 있을 생각이었지만.
[02:43.920 --> 02:47.020] Then, I'll be waiting for you at TalkToMeInKorean.com.
[02:47.020 --> 02:49.020] TALK TO ME IN KOREAN 에서 만나요!
[02:49.020 --> 02:49.520] Bye!
Performing alignment...
[00:00.000 --> 00:00.502] 안녕하세요. 저는 현우입니다. 한국어로 말해주세요.
[00:01.360 --> 00:02.021] 아무도 불가능한 길게 말하는 사람을 듣고 싶지 않습니다.
[00:07.120 --> 00:08.282] 그러나, 이런 길게 말하는 말은 한국어로 말하는 새로운 언어를 공부하는 경우에 정말 도움이 될 것입니다.
[00:16.240 --> 00:16.821] 영어로는 짧은 말과 길게 말하는 것과는 차이가 없죠.
[00:21.160 --> 00:21.260] 예를 들어,
[00:24.160 --> 00:30.620] I am a lion. You are a bunny. We can't be friends.
[00:30.620 --> 00:35.160] You simply add more words like and, and so, and you say
[00:35.160 --> 00:41.260] I'm a lion. And you're a bunny. So we can't be friends.
[00:41.260 --> 00:43.460] The verbs themselves don't really change,
[00:43.460 --> 00:47.180] so it's relatively easier to make longer sentences in English.
[00:47.180 --> 00:52.480] But in Korean, with these three short sentences,
[00:50.480 --> 00:51.061] 나는 사자야. 너는 토끼야. 우리는 친구가 될 수 없어.
[00:58.680 --> 01:03.060] You have to change the verb endings to form a longer sentence using them.
[01:01.060 --> 01:01.681] 나는 사자고, 너는 토끼니까, 우리는 친구가 될 수 없어.
[01:08.500 --> 01:15.060] So, without understanding how the verbs change forms to be linked with the following part,
[01:15.060 --> 01:18.660] you can't really make your sentences more fluid and flexible,
[01:18.660 --> 01:24.860] and it'll be harder for you to understand native speakers when they mix and link various sentence parts.
[01:24.860 --> 01:28.440] Again, you don't have to talk like this.
[01:28.920 --> 01:29.522] 나는 사자고 너는 토끼니까 우리는 친구가 될 수 없지만
[01:32.580 --> 01:33.322] 내가 배가 안 고플 때는 너를 잡아먹지 않으려고 노력하겠다는 약속은
[01:36.320 --> 01:36.781] 지금은 일단 해줄 수 있다고 볼 수 있는데
[01:39.000 --> 01:39.501] 100% 보장할 수는 없다는 점을 이해해줬으면 좋겠는데
[01:42.560 --> 01:42.640] 가능할까?
[01:45.600 --> 01:49.600] But you don't want to always talk like this either.
[01:47.600 --> 01:48.161] 나는 사자야, 너는 토끼야. 우리는 친구가 될 수 없어.
[01:52.720 --> 01:53.321] 내가 배가 안 고파. 그러면 너를 안 잡아먹어. 노력할게.
[02:22.720 --> 02:23.581] 재미있는 책, 한국어 공부를 좀 하려고 어디로 가면 좋을지 아직 모르겠어요.
[02:30.720 --> 02:30.840] 마침 and 우연히 are connected together here.
[02:37.320 --> 02:41.120] This one is so, this is but.
[02:39.120 --> 02:39.381] 집에만 있을 생각이었지만.
[02:43.920 --> 02:47.020] Then, I'll be waiting for you at TalkToMeInKorean.com.
[02:45.020 --> 02:45.160] TALK TO ME IN KOREAN 에서 만나요!
[02:49.020 --> 02:49.520] Bye!
E:\Applications\WhisperX>
E:\Applications\WhisperX>whisperx --model large --language ko GM2Ki_FmF5U.mp4 --align_model wav2vec2-xls-r-300m-korean --output_dir examples/whisperx
C:\Users\REDACTED\AppData\Roaming\Python\Python39\site-packages\torchaudio\backend\utils.py:62: UserWarning: No audio backend is available.
warnings.warn("No audio backend is available.")
[00:00.000 --> 00:03.320] 안녕하세요. 저는 현우에요. 한국어로 말해주세요.
[00:03.320 --> 00:09.060] 아무도 누군가에게 불가능한 길게 말하는 말을 듣고 싶지 않습니다.
[00:09.060 --> 00:18.280] 그러나, 이런 길게 말하는 말은 한국어 같은 새로운 언어를 공부할 때 정말 도움이 될 것입니다.
[00:18.280 --> 00:23.100] 영어에서는 짧은 말과 길게 말의 차이가 적습니다.
[00:23.100 --> 00:24.200] 예를 들어,
[00:24.200 --> 00:30.540] I am a lion. You are a bunny. We can't be friends.
[00:30.540 --> 00:35.200] You simply add more words like and, so, and you say
[00:35.200 --> 00:41.300] I'm a lion. And you're a bunny. So, we can't be friends.
[00:41.300 --> 00:43.420] The verbs themselves don't really change,
[00:43.420 --> 00:47.220] so it's relatively easier to make longer sentences in English.
[00:47.220 --> 00:52.480] But in Korean, with these three short sentences
[00:52.480 --> 00:58.720] 나는 사자야. 너는 토끼야. 우리는 친구가 될 수 없어.
[00:58.720 --> 01:03.220] You have to change the verb endings to form a longer sentence using them.
[01:03.220 --> 01:08.440] 나는 사자고, 너는 토끼니까, 우리는 친구가 될 수 없어.
[01:08.440 --> 01:15.080] So without understanding how the verbs change forms to be linked with the following part,
[01:15.080 --> 01:18.580] You can't really make your sentences more fluid and flexible
[01:18.580 --> 01:24.920] and it'll be harder for you to understand native speakers when they mix and link various sentence parts.
[01:24.920 --> 01:28.340] Again, you don't have to talk like this.
[01:30.840 --> 01:34.580] 나는 사자고 너는 토끼니까 우리는 친구가 될 수 없지만
[01:34.580 --> 01:38.280] 내가 배가 안 고플 때는 너를 잡아먹지 않으려고 노력하겠다는 약속은
[01:38.280 --> 01:40.920] 지금은 일단 해줄 수 있다고 볼 수 있는데
[01:40.920 --> 01:44.520] 100% 보장할 수는 없다는 점을 이해해줬으면 좋겠는데
[01:44.520 --> 01:45.880] 가능할까?
[01:49.880 --> 01:54.920] 나는 사자야. 너는 토끼야. 우리는 친구가 될 수 없어.
[01:54.920 --> 01:59.720] 내가 배가 안 고파. 그러면 너를 안 잡아먹어. 노력할게.
[02:24.920 --> 02:32.920] 재미있는 책, 한국어 공부를 좀 하려고 어디로 가면 좋을지 아직 모르겠어요.
[02:32.920 --> 02:37.360] 마침 and 우연히 are connected together here.
[02:37.360 --> 02:41.120] This one is so, this is but.
[02:41.120 --> 02:44.000] 집에만 있을 생각이었지만.
[02:44.000 --> 02:47.000] Then I'll be waiting for you at TalkToMeInKorean.com
[02:47.000 --> 02:49.000] TalkToMeInKorean에서 만나요!
[02:49.000 --> 02:49.500] Bye!
Performing alignment...
[00:00.000 --> 00:00.482] 안녕하세요. 저는 현우에요. 한국어로 말해주세요.
[00:01.340 --> 00:02.062] 아무도 누군가에게 불가능한 길게 말하는 말을 듣고 싶지 않습니다.
[00:07.060 --> 00:08.102] 그러나, 이런 길게 말하는 말은 한국어 같은 새로운 언어를 공부할 때 정말 도움이 될 것입니다.
[00:16.280 --> 00:16.801] 영어에서는 짧은 말과 길게 말의 차이가 적습니다.
[00:21.100 --> 00:21.220] 예를 들어,
[00:24.200 --> 00:30.540] I am a lion. You are a bunny. We can't be friends.
[00:30.540 --> 00:35.200] You simply add more words like and, so, and you say
[00:35.200 --> 00:41.300] I'm a lion. And you're a bunny. So, we can't be friends.
[00:41.300 --> 00:43.420] The verbs themselves don't really change,
[00:43.420 --> 00:47.220] so it's relatively easier to make longer sentences in English.
[00:47.220 --> 00:52.480] But in Korean, with these three short sentences
[00:50.480 --> 00:51.061] 나는 사자야. 너는 토끼야. 우리는 친구가 될 수 없어.
[00:58.720 --> 01:03.220] You have to change the verb endings to form a longer sentence using them.
[01:01.220 --> 01:01.821] 나는 사자고, 너는 토끼니까, 우리는 친구가 될 수 없어.
[01:08.440 --> 01:15.080] So without understanding how the verbs change forms to be linked with the following part,
[01:15.080 --> 01:18.580] You can't really make your sentences more fluid and flexible
[01:18.580 --> 01:24.920] and it'll be harder for you to understand native speakers when they mix and link various sentence parts.
[01:24.920 --> 01:28.340] Again, you don't have to talk like this.
[01:28.840 --> 01:29.442] 나는 사자고 너는 토끼니까 우리는 친구가 될 수 없지만
[01:32.580 --> 01:33.322] 내가 배가 안 고플 때는 너를 잡아먹지 않으려고 노력하겠다는 약속은
[01:36.280 --> 01:36.761] 지금은 일단 해줄 수 있다고 볼 수 있는데
[01:38.920 --> 01:39.441] 100% 보장할 수는 없다는 점을 이해해줬으면 좋겠는데
[01:42.520 --> 01:42.600] 가능할까?
[01:47.880 --> 01:48.481] 나는 사자야. 너는 토끼야. 우리는 친구가 될 수 없어.
[01:52.920 --> 01:53.501] 내가 배가 안 고파. 그러면 너를 안 잡아먹어. 노력할게.
[02:22.920 --> 02:23.721] 재미있는 책, 한국어 공부를 좀 하려고 어디로 가면 좋을지 아직 모르겠어요.
[02:30.920 --> 02:31.040] 마침 and 우연히 are connected together here.
[02:37.360 --> 02:41.120] This one is so, this is but.
[02:39.120 --> 02:39.381] 집에만 있을 생각이었지만.
[02:44.000 --> 02:47.000] Then I'll be waiting for you at TalkToMeInKorean.com
[02:45.000 --> 02:45.120] TalkToMeInKorean에서 만나요!
[02:49.000 --> 02:49.500] Bye!
E:\Applications\WhisperX>
OS: Windows 10
Python: 3.9.9
WhisperX: e909f2f
Whisper Model: Large
Alignment Model: w11wo/wav2vec2-xls-r-300m-korean
https://www.youtube.com/watch?v=GM2Ki_FmF5U (720p version, audio + video, mp4, I let WhisperX preprocess it)
whisperx --model large --language ko GM2Ki_FmF5U.mp4 --align_model wav2vec2-xls-r-300m-korean --output_dir examples/whisperx --align_extend 2
Note: Everything below is described using align_extend 2
as shown in the command above and the Terminal (Main)
details above
In the input, the speaker uses a mix of English and Korean. In the video's introduction, English is spoken and later in the video, such as during the example sentences, he switches to Korean. Instead of returning the introduction in English, you can see that it instead translated the English sentences into Korean for some reason.
This behavior is inconsistent too. For example, at 0:00:24, English is spoken and WhisperX transcribes it in English. That is fine. However, as mentioned above, during the introduction it transcribed it from English into Korean. I have no clue why that is.
EDIT: SOLVED I remembered that in #7 there was mention of having to us Chinese
instead of cn
, and so took a look at the issue again and saw it was regarding alignment. I changed kr
to Korean
for the language
parameter and this issue was resolved.
At first I tried passing in no align_extend
parameter. That made the transcribed captions even worse. I then used align_extend 2
as given in the Japanese example in the README which improved my results. It made everything better, and the English portions are lined up. However, the issue is the first occurrence (0:00:52) of the following line below is not aligned properly at all:
나는 사자야. 너는 토끼야. 우리는 친구가 될 수 없어.
The transcription is correct. The issue is the alignment. As you can see from the output of the command line, initially everything is correct before it performs the alignment
[00:47.180 --> 00:52.480] But in Korean, with these three short sentences,
[00:52.480 --> 00:58.680] 나는 사자야. 너는 토끼야. 우리는 친구가 될 수 없어.
[00:58.680 --> 01:03.060] You have to change the verb endings to form a longer sentence using them.
[01:03.060 --> 01:08.500] 나는 사자고, 너는 토끼니까, 우리는 친구가 될 수 없어.
However, after the alignment, it seems like of the Korean sentences' alignments are less than a second long. Here is just one example, but if you look at the output above, you will notice it's the case for all Korean sentences post-alignment.
[00:47.180 --> 00:52.480] But in Korean, with these three short sentences,
[00:50.480 --> 00:51.061] 나는 사자야. 너는 토끼야. 우리는 친구가 될 수 없어.
[00:58.680 --> 01:03.060] You have to change the verb endings to form a longer sentence using them.
[01:01.060 --> 01:01.681] 나는 사자고, 너는 토끼니까, 우리는 친구가 될 수 없어.
I'm currently trying to work with the new VAD feature but I'm getting the following error:
TypeError: transcribe_with_vad() missing 1 required positional argument: 'vad_pipeline'
Is there sample code anywhere for transcribing with vad?
For a longer file on M1 Pro, I keep getting this error after about 22min of alignment:
Reproduction repo: https://github.com/akparhi/pyvtt/tree/dev
Request with 1hr youtube video: curl --location --request POST 'http://localhost:8000/speech-to-text' \ --header 'Content-Type: application/json' \ --data-raw '{ "url": "https://www.youtube.com/watch?v=MYrfLmm_cT4", "accuracy": "word", }'
Is it possible to add the functions of determining the different speakers in a conversation and identifying them in the subtitles?
Is there a function within WhisperX that can generate an SRT file?
the most recent commit 286a2f2 references a diarize.py file which doesn't exist
whisperX/whisperx/transcribe.py
Line 12 in 286a2f2
is it going to be added in a future commit?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.