This script modifies OpenAI's Whisper to produce more reliable timestamps.
a.mp4
pip install -U stable-ts
To install the latest commit:
pip install -U git+https://github.com/jianfch/stable-ts.git
import stable_whisper
model = stable_whisper.load_model('base')
result = model.transcribe('audio.mp3')
result.to_srt_vtt('audio.srt')
CLI
stable-ts audio.mp3 -o audio.srt
Parameters: load_model(), transcribe(), transcribe_minimal()
faster-whisper
Use with faster-whisper:
model = stable_whisper.load_faster_whisper('base')
result = model.transcribe_stable('audio.mp3')
Parameters: transcribe_stable(),
Stable-ts supports various text output formats.
result.to_srt_vtt('audio.srt') #SRT
result.to_srt_vtt('audio.vtt') #VTT
result.to_ass('audio.ass') #ASS
result.to_tsv('audio.tsv') #TSV
Parameters:
to_srt_vtt(),
to_ass(),
to_tsv()
There are word-level and segment-level timestamps. All output formats support them.
They also support will both levels simultaneously except TSV.
By default, segment_level
and word_level
are both True
for all the formats that support both simultaneously.
Examples in VTT.
Default: segment_level=True
+ word_level=True
CLI
--segment_level true
+ --word_level true
00:00:07.760 --> 00:00:09.900
But<00:00:07.860> when<00:00:08.040> you<00:00:08.280> arrived<00:00:08.580> at<00:00:08.800> that<00:00:09.000> distant<00:00:09.400> world,
segment_level=True
+ word_level=False
00:00:07.760 --> 00:00:09.900
But when you arrived at that distant world,
segment_level=False
+ word_level=True
00:00:07.760 --> 00:00:07.860
But
00:00:07.860 --> 00:00:08.040
when
00:00:08.040 --> 00:00:08.280
you
00:00:08.280 --> 00:00:08.580
arrived
...
The result can also be saved as a JSON file to preserve all the data for future reprocessing. This is useful for testing different sets of postprocessing arguments without the need to redo inference.
result.save_as_json('audio.json')
CLI
stable-ts audio.mp3 -o audio.json
Processing JSON file of the results into SRT.
result = stable_whisper.WhisperResult('audio.json')
result.to_srt_vtt('audio.srt')
CLI
stable-ts audio.json -o audio.srt
Audio can be aligned/synced with plain text on word-level.
text = 'Machines thinking, breeding. You were to bear us a new, promised land.'
result = model.align('audio.mp3', text)
When the text is correct but the timestamps need more work,
align()
is a faster alternative for testing various settings/models.
new_result = model.align('audio.mp3', result)
Parameters: align()
Timestamps are adjusted after the model predicts them.
When suppress_silence=True
(default), transcribe()
/transcribe_minimal()
/align()
adjust based on silence/non-speech.
The timestamps can be further adjusted base on another result with adjust_by_result()
,
which acts as a logical AND operation for the timestamps of both results, further reducing duration of each word.
Note: both results are required to have word timestamps and matching words.
# the adjustments are in-place for `result`
result.adjust_by_result(new_result)
Parameters: adjust_by_result()
Stable-ts has a preset for regrouping words into different segments with more natural boundaries.
This preset is enabled by regroup=True
(default).
But there are other built-in regrouping methods that allow you to customize the regrouping algorithm.
This preset is just a predefined combination of those methods.
xata.mp4
# The following results are all functionally equivalent:
result0 = model.transcribe('audio.mp3', regroup=True) # regroup is True by default
result1 = model.transcribe('audio.mp3', regroup=False)
(
result1
.clamp_max()
.split_by_punctuation([('.', ' '), '。', '?', '?', (',', ' '), ','])
.split_by_gap(.5)
.merge_by_gap(.3, max_words=3)
.split_by_punctuation([('.', ' '), '。', '?', '?'])
)
result2 = model.transcribe('audio.mp3', regroup='cm_sp=.* /。/?/?/,* /,_sg=.5_mg=.3+3_sp=.* /。/?/?')
# To undo all regrouping operations:
result0.reset()
Any regrouping algorithm can be expressed as a string. Please feel free share your strings here
- regroup()
- split_by_gap()
- split_by_punctuation()
- split_by_length()
- merge_by_gap()
- merge_by_punctuation()
- merge_all_segments()
- clamp_max()
- lock()
You can locate words with regular expression.
# Find every sentence that contains "and"
matches = result.find(r'[^.]+and[^.]+\.')
# print the all matches if there are any
for match in matches:
print(f'match: {match.text_match}\n'
f'text: {match.text}\n'
f'start: {match.start}\n'
f'end: {match.end}\n')
# Find the word before and after "and" in the matches
matches = matches.find(r'\s\S+\sand\s\S+')
for match in matches:
print(f'match: {match.text_match}\n'
f'text: {match.text}\n'
f'start: {match.start}\n'
f'end: {match.end}\n')
Parameters: find()
- do not disable word timestamps with
word_timestamps=False
for reliable segment timestamps - use
vad=True
for more accurate non-speech detection - use
demucs=True
to isolate vocals with Demucs; it is also effective at isolating vocals even if there is no music - use
demucs=True
andvad=True
for music --dq true
ordq=True
forstable_whisper.load_model
to enable dynamic quantization for inference on CPU- use
encode_video_comparison()
to encode multiple transcripts into one video for synced comparison; see Encode Comparison - use
visualize_suppression()
to visualize the differences between non-VAD and VAD options; see Visualizing Suppression - if the non-speech/silence seems to be detected but the starting timestamps do not reflect that, then try
min_word_dur=0
You can visualize which parts of the audio will likely be suppressed (i.e. marked as silent). Requires: Pillow or opencv-python.
import stable_whisper
# regions on the waveform colored red are where it will likely be suppressed and marked as silent
# [q_levels]=20 and [k_size]=5 (default)
stable_whisper.visualize_suppression('audio.mp3', 'image.png', q_levels=20, k_size = 5)
With Silero VAD
# [vad_threshold]=0.35 (default)
stable_whisper.visualize_suppression('audio.mp3', 'image.png', vad=True, vad_threshold=0.35)
Parameters: visualize_suppression()
You can encode videos similar to the ones in the doc for comparing transcriptions of the same audio.
stable_whisper.encode_video_comparison(
'audio.mp3',
['audio_sub1.srt', 'audio_sub2.srt'],
output_videopath='audio.mp4',
labels=['Example 1', 'Example 2']
)
Parameters: encode_video_comparison()
Transcribe multiple audio files then process the results directly into SRT files.
stable-ts audio1.mp3 audio2.mp3 audio3.mp3 -o audio1.srt audio2.srt audio3.srt
You can use most of the features of Stable-ts improve the results of any ASR model/APIs. Just follow this notebook.
- updated to use Whisper's more reliable word-level timestamps method.
- the more reliable word timestamps allow regrouping all words into segments with more natural boundaries.
- can now suppress silence with Silero VAD (requires PyTorch 1.12.0+)
- non-VAD silence suppression is also more robust
results_to_sentence_srt(result, 'audio.srt')
→result.to_srt_vtt('audio.srt', word_level=False)
results_to_word_srt(result, 'audio.srt')
→result.to_srt_vtt('output.srt', segment_level=False)
results_to_sentence_word_ass(result, 'audio.srt')
→result.to_ass('output.ass')
- there's no need to stabilize segments after inference because they're already stabilized during inference
transcribe()
returns aWhisperResult
object which can be converted todict
with.to_dict()
. e.gresult.to_dict()
This project is licensed under the MIT License - see the LICENSE file for details
Includes slight modification of the original work: Whisper