crimeisdown / trunk-transcribe Goto Github PK

Transcription of calls from trunk-recorder using OpenAI Whisper

Shell 5.38% Dockerfile 0.36% Python 93.70% Batchfile 0.25% Makefile 0.32%

celery meilisearch openai-whisper telegram-bot trunk-recorder whisper

trunk-transcribe's Issues

Improve detection of non-voice audio and gibberish transcripts

Often with analog audio, FCC station identification morse code will get transcribed when it shouldn't be. Additionally, some noise can get transcribed as gibberish due to hallucination.

The detection of such issues should be improved so that the amount of transcript "noise" can be reduced.

Refactor worker logic into separate Celery tasks

Right now, the transcription job does the following:

Downloads the audio
Parses metadata and gets the audio to be ready to fed to Whisper
Feeds a chunk of audio to Whisper for each speaker (digital only, analog will just feed the whole thing)
Compiles the resulting transcript
Geocodes any addresses
Sends the transcript to the search API to be indexed
Determines which notifications should be made, and sends the appropriate ones

Only step 3 really relies on the GPU, technically the other steps are CPU-only. To make the architecture more scalable, split out any logic (and immediate prerequisites) that requires a GPU from the logic that doesn't need a GPU into separate Celery tasks, so they can be taken on by different workers for improved performance.

Transcribe fails if S3 is not enabled

If the S3 env variables are not defined, the audio is successfully sent as base64, but it appears that the worker doesn't know how to process it. It fails with:

trunk-transcribe-worker-1       | [2023-03-28 19:23:54,679: INFO/ForkPoolWorker-2] Task transcribe[9ef51669-f8ea-4d1b-b74c-9c804702305c] retry: Retry in 2s: InvalidSchema("No connection adapters were found for 'data:audio/mpeg;base64,SUQzBAAA<...>'")

Fixed by using S3 instead - but it would probably be good to either support this or document that it doesn't work.

OpenAI Whisper docker image does not actually use CUDA even when enabled

The docker image that is publicly published does not actually use CUDA:

trunk-transcribe-worker-1       | [2023-03-28 18:56:16,152: WARNING/ForkPoolWorker-2] /usr/local/lib/python3.10/dist-packages/whisper/transcribe.py:114: UserWarning: FP16 is not supported on CPU; using FP32 instead
trunk-transcribe-worker-1       |   warnings.warn("FP16 is not supported on CPU; using FP32 instead")
trunk-transcribe-worker-1       |
trunk-transcribe-worker-1       | [2023-03-28 18:56:23,461: WARNING/ForkPoolWorker-2] /usr/local/lib/python3.10/dist-packages/whisper/transcribe.py:114: UserWarning: FP16 is not supported on CPU; using FP32 instead
trunk-transcribe-worker-1       |   warnings.warn("FP16 is not supported on CPU; using FP32 instead")
trunk-transcribe-worker-1       |
trunk-transcribe-worker-1       | [2023-03-28 18:56:24,606: WARNING/ForkPoolWorker-2] /usr/local/lib/python3.10/dist-packages/whisper/transcribe.py:114: UserWarning: FP16 is not supported on CPU; using FP32 instead
trunk-transcribe-worker-1       |   warnings.warn("FP16 is not supported on CPU; using FP32 instead")
trunk-transcribe-worker-1       |
trunk-transcribe-meilisearch-1  | [2023-03-28T18:56:27Z INFO  actix_web::middleware::logger] 127.0.0.1 "GET /health HTTP/1.1" 200 22 "-" "Wget" 0.000401
trunk-transcribe-worker-1       | [2023-03-28 18:56:27,331: WARNING/ForkPoolWorker-2] /usr/local/lib/python3.10/dist-packages/whisper/transcribe.py:114: UserWarning: FP16 is not supported on CPU; using FP32 instead
trunk-transcribe-worker-1       |   warnings.warn("FP16 is not supported on CPU; using FP32 instead")
trunk-transcribe-worker-1       |
trunk-transcribe-worker-1       | [2023-03-28 18:56:28,558: WARNING/ForkPoolWorker-2] /usr/local/lib/python3.10/dist-packages/whisper/transcribe.py:114: UserWarning: FP16 is not supported on CPU; using FP32 instead
trunk-transcribe-worker-1       |   warnings.warn("FP16 is not supported on CPU; using FP32 instead")
trunk-transcribe-worker-1       |

It appears that 'nvidia-smi' is not installed on the system image, and the CUDA libraries are also not installed.

I fixed this by changing Dockerfile.whisper to build off Nvidia's CUDA image instead:

diff --git a/Dockerfile.whisper b/Dockerfile.whisper
index 8f9d690..3b1c265 100644
--- a/Dockerfile.whisper
+++ b/Dockerfile.whisper
@@ -3,7 +3,7 @@
 #
 # PLEASE DO NOT EDIT IT DIRECTLY.
 #
-FROM ubuntu:22.04
+FROM nvidia/cuda:11.7.1-base-ubuntu22.04

 RUN apt-get update && \
     apt-get -y upgrade && \

..and then rebuilt the image, and confirmed that it works. Might not be the ideal fix.. but at least confirms where the issue is happening.

Start collecting start and end timestamps of transcript segments

Whisper provides a start and end time for each transcript segment, for the purpose of making accurate subtitles. However, this data can also be used to finetune a Whisper model, in conjunction with corrected transcripts.

In order to make this happen, we need to first start collecting start and end timestamps for all transcript segments, and ensure the raw transcript data stores these segments. This involves modifying how we use the result dicts we get back from Whisper, and updating the transcript data structure in various places.

crimeisdown / trunk-transcribe Goto Github PK

trunk-transcribe's Issues

Improve detection of non-voice audio and gibberish transcripts

Refactor worker logic into separate Celery tasks

Transcribe fails if S3 is not enabled

OpenAI Whisper docker image does not actually use CUDA even when enabled

Start collecting start and end timestamps of transcript segments

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs