crimeisdown / trunk-transcribe Goto Github PK

Transcription of calls from trunk-recorder using OpenAI Whisper

Shell 5.88% Dockerfile 0.41% Python 93.07% Batchfile 0.28% Makefile 0.36%

celery meilisearch openai-whisper telegram-bot trunk-recorder whisper

trunk-transcribe's Introduction

CrimeIsDown.com

CrimeIsDown.com takes the confusion out of monitoring Chicago crime with data-driven tools. This is the code that is powering the website, where the tools will live (tools may be stored in a separate repository as they are developed).

Getting started

Clone the repository
Ensure you have Ruby with Bundler, Node.js, NPM (Node Package Manager), Bower, and Gulp.js installed globally.
In the root of the repostiory, run the following commands:

bundle install
npm install
bower install
gulp

This should install all the necessary packages and generate a folder called /dist which contains the compiled HTML, CSS, and JS needed to make the website work. The contents of this /dist folder can be uploaded to a web server and viewed. To edit the website and view the changes instantly, use the command gulp serve instead of the standard gulp command. See the client-side documentation below for a full list of Gulp tasks.

Server-side

No server-side technologies are being used currently. An API that interacts with some datasets from the City of Chicago Data Portal is planned. This may be developed in a separate repository.

Client-side

The original project structure and development tool configuration was made possible by the Yeoman generator for web apps with Gulp.js and has been modified for the purposes of this website.

Directory structure

/
|-app         # Application goes here (index.haml, 404.haml, etc.)
|---images    # Static images used inside the app
|---scripts   # Various JavaScript files to be loaded
|---styles    # SCSS being used in the application, in main.scss for app stylesheets and vendor.scss for vendor stylesheets
|---templates # Master templates used by application pages
|-dist        # Where the files go after being compiled
|---images    # Compressed versions of images in /app/images
|---scripts   # Compiled, minified scripts used in application (both original and vendor JS)
|---styles    # Compiled SCSS from /app/styles
|-test        # JavaScript tests should go here
|---spec      # Mocha tests

Use Gulp tasks to "compile" the website

These Gulp.js tasks were included in the Yeoman generator listed above. More may be added as the need arises.

gulp or gulp build to build an optimized version of your application in /dist
gulp serve to launch a browser sync server on your source files
gulp serve:dist to launch a server on your optimized application
gulp wiredep to fill bower dependencies in your .html file(s)

Other client-side technologies currently being used include:

SCSS split up into separate files in the app/styles folder, compiled in a Gulp.js task
More technologies not listed in this README yet

Contributing

To be decided in a new CONTRIBUTING.md file. Stay tuned for further information.

Changelog and versioning

Various releases of this website and the tools under it will make use of the Semantic Versioning guidelines. There may be some errors in protocol, but generally we try to adhere to this.

Releases should be numbered with the format of <major>.<minor>.<patch>. What is defined as a "major", "minor", or "patch" release has yet to be decided.

1.1.0

Finished Tumblr theme, some design changes to main website

1.0.0

Finished multi-page website, more multimedia and tools

0.1.0

Finished single-page website with multimedia

0.0.1

Initial release, with a draft of single-page website (idea pitch site)

License

To be decided. Currently all code is under standard copyright law, except where any third-party material is used, in which the license of that material would apply.

trunk-transcribe's People

Contributors

Stargazers

Watchers

Forkers

qqcof kowaretaken

trunk-transcribe's Issues

Transcribe fails if S3 is not enabled

If the S3 env variables are not defined, the audio is successfully sent as base64, but it appears that the worker doesn't know how to process it. It fails with:

trunk-transcribe-worker-1       | [2023-03-28 19:23:54,679: INFO/ForkPoolWorker-2] Task transcribe[9ef51669-f8ea-4d1b-b74c-9c804702305c] retry: Retry in 2s: InvalidSchema("No connection adapters were found for 'data:audio/mpeg;base64,SUQzBAAA<...>'")

Fixed by using S3 instead - but it would probably be good to either support this or document that it doesn't work.

Start collecting start and end timestamps of transcript segments

Whisper provides a start and end time for each transcript segment, for the purpose of making accurate subtitles. However, this data can also be used to finetune a Whisper model, in conjunction with corrected transcripts.

In order to make this happen, we need to first start collecting start and end timestamps for all transcript segments, and ensure the raw transcript data stores these segments. This involves modifying how we use the result dicts we get back from Whisper, and updating the transcript data structure in various places.

Refactor worker logic into separate Celery tasks

Right now, the transcription job does the following:

Downloads the audio
Parses metadata and gets the audio to be ready to fed to Whisper
Feeds a chunk of audio to Whisper for each speaker (digital only, analog will just feed the whole thing)
Compiles the resulting transcript
Geocodes any addresses
Sends the transcript to the search API to be indexed
Determines which notifications should be made, and sends the appropriate ones

Only step 3 really relies on the GPU, technically the other steps are CPU-only. To make the architecture more scalable, split out any logic (and immediate prerequisites) that requires a GPU from the logic that doesn't need a GPU into separate Celery tasks, so they can be taken on by different workers for improved performance.

Improve detection of non-voice audio and gibberish transcripts

Often with analog audio, FCC station identification morse code will get transcribed when it shouldn't be. Additionally, some noise can get transcribed as gibberish due to hallucination.

The detection of such issues should be improved so that the amount of transcript "noise" can be reduced.

OpenAI Whisper docker image does not actually use CUDA even when enabled

The docker image that is publicly published does not actually use CUDA:

trunk-transcribe-worker-1       | [2023-03-28 18:56:16,152: WARNING/ForkPoolWorker-2] /usr/local/lib/python3.10/dist-packages/whisper/transcribe.py:114: UserWarning: FP16 is not supported on CPU; using FP32 instead
trunk-transcribe-worker-1       |   warnings.warn("FP16 is not supported on CPU; using FP32 instead")
trunk-transcribe-worker-1       |
trunk-transcribe-worker-1       | [2023-03-28 18:56:23,461: WARNING/ForkPoolWorker-2] /usr/local/lib/python3.10/dist-packages/whisper/transcribe.py:114: UserWarning: FP16 is not supported on CPU; using FP32 instead
trunk-transcribe-worker-1       |   warnings.warn("FP16 is not supported on CPU; using FP32 instead")
trunk-transcribe-worker-1       |
trunk-transcribe-worker-1       | [2023-03-28 18:56:24,606: WARNING/ForkPoolWorker-2] /usr/local/lib/python3.10/dist-packages/whisper/transcribe.py:114: UserWarning: FP16 is not supported on CPU; using FP32 instead
trunk-transcribe-worker-1       |   warnings.warn("FP16 is not supported on CPU; using FP32 instead")
trunk-transcribe-worker-1       |
trunk-transcribe-meilisearch-1  | [2023-03-28T18:56:27Z INFO  actix_web::middleware::logger] 127.0.0.1 "GET /health HTTP/1.1" 200 22 "-" "Wget" 0.000401
trunk-transcribe-worker-1       | [2023-03-28 18:56:27,331: WARNING/ForkPoolWorker-2] /usr/local/lib/python3.10/dist-packages/whisper/transcribe.py:114: UserWarning: FP16 is not supported on CPU; using FP32 instead
trunk-transcribe-worker-1       |   warnings.warn("FP16 is not supported on CPU; using FP32 instead")
trunk-transcribe-worker-1       |
trunk-transcribe-worker-1       | [2023-03-28 18:56:28,558: WARNING/ForkPoolWorker-2] /usr/local/lib/python3.10/dist-packages/whisper/transcribe.py:114: UserWarning: FP16 is not supported on CPU; using FP32 instead
trunk-transcribe-worker-1       |   warnings.warn("FP16 is not supported on CPU; using FP32 instead")
trunk-transcribe-worker-1       |

It appears that 'nvidia-smi' is not installed on the system image, and the CUDA libraries are also not installed.

I fixed this by changing Dockerfile.whisper to build off Nvidia's CUDA image instead:

diff --git a/Dockerfile.whisper b/Dockerfile.whisper
index 8f9d690..3b1c265 100644
--- a/Dockerfile.whisper
+++ b/Dockerfile.whisper
@@ -3,7 +3,7 @@
 #
 # PLEASE DO NOT EDIT IT DIRECTLY.
 #
-FROM ubuntu:22.04
+FROM nvidia/cuda:11.7.1-base-ubuntu22.04

 RUN apt-get update && \
     apt-get -y upgrade && \

..and then rebuilt the image, and confirmed that it works. Might not be the ideal fix.. but at least confirms where the issue is happening.