GithubHelp home page GithubHelp logo

whospeaks's Introduction

WhoSpeaks

Toolkit for Enhanced Voice Training Datasets

Note: realtime_diarize.py required changes to RealtimeSTT. Please upgrade to latest version.

WhoSpeaks emerged from the need for better speaker diarization tools. Existing libraries are heavyweight and often fall short in reliability, speed and efficiency. So this project offers a more refined alternative.

Hint: Anybody interested in state-of-the-art voice solutions please also have a look at Linguflex. It lets you control your environment by speaking and is one of the most capable and sophisticated open-source assistants currently available.

Here's the core concept:

  • Voice Characteristic Extraction: For each sentence in your audio, unique voice characteristics are extracted, creating audio embeddings.
  • Sentence Similarity Comparison: Then cosine similarity is used to compare these embeddings against every other sentence, identifying similarities.
  • Grouping and Averaging: Similar sounding sentences are grouped together. This approach averages out anomalies and minimizes errors from individual data points.
  • Identification of Distinct Groups: By analyzing these groups, we can isolate the most distinct ones, which represent unique speaker characteristics.

These steps allow us to match any sentence against the established speaker profiles with remarkable precision.

Feature Modules

  • fetch_youtube_mp3.py: Extracts and converts YouTube audio, like podcasts, to MP3 for voice analysis.
  • split_dataset.py: This tool divides your input audio into distinct sentences.
  • convert_wav.py: Converts the sentence-based MP3 files into WAV format.
  • auto_diarize.py/speaker_diarize.py: Heart of WhoSpeaks. Categorizes sentences into speaker groups and selects training sentences based on the unique algorithm described above.
  • pyannote_diarize.py: Use for comparison against pyannote audio diarization, a current state of the art speaker diarization model

Note: auto_diarize is for multiple speakers, speaker_diarize is for two speakers only

I initially developed this as a personal project, but was astounded by its effectiveness. In my first tests it outperformed existing solutions like pyannote audio in both reliability and speed while being the more lightweight approach. For me it could be a significant step up in voice diarization capabilities, that's why I've decided to release this rather raw, yet powerful code for others to experiment with.

Performance and Testing

To demonstrate WhoSpeaks' capabilities, I made a test using a challenging audio sample: the 4:38 Coin Toss scene from "No Country for Old Men". In this scene, the two male speakers have very similar voice profiles, presenting a difficult scenario for diarization libraries.

Process:

  1. Download: Using fetch_youtube_mp3.py, download the MP3 from the scene's YouTube video.
  2. Diarization Comparison: Run the scene through pyannote_diarize.py (from pyannote audio) and set the speaker parameters to 2.
    • Pyannote's output was inaccurate, assigning most sentences to one speaker incorrectly.
  3. WhoSpeaks Analysis:
    • Sentence Splitting: Use split_dataset.py with tiny.en for efficiency, though large-v2 offers higher accuracy.
    • Conversion: The MP3 segments are converted to WAV format using convert_wav.py.
    • Diarization: Then run auto_diarize.py and visually inspect the dendrogram file to confirm the presence of two speakers.

To run auto_diarize.py and speaker_diarize.py it is necessary to set the environment variable COQUI_MODEL_PATH to the path containing the "v2.0.2" model folder for coqui XTTS.

Results:

  • WhoSpeaks' algorithm assigned 53 sentences correctly to Javier Bardem's voice with only 2 minor errors.
  • Of the 33 sentences assigned to the other actor, only one was incorrect.
  • The overall error rate was approximately 3.5%, demonstrating a precision of about 95% in correctly assigning sentences.

The effectiveness of WhoSpeaks in this test, particularly against pyannote audio, showcases its potential in handling complex diarization scenarios with high accuracy and efficiency.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.