GithubHelp home page GithubHelp logo

bamert / stm32_speech_commands Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 0.0 17.77 MB

Efficient real-time keyword spotting on STM32L4 microcontrollers

Python 0.02% HTML 0.01% C 97.06% CMake 0.05% Assembly 2.75% C++ 0.10% Makefile 0.01%
browser inference keyword-spotting stm32 wakeword

stm32_speech_commands's Introduction

Efficient Keyword Spotting for Embedded Systems

This repository demonstrates an efficient keyword spotting system tailored for STM32L4 microcontrollers, balancing accuracy and speed for real-time audio processing in resource-constrained embedded systems.

Deployed on an STM32L475(128KB SRAM, 80MHz) it recognizes 35 different keywords and achieves a post-rejection accuracy of 96% at an inference latency of 190ms, suitable for streaming applications.

Demo For reference, the model can be tested in the browser here.

Model Specifications

  • Utilizes a modified M5 model, processing raw waveforms (no spectrogram).
  • Dataset: Recognizes 35 keywords from the speech commands dataset.
  • Audio sampling rate: 8kHz, 1 sec frames.
  • Inference Time (Cortex M4): ~ 190ms at 80Mhz (Cortex M4).
  • Inference Time (Browser): ~ 1-5ms depending on device
  • Memory Usage (Cortex M4): Consumes about 60Kb RAM.

Repository Structure

  • model_training: Contains Pytorch Lightning training code.
  • browser_inference: Includes browser-based demo inference code. Try it here.
  • stm32_inference: Features STM32-specific inference engine with firmware for B-L475-IOT01A board.

Getting Started

  • The python requirements are managed with poetry. They are installed with cd model_training && poetry install.
  • The stm32 code requires the arm gcc: arm-none-eabi-gcc. Build the code with cd stm32_inference && make.
    • A firmware binary is available at stm32_inference/build/speechmodel_code.bin.
  • Includes a no-frills browser inference engine in browser_inference/browser_demo_inference.html

Model accuracy / inference time tradeoff

Model val acc. pr val acc.(% rejected) stm32 inference time [ms] MFLOP kParams
M5-c32-k80 86.6 96.9 (23.1) 603 3.8 166
M5-c16-k80 81.7 96.3 (37.4) - - -
M5-c32-k40 87.6 97.2 (23.0) 595 2.4 99
M5-c32-k20 86.2 96.6 (23.8) 246 1.8 98
M5-c32-k10 84.5 96.5 (28.4) 180 1.6 97

The above table shows some of the model configurations that were tried. The first row shows the original configuration of the M5 model by Dai et al.

The STM32 inferences engine acquires and runs inference on overlapping audio frames of 1 second length (8kHz; 8000samples) every 250ms. This is to ensure that the longer keywords ("visual", "marvin", ..) have a higher likelihood of being fully contained in one of the frames as opposed to being cut in half. To enable 4 inferences per second, the inference time of the model has to be under 250ms.

Experiments with a smaller kernel length for the initial 1D convolution showed that reasonable performance can also be reached with a much smaller k=10. The accuracy on the validation split with this model is 84.5%. For keyword spotting applications it is more acceptable to miss an unclear keyword rather than making a false positive prediction. For this reason we use the distance between the class with the highest and second highest probabilities as a proxy for the confidence of the prediction. We only make a prediction if this distance is > 75%. Given this additional criterion to avoid false positives, all models reach a post-rejection accuracy in excess of 96% on the non-rejected validation samples (pr val acc).

The model used in the stm32 and browser inference engines above is the M5-c32-k10.

stm32_speech_commands's People

Contributors

bamert avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.