GithubHelp home page GithubHelp logo

create-speech-to-text-pipeline / pipeline Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 8.0 5.13 MB

A tool that can be deployed to process posting and receiving text and audio files from and into a data lake, apply transformation in a distributed manner, and load it into a warehouse in a suitable format to train a speech-to-text model

License: MIT License

Jupyter Notebook 99.61% HTML 0.04% CSS 0.01% JavaScript 0.12% Python 0.21% Shell 0.01% Dockerfile 0.01%
apache-airflow apache-kafka apache-spark kafka-js kafka-python pyspark reactjs amazon-msk amazon-s3-storage

pipeline's Introduction

Logo

Speech-to-Text Data Collection

A tool that can be deployed to process posting and receiving text and audio files from and into a data lake, apply transformation in a distributed manner, and load it into a warehouse in a suitable format to train a speech-to-text model.

Data Capture Pipeline

Pipeline Diagram

Directory Structure

.
├── airflow
│   ├── dags
│   │   ├── extract_load.py
│   │   └── scripts
│   │       ├── dataloader.py
│   │       ├── db_connection.py
│   │       ├── __init__.py
│   │       └── schema
│   │           └── amharicnews.sql
│   ├── data
│   │   └── AmharicNewsDataset.csv
│   ├── docker-compose.yaml
│   └── logs
│       └── scheduler
│           └── latest -> /opt/airflow/logs/scheduler/2022-10-05
├── backend
│   └── dummy.txt
├── frontend
│   ├── dummy.txt
│   ├── frontend
│   │   ├── package.json
│   │   ├── package-lock.json
│   │   ├── public
│   │   │   ├── favicon.ico
│   │   │   ├── index.html
│   │   │   ├── logo192.png
│   │   │   ├── logo512.png
│   │   │   ├── manifest.json
│   │   │   └── robots.txt
│   │   ├── README.md
│   │   └── src
│   │       ├── App.css
│   │       ├── App.js
│   │       ├── App.test.js
│   │       ├── index.css
│   │       ├── index.js
│   │       ├── logo.svg
│   │       ├── reportWebVitals.js
│   │       └── setupTests.js
│   └── proto.png
├── img
│   ├── logo.png
│   └── pipelineDiagram.png
├── LICENSE
├── logging
│   └── dummy.txt
├── notebook
│   └── Amharic_news_Classification.ipynb
├── README.md
├── requirements.txt
├── screenshots
│   ├── airflowscreenshoot.png
│   └── design diagram.png
└── testing
    ├── dummy.txt
    └── test_dataloading.py

17 directories, 39 files

Run Locally

Clone the project

  git clone https://github.com/create-speech-to-text-pipeline/pipeline

Go to the project directory

  cd pipeline

Install dependencies

  pip3 install -r requirements.txt

Set up pipeline

  python3 setup.py

Screenshots

App Screenshot

Authors

pipeline's People

Contributors

akrobi avatar haylemicheal avatar kaydeejr avatar mohammedesamaldin avatar nahomhmichael avatar yonamg avatar

Watchers

 avatar

pipeline's Issues

Create a Kafka cluster

  • Based on Installing a Kafka Cluster and Creating a Topic - Hands-on Labs | A Cloud Guru, set up a cluster in your assigned AWS machine.

  • Your cluster will be responsible for creating a Delta Lake - a bucket in S3 where Spark transformed streaming data from users reading the texts you showed them are stored. (hint You will write a code that can generate an ID for a randomly selected text and its audio equivalent, receives an ID from an API, sends back as json the ID + audio to Kafka like URL

Backend

prepare API endpoints for kafka - using flask

Planning and design

  • Build or simulate a Kafka event source for the text corpus - you should read Breaking News: Everything Is An Event! (Streams, Kafka And You) (florimond.dev)

  • Develop an overview of your approach and document it. Explain why this approach and why these tools. Explain how this approach will provide a good data source for the clients’ speech-to-text ML engine. Explain the purpose of each of these tools - should defend it if one asks them why, not simple python code.

Create a javascript tag

EDA

Jupyter notebook that illustrate your data exploration with professional plots, readable axes labels, title, and legend; good choice of color

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.