Arabic-Voice-Interface-for-City-Operation-Center

About

A city operations center (COC) enables smart city operators to integrate data from different sectors and agencies, manage resources, relate to the citizens and address their concerns.

Giza Systems offers a software platform for city operation center that enables the operators to manage IoT assets in smart cities by collecting data from these assets, create alarms based on the received data, calculate KPIs, configure schedulers, manage Standard Operation Procedures (SOPs), build dashboards, and train ML models.

Description

In this project, we aimed to build a voice interface for the Asset-360 view screen. The COC operator can simply use this voice interface by asking questions related to the asset and the interface will reply with the answers to the operator questions.

The answer to the operator's question will be in the same language of the question (AR -> AR) or (ENG -> ENG). This can save time for the operators instead of navigating through the screens of different assets.

The target of this project is to map the operator questions to pieces of information related to the asset.

Introduction
Project Structure
Getting Started
Pipeline
Running The Pipeline
Examples
Team Members
Contributing
Future Work
Acknowledgments

Introduction

This repository contains the full code for an Arabic & English virtual assistant

It was developed as a final graduation project for ITI Intake 43 AI Mansoura Branch in July 2023 Under the supervision of Giza Systems.

Project Structure

├── Interface
│   ├── google_app
│   ├── interface
├── data
├── notebooks
├── src
│   ├── rasa
│   ├── speechtotext
│   ├── texttospeech
│   ├── translation
│   └── wav2lip
└── utils

The repository is organized as follows:

Interface/: This directory is the django project and contains google_app as the django app.
data/: This directory contains the dataset used for training and evaluation. It includes both the raw data and preprocessed versions, if applicable.
notebooks/: This directory contains Jupyter notebooks that provide step-by-step explanations of the data exploration, preprocessing, model training, and evaluation processes.
src/: This directory contains the source code for the project, including data preprocessing scripts, model training scripts, and evaluation scripts.
utils/: This directory contains utility functions and helper scripts used throughout the project.

Getting Started

It is recommended to set up a virtual environment for this project using python 3.8.16
You need to provide API keys for Google Cloud Services and Azure Cognitive Speech Services, in the following modules:
- utils/detect_language.py
- src/translation/azure_translator.py
- src/translation/google_translator.py
- src/texttospeech/google_text_to_speech.py
- src/texttospeech/azure_text_to_speech.py
- src/speechtotext/google_speech_to_text.py
- src/speechtotext/azure_speech_to_text.py

To get started with the project, follow these steps:

Clone the repository:

git clone https://github.com/Aylore/Arabic-Voice-Interface-for-City-Operation-Center

Change directory into the repository:

cd Arabic-Voice-Interface-for-City-Operation-Center

Install the required dependencies:
```
make install
```

You will only need to do this for your first time (feel free to use your own)

Download this pretrained model for the wav2lip model using
```
make wav2lip-model
```
Train rasa chatbot using
```
make rasa-train
```

Pipeline

Speech To Text

The first step of the pipeline is to transcribe the user's spoken question into text using a speech-to-text system. We use the Azure Speech Services API to perform this task, for more information check SST-online branch README, where we compare between speech-to-text services including AWS and Google Cloud.
Text Translation 1

If the user asked the question in arabic, the text is translated to english before feeding the question to the chat bot.
Rasa Chatbot

After getting the transcript of the question, The chatbot generates a response to the user's question based on the intent and entities identified in the question. it calls an API endpoint to retrieve the answer.
Text Translation 2

If the user asked the question was in arabic, the text is translated from english to arabic after getting the answer from the chat bot and before generating audio file.
Text To Speech

After getting the response from our chatbot We then use the Azure Speech SDK to synthesize the response into an audio file. The audio file can be played back to the user as the chatbot's spoken response.
LipSync

After getting the audio response we had to present the answer to the user in a convenient way so we trained -on an agent of our chosing- a LipSync model using the current SOTA model wav2lip , check the training notebook for more information refer to this branch
Face Restoration

Due to the output result of the wav2lip model we used an image enhancement model to restore the quality using Code Former
Django Integration

After the video response is generated, we send the response to a Django web application. The Django application can then display the video response to the user, along with any additional information or functionality needed.

Running The Pipeline

Run the uvicorn server of fastapi
```
make fastapi
```
Activate the rasa api.
```
make rasa-run
```
Run rasa actions to get the data from the api.
```
make rasa-actions
```
Run the django server to use the interface.
```
make django
```

Execution Time

Speech To Text : ~ 2s
Translation : ~ 2s
Chatbot : ~ 250ms
Text To Speech : ~ 2s
Wav2Lip : ~ 30:40s
Face restoration : ~ 4m:7m

These numbers were achieved on M1 macbook air with 16GB of RAM

Examples

Chatbot

English

english-example.mp4

Arabic

arabic-example.mp4

Team Members

Name	LinkedIn	GitHub
Ahmed Elghitany
Israa Okil
Khaled Ehab
Osama Oun

Contributing

If you would like to contribute to this project, Feel Free to make a pull request or contact one of our team members via the links above.

Future Work

Edit the face restoration model to use a simpler model for face detection or combining it with wav2lip some how. needs further research
Taking feedback from the user after receiving his answer to find areas of development and better enhance the pipeline.
Applying an end to end arabic pipeline with arabic chat bot and no translation needed.

Acknowledgements

wav2lip, "A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild", published at ACM Multimedia 2020.
Code Former, [NeurIPS 2022] Towards Robust Blind Face Restoration with Codebook Lookup Transformer.

aylore / arabic-voice-interface-for-city-operation-center Goto Github PK