psmit / finnish-parliament-scripts Goto Github PK

View Code? Open in Web Editor NEW

Scripts for retrieving and aligning speech and meeting transcripts from the web portal of the Parliament of Finland (https://www.eduskunta.fi)

License: GNU General Public License v3.0

Perl 18.09% Python 81.91%

finnish-parliament-scripts's Introduction

finnish-parliament-scripts

Scripts for retrieving and aligning speech and meeting transcripts from the web portal of the Parliament of Finland (https://www.eduskunta.fi)

Dependencies:

sox
avconv
sclite
python3
python3-lxml
wget

ASR system is also required to produce first-pass hypotheses

Download videos and meeting transcripts and save into DATA-FOLDER:

retrieve/retrieve_sessions.py DATA-FOLDER

Four different files will be saved for each session:

*.mp4 - video of the session
*.wav - audio file stored in wav-format (16kHz,mono)
*.transcript - meeting transcript with speaker information for each paragraph
*.metadata - metadata file containing date information and links to the original video and meeting transcript

EDIT: Currently the retrieval of the meeting transcripts fails because the publishing format has changed.

Produce first-pass recognition output with an ASR system (preferably train a biased LM with the meeting transcripts).

Store recognition output in the following format:

start-time-in-seconds end-time-in-seconds word

Align the first-pass output with the meeting transcript using sclite:

align/asr_align_2_elan.py asr-output transcript-file metadata-filename elan-filename

The output is in the Elan EAF-format.

Test the alignment script with example files:

align/asr_align_2_elan.py test/session_79_2008.asr test/session_79_2008.transcript test/session_79_2008.metadata test/session_79_2008.eaf

Extract individual speech segments from a list of EAF-files:

extract/elan_wav_extractor.py eaf-list wav-segment-dir

Stores both audio file (.wav) and transcript (.trn)

Extract individual speech segments from a list of metadata files:

extract/corpus_extractor.py metadata-file-list

Stores audio file (.wav)

André Mansikkaniemi, [email protected]

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.

Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

TensorFlow

An Open Source Machine Learning Framework for Everyone

Django

The Web framework for perfectionists with deadlines.

Laravel

A PHP framework for web artisans

D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

web

Some thing interesting about web. New door for the world.

server

A server is a program made to process requests and deliver data to clients.

Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

Visualization

Some thing interesting about visualization, use data art

Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.

Microsoft

Open source projects and samples from Microsoft.

Google

Google ❤️ Open Source for everyone.

Alibaba

Alibaba Open Source for everyone

D3

Data-Driven Documents codes.

Tencent

China tencent open source team.

psmit / finnish-parliament-scripts Goto Github PK