GithubHelp home page GithubHelp logo

psmit / finnish-parliament-scripts Goto Github PK

View Code? Open in Web Editor NEW

This project forked from aalto-speech/finnish-parliament-scripts

0.0 1.0 0.0 252 KB

Scripts for retrieving and aligning speech and meeting transcripts from the web portal of the Parliament of Finland (https://www.eduskunta.fi)

License: GNU General Public License v3.0

Perl 18.09% Python 81.91%

finnish-parliament-scripts's Introduction

finnish-parliament-scripts

Scripts for retrieving and aligning speech and meeting transcripts from the web portal of the Parliament of Finland (https://www.eduskunta.fi)

Dependencies:

  • sox
  • avconv
  • sclite
  • python3
  • python3-lxml
  • wget

ASR system is also required to produce first-pass hypotheses

Download videos and meeting transcripts and save into DATA-FOLDER:

retrieve/retrieve_sessions.py DATA-FOLDER

Four different files will be saved for each session:

  • *.mp4 - video of the session
  • *.wav - audio file stored in wav-format (16kHz,mono)
  • *.transcript - meeting transcript with speaker information for each paragraph
  • *.metadata - metadata file containing date information and links to the original video and meeting transcript

EDIT: Currently the retrieval of the meeting transcripts fails because the publishing format has changed.

Produce first-pass recognition output with an ASR system (preferably train a biased LM with the meeting transcripts).

Store recognition output in the following format:

  • start-time-in-seconds end-time-in-seconds word

Align the first-pass output with the meeting transcript using sclite:

align/asr_align_2_elan.py asr-output transcript-file metadata-filename elan-filename

The output is in the Elan EAF-format.

Test the alignment script with example files:

align/asr_align_2_elan.py test/session_79_2008.asr test/session_79_2008.transcript test/session_79_2008.metadata test/session_79_2008.eaf

Extract individual speech segments from a list of EAF-files:

extract/elan_wav_extractor.py eaf-list wav-segment-dir

Stores both audio file (.wav) and transcript (.trn)

Extract individual speech segments from a list of metadata files:

extract/corpus_extractor.py metadata-file-list 

Stores audio file (.wav)

André Mansikkaniemi, [email protected]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.