GithubHelp home page GithubHelp logo

shipping-forecasting's Introduction

Shipping Forecasting

(Live app)

Shipping forecaster visualizes the latest Shipping Forecast bulletin, using the voices of BBC Radio 4 speakers to read it.

It downloads the text from Met Office's website using PhantomJS; the audio from BBC using get_iplayer; splits the audio files into individual words using audiogrep; and plays these audio files sequentially using p5.js.

Data sources

Extracting audio samples for individual words

I transcribed the audio broadcasts using the amazing audiogrep. Audiogrep uses pocketsphinx for audio transcription.

The problem is pocketsphinx is not accurate apparently for British broadcasting Enlgish. So I tried first to adapt its acoustic model. You can do that by "training" the existing US English model (tutorial here).

Basically you can record (or use existing) audio files of someone reading a few sentences, and use these and their transcriptions as input to train the model. I thought this was going to be a good approach, since I could take an existing broadcast and cut it up in sentences, then pair it to the official text of the broadcast as its transcription. After doing this process, which took some time, I noticed no improvement at all when compared to the original US English model.

So I tried a different approach, which was to build my own custom dictionary using only words from the Shipping Forecast.

Using the new dictionary helped a lot to improve accuracy of the transcription. It still didn't get the new words that I inputed manually into the dictionary (such as "thundery," or "squally"), but overall the transcription was much cleaner.

After the audio is transcribed, audigrep lets you do many operations with it. I was interested in extracted individual words from the audio, which can be done like this: audiogrep --input broadcast-file.mp3 --extract

Audiogrep will proceed to extract every word it has found in the transcription into individual audio files. I could use this files to play in my web app.

Creating a shipping forecast dictionary for transcription

This was done by taking the text of a sample Shipping Forecast and extracting all its words. I adapted this text cleanup function:

String.prototype.cleanup = function() {
  return this.toLowerCase().replace(/[^a-zA-Z0-9]+/g, "|");
}

(source)

The result was a string like this:

"shipping|forecast|the|shipping|forecast|issued|by|the|met|office|on|behalf|..."

I could then use this string to run a grep expression on the original pocketshpinx English dictionary, to extract only the words contained in the Shipping Forecast. The full dictionary file is a text file with one word per line, accompanied by its phonetic equivalent, like this:

accommodate AH K AA M AH D EY T
accommodated AH K AA M AH D EY T IH D
accommodates AH K AA M AH D EY T S
accommodating AH K AA M AH D EY T IH NG
accommodation AH K AA M AH D EY SH AH N
accommodations AH K AA M AH D EY SH AH N Z
accommodative AH K AA M AH D EY T IH V
accompanied AH K AH M P AH N IY D
accompanies AH K AH M P AH N IY Z
...and so on.

Grep expressions allow you to extract lines from a file that contain a pattern (could be a word, a list of words, or a Regular Expression). The expression I used looked for lines which contained any of the words in the shipping forecast:

egrep -wi 'shipping|forecast|the|shipping|forecast|issued|by|the|met|office|...' original.dict > new-dictionary.txt

(why I used egrep insted of just grep)

I had to use another expression to get rid of everything with hyphens and apostrophes:

egrep -v "'|-" new-dictionary.txt > new-dictionary-clean.txt

Another expression was used to get all the numbers:

egrep -wi 'one|two|three|four|five|six|seven|eight|nine|ten|eleven|twelve|thirteen|fourteen|fifteen|sixteen|seventeen|nineteen|twenty|thirty|forty|fifty|sixty|seventy|eight|ninety|hundred|thousand|first|second|third|fourth|fifth|sixth|seventh|eight|ninth|tenth|eleventh|twelveth|thirteenth|fourteenth|fifteenth|sixteeth|seventeeth|eigtheenth|nineteenth|twentieth|thirtieth' original.dict > numbers.txt

After combining the numbers and cleaned up dictionary, I had a dictionary file that contained only words used in the Shipping Forecast. It looked something like this:

a AH
a(2) EY
agency EY JH AH N S IY
all AO L
and AH N D
and(2) AE N D
are AA R
are(2) ER
area EH R IY AH
areas EH R IY AH Z
at AE T
backing B AE K IH NG
bailey B EY L IY
becoming B IH K AH M IH NG
behalf B IH HH AE F
...etc.

I substituted pocketsphinx original English dictionary for this "Shipping Forecast dictionary," by replacing the file in the pocketsphinx folder.

shipping-forecasting's People

Contributors

bplmp avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.