finos / greenkey-asrtoolkit Goto Github PK
View Code? Open in Web Editor NEWA collection of useful tools for handling speech recognition data
License: Apache License 2.0
A collection of useful tools for handling speech recognition data
License: Apache License 2.0
Describe the bug
Given a list such as "a,b" - clean_formatting currently just removes the comma. We should replace this with a space in the event bad data has entered a data processing pipeline.
To Reproduce
Steps to reproduce the behavior:
echo a,b > foo.txt
clean_formatting foo.txt
cat foo.txt
ab
Expected behavior
a b
Describe the bug
Occasionally, <#s>
can be output by some decoders to indicate silence
Expected behavior
We should amend the regex matches for noise tags to incorporate this specific tag.
Additional context
This may be somewhat unclear if we group more tagged non speech events in the same variable, so we may want to rename the variable.
Given the lack of activity on this repo, I propose archiving it.
If and when activity can be restored, the repository can be easily unarchived, by either:
We will wait until Friday the 25th of November to collect feedback from the community, then we will proceed with the archival.
Archiving the repo will consist of:
Subtitle files in SRT format should be supported by this tool
With this file as test.txt
okay
yeah
okay
I get this as an output
okayyeahokay
The current audio corpora prep seems to only work on SPH files. In addition, the current description says this:
Note that filenames with hyphens will be sanitized to underscores and that audio files will be forced to single channel, 16 kHz, signed PCM format. If two channels are present, only the first will be used.
Many corpora come in WAV files instead of SPH files, and many also have two unmixed channels that need to be mixed to properly account for all audio.
Is your feature request related to a problem? Please describe.
Noise tags [noise]
or <noise>
are not removed by the text clean up routines.
Describe the solution you'd like
These should be remove using a regex that matches A-Za-z words in brackets or angle brackets
In some cases, it's desirable to simply extract a readable transcript out of JSON. For example, a text representation or well-formatted subtitles. In these cases, it would be useful to have convert_transcript output the formatted / punctuated transcript instead of the unformatted one.
As agreed to in our last PMC meeting, please move the project roadmap from confluence to this project, wherever it fits best (I suggest the Wiki, with a link from the README).
https://finosfoundation.atlassian.net/wiki/spaces/VOICE/pages/906133835/GreenKey+ASR+Toolkit+Roadmap
What kind of inputs does the wer tool support? Could you include some examples in the documentation?
Describe the bug
Line 58 of align_json.py: https://github.com/finos/greenkey-asrtoolkit/blob/master/asrtoolkit/align_json.py
Should this be "align_json(
", not "align_gk_json(
"? I don't see align_gk_json
anywhere else in the repo.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
A clear and concise description of what you expected to happen.
Screenshots
If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
Additional context
Add any other context about the problem here.
Is your feature request related to a problem? Please describe.
GK JSON schema has been better defined since asrtoolkit first output to it.
Describe the solution you'd like
They should be floats
Describe alternatives you've considered
We could maintain both or switch to strings
Additional context
GK schema last edited by @burrows
Hi I use Python 3.5.3 :: Anaconda custom (x86_64)
and when I ran the command wer answer.txt stt.txt
the result was wer 400%
My another coworker tried this with same files, same comand, the python version was 3.6.8
but he got the result seemed normal like wer 49.485 % . His computer is Ubuntu and I use Mac OS
Other coworkers (One Ubuntu, one Windows) also tried and they got the wer 400% result like me so can you help me to figure out the reason that the calculation result is different from me and the other person even though using same files and same command?
Thank you!!!
Is your feature request related to a problem? Please describe.
We need to provide clean HTML output that can be served in front end applications
Describe the solution you'd like
We should be able to return html output as a string or as a file
Many use cases directly call sanitize hyphens. However, there's no choice in these cases to disable the warning message.
Need to support VTT format for file conversion
Frequently takes a long time and much memory to compute the word error rate
Describe the solution you'd like
validate_stm should be a command line tool
Describe alternatives you've considered
Users can invoke this by writing python code
Additional context
Requested by users of asrtoolkit
Describe the bug
The wer
script accepts pretty much anything as input and will spit out a random WER. This conflates a problem with the wer script and a simple filename error. I would suggest adding
To Reproduce
Steps to reproduce the behavior:
> wer asdf asd
WER: 0.000%
> wer ^ ^
WER: 0.000%
> wer aaaaaaaaaaaaaaa eeeeeeeeeeeeeeeeeee
WER: 100.000%
> wer --ignore-nsns I-like to-eat-garbage
WER: 150.000%
> wer seg_data/gkt_corpora_earnings_test_AAPL2017q1TranscriptMturkTest_seg_0001. seg_data/gkt_corpora_earnings_test_AAPL2017q1TranscriptMturkTest_seg_0001.st
WER: 11.111%
Expected behavior
I would expect the wer
script to error when given inputs that aren't valid ( ie. the expect file format, imaginary files, etc. )
Desktop (please complete the following information):
Is your feature request related to a problem? Please describe.
Phone numbers in transcripts are mapped to long series of numerals
Describe the solution you'd like
they should be mapped to spelled out single digits
Examples:
1-317-222-222 should map to the series of numbers 'one', 'three', 'one', 'seven', etc.
Describe the solution you'd like
The quantity "$6.5B" should map to "six point five billion" in the un-formatted text
Testing coverage gap here
The excel spreadsheet reader in asrtoolkit (as written) requires pandas, which is dependent upon numpy. The latter module was removed on 1/28/19 from requirements.txt to address test failures on circleci (cause undetermined). This script is rarely used internally, and does not affect the rest of the codebase but it is used by clients.
To Do:
fix the excel spreadsheet reader in asrtoolkit using a library other than pandas. One option would be to use csv with the excel format option
Basic json formatting should be supported
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.