GithubHelp home page GithubHelp logo

automated-transcription-service's Introduction

automated-transcription-service

Social science researchers using qualitative methods, especially in-depth interviews and focus groups, typically need audio recordings transcribed into accurate text for analysis. Currently, many researchers use other automated transcription services, such as Temi, Trint, or Otter.ai, which are well-known to social science researchers and provide easy-to-use web interfaces for uploading multiple audio files, and then downloading multiple transcripts. These services are also more accessible to graduate students, who do not have internal departmental account numbers for billing and typically pay out-of-pocket for these external services. However, these services come with important data security concerns. Most of these services do not provide the kinds of security documentation required for data steward approval, and many will not have signed a Business Associate Agreement with the university, meaning that they are not approved for use with HIPAA-protected data.

We believe that cloud machine learning APIs provides a powerful alternative to researchers. Thus far, social scientists have not made full use of this option, in part, we believe, because using these services efficiently requires additional technical skills that many social scientists do not have, and/or do not have time to learn. Other social scientists, especially graduate students, have used these services, but do not have access to the same cloud environment as faculty—meaning that their data, when stored in a free or student account, do not receive the same security protections.

Thus, we seek to provide a new service to researchers that will make audio transcription convenient, efficient, and accessible to them, even without technical skills. For researchers, this will provide an affordable and secure option for quickly producing automated transcripts of research-related recordings.

This project has folders:

  • aws: To build a pipeline with terraform to accept audio files in an S3 input bucket and convert those to docx with the help of a Python script. Output files are placed in another S3 output bucket
  • google: Python script to convert json to docx only

automated-transcription-service's People

Contributors

alan-walsh avatar emilymeanwell avatar pcberg avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar

automated-transcription-service's Issues

Create ECR images using Terraform

Can/should we automate the creation of the ECR container images using Terraform? CI/CD? Probably.

Becca has done some preliminary work on this:

/*
resource aws_ecr_repository repo {
name = local.ecr_repository_name
}
*/

/resource null_resource ecr_image {
triggers = {
python_file = var.python
docker_file = var.docker
}
}
/

/*
resource "aws_instance" "exec" {
#provisioner "local-exec" {
command = <<EOF
aws ecr get-login-password --region ${var.region} | docker login --username AWS --password-stdin ${local.account_id}.dkr.ecr.${var.region}.amazonaws.com
cd ${path.module}/lambdas/git_client
docker build -t ${aws_ecr_repository.repo.repository_url}:${local.ecr_image_tag} .
docker push ${aws_ecr_repository.repo.repository_url}:${local.ecr_image_tag}
EOF
}
#do I need to include a path to requirements.txt?
*/

/*
data aws_ecr_image lambda_image {
depends_on = [
null_resource.ecr_image
]
repository_name = local.ecr_repository_name
image_tag = local.ecr_image_tag
}
*/

/*
data "aws_iam_policy_document" "lambda_docx" {
statement {
actions = [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents"
]
effect = "Allow"
resources = [""]
sid = "Create WatchLogs"
}
statement {
actions = [
"codecommit:GitPull",
"codecommit:GitPush",
"codecommit:GetBranch",
"codecommit:ListBranches",
"codecommit:CreateCommit",
"codecommit:GetCommit",
"codecommit:GetCommitHistory",
"codecommit:GetDifferences",
"codecommit:GetReferences",
"codecommit:BatchGetCommits",
"codecommit:GetTree",
"codecommit:GetObjectIdentifier",
"codecommit:GetMergeCommit"
]
effect = "Allow"
resources = [""]
sid = "CodeCommit"
}
}

resource "aws_iam_policy" "lambda_docx" {
name = "${local.prefix}-lambda-policy"
path = "/"
policy = data.aws_iam_policy_document.lambda.json
}
*/

/*
environment {
variables = {
#BUCKET = "aws_s3_bucket.upload.id",
MPLCONFIGDIR = "/tmp",
WEBHOOK_URL = "https://indiana.webhook.office.com/webhookb2/xxxxxxxxxxx...",
BUCKET = "aws_s3_bucket.download.id"
}
}
*/

Missing transcript

If there is no speaking detected in the audio file, Transcribe returns a JSON document but without the expected speech segments, which currently results in an exception. So we need to check for that instead of simply assuming it will always be there. Error message pointing to at least one location in code where this is an issue:

[ERROR] TypeError: 'NoneType' object is not subscriptable
Traceback (most recent call last):
  File "/function/transcribe_to_docx.py", line 756, in lambda_handler
    speech_segments = create_turn_by_turn_segments(transcript, isSpeakerMode = True)
  File "/function/transcribe_to_docx.py", line 538, in create_turn_by_turn_segments
    for segment in data["results"]["speaker_labels"]["segments"]:

Update webhook code for notifications.

Microsoft change requires action: https://devblogs.microsoft.com/microsoft365dev/retirement-of-office-365-connectors-within-microsoft-teams/. As of Aug 2024 Microsoft has delayed the retirement of webhooks to Dec 2025.

We should consider changing the code to post these messages to SNS. That way people can configure it as they like. Could post to Power Automate workflow, but also Slack webhooks, or even direct email. I.e. there should not be an assumption that Teams is the only target for notifications.

Ruff

Add Ruff GitHub actions CI/CD workflow for PRs.

Github action that auto updates the image (for vulnerabilities) and deploys to AWS

Multiple EventBridge messages

Under some circumstances, EventBridge may generate duplicate events for the a triggered rule:

https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-troubleshooting.html#eb-rule-triggered-more-than-once

We saw this behavior in ATS on 2023-08-15 when two messages were sent to SQS for the same events (Transcribe job end). The messages were sent 2 minutes apart, so there is some kind of delay. This is most likely related to the massive scalability and redundancy that AWS builds into these managed services behind the scene.

If we want to prevent processing of the same JSON fille multiple times, we would need to build in some kind of logic to check to see if a given job has already been processed. This would be best accomplished by using a step function and adding a step to write a record to DynamoDB, which we could then check in another step.

Enable AWS batch jobs for long-running transcribe-to-docx?

We ran into the 15 minute limitation for json to docx translation. We made tweaks to lower the chance of hitting the limit. However, it seems with large files (approximately 4+ hours?) we would eventually still hit the limit. Does it makes sense to convert this to an AWS Batch job? What is the effort? Should it dynamically branch to optimize cost savings, if any? How often would we hit the limit? The only option when hitting the limit would be running the json through command line conversion.

Capture key data in DynamoDB

Emily comments

Hi all! As I was running jobs this morning, I had a thought about the reporting stuff we talked about--extracting information from logs. In addition to knowing the file name that was transcribed (to match it to a project/person) and knowing the audio length (to help approximate the cost/value), I wonder if it would be possible to pull out the confidence score? That might give us information on at least how well Amazon thinks it's doing! The other thing is that if this is complicated to do, I'm still at a point with this service where I could manually log audio files and their lengths moving forward, or maybe write a little script I could run on the json files to extract that information without having to go to the logs? (In Amazon as well as GCP) Just a thought.

Step function and DynamoDB

Step function is probably a better way to perform Transcribe-to-Docx anyway, so this could be a step in that process. Write a record for each job into DynamoDB, which would make reporting that much easier.

  • Note: should probably happen after the Docx is created, as we would have all of the necessary data from the transcription. So pass that as a message to the next task in the step function.

Test multi channel recording

Emily will try to provide multi channel recording file to us. If she does not have one available we can probably create via zoom (or other meeting) recordings. Test that in our pipeline, does it work?

Use SecureShare for download

Investigate possibility of having Lambda function create SS downloads with notification to user (based on file prefix).

Does SS support group accounts? API?

Related: #4

Transcribe retries

Need to look at the configuration of retries for the audio-to-transcribe function. Right now it might be attempting to retry a filed submission, but it should probably just notify and then stop. This will require someone to upload again, but that is really what we want. The example case was a file with AAC compression that errored.

Vocabulary files

Allow users to include a vocabulary file with their transcription, either list or table. This is likely to make a huge difference in recordings with a lot of very domain-specific language.

In the current implementation this would require some kind of clue in the audio filename. Either the vocab file has the same name as the recording or perhaps some kind of prefix that becomes a clue to the audio-to-transcribe Lambda function to look for the vocab file.

Rich web UI presence or AWS S3 UI

We can develop a simple web application that allows for:

User (IU network ID) access
** Individual file upload / download
** Admin role
Reflecting job queue that is in AWS with some sort of "pending" state

This would not require a web page, this is can be done via AWS S3 UI:

https://github.com/aws-solutions/content-localization-on-aws
https://aws.amazon.com/solutions/implementations/content-localization-on-aws/
https://iu.mediaspace.kaltura.com/media/t/1_1mjwy1qi

Existing AWS quickstarts

This solution could be a starting place: https://aws.amazon.com/solutions/implementations/content-localization-on-aws/
Which is based on: https://docs.aws.amazon.com/solutions/latest/media-insights-on-aws/architecture-overview.html
Either or both could be useful.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.