License: BSD 3-Clause "New" or "Revised" License

Dockerfile 1.42% Python 76.95% HCL 21.63%

automated-transcription-service's Introduction

automated-transcription-service

Social science researchers using qualitative methods, especially in-depth interviews and focus groups, typically need audio recordings transcribed into accurate text for analysis. Currently, many researchers use other automated transcription services, such as Temi, Trint, or Otter.ai, which are well-known to social science researchers and provide easy-to-use web interfaces for uploading multiple audio files, and then downloading multiple transcripts. These services are also more accessible to graduate students, who do not have internal departmental account numbers for billing and typically pay out-of-pocket for these external services. However, these services come with important data security concerns. Most of these services do not provide the kinds of security documentation required for data steward approval, and many will not have signed a Business Associate Agreement with the university, meaning that they are not approved for use with HIPAA-protected data.

We believe that cloud machine learning APIs provides a powerful alternative to researchers. Thus far, social scientists have not made full use of this option, in part, we believe, because using these services efficiently requires additional technical skills that many social scientists do not have, and/or do not have time to learn. Other social scientists, especially graduate students, have used these services, but do not have access to the same cloud environment as faculty—meaning that their data, when stored in a free or student account, do not receive the same security protections.

Thus, we seek to provide a new service to researchers that will make audio transcription convenient, efficient, and accessible to them, even without technical skills. For researchers, this will provide an affordable and secure option for quickly producing automated transcripts of research-related recordings.

This project has folders:

aws: To build a pipeline with terraform to accept audio files in an S3 input bucket and convert those to docx with the help of a Python script. Output files are placed in another S3 output bucket
google: Python script to convert json to docx only

automated-transcription-service's People

Contributors

Stargazers

Watchers

automated-transcription-service's Issues

Create ECR images using Terraform

Can/should we automate the creation of the ECR container images using Terraform? CI/CD? Probably.

Becca has done some preliminary work on this:

/*
resource aws_ecr_repository repo {
name = local.ecr_repository_name
}
*/

/resource null_resource ecr_image {
triggers = {
python_file = var.python
docker_file = var.docker
}
}
/

/*
resource "aws_instance" "exec" {
#provisioner "local-exec" {
command = <<EOF
aws ecr get-login-password --region ${var.region} | docker login --username AWS --password-stdin ${local.account_id}.dkr.ecr.${var.region}.amazonaws.com
cd ${path.module}/lambdas/git_client
docker build -t ${aws_ecr_repository.repo.repository_url}:${local.ecr_image_tag} .
docker push ${aws_ecr_repository.repo.repository_url}:${local.ecr_image_tag}
EOF
}
#do I need to include a path to requirements.txt?
*/

/*
data aws_ecr_image lambda_image {
depends_on = [
null_resource.ecr_image
]
repository_name = local.ecr_repository_name
image_tag = local.ecr_image_tag
}
*/

/*
data "aws_iam_policy_document" "lambda_docx" {
statement {
actions = [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents"
]
effect = "Allow"
resources = [""]
sid = "Create WatchLogs"
}
statement {
actions = [
"codecommit:GitPull",
"codecommit:GitPush",
"codecommit:GetBranch",
"codecommit:ListBranches",
"codecommit:CreateCommit",
"codecommit:GetCommit",
"codecommit:GetCommitHistory",
"codecommit:GetDifferences",
"codecommit:GetReferences",
"codecommit:BatchGetCommits",
"codecommit:GetTree",
"codecommit:GetObjectIdentifier",
"codecommit:GetMergeCommit"
]
effect = "Allow"
resources = [""]
sid = "CodeCommit"
}
}

resource "aws_iam_policy" "lambda_docx" {
name = "${local.prefix}-lambda-policy"
path = "/"
policy = data.aws_iam_policy_document.lambda.json
}
*/

/*
environment {
variables = {
#BUCKET = "aws_s3_bucket.upload.id",
MPLCONFIGDIR = "/tmp",
WEBHOOK_URL = "https://indiana.webhook.office.com/webhookb2/xxxxxxxxxxx...",
BUCKET = "aws_s3_bucket.download.id"
}
}
*/

Google Cloud: Enhance to generate Word Document (instead of text)

Generate a nicely formatted Word document like we do in the ATS processed for GCP: https://github.com/indiana-university/automated-transcription-service/blob/main/google/sttparser.py

Note; this may be a larger task, coordinate and spec as appropriate before starting.

Missing transcript

If there is no speaking detected in the audio file, Transcribe returns a JSON document but without the expected speech segments, which currently results in an exception. So we need to check for that instead of simply assuming it will always be there. Error message pointing to at least one location in code where this is an issue:

[ERROR] TypeError: 'NoneType' object is not subscriptable
Traceback (most recent call last):
  File "/function/transcribe_to_docx.py", line 756, in lambda_handler
    speech_segments = create_turn_by_turn_segments(transcript, isSpeakerMode = True)
  File "/function/transcribe_to_docx.py", line 538, in create_turn_by_turn_segments
    for segment in data["results"]["speaker_labels"]["segments"]:

Evaluate Azure

https://learn.microsoft.com/en-us/azure/cognitive-services/speech-service/batch-transcription

Does it give an advantage over using AWS / GC? Is there an advantage / difference to Office 365 transcribe?

Update webhook code for notifications.

Microsoft change requires action: https://devblogs.microsoft.com/microsoft365dev/retirement-of-office-365-connectors-within-microsoft-teams/. As of Aug 2024 Microsoft has delayed the retirement of webhooks to Dec 2025.

We should consider changing the code to post these messages to SNS. That way people can configure it as they like. Could post to Power Automate workflow, but also Slack webhooks, or even direct email. I.e. there should not be an assumption that Teams is the only target for notifications.

Investigate switch to JSON protocol

https://aws.amazon.com/about-aws/whats-new/2023/11/amazon-sqs-support-json-protocol/

Ruff

Add Ruff GitHub actions CI/CD workflow for PRs.

Github action that auto updates the image (for vulnerabilities) and deploys to AWS

Desirable would be the following two files capturing versions

https://github.com/indiana-university/automated-transcription-service/blob/main/aws/src/lambda/Dockerfile
https://github.com/indiana-university/automated-transcription-service/blob/main/aws/src/lambda/requirements.txt

Bainstorming notes...

https://www.flypenguin.de/2021/07/30/auto-rebuild-docker-images-if-base-image-changes-using-github-actions/
https://dev.to/oracle2025/how-to-keep-a-dockerfile-updated-with-dependabot-1mdn
https://docs.github.com/en/code-security/dependabot
https://hub.docker.com/_/python/

https://towardsaws.com/build-push-docker-image-to-aws-ecr-using-github-actions-8396888a8f9e

Multiple EventBridge messages

Under some circumstances, EventBridge may generate duplicate events for the a triggered rule:

https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-troubleshooting.html#eb-rule-triggered-more-than-once

We saw this behavior in ATS on 2023-08-15 when two messages were sent to SQS for the same events (Transcribe job end). The messages were sent 2 minutes apart, so there is some kind of delay. This is most likely related to the massive scalability and redundancy that AWS builds into these managed services behind the scene.

If we want to prevent processing of the same JSON fille multiple times, we would need to build in some kind of logic to check to see if a given job has already been processed. This would be best accomplished by using a step function and adding a step to write a record to DynamoDB, which we could then check in another step.

Enable AWS batch jobs for long-running transcribe-to-docx?

We ran into the 15 minute limitation for json to docx translation. We made tweaks to lower the chance of hitting the limit. However, it seems with large files (approximately 4+ hours?) we would eventually still hit the limit. Does it makes sense to convert this to an AWS Batch job? What is the effort? Should it dynamically branch to optimize cost savings, if any? How often would we hit the limit? The only option when hitting the limit would be running the json through command line conversion.

Capture key data in DynamoDB

Emily comments

Hi all! As I was running jobs this morning, I had a thought about the reporting stuff we talked about--extracting information from logs. In addition to knowing the file name that was transcribed (to match it to a project/person) and knowing the audio length (to help approximate the cost/value), I wonder if it would be possible to pull out the confidence score? That might give us information on at least how well Amazon thinks it's doing! The other thing is that if this is complicated to do, I'm still at a point with this service where I could manually log audio files and their lengths moving forward, or maybe write a little script I could run on the json files to extract that information without having to go to the logs? (In Amazon as well as GCP) Just a thought.

Step function and DynamoDB

Step function is probably a better way to perform Transcribe-to-Docx anyway, so this could be a step in that process. Write a record for each job into DynamoDB, which would make reporting that much easier.

Note: should probably happen after the Docx is created, as we would have all of the necessary data from the transcription. So pass that as a message to the next task in the step function.

Google Cloud: Allow multi file processing

At this time we are not planning to automate audio file processing in GC but it would be nice if multiple files at once can be processed: Enhance sttparser.py

Test multi channel recording

Emily will try to provide multi channel recording file to us. If she does not have one available we can probably create via zoom (or other meeting) recordings. Test that in our pipeline, does it work?

Use SecureShare for download

Investigate possibility of having Lambda function create SS downloads with notification to user (based on file prefix).

Does SS support group accounts? API?

Related: #4

Transcribe retries

Need to look at the configuration of retries for the audio-to-transcribe function. Right now it might be attempting to retry a filed submission, but it should probably just notify and then stop. This will require someone to upload again, but that is really what we want. The example case was a file with AAC compression that errored.

Vocabulary files

Allow users to include a vocabulary file with their transcription, either list or table. This is likely to make a huge difference in recordings with a lot of very domain-specific language.

In the current implementation this would require some kind of clue in the audio filename. Either the vocab file has the same name as the recording or perhaps some kind of prefix that becomes a clue to the audio-to-transcribe Lambda function to look for the vocab file.

Rich web UI presence or AWS S3 UI

We can develop a simple web application that allows for:

User (IU network ID) access
** Individual file upload / download
** Admin role
Reflecting job queue that is in AWS with some sort of "pending" state

This would not require a web page, this is can be done via AWS S3 UI:

https://github.com/aws-solutions/content-localization-on-aws
https://aws.amazon.com/solutions/implementations/content-localization-on-aws/
https://iu.mediaspace.kaltura.com/media/t/1_1mjwy1qi

Existing AWS quickstarts

This solution could be a starting place: https://aws.amazon.com/solutions/implementations/content-localization-on-aws/
Which is based on: https://docs.aws.amazon.com/solutions/latest/media-insights-on-aws/architecture-overview.html
Either or both could be useful.

indiana-university / automated-transcription-service Goto Github PK