nasa-pds / data-upload-manager Goto Github PK

Data Upload Manager (DUM) component for managing the interface for data uploads to the Planetary Data Cloud from Data Providers and PDS Nodes.

Home Page: https://nasa-pds.github.io/data-upload-manager

License: Apache License 2.0

Python 73.72% JavaScript 3.16% HCL 23.12%

s3-storage upload

data-upload-manager's Introduction

PDS Data Upload Manager

The PDS Data Upload Manager provides the client application and server interface for managing data deliveries and retrievals from the Data Providers to and from the Planetary Data Cloud.

Prerequisites

The PDS Data Delivery Manager has the following prerequisties:

python3 for running the client application and unit tests
awscli (optional) for deploying the service components to AWS (TBD)

User Quickstart

Install with:

pip install pds-data-upload-manager

To deploy the service components to an AWS environment:

TBD

To execute the client, run:

pds-ingress-client.py <ingress path> [<ingress_path> ...]

Code of Conduct

All users and developers of the NASA-PDS software are expected to abide by our Code of Conduct. Please read this to ensure you understand the expectations of our community.

Development

To develop this project, use your favorite text editor, or an integrated development environment with Python support, such as PyCharm.

Contributing

For information on how to contribute to NASA-PDS codebases please take a look at our Contributing guidelines.

Installation

Install in editable mode and with extra developer dependencies into your virtual environment of choice:

pip install --editable '.[dev]'

Configure the pre-commit hooks:

pre-commit install && pre-commit install -t pre-push

Packaging

To isolate and be able to re-produce the environment for this package, you should use a Python Virtual Environment. To do so, run:

python -m venv venv

Then exclusively use venv/bin/python, venv/bin/pip, etc. (It is no longer recommended to use venv/bin/activate.)

If you have tox installed and would like it to create your environment and install dependencies for you run:

tox --devenv <name you'd like for env> -e dev

Dependencies for development are specified as the dev extras_require in setup.cfg; they are installed into the virtual environment as follows:

pip install --editable '.[dev]'

Tooling

The dev extras_require included in this repo installs black, flake8 (plus some plugins), and mypy along with default configuration for all of them. You can run all of these (and more!) with:

tox -e lint

Tests

A complete "build" including test execution, linting (mypy, black, flake8, etc.), and documentation build is executed via:

tox

Unit tests

Our unit tests are launched with command:

pytest

Documentation

You can build this projects' docs with:

sphinx-build docs/source docs/build

You can access the build files in the following directory relative to the project root:

build/sphinx/html/

data-upload-manager's People

Contributors

Watchers

data-upload-manager's Issues

As a user, I want to upload only data products that have not been previously ingested

Checked for duplicates

Yes - I've already checked

🧑‍🔬 User Persona(s)

Node Operator

💪 Motivation

...so that I can upload only data I do not already have in Planetary Data Cloud.

📖 Additional Details

No response

Acceptance Criteria

Given
When I perform
Then I expect

⚙️ Engineering Details

No response

As a user, I want to parallelize upload of data products to PDC

Checked for duplicates

Yes - I've already checked

🧑‍🔬 User Persona(s)

No response

💪 Motivation

...so that I can [why do you want to do this?]

📖 Additional Details

No response

Acceptance Criteria

Given
When I perform
Then I expect

⚙️ Engineering Details

No response

Implement automatic refresh of Cognito authentication token

💡 Description

The authentication token returned from Cognito has a default expiration of 1 hour, which is typically shorter than what is expected for large file transfers. Cognito authentication tokens can be refreshed by providing the "refresh" token that is supplied after initial authentication.

The DUM client script needs to be updated to support an automatic refresh of the authentication token based on when the token is expected to expire. This should allow long running transfers to complete without interruption.

⚔️ Parent Epic / Related Tickets

No response

Add User-Defined Object Metadata

💡 Description

As an admin, I want access to buckets to be restricted by subnet

Checked for duplicates

Yes - I've already checked

🧑‍🔬 User Persona(s)

Cloud Admin / Operator

💪 Motivation

...so that I can add another layer of security to access to S3 buckets

📖 Additional Details

No response

Acceptance Criteria

Given a bucket that I have write access policy to a bucket with data upload manager, and within a set IP subnet
When I perform a DUM load
Then I expect the data to upload successfully

Given a bucket that I have write access policy to a bucket with data upload manager, and outside the expected IP subnet
When I perform a DUM load
Then I expect the data to upload successfully

⚙️ Engineering Details

No response

DUM Client does not properly sanitize double-quotes from INI config

Checked for duplicates

No - I haven't checked

🐛 Describe the bug

When double-quotes are present in the string values within the INI config utilized by the DUM client, they end up being erroneously included in the JSON-serialized payload sent to API Gateway. This causes escaped double-quotes (\") to appear in the HTML header that defines the CloudWatch log group to submit client logs to. The CloudWatch API then rejects the log stream creation request with a SeriallizationError, causing client logs to not appear in CloudWatch.

🕵️ Expected behavior

The DUM Client INI config parser should be properly sanitizing double-quotes from parsed strings to ensure they are not escaped when serializing a payload to JSON. This will ensure that DUM client logs will populate in CloudWatch as-expected.

📜 To Reproduce

Run the DUM pds-ingress-client using an INI config that surrounds the value for log_group_name in double-quotes:

...
[OTHER]
log_level = DEBUG
log_format = %(levelname)s %(threadName)s %(name)s:%(funcName)s %(message)s
log_group_name = "/pds/nucleus/dum/client-log-group"

After completing an ingest to S3, there will not be a corresponding log in the CloudWatch log group specified by the INI config. If the double-quotes are removed and the ingress client is rerun, then logs should appear in CloudWatch.

🖥 Environment Info

Version of this software [e.g. vX.Y.Z]
Operating System: [e.g. MacOSX with Docker Desktop vX.Y]
...

📚 Version of Software Used

v1.2.0

🩺 Test Data / Additional context

No response

🦄 Related requirements

🦄 #xyz

⚙️ Engineering Details

No response

🎉 Integration & Test

No response

CSS MVP: Deploy to MCP and Test Uploads to Nucleus

💡 Description

Upload test data set with manual trigger of Nucleus

💡 Description

Needs new IAM roles for DUM. To get help from @sjoshi-jpl @viviant100
Test out deployment to MCP
Test uploading data from internal pdsmcp-dev EC2 that can reach private API Gateway
- Ask SAs to help setup EC2 and give access to specific EN operator user group
Test uploading data from on-prem EC2 instance to public API Gateway
- Ask SAs to setup IP whitelist
Move onto deploying and running with SBN: #32

As a user, I want to skip upload of files already in S3 (nucleus staging bucket)

Checked for duplicates

No - I haven't checked

🧑‍🔬 User Persona(s)

Node Operator

💪 Motivation

...so that I can avoid duplicate copies for data.

📖 Additional Details

Current design is to overwrite the data when user upload data via DUM. Propose to add capability to verify if the file modification time and size remain unchanged, thereby allowing the copying to S3 to be skipped. Additionally, providing an optional flag (e.g., --force-overwrite) would enable users to overwrite the file when needed.

Acceptance Criteria

Given
When I perform
Then I expect

⚙️ Engineering Details

Note: Per #91, rclone handles this functionality for us.

The user should also have the ability to force overwrite data that is already out there.

Update installation documentation to only use virtual environment only

💡 Description

Let's always direct users to setup a virtual environment by default

⚔️ Parent Epic / Related Tickets

No response

Log upload to Cloudwatch fails during batch upload

Checked for duplicates

No - I haven't checked

🐛 Describe the bug

When testing upload of CSS sample data to DUM, the following warning is generated on every ingest:

`WARNING:root:Unable to submit to CloudWatch Logs, reason: 'LogRecord' object has no attribute 'message'

🕵️ Expected behavior

The LogRecords should still have a message field assigned, and all logs should be uploaded to CloudWatch without issue.

📜 To Reproduce

Configure an instance of DUM on an EC2 instance that can communicate with the (currently Private) API gateway
Use the pds-ingress-client.py script to upload a file
Verify the above warning is reproduced in the output log

🖥 Environment Info

Version of this software [e.g. vX.Y.Z]
Operating System: [e.g. MacOSX with Docker Desktop vX.Y]
...

📚 Version of Software Used

No response

🩺 Test Data / Additional context

No response

🦄 Related requirements

🦄 #xyz

⚙️ Engineering Details

No response

As a user, I want timestamps to the ongoing logs that are printed to stdout while running the job

Checked for duplicates

Yes - I've already checked

🧑‍🔬 User Persona(s)

Node Operator

💪 Motivation

...so that I can have a better gauge of the execution time of the application.

📖 Additional Details

From @mdrum :

Add timestamps to the ongoing logs that are printed to stdout while running the job. We ran four separate jobs, and we had it configured to only print warnings, so we didn't get to see how far it got in the third job before pausing.

Acceptance Criteria

Given
When I perform
Then I expect

⚙️ Engineering Details

No response

🎉 I&T

No response

Develop Cost Model

💡 Description

Cost model for data upload manager components. Should go hand-in-hand with Design Doc but this will be managed and maintained in a secure location. This Epic will also include consideration of deployment strategies, as needed.

verify the node of the user against Cognito

💡 Description

we need groups for each node
The cognito user will be assigned node groups (at least one).
The client will forward the access token to the API gateway
The lambda authorizer will decode the groups, and check that it matches the node as specified in the header of the request.

Motivation

So that the PDS users have a single login and password for all the PDS services.

Refine Design and Prototype

💡 Description

Follow-on to #1. Build on top of POC and initial design through implementation and rapid prototyping

Develop Initial Proof-of-Concept

💡 Description

Per discussions with team, looking at producing a POC for a few different architectures.

Some options:

Data provider stages data in their own S3 bucket, publish CNM to DAAC SNS topic
Data provider stages data in DAAC S3 bucket, publishes CNM to DAAC SNS topic
Data provider stages data in DAAC S3 bucket, DAAC ingest themselves using some crawler tool one time (for one-off collections)
Data provider stages data in DAAC S3 bucket, DAAC continuously ingest data as it comes in (new mechanism that is not yet completed)

Per discussion with the team, going to pursue an approach similar to what @collinss-jpl has proposed:

The approach I had been thinking about utilizes AWS API gateway connected to a Lambda (similar to TEA) to allow a client application on the SBN host to request an upload/sync of a local file or files. The Lambda uses information from the request to determine where in our S3 bucket hierarchy the requested files should get uploaded to (based on product type, PDS submitter node, or whatever other criteria we derive). The S3 URI(s) are then returned back through the API gateway to the client. The SBN client application then uses the returned URI(s) to perform the sync using the CLI or boto library. Eventually we could work in a job queue on the SBN client app so the uploads can be performed asynchronously from the upload requests. We would also be able to use the built-in throttling capability on API Gateway to control how much data or how many requests we’ll allow within a window of time etc…

DUM Lambda Service can return pre-signed S3 URL's to non-existing buckets

Checked for duplicates

No - I haven't checked

🐛 Describe the bug

The DUM Lambda Service function utilizes a bucket map to determine the correct bucket that incoming data should be routed to, however, the function does not currently check if said bucket has actually be created in S3. This results in an invalid pre-signed URL being returned to the client script, which will encounter a error when attempting to use the URL to push data to S3.

🕵️ Expected behavior

The DUM Lambda Service should be checking for existence of the bucket read from the bucket map file, and return an error to the client if the bucket does not already exist.

📜 To Reproduce

Configure a bucket-map.yaml file for use with the DUM Lambda service that routes files to a non-existing bucket for one of the PDS groups (ex: eng, sbn, img). Then use the client script to submit a file for ingress as a user assigned to said PDS group.

🖥 Environment Info

Version of this software [e.g. vX.Y.Z]
Operating System: [e.g. MacOSX with Docker Desktop vX.Y]
...

📚 Version of Software Used

1.2.0

🩺 Test Data / Additional context

No response

🦄 Related requirements

🦄 #xyz

⚙️ Engineering Details

No response

🎉 Integration & Test

No response

As a user, I want to use Cognito Single Sign On to authenticate to the DUM service

💡 Description

Epic to investigate/implement support for client authentication when accessing the API gateway endpoint to the Ingress Lambda service.

Develop Ingress Client Interface

💡 Description

The current command-line interface for the Ingress client script only allows a user to provide a single file path, as well as an (arbitrary) node ID. This interface needs to be developed to allow at a minimum:

Validation of the provided node ID against the standard set of PDS identifiers
Support for specifying multiple input paths
Support for distinguishing paths to files vs. paths to directories and performing the appropriate S3 sync logic

DUM client is unable to create CloudWatch Log Stream pds-ingress-client-sbn-* when upload data to cloud

Checked for duplicates

No - I haven't checked

🐛 Describe the bug

When the css data was uploaded to cloud via DUM client, the following error occured:

WARNING:root:Unable to submit to CloudWatch Logs, reason: Failed to create CloudWatch Log Stream pds-ingress-client-sbn-1709312322, reason: 403 Client Error: Forbidden for url: https://yofdsuex7g.execute-api.us-west-2.amazonaws.com/prod/createstream

🕵️ Expected behavior

NO errors. And logs get pushed to the cloud

📜 To Reproduce

Run DUM on any data and push data to the cloud.

🖥 Environment Info

Linux OS

📚 Version of Software Used

0.3.0

🩺 Test Data / Additional context

Any PDS4 data

🦄 Related requirements

⚙️ Engineering Details

As a workaround, let's plan to comment out the code that is causing this for the time being. Logging into AWS is a lower priority ("should") requirement.

As a user, I want status summary reports during a long running execution (batching functionality)

Checked for duplicates

Yes - I've already checked

🧑‍🔬 User Persona(s)

Node Operator

💪 Motivation

...so that I can regular check the overall status of an execution in either file upload "chunks"

📖 Additional Details

From @mdrum

Perhaps allow for batching of some sort. Rather than accepting all files and generating one big report at the end, you would want to split it up into groups of 1000 files or something and create mid-reports along the way, with a final tally being generated at the end. We could do this on our end, of course, but then it would mean we would have to combine the mid-reports manually ourselves. Something to think about.

Acceptance Criteria

Given
When I perform
Then I expect

⚙️ Engineering Details

Initial thoughts: new flag(s) to allow someone to specify reporting type: (cumulative, batch, both), and select the batching "chunks" (default: 1000 files)

🎉 I&T

No response

Develop initial design doc

💡 Description

After initial rapid prototyping has completed, develop a design and architecture diagram/document.

Ideally this document would be posted as part of the online documentation for this repository.

As a user, I want an end summary report in logs to show statistics of files uploaded

Checked for duplicates

Yes - I've already checked

🧑‍🔬 User Persona(s)

Node Operator

💪 Motivation

...so that I can have a summary of how things were uploaded

📖 Additional Details

files uploaded
files skipped
files overwritten

Acceptance Criteria

Given a set of files to be uploaded to S3
When I perform a nominal upload
Then I expect a final report to be output show metrics of files read, files successfully uploaded, files skipped, and files overwritten

⚙️ Engineering Details

No response

As a user, I want to resume/rerun upload on a directory and only have the updates or missing files uploaded

Checked for duplicates

No - I haven't checked

🧑‍🔬 User Persona(s)

No response

💪 Motivation

...so that I can [why do you want to do this?]

📖 Additional Details

No response

Acceptance Criteria

Given
When I perform
Then I expect

⚙️ Engineering Details

No response

Backoff/Retry logic masks errors from urllib3 exceptions

Checked for duplicates

No - I haven't checked

🐛 Describe the bug

Within the DUM client script, backoff/retry decorators are used to capture and inspect exceptions and from the requests.exceptions package to see if the error is recoverable (i.e. intermittent outtage). However, it's been observed that exceptions from the urllib3 package can also trigger the backoff/retry handlers, which results in the following error since the format of the exception is different from what is expected:

Ingress failed, reason: 'NoneType' object has no attribute 'status_code'"

This essentially "masks" the true error, which makes debugging the underlying issue much more difficult.

🕵️ Expected behavior

The backoff/retry decorator code should gracefully handle exception classes that trigger the logic, but do conform to the expected structure (ex: missing a response field with the HTTP status code)

📜 To Reproduce

This bug can be reproduced when usage of the pre-signed S3 URL returned from the DUM lambda service results in a transfer failure that raises an exception from the urllib3 package, such as SSLError. This will likely need to be simulated using a mock function for the requests.put call made in pds_ingress_client.ingress_file_to_s3()

🖥 Environment Info

Version of this software [e.g. vX.Y.Z]
Operating System: [e.g. MacOSX with Docker Desktop vX.Y]
...

📚 Version of Software Used

1.2.0

🩺 Test Data / Additional context

No response

🦄 Related requirements

🦄 #xyz

⚙️ Engineering Details

No response

🎉 Integration & Test

No response

As a user, I want to skip upload of files that are already in the Registry

Checked for duplicates

Yes - I've already checked

🧑‍🔬 User Persona(s)

Node Operator

💪 Motivation

...so that I do not try to reload the data

📖 Additional Details

No response

Acceptance Criteria

Given
When I perform
Then I expect

⚙️ Engineering Details

The easiest way to do this would be search the registry either for the file path OR by checksum OR both? We could do this with the LID/LIDVID but I think that will add some significant overhead.

Do we want to figure out some sort of auto-generated UUID for every file we upload to the cloud and add this as metadata? Maybe this is something we could actually store then in the Nucleus database and eventually in the registry. It could link throughout the whole system, agnostic of the LIDVID for the products themselves.

Investigate integration of authentication Lambda/Cognito with API Gateway

Ticket to track integration of Cognito authentication with API gateway via Lambda authorizer function.

The following links were provided as a starting point to investigate this feature:

https://github.com/unity-sds/unity-cs/wiki/Getting-Cognito-JWT-Tokens-in-Command-Line
https://github.com/unity-sds/unity-cs-security/tree/main/code_samples

Add argument to client script to follow symlinks

Checked for duplicates

Yes - I've already checked

🧑‍🔬 User Persona(s)

The current behavior of the pds-ingress-client.py script is to ignore an symbolic links encountered when traversing paths to be uploaded. Going forward, it could be useful to add a command-line option to instruct the client to follow encountered symlinks, rather than ignore them by default.

💪 Motivation

Would allow pds-ingress-client.py to be used with datasets that are compiled from pre-existing data via symlink to avoid data duplication.

📖 Additional Details

No response

Acceptance Criteria

Given
When I perform
Then I expect

⚙️ Engineering Details

No response

The bucket-map.yaml file used for terraform deployments should be pulled from a private location

💡 Description

Currently it is stored on S3 for production.

⚔️ Parent Epic / Related Tickets

No response

As Nucleus, I want to use a lock file to know when DUM is writing to a S3 bucket folder

Checked for duplicates

No - I haven't checked

🧑‍🔬 User Persona(s)

No response

💪 Motivation

...so that I can know when a directory has completed reading, and we can fully evaluate all the products (XML + data files) in the directory + sub-directories.

📖 Additional Details

Crawl file system
Each directory you come across, write a dum.lock file with TBD information in it
Continue to crawl and write data, as you complete a directory, and all it's sub-directories, remove the dum.lock file

Acceptance Criteria

Given
When I perform
Then I expect

⚙️ Engineering Details

No response

As a user, I want to include the modification datetime in the the user-defined object metadata being sent in the upload payload

Checked for duplicates

Yes - I've already checked

🧑‍🔬 User Persona(s)

Archivist

💪 Motivation

...so that I can match the modification datetime from the source system where the data is being copied.

📖 Additional Details

No response

Acceptance Criteria

Given
When I perform
Then I expect

⚙️ Engineering Details

No response

Develop IaC for Deployment

💡 Description

Develop the necessary documentation, terraform scripts and/or other definitions/scripts needed to deploy the app on a user system and deploying to the cloud.

Add External Config Support to Ingress Client Script

💡 Description

The current Ingress client script contains a number of hardcoded constants related to AWS configuration (such as API Gateway ID and region) that should be refactored into an external .ini config (or similar) to allow easy customization without requiring code redeployment.

Populate Sphinx documentation for entire DUM service

💡 Description

Ticket to add Sphinx documentation for the entire DUM repository. Topics covered should include:

Installation instructions
Terraform deployment procedure
Cognito account creation
INI config format
Client script usage

Temporarily disable centralized logging

💡 Description

Because of bug #75 the centralized logging need to be disabled until the bug is fixed.

⚔️ Parent Epic / Related Tickets

No response

Develop Directory Write Locking Mechanism

💡 Description

Deploy v1.2.0 DUM to Production

💡 Description

Tag new DUM with v1.2.0
- new summary report
- new logging group /pds/nucleus/dum/client-log-group
- token refresh
Update DUM client to test new capabilities
Debug session with SA
Request SBN to upgrade

⚔️ Parent Epic / Related Tickets

No response

Upgrade SBN to latest and Rename Bucket Folder

💡 Description

Tag new DUM within logging fixed
Deploy to production
Request SBN to upgrade
Rename root folder in S3 bucket from SBN to sbn (created sbn folder, move gbo... to sbn folder.

⚔️ Parent Epic / Related Tickets

No response

Test upload subset of CSS with manual trigger of Nucleus

💡 Description

Detailed work plan TBD

Improve DUM Upload Performance and Avoid Replication of Files in the Cloud

💡 Description

Currently, execution is taking too long to upload data. We need to figure out ways we can improve throughput. TBD on this task. This may be dependent on system CPUs available.

⚔️ Sub-tasks

Develop Initial Design and Architecture

💡 Description

Doing some rapid prototyping to determine the best approach and develop and architecture. See sub-tasks for more details

Update lambda function to lowercase the node prefix in buckets

💡 Description

Right now data is being pushed to buckets like for SBN, /SBN/my/data/here. the /SBN seems a bit redundant, but we can leave it for now. However, definitely want this lowercase.

⚔️ Parent Epic / Related Tickets

No response

Investigate COTs Tools for Improving Overall Functionality and Robustness of DUM

💡 Description

https://rclone.org
MGSS DataDrive
owncloud
???

Is this useful? Should we consider a refactoring to simply wrap this utility?

⚔️ Parent Epic / Related Tickets

No response

As a user, I want to force an upload of file that is already in S3 or the Registry

Checked for duplicates

Yes - I've already checked

🧑‍🔬 User Persona(s)

Node Operator

💪 Motivation

...so that I can overwrite data that has already been loaded into the PDC

📖 Additional Details

No response

Acceptance Criteria

Given a file that has already been loaded into the registry
When I perform a DUM upload with the --overwrite flag enabled
Then I expect the data to overwrite the existing file in the system, and DUM to note that this occurred in the logs

⚙️ Engineering Details

No response

Develop Ingress Client Logging Capabilities

💡 Description

Ticket to develop the logging capabilities of the pds-ingress-client.py script. Added capabilities should include:

Addition of a logging or log utility module to control initialization of the global logger object
Implement features to allow control of the logging level/format from the command line
Addition of an API Gateway endpoint to submit client logs to a CloudWatch log group
Implement submission of all logged information during client execution to the API Gateway endpoint

Motivation

To help to support the discipline nodes.

As a user, I want to be able to take no more than X seconds per product to upload to AWS

Checked for duplicates

No - I haven't checked

🧑‍🔬 User Persona(s)

One potential issue was raised for awareness: currently it seems to take about 5-10 seconds per css product to upload. That's going to need to be improved somewhere in the chain, because at that rate it will take more than 24 hours to upload the number of products that are generated every 24 hours.

💪 Motivation

...so that I can [why do you want to do this?]

📖 Additional Details

No response

Acceptance Criteria

Given
When I perform
Then I expect

⚙️ Engineering Details

Variables we need to take into account:

Bandwidth to AWS - is there a way we could improve upload throughput?
Size of each file / product
Number of files
Not currently generating checksums

Develop Ingress Service Routing Logic

💡 Description

The current prototype of the Data ingress lambda function contains some dummy logic for determining the product type from the provided file path/node ID. This ticket is to track the planning and implementation for the initial logic for determining the s3 path convention from the data payload provided by the client script.

Also within scope for this ticket is defining the input payload schema itself that will determine what is sent by the client.

As a result, a back-end service component is developed for the Data-Upload-Manager. It received the upload request from the client and returns a s3 path (or eventually a presigned S3 URL) where the data should be uploaded by the client.

Add support for presigned upload URL usage

💡 Description

To further secure the data upload process, the ingress service Lambda needs to incorporate generation of pre-signed S3 URL's that the client can use to securely upload files to S3. This will allow all the PDS buckets in Nucleus to be private (so trying to guess an S3 upload URI should not work), while still providing a means for outside users to push to S3 without additional credentials or permissions.

Develop Ingress Lambda Logging Conventions

💡 Description

The Ingress Lambda function can log messages directly to AWS CloudWatch via the built-in logging library. This will likely be the primary mechanism for tracking incoming requests, so we need to define exactly what we would like to see logged for each request.

As a user, I want to include a MD5 checksum in the the user-defined object metadata being sent in the upload payload

Checked for duplicates

No - I haven't checked

🧑‍🔬 User Persona(s)

Archivist

💪 Motivation

...so that I can include a checksum with the files being uploaded to ensure data integrity as the files flow through the system

📖 Additional Details

No response

Acceptance Criteria

Given a file to be uploaded to S3
When I perform data upload manager execution on that file
Then I expect data upload manager to generate a checksum and add to the object metadata for the S3 object

⚙️ Engineering Details

https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingMetadata.html