GithubHelp home page GithubHelp logo

aind-data-transfer's Introduction

DEPRECATION

We are deprecating this library. We plan to drop support on 2024-08-01 and will archive the repository on 2024-09-01. In particular, we are:

Please reach out to the Scientific Computing Department with any questions or concerns.

aind-data-transfer

License Code Style

Tools for transferring large data to and between cloud storage providers.

Installation

To upload data to aws s3, you may need to install and configure awscli. To upload data to gcp, you may need to install and configure gsutil.

Generic upload

You may need to first install pyminizip from conda if getting errors on Windows: conda install -c mzh pyminizip

  • From PyPI: pip install aind-data-transfer
  • From source: pip install -e .

Imaging

  • Run pip install -e .[imaging]
  • Run ./post_install.sh

Ephys

  • From PyPI: pip install aind-data-transfer[ephys]
  • From source pip install -e .[ephys]

Full

  • Run pip install -e .[full]
  • Run ./post_install.sh

Development

  • Run pip install -e .[dev]
  • Run ./post_install.sh

MPI

To run scripts on a cluster, you need to install dask-mpi. This requires compiling mpi4py with the MPI implementation used by your cluster (Open MPI, MPICH, etc). The following example is for the Allen Institute HPC, but should be applicable to other HPC systems.

SSH into your cluster login node

ssh user.name@hpc-login

On the Allen cluster, the MPI modules are only available on compute nodes, so SSH into a compute node (n256 chosen arbitrarily).

ssh user.name@n256

Now load the MPI module and compiler. It is important that you use the latest MPI version and compiler, or else dask-mpi may not function properly.

module load gcc/10.1.0-centos7 mpi/mpich-3.2-x86_64

Install mpi4py

python -m pip install --no-cache-dir mpi4py

Now install dask-mpi

python -m pip install dask_mpi --upgrade

Usage

Running one or more upload jobs

The jobs can be defined inside a csv file. The first row of the csv file needs the following headers. Some are required for the job to run, and others are optional.

Required

s3_bucket: S3 Bucket name
platform: One of [behavior, confocal, ecephys, exaSPIM, FIP, HCR, HSFP, mesoSPIM, merfish, MRI, multiplane-ophys, single-plane-ophys, SLAP2, smartSPIM] (pulled from the Platform.abbreviation field)
modality: One of [behavior-videos, confocal, ecephys, fMOST, icephys, fib, merfish, MRI, ophys, slap, SPIM, trained-behavior] (pulled from the Modality.abbreviation field)
subject_id: ID of the subject
acq_datetime: Format can be either YYYY-MM-DD HH:mm:ss or MM/DD/YYYY I:MM:SS P

One or more modalities need to be set. The csv headers can look like:

modality0: [behavior-videos, confocal, ecephys, fMOST, icephys, fib, merfish, MRI, ophys, slap, SPIM, trained-behavior]
modality0.source: path to modality0 raw data folder
modality0.compress_raw_data (Optional): Override default compression behavior. True if ECEPHYS, False otherwise.
modality0.skip_staging (Optional): If modality0.compress_raw_data is False and this is True, upload directly to s3. Default is False.
modality0.extra_configs (Optional): path to config file to override compression defaults
modality1 (Optional): [behavior-videos, confocal, ecephys, fMOST, icephys, fib, merfish, MRI, ophys, slap, SPIM, trained-behavior]
modality1.source (Optional): path to modality0 raw data folder
modality1.compress_raw_data (Optional): Override default compression behavior. True if ECEPHYS, False otherwise.
modality1.skip_staging (Optional): If modality1.compress_raw_data is False and this is True, upload directly to s3. Default is False.
modality1.extra_configs (Optional): path to config file to override compression defaults
...

Somewhat Optional. Set the aws_param_store_name, but can define custom endpoints if desired

aws_param_store_name: Path to aws_param_store_name to retrieve common endpoints

If aws_param_store_name not set...

codeocean_domain: Domain of Code Ocean platform
codeocean_trigger_capsule_id: Launch a Code Ocean pipeline
codeocean_trigger_capsule_version: Optional if Code Ocean pipeline is versioned
metadata_service_domain: Domain name of the metadata service
aind_data_transfer_repo_location: The link to this project
video_encryption_password: Password with which to encrypt video files
codeocean_api_token: Code Ocean token used to run a capsule

Optional

temp_directory: The job will use your OS's file system to create a temp directory as default. You can override the location by setting this parameter.
behavior_dir: Location where behavior data associated with the raw data is stored.
metadata_dir: Location where metadata associated with the raw data is stored.
log_level: Default log level is warning. Can be set here.

Optional Flags

metadata_dir_force: Default is false. If true, the metadata in the metadata folder will be regarded as the source of truth vs. the metadata pulled from aind_metadata_service
dry_run: Default is false. If set to true, it will perform a dry-run of the upload portion and not actually upload anything.
force_cloud_sync: Use with caution. If set to true, it will sync the local raw data to the cloud even if the cloud folder already exists.
compress_raw_data: Override all compress_raw_data defaults and set them to True.
skip_staging: For each modality, copy uncompressed data directly to s3.

After creating the csv file, you can run through the jobs with

python -m aind_data_transfer.jobs.s3_upload_job --jobs-csv-file "path_to_jobs_list"

Any Optional Flags attached will persist and override those set in the csv file. For example,

python -m aind_data_transfer.jobs.s3_upload_job --jobs-csv-file "path_to_jobs_list" --dry-run --compress-raw-data

will compress the raw data source and run a dry run for all jobs defined in the csv file.

An example csv file might look like:

data-source, s3-bucket, subject-id, modality, platform, acq-datetime, aws_param_store_name
dir/data_set_1, some_bucket, 123454, ecephys, ecephys, 2020-10-10 14:10:10, /aind/data/transfer/endpoints
dir/data_set_2, some_bucket2, 123456, ophys, multiplane-ophys, 2020-10-11 13:10:10, /aind/data/transfer/endpoints

Defining a custom processing capsule to run in code ocean

Read the previous section on defining a csv file. Retrieve the capsule id from the code ocean platform. You can add an extra parameter to define a custom processing capsule that gets executed aftet the data is uploaded:

codeocean_process_capsule_id, data-source, s3-bucket, subject-id, modality, platform, acq-datetime, aws_param_store_name
xyz-123-456, dir/data_set_1, some_bucket, 123454, ecephys, ecephys, 2020-10-10 14:10:10, /aind/data/transfer/endpoints
xyz-123-456, dir/data_set_2, some_bucket2, 123456, ophys, multiplane-ophys, 2020-10-11 13:10:10, /aind/data/transfer/endpoints

Contributing

Linters and testing

There are several libraries used to run linters, check documentation, and run tests.

  • Please test your changes using the coverage library, which will run the tests and log a coverage report:
coverage run -m unittest discover && coverage report
  • Use interrogate to check that modules, methods, etc. have been documented thoroughly:
interrogate .
  • Use flake8 to check that code is up to standards (no unused imports, etc.):
flake8 .
  • Use black to automatically format the code into PEP standards:
black .
  • Use isort to automatically sort import statements:
isort .

Pull requests

For internal members, please create a branch. For external members, please fork the repo and open a pull request from the fork. We'll primarily use Angular style for commit messages. Roughly, they should follow the pattern:

<type>(<scope>): <short summary>

where scope (optional) describes the packages affected by the code changes and type (mandatory) is one of:

  • build: Changes that affect the build system or external dependencies (example scopes: pyproject.toml, setup.py)
  • ci: Changes to our CI configuration files and scripts (examples: .github/workflows/ci.yml)
  • docs: Documentation only changes
  • feat: A new feature
  • fix: A bug fix
  • perf: A code change that improves performance
  • refactor: A code change that neither fixes a bug nor adds a feature
  • test: Adding missing tests or correcting existing tests

aind-data-transfer's People

Contributors

alejoe91 avatar aricai avatar bjhardcastle avatar camilolaiton avatar carshadi avatar cbrrrry avatar dyf avatar github-actions[bot] avatar jtyoung84 avatar jwong-nd avatar mekhlakapoor avatar sun-flow avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

aind-data-transfer's Issues

register ecephys data with codeocean after transfer

Once data has landed in a cloud bucket, we need to tell CodeOcean about it.

Acceptance Criteria:

  • Data Asset exists with a standards-conformant name (<modality>_<subject-id>_<acq_date>_<acq_time>)
  • Data Asset is kept on external storage
  • Data Asset is indexed
  • Data Asset is tagged by modality (e.g. ecephys)
  • Tag is not hardcoded, but read from either asset name or from data_description.json

Notes:

Example curl request that does the right thing:

curl --location --request POST 'https://codeocean.allenneuraldynamics.org/api/v1/data_assets' \
--header 'Content-Type: application/json' \
-u \'{API_TOKEN}:\' \
--data '{ 
    "name": "ecephys_625463_2022-09-28_16-34-22", 
    "description": "", 
    "mount": "ecephys_625463_2022-09-28_16-34-22", 
    "tags": [ "ecephys" ],
    "source": {
        "aws": {
            "bucket": "{BUCKET_NAME}",
            "prefix": "ecephys_625463_2022-09-28_16-34-22", 
            "keep_on_external_storage": true,
            "index_data": true,
            "access_key_id": "'"{ACCESS_KEY_ID}"'",
            "secret_access_key": "'"{ACCESS_KEY_ID}"'"
        }
    }
}'

BIDS-like ingest

for nd-data-transfer, expose an option to upload data in a BIDS-like directory structure.

https://bids-specification.readthedocs.io/en/stable/01-introduction.html
https://bids-specification.readthedocs.io/en/stable/04-modality-specific-files/10-microscopy.html
this may help: https://github.com/bids-standard/pybids

Proposed structure:

  • <modality>_<subject-id>_<acquisition-date> (root directory, optional suffix for _<acquisition-time>)
    • LICENSE (CC-BY-4.0)
    • dataset_description.json, minimally:
      • name (directory name above)
      • subject id
      • acquisition-datetime
      • stain label
      • creation datetime
      • license name (CC-BY-4.0)
    • <modality>
      • chunk-<N>_stain-<label>.<ext>

This does not need to validate for now.

Add utility to compress ephys data

User story

As a user, I want to compress ephys data, so I can upload a smaller data set to the cloud.

Acceptance criteria

  • When a script is run with an input directory and output directory, then the raw contents in the input directory is compressed and saved to the output directory.

Sprint Ready Checklist

    • Acceptance criteria defined
    • Team understands acceptance criteria
    • Team has defined solution / steps to satisfy acceptance criteria
    • Acceptance criteria is verifiable / testable
    • External / 3rd Party dependencies identified
    • Ticket is prioritized and sized

Notes

Add any helpful notes here.

Ephys team can run ephys transfer job themselves

Originally this issue was about automating file reorganization. I've updated it to be about simply making sure that the ephys team can run this job themselves.

Acceptance Criteria:

  • ephys team can run transfer job themselves
  • ephys team can find their data in codeocean
  • ephys team can find their results in codeocean

old version below:

Right now the ephys team is manually reorganizing files from different hard drives.

  • Confirm with team (anna) about what they are doing
  • Update the ephys job step that is reorganizing videos and such to automate this

Encrypt video folder before exporting to AWS

User story

As a user, I want to encrypt the video data folder, so I can manage who can access it.

Acceptance criteria

  • When the ephys job pipeline is run, the job that uploads the video folder encrypts the video folder first.

Sprint Ready Checklist

    • Acceptance criteria defined
    • Team understands acceptance criteria
    • Team has defined solution / steps to satisfy acceptance criteria
    • Acceptance criteria is verifiable / testable
    • External / 3rd Party dependencies identified
    • Ticket is prioritized and sized

Notes

We can try using pyminizip, but it might require extra installation in a windows machine.

ephys job not working on windows

Describe the bug
The Ephys job doesn't appear to work on windows

To Reproduce
Steps to reproduce the behavior:

  1. Run the open ephys job script in windows
  2. Notice it doesn't work

Expected behavior
The job should work on Windows

Desktop (please complete the following information):

  • OS: Windows

Additional context
Add any other context about the problem here.

Add jobs for transcoding + uploading imaging datasets

User story

As a user, I want to be able to run a transcode job easily using a configuration file

Acceptance criteria

  • When a script is run, then the imaging acquisition folder will be copied to a destination location (s3, gcs, or filesystem) with the same directory structure, except the raw data will be converted to OME-Zarr format.

Sprint Ready Checklist

    • Acceptance criteria defined
    • Team understands acceptance criteria
    • Team has defined solution / steps to satisfy acceptance criteria
    • Acceptance criteria is verifiable / testable
    • External / 3rd Party dependencies identified
    • Ticket is prioritized and sized

Streamline installing dependencies

Describe the bug
It takes over 4 minutes to install the dependencies. We should explore ways to streamline this process.

To Reproduce
Steps to reproduce the behavior:

  1. Install the dependencies (or check some of the github action logs on a pull request)
  2. Notice it takes several minutes

Expected behavior
We can save time by streamlining this operation if we can.

Additional context
Add any other context about the problem here.

github actions are failing

Describe the bug
Github actions are failing.

To Reproduce
Steps to reproduce the behavior:

  1. Open a Pull Request
  2. Notice the automated checks are failing

Expected behavior
The checks during github actions should be passing.

Screenshots
Check the stack trace here:

https://github.com/AllenNeuralDynamics/nd-data-transfer/runs/8218772895?check_suite_focus=true

Desktop (please complete the following information):

  • N/A

Smartphone (please complete the following information):

  • N/A

Additional context
One of the unit tests are implicitly creating a client to Google Cloud Storage. As a quick-fix, the unit test can be suppressed. The long-term fix is to separate out the client from the class it's being instantiated, and mock it in the unit test. The file that's probably causing the issue is:

https://github.com/AllenNeuralDynamics/nd-data-transfer/blob/main/tests/test_gcs_uploader.py

Add processing.json file to datasets before export.

User story

  • As an engineer, I want to keep the information I used to run the ephys job, so I can replicate the job if I need to.
  • As a scientist, I want to understand what was done to the primary data.

Acceptance criteria

Sprint Ready Checklist

    • Acceptance criteria defined
    • Team understands acceptance criteria
    • Team has defined solution / steps to satisfy acceptance criteria
    • Acceptance criteria is verifiable / testable
    • External / 3rd Party dependencies identified
    • Ticket is prioritized and sized

Notes

Add any helpful notes here.

Save to Zarr fails on Windows if filename exceeds 256 characters

Describe the bug
Windows has a built-in maximum length of file names of 256 (see e.g. here). When Zarr attempts to create files, this can throw a cryptic FileNotFoundError.

To Reproduce
This simple code shows the faulty behavior:

import zarr

# simple function to create a dataset in a group
def create_zarr_group_dataset(root_path):
    if Path(root_path).is_dir():
        shutil.rmtree(root_path)
    zarr_root = zarr.open(root_path, mode="w", storage_options=None)
    zarr_group = zarr_root.create_group("a_group")
    g_dset = zarr_group.create_dataset(name="group_data", data=[str(i) for i in range(100)], 
                                       compressor=None)

# here we extend the file name by appending \\new_folder n_iter times
for n_iter in range(20):
    zarr_path = base_folder
    for i in range(n_iter):
        zarr_path += "\\new_folder"
    zarr_path += zarr_name
    print(f"N iter: {n_iter} - len file path {len(zarr_path)}")
    try:
        create_zarr_group_dataset(zarr_path)
    except Exception as e:
        print(f"Failed for iter {n_iter}")

Which produces:

N iter: 0 - len file path 57
N iter: 1 - len file path 68
N iter: 2 - len file path 79
N iter: 3 - len file path 90
N iter: 4 - len file path 101
N iter: 5 - len file path 112
N iter: 6 - len file path 123
N iter: 7 - len file path 134
N iter: 8 - len file path 145
N iter: 9 - len file path 156
N iter: 10 - len file path 167
N iter: 11 - len file path 178
N iter: 12 - len file path 189
N iter: 13 - len file path 200
Failed for iter 13
N iter: 14 - len file path 211
Failed for iter 14
N iter: 15 - len file path 222
Failed for iter 15
N iter: 16 - len file path 233
Failed for iter 16
N iter: 17 - len file path 244
Failed for iter 17
N iter: 18 - len file path 255
Failed for iter 18
N iter: 19 - len file path 266
Failed for iter 19

Expected behavior
We should raise an error with an informative message to reduce the depth of the destination folder.
Also opened an issue on the zarr project: zarr-developers/zarr-python#1235

Desktop (please complete the following information):

  • OS: Windows

write_ome_zarr.py: incorrect behavior with --resume flag

Describe the bug
Currently, calling write_ome_zarr.py with the --resume flag will perform the following check to see if a tile
has already been written

def _tile_exists(zarr_path, tile_name, n_levels):
z = zarr.open(zarr_path, "r")
try:
# TODO: only re-upload missing levels
a = z[f"{tile_name}/{n_levels - 1}"]
return a.nbytes > 0
except KeyError:
return False

This was meant as a placeholder, and is incorrect for a couple reasons. 1) since it only takes into account the shape of the array, not stored data. 2) it only checks the lowest resolution level, since at the time they were written in order from high -> low. All levels are written simultaneously now.

To actually resume in case of failure, what I've been doing is checking the logs to see the tile that was currently being written, then delete that (partially written) tile from the output location, and finally restart the job using the --resume flag, which skips over all the existing arrays (whether or not they were fully written) in the output store.

This issue zarr-developers/zarr-python#587 describes a few different ways to detect missing chunks in a Zarr array. The one that seemed most promising to me was taking the ratio of nchunks_initialized and nchunks , which should be 1 if all chunks were written (assuming there are no empty "zero" chunks). The methods which scan the entire array and/or compute checksums feel less appealing to me since it might end up being faster to just re-write the tile, but could be worth investigating depending on how thorough we want to be.

Deprecation warning when running ephys reader

Describe the bug
A lot of warnings show up when running the ephys upload job.

warnings.warn(
/miniconda3/envs/nd-data-transfer/lib/python3.8/site-packages/packaging/version.py:111: DeprecationWarning: Creating a LegacyVersion has been deprecated and will be removed in the next major release

To Reproduce
Steps to reproduce the behavior:

  1. Run the test_read unit test
  2. See warning shows up

Expected behavior
There shouldn't be any warnings if we can avoid it

Additional context
Add any other context about the problem here.

Tech-debt, re-organize regex patterns, and double-check ephys gcp uploader

User story

As a developer, I want a clean code base, so I can maintain it easier.

Acceptance criteria

  • It might make more sense to move the RegexPatterns enums from EphysJobConfigurationLoader into EphysReaders.
  • Double-check that the ephys job is uploading to gcp correctly.

Sprint Ready Checklist

    • Acceptance criteria defined
    • Team understands acceptance criteria
    • Team has defined solution / steps to satisfy acceptance criteria
    • Acceptance criteria is verifiable / testable
    • External / 3rd Party dependencies identified
    • Ticket is prioritized and sized

Notes

Add any helpful notes here.

Ephys job uses codeocean trigger capsule

Acceptance Criteria:

  • The last step of the ephys upload job triggers a codeocean job triggering capsule with the spike sorting capsule ID
  • The job triggering capsule registers a data asset, triggers the desired capsule, registers the run output as a data asset
  • raw and result data assets are named according to aind-data-schema conventions (see: RawDataAsset and DerivedDataAsset).

Update spike sorting capsule to handle compressed data

The spike sorting capsule does not work out of the box on compressed data.

Acceptance Criteria

  • wavpack-numcodecs is installed
  • Address error: 'inter_sample_shift' is not a property!
  • Also: get the capsule pushed to our Github repository

Add version bump check and tag creation to Github actions

User story

As a user, I want to see which version of the code I'm working on, so I can update or retrace my steps if I want to.

Acceptance criteria

Sprint Ready Checklist

    • Acceptance criteria defined
    • Team understands acceptance criteria
    • Team has defined solution / steps to satisfy acceptance criteria
    • Acceptance criteria is verifiable / testable
    • External / 3rd Party dependencies identified
    • Ticket is prioritized and sized

Notes

Add any helpful notes here.

Correct for NP-opto electrode locations before compression

Is your feature request related to a problem? Please describe.
The OpenEphys GUI does not handle correctly the geometry of NP-opto probes when saving the settings.xml, and saves the NP-opto as having the NP1.0 configuration.
Since we automatically read the probe information from the settings file, this can lead to errors for downstream analysis.

Describe the solution you'd like
Before compressing and clipping, we should check for opto probes (easy from the settings file) and correct the electrode locations, so that the correct probe configuration is loaded in SpikeInterface.

Describe alternatives you've considered
An alternative could be to correct the probe geometry a posteriori, but since we plan to trigger the computational pipeline as soon as a new data asset is created, this is not possible.

Import metadata classes from aind-data-schema

User story

As a developer, I want to re-use code from aind-data-schema, so I can maintain the code base easier.

Acceptance criteria

  • The MetadataSchemaClient class is removed from the project
  • Classes from aind-data-schema should be used instead

Sprint Ready Checklist

    • Acceptance criteria defined
    • Team understands acceptance criteria
    • Team has defined solution / steps to satisfy acceptance criteria
    • Acceptance criteria is verifiable / testable
    • External / 3rd Party dependencies identified
    • Ticket is prioritized and sized

Notes

We might need to wait until the aind-data-schema project is published to pypi

Autobump Version

User story

As a developer, I want a github action to manage the version number, so I don't have to worry about the updates manually.

Acceptance criteria

  • When a PR is merged into main, then the version is automatically bumped based on the commit message (default to patch)

Sprint Ready Checklist

  • 1. Acceptance criteria defined
  • 2. Team understands acceptance criteria
  • 3. Team has defined solution / steps to satisfy acceptance criteria
  • 4. Acceptance criteria is verifiable / testable
  • 5. External / 3rd Party dependencies identified
  • 6. Ticket is prioritized and sized

Notes

Add any helpful notes here.

Add mapping of config string to Blosc Enums.

Describe the bug
Currently, the configs are parsing json, but some of the parameters need to be mapped to enum.

To Reproduce
Steps to reproduce the behavior:

  1. Try to run with Blosc.
  2. Notice the shuffle parameter doesn't get parsed.

Expected behavior
The config parser should parse configs correctly

Additional context
We'll want to modify the ephys job config parser to handle this:
The enum values are stored here: https://numcodecs.readthedocs.io/en/stable/blosc.html

Add logging to ephys job pipeline

User story

As a user and developer, I want to see useful log messages, so I can more easily track the progress or debug the processing pipeline.

Acceptance criteria

  • The components of the ephys job pipeline import logging
  • The logging level can be configured in the ephys job configs yml file.

Sprint Ready Checklist

    • Acceptance criteria defined
    • Team understands acceptance criteria
    • Team has defined solution / steps to satisfy acceptance criteria
    • Acceptance criteria is verifiable / testable
    • External / 3rd Party dependencies identified
    • Ticket is prioritized and sized

Notes

Add any helpful notes here.

Update lsb and median unit test to check values of scaled data

User story

As a developer, I want to efficient and robust unit tests, so I can have a healthy and maintainable code base.

Acceptance criteria

  • Given the ephys test data is scaled, then the unit test should check that the lsb value of the scaled recording is 1 and that the median values are all 0.

Sprint Ready Checklist

    • Acceptance criteria defined
    • Team understands acceptance criteria
    • Team has defined solution / steps to satisfy acceptance criteria
    • Acceptance criteria is verifiable / testable
    • External / 3rd Party dependencies identified
    • Ticket is prioritized and sized

Notes

Add any helpful notes here.

Upload compressed ephys data sets to cloud storage

User story

As a user, I want to upload datasets to the cloud, so I can run my analysis in the cloud.

Acceptance criteria

  • Given and input directory and an output bucket in aws or gcp, when a script is run, then the data set in the input directory is uploaded to the output bucket.

Sprint Ready Checklist

    • Acceptance criteria defined
    • Team understands acceptance criteria
    • Team has defined solution / steps to satisfy acceptance criteria
    • Acceptance criteria is verifiable / testable
    • External / 3rd Party dependencies identified
    • Ticket is prioritized and sized

Notes

Do any file re-organization locally.

Add unit tests for ephys writers

User story

As a developer, I want to test the writing methods easily, so I can catch bugs with writes before commits.

Acceptance criteria

  • Test coverage for ephys job pipeline should be 100%
  • There should be unit tests with mocked methods to test ephys writer class

Sprint Ready Checklist

    • Acceptance criteria defined
    • Team understands acceptance criteria
    • Team has defined solution / steps to satisfy acceptance criteria
    • Acceptance criteria is verifiable / testable
    • External / 3rd Party dependencies identified
    • Ticket is prioritized and sized

Notes

Add any helpful notes here.

Scripts to transcode and upload ephys data to cloud storage

This task is create an easily installable script that rig operators can use to reliably upload ephys data to cloud storage.

This script contains a prototypical workflow for doing a related upload + transcode job for raw ephys data:

https://github.com/AllenNeuralDynamics/ephys-framework-tests/blob/main/rig-utils/upload_session_to_cloud.py

nd-data-transfer is being developed to take raw imaging data, compress and convert as OME-Zarr, and upload it to AWS or GCP object storage.

Non-requirements:

  • scripts do not have to be integrated into nd-data-transfer, but this would be preferable.
  • compression and transcoding does not have to be on the cluster (as is being done in nd-data-transfer for imaging data).

Acceptance Criteria:

  • script is pip-installable
  • script can upload data to GCS
  • script can upload data to S3
  • script converts relevant portions of raw data to Zarr
  • script compresses data with Blosc-zstd compressor

In the future we will migrate to a Wavpak compressor.

Tech-debt, add Numpy formatted docstrings and type hints to methods

User story

As a user, I want to see documentation about the input parameters of a method, so I can understand how to use it easier.
As a developer, I want clean, well-documented code, so I can maintain it easier.

Acceptance criteria

  • Interrogate report has 100% coverage
  • Automate way to check that type hints are included in all the function signatures.

Sprint Ready Checklist

    • Acceptance criteria defined
    • Team understands acceptance criteria
    • Team has defined solution / steps to satisfy acceptance criteria
    • Acceptance criteria is verifiable / testable
    • External / 3rd Party dependencies identified
    • Ticket is prioritized and sized

Notes

https://peps.python.org/pep-0484/

Import CodeOcean API as a dependency for ephys jobs

User story

As a developer, I want to re-use code, so I can maintain things better.

Acceptance criteria

  • The codeocean.py module is removed from this project.
  • aind-codeocean-api is installed via pip

Sprint Ready Checklist

    • Acceptance criteria defined
    • Team understands acceptance criteria
    • Team has defined solution / steps to satisfy acceptance criteria
    • Acceptance criteria is verifiable / testable
    • External / 3rd Party dependencies identified
    • Ticket is prioritized and sized

Notes

We might need to wait until aind-codeocean-api is published to pypi

Make nd-data-transfer library pip-installable

User story

As a user, I want to install nd-data-transfer via pip, so I can easily run the code on local machines.

Acceptance criteria

  • The nd-data-transfer library should be released to pypi?

Sprint Ready Checklist

    • Acceptance criteria defined
    • Team understands acceptance criteria
    • Team has defined solution / steps to satisfy acceptance criteria
    • Acceptance criteria is verifiable / testable
    • External / 3rd Party dependencies identified
    • Ticket is prioritized and sized

Notes

Add any helpful notes here.

reorganize ephys data during transfer

  • move the Videos subdirectory to the same level as ecphys_clipped and ephys_compressed
  • downcase Videos to videos to be consistent
  • name the top level directory ecephys_<subject_id>_<acq_date>_<acq_time>

Don't copy np opto settings.xml correction directly into raw folder

User story

As a user, I want to keep my data set without any modifications from the ephys pipeline, so I can keep the data sets as raw as possible.

Acceptance criteria

  • The settings.xml file corrections should be done within the clipped/compressed datasets rather than the modifying the original data sets.

Sprint Ready Checklist

    • Acceptance criteria defined
    • Team understands acceptance criteria
    • Team has defined solution / steps to satisfy acceptance criteria
    • Acceptance criteria is verifiable / testable
    • External / 3rd Party dependencies identified
    • Ticket is prioritized and sized

Notes

Add any helpful notes here.

Directory options for ephys job

User story

As a user, I want to control a few directory options, so I can manage data better.

Acceptance criteria

  • Allow user to delete original directory (or move it to a tombstoned folder?)
  • Currently, aws sync and gsutil rsync commands are used to upload folders to cloud storage. We may want to look into whether we want to overwrite the existing cloud folder, update the data if a cloud folder already exists, or send back a warning if the cloud folder already exists.

Sprint Ready Checklist

    • Acceptance criteria defined
    • Team understands acceptance criteria
    • Team has defined solution / steps to satisfy acceptance criteria
    • Acceptance criteria is verifiable / testable
    • External / 3rd Party dependencies identified
    • Ticket is prioritized and sized

Notes

Add any helpful notes here.

Move direct dependencies to install script and publish project to pypi

User story

As a user, I want to be able to pip install from pypi, so I can run the jobs without cloning the repo.

Acceptance criteria

  • There should be a separate script to install the direct dependencies.
  • The publish to pypi github action is uncommented

Sprint Ready Checklist

    • Acceptance criteria defined
    • Team understands acceptance criteria
    • Team has defined solution / steps to satisfy acceptance criteria
    • Acceptance criteria is verifiable / testable
    • External / 3rd Party dependencies identified
    • Ticket is prioritized and sized

Notes

Add any helpful notes here.

Copy translation metadata from IMS to OME-Zarr `.zattrs`

User story

As a user, I want to retain tile position metadata from the Imaris files when converting to OME-Zarr, so that I can
stitch the dataset.

According to OME-NGFF spec https://ngff.openmicroscopy.org/latest/ , the translation field specifies the offset from the origin, in physical coordinates, and must go in the .zattrs under coordinateTransformations, like so

{
  "multiscales": [
    {
      "datasets": [
        {
          "coordinateTransformations": [
            {
              "scale": [
                1.0,
                0.75,
                0.75
              ],
              "type": "scale"
            },
            {
              "translation": [
                -12000.0,
                0.0,
                0.0
              ],
              "type": "translation"
            }
          ],
          "path": "0"
        }
      ]
    }
  ]
}

translation must be specified after scale

Acceptance criteria

  • translation field shows up in the .zattrs file

Sprint Ready Checklist

    • Acceptance criteria defined
    • Team understands acceptance criteria
    • Team has defined solution / steps to satisfy acceptance criteria
    • Acceptance criteria is verifiable / testable
    • External / 3rd Party dependencies identified
    • Ticket is prioritized and sized

Notes

Add any helpful notes here.

Clip open ephys dat files

User story

As a user, I want to a clipped dat file in addition to a compressed dat file, so I can still use the spike interface api.

Acceptance criteria

  • When a script is run, then the ephys data folder will be copied to a destination folder with exactly the same directory structure, except all of the dat files will be clipped so that they are significantly smaller than the original.

Sprint Ready Checklist

    • Acceptance criteria defined
    • Team understands acceptance criteria
    • Team has defined solution / steps to satisfy acceptance criteria
    • Acceptance criteria is verifiable / testable
    • External / 3rd Party dependencies identified
    • Ticket is prioritized and sized

Notes

Add any helpful notes here.

Move NI-DAQ filter into scale function

Describe the bug
Currently, the NI-DAQ filter is in the read stream, and so those recording aren't being compressed.

To Reproduce
Steps to reproduce the behavior:

  1. Run the compress ephys job
  2. Notice the NI-DAQ recording blocks do not get compressed

Expected behavior
It will make things easier if the NI-DAQ recording blocks are also compressed

Additional context
That data doesn't need to be scaled though, so we can move the filter into the scale_read_blocks method

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.