GithubHelp home page GithubHelp logo

neicnordic / sda-pipeline Goto Github PK

View Code? Open in Web Editor NEW
5.0 15.0 5.0 14.45 MB

A federated storage for sensitive data, NeIC SDA

Home Page: https://neicnordic.github.io/sda-pipeline/

License: GNU Affero General Public License v3.0

Go 94.79% Dockerfile 0.33% Shell 4.89%
federatedega sensitive-data sensitive-data-archive neic-sda

sda-pipeline's Introduction

Archival notice

โš ๏ธ This repository is no longer maintained. The code has been integrated and it is further developed at: https://github.com/neicnordic/sensitive-data-archive

sda-pipeline

License GoDoc

Build Status Go Report Card Code Coverage

DeepSource Join the chat at https://gitter.im/neicnordic/sda-pipeline

sda-pipeline is part of NeIC Sensitive Data Archive and implements the components required for data submission. It can be used as part of a Federated EGA or as a isolated Sensitive Data Archive. sda-pipeline was built with support for both S3 and POSIX storage.

Deployment

Recommended provisioning method for production is:

For local development/testing see instructions in dev_utils folder. There is an README file in the dev_utils folder with sections for running the pipeline locally using Docker Compose.

Core Components

Component Role
intercept The intercept service relays message between the queue provided from the federated service and local queues. (Required only for Federated EGA use case)
ingest The ingest service accepts messages for files uploaded to the inbox, registers the files in the database with their headers, and stores them header-stripped in the archive storage.
verify The verify service reads and decrypts ingested files from the archive storage and sends accession requests.
finalize The finalize command accepts messages with accessionIDs for ingested files and registers them in the database.
mapper The mapper service registers the mapping of accessionIDs (IDs for files) to datasetIDs.
backup The backup service accepts messages with accessionIDs for ingested files and copies them to the second/backup storage.

Internal Components

Component Role
broker Package containing communication with Message Broker SDA-MQ.
config Package for managing configuration.
database Provides functionalities for using the database, as well as high level functions for working with the SDA-DB.
storage Provides interface for storage areas such as a regular file system (POSIX) or as a S3 object store.

Documentation

sda-pipeline documentation can be found at: https://neicnordic.github.io/sda-pipeline/pkg/sda-pipeline/

NeIC Sensitive Data Archive documentation can be found at: https://neic-sda.readthedocs.io/en/latest/ along with documentation about other components for data access.

Contributing

We happily accepts contributions. Please see our contributing documentation for some tips on getting started.

sda-pipeline's People

Contributors

aaperis avatar blankdots avatar dbampalikis avatar dependabot-preview[bot] avatar dependabot[bot] avatar dtitov avatar gitter-badger avatar jbygdell avatar jonandernovella avatar kjellp avatar kostas-kou avatar kusalananda avatar lilachic avatar nanjiangshu avatar norling avatar pahatz avatar pontus avatar sstli avatar viklund avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sda-pipeline's Issues

Send messages on error

Upon failures that aren't likely local/temporary, we need to make sure we send a message to the correct queue for errors queue for communication to the user as well as nacking the message so we do not get our queues filled up with messages that only lead to failures.

Use pipes to connect readers and writers

At the moment we are not able to perform batched S3 uploads since we cannot perform append-like operations using the S3 API. The solution might be to put all writes into a pipe and then let the uploader consume the pipe's reader.

Sync reads whole file in memory

The sync service is currently reading the whole file in memory, causing in to crash when the size gets bigger. We need to allow for specific buffer size instead

Standardize error queue messages

The error messages that pour in to the error queue should be made more uniform. This could be done e.g. by using one of the structs that we already have in place for errors.

Add `sync/backup` service to test suite

DoD: Ingestion test run with the sync/backup service doing backup of file from archive before finalize receive the message to start it's work. sync/backup listens to accessionIDs and publishes with the routing key so that the message ends up in the queue finalize listens to. finalize can not listen to accessionIDs in this scenario.

Create dataset id that is not dependent on path

Currently the orchestrator is creating dataset ids based on the path of the file. That needs to change, given that files might be uploaded in the same folder.

Ideas:

  • Deploy a service that gets the message from completed and upon user interaction it creates the dataset id
    The dataset id will be given (possibly from the data stewards - DOI)

Inputs:

  1. File paths as they are defined in the metadata
  2. DOI for the specific dataset
  3. Submitter's username

Output:
Message for mapper service

Could even be a Kubernetes job

Note: The dataset id creation is not an automatic process

create integration test

  • ingest on posix + posix followed by verification
  • ingest on posix + S3 followed by verification
  • ingest on S3 + posix followed by verification
  • ingest on S3 + S3 followed by verification

Publish the initial message to the files routingkey in the localega exchange to start the ingestion process. This should be able to be done using curl via RabbitMQs API.

{
"user": "test",
"filepath": "dummy_data.c4gh"
}

Messages, s3 and eventual consistency

Since s3 is eventually consistent, it's quite possible that we get messages for a file in inbox and when we try to access. I believe this (or some similar issue) is currently seen in #141.

The fix for that will likely be different, but for ingestion as well as verification, we should probably handle the case that the message arrives before the s3 provider gets its things in order - that is, we should not treat a failure of getting a filesize or reader as fatal but rather see that as that we should retry (up to a certain limit).

I'm not sure how that is best done through the broker (requeue or send a new message) and what can be done (include a counter? Do we have a timestamp we can look at?) but we will probably need to do something.

Fix broken test

8_bad_json.sh and 10_trigger_failures.sh are broken and needs investigating.

Rename `Sync` to `backup`

There is a service called sync in the pipeline. This should more aptly be named backup (since it does backups). This service needs to be renamed, documentation updated, and references updated.

Refactor db logic

We should use encapsulate db structs in a generic Db struct instead of using the psql struct directly.

FEGA Accesion IDs problematic if same file uploaded several times

Accession ids from FEGA is based on the checksum of the (probably decrypted) file. So if the same file is uploaded several times there will be two entries in the database with the same accession id.
This makes it difficult to figure out which line belongs to which dataset. The reason this is important for example the filename needs to be correct for the private metadata (that is uploaded to the FEGA node). Also there should not be private metadata in the filenames but this could happen and those should not leak by mistake between datasets.

DoD
We need to be able to better distinguish between files in the database and when we ingest them.
This might involve some db changes to https://github.com/neicnordic/sda-db if we want to modify the db schema

Add docker-compose without TLS

Currently the dev_utils has TLS enabled by default, we need a docker-compose for local development to start the services, sda-db and sda-mq without TLS.

Add mapper

  • Rework mapper to use our shared configuration
  • Include in functionality tests

Extend sda-sync service to also copy header

For big picture we need to sync files between Sweden and Finland. To do this we need to be able to sync the header as well.

Extend the s3sync service so it can get the header from the database and reencrypt it using another public key.

Question about protocol? Should header and body be concatenated or should it be sent separately?

  • The simplest solution would be to fetch the header, encrypt it and pass it through the MQ message

  • reencrypt with Finish public key

  • standalone deployment template

Deployment documentation

Update and/or add documentation on how it's supposed to be setup and configured.

Figure out how to cross-reference documentation we have in the project.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.