GithubHelp home page GithubHelp logo

bd2kgenomics / dcc-dockstore-tool-runner Goto Github PK

View Code? Open in Web Editor NEW
1.0 11.0 2.0 211.68 MB

A Dockstore tool designed to perform file downloads from Redwood, run another Dockstore tool, and then upload to Redwood.

Home Page: https://dockstore.org/containers/quay.io/ucsc_cgl/dockstore-tool-runner

License: Apache License 2.0

Python 92.91% Shell 4.97% Common Workflow Language 2.12%

dcc-dockstore-tool-runner's Introduction

dcc-dockstore-tool-runner

A Dockstore tool designed to perform file downloads from Redwood, run another Dockstore tool, and then prepare a metadata.json and upload results to Redwood.

Running Locally

Normally you would not run directly, you are always going to run this via Dockstore or, maybe, via Docker. For development purposes, though, you may want to setup a local environment for debugging and extending this tool.

Install Deps

Ubuntu 14.04

You need to make sure you have system level dependencies installed in the appropriate way for your OS. For Ubuntu 14.04 you do:

sudo apt-get install python-dev libxml2-dev libxslt-dev lib32z1-dev

Python and Packages

Use python 2.7.x.

See here for information on setting up a virtual environment for Python.

If you haven't already installed pip and virtualenv, depending on your system you may (or may not) need to use sudo for these:

sudo easy_install pip
sudo pip install virtualenv

Now to setup:

virtualenv env
source env/bin/activate
pip install jsonschema jsonmerge openpyxl sets json-spec elasticsearch semver luigi python-dateutil setuptools==28.8.0 cwl-runner cwltool==1.0.20160712154127 schema-salad==1.14.20160708181155 avro==1.8.1 typing

Alternatively, you may want to use Conda, see here here, and here for more information.

conda create -n dockstore-tool-runner-project python=2.7.11
source activate dockstore-tool-runner-project
pip install jsonschema jsonmerge openpyxl sets json-spec elasticsearch semver luigi python-dateutil setuptools==28.8.0 cwl-runner cwltool==1.0.20160712154127 schema-salad==1.14.20160708181155 avro==1.8.1 typing

Patch CWLTools

Unfortunately, we need to patch cwltool so we can properly handle calling nested Docker containers through it. Specifically, we need to pass in the Docker socket and also ensure the working directory paths are consistent between the various layers of Docker calls. If you have installed cwltool via pip in a virtualenv or conda environment make sure you patch that one and not the system version. Customize the below for your environment.

patch -d /usr/local/lib/python2.7/dist-packages/cwltool/ < job.patch

Install Dockstore CLI

Take a look at http://dockstore.org and go through the onboarding process.

That being said, this particular tool is designed to work with Dockstore CLI 1.0.1. A copy is located in dockstore so you can make sure you're using the version that matches the version of cwltool you patched above.

Redwood Client

You will need a copy of the Redwood client (assuming you are calling the tool outside of docker/dockstore), you can download it from here. You will need a token to upload/download data as well. This assumes you're pulling/pushing to Redwood at UCSC. If not, you don't need the client or token (instead, for example, you'll pull/push data to S3).

Make a Temp Dir

So the dcc-dockstore-tool-runner is a Dockstore-based tool that calls another Dockstore-based tool. This is a complex thing, since it means we need to 1) patch CWLtools to consistently pass in the Docker socket and 2) a common shared data path used by the nested containers. The patch hardcodes this shared location to be /datastore so please create that before running these tools.

sudo mkdir /datastore
sudo chmod a+rwx /datastore

You should make sure /datastore is a large volume if you intend to work with large inputs/outputs.

If you're on a mac, make sure you allow Docker to mount this otherwise you'll see an error like:

docs/mac_mount.png

Testing

Lowest level to highest level. The Dockstore CLI pointed to the released version of this tool is the way most users will call this.

Testing Python Command Directly

The command below will download samples from Redwood, run fastqc from Dockstore on two fastq files, and then upload the results back to a Redwood storage system. This is a real file and requires controlled access and a token to work. You need to have cwltool installed/patched, Docker installed. See above.

# example with real files
python DockstoreRunner.py --redwood-path `pwd`/ucsc-storage-client --redwood-token `cat accessToken` --redwood-host storage.ucsc-cgl.org --json-encoded ew0KCSJmYXN0cV9maWxlcyI6IFt7DQoJCSJjbGFzcyI6ICJGaWxlIiwNCgkJInBhdGgiOiAicmVkd29vZDovL3N0b3JhZ2UudWNzYy1jZ2wub3JnL2ZiM2RkODVkLTM2N2UtNWI4ZC05OTI5LTk1MTY0MDg5ZDEwZi83M2YyMzYyNS04ZTU1LTU0MDgtOWY0ZS1hMmRlZDg0MGE2NWQvTkExMjg3OC1OR3YzLUxBQjEzNjAtQV8xLmZhc3RxLmd6Ig0KCX1dLA0KCSJ6aXBwZWRfZmlsZSI6IHsNCgkJImNsYXNzIjogIkZpbGUiLA0KCQkicGF0aCI6ICIvdG1wL2Zhc3RxY19yZXBvcnRzLnRhci5neiINCgl9DQp9 --docker-uri quay.io/briandoconnor/fastqc:0.11.5 --dockstore-url https://dockstore.org/containers/quay.io/briandoconnor/fastqc --workflow-type sequence_upload_qc_report --parent-uuid d2545c4e-dd7d-5a07-b598-e9acba87228f --vm-instance-type m4.4xlarge --vm-region us-west-2 --vm-instance-cores 16 --vm-instance-mem-gb 64 --vm-location aws --tmpdir /datastore

This encoded string corresponds to the contents of sample_fastqc.json.

To encode and decode online see: https://www.base64encode.org/

Testing Via cwltool

This may be useful for debugging, it's one layer above calling the python script directly.

NOTE: THE ENVIRONMENT VARIABLE TMPDIR MUST BE SET TO A DIRECTORY WITH ENOUGH SPACE TO HOLD INPUT, OUTPUT AND INTERMEDIATE FILES. Otherwise cwltool will use /VAR/SPOOL/CWL by default which may not have enough space.

cwltool --debug --enable-dev --non-strict --enable-net  <path to>/Dockstore.cwl --redwood-path `pwd`/ucsc-storage-client --redwood-token `cat accessToken` --redwood-host storage.ucsc-cgl.org --json-encoded  ew0KCSJmYXN0cV9maWxlcyI6IFt7DQoJCSJjbGFzcyI6ICJGaWxlIiwNCgkJInBhdGgiOiAicmVkd29vZDovL3N0b3JhZ2UudWNzYy1jZ2wub3JnL2ZiM2RkODVkLTM2N2UtNWI4ZC05OTI5LTk1MTY0MDg5ZDEwZi83M2YyMzYyNS04ZTU1LTU0MDgtOWY0ZS1hMmRlZDg0MGE2NWQvTkExMjg3OC1OR3YzLUxBQjEzNjAtQV8xLmZhc3RxLmd6Ig0KCX1dLA0KCSJ6aXBwZWRfZmlsZSI6IHsNCgkJImNsYXNzIjogIkZpbGUiLA0KCQkicGF0aCI6ICIvdG1wL2Zhc3RxY19yZXBvcnRzLnRhci5neiINCgl9DQp9 --dockstore-uri quay.io/briandoconnor/fastqc:0.11.5 --parent-uuid d2545c4e-dd7d-5a07-b598-e9acba87228f --tmpdir /datastore

Testing Via Dockstore CLI

Most people, other than developers of this tool, will use the Dockstore CLI to invoke it. Here's an example running on an Ubutnu box. Build the docker image:

# patch in /usr/local/lib/python2.7/dist-packages/cwltool
# make a tmpdir like /datastore
docker build -t quay.io/ucsc_cgl/dockstore-tool-runner:1.0.20 .
# fill in your JSON from Dockstore.json template as Dockstore.my.json
mkdir /datastore; chown ubuntu:ubuntu /datastore/
# local execution
TMPDIR=/datastore dockstore tool launch --entry Dockstore.cwl --local-entry --json Dockstore.my.json
# as root in /datastore
TMPDIR=/datastore dockstore tool launch --entry ~ubuntu/gitroot/BD2KGenomics/dcc-dockstore-tool-runner/Dockstore.cwl --local-entry --json ~ubuntu/gitroot/BD2KGenomics/dcc-dockstore-tool-runner/Dockstore.my.json
# execute published on dockstore (this is the way most people will use this tool!)
dockstore tool launch --entry quay.io/ucsc_cgl/dockstore-tool-runner:1.0.20 --json Dockstore.my.json

# running you see it launch the cwltool command, you mind find this useful while debugging
cwltool --enable-dev --non-strict --enable-net --outdir /datastore/./datastore/launcher-ff6b55b3-52e8-430c-9a70-1ff295332698/outputs/ --tmpdir-prefix /datastore/./datastore/launcher-ff6b55b3-52e8-430c-9a70-1ff295332698/working/ /home/ubuntu/gitroot/BD2KGenomics/dcc-dockstore-tool-runner/Dockstore.cwl /datastore/./datastore/launcher-ff6b55b3-52e8-430c-9a70-1ff295332698/workflow_params.json

Known Issues

AttributeError: 'str' object has no attribute 'append'

Looks like a bug in Python 2.7 shipped with MacOS since it was fixed on Ubuntu.

The params section of metadata.json needs to be fixed:

"workflow_params" : {
  "%s": "%s","%s": "%s","%s": "%s","%s": "%s","%s": "%s","%s": "%s"
}

TODO

  • need to fix the params issue

dcc-dockstore-tool-runner's People

Contributors

briandoconnor avatar caaespin avatar gpelayo avatar tboser avatar wshands avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

wshands

dcc-dockstore-tool-runner's Issues

Metadata should be uploaded to storage system after result files are uploaded

The metadata specifying result files produced from a pipeline is currently uploaded before the result files are uploaded to the storage system. These steps should be reversed, since it is more likely that the upload of result files will fail, and if the upload of result files fails, we will have metadata indicating a result file is in the storage system when it is not, and when this metadata is used to locate files to download, and a download of the missing file is attempted, the download fails. In addition the browser will display the details for the missing file that is not actually in the storage system.
However if the result files are uploaded before the metadata is uploaded, and the result file upload fails, the upload will stop and metadata will not be uploaded for the pipeline results. Also if the metadata upload fails, which is unlikely, the result files will exist in the storage system but the user will simply not know about them, the browser will not know about them and the pipeline will simply need to be rerun.

No Program assigned to results

There is no assignment of program/project in the resulting metadata.json; as a result, registration and upload to redwood cannot happen, as it now requires scoping.

Autoscaling:Dockstore Tool Runner

Dockstore tool runner

  • Add parameter for output URL (S3 bucket)
  • Code to convert Redwood URL to signed AWS https URL
  • Code to download results from output URL (S3 bucket)
    Or code to upload directly from output URL (S3 bucket) to Redwood storage system

update redwood client

To be compatible with dcc-ops/redwood, the old redwood-client from s3://beni-storage-dev shouldn't be used.

The Dockerfile can inherit from quay.io/ucsc_cgl/redwood-client:1.1.1 (which inherits from ubuntu:16.04) instead of from ubuntu:16.04 directly, then use the exposed upload, download, redwood-download (bundle download), dcc-metadata-client, or icgc-storage-client commands. Will make it much easier to update in the future.

Tool run metadata start and stop time backwards

The tool_run and download start and stop times are reversed:

"bundle_uuid": "fad0c4f0-0551-4331-9fee-0a327bb6624b", 
"timing_metrics": {
    "step_timing": {
        "download": {
            "walltime_seconds": 388, 
            "stop_time_utc": "2017-08-31T16:33:59.503111", 
            "start_time_utc": "2017-08-31T16:40:28.102868"
        }, 
        "tool_run": {
            "walltime_seconds": 9079, 
            "stop_time_utc": "2017-08-31T16:40:28.102946", 
            "start_time_utc": "2017-08-31T19:11:47.120607"
        }
    }, 
    "overall_walltime_seconds": 9467, 
    "overall_stop_time_utc": "2017-08-31T19:11:47.120683", 
    "overall_start_time_utc": "2017-08-31T16:33:59.503111"
},

Add option to pass signed urls to workflow

Rather than download the redwood urls we should generate signed URLs and allow the pipeline to download them. This is necessary for autoscaling dockers.

To do this, I plan on modifying DockstoreRunner.download_and_transform_json to look at an argument along the lines of --generate_signed_urls. If that argument is passed, we can map the redwood url to an s3 url and use boto's generate_url function to create our signed url.

Test Kallisto only RNA-Seq feature

Test the feature and collect timing for running with the feature on and off to determine compute cost for Kallisto only run vs. both RSEM and Kallisto

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.