The workflows_survey from jasonyangshadow

bioinformatics workflow and pipeline framework

scripts

Scripts, written in Unix shell or other scripting language such as perl or python could be treated as basic form of pipeline framework. In particular, the robustness feature could be brittle and dependencies refer to upstream and downstream of this pipeline need to be solved manually by authors themselves. Also reentrancy, that the framework recovers from last interrupted point, could be difficult to implement.

For example: Redundans is a pipeline that assists an assembly of heterozygous/polymorphic genomes. To run this pipeline, your OS needs to satisfy the prerequisites and by typing the command in terminal the pipeline begins to run until it generates the output or is interrupted by exception.

Recently, as docker container technology becomes more and more popular, so lots of these pipelines could be packaged into images in order to avoid additional installment.

make

Make utility is successfully used to manage file transformations common to scientific computing pipelines. It introduces the concept of 'implicit wild card rules', which searches file and dependencies based on file suffixes. However, make itself is not defined for scientific pipelines, as a result, it has several limitations, such as no built-in support for distributed computing, lacking of powerful data structures as well as impossible sophisticated logic implementation, making it impractical for modern bioinformatics analysis.

Syntax: Implicit -> Tasks(jobs) should clearly specify the input/output file names or file names wildcards. Syntax: Explicit -> Tasks(jobs) depend on other tasks, not file targets. Paradigm: Class -> Pipelines are developed and described based on classes inherited. Paradigm: Configuration -> Pipelines are developed and described based on configurations.

Arvados

github

It enables you to quickly begin using cloud computing resources in your data science work and allows you to track your methods and datasets, share them and re-run analysis.

Syntax: Explicit
Paradigm: Configuration
Interaction: CLI
Distributed Computing Support: Yes, Curoverse document
Extensive: SDK
Language: Python, Go, Ruby
License: AGPL-3.0 Apache-2.0 CC-BY-SA-3.0 github
Pros: CWL and YAML are supported, runtime environment is packaged into docker image
Cons: Compute nodes must have Docker installed to run containers. So root is required.

Taverna

apache git

Taverna Workbench is a redesign of the Taverna Workbench 1.7.x series from the ground up. In addition to the improvements to the engine, Taverna Workbench provides more support for workflow design with a graphical workflow editor for direct manipulation of the workflow diagram, context-specific views over Web services and other resources, and standard editing facilities such as copy/paste and undo/redo.

Syntax: Explicit
Paradigm: Configuration
Interaction: CLI Apache Taverna Command-line Tool & workbench Taverna Workbench
Distributed Computing Support: Yes, Service Sets (WSDL based)
Extensive: SDK
Language: Java
License: Apache License 2.0
Pros: Data viewer for viewing result. Provenance Management , which could capture the provenance of workflow definations, workflow runs and data. Capturing everything generated during the runtime. Taverna Player is web-based interface for executing existing workflows with new data. Taverna Server is used as the back end processing server. Taverna Language is a Java API that gives programmatic access to inspecting, modifying and converting SCUFL2 workflow definitions and Research Object Bundles.
Cons: It seems that only several service sets are available and it is difficult to create your own service based on current documents. Additional language learning for pipe development.

Galaxy

github

Galaxy is an open, web-based platform for accessible, reproducible, and transparent computational biomedical research.

Accessible: Users without programming experience can easily specify parameters and run tools and workflows.
Reproducible: Galaxy captures information so that any user can repeat and understand a complete computational analysis.
Transparent: Users share and publish analyses via the web and create Pages, interactive, web-based documents that describe a complete analysis.

Syntax: Explicit
Paradigm: Configuration
Interaction: workbench
Distributed Computing Support: Yes, CloudMan
Extensive: API
Language: Python
License: Academic Free License version 3.0
Pros: Integrated graphics interface and visualization tools, easy drag. There are several programing language bindings for galaxy api. document includes Java, Php, Python, Javascript, CLI
Cons: By default, one could deploy galaxy on AWS, Jetstream cloud and etc. Following document to create private cloud, one needs to build galaxyIndicesFS, install dependencies and start relative services, which may need some time and root privileges.

Agave

github

The Agave Platform is an open source, science-as-a-service API platform for powering your digital lab. Agave allows you to bring together your public, private, and shared high performancecomputing (HPC), high throughput computing (HTC), Cloud, and Big Data resources under a single, web-friendly REST API.

Run code
Manage data
Collaborate meaningfully
Integrate anywhere

Syntax: Explicit
Paradigm: Configuration
Interaction: CLI and Agave ToGo
Distributed Computing Support: Yes, Execution Sys Config
- HPC, Condor -> batch scheduler
- CLI -> processes
Extensive: Web API
Language: Php, Java
License: Unkown
Pros: JSON file is used to configure everything. Core service is deployed and distributed by docker images.github . It has a file service to manage the data storage, and the system itself will transfer data and ensure it completes. It has a angular-js based web administration console.
Cons: It implements basic functions of workflow system but doesn't have reproducible feature. Data will be transfered among nodes and may be improper for huge data. Docker is used and root privileges may be required.

Snakemake

The Snakemake workflow management system is a tool to create reproducible and scalable data analyses. Workflows are described via a human readable, Python based language. They can be seamlessly scaled to server, cluster, grid and cloud environments, without the need to modify the workflow definition. Finally, Snakemake workflows can entail a description of required software, which will be automatically deployed to any execution environment.

Syntax: Implicit
Paradigm: Configuration
Interaction: CLI
Distributed Computing Support: Yes, Kubernetes, Singularity, Docker
Extensive: API
Language: Python
License: MIT License
Pros: Packages are managed by conda. Google cloud engine is supported and cluster execution is easily integrated. DAG is introduced to visualize workflow. Configuration is based on YAML or JSON. Sustainable and reproducible archiving. Scheduling algorithm provides general support for distributed computing.
Cons: Extended Backus-Naur form (EBNF) is needed to learn to write pipelines. No other languages are supported. No community of sharing pipelines written and ran on snakemake. Cloud file or storage management?

Bpipe

github

Simple definition of tasks to run - Bpipe runs shell commands almost as-is - no programming required.
Transactional management of tasks - commands that fail get outputs cleaned up, log files saved and the pipeline cleanly aborted. No out of control jobs going crazy.
Automatic Connection of Pipeline Stages - Bpipe manages the file names for input and output of each stage in a systematic way so that you don't need to think about it. Removing or adding new stages "just works" and never breaks the flow of data.
Easy Restarting of Jobs - when a job fails cleanly restart from the point of failure.
Easy Parallelism - Bpipe makes it simple to split jobs into many pieces and run them all in parallel whether on a cluster or locally on your own machine
Audit Trail - Bpipe keeps a journal of exactly which commands executed and what their inputs and outputs were.
Integration with Cluster Resource Managers - if you use Torque PBS, Oracle Grid Engine or Platform LSF then Bpipe will make your life easier by allowing pure Bpipe scripts to run on your cluster virtually unchanged from how you run them locally.
Notifications by Email or Instant Message - Bpipe can send you alerts to tell you when your pipeline finishes or even as each stage completes.

Syntax: Explicit
Paradigm: Class
Interaction: CLI
Distributed Computing Support: Partially? Based on the document, it seems that the parallel tasks are executed in small-scale.
Extensive: NA
Language: Groovy
License: BSD 3-clause "New" or "Revised" License
Pros: Tiny, low-level and efficient. No needs to learn any other languages while it supports executing several embedded languages. Auditing every commands pipelines executed and recovering from interruption and error. Pipelines are written through combination of inner commands.
Cons: Compared to other workflows, it doesn't provide any gui interactions, visualization of workflows or results, sharable feature and container based dispatch feature. Also the small-scale parallel tasks are not fully distributed ones?

Ruffus

github

The ruffus module has the following design goals:

Simplicity. Can be picked up in 10 minutes
Elegance
Light weight
Unintrusive
Flexible/Powerful

Automatic support for

Managing dependencies
Parallel jobs
Re-starting from arbitrary points, especially after errors
Display of the pipeline as a flowchart
Reporting

Syntax: Explicit
Paradigm: Class
Interaction: CLI
Distributed Computing Support: Yes, It uses python multiprocessing to run each job in a separate process. From version 2.4 onwards, it includes Open Grid Forum API specification.
Extensive: NA
Language: Python
License: MIT License
Pros: Lightweight, elegance and distributed computing supporting. Easy for the guys familiar with python development.
Cons: No visualization of result or workflows. Requiring python development knowledge. Pip and easy_installl dependencies. Official Todo

Nextflow

github

Nextflow framework is based on the dataflow programming model, which greatly simplifies writing parallel and distributed pipelines without adding unnecessary complexity and letting you concentrate on the flow of data, i.e. the functional logic of the application/algorithm. Nextflow script is defined by composing many different processes. Each process can be written in any scripting language that can be executed by the Linux platform (BASH, Perl, Ruby, Python, etc), to which is added the ability to coordinate and synchronize the processes execution by simply specifying their inputs and outputs.

Feature:

Fast prototyping, Nextflow allows you to write a computational pipeline by making it simpler to put together many different tasks.
Reproducibility, docker and singularity supported.
Portable, it provides abstraction layers between pipeline's logic and the execution layer.
Unified parallelism, it is based on dataflow programing model simplifying writing complex pipelines.
Continuous checkpoints, all the intermediate results are tracked.
Stream oriented, it extends the Unix pipes model with a fluent DSL.

Syntax: Implicit
Paradigm: Class
Interaction: CLI
Distributed Computing Support: Yes, it provides out of box support for SGE, LSF, SLURM, PBS and HTCondor batch schedulers and for kubernetes and AWS
Extensive: NA
Language: Groovy
License: GNU GPLv3 License
Pros: Powerful, it includes lots of features such as cluster support, execution report, resources and task report, pipelines sharing and container technology support. Groovy based language development for pipelines, quite easy to write and read. CWL support. Multiple scripts running on unix platform supports including python, ruby, bash and etc.
Cons: No rootless container support, many dependencies including apache ignite and jdk.

BioQueue

BioQueue is a lightweight and easy-to-use queue system to accelerate the proceeding of bioinformatic workflows. Based on machine learning methods, BioQueue can maximize the efficiency, and at the same time, it also reduces the possibility of errors caused by unsupervised concurrency (like memory overflow). BioQueue can both run on POSIX compatible systems (Linux, Solaris, OS X, etc.) and Windows.

Syntax: Implicit
Paradigm: Class
Interaction: CLI and GUI(Web-based)
Distributed Computing Support: Yes, document
Extensive: API
Language: Python
License: Apache License 2.0
Pros: Machine learning method is introduced to estimate the resource usage(CPU,memory and disk) needed by each step. It possesses a shell command-like syntax instead of implementing a new script language. Reading and writing to sqlite rather than disk files.
Cons: Lack of some features such as pipelines sharing and container technology. Also the performance and accuracy of resource usage estimation is not tested.

Cluster Flow

github

Cluster Flow is designed to be quick and easy to install, with flexible configuration and simple customization.

Simple. Installation walkthroughs and a large module toolset mean you get up and running quickly.
Powerful. Comes packaged with support for 24 different bioinformatics tools (RNA, ChIP, Bisulfite and more).
Flexibile. Pipelines are fast to assemble, making it trivial to change on the fly.
Traceable. Commands, software versions, everything is logged for reproducability.

Syntax: Implicit
Paradigm: Class
Interaction: CLI
Distributed Computing Support: Yes, It supports the sun GRidEngine, LSF and SLURM job managers.
Extensive: NA
Language: Perl
License: GNU General Public License v3.0
Pros: Simple, it is a complex perl script requiring basic core perl packages, which makes it runnable on most machines.
Cons: Currently, it supports a list of tools . For cluster, one needs to configure it manually such as resource estimation and environment configuration.

Toil

github

A scalable, efficient, cross-platform pipeline management system written entirely in Python and designed around the principles of functional programming.

Pythonic. Easily mastered, the Python user API for defining and running workflows is built on one core class. Also, everything is open source under the Apache License.
Robust. Toil workflows support arbitrary worker and leader failure, with strong check-pointing that always allows resumption.
Efficient. Caching, fine grained, per task, resource requirement specifications, and support for the AWS spot market mean workflows can be executed with little waste.
Built for the cloud. Develop and test on your laptop, then deploy on Microsoft Azure, Amazon Web Services (including the spot market), Google Compute Engine, OpenStack, or on an individual machine.
Strongly scalable. Build a workflow on your laptop, then scale to the cloud and run it concurrently on hundreds of nodes and thousands of cores with ease. We've tested it with 32,000 preemptable cores so far, but Toil can handle more.
Service integration. Toil plays nice with databases and services, such as Apache Spark. Service clusters can be created quickly and easily integrated with a Toil workflow, with precisely defined start and end times that fits with the flow of other jobs in the workflow.

Syntax: Implicit
Paradigm: Class
Interaction: CLI
Distributed Computing Support: Yes, it supports AWS, Azure, Openstack, GCE and HPC.
Extensive: API
Language: Python
License: Apache License, Version 2.0
Pros: Full features support, including reproducibility, cloud and container technology support. CWL is supported.
Cons: No multiple languages supported. No visualization of workflows or report. Pythonic development requires package dependencies management and pip or easy_install tools dependencies.

Bcbio

github

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis. You write a high level configuration file specifying your inputs and analysis parameters. This input drives a parallel run that handles distributed execution, idempotent processing restarts and safe transactional steps. bcbio provides a shared community resource that handles the data processing component of sequencing analysis, providing researchers with more time to focus on the downstream biology.

Quantifiable: Doing good science requires being able to accurately assess the quality of results and re-verify approaches as new algorithms and software become available.
Analyzable: Results feed into tools to make it easy to query andvisualize the results.
Scalable: Handle large datasets and sample populations on distributedheterogeneous compute environments.
Reproducible: Track configuration, versions, provenance and command lines to enable debugging, extension and reproducibility of results.
Community developed: The development process is fully open and sustained by contributors from multiple institutions. By working together on a shared framework, we can overcome the challenges associated with maintaining complex pipelines in a rapidly changing area of research.
Accessible: Bioinformaticians, biologists and the general public should be able to run these tools on inputs ranging from research materials to clinical samples to personal genomes.

Syntax: Implicit
Paradigm: Configuration
Interaction: CLI
Distributed Computing Support: Yes, it supports multiple cores and parallel messaging and AWS
Extensive: NA
Language: Python
License: MIT License
Pros: It is designed for special users and purpose , also it provides basic features of workflow system. CWL is supported.
Cons: Resource configuration for parallel tasks should be set manually. ZeroMQ and IPython parallel framework are employed to implement the parallel feature. No packages management components.

GenePattern

github

A platform for reproducible bioinformatics

Powerful genomics tools in a user-friendly interface, GenePattern provides hundreds of analytical tools for the analysis of gene expression (RNA-seq and microarray), sequence variation and copy number, proteomic, flow cytometry, and network analysis. These tools are all available through a Web interface with no programming experience required.
GenePattern Notebooks, The GenePattern Notebook environment extends the Jupyter Notebook system, allowing researchers to create documents that interleave formatted text, graphics and other multimedia, executable code, and GenePattern analyses, creating a single "research narrative" that puts scientific discussion and analyses in the same place.
Analysis Pipelines, GenePattern pipelines allow you to capture, automate, and share the complex series of steps required to analyze genomic data. By providing a way to create and distribute an entire computational analysis methodology in a single executable script, pipelines enable a form of in silico reproducible research.
Reproducible Research, Published research, particularly in silico research, should contain sufficient information to completely reproduce the research results. By capturing the analysis methods, parameters, and data used to produce the research results, GenePattern pipelines enable reproducible research. By versioning every pipeline and its methods, GenePattern ensures that each version of a pipeline (and its results) remain static, even as your research and the pipeline continue to evolve.
Programming Environment, GenePattern provides a simple application interface that gives users access to computational analysis methods and tools, regardless of their computational experience. GenePattern also provides a programmatic interface that makes those analysis modules available to computational biologists and developers from Java, MATLAB, and R.

Syntax: Implicit
Paradigm: Configuration
Interaction: GUI
Distributed Computing Support: Yes, GenePattern Server.
Extensive: API
Language: Java
License: BSD-style License
Pros: Full GUI interface with powerful drag and drop feature. It supports development in different languages including R, Java, Python and Matlab.
Cons: It doesn’t support container technology meaning that it is not easy to be shared and it may be a little difficult to deploy on cluster.

Makeflow

github Makeflow is a workflow system for executing large complex workflows on clusters, clouds, and grids.

Makeflow is easy to use. The Makeflow language is similar to traditional Make, so if you can write a Makefile, then you can write a Makeflow. A workflow can be just a few commands chained together, or it can be a complex application consisting of thousands of tasks. It can have an arbitrary DAG structure and is not limited to specific patterns.
Makeflow is production-ready. Makeflow is used on a daily basis to execute complex scientific applications in fields such as data mining, high energy physics, image processing, and bioinformatics. It has run on campus clusters, the Open Science Grid, NSF XSEDE machines, and NCSA Blue Waters.
Makeflow is portable. A workflow is written in a technology neutral way, and then can be deployed to a variety of different systems without modification, including local execution on a single multicore machine as well as batch systems like HTCondor, SGE, PBS, Torque, SLURM, or the bundled Work Queue system. Makeflow can also easily run your jobs in a container environment like Docker or Singularity on top of an existing batch system. The same specification works for all systems, so you can easily move your application from one system to another without rewriting everything.
Makeflow is powerful. Makeflow can handle workloads of millions of jobs running on thousands of machines for months at a time. Makeflow is highly fault tolerant: it can crash or be killed, and upon resuming, will reconnect to running jobs and continue where it left off. A variety of analysis tools are available to understand the performance of your jobs, measure the progress of a workflow, and visualize what is going on.

Syntax: Implicit
Paradigm: Class
Interaction: CLI
Distributed Computing Support: Yes, it supports Amazon EC2, general cluster systems and batch systems.
Extensive: NA
Language: C, Python
License: GNU General Public License v2.0
Pros: Make style pipeline development. It focuses on cloud, grid and cluster deployment. It has reproducibility feature and supports container technology.
Cons: No multiple languages support. Container deployment needs root privileges.

Airavata

github Apache Airavata, a software framework to executing and managing computational jobs on distributed computing resources including local clusters, supercomputers, national grids, academic and commercial clouds. Airavata builds on general concepts of service oriented computing, distributed messaging, and workflow composition and orchestration. Airavata bundles a server package with an API, client software development Kits and a general purpose reference UI implementation - Apache Airavata PHP reference gateway.

Syntax: Explicit
Paradigm: Configuration
Interaction: GUI
Distributed Computing Support: Yes
Extensive: API
Language: Java
License: Apache License 2.0
Pros: It is a project belongs to apache community, its goal is to develop a middle-ware sitting between users and computing resources. It has both desktop(client) and web interfaces and generates data using apache thrift-based API. It provides resources monitoring features and multiple languages user API.
Cons: It is not a specific workflow designed for bioinformatics and may need to be modified to meet one's own requirements. Application factory needs connection with computational resources, which may need root privileges to install plugins/softwares.

Pegasus

github Pegasus WMS is a configurable system for mapping and executing scientific workflows over a wide range of computational infrastructures including laptops, campus clusters, supercomputers, grids, and commercial and academic clouds. Pegasus has been used to run workflows with up to 1 million tasks that process tens of terabytes of data at a time. Pegasus has a number of features that contribute to its usability and effectiveness:

Portability / Reuse – User created workflows can easily be run in different environments without alteration. Pegasus currently runs workflows on top of Condor pools, Grid infrastrucutures such as Open Science Grid and XSEDE, Amazon EC2, Google Cloud, and HPC clusters. The same workflow can run on a single system or across a heterogeneous set of resources.
Performance – The Pegasus mapper can reorder, group, and prioritize tasks in order to increase overall workflow performance.
Scalability – Pegasus can easily scale both the size of the workflow, and the resources that the workflow is distributed over. Pegasus runs workflows ranging from just a few computational tasks up to 1 million. The number of resources involved in executing a workflow can scale as needed without any impediments to performance.
Provenance – By default, all jobs in Pegasus are launched using the Kickstart wrapper that captures runtime provenance of the job and helps in debugging. Provenance data is collected in a database, and the data can be queried with tools such as pegasus-statistics, pegasus-plots, or directly using SQL.
Data Management – Pegasus handles replica selection, data transfers and output registration in data catalogs. These tasks are added to a workflow as auxilliary jobs by the Pegasus planner.
Reliability – Jobs and data transfers are automatically retried in case of failures. Debugging tools such as pegasus-analyzer help the user to debug the workflow in case of non-recoverable failures.
Error Recovery – When errors occur, Pegasus tries to recover when possible by retrying tasks, by retrying the entire workflow, by providing workflow-level checkpointing, by re-mapping portions of the workflow, by trying alternative data sources for staging data, and, when all else fails, by providing a rescue workflow containing a description of only the work that remains to be done. It cleans up storage as the workflow is executed so that data-intensive workflows have enough space to execute on storage-constrained resources. Pegasus keeps track of what has been done (provenance) including the locations of data used and produced, and which software was used with which parameters.

Syntax: Explicit
Paradigm: Configuration
Interaction: CLI
Distributed Computing Support: Yes, it supports Amazon EC2/S3, Google Cloud, PBS Cluster, Campus cluster and etc. document
Extensive: Python API
Language: Java & Python
License: Apache License 2.0
Pros: It is successfully used in many fields. It automates the searching process for resources and data location and allows users to debug their pipelines via debugging tools and online workflow monitoring dashboard.
Cons: Unsupport container technology. Only support XML rather than other pipeline description formats and one needs to use dax generator for each workflow to generate XML.

Bigdatascript

github BigDataScript is intended as a scripting language for big data pipeline. With BigDataScript, creating jobs for big data is as easy as creating a shell script and it runs seamlessly on any computer system, no matter how small or big it is. If you normally use specialized programs to perform heavyweight computations, then BigDataScript is the glue to those commands you need to create a reliable pipeline.

Reduced development time. Spend less time debugging your work on big systems with a huge data volumes. Now you can debug the same jobs using a smaller sample on your computer. Get immediate feedback, debug, fix and deploy when it's done. Shorter development cycles means better software.
System independent. Cross-system, seamless execution, the same program runs on a laptop, server, server farm, cluster or cloud. No changes to the program required. Work once.
Easy to learn. The syntax is intuitive and it resembles the syntax of most commonly used programming languages. Reading the code is easy as pi.
Automatic Checkpointing. If any task fails to execute, BigDataScript creates a checkpoint file, serializing all the information from the program. Want to restart were it stopped? No problem, just resume the execution from the checkpoint.
Automatic logging. Everything is logged (-log command line option), no explicit actions required. Every time you execute a system command or a task, BigDataScript logs the executed commands, stdout & stderr and exit codes.
Clean stop with no mess behind. You have a BigDataScript running on a terminal and suddenly you realized there is something wrong... Just hit Ctrl-C. All scheduled tasks and running jobs will be terminated, removed from the queue, deallocated from the cluster. A clean stop allows you to focus on the problem at hand without having to worry about restoring a clean state.
Task dependencies. In complex pipelines, tasks usually depend on each other. BigDataScript provides ways to easily manage task dependencies.
Avoid re-work. Executing the pipeline over and over should not re-do jobs that were completed successfully and moreover are time consuming. Task dependency based on timestamps is a built-in functionality, thus making it easy to avoid starting from scratch every time.
Built in debugger. Debugging is an integral part of programming, so it is part of bds language. Statements breakpoint and debug make debugging part of the language, instead of requiring platform specific tools.
Built in test cases facility. Code testing is performed in everyday programming, so testing is built in bds.
Syntax: Implicit
Paradigm: Class
Interaction: CLI
Distributed Computing Support: Yes, this language is born with running on cloud or cluster
Extensive: NA
Language: go?
License: Apache License 2.0
Pros: It is a completely created script language for developing pipelines with many built-in features.
Cons: Integration on existing pipelines and tools is still needs to be done. One may need to spend time on developing codes to use other pipelines. No visualization support.

Biomake

This is a make-like utility for managing builds (or analysis workflows) involving multiple dependent files. It supports most of the functionality of GNU Make, along with neat extensions like cluster-based job processing, multiple wildcards per target, MD5 checksums instead of timestamps, and declarative logic programming in Prolog.

Syntax: Implicit
Paradigm: Class
Interaction: CLI
Distributed Computing Support: Yes, Biomake already adds extensions like cluster-based job processing.
Extensive: NA
Language: perl
License: BSD-3-Clause
Pros: Make-like style development, it also supports most of the funcitonality of GNU make, thougth they behave slight difference on processing variables.
Cons: No cloud-based computing framework. No container technology supports. No visualization.

Loom

Loom is a platform-independent tool to create, execute, track, and share workflows.

Ease of use. Loom runs out-of-the-box locally or in the cloud.
Repeatable analysis. Loom makes sure you can repeat your analysis months and years down the road after you've lost your notebook, your data analyst has found a new job, and your server has had a major OS version upgrade.Loom uses Docker to reproduce your runtime environment, records file hashes to verify analysis inputs, and keeps fully reproducible records of your work.
Traceable results. Loom remembers anything you ever run and can tell you exactly how each result was produced.
Portability between platforms. Exactly the same workflow can be run on your laptop or on a public cloud service.
Open architecture. Not only is Loom open source and free to use, it uses an inside-out architecture that minimizes lock-in and lets you easily share your work with other people.
Graphical user interface. While you may want to automate your analysis from the command line, a graphical user interface is useful for interactively browsing workflows and results.
Security and compliance. Loom is designed with clinical compliance in mind.
Syntax: Explicit
Paradigm: Configuration
Interaction: CLI
Distributed Computing Support: Yes, Google cloud server is supported by default.
Extensive: NA
Language: Python
License: GNU Affero General Public License v3.0
Pros: Easy configuration and development, configuration based on yaml allows users to develop rapidly.
Cons: Configuration may become complex when pipeline itself is complex, based on official examples, configuration becomes longer and difficult to read and write when one implements complex pipelines. Author could introduce hierarchical yaml supports.

Dagr

A task and pipeline execution system for directed acyclic graphs to support scientific, and more specifically, genomic analysis workflows. There are many toolkits available for creating and executing pipelines of dependent jobs; dagr does not aim to be all things to all people but to make certain types of pipelines easier and more pleasurable to write. It is specifically focused on:

Writing pipelines that are concise, legible, and type-safe
Easy composition of pipelines into bigger pipelines
Providing safe and coherent ways to dynamically change the graph during execution
Making the full power and expressiveness of scala available to pipeline authors
Efficiently executing tasks concurrently within the constraints of a single machine/instance It is a tool for working data scientists, programmers and bioinformaticians.
Syntax: Implicit
Paradigm: Class
Interaction: CLI
Distributed Computing Support: No?
Extensive: NA
Language: Scala
License: MIT License
Pros: It manages complex dependencies among tasks and piplines. It contains a small set of predefined genomic analysis taks and pipelines. Resource-aware scheduling across tasks and pipelines..
Cons: Currently in alpha version, still unstable. It may be not usable in product environment.

Butler

Butler is a collection of tools whose goal is to aid researchers in carrying out scientific analyses on a multitude of cloud computing platforms (AWS, Openstack, Google Compute Platform, Azure, and others). Butler is based on many other Open Source projects such as - Apache Airflow, Terraform, Saltstack, Grafana, InfluxDB, PostgreSQL, Celery, Elasticsearch, Consul, and others.

Provisioning - Creation and teardown of clusters of Virtual Machines on various clouds.
Configuration Management - Installation and configuration of software on Virtual Machines.
Workflow Management - Definition and execution of distributed scientific workflows at scale.
Operations Management - A set of tools for maintaining operational control of the virtualized environment as it performs work.
Syntax: Explicit
Paradigm: Class
Interaction: CLI & GUI
Distributed Computing Support: Yes, AWS,openstack, Google Compute Platform, Azure and other are supported.
Extensive: NA
Language: Python
License: GNU General Public License v3.0
Pros: It is a bunch collections of powerful tools, it has supports of docker technology and CWL. It is also shpped with some ready-made workflows that could be used immediately.
Cons: Installment steps are a little complex, as it consists of different softwares or middlewares.

Fireworks

github FireWorks is a free, open-source code for defining, managing, and executing workflows. Complex workflows can be defined using Python, JSON, or YAML, are stored using MongoDB, and can be monitored through a built-in web interface. Workflow execution can be automated over arbitrary computing resources, including those that have a queueing system. FireWorks has been used to run millions of workflows encompassing tens of millions of CPU-hours across diverse application areas and in long-term production projects over the span of multiple years.

A clean and flexible Python API, a powerful command-line interface, and a built-in web service for monitoring workflows.
A database backend (MongoDB) lets you add, remove, and search the status of workflows.
Detect failed jobs (both soft and hard failures), and rerun them as needed.
Multiple execution modes - directly on a multicore machines or through a queue, on a single machine or multiple machines. Assign priorities and where jobs run.
Support for dynamic workflows - workflows that modify themselves or create new ones based on what happens during execution.
Automatic duplicate handling at the sub-workflow level - skip duplicated portions between two workflows while still running unique sections
Built-in tasks for creating templated inputs, running scripts, and copying files to remote machines
Remotely track the status of output files during execution.
Package many small jobs into a single large job (e.g., automatically run 100 serial workflows in parallel over 100 cores)
Support for several queueing systems such as PBS/Torque, Sun Grid Engine, SLURM, and IBM LoadLeveler.
Syntax: Implicit
Paradigm: Configuration(YAML) & Class(Python classes)
Interaction: CLI
Distributed Computing Support: Partially, it supports several supercomputing centers such as NERSC, but seems doesn't support cloud platform such as AWS, Google computing platform or Azure.
Extensive: Python API
Language: Python
License: BSD-style License
Pros: Pipelines development is based on either configuration(Yaml) or python API, which makes it quite easy to develop pipelines. It provides a feature of dynamical workflows. Built-in firetasks allow users to create pipelines rapidly.
Cons: No pipelines dependencies management. No container technology supports. No cloud framework supports. It depends on mongodb, users need to maintain mongodb themselves.

jasonyangshadow / workflows_survey Goto Github PK

workflows_survey's Introduction

bioinformatics workflow and pipeline framework

scripts

make

workflows_survey's People

Contributors

Stargazers

Watchers

Recommend Projects

Recommend Topics

Recommend Org

Jobs