GithubHelp home page GithubHelp logo

daisy's Introduction

Daisy: A Blockwise Task Scheduler

Blockwise task scheduler for processing large volumetric data

What is Daisy?

Daisy is a library framework that facilitates distributed processing of big nD datasets across clusters of computers. It combines the best of MapReduce/Hadoop (the ability to map a process function across elements) and Luigi (the ability to chain dependent tasks together) together in one lightweight and efficient package with a focus on processing nD datasets.

Daisy documentations are at https://daisy-docs.readthedocs.io/

Updates

Daisy v1.0 is now available on PyPI!

  • Install it now with pip install -U daisy
  • Besides quality-of-life improvements, we have also refactored I/O-related utilities to funlib.persistence to make code maintenance easier. This includes everything that was in daisy.persistence along with daisy.Array and helper functions such as daisy.open_ds, and daisy.prepare_ds.
    • Just run pip install git+https://github.com/funkelab/funlib.persistence.
    • These functions, which provide an easy to use interface to common formats such as zarr, n5, and hdf5 for arrays and interfaces for storing large spatial graphs in MongoDB or Files remain the same.

Overview

Developed by researchers at HHMI Janelia and Harvard, the intention behind Daisy was to develop a scalable and fast distributed block-wise scheduler for processing very large (TBs to PBs) 3D/4D bio image datasets. We needed a fast and scalable scheduler but also resilient to failures and recoverable/resumable from hardware errors. Daisy should also be generalizable enough to support efficient processing of different tasks with different computation and input/output modalities.

Daisy is lightweight

  • Daisy uses high performance TCP/IP libraries for communications between the scheduler and workers.

  • It minimizes network overheads by sending only coordinates and status checks. Daisy does not enforce the exact method of data transfers to/between workers so that maximum performance is achieved for different tasks.

Daisy's API is easy-to-use and extensible

  • Built on Python, Daisy provides an easy-to-use native interface for Python scripts useful for both simple and complex use cases.

  • Simplistically, Daisy is a framework for mapping a function across independent sub-blocks in the dataset.

  • More complex usages include specifying inter-block dependencies, inter-task dependencies, using Daisy's array interface and geometric graph interface.

Daisy chains complex pipelines of tasks

  • Inspired by powerful workflow management frameworks like Luigi for automating long running tasks and decreasing overall processing time through task pipelining, Daisy allows user to specify dependency between tasks, allowing for task chaining and running multiple tasks in a pipeline with dynamic concurrent per-block execution.

  • For instance, Daisy can chain a map task and a reduce task to implement a map-reduce task for nD datasets. Of course, any other combinations of map and reduce tasks are composable.

  • By tracking dependencies at the block level, tasks can be executed concurrently to maximize pipelining parallelism.

Daisy is tuned for processing datasets with real-world units

  • Daisy has a native inferface to represent of regions in a volume in real world units, easily handling voxel sizes, with convenience functions for region manipulation (intersection, union, growing or shrinking, etc.)

Installation

pip install daisy

Alternatively, you can install from github for the latest development version:

pip install -e git+https://github.com/funkelab/daisy#egg=daisy

Quickstart

See the following code in a IPython notebook!

Map task

First, let's run a simple map task with Daisy. Supposed we have an array a that we want to compute the square for each element and store in b

import numpy as np
shape = 4096000
a = np.arange(shape, dtype=np.int64)
b = np.empty_like(a, dtype=np.int64)
print(a)
# prints [0 1 2 ... 4095997 4095998 4095999]

We can use the following process_fn:

def process_fn():
    # iterating and squaring each element in a and store to b
    with np.nditer([a, b],
                   op_flags=[['readonly'], ['readwrite']]) as it:
        with it:
           for x,y in it:
                y[...] = x**2
%timeit process_fn()  # 3.55 s ± 22.7 ms per loop
print(b)
# prints [0 1 4 ... 16777191424009 16777199616004 16777207808001]

Since process_fn linearly processes a in a single-thread, it is quite slow. Let's use Daisy to break a into blocks and run process_fn in parallel.

First, we'll wrap a in a daisy.Array and make a b array based on zarr that multiple concurrent process can write to. We will also define block_shape - the granularity that each worker will be working at.

import daisy
from funlib.persistence import Array
from funlib.geometry import Roi, Coordinate
import zarr
shape = 4096000
block_shape = 1024*16
# input array is wrapped in `Array` for easy of `Roi` indexing
a = Array(np.arange(shape, dtype=np.int64),
                roi=Roi((0,), shape),
                voxel_size=(1,))
# to parallelize across processes, we need persistent read/write arrays
# we'll use zarr here to do do that
b = zarr.open_array(zarr.TempStore(), 'w', (shape,),
                    chunks=(block_shape,),
                    dtype=np.int64)
# output array is wrapped in Array for easy of Roi indexing
b = Array(b,
                roi=Roi((0,), shape),
                voxel_size=(1,))

The process_fn is then modified slightly to take in a block object and perform read/write using the ROIs given by it.

# same process function as previously, but with additional code
# to read and write data to persistent arrays
def process_fn_daisy(block):
    a_sub = a[block.read_roi].to_ndarray()
    b_sub = np.empty_like(a_sub)
    with np.nditer([a_sub, b_sub],
                   op_flags=[['readonly'], ['readwrite']],
                  ) as it:
        with it:
           for x,y in it:
                y[...] = x**2
    
    b[block.write_roi] = b_sub

Next, we define total_roi based on total amount of work (shape) and block_roi based on scheduling block size (block_shape). We then make a daisy.Task and run it.

total_roi = Roi((0,), shape)  # total ROI to map process over
block_roi = Roi((0,), (block_shape,))  # block ROI for parallel processing
# creating a Daisy task, note that we do not specify how each
# worker should read/write to input/output arrays
task = daisy.Task(
    total_roi=total_roi,
    read_roi=block_roi,
    write_roi=block_roi,
    process_function=process_fn_daisy,
    num_workers=8,
    task_id='square',
)
daisy.run_blockwise([task])
'''
prints Execution Summary
-----------------
  Task square:
    num blocks : 250
    completed ✔: 250 (skipped 0)
    failed    ✗: 0
    orphaned  ∅: 0
    all blocks processed successfully
'''
# %timeit daisy.run_blockwise([task])  # 1.26 s ± 16.1 ms per loop
print(b.to_ndarray())
# prints [0 1 4 ... 16777191424009 16777199616004 16777207808001]

See that with just a minor modification, using Daisy to run multiple workers in parallel results in a 2.8176x speedups on a computer with 6 cores. For longer running tasks with larger block sizes (to minimize process spawning/joining overheads) the speedups should approach the # of threads/cores running in parallel more.

Reduce task

Now we'll write and run a reduce task. This task performs a sum of blocks of shape reduce_shape from b and stores the results to c.

import multiprocessing
reduce_shape = shape/16
# while using zarr with `Array` can be easier to understand and less error prone, it is not a requirement.
# Here we make a shared memory array for collecting results from different workers
c = multiprocessing.Array('Q', range(int(shape/reduce_shape)))
def process_fn_sum_reduce(block):
    b_sub = b[block.write_roi].to_ndarray()
    s = np.sum(b_sub)
    # compute c idx based on block offset and shape
    idx = (block.write_roi.offset / block.write_roi.shape)[0]
    c[idx] = s
total_roi = Roi((0,), shape)  # total ROI to map process over
block_roi = Roi((0,), reduce_shape)  # block ROI for parallel processing
task1 = daisy.Task(
    total_roi=total_roi,
    read_roi=block_roi,
    write_roi=block_roi,
    process_function=process_fn_sum_reduce,
    num_workers=8,
    task_id='sum_reduce',
)
daisy.run_blockwise([task1])
print(c[:])

This concludes our quickstart tutorial. For more examples/tutorials please see the examples/ directory.

Citing Daisy

To cite this repository please use the following bibtex entry:

@software{daisy2022github,
  author = {Tri Nguyen and Caroline Malin-Mayor and William Patton and Jan Funke},
  title = {Daisy: block-wise task dependencies for luigi.},
  url = {https://github.com/funkelab/daisy},
  version = {1.0},
  year = {2022},
}

In the above bibtex entry, the version number is intended to be that from daisy/setup.py, and the year corresponds to the project's 1.0 release.

daisy's People

Contributors

abred avatar cmalinmayor avatar funkey avatar hanslovsky avatar lathomas42 avatar mzouink avatar pattonw avatar rhoadesscholar avatar sheridana avatar trivoldus28 avatar yajivunev avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

daisy's Issues

Tqdm Progress bar can become too big

When submitting a cluster task tqdm save a new line for twice for each block which can become a gigabytes file

Is there a way to make an option that set update frequency ? or use an other option that saves only the last line

daisy.run_blockwise incompatible with Python 3.5 and older

The local scheduler uses asyncio functionality that is not available in Python 3.5:

Traceback (most recent call last):
  File "/usr/local/bin/predict-affinities", line 11, in <module>
    load_entry_point('eqip==0.3.1.dev0', 'console_scripts', 'predict-affinities')()
  File "/usr/local/lib/python3.5/dist-packages/eqip/inference/backend_daisy.py", line 328, in predict_affinities_daisy
    read_write_conflict=False)
  File "/usr/local/lib/python3.5/dist-packages/daisy/scheduler.py", line 574, in run_blockwise
    return distribute([{'task': BlockwiseTask()}])
  File "/usr/local/lib/python3.5/dist-packages/daisy/scheduler.py", line 602, in distribute
    return Scheduler().distribute(dependency_graph)
  File "/usr/local/lib/python3.5/dist-packages/daisy/scheduler.py", line 69, in distribute
    self._start_tcp_server()
  File "/usr/local/lib/python3.5/dist-packages/daisy/scheduler.py", line 162, in _start_tcp_server
    asyncio._set_running_loop(new_event_loop)
AttributeError: module 'asyncio' has no attribute '_set_running_loop'

It probably does not make sense to make daisy.run_blockwise compatible with Python 3.5 because

the asyncio module has received new features, significant usability and performance improvements, and a fair amount of bug fixes. Starting with Python 3.6 the asyncio module is no longer provisional and its API is considered stable.

Unfortunately, tensorflow docker images only come with Ubuntu 16.04 and Python3.5. The solution for users is to use conda (or pip) and install tensorflow-gpu into a virtual environment. Unfortunately, I do not see a good solution for daisy to ensure that users have an appropriate version of Python on their system:

  • adding python_requires='>=3.6' to setup.py would prohibit the use of the tensorflow docker containers (come only with Python3.5), which is probably required for lsf scenarios

I suggest to use this issue as a note to users who would like to rundaisy.run_blockwise in a non-lsf setting, maybe also linking to it in README.md.

Can only import tensorflow (implicit or explicit) inside process_function of daisy.run_blockwise

When trying to use tensorflow in the workers of daisy.run_blockwise, tensorflow may only be imported inside the process_function. This includes implicit imports, e.g. through gunpowder. For some reason, tensorflow does not play well with multiprocessing. The solution is to make sure that tensorflow is imported exclusively in the relevant subprocesses (and not in the main process). This is not an issue that arises from within daisy but it would be helpful for users to know how to resolve this issue (maybe even add a link to this issue to README.md).

A simple example of what does not work:

import daisy
import gunpowder

def process_function():
    scheduler = ClientScheduler()
    while True:
        # do something with gunpowder, e.g. build pipeline and run prediction on requested blocks

daisy.run_blockwise(
    ...
    process_function=process_function,
    num_workers=2) # no issues if only one worker

and what works:

import daisy

def process_function():
    import gunpowder
    scheduler = ClientScheduler()
    while True:
        # do something with gunpowder, e.g. build pipeline and run prediction on requested blocks

daisy.run_blockwise(
    ...
    process_function=process_function,
    num_workers=2)

Why force the size of chunks to be smaller than 256

In line 182 of datasets.py, 'chunk_size' is obtained by 'get_chunk_size ' which controls the 'chunk_size' to be smaller than 256. 256 is not an adjustable parameter. I suggest changing it to an adjustable parameter so that the user can obtain the size of chunks they like.

Separate completion checking from block acquisition

In the current implementation, blocks are only checked for completion once they are requested by workers. This checking should instead be run in parallel, such that the master process is eliminating already completed blocks from the queue independently of the worker operations.

pre_check_ret = self.__precheck(block)

Fixing block ID numbering with cantor number breaks compatibility

In dev-0.3, the block_id is assigned directly the cantor number, which counts from 0. In a personal communication with Logan, he said that @funkey prefers to have the blocks counted from 1 instead. This would be an easy change, and personally I have been using the 1 offset for some time now without any problem.

The problem is that making the change now would break compatibility with the previous numbering system, and can be a problem for usages where block_id is stored in a DB and used (it is used extensively in my proofreading pipeline).

So I propose that we fix the block ID enumeration by adding 1 to the cantor number returned value, and bump the daisy versioning to dev-0.4. Thought?

TypeError: Pickling an AuthenticationString object is disallowed for security reasons

Hey,

I'm trying to update an old script for daisy 0.2.1. to 1.1.1. I (probably naively) attempted to just use my old process function as the new process function in a daisy Task, but the error log gives me the error in the title.

Would love any suggestions, happy to give more as needed, full traceback here:

tornado.application - ERROR - Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0x7f9567cc44d0>, <Task finished coro=<TCPStream._send_message() done, defined at /home/griffin/anaconda3/envs/synsev/lib/python3.7/site-packages/daisy/tcp/tcp_stream.py:64> exception=TypeError('Pickling an AuthenticationString object is disallowed for security reasons')>)
Traceback (most recent call last):
  File "/home/griffin/anaconda3/envs/synsev/lib/python3.7/site-packages/tornado/ioloop.py", line 758, in _run_callback
    ret = callback()
  File "/home/griffin/anaconda3/envs/synsev/lib/python3.7/site-packages/tornado/stack_context.py", line 300, in null_wrapper
    return fn(*args, **kwargs)
  File "/home/griffin/anaconda3/envs/synsev/lib/python3.7/site-packages/tornado/ioloop.py", line 779, in _discard_future_result
    future.result()
  File "/home/griffin/anaconda3/envs/synsev/lib/python3.7/site-packages/daisy/tcp/tcp_stream.py", line 69, in _send_message
    pickled_data = pickle.dumps(message)
  File "/home/griffin/anaconda3/envs/synsev/lib/python3.7/multiprocessing/process.py", line 330, in __reduce__
    'Pickling an AuthenticationString object is '
TypeError: Pickling an AuthenticationString object is disallowed for security reasons

fit: number of overhanging blocks can be more than 1

Since "valid" fit would be missing at most 1 write roi block whose read roi might go out of the total roi, I would expect that an "overhang" fit would add at most 1 block on each axis. This is not the case though since a valid fit checks:
total_roi.contains(b.read_roi), and the "overhang" checks total_roi.contains(b.write_roi.get_begin()).

I've added an example to the documentation:

        "valid": Skip blocks that would lie outside of ``total_roi``. This
        is the default::

            |---------------------------|     total ROI

            |rrrr|wwwwww|rrrr|                block 1
                   |rrrr|wwwwww|rrrr|         block 2
                                            no further block

        "overhang": Add all blocks that overlap with ``total_roi``, even if
        they leave it. Client code has to take care of save access beyond
        ``total_roi`` in this case.::

            |---------------------------|     total ROI

            |rrrr|wwwwww|rrrr|                block 1
                   |rrrr|wwwwww|rrrr|         block 2
                          |rrrr|wwwwww|rrrr|  block 3 (overhanging)

            |---------------------------|     total ROI

            |rrrrrr|www|rrrrrr|                      block 1
                |rrrrrr|www|rrrrrr|                  block 2
                    |rrrrrr|www|rrrrrr|              block 3
                        |rrrrrr|www|rrrrrr|          block 4 (overhanging)
                            |rrrrrr|www|rrrrrr|      block 5 (overhanging)
                                |rrrrrr|www|rrrrrr|  block 6 (overhanging)

        "shrink": Like "overhang", but shrink the boundary blocks' read and
        write ROIs such that they are guaranteed to lie within
        ``total_roi``. The shrinking will preserve the context, i.e., the
        difference between the read ROI and write ROI stays the same.::

            |---------------------------|     total ROI

            |rrrr|wwwwww|rrrr|                block 1
                   |rrrr|wwwwww|rrrr|         block 2
                          |rrrr|www|rrrr|     block 3 (shrunk)

Documentation accurately describes the current functionality.

Is this meant to be or should we change the "overhang" check to something like total_write_roi.contains(b.write_roi.get_begin())

Enable blocking worker resubmission

When i submit a task with a bug on it, daisy try to solve it by submitting ~infinite number of workers.
Is it possible to set worker resubmission as option?

Add funlib.persistence to dependencies

Need to add funlib.persistence to dependencies.
Ideally, make any dependencies automatically installed during setup - so it "comes with batteries included" ;)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.