I've run into this issue a couple of times while attempting to convert 1000G data <cod

I pushed an update in <a class="issue-link js-issue-link" data-error-text="Failed to l

BrokenProcessPool when using multiple workers about bio2zarr HOT 7 CLOSED

shz9 commented on September 26, 2024

BrokenProcessPool when using multiple workers

from bio2zarr.

Comments (7)

benjeffery commented on September 26, 2024 1

I'm hitting this exact same exception - the encode keeps going which isn't ideal.

from bio2zarr.

jeromekelleher commented on September 26, 2024 1

I pushed an update in #80 which should help debug this @shz9. If you run encode with -v it should give you helpful messages about the minimum RAM required per array.

Doing things better will require some refactoring, which we should probably do as part of making the encode job work in parallel over a cluster.

from bio2zarr.

jeromekelleher commented on September 26, 2024

I just hit the same issue and it was due to the worker getting killed by the OOM killer.

I suspect what happened here is that you had just enough memory reserved for 4 workers for all the fields except PL. These fields are huge (each chunk is nearly 1GB), so I'm not surprised the cluster killed it.

This is not obvious, so we should potentially intercept the BrokenProcessPool exception in the main process and add an informative mesage like "you probably ran out of memory".

I think the simplest thing for now is to just remove PL from your experiments. Edit the schema JSON and delete the PL field, and it should all work fine.

Also a general question related to this: Do you think it's possible to pick up the encoding work from where it left off if things like this happen instead of starting over?

I think that's closely related to how we're going to split this up into manageable bits for cluster scheduling. See #71 and #77 for discussions on how we're doing this for explode (and I think some high-level discussion about encode too).

from bio2zarr.

jeromekelleher commented on September 26, 2024

Good to know, I think I know how to fix

from bio2zarr.

shz9 commented on September 26, 2024

In addition to the informative exception, do you think it'd be possible to allow the user to set a --max-memory flag based on which we can determine memory-friendly chunksizes for the encoding stage? We can re-chunk afterwards for optimal compression if needed? Alternatively, if we don't want to change the chunksizes, we can automatically reduce the number of workers for arrays that may have large chunks?

If it's of interest, I have this function that determines chunking patterns based on on number of cores / data type:
https://github.com/shz9/magenpy/blob/579504c7cd8a61808ab8b880e1627ef3ffe5fc8d/magenpy/stats/ld/utils.py#L547

def optimize_chunks_for_memory(chunked_array, cpus=None, max_mem=None):
    """
    Determine optimal chunks that fit in max_mem. Max_mem should be numerical in GiB
    Modified from: Sergio Hleap
    """

    import psutil
    import dask.array as da

    if cpus is None:
        cpus = psutil.cpu_count()

    if max_mem is None:
        max_mem = psutil.virtual_memory().available / (1024.0 ** 3)

    chunk_mem = max_mem / cpus
    chunks = da.core.normalize_chunks(f"{chunk_mem}GiB", shape=chunked_array.shape, dtype=chunked_array.dtype)

    return chunked_array.chunk(chunks)

from bio2zarr.

jeromekelleher commented on September 26, 2024

Ooh, max-memory is a great idea! We could associate a memory value with each future (say 3 times the number of bytes in one chunk of the array) and then stop submitting when the total for the outstanding futures exceeds this. I expect this would work quite well, especially if we try and mix up the big chunks with smaller ones.

We should follow this up in a separate issue

from bio2zarr.

jeromekelleher commented on September 26, 2024

Closing this as we've added the max-memory argument as well.

from bio2zarr.

BrokenProcessPool when using multiple workers about bio2zarr HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs