To distribute the work of explode and <code class="no

The snakemake would be quite straightforward: <div class="snippet-clipboard-conten

Add CLI flags to support batch array job processing about bio2zarr HOT 12 CLOSED

benjeffery commented on June 29, 2024

Add CLI flags to support batch array job processing

from bio2zarr.

Comments (12)

benjeffery commented on June 29, 2024 1

The snakemake would be quite straightforward:

rule all:
    input:
        f"{output_dir}/metadata.json"

rule explode_start:
    input:
        input_files=INFILES
    output:
        OUTDIR/partitions.json
    shell:
        """
        bio2zarr vcf2zarr explode start --total-jobs {total_jobs} {input_files} OUTDIR && touch {output}
        """

rule explode_write:
    input:
        OUTDIR/partitions.json
    output:
        flag=OUTDIR/job_{jobid}.flag
    params:
        total_jobs=total_jobs,
    shell:
        """
        bio2zarr vcf2zarr explode write --job {jobid} --total-jobs {params.total_jobs} INFILES OUTDIR && touch {output.flag}
        """

rule explode_finalize:
    input:
        expand(f"OURDIR/job_{{jobid}}.flag", jobid=range(0, total_jobs))
    output:
        OUTDIR/metadata.json
    shell:
        """
        bio2zarr vcf2zarr explode finalize INFILES OUTDIR
        """

I think we need both the python way and the CLI as for classic batch array submission a CLI way would be much simpler.

from bio2zarr.

benjeffery commented on June 29, 2024 1

I think a CLI that reads the json might work. I don't want the user to have to parse JSON if they are using bash etc.

from bio2zarr.

benjeffery commented on June 29, 2024

I'm happy to pick this one up.

from bio2zarr.

jeromekelleher commented on June 29, 2024

Let me have a think about this...

What would the corresponding snakemake look like?

Would it be simpler to expose this as Python functions, which Snakemake can hook into (and keep the CLI for interactive work)?

from bio2zarr.

jeromekelleher commented on June 29, 2024

Fair enough... I'm just slightly queasy about adding another level of command hierarchy to the CLI.

from bio2zarr.

jeromekelleher commented on June 29, 2024

In principle this is pretty straighforward. In the initial bit, we need to influence how many partitions we split the file into, and then save the metadata as wip.metadata.json (or something) around here

Then, there's the actual "explode" tasks, which take a slice of partition IDs starting here
Each of the explode tasks first reads the metadata from wip.metadata.json.

Then, once all the explode tasks are done, we just finalise by renaming wip.metadata.json to metadata.json.

From a UI perspective, this is then:

bio2zarr explode-init  <vcfs> <destdir> [num partitions]

Splits into N partitions, writes ```wip.metadata.json`` (and header.txt) and exits

bio2zarr explode-slice <destdir> start stop

Assumes explode-init has already happened, and tries to explode the specified partition slice

bio2zarr explode-finalise <destdir>

The explode command on its own just does these things sequentially.

I think that's better than more hierarchy?

from bio2zarr.

jeromekelleher commented on June 29, 2024

Ah, it's not that simple, we need to keep the summaries per partition.

Let's just change the format to save the partition summaries as JSON, one per partition. This can be done as part of the work packet rather than being returned from the future. Then the finalise step reads these files, and saves the final metadata.json.

This is actually a good check that the partition has been fully completed too, so helps with integrity.

Good, I like this!

from bio2zarr.

jeromekelleher commented on June 29, 2024

Snakemake can also use partition_j.json to flag whether partition j has completed too, right?

from bio2zarr.

benjeffery commented on June 29, 2024

Snakemake can also use partition_j.json to flag whether partition j has completed too, right?

Yes, that's probably better.

from bio2zarr.

benjeffery commented on June 29, 2024

Thinking about the num_partitions being essentially unpredictable to exact number, I don't think we can have an upper bound. The number of partitions is dependant on contigs inside the VCF that aren't known before we scan. My current approach is to have a target number that the code aims for, but to write the actual number to a file that the snakemake can then read in to determine the jobs for the next step. This isn't ideal, but I can't see a way around this.

from bio2zarr.

jeromekelleher commented on June 29, 2024

What about reading the metadata.wip.json or whatever? Should be easy to parse, or could provide a python API (or cli)?

from bio2zarr.

jeromekelleher commented on June 29, 2024

Closing as completed (for explode)

from bio2zarr.

Add CLI flags to support batch array job processing about bio2zarr HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs