GithubHelp home page GithubHelp logo

Comments (12)

benjeffery avatar benjeffery commented on June 29, 2024 1

The snakemake would be quite straightforward:

rule all:
    input:
        f"{output_dir}/metadata.json"

rule explode_start:
    input:
        input_files=INFILES
    output:
        OUTDIR/partitions.json
    shell:
        """
        bio2zarr vcf2zarr explode start --total-jobs {total_jobs} {input_files} OUTDIR && touch {output}
        """

rule explode_write:
    input:
        OUTDIR/partitions.json
    output:
        flag=OUTDIR/job_{jobid}.flag
    params:
        total_jobs=total_jobs,
    shell:
        """
        bio2zarr vcf2zarr explode write --job {jobid} --total-jobs {params.total_jobs} INFILES OUTDIR && touch {output.flag}
        """

rule explode_finalize:
    input:
        expand(f"OURDIR/job_{{jobid}}.flag", jobid=range(0, total_jobs))
    output:
        OUTDIR/metadata.json
    shell:
        """
        bio2zarr vcf2zarr explode finalize INFILES OUTDIR
        """

I think we need both the python way and the CLI as for classic batch array submission a CLI way would be much simpler.

from bio2zarr.

benjeffery avatar benjeffery commented on June 29, 2024 1

I think a CLI that reads the json might work. I don't want the user to have to parse JSON if they are using bash etc.

from bio2zarr.

benjeffery avatar benjeffery commented on June 29, 2024

I'm happy to pick this one up.

from bio2zarr.

jeromekelleher avatar jeromekelleher commented on June 29, 2024

Let me have a think about this...

What would the corresponding snakemake look like?

Would it be simpler to expose this as Python functions, which Snakemake can hook into (and keep the CLI for interactive work)?

from bio2zarr.

jeromekelleher avatar jeromekelleher commented on June 29, 2024

Fair enough... I'm just slightly queasy about adding another level of command hierarchy to the CLI.

from bio2zarr.

jeromekelleher avatar jeromekelleher commented on June 29, 2024

In principle this is pretty straighforward. In the initial bit, we need to influence how many partitions we split the file into, and then save the metadata as wip.metadata.json (or something) around here

Then, there's the actual "explode" tasks, which take a slice of partition IDs starting here
Each of the explode tasks first reads the metadata from wip.metadata.json.

Then, once all the explode tasks are done, we just finalise by renaming wip.metadata.json to metadata.json.

From a UI perspective, this is then:

bio2zarr explode-init  <vcfs> <destdir> [num partitions]

Splits into N partitions, writes ```wip.metadata.json`` (and header.txt) and exits

bio2zarr explode-slice <destdir> start stop

Assumes explode-init has already happened, and tries to explode the specified partition slice

bio2zarr explode-finalise <destdir> 

The explode command on its own just does these things sequentially.

I think that's better than more hierarchy?

from bio2zarr.

jeromekelleher avatar jeromekelleher commented on June 29, 2024

Ah, it's not that simple, we need to keep the summaries per partition.

Let's just change the format to save the partition summaries as JSON, one per partition. This can be done as part of the work packet rather than being returned from the future. Then the finalise step reads these files, and saves the final metadata.json.

This is actually a good check that the partition has been fully completed too, so helps with integrity.

Good, I like this!

from bio2zarr.

jeromekelleher avatar jeromekelleher commented on June 29, 2024

Snakemake can also use partition_j.json to flag whether partition j has completed too, right?

from bio2zarr.

benjeffery avatar benjeffery commented on June 29, 2024

Snakemake can also use partition_j.json to flag whether partition j has completed too, right?

Yes, that's probably better.

from bio2zarr.

benjeffery avatar benjeffery commented on June 29, 2024

Thinking about the num_partitions being essentially unpredictable to exact number, I don't think we can have an upper bound. The number of partitions is dependant on contigs inside the VCF that aren't known before we scan. My current approach is to have a target number that the code aims for, but to write the actual number to a file that the snakemake can then read in to determine the jobs for the next step. This isn't ideal, but I can't see a way around this.

from bio2zarr.

jeromekelleher avatar jeromekelleher commented on June 29, 2024

What about reading the metadata.wip.json or whatever? Should be easy to parse, or could provide a python API (or cli)?

from bio2zarr.

jeromekelleher avatar jeromekelleher commented on June 29, 2024

Closing as completed (for explode)

from bio2zarr.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.