Comments (12)
The snakemake would be quite straightforward:
rule all:
input:
f"{output_dir}/metadata.json"
rule explode_start:
input:
input_files=INFILES
output:
OUTDIR/partitions.json
shell:
"""
bio2zarr vcf2zarr explode start --total-jobs {total_jobs} {input_files} OUTDIR && touch {output}
"""
rule explode_write:
input:
OUTDIR/partitions.json
output:
flag=OUTDIR/job_{jobid}.flag
params:
total_jobs=total_jobs,
shell:
"""
bio2zarr vcf2zarr explode write --job {jobid} --total-jobs {params.total_jobs} INFILES OUTDIR && touch {output.flag}
"""
rule explode_finalize:
input:
expand(f"OURDIR/job_{{jobid}}.flag", jobid=range(0, total_jobs))
output:
OUTDIR/metadata.json
shell:
"""
bio2zarr vcf2zarr explode finalize INFILES OUTDIR
"""
I think we need both the python way and the CLI as for classic batch array submission a CLI way would be much simpler.
from bio2zarr.
I think a CLI that reads the json might work. I don't want the user to have to parse JSON if they are using bash etc.
from bio2zarr.
I'm happy to pick this one up.
from bio2zarr.
Let me have a think about this...
What would the corresponding snakemake look like?
Would it be simpler to expose this as Python functions, which Snakemake can hook into (and keep the CLI for interactive work)?
from bio2zarr.
Fair enough... I'm just slightly queasy about adding another level of command hierarchy to the CLI.
from bio2zarr.
In principle this is pretty straighforward. In the initial bit, we need to influence how many partitions we split the file into, and then save the metadata as wip.metadata.json
(or something) around here
Then, there's the actual "explode" tasks, which take a slice of partition IDs starting here
Each of the explode tasks first reads the metadata from wip.metadata.json
.
Then, once all the explode tasks are done, we just finalise by renaming wip.metadata.json
to metadata.json
.
From a UI perspective, this is then:
bio2zarr explode-init <vcfs> <destdir> [num partitions]
Splits into N partitions, writes ```wip.metadata.json`` (and header.txt) and exits
bio2zarr explode-slice <destdir> start stop
Assumes explode-init has already happened, and tries to explode the specified partition slice
bio2zarr explode-finalise <destdir>
The explode
command on its own just does these things sequentially.
I think that's better than more hierarchy?
from bio2zarr.
Ah, it's not that simple, we need to keep the summaries per partition.
Let's just change the format to save the partition summaries as JSON, one per partition. This can be done as part of the work packet rather than being returned from the future. Then the finalise step reads these files, and saves the final metadata.json.
This is actually a good check that the partition has been fully completed too, so helps with integrity.
Good, I like this!
from bio2zarr.
Snakemake can also use partition_j.json to flag whether partition j has completed too, right?
from bio2zarr.
Snakemake can also use partition_j.json to flag whether partition j has completed too, right?
Yes, that's probably better.
from bio2zarr.
Thinking about the num_partitions
being essentially unpredictable to exact number, I don't think we can have an upper bound. The number of partitions is dependant on contigs inside the VCF that aren't known before we scan. My current approach is to have a target number that the code aims for, but to write the actual number to a file that the snakemake can then read in to determine the jobs for the next step. This isn't ideal, but I can't see a way around this.
from bio2zarr.
What about reading the metadata.wip.json or whatever? Should be easy to parse, or could provide a python API (or cli)?
from bio2zarr.
Closing as completed (for explode)
from bio2zarr.
Related Issues (20)
- Returning a string from `.mkschema` HOT 1
- Document status of Python API
- Fixup msprime based tests when packages are fixed
- Add "what about cloud?" docs
- Add explicit warning for Mac Python 3.9
- New tool: tskit2zarr HOT 1
- Document copying to cloud storage HOT 1
- Refactor docs build infrastructure
- Restructure vcf2zarr docs
- Add --no-progress (or similar) to suppress progress
- Bug in dexplode-partition
- Change dexplode-init to use ``--num-parts``/``-n`` instead of positional HOT 1
- Change dencode-init to use --num-partitions
- Hypothesis testing for vcf2zarr HOT 13
- Pin to zarr < 3
- ValueError: could not broadcast input array
- Run tests against Zarr 3 HOT 2
- Run tests against numpy 2 HOT 4
- Set copy=True in np.array creation for numpy 2.0 compatibility HOT 1
- ICF stores created with numpy 1.x won't work with numpy 2.x HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bio2zarr.