GithubHelp home page GithubHelp logo

bio2zarr's Introduction

CI Coverage Status

bio2zarr

Convert bioinformatics file formats to Zarr

See the documentation for details.

bio2zarr's People

Contributors

jeromekelleher avatar benjeffery avatar ravwojdyla avatar tomwhite avatar

Stargazers

Paul R Johnston avatar Marcos Oliveira de Carvalho avatar Samer I. Al-Saffar avatar Billy Rowell avatar  avatar Marc-André Legault avatar Lucas Czech avatar peterdfields avatar Vince Buffalo avatar Raphael Hagen avatar  avatar Jeff Hammerbacher avatar Oliver Hofmann avatar Tim Millar avatar Nick Minor avatar

Watchers

 avatar Oliver Hofmann avatar  avatar Clemens Weiss avatar

bio2zarr's Issues

More informative errors on IntermediateColumnarFormat missing files

We use missing files as a way of ensuring integrity. We currently don't tell the user anything useful when those files are missing.

Should define some exceptions which capture the situation, and make sure that some useful information makes its way back to the user (when say, the ICF hasn't completed exploding)

Refactor encode code path

Currently we're using the ThreadedZarrEncoder to handle writing zarr chunks out, and assigning one process per column. This doesn't work very well for a few reasons. In particular, progress is very misleading when you use a lot of processes.

It would be better encode columns in vertical slices, assigning start and end coords (chunk aligned) to each worker process. These could then be chunk-buffered and written out incrementally, synchronously. This would be much simpler.

To do this we would need start, stop values for the iter_values method (which we can do using the per-partition counts). We might also need per chunk counts to avoid reading in lots of chunks unnecessarily.

Keeping track of progress is also an issue. Current approach is not working all that well (related: #48 )

Compressor not set for contig_length

We're using Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0) for contig_length in the Zarr, presumably not setting the value somewhere.

Improve compression on PL fields

On recent 1000 genomes data, we have the following:

39G   20201028_CCDG_14151_B01_GRM_WGS_2020-08-05_chr20.recalibrated_variants.vcf.gz

The zarr using defaults is:

43G     tmp/1kg_chr20_all.zarr

which is 4G more.

This is dominated by the FORMAT cols:

1.1G    tmp/1kg_chr20_all.zarr/call_AB                                                        
7.1G    tmp/1kg_chr20_all.zarr/call_AD                                                        
4.5G    tmp/1kg_chr20_all.zarr/call_DP                                                        
174M    tmp/1kg_chr20_all.zarr/call_genotype                                                  
31M     tmp/1kg_chr20_all.zarr/call_genotype_mask                                             
4.7M    tmp/1kg_chr20_all.zarr/call_genotype_phased    
6.2G    tmp/1kg_chr20_all.zarr/call_GQ                                                        
4.7M    tmp/1kg_chr20_all.zarr/call_MIN_DP                                                    
4.7M    tmp/1kg_chr20_all.zarr/call_MQ0                                                       
552M    tmp/1kg_chr20_all.zarr/call_PGT
966M    tmp/1kg_chr20_all.zarr/call_PID
22G     tmp/1kg_chr20_all.zarr/call_PL
4.7M    tmp/1kg_chr20_all.zarr/call_RGQ
4.7M    tmp/1kg_chr20_all.zarr/call_SB

In particular, call_PL is 22G. Hopefully there's some reasonably straightforward combination of filters and compressor options that'll bring this down.

Use Rich for UI

Rich provides some really nice utilities for CLIs, and helps give a CLI real polish. Using rich would definitely make things look nicer.

I tried out rich progress bars and log formatting in #64. The results are very pretty, but I found the progress bars lacking in some key features that I find useful:

  1. Support for arbitrary units isn't present (which could be worked around)
  2. There's no instantaneous rate (chunks per second, e.g.)
  3. The estimated completion time seems much less robust than tqdm's

So, until these key features are supported in rich progress bars I think we'll stick with tqdm.

Add testing and validation for multiple-vcfs

Currently not testing how we handle multiple VCFs.

For validation we should test chunking up the file in various ways, and then validate final product against the original file.

Add CLI flags to support batch array job processing

To distribute the work of explode and endode it would be useful to run commands such as:

bio2zarr vcf2zarr explode --job 5 --total-jobs 50 INFILES OUTDIR

Where the --job argument would be different on each worker.

Implementation wise I think this would mean breaking explode into 3 parts:

explode start --total-jobs M - single local task that does the scan and writes vcf partition metadata to the outdir
explode write --job N - parallel job for doing the conversion reads metadata from OUTDIR and writes partition summaries
explode finalise - single local job that writes the updated metadata, after reading the partition summaries, errors out if a partition summary is missing.

If none of these are specified then the usual explode proceeds locally as it does now.

Add within-VCF file partitioning

It would be useful to partition input VCF files so that we can explode them in parallel. This can directly reuse the code in sgkit, which uses the indexes to partition files.

Note, we need to be careful about indels that span partitions. This is done by filtering variants that don't start in the region, as sgkit does here

@tomwhite I think it would be simplest if we just copied the relevant code from sgkit over here. I can do this, but it it would be nice if you got the due git-cred for it. Would you be interested in opening a PR that copies in the relevant code and tests with minimal modifications, and then I can factor in to the current structure?

Doesn't matter what files they go in - could just make a "ported_vcf.py" module with new code, and then copy tests directly into tests directory or something?

Weird progress at end of Encode 1D

On lots of files we seem to get a disappearing progreess bar for encode 1D, like

Encode 1D: 1.47kchunks [00:09, 150chunks/s] 

validation-tmp/1kg_2020_chr20_annotations.vcf.gz.zarr

Handle named VCF dimensions

We don't currently use VCF numbers of R, A and G, but instead determine dimensions from observed data.

A minimal thing we need to do is update the dimensions for these fields as described in the spec

Reorganise with top-level modules vcf2zarr, plink2zarr

The Python API that's emerging is to do things like:

from bio2zarr import vcf

vcf.convert(["my_file.vcf.gz", "out.zarr")

It would probably be better to do it like

from bio2zarr import vcf2zarr

vcf2zarr.convert(["my_file.vcf.gz", "out.zarr")

Then, all the functions exported are in that flat namespace vcf2zarr. How stuff is split across files within the package is private.

Refactor basic file stucture

Currently all the code is in the vcf.py file. It would be good to make a plink.py file and maybe one other core.py file which takes care of the business of BufferedArrays etc

Test on VCF/BCF indexes not produced by htslib

It's not clear how much we're depending on properties of the indexes produced by htslib. I think everything in the test suite has been produced by bcftools index or tabix, so it would be good to check if we get the same results using indexes generated by htsjdk e.g.

Notebook progress support

Should be as simple as

from tqdm.auto import tqdm

Will need to check how well it works across both.

Should wait until we've abstracted out the "scan" bit to use the standard parallel worker pattern

Fill out test field corner cases

The info_field_type_combos.vcf.gz file copied from sgkit has a lot of wacky corner cases, and is a pretty exhaustive summary of the things that can happen. It current fails in various ways.

Also fill out all the corner cases in validate.

Parse annotations into separate columns

Variant level annotations are often included as INFO tags with substructure, e.g.

##SnpEffVersion="4.3i (build 2016-12-15 22:33), by Pablo Cingolani"
##SnpEffCmd="SnpEff  -noStats -lof GRCh38.86 /gpfs/commons/home/usevani/compbio/CCDG/Project_CCDG_14151_B01_GRM_WGS/final_annotated_vcfs/tmp_dir_annt/CCDG_14151_B01_GRM_WGS_2020-08-05_chr20.recalibrated_variants.annotated.normalize.vcf.gz "
##INFO=<ID=ANN,Number=.,Type=String,Description="Functional annotations: 'Allele | Annotation | Annotation_Impact | Gene_Name | Gene_ID | Feature_Type | Feature_ID | Transcript_BioType | Rank | HGVS.c | HGVS.p | cDNA.pos / cDNA.length | CDS.pos / CDS.length | AA.pos / AA>
##INFO=<ID=LOF,Number=.,Type=String,Description="Predicted loss of function effects for this variant. Format: 'Gene_Name | Gene_ID | Number_of_transcripts_in_gene | Percent_of_transcripts_affected'">
##INFO=<ID=NMD,Number=.,Type=String,Description="Predicted nonsense mediated decay effects for this variant. Format: 'Gene_Name | Gene_ID | Number_of_transcripts_in_gene | Percent_of_transcripts_affected'">

It would be very helpful and useful to split these into their own Zarr arrays. We could add this as an option, like --parse-snpeff or something (I'm not sure how stable these formats are across versions, etc, though)

Add options for compressor in explode/dexplode-init

In #91 we added the ability to store the compressor used for the ICF, and switched the default to LZ4. In my example, I had
2m13s and 100M ICF with zstd, and 56s and 214M ICF with LZ4. So, roughly half the time and twice the size. There was no real difference in the zarr encode time in both (which isn't surprising, because both LZ4 and zstd are super fast at decoding).

Given that this is an intermediate format and we're not going to keep it around for long, it seems OK to use the faster and less compact option as the default. However, we can totally provide the option of specifying the compressor.

So, I'd suggest we add the option --compressor=[list of Blosc compressors] to the explode and dexplode-init commands.

Rename VCF commands

Current we have:

Usage: python -m bio2zarr vcf2zarr [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  convert
  explode
  genspec
  summarise
  to-zarr
  validate

I think convert is a good name for the top-level all-in-one command (so, plink2zarr convert would just work).

We should probably change genspec to mkschema (or something; generally changing "spec" to "schema")

Not sure about summarise. Maybe there should be an inspect command that will look at either an exploded intermediate format or zarr and return some info?

I dislike to-zarr quite a lot. Maybe encode or something?

Change ``dexplode_slice`` to ``dexplode_partition``?

I'm not sure I see much point in supporting a slice of partitions in the distributed explode operation. If you're distributing across a cluster anyway, you may as well just use a simple array job and provision one core per partition. There's nothing really much to be gained from doing multiple partitions per job. Also, there's no point in over-partitioning.

It's also quite a bit easier to explain...

Am I missing something @benjeffery?

Provide mechanism to change types of fields

Some programs don't do a great job of typing or sizing the output VCF fields. E.g., this is some SnpEff output (I think)

##INFO=<ID=GDI-Phred,Number=.,Type=String,Description="Phred-scaled GDI scores">
##INFO=<ID=GDI,Number=.,Type=String,Description="gene damage index score, a genome-wide, gene-level metric of the mutational damage that has accumulated in the general population from doi 1>

This has been marked as String, Number=".", where it probably should be Integer, Number=1.

One way we could do this is to check the correspondence between the VCF types in the VCF metadata with the types in the Zarr schema, and to push that field through a "conversion" code path if they are not compatible. So, here, we'd see that the original VCF type is String and the Zarr type is int, and we'd have to pass the field through an integer parsing code path.

An nice feature to add at some point then would be to take an input Zarr schema and to automatically detect some common mistyped fields and correct them.

Investigate/explain size vs compressed size in FORMAT cols

1000 Genomes chr20 we get this:

name                      type       chunks  size       compressed      max_n  min_val    max_val                                                                                             
------------------------  -------  --------  ---------  ------------  -------  ---------  ---------
CHROM                     String        160  1.25 MB    138.06 KB           1  n/a        n/a                                                                                                 
POS                       Integer       160  606.64 KB  4.43 MB             1  6e+04      6.4e+07                                                                                             
QUAL                      Float         160  606.64 KB  11.33 MB            1  30         1.2e+08
ID                        String        160  16 KB      9.6 KB              0  n/a        n/a                                                                                                 
FILTERS                   String        160  1.41 MB    739.29 KB           1  n/a        n/a                                                                                                 
REF                       String        160  1.32 MB    3.57 MB             1  n/a        n/a                                                                                                 
ALT                       String        160  1.91 MB    4.82 MB             6  n/a        n/a                                                                                                 
FORMAT/AB                 Float        1097  178.45 MB  584.71 MB           1  0.04       0.95                                                                                                
FORMAT/AD                 Integer      2496  466.49 MB  7.78 GB             7  0          5.8e+03                                                                                             
FORMAT/DP                 Integer      1224  204.99 MB  6.08 GB             1  0          5.8e+03                                                                                             
FORMAT/GQ                 Integer      1224  204.99 MB  5.25 GB             1  0          99                                                                                                  
FORMAT/GT                 Integer      1788  307.2 MB   417.48 MB           3  -1         6                                                                                                   
FORMAT/MIN_DP             Integer       160  16 KB      9.6 KB              0  n/a        n/a                                                                                                 
FORMAT/MQ0                Integer       160  16 KB      9.6 KB              0  n/a        n/a                                                                                                 
FORMAT/PGT                String        849  46.13 MB   100.24 MB           1  n/a        n/a                                                                                                 
FORMAT/PID                String        849  59.74 MB   129.69 MB           1  n/a        n/a                                                                                                 
FORMAT/PL                 Integer      3992  807.39 MB  18.42 GB           28  0          1.7e+05                                                                                             
FORMAT/RGQ                Integer       160  16 KB      9.6 KB              0  n/a        n/a                                                                                                 
FORMAT/SB                 Integer       160  16 KB      9.6 KB              0  n/a        n/a          

Note that cols like FORMAT/DP etc are reporting as much larger compressed than uncompressed. Investigate.

Reorder vcf commands in cli

Currently have:

Usage: python -m bio2zarr vcf2zarr [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  convert   Convert input VCF(s) directly to vcfzarr (not recommended for...
  encode    Encode intermediate format (see explode) to vcfzarr
  explode   Convert VCF(s) to columnar intermediate format
  inspect   Inspect an intermediate format file
  mkschema  Generate a schema for zarr encoding
  validate

which is confusing

Improve default chunk size

Default chunk sizes are chosen arbitrarily at the moment. We're current following sgkit's chunk_length=10_000, chunk_width=`1_000 (see here) but we can probably do a bit better than that

Add non-genotype data to plink conversion

See also #19

Current plink conversion is very much proof of concept, just doing the actual genotypes. We need to convert all the other stuff too, including providing APIs to open the linked files.

Tabix indexes sometimes return empty last partition

See assert_part_counts_non_zero in test_vcf_utils and where it's called. It's in the file CEUTrio.20.21.gatk3.4.g.vcf.bgz.tbi

Not sure whether this is a general thing, or just happens with this small file.

Insufficient testing of dimension padding

The all_info_fields example VCF doesn't have enough coverage on Number="." fields where we need to do dimension padding. Probably simplest to update upstream in sgkit to make sure it's handling these fields OK as well.

Generalise "inspect" command

Would be very nice to support inspecting a set of VCFs and Zarrs also. Creating a similar summary table should be easy enough in both cases.

BrokenProcessPool when using multiple workers

I've run into this issue a couple of times while attempting to convert 1000G data chr22 VCF to Zarr format. The first time I tried calling vcf2zarr convert ... directly. The second time I did it in the staggered approach suggested in the documentation. However, it still produced this error at the encode stage, particularly when encoding 2D matrices:

(bio2zarr_env) [szabad@narval2 bio2zarr_experiments]$ vcf2zarr encode -p 4 converted/chr22.exploded converted/chr22.zarr
Encode 1D: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 19.7k/19.7k [09:35<00:00, 34.2chunks/s]
Encode 2D:  79%|███████████████████████████████████████████████████████████████████████████████▎                     | 8.49k/10.8k [28:22<29:35, 1.30chunks/s]Traceback (most recent call last):
  File "/home/szabad/bio2zarr_env/bin/vcf2zarr", line 8, in <module>
    sys.exit(vcf2zarr())
  File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/bio2zarr/cli.py", line 127, in encode
    vcf.encode(
  File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/bio2zarr/vcf.py", line 1538, in encode
    SgvcfZarr.encode(
  File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/bio2zarr/vcf.py", line 1498, in encode
    pwm.submit(
  File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/bio2zarr/core.py", line 219, in __exit__
    wait_on_futures(self.futures)
  File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/bio2zarr/core.py", line 56, in wait_on_futures
    raise exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

I'm running this on a cluster and it's possible that the processes are killed silently if they used more resources than they should. But in this case, I'm just using 4 workers and it didn't seem they were using lots of memory.

Also a general question related to this: Do you think it's possible to pick up the encoding work from where it left off if things like this happen instead of starting over?

Basic shell completion

Click supports shell completion which we should document how to set up.

It should work for basic things like paths, but we can look at how to autocomplete things like column names (if CLI commands take them as parameters)

Progress meter does not complete

We often don't completely close the progress meter, like

 Scan:  50%|█████████████████████                     | 1.00/2.00 [00:00<00:00, 21.8files/s

Will be something to do with how we're joining the progress thread.

Add option to encode first n-chunks only

This should truncate the columns down to this size, so that we're not falsely reporting large unpopulated arrays.

Will require breaking up the fixed columns into slices, which is no harm anyway as they are quite slow.

Add header and group metadata

Need:

vcf_zarr_version	0.2
vcf_header	The VCF header from ##fileformat to #CHROM inclusive, stored as a single string.

Review fspec dependency

The code for inspecting tabix files etc imported from sgkit in #22 depends on fspec.

If fspec is a light dependency, then we can just include it. If not, then it's probably simplest to mock it out for now, while evaluating whether it's worth updating the code to use fspec more generally.

To me, any file that's small enough the convert reliably over the network could just be downloaded and converted. Being dependent on 10s of gigabytes downloading without problems seems like an unnecessary complexity to add to this tool. Likewise for writing directly to cloud stores - handling failures and retries etc well seems like a pretty tricky thing which we can reasonably delegate to dedicated tools.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.