On recent 1000 genomes data, we have the following: <div class="snippet-clipboard-

Ah - so the negative numbers here are how vcf-zarr encodes <a href="https://github.com

See <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id

Closing in favour of <a class="issue-link js-issue-link" data-error-text="Failed to lo

Improve compression on PL fields about bio2zarr HOT 10 CLOSED

sgkit-dev commented on September 26, 2024

Improve compression on PL fields

from bio2zarr.

Comments (10)

jeromekelleher commented on September 26, 2024 1

Assuming we can find a practical way to deal with the large arbitrary number, another possible solution that could benefit other data fields in the Zarr VCF format is integer packing with differential coding. numcodecs has a bit packing class for bool, but we can create a similar one for integers by linking to some of Lemire's C libraries to do the packing and unpacking super fast.

For now I think we need to stick with what's available in numcodecs - adding new codecs etc would be something done at the Zarr level, rather than something we want to get involved in for now. If this catches on and there's a demonstrated need for better integer packing then that could certainly be brought forward as a proposal to the wider Zarr community.

I think we can probably get a long way with what we have, though (or at least, do an awful lot better than VCF!)

from bio2zarr.

jeromekelleher commented on September 26, 2024 1

Ah - so the negative numbers here are how vcf-zarr encodes missing and fill values

I don't think there's any need to look at the VCF itself here, we're pretty certain that the data itself is being faithfully transcoded into zarr. The question here is whether we can use some simple combination of pre-existing Zarr filters and/or compression methods from Blosc to make compression a bit better.

Like I said above though, there's only one real fix for this specific problem with PL values and that's to use the approach described in the SVCR paper.

So, I think the time here would be better spent on figuring out how we can get better default settings for non PL fields, as discussed over in #74.

from bio2zarr.

jeromekelleher commented on September 26, 2024

Part of the problem here is that call_PL contains some very large values (absurdly so, given they are log 10 values), and the column gets encoded as 32 bit ints by default. Chunks are roughly 24 or 25 MB.

Reducing to 8 bit ints is probably quite defensible. However, just changing the dtype to i1 here resulted in larger chunks (31-32Mb)! Setting shuffle to 0 seems to bring things down to something more sensible (17-18M). I've not looked at it systematically, though.

Really the only actual solution here is to encode the PL values using the "observed alleles" idea. Hopefully we can make this good enough for now, and save that one for implementation later. Although it's probably fairly straightforward, and may be quicker to just do and leave as an option rather than trying to explain why our files aren't that much smaller than VCFs.

from bio2zarr.

jeromekelleher commented on September 26, 2024

Doing i1 with no shuffle brings the total for PL down to 16G, so a 6G saving. But that's only a 2G saving over the original VCF which isn't that exciting.

from bio2zarr.

jeromekelleher commented on September 26, 2024

See sgkit-dev/vcf-zarr-publication#5 for discussion on local alleles discussed in SVCR paper

from bio2zarr.

shz9 commented on September 26, 2024

On the sample VCF that I'm experimenting with, I'm noticing that a lot of call_PL values are negative, presumably because they're outside the range of int16, the current default data type. I have two suggestions here:

If possible, make the data type unsigned integer uint16, because this field does not accommodate negative values anyway (unless missing?). This should automatically clip values larger than 2**16.
If we want to keep int16 nonetheless, I'd suggest clipping the PL values to the maximum that the integer data type allows before conversion.

from bio2zarr.

shz9 commented on September 26, 2024

In the VCF that I'm working with, the largest recurring value that I'm observing is 2147483647, which is the maximum value for int32 (2**(32-1) - 1). Probably capped because cyvcf2 (and htslib?) are using int32 to represent PL values. So, this is probably an arbitrary number that can be represented more compactly in another way.

The rest of the numbers are more reasonable, usually less than 100 and rarely exceeding 2**12.

Assuming we can find a practical way to deal with the large arbitrary number, another possible solution that could benefit other data fields in the Zarr VCF format is integer packing with differential coding. numcodecs has a bit packing class for bool, but we can create a similar one for integers by linking to some of Lemire's C libraries to do the packing and unpacking super fast.

Depending on the sizes of these integers (or their differences), we could potentially losslessly store them in 6 bits or less. This may also depend on the data layout and how we compute the differences (within individual or across individuals for the same genotype?)

Happy to discuss this more.

from bio2zarr.

jeromekelleher commented on September 26, 2024

Both of these issues (negative PLs and PL value of ~2**32) sound pretty pathological @shz9, would you mind giving a bit more detail? (PLs are phred scaled log 10 values, so anything greater than like 100 seems insanely large to me.) Where did you get the VCF from? What does the output of vcf2zarr inspect <exploded-if> look like?

from bio2zarr.

shz9 commented on September 26, 2024

After digging a bit more into the VCF file I have, it seems that this is an issue with missing data. I'm working with a tiny subset of 1000G data that I found on our servers (not sure how it was processed), consisting of 91 samples and ~6k variants on chr22. The first line looks like this:

./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	0/0:.:3,0:3:9:0,9,104	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	0/0:.:1,0:1:3:0,3,35	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	0/0:.:1,0:1:3:0,3,35	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	0/0:.:3,0:3:9:0,9,104	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	0/0:.:1,0:1:3:0,3,35	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	0/0:.:1,0:1:3:0,3,35	./.:.:0,0:0:.:.	0/0:.:2,0:2:6:0,6,70	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	0/0:.:2,0:2:6:0,6,59	0/0:.:2,0:2:6:0,6,70	./.:.:0,0:0:.:.	0/0:.:1,0:1:3:0,3,35	./.:.:0,0:0:.:.	0/0:.:2,0:2:6:0,6,70	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	0/0:.:1,0:1:3:0,3,35	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.

Mostly missing data, which, for some reason, cyvcf2 just sets to be the largest possible integer value:

In [8]: for i, rec in enumerate(VCF("mini.vcf.gz")):
   ...:     print(rec.gt_phred_ll_homref)
   ...:     print(rec.gt_phred_ll_het)
   ...:     print(rec.gt_phred_ll_homalt)
   ...:     break
   ...: 
[2147483647 2147483647 2147483647 2147483647          0 2147483647
 2147483647 2147483647 2147483647 2147483647 2147483647 2147483647
 2147483647 2147483647 2147483647 2147483647 2147483647 2147483647
 2147483647 2147483647 2147483647 2147483647          0 2147483647
 2147483647 2147483647 2147483647          0 2147483647 2147483647
 2147483647 2147483647          0 2147483647 2147483647 2147483647
 2147483647 2147483647 2147483647 2147483647 2147483647          0
 2147483647 2147483647 2147483647          0 2147483647          0
 2147483647 2147483647 2147483647 2147483647 2147483647 2147483647
 2147483647 2147483647 2147483647 2147483647 2147483647 2147483647
 2147483647 2147483647 2147483647 2147483647 2147483647 2147483647
 2147483647 2147483647 2147483647          0          0 2147483647
          0 2147483647          0 2147483647 2147483647 2147483647
          0 2147483647 2147483647 2147483647 2147483647 2147483647
 2147483647 2147483647 2147483647 2147483647 2147483647 2147483647
 2147483647]
...

When this gets translated to Zarr, the large values become negative:

In [1]: import zarr

In [2]: z = zarr.open("tmp/sample.zarr/")

In [4]: z.call_PL[:]
Out[4]: 
array([[[  -1,   -2,   -2, ...,   -2,   -2,   -2],
        [  -1,   -2,   -2, ...,   -2,   -2,   -2],
        [  -1,   -2,   -2, ...,   -2,   -2,   -2],
        ...,
        [  -1,   -2,   -2, ...,   -2,   -2,   -2],
        [  -1,   -2,   -2, ...,   -2,   -2,   -2],
        [  -1,   -2,   -2, ...,   -2,   -2,   -2]],
...

Again, this is a consequence of using the int16 data type. If I cast 2147483647 to int16 I get -1.

inspect didn't show me anything unusual about the call_PL array:

name                      type       chunks  size       compressed      max_n  min_val    max_val
------------------------  -------  --------  ---------  ------------  -------  ---------  ---------
CHROM                     String          4  57.34 KB   915 bytes           1  n/a        n/a
POS                       Integer         4  57.34 KB   9.53 KB             1  1.1e+07    1.1e+07
QUAL                      Float           4  57.34 KB   30.36 KB            1  30         3.7e+06
ID                        String          4  8.19 KB    234 bytes           0  n/a        n/a
FILTERS                   String          4  65.54 KB   902 bytes           1  n/a        n/a
REF                       String          4  57.34 KB   13.48 KB            1  n/a        n/a
ALT                       String          4  62.76 KB   18.92 KB            6  n/a        n/a
FORMAT/AB                 Float           4  233.82 KB  32.1 KB             1  n/a        n/a
FORMAT/AD                 Integer         4  464.12 KB  526.93 KB           7  0          1.2e+02
FORMAT/DP                 Integer         4  251.9 KB   392.14 KB           1  0          1.4e+02
FORMAT/GQ                 Integer         4  251.9 KB   418.42 KB           1  0          99
FORMAT/GT                 Integer         4  345.09 KB  53.48 KB            3  -1         6
FORMAT/MIN_DP             Integer         4  8.19 KB    234 bytes           0  n/a        n/a
FORMAT/MQ0                Integer         4  8.19 KB    234 bytes           0  n/a        n/a
FORMAT/PGT                String          4  186.27 KB  10.27 KB            1  n/a        n/a
FORMAT/PID                String          4  186.27 KB  13.54 KB            1  n/a        n/a
FORMAT/PL                 Integer         4  728.02 KB  1.22 MB            28  0          8e+03
...

from bio2zarr.

jeromekelleher commented on September 26, 2024

Closing in favour of #185 - there's no point in trying to make PL fields slightly better here, the local alleles approach is the right thing to do.

from bio2zarr.

Improve compression on PL fields about bio2zarr HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs