Comments (10)
Assuming we can find a practical way to deal with the large arbitrary number, another possible solution that could benefit other data fields in the Zarr VCF format is integer packing with differential coding. numcodecs has a bit packing class for bool, but we can create a similar one for integers by linking to some of Lemire's C libraries to do the packing and unpacking super fast.
For now I think we need to stick with what's available in numcodecs - adding new codecs etc would be something done at the Zarr level, rather than something we want to get involved in for now. If this catches on and there's a demonstrated need for better integer packing then that could certainly be brought forward as a proposal to the wider Zarr community.
I think we can probably get a long way with what we have, though (or at least, do an awful lot better than VCF!)
from bio2zarr.
Ah - so the negative numbers here are how vcf-zarr encodes missing and fill values
I don't think there's any need to look at the VCF itself here, we're pretty certain that the data itself is being faithfully transcoded into zarr. The question here is whether we can use some simple combination of pre-existing Zarr filters and/or compression methods from Blosc to make compression a bit better.
Like I said above though, there's only one real fix for this specific problem with PL values and that's to use the approach described in the SVCR paper.
So, I think the time here would be better spent on figuring out how we can get better default settings for non PL fields, as discussed over in #74.
from bio2zarr.
Part of the problem here is that call_PL contains some very large values (absurdly so, given they are log 10 values), and the column gets encoded as 32 bit ints by default. Chunks are roughly 24 or 25 MB.
Reducing to 8 bit ints is probably quite defensible. However, just changing the dtype to i1 here resulted in larger chunks (31-32Mb)! Setting shuffle to 0 seems to bring things down to something more sensible (17-18M). I've not looked at it systematically, though.
Really the only actual solution here is to encode the PL values using the "observed alleles" idea. Hopefully we can make this good enough for now, and save that one for implementation later. Although it's probably fairly straightforward, and may be quicker to just do and leave as an option rather than trying to explain why our files aren't that much smaller than VCFs.
from bio2zarr.
Doing i1 with no shuffle brings the total for PL down to 16G, so a 6G saving. But that's only a 2G saving over the original VCF which isn't that exciting.
from bio2zarr.
See sgkit-dev/vcf-zarr-publication#5 for discussion on local alleles discussed in SVCR paper
from bio2zarr.
On the sample VCF that I'm experimenting with, I'm noticing that a lot of call_PL
values are negative, presumably because they're outside the range of int16
, the current default data type. I have two suggestions here:
- If possible, make the data type unsigned integer
uint16
, because this field does not accommodate negative values anyway (unless missing?). This should automatically clip values larger than2**16
. - If we want to keep
int16
nonetheless, I'd suggest clipping thePL
values to the maximum that the integer data type allows before conversion.
from bio2zarr.
In the VCF that I'm working with, the largest recurring value that I'm observing is 2147483647
, which is the maximum value for int32
(2**(32-1) - 1
). Probably capped because cyvcf2
(and htslib
?) are using int32
to represent PL values. So, this is probably an arbitrary number that can be represented more compactly in another way.
The rest of the numbers are more reasonable, usually less than 100 and rarely exceeding 2**12
.
Assuming we can find a practical way to deal with the large arbitrary number, another possible solution that could benefit other data fields in the Zarr VCF format is integer packing with differential coding. numcodecs
has a bit packing class for bool
, but we can create a similar one for integers by linking to some of Lemire's C libraries to do the packing and unpacking super fast.
Depending on the sizes of these integers (or their differences), we could potentially losslessly store them in 6 bits or less. This may also depend on the data layout and how we compute the differences (within individual or across individuals for the same genotype?)
Happy to discuss this more.
from bio2zarr.
Both of these issues (negative PLs and PL value of ~2**32) sound pretty pathological @shz9, would you mind giving a bit more detail? (PLs are phred scaled log 10 values, so anything greater than like 100 seems insanely large to me.) Where did you get the VCF from? What does the output of vcf2zarr inspect <exploded-if>
look like?
from bio2zarr.
After digging a bit more into the VCF file I have, it seems that this is an issue with missing data. I'm working with a tiny subset of 1000G data that I found on our servers (not sure how it was processed), consisting of 91 samples and ~6k variants on chr22. The first line looks like this:
./.:.:0,0:0:.:. ./.:.:0,0:0:.:. ./.:.:0,0:0:.:. ./.:.:0,0:0:.:. 0/0:.:3,0:3:9:0,9,104 ./.:.:0,0:0:.:. ./.:.:0,0:0:.:. ./.:.:0,0:0:.:. ./.:.:0,0:0:.:. ./.:.:0,0:0:.:. ./.:.:0,0:0:.:. ./.:.:0,0:0:.:. ./.:.:0,0:0:.:. ./.:.:0,0:0:.:. ./.:.:0,0:0:.:. ./.:.:0,0:0:.:. ./.:.:0,0:0:.:. ./.:.:0,0:0:.:. ./.:.:0,0:0:.:. ./.:.:0,0:0:.:. ./.:.:0,0:0:.:. ./.:.:0,0:0:.:. 0/0:.:1,0:1:3:0,3,35 ./.:.:0,0:0:.:. ./.:.:0,0:0:.:. ./.:.:0,0:0:.:. ./.:.:0,0:0:.:. 0/0:.:1,0:1:3:0,3,35 ./.:.:0,0:0:.:. ./.:.:0,0:0:.:. ./.:.:0,0:0:.:. ./.:.:0,0:0:.:. 0/0:.:3,0:3:9:0,9,104 ./.:.:0,0:0:.:. ./.:.:0,0:0:.:. ./.:.:0,0:0:.:. ./.:.:0,0:0:.:. ./.:.:0,0:0:.:. ./.:.:0,0:0:.:. ./.:.:0,0:0:.:. ./.:.:0,0:0:.:. 0/0:.:1,0:1:3:0,3,35 ./.:.:0,0:0:.:. ./.:.:0,0:0:.:. ./.:.:0,0:0:.:. 0/0:.:1,0:1:3:0,3,35 ./.:.:0,0:0:.:. 0/0:.:2,0:2:6:0,6,70 ./.:.:0,0:0:.:. ./.:.:0,0:0:.:. ./.:.:0,0:0:.:. ./.:.:0,0:0:.:. ./.:.:0,0:0:.:. ./.:.:0,0:0:.:. ./.:.:0,0:0:.:. ./.:.:0,0:0:.:. ./.:.:0,0:0:.:. ./.:.:0,0:0:.:. ./.:.:0,0:0:.:. ./.:.:0,0:0:.:. ./.:.:0,0:0:.:. ./.:.:0,0:0:.:. ./.:.:0,0:0:.:. ./.:.:0,0:0:.:. ./.:.:0,0:0:.:. ./.:.:0,0:0:.:. ./.:.:0,0:0:.:. ./.:.:0,0:0:.:. ./.:.:0,0:0:.:. 0/0:.:2,0:2:6:0,6,59 0/0:.:2,0:2:6:0,6,70 ./.:.:0,0:0:.:. 0/0:.:1,0:1:3:0,3,35 ./.:.:0,0:0:.:. 0/0:.:2,0:2:6:0,6,70 ./.:.:0,0:0:.:. ./.:.:0,0:0:.:. ./.:.:0,0:0:.:. 0/0:.:1,0:1:3:0,3,35 ./.:.:0,0:0:.:. ./.:.:0,0:0:.:. ./.:.:0,0:0:.:. ./.:.:0,0:0:.:. ./.:.:0,0:0:.:. ./.:.:0,0:0:.:. ./.:.:0,0:0:.:. ./.:.:0,0:0:.:. ./.:.:0,0:0:.:. ./.:.:0,0:0:.:. ./.:.:0,0:0:.:. ./.:.:0,0:0:.:.
Mostly missing data, which, for some reason, cyvcf2
just sets to be the largest possible integer value:
In [8]: for i, rec in enumerate(VCF("mini.vcf.gz")):
...: print(rec.gt_phred_ll_homref)
...: print(rec.gt_phred_ll_het)
...: print(rec.gt_phred_ll_homalt)
...: break
...:
[2147483647 2147483647 2147483647 2147483647 0 2147483647
2147483647 2147483647 2147483647 2147483647 2147483647 2147483647
2147483647 2147483647 2147483647 2147483647 2147483647 2147483647
2147483647 2147483647 2147483647 2147483647 0 2147483647
2147483647 2147483647 2147483647 0 2147483647 2147483647
2147483647 2147483647 0 2147483647 2147483647 2147483647
2147483647 2147483647 2147483647 2147483647 2147483647 0
2147483647 2147483647 2147483647 0 2147483647 0
2147483647 2147483647 2147483647 2147483647 2147483647 2147483647
2147483647 2147483647 2147483647 2147483647 2147483647 2147483647
2147483647 2147483647 2147483647 2147483647 2147483647 2147483647
2147483647 2147483647 2147483647 0 0 2147483647
0 2147483647 0 2147483647 2147483647 2147483647
0 2147483647 2147483647 2147483647 2147483647 2147483647
2147483647 2147483647 2147483647 2147483647 2147483647 2147483647
2147483647]
...
When this gets translated to Zarr
, the large values become negative:
In [1]: import zarr
In [2]: z = zarr.open("tmp/sample.zarr/")
In [4]: z.call_PL[:]
Out[4]:
array([[[ -1, -2, -2, ..., -2, -2, -2],
[ -1, -2, -2, ..., -2, -2, -2],
[ -1, -2, -2, ..., -2, -2, -2],
...,
[ -1, -2, -2, ..., -2, -2, -2],
[ -1, -2, -2, ..., -2, -2, -2],
[ -1, -2, -2, ..., -2, -2, -2]],
...
Again, this is a consequence of using the int16
data type. If I cast 2147483647
to int16
I get -1
.
inspect
didn't show me anything unusual about the call_PL
array:
name type chunks size compressed max_n min_val max_val
------------------------ ------- -------- --------- ------------ ------- --------- ---------
CHROM String 4 57.34 KB 915 bytes 1 n/a n/a
POS Integer 4 57.34 KB 9.53 KB 1 1.1e+07 1.1e+07
QUAL Float 4 57.34 KB 30.36 KB 1 30 3.7e+06
ID String 4 8.19 KB 234 bytes 0 n/a n/a
FILTERS String 4 65.54 KB 902 bytes 1 n/a n/a
REF String 4 57.34 KB 13.48 KB 1 n/a n/a
ALT String 4 62.76 KB 18.92 KB 6 n/a n/a
FORMAT/AB Float 4 233.82 KB 32.1 KB 1 n/a n/a
FORMAT/AD Integer 4 464.12 KB 526.93 KB 7 0 1.2e+02
FORMAT/DP Integer 4 251.9 KB 392.14 KB 1 0 1.4e+02
FORMAT/GQ Integer 4 251.9 KB 418.42 KB 1 0 99
FORMAT/GT Integer 4 345.09 KB 53.48 KB 3 -1 6
FORMAT/MIN_DP Integer 4 8.19 KB 234 bytes 0 n/a n/a
FORMAT/MQ0 Integer 4 8.19 KB 234 bytes 0 n/a n/a
FORMAT/PGT String 4 186.27 KB 10.27 KB 1 n/a n/a
FORMAT/PID String 4 186.27 KB 13.54 KB 1 n/a n/a
FORMAT/PL Integer 4 728.02 KB 1.22 MB 28 0 8e+03
...
from bio2zarr.
Closing in favour of #185 - there's no point in trying to make PL fields slightly better here, the local alleles approach is the right thing to do.
from bio2zarr.
Related Issues (20)
- Refactor docs build infrastructure
- Restructure vcf2zarr docs
- Add --no-progress (or similar) to suppress progress
- Bug in dexplode-partition
- Change dexplode-init to use ``--num-parts``/``-n`` instead of positional HOT 1
- Change dencode-init to use --num-partitions
- Hypothesis testing for vcf2zarr HOT 13
- Pin to zarr < 3
- ValueError: could not broadcast input array
- Run tests against Zarr 3 HOT 13
- Run tests against numpy 2 HOT 4
- Set copy=True in np.array creation for numpy 2.0 compatibility HOT 1
- ICF stores created with numpy 1.x won't work with numpy 2.x HOT 1
- Optional first phasing symbol introduced in VCF 4.4 HOT 4
- Parsing fails for VCF with GT in header but not in FORMAT field HOT 1
- Char fields added as Unicode not string HOT 1
- Numcodecs v0.13.0 causing test failures HOT 1
- Inspect fails for datasets with out consolidated metadata
- LPL no smaller than PL HOT 10
- Add variant length/end coordinate field HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bio2zarr.