Comments (4)
Type inference for these sub-columns would be an issue also. Hopefully the output of various annotation programs will be fairly dependable, and we can bake in lookup tables, defaulting to String if not known.
from bio2zarr.
There's a basic question here about whether this should be done in vcf2zarr as part of the VCF conversion process, or whether we should post-process some VCF columns that have been stored as Zarr arrays to extract annotations. I'm inclined to go with parsing the Zarr arrays, perhaps as something like
vcf2zarr extract-annotations <ZARR DIR>
It would look for some known annotation INFO fields (like variant_ANN
etc above) and do the necessary thing to extract the required Zarr columns.
Re naming these, the simplest this is to do something like variaant_ANN_Allele
, etc, i.e., follow the nested naming.
from bio2zarr.
This is not straightforward... Looking at an example from recent 1000 Genomes data, we have
<zarr.core.Array '/variant_ANN' (96475, 18) object>
Functional annotations: 'Allele | Annotation | Annotation_Impact | Gene_Name | Gene_ID | Feature_Type | Feature_ID | Transcri>
[['A|intergenic_region|MODIFIER|DEFB125|ENSG00000178591|intergenic_region|ENSG00000178591|||n.60070G>A||||||'
'' '' ... '' '' '']
['C|intergenic_region|MODIFIER|DEFB125|ENSG00000178591|intergenic_region|ENSG00000178591|||n.60083T>C||||||'
'' '' ... '' '' '']
['C|intergenic_region|MODIFIER|DEFB125|ENSG00000178591|intergenic_region|ENSG00000178591|||n.60114T>C||||||'
'' '' ... '' '' '']
...
['A|downstream_gene_variant|MODIFIER|STK35|ENSG00000125834|transcript|ENST00000381482.7|protein_coding||c.*35483G>A|||||4306>
'A|intragenic_variant|MODIFIER|STK35|ENSG00000125834|gene_variant|ENSG00000125834|||n.2152861G>A||||||'
'' ... '' '' '']
['A|downstream_gene_variant|MODIFIER|STK35|ENSG00000125834|transcript|ENST00000381482.7|protein_coding||c.*35600T>A|||||4423>
'A|intragenic_variant|MODIFIER|STK35|ENSG00000125834|gene_variant|ENSG00000125834|||n.2152978T>A||||||'
'' ... '' '' '']
['A|downstream_gene_variant|MODIFIER|STK35|ENSG00000125834|transcript|ENST00000381482.7|protein_coding||c.*35631T>A|||||4454>
'A|intragenic_variant|MODIFIER|STK35|ENSG00000125834|gene_variant|ENSG00000125834|||n.2153009T>A||||||'
'' ... '' '' '']]
So, the ANN column is 2D, with (it looks like) a maximum of 18 annotations for a given variant in this set. Each of these annotations is a pipe-separated list of mostly string data. So, we could separate this out into ~15 arrays of dimension (variants, 18)
but I don't think there's much point. It's not going to map well to the Zarr model (because of all the strings).
I think this is a place where integrating with a different technology designed for handling sparse string data is the right approach.
from bio2zarr.
Going to close this as a "wontfix" as it's out of scope for the moment.
from bio2zarr.
Related Issues (20)
- Returning a string from `.mkschema` HOT 1
- Document status of Python API
- Fixup msprime based tests when packages are fixed
- Add "what about cloud?" docs
- Add explicit warning for Mac Python 3.9
- New tool: tskit2zarr HOT 1
- Document copying to cloud storage HOT 1
- Refactor docs build infrastructure
- Restructure vcf2zarr docs
- Add --no-progress (or similar) to suppress progress
- Bug in dexplode-partition
- Change dexplode-init to use ``--num-parts``/``-n`` instead of positional HOT 1
- Change dencode-init to use --num-partitions
- Hypothesis testing for vcf2zarr HOT 5
- Pin to zarr < 3
- ValueError: could not broadcast input array
- Run tests against Zarr 3 HOT 1
- Run tests against numpy 2 HOT 3
- Set copy=True in np.array creation for numpy 2.0 compatibility HOT 1
- ICF stores created with numpy 1.x won't work with numpy 2.x HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bio2zarr.