Variant level annotations are often included as INFO tags with substructure, e.g.

Parse annotations into separate columns about bio2zarr HOT 4 CLOSED

jeromekelleher commented on June 25, 2024

Parse annotations into separate columns

from bio2zarr.

Comments (4)

jeromekelleher commented on June 25, 2024

Type inference for these sub-columns would be an issue also. Hopefully the output of various annotation programs will be fairly dependable, and we can bake in lookup tables, defaulting to String if not known.

from bio2zarr.

jeromekelleher commented on June 25, 2024

There's a basic question here about whether this should be done in vcf2zarr as part of the VCF conversion process, or whether we should post-process some VCF columns that have been stored as Zarr arrays to extract annotations. I'm inclined to go with parsing the Zarr arrays, perhaps as something like

vcf2zarr extract-annotations <ZARR DIR>

It would look for some known annotation INFO fields (like variant_ANN etc above) and do the necessary thing to extract the required Zarr columns.

Re naming these, the simplest this is to do something like variaant_ANN_Allele, etc, i.e., follow the nested naming.

from bio2zarr.

jeromekelleher commented on June 25, 2024

This is not straightforward... Looking at an example from recent 1000 Genomes data, we have

<zarr.core.Array '/variant_ANN' (96475, 18) object>
Functional annotations: 'Allele | Annotation | Annotation_Impact | Gene_Name | Gene_ID | Feature_Type | Feature_ID | Transcri>
[['A|intergenic_region|MODIFIER|DEFB125|ENSG00000178591|intergenic_region|ENSG00000178591|||n.60070G>A||||||'
  '' '' ... '' '' '']
 ['C|intergenic_region|MODIFIER|DEFB125|ENSG00000178591|intergenic_region|ENSG00000178591|||n.60083T>C||||||'
  '' '' ... '' '' '']
 ['C|intergenic_region|MODIFIER|DEFB125|ENSG00000178591|intergenic_region|ENSG00000178591|||n.60114T>C||||||'
  '' '' ... '' '' '']
 ...
 ['A|downstream_gene_variant|MODIFIER|STK35|ENSG00000125834|transcript|ENST00000381482.7|protein_coding||c.*35483G>A|||||4306>
  'A|intragenic_variant|MODIFIER|STK35|ENSG00000125834|gene_variant|ENSG00000125834|||n.2152861G>A||||||'
  '' ... '' '' '']
 ['A|downstream_gene_variant|MODIFIER|STK35|ENSG00000125834|transcript|ENST00000381482.7|protein_coding||c.*35600T>A|||||4423>
  'A|intragenic_variant|MODIFIER|STK35|ENSG00000125834|gene_variant|ENSG00000125834|||n.2152978T>A||||||'
  '' ... '' '' '']
 ['A|downstream_gene_variant|MODIFIER|STK35|ENSG00000125834|transcript|ENST00000381482.7|protein_coding||c.*35631T>A|||||4454>
  'A|intragenic_variant|MODIFIER|STK35|ENSG00000125834|gene_variant|ENSG00000125834|||n.2153009T>A||||||'
  '' ... '' '' '']]

So, the ANN column is 2D, with (it looks like) a maximum of 18 annotations for a given variant in this set. Each of these annotations is a pipe-separated list of mostly string data. So, we could separate this out into ~15 arrays of dimension (variants, 18) but I don't think there's much point. It's not going to map well to the Zarr model (because of all the strings).

I think this is a place where integrating with a different technology designed for handling sparse string data is the right approach.

from bio2zarr.

jeromekelleher commented on June 25, 2024

Going to close this as a "wontfix" as it's out of scope for the moment.

from bio2zarr.

Parse annotations into separate columns about bio2zarr HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs