GithubHelp home page GithubHelp logo

Comments (4)

jeromekelleher avatar jeromekelleher commented on June 25, 2024

Type inference for these sub-columns would be an issue also. Hopefully the output of various annotation programs will be fairly dependable, and we can bake in lookup tables, defaulting to String if not known.

from bio2zarr.

jeromekelleher avatar jeromekelleher commented on June 25, 2024

There's a basic question here about whether this should be done in vcf2zarr as part of the VCF conversion process, or whether we should post-process some VCF columns that have been stored as Zarr arrays to extract annotations. I'm inclined to go with parsing the Zarr arrays, perhaps as something like

vcf2zarr extract-annotations <ZARR DIR> 

It would look for some known annotation INFO fields (like variant_ANN etc above) and do the necessary thing to extract the required Zarr columns.

Re naming these, the simplest this is to do something like variaant_ANN_Allele, etc, i.e., follow the nested naming.

from bio2zarr.

jeromekelleher avatar jeromekelleher commented on June 25, 2024

This is not straightforward... Looking at an example from recent 1000 Genomes data, we have

<zarr.core.Array '/variant_ANN' (96475, 18) object>
Functional annotations: 'Allele | Annotation | Annotation_Impact | Gene_Name | Gene_ID | Feature_Type | Feature_ID | Transcri>
[['A|intergenic_region|MODIFIER|DEFB125|ENSG00000178591|intergenic_region|ENSG00000178591|||n.60070G>A||||||'
  '' '' ... '' '' '']
 ['C|intergenic_region|MODIFIER|DEFB125|ENSG00000178591|intergenic_region|ENSG00000178591|||n.60083T>C||||||'
  '' '' ... '' '' '']
 ['C|intergenic_region|MODIFIER|DEFB125|ENSG00000178591|intergenic_region|ENSG00000178591|||n.60114T>C||||||'
  '' '' ... '' '' '']
 ...
 ['A|downstream_gene_variant|MODIFIER|STK35|ENSG00000125834|transcript|ENST00000381482.7|protein_coding||c.*35483G>A|||||4306>
  'A|intragenic_variant|MODIFIER|STK35|ENSG00000125834|gene_variant|ENSG00000125834|||n.2152861G>A||||||'
  '' ... '' '' '']
 ['A|downstream_gene_variant|MODIFIER|STK35|ENSG00000125834|transcript|ENST00000381482.7|protein_coding||c.*35600T>A|||||4423>
  'A|intragenic_variant|MODIFIER|STK35|ENSG00000125834|gene_variant|ENSG00000125834|||n.2152978T>A||||||'
  '' ... '' '' '']
 ['A|downstream_gene_variant|MODIFIER|STK35|ENSG00000125834|transcript|ENST00000381482.7|protein_coding||c.*35631T>A|||||4454>
  'A|intragenic_variant|MODIFIER|STK35|ENSG00000125834|gene_variant|ENSG00000125834|||n.2153009T>A||||||'
  '' ... '' '' '']]

So, the ANN column is 2D, with (it looks like) a maximum of 18 annotations for a given variant in this set. Each of these annotations is a pipe-separated list of mostly string data. So, we could separate this out into ~15 arrays of dimension (variants, 18) but I don't think there's much point. It's not going to map well to the Zarr model (because of all the strings).

I think this is a place where integrating with a different technology designed for handling sparse string data is the right approach.

from bio2zarr.

jeromekelleher avatar jeromekelleher commented on June 25, 2024

Going to close this as a "wontfix" as it's out of scope for the moment.

from bio2zarr.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.