GithubHelp home page GithubHelp logo

hg19 about yleaf HOT 7 CLOSED

genid avatar genid commented on July 25, 2024
hg19

from yleaf.

Comments (7)

bramvanwersch avatar bramvanwersch commented on July 25, 2024

Hey,

We do not expect this to make much of a difference. All the testing has been done using hg19 downloaded from the UCSC.

Regarding the ModuleNotFoundError, this is a specific issue with conda and python 3.7. It does not influence how the code runs but it is a bit annoying. You can get rid of it by updating setuptools: pip install -U pip setuptools.

from yleaf.

npsonis avatar npsonis commented on July 25, 2024

UCSC hg19 has different coordinates that GRCh37.
So, If I have mapped my data to the later, shouldn't I have issues when using those BAM files?

from yleaf.

RandyHarr avatar RandyHarr commented on July 25, 2024

UCSC hg19 has different coordinates that GRCh37. So, If I have mapped my data to the later, shouldn't I have issues when using those BAM files?

HG19 and GRCh37 better not have different coordinates. They are the same model. Just values of base-pairs are different but not coordinates. (At least not in the primary chromosomes). The traditional, first released hg19 does have a Yoruba Mitochondria model. But its later release and all other Build 37 releases are rCRS. You can see my doc on Reference Models (https://bit.ly/34CO0vj) and its attached spreadsheet for more info (https://bit.ly/2ZmYPAg). In the spreadsheet, pay particular attention to the third tab that has md5 checksums of each chromosome (similar to used with CRAM processing). The second tab has the lengths of the primary chromosomes in each model. hg19 and GRCh37 are different in the checksums and even the number of N's in each chromosome, but are identical length and coordinate system. (note: as stated in the document, to avoid confusion, what I suspect you are calling GRCh37 models are labeled as EBI37)

Note that GRCh models have various patch versions also. But this never changes the coordinates of the chromosomes either. Only adds or modifies the extra contigs added; in some cases trying to offer fixes or improvements on the chromosomes. But as an add on piece of information that does not change the base model coordinate system.

from yleaf.

npsonis avatar npsonis commented on July 25, 2024

Thanks for the detailed reply. I am aware about their content differences. I had the impression that they also have a different coordinate system (1- vs 0-based) as stated here: https://www.ogc.ox.ac.uk/guide-reference-genome-selection/

So, UCSC uses a different coordinate system only in their genome browser (https://genome-blog.soe.ucsc.edu/blog/2016/12/12/the-ucsc-genome-browser-coordinate-counting-systems/), but their actual reference genome file available for downloading - hg19- has the same coordinates with GRCh37.

Is that right?

Thanks again.

You may close this upon aggrement.

from yleaf.

RandyHarr avatar RandyHarr commented on July 25, 2024

Excellent point. And sorry if the first response did not answer your question in that context. Those are great references. I will try and clarify this in my document. Thanks.

The FASTA itself has no coordinate system. Look at it internally. It is just a sequence of base-pair values. Reference assemblies hg19 (UCSC) and hs37d5 (1K genome project), or one of the other two classes, are all identical as just being a sequence of base-pairs with no coordinate system. All reference models, in their base definition, are simply FASTAs.

A coordinate system is implied when trying to define an annotation with reference to the original FASTA. An annotation may be, for example, the definition of an SNP. VCF's are 1-based and named SNPs used in Y phylogeny tend to be defined in relation to those files. I tend to only see the 0-based used in range systems like BED files. (BAM files are also 0-based but that is another discussion.)

Look at yBrowse.org; which has become the defacto common definition location for named Y variants used in phylogenetic tree of haplogroup work. Enter any SNP into yBrowse and see its definition of the range location for that SNP. Same number is given in both entries of the range for a single base-pair SNP. Click on the SNP to get a similar single location. yFull is consistent with yBrowse in locations defined for the same "named" SNP. And so is 1-based also. If you look in yleafs definition tables (e.g. data/hg19/new_positions.txt), they are giving single values and not ranges. Those tables are consistent with yBrowse.

So it comes down to consistency. yLeaf appears to be consistently using a 1-based system. Likely because its tables are based on comparing / annotating a VCF with named SNPs as retrieved from yFull (in release 3.1). When yLeaf takes in FASTQs or BAMs, it is internally extracting a VCF before performing the common look-up.

Obviously, Bram can answer for his tool yLeaf himself. But this is my take on it. (FYI, you can easily confirm this with our Python tool WGS Extract. It allows you to easily generate BAMs aligned to the various reference model files. Which you can then generate annotated VCFs from and run through yLeaf easily. We still only use yLeaf 2.2 there if you do the Y analysis there but hope to get 3.1 incorporated soon. Start with a Y only BAM -- which you can generate there -- as that is much quicker to realign and accurate enough for this purpose.)

from yleaf.

npsonis avatar npsonis commented on July 25, 2024

Thank you again for the exlanation. I appreciate it.

from yleaf.

bramvanwersch avatar bramvanwersch commented on July 25, 2024

My apologies for the late reply.

Thank you @RandyHarr for your explanation. I have nothing to add to it. It seems the issues have been resolved. So I will close this issue.

from yleaf.

Related Issues (13)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.