Hi HoneyBADGER developers, Thank you for developing this tool! I tri

showing NULL in the step of calcGexpCnvBoundaries - getting started tutorial about honeybadger HOT 12 CLOSED

jefworks-lab commented on July 17, 2024

showing NULL in the step of calcGexpCnvBoundaries - getting started tutorial

from honeybadger.

Comments (12)

Rongtingting commented on July 17, 2024

When I tried mart.obj <- useMart(biomart="ENSEMBL_MART_ENSEMBL", host="grch37.ensembl.org", dataset="hsapiens_gene_ensembl"), the error gone.
I think that the default version is hg38 (useMart function, host="www.ensembl.org"), but the demo dataset of honeyBADGER uses the hg19 (host="grch37.ensembl.org"), and the version "jul2015.archive.ensembl.org" may be private for author's accout.

from honeybadger.

biojiangke commented on July 17, 2024

I got the exact "NULL" result from default ensembl genes (and all the following warnings etc.), but the solution is not working for me. I still got "NULL" from hg19. Any other thoughts?

from honeybadger.

Rongtingting commented on July 17, 2024

I got the exact "NULL" result from default ensembl genes (and all the following warnings etc.), but the solution is not working for me. I still got "NULL" from hg19. Any other thoughts?

have you tried more steps? pls have a look at the results of calcGexpCnvBoundaries step.

from honeybadger.

biojiangke commented on July 17, 2024

If 'host="grch37.ensembl.org"' is used for the "mart.obj" object, "calcGexoCnvBoundaries" would give it a "NULL" result. And the "regions.genes" is completely empty.

from honeybadger.

Rongtingting commented on July 17, 2024

If 'host="grch37.ensembl.org"' is used for the "mart.obj" object, "calcGexoCnvBoundaries" would give it a "NULL" result. And the "regions.genes" is completely empty.

use the hg19,

hb$calcGexpCnvBoundaries(init=TRUE, verbose=FALSE)
ERROR: Error: subscript contains invalid names
ERROR: Error: subscript contains invalid names
NULL

it seems get error again, however it can be ignored since the following steps can be run, and it also got some results (but different from the demo‘s results...

from honeybadger.

biojiangke commented on July 17, 2024

Is "regions.genes" empty, without any genomic intervals? Then "summarizeResults" would complain about empty results? In the end, this may not matter. Custom data may run through without problems. But, the fact that the tutorial example is not reproducible (even somewhat), is kinda disturbing.

from honeybadger.

Rongtingting commented on July 17, 2024

The regions.genes is not empty, but the tutorial example is indeed not reproducible...

print(regions.genes)
GRanges object with 4 ranges and 0 metadata columns:
seqnames ranges strand

[1] chr11 167784-77185680 *
[2] chr7 855528-158749438 *
[3] chr9 134000948-141019076 *
[4] chr10 320130-135187193 *

seqinfo: 52 sequences from an unspecified genome; no seqlengths

from honeybadger.

biojiangke commented on July 17, 2024

Now it seems weirder: I got "seqinfo: 28 sequences from an unspecified genome; no seqlengths" from this step with an empty "regions.genes". Clearly there are multiple versions of reference assemblies going around here. The question is where the root is. I'd assume this comes from the mart.obj, but shouldn't we get the same "hsapiens_gene_ensembl" if we use the same "host" and "dataset" at the "useMart" step?

from honeybadger.

Rongtingting commented on July 17, 2024

yes, i think we shouldn't get the totally different output.

from honeybadger.

JEFworks commented on July 17, 2024

Dear Rongtingting,

Thank you for taking the initiative to address the issue you discovered and sharing the solution. Indeed, the data included with the package was aligned to hg19. Back when this paper was originally published and this subsequent tutorial released, biomaRt's default version was hg19 and has indeed since been updated. The exact version used for both the paper and tutorial is the assembly from July 2015! The full set of changes to the human genome since hg19vJuly2015 can be found here: http://useast.ensembl.org/info/website/archives/index.html

There are at least a few reasons why using a different genome version may produce slightly different results. One, the gene symbols/names may have changed. So a gene that is included in the built in data can no longer be found in the new assembly. Two, the gene coordinates may have changed. This will affect the exact genomic coordinates represented by the genes and subsequently the exact genomic coordinates of the CNVs identified. Three, newer genome assemblies may also have different alternative contig names (regions.genes@seqnames that are not chromosomes 1 through 22), though this should not impact the final results, which are limited to chromosomes 1 through 22 anyway (thought you may find different numbers of sequences from unspecified genomes to what is noted in the tutorial).

The version and seed of JAGs runs you use may also play a minor role since HMMs are stochastic after all. You should also double check that JAGs is installed and running correctly since it is external to the R environment.

All this may all impact the exact coordinates of the CNVs identified, in particular before retestIdentifiedCnvs is used to filter out spurious/non-confident identified CNVs. However, the final set of identified CNVs on chromosomes 5, 7, 20, 10 13, and 14 should be reproducible though, especially if you are able to reproduce the figure from hb$plotGexpProfile().

The tutorial is compiled from the Rmarkdown under https://github.com/JEFworks-Lab/HoneyBADGER/blob/master/vignettes/Getting_Started.Rmd in case you would like to recompile it from there instead of copying and pasting from the tutorial.

Hope that clarifies some things.

Stay healthy and safe,
Prof. Jean Fan

from honeybadger.

biojiangke commented on July 17, 2024

Since "jul2015.archive.ensembl.org" is not available at this point (the archive used in the Rmd), the closest ones on ensembl archive list are "may2015" and "sep2015" archives. I tried both and "sep2015" is generating the closest results to the tutorial. With a few snags, the tutorial will run up to the "using allele information" part, which I haven't tested yet. Indeed, the "amps" on chr5, 7, 20, and "dels" on chr10, 13, 14 showed up (with some extra "dels" on chr6, 9, 11). One minor suggestion might be to update the documentation with some specific information about which ensembl archive(s) might generate similar results, because the original "jul2015" is not accessible now.

from honeybadger.

Rongtingting commented on July 17, 2024

Dear Rongtingting,

Thank you for taking the initiative to address the issue you discovered and sharing the solution. Indeed, the data included with the package was aligned to hg19. Back when this paper was originally published and this subsequent tutorial released, biomaRt's default version was hg19 and has indeed since been updated. The exact version used for both the paper and tutorial is the assembly from July 2015! The full set of changes to the human genome since hg19vJuly2015 can be found here: http://useast.ensembl.org/info/website/archives/index.html

There are at least a few reasons why using a different genome version may produce slightly different results. One, the gene symbols/names may have changed. So a gene that is included in the built in data can no longer be found in the new assembly. Two, the gene coordinates may have changed. This will affect the exact genomic coordinates represented by the genes and subsequently the exact genomic coordinates of the CNVs identified. Three, newer genome assemblies may also have different alternative contig names (regions.genes@seqnames that are not chromosomes 1 through 22), though this should not impact the final results, which are limited to chromosomes 1 through 22 anyway (thought you may find different numbers of sequences from unspecified genomes to what is noted in the tutorial).

The version and seed of JAGs runs you use may also play a minor role since HMMs are stochastic after all. You should also double check that JAGs is installed and running correctly since it is external to the R environment.

All this may all impact the exact coordinates of the CNVs identified, in particular before retestIdentifiedCnvs is used to filter out spurious/non-confident identified CNVs. However, the final set of identified CNVs on chromosomes 5, 7, 20, 10 13, and 14 should be reproducible though, especially if you are able to reproduce the figure from hb$plotGexpProfile().

The tutorial is compiled from the Rmarkdown under https://github.com/JEFworks-Lab/HoneyBADGER/blob/master/vignettes/Getting_Started.Rmd in case you would like to recompile it from there instead of copying and pasting from the tutorial.

Hope that clarifies some things.

Stay healthy and safe,
Prof. Jean Fan

Dear Prof. Fan,

Thank you for taking the time to help us in this isssue. Yes, different version of rjags might cause little difference during the sampling.

With the demo data provided by the pcakage, both expression and allele info part can be run following the getting started tutorial.

However, I found that the last step which combine the expression and allele information can not get results! Could you give me some instructions on how to figure it out? (The log of the last step is attached)

Thanks a lot for your time!!!

hb$retestIdentifiedCnvs(retestBoundGenes=TRUE, retestBoundSnps=TRUE, verbose=FALSE)
WARNING! ONLY 9 SNPS IN REGION!
WARNING! ONLY 2 SNPS IN REGION!
Compiling model graph
Resolving undeclared variables
Allocating nodes
Graph information:
Observed stochastic nodes: 30095
Unobserved stochastic nodes: 37372
Total graph size: 431029

Initializing model

|++++++++++++++++++++++++++++++++++++++++++++++++++| 100%
|| 100%
|| 100%
Compiling model graph
Resolving undeclared variables
Allocating nodes
Graph information:
Observed stochastic nodes: 46958
Unobserved stochastic nodes: 64285
Total graph size: 754590

Initializing model

|++++++++++++++++++++++++++++++++++++++++++++++++++| 100%
|| 100%
|| 100%
Compiling model graph
Resolving undeclared variables
Allocating nodes
Graph information:
Observed stochastic nodes: 3548
Unobserved stochastic nodes: 4150
Total graph size: 27465

Initializing model

|++++++++++++++++++++++++++++++++++++++++++++++++++| 100%
|| 100%
|| 100%
ERROR! ONLY 1 GENES IN REGION!
Compiling model graph
Resolving undeclared variables
Allocating nodes
Graph information:
Observed stochastic nodes: 13842
Unobserved stochastic nodes: 10094
Total graph size: 69120

Initializing model

|++++++++++++++++++++++++++++++++++++++++++++++++++| 100%
|| 100%
|| 100%
Compiling model graph
Resolving undeclared variables
Allocating nodes
Graph information:
Observed stochastic nodes: 20082
Unobserved stochastic nodes: 13934
Total graph size: 102837

Initializing model

|++++++++++++++++++++++++++++++++++++++++++++++++++| 100%
|| 100%
|| 100%
Compiling model graph
Resolving undeclared variables
Allocating nodes
Graph information:
Observed stochastic nodes: 8578
Unobserved stochastic nodes: 7005
Total graph size: 40610

Initializing model

|++++++++++++++++++++++++++++++++++++++++++++++++++| 100%
|| 100%
|| 100%

results <- hb$summarizeResults(geneBased=TRUE, alleleBased=TRUE)
Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 7, 6

from honeybadger.

showing NULL in the step of calcGexpCnvBoundaries - getting started tutorial about honeybadger HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs