bioinformatics-centre / bayestyper Goto Github PK

A method for variant graph genotyping based on exact alignment of k-mers

C++ 99.36% CMake 0.64%

variant-calling genotyping ngs variant-graphs dna-sequencing snvs indels structural-variation bayesian-inference gibbs-sampling

bayestyper's People

Contributors

Stargazers

Watchers

Forkers

transgirlcodes seanzombias danangcrysnanto cheesemania jonassibbesen sunbymoon iprada

bayestyper's Issues

error bayestyper cluster

Dear,

I try to ran bayestyper. First of all I ran KMC and bayestypertools without any problem, but when I want to run bayestyper cluster appear the following error:

bayesTyper: /isdata/kroghgrp/jasi/bayesTyper/code/releases/v1.4.1_static/BayesTyper-1.4.1/src/bayesTyper/VariantFileParser.cpp:307: void VariantFileParser::parseVariants(ProducerConsumerQueue<std::vector<std::unordered_map<unsigned int, VariantCluster*>>>*, uint, const Chromosomes&): Assertion `prev_position < cur_position' failed.
/scratch/slurm/job5138623/slurm_script: line 14: 328998 Aborted

I ran the program like this:
./bayesTyper cluster -v insertions_filtered.vcf -s insilico3.tsv -g GRCh37_canon.fa -d GRCh37_decoy.fa -p 8

The vcf is from a Pamir variant caller only, which only detects insertions, I dont combine any variant caller more. the insilico3.tsv contains this information tab delimited:
insilico3 M path/insilico3

so I dont know where is the problem... Do you have any idea why have this error?? could you help me...

Thanks a lot for your time.

Jordi

Genotyping 808 samples

Dear Jonas,

I read the paper and I've to congratulate, because is really interesting! I contact to you because I'm interested to use BayesTyper to genotype my samples. I've 808 samples (at 30X) which are ran independently by different variant callers, some of them are CNVnator, Platypus and Manta. First of all I merge all the vcfs with one tool named SURVIVOR. Now I want to genotype all my samples together, in order to take the genotypes likelihoods prior and posterior respectively. But as I read in the documentation, I've to do batches in order to do that... So could you tell me If is possible use Bayestyper to genotype all my samples by caller?? I've to merge all vcfs and use Bayestyper?

Thanks for your help and time

Jordi

Error when combining vcfs using bayesTyperTools combine

Hi,

I have met errors below while combine my vcf files using bayesTyperTools combine :

$ bayesTyperTools combine -v indel:BS_SK_indel.vcf -o OUTindel

[12/08/2018 17:42:30] You are using BayesTyperTools (v1.3.1)

[12/08/2018 17:42:30] Running BayesTyperTools (v1.3.1) combine on 1 files ...

[12/08/2018 17:42:30] Finished chromosome chr1
[12/08/2018 17:42:31] Finished chromosome chr2
[12/08/2018 17:42:31] Finished chromosome chr3
[12/08/2018 17:42:31] Finished chromosome chr4
[12/08/2018 17:42:31] Finished chromosome chr5
[12/08/2018 17:42:31] Finished chromosome chr6
[12/08/2018 17:42:31] Finished chromosome chr7
bayesTyperTools: /isdata/kroghgrp/jasi/bayesTyper/code/releases/v1.3.1_static/BayesTyper/src/bayesTyperTools/Combine.cpp:301: uint Combine::addVariant(Variant*, std::map<unsigned int, Variant*>*, const string&, bool): Assertion `contig_variants_it.first->second->ref().seq().substr(0, min_ref_length) == cur_var->ref().seq().substr(0, min_ref_length)' failed.
Aborted (core dumped)

It seems that something was wrong with my vcf file , but I couldn't figure it out. Since it breaks at chr7, I have tried to run the bayesTyperTools combine cmd using vcf records without "chr8" and it successfully finished, but I can not see any irregularity in chr8 vcf records . Could you tell me what does the error message mean?

Thank you!

Best wishes,

Songtao Gui

filtering on merged samples

Great tool!

I have a quick question on filtering by kmer coverage is samples which are run as batches. How would the "bayestyper_genomic_parameters.txt" file be specified after runs have been combined (or does filtering need to be done prior to combining samples?)

Thanks

ChildIOException: File/directory is a child to another output

I have installed the latest version along with all the dependencies and have configured it to test run on a single sample using all three tools (Manta, GATK, Platypus). It fails on start with the following error:

snakemake --snakefile=/share/Codes/output/revision/bayestyper/call_candidates_and_genotype.smk  --cores=24
/share/Codes/venv3/lib/python3.5/site-packages/snakemake/workflow.py:14: FutureWarning: read_table is deprecated, use read_csv instead, passing sep='\t'.
  from functools import partial
Building DAG of jobs...
ChildIOException:
File/directory is a child to another output:
/share/Codes/output/revision/bayestyper/manta/HG00514
/share/Codes/output/revision/bayestyper/manta/HG00514/results/variants/candidateSV.vcf.gz

Seems like an error internal to Snakemake. I tried replacing all mkdir command with mkdir -p as suggest here but no such commands are executed by the jobs in the snakefiles. Removing Manta

[feature request] Genotyping more than 500 samples

I need to make a genotyping more than 500 samples, I hope the software can add this feature

killed by SIGKIL

Hi,

I keep getting this error when trying BayesTyper with the following options:

./bayesTyper -o integrated_calls -c chr20 -s samples.tsv -v /genotyping/vcf-files/test_bayes.vcf -g /genotyping/ref/hs38DHp.fa

Last few lines of the output:

[11/09/2017 19:24:58] Parsed 1200 alternative alleles and excluded:

	- Alleles on chromosome(s) not in genome: 0
	- Alleles with reference not equal to genome sequence: 0
	- Alleles with non-canonical bases (not ACGTN): 0
	- Alleles within 55 bases of chromosome end: 0
	- Alleles longer than 3000000 bases: 0

[11/09/2017 19:24:58] Out of 1200 variants:

	- Single nucleotides polymorphism: 0
	- Insertion: 102
	- Deletion: 170
	- Complex: 928
	- Mixture: 0
	- Unsupported (excluded): 0

[11/09/2017 19:24:58] Merged variants into 1200 clusters and further into 1198 groups

[11/09/2017 19:24:58] Shuffling intercluster regions ...
[11/09/2017 19:24:58] Finished shuffling

[11/09/2017 19:24:58] Sorting variant clusters by decreasing complexity (number of variants) ...
[11/09/2017 19:24:58] Finished sorting


[11/09/2017 19:24:58] Maximum resident set size: 3.20937 Gb

[11/09/2017 19:24:58] Counting smallmers ...
Killed

samples.tsv:

sample_1	M	/data/ERR1955396

with 36 GB kmc_suf size and 33.6 MB kmc_pre size (gathered using kmc_tools union since the data is paired).

Could you help me find out the cause?

Assertion error when run bayesTyper cluster

Hi, Jonas!
when I run bayesTyper cluster, it appeared error as follows:

[20/02/2020 18:01:01] Parsing variants in unit 1 ...
bayesTyper: /opt/conda/conda-bld/bayestyper_1574250450004/work/src/bayesTyper/Chromosomes.cpp:145: bool Chromosomes::isDecoy(const string&) const: Assertion `order.find(name) != order.end()' failed.

Do you have ideas how to adjust it and make it run successfully!

Thanks

bayesTyperTools combine Assertion `contig_id_to_idx_it != contig_id_to_idx.end()' failed

I got this error message:

bayesTyperTools: /isdata/kroghgrp/jasi/bayesTyper/code/releases/v1.3.1_static/BayesTyper/src/vcf++/VcfMetaData.cpp:338: uint VcfMetaData::getContigIndex(const string&): Assertion `contig_id_to_idx_it != contig_id_to_idx.end()' failed.
/ebi/lsf/yoda-spool/02/1529441822.3144105: line 8: 29650 Aborted                 (core dumped) /homes/mhunt/bin/bayesTyper_v1.3.1_linux_x86_64/bin/bayesTyperTools combine -v samtools:../samtools.vcf,cortex:../cortex.vcf -o 03.typertools.combine -z

Maybe I'm not using it as intended?

My use case is a bacterium where I have calls for one sample from samtools and cortex, not from GATK/playpus/Manta. I don't have the "variation prior", just calls for that sample from samtools and cortex, hence no prior:<prior>.vcf in my command line.

Can this type of data be used with BayesTyper?

bayestyper cluster assertion failure

I'm running into an error with the cluster tool

[08/08/2019 16:08:15] You are using BayesTyper (v1.4)

[08/08/2019 16:08:15] Seeding pseudo-random number generator with 1565298495 ...
[08/08/2019 16:08:15] Setting the kmer size to 55 ...

[08/08/2019 16:08:15] Parsed information for 1 sample(s)

[08/08/2019 16:08:15] Parsing reference genome ...
[08/08/2019 16:08:16] Parsed 14 reference genome chromosomes(s) (24214674 nucleotides)

[08/08/2019 16:08:16] Parsing decoy sequence(s) ...
[08/08/2019 16:08:16] Parsed 3 decoy sequence(s) (4837922 nucleotides)

[08/08/2019 16:08:16] Setting the number of inference units to 1 across 951939 variants ...

[08/08/2019 16:08:23] Maximum resident set size: 1.80473 Gb

[08/08/2019 16:08:23] Parsing variants in unit 1 ...
bayesTyper: /isdata/kroghgrp/jasi/bayesTyper/code/releases/v1.4_static/BayesTyper-1.4/src/bayesTyper/VariantFileParser.cpp:218: bool VariantFileParser::constructVariantClusterGroups(InferenceUnit*, uint, const Chromosomes&): Assertion `num_variant_clusters_pre < num_variant_clusters' failed.
./variantcluster.sh: line 8: 14522 Aborted /master/ianc/BayesTyper/bayesTyper_v1.4_linux_x86_64/bin/bayesTyper cluster -p 8 -v candidates.vcf.gz -s ${1}.tsv -g PvivaxP01.genome_canon_2.fa -d PvivaxP01.genome_decoy_2.fa -o ${1}_bayestyper

I'm not certain whats going wrong, as near as I can tell the variants file looks good and has been sorted. Any clues about whats happening here?

Many Thanks,

Ian

terminate called after throwing an instance of 'std::out_of_range'

I get the following error after running BayesTyper on a VCF file generated by Paragraph:

bayesTyper cluster -v /share/hormozdiarilab/Codes/data/HGSV/Unified/HG00514_HG00733.merged_nonredundant.unified.paragraph.sorted.vcf.gz -s samples.tsv -g /share/Data/ReferenceGenomes/Hg38/hg38.fa -p 16

[08/05/2020 12:22:37] You are using BayesTyper (v1.5)

[08/05/2020 12:22:37] Seeding pseudo-random number generator with 1588965757 ...
[08/05/2020 12:22:37] Setting the kmer size to 55 ...

[08/05/2020 12:22:37] Parsed information for 1 sample(s)

[08/05/2020 12:22:37] Parsing reference genome ...
[08/05/2020 12:22:48] Parsed 58 reference genome chromosomes(s) (2638915210 nucleotides)

[08/05/2020 12:22:48] Parsing decoy sequence(s) ...
[08/05/2020 12:22:48] Parsed 0 decoy sequence(s) (0 nucleotides)

[08/05/2020 12:22:59] Setting the number of inference units to 1 across 11882 variants ...

[08/05/2020 12:23:13] Maximum resident set size: 11.6129 Gb


[08/05/2020 12:23:13] Parsing variants in unit 1 ...
terminate called after throwing an instance of 'std::out_of_range'
  what():  basic_string::substr: __pos (which is 40917102) > this->size() (which is 40799484)
[1]    30690 abort       cluster -v  -s samples.tsv -g  -p 16

I tried looking for variants at or near the 40917102 on any chromosome and the only thing I find is a deletion on chr8 at 40917103:

chr8    40917103        DEL00004034;DEL00004141 G       <DEL>   30      .       END=40922258;SVTYPE=DEL;CIPOS=-10,10;CIEND=-10,10;IMPRECISE;MAPQ=60     GT      ./.     ./.

These coordinates fall within the boundaries of the chromosome so I don't see why I'm getting this error.

how to prepare canon.fa and decoy.fa for non-human organism?

Hi,
I can not prepare the canon.fa and decoy.fa for non-human organism, could you tell me how to do that?

genotype failed: Assertion `parameters.first > 0' failed.

Hi,

I am using Bayestyper to genotype SNPs of HG002 but I'm getting an assertion failure error. The reference SNPs are fetched from file processed by GIAB using freebayes.

The commands I used are listed below:

kmc -k55 -fbam 20_snps_reads.sorted.bam 55mers .

num_threads=1
bayesTyperTools makeBloom -k 55mers -p $num_threads 

ref_build=genome/bayestyper_GRCh38_bundle_v1.3/GRCh38
bayesTyper cluster -v test_20_snps.with_chr.vcf.gz -s samples.tsv -g ${ref_build}_canon.fa -d ${ref_build}_decoy.fa -p $num_threads

bayesTyper genotype -v bayestyper_unit_1/variant_clusters.bin -c bayestyper_cluster_data \
    -s samples.tsv -g ${ref_build}_canon.fa -d ${ref_build}_decoy.fa \
    -o bayestyper_unit_1/bayestyper -z -p $num_threads

The logs of bayesTyper genotype process are below:

[08/02/2023 12:52:36] You are using BayesTyper (v1.5)

[08/02/2023 12:52:36] Seeding pseudo-random number generator with 1675831956 ...
[08/02/2023 12:52:36] Setting the kmer size to 55 ...

[08/02/2023 12:52:36] Parsed information for 1 sample(s)

[08/02/2023 12:52:36] Parsing reference genome ...
[08/02/2023 12:53:29] Parsed 65 reference genome chromosomes(s) (3095211400 nucleotides)

[08/02/2023 12:53:29] Parsing decoy sequence(s) ...
[08/02/2023 12:53:30] Parsed 2515 decoy sequence(s) (10503663 nucleotides)

[08/02/2023 12:53:37] Maximum resident set size: 3.28199 Gb


[08/02/2023 12:53:37] Parsing variant clusters ...
[08/02/2023 12:53:38] Parsed 53 variant clusters (100 variants)

[08/02/2023 12:53:39] Parsing parameter kmers ...
[08/02/2023 12:53:41] Parsed 1000000 kmers

[08/02/2023 12:53:41] Maximum resident set size: 5.02302 Gb


[08/02/2023 12:53:41] Counting kmers in variant cluster paths ...
[08/02/2023 12:53:41] Counting kmers in inter-cluster regions and decoy sequence(s) ...

[08/02/2023 13:25:38] Parsing KMC table containing 19636 kmers for sample HG002 ...

[08/02/2023 13:25:38] Classifying kmers in variant cluster paths ...
[08/02/2023 13:25:38] Out of 2113360 kmers:

	- 5163 have a match to a single variant cluster
	- 0 have a match to single variant cluster group and multiple variant clusters

	- 0 have match to at least one variant cluster and has match to a decoy sequence (not used for inference)
	- 6 have match to at least one variant cluster and has a maximum haploid multiplicity higher than 127 (not used for inference)
	- 0 have matches to multiple variant cluster groups within or across inference units (not used for inference)

	- 2108191 have no match to a variant cluster (includes parameter kmers)

[08/02/2023 13:25:38] Maximum resident set size: 5.2132 Gb


[08/02/2023 13:25:38] Estimating genomic haploid kmer count distribution(s) from parameter kmers ...

bayesTyper: /isdata/kroghgrp/jasi/bayesTyper/code/releases/v1.5_static/BayesTyper-1.5/src/bayesTyper/NegativeBinomialDistribution.cpp:94: void NegativeBinomialDistribution::setParameters(const std::pair<double, double>&): Assertion `parameters.first > 0' failed.

Do you have any suggestions about this error?

Many thanks,
Jesson-mark.

how to generate test cases

how to run test

how to run bayestyper

i Is it necessary to install MANTA?

Assertion `genome_seqs_it != genome_seqs.end()' failed.

When running a conversion of a manta vcf, with the following command line, an assertion fails. What does that error indicate? Best way to resolve?

../bin/bayesTyperTools convertAllele -v adsp5k.manta/manta.gcad.vcf.gz --keep-partial 1 -g /restricted/projectnb/casa/ref/GRCh38_full_analysis_set_plus_decoy_hla.fa -z -o  adsp5k.manta.sv.converted

BayesTyperTools: /isdata/kroghgrp/jasi/bayesTyper/code/releases/v1.5_static/BayesTyper-1.5/src/bayesTyperTools/ConvertAllele.cpp:138: void ConvertAllele::convertAllele(const string&, const string&, const string&, const string&, const string&, bool, bool): Assertion `genome_seqs_it != ge
nome_seqs.end()' failed.
/var/spool/sge/scc-yn3/job_scripts/1931498: line 3: 78426 Aborted                 ../bin/bayesTyperTools convertAllele -v adsp5k.manta/manta.gcad.vcf.gz --keep-partial 1 -g /restricted/projectnb/casa/ref/GRCh38_full_analysis_set_plus_decoy_hla.fa -z -o adsp5k.manta.sv.converted.vcf.gz
tbx_index_build failed: adsp5k.manta.sv.converted.vcf.gz

Feature Request for prior vcf with bayesTyperTools combine

In the prior vcf file, there is a lot of details on the origin of the calls. However, it is not retained when combined with another vcf and all these variants have an ACO=prior rather than dbSNP150,GDK etc.

chr1    10019   .       TA      T       .       .       ACO=dbSNP150
chr1    10055   .       T       TA      .       .       ACO=dbSNP150
chr1    10128   .       A       AC      .       .       ACO=dbSNP150
chr1    10144   .       TA      T       .       .       ACO=dbSNP150
chr1    10146   .       AC      A       .       .       ACO=GDK:dbSNP150
chr1    10149   .       CCT     C       .       .       ACO=GDK
chr1    10165   .       A       AC      .       .       ACO=dbSNP150
chr1    10177   .       A       AC      .       .       ACO=1000g:dbSNP150
chr1    10218   .       AC      A       .       .       ACO=GDK
chr1    10228   .       TAACCCCTAACCCTAACCCTAAACCCTA    TACCCCTAACCCTAACCCTAAACCCTA,T   .       .       ACO=dbSNP150,dbSNP150
chr1    10230   .       AC      A       .       .       ACO=dbSNP150
chr1    10235   .       T       TA      .       .       ACO=1000g:dbSNP150
chr1    10249   .       AAC     A       .       .       ACO=dbSNP150
chr1    10254   .       TA      T       .       .       ACO=dbSNP150
chr1    10328   .       AACCCCTAACCCTAACCCTAACCCT       A       .       .       ACO=dbSNP150
chr1    10329   .       AC      A       .       .       ACO=dbSNP150
chr1    10352   .       T       TA      .       .       ACO=1000g:dbSNP150
chr1    10371   .       ACCCTAACCCTAACCCTAAC    A       .       .       ACO=GDK
chr1    10383   .       A       AC      .       .       ACO=dbSNP150
chr1    10389   .       AC      A       .       .       ACO=dbSNP150
chr1    10433   .       A       AC      .       .       ACO=dbSNP150
chr1    10439   .       AC      A       .       .       ACO=dbSNP150
chr1    10458   .       A       AC      .       .       ACO=dbSNP150
chr1    10616   .       CCGCCGTTGCAAAGGCGCGCCG  C       .       .       ACO=1000g:dbSNP150
chr1    10642   .       G       A       .       .       ACO=dbSNP150
chr1    10891   .       CA      C       .       .       ACO=dbSNP150
chr1    11008   .       C       G       .       .       ACO=dbSNP150
chr1    11012   .       C       G       .       .       ACO=dbSNP150
chr1    11063   .       T       G       .       .       ACO=dbSNP150
chr1    11666   .       TAACAGG T       .       .       ACO=GDK
chr1    12938   .       GCAAA   G       .       .       ACO=dbSNP150

Everyhting gets sets to prior rather than the original origin. It would be nice if that information is retailed if the vcf tag is prior.

chr1 10019 . TA T . . ACO=prior
chr1 10055 . T TA . . ACO=prior
chr1 10128 . A AC . . ACO=prior
chr1 10144 . TA T . . ACO=prior
chr1 10146 . AC A . . ACO=prior
chr1 10149 . CCT C . . ACO=prior
chr1 10165 . A AC . . ACO=prior
chr1 10177 . A AC . . ACO=prior
chr1 10218 . AC A . . ACO=prior
chr1 10228 . TAACCCCTAACCCTAACCCTAAACCCTA TACCCCTAACCCTAACCCTAAACCCTA,T . . ACO=prior,prior
chr1 10230 . AC A . . ACO=prior
chr1 10235 . T TA . . ACO=prior
chr1 10249 . AAC A . . ACO=prior
chr1 10254 . TA T . . ACO=prior
chr1 10328 . AACCCCTAACCCTAACCCTAACCCT A . . ACO=prior
chr1 10329 . AC A . . ACO=prior
chr1 10352 . T TA . . ACO=prior
chr1 10371 . ACCCTAACCCTAACCCTAAC A . . ACO=prior
chr1 10383 . A AC . . ACO=prior
chr1 10389 . AC A . . ACO=prior
chr1 10433 . A AC . . ACO=prior
chr1 10439 . AC A . . ACO=prior
chr1 10458 . A AC . . ACO=prior
chr1 10464 . A AC . . ACO=adsp5k_gatk
chr1 10616 . CCGCCGTTGCAAAGGCGCGCCG C . . ACO=adsp5k_gatk:prior
chr1 10642 . G A . . ACO=prior
chr1 10744 . A AC . . ACO=adsp5k_gatk
chr1 10815 . T TC . . ACO=adsp5k_gatk
chr1 10891 . CA C . . ACO=prior
chr1 11008 . C G . . ACO=prior
chr1 11012 . C G . . ACO=prior
chr1 11063 . T G . . ACO=prior
chr1 11666 . TAACAGG T . . ACO=prior

Genotyping 500 samples

I followed the method “Executing BayesTyper on sample batches” to genotyping my samples, when I Combine the the batch vcf files using bcftools merge, Error occured:
Failed to open bayestyper_rmdup_DH_00_unit_1/bayestyper-sk-b73.vcf.gz: not compressed with bgzip
it seems like bayestypre use gzip to commpress the vcf files, But bcftools require bgzip commpressed vcf files

how to get test case

test data

SV Typing only

If one would like to only genotype SVs (to minimize the CPU time), what would be the recommendation for the candidate variants file. For example, would limiting the SNVs to high minor allele frequency speed things up at all? Or Limit the SNVs to those found in the samples and not the larger dbSNP list. Or is it best to include the comprehensive set of SNVs for the algorithm to work optimally. We have vcfs from Lumpy, Delly, Manta, Strelka2, Scalpel and GATK HaplotypeCaller.

I would like to try BayesTyper on 5000 30x WGS for genotyping SVs. Also are there any benefits for running in batches versus 5000 single runs for these crams. What would you recommend?

manta_run_workflow fails with Unexpected character in reference sequence.

I get the following error during the manta_run_workflow stage:

[2020-05-17T19:56:16.703129Z] [23217_1] [WorkflowRunner] [ERROR] [makeLocusGraph_chromId_438_chr19_KI270933v1_alt_0000] Error Message:
[2020-05-17T19:56:16.705237Z] [23217_1] [WorkflowRunner] [ERROR] [makeLocusGraph_chromId_438_chr19_KI270933v1_alt_0000] Last 6 stderr lines from task (of 6 total lines):
[2020-05-17T19:56:16.705237Z] [23217_1] [WorkflowRunner] [ERROR] [2020-05-17T19:53:07.332513Z] [23217_1] [makeLocusGraph_chromId_438_chr19_KI270933v1_alt_0000] ERROR::
Unexpected character in reference sequence.
[2020-05-17T19:56:16.705237Z] [23217_1] [WorkflowRunner] [ERROR] [2020-05-17T19:53:07.334080Z] [23217_1] [makeLocusGraph_chromId_438_chr19_KI270933v1_alt_0000]   refer$
nce_sequence_file: '/share/data/BayesTyper/bayestyper_GRCh38_bundle_v1.3/GRCh38.fa'
[2020-05-17T19:56:16.705237Z] [23217_1] [WorkflowRunner] [ERROR] [2020-05-17T19:53:07.346773Z] [23217_1] [makeLocusGraph_chromId_438_chr19_KI270933v1_alt_0000]   chrom$
some: 'chr19_KI270933v1_alt'
[2020-05-17T19:56:16.705237Z] [23217_1] [WorkflowRunner] [ERROR] [2020-05-17T19:53:07.347916Z] [23217_1] [makeLocusGraph_chromId_438_chr19_KI270933v1_alt_0000]   chara$
ter: '>'
[2020-05-17T19:56:16.705237Z] [23217_1] [WorkflowRunner] [ERROR] [2020-05-17T19:53:07.348516Z] [23217_1] [makeLocusGraph_chromId_438_chr19_KI270933v1_alt_0000]   chara$
ter_decimal_index: 62
[2020-05-17T19:56:16.705237Z] [23217_1] [WorkflowRunner] [ERROR] [2020-05-17T19:53:07.349091Z] [23217_1] [makeLocusGraph_chromId_438_chr19_KI270933v1_alt_0000]   chara$
ter_position_in_chromosome: 118438
gzip: manta/NA19240/results/variants

I get a similar error for all makeLocusGraph workflows launched for alt contigs. The stage eventually fails with the following error:

gzip: manta/NA19240/results/variants/candidateSV.vcf.gz: No such file or directory
gzip: manta/NA19240/results/variants/candidateSV.vcf: No such file or directory

As for this particular case, I checked the sequence for chr19_KI270933v1_alt in the reference file provided by the BayesTyper bundle just in case and it doesn't include any > characters in the reported position. I'm assuming this stage does not look at the BAM file yet, so it shouldn't be a problem with the mapping. Any insights on this? Thanks.

Classifying kmers error with bayesTyper genotype

Hi,

I'm trying to genotype variant clusters using bayesTyper genotype and running into the following error:

[05/03/2019 14:03:52] You are using BayesTyper (v1.4)

[05/03/2019 14:03:52] Seeding pseudo-random number generator with 1551812632 ...
[05/03/2019 14:03:52] Setting the kmer size to 55 ...

[05/03/2019 14:03:52] Parsed information for 1 sample(s)

[05/03/2019 14:03:52] Parsing reference genome ...
[05/03/2019 14:04:03] Parsed 65 reference genome chromosomes(s) (3095211400 nucleotides)

[05/03/2019 14:04:03] Parsing decoy sequence(s) ...
[05/03/2019 14:04:03] Parsed 2515 decoy sequence(s) (10503663 nucleotides)

[05/03/2019 14:04:11] Maximum resident set size: 3.28256 Gb


[05/03/2019 14:04:11] Parsing variant clusters ...
[05/03/2019 14:04:42] Parsed 1811780 variant clusters (5202843 variants)

[05/03/2019 14:04:49] Parsing parameter kmers ...
[05/03/2019 14:04:51] Parsed 1000000 kmers

[05/03/2019 14:04:51] Maximum resident set size: 24.6832 Gb


[05/03/2019 14:04:51] Counting kmers in variant cluster paths ...
[05/03/2019 14:09:35] Counting kmers in inter-cluster regions and decoy sequence(s) ...

[05/03/2019 14:11:06] Parsing KMC table containing 3219629908 kmers for sample altai_neand.res ...

[05/03/2019 14:16:03] Classifying kmers in variant cluster paths ...
bayesTyper: /isdata/kroghgrp/jasi/bayesTyper/code/releases/v1.4_static/BayesTyper-1.4/src/bayesTyper/KmerHash.cpp:306: std::vector<std::vector<std::vector<KmerStats> > > ObservedKmerCountsHash<sample_bin>::calculateKmerStats(const std::vector<Sample>&) [with unsigned char sample_bin = 3u]: Assertion `!(*hash_it).second.isExcluded()' failed.
Aborted

This is the command I'm using:

bayesTyper genotype -v bayestyper_unit_1/variant_clusters.bin -c bayestyper_cluster_data -s samples.tsv -g ../vcf/bayestyper_GRCh38_bundle_v1.3/GRCh38_canon.fa -d ../vcf/bayestyper_GRCh38_bundle_v1.3/GRCh38_decoy.fa -o bayestyper_unit_1/bayestyper -z -p 12

Could you tell me what the error is referring to and how to debug it?

Thanks,
Steph

What to do with smaller genomes?

Hi there

Congratulations on a very nicely written paper. I'm still absorbing it, but I wondered what I ought to do if using it on small genomes. Your documentation says you need 1 million SNPs in your input candidate VCFs but for small genomes you won't get that many variants, commonly. Is this ok?

added "*" in ALT after genotyping

Hi,

After running BayesTyper genotype, the output vcf file added * allele to the ALT field.

For example:
This is the input record, the ALT field only contains T allele :

1	119885	.	TCTCTTTTTCTCGAACACGCAGGAGAACTGTGCGTCATTATATTAAGAGGAAAAAGGTCCCAAGTGGACTAAGAAAACAAAGTGCCCGAGAAGGCAGCAAACGAGAAGAGGGGGACAAAAAGAAAAAAAGAAAGAAACCAAAATAAAAGAAAGAAAAACTAGAAACTAGAAACAAGGGGGGGGGGTGCAACCCCCACCACCCCACTTAAATTAAGGCCACAATTGTCTAATATCTTTTGCTCCTGCCATTCCCCAAAGCTTAACCTCCTCATGAATCTCTTGCAACATGCTGCGGATACTCGGGGATACCCCATCAAACACACAAGCATTCCTTTGTTTCCACAACCTCCAAGACACCAAGATAACTAGGGAATTAAACCCCTTTCTTTTACTATTTGGCACTTTCTGCTCAGCCTTCCTCCACCATTCTTGGAAAACCACATCTGTTATTTCTGGAGCCAAAGGCAGCAATCCCACCTTGTTCAAAGTCTGGGCCCAAATGTCTCTAGCAAAAACGCAAGCCACCAGAATGTGTTGTGCTGTTTCCTCTTGCTGATCACAAAGAAGACATTTGTCAGGATGGTTCAAACCCCTACGAGCCAGTCTATCTGCTGTCCAACATTTGTTAAGGGATGCAAGCCAAATAAAGAATTTGCATTTCTGAGGTGCCCATGTCCGCCAAATTCGCTCGGACGGTTCAAAATAAACGGAACCAGCGAAGAAACGATCATAAGCTGACTTAGATGAATACTGCCCATTGGCCGTTGGCAGCCATTTATGCTGATCTGAAATTCCAGGTTGCAAATGAATTCCCCGAGTGACATCCCATATATAAAAGAAGCCCATAAGCACTTCGGCCGGCAAACTACCAGTAATGTCCGACACCCATCTATTATTTAGCAAAGCCTCGTACACAGATCTGCTCTTCTGAATTTTCATGGGGATGCGACTCAGGAGAACTGGAGCAAGCTCACCCACAGATTTCCCATGTAACCATTTATCAGTCCAAAACAAAGTATTCTGGCCATCTCCCACAATAGAGCAAACAGAGACTGAGAAAAATGCTGCTGCATTTGGATGCACCTGAATGTCAAAATCTGACCATGACCGTTCAGGCTGGGTCTTTTTAAGCCACATCCAACGCATATTCAGAGACCAGCCAAGCACCTCCAAATTGTGGATTCCGAGGCCCCCCCTACTGATAGGCCTGCAAACCTTGGACCAACCGACAACACAATGGCCTCCTCTAACATCTGTCCTTCCCTTCCATAAAAACCCCCTGCGAATCTTATCAATAGCTCTAATTAC	T	.	.	ACO=B73_mantaSV

And this is the same record in the output, the ALT field turned to T,*, and a sample was genotyped as 2/2:

1	119885	.	TCTCTTTTTCTCGAACACGCAGGAGAACTGTGCGTCATTATATTAAGAGGAAAAAGGTCCCAAGTGGACTAAGAAAACAAAGTGCCCGAGAAGGCAGCAAACGAGAAGAGGGGGACAAAAAGAAAAAAAGAAAGAAACCAAAATAAAAGAAAGAAAAACTAGAAACTAGAAACAAGGGGGGGGGGTGCAACCCCCACCACCCCACTTAAATTAAGGCCACAATTGTCTAATATCTTTTGCTCCTGCCATTCCCCAAAGCTTAACCTCCTCATGAATCTCTTGCAACATGCTGCGGATACTCGGGGATACCCCATCAAACACACAAGCATTCCTTTGTTTCCACAACCTCCAAGACACCAAGATAACTAGGGAATTAAACCCCTTTCTTTTACTATTTGGCACTTTCTGCTCAGCCTTCCTCCACCATTCTTGGAAAACCACATCTGTTATTTCTGGAGCCAAAGGCAGCAATCCCACCTTGTTCAAAGTCTGGGCCCAAATGTCTCTAGCAAAAACGCAAGCCACCAGAATGTGTTGTGCTGTTTCCTCTTGCTGATCACAAAGAAGACATTTGTCAGGATGGTTCAAACCCCTACGAGCCAGTCTATCTGCTGTCCAACATTTGTTAAGGGATGCAAGCCAAATAAAGAATTTGCATTTCTGAGGTGCCCATGTCCGCCAAATTCGCTCGGACGGTTCAAAATAAACGGAACCAGCGAAGAAACGATCATAAGCTGACTTAGATGAATACTGCCCATTGGCCGTTGGCAGCCATTTATGCTGATCTGAAATTCCAGGTTGCAAATGAATTCCCCGAGTGACATCCCATATATAAAAGAAGCCCATAAGCACTTCGGCCGGCAAACTACCAGTAATGTCCGACACCCATCTATTATTTAGCAAAGCCTCGTACACAGATCTGCTCTTCTGAATTTTCATGGGGATGCGACTCAGGAGAACTGGAGCAAGCTCACCCACAGATTTCCCATGTAACCATTTATCAGTCCAAAACAAAGTATTCTGGCCATCTCCCACAATAGAGCAAACAGAGACTGAGAAAAATGCTGCTGCATTTGGATGCACCTGAATGTCAAAATCTGACCATGACCGTTCAGGCTGGGTCTTTTTAAGCCACATCCAACGCATATTCAGAGACCAGCCAAGCACCTCCAAATTGTGGATTCCGAGGCCCCCCCTACTGATAGGCCTGCAAACCTTGGACCAACCGACAACACAATGGCCTCCTCTAACATCTGTCCTTCCCTTCCATAAAAACCCCCTGCGAATCTTATCAATAGCTCTAATTAC	T,*	99	PASS	AC=34,2;AF=0.944444,0.0555556;AN=36;ACP=1,1,1;VCS=1;VCR=1:119885-121192;VCGS=7;VCGR=1:96019-130431;HC=2;ACO=B73_mantaSV,.	GT:GPP:APP:NAK:FAK:MAC:SAF	1/1:0,0,1,0,0,0:0,1,0:-1,5.2,-1:-1,1,-1:-1,6.3984,-1:0,0,0	1/1:0,0,1,0,0,0:0,1,0:-1,5.2,-1:-1,1,-1:-1,10.0334,-1:0,0,0	1/1:0,0,1,0,0,0:0,1,0:-1,5.2,-1:-1,1,-1:-1,6.05643,-1:0,0,0	1/1:0,0,1,0,0,0:0,1,0:-1,5.2,-1:-1,1,-1:-1,5.12502,-1:0,0,0	2/2:0,0,0,0,0,1:0,0,1:-1,-1,6:-1,-1,1:-1,-1,2.34679:0,0,0	./.:0,0,0,0,0.7462,0.2538:0,0.7462,1:-1,5.48432,5.80061:-1,0,1:-1,0,11.4544:0,2,0	1/1:0,0,1,0,0,0:0,1,0:-1,5.2,-1:-1,1,-1:-1,6.11511,-1:0,0,0	1/1:0,0,1,0,0,0:0,1,0:-1,5.2,-1:-1,1,-1:-1,3.27049,-1:0,0,0	1/1:0,0,1,0,0,0:0,1,0:-1,5.2,-1:-1,1,-1:-1,3.36197,-1:0,0,0	1/1:0,0,1,0,0,0:0,1,0:-1,5.2,-1:-1,1,-1:-1,4.12725,-1:0,0,0	1/1:0,0,1,0,0,0:0,1,0:-1,5.2,-1:-1,1,-1:-1,5.13398,-1:0,0,0	1/1:0,0,1,0,0,0:0,1,0:-1,5.2,-1:-1,1,-1:-1,8.84972,-1:0,0,0	1/1:0,0,1,0,0,0:0,1,0:-1,5.2,-1:-1,1,-1:-1,3.88092,-1:0,0,0	1/1:0,0,1,0,0,0:0,1,0:-1,5.2,-1:-1,1,-1:-1,5.42057,-1:0,0,0	1/1:0,0,1,0,0,0:0,1,0:-1,5.2,-1:-1,1,-1:-1,5.15327,-1:0,0,0	1/1:0,0,1,0,0,0:0,1,0:-1,5.2,-1:-1,1,-1:-1,9.88321,-1:0,0,0	./.:0.9884,0.0116,0,0,0,0:1,0.0116,0:14.462,3.36207,-1:1,0.336207,-1:5.04096,2.85776,-1:0,2,0	1/1:0,0,1,0,0,0:0,1,0:-1,5.2,-1:-1,1,-1:-1,5.0996,-1:0,0,0	1/1:0,0,1,0,0,0:0,1,0:-1,5.2,-1:-1,1,-1:-1,7.37556,-1:0,0,0	1/1:0,0,1,0,0,0:0,1,0:-1,5.2,-1:-1,1,-1:-1,4.16548,-1:0,0,0

How did this happen and what's the meaning?

Please find attached the running log and the heading 1000 lines of the input and output vcf files.
xab-geno.zip

Thank you.

Best wishes,

Songtao Gui

Genotyping assertion failure

Hi,

I am using Bayestyper to genotype my yeast strains after the clustering step but I'm getting an assertion failure error. I ran bayestyper genotype with the --noise-genotyping parameter due to the smaller genome size but it produces the following error:

[02/08/2019 10:36:21] You are using BayesTyper (v1.5)

[02/08/2019 10:36:21] Seeding pseudo-random number generator with 1564565781 ...
[02/08/2019 10:36:21] Setting the kmer size to 55 ...

[02/08/2019 10:36:21] Parsed information for 19 sample(s)

[02/08/2019 10:36:21] Parsing reference genome ...
[02/08/2019 10:36:21] Parsed 17 reference genome chromosomes(s) (12157105 nucleotides)

[02/08/2019 10:36:21] Parsing decoy sequence(s) ...
[02/08/2019 10:36:21] Parsed 0 decoy sequence(s) (0 nucleotides)

[02/08/2019 10:36:21] Maximum resident set size: 0.017216 Gb


[02/08/2019 10:36:21] Parsing variant clusters ...
[02/08/2019 10:36:59] Parsed 3299 variant clusters (1776718 variants)

[02/08/2019 10:37:09] Parsing parameter kmers ...
[02/08/2019 10:37:09] Parsed 5028 kmers

[02/08/2019 10:37:09] Maximum resident set size: 18.2243 Gb


[02/08/2019 10:37:09] Counting kmers in variant cluster paths ...
[02/08/2019 11:04:13] Counting kmers in inter-cluster regions and decoy sequence(s) ...

[02/08/2019 11:04:16] Parsing KMC table containing 16766385 kmers for sample 1 ...
[02/08/2019 11:05:20] Parsing KMC table containing 20050264 kmers for sample 2 ...
[02/08/2019 11:06:18] Parsing KMC table containing 23337507 kmers for sample 3 ...
[02/08/2019 11:07:43] Parsing KMC table containing 14484357 kmers for sample 4 ...
[02/08/2019 11:08:38] Parsing KMC table containing 19351468 kmers for sample 5 ...
[02/08/2019 11:09:47] Parsing KMC table containing 190540458 kmers for sample 6 ...
[02/08/2019 11:14:59] Parsing KMC table containing 17861269 kmers for sample 7 ...
[02/08/2019 11:15:57] Parsing KMC table containing 16852594 kmers for sample 8 ...
[02/08/2019 11:16:55] Parsing KMC table containing 16318134 kmers for sample 9 ...
[02/08/2019 11:17:46] Parsing KMC table containing 22293085 kmers for sample 10 ...
[02/08/2019 11:18:47] Parsing KMC table containing 14517377 kmers for sample 11 ...
[02/08/2019 11:19:36] Parsing KMC table containing 14126999 kmers for sample 12 ...
[02/08/2019 11:20:30] Parsing KMC table containing 18857437 kmers for sample 13 ...
[02/08/2019 11:21:27] Parsing KMC table containing 17636840 kmers for sample 14 ...
[02/08/2019 11:22:21] Parsing KMC table containing 20092469 kmers for sample 15 ...
[02/08/2019 11:23:22] Parsing KMC table containing 230337421 kmers for sample 16 ...
[02/08/2019 11:28:47] Parsing KMC table containing 18801072 kmers for sample 17 ...
[02/08/2019 11:30:04] Parsing KMC table containing 18662345 kmers for sample 18 ...
[02/08/2019 11:31:14] Parsing KMC table containing 6201477 kmers for sample 19 ...

[02/08/2019 11:31:37] Classifying kmers in variant cluster paths ...
[02/08/2019 12:06:09] Out of 26475843 kmers:

        - 21164548 have a match to a single variant cluster
        - 480057 have a match to single variant cluster group and multiple variant clusters

        - 0 have match to at least one variant cluster and has match to a decoy sequence (not used for inference)
        - 199 have match to at least one variant cluster and has a maximum haploid multiplicity higher than 127 (not used for inference)
        - 4784833 have matches to multiple variant cluster groups within or across inference units (not used for inference)

        - 46206 have no match to a variant cluster (includes parameter kmers)

[02/08/2019 12:06:09] Maximum resident set size: 18.7509 Gb

[02/08/2019 12:06:09] Estimating genomic haploid kmer count distribution(s) from parameter kmers ...


WARNING: Low number of kmers used for negative binomial parameters estimation for sample 1 (0 < 10000)
WARNING: The mean and variance estimates might be biased due to the genome used being too small, too variant dense and/or too repetitive

bayesTyper: /isdata/kroghgrp/jasi/bayesTyper/code/releases/v1.5_static/BayesTyper-1.5/src/bayesTyper/CountDistribution.cpp:115: void CountDistribution::setGenomicCountDistributions(const std::vector<std::vector<std::vector<KmerStats> > >&, const string&): Assertion `max_genomic_kmer_multiplicity > 0' failed. 
Aborted

Do you have any suggestions as to what may be causing this error?

Many thanks,
Prithika

Genotyping for the inversions

Hi,
it's a powerful tool I used. So could the tool genotype for the inversions?

genotyping error: intercluster_diploid_kmer_stats.at(sample_idx).at(bias _idx).front().getCount() == 0

Hi
When I run the BayesTyper v1.4, I got this error. Could anyone please help me to fix this? Thanks

[14/02/2019 06:13:18] You are using BayesTyper (v1.4)

[14/02/2019 06:13:18] Seeding pseudo-random number generator with 1550146398 ...
[14/02/2019 06:13:18] Setting the kmer size to 55 ...

[14/02/2019 06:13:18] Parsed information for 1 sample(s)

[14/02/2019 06:13:18] Parsing reference genome ...
[14/02/2019 06:14:49] Parsed 24 reference genome chromosomes(s) (3095677412 nucl
eotides)

[14/02/2019 06:14:49] Parsing decoy sequence(s) ...
[14/02/2019 06:15:03] Parsed 62 decoy sequence(s) (41777093 nucleotides)

[14/02/2019 06:15:14] Maximum resident set size: 3.37096 Gb

[14/02/2019 06:15:14] Parsing variant clusters ...
[14/02/2019 06:16:04] Parsed 1829850 variant clusters (5052955 variants)

[14/02/2019 06:16:12] Parsing parameter kmers ...
[14/02/2019 06:16:24] Parsed 1000000 kmers

[14/02/2019 06:16:24] Maximum resident set size: 25.0474 Gb

[14/02/2019 06:16:24] Counting kmers in variant cluster paths ...
[14/02/2019 06:29:30] Counting kmers in inter-cluster regions and decoy sequence
(s) ...

[14/02/2019 06:34:57] Parsing KMC table containing 8070384125 kmers for sample U
DN765115 ...

[14/02/2019 07:48:11] Classifying kmers in variant cluster paths ...
[14/02/2019 07:50:42] Out of 211574032 kmers:

    - 180492874 have a match to a single variant cluster
    - 17803661 have a match to single variant cluster group and multiple var

iant clusters

    - 1011777 have match to at least one variant cluster and has match to a

decoy sequence (not used for inference)
- 3551 have match to at least one variant cluster and has a maximum hapl
oid multiplicity higher than 127 (not used for inference)
- 10721402 have matches to multiple variant cluster groups within or acr
oss inference units (not used for inference)

    - 1540767 have no match to a variant cluster (includes parameter kmers)

[14/02/2019 07:50:42] Maximum resident set size: 27.0064 Gb

[14/02/2019 07:50:42] Estimating genomic haploid kmer count distribution(s) from
parameter kmers ...

bayesTyper: /isdata/kroghgrp/jasi/bayesTyper/code/releases/v1.4_static/BayesType
r-1.4/src/bayesTyper/CountDistribution.cpp:92: void CountDistribution::setGenomi
cCountDistributions(const std::vector<std::vector<std::vector > >&, c
onst string&): Assertion `intercluster_diploid_kmer_stats.at(sample_idx).at(bias
_idx).front().getCount() == 0' failed.

Best,
Youhuang

Genotyping error on one cluster

When running the bayestyper genotype on the 9 clusters, the 7th cluster generates an error. Any suggestions? This ran suscessfully before I dropped -ci1 from the kmc command line to minimize storage requirements.

seq 1 $N_UNITS|xargs -I {} -n1 $BAYESTYPER/bin/bayesTyper genotype -v bayestyper_unit_{}/variant_clusters.bin -c bayestyper_cluster_data -s sample.
tsv  -g $CANON -d $DECOY -o bayestyper_unit_{}/bayestyper -z -p $NSLOTS

This is the error

bayesTyper: /isdata/kroghgrp/jasi/bayesTyper/code/releases/v1.5_static/BayesTyper-1.5/src/bayesTyper/GenotypeWriter.cpp:415: uint GenotypeWriter::f
inalise(const string&, const Chromosomes&, const string&, const OptionsContainer&, const Filters&): Assertion `getline(tmp_infile_fstream, genotype
d_variants_it->second.back().genotypes, '\n')' failed.

This is the full output.

[29/12/2019 20:45:39] You are using BayesTyper (v1.5)

[29/12/2019 20:45:39] Seeding pseudo-random number generator with 1577670339 ...
[29/12/2019 20:45:39] Setting the kmer size to 55 ...

[29/12/2019 20:45:39] Parsed information for 1 sample(s)

[29/12/2019 20:45:39] Parsing reference genome ...
[29/12/2019 20:45:47] Parsed 65 reference genome chromosomes(s) (3095211400 nucleotides)

[29/12/2019 20:45:47] Parsing decoy sequence(s) ...
[29/12/2019 20:45:47] Parsed 2515 decoy sequence(s) (10503663 nucleotides)

[29/12/2019 20:45:54] Maximum resident set size: 3.28256 Gb


[29/12/2019 20:45:54] Parsing variant clusters ...
[29/12/2019 20:46:28] Parsed 2004766 variant clusters (5509182 variants)

[29/12/2019 20:46:39] Parsing parameter kmers ...
[29/12/2019 20:46:42] Parsed 1000000 kmers

[29/12/2019 20:46:42] Maximum resident set size: 27.0824 Gb


[29/12/2019 20:46:42] Counting kmers in variant cluster paths ...
[29/12/2019 20:54:17] Counting kmers in inter-cluster regions and decoy sequence(s) ...

[29/12/2019 20:56:38] Parsing KMC table containing 3141728545 kmers for sample A-ADC-AD008276-BL-NCR-13AD63452 ...

[29/12/2019 21:04:25] Classifying kmers in variant cluster paths ...
[29/12/2019 21:06:16] Out of 231207366 kmers:

        - 187516823 have a match to a single variant cluster
        - 30409265 have a match to single variant cluster group and multiple variant clusters

        - 224078 have match to at least one variant cluster and has match to a decoy sequence (not used for inference)
        - 20713 have match to at least one variant cluster and has a maximum haploid multiplicity higher than 127 (not used for inference)
        - 11835346 have matches to multiple variant cluster groups within or across inference units (not used for inference)

        - 1201141 have no match to a variant cluster (includes parameter kmers)

[29/12/2019 21:06:16] Maximum resident set size: 28.9024 Gb


[29/12/2019 21:06:16] Estimating genomic haploid kmer count distribution(s) from parameter kmers ...

[29/12/2019 21:06:16] Estimated negative binomial (mean = 12.5918, var = 25.1094) for sample A-ADC-AD008276-BL-NCR-13AD63452 using 882855 parameter
 kmers (multiplicity = 2)

[29/12/2019 21:06:16] Wrote genomic parameters to bayestyper_unit_7/bayestyper_genomic_parameters.txt

[29/12/2019 21:06:16] Maximum resident set size: 28.9024 Gb


[29/12/2019 21:06:16] Estimating noise model parameters using 20 independent gibbs sampling chains each with 350 iterations (100 burn-in) ...
[29/12/2019 21:11:45] Calculated final noise model parameters by averaging 5000 parameter estimates (250 per gibbs sampling chain)

[29/12/2019 21:11:45] Wrote noise parameters to bayestyper_unit_7/bayestyper_noise_parameters.txt

[29/12/2019 21:11:45] Maximum resident set size: 29.648 Gb


[29/12/2019 21:11:45] Estimating genotypes using 20 independent gibbs sampling chains each with 350 iterations (100 burn-in) ...

[29/12/2019 21:15:27] Genotyped 100000 variants
[29/12/2019 21:17:38] Genotyped 200000 variants
[29/12/2019 21:19:48] Genotyped 300000 variants
[29/12/2019 21:21:55] Genotyped 400000 variants
[29/12/2019 21:23:55] Genotyped 500000 variants
[29/12/2019 21:25:51] Genotyped 600000 variants
[29/12/2019 21:27:28] Genotyped 700000 variants
[29/12/2019 21:28:58] Genotyped 800000 variants
[29/12/2019 21:30:15] Genotyped 900000 variants
[29/12/2019 21:33:04] Genotyped 1000000 variants
[29/12/2019 21:34:05] Genotyped 1100000 variants
bayesTyper: /isdata/kroghgrp/jasi/bayesTyper/code/releases/v1.5_static/BayesTyper-1.5/src/bayesTyper/GenotypeWriter.cpp:415: uint GenotypeWriter::f
inalise(const string&, const Chromosomes&, const string&, const OptionsContainer&, const Filters&): Assertion `getline(tmp_infile_fstream, genotype
d_variants_it->second.back().genotypes, '\n')' failed.

ConvertAllele and Variant normalization

Hi,

Firstly thank you for creating such a great resource. I have two questions (using v1.5):

I am trying to genotype SVs using identified by manta (only SVs), when following the steps and using bayesTyperTools convertAllele almost 20% of variants are skipped, for example:

Skipped 1219 unsupported allele(s):

	- 307 <INS> alternative allele(s)
	- 912 translocation alternative allele(s)

In your new version I understand that there is added support for insertions, is there a way to rescue these skipped insertions?

At the variant normalization step using bcftools norm a significant number of variants suffer from errors and are not normalized, for example:

Non-ACGTN reference allele at chr3:52803269

Do you have any recommendations for this? I tried the bcftools norm --check-ref ws to fix 'bad sites'.

Best wishes,
Mo

A question about convertAllele.

Hi,
I am using 'bayesTyperTools convertAllele' to convert my vcf file, in my alt column there are many type allele, then I trying add '--alt-file' in my code. I don't understand this description "alternative allele file (fasta format). Sequence name in fasta (>"name") should match <"name">. " I am not sure about my alt-file format.
Looking forward to your reply!
Best regards!

My code:
bayesTyperTools convertAllele -v my.vcf -g Zm-B73-chr2.fa.masked --alt-file alt-seq.fa -o convert1.vcf

My error log:

[11/01/2021 22:29:21] Parsed 1 chromosome(s)
bayesTyperTools: /isdata/kroghgrp/jasi/bayesTyper/code/releases/v1.5_static/BayesTyper-1.5/src/bayesTyperTools/ConvertAllele.cpp:79: void ConvertAllele::convertAllele(const string&, const string&, const string&, const string&, const string&, bool, bool): Assertion `alt_seqs.emplace("<" + cur_fasta_rec->id() + ">", make_pair(cur_fasta_rec, cur_fasta_rec_rv)).second' failed.
bayesConvert.sh: line 14: 14071 Aborted                 (core dumped)

My alt-seq.fa file head:

>INS
CCTTGTTTAGGGACTGGCAGGACACCCTAGACAACTCTAATCGACATAGAGTCTGTAACA
CCTGGGTTTTAAGGAACAAAGTCGGGTGCATCTCATACAT
>INS
ACTGTGTTCAGCGGTTCCCTCTAAATTTCTCCCCCTATATCTCACTCACGTGCCACGTCA
GCGTTCTCTTTCGCTCTATATCTCCACGCTCTACAGCGGTTCCCCCTATATCAAACCTCT
ATACCACACCACACCAATATTTTATACTTTCATCATCAACTAACTCAACTATCATCCAAT
ATTTGTTTTATTTTTATTTGCTCTATAAACAGTGCGCGGCCGTAACTTAGTACACAATTA
ATTAGCAAACTTCGTAAGCCTTGCCAAGCCAAAATCAGAAATCTTAGGCTTATAGTCATC
ATCAAGCAATATGCACTTGGAACTAATCTTAGGCTCATAGAATGCGAGGGTTGCAACTGT

Feature Request- Add in gnomAD SV call set to prior

The prior vcf that is distributed BayesTyper has a very comprehensive list of SVs. Now there is a recently released SV vcf from gnomAD would be great to add to the present prior list: gnomadSV v2.1 vcf is found at https://gnomad.broadinstitute.org/downloads#v2-structural-variants.

Combine assertion error

When running combine with scalpel calls and the prior vcf. I am getting this error. What type of vcf issue would trigger this?

bayesTyperTools: /isdata/kroghgrp/jasi/bayesTyper/code/releases/v1.5_static/BayesTyper-1.5/src/bayesTyperTools/Combine.cpp:411: void Combine::combine(const std::vector<std::__cxx11::basic_string<char> >&, const string&, bool): Assertion contig_variants_it->second->alt(alt_allele_idx) != contig_variants_it->second->ref() failed.

downloading variation priors fails

Hi,
I am trying to download bayestyper_GRCh38_bundle.tar.gz using http://people.binf.ku.dk/~lassemaretty/bayesTyper/bayestyper_GRCh38_bundle.tar.gz but I cant get past 44%
Thanks,
Moustafa

Identifying SV type from output

Hello @jonassibbesen,
Just reaching out for a bit of assistance

Out of 33,462 candidate SVs called with Manta, I have successfully genotyped 26,606 SVs. To compare the relative frequency of SV type and size with those identified with other pipelines, I would like to count the number of deletions, duplications, insertions and inversions, and estimate their sizes.

I have tried to add symbolic alleles as per this approach, however only a few hundred deletions and insertions could be identified.

I have also tried to identify SVs that overlap with the candidate SV file before and after running bayestyperTools convertAllele and found that very few sites intersect or overlap (again, all deletions or insertions).

After trolling through the internet I wasn't able to much about converting from the long-sequence SV annotation format to symbolic alleles, which has me thinking that I'm missing something super obvious??? In the original paper for bayesTyper, how did you identify the different SV types for comparisons?

I have thought about splitting the candidate SV calls into different groups then carrying on from the convertAllele step, but I wasn't sure if this would negatively impact the genotype outputs. Especially since looking for an intersection between these type specific converted VCFs did not overlap well with the final output with all SVs run together.

Anyway, really like the tool and super keen for any help or advice you may have!

byesTyper cluster wrong

i meet the same problem as #23 (comment)
but i do not know how to handle it .
it like this :
Parsing variants in unit 2 ...
bayesTyper: /opt/conda/conda-bld/bayestyper_1574250450004/work/src/bayesTyper/Chromosomes.cpp:145: bool Chromosomes::isDecoy(const string&) const: Assertion `order.find(name) != order.end()' failed.
Aborted (core dumped)
i use BayesTyper (v1.5 )which install by conda.
My platform info:
Linux version 2.6.32-431.23.3.el6.x86_64 ([email protected]) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-4) (GCC) ) #1 SMP Wed Jul 16 06:12:23 EDT 2014

conda 4.9.2

when cmake, Could NOT find Boost (missing: Boost_INCLUDE_DIR iostreams program_options system filesystem serialization)

Cram Support?

KMC3 presently does not read crams. Converting crams to a bam is an extra step that adds to the storage and CPU requirements. Is is there another kmer program that reads crams that could replace that step to generate the kmers needed for BayesTyper.

Classifying kmers error with bayesTyper 1.4.1 and 1.5

I'm having problems during the kmer classification step in bayesTyper genotype with both versions 1.4.1 and 1.5:

[19/10/2022 21:21:47] You are using BayesTyper (v1.4.1)

[19/10/2022 21:21:47] Seeding pseudo-random number generator with 1666239707 ...
[19/10/2022 21:21:47] Setting the kmer size to 55 ...

[19/10/2022 21:21:47] Parsed information for 1 sample(s)

[19/10/2022 21:21:47] Parsing reference genome ...
[19/10/2022 21:21:55] Parsed 66 reference genome chromosomes(s) (3095248640 nucleotides)

[19/10/2022 21:21:55] Parsing decoy sequence(s) ...
[19/10/2022 21:21:55] Parsed 129 decoy sequence(s) (4673901 nucleotides)

[19/10/2022 21:22:03] Maximum resident set size: 3.30645 Gb

[19/10/2022 21:22:03] Parsing variant clusters ...
[19/10/2022 21:22:05] Parsed 38805 variant clusters (68140 variants)

[19/10/2022 21:22:07] Parsing parameter kmers ...
[19/10/2022 21:22:09] Parsed 1000000 kmers

[19/10/2022 21:22:09] Maximum resident set size: 5.97004 Gb

[19/10/2022 21:22:09] Counting kmers in variant cluster paths ...
[19/10/2022 21:22:54] Counting kmers in inter-cluster regions and decoy sequence(s) ...

[19/10/2022 22:03:33] Parsing KMC table containing 14107141124 kmers for sample <REDACTED> ...

[20/10/2022 01:48:25] Classifying kmers in variant cluster paths ...

bayesTyper: /isdata/kroghgrp/jasi/bayesTyper/code/releases/v1.4.1_static/BayesTyper-1.4.1/src/bayesTyper/KmerHash.cpp:269: std::vector<std::vector<std::vector<KmerStats> > > ObservedKmerCountsHash<sample_bin>::calculateKmerStats(const std::vector<Sample>&) [with unsigned char sample_bin = 3u]: Assertion `!(*hash_it).second.isParameter()' failed.

[19/10/2022 10:01:54] You are using BayesTyper (v1.5)

[19/10/2022 10:01:54] Seeding pseudo-random number generator with 1666198914 ...
[19/10/2022 10:01:54] Setting the kmer size to 55 ...

[19/10/2022 10:01:54] Parsed information for 1 sample(s)

[19/10/2022 10:01:54] Parsing reference genome ...
[19/10/2022 10:02:01] Parsed 66 reference genome chromosomes(s) (3095248640 nucleotides)

[19/10/2022 10:02:01] Parsing decoy sequence(s) ...
[19/10/2022 10:02:01] Parsed 129 decoy sequence(s) (4673901 nucleotides)

[19/10/2022 10:02:08] Maximum resident set size: 3.30645 Gb


[19/10/2022 10:02:08] Parsing variant clusters ...
[19/10/2022 10:02:10] Parsed 38805 variant clusters (68140 variants)

[19/10/2022 10:02:11] Parsing parameter kmers ...
[19/10/2022 10:02:12] Parsed 1000000 kmers

[19/10/2022 10:02:12] Maximum resident set size: 5.97024 Gb

[19/10/2022 10:02:12] Counting kmers in variant cluster paths ...
[19/10/2022 10:02:41] Counting kmers in inter-cluster regions and decoy sequence(s) ...

[19/10/2022 10:36:05] Parsing KMC table containing 14107141124 kmers for sample <REDACTED> ...

[19/10/2022 13:39:02] Classifying kmers in variant cluster paths ...

bayesTyper: /isdata/kroghgrp/jasi/bayesTyper/code/releases/v1.5_static/BayesTyper-1.5/src/bayesTyper/KmerHash.cpp:278: std::vector<std::vector<std::vector<KmerStats> > > ObservedKmerCountsHash<sample_bin>::calculateKmerStats(const std::vector<Sample>&) [with unsigned char sample_bin = 3u]: Assertion `!(*hash_it).second.isParameter()' failed.

This looks related to #13, but that appeared to be resolved in 1.4.1. Do you have any advice?

Where are the test cases

A question about VCF format of results from the 'genotype' function.

Hi,

When I check my results from the 'genotype' function. There are many "*" symbols in the ALT column. I want to know why it is present here, and if it is biologically meaningful?

Looking forward to your reply!
Best regards!

Example of my question:

chr1    42842   .       T       TT,*  
chr1    42969   .       A       AA,*  
chr1    42993   .       G       GT,*  
chr1    43026   .       TA      T,*   
chr1    43090   .       C       CA,*  
chr1    43333   .       TC      T,*   
chr1    43450   .       AA      A,*   
chr1    43459   .       TA      T,*   
chr1    43488   .       TA      T,*   
chr1    43491   .       A       AGG,* 
chr1    43631   .       TT      T,*   
chr1    43684   .       A       AA,*  
chr1    43709   .       C       CT,*

Questions about bayesTyperTools combine

Hi,

First, When I using bayesTyper cluster, I got this error ERROR: Variants on the same position need to be multi-allelic; multiple variants observed on position "537199" on contig "chr2", then I trying to use bayesTyperTools combine to convert it. However, in my VCF file there are many duplication type variation , after converting, the ref allele changed to alt allele. Is that a BUG?
Another question is can bayesTyper support the duplication genotype? If it can do, the first question more confuse me!

For example:
Before combine:

#CHROM  POS     ID      REF     ALT
chr2    3106    DUP6079 CTTCGGTCCTGCGGAAGGCAAAGGTACAAGCGTGATG    .
chr2    6812772    DUP6080 CTTCGGTCCTGCGGAAGGCAAAGGTACAAGCGTGATG    .
chr2    6812772    DUP6081 CTTCGGTCCTGCGGAAGGCAAAGGTA    .

After combine:

#CHROM  POS     ID      REF     ALT
chr2    3106    .    C     CTTCGGTCCTGCGGAAGGCAAAGGTACAAGCGTGATG
chr2    6812772    .    C    CTTCGGTCCTGCGGAAGGCAAAGGTACAAGCGTGATG,CTTCGGTCCTGCGGAAGGCAAAGGTA

Assertion failures in genotyping step

Hey BayesTyper team!

Thank you for your hard work and congratulations on your newly published Nature paper. I am the developer of Graphtyper. Like BayesTyper, Graphtyper is a novel genotyping method that incorporates known sequence variants. I think it is pretty cool that both methods are trying to improve some of the same drawbacks of previous methods, but with a very different approach!

I want to try out BayesTyper with some simulated sequences of 3 samples with different haplotypes (simulated Illumina 151bp reads at 30x coverage). I am giving BayesTyper the sequence variant truth set (without calls) and I wanted to see if it could make the correct calls. However, I am having some problems running it.

Here are the messages I get:

$ bayesTyper -o integrated_calls -s samples.tsv -v bayesTyper_input.vcf -g chr20_10M_10k/reference.fa

[16/08/2017 13:35:01] You are using BayesTyper (v1.1 23a460559ff5fb1598826b9a894ffc857a0c6804)

[16/08/2017 13:35:01] Seeding pseudo-random number generator with 1502890501 ...
[16/08/2017 13:35:01] Setting the kmer size to 55 ...

[16/08/2017 13:35:01] Parsed information for 3 sample(s)

[16/08/2017 13:35:01] Parsing reference genome ...
[16/08/2017 13:35:01] Parsed 1 sequence(s) from the reference genome (10000 nucleotides)

[16/08/2017 13:35:01] Parsing decoy sequence(s) ...
[16/08/2017 13:35:01] Parsed 0 decoy sequence(s) (0 nucleotides)

[16/08/2017 13:35:01] Parsing variant file ...
[16/08/2017 13:35:01] Parsed 38 variants on chromosome chr20

[16/08/2017 13:35:01] Parsed 42 alternative alleles and excluded:

	- Alleles on chromosome(s) not in genome: 0
	- Alleles with reference not equal to genome sequence: 0
	- Alleles with non-canonical bases (not ACGTN): 0
	- Alleles within 55 bases of chromosome end: 0
	- Alleles longer than 3000000 bases: 0

[16/08/2017 13:35:01] Out of 38 variants:

	- Single nucleotides polymorphism: 26
	- Insertion: 6
	- Deletion: 5
	- Complex: 0
	- Mixture: 1
	- Unsupported (excluded): 0

[16/08/2017 13:35:01] Merged variants into 6 clusters and further into 6 groups

[16/08/2017 13:35:01] Shuffling intercluster regions ...
[16/08/2017 13:35:01] Finished shuffling

[16/08/2017 13:35:01] Sorting variant clusters by decreasing complexity (number of variants) ...
[16/08/2017 13:35:01] Finished sorting


[16/08/2017 13:35:01] Maximum resident set size: 0.003644 Gb

[16/08/2017 13:35:01] Counting smallmers ...
[16/08/2017 13:35:06] Counted 6578 unique smallmers (18 nt)


[16/08/2017 13:35:06] Maximum resident set size: 9.27463 Gb

[16/08/2017 13:35:06] Counting kmers in genomic regions between variant clusters including the decoy sequence(s) ...
[16/08/2017 13:35:06] Counted 0 unique kmers passing the smallmer filter


[16/08/2017 13:35:06] Maximum resident set size: 9.27465 Gb

[16/08/2017 13:35:06] Parsing and filtering KMC k-mer tables from 3 sample(s) ...

[16/08/2017 13:35:06] Parsing kmer table with 49085 kmers for sample SAMP1 (only kmers observed at least 1 time(s) were included by KMC) ...
[16/08/2017 13:35:06] Parsing kmer table with 47490 kmers for sample SAMP2 (only kmers observed at least 1 time(s) were included by KMC) ...
[16/08/2017 13:35:06] Parsing kmer table with 28944 kmers for sample SAMP3 (only kmers observed at least 1 time(s) were included by KMC) ...

[16/08/2017 13:35:06] Counted 1296 unique kmers passing smallmer filter


[16/08/2017 13:35:06] Maximum resident set size: 9.36827 Gb

[16/08/2017 13:35:06] Sorting variant cluster groups by decreasing complexity (number of variants) ...
[16/08/2017 13:35:06] Finished sorting

[16/08/2017 13:35:06] Counting kmers in variant clusters ...
[16/08/2017 13:35:06] Out of 1296 unique kmers:

	- 1286 have a unique match to a single variant cluster
	- 0 have a match to single variant cluster group and multiple internal variant clusters

	- 0 have match to at least one variant cluster and has match to a decoy sequence (not used for inference)
	- 0 have match to at least one variant cluster and has a maximum haploid multiplicity higher than 127 (not used for inference)
	- 0 have matches to multiple variant cluster groups (not used for inference)
	- 4 have match to at least one variant cluster and has constant multiplicity across all haplotype candidates (not used for inference)

	- 6 have no match to a variant cluster


[16/08/2017 13:35:06] Estimating haploid kmer count distribtion(s) ...

bayesTyper: ~/git/BayesTyper/src/bayesTyper/NegativeBinomialDistribution.cpp:57: static std::pair<double, double> NegativeBinomialDistribution::methodOfMomentsEst(const std::vector<long unsigned int>&): Assertion `num_obs > 1' failed.
Aborted (core dumped)

Do you have any idea what this assertion failure could mean? I checked the num_obs variable just before the assertion and it was 0.

Best regards,
Hannes

Running BayesTyper on custom VCF file

I've been trying to run BayesTyper on a VCF file (uploaded here which is basically the merged set of calls from HG00514 and HG00733 from here with duplicates removed). BayesTyper's clustering puts all the input variants into the Complex category although they are deletions and insertions. The genotyping stage also runs without errors but almost all the events are skipped (./.) and the remaining few are genotyped as '0/0'. Around 60% of the events here should get a 1/0 or 1/1 genotype.

Note that I have not combined the VCF file with BayesTyper's variation priors as I'm not interested in SNVs or any other variants. When I try to do that however, all of my input events are skipped and the combined set includes only events from the prior.

This is the command I use for running BayesTyper:

bayesTyper cluster -v  HG00514_HG00733.merged_nonredundant.unified.sorted.vcf -s samples.tsv -g /data/BayesTyper/bayestyper_GRCh38_bundle_v1.3/GRCh38_canon.fa -d /data/BayesTyper/bayestyper_GRCh38_bundle_v1.3/GRCh38_decoy.fa -p 16
bayesTyper genotype -v bayestyper_unit_1/variant_clusters.bin -c bayestyper_cluster_data -s samples.tsv -g /data/BayesTyper/bayestyper_GRCh38_bundle_v1.3/GRCh38_canon.fa -d /data/BayesTyper/bayestyper_GRCh38_bundle_v1.3/GRCh38_decoy.fa -o bayestyper_unit_1/bayestyper -z -p 4

Am I missing anything here? Thanks.

Is BayesTyper suitable for non-human species?

Dear BayesTyper developer,

I would like to do the SV genotyping using plant WGS data, I was wondering that Is BayesTyper suitable for non-human species ( for example, diploid plants )?

Thank you !

Error with bayesTyperTools combine

Hi,

I'm trying to use bayesTyperTools combine to combine two VCF files. When I run the program, I get this error message:

[14/02/2019 15:29:52] You are using BayesTyperTools (v1.4)

[14/02/2019 15:29:52] Running BayesTyperTools (v1.4) combine on 2 files ...

bayesTyperTools: /isdata/kroghgrp/jasi/bayesTyper/code/releases/v1.4_static/BayesTyper-1.4/src/vcf++/Allele.cpp:63: Allele::Allele(const string&): Assertion `_seq.find_first_not_of("ACGTRYSWKMBDHVN") == string::npos' failed.
Aborted

Here's the command I'm using:

bayesTyperTools combine -v sgdp:sgdp_snvs.vcf.gz,gorilla:gorilla_snvs.vcf.gz -o combined.vcf.gz -z

And here are the first 1000 lines of the VCFs for reference:
sgdp_snvs_1000.vcf.zip
gorilla_snvs_1000.vcf.zip

I'm not sure what's off about the VCF files. Could you tell me what the error message means and how to go about debugging it?

Thanks,
Steph

Error during combine of Manta SV sites

When running combine, the following error occurs. Any suggestions?

[09/03/2020 23:18:18] You are using BayesTyperTools (v1.5)

[09/03/2020 23:18:18] Running BayesTyperTools (v1.5) combine on 2 files ...

[09/03/2020 23:19:28] Finished chromosome chr1
[09/03/2020 23:20:43] Finished chromosome chr2
[09/03/2020 23:21:46] Finished chromosome chr3
[09/03/2020 23:22:46] Finished chromosome chr4
[09/03/2020 23:23:43] Finished chromosome chr5
[09/03/2020 23:24:38] Finished chromosome chr6
[09/03/2020 23:25:32] Finished chromosome chr7
[09/03/2020 23:26:20] Finished chromosome chr8
[09/03/2020 23:27:03] Finished chromosome chr9
[09/03/2020 23:27:46] Finished chromosome chr10
[09/03/2020 23:28:30] Finished chromosome chr11
[09/03/2020 23:29:10] Finished chromosome chr12
[09/03/2020 23:29:40] Finished chromosome chr13
[09/03/2020 23:30:08] Finished chromosome chr14
[09/03/2020 23:30:36] Finished chromosome chr15
[09/03/2020 23:31:07] Finished chromosome chr16
[09/03/2020 23:31:36] Finished chromosome chr17
[09/03/2020 23:32:03] Finished chromosome chr18
[09/03/2020 23:32:25] Finished chromosome chr19
[09/03/2020 23:32:44] Finished chromosome chr20
[09/03/2020 23:32:56] Finished chromosome chr21
bayesTyperTools: /isdata/kroghgrp/jasi/bayesTyper/code/releases/v1.5_static/BayesTyper-1.5/src/vcf++/VcfFile.cpp:146: void VcfFileReader::updateVariantLine(): Assertion `getline(vcf_infile_fstream, cur_var_line.at(7), '\n')' failed.
./run_combine.sh: line 5: 25393 Aborted                 ../bin/bayesTyperTools combine -o SNP_dbSNP150common_SV_1000g_dbSNP150all_GDK_GoNL_GTEx_GRCh38.$1 -z -v $VCF_LIST,$PRIOR_VCF

Running BayesTyper with a different kmer length

Hi,

I'm interested in running BayesTyper with a kmer length other than the default, as in supplementary section S13 of this paper: http://science.sciencemag.org/content/358/6363/655.

After running KMC3 with -k31, I attempted to run BayesTyper's makeBloom function, and got an error that I assume comes from the discrepancy in kmer lengths.

[13/03/2019 13:31:03] You are using BayesTyperTools (v1.4)

[13/03/2019 13:31:03] Running BayesTyperTools (v1.4) makeBloom ...

bayesTyperTools: /isdata/kroghgrp/jasi/bayesTyper/code/releases/v1.4_static/BayesTyper-1.4/src/bayesTyperTools/MakeBloom.cpp:213: kmer_bloom_t* MakeBloom::kmc2bloomThreaded(const string&, float, uint): Assertion `kmc_table_info.kmer_length == kmer_size' failed.
Aborted

There doesn't seem to be an immediate option for adjusting kmer length when running makeBloom. Do you know how I would be able to do this?

Thanks,
Steph

Question of the format of VCF file for BayesTyper.

Hi,

I used SyRI package to obtain a candidate VCF file. When I uesing following code to convert the VCF file, a lot of structural variation type be skipped. So I have the following questions:

What specific types of variation are BayesTyper supported?
What are the other type vatiation (except SNP and INDEL) format should be in a VCF file?
If possible, I hope you guys can share a reference file in the relevant format.

Looking forward to your reply!
Best regards!

My code:
bayesTyperTools convertAllele -v syri.vcf -g chr2.fa -o syri.convert.vcf
My log:

[11/01/2021 11:39:22] Parsed 2398884 alternative allele(s)

        - Included 51 <INV> alternative allele(s) 
        - Included 729 <DEL> alternative allele(s) 
        - Included 20984 <DUP> alternative allele(s) 
        - Included 2276995 sequence alternative allele(s) 

        - Skipped 100125 unsupported allele(s):

                - 97 <TDM> alternative allele(s)
                - 829 <INVAL> alternative allele(s)
                - 468 <CPL> alternative allele(s)
                - 9331 <SYN> alternative allele(s)
                - 16551 <SYNAL> alternative allele(s)
                - 434 <CPG> alternative allele(s)
                - 22272 <DUPAL> alternative allele(s)
                - 668 <INS> alternative allele(s)
                - 4614 <INVTR> alternative allele(s)
                - 5890 <TRANSAL> alternative allele(s)
                - 5456 <INVTRAL> alternative allele(s)
                - 4845 <TRANS> alternative allele(s)
                - 21185 <NOTAL> alternative allele(s)
                - 7485 <HDR> alternative allele(s)

My VCF file head lines (I hope it helpful):

##fileformat=VCFv4.3
##fileDate=20210109
##source=syri
##ALT=<ID=SYN,Description="Syntenic region">
##ALT=<ID=INV,Description="Inversion">
##ALT=<ID=TRANS,Description="Translocation">
##ALT=<ID=INVTR,Description="Inverted Translocation">
##ALT=<ID=DUP,Description="Duplication">
##ALT=<ID=INVDP,Description="Inverted Duplication">
##ALT=<ID=SYNAL,Description="Syntenic alignment">
##ALT=<ID=INVAL,Description="Inversion alignment">
##ALT=<ID=TRANSAL,Description="Translocation alignment">
##ALT=<ID=INVTRAL,Description="Inverted Translocation alignment">
##ALT=<ID=DUPAL,Description="Duplication alignment">
##ALT=<ID=INVDPAL,Description="Inverted Duplication alignment">
##ALT=<ID=HDR,Description="Highly diverged regions">
##ALT=<ID=INS,Description="Insertion in non-reference genome">
##ALT=<ID=DEL,Description="Deletion in non-reference genome">
##ALT=<ID=CPG,Description="Copy gain in non-reference genome">
##ALT=<ID=CPL,Description="Copy loss in non-reference genome">
##ALT=<ID=SNP,Description="Single nucleotide polymorphism">
##ALT=<ID=TDM,Description="Tandem repeat">
##ALT=<ID=NOTAL,Description="Not Aligned region">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position on reference genome">
##INFO=<ID=ChrB,Number=1,Type=String,Description="Chromoosme ID on the non-reference genome">
##INFO=<ID=StartB,Number=1,Type=Integer,Description="Start position on non-reference genome">
##INFO=<ID=EndB,Number=1,Type=Integer,Description="End position on non-reference genome">
##INFO=<ID=Parent,Number=1,Type=String,Description="ID of the parent SR">
##INFO=<ID=VarType,Number=1,Type=String,Description="Start position on non-reference genome">
##INFO=<ID=DupType,Number=1,Type=String,Description="Copy gain or loss in the non-reference genome">

Questions on data bundle and SV genotyping

Hi @jonassibbesen ,

Thanks for providing this great tool.

I am now trying to download the data bundle from the link you provide. However, it always failed after ~1GB data is downloaded no matter which tool I use. For example, wget keep raising error Connection closed at byte 1073725440. Retrying..

Will you consider provide an alternative link for download? Or could you clarify if I generate the reference data for GRCh38 in this way is OK:

Reference genome: put chr1-22, X, Y, chrrandom to canon.fa, put chrUn, chrdecoy to decoy.fa, skip chralt and HLA
Variant prior vcf: sequence resolved site-only vcf

By the way, if I only want to genotype large SVs detected from long read sequencing-based callset, can I skip the variant calling step and estimate SV genotype of short read sequencing samples using SV callset + SNV/INDEL prior file?

Thanks,
Han

convertAllele assertion failure

Hello BayesTyper team,

first of all congratulations on your recent publication.

I would like to evaluate BayesTyper and compare it with my own work but I am running into some issues so I was wondering if you could help me. I am using your snakemake workflow on three platinum genome samples (NA12878, NA12891, NA12892) but in the manta_conv_all_all rule I get this assertion failure with all three samples of my samples:

bayesTyperTools: /isdata/kroghgrp/jasi/bayesTyper/code/releases/v1.3.1_static/BayesTyper/src/bayesTyperTools/ConvertAllele.cpp:119: void ConvertAllele::convertAllele(const string&, const string&, const string&, const string&, bool): Assertion `genome_seqs_it != genome_seqs.end()' failed.
/usr/bin/bash: line 1: 162862 Aborted                 (core dumped) /nfs/odinn/tmp/hannese/requests/180704_bayestyper_dataset/bayestyper/bayesTyper_v1.3.1_linux_x86_64/bin/bayesTyperTools convertAllele -v manta/NA12892/results/variants/candidateSV.vcf.gz -g /odinn/tmp/hannese/requests/180704_bayestyper_dataset/data_bundle/bayestyper_GRCh38_bundle/GRCh38.fa -z -o manta/NA12892/results/variants/candidateSV_converted > manta/NA12892_convert.log

I am using BayesTyper 1.3.1 and Manta 1.0.3 (I have also tried Manta 1.4 but I got the same assertion failure) on GRCh38.

The .log file:

[17/07/2018 19:43:05] You are using BayesTyperTools (v1.3.1)

[17/07/2018 19:43:05] Running BayesTyperTools (v1.3.1) convertAllele ...

[17/07/2018 19:44:09] Parsed 3366 chromosome(s)
[17/07/2018 19:44:09] Parsed 0 mobile element insertion sequence(s)

[17/07/2018 19:53:03] Parsed 100000 variant(s) ...

Full slurm output file:

[Tue Jul 17 19:43:05 2018] Building DAG of jobs...
[Tue Jul 17 19:43:05 2018] Using shell: /usr/bin/bash
[Tue Jul 17 19:43:05 2018] Provided cores: 24
[Tue Jul 17 19:43:05 2018] Rules claiming more threads will be scaled down.
[Tue Jul 17 19:43:05 2018] Job counts:
[Tue Jul 17 19:43:05 2018]      count   jobs
[Tue Jul 17 19:43:05 2018]      1       manta_conv_all_id
[Tue Jul 17 19:43:05 2018]      1

[Tue Jul 17 19:43:05 2018] rule manta_conv_all_id:
[Tue Jul 17 19:43:05 2018]     input: manta/NA12892/results/variants/candidateSV.vcf.gz
[Tue Jul 17 19:43:05 2018]     output: manta/NA12892/results/variants/candidateSV_converted.vcf.gz
[Tue Jul 17 19:43:05 2018]     log: manta/NA12892_convert.log
[Tue Jul 17 19:43:05 2018]     jobid: 0
[Tue Jul 17 19:43:05 2018]     wildcards: sample_id=NA12892

bayesTyperTools: /isdata/kroghgrp/jasi/bayesTyper/code/releases/v1.3.1_static/BayesTyper/src/bayesTyperTools/ConvertAllele.cpp:119: void ConvertAllele::convertAllele(const string&, const string&, const string&, const string&, bool): Assertion `genome_seqs_it != genome_seqs.end()' failed.
/usr/bin/bash: line 1: 162862 Aborted                 (core dumped) /nfs/odinn/tmp/hannese/requests/180704_bayestyper_dataset/bayestyper/bayesTyper_v1.3.1_linux_x86_64/bin/bayesTyperTools convertAllele -v manta/NA12892/results/variants/candidateSV.vcf.gz -g /odinn/tmp/hannese/requests/180704_bayestyper_dataset/data_bundle/bayestyper_GRCh38_bundle/GRCh38.fa -z -o manta/NA12892/results/variants/candidateSV_converted > manta/NA12892_convert.log
[Tue Jul 17 19:53:26 2018] Error in rule manta_conv_all_id:
[Tue Jul 17 19:53:26 2018]     jobid: 0
[Tue Jul 17 19:53:26 2018]     output: manta/NA12892/results/variants/candidateSV_converted.vcf.gz
[Tue Jul 17 19:53:26 2018]     log: manta/NA12892_convert.log

[Tue Jul 17 19:53:26 2018] RuleException:
[Tue Jul 17 19:53:26 2018] CalledProcessError in line 101 of /nfs/odinn/tmp/hannese/requests/180704_bayestyper_dataset/workflows/rules/call_candidates.smk:
[Tue Jul 17 19:53:26 2018] Command ' set -euo pipefail;  /nfs/odinn/tmp/hannese/requests/180704_bayestyper_dataset/bayestyper/bayesTyper_v1.3.1_linux_x86_64/bin/bayesTyperTools convertAllele -v manta/NA12892/results/variants/candidateSV.vcf.gz -g /odinn/tmp/hannese/requests/180704_bayestyper_dataset/data_bundle/bayestyper_GRCh38_bundle/GRCh38.fa -z -o manta/NA12892/results/variants/candidateSV_converted > manta/NA12892_convert.log ' returned non-zero exit status 134.
[Tue Jul 17 19:53:26 2018]   File "/nfs/odinn/tmp/hannese/requests/180704_bayestyper_dataset/workflows/rules/call_candidates.smk", line 101, in __rule_manta_conv_all_id
[Tue Jul 17 19:53:26 2018]   File "/nfs/prog/bioinfo/apps-x86_64/python/3.6.3/lib/python3.6/concurrent/futures/thread.py", line 56, in run
[Tue Jul 17 19:53:26 2018] Removing output files of failed job manta_conv_all_id since they might be corrupted:
[Tue Jul 17 19:53:26 2018] manta/NA12892/results/variants/candidateSV_converted.vcf.gz
[Tue Jul 17 19:53:27 2018] Will exit after finishing currently running jobs.
[Tue Jul 17 19:53:27 2018] Shutting down, this might take some time.
[Tue Jul 17 19:53:27 2018] Exiting because a job execution failed. Look above for error message
[Tue Jul 17 19:53:27 2018] Complete log: /nfs/odinn/tmp/hannese/requests/180704_bayestyper_dataset/workflows/.snakemake/log/2018-07-17T194304.495091.snakemake.log

Any help appreciated.
Hannes

bioinformatics-centre / bayestyper Goto Github PK

bayestyper's People

Contributors

Stargazers

Watchers

Forkers

bayestyper's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs