gymreklab / gangstr Goto Github PK

View Code? Open in Web Editor NEW

79.0 79.0 16.0 4.97 MB

A tool for profiling long STRs from short reads

License: GNU General Public License v2.0

C++ 77.08% Shell 3.66% C 12.42% Python 5.81% CMake 1.03%

gangstr's People

Contributors

Stargazers

Watchers

Forkers

nmmsv seanzombias radygenomics richyanicky philpalmer haoziyeung ileenamitra bw2 oodnadatta pacificanalytics eclipsezhao rk28-04 danielnaro shrishtee-kandoi noma-m

gangstr's Issues

ERROR: Not enough reads for [sample ID]. Please set insert size distribution manually.

Hello!
There is an error when I apply GangSTR on 3079 samples, I am informed that there are no enough reads for some samples and I don’t know how to set insert size distribution manually.
I attempt to remove this kind of samples from my bamlist, how can I select them without undergoing the whole GangSTR process.

[GangSTR-2.4.2] ProgressMeter: Loading read group id R18072932LD01-2967813 for sample R18072932LD01-2967813
[GangSTR-2.4.2] ERROR: Not enough reads for 086600D 92. Please set insert size distribution manually.

Thank you very much!

Verify checksums for files downloaded over ftp or http

Hello GangSTR team,

During installation, files downloaded over insecure connections (ftp and http) could be tampered by malicious actors. Verifying the checksums after download could be added in the installation script.
Thank you!
This tool has a very nice documentation. Thank you for the care you have put in!

Cannot build with the latest version

Hello,

I was trying to install GangSTR on my debian machine. I installed the required libraries as mentioned in the pre-requisites. When compiling the software, I encountering the following error:

make  all-recursive
make[1]: Entering directory '/media/Script/prog/TandemRepeatSoftwares/GangSTR'
Making all in m4
make[2]: Entering directory '/media/Script/prog/TandemRepeatSoftwares/GangSTR/m4'
make[2]: Nothing to be done for 'all'.
make[2]: Leaving directory '/media/Script/prog/TandemRepeatSoftwares/GangSTR/m4'
Making all in src
make[2]: Entering directory '/media/Script/prog/TandemRepeatSoftwares/GangSTR/src'
g++ -DHAVE_CONFIG_H -I. -I..  -I../src/ -I..  -I/usr/local/include    -g -O2  -D_GIT_VERSION="\"2.4.3.3-faac\"" -D_MACHTYPE="\"x86_64\"" -std=c++11      -g -O2  -D_GIT_VERSION="\"2.4.3.3-faac\"" -D_MACHTYPE="\"x86_64\"" -std=c++11 -MT GangSTR-main_gangstr.o -MD -MP -MF .deps/GangSTR-main_gangstr.Tpo -c -o GangSTR-main_gangstr.o `test -f 'main_gangstr.cpp' || echo './'`main_gangstr.cpp
In file included from ../src/bam_info_extract.h:26:0,
                 from main_gangstr.cpp:28:
../src/bam_io.h: In member function ‘bool BamCramReader::file_exists(const string&)’:
../src/bam_io.h:458:34: error: ‘F_OK’ was not declared in this scope
     return (access(path.c_str(), F_OK) != -1);
                                  ^~~~
../src/bam_io.h:458:38: error: ‘access’ was not declared in this scope
     return (access(path.c_str(), F_OK) != -1);
                                      ^
In file included from ../src/bam_info_extract.h:26:0,
                 from main_gangstr.cpp:28:
../src/bam_io.h: In copy constructor ‘BamAlignment::BamAlignment(const BamAlignment&)’:
../src/bam_io.h:77:26: warning: ignoring return value of ‘bam1_t* bam_copy1(bam1_t*, const bam1_t*)’, declared with attribute warn_unused_result [-Wunused-result]
     bam_copy1(b_, aln.b_);
                          ^
Makefile:587: recipe for target 'GangSTR-main_gangstr.o' failed
make[2]: *** [GangSTR-main_gangstr.o] Error 1
make[2]: Leaving directory '/media/Script/prog/TandemRepeatSoftwares/GangSTR/src'
Makefile:442: recipe for target 'all-recursive' failed
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory '/media/Script/prog/TandemRepeatSoftwares/GangSTR'
Makefile:374: recipe for target 'all' failed
make: *** [all] Error 2

Do you have any idea what can be the issue ?

Thanks.

No Locus contains enough reads to extract read length

Hello,
Thank you for the nice tool!

I just installed GangSTR and tried to run it on a BAM file, but I got following error message
[GangSTR-1.4] ERROR: No Locus contains enough reads to extract read length. (Possible mismatch in chromosome names)

Thanks for the help in advance!

Out of range error when trying to compute ins/sdev on regions with no reads

I got this error when running on a BAM file with only chr22 extracted but without setting the insert size mean/sdev on the GangSTR command line

GangSTR fails on small input files

I am part of a team doing targeted sequencing of PolyG loci. We have previously been using LobSTR for genotyping polyG sequences from reads prior to a consensus making step (see https://github.com/risqueslab/PolyG-DS). I was investigating GangSTR on the recommendation of one of our collaborators to see if we might want to switch to using it. However, I keep running into the following message (for most of our loci):

[GangSTR-2.5.0] ProgressMeter: Processing chr1:46990539
[GangSTR-2.5.0] ProgressMeter: 	Setting flanking regions
[GangSTR-2.5.0] ProgressMeter: 	Loading read data
[GangSTR-2.5.0] ProgressMeter: 	Not enough reads extracted. Skipping locus..

and the .readinfo.tab file is empty. I know that these input files generated good polyG data using lobSTR, so I was wondering if you had any idea what might be happening and how we might avoid it happening in the future. An example line from our regions file (hg19-based) is:
chr1 46990539 46990839 1 C

Read group

Could you suggest a way to add Read group info to the bam files that will be used as input for Gangstr?

WARNING: Region exceeds maximum total processed reads per sample.

A message "WARNING: Region exceeds maximum total processed reads per sample."
What is means?

Reference .bed file difference

Hello,

I'm trying several tools mentioned in trtools, and am curious about the following observation:

b37 reference BED file provided for GangSTR lists 829,231 repeats spanning ~ 12 Mb;
b37 reference BED file provided for HipSTR lists 1,620,030 repeats spanning ~ 41 Mb.

I think they are supposed to be generated using the same algorithm (trf). Do you know why is there such difference?

Thank you in advance!

STR info example

Do you have an example of an STR info file? Do the loci have to be an exact match? Do you want bed (0-based) or the 1-based coordinate scheme?

Thanks,
Phil

genome index

Getting the error- no index for reference genome (fasta file). What kind of index does Gangstr need- couldn't find it in the manual?

error after trimming sequences

Hi,

i'm used to run gangstr on my bam files but recently we tried experiment by trimming our fastq sequences from 150bp to 100bp and 75bp and with these new bams i get this error :

2021-11-08_15-23-42: START
[GangSTR-2.4] ProgressMeter: Loading read group id 4 for sample 20
terminate called after throwing an instance of 'std::out_of_range'
what(): vector::_M_range_check
/usr/ccub/sge-8.1.8/ccub/spool/davis004/job_scripts/6443774 : ligne 23 : 4850 Abandon (core dumped)"$GANGSTR" --bam $OUTPUTDIR/$SAMPLE/$SAMPLE".RG.sort.mark.bam" --ref "$REF" --regions "$GANGSTR_REGIONS" --out $OUTPUTDIR/$SAMPLE/gangstr --verbose --readlength 75
gangstr exit code : 134
2021-11-08_15-23-45: END

i can send you my bam file if needed
do you have any idea what can be the problem?
thanks!
Marine

Expansions detection

Hi GangSTR team,

I want to use GangSTR to detect repeat expansions and I have a few questions.

Is there a way to output a vcf file with only the STRs with ALT allele different from the reference ?
Also I have been trying to use the --str-info flag to be able to identify expansions with DumpSTR. I used a very basic file for a first attempt, this unique line :

chr4 3074877 3074933 100

But when I run GangSTR, I get this warning :
WARNING: Unknown STR info column detected... 100
And in the output vcf file, the 3 values of the QEXP field are -1, even at the location of the str-info file.
Could you tell me why my formatting of the str-info file is wrong?

In the str-info file, what threshold would you advise to use compared to the reference STR length at each location, given that I want to do a global scanning of the genome and not look for one specific pathogenic location ?

Thanks !

[GangSTR-2.4.6] ERROR: No read group specified in BAM file

GangSTR --bam sample_sorted.bam --ref Homo_sapiens.GRCh38.dna.primary_assembly.fa --regions hg38_ver12.bed --out Sample_
[GangSTR-2.4.6] ERROR: No read group specified in BAM file

Bam file -
Samtools view sample_sorted.bam | head
ERR000044.1 97 9 26830129 0 45M = 26830192 108 GAACAGTCATTGCCCAATTCCCAACAGCAGTTGGGGTGTCCTGTT IIIIIIIIIIIIIIIIIIIIIIHIIEIF<I0I9C?I;IH.<I0AI NM:i:0 MD:Z:45 AS:i:45 XS:i:45
ERR000044.1 145 9 26830192 60 45M = 26830129 -108 GTGAAACCAGCTGGTTTTCTGGGTCGAGCGGGGACTTGGAGAACT IIIIIIIIIIIBIIIII=IIIIIIIIIIIIIIIIIIIIIIIIIII NM:i:0 MD:Z:45 AS:i:45 XS:i:27

zcat sample_sorted.bam | head
BAM�g�@sq SN:1 LN:248956422
@sq SN:10 LN:133797422
@sq SN:11 LN:135086622
@sq SN:12 LN:133275309
@sq SN:13 LN:114364328
@sq SN:14 LN:107043718
@sq SN:15 LN:101991189
@sq SN:16 LN:90338345
@sq SN:17 LN:83257441
@sq SN:18 LN:80373285

script for trimming reference repeats?

I'm experimenting with different RepeatFinder parameters. The GangSTR paper mentions

""
To avoid errors in the local realignment step of GangSTR, all repeating regions are trimmed until they no longer contain any imperfections in their first and last four copies of the motif. Next we require that the trimmed repeating region is a perfect repetition of the motif. This step ensures there are no errors in longer STRs that may pass the trimming step. Finally, we set a threshold of at least four surviving copies for motifs of length 2-8bp and at least three copies for motifs of length greater than 8bp.
""

I found some post-filtering in
https://github.com/gymreklab/GangSTR/blob/master/reference/chr_make_reference.sh#L74-L81
but couldn't find the trimming code described above.
Would you by chance be able to share the code for this?

Pre-defined off-target loci for target pathogenic loci links are not working

Hi Nima,

Many thanks for the nice tool!

Many thanks,
Sergey

terminate called after throwing an instance of 'std::out_of_range'

Hi,
Thanks for the nice tool.

When I use GangSTR to run some of my bam files, I keep getting the error:
terminate called after throwing an instance of 'std::out_of_range' what(): basic_string::substr: __pos (which is 146) > this->size() (which is 102) Aborted
But working well on my rest bam fils. I checked all my bam files and they all in good format.

This is the simple command I used.
GangSTR --bam JU3169.bam --ref WS256.fa --regions whle_for_GangSTR.bed --genomewide --out JU3169

You can find the bam file with error here:
https://northwestern.box.com/s/sgsizzw7uoynluilswurs098l7nbmdot

Thanks for help!
Ye

reference .bed files should have 0-based start coordinate

The reference .bed files currently use a 1-based start coordinate, but should be 0-based according to the .bed spec.
Converting these to 0-based and re-running GangSTR significantly changes the output - at least in a small test dataset.

str::out_of_range

For some samples I have this error. I don't know the reason and how to fix it. Here is the error message:

terminate called after throwing an instance of 'std::out_of_range'
what(): basic_string::substr: __pos (which is 145) > this->size() (which is 105)

Error when using dumpSTR

Hello, I installed GangSTR, dumpSTR and the related dependancies and i run into an error when using dumpSTR to filter a vcf file obtained with GangSTR:

GangSTR command

GangSTR --bam --ref /path/ucsc.hg19.fasta --regions hg19_ver13.bed --out ASD

dumpSTR command

dumpSTR --filter-spanbound-only --filter-badCI --max-call-DP 2000 --min-call-DP 15 --filter-regions /path/filter_files/hg19_segmentalduplications.bed.gz --vcf ASD.vcf --out ASD_filtered

Error

Traceback (most recent call last):
File "/usr/local/bin/dumpSTR", line 9, in
load_entry_point('strtools==1.0.0', 'console_scripts', 'dumpSTR')()
File "/usr/local/lib/python3.5/dist-packages/strtools-1.0.0-py3.5.egg/dumpSTR/dumpSTR.py", line 383, in main
File "/usr/local/lib/python3.5/dist-packages/strtools-1.0.0-py3.5.egg/dumpSTR/filters.py", line 86, in call
File "/usr/local/lib/python3.5/dist-packages/strtools-1.0.0-py3.5.egg/strtools/utils/utils.py", line 29, in GetHomopolymerRun
ValueError: max() arg is an empty sequence

I was wondering what could be the cause for this.

Thank you in advance.

Help reading output (Genotype given in bp difference from reference)

chr:1 14070 . cctccctccctc . . . END=14081;RU=cctc;REF=3 GT:DP:GB:CI:RC:Q:INS 0/0:79:3,3:3-3,3-3:63,13,0,3:4.89157:344.217,96.4342

I was wondering why in the output the GB (Genotype given in bp difference from reference) is not a multiple of the motif length. For example, the motif cctc is 4 bp and the reference copy number is 3. Shouldn’t the genotype for a sample be a multiple of 4, whereas in this case the GB is 3 bp?

Any help would be appreciated.

Thank you

Remove trailing whitespace from ENCLREADS and FLNKREADS in VCF output

See https://github.com/gymreklab/GangSTR/blob/master/src/vcf_writer.cpp#L61 and L62.
This makes vcftools output warnings:

Leading or trailing space in attr_key-attr_value pairs is discouraged:
	[Description] [Summary of reads in enclosing class. Keys are number of copies and values show number of reads with that many copies. ]
	FORMAT=<ID=ENCLREADS,Number=1,Type=String,Description="Summary of reads in enclosing class. Keys are number of copies and values show number of reads with that many copies. ">
Leading or trailing space in attr_key-attr_value pairs is discouraged:
	[Description] [Summary of reads in flanking class. Keys are number of copies and values show number of reads with that many copies. ]
	FORMAT=<ID=FLNKREADS,Number=1,Type=String,Description="Summary of reads in flanking class. Keys are number of copies and values show number of reads with that many copies. ">

Seg fault error on Ubuntu 20.04

GangSTR doesn't work on ubuntu 20.04 but works on 16.04 and 18.04. On ubuntu 20.04 it gives segfault with any/all variations of parameters.

Downstream filtering for haploid TRs

Nice tool!
I Am trying to implement this tool for identifying TRs expansion in a bacterial genome. Though gangSTR works perfectly well, the problem is with subsequent filtering with dumpSTR (from TRTools). All sites are filtered out even after lowering the threshold, which I guess is an issue that starts from the gangSTR stage. Are there specific considerations when --ploidy 1 is set and the genome is haploid?

how to calculate off-target loci

How can I generate the list of off-target regions for a particular locus?

For HTT, running bwa mem -a -M on reads like

@test_read
CAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGC
+
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

got me close to the list in
https://s3.amazonaws.com/gangstr/hg38/HTT_hg38.bed
but still missed a few of the regions.

Using GangSTR on HiFi PacBio reads

Hi,
would GangSTR work with the PacBio HiFi reads instead of Illumina reads? We are planning to sequence some individuals and are considering getting HiFi reads for looking at structural variations and such. Could we use the same reads to look at STRs also?

Thank you.

Ole

Issue about "Not enough reads extracted. Skipping locus.."

Hello GangSTR team,

I am trying to apply GangSTR on exome sequencing samples. However, I encountered an issue about 'Not enough reads extracted' across all the TR regions. Could you help me to figure it out? May I know how many reads are essential for a given TR region? (I can image the scenario as the figure1 in your NAR paper) The commands I used are listed below. Thank you in advance!

Calculate coverage for the given cram file

mosdepth -n --fast-mode \
        --fasta /path/to/GRCh38_full_analysis_set_plus_decoy_hla.fa \
        --by /path/to/xgen_plus_spikein.b38.bed \
        /path/to/${sampleID}.coverage \
        /path/to/${sampleID}.cram

Calculate the average coverage

gunzip -c /path/to/${sampleID}.coverage.regions.bed.gz |\
        sort -n -k 4 | awk '{ sum += $4; n++ } END { if (n > 0) print sum / n; }' \
        > /path/to/${sampleID}.avgcov

Calculate the insertmean and insertsdev using samtools

samtools stats \
        --reference /path/to/GRCh38_full_analysis_set_plus_decoy_hla.fa \
        --target-regions /path/to/xgen_plus_spikein.b38.bed \
        /path/to/${sampleID}.cram  1 \
        > /path/to/${sampleID}_chr1.stats

Run GangSTR

GangSTR --bam /path/to/${sampleID}.cram \
        --ref /path/to/GRCh38_full_analysis_set_plus_decoy_hla.fa \
        --regions /path/to/hg38_ver13.bed \
        --out /path/to/${sampleID}.vcf \
        --nonuniform \
        --coverage  43 \
        --readlength 76 --insertmean 164.9 --insertsdev 78.7 --targeted  #paired-end reads 76*2

Output errors

[GangSTR-2.5.0] ProgressMeter: Loading read group id sampleID for sample sampleID
[GangSTR-2.5.0] ProgressMeter: Processing chr1:14070
[GangSTR-2.5.0] ProgressMeter:  Not enough reads extracted. Skipping locus..
[GangSTR-2.5.0] ProgressMeter: Processing chr1:16620
[GangSTR-2.5.0] ProgressMeter:  Not enough reads extracted. Skipping locus..
[GangSTR-2.5.0] ProgressMeter: Processing chr1:22812
[GangSTR-2.5.0] ProgressMeter:  Not enough reads extracted. Skipping locus..
[GangSTR-2.5.0] ProgressMeter: Processing chr1:26454
[GangSTR-2.5.0] ProgressMeter:  Not enough reads extracted. Skipping locus..
[GangSTR-2.5.0] ProgressMeter: Processing chr1:31556
[GangSTR-2.5.0] ProgressMeter:  Not enough reads extracted. Skipping locus..
...
...
...
[GangSTR-2.5.0] ProgressMeter: Processing chrY:56886704
[GangSTR-2.5.0] ProgressMeter:  Not enough reads extracted. Skipping locus..
[GangSTR-2.5.0] ProgressMeter: Processing chrY:56886966
[GangSTR-2.5.0] ProgressMeter:  Not enough reads extracted. Skipping locus..
[GangSTR-2.5.0] ProgressMeter: Processing chrY:56887112
[GangSTR-2.5.0] ProgressMeter:  Not enough reads extracted. Skipping locus..

Meanwhile, the output VCF is empty with only header lines.

munmap_chunk(): invalid pointer

Hello,

I try to run GangSTR with GRCh38 (version 35) fasta, I use your regions file (hg38_ver13) and it gives this munmap_chunk() error with the first entry already;

ProgressMeter: Processing chr1:10486
munmap_chunk(): invalid pointer
Aborted (core dumped)

The bams are generated by bwa mem mapping with the same reference fa of course.

Below is the call;

GangSTR --bam SRR6761495.bam,SRR6761497.bam,SRR6761499.bam --ref GRCh38.p13.v35.genome.fa --regions hg38_ver13.bed --out GangSTR_test

I have no idea what is wrong here.

Thanks for your help already.

Best Regards

REPCN number is incorrect for x chromosome

REPCN is currently specified as

##FORMAT=<ID=REPCN,Number=2,Type=Integer,Description="Genotype given in number of copies of the repeat motif">

but the number isn't 2 - it's the ploidy of the sample, e.g. 2 for all autosomes and 1 for male non-psuedoautosomal X or Y. The current VCF spec doesn't allow for specifying things based on ploidy, so the number should be '.' and this should be explained in the description.

This might apply to other fields too, but I don't see any off the top of my head.

Help with understanding GGL values

Each entry in the VCF file has a bunch of GGL (gangstr genotype likelihoods). What do these scores mean?

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT WGS
chr1 14070 . cctccctccctc . . . END=14081;RU=cctc;PERIOD=4;REF=3;GRID=1,6;STUTTERUP=0.05;STUTTERDOWN=0.05;STUTTERP=0.9;EXPTHRESH=-1 GT:DP:Q:REPCN:REPCI:RC:ENCLREADS:FLNKREADS:ML:INS:STDERR:QEXP:GGL 0/0:4:0.493238:3,3:3-3,3-3:3,1,0,0:3,3:NULL:19.5377:379.445,151.3:0,0:-1,-1,-1:-15.8183,-13.4988,-12.6286,-9.55703,-9.43554,-8.48511,-13.4849,-12.6063,-9.40548,-12.5884,-15.7571,-13.424,-9.50102,-13.4195,-15.7181,-16.4592,-13.4957,-9.48244,-13.4981,-16.4644,-18.6745

install errors

Hi,

I am having trouble installing this package. I am wondering if the issue is the compiler version?

This is where is fails:

# Install GangSTR
./configure --prefix=$PREFIX || die "Error configuring GangSTR"
make || die "Error compiling GangSTR"

In file included from ../src/enclosing_class.h:24:0,
                 from ../src/likelihood_maximizer.h:24,
                 from ../src/genotyper.h:28,
                 from main_gangstr.cpp:31:
../src/read_class.h:45:34: error: ‘constexpr’ needed for in-class initialization of static data member ‘const double ReadClass::NEG_INF’ of non-integral type [-fpermissive]
   const static double NEG_INF = -100; // TODO make smaller?
                                  ^~~
../src/read_class.h:83:41: error: ‘constexpr’ needed for in-class initialization of static data member ‘const double ReadClass::allele1_weight_’ of non-integral type [-fpermissive]
   const static double allele1_weight_ = 0.5;
                                         ^~~
../src/read_class.h:84:41: error: ‘constexpr’ needed for in-class initialization of static data member ‘const double ReadClass::allele2_weight_’ of non-integral type [-fpermissive]
   const static double allele2_weight_ = 0.5;
                                         ^~~
make[2]: *** [GangSTR-main_gangstr.o] Error 1
make[1]: *** [all-recursive] Error 1
make: *** [all] Error 2

Thanks!

Request: add to bioconda

Hi,

Would it be possible to add a GangSTR recipe to bioconda? That makes installation generally lots easier. I haven't added anything more complicated than python scripts, but here are some guidelines...

Thanks,
Wouter

hg38_ver5.bed.gz not in sorted order

The hg38_ver5.bed.gz file is not in sorted order. So when running GangSTR, the resulting vcf output is also out of sorted order.

This will sort it into the correct order:
bedtools sort -faidx GRCh38_full_analysis_set_plus_decoy_hla.fa.fai -i hg38_ver5.bed >hg38_ver6.bed

Crash soon after started

I aligned my RNAseq data using STAR, and ran GangSTR with default settings, which resulted an error as follows:

[GangSTR-2.4] ProgressMeter: Loading read group id g1 for sample s1
[GangSTR-2.4] ProgressMeter: Processing chr1:14070
[GangSTR-2.4] ProgressMeter: 	Genotyper Results:  3, 3	likelihood = 9.24477
[GangSTR-2.4] ProgressMeter: Processing chr1:16620
....
[GangSTR-2.4] ProgressMeter: Processing chr1:948423
[GangSTR-2.4] ProgressMeter: 	Genotyper Results:  1, 1	likelihood = -25
[GangSTR-2.4] ProgressMeter: Processing chr1:948930
[GangSTR-2.4] ERROR: Invalid CIGAR option encountered in TrimAlignment

when I use a bwa-aligned bam file, the error message disappeared. so this might be caused by the aligner.
my STAR options are:

STAR --genomeDir $star_genome_dir \
--readFilesIn $fq1 $fq2 \
--runThreadN $threads \
--limitBAMsortRAM 0 \
--readFilesCommand zcat \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix $outdir \
--outFilterMultimapScoreRange 1 \
--outFilterMultimapNmax 20 \
--outFilterMismatchNmax 10 \
--alignIntronMax 500000 \
--alignMatesGapMax 1000000 \
--sjdbScore 2 \
--alignSJDBoverhangMin 1 \
--genomeLoad NoSharedMemory \
--outFilterMatchNminOverLread 0.33 \
--outFilterScoreMinOverLread 0.33 \
--sjdbOverhang 100 \
--outSAMstrandField intronMotif \
--outSAMattributes NM MD MC AS XS \
--outSAMunmapped Within \
--outTmpDir $tmp_dir \
--outSAMattrRGline ID:$id LB:$id SM:$id PL:Illumina PU:$id

Is there a way to use STAR-aligned bam as input?

HELP

My experimental data is not whole genome sequencing but restriction site-associated DNA sequencing (RAD-seq), and I would like to know if I can use your software under such conditions.
Thanks again!

What is the total processed reads per sample for a region?

Hi,

I was running GangSTR on WGS bam files with an average coverage of 30X, and for a great list of positions I received the message "WARNING: Region exceeds maximum total processed reads per sample."
I thought the reads per region are comparable to the local read depth and the maximum number of total processed reads at any given site for a 30X genome was approximately 30. Apparently I was wrong, for 30 should not exceed the default --max-proc-read (3000).
Could you help me clarify how should I translate between local read depth and total processed reads for a region?

Unable to download Target TR loci for hg38

Hi,
Would you be able to check on the links for the reference target TR regions files for hg38 - they are not working for me currently.

hg38_ver13.bed.gz
hg38_ver12.bed.gz
hg38_ver5.bed.gz
hg38_ver6.sorted.bed.gz

Thank you!
Alex

Crashing something about size

docker run -v $PWD:/data -i -t gymreklab/str-toolkit GangSTR --bam /data/A.bam --regions /data/hg38_ver13.bed --ref /data/chr21.fa  --out /data/GangSTR_PA_000000019 

[GangSTR-2.4] ProgressMeter: Loading read group id g1 for sample s1
terminate called after throwing an instance of 'std::out_of_range'
  what():  vector::_M_range_check: __n (which is 0) >= this->size() (which is 0)

Warning in Gangstr run

I am getting the foll. warnings during my Gangstr runs-
[GangSTR-2.5.0] WARNING: Region exceeds maximum total processed reads per sample.
What does this mean ?

best,
hasna

Running in parallel by splitting hs37_ver8.bed

Hello,

I split the original reference STR file in 100 small pieces to run in parallel but got an error saying "Low coverage or targeted data. Please set coverage manually'
I believe this is due to small number of STR reference sites that I provided. Also coverage stats (i checked using -v) are little different for each run of same sample. Is there any downside to running GangSTR this way, or it's not recommended at all ?

Best,
Nick

Cram Support

Is there any timeline for cram support? I am interested on running this on 5000 crams.

For a workaround, could the cram reads overlapping the region in the GangSTR reference bed files be extracted and converted to a bam? An then run GangSTR on the bam of TRE region reads. If it works, it would be faster than converting the whole cram to a bam and use much less space.

Interesting Compiling issue

I am attempting to install GangSTR on the HPC that i adminstrate, and i am using gcc/5.4.0, htslib/1.9. nlopt/2.4.2 and gsl/2.5. My configuration is:

./configure --prefix=/cm/shared/apps/GangSTR/2.4.6 HTSLIB_CFLAGS="-I$htslib_CF -I$nlopt_CF" HTSLIB_LIBS="-L$htslib_LD -L$nlopt_LD

where the CF and LD variables point to the appropriate things. When i go to compiled i get the following rather odd errror:
...
g++ -DHAVE_CONFIG_H -I. -I.. -I../src/ -I.. -I -I -g -O2 -D_GIT_VERSION=""2.4.6"" -D_MACHTYPE=""x86_64"" -std=c++11 -g -O2 -D_GIT_VERSION=""2.4.6"" -D_MACHTYPE=""x86_64"" -std=c++11 -MT GangSTR-str_info.o -MD -MP -MF .deps/GangSTR-str_info.Tpo -c -o GangSTR-str_info.o test -f 'str_info.cpp' || echo './'str_info.cpp
mv -f .deps/GangSTR-str_info.Tpo .deps/GangSTR-str_info.Po
/bin/sh ../libtool --tag=CXX --mode=link g++ -g -O2 -D_GIT_VERSION=""2.4.6"" -D_MACHTYPE=""x86_64"" -std=c++11 -g -O2 -D_GIT_VERSION=""2.4.6"" -D_MACHTYPE=""x86_64"" -std=c++11 -lgsl -lgslcblas -lm -L -L -lnlopt_cxx -lm -o GangSTR GangSTR-main_gangstr.o GangSTR-common.o GangSTR-options.o GangSTR-locus.o GangSTR-region_reader.o GangSTR-gc_region_reader.o GangSTR-ref_genome.o GangSTR-genotyper.o GangSTR-read_class.o GangSTR-frr_class.o GangSTR-flanking_class.o GangSTR-enclosing_class.o GangSTR-spanning_class.o GangSTR-likelihood_maximizer.o GangSTR-mathops.o GangSTR-read_extractor.o GangSTR-bam_io.o GangSTR-sample_info.o GangSTR-stringops.o GangSTR-read_pair.o GangSTR-realignment.o GangSTR-ssw.o GangSTR-ssw_cpp.o GangSTR-vcf_writer.o GangSTR-bam_info_extract.o GangSTR-str_info.o -lgsl -lgslcblas -lm -L -L -lnlopt_cxx -lm
libtool: error: require no space between '-L' and '-L'
make[2]: *** [GangSTR] Error 1
make[2]: Leaving directory /root/temp/GangSTR-2.4.6/src' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory /root/temp/GangSTR-2.4.6'
make: *** [all] Error 2

Any thoughts? I would appreciate any pointers.

GangSTR gives very high GB values. Possibly related to piling FRRs

Hi,

Thanks again for the great tool.

I encountered another problem with GangSTR in a new run. Some loci in our data have suspiciously high GB values. Additionally there is a cluster at the highest value observed which is exactly 600. This seems rather unnatural and unlikely to me.

GB values over all samples and loci (y scale log10 transformed)

I first ran GangSTR with the standard settings providing only --bam, --ref, --regions and --out. After that gave me strange results I tried adding the --genomewide flag but still got the same problem.

I think the problem might be related to some of the loci having a very high number of FRRs due to a piling of the reads at certain loci during alignment.

GB values over all samples and loci indicating FRR count > 100(y scale log10 transformed)

There seems to be a trend of the high FRR count being correlated with a high GB value. This makes sense I guess since a larger repeat region increases the probability of producing FRRs but I am still suspicious regarding the clustering of values at 600 which makes me doubt the rest of the values too.

I saw that you mention the option of specifying a list of off-target regions for each TR in the Readme. I guess that could solve that problem or at least decrease its impact.
My question regarding that is: what's the recommended way to incorporate off-target FRRs? Should I for each locus specify all other regions with the same motif? Or filter regions with a certain Read Depth and only include those? If so did you experiment with that and have a recommended cut-of value for the Read Depth? Our median RD is 33, median FRR is 0.

FRR count over all samples and loci(y scale log10 transformed)

Thanks so much for you help!

Matthias

Request: parameter for input file containing bam files

Hi,

I see that GangSTR allows list of bam files as comma separated --bam.
It would be great if tool can allow a parameter that takes file containing bam files (files are present with their path/location). I suggest this based on hipSTR's --bam-files

Thanks.

GangSTR --help outputs error

Hi,

I am quite new to bioinformatics, i recently tried to install GangSTR via tarball and git clone. I also have cmake(version 3.20.0) libz-dev, libbz2-dev, and liblzma-dev installed.

I believe there are no errors during installing as the last line of output from makeis:
[100%] Built target GangSTR

But when i enter GangSTR --help i see the following error:
GangSTR: error while loading shared libraries: libhts.so.3: cannot open shared object file: No such file or directory

Not sure if the output of sudo cmake --install . would help but the output are as seen below:
-- Install configuration: ""
-- Installing: /usr/local/bin/GangSTR
-- Set runtime path of "/usr/local/bin/GangSTR" to ""

Do let me know if you need other information. Thank you !

Erroneuos results: GB value lies outside of the confidence interval and is unreasonably high

Hey,

first of all thanks again for the nice tool!

I already exchanged some mails with Nima about this problem. This Issue is for keeping Track of the problem solving process.

Short description

I encountered this problem while running GangSTR on a subset of the regions file hg19_ver8.bed including only CAG and CTG regions with standard settings (no --genomewide mode, but testing suggests that this does not influence the problem). The problem became apparent when taking a first look at the data and seeing a clear cut cluster of loci that deviate from the reference by ~230 with the mean of non-outliers around 0.
After some investigation I realized that all of the outliers had a GB value much larger than the right value of the confidence interval. This is not restricted to certain loci.This seems to be an erroneous behaviour of GangSTR.

This Plot shows the clustering of erroneous values around 230 (DIF is the difference from the reference, POS the position on the chromosome):

The samples that we observed this behaviour for have a lower insert size than the average of our data.
The outliers also seem to differ in the Quality value GangSTR assigns to them.

During the process of putting together some data for the developers I found that the reproducibility of the error in a locus is influenced by the size of the available region surrounding it. Sub setting the .bam file to be only 1MB around the locus caused the targeted loci to have normal values. A 2MB file showed the erroneous behaviour.

Thanks for the help

Best Matthias

Key Error from REF and ALT fields when running DumpSTR on GangSTR vcf file

DumpSTR throws KeyError: 'csf1poatct' when running on a GangSTR vcf file (Platinum Genomes pedigree). I noticed the error was due to strings such as ‘csf1poatct,’ ‘d7s820tatc,’ and ‘d8s1179tcta’ in the REF and ALT fields in chr5 pos 149455887, chr7 pos 83789542, and chr8 pos 125907115.

reported low number of copies for the known expansion

Hi,
I am analyzing a sample with WES with GangSTR 2.4. This sample has SCA3 repeat expansion in ATXN3 gene. I used the --nonuniform and --targeted parameter with this WES bam and used default values for all other parameters. The number of copies reported for the genotype by GangSTR at this loci is 19,26 which is surprisingly low. For the same loci, STRech and Expansion Hunter reported the copies which are expected in the pathogenic rage.
I am wondering if there are any other parameters I have to set up for WES. Also, are the number of reported copies are dependent on read length?
Thanks,
Bharati

Installation error and doubts for libraries installed

Hi Devs,

Thanks for the tool. I'm running into two issues/things:

Installation fails for gangSTR on my machine with following error:
./install-gangstr.sh ./tools_STRs/GangSTR/bin_GangSTR

./install-gangstr.sh: line 126: ./configure: No such file or directory
install-gangstr.sh error: Error configuring GangSTR

There's no configure file in the directory.

Next, I've htsfile in my path, but then also gangSTR installs htslib in dependences folder. It installs older version of HTslib (htslib-1.8.tar.bz2).

In my data processing I've used samtools with 1.9 version.

Linux version:

uname -a
 4.9.0-4-amd64 #1 SMP Debian 4.9.51-1 (2017-09-28) x86_64 GNU/Linux

Let me know if other details are needed.

[GangSTR-2.5.0] ERROR: Error extracting read length

Hi, i encountered the following error when running GangSTR on nanopore reads:
[GangSTR-2.5.0] ERROR: Error extracting read length

My command line is as follows: GangSTR --bam gu_ccpp_mm2_RG_SM.sorted.bam --ref ../../GCA_000001405.15_GRCh38_no_alt_analysis_set.fna --regions gangSTR_str.bed --out gu_ccpp_mm2_RG_SM_

Looking forward to your reply!

gymreklab / gangstr Goto Github PK

gangstr's People

Contributors

Stargazers

Watchers

Forkers

gangstr's Issues

GangSTR command

dumpSTR command

Error

Short description

Recommend Projects

Recommend Topics

Recommend Org

Jobs