gt1 / biobambam2 Goto Github PK

View Code? Open in Web Editor NEW

92.0 92.0 17.0 4.04 MB

Tools for early stage alignment file processing

License: Other

Shell 2.50% C++ 82.91% Makefile 1.64% M4 0.86% Roff 12.09%

biobambam2's People

Contributors

Stargazers

Watchers

Forkers

dkj whitwham amcpherson wxb263stu lishengkang valeriuo biocodings tpibob evantheb mstroehle cococou williammajanja-zz baiyuanxiang klmr xjyx jianguozhou3

biobambam2's Issues

bamsort - multiple unsorted SAM input fails when no records

If you have a set of SAM files as input for bamsort where all have no actual reads it fails. One file like this causes no issue but adding additional files causes a failure:

$ grep -cv '^\@' 1_pindel_wt.sam
0
$ bamsort inputformat=sam O=test.bam I=1_pindel_wt.sam
[V] Reading alignments from source.
[V] read 0 alignments
[V] producing sorted output
[V] wrote 0 alignments
$ grep -cv '^\@' 2_pindel_wt.sam
0
$ bamsort inputformat=sam O=test.bam I=1_pindel_wt.sam I=2_pindel_wt.sam
BamMergeTemplate::BamMergeTemplate(): cannot merge, not all files are marked as sorted.

/software/CGP/external-apps/biobambam2-2.0.18/bin/../lib/libmaus2.so.2(libmaus2::util::StackTrace::StackTrace()+0x4c)[0x7f08b5d0d33c:??:0]
/software/CGP/external-apps/biobambam2/bin/bamsort(libmaus2::exception::LibMausException::LibMausException()+0x20)[0x418390:??:0]
/software/CGP/external-apps/biobambam2/bin/bamsort()[0x476d57:??:0]
/software/CGP/external-apps/biobambam2/bin/bamsort()[0x48f37e:??:0]
/software/CGP/external-apps/biobambam2/bin/bamsort()[0x48f8ec:??:0]
/software/CGP/external-apps/biobambam2/bin/bamsort()[0x4111c0:??:0]
/software/CGP/external-apps/biobambam2/bin/bamsort()[0x40d9a6:??:0]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)[0x7f08b313376d:??:0]
/software/CGP/external-apps/biobambam2/bin/bamsort()[0x40e692:??:0]

Works fine for multiple unsorted SAM files when records are present.

disablevalidation option for bammerge possible ?

Hi,

I'm using biobambam2 and I have a few FASTQs with some '@' in the read names.

I have used bamsort etc ... with disablevalidation=1 to avoid manipulating the raw fastq files by hand before lauching the pipeline.

But when I then need to merge the BAMs into a single one, bammerge will refuse to proceed because of these 'non-valid' read names. Is it possible to have disablevalidation option in this tool ?

Best wishes,
Anthony

bamtofastq - put specified tags in fastq comment if present

If remapping a BAM file it would be useful to be able to retain information such as barcode (BC) information in the downstream files.

BWA mem supports taking data from the fastq comment and appending it to the mapped read in it's ouput:

-C Append append FASTA/Q comment to SAM output. This option can be used to transfer read meta information (e.g. barcode) to the SAM output. Note that the FASTA/Q comment (the string after a space in the header line) must conform the SAM spec (e.g. BC:Z:CGTAC). Malformated comments lead to incorrect SAM output.

Could a option along the lines of the following be added?

tags=<[]>   BAM tags to be copied in to fastq comment, e.g. BC.

Query: does bb2 support new bgzip fasta reference?

I recently tried to convert a BAM to CRAM with scramble providing a bgzip compressed fasta file and index as generated by samtools 1.3+ and found it wasn't compatible.

Is this supported in biobambam2 or would this limitation carry over (from io_lib)?

bamsormadup hangs on specific bam

Hi German,

I've been using biobambam2 for a while now and it's been a very fast, very solid tool -- thanks!

I've run into a weird problem with bamsormadup on one particular bam, and am wondering how I should debug it. For just one 165 GB bam generated by bwa mem, bamsormadup seems to hang during the duplicate marking step. bamsormadup has worked fine for all of the other bams in the same project (which go up to 158 GB in size), and re-mapping the problem bam (just in case odd random file corruption was to blame) hasn't helped.

The tail of the bamsormadup log is:

[V] 1906565714  01:10:34:63528599       MemUsage(size=8280.09,rss=7356.48,peak=8724.07) AutoArrayMemUsage(memusage=7351.46,peakmemusage=7466.17,maxmem=1.75922e+13)
[V] 1908071306  01:10:39:29409300       MemUsage(size=8280.09,rss=7356.48,peak=8724.07) AutoArrayMemUsage(memusage=7351.46,peakmemusage=7466.17,maxmem=1.75922e+13)
[V] 1908823598  01:10:40:24747099       MemUsage(size=8280.09,rss=7356.48,peak=8724.07) AutoArrayMemUsage(memusage=7351.55,peakmemusage=7466.17,maxmem=1.75922e+13)
[V] 1909575594  01:10:41:84439600       MemUsage(size=8280.09,rss=7356.48,peak=8724.07) AutoArrayMemUsage(memusage=7351.46,peakmemusage=7466.17,maxmem=1.75922e+13)
[V] 1910203706  01:10:46:07156000       MemUsage(size=8280.09,rss=7356.48,peak=8724.07) AutoArrayMemUsage(memusage=7351.46,peakmemusage=7466.17,maxmem=1.75922e+13)     final
[V] flushing read ends lists...done.
[V] merging read ends lists/computing duplicates...=>> PBS: job killed: walltime 10868 exceeded limit 10800

I've tried running this job for up to 12 hours with no luck. The process was given 16 cores, 24 GB RAM, and 400 GB scratch space; it used at most 7.2 GB RAM and 240 GB scratch.

Do you have any suggestions on how I can debug this? Are there any special flags to give to bamsormadup that might produce a more verbose log? The command used was:

bamsormadup threads=8 M=SK13-2B.sorted.markdup.txt tmpfile=${PBS_JOBFS}/ indexfilename=SK13-2B.sorted.markdup.bam.bai < SK13-2B.bam > SK13-2B.sorted.markdup.bam

Thanks again for your work,

Mark

Feature request: add rmdup capability to bamsort

Bamsort has a markduplicates flag that is quite useful. It would also be useful if it, like bammarkduplicates, had an rmdup=<[]> flag to remove the duplicates.

bamclipreinsert error when the sequence contains a single base with a quality of 9

If after clipping the sequence in a bam file has a single base with a quality of 9, which corresponds to a '*' in the sam format, bamclipreinsert doesn't restore the quality values properly

I've created two bam files, containing a single cluster, which illustrate the problem. input2.bam in which the clipped forward read has a single base with a quality value of 9 and fixed2.bam where the quality value has been changed to 10.

Compare
bamclipreinsert < input2.bam | samtools view
bamclipreinsert < fixed2.bam | samtools view

We actually noticed this error because bam12auxmerge failed with the error 'Invalid quality value', when we tried re-inserting the clipped bases. The sam representation of the resulting bam file is somewhat different to that produced by bamclipreinsert but I assume the cause is the same.

bamtofastq

Hi!

I have downloaded a pair of tumor/normal BAM files and I want to realign them again. So, I tried to convert them into FASTQ file . When using biobambam2 like:

bamtofastq collate=1 inputformat=bam level=5 exclude=QCFAIL,SECONDARY,SUPPLEMENTARY filename=name_normal.bam gz=1 outputdir=/output/dir/ outputperreadgroup=1 outputperreadgroupsuffixF=_1.fq.gz outputperreadgroupsuffixF2=_2.fq.gz outputperreadgroupsuffixO=_o1.fq.gz outputperreadgroupsuffixO2=_o2.fq.gz outputperreadgroupsuffixS=_s.fq.gz tryoq=1

the fastq files obtained have as name the @rg ID of the bam file plus the corresponding suffix(e.g. _1.fq.gz ) instead of the name of the file plus the read group id (@rg ID) plus the suffix that I have written in the command (name_normal+prefix(@rg ID)+suffix).

Is there a way to tell biobambam2 to put also the original bam filename to the fastq outputs? Or is there a way to tell biobamba2 that, a part of using the RG ID, to use the sample name (tag @rg SM) too as fastq file name?

"Too many open files"

I am working with very large data. Gzip FASTQ size = 250 GB . I split the FASTQ file into ~1,200 smaller FASTQ files. I aligned the 1,200 FASTQ files with BWA, standard parameters.

Now I am trying to merge the 1,200 small BAM files (~350 Mb each) with biobambam2.

Immediately upon calling biobambam2 bammerge, it fails with error message: "Too many open files"

default: extend output file name

Many programs indicate that specifying true for index or md5 will result in a file name extended from the output file name. In many cases this fails, e.g.

$ biobambam2-2.0.33-release-20160317091357-x86_64-etch-linux-gnu/bin/bamdownsamplerandom p=0.1 filename=full.bam O=ds.bam index=1
[V] no filename for index given, not creating index

Do not crash if no argument was provided

Hi,
I find it unusual that bamsort, bammerge and probably many other utils crash if they are executed without an argument. Please assume '-h' was passed to the utility and display appropriate usage help text.

$ bammerge
Refusing write binary data to terminal, please redirect standard output to pipe or file.

/usr/lib64/libmaus2.so.2(libmaus2::util::StackTrace::StackTrace()+0x5f)[0x7fffceaed4df]
bammerge(libmaus2::exception::LibMausException::LibMausException()+0x20)[0x4128c0]
bammerge()[0x410fd3]
bammerge()[0x40ce06]
/lib64/libc.so.6(__libc_start_main+0xf0)[0x7fffcd8ba280]
bammerge()[0x40d4da]

$

Improper header parsing resulting in incorrect 'Malformed SAM header line:' error message

The bamreset tool parses the header incorrectly.

It complains that the @rg line is malformed if it is tab delimited. Whereas if @rg and the first value are separated by a tab and all other values are separated by a space the 'Malformed SAM header line" is not reported.

Per the header section in https://samtools.github.io/hts-specs/SAMv1.pdf

In the header, each line is TAB-delimited and, apart from @co lines, each data field follows a format ‘TAG:VALUE’ where TAG is a two-character string that defines the format and content of VALUE. Thus header lines match /^@[A-Z][A-Z](t[A-Za-z][A-Za-z0-9]:[ -~]+)+$/ or /^@co\t.*/

Feature request: split by chromosome

Biobambam2 is an incredibly useful package and it seems like I am constantly discovering new features and applications. It would be helpful if the readme had a short description or if there were a description file somewhere.

Also with all of the functionality, I have not yet seen a function to split a bam into separate chromosomes. I'm looking for something that splits each contig into a new bam file for the purpose of parallelized variant calling. Would also be helpful if several small contigs were aggregated so their is a lower limit of file size. Does biobambam2 already have something for this?

Thanks!

MC:Z: tag

Hi,
I've got a situation where bwa mem | bamsormadup can write reads with an empty "MC:Z:" tag. The resulting BAM file passes biobambam2 bamvalidate, but fails picard ValidataSamFile & thus likely GATK tools. Reading the SAM spec it seems that "MC:Z:" is invalid, so can you please clarify? I'm using the latest 2.0.49. Here's some output:

Picard ValidateSamFile output:

ERROR: Record 551, Read name ST-E00118:53:H02GVALXX:1:1113:3172:2135, Mate CIGAR String (MC Attribute) present for a read whose mate is unmapped
ERROR: Record 552, Read name ST-E00118:53:H02GVALXX:1:1113:3172:2135, Mate CIGAR String (MC Attribute) present for a read whose mate is unmapped
ERROR: Record 551, Read name ST-E00118:53:H02GVALXX:1:1113:3172:2135, Mate CIGAR string does not match CIGAR string of mate
ERROR: Record 552, Read name ST-E00118:53:H02GVALXX:1:1113:3172:2135, Mate CIGAR string does not match CIGAR string of mate
ERROR: Record 553, Read name ST-E00118:53:H02GVALXX:1:1206:29541:52380, Mate CIGAR String (MC Attribute) present for a read whose mate is unmapped
ERROR: Record 554, Read name ST-E00118:53:H02GVALXX:1:1206:29541:52380, Mate CIGAR String (MC Attribute) present for a read whose mate is unmapped
ERROR: Record 553, Read name ST-E00118:53:H02GVALXX:1:1206:29541:52380, Mate CIGAR string does not match CIGAR string of mate
ERROR: Record 554, Read name ST-E00118:53:H02GVALXX:1:1206:29541:52380, Mate CIGAR string does not match CIGAR string of mate

bamvalidate output

$ cat NA12878.SPRR.R1.bam | bamvalidate 
NULL

the bad reads

after bwa mem | bamsormadup

$ samtools view NA12878.SPRR.R1.bam | grep "\tMC:Z:$"
ST-E00118:53:H02GVALXX:1:1113:3172:2135 77  *   0   0   *   *   0   0   TGGTGTCCGTGCCCGGTTTCCTTTAGGCTCAACTGTTGTTAGAGTGATGTTTTCGGAGGGGGAGCAGCGGTGGAAGCAGGAGTGGCTACGATAGAGGGATGAGGGGAAGGGAGTGAAGGAGGTTTGTGAGCAAGTAAGTGNNNNNTGTTAN ><=>??-9-<<+--5<===-4><=5==+>--,*7<?=(+36933+/,5==+9=0#(23'(4(,-*-4+*8*).,+6+.6)-89=)+76*#5*+59>>>-+)-7)):--1)(62<@>-;).7*,46?@*.8-/06-,.+::#####<=:-/#AS:i:0   XS:i:0  RG:Z:NA12878.SPRR   ms:i:1872   mc:i:0  MC:Z:
ST-E00118:53:H02GVALXX:1:1113:3172:2135 141 *   0   0   *   *   0   0   TGGTGTCCGTGCCCGGTTTCCTTTAGGCTCAACTGTTGTTAGAGTGATGTTTTCGGAGGGGGAGCAGCGGTGGAAGCAGGAGTGGCTACGATAGAGGGATGAGGGGAAGGGAGTGAAGGAGGTTTGTGAGCAAGTAAGTGNNNNNTGTTAN ><=>??-9-<<+--5<===-4><=5==+>--,*7<?=(+36933+/,5==+9=0#(23'(4(,-*-4+*8*).,+6+.6)-89=)+76*#5*+59>>>-+)-7)):--1)(62<@>-;).7*,46?@*.8-/06-,.+::#####<=:-/#AS:i:0   XS:i:0  RG:Z:NA12878.SPRR   ms:i:1872   mc:i:0  MC:Z:
ST-E00118:53:H02GVALXX:1:1206:29541:52380   77  *   0   0   *   *   0   0   CTTTGAACATCCTCCTGACATCCGTTGGCTCCACTCATCTACTTCGCTGGCCCGCGCGCTTCCCAGGTCTTTGTCCGGGGCTCGAGCCACTCTCCTGTCGCCACCTACCACTTGCCTTCTCCTCCCAGCGTTATNNNNNNNNNCNNCNGNG >>.>+-;>;<??8?>.),8-5=>59:')?.?<,*--,*>,?+?4,#@<?(>:>":#)#=-*=9<48(,>4):?++-$*))+.-$5)55,=.:.@>=)<-6A56>-.++-->:36A??,9/-:.7,@-*,&.-..#########+##.#5#- AS:i:0  XS:i:0  RG:Z:NA12878.SPRR   ms:i:1807   mc:i:0  MC:Z:
ST-E00118:53:H02GVALXX:1:1206:29541:52380   141 *   0   0   *   *   0   0   CTTTGAACATCCTCCTGACATCCGTTGGCTCCACTCATCTACTTCGCTGGCCCGCGCGCTTCCCAGGTCTTTGTCCGGGGCTCGAGCCACTCTCCTGTCGCCACCTACCACTTGCCTTCTCCTCCCAGCGTTATNNNNNNNNNCNNCNGNG >>.>+-;>;<??8?>.),8-5=>59:')?.?<,*--,*>,?+?4,#@<?(>:>":#)#=-*=9<48(,>4):?++-$*))+.-$5)55,=.:.@>=)<-6A56>-.++-->:36A??,9/-:.7,@-*,&.-..#########+##.#5#- AS:i:0  XS:i:0  RG:Z:NA12878.SPRR   ms:i:1807   mc:i:0  MC:Z:

cheers,
Mark

bamtofastq - output ordering

Is there are any guarantee that a non-paired read will not be output between 2 reads of a pair when F, F2 and S are all to the same file/pipe?

I'm looking to have a piped process digest collated fastq data but I also want to be able to accept any reads that may be single-end (flag bit 1 not set). Knowing that these will not be output between reads of a pair would be helpful.

Thanks

bamsormadup bin size errors with SAM input from bwa

German;
We're running into an issue with using bamsormadup directly on SAM output from bwa. If you run with any htsjdk based tools you'll get errors/warnings about the BAM record:

ERROR: Record 80117, Read name HWI-ST1124:106:C15APACXX:1:1306:21079:31074, bin field of BAM record does not equal value computed based on alignment start and end, and length of sequence to which read is aligned

The BAMs appear to work okay but generate a ton of messages and potentially will have slower look ups.

This appears to only happen when running from SAM input. This type of input generates the issue:

bwa mem hg19/bwa/hg19.fa 1.fq 2.fq bamsormadup inputformat=sam SO=coordinate > piped.bam

while first feeding to samtools does not:

bwa mem hg19/bwa/hg19.fa 1.fq 2.fq | samtools view -b | bamsormadup inputformat=bam SO=coordinate > intermediate.bam

This is a self-contained test case that demonstrates the issue using picard ValidateSamFile:

wget https://s3.amazonaws.com/chapmanb/testcases/bamsormadup_bin_field.tar.gz

Let me know if I can provide any other details to help debug. Thanks as always for the awesome tools and all the help.

Empty MC:Z tags from bamsormadup in 2.0.57

German;
I'm running into the same issue as reported in #24 in the latest release and will get empty MC:Z tags from bamsormadup:

HWI-EAS264:7:101:6410:6415#0    121     chrM    389     60      76M     =       389     0       CAGATTTCAAATTTTATCTTTTGGCGGTATGCACTTTTAACAGTCACCCCCCAACTAACACATTATTTTCCCCTCC    BA?BABDBBDDBBBBDADCDDCDCBDCCCCCDCCCDDCCDCCC@@BCCCCCCCCCCCCCDCB@CCCCC@BCCCCCC    NM:i:1  MD:Z:21A54      AS:i:71 XS:i:0  RG:Z:Test1      ms:i:179        mc:i:388        MC:Z:
HWI-EAS264:7:101:6410:6415#0    181     chrM    389     0       *       =       389     0       GTTGGGGTTTTTGTTTTTGGGGGTTGGGAGGGGTGGGGTTAAGGGGTTGGGGCAGGAGGGGGGGGGGGGGGGGGGG    ######################################################################A=A==@    AS:i:0  XS:i:0  RG:Z:Test1      MQ:i:60 ms:i:2563       mc:i:464

I put together a small test case that demonstrates the problem:

wget http://s3.amazonaws.com/chapmanb/testcases/biobambam_empty_mc.tar.gz

Thanks so much for the help and let me know if I can provide any other information.

incomplete pairs

Hi!

I have a very naive question. When using bamtofastq the BAM file is split into read group1 and read group2 of pair reads and then there are the incomplete reads also coming from group1 and group2.

What exactly "incomplete" refers to in the documentation (https://github.com/gt1/biobambam2/blob/master/src/programs/bamtofastq.1)? Can you please tell whether incomplete means unpaired(=unmatched) reads? Does this have anything to do with unmapped reads?

-outputperreadgroupsuffixO=<_o1.fq>
output file name suffix for first mates of incomplete pairs if outputperreadgroup=1.
Default is _o1.fq if gz=0 and _o1.fq.gz for gz=1.
-outputperreadgroupsuffixO2=<_o2.fq>
output file name suffix for second mates of incomplete pairs if outputperreadgroup=1.
Default is _o2.fq if gz=0 and _o2.fq.gz for gz=1. outputperreadgroupsuffixS=<_s.fq>
-output file name suffix for singled end reads if outputperreadgroup=1.
Default is _s.fq if gz=0 and _s.fq.gz for gz=1.

Besides, what is the difference between single end reads and unmatched(orphan) when defining the output files of bamtofastq? Is *_s.fastq.gz file the sum of *_o1.fastq.gz and *_o2.fastq.gz ?
-S=:
output file for single end reads if collation is active
-O=:
output file for unmatched (orphan) first mates if collation is active.
-O2=:
output file for unmatched (orphan) second mates if collation is active.

Thanks!

FYI: biobambam2 on dockstore.org

Hi,

I've started a repository for mapping biobambam2 tools to CWL. I intend to add tools to the mappings as I require them or on request.

Repo is here.
Dockstore entry for biobambam2 here.

bammerge - option for output file not documented

bammerge has options to generate md5 and index filenames based on the output file name however this option is not documented in the man page or the -h option.

I've tried various values used in other steps that have this same option (O, outputfile, filename, file).

Having a look at the code I can't see anything that attempts to load any output filenames.

Also the option IL is not in the short help given by -h.

bamcollate2 - errors but returns 0 exit code

Hi,

I've had an instance of converting cram to bam with bamcollate2 (so I can get the index at the same time) failing to convert from CRAM but generating a valid BAM and index. The BAM is in complete but the exit code is 0, see below.

bamcollate2 inputformat=cram outputformat=bam collate=0 index=1 outputthreads=8 exclude= filename=/datastore/input/COLO-829-BL.cram O=/datastore/input/COLO-829-BL.bam indexfilename=/datastore/input/COLO-829-BL.bam.bai
1               247074
2               248435
...
320             262795
ERROR: md5sum reference mismatch for ref 15 pos 35230902..46388026
CRAM: 9997330cdbd9d427108443772e563135
Ref : 147f605857d732deab981519ab7a4a05
Failure to decode slice
[V] 336243699
[V] MemUsage(size=648.18,rss=42.5898,peak=727.109) wall clock time 18:48:99723199
cgpbox@3a0b5a18e6ff:~$ echo $?
0
cgpbox@3a0b5a18e6ff:~$ bamcollate2 -v
This is biobambam2 version 2.0.50.
biobambam2 is distributed under version 3 of the GNU General Public License.

Question - bammarkduplicates2 - does it invoke 'top'?

Hi,

Just found a very odd log, which appears to include the output of top, that indicated a memory allocation error.

AutoArray<unsigned char,alloc_type_cxx> failed to allocate 268435456 elements (268435456 bytes)
current total allocation 306663325

Is biobambam invoking 'top' or do I need to raise this with our systems group?

Thanks

Conflicting executable names

Biobambam2's bamtofastq conflicts with Bedtool's bamToFastq and
Biobambam2's fastaexplode conflicts with Exonerate's fastaexplode.
You may want to consider renaming these two tools. Adding a prefix to the tool names, such as biobambam-fastaexplode is a good way to avoid collisions.

Not working on Centos7 –libmaus2 related error

Hi there, I tried installing it with conda install biobimbam which install version 2.0.62-0. When I run bamtofastq in gives me this error which seems to be related to libmaus2

Unable to parse argument as type std::__cxx11::basic_string<char, std::char_traits, std::allocator >

/root/miniconda2/share/biobambam-2.0.62-0/bin/../lib/libmaus2.so.2(libmaus2::util::StackTrace::StackTrace()+0x55)[0x7f6bda09bfd5]
bamtofastq(libmaus2::exception::LibMausException::LibMausException()+0x20)[0x41aec0]
bamtofastq()[0x430655]
bamtofastq()[0x4307a3]
bamtofastq()[0x4173a9]
bamtofastq()[0x4183a2]
bamtofastq()[0x4121b5]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f6bd7210b35]
bamtofastq()[0x4133d2]

I tried to compile it (version 2.0.70) with libmaus2 as described in README.md but it again throws similar looking error :

Unable to parse argument as type std::string

/arun/...../libmaus2/lib/libmaus2.so.2(libmaus2::util::StackTrace::StackTrace()+0x54)[0x7f27d374e194]
bamtofastq(libmaus2::exception::LibMausException::LibMausException()+0x20)[0x436c80]
bamtofastq(std::string libmaus2::util::ArgInfoParseBase::parseArgstd::string(std::string const&)+0xea)[0x44ccda]
bamtofastq(std::string libmaus2::util::ArgInfo::getValuestd::string(std::string const&, std::string) const+0x43)[0x44ce13]
bamtofastq(bamtofastqCollating(libmaus2::util::ArgInfo const&)+0x2a6)[0x4334d6]
bamtofastq(bamtofastq(libmaus2::util::ArgInfo const&)+0x3bc)[0x4346ec]
bamtofastq(main+0x16b3)[0x42e243]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f27d1c12b35]
bamtofastq()[0x42f73f]

bammarkduplicates2 - improving throughput

Hi,

I'm running biobambam2 on a high cpu/RAM host and have been attempting to improve the throughput by modifying various parameters but I'm not having much luck.

I'm assuming that to modify the profile I need to change several parameters together, it's also not clear where in the process I should expect 'markthreads' to affect the timings in the log (or if high values here are actually detrimental).

The process involves many input files.

Thanks

P.s. I found this thread from the biobambam repo (now gone) but not sure if it's still relevant:

There are some things coming up, but they are not production ready yet. If you wanted to do the task with biobambam I would suggest to perform the following steps:

merge the files:
bammerge level=0 in1.bam in2.bam ... in5.bam | bamrecompress numthreads=64 > merged.bam

run the duplicate marking:
bammarkduplicates2 I=merged.bam O=marked.bam M=marked.metrics inputthreads=64 outputthreads=64 markthreads=64

Installation Error. "programs/bamadapterfind.cpp: using ‘typename’ outside of template"

Successfully installed libmaus2, and had also successfully installed biobambam2 in other OS, but always got error below when make install in CentOS 6.8

...
programs/bamadapterfind.cpp:143: error: using ‘typename’ outside of template
programs/bamadapterfind.cpp:144: error: using ‘typename’ outside of template
make[1]: *** [programs/bamadapterfind-bamadapterfind.o] Error 1
...

bamcollate2 not reporting all reads

Hi,

I am trying to use bamcollate2 on a RNA-seq bam file aligned with map-splice.
But, after processing with bamcollate2 I am missing reads that were mapped to multiple locations.
can you let me know how I can get all the reads out from bamcollate?
The file that I am using is the RNA-seq bam file found on the GDC legacy website with file id 9b1a94fa-d6e8-49c5-a552-2da0e0ffe893.
The outputs from the file before and after collating are below.
before there are three reads (two lines for one read which is a fusion alignment) and after bamcollate2 there are only two reads in the bam file.

[mirahan@hanlab-dell1 pre-mrna]$ samtools view bam/TCGA-CG-5720-01A.ver2.bam | grep HS2_251:8:1101:1049:197409
HS2_251:8:1101:1049:197409/2 115 chr2 133038633 255 21M54S = 230045563 -97006910 GTTCAACTGCTGTTCACATGGTCGCCCGTCCCTTCGGAACGGCGCTCGCCCATCTCTCAGGACCGACTGACCCAT @b@FFFEFDDBHHGGHHFGIIIIG<A?CBF9EHIGH>GHHH?G8BGHIIIIJBBEEF=ACC@BB@@b@B>B@BB# XF:Z:ATAC, ZF:Z:FUS_133038633_230045616(--) RG:Z:EXTERNAL_9ed6bfd4-c1c1-44ae-88d8-ff01224de6de_20141215_110858_1_ IH:i:2 YH:Z:1.1 HI:i:1 YI:i:1 NM:i:3 XS:A:+
HS2_251:8:1101:1049:197409/2 499 chr2 230045563 255 21S54M = 133038633 97006910 GTTCAACTGCTGTTCACATGGTCGCCCGTCCCTTCGGAACGGCGCTCGCCCATCTCTCAGGACCGACTGACCCAT @b@FFFEFDDBHHGGHHFGIIIIG<A?CBF9EHIGH>GHHH?G8BGHIIIIJBBEEF=ACC@BB@@b@B>B@BB# XF:Z:ATAC, ZF:Z:FUS_133038633_230045616(--) RG:Z:EXTERNAL_9ed6bfd4-c1c1-44ae-88d8-ff01224de6de_20141215_110858_1_ IH:i:2 YH:Z:1.2 HI:i:2 YI:i:1 NM:i:3 XS:A:+
HS2_251:8:1101:1049:197409/1 69 * 0 0 * * 0 0 CNGGGGATCTGAACCCGACTCCCTTTCGATCGGCCGAGGGCAACGGAGGCCATCGCCCGTCCCTTCGGAACGGCG @#1=ADBDFHFHFGHIIIBHIIG9CGFF;?DABBDEGIEGHEHEEB?C/;?<C?7(38<7?@BCCCBBBBBBBBB RG:Z:EXTERNAL_9ed6bfd4-c1c1-44ae-88d8-ff01224de6de_20141215_110858_1_ IH:i:0 HI:i:0
[mirahan@hanlab-dell1 pre-mrna]$ samtools view TCGA-CG-5620-01A.collated.bam | grep HS2_251:8:1101:1049:197409
HS2_251:8:1101:1049:197409/1 69 * 0 0 * * 0 0 CNGGGGATCTGAACCCGACTCCCTTTCGATCGGCCGAGGGCAACGGAGGCCATCGCCCGTCCCTTCGGAACGGCG @#1=ADBDFHFHFGHIIIBHIIG9CGFF;?DABBDEGIEGHEHEEB?C/;?<C?7(38<7?@BCCCBBBBBBBBB RG:Z:EXTERNAL_9ed6bfd4-c1c1-44ae-88d8-ff01224de6de_20141215_110858_1_ IH:i:0 HI:i:0
HS2_251:8:1101:1049:197409/2 115 chr2 133038633 255 21M54S = 230045563 -97006910 GTTCAACTGCTGTTCACATGGTCGCCCGTCCCTTCGGAACGGCGCTCGCCCATCTCTCAGGACCGACTGACCCAT @b@FFFEFDDBHHGGHHFGIIIIG<A?CBF9EHIGH>GHHH?G8BGHIIIIJBBEEF=ACC@BB@@b@B>B@BB# XF:Z:ATAC, ZF:Z:FUS_133038633_230045616(--) RG:Z:EXTERNAL_9ed6bfd4-c1c1-44ae-88d8-ff01224de6de_20141215_110858_1_ IH:i:2 YH:Z:1.1 HI:i:1 YI:i:1 NM:i:3 XS:A:+
[mirahan@hanlab-dell1 pre-mrna]$

bammerge: Accept wildcard regexp when parsing I= input

It is confusing that bammerge wants to write on STDOUT. I think is is a very bad idea because the shell redirect will buffer the data and for example if a target disk gets full during writing the buffer will keep growing until the kernel runs out of memory.

Please introduce O= flag.

Further, it would be nice if say I=dir/file_prefix.[0-9].bam was possible.

The error message seems funny.

$ bammerge level=9 index=1 I=ee_16AUT1C3/HM2YTCCXX.?.ee_16AUT1C3.bwa.sorted.bam IL=ee_16AUT1C3/HM2YTCCXX.ee_16AUT1C3.bwa.sorted. indexfilename=ee_16AUT1C3/HM2YTCCXX.ee_16AUT1C3.bwa.sorted.bam.bai > ee_16AUT1C3/HM2YTCCXX.ee_16AUT1C3.bwa.sorted.bam
PosixFdInput(ee_16AUT1C3/HM2YTCCXX.ee_16AUT1C3.bwa.sorted.,0): No such file or directory

/usr/lib64/libmaus2.so.2(libmaus2::util::StackTrace::StackTrace()+0x5f)[0x7fffceaed4df]
bammerge(libmaus2::exception::LibMausException::LibMausException()+0x20)[0x4128c0]
/usr/lib64/libmaus2.so.2(libmaus2::aio::PosixFdInput::PosixFdInput(std::string const&, int)+0x19d)[0x7fffcead629d]
/usr/lib64/libmaus2.so.2(libmaus2::aio::PosixFdInputStreamFactory::constructUnique(std::string const&)+0x272)[0x7fffcead6ce2]
bammerge(libmaus2::aio::InputStreamFactoryContainer::constructUnique(std::string const&)+0x51)[0x4197d1]
bammerge()[0x40fdd1]
bammerge()[0x40ce06]
/lib64/libc.so.6(__libc_start_main+0xf0)[0x7fffcd8ba280]
bammerge()[0x40d4da]

bammarkduplicates piping to stdout problem

Hi,
thanks for this efficient tool suite!

I have an issue when I don't provide an output file directly with "O=" to bammarkduplicates2 in order to
pipe the stdout to samtools again. It seems that bammarkduplicates itself has some problem with this and stops with an error after the first phase:

$ samtools view -h -f 2 -F 780 MY.BAM | samtools view -b -u - | ~/install/biobambam2-2.0.60/bin/bammarkduplicates2 markthreads=10 inputbuffersize=1310720 level=0 verbose=1 rewritebam=2 2>bamdup.log | samtools view -@10 -b - >MYBAM.bammarkdup.bam

...
[V] 657457153 als, 657457153 mapped frags, 326713918 mapped pairs, 202604 frags/s MemUsage(size=928.965,rss=178.41,peak=1276.28) time 2.90766 total 54:05:70748000
[V] 658505728 als, 658505728 mapped frags, 327220159 mapped pairs, 202668 frags/s MemUsage(size=928.965,rss=178.41,peak=1276.28) time 4.15273 total 54:09:86047099
[D] excntpairs=58079412 fincntpairs=169475253 strcntpairs=100066447
[D] excntfrags=138983750 fincntfrags=520351682 strcntfrags=0
[V] fragment and pair data computed in time 3253.13 (54:13:12949200)
[V] 659335432 lines, 659335432 als, 659335432 mapped frags, 327621112 mapped pairs, 202719 frags/s MemUsage(size=784.934,rss=82.3555,peak=1276.28)
[V] Checking pairs...done, rate 1.80079e+06
[V] Checking single fragments...done, rate 2.06009e+06
[V] number of alignments marked as duplicates: 412919402 time 3414.26 (56:54:26030099)
# /home/heyne/install/biobambam2-2.0.60/bin/bammarkduplicates2 markthreads=10 inputbuffersize=1310720 level=0 verbose=1 rewritebam=2

##METRICS
LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED UNMAPPED_READS UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
Unknown Library 0 327621112 0 0 206459701 0 0.630178 132273373

## HISTOGRAM
BIN VALUE
1 1
2 1.08401
3 1.09106
4 1.09166
...
100 1.09171
[D] using incremental BAM header parser on parallel recoder.

BgzfInflateHeaderBase::readHeader(): invalid header data (unexpected bytes)
/home/heyne/install/libmaus2--2.0.281/lib/libmaus2.so.2(_ZN8libmaus24util10StackTraceC1Ev+0x54) [0x7f25c0747f94]
/home/heyne/install/biobambam2-2.0.60/bin/bammarkduplicates2(_ZN8libmaus29exception16LibMausExceptionC1Ev+0x20) [0x444020]
/home/heyne/install/biobambam2-2.0.60/bin/bammarkduplicates2(ZN8libmaus22lz15BgzfInflateBase9readBlockISiEENS1_13BaseBlockInfoERT+0x50) [0x448920]
/home/heyne/install/biobambam2-2.0.60/bin/bammarkduplicates2(ZN8libmaus22lz16BgzfInflateBlock9readBlockISiEEbRT+0x41) [0x44caf1]
/home/heyne/install/biobambam2-2.0.60/bin/bammarkduplicates2(_ZN8libmaus22lz32BgzfInflateDeflateParallelThread3runEv+0xc8) [0x495928]
/home/heyne/install/biobambam2-2.0.60/bin/bammarkduplicates2(_ZN8libmaus28parallel11PosixThread8dispatchEPv+0x18) [0x442aa8]
/lib64/libpthread.so.0(+0x7dc5) [0x7f25befc2dc5]
/lib64/libc.so.6(clone+0x6d) [0x7f25becefced]

When I use "O=MYBAM.bammarkdup.bam" without piping all works fine.

$samtools view -h -f 2 -F 780 -q 10 MY.BAM | samtools view -b -u - | ~/install/biobambam2-2.0.60/bin/bammarkduplicates2 markthreads=10 inputbuffersize=1310720 verbose=1 O=MYBAM.bammarkdup.tmp.bam

What is the right way to use piping to stdout?
(I have to admit that I never tried piping with "verbose=0")

bamadapterfind: clip default '-h' vs man

Just a clarification for next release:

$ bamadapterfind -h
This is biobambam2 version 2.0.31.
...
clip=<[0]>  
...

vs.

$ man bamadapterfind
...
       clip=<1>
...

Thanks,
Keiran

indexing fails on sparse BAMs

Hi,

We have a BAM file that only has a very small number of reads which only hit a handful of the reference sequences. It appears that this triggers a check failure:

$ bamindex -v
This is biobambam2 version 2.0.25.
biobambam2 is distributed under version 3 of the GNU General Public License.

$ bamindex tmpfile=tmpbamindex < 4911821_sorted_rmdup.bam > 4911821_sorted_rmdup.bam.bai
[V] 1 554204 3047
[V] 2 334746 1804
[V] 6 1 0
[V] Y 0 1
[V] 893803
BamIndexGenerator::checkConsisteny(): inconsistent binCIS.peek()=23 != linCIS.peek()=-1
bin index and linear index are inconsistent.

/software/CGP/pancan/bin/../lib/libmaus2.so.2(libmaus2::util::StackTrace::StackTrace()+0x4c)[0x7f572ed17a8c:??:0]
/software/CGP/pancan/bin/bamindex(libmaus2::exception::LibMausException::LibMausException()+0x20)[0x40d410:??:0]
/software/CGP/pancan/bin/bamindex()[0x421a35:??:0]
/software/CGP/pancan/bin/bamindex()[0x40b90e:??:0]
/software/CGP/pancan/bin/bamindex()[0x407e68:??:0]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)[0x7f572bf1876d:??:0]
/software/CGP/pancan/bin/bamindex()[0x4080f2:??:0]

The behaviour was initially noticed running bammarkduplicates2 with index=1.

Samtools indexes the file fine with the following stats:

$ samtools

Program: samtools (Tools for alignments in the SAM format)
Version: 0.1.18 (r982:295)
...
$ samtools index 4911821_sorted_rmdup.bam
$ samtools idxstats 4911821_sorted_rmdup.bam
1   249250621   554204  3047
2   243199373   334746  1804
3   198022430   0   0
4   191154276   0   0
5   180915260   0   0
6   171115067   1   0
7   159138663   0   0
8   146364022   0   0
9   141213431   0   0
10  135534747   0   0
11  135006516   0   0
12  133851895   0   0
13  115169878   0   0
14  107349540   0   0
15  102531392   0   0
16  90354753    0   0
17  81195210    0   0
18  78077248    0   0
19  59128983    0   0
20  63025520    0   0
21  48129895    0   0
22  51304566    0   0
X   155270560   0   0
Y   59373566    0   1
MT  16569   0   0
*   0   0   0

Specifying parameters on the command line

Hi,

When I started trying to use the bamsormadup (and bamsort) it took me quit some time to realise how the parameters should be specified.

My first attempts were with:

bamsort -I input.bam -O output.bam
bamsort --I=input.bam --O=output.bam

These seem to be the standard in other applications, would it be possible to implement these in Biobambam2? If not could there be some worked examples in the README that show how an example command should be be formed?

Thank you,

Mark

bamdownsamplerandom - bam vs cram

Hi,

I took a 29GB BAM file and converted it to CRAM using scramble with default parameters (resulting file 13GB).

I then ran bamdownsamplerandom (p=0.1) on both files. The BAM downsample completed in ~45 minutes. The CRAM version has now been running for 23h.

(biobambam2-2.0.33-release-20160317091357-x86_64-etch-linux-gnu)

bammarkduplicates vs bammarkduplicates2 documentation

Hello German,
I was looking for documentation about the difference between bammarkduplicates and bammarkduplicates2 and did not see any on the help pages for those respective tools. Is there some description of this difference?

Kind regards,
Jeff

bamcollate2 - segfaults

Under v2.0.54 (pre-compiled binaries) we've seen some segfaults occurring yet they when they reoccur they seems to be at different points in the input BAM.

I've had a look at the changes and commit info and I can't see any specific items relating to issues in bamtofastq.

There's didn't seem to be any error message however the output stream was bing consumed by a wrapper which may have 'hidden' it somehow. Information pulled by our admins:

May 18 12:04:42 cgp-7-1-16 kernel: [4226734.618114] show_signal_msg: 9 callbacks suppressed
May 18 12:04:42 cgp-7-1-16 kernel: [4226734.618121] bamcollate2[25298]: segfault at 1 ip 0000000000000001 sp 00007ffd10f74478 error 14 in bamcollate2[400000+b9000]
May 18 12:04:42 cgp-7-1-16 kernel: [4226734.621187] bamcollate2[25461]: segfault at 1 ip 0000000000000001 sp 00007ffd7f499e78 error 14 in bamcollate2[400000+b9000]
May 18 12:04:42 cgp-7-1-16 kernel: [4226734.627157] bamcollate2[25454]: segfault at 1 ip 0000000000000001 sp 00007ffc742608f8 error 14 in bamcollate2[400000+b9000]
May 18 12:04:42 cgp-7-1-16 kernel: [4226734.627372] bamcollate2[25297]: segfault at 1 ip 0000000000000001 sp 00007ffe856a91f8 error 14 in bamcollate2[400000+b9000]

The first message indicates nine similar messages were also logged, so there were quite a few of these in quick succession. I can't see any indication of a hardware fault on cgp-7-1-16, so it may be a rare bug in the software or possibly a kernel issue.

I'm retrying with the latest binaries (2.0.72) on Ubuntu 12.04.5 LTS (GNU/Linux 3.2.0-105-generic x86_64). Is it safe to use the etch binaries like this?

libmaus2 detected but gives can not compile error in configure script

Hello,
I am running into difficulty building biobambam2 from source on Ubuntu 16.04.3 LTS (Xenial), kernel version 4.4.0-103-generic.

I successfully compiled libmaus2 (version libmaus2-2.0.432-release-20171214144141) and used "make install" to install into the prefix /home/ubuntu/tools.

wget https://github.com/gt1/libmaus2/archive/2.0.432-release-20171214144141.tar.gz
tar xvf 2.0.432-release-20171214144141.tar.gz
cd libmaus2-2.0.432-release-20171214144141/
./configure --prefix=/home/ubuntu/tools
make
make install

However, when I follow the instructions to compile biobambam2, it does detect libmaus2, but then gives a subsequent error about not being able to use it for compilation.

These are the commands I ran:

https://github.com/gt1/biobambam2/archive/2.0.82-release-20171214120547.tar.gz
tar xvf 2.0.82-release-20171214120547.tar.gz
cd biobambam2-2.0.82-release-20171214120547/
./configure --with-libmaus2=/home/ubuntu/tools --prefix=/home/ubuntu/tools

This is the error I see after running the configure script (only showing last 5 lines of output):

checking for libmaus2... yes
checking for libmaus2digests... yes
checking for libmaus2seqchksumsfactory... yes
checking whether we can compile a program using libmaus2... no
configure: error: Required libmaus2 is not available.

Is there some compatibility error here, or do I need to add some additional settings in the build process? Thank you very much for your help.

I've tried setting the following environment variables before running configure, yet get the same error still.

export LDFLAGS="-L/home/ubuntu/tools/lib"
export CPPFLAGS="-I/home/ubuntu/tools/include/libmaus2"

Best,
Jeff

question about bamtofastq

Really amazing tool kits!

I have an already name-sorted bam file with paired-end reads, and I want to convert it to fastq files.
It seems that I have to specify collate=1, even though it's already collated by read name, or otherwise the results will be print out on screen (stdout) rather than writing to the files specified by F and F2.

I'm wondering in this case, what values should I pass to 'colhlog' and 'colsbs', in order to most efficiently convert the file, since it's an already name-sorted bam file.

Or can I specify collate=0 and still let the results output to F and F2 files?

Thanks!
Hurley

Input and Output Files

Hi,

This is a great tool (I mostly use bamsormadup), but would it be possible to all for parameters to specify an input and output file?

Thank you,

Mark

bamsort - SO:coordinate not being added to @HD line

Hi there,
I'm running a bamsort command on my RNA-Seq mapped BAM file which has been aligned with STAR.
The sort command completes successfully with no errors and I am able to use the BAM file in downstream analysis that requires BAM files to be coordinate sorted but the header line just looks like this with no SO:coordinate tag added which is not what I would expect:
@hd VN:1.4

The version of bamsort I am using is: 2.0.25

The command I am running is this:
bamsort I=Aligned.out.bam fixmate=1 inputformat=bam level=1 tmpfile=./star/tmp O=Aligned.sortedByCoord.out.bam inputthreads=4 outputthreads=4

When I run a samtools sort command, the SO:coordinate tag is added.
Can you think of why this might be happening?

Thanks
Angela

bamtofastq - is multi-thread possible?

Is there any way it would be possible to use multiple cores in this stage?

Is 1 to decompress BAM, 1 to process, and 1 to compress per output file (if gz=1) possible?

This would be very useful is possible

Thanks

High memory usage for bamsormadup on inputs with many reference contigs

German;
We're using bamcat/bamsormadup within bcbio to merge 27 input BAMs with:

bamcat level=0 `cat files.txt` | bamsormadup threads=8

We expect a ~130Gb output file from the merge.

The merge process has very high memory usage, filling up a machine with 330Gb of memory:

[V] 442783933   03:37:23:67015100 MemUsage(size=279998,rss=198773,peak=280062)
AutoArrayMemUsage(memusage=32544.6,peakmemusage=33442.9,maxmem=1.75922e+13)

We've been trying to understand why we have such poor memory usage for this analysis and wonder if it has to do with the number of reference contigs. This is run on the monkey genome with ~7500 contigs. Would that have any relationship to memory usage? Any other tips as to why we see such high memory usage for this merge? Thanks much for any suggestions or thoughts.

bamcollate2 - split bam by readgroup

Would it be possible to add the ability to split a BAM by the readgroups durring the collate phase similar to the way that bamtofastq does?

outputperreadgroup=<[0]>                     : split output per read group (for collate=1 only)
outputdir=<>                                 : directory for output if outputperreadgroup=1 (default: current directory)

Thanks,

bamcat syscall error

Hi,

When executing bamcat command I get the following message:

 login01 /scratch/gcarrasco/tests/test_bcbio_cwl ~> bamcat
libmaus2::util::PosixExecute::execute() donotthrow: "addr2line --exe=bamcat 0x4140a0" exited with status 1
libmaus2::util::PosixExecute::execute() donotthrow: failing command was addr2line --exe=bamcat 0x4140a0
libmaus2::util::PosixExecute::execute() donotthrow: failed syscall  login01

Besides that, the program seems to work. We're running Linux Centos 6.6 in our cluster and the package was installed via conda.

Any clue of why this may be happening?

Thanks a lot in advance,

bamdownsamplerandom - sort order issues with cram input

Hi,

This gives warnings that the sort order is unknown on files where the field is appropriately set in the header:

$ samtools-1.2 view -H cram/MD5146a.cram | head -n 1
@HD VN:1.5  SO:coordinate
$ biobambam2-2.0.33-release-20160317091357-x86_64-etch-linux-gnu/bin/bamdownsamplerandom reference=/nfs/cancer_ref01/mouse/37/genome.fa p=0.1 filename=cram/MD5146a.cram inputformat=cram outputformat=cram O=ds.cram
Unknown sort order field: unknown
...

This doesn't occur for BAM input.

Regards,
Keiran

bam2fastq failed

--Hi,

strange error using bam2fastq:

BIOBAMBAM/bin/bamtofastq filename=D1_sorted.bam inputformat=bam gz=1 F=D1_R1.fastq.gz F2=D1_R2.fastq.gz O=orphan_D1_1.fastq.gz O2=orphan_D1_2.fastq.gz

BAM header is not consistent (binary and text do not match) for @sq SN:8 LN:11660

LIBMAUS2/lib/libmaus2.so.2(libmaus2::util::StackTrace::StackTrace()+0x54)[0x7f71ec496414]
BIOBAMBAM/bin/bamtofastq(libmaus2::exception::LibMausException::LibMausException()+0x20)[0x437f00]/BIOBAMBAM/bin/bamtofastq(libmaus2::bambam::BamHeader::initSetup()+0xbbb)[0x49a55b]
BIOBAMBAM/bin/bamtofastq(void libmaus2::bambam::BamHeader::initlibmaus2::lz::BgzfInflateStream(libmaus2::lz::BgzfInflateStream&)+0x299)[0x49ec19]
BIOBAMBAM/bin/bamtofastq(libmaus2::bambam::BamDecoderWrapper::BamDecoderWrapper(std::unique_ptr<libmaus2::aio::InputStream, std::default_deletelibmaus2::aio::InputStream >&, bool)+0x341)[0x49f901]
BIOBAMBAM/bin/bamtofastq(libmaus2::bambam::BamAlignmentDecoderFactory::construct(std::istream&, std::string const&, std::string const&, unsigned long, std::string const&, bool, std::ostream*, std::string const&)+0xd2d)[0x4a07bd]
BIOBAMBAM/bin/bamtofastq(libmaus2::bambam::BamMultiAlignmentDecoderFactory::construct(libmaus2::util::ArgInfo const&, bool, std::ostream*, std::istream&, bool, bool)+0x381)[0x4a1fa1]
BIOBAMBAM/bin/bamtofastq(bamtofastqCollating(libmaus2::util::ArgInfo const&)+0x5de)[0x43344e]
BIOBAMBAM/bin/bamtofastq(bamtofastq(libmaus2::util::ArgInfo const&)+0x3c1)[0x4344c1]
BIOBAMBAM/bin/bamtofastq(main+0x1a02)[0x42cbc2]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7f71ea906b45]
BIOBAMBAM/bin/bamtofastq()[0x42e79f]

do i need to reformat my bam input file, but how ?

thank you --

bamtofastq gets confused over near-identical read names

Hi there,
Many thanks for all the work on biobambam2, much appreciated. I've come across a probably (version?) name sorting problem in bamtofastq.

The eventual error manifests itself in bwa as

[mem_sam_pe] paired reads have different names: "0302t6w", "0302t06w"

Looking at the original bam file there are reads with similar read names in this order:

0302t6wM        99      chr1    16254844
...
0302t6wM        147     chr1    16254937
...
0302t06w        65      chr3    45533685
...
0302t6ww        99      chr7    74710731
...
0302t6ww        147     chr7    74710799
...
0302t6wO        99      chr12   50005111
...
0302t6wO        147     chr12   50005241
...
0302t06w        129     chr13   37453392
...
0302t6w 97      chr13   43262597
...
0302t6w 145     chrX    33099514
...

After bamtofastq the files have these reads only:

@0302t6wM/1
@0302t6ww/1
@0302t6wO/1
@0302t6w/1
@0302t6wM/2
@0302t6ww/2
@0302t6wO/2
@0302t06w/2

in other words, @0302t6w/2 and @0302t06w/1 are missing.

The command is roughly like this:

bamtofastq filename=my.sorted.bam T=temp.sorted-1.fq-sort F=>(bgzip -c /dev/stdin > my.sorted-1.fq.gz) F2=>(bgzip -c /dev/stdin > my.sorted-2.fq.gz) S=/dev/null O=/dev/null O2=/dev/null collate=1 colsbs=2097152

Could this have something to do with internal version-name sorting and the near identical read names (save for the zero in between t and 6)?

Improving documentation and commandline argument names/behavior

Hi,
I am new to using this tool (2.0.69) but find it weird why commandline arguments are not systematically shared or why some apllications lack arguments supposedly common to your tools.

bamsormadup uses reference="$ref" while bamsort uses calmdnmreference="$ref"

bamsormadup uses threads="$threads" while bamsort uses inputthreads="$input_threads" outputthreads="$output_threads" . Could bamsort also accept just threads and figure out
how to split their distribution on its own?

Both tools use SO=coordinate but it would be clearer if there was also SI={coordinate,queryname,hash} possible. If input BAM header says conflicting info, then just exit.

bamsormadup has no option to restrict memory usage while bamsort uses blockmb but in 1MB units. Weird. Couldn't it be more user friendly so that I could specify memory=22G or memory=22g?

bamsormadup lacks index=1 option while probably same is achieved through indexfilename="$prefix".bam.bai.
bamsort understands index=1 and probably I do not have to specify indexfilename="$prefix".bam.bai (should be the default, isn't it?).

bamsormadup -h text is quiet whether I="$infile" O="$outfile" is accepted or not. At least I would hope so from commandline options of bamsort.

It is not clear to me why bamsormadup docs merely guide me to use an SSD drive or a ramdisk to store TMP files. Couldn't it be done in memory? I mean, I do not have an SSD drive so I will create a ramdisk with ext3 filesystem without journal most likely. I wonder whether in-memory store wouldn't be better right away without filesystem overhead.

While it seems picard MarkDuplicates is superior to samtools rmdup I wonder how bamsormadup is standing in the comparison. Certainly I was tempted to mis-use bamsormadup to sort and index in parallel my BAM files while not using it to mark duplicate reads. Could that be done? This basically stems from the fact bamsort asks me to specify input vs. output threads, why in a so complicated way?

I know, "patches are welcome". ;) Thank you anyway for your current efforts.

build directories leaked in .la files in binary tarball

Not affecting me (but caught someone else out when trying to build and deploy iolib to the same hierarchy I think - worked around with --disable-static on iolib build), and I don't know if it's worthwhile to fix:

dj3@deskpro108240:/tmp$ grep tischler biobambam2-2.0.22-release-20151029100516-x86_64-etch-linux-gnu/lib/libcurl.la 
dependency_libs=' -L/home/tischler/src/build_biobambam/compile-x86_64-etch-linux-gnu/lib /home/tischler/src/build_biobambam/compile-x86_64-etch-linux-gnu/lib/libidn.la /home/tischler/src/build_biobambam/compile-x86_64-etch-linux-gnu/lib/libgnutls.la /home/tischler/src/build_biobambam/compile-x86_64-etch-linux-gnu/lib/libp11-kit.la -ldl -lpthread /home/tischler/src/build_biobambam/compile-x86_64-etch-linux-gnu/lib/libtasn1.la -lnettle -lhogweed /home/tischler/src/build_biobambam/compile-x86_64-etch-linux-gnu/lib/libgmp.la -lz -lrt'
libdir='/home/tischler/src/build_biobambam/compile-x86_64-etch-linux-gnu/lib'

curl in binary release

Hi,

The bin folder of the release includes 'curl' I've found that this doesn't function for SSL connections resulting in having to ensure it is not included in installs.

Is there an actual need for the curl binary in the final bin?

Thanks

alternative to libmaus2?

Hi there
I am unable to install biobambam2 because of a compilation problem with libmaus2. Is there any alternative to this package? Or the binaries somewhere?

Eckart

gt1 / biobambam2 Goto Github PK

biobambam2's People

Contributors

Stargazers

Watchers

Forkers

biobambam2's Issues

Picard ValidateSamFile output:

bamvalidate output

the bad reads

Recommend Projects

Recommend Topics

Recommend Org

Jobs