oicr-gsi / bam-qc-metrics Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 10.31 MB

Metrics for BAM file QC

License: GNU General Public License v3.0

Python 100.00%

bioinformatics bioinformatics-pipeline samtools

bam-qc-metrics's People

Watchers

bam-qc-metrics's Issues

Remove downsampling and filtering?

In new versions of the bam-qc workflow, filtering and downsampling will be done upstream by other workflow tasks. So, the filtering/downsampling capabilities of bam-qc-metrics itself will no longer be used. We could:

Remove the downsampling functionality from bam-qc-metrics to simplify code
Remove output fields referring to filtering/downsampling results in bam-qc-metrics, to simplify output and reduce potential confusion. (Eg. "total reads", "unmapped reads".)

Low-priority, but could be useful. Should we do it?

CIGAR of read reverse strand

If the read is reversed by the aligner (flag 16), the CIGAR string will not match the cycle of the machine.

The iterarator needs to be reversed if flag 16 is set

bam-qc-metrics/bam_qc_metrics/bam_qc.py

Line 80 in f923853

for (op, length) in read.cigartuples:

trim_quality usage

This variable is used

bam-qc-metrics/bam_qc_metrics/bam_qc.py

Line 132 in f923853

result = pysam.stats("-q", str(self.trim_quality), self.bam_path)

where it means -q, --trim-quality INT The BWA trimming parameter (https://sourceforge.net/p/bio-bwa/mailman/message/25597301/)

The same variable is compared to the read MAPQ value

bam-qc-metrics/bam_qc_metrics/bam_qc.py

Line 74 in f923853

if self.trim_quality != None and read.mapping_quality < self.trim_quality:

These two uses are incompatible.

samtools uses the -q flag differently depending on context. In samtools view: -q INT Skip alignments with MAPQ smaller than INT [0].

sample_rate parameter is not passed on

The instance variable is always set to 1 at

bam-qc-metrics/bam_qc_metrics/bam_qc.py

Line 44 in 2f72712

self.sample_rate = 1

regardless of parameter. I think this needs changing to self.sample_rate = sample_rate

Failure with incompatible reference file

If sequences in the BAM file do not appear in the given alignment reference, analysis dies (see below for error).

Make a more informative error message, or (if possible) prevent the error from happening.

(bamqc) ibancarz@ld5312-ibanca:~/playground/bamqc_test_data/20180816/A00469_0047/test$ run_bam_qc.py -b ../../../SWID_14343630_TGL41_0004_nn_R_PE_320_CM_HMC_4_190531_M00146_0054_000000000-D6CW8_GTTACGCA-ATCGCCAT_L001_001.annotated.bam -o test.json -t ../../../hg19_random.genome.sizes.bed -r ../../../hg19.fa [E::faidx_adjust_position] The sequence "chr1_gl000192_random" not found

Include a genome reference in the repository for tests

samtools stats might allow you to use just one chromosome in the fasta file. If you pick one of the runt chromosomes (outside of chr1-22, X), it would add less than 1MB to the repo.

Originally posted by @slazicoicr in https://github.com/_render_node/MDIzOlB1bGxSZXF1ZXN0UmV2aWV3VGhyZWFkMTkyMjc3NTU2OnYy/pull_request_review_threads/discussion

Version update script

Small utility script to update the workflow version number, in JSON files of expected test data.

Failure if --target not given

Analysis crashes if the --target option is not specified, as follows:

(bamqc) ibancarz@ld5312-ibanca:~/playground/bamqc_test_data/20180816/A00469_0047/test$ run_bam_qc.py -b ../../../SWID_14343630_TGL41_0004_nn_R_PE_320_CM_HMC_4_190531_M00146_0054_000000000-D6CW8_GTTACGCA-ATCGCCAT_L001_001.annotated.bam -o test.json Traceback (most recent call last): File "/home/ibancarz/playground/bam-qc-metrics-v0.1.6/bin/run_bam_qc.py", line 138, in <module> main() File "/home/ibancarz/playground/bam-qc-metrics-v0.1.6/bin/run_bam_qc.py", line 134, in main qc = bam_qc(config) File "/home/ibancarz/playground/bam-qc-metrics-v0.1.6/bam_qc_metrics/bam_qc.py", line 122, in __init__ fast_finder.read_length_summary()) File "/home/ibancarz/playground/bam-qc-metrics-v0.1.6/bam_qc_metrics/bam_qc.py", line 625, in __init__ self.metrics = self.evaluate_all_metrics() File "/home/ibancarz/playground/bam-qc-metrics-v0.1.6/bam_qc_metrics/bam_qc.py", line 630, in evaluate_all_metrics self.evaluate_bedtools_metrics(), File "/home/ibancarz/playground/bam-qc-metrics-v0.1.6/bam_qc_metrics/bam_qc.py", line 642, in evaluate_bedtools_metrics metrics['number of targets'] = targetBedTool.count() File "/home/ibancarz/playground/bam-qc-metrics-v0.1.6/pybedtools/bedtool.py", line 2507, in count return sum(1 for _ in iter(self)) File "/home/ibancarz/playground/bam-qc-metrics-v0.1.6/pybedtools/bedtool.py", line 2507, in <genexpr> return sum(1 for _ in iter(self)) File "pybedtools/cbedtools.pyx", line 754, in pybedtools.cbedtools.IntervalIterator.__next__ TypeError: NoneType object is not an iterator

read_mark_duplicates_metrics

bam-qc-metrics/bam_qc_metrics/bam_qc.py

Line 237 in f923853

msg = "Failed to parse duplicate metrics path %s, section %d, line %d" % params

line will be a string, not digit

Example output for a MiSeq analysis

GLCS_0001_Lv_R_PE_279_WG|1|190305_M00146_0024_000000000-D5N29.txt

bam-qc-metrics/bam_qc_metrics/bam_qc.py

Line 215 in f923853

 if re.match('## METRICS CLASS\s+net\.sf\.picard\.sam\.DuplicationMetrics', line): 

Won't match ## METRICS CLASS picard.sam.DuplicationMetrics

For low coverage runs, ESTIMATED_LIBRARY_SIZE is left empty. Unfortunately, the Picard gods did not see fit to add a \t to signify an empty field, so the error below is raised

bam-qc-metrics/bam_qc_metrics/bam_qc.py

Line 240 in f923853

 raise ValueError("Key and value lists from %s are of unequal length" % input_path) 

Speed up tests

Tests are a little slow now at ~60s. Speed up by using a smaller input dataset for non-critical tests.

Version update script failure

Using the version update script, or even editing the VERSION file without updating test data, causes apparently unrelated errors. See JIRA ticket: https://jira.oicr.on.ca/browse/GP-2242

Until such time as this issue is resolved, the script update_test_data_version.py is deprecated.

oicr-gsi / bam-qc-metrics Goto Github PK

bam-qc-metrics's People

Watchers

bam-qc-metrics's Issues

Remove downsampling and filtering?

CIGAR of read reverse strand

trim_quality usage

sample_rate parameter is not passed on

Failure with incompatible reference file

Include a genome reference in the repository for tests

Version update script

Failure if --target not given

read_mark_duplicates_metrics

Speed up tests

Version update script failure

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs