msk-access / sequence_qc Goto Github PK

View Code? Open in Web Editor NEW

0.0 2.0 0.0 1.09 MB

Noise calculations from BAM file

Home Page: https://cmo-ci.gitbook.io/sequence-qc/

License: Other

Python 96.08% Dockerfile 3.92%

noise-detection

sequence_qc's Introduction

sequence_qc

Package for doing various ad-hoc quality control steps from MSK-ACCESS generated FASTQ or BAM files

Free software: Apache Software License 2.0
Documentation: https://msk-access.gitbook.io/sequence-qc/

Installation

From pypi:

pip install sequence_qc

From conda:

conda install -c ionox0 -c conda-forge -c bioconda sequence-qc

sequence_qc's People

Contributors

Watchers

sequence_qc's Issues

set up travis testing

Some Modification base on Mike's email

Thanks very much for preparing this. Just a couple of minor suggestions for the graphs:

Top graph: Change the y-axis label to “ Alt Count / (Alt Count + Ref Count) x 10^6 “ and remove the mu
Label as Duplex consensus (assuming that’s what’s in the calculation)
Noise Positions graph: Change the y-axis label to “ Alt Count / Total Count “. Can remove the genomic positions from the x-axis label since there is not enough space to show all of them, and it’s not so informative.

Output pdf plot for "top noisy positions"

Use matplotlib to produce plots of the top noisy positions, as was previously being done in excel

Include noise from heterozygous / tri-allelic sites

Separating this into a separate issue, as it is a bit more complicated than I originally though, and want to get it right before coding it.

Currently, the noise calculation works by designating the most common base as the "genotype" at that position, and if any of the other 3 bases pass the "threshold" (2%), that position is skipped.

We thought it might make sense that instead of skipping heterozygous / tri-allelic sites, that these rules should apply:

if there is an A at 49% and T and 49%, the alt allele count should still include the C at 1% and G at 1%
if there is an A at 33%, C at 33% and T at 33%, the alt allele count should still include the G at 1%

Also lets update the latex equations to describe these cases as well

consistency with biometrics package

As @murphycj2 has done in the biometrics package, lets make this package use pysamstats (which really just calls to pysam) in the same way:

Ensure stepper is not eliminating duplicate reads (use nofilter?) alimanfoo/pysamstats#91
Figure out a way to avoid double-counting overlapping reads (ignore_overlap param does not get passed to pysam... alimanfoo/pysamstats#98)
Set min_mapping_quality truncate and max_depth defaults to same value as biometrics (and ensure they are exposed as parameters)
Can we be consistent with min_base_quality or do the noise and fingerprinting require different values for this to work?
We might still want to consider the possibility of combining these two calculations into the same package

Template repository to host sequence-based quality control scripts as a package.

Currently, we have identified two scripts that need to be in here.

UMI Quality Control BASH script (https://github.com/mskcc/ACCESS-Pipeline/blob/master/python_tools/workflow_tools/qc/make_umi_qc_tables.sh)
Noise calculation BASH script (https://github.com/mskcc/ACCESS-Pipeline/blob/master/python_tools/workflow_tools/qc/calculate_noise.sh).

Each of these needs to be converted as a python script.

Exploring Alfred and Fastp, if Picard takes too much time.

Alfred: https://www.gear-genomics.com/docs/alfred/
Fastp: https://github.com/OpenGene/fastp

Add Inset size calculation metrics

Given a particular amount of noise, plot the insert size distribution of the reads supporting the noise.

Noise output file format

How should the qc metrics for:

noise (%)
number of contributing sites
noise by substitution type

be represented in this module's output?

I propose to have three separate files with the following columns (plus additional columns inherent to pysamstats):

<sample>_noise.tsv:

alt_count_acgt
geno_count_acgt
noise_percent_acgt
contributing_sites_acgt

alt_count_inc_deletions
geno_count_inc_deletions
noise_percent_inc_deletions
contributing_sites_inc_deletions

alt_count_inc_n
geno_count_inc_n
noise_percent_inc_n
contributing_sites_inc_n

<sample>_noise_by_substitution.tsv:

C>A - geno_count, alt_count, noise, contributing_sites
C>G - geno_count, alt_count, noise, contributing_sites
C>T - geno_count, alt_count, noise, contributing_sites
T>A - geno_count, alt_count, noise, contributing_sites
T>C - geno_count, alt_count, noise, contributing_sites
T>G - geno_count, alt_count, noise, contributing_sites

multi-sample mode

Feature request: noise calculation over multiple bams

Subtasks:

decide on command line usage - simple list of bam paths?
decide on outputs:
- multiple pileups / noise files?
- pdf output of noise level across samples
use multiprocessing ? (or another threading library?)

	string index out of range
	string index out of range
	string index out of range
	string index out of range
	string index out of range
	string index out of range
	string index out of range
	string index out of range
	string index out of range
	string index out of range
	string index out of range

check noise precision

why 9 cont sites but 0% noise?

msk-access / sequence_qc Goto Github PK

sequence_qc's Introduction

sequence_qc

Installation

sequence_qc's People

Contributors

Watchers

sequence_qc's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs