GithubHelp home page GithubHelp logo

msk-access / sequence_qc Goto Github PK

View Code? Open in Web Editor NEW
0.0 2.0 0.0 1.09 MB

Noise calculations from BAM file

Home Page: https://cmo-ci.gitbook.io/sequence-qc/

License: Other

Python 96.08% Dockerfile 3.92%
noise-detection

sequence_qc's Introduction

sequence_qc

Package for doing various ad-hoc quality control steps from MSK-ACCESS generated FASTQ or BAM files

Build Status PyPi Anaconda

Installation

From pypi:

pip install sequence_qc

From conda:

conda install -c ionox0 -c conda-forge -c bioconda sequence-qc

sequence_qc's People

Contributors

dependabot[bot] avatar ionox0 avatar murphycj2 avatar rhshah avatar

Watchers

 avatar  avatar

sequence_qc's Issues

Some Modification base on Mike's email

Thanks very much for preparing this. Just a couple of minor suggestions for the graphs:

  1. Top graph: Change the y-axis label to “ Alt Count / (Alt Count + Ref Count) x 10^6 “ and remove the mu
  2. Label as Duplex consensus (assuming that’s what’s in the calculation)
  3. Noise Positions graph: Change the y-axis label to “ Alt Count / Total Count “. Can remove the genomic positions from the x-axis label since there is not enough space to show all of them, and it’s not so informative.

Include noise from heterozygous / tri-allelic sites

Separating this into a separate issue, as it is a bit more complicated than I originally though, and want to get it right before coding it.

Currently, the noise calculation works by designating the most common base as the "genotype" at that position, and if any of the other 3 bases pass the "threshold" (2%), that position is skipped.

We thought it might make sense that instead of skipping heterozygous / tri-allelic sites, that these rules should apply:

  • if there is an A at 49% and T and 49%, the alt allele count should still include the C at 1% and G at 1%

  • if there is an A at 33%, C at 33% and T at 33%, the alt allele count should still include the G at 1%

Also lets update the latex equations to describe these cases as well

consistency with biometrics package

As @murphycj2 has done in the biometrics package, lets make this package use pysamstats (which really just calls to pysam) in the same way:

  • Ensure stepper is not eliminating duplicate reads (use nofilter?) alimanfoo/pysamstats#91
  • Figure out a way to avoid double-counting overlapping reads (ignore_overlap param does not get passed to pysam... alimanfoo/pysamstats#98)
  • Set min_mapping_quality truncate and max_depth defaults to same value as biometrics (and ensure they are exposed as parameters)
  • Can we be consistent with min_base_quality or do the noise and fingerprinting require different values for this to work?
  • We might still want to consider the possibility of combining these two calculations into the same package

Noise output file format

How should the qc metrics for:

  • noise (%)
  • number of contributing sites
  • noise by substitution type

be represented in this module's output?

I propose to have three separate files with the following columns (plus additional columns inherent to pysamstats):

<sample>_noise.tsv:

alt_count_acgt
geno_count_acgt
noise_percent_acgt
contributing_sites_acgt

alt_count_inc_deletions
geno_count_inc_deletions
noise_percent_inc_deletions
contributing_sites_inc_deletions

alt_count_inc_n
geno_count_inc_n
noise_percent_inc_n
contributing_sites_inc_n

<sample>_noise_by_substitution.tsv:

C>A - geno_count, alt_count, noise, contributing_sites
C>G - geno_count, alt_count, noise, contributing_sites
C>T - geno_count, alt_count, noise, contributing_sites
T>A - geno_count, alt_count, noise, contributing_sites
T>C - geno_count, alt_count, noise, contributing_sites
T>G - geno_count, alt_count, noise, contributing_sites

multi-sample mode

Feature request: noise calculation over multiple bams

Subtasks:

  • decide on command line usage - simple list of bam paths?
  • decide on outputs:
    • multiple pileups / noise files?
    • pdf output of noise level across samples
  • use multiprocessing ? (or another threading library?)

investigate drop off in noise around .00039

Why does the noise suddenly drop to zero? Does it need to have a higher precision for this calculation?

Make sure it is not being cut off, and python can correctly handle very small fractions (for example 1/20,000)

Screen Shot 2020-07-07 at 5 30 28 PM

Confirm that insertions will only count as a single "alt" in this calculation

In the tests for this noise module, there is a 5bp insertion in the test bam.

This results in a single Alt base in the test, which indicates that insertions of any length will only be counted as a single alt read. This means that insertions are included in base "acgt" noise calculation numbers, but only count as a single alt.

catch and log error

use logger for this error and set to level=ERROR:

	string index out of range
	string index out of range
	string index out of range
	string index out of range
	string index out of range
	string index out of range
	string index out of range
	string index out of range
	string index out of range
	string index out of range
	string index out of range

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.