GithubHelp home page GithubHelp logo

epigen / genome_tracks Goto Github PK

View Code? Open in Web Editor NEW
14.0 14.0 0.0 96 KB

A Snakemake workflow for easy visualization of genome browser tracks of aligned BAM files (e.g., RNA-seq, ATAC-seq, scRNA-seq, ...) powered by the wrapper gtracks for the package pyGenomeTracks, and IGV-reports.

Home Page: https://epigen.github.io/genome_tracks/

License: MIT License

Python 100.00%
atac-seq bioinformatics biomedical-data-science genome-browser genome-track genomic-regions pipeline python rna-seq scrna-seq snakemake visualization workflow

genome_tracks's People

Contributors

sreichl avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

genome_tracks's Issues

5 mouse genes cause errors referencing very large genes on the same chromosome

used mouse genome 12-column BED file:
mm10 from UCSC as gzip https://genome.ucsc.edu/cgi-bin/hgTables assembly:mm10 -> track:NCBI RefSeq -> table:refFlat; output format: BED

error genes

  • Ccl19 -> chr4:42,754,525-42,756,543 2,019 bp
  • Ccl21a -> chr4:42,772,860-42,773,993 1,134 bp
  • Ccl21c -> chr4:42,612,123-42,613,253 1,131 bp
  • Il11ra2 -> chr4:42,656,001-42,665,763 9,763 bp
  • Rarres2 -> chr6:48,546,630-48,549,721 3,092 bp

Mdn1 pops up in all Chr4 error messages:
chr4:32,657,119-32,775,217
118,099 bp

Sspo pops up in Rarres2 (on Chr6) error message
chr6:48,425,163-48,478,169
53,007 bp

Ccl19 (error)
chr4:42,754,525-42,756,543
2,019 bp

Cxcl17 not correctly displayed! no gene not correct coordinates
chr7:25,099,478-25,112,311
12,834 bp

Ccl17 completely fine
chr8:95,537,081-95,538,664
1,584 bp

https://www.informatics.jax.org/marker/MGI:109123

https://genome.ucsc.edu/cgi-bin/hgTracks?db=mm39&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&nonVirtPosition=&position=chr6%3A48425163-48478169&hgsid=1597741285_OPZS1OsEmzcw5yAStALoyZTLcBbj

test & clean everything

  • clean code and comments
  • add reporting for IGV-report
  • macroStim RNA-seq
  • macroStim ATACseq
  • macroStim PT141 for NT, Csf1r subsets (check KO)

make MR.PARETO module

  • #11
  • update and extend docs
    • config.yaml
    • config/README: how configure single cell data; important that single-cell metadata.tsv is sample specific as by chance different samples could have the same cell barcodes...
    • README
      • remove ALL feature -> always all
      • add new features: ymax per gene/region; color per group; UCSC genome browser hub; IGV-Report, single-cell bam file support,...
      • move all genome browser track related docs/instructions from ATAC-seq pipeline to genome_tracks
      • performance
        • 78 ATAC-seq samples in 31 groups took 22 minutes with max 4 cores and 4GB memory
        • 64 RNA-seq samples in 31 groups took 16 minutes with max 4 cores and 4GB memory
        • 2 10x genomics 5' scRNA-seq reactions each ~10k cells split in 2 small subset groups took 31 minutes with max 4 cores and 8GB memory
  • go through MRP checklist

change input to CSV instead of txt file

  • consider gene/region input as csv
  • first column gene/region
  • second column ymax/"” (then remove from config) → then its at least configurable on a gene/region level → can be extracted like DEA parameters in dea_limma workflow
  • ymax parameter: have in addition to "”==auto also "max” which determines the max value across all to be plotted and uses that as ymax → not possible, because that is not known before

check and implement parameters

  • check normalization methods in deeptools::bamCoverage (e.g., RPKM vs RPGC) -> configurable?
  • check if scaling is necessary in deeptools::bamCoverage ie as a dependency to the number of samples merged? or during merging with samtools? -> no, samtools merge literally only merges.
    • no scaling is necessary as it is done by all normalization methods. In case of RPGC* it not only takes the total number of reads into consideration, but also the effective genome size, making it even more robust.
  • check if the parameters work with tracks (they are from pyGenomeTracks): --dpi 300 --fontSize 12
    • nope

from ATAC-seq pipeline

bamCoverage --bam {input.bam} \
            -p max --binSize 10  --normalizeUsing RPGC \
            --effectiveGenomeSize {params.genome_size} --extendReads 175 \
            -o "{output.bigWig}" > "{output.bigWig_log}" 2>&1;

*RPGC, on the other hand, does not only take the total number of reads into consideration, it also needs the effective genome size (which will differ from the "real" genome size because for mapping reads those regions of the genome where the sequence is either not determined or too repetitive to be covered should not be taken into consideration for calculating the coverage. Note that the exact effective genome size might be bigger than the values we indicate in the help texts if you have very long sequencing reads. For the example above, RPGC would work as follows:
sequencing depth = (total number of mapped reads * fragment length) / effective genome size = 50 x 10^6 * 200/ 2.15057 x 10^9 = 4.65
RPGC scaling factor = 1/sequencing depth = 1/4.65 = 0.22
RPGC(bin1) = 0.22 * 10 = 2.2
RPGC(bin2) = 0.22 * 12 = 2.64

check & provide instructions for single cell modalities

should also work for scRNA-seq and scATAC-seq data/samples that gets grouped and merged by metadata

  • works in principle, need to find a way to differentiate between normal and single cell bams in rule sc_bams
  • I have to turn the order for single cell around
    • First split per sample into sample/group.bam
    • Then merge the same groups together and put into merged_band and take from there as before.
    • So, when single cell. Then preprocessing step with sinto.
    • Why? Because of potential barcode duplicates across samples (can happen by chance)

make output format configurable

from the docs: The file type of the plot will be determined by the output file extension.

currently hard coded: .svg
extend to: .png and .pdf

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.