epigen / genome_tracks Goto Github PK

A Snakemake workflow for easy visualization of genome browser tracks of aligned BAM files (e.g., RNA-seq, ATAC-seq, scRNA-seq, ...) powered by the wrapper gtracks for the package pyGenomeTracks, and IGV-reports.

Home Page: https://epigen.github.io/genome_tracks/

License: MIT License

Python 100.00%

atac-seq bioinformatics biomedical-data-science genome-browser genome-track genomic-regions pipeline python rna-seq scrna-seq snakemake visualization workflow

genome_tracks's Issues

5 mouse genes cause errors referencing very large genes on the same chromosome

used mouse genome 12-column BED file:
mm10 from UCSC as gzip https://genome.ucsc.edu/cgi-bin/hgTables assembly:mm10 -> track:NCBI RefSeq -> table:refFlat; output format: BED

error genes

Ccl19 -> chr4:42,754,525-42,756,543 2,019 bp
Ccl21a -> chr4:42,772,860-42,773,993 1,134 bp
Ccl21c -> chr4:42,612,123-42,613,253 1,131 bp
Il11ra2 -> chr4:42,656,001-42,665,763 9,763 bp
Rarres2 -> chr6:48,546,630-48,549,721 3,092 bp

Mdn1 pops up in all Chr4 error messages:
chr4:32,657,119-32,775,217
118,099 bp

Sspo pops up in Rarres2 (on Chr6) error message
chr6:48,425,163-48,478,169
53,007 bp

Ccl19 (error)
chr4:42,754,525-42,756,543
2,019 bp

Cxcl17 not correctly displayed! no gene not correct coordinates
chr7:25,099,478-25,112,311
12,834 bp

Ccl17 completely fine
chr8:95,537,081-95,538,664
1,584 bp

https://www.informatics.jax.org/marker/MGI:109123

https://genome.ucsc.edu/cgi-bin/hgTracks?db=mm39&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&nonVirtPosition=&position=chr6%3A48425163-48478169&hgsid=1597741285_OPZS1OsEmzcw5yAStALoyZTLcBbj

test & clean everything

consider switching to ggcoverage package

make MR.PARETO module

change input to CSV instead of txt file

consider gene/region input as csv
first column gene/region
second column ymax/"” (then remove from config) → then its at least configurable on a gene/region level → can be extracted like DEA parameters in dea_limma workflow
ymax parameter: have in addition to "”==auto also "max” which determines the max value across all to be plotted and uses that as ymax → not possible, because that is not known before

check and implement parameters

check normalization methods in deeptools::bamCoverage (e.g., RPKM vs RPGC) -> configurable?
check if scaling is necessary in deeptools::bamCoverage ie as a dependency to the number of samples merged? or during merging with samtools? -> no, samtools merge literally only merges.
- no scaling is necessary as it is done by all normalization methods. In case of RPGC* it not only takes the total number of reads into consideration, but also the effective genome size, making it even more robust.
check if the parameters work with tracks (they are from pyGenomeTracks): --dpi 300 --fontSize 12
- nope

from ATAC-seq pipeline

bamCoverage --bam {input.bam} \
            -p max --binSize 10  --normalizeUsing RPGC \
            --effectiveGenomeSize {params.genome_size} --extendReads 175 \
            -o "{output.bigWig}" > "{output.bigWig_log}" 2>&1;

*RPGC, on the other hand, does not only take the total number of reads into consideration, it also needs the effective genome size (which will differ from the "real" genome size because for mapping reads those regions of the genome where the sequence is either not determined or too repetitive to be covered should not be taken into consideration for calculating the coverage. Note that the exact effective genome size might be bigger than the values we indicate in the help texts if you have very long sequencing reads. For the example above, RPGC would work as follows:
sequencing depth = (total number of mapped reads * fragment length) / effective genome size = 50 x 10^6 * 200/ 2.15057 x 10^9 = 4.65
RPGC scaling factor = 1/sequencing depth = 1/4.65 = 0.22
RPGC(bin1) = 0.22 * 10 = 2.2
RPGC(bin2) = 0.22 * 12 = 2.64

works in principle, need to find a way to differentiate between normal and single cell bams in rule sc_bams
I have to turn the order for single cell around
- First split per sample into sample/group.bam
- Then merge the same groups together and put into merged_band and take from there as before.
- So, when single cell. Then preprocessing step with sinto.
- Why? Because of potential barcode duplicates across samples (can happen by chance)

https://snakemake-wrappers.readthedocs.io/en/stable/wrappers/igv-reports.html
https://github.com/igvteam/igv-reports
If I understand correctly you can create one report summarizing all samples, which would be great
try it manually on one test example, if easy -> implement.
- in bioconda: https://bioconda.github.io/recipes/igv-reports/README.html

epigen / genome_tracks Goto Github PK

genome_tracks's People

Contributors

Stargazers

Watchers

genome_tracks's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs