caleblareau / mgatk Goto Github PK

View Code? Open in Web Editor NEW

100.0 8.0 27.0 138.69 MB

mgatk: mitochondrial genome analysis toolkit

Home Page: http://caleblareau.github.io/mgatk

License: MIT License

Python 72.95% R 27.05%

mitochondrial genotype genomics single-cell epigenetics python3 r

mgatk's People

Stargazers

Watchers

mgatk's Issues

circular mitochondrial genome

the mitochondrial DNA is circular / plasmid like.

Basically, we need a workflow that creates a surrogate second mitochondrial chromomsome
that wraps, say, the last 50 BP of the chromosome to the first 50 bp. This should be made into
its own chromosome. Then, a new reference genome build for the favorite tool has to be made.

For mgatk purposes, we need something intelligent to process 2 chromosome .fasta files of mitochondrial chromosomes that is also sensitive to multi-mapping when filtering the .bam file. And finally, variant quantification has to be more intelligent to handle the multiple chromosome, etc.

empty bam file

messes everything up (literally 0 mito reads); handle as an edge case in probably all scripts unfortunately... https://github.com/aryeelab/mgatk/tree/master/Pypkg/mgatk/mgatk/bin/python

That or just remove it from the first

Integrating multiple data

Hello, I am applying mgatk on several datasets and at a certain point I will need to integrate them. Given that the same cell barcode can be in different experiments, I was thinking I could concatenate the rds objects so that every cell gets a unique name across multiple datasets. Is this legit for downstream analysis? I thought I could then perform a single run of variant calling on the merged dataset.
I admit I'm not proficient with R, so any hint on how to merge multiple SummarizedExperiment objects is much appreciated :-)
Another question that comes after integration is how to evaluate the similarities among cells. Once variants have been identified (and filtered), would it be appropriate to calculate distances on the allele_frequency values in the data slot?

NA

Improper folder specification

Describe the bug

I am running mgatk on some BAM files, all processed in the same way, and it works for many of them but for one in particular I keep having this error:

[…]
REMOVED:  ACCTAGTGAGGGTTCC
REMOVED:  TGGAATCACGGGATTG
REMOVED:  TTGTAACCTATAACCC
Tue Nov 23 16:17:37 CET 2021: Genotyping samples with 4 threads
Error in checkGrep(grep(".A.txt", files)) : 
  Improper folder specification; file missing / extra file present. See documentation
Calls: importMito -> checkGrep
Execution halted

A summary of .log files

The relevant part is here

Error in rule make_final_sparse_matrices:
    jobid: 0
    output: HCT116_FOLFIRI_LT_tnH/final/HCT116_FOLFIRI_LT_tnH.A.txt.gz, HCT116_FOLFIRI_LT_tnH/final/HCT116_FOLFIRI_LT_tnH.C.txt.gz, HCT116_FOLFIRI_LT_tnH/final/HCT116_FOLFIRI_LT_tnH.G.txt.gz, HCT116_FOLFIRI_LT_tnH/final/HCT116_FOLFIRI_LT_tnH.T.txt.gz, HCT116_FOLFIRI_LT_tnH/final/HCT116_FOLFIRI_LT_tnH.coverage.txt.gz

RuleException:
AttributeError in line 54 of /home/cittaro.davide/miniforge3/envs/getseq/lib/python3.9/site-packages/mgatk/bin/snake/Snakefile.Gather:
'InputFiles' object has no attribute 'As'
  File "/home/cittaro.davide/miniforge3/envs/getseq/lib/python3.9/site-packages/mgatk/bin/snake/Snakefile.Gather", line 54, in __rule_make_final_sparse_matrices
  File "/home/cittaro.davide/miniforge3/envs/getseq/lib/python3.9/concurrent/futures/thread.py", line 52, in run
Exiting because a job execution failed. Look above for error message

Apparently the As are not found (is that even possibile?), so instead of having an empty file, there's no file at all

UMIs for 3' / 5' sequencing data

Environment setup

Hi there. I was excited to start using your tool but I'm having some issues with getting it going. I've installed mgatk, but when I get the following error

/home/anaconda3/envs/mgatk/lib/R/bin/exec/R: error while loading shared libraries: libreadline.so.6: cannot open shared object file: No such file or directory
ERROR: cannot find the following R package: {'Matrix', 'SummarizedExperiment', 'data.table', 'GenomicRanges'} Install it in your R console and then try rerunning mgatk (but there may be other missing dependencies).

I have these packages installed in my conda environment, and I still get this error. I'm used to performing analyses in python rather than R, so forgive me if this a stupid question, but any help in resolving it would be much appreciated.

Thanks,
John

read tag

For these giant single cell datasets, would be good to get a version of the pipeline going that just works for single .bam files with a read group ID

Using call mode for bulk RNA-seq data.

Hi there!

I am trying use mgatk's call mode to call variants on bulk RNA-seq data. It appears the following files are not generated.

*.variant_stats.tsv.gz
*.cell_heteroplasmic_df.tsv.gz
*.vmr_strand_plot.png

I used the following command on a directory that contains multiple aligned bams from different indviduals.
mgatk call -i bams_og -n GTEX_4 -o /global/scratch/fran_catalan/Data/MitoData/GTEX_v8/Liver/mgatk_test/Liver_4 -c 24 -g hg38 --snake-stdout

A summary of .log files

cat Liver_4/logs/*log
Wed Jul 21 14:51:45 PDT 2021: Starting analysis with mgatk
Wed Jul 21 14:51:45 PDT 2021: mgatk will process 226 samples
Wed Jul 21 14:52:43 PDT 2021: Processing samples with 24 threads
Wed Jul 21 20:17:34 PDT 2021: mgatk successfully processed the supplied .bam files
Wed Jul 21 20:40:40 PDT 2021: Successfully created final output files
Wed Jul 21 20:40:53 PDT 2021: Intermediate files successfully removed.

Describe the sequencing assay being analyzed

I'm analyzing RNA bulk data from GTEX v8. I want to compare the variants I find through mgatk with the variants found in Ludwig 2020 et al (awesome paper btw!) I previously tried my own custom variant pipeline where I used gatk's mutect2 on mitochondrial mode, but it returned very few unique heteroplasmic variants. I

Clarify if the execution successful on the test data provided in the repository

The three extra files aren't generated when I run the call or tenx mode on the test data provided.

Additional context

Add any other context about the problem here, including for example if you have run the tool successfully in other contexts.

I'm hoping you might have some advice on how to best compare heteroplasmic variants between groups of individuals from the output that mgatk generates. Apologies that my question is a little off topic, mundane, and doesn't necessarily pertain to single cell, but thank you so much for your time!

Force upper case in reference

check .snakemake directory

if it exists, get rid of it on execution

How to limit the virtual memory use of mgatk

I tried to run the mgatk package in an assigned node (memory 192GB), and the session was killed because the occupied virtual memory exceeded the limit of the server:

Job XXX Aborted
Exit Status = 137
Signal = KILL
User = XXX
Queue = XXX
Host = XXX
Start Time = 05/21/2021 09:58:48.949
End Time = 05/21/2021 17:06:26.175
CPU = 3:12:13:14
Max vmem = 749.543G
Max rss = NA
failed execd enforced h_vmem limit because:
job 7921630.1 died through signal KILL (9).

I tried to limit the general use of virtual memory of the node with the unix command ulimit. But the session was still aborted.
I added the cluster limit to be 12. It was of no use.

I would appreciate it very much if you could find the solution to the issue.

P.S.: The code I ran:

#!/bin/bash
#$ -cwd
# error = Merged with joblog
#$ -o joblog.$JOB_ID
#$ -j y
#$ -l h_rt=12:00:00,h_data=192G,rh7
# Email address to notify
#$ -M $USER@mail
# Notify when
#$ -m bea


# load the job environment:
. /u/local/Modules/default/init/modules.sh
module use /u/project/CCN/apps/modulefiles

# Load the FSL module
module load gcc
module load java
module load R
module load python/3.7.3
 
python3 -m venv venv3
source venv3/bin/activate
pip3 install mgatk
cd /u/project/XXX/XXX
ulimit -v 67108864
mgatk tenx -i /u/project/XXX/XXX/XXX/XXX/atac_possorted_bam.bam -n XXX -o XXX -bt CB -b /u/project/XXX/XXX/XXX/XXX/XXX_atac_barcode.tsv -g hg38 -c 12

chunk_barcoded_bam.py issue

Hi there! When I try to run mgatk on our cluster, I get the following printed to the error file repeated around 20-30 times:

"/systempath..../chunk_barcoded_bam.py", line 59, in
faux_umi = barcode_id[0:16] + umi_id + fauxdon[(int(barcode_id[17:]) - 1)]
ValueError: invalid literal for int() with base 10: '

and then after the repeats of the above, the final lines of the error file say:

Error in checkGrep(grep(".A.txt", files)) :
Improper folder specification; file missing / extra file present. See documentation
Calls: importMito -> checkGrep
execution halter
r: invalid literal for int() with base 10: '' invalid literal for int() with base 10: ''

Any help would be great.
Thanks :)

Failed to open file "CRA_test1_mgatk/temp/temp_bam/barcodes.8.temp0.bam" : No such file or directory

Hi,

I am running mgatk tenx as advised in the documentation:

mgatk tenx -i $folder_bam/possorted_bam.bam -n CRA_test1 -o CRA_test1_mgatk -c 12 -bt CB -b $folder_bam/filtered_peak_bc_matrix/barcodes.tsv

I get the following error:

Error in checkGrep(grep(".A.txt", files)) :
Improper folder specification; file missing / extra file present. See documentation
Calls: importMito -> checkGrep
Execution halted
(myenv)jovyan@jupyter-bio-2druxandra-2dtesloianu:~$ [E::hts_open_format] Failed to open file "CRA_test1_mgatk/temp/temp_bam/barcodes.8.temp0.bam" : No such file or directory
[Sat Nov 7 14:38:19 2020]
Error in rule process_one_slice:
jobid: 0
output: CRA_test1_mgatk/qc/depth/barcodes.8.depth.txt, CRA_test1_mgatk/temp/sparse_matrices/barcodes.8.A.txt, CRA_test1_mgatk/temp/sparse_matrices/barcodes.8.C.txt, CRA_test1_mgatk/temp/sparse_matrices/barcodes.8.G.txt, CRA_test1_mgatk/temp/sparse_matrices/barcodes.8.T.txt, CRA_test1_mgatk/temp/sparse_matrices/barcodes.8.coverage.txt

RuleException:
SamtoolsError in line 93 of /home/jovyan/my-conda-envs/myenv/lib/python3.7/site-packages/mgatk/bin/snake/Snakefile.tenx:
'samtools returned with error 1: stdout=, stderr=samtools sort: can't open "CRA_test1_mgatk/temp/temp_bam/barcodes.8.temp0.bam": No such file or directory\n'
File "/home/jovyan/my-conda-envs/myenv/lib/python3.7/site-packages/snakemake/executors/init.py", line 2252, in run_wrapper
File "/home/jovyan/my-conda-envs/myenv/lib/python3.7/site-packages/mgatk/bin/snake/Snakefile.tenx", line 93, in __rule_process_one_slice
File "/home/jovyan/my-conda-envs/myenv/lib/python3.7/site-packages/pysam/utils.py", line 75, in call
File "/home/jovyan/my-conda-envs/myenv/lib/python3.7/site-packages/snakemake/executors/init.py", line 560, in _callback
File "/home/jovyan/my-conda-envs/myenv/lib/python3.7/concurrent/futures/thread.py", line 57, in run
File "/home/jovyan/my-conda-envs/myenv/lib/python3.7/site-packages/snakemake/executors/init.py", line 546, in cached_or_run
File "/home/jovyan/my-conda-envs/myenv/lib/python3.7/site-packages/snakemake/executors/init.py", line 2264, in run_wrapper
Exiting because a job execution failed. Look above for error message
[E::hts_open_format] Failed to open file "CRA_test1_mgatk/temp/temp_bam/barcodes.4.temp0.bam" : No such file or directory

When I do ls -lR on the CRA_test1_mgatk directory, it seems like it completely lacks the temp directory

(myenv)jovyan@jupyter-bio-2druxandra-2dtesloianu:~/CRA_test1_mgatk$ ls -lR
.:
total 12
drwxr-sr-x 2 jovyan users 4096 Nov 7 13:05 final
drwxr-sr-x 4 jovyan users 4096 Nov 7 13:11 logs
drwxr-sr-x 3 jovyan users 4096 Nov 7 14:38 qc

./final:
total 120
-rw-r--r-- 1 jovyan users 121446 Nov 7 14:31 chrM_refAllele.txt

./logs:
total 16
-rw-r--r-- 1 jovyan users 1392 Nov 7 14:37 base.mgatk.log
-rw-r--r-- 1 jovyan users 602 Nov 7 14:37 CRA_test1.parameters.txt
-rw-r--r-- 1 jovyan users 0 Nov 7 14:37 CRA_test1.snakemake_tenx.log
drwxr-sr-x 2 jovyan users 4096 Nov 7 14:26 filterlogs
drwxr-sr-x 2 jovyan users 4096 Nov 7 13:11 rmdupslogs

./logs/filterlogs:
total 96
-rw-r--r-- 1 jovyan users 26 Nov 7 14:38 barcodes.10.filter.log
-rw-r--r-- 1 jovyan users 26 Nov 7 14:38 barcodes.11.filter.log
-rw-r--r-- 1 jovyan users 26 Nov 7 14:38 barcodes.12.filter.log
-rw-r--r-- 1 jovyan users 24 Nov 7 14:25 barcodes.13.filter.log
-rw-r--r-- 1 jovyan users 24 Nov 7 14:26 barcodes.14.filter.log
-rw-r--r-- 1 jovyan users 24 Nov 7 14:26 barcodes.15.filter.log
-rw-r--r-- 1 jovyan users 24 Nov 7 14:26 barcodes.16.filter.log
-rw-r--r-- 1 jovyan users 24 Nov 7 14:26 barcodes.17.filter.log
-rw-r--r-- 1 jovyan users 24 Nov 7 14:25 barcodes.18.filter.log
-rw-r--r-- 1 jovyan users 24 Nov 7 14:25 barcodes.19.filter.log
-rw-r--r-- 1 jovyan users 26 Nov 7 14:38 barcodes.1.filter.log
-rw-r--r-- 1 jovyan users 24 Nov 7 14:25 barcodes.20.filter.log
-rw-r--r-- 1 jovyan users 24 Nov 7 14:26 barcodes.21.filter.log
-rw-r--r-- 1 jovyan users 24 Nov 7 14:26 barcodes.22.filter.log
-rw-r--r-- 1 jovyan users 24 Nov 7 14:26 barcodes.23.filter.log
-rw-r--r-- 1 jovyan users 24 Nov 7 14:26 barcodes.24.filter.log
-rw-r--r-- 1 jovyan users 26 Nov 7 14:38 barcodes.2.filter.log
-rw-r--r-- 1 jovyan users 26 Nov 7 14:38 barcodes.3.filter.log
-rw-r--r-- 1 jovyan users 26 Nov 7 14:38 barcodes.4.filter.log
-rw-r--r-- 1 jovyan users 26 Nov 7 14:38 barcodes.5.filter.log
-rw-r--r-- 1 jovyan users 26 Nov 7 14:38 barcodes.6.filter.log
-rw-r--r-- 1 jovyan users 26 Nov 7 14:38 barcodes.7.filter.log
-rw-r--r-- 1 jovyan users 26 Nov 7 14:38 barcodes.8.filter.log
-rw-r--r-- 1 jovyan users 26 Nov 7 14:38 barcodes.9.filter.log

./logs/rmdupslogs:
total 0

./qc:
total 4
drwxr-sr-x 2 jovyan users 4096 Nov 7 13:11 quality

./qc/quality:
total 0

MultiAssayExperiment

Move primary R object from RSE to MAE

allows for multiple alleles per position
allows for integration of meta-data in a straightforward way
Don't have to duplicate per-alllele coverage

correctly specify reference genome or .fasta file

I am running the following:

mgatk tenx -i /mypath//atac_possorted_bam.bam
-n Pilot1 -o Pitlo1_mgatk -c 12
-bt CB -b /mypath/outs/barcodes.tsv

and I get the following output:

Found file of barcodes to be parsed: /mypath/outs/barcodes.tsv
Mon Jan 10 11:34:57 GMT 2022: User specified mitochondrial genome does NOT match .bam file; correctly specify reference genome or .fasta file

Any ideas? Many thanks in advance
Chris

can be used for snATAC?

Hi, thanks for the tool! Have you guys tried using mgatk for single-nucleus ATAC-seq? How do you think this would perform for snATAC? I guess assumption would be that mtDNA in snATAC come from mitochondria of the same cell?!

Thanks

samtools external dependency / fastq mean BAQ

Currently only occurs in one place (as far as I can tell-- need to verify by removing samtools from the environment. (or actually I can't remember-- did I expunge this?) Minimally, there's a loop in python that is probably the bottle neck of processing an individual bam:

https://github.com/aryeelab/mgatk/blob/f671d8ce4800b185de1c4b75d69921772f9d7def/Pypkg/mgatk/mgatk/bin/python/sumstatsBP.py#L47

A more recent version of pysam affords this internally:
pysam-developers/pysam#528

AF > 1

ERR1146421
Postion 289
Allele C
"Depth" is 84; "Count" is 91

Fairly rare but still annoying

snake make log / stats

do it

Using Mgatk for non mitochondrial sequences

Hi,

I tried to run Mgatk on my data (10x scRNA seq) and I believe since the cells aren't very enriched for mtDNA i wasn't able to find distinct variants to identify subclones. Do you know if I can use the Mgatk program for non mitochondrial sequences?

Thank you,

Sunita

Missing files due to improper folder specification

Hi,

I am trying to run the example provided for mgatk bcall under mgatk/tests, which results in an error due to missing files. Within bc1d/final/, there is only chrM_refAllele.txt generated. I get the same error when trying to run the genotyping for my own 10x-scATAC data. Do you have an idea what might be the problem here?

cd mgatk/tests
mgatk bcall -i barcode/test_barcode.bam -n bc1 -o bc1d -bt CB -b barcode/test_barcodes.txt -z
Fri Jun 12 10:44:30 CEST 2020: mgatk v0.5.3
Fri Jun 12 10:44:30 CEST 2020: Found bam file: barcode/test_barcode.bam for genotyping.
Fri Jun 12 10:44:30 CEST 2020: Found file of barcodes to be parsed: barcode/test_barcodes.txt
Fri Jun 12 10:44:30 CEST 2020: User specified mitochondrial genome matches .bam file
Fri Jun 12 10:44:33 CEST 2020: Finished determining/splitting barcodes for genotyping.
Fri Jun 12 10:44:33 CEST 2020: Genotyping samples with 1 threads

Error in checkGrep(grep(".A.txt", files)) :
Improper folder specification; file missing / extra file present. See documentation
Calls: importMito -> checkGrep
Execution halted

Thanks and best wishes,
Malte

wiki

Error with bcall mode

I want to perform genotyping for scATAC-seq data generated using 10x multiomics protocol. The bam file was generate using cellranger-arc.

The command I am running is as follows :

mgatk bcall -i atac_possorted_bam.bam -g hg38 -bt CB -b filtered.barcodes.txt -ns 100 -c 8 -n sample1

the error msg is as follows:

Thu Jun 03 17:04:43 PDT 2021: Genotyping samples with 8 threads
Error in checkGrep(grep(".A.txt", files)) :
Improper folder specification; file missing / extra file present. See documentation
Calls: importMito -> checkGrep
Execution halted

I was hoping you could help me with this as I am not sure what I am doing wrong

Thanks in advance

mgatk does not generate .rds output files

Hi Caleb,

amazing work, providing this tool for looking into MtDNA variants!

Ive been running mgatk on samples profiled by 10X scATAC-seq, but have encountered some issues when looking at the output files. In some cases, although all previous files have been successfully generated and the workflow has been executed correctly ( to the snakemake log files), the .rds objects are missing.

Tue Mar 23 16:45:56 CET 2021: Starting analysis with mgatk
Tue Mar 23 16:45:56 CET 2021: Processing samples with 12 threads
Tue Mar 23 17:59:54 CET 2021: mgatk successfully processed the supplied .bam files
Tue Mar 23 18:04:05 CET 2021: Successfully created final output files
Tue Mar 23 18:04:08 CET 2021: Intermediate files successfully removed.
Tue Mar 23 20:11:49 CET 2021: Starting analysis with mgatk
Tue Mar 23 20:11:49 CET 2021: Processing samples with 12 threads
Tue Mar 23 21:25:31 CET 2021: mgatk successfully processed the supplied .bam files
Tue Mar 23 21:25:40 CET 2021: Successfully created final output files
Tue Mar 23 21:25:42 CET 2021: Intermediate files successfully removed.

I am not sure what's wrong, as apart from the missing files, everything seems to be executed correctly. Any assistance or thoughts would be welcome! :)

Thanks so much in advance,
Moritz

Error in rule make_depth_table: 'InputFiles' object has no attribute 'depths'

Hi,

I'm trying to run the example for mgatk call using tests/humanbam, and I'm getting an error in rule make_depth_table that 'InputFiles' object has no attribute 'depths'. Here's the full output:

(mgatk) rh476@CFCE2:~$ mgatk call -i humanbam -o outdir -c 8 -g hg19 -n test -kd
Wed Aug 19 17:44:27 EDT 2020: mgatk v0.5.8
Wed Aug 19 17:44:27 EDT 2020: Found designated mitochondrial chromosome: chrM
Wed Aug 19 17:44:27 EDT 2020: Genotyping samples with 8 threads
Building DAG of jobs...
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 8
Rules claiming more threads will be scaled down.
Job counts:
	count	jobs
	1	all
	1	make_sample_list
	1	process_one_sample
	3

ÆWed Aug 19 17:44:28 2020Å
rule process_one_sample:
    input: outdir/.internal/samples/MGH60-P6-A11.mito.bam.txt
    output: outdir/temp/ready_bam/MGH60-P6-A11.mito.qc.bam, outdir/temp/ready_bam/MGH60-P6-A11.mito.qc.bam.bai, outdir/qc/depth/MGH60-P6-A11.mito.depth.txt, outdir/temp/sparse_matrices/MGH60-P6-A11.mito.A.txt, outdir/temp/sparse_matrices/MGH60-P6-A11.mito.C.txt, outdir/temp/sparse_matrices/MGH60-P6-A11.mito.G.txt, outdir/temp/sparse_matrices/MGH60-P6-A11.mito.T.txt, outdir/temp/sparse_matrices/MGH60-P6-A11.mito.coverage.txt
    jobid: 2
    wildcards: sample=MGH60-P6-A11.mito

Using shell: /bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job counts:
	count	jobs
	1	all
	1	make_depth_table
	1	make_final_sparse_matrices
	3

ÆWed Aug 19 17:44:28 2020Å
rule make_depth_table:
    output: outdir/final/test.depthTable.txt
    jobid: 1

Job counts:
	count	jobs
	1	process_one_sample
	1
Job counts:
	count	jobs
	1	make_depth_table
	1
ÆWed Aug 19 17:44:29 2020Å
Error in rule make_depth_table:
    jobid: 0
    output: outdir/final/test.depthTable.txt

RuleException:
AttributeError in line 30 of /mnt/cfce-stor1/home/rh476/miniconda3/envs/mgatk/lib/python3.8/site-packages/mgatk/bin/snake/Snakefile.Gather:
'InputFiles' object has no attribute 'depths'
  File "/mnt/cfce-stor1/home/rh476/miniconda3/envs/mgatk/lib/python3.8/site-packages/snakemake/executors/__init__.py", line 2168, in run_wrapper
  File "/mnt/cfce-stor1/home/rh476/miniconda3/envs/mgatk/lib/python3.8/site-packages/mgatk/bin/snake/Snakefile.Gather", line 30, in __rule_make_depth_table
  File "/mnt/cfce-stor1/home/rh476/miniconda3/envs/mgatk/lib/python3.8/site-packages/snakemake/executors/__init__.py", line 529, in _callback
  File "/mnt/cfce-stor1/home/rh476/miniconda3/envs/mgatk/lib/python3.8/concurrent/futures/thread.py", line 57, in run
  File "/mnt/cfce-stor1/home/rh476/miniconda3/envs/mgatk/lib/python3.8/site-packages/snakemake/executors/__init__.py", line 515, in cached_or_run
  File "/mnt/cfce-stor1/home/rh476/miniconda3/envs/mgatk/lib/python3.8/site-packages/snakemake/executors/__init__.py", line 2199, in run_wrapper
Exiting because a job execution failed. Look above for error message
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /mnt/cfce-stor1/home/rh476/.snakemake/log/2020-08-19T174428.325464.snakemake.log
ÆWed Aug 19 17:44:35 2020Å
Finished job 2.
1 of 3 steps (33%) done

ÆWed Aug 19 17:44:35 2020Å
rule make_sample_list:
    input: outdir/qc/depth/MGH60-P6-A11.mito.depth.txt
    output: outdir/temp/scattered.allSamples.txt
    jobid: 1

Job counts:
	count	jobs
	1	make_sample_list
	1
ÆWed Aug 19 17:44:36 2020Å
Finished job 1.
2 of 3 steps (67%) done

ÆWed Aug 19 17:44:36 2020Å
localrule all:
    input: outdir/temp/scattered.allSamples.txt
    jobid: 0

ÆWed Aug 19 17:44:36 2020Å
Finished job 0.
3 of 3 steps (100%) done
Complete log: /mnt/cfce-stor1/home/rh476/.snakemake/log/2020-08-19T174428.332329.snakemake.log
Error in checkGrep(grep(".A.txt", files)) : 
  Improper folder specification; file missing / extra file present. See documentation
Calls: importMito -> checkGrep
Execution halted

Would you know what might be causing this?

Thanks,
Ruiyang

`bcount` mode

Helps to facilitate when 10X run looks sub-par

Missing files in output final folder: .variant_stats.tsv.gz .cell_heteroplasmic_df.tsv.gz *.vmr_strand_plot.png

Describe the bug

I have tried to use mgatk on your test humanbam dataset and on my own dataset (mouse) and in both runs I don't get these files in the final output folder:

*.variant_stats.tsv.gz
*.cell_heteroplasmic_df.tsv.gz
*.vmr_strand_plot.png

I get all the other files mentioned in your wiki.

For humanbam I have run the command:
mgatk call -i humanbam/ -o humanbam/ --jobs 1 -c 1 -g hg19_chrM

A summary of .log files

Thu Jan 06 14:09:03 GMT 2022: Starting analysis with mgatk
Thu Jan 06 14:09:03 GMT 2022: mgatk will process 4 samples
Thu Jan 06 14:09:03 GMT 2022: Processing samples with 1 threads
Thu Jan 06 14:09:22 GMT 2022: mgatk successfully processed the supplied .bam files
Thu Jan 06 14:09:28 GMT 2022: Successfully created final output files
Thu Jan 06 14:09:28 GMT 2022: Intermediate files successfully removed.

Parameters:
input_directory: 'humanbam/'
output_directory: 'humanbam/'
script_dir: xxx
fasta_file: 'humanbam//fasta/chrM.fasta'
mito_chr: 'chrM'
mito_length: '16571'
name: 'mgatk'
base_qual: '0'
remove_duplicates: 'True'
handle_overlap: 'False'
low_coverage_threshold: '10'
barcode_tag: 'X'
umi_barcode: ''
alignment_quality: '0'
emit_base_qualities: 'False'
proper_paired: 'False'
NHmax: '1'
NMmax: '4'
max_javamem: '8000m'

Describe the sequencing assay being analyzed

my dataset is scRNAseq, so my expectation was that it may not work, but I expected to see all final output files for the test dataset at least.

Clarify if the execution successful on the test data provided in the repository

As mentioned above, i used your humanbam dataset but several output files were missing.

I would be grateful for your help. Many thanks.

cambridge mitochondrial genome

@julirsch vaguely remember you talking about this

https://www.mitomap.org/foswiki/bin/view/MITOMAP/HumanMitoSeq

What is there to do here / how important is this?

cluster scatter

mgatk call -i Glioma/analysis/io -o mgatk_glioma -n glioma --cluster "bsub -q normal -o /dev/null" --jobs 4100

Filtering variants

Hi there.
When working with my own scATAC-seq data derived from pre FACS-sorted cells I was never able to identify mitochondrial variants which achieve the thresholds (log10(VMR) > -2 & strand > 0.65) used in filtering the called variants e.g. in the IdentifyVariants function in signac (1.0.0) or in the example workflow for mtscATAC-seq data.
I wanted to ask if the thresholds are only suitable when working with mtscATAC-seq data or can also be worked with when using basic scATAC-seq data. And if not, do you have any recommendations which thresholds to use for basic scATAC-seq data?

Thank you and greetings.

checking dependencies...

Hi there,

Forgive me if this is very basic, I am very new to this kind of data analysis.

I have 10X genomics scATAC-seq data that I am trying to run mgatk on. I was trying to run bcall on the data and was getting this error:

Error in checkGrep(grep(".A.txt", files)) : 
  Improper folder specification; file missing / extra file present. See documentation
Calls: importMito -> checkGrep
Execution halted

I have now discovered the check function, but I am having issues with the dependencies. When I run

mgatk check -i {sample}_bam.bam -o mgatk_outs -g hg38 -b barcodes.txt -c 30

I get the following output:

Traceback (most recent call last):
  File "/home/klee/anaconda3/envs/seurat4/bin/mgatk", line 8, in <module>
    sys.exit(main())
  File "/home/klee/anaconda3/envs/seurat4/lib/python3.7/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/klee/anaconda3/envs/seurat4/lib/python3.7/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/klee/anaconda3/envs/seurat4/lib/python3.7/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/klee/anaconda3/envs/seurat4/lib/python3.7/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/klee/anaconda3/envs/seurat4/lib/python3.7/site-packages/mgatk/cli.py", line 232, in main
    if bams[0] == '':
IndexError: list index out of range

I have tried in a conda environment and a pip environment, neither have been successful. Please let me know if you can tell what step I am missing - it would be much appreciated.

Thank you in advance,

Kiera

Process 10x-scRNA-seq data using mgatk

Hi, Greetings! When I tried to submit my 10x-scRNA-seq data file for analysis using the code below (with my bam file and barcodes.tsv, not shown here):

mgatk tenx -i ${outdir}/outs/possorted_bam.bam
-n CRR_test1 -o CRR_test1_mgatk -c 12 -ub UB
-bt CB -b ${outdir}/outs/filtered_feature_bc_matrix/barcodes.tsv

But when there's an error showed up after I submitted this analysis: " User specified mitochondrial genome does NOT match .bam file; correctly specify reference genome or .fasta file"

Since this .bam file is not generated by us, is there any other common reference genome that I can use to try?
Thank you very much!

*My mgatk version is v0.6.4
*My PC is macOS Catalina V 10.15.7

Using scRNA

Hey Caleb,

I am currently working on scATAC data but was wondering if it were possible to use mgatk with scRNA data.
Do you know how to implement scRNA into the mgatk workflow or which prerequisites are needed to work with scRNA.
As I am aware the calling of SNV does not seem to be working. Is it possible to circumvent this problem by providing a list of SNV coming from scATAC data?

Thanks and greetings.

Error in .local(assays, ...) : unused argument (rowData = <S4 object of class "GRanges">)

Hi Caleb,

I am trying to run the example provided for mgatk tenx under mgatk/tests and am running into some issues. I tried with the --skip-R and/or --snake-stdouts flag to my command but still end up with the same error. I'm running the program on a hpc.

Thu Feb 11 17:20:19 PST 2021: Found bam file: barcode/test_barcode.bam for genotyping.
Thu Feb 11 17:20:19 PST 2021: Found file of barcodes to be parsed: barcode/test_barcodes.txt
Thu Feb 11 17:20:19 PST 2021: User specified mitochondrial genome matches .bam file
Thu Feb 11 17:20:24 PST 2021: Finished determining/splitting barcodes for genotyping.
Thu Feb 11 17:20:24 PST 2021: Genotyping samples with 2 threads
Error in .local(assays, ...) :
  unused argument (rowData = <S4 object of class "GRanges">)
Calls: importMito ... SummarizedExperiment -> SummarizedExperiment -> .local
Execution halted

Here is a list of files from the output directory

ls -lRh bc1dmem
.:

total 12K
4.0K drwxr-xr-x 2 fran_catalan ucb 4.0K Feb 11 22:36 final
4.0K drwxr-xr-x 4 fran_catalan ucb 4.0K Feb 11 17:20 logs
4.0K drwxr-xr-x 4 fran_catalan ucb 4.0K Feb 11 17:14 qc

./final:
total 648K
 96K -rw-r--r-- 1 fran_catalan ucb  96K Feb 11 22:36 bc1.A.txt.gz
100K -rw-r--r-- 1 fran_catalan ucb  96K Feb 11 22:36 bc1.C.txt.gz
 52K -rw-r--r-- 1 fran_catalan ucb  50K Feb 11 22:36 bc1.G.txt.gz
 80K -rw-r--r-- 1 fran_catalan ucb  77K Feb 11 22:36 bc1.T.txt.gz
196K -rw-r--r-- 1 fran_catalan ucb 194K Feb 11 22:36 bc1.coverage.txt.gz
4.0K -rw-r--r-- 1 fran_catalan ucb   76 Feb 11 22:36 bc1.depthTable.txt
120K -rw-r--r-- 1 fran_catalan ucb 119K Feb 11 22:35 chrM_refAllele.txt

./logs:
total 28K
4.0K -rw-r--r-- 1 fran_catalan ucb 1.1K Feb 11 22:36 base.mgatk.log
4.0K -rw-r--r-- 1 fran_catalan ucb  474 Feb 11 22:35 bc1.parameters.txt
4.0K -rw-r--r-- 1 fran_catalan ucb 3.0K Feb 11 22:36 bc1.snakemake_tenx.log
8.0K -rw-r--r-- 1 fran_catalan ucb 6.8K Feb 11 22:36 bc1.snakemake_tenx.stats
4.0K drwxr-xr-x 2 fran_catalan ucb 4.0K Feb 11 17:14 filterlogs
4.0K drwxr-xr-x 2 fran_catalan ucb 4.0K Feb 11 17:14 rmdupslogs

./logs/filterlogs:
total 8.0K
4.0K -rw-r--r-- 1 fran_catalan ucb 22 Feb 11 22:35 barcodes.1.filter.log
4.0K -rw-r--r-- 1 fran_catalan ucb 21 Feb 11 22:35 barcodes.2.filter.log

./logs/rmdupslogs:
total 8.0K
4.0K -rw-r--r-- 1 fran_catalan ucb 1.5K Feb 11 22:35 barcodes.1.rmdups.log
4.0K -rw-r--r-- 1 fran_catalan ucb 1.5K Feb 11 22:35 barcodes.2.rmdups.log

./qc:
total 8.0K
4.0K drwxr-xr-x 2 fran_catalan ucb 4.0K Feb 11 22:36 depth
4.0K drwxr-xr-x 2 fran_catalan ucb 4.0K Feb 11 17:14 quality

./qc/depth:
total 8.0K
4.0K -rw-r--r-- 1 fran_catalan ucb 50 Feb 11 22:36 barcodes.1.depth.txt
4.0K -rw-r--r-- 1 fran_catalan ucb 26 Feb 11 22:36 barcodes.2.depth.txt

./qc/quality:
total 0

Thank you in advance for your help!
-Fran

Missing input files for rule call_variants

Hi @caleblareau ,

thanks for developing this great package!

I wanted to try and run a test of using mgatk on 10x mouse scRNA-seq data.

I used the .bam file from the following public dataset:
https://www.10xgenomics.com/resources/datasets/1-k-heart-cells-from-an-e-18-mouse-v-3-chemistry-3.0.0

I installed mgatk using venv and tried to run the following:

ref_dir="/test_data"
mgatk tenx --mito-genome GRCm38 \
        -i ${ref_dir}/heart_1k_v3_possorted_genome_bam.bam \
        -n tenx_heart_test -o tenx_heart_test_mgatk \
        -c 1 \
        -ub UB \
        -bt CB \
        -b ${ref_dir}/filtered_feature_bc_matrix/barcodes.tsv \
        --snake-stdout \
        --keep-temp-files

I got the following error and don't know how to address this.

Thanks for your help!

Wed May 12 14:57:09 EDT 2021: mgatk v0.6.1
Wed May 12 14:57:09 EDT 2021: Found bam file: /test_data/heart_1k_v3_possorted_genome_bam.bam for genotyping.
Wed May 12 14:57:09 EDT 2021: Found file of barcodes to be parsed: /test_data/filtered_feature_bc_matrix/barcodes.tsv
[W::hts_idx_load3] The index file is older than the data file: /test_data/heart_1k_v3_possorted_genome_bam.bam.bai
Wed May 12 14:57:10 EDT 2021: User specified mitochondrial genome matches .bam file
[W::hts_idx_load3] The index file is older than the data file: /test_data/heart_1k_v3_possorted_genome_bam.bam.bai
Wed May 12 15:00:04 EDT 2021: Finished determining/splitting barcodes for genotyping.
Wed May 12 15:00:04 EDT 2021: Genotyping samples with 1 threads
Building DAG of jobs...
MissingInputException in line 160 of /mgatk_venv/lib/python3.8/site-packages/mgatk/bin/snake/Snakefile.tenx:
Missing input files for rule call_variants:
tenx_heart_test_mgatk/final/chrM_refAllele.txt
Error in checkGrep(grep(".A.txt", files)) : 
  Improper folder specification; file missing / extra file present. See documentation
Calls: importMito -> checkGrep
Execution halted

Check does not work for Tenx input parameters.

Describe the bug
I set up the mgatk parameters to run in the tenx mode.

I used to check to control my input parameters and I got this error.

Mon Dec 13 11:05:30 CET 2021: mgatk v0.6.4
Mon Dec 13 11:05:30 CET 2021: checking dependencies...
Traceback (most recent call last):
  File "/home/mg000001/miniconda3/envs/Maegtk_Python38_V_0_6_4/bin/mgatk", line 8, in <module>
    sys.exit(main())
  File "/cm/shared/apps/scRNA/lib/python3.6/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/cm/shared/apps/scRNA/lib/python3.6/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/cm/shared/apps/scRNA/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/cm/shared/apps/scRNA/lib/python3.6/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/mg000001/miniconda3/envs/Maegtk_Python38_V_0_6_4/lib/python3.8/site-packages/mgatk/cli.py", line 232, in main
    if bams[0] == '':
IndexError: list index out of range

Tenx requires a bam file as input, while call requires a folder. The check assumes that you have supplied a folder and will always fail for a tenx setup.

Clarification on marking duplicates

Hello,

I was wondering if the duplicate marking picard was at cell resolution? I didn't realize until yesterday with the release of cellranger-atac 2.0 that CR duplicate marking was at bulk resolution. I'm assuming mgatk it is at cell resolution?

Thanks,
Chang

Use mitochondria genome only or use all chromosomes

Hello Caleb,
I was wondering if mgatk will only recognize mitochondria reads even if I give BAM containing all chromosomes.

For example, I am running mgatk on a 10x scATAC data of mm10.

Step-1. Having created a custom reference, do I let cellranger align reads to all chromosomes?

1.1. ✅ Mask the mt black region.

bedtools maskfasta -fi old_mm10.fa -bed mito_blacklist.bed  -fo new_mm10.fa

1.2 Excute cellranger-atac mkref as the 10x instruction. Here, should I include the 'chrM' in the PRIMARY_ASSEMBLY for mkref so that the cellranger-atac count will align reads to all chromosomes including chrM? Or should I simply type 'chrM' only for PRIMARY_ASSEMBLY so that the resulted BAM only contain chrM reads?

Step-2. Run mgatk. How should I set the --mito-genome parameter? Should I simply claim 'mm10' or should I feed the FASTA file? Which FASTA file? Should it include all chromosomes or chrM only?

Many thanks.

Currently, I let cellranger align reads to all chromosomes so that the BAM contains all chromosomes. Then I claim mm10 in the --mito-genome parameter when running mgatk.

mgatk consensus read formation

On this page https://github.com/caleblareau/mgatk/wiki/Process-mtDNA-from-CellRanger-ATAC it's stated that "for scRNA, we can utilize UMI-aware PCR deduplication with mgatk". I was wondering how exactly this is done. In particular, how do you form a consensus/representative read for each umi when you perform deduplication. I've seen a variety of different methods for this in the literature.

Thanks,
John

click echo bam file being analyzed

just to keep multiple files coherent

Watson/crick stand

More something to check, but is there systematic bias in which strand has more mutations when we can measure them?

del viz

discard base quality by default but keep it with flag

mgatk does not export cell_heteroplasmic_df.tsv.gz

I want to reproduce results in Lineage Tracing in Humans Enabled by Mitochondrial Mutations and Single-Cell Genomics with AML datasets(aml035_pre_transplant and aml035_post_transplant), but after executing the following code, I could not get cell_heteroplasmic_df.tsv.gz, vmr_strand_plot.png, variant_stats.tsv.gz.

# mgatk, version 0.6.4

mgatk bcall -i ./AML/aml035_pre_transplant/aml035_pre_transplant_possorted_genome_bam.bam -n aml035_pt -o aml035_pt_bcall -c 6 -ub UB -mb 200 -bt CB -b ./AML/aml035_pre_transplant/filtered_matrices_mex/hg19/barcodes.tsv -g AML/hg19_MT.fa --snake-stdout

# or

mgatk tenx -i ./AML/aml035_pre_transplant/aml035_pre_transplant_possorted_genome_bam.bam -n aml035_pt -o aml035_pt -c 6 -ub UB -bt CB -b ./AML/aml035_pre_transplant/filtered_matrices_mex/hg19/barcodes.tsv -g AML/hg19_MT.fa --snake-stdout

base.mgatk.log:

Thu Nov 25 16:18:25 CST 2021: Starting analysis with mgatk
Thu Nov 25 16:18:25 CST 2021: mgatk will process 3571 samples
Thu Nov 25 16:18:29 CST 2021: Processing samples with 6 threads
Thu Nov 25 17:19:41 CST 2021: mgatk successfully processed the supplied .bam files
Thu Nov 25 17:24:25 CST 2021: Successfully created final output files
Thu Nov 25 17:24:48 CST 2021: Intermediate files successfully removed.

output folder detail:

.:
total 4.0K
drwxr-xr-x. 2 user lab 4.0K Nov 25 04:24 final
drwxr-xr-x. 4 user lab  205 Nov 25 04:23 logs
drwxr-xr-x. 4 user lab   46 Nov 25 03:18 qc

./final:
total 111M
-rw-r--r--. 1 user lab  34M Nov 25 04:24 aml035_pt.signac.rds
-rw-r--r--. 1 user lab  20M Nov 25 04:24 aml035_pt.rds
-rw-r--r--. 1 user lab  77K Nov 25 04:23 aml035_pt.depthTable.txt
-rw-r--r--. 1 user lab  28M Nov 25 04:23 aml035_pt.coverage.txt.gz
-rw-r--r--. 1 user lab 9.0M Nov 25 04:23 aml035_pt.A.txt.gz
-rw-r--r--. 1 user lab 9.3M Nov 25 04:23 aml035_pt.C.txt.gz
-rw-r--r--. 1 user lab 4.2M Nov 25 04:23 aml035_pt.G.txt.gz
-rw-r--r--. 1 user lab 7.6M Nov 25 04:23 aml035_pt.T.txt.gz
-rw-r--r--. 1 user lab 119K Nov 25 03:16 MT_refAllele.txt

./logs:
total 11M
-rw-r--r--. 1 user lab  409 Nov 25 04:24 base.mgatk.log
-rw-r--r--. 1 user lab 2.9K Nov 25 04:23 aml035_pt.snakemake_gather.stats
-rw-r--r--. 1 user lab  11M Nov 25 04:19 aml035_pt.snakemake_scatter.stats
drwxr-xr-x. 2 user lab 144K Nov 25 04:19 rmdupslogs
drwxr-xr-x. 2 user lab 144K Nov 25 04:19 filterlogs
-rw-r--r--. 1 user lab  509 Nov 25 03:18 aml035_pt.parameters.txt

./logs/rmdupslogs:
total 14M
-rw-r--r--. 1 user lab 1.5K Nov 25 04:19 CATACTACGAATAG-1.rmdups.log
-rw-r--r--. 1 user lab 1.5K Nov 25 04:19 GAGCATACCCCTTG-4.rmdups.log
-rw-r--r--. 1 user lab 1.5K Nov 25 04:19 CAGACTGATCACCC-1.rmdups.log
-rw-r--r--. 1 user lab 1.5K Nov 25 04:19 TACTACACGCCAAT-3.rmdups.log
-rw-r--r--. 1 user lab 1.5K Nov 25 04:19 TTGACACTAAAACG-3.rmdups.log
-rw-r--r--. 1 user lab 1.5K Nov 25 04:19 CTTACAACTGTGAC-3.rmdups.log
-rw-r--r--. 1 user lab 1.5K Nov 25 04:19 AGGCTAACCGATAC-3.rmdups.log
-rw-r--r--. 1 user lab 1.5K Nov 25 04:19 TGACCGCTCCTGTC-2.rmdups.log
-rw-r--r--. 1 user lab 1.5K Nov 25 04:19 TTCGTATGTGTGAC-3.rmdups.log
-rw-r--r--. 1 user lab 1.5K Nov 25 04:19 GAGGATCTCCTTAT-4.rmdups.log
-rw-r--r--. 1 user lab 1.5K Nov 25 04:19 CAGGCCGATTCGTT-1.rmdups.log
-rw-r--r--. 1 user lab 1.5K Nov 25 04:19 TTAGGTCTTTCTCA-4.rmdups.log
-rw-r--r--. 1 user lab 1.5K Nov 25 04:19 TTGGTACTGCAAGG-3.rmdups.log
-rw-r--r--. 1 user lab 1.5K Nov 25 04:19 TAGAATTGACACCA-3.rmdups.log
-rw-r--r--. 1 user lab 1.5K Nov 25 04:19 CTTGAACTTAAAGG-3.rmdups.log
-rw-r--r--. 1 user lab 1.5K Nov 25 04:19 AGGTCATGTTGACG-3.rmdups.log
-rw-r--r--. 1 user lab 1.5K Nov 25 04:19 TGAGACACTCCTCG-2.rmdups.log
.........


./logs/filterlogs:
total 14M
-rw-r--r--. 1 user lab 18 Nov 25 04:19 CATACTACGAATAG-1.filter.log
-rw-r--r--. 1 user lab 19 Nov 25 04:19 CAGACTGATCACCC-1.filter.log
-rw-r--r--. 1 user lab 20 Nov 25 04:19 GAGCATACCCCTTG-4.filter.log
-rw-r--r--. 1 user lab 20 Nov 25 04:19 TTGACACTAAAACG-3.filter.log
-rw-r--r--. 1 user lab 19 Nov 25 04:19 TACTACACGCCAAT-3.filter.log
-rw-r--r--. 1 user lab 20 Nov 25 04:19 CTTACAACTGTGAC-3.filter.log
-rw-r--r--. 1 user lab 18 Nov 25 04:19 AGGCTAACCGATAC-3.filter.log
-rw-r--r--. 1 user lab 20 Nov 25 04:19 TGACCGCTCCTGTC-2.filter.log
-rw-r--r--. 1 user lab 18 Nov 25 04:19 TTCGTATGTGTGAC-3.filter.log
-rw-r--r--. 1 user lab 19 Nov 25 04:19 TTAGGTCTTTCTCA-4.filter.log
-rw-r--r--. 1 user lab 18 Nov 25 04:19 CAGGCCGATTCGTT-1.filter.log
-rw-r--r--. 1 user lab 19 Nov 25 04:19 GAGGATCTCCTTAT-4.filter.log
-rw-r--r--. 1 user lab 18 Nov 25 04:19 TTGGTACTGCAAGG-3.filter.log
-rw-r--r--. 1 user lab 18 Nov 25 04:19 TAGAATTGACACCA-3.filter.log
-rw-r--r--. 1 user lab 19 Nov 25 04:19 CTTGAACTTAAAGG-3.filter.log
-rw-r--r--. 1 user lab 20 Nov 25 04:19 TGAGACACTCCTCG-2.filter.log
-rw-r--r--. 1 user lab 20 Nov 25 04:19 AGGTCATGTTGACG-3.filter.log
-rw-r--r--. 1 user lab 17 Nov 25 04:19 CATCGCTGTTCGGA-1.filter.log
-rw-r--r--. 1 user lab 20 Nov 25 04:19 AGGTCATGGCAGTT-3.filter.log
-rw-r--r--. 1 user lab 17 Nov 25 04:19 TGACTTTGGGTCAT-2.filter.log
-rw-r--r--. 1 user lab 17 Nov 25 04:19 TAAGATTGTCCTGC-1.filter.log
.........


./qc:
total 188K
drwxr-xr-x. 2 user lab 144K Nov 25 04:19 depth
drwxr-xr-x. 2 user lab   10 Nov 25 03:18 quality

./qc/depth:
total 14M
-rw-r--r--. 1 user lab 22 Nov 25 04:19 CATACTACGAATAG-1.depth.txt
-rw-r--r--. 1 user lab 22 Nov 25 04:19 GAGCATACCCCTTG-4.depth.txt
-rw-r--r--. 1 user lab 22 Nov 25 04:19 CAGACTGATCACCC-1.depth.txt
-rw-r--r--. 1 user lab 22 Nov 25 04:19 TTGACACTAAAACG-3.depth.txt
-rw-r--r--. 1 user lab 22 Nov 25 04:19 TACTACACGCCAAT-3.depth.txt
-rw-r--r--. 1 user lab 22 Nov 25 04:19 CTTACAACTGTGAC-3.depth.txt
-rw-r--r--. 1 user lab 21 Nov 25 04:19 AGGCTAACCGATAC-3.depth.txt
-rw-r--r--. 1 user lab 21 Nov 25 04:19 TGACCGCTCCTGTC-2.depth.txt
-rw-r--r--. 1 user lab 22 Nov 25 04:19 TTCGTATGTGTGAC-3.depth.txt
-rw-r--r--. 1 user lab 22 Nov 25 04:19 GAGGATCTCCTTAT-4.depth.txt
-rw-r--r--. 1 user lab 22 Nov 25 04:19 CAGGCCGATTCGTT-1.depth.txt
-rw-r--r--. 1 user lab 22 Nov 25 04:19 TTAGGTCTTTCTCA-4.depth.txt
-rw-r--r--. 1 user lab 22 Nov 25 04:19 TTGGTACTGCAAGG-3.depth.txt
-rw-r--r--. 1 user lab 22 Nov 25 04:19 TAGAATTGACACCA-3.depth.txt
-rw-r--r--. 1 user lab 22 Nov 25 04:19 CTTGAACTTAAAGG-3.depth.txt
-rw-r--r--. 1 user lab 22 Nov 25 04:19 AGGTCATGTTGACG-3.depth.txt
-rw-r--r--. 1 user lab 22 Nov 25 04:19 TGAGACACTCCTCG-2.depth.txt
-rw-r--r--. 1 user lab 22 Nov 25 04:19 CATCGCTGTTCGGA-1.depth.txt
-rw-r--r--. 1 user lab 22 Nov 25 04:19 AGGTCATGGCAGTT-3.depth.txt
.........

./qc/quality:
total 0

running test data with following code did not get cell_heteroplasmic_df.tsv.gz, vmr_strand_plot.png, variant_stats.tsv.gz:

mgatk bcall -i barcode/test_barcode.bam -n bc1 -o bc1d -bt CB -b barcode/test_barcodes.txt -z

Python unicode error

Heya Caleb,

Thanks so much for this package, super useful and interesting. I was wondering if you could help me, with the error, below;

Traceback (most recent call last):
  File "/u/oknight/bin/miniconda3/bin/mgatk", line 8, in <module>
    sys.exit(main())
  File "/u/oknight/bin/miniconda3/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/u/oknight/bin/miniconda3/lib/python3.9/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/u/oknight/bin/miniconda3/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/u/oknight/bin/miniconda3/lib/python3.9/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/u/oknight/bin/miniconda3/lib/python3.9/site-packages/mgatk/cli.py", line 208, in main
    barcode_files = split_barcodes_file(barcodes, math.ceil(file_len(barcodes)/int(ncores)), output)
  File "/u/oknight/bin/miniconda3/lib/python3.9/site-packages/mgatk/mgatkHelp.py", line 167, in file_len
    for i, l in enumerate(f):
  File "/u/oknight/bin/miniconda3/lib/python3.9/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

Thanks a million

Ollie

Cell count difference in the mgatk outputs and scATAC-seq cell ranger outputs

Hi,

I am running the downstream analysis for my outputs in signac. However, when I tried subsetting the mito data to the ones only present in scATAC-seq data. The cell count decreased from 3k to only around 500. They are from the same sample so while subsetting, even if we lose some cells, is it normal to lose these many?

BioRad scATAC-seq data

Hi, I have sets of scATAC-seq data generated with BioRad platform, which have around 3% total reads mapped to mtDNA. I tried to run mgatk with --barcode-tag changed to "DB" and run into error. Do I need to also change other things in order to run the pipeline? Thanks.

Jason

mgatk tenx -i deconvoluted_data/final/alignments.possorted.tagged.bap.bam  -n mito_test1 -o mito_mgatk -c 5 -bt DB -b count_matrix/barcodes_out.csv -g genomes/mm10/fasta/chrM.fa
Tue Nov 02 12:06:30 HKT 2021: mgatk v0.6.4
Tue Nov 02 12:06:30 HKT 2021: Found bam file: deconvoluted_data/final/alignments.possorted.tagged.bap.bam for genotyping.
Tue Nov 02 12:06:30 HKT 2021: Found file of barcodes to be parsed: count_matrix/barcodes_out.csv
Tue Nov 02 12:06:30 HKT 2021: User specified mitochondrial genome matches .bam file
Tue Nov 02 12:06:30 HKT 2021: Finished determining/splitting barcodes for genotyping.
Tue Nov 02 12:06:30 HKT 2021: Genotyping samples with 5 threads
Error in .subset2(x, i, exact = exact) : subscript out of bounds
Calls: importMito ... importMito.explicit -> levels -> [[ -> [[.data.frame -> <Anonymous>
Execution halted

split R package and python package

Max cell IDs

To facilitate having an upper bound on the number of cells written, need to split-up (user-defined parameter) which cells are processed in a chunk. Write a clever function here and then iterate through in a loop.

As part of this, we will need to make the "unknown" nomination split into two scripts... the first will nominate [to do: make parallel] based on abundance whereas the second will just be the "known" genotyping.

We shouldn't have to change the backend snakemake since it can run in parallel ad lib.

caleblareau / mgatk Goto Github PK

mgatk's People

Stargazers

Watchers

Forkers

mgatk's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs