blachlylab / mucor3 Goto Github PK

View Code? Open in Web Editor NEW

0.0 4.0 0.0 874 KB

Parses VCF data into tabular spreadsheets and aggregates data by sample

License: MIT License

Python 9.32% Makefile 0.17% D 90.07% Dockerfile 0.04% Shell 0.40%

bioinformatics vcf jsonlines aggregation

mucor3's Introduction

Mucor3

Introduction
Installation
Run Mucor3
Datastore
Custom Tables

Introduction

Mucor3 an iteration on the original Mucor. Mucor3 encompasses a range of processes involved with not only creation of VCF variant reports but also line-delimited JSON manipulation. Mucor3 translates VCF files into tabular data and aggregates it into useful pivoted tables. VCFs are converted to line-delimited json objects. This allows for great flexibility in filtering the VCF data before pivoting the data. After combining all variant jsonl into one file Mucor3 can convert it to a tabular format and generate pivoted tables that show by default each variant pivoted by sample while display the allele frequency of that variant for a particular sample. depthgauge serves to create a pivoted spreadsheet that shows the read depth at all positions in the pivoted allele frequency table. Mucor3 is broken into several steps that can be performed by a variety of programs to suit your needs. Generally the steps involve annotation, atomization, filtering, manipulation, and report generation.

Quick Guide

Installation

git clone --recurse-submodules https://github.com/blachlylab/mucor3.git
make 
cd mucor3
python setup.py install

Step 0: Annotation

In order for VCFs to contain useful information about the mutations they contain, annotation is usually a neccessary step not included in most variant callers. Annotation can be performed by a variety of programs, however contrary to previous versions of this software, mucor3 does not. This is to keep the goals of this project smaller and within scope. Also there are many existing programs to do this that are optimized for this purpose. Some notable annotation software we use is:

snpEff
snpSift
vcfanno
vep

Step 1: Atomization

To allow for greater flexibility in the tools we can use with mucor3, we decided to use JSON as an intermediate representation. So we have atomizers to convert tables and VCFs to line-delimited JSON. The VCF atomizer will convert your vcfs into line delimited json objects. A single json object represents an individual VCF record for a single sample or an individual annotation for a single VCF record for a single sample (if you intend on using elasticsearch for filtering). Read more about the VCF atomizer here.

atomization/atomize_vcf/atomize_vcf sample1.vcf.gz >sample1.jsonl
vcf_atomizer sample2.vcf >sample2.jsonl

Step 2: Combine VCF json information to one file

cat sample1.jsonl sample2.jsonl ... > data.jsonl

Step 2.5: Linking sample information

You can use the table atomizer to create JSON records from a sample spreadsheet.

table_atomizer samples.tsv > samples.jsonl
table_atomizer samples.xlsx >samples.jsonl

This data can then be used with the previously generated VCF data to link sample information to VCF variant data. After this data is linked, it can be used for filtering in later steps.

code for linking here

Step 3: Filtering

VCFs are often filled with millions of variants, and consequently can make the tables generated by Mucor3 very large. Filtering allows us to potentially reduce the spreadsheets to on variants that are important to us. A number of different tools can be used to approach this. Some programs that can be used to perform filtering:

jq
elasticsearch
apache drill
couchdb

We provide python scripts to aid in the use of elasticsearch for filtering. We also provide a program called varquery that can perform filtering in a similar way to elasticsearch without the bulk of a full database.

varquery index data.jsonl > data.index
varquery query data.index data.jsonl "/AF > 0.5 AND /INFO/ANN/EFFECT=(missense OR 5_prime_utr)"

More info on using varquery for filtering can be found here. More info on using elasticsearch for filtering can be found here.

Running Mucor3

Provide Mucor3 with your combined data and an output folder.

mucor3 data.jsonl output_folder

Mucor3 will output a pivoted table that is every variant pivoted by sample and should have this general format:

CHROM	POS	REF	ALT	ANN_gene_name	ANN_effect	sample1	sample2
chr1	2	G	T	foo	missense	.	0.7
chr1	5	C	T	foo	synonymous	1	0.25
chr1	1000	TA	T	bar	...	0.45	.
chr1	3000	G	GATAGC	oncogene	...	0.01	.

The values under sample1 and sample2 are the values from the AF field of FORMAT region of the VCF.

The master table however would represent this same data in this format:

CHROM	POS	REF	ALT	AF	sample	ANN_gene_name	ANN_hgvs_p	ANN_effect
chr1	2	G	T	0.7	sample2	foo	p.Met1Ala	missense
chr1	5	C	T	1	sample1	foo	...	synonymous
chr1	5	C	T	0.25	sample2	foo	...	...
chr1	1000	TA	T	0.45	sample1	bar	...	...
chr1	3000	G	GATAGC	0.01	sample1	oncogene	...	...

Note: The ANN_ fields will not be present for VCFs that have not been annotated using SnpEff.

DepthGauge

Before running depthgauge we need to know what the first sample name column is in our AF.tsv spreadsheet. In the above data the column number is 7 for column sample1. We also provide a folder which contains the BAM files needed to run depthgauge. Important: The bams must have the same name as sample in the spreadsheet and VCFs and must be sorted and indexed. For our example we would expect the BAMs folder to contain sample1.bam, sample2.bam, sample1.bam.bai, and sample2.bam.bai.

depthgauge -t 4 output_folder/AF.tsv 7 BAMS/ depthgauge.tsv

This will create an identical table to our first except with read depths instead of allele frequencies.

CHROM	POS	REF	ALT	ANN_gene_name	ANN_effect	sample1	sample2
chr1	2	G	T	foo	missense	10	37
chr1	5	C	T	foo	synonymous	100	4
chr1	1000	TA	T	bar	...	20	45
chr1	3000	G	GATAGC	oncogene	...	300	78

Datastore

The key advancement of using JSONL as a intermediate data type is it flexibility and use in noSQL datastores. When using a large number of samples or a more permanent dataset that may be analyzed several times, a noSQL database may offer more flexibility and robustness. We have provided python scripts that can be used to upload data to an Elasticsearch instance and query VCF data from an Elasticsearch instance. Other JSONL querying mechanisms can be used i.e. Apache Drill, AWS Athena, newer versions of PostgreSQL, and many others.

Custom Tables

The main mucor3 python script creates a pivot table by taking the jsonl directly from the vcf_atomizer and setting the fields CHROM, POS, REF, ALT as an index, pivoting on sample, and displaying the AF for the combination of "index" and "pivot on" value. Using the mucor scripts directly allows for greater flexibility and manipulation. All scripts with the exception of jsonlcsv.py take jsonl as input and output jsonl.

Merge

merge.py helps combine rows together to ensure that when a pivot is performed that rows are unique to avoid duplications. The main mucor3 script uses this to ensure we have unique rows for any given variant so we should only have one occurrence of any combination of CHROM, POS, REF, ALT, and sample. merge.py can be used to combine rows in other ways, simply by specifying what column combinations should define a unique row. mucor3's merge would appear as such using the script directly:

cat data.jsonl | python merge.py sample CHROM POS REF ALT

merge.py will concatenate columns for rows that are duplicate based on the provided indices.

CHROM	POS	REF	ALT	AF	sample	ANN_gene_name	ANN_hgvs_p	ANN_effect	ANN_transcript_id
chr1	2	G	T	0.7	sample2	foo	p.Met1Ala	missense	1
chr1	2	G	T	0.7	sample2	foo		synonymous	2
chr1	5	C	T	1	sample1	foo		synonymous	3

The above table would be changed to this:

CHROM	POS	REF	ALT	AF	sample	ANN_gene_name	ANN_hgvs_p	ANN_effect	ANN_transcript_id
chr1	2	G	T	0.7	sample2	foo		missense;synonymous	1;2
chr1	5	C	T	1	sample1	foo		synonymous	3

This step is neccesary as the vcf_atomizer reports duplicate variant results for multiple SnpEff annotations as this is most efficient for filtering data using Elasticsearch or jq. We must use merge.py to later coelesce the rows back to representing a single variant.

Aggregate

Jsonl to TSV

mucor3's People

Contributors

Watchers

mucor3's Issues

mucor3 atomize --flatten broken

Only outputs md5 sum?

Replace filesystem-based inverted index with embedded KV store

Suggest rocksdb

https://github.com/facebook/rocksdb

ion_rocksdb: Not appropriately dropping samples that have ref genotype

Currently we drop sample records that have GT: ./. but not GT: 0/0

Merged annotation fields out of order

If for each sample variant the annotations are in different order, when merging on sample and variant, samples with the same variant will have annotation fields merged in different order.

sample	CHROM	POS	REF	ALT	meta1
sample1	chr1	2	G	T	BAD
sample1	chr1	2	G	T	GOOD
sample2	chr1	2	G	T	GOOD
sample2	chr1	2	G	T	BAD

A merge on the above dataset would result in the below dataset.

sample	CHROM	POS	REF	ALT	meta1
sample1	chr1	2	G	T	BAD;GOOD
sample2	chr1	2	G	T	GOOD;BAD

This can make comparison and pivoting difficult. Recommend sorting on sample, CHROM, POS, REF, ALT, and the ANN fields before any merges.

ID column sometimes NaN although INFO_dbsnp_ids is not

This seems to be an issue with earlier software, namely MuTect2, in our pipeline. We can use INFO_dbsnp_ids field instead of ID.

Pivoted dataset can have duplicate variant records

When pivoting and including metadata columns, if metadata columns are different for some samples for the same variant, they will be reported separately.

CHROM	POS	REF	ALT	meta1	sample1	sample2
chr1	1	G	A	BAD	0.1	.
chr1	1	G	A	GOOD	.	0.2
chr1	2	G	T	BAD;GOOD	.	0.1
chr1	2	G	T	GOOD;BAD	0.5	.

Suggest pivoting with only CHROM, POS, REF, ALT. Then add metadata columns back with a left join. Columns from the right dataset in the join should be merged on CHROM, POS, REF, ALT.

Need different vcf atomizer output modes

Need different modes for outputting data in different orientations from the vcf atomizer:

Variant centric:

One row per unique variant. Multi-sample vcfs would have multiple sample objects under the FORMAT object.

{ 
    "CHROM" : "chr1", 
    "POS" : "1", 
    "REF" : "G", 
    "ALT" : "A", 
    "INFO" : { 
        "ANN" : [ 
            {
                "effect": "one"
            }, 
            {
                "effect": "two"
            }, 
        ]
    }, 
    "FORMAT" : { 
        "SAM1": { 
            "AF": 0.2
        },
        "SAM2": { 
            "AF": 0.1
        } 
    }
}

Sample variant centric (current):

One row per each variant for each sample. We expand the FORMAT object into multiple rows (while duplicating all other information).

{ 
    # all other values the same
    "FORMAT" : { 
        "AF": 0.2 # AF for SAM1
    },
    "sample":"SAM1" 
}
{ 
    # all other values the same
    "FORMAT" : { 
        "AF": 0.1 # AF for SAM2
    },
    "sample":"SAM2"
}

Annotation centric (previously):

One row per each annotation per each variant for each sample. We expand further expand the ANN object into multiple rows (while duplicating all other information).

{ 
    # all other values the same
    "INFO" : { 
        "ANN" : {
            "effect": "one" # first ANN annotation for SAM1
        } 
    }, 
    "FORMAT" : { 
        "AF": 0.2 
    },
    "sample":"SAM1"
}
{ 
    # all other values the same
    "INFO" : { 
        "ANN" : {
            "effect": "two" # second ANN annotation for SAM1
        }
    }, 
    "FORMAT" : { 
        "AF": 0.2
    },
    "sample":"SAM1"
}
{ 
    # all other values the same
    "INFO" : { 
        "ANN" : {
            "effect": "one" # first ANN annotation for SAM2
        } 
    }, 
    "FORMAT" : { 
        "AF": 0.1
    },
    "sample":"SAM2"
}
{ 
    # all other values the same
    "INFO" : { 
        "ANN" : {
            "effect": "two" # second ANN annotation for SAM2
        } 
    }, 
    "FORMAT" : { 
        "AF": 0.1
    },
    "sample":"SAM2"
}

Pivot table positive rate wrong denominator

(See also #1)

Commit 4367c9e introduced positive no. and positive rate in aggregate.pivot. However, main which calls pivot later adds back missing columns (if any). This means the positive rate's denominator is wrong.

Suggest removing it, or refactor positive no. and positive rate into separate functino you can call frmo main after adding back missing columns

Speed up depthgauge

Is there a way to increase the speed of depthgauge besides setting / increasing threads? When running on ~100 viral samples it takes around a day and a half to complete and seems to run more slowly when the number of samples is increased. In the beginning, it quickly works through the first 50 or so, but then slows to an approximate rate of 1 per ~40mins towards the end of processing the samples.

With having threads set to 4, on ~190 samples it took >3 days to finish running. So the average time expected for calculating the depth of viral samples is around ~40mins per sample.

ion_rocksdb: duplicated filters

Appears that filters could be getting duplicated during atomization

Update documentation on the master branch

Documentation is not true to how to install or where the executables are, this needs to be updated.

sample_indexer hardcodes some columns

We should expect to always have a "sample" columns in our data file(s) unless it is a pivoted table, but we have no expectations for the index/key file to have specific columns. I think it currently relies on the columns "accession" and "status" being present? So we really need two flags instead of just --column or --column needs to take a parsed value like colname1=colname2.

Since our index/key table could be as such in our samples.xlsx:

Sequencing ID	clinical_trial name
sample1	SAM0001
sample2	SAM0002
...	...

In which case we would need to run the script as such:

sample_indexer -f data.tsv -i samples.xlsx -c "Sequencing ID=clinical_trial name" -o converted.tsv
# or 
sample_indexer -f data.tsv -i samples.xlsx --from "Sequencing ID" --to "clinical_trial name" -o converted.tsv

ion_rocksdb: atomizer cannot process vcf without samples/genotypes

Pivot table improvement

Mucor pivot table improvement suggestion

After pivot, add column , right before the first sample, that shows total # of nonzero entries in the row
That will help prioritize

Merge python and D code

Need to merge the python functionality into the D binary. Most of this is done, just need to wrap it all together as a single invocation that goes from VCF to pivoted table.

Off by 1 index error between AF calculations and depth calculations.

The output of depthgage and mucor3's AF.tsv file are off by 1 from each other. This is likely due to not accounting for the 1-based interval system of a VCF vs the 0-based interval system for a BAM. These should be reported as one or the other but not both as it may make downstream analysis more difficult.

AF.tsv is as follows:

chrEBV_B_95_8_Raji      25      C       G       2       0.015037593984962405    .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       0.88    .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       0.084   .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .
chrEBV_B_95_8_Raji      28      A       C       1       0.007518796992481203    .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       0.667   .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .
chrEBV_B_95_8_Raji      47      C       T       2       0.015037593984962405    .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       0.8     .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       0.667   .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .
chrEBV_B_95_8_Raji      53      C       A       1       0.007518796992481203    .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       0.244   .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .
chrEBV_B_95_8_Raji      72      G       GATCGTCT        1       0.007518796992481203    .       .       .       .       .       .       .       .

DP is as follows from depthgauge:

chrEBV_B_95_8_Raji  26  C   G   2   0.015037593984962405    1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   5955    0   0   0   0   0   0   0   0   1087    0   0   0   0   1653    0   0   0   0   0   0   0   0   1211    0   0   640 0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1245    0   156 0   0   0   0   899 1128    1640    0   104 6   0   0   0   33  701 0   0   0   16252   0   36  3098    0   0   0   8334    0   0   0   0   0   0   0   1   0   0   1   551 0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   331 0
chrEBV_B_95_8_Raji  29  A   C   1   0.007518796992481203    1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   5956    0   0   0   0   0   0   0   0   1087    0   0   0   0   1670    0   0   0   0   0   0   0   0   1211    0   0   640 0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1245    0   156 0   0   0   0   899 1128    1639    0   104 6   0   0   0   33  841 0   0   0   16327   0   36  3110    0   0   0   8334    0   0   0   0   0   0   0   1   0   0   1   551 0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   331 0
chrEBV_B_95_8_Raji  48  C   T   2   0.015037593984962405    1   0   4   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   5950    0   0   0   0   0   0   0   0   1236    0   0   0   0   1693    0   2513    0   0   0   0   0   0   1211    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1245    0   156 0   0   0   0   899 1132    1576    0   28397   6   0   0   0   34  896 0   0   0   17022   0   36  8758    0   0   0   8335    0   0   0   0   0   0   0   1   0   2   1   551 0   0   0   34  0   0   0   0   0   0   0   0   0   0   0   0   21620   662 0
chrEBV_B_95_8_Raji  54  C   A   1   0.007518796992481203    1   0   4   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   5943    0   0   0   0   0   0   0   0   1236    0   0   0   0   1695    0   2513    0   0   0   0   0   0   1211    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1245    0   156 0   0   0   0   899 1132    342 0   28477   6   0   0   0   34  897 0   0   1572    17022   0   36  8765    0   0   0   8336    0   0   0   0   0   0   0   1   0   2   1   551 0   0   0   34  0   0   0   0   0   0   0   0   0   0   0   0   21635   664 0
chrEBV_B_95_8_Raji  73  G   GATCGTCT    1   0.007518796992481203    0   0   4   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   62  1236    0   0   0   0   1676    0   2513    0   0   0   0   0   0   1211    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   156 0   0   0   0   899 1128    342 0   28588   3   0   7016    0   18  898 0   0   1577    17022   0   36  14400   0   0   0   8335    0   0   0   0   0   0   0   1   0   2   1   550 0   0   0   34  0   0   0   0   0   0   0   0   0   0   0   0   21649   663 0

review of outputs

We currently output an AF.tsv, DP.tsv, master.tsv, and Variants.tsv.
AF and DP are VAF and depth pivot tables.
master.tsv is the raw dataset (post-filtering).
Variants.tsv is the master dataset that is merged on CHROM,POS, REF, and ALT. This makes it easier to see variants across samples.

Other outputs to be considered?