GithubHelp home page GithubHelp logo

blachlylab / mucor3 Goto Github PK

View Code? Open in Web Editor NEW
0.0 4.0 0.0 874 KB

Parses VCF data into tabular spreadsheets and aggregates data by sample

License: MIT License

Python 9.32% Makefile 0.17% D 90.07% Dockerfile 0.04% Shell 0.40%
bioinformatics vcf jsonlines aggregation

mucor3's Introduction

Mucor3

Introduction

Mucor3 an iteration on the original Mucor. Mucor3 encompasses a range of processes involved with not only creation of VCF variant reports but also line-delimited JSON manipulation. Mucor3 translates VCF files into tabular data and aggregates it into useful pivoted tables. VCFs are converted to line-delimited json objects. This allows for great flexibility in filtering the VCF data before pivoting the data. After combining all variant jsonl into one file Mucor3 can convert it to a tabular format and generate pivoted tables that show by default each variant pivoted by sample while display the allele frequency of that variant for a particular sample. depthgauge serves to create a pivoted spreadsheet that shows the read depth at all positions in the pivoted allele frequency table. Mucor3 is broken into several steps that can be performed by a variety of programs to suit your needs. Generally the steps involve annotation, atomization, filtering, manipulation, and report generation.

Quick Guide

Installation

git clone --recurse-submodules https://github.com/blachlylab/mucor3.git
make 
cd mucor3
python setup.py install

Step 0: Annotation

In order for VCFs to contain useful information about the mutations they contain, annotation is usually a neccessary step not included in most variant callers. Annotation can be performed by a variety of programs, however contrary to previous versions of this software, mucor3 does not. This is to keep the goals of this project smaller and within scope. Also there are many existing programs to do this that are optimized for this purpose. Some notable annotation software we use is:

  1. snpEff
  2. snpSift
  3. vcfanno
  4. vep

Step 1: Atomization

To allow for greater flexibility in the tools we can use with mucor3, we decided to use JSON as an intermediate representation. So we have atomizers to convert tables and VCFs to line-delimited JSON. The VCF atomizer will convert your vcfs into line delimited json objects. A single json object represents an individual VCF record for a single sample or an individual annotation for a single VCF record for a single sample (if you intend on using elasticsearch for filtering). Read more about the VCF atomizer here.

atomization/atomize_vcf/atomize_vcf sample1.vcf.gz >sample1.jsonl
vcf_atomizer sample2.vcf >sample2.jsonl

Step 2: Combine VCF json information to one file

cat sample1.jsonl sample2.jsonl ... > data.jsonl

Step 2.5: Linking sample information

You can use the table atomizer to create JSON records from a sample spreadsheet.

table_atomizer samples.tsv > samples.jsonl
table_atomizer samples.xlsx >samples.jsonl

This data can then be used with the previously generated VCF data to link sample information to VCF variant data. After this data is linked, it can be used for filtering in later steps.

code for linking here

Step 3: Filtering

VCFs are often filled with millions of variants, and consequently can make the tables generated by Mucor3 very large. Filtering allows us to potentially reduce the spreadsheets to on variants that are important to us. A number of different tools can be used to approach this. Some programs that can be used to perform filtering:

  1. jq
  2. elasticsearch
  3. apache drill
  4. couchdb

We provide python scripts to aid in the use of elasticsearch for filtering. We also provide a program called varquery that can perform filtering in a similar way to elasticsearch without the bulk of a full database.

varquery index data.jsonl > data.index
varquery query data.index data.jsonl "/AF > 0.5 AND /INFO/ANN/EFFECT=(missense OR 5_prime_utr)"

More info on using varquery for filtering can be found here. More info on using elasticsearch for filtering can be found here.

Running Mucor3

Provide Mucor3 with your combined data and an output folder.

mucor3 data.jsonl output_folder

Mucor3 will output a pivoted table that is every variant pivoted by sample and should have this general format:

CHROM POS REF ALT ANN_gene_name ANN_effect sample1 sample2
chr1 2 G T foo missense . 0.7
chr1 5 C T foo synonymous 1 0.25
chr1 1000 TA T bar ... 0.45 .
chr1 3000 G GATAGC oncogene ... 0.01 .

The values under sample1 and sample2 are the values from the AF field of FORMAT region of the VCF.

The master table however would represent this same data in this format:

CHROM POS REF ALT AF sample ANN_gene_name ANN_hgvs_p ANN_effect
chr1 2 G T 0.7 sample2 foo p.Met1Ala missense
chr1 5 C T 1 sample1 foo ... synonymous
chr1 5 C T 0.25 sample2 foo ... ...
chr1 1000 TA T 0.45 sample1 bar ... ...
chr1 3000 G GATAGC 0.01 sample1 oncogene ... ...

Note: The ANN_ fields will not be present for VCFs that have not been annotated using SnpEff.

DepthGauge

Before running depthgauge we need to know what the first sample name column is in our AF.tsv spreadsheet. In the above data the column number is 7 for column sample1. We also provide a folder which contains the BAM files needed to run depthgauge. Important: The bams must have the same name as sample in the spreadsheet and VCFs and must be sorted and indexed. For our example we would expect the BAMs folder to contain sample1.bam, sample2.bam, sample1.bam.bai, and sample2.bam.bai.

depthgauge -t 4 output_folder/AF.tsv 7 BAMS/ depthgauge.tsv

This will create an identical table to our first except with read depths instead of allele frequencies.

CHROM POS REF ALT ANN_gene_name ANN_effect sample1 sample2
chr1 2 G T foo missense 10 37
chr1 5 C T foo synonymous 100 4
chr1 1000 TA T bar ... 20 45
chr1 3000 G GATAGC oncogene ... 300 78

Datastore

The key advancement of using JSONL as a intermediate data type is it flexibility and use in noSQL datastores. When using a large number of samples or a more permanent dataset that may be analyzed several times, a noSQL database may offer more flexibility and robustness. We have provided python scripts that can be used to upload data to an Elasticsearch instance and query VCF data from an Elasticsearch instance. Other JSONL querying mechanisms can be used i.e. Apache Drill, AWS Athena, newer versions of PostgreSQL, and many others.

Custom Tables

The main mucor3 python script creates a pivot table by taking the jsonl directly from the vcf_atomizer and setting the fields CHROM, POS, REF, ALT as an index, pivoting on sample, and displaying the AF for the combination of "index" and "pivot on" value. Using the mucor scripts directly allows for greater flexibility and manipulation. All scripts with the exception of jsonlcsv.py take jsonl as input and output jsonl.

Merge

merge.py helps combine rows together to ensure that when a pivot is performed that rows are unique to avoid duplications. The main mucor3 script uses this to ensure we have unique rows for any given variant so we should only have one occurrence of any combination of CHROM, POS, REF, ALT, and sample. merge.py can be used to combine rows in other ways, simply by specifying what column combinations should define a unique row. mucor3's merge would appear as such using the script directly:

cat data.jsonl | python merge.py sample CHROM POS REF ALT

merge.py will concatenate columns for rows that are duplicate based on the provided indices.

CHROM POS REF ALT AF sample ANN_gene_name ANN_hgvs_p ANN_effect ANN_transcript_id
chr1 2 G T 0.7 sample2 foo p.Met1Ala missense 1
chr1 2 G T 0.7 sample2 foo synonymous 2
chr1 5 C T 1 sample1 foo synonymous 3

The above table would be changed to this:

CHROM POS REF ALT AF sample ANN_gene_name ANN_hgvs_p ANN_effect ANN_transcript_id
chr1 2 G T 0.7 sample2 foo missense;synonymous 1;2
chr1 5 C T 1 sample1 foo synonymous 3

This step is neccesary as the vcf_atomizer reports duplicate variant results for multiple SnpEff annotations as this is most efficient for filtering data using Elasticsearch or jq. We must use merge.py to later coelesce the rows back to representing a single variant.

Aggregate

Jsonl to TSV

mucor3's People

Contributors

charlesgregory avatar kekananen avatar

Watchers

 avatar  avatar  avatar  avatar

mucor3's Issues

Merged annotation fields out of order

If for each sample variant the annotations are in different order, when merging on sample and variant, samples with the same variant will have annotation fields merged in different order.

sample CHROM POS REF ALT meta1
sample1 chr1 2 G T BAD
sample1 chr1 2 G T GOOD
sample2 chr1 2 G T GOOD
sample2 chr1 2 G T BAD

A merge on the above dataset would result in the below dataset.

sample CHROM POS REF ALT meta1
sample1 chr1 2 G T BAD;GOOD
sample2 chr1 2 G T GOOD;BAD

This can make comparison and pivoting difficult. Recommend sorting on sample, CHROM, POS, REF, ALT, and the ANN fields before any merges.

Pivoted dataset can have duplicate variant records

When pivoting and including metadata columns, if metadata columns are different for some samples for the same variant, they will be reported separately.

CHROM POS REF ALT meta1 sample1 sample2
chr1 1 G A BAD 0.1 .
chr1 1 G A GOOD . 0.2
chr1 2 G T BAD;GOOD . 0.1
chr1 2 G T GOOD;BAD 0.5 .

Suggest pivoting with only CHROM, POS, REF, ALT. Then add metadata columns back with a left join. Columns from the right dataset in the join should be merged on CHROM, POS, REF, ALT.

Need different vcf atomizer output modes

Need different modes for outputting data in different orientations from the vcf atomizer:

Variant centric:

One row per unique variant. Multi-sample vcfs would have multiple sample objects under the FORMAT object.

{ 
    "CHROM" : "chr1", 
    "POS" : "1", 
    "REF" : "G", 
    "ALT" : "A", 
    "INFO" : { 
        "ANN" : [ 
            {
                "effect": "one"
            }, 
            {
                "effect": "two"
            }, 
        ]
    }, 
    "FORMAT" : { 
        "SAM1": { 
            "AF": 0.2
        },
        "SAM2": { 
            "AF": 0.1
        } 
    }
}

Sample variant centric (current):

One row per each variant for each sample. We expand the FORMAT object into multiple rows (while duplicating all other information).

{ 
    # all other values the same
    "FORMAT" : { 
        "AF": 0.2 # AF for SAM1
    },
    "sample":"SAM1" 
}
{ 
    # all other values the same
    "FORMAT" : { 
        "AF": 0.1 # AF for SAM2
    },
    "sample":"SAM2"
}

Annotation centric (previously):

One row per each annotation per each variant for each sample. We expand further expand the ANN object into multiple rows (while duplicating all other information).

{ 
    # all other values the same
    "INFO" : { 
        "ANN" : {
            "effect": "one" # first ANN annotation for SAM1
        } 
    }, 
    "FORMAT" : { 
        "AF": 0.2 
    },
    "sample":"SAM1"
}
{ 
    # all other values the same
    "INFO" : { 
        "ANN" : {
            "effect": "two" # second ANN annotation for SAM1
        }
    }, 
    "FORMAT" : { 
        "AF": 0.2
    },
    "sample":"SAM1"
}
{ 
    # all other values the same
    "INFO" : { 
        "ANN" : {
            "effect": "one" # first ANN annotation for SAM2
        } 
    }, 
    "FORMAT" : { 
        "AF": 0.1
    },
    "sample":"SAM2"
}
{ 
    # all other values the same
    "INFO" : { 
        "ANN" : {
            "effect": "two" # second ANN annotation for SAM2
        } 
    }, 
    "FORMAT" : { 
        "AF": 0.1
    },
    "sample":"SAM2"
}

Pivot table positive rate wrong denominator

(See also #1)

Commit 4367c9e introduced positive no. and positive rate in aggregate.pivot. However, main which calls pivot later adds back missing columns (if any). This means the positive rate's denominator is wrong.

Suggest removing it, or refactor positive no. and positive rate into separate functino you can call frmo main after adding back missing columns

Speed up depthgauge

Is there a way to increase the speed of depthgauge besides setting / increasing threads? When running on ~100 viral samples it takes around a day and a half to complete and seems to run more slowly when the number of samples is increased. In the beginning, it quickly works through the first 50 or so, but then slows to an approximate rate of 1 per ~40mins towards the end of processing the samples.

With having threads set to 4, on ~190 samples it took >3 days to finish running. So the average time expected for calculating the depth of viral samples is around ~40mins per sample.

sample_indexer hardcodes some columns

We should expect to always have a "sample" columns in our data file(s) unless it is a pivoted table, but we have no expectations for the index/key file to have specific columns. I think it currently relies on the columns "accession" and "status" being present? So we really need two flags instead of just --column or --column needs to take a parsed value like colname1=colname2.

Since our index/key table could be as such in our samples.xlsx:

Sequencing ID clinical_trial name
sample1 SAM0001
sample2 SAM0002
... ...

In which case we would need to run the script as such:

sample_indexer -f data.tsv -i samples.xlsx -c "Sequencing ID=clinical_trial name" -o converted.tsv
# or 
sample_indexer -f data.tsv -i samples.xlsx --from "Sequencing ID" --to "clinical_trial name" -o converted.tsv

Pivot table improvement

Mucor pivot table improvement suggestion

After pivot, add column , right before the first sample, that shows total # of nonzero entries in the row
That will help prioritize

Merge python and D code

Need to merge the python functionality into the D binary. Most of this is done, just need to wrap it all together as a single invocation that goes from VCF to pivoted table.

Off by 1 index error between AF calculations and depth calculations.

The output of depthgage and mucor3's AF.tsv file are off by 1 from each other. This is likely due to not accounting for the 1-based interval system of a VCF vs the 0-based interval system for a BAM. These should be reported as one or the other but not both as it may make downstream analysis more difficult.

AF.tsv is as follows:

chrEBV_B_95_8_Raji      25      C       G       2       0.015037593984962405    .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       0.88    .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       0.084   .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .
chrEBV_B_95_8_Raji      28      A       C       1       0.007518796992481203    .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       0.667   .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .
chrEBV_B_95_8_Raji      47      C       T       2       0.015037593984962405    .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       0.8     .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       0.667   .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .
chrEBV_B_95_8_Raji      53      C       A       1       0.007518796992481203    .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       0.244   .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .
chrEBV_B_95_8_Raji      72      G       GATCGTCT        1       0.007518796992481203    .       .       .       .       .       .       .       .       

DP is as follows from depthgauge:

chrEBV_B_95_8_Raji  26  C   G   2   0.015037593984962405    1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   5955    0   0   0   0   0   0   0   0   1087    0   0   0   0   1653    0   0   0   0   0   0   0   0   1211    0   0   640 0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1245    0   156 0   0   0   0   899 1128    1640    0   104 6   0   0   0   33  701 0   0   0   16252   0   36  3098    0   0   0   8334    0   0   0   0   0   0   0   1   0   0   1   551 0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   331 0
chrEBV_B_95_8_Raji  29  A   C   1   0.007518796992481203    1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   5956    0   0   0   0   0   0   0   0   1087    0   0   0   0   1670    0   0   0   0   0   0   0   0   1211    0   0   640 0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1245    0   156 0   0   0   0   899 1128    1639    0   104 6   0   0   0   33  841 0   0   0   16327   0   36  3110    0   0   0   8334    0   0   0   0   0   0   0   1   0   0   1   551 0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   331 0
chrEBV_B_95_8_Raji  48  C   T   2   0.015037593984962405    1   0   4   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   5950    0   0   0   0   0   0   0   0   1236    0   0   0   0   1693    0   2513    0   0   0   0   0   0   1211    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1245    0   156 0   0   0   0   899 1132    1576    0   28397   6   0   0   0   34  896 0   0   0   17022   0   36  8758    0   0   0   8335    0   0   0   0   0   0   0   1   0   2   1   551 0   0   0   34  0   0   0   0   0   0   0   0   0   0   0   0   21620   662 0
chrEBV_B_95_8_Raji  54  C   A   1   0.007518796992481203    1   0   4   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   5943    0   0   0   0   0   0   0   0   1236    0   0   0   0   1695    0   2513    0   0   0   0   0   0   1211    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1245    0   156 0   0   0   0   899 1132    342 0   28477   6   0   0   0   34  897 0   0   1572    17022   0   36  8765    0   0   0   8336    0   0   0   0   0   0   0   1   0   2   1   551 0   0   0   34  0   0   0   0   0   0   0   0   0   0   0   0   21635   664 0
chrEBV_B_95_8_Raji  73  G   GATCGTCT    1   0.007518796992481203    0   0   4   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   62  1236    0   0   0   0   1676    0   2513    0   0   0   0   0   0   1211    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   156 0   0   0   0   899 1128    342 0   28588   3   0   7016    0   18  898 0   0   1577    17022   0   36  14400   0   0   0   8335    0   0   0   0   0   0   0   1   0   2   1   550 0   0   0   34  0   0   0   0   0   0   0   0   0   0   0   0   21649   663 0

review of outputs

We currently output an AF.tsv, DP.tsv, master.tsv, and Variants.tsv.
AF and DP are VAF and depth pivot tables.
master.tsv is the raw dataset (post-filtering).
Variants.tsv is the master dataset that is merged on CHROM,POS, REF, and ALT. This makes it easier to see variants across samples.

Other outputs to be considered?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.