GithubHelp home page GithubHelp logo

blachlylab / mucor3 Goto Github PK

View Code? Open in Web Editor NEW
0.0 4.0 0.0 874 KB

Parses VCF data into tabular spreadsheets and aggregates data by sample

License: MIT License

Python 9.32% Makefile 0.17% D 90.07% Dockerfile 0.04% Shell 0.40%
bioinformatics vcf jsonlines aggregation

mucor3's People

Contributors

charlesgregory avatar kekananen avatar

Watchers

 avatar  avatar  avatar  avatar

mucor3's Issues

review of outputs

We currently output an AF.tsv, DP.tsv, master.tsv, and Variants.tsv.
AF and DP are VAF and depth pivot tables.
master.tsv is the raw dataset (post-filtering).
Variants.tsv is the master dataset that is merged on CHROM,POS, REF, and ALT. This makes it easier to see variants across samples.

Other outputs to be considered?

Need different vcf atomizer output modes

Need different modes for outputting data in different orientations from the vcf atomizer:

Variant centric:

One row per unique variant. Multi-sample vcfs would have multiple sample objects under the FORMAT object.

{ 
    "CHROM" : "chr1", 
    "POS" : "1", 
    "REF" : "G", 
    "ALT" : "A", 
    "INFO" : { 
        "ANN" : [ 
            {
                "effect": "one"
            }, 
            {
                "effect": "two"
            }, 
        ]
    }, 
    "FORMAT" : { 
        "SAM1": { 
            "AF": 0.2
        },
        "SAM2": { 
            "AF": 0.1
        } 
    }
}

Sample variant centric (current):

One row per each variant for each sample. We expand the FORMAT object into multiple rows (while duplicating all other information).

{ 
    # all other values the same
    "FORMAT" : { 
        "AF": 0.2 # AF for SAM1
    },
    "sample":"SAM1" 
}
{ 
    # all other values the same
    "FORMAT" : { 
        "AF": 0.1 # AF for SAM2
    },
    "sample":"SAM2"
}

Annotation centric (previously):

One row per each annotation per each variant for each sample. We expand further expand the ANN object into multiple rows (while duplicating all other information).

{ 
    # all other values the same
    "INFO" : { 
        "ANN" : {
            "effect": "one" # first ANN annotation for SAM1
        } 
    }, 
    "FORMAT" : { 
        "AF": 0.2 
    },
    "sample":"SAM1"
}
{ 
    # all other values the same
    "INFO" : { 
        "ANN" : {
            "effect": "two" # second ANN annotation for SAM1
        }
    }, 
    "FORMAT" : { 
        "AF": 0.2
    },
    "sample":"SAM1"
}
{ 
    # all other values the same
    "INFO" : { 
        "ANN" : {
            "effect": "one" # first ANN annotation for SAM2
        } 
    }, 
    "FORMAT" : { 
        "AF": 0.1
    },
    "sample":"SAM2"
}
{ 
    # all other values the same
    "INFO" : { 
        "ANN" : {
            "effect": "two" # second ANN annotation for SAM2
        } 
    }, 
    "FORMAT" : { 
        "AF": 0.1
    },
    "sample":"SAM2"
}

Merged annotation fields out of order

If for each sample variant the annotations are in different order, when merging on sample and variant, samples with the same variant will have annotation fields merged in different order.

sample CHROM POS REF ALT meta1
sample1 chr1 2 G T BAD
sample1 chr1 2 G T GOOD
sample2 chr1 2 G T GOOD
sample2 chr1 2 G T BAD

A merge on the above dataset would result in the below dataset.

sample CHROM POS REF ALT meta1
sample1 chr1 2 G T BAD;GOOD
sample2 chr1 2 G T GOOD;BAD

This can make comparison and pivoting difficult. Recommend sorting on sample, CHROM, POS, REF, ALT, and the ANN fields before any merges.

Merge python and D code

Need to merge the python functionality into the D binary. Most of this is done, just need to wrap it all together as a single invocation that goes from VCF to pivoted table.

Pivot table positive rate wrong denominator

(See also #1)

Commit 4367c9e introduced positive no. and positive rate in aggregate.pivot. However, main which calls pivot later adds back missing columns (if any). This means the positive rate's denominator is wrong.

Suggest removing it, or refactor positive no. and positive rate into separate functino you can call frmo main after adding back missing columns

Pivot table improvement

Mucor pivot table improvement suggestion

After pivot, add column , right before the first sample, that shows total # of nonzero entries in the row
That will help prioritize

Pivoted dataset can have duplicate variant records

When pivoting and including metadata columns, if metadata columns are different for some samples for the same variant, they will be reported separately.

CHROM POS REF ALT meta1 sample1 sample2
chr1 1 G A BAD 0.1 .
chr1 1 G A GOOD . 0.2
chr1 2 G T BAD;GOOD . 0.1
chr1 2 G T GOOD;BAD 0.5 .

Suggest pivoting with only CHROM, POS, REF, ALT. Then add metadata columns back with a left join. Columns from the right dataset in the join should be merged on CHROM, POS, REF, ALT.

sample_indexer hardcodes some columns

We should expect to always have a "sample" columns in our data file(s) unless it is a pivoted table, but we have no expectations for the index/key file to have specific columns. I think it currently relies on the columns "accession" and "status" being present? So we really need two flags instead of just --column or --column needs to take a parsed value like colname1=colname2.

Since our index/key table could be as such in our samples.xlsx:

Sequencing ID clinical_trial name
sample1 SAM0001
sample2 SAM0002
... ...

In which case we would need to run the script as such:

sample_indexer -f data.tsv -i samples.xlsx -c "Sequencing ID=clinical_trial name" -o converted.tsv
# or 
sample_indexer -f data.tsv -i samples.xlsx --from "Sequencing ID" --to "clinical_trial name" -o converted.tsv

Off by 1 index error between AF calculations and depth calculations.

The output of depthgage and mucor3's AF.tsv file are off by 1 from each other. This is likely due to not accounting for the 1-based interval system of a VCF vs the 0-based interval system for a BAM. These should be reported as one or the other but not both as it may make downstream analysis more difficult.

AF.tsv is as follows:

chrEBV_B_95_8_Raji      25      C       G       2       0.015037593984962405    .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       0.88    .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       0.084   .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .
chrEBV_B_95_8_Raji      28      A       C       1       0.007518796992481203    .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       0.667   .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .
chrEBV_B_95_8_Raji      47      C       T       2       0.015037593984962405    .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       0.8     .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       0.667   .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .
chrEBV_B_95_8_Raji      53      C       A       1       0.007518796992481203    .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       0.244   .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .
chrEBV_B_95_8_Raji      72      G       GATCGTCT        1       0.007518796992481203    .       .       .       .       .       .       .       .       

DP is as follows from depthgauge:

chrEBV_B_95_8_Raji  26  C   G   2   0.015037593984962405    1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   5955    0   0   0   0   0   0   0   0   1087    0   0   0   0   1653    0   0   0   0   0   0   0   0   1211    0   0   640 0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1245    0   156 0   0   0   0   899 1128    1640    0   104 6   0   0   0   33  701 0   0   0   16252   0   36  3098    0   0   0   8334    0   0   0   0   0   0   0   1   0   0   1   551 0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   331 0
chrEBV_B_95_8_Raji  29  A   C   1   0.007518796992481203    1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   5956    0   0   0   0   0   0   0   0   1087    0   0   0   0   1670    0   0   0   0   0   0   0   0   1211    0   0   640 0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1245    0   156 0   0   0   0   899 1128    1639    0   104 6   0   0   0   33  841 0   0   0   16327   0   36  3110    0   0   0   8334    0   0   0   0   0   0   0   1   0   0   1   551 0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   331 0
chrEBV_B_95_8_Raji  48  C   T   2   0.015037593984962405    1   0   4   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   5950    0   0   0   0   0   0   0   0   1236    0   0   0   0   1693    0   2513    0   0   0   0   0   0   1211    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1245    0   156 0   0   0   0   899 1132    1576    0   28397   6   0   0   0   34  896 0   0   0   17022   0   36  8758    0   0   0   8335    0   0   0   0   0   0   0   1   0   2   1   551 0   0   0   34  0   0   0   0   0   0   0   0   0   0   0   0   21620   662 0
chrEBV_B_95_8_Raji  54  C   A   1   0.007518796992481203    1   0   4   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   5943    0   0   0   0   0   0   0   0   1236    0   0   0   0   1695    0   2513    0   0   0   0   0   0   1211    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1245    0   156 0   0   0   0   899 1132    342 0   28477   6   0   0   0   34  897 0   0   1572    17022   0   36  8765    0   0   0   8336    0   0   0   0   0   0   0   1   0   2   1   551 0   0   0   34  0   0   0   0   0   0   0   0   0   0   0   0   21635   664 0
chrEBV_B_95_8_Raji  73  G   GATCGTCT    1   0.007518796992481203    0   0   4   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   62  1236    0   0   0   0   1676    0   2513    0   0   0   0   0   0   1211    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   156 0   0   0   0   899 1128    342 0   28588   3   0   7016    0   18  898 0   0   1577    17022   0   36  14400   0   0   0   8335    0   0   0   0   0   0   0   1   0   2   1   550 0   0   0   34  0   0   0   0   0   0   0   0   0   0   0   0   21649   663 0

Speed up depthgauge

Is there a way to increase the speed of depthgauge besides setting / increasing threads? When running on ~100 viral samples it takes around a day and a half to complete and seems to run more slowly when the number of samples is increased. In the beginning, it quickly works through the first 50 or so, but then slows to an approximate rate of 1 per ~40mins towards the end of processing the samples.

With having threads set to 4, on ~190 samples it took >3 days to finish running. So the average time expected for calculating the depth of viral samples is around ~40mins per sample.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.