blachlylab / mucor3 Goto Github PK
View Code? Open in Web Editor NEWParses VCF data into tabular spreadsheets and aggregates data by sample
License: MIT License
Parses VCF data into tabular spreadsheets and aggregates data by sample
License: MIT License
We currently output an AF.tsv, DP.tsv, master.tsv, and Variants.tsv.
AF and DP are VAF and depth pivot tables.
master.tsv is the raw dataset (post-filtering).
Variants.tsv is the master dataset that is merged on CHROM,POS, REF, and ALT. This makes it easier to see variants across samples.
Other outputs to be considered?
Only outputs md5 sum?
Need different modes for outputting data in different orientations from the vcf atomizer:
One row per unique variant. Multi-sample vcfs would have multiple sample objects under the FORMAT object.
{
"CHROM" : "chr1",
"POS" : "1",
"REF" : "G",
"ALT" : "A",
"INFO" : {
"ANN" : [
{
"effect": "one"
},
{
"effect": "two"
},
]
},
"FORMAT" : {
"SAM1": {
"AF": 0.2
},
"SAM2": {
"AF": 0.1
}
}
}
One row per each variant for each sample. We expand the FORMAT object into multiple rows (while duplicating all other information).
{
# all other values the same
"FORMAT" : {
"AF": 0.2 # AF for SAM1
},
"sample":"SAM1"
}
{
# all other values the same
"FORMAT" : {
"AF": 0.1 # AF for SAM2
},
"sample":"SAM2"
}
One row per each annotation per each variant for each sample. We expand further expand the ANN object into multiple rows (while duplicating all other information).
{
# all other values the same
"INFO" : {
"ANN" : {
"effect": "one" # first ANN annotation for SAM1
}
},
"FORMAT" : {
"AF": 0.2
},
"sample":"SAM1"
}
{
# all other values the same
"INFO" : {
"ANN" : {
"effect": "two" # second ANN annotation for SAM1
}
},
"FORMAT" : {
"AF": 0.2
},
"sample":"SAM1"
}
{
# all other values the same
"INFO" : {
"ANN" : {
"effect": "one" # first ANN annotation for SAM2
}
},
"FORMAT" : {
"AF": 0.1
},
"sample":"SAM2"
}
{
# all other values the same
"INFO" : {
"ANN" : {
"effect": "two" # second ANN annotation for SAM2
}
},
"FORMAT" : {
"AF": 0.1
},
"sample":"SAM2"
}
Suggest rocksdb
If for each sample variant the annotations are in different order, when merging on sample and variant, samples with the same variant will have annotation fields merged in different order.
sample | CHROM | POS | REF | ALT | meta1 |
---|---|---|---|---|---|
sample1 | chr1 | 2 | G | T | BAD |
sample1 | chr1 | 2 | G | T | GOOD |
sample2 | chr1 | 2 | G | T | GOOD |
sample2 | chr1 | 2 | G | T | BAD |
A merge on the above dataset would result in the below dataset.
sample | CHROM | POS | REF | ALT | meta1 |
---|---|---|---|---|---|
sample1 | chr1 | 2 | G | T | BAD;GOOD |
sample2 | chr1 | 2 | G | T | GOOD;BAD |
This can make comparison and pivoting difficult. Recommend sorting on sample, CHROM, POS, REF, ALT, and the ANN fields before any merges.
Need to merge the python functionality into the D binary. Most of this is done, just need to wrap it all together as a single invocation that goes from VCF to pivoted table.
This seems to be an issue with earlier software, namely MuTect2, in our pipeline. We can use INFO_dbsnp_ids field instead of ID.
(See also #1)
Commit 4367c9e introduced positive no. and positive rate in aggregate.pivot
. However, main
which calls pivot
later adds back missing columns (if any). This means the positive rate's denominator is wrong.
Suggest removing it, or refactor positive no. and positive rate into separate functino you can call frmo main after adding back missing columns
Documentation is not true to how to install or where the executables are, this needs to be updated.
Mucor pivot table improvement suggestion
After pivot, add column , right before the first sample, that shows total # of nonzero entries in the row
That will help prioritize
When pivoting and including metadata columns, if metadata columns are different for some samples for the same variant, they will be reported separately.
CHROM | POS | REF | ALT | meta1 | sample1 | sample2 |
---|---|---|---|---|---|---|
chr1 | 1 | G | A | BAD | 0.1 | . |
chr1 | 1 | G | A | GOOD | . | 0.2 |
chr1 | 2 | G | T | BAD;GOOD | . | 0.1 |
chr1 | 2 | G | T | GOOD;BAD | 0.5 | . |
Suggest pivoting with only CHROM, POS, REF, ALT. Then add metadata columns back with a left join. Columns from the right dataset in the join should be merged on CHROM, POS, REF, ALT.
We should expect to always have a "sample" columns in our data file(s) unless it is a pivoted table, but we have no expectations for the index/key file to have specific columns. I think it currently relies on the columns "accession" and "status" being present? So we really need two flags instead of just --column
or --column
needs to take a parsed value like colname1=colname2
.
Since our index/key table could be as such in our samples.xlsx
:
Sequencing ID | clinical_trial name |
---|---|
sample1 | SAM0001 |
sample2 | SAM0002 |
... | ... |
In which case we would need to run the script as such:
sample_indexer -f data.tsv -i samples.xlsx -c "Sequencing ID=clinical_trial name" -o converted.tsv
# or
sample_indexer -f data.tsv -i samples.xlsx --from "Sequencing ID" --to "clinical_trial name" -o converted.tsv
The output of depthgage
and mucor3's AF.tsv
file are off by 1 from each other. This is likely due to not accounting for the 1-based interval system of a VCF
vs the 0-based interval system for a BAM
. These should be reported as one or the other but not both as it may make downstream analysis more difficult.
AF.tsv
is as follows:
chrEBV_B_95_8_Raji 25 C G 2 0.015037593984962405 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0.88 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0.084 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
chrEBV_B_95_8_Raji 28 A C 1 0.007518796992481203 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0.667 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
chrEBV_B_95_8_Raji 47 C T 2 0.015037593984962405 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0.667 . . . . . . . . . . . . . . . . . . . .
chrEBV_B_95_8_Raji 53 C A 1 0.007518796992481203 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0.244 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
chrEBV_B_95_8_Raji 72 G GATCGTCT 1 0.007518796992481203 . . . . . . . .
DP is as follows from depthgauge
:
chrEBV_B_95_8_Raji 26 C G 2 0.015037593984962405 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5955 0 0 0 0 0 0 0 0 1087 0 0 0 0 1653 0 0 0 0 0 0 0 0 1211 0 0 640 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1245 0 156 0 0 0 0 899 1128 1640 0 104 6 0 0 0 33 701 0 0 0 16252 0 36 3098 0 0 0 8334 0 0 0 0 0 0 0 1 0 0 1 551 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 331 0
chrEBV_B_95_8_Raji 29 A C 1 0.007518796992481203 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5956 0 0 0 0 0 0 0 0 1087 0 0 0 0 1670 0 0 0 0 0 0 0 0 1211 0 0 640 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1245 0 156 0 0 0 0 899 1128 1639 0 104 6 0 0 0 33 841 0 0 0 16327 0 36 3110 0 0 0 8334 0 0 0 0 0 0 0 1 0 0 1 551 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 331 0
chrEBV_B_95_8_Raji 48 C T 2 0.015037593984962405 1 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5950 0 0 0 0 0 0 0 0 1236 0 0 0 0 1693 0 2513 0 0 0 0 0 0 1211 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1245 0 156 0 0 0 0 899 1132 1576 0 28397 6 0 0 0 34 896 0 0 0 17022 0 36 8758 0 0 0 8335 0 0 0 0 0 0 0 1 0 2 1 551 0 0 0 34 0 0 0 0 0 0 0 0 0 0 0 0 21620 662 0
chrEBV_B_95_8_Raji 54 C A 1 0.007518796992481203 1 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5943 0 0 0 0 0 0 0 0 1236 0 0 0 0 1695 0 2513 0 0 0 0 0 0 1211 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1245 0 156 0 0 0 0 899 1132 342 0 28477 6 0 0 0 34 897 0 0 1572 17022 0 36 8765 0 0 0 8336 0 0 0 0 0 0 0 1 0 2 1 551 0 0 0 34 0 0 0 0 0 0 0 0 0 0 0 0 21635 664 0
chrEBV_B_95_8_Raji 73 G GATCGTCT 1 0.007518796992481203 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 62 1236 0 0 0 0 1676 0 2513 0 0 0 0 0 0 1211 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 156 0 0 0 0 899 1128 342 0 28588 3 0 7016 0 18 898 0 0 1577 17022 0 36 14400 0 0 0 8335 0 0 0 0 0 0 0 1 0 2 1 550 0 0 0 34 0 0 0 0 0 0 0 0 0 0 0 0 21649 663 0
Is there a way to increase the speed of depthgauge
besides setting / increasing threads? When running on ~100 viral samples it takes around a day and a half to complete and seems to run more slowly when the number of samples is increased. In the beginning, it quickly works through the first 50 or so, but then slows to an approximate rate of 1 per ~40mins towards the end of processing the samples.
With having threads set to 4, on ~190 samples it took >3 days to finish running. So the average time expected for calculating the depth of viral samples is around ~40mins per sample.
Appears that filters could be getting duplicated during atomization
Currently we drop sample records that have GT: ./.
but not GT: 0/0
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.