Comments (10)
hi, thanks for the response, is a VCF of the LRGT set available?
I don't see svtyper deletions on GiaB set in the supplement, only Manta and Paragraph in table S2.
on lumpy output (where svtyper genotypes result in ~90% precision and 84% recall), paragraphs gives only 20% recall and 98% precision.
I will likely leave this for now, feel free to close the issue. It would have saved me some time to know that paragraph currently won't work with lumpy output (due to lack of precision in break-points?)
from paragraph.
and here is the source of add_ci to get around svtyper requiring CIPOS and CIEND and a header parsing bug in svtyper:
import hts/vcf
var ivcf:VCF
var ovcf:VCF
if not open(ivcf, "/dev/stdin"):
quit "bad"
doAssert ivcf.header.add_info("CIPOS", "2", "Integer", "ci") == Status.OK
doAssert ivcf.header.add_info("CIEND", "2", "Integer", "ci") == Status.OK
doAssert ivcf.header.add_info("CIPOS95", "2", "Integer", "ci") == Status.OK
doAssert ivcf.header.add_info("CIEND95", "2", "Integer", "ci") == Status.OK
doAssert ivcf.header.remove_info("MultiTechExact") == Status.OK
doAssert ivcf.header.remove_info("MultiTech") == Status.OK
doAssert ivcf.header.remove_info("DistPASSHG2gt49Minlt1000") == Status.OK
if not open(ovcf, "with-ci.vcf", mode="w"):
quit "bad"
ovcf.copy_header(ivcf.header)
doAssert ovcf.write_header()
var ci = @[-2'i32, 2]
for v in ivcf:
doAssert v.info.set("CIPOS", ci) == Status.OK
doAssert v.info.set("CIEND", ci) == Status.OK
doAssert v.info.set("CIPOS95", ci) == Status.OK
doAssert v.info.set("CIEND95", ci) == Status.OK
discard v.info.delete("MultiTech")
discard v.info.delete("MultiTechExact")
discard v.info.delete("DistPASSHG2gt49Minlt1000")
doAssert ovcf.write_variant(v)
ovcf.close()
from paragraph.
Hi Brent,
Thanks for looking into this!
For the recall of Paragraph, we got similar results to yours on NIST truth set
v0.6 (see supplementary materials). NIST truth set is not guaranteed
to be breakpoint accurate, so some Paragraph FNs are likely due to inaccurate
breakpoints.
We also observed that svtyper has a good recall for >300bp deletions (Fig 2a)
but for smaller deletions its performance appears to drop sharply. Since a large
fraction of deletions is smaller than 300bp (Fig 2b) we get a much lower
estimate for the overall recall (Table 1).
I'd say the test set matters too. NIST tier1 contains only confident regions of
the genome. And deletion size is important when making such comparisons.
And we indeed added CIPOS & CIEND when evaluating svtyper.
from paragraph.
Hey @brentp
thanks for looking at this. Always interesting to see how things run in other peoples hands.
Our calls are here: https://github.com/Illumina/paragraph/blob/master/data/download-instructions.txt
Would be interesting to know why the calls from Lumpy are showing reduced performance.
Thanks
Fritz
from paragraph.
hi Fritz, thanks for the reply. If you make the cram/bam that you used available, I'll be glad to retry the evaluation on those variants+alignments.
from paragraph.
you mean the Pacbio data?
Or the illumina reads?
from paragraph.
I mean the illumina hg002 ~35X sample.
from paragraph.
@brentp Yes, HG002 Long-read ground truth is available in data/ directory. min event length = 30bp. del+ins+inv+dup. Note that in the paper we only used 50~10kbp del & ins events for evaluation.
And in Table S2 we only tested Manta & Paragraph. We didn't test everything since such comparison has been done on LRGT.
It's interesting to see such a different recall on lumpy calls. We never tested our method on lumpy calls before. For now, I guess it's mostly because of breakpoints, as the PE method is unlikely to achieve base-pair accuracy. But we're going to double check.
from paragraph.
@traxexx I think he is asking for the Illumina reads that we used. I don't know where they are currently hosted.
from paragraph.
@brentp that's not public yet. We'll finally make it public. For now please send me an email at [email protected] and I'll share the bam with you via Basespace.
from paragraph.
Related Issues (20)
- Can paragraph be used for indel from 2 bp to 30 bp? HOT 1
- ValueError: Invalid VariantRecord. Number of samples does not match header HOT 2
- Error with idxdepth: "Assertion failed: _impl->header_contig_map.count(chr) != 0" HOT 1
- Missing key SEQ for <INS> HOT 6
- --vcf-split option with no description
- subprocess.CalledProcessError HOT 2
- grmpy error: [E::cram_itr_query]
- index file
- How to merge multi-samples SVs and obtain breakpoints for genotyping a population
- no BGZF EOF marker
- Install paragraph
- Stop using Werror and Wall
- Genotyping for SNP
- idxdepth regex option not working
- Problem starting the script multigrmpy.py HOT 2
- Error when working with `--ins-info-key` HOT 1
- Add support for VCFv4.2 breakend notation
- Issue's with Native Build and Boost
- Format error in vcf line: HOT 3
- [BUG] Error adding alt from insertion sequence representing a duplication
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from paragraph.