Comments (5)
Hi @tb-fyr ,
thanks for digging into this issue. I compared the releases from a code perspective and there are 2 changes that should affect this between the versions.
min length default settings
- The default settings for the read min length was reduced based on an issue raised here
- currently read_min_length = 100
- previously 350 for V1-V4 and 500 for V1200
- this can be easily tested by using poreCov version 1.1.0 and forcing the previous length filters of 0.11.0 with:
--minLength 500 --maxLength 1500
- if this was the issue 1.1.0 should have the same results as 0.11.0 with the above option flags
barcode mid strand detection
- We added barcode mid strand detection to the demultiplexing settings
--detect_mid_strand_barcodes --min_score_mid_barcodes 50
- this might cause reads assigned to the wrong barcodes? if you ruled out the minLength part I could create a branch for you with this part removed for testing
Code Source Comparision: 0.11.0...1.1.0
from porecov.
thanks for reporting. currently, it's difficult to reproduce this on my end without your data as an example. e.g. different demultiplexing (barcode both ends, mid strand detection...) and basecalling settings (q score cutoffs - used model) might change already the number of reads within a fastq file
from porecov.
Here are our two negative barcodes from one experiment but two separate versions of porecov basecalling (I was getting an error trying to add any more files)
barcode01_0-11-0porecov.fastq.gz
barcode01_1-1-0porecov.fastq.gz
I use default params for demultiplexing using Midnight Kit (mid strand detection at 50) and default basecalling settings with guppy model: dna_r9.4.1_450bps_fast.cfg
I've ran these both multiple times and I get the same result. Under 0.11.0 porecov, water doesn't yield any results or show up in summary report.
Under 1.1.0, water shows as having omicron. That is where I compared the nanoplot and noticed the added reads.
I have confirmed all these same experiments in EPI2ME and in all our march runs, if I run the data under 0.11.0, our results are clean; but if I run them under 1.1.0 then I get different results. Taking the exact same samples and basecalling them on our MinION MK1C, everything is clean no matter what version of porecov.
Also, here are two screenshots from EPI2ME output for the same experiments the above fastq's come from (again 0.11.0 vs 1.1.0). I noticed with the 1.1.0 result (the second png image), that the distribution of coverage and N distribution seems weird to me, as compared to the clean 0.11.0 version.
Thanks for any help you can provide!
from porecov.
Hello,
Adding the min and max length did clear up our negatives as expected and confirmed in our other analyses with same data so it appears that worked. However, it did result in less samples being recognized it appears. Our pass/fail QC summary was out of 80 samples, rather than 96 on our original 0.11.0 run.
Also, out of curiosity what was the reasoning for changing read min length to 100 for all schemes?
I hope the mid strand detection parameter wouldn't be assigning wrong barcodes, as this is recommended setting from Oxford nanopore, however, during my troubleshooting I did look at the barcode summary for our negatives once and was identifying a few barcodes that weren't the negative barcodes (out of like a million reads though). I've heard barcode hopping is possible with oxford tech but that shouldn't be enough reads to change the result, I would hope...
Also, just to confirm my understanding, what determines that a sample doesn't get added to the report summary? I've been assuming it just didn't pass all QC filters? For example, you have 48 samples but the report summary shows 35/40.
from porecov.
As the whole "covid assembly" is just based around mapping we experienced a few false-positive samples due to barcode hopping. We discussed this in our manuscript too (we called it "sample or barcode bleeding") mostly attributed a barcode pool with at least one sample being overloaded (thus lots of "free" DNA) and at least one sample completely underloaded with DNA (lots of free barcodes). As 20x times coverage on the small genome is enough, a few thousand reads already get you a signal. The min read length reduction was included as was increasing for some experiments the coverage quite noticeable and thus improve genome reconstruction and subsequently, way more strains being recognized or "QC-valid" based on completeness. It seemed to improve results more generally speaking as i got quite a few emails about this :)
In the latest porecov version a read screen method is included via --screen_reads
this tries to determine the pangolineage of every read. (its by default of as it takes some hours per sample). But you could use that to investigate how "clean" the barcodes you are using are. btw since the tool was developed for accurate short reads expect some noise depending on the accuracy of the basecaller you are using. for sup basecalled reads we identify other sars-cov lineages with around 0 - 1%. a "clean" sample should be at least 95 % of a lineage. results look like this:
sample variant_group proportion mean std_error
SARSCoV2_filtered A.23.1 0 0.0 0.0
SARSCoV2_filtered AV.1 0.02107885561393124 0.02151797843491451 0.005105415907809195
SARSCoV2_filtered Alpha_B.1.1.7 0.008255186842329024 0.008313999542467122 0.0026089935944567525
SARSCoV2_filtered B.1 4.677849064695906e-10 8.876766030813688e-10 1.2800009054288062e-09
SARSCoV2_filtered B.1.1 1.1621285428771163e-10 5.754328413083417e-10 9.877091767211538e-10
SARSCoV2_filtered B.1.1.318 0 0.0 0.0
SARSCoV2_filtered B.1.177 0.7024952547613489 0.7044517999333074 0.020328715359556118
SARSCoV2_filtered B.1.617.3 0 0.0 0.0
SARSCoV2_filtered B.1.623 0 0.0 0.0
SARSCoV2_filtered Beta_B.1.351 0.2556522779374451 0.2536694420134167 0.021667675821820425
SARSCoV2_filtered Delta_B.1.617.2 0.0004445407879727177 0.00048448068661852194 0.0004949322336048902
SARSCoV2_filtered Epsilon_B.1.427_429 0 0.0 0.0
SARSCoV2_filtered Eta_B.1.525 0 2.0539993885939775e-11 9.01935044212674e-11
SARSCoV2_filtered Gamma_P.1 0.0018766985950762869 0.0016466757712217091 0.002038137642675043
SARSCoV2_filtered Iota_B.1.526 0 3.659593917460922e-10 6.485911094918895e-10
SARSCoV2_filtered Kappa_B.1.617.1 0 7.167376434181761e-10 1.0773584061104731e-09
SARSCoV2_filtered Lambda_C.37 0 0.0 0.0
SARSCoV2_filtered Mu_B.1.621 0.0008949069846917494 0.0008868070308837934 0.0008862490897717564
SARSCoV2_filtered Omicron_BA.1 0 3.184555144123045e-05 0.00022817950467739884
SARSCoV2_filtered Omicron_BA.2 0.0038189007877782737 0.003559979687520783 0.0011243997285608296
SARSCoV2_filtered Omicron_BA.3 0 9.689481363893597e-05 0.0005993846545283763
SARSCoV2_filtered Theta_P.3 0.001851423019144237 0.0021062543590370003 0.0019484222501466196
SARSCoV2_filtered Zeta_P.2 0.0036319880617552086 0.0032338856626401786 0.0028578754111950678
e.g. this sample has B.1.177 70% and Beta_B.1.351 25% might help to checkout whats going on in a sample.
from porecov.
Related Issues (20)
- add skip scorpio parameter to pangolin HOT 1
- Only calculate NanoPlot after read filtering step HOT 5
- Add new V5 ARTIC primer BED HOT 5
- Medaka step fails in the -profile fastq_test HOT 3
- summary_report.py fails HOT 7
- publish primersitereport from medaka output
- VarSkipV2b primer does not work as expected HOT 7
- Update Medaka to support R10.4.1 models HOT 14
- Update --help to list up-to-date primer schemes that are supported
- MinKNOW/Guppy update needs new model for R10.4.1 5 kHz HOT 6
- Warning when execution report and timeline already exists HOT 1
- The pipeline fails in artic_ncov_wf_artic_medaka HOT 7
- new pangolin table columns HOT 1
- CovarPlot fails w/ custom BED HOT 5
- Process retry in slurm profile HOT 1
- Publish VCF files HOT 2
- Update medaka to get the latest models HOT 1
- Test Freyja Update function HOT 11
- [Question] CI for Variant Calling HOT 5
- Singularity container execution of pangolin crashes with recent nextflow versions HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from porecov.