I've been having results lately that make my negatives appear to be contaminated (i.e.

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Possible filtering issue about porecov HOT 5 CLOSED

tb-fyr commented on June 29, 2024

Possible filtering issue

from porecov.

Comments (5)

replikation commented on June 29, 2024 2

Hi @tb-fyr ,

thanks for digging into this issue. I compared the releases from a code perspective and there are 2 changes that should affect this between the versions.

min length default settings

The default settings for the read min length was reduced based on an issue raised here
- currently read_min_length = 100
- previously 350 for V1-V4 and 500 for V1200
this can be easily tested by using poreCov version 1.1.0 and forcing the previous length filters of 0.11.0 with:
- --minLength 500 --maxLength 1500
- if this was the issue 1.1.0 should have the same results as 0.11.0 with the above option flags

barcode mid strand detection

We added barcode mid strand detection to the demultiplexing settings --detect_mid_strand_barcodes --min_score_mid_barcodes 50
this might cause reads assigned to the wrong barcodes? if you ruled out the minLength part I could create a branch for you with this part removed for testing

Code Source Comparision: 0.11.0...1.1.0

from porecov.

replikation commented on June 29, 2024

thanks for reporting. currently, it's difficult to reproduce this on my end without your data as an example. e.g. different demultiplexing (barcode both ends, mid strand detection...) and basecalling settings (q score cutoffs - used model) might change already the number of reads within a fastq file

from porecov.

tb-fyr commented on June 29, 2024

Here are our two negative barcodes from one experiment but two separate versions of porecov basecalling (I was getting an error trying to add any more files)
barcode01_0-11-0porecov.fastq.gz
barcode01_1-1-0porecov.fastq.gz
I use default params for demultiplexing using Midnight Kit (mid strand detection at 50) and default basecalling settings with guppy model: dna_r9.4.1_450bps_fast.cfg
I've ran these both multiple times and I get the same result. Under 0.11.0 porecov, water doesn't yield any results or show up in summary report.
Under 1.1.0, water shows as having omicron. That is where I compared the nanoplot and noticed the added reads.

I have confirmed all these same experiments in EPI2ME and in all our march runs, if I run the data under 0.11.0, our results are clean; but if I run them under 1.1.0 then I get different results. Taking the exact same samples and basecalling them on our MinION MK1C, everything is clean no matter what version of porecov.

Also, here are two screenshots from EPI2ME output for the same experiments the above fastq's come from (again 0.11.0 vs 1.1.0). I noticed with the 1.1.0 result (the second png image), that the distribution of coverage and N distribution seems weird to me, as compared to the clean 0.11.0 version.

Thanks for any help you can provide!

from porecov.

tb-fyr commented on June 29, 2024

Hello,
Adding the min and max length did clear up our negatives as expected and confirmed in our other analyses with same data so it appears that worked. However, it did result in less samples being recognized it appears. Our pass/fail QC summary was out of 80 samples, rather than 96 on our original 0.11.0 run.
Also, out of curiosity what was the reasoning for changing read min length to 100 for all schemes?

I hope the mid strand detection parameter wouldn't be assigning wrong barcodes, as this is recommended setting from Oxford nanopore, however, during my troubleshooting I did look at the barcode summary for our negatives once and was identifying a few barcodes that weren't the negative barcodes (out of like a million reads though). I've heard barcode hopping is possible with oxford tech but that shouldn't be enough reads to change the result, I would hope...

Also, just to confirm my understanding, what determines that a sample doesn't get added to the report summary? I've been assuming it just didn't pass all QC filters? For example, you have 48 samples but the report summary shows 35/40.

from porecov.

replikation commented on June 29, 2024

As the whole "covid assembly" is just based around mapping we experienced a few false-positive samples due to barcode hopping. We discussed this in our manuscript too (we called it "sample or barcode bleeding") mostly attributed a barcode pool with at least one sample being overloaded (thus lots of "free" DNA) and at least one sample completely underloaded with DNA (lots of free barcodes). As 20x times coverage on the small genome is enough, a few thousand reads already get you a signal. The min read length reduction was included as was increasing for some experiments the coverage quite noticeable and thus improve genome reconstruction and subsequently, way more strains being recognized or "QC-valid" based on completeness. It seemed to improve results more generally speaking as i got quite a few emails about this :)

In the latest porecov version a read screen method is included via --screen_reads this tries to determine the pangolineage of every read. (its by default of as it takes some hours per sample). But you could use that to investigate how "clean" the barcodes you are using are. btw since the tool was developed for accurate short reads expect some noise depending on the accuracy of the basecaller you are using. for sup basecalled reads we identify other sars-cov lineages with around 0 - 1%. a "clean" sample should be at least 95 % of a lineage. results look like this:

sample	variant_group	proportion	mean	std_error
SARSCoV2_filtered	A.23.1	0	0.0	0.0
SARSCoV2_filtered	AV.1	0.02107885561393124	0.02151797843491451	0.005105415907809195
SARSCoV2_filtered	Alpha_B.1.1.7	0.008255186842329024	0.008313999542467122	0.0026089935944567525
SARSCoV2_filtered	B.1	4.677849064695906e-10	8.876766030813688e-10	1.2800009054288062e-09
SARSCoV2_filtered	B.1.1	1.1621285428771163e-10	5.754328413083417e-10	9.877091767211538e-10
SARSCoV2_filtered	B.1.1.318	0	0.0	0.0
SARSCoV2_filtered	B.1.177	0.7024952547613489	0.7044517999333074	0.020328715359556118
SARSCoV2_filtered	B.1.617.3	0	0.0	0.0
SARSCoV2_filtered	B.1.623	0	0.0	0.0
SARSCoV2_filtered	Beta_B.1.351	0.2556522779374451	0.2536694420134167	0.021667675821820425
SARSCoV2_filtered	Delta_B.1.617.2	0.0004445407879727177	0.00048448068661852194	0.0004949322336048902
SARSCoV2_filtered	Epsilon_B.1.427_429	0	0.0	0.0
SARSCoV2_filtered	Eta_B.1.525	0	2.0539993885939775e-11	9.01935044212674e-11
SARSCoV2_filtered	Gamma_P.1	0.0018766985950762869	0.0016466757712217091	0.002038137642675043
SARSCoV2_filtered	Iota_B.1.526	0	3.659593917460922e-10	6.485911094918895e-10
SARSCoV2_filtered	Kappa_B.1.617.1	0	7.167376434181761e-10	1.0773584061104731e-09
SARSCoV2_filtered	Lambda_C.37	0	0.0	0.0
SARSCoV2_filtered	Mu_B.1.621	0.0008949069846917494	0.0008868070308837934	0.0008862490897717564
SARSCoV2_filtered	Omicron_BA.1	0	3.184555144123045e-05	0.00022817950467739884
SARSCoV2_filtered	Omicron_BA.2	0.0038189007877782737	0.003559979687520783	0.0011243997285608296
SARSCoV2_filtered	Omicron_BA.3	0	9.689481363893597e-05	0.0005993846545283763
SARSCoV2_filtered	Theta_P.3	0.001851423019144237	0.0021062543590370003	0.0019484222501466196
SARSCoV2_filtered	Zeta_P.2	0.0036319880617552086	0.0032338856626401786	0.0028578754111950678

e.g. this sample has B.1.177 70% and Beta_B.1.351 25% might help to checkout whats going on in a sample.

from porecov.

Possible filtering issue about porecov HOT 5 CLOSED

Comments (5)

min length default settings

barcode mid strand detection

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs