Comments (10)
Thank you for the report and for using DeepConsensus. When dealing with questions of base/read quality, one issue to discuss is the difference between empirical (true) quality and estimated quality. Ideally, a program should calibrate its quality estimates exactly with the true underlying quality, but this isn't always easy to do. So one general question is: are you seeing lower Q30 values because the empirical quality is lower, which would be one sort of issue, or because the DeepConsensus v1.1 model is more conservative at the upper-ranges of quality in certain genome contexts you are experiencing, and this difference drives only differences in estimated quality (a different sort of issue).
I assume the current quality distributions you have are from the estimated confidence from DeepConsensus outputs. Do you have any orthogonal measure (e.g. kmer analysis or a reasonable assembly) that might be informative with respect to the empirical quality of the data?
One difference between v0.3 and v1.1 is that we train DeepConsensus with the T2T genome, which extends farther into difficult parts of the human genome than the training in HG002 with v0.3. In both cases, we do try to exclude certain regions which are not mappable or can't be entirely confident in the assembly. All of the feedback we received for empirical quality in human and non-human species does indicate that the T2T training is somewhat better.
It looks like your read qualities at the top end of the distribution are about ~1-2 QV points lower in v1.1, with the Q20-Q25 range rescuing a bit more reads in v1.1. My first instinct is that DeepConsensus v1.1 might have more exposure to harder regions in v1.1 and may have learned to be a bit more conservative in estimating qualities in those regions.
To confirm or reject that possibility, I think we'd need some other estimate of empirical quality or of the calibration between empirical and estimated qualities in these regions.
Thank you,
Andrew
from deepconsensus.
Hi Andrew,
Thank you for your detailed response and I believe I understand your reasoning.
You are correct that the quality distributions I reported are the estimated confidence from DeepConsensus outputs. Unfortunately, we do not have a reasonably well-assembled genome assembly for this species or (if I am understanding your kmer analysis point) whole genome Illumina short reads to enable a comparison through kmer analysis. The only other data we have available for this species is 1.6 billion paired end reads of Hi-C data prepared from the same individual. However, it is my understanding that the RE digestion step in our Hi-C prep (Proximo kit from Phase Genomics) results in biased coverage of different portions of the genome and is not suitable for many forms of kmer analysis.
I had a similar gut feeling about DeepConsensus v1.1 model being more conservative for base quality estimations in repetitive regions, due to training on the T2T genome, which compose most of my land snail genome.
As I do not have an empirical measure for comparison, I plan to attempt a second genome assembly using DeepConsensus v1.1 and compare it to our assembly that used DeepConsensus v.0.3 outputs. If you have any other advice, I would appreciate it. Otherwise, I consider this issue closed.
If you or other members of the DeepConsensus team are interested in the v0.3 v1.1 genome assembly comparison, I can post the results here in a couple weeks time.
All the best and thank you for your help,
Mason
from deepconsensus.
Thank you for the fast reply. You understand me correctly when asking about either well-assembled reference or Illumina reads for kmers. Given what you have available, I think a comparison between assemblies does make sense and we'd be quite interested if you can post the statistics when you have them. Comparing assembly quality can be complicated, but it's likely the best we can do in this situation.
Again, we appreciate the feedback and your effort in detailing the issue and analyzing the data.
from deepconsensus.
Hi Mason,
A naive question, how you got the v0.3.qchist.txt and v1.1.qchist.txt ?
Do you know which reads are Q20 ,which reads are Q30
from deepconsensus.
Hi Wenfei,
I calculated the quality histograms using the BBMap reformat.sh script using the 'aqhist' flag. The same tool can be used to filter average read quality using the 'trimq' flag.
Hope this helps!
I will post my deepC 0.3 vs. 1.1 genome stats soon.
from deepconsensus.
Many thanks !!!
from deepconsensus.
I'll close this issue. Feel free open again if you have more questions. And please feel free to share any updates later. Thanks!
from deepconsensus.
I also had this problem with plants. I counted the quality values in the fastq files of pbccs and deepconsensus, as shown in the figure below.
I am puzzled by this result, as stated in the paper, Q30 will increase, but my result was not.
Is there something wrong with my process.
Iām looking forward to answer, thanks.
from deepconsensus.
Hi @pxxiao-hz
The base quality outputs of pbccs and DeepConsensus are predicted values from each type of model. The degree to which predicted confidence reflects the real or empirical accuracy is called calibration.
We have looked at the calibration of pbccs and DeepConsensus, with a detailed analysis here. In short, pbccs is overconfident in the predicted values when it predicts with confidence 20+. pbccs is less conservative and predictions far more of its bases at Q93 (as you also see in your plot). DeepConsensus is well-calibrated up to a quality of Q35.
Note that the most important place to be well-calibrated is at Q20, as that value is where the cutoff for filtering a HiFi read occurs (and both pbccs and DeepConsensus are both reasonably calibrated at this point).
Our conclusion from this is that DeepConsensus should still producing more accurate reads, it is just being more realistic about the likely quality of the bases (and as a result less over-optimistic) relative to pbccs.
from deepconsensus.
Thank you for your prompt answer, @AndrewCarroll.
I will use these reads for relevant downstream analysis.
from deepconsensus.
Related Issues (20)
- Error detecting params.json using docker in debian (10) HPC HOT 2
- Installation from source file problem HOT 2
- lower quality and less reads in deepconsensus 1.0 output compared to ccs HOT 2
- python 3.9 HOT 2
- [Repeat] Running deepconsensus results in "free(): invalid pointer" error HOT 17
- QV for each ccs reads HOT 2
- Public raw train dataset availability HOT 2
- Cannot open/create ccs.bam file? HOT 6
- vRAM limit HOT 1
- the label without alignment HOT 5
- bam or fastq issue HOT 2
- Separate subreads for mixed samples? HOT 3
- GPU installation failure with pip HOT 3
- GPU installation using quick start guide fails HOT 5
- OSError: error -3 while reading file HOT 8
- normal pass / fail rate? HOT 2
- KeyError HOT 5
- About making ground truth. HOT 2
- Optimizing runtime on HPC HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
š Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ššš
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ā¤ļø Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from deepconsensus.