GithubHelp home page GithubHelp logo

Comments (10)

AndrewCarroll avatar AndrewCarroll commented on May 17, 2024

Hi @mason-linscott

Thank you for the report and for using DeepConsensus. When dealing with questions of base/read quality, one issue to discuss is the difference between empirical (true) quality and estimated quality. Ideally, a program should calibrate its quality estimates exactly with the true underlying quality, but this isn't always easy to do. So one general question is: are you seeing lower Q30 values because the empirical quality is lower, which would be one sort of issue, or because the DeepConsensus v1.1 model is more conservative at the upper-ranges of quality in certain genome contexts you are experiencing, and this difference drives only differences in estimated quality (a different sort of issue).

I assume the current quality distributions you have are from the estimated confidence from DeepConsensus outputs. Do you have any orthogonal measure (e.g. kmer analysis or a reasonable assembly) that might be informative with respect to the empirical quality of the data?

One difference between v0.3 and v1.1 is that we train DeepConsensus with the T2T genome, which extends farther into difficult parts of the human genome than the training in HG002 with v0.3. In both cases, we do try to exclude certain regions which are not mappable or can't be entirely confident in the assembly. All of the feedback we received for empirical quality in human and non-human species does indicate that the T2T training is somewhat better.

It looks like your read qualities at the top end of the distribution are about ~1-2 QV points lower in v1.1, with the Q20-Q25 range rescuing a bit more reads in v1.1. My first instinct is that DeepConsensus v1.1 might have more exposure to harder regions in v1.1 and may have learned to be a bit more conservative in estimating qualities in those regions.

To confirm or reject that possibility, I think we'd need some other estimate of empirical quality or of the calibration between empirical and estimated qualities in these regions.

Thank you,
Andrew

from deepconsensus.

mason-linscott avatar mason-linscott commented on May 17, 2024

Hi Andrew,

Thank you for your detailed response and I believe I understand your reasoning.

You are correct that the quality distributions I reported are the estimated confidence from DeepConsensus outputs. Unfortunately, we do not have a reasonably well-assembled genome assembly for this species or (if I am understanding your kmer analysis point) whole genome Illumina short reads to enable a comparison through kmer analysis. The only other data we have available for this species is 1.6 billion paired end reads of Hi-C data prepared from the same individual. However, it is my understanding that the RE digestion step in our Hi-C prep (Proximo kit from Phase Genomics) results in biased coverage of different portions of the genome and is not suitable for many forms of kmer analysis.

I had a similar gut feeling about DeepConsensus v1.1 model being more conservative for base quality estimations in repetitive regions, due to training on the T2T genome, which compose most of my land snail genome.

As I do not have an empirical measure for comparison, I plan to attempt a second genome assembly using DeepConsensus v1.1 and compare it to our assembly that used DeepConsensus v.0.3 outputs. If you have any other advice, I would appreciate it. Otherwise, I consider this issue closed.

If you or other members of the DeepConsensus team are interested in the v0.3 v1.1 genome assembly comparison, I can post the results here in a couple weeks time.

All the best and thank you for your help,
Mason

from deepconsensus.

AndrewCarroll avatar AndrewCarroll commented on May 17, 2024

Hi @mason-linscott

Thank you for the fast reply. You understand me correctly when asking about either well-assembled reference or Illumina reads for kmers. Given what you have available, I think a comparison between assemblies does make sense and we'd be quite interested if you can post the statistics when you have them. Comparing assembly quality can be complicated, but it's likely the best we can do in this situation.

Again, we appreciate the feedback and your effort in detailing the issue and analyzing the data.

from deepconsensus.

Wenfei-Xian avatar Wenfei-Xian commented on May 17, 2024

Hi Mason,
A naive question, how you got the v0.3.qchist.txt and v1.1.qchist.txt ?
Do you know which reads are Q20 ,which reads are Q30

from deepconsensus.

mason-linscott avatar mason-linscott commented on May 17, 2024

Hi Wenfei,

I calculated the quality histograms using the BBMap reformat.sh script using the 'aqhist' flag. The same tool can be used to filter average read quality using the 'trimq' flag.

Hope this helps!

I will post my deepC 0.3 vs. 1.1 genome stats soon.

from deepconsensus.

Wenfei-Xian avatar Wenfei-Xian commented on May 17, 2024

Many thanks !!!

from deepconsensus.

pichuan avatar pichuan commented on May 17, 2024

I'll close this issue. Feel free open again if you have more questions. And please feel free to share any updates later. Thanks!

from deepconsensus.

pxxiao-hz avatar pxxiao-hz commented on May 17, 2024

I also had this problem with plants. I counted the quality values in the fastq files of pbccs and deepconsensus, as shown in the figure below.
I am puzzled by this result, as stated in the paper, Q30 will increase, but my result was not.
Is there something wrong with my process.
Iā€˜m looking forward to answer, thanks.
ccs_vs_deep

from deepconsensus.

AndrewCarroll avatar AndrewCarroll commented on May 17, 2024

Hi @pxxiao-hz

The base quality outputs of pbccs and DeepConsensus are predicted values from each type of model. The degree to which predicted confidence reflects the real or empirical accuracy is called calibration.

We have looked at the calibration of pbccs and DeepConsensus, with a detailed analysis here. In short, pbccs is overconfident in the predicted values when it predicts with confidence 20+. pbccs is less conservative and predictions far more of its bases at Q93 (as you also see in your plot). DeepConsensus is well-calibrated up to a quality of Q35.

Note that the most important place to be well-calibrated is at Q20, as that value is where the cutoff for filtering a HiFi read occurs (and both pbccs and DeepConsensus are both reasonably calibrated at this point).

Our conclusion from this is that DeepConsensus should still producing more accurate reads, it is just being more realistic about the likely quality of the bases (and as a result less over-optimistic) relative to pbccs.

from deepconsensus.

pxxiao-hz avatar pxxiao-hz commented on May 17, 2024

Thank you for your prompt answer, @AndrewCarroll.
I will use these reads for relevant downstream analysis.

from deepconsensus.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    šŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. šŸ“ŠšŸ“ˆšŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ā¤ļø Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.