Shifu; Congrats on the paper in bioRxiv and thanks for all the great work on fastp

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Trimming for polyA/T/C; improved runtimes for downstream alignment and variant calling,about opengene/fastp

Comments (29)

sfchen commented on July 25, 2024

Thank you Brad!

Actually fastp only implemented polyG trimming in 3' ends since I thought polyG is much more serious than poly A/T/C for NextSeq/NovaSeq. It's not difficult to support polyX trimming based on the polyG trimming implementation. And I will implement it, according to your good suggestion.

Since the paired-end processing will output two files, I cannot find a good solution to stream the output to aligners.

Thank you again for your support and good suggestions. I will continue to improve this tool.

from fastp.

ohofmann commented on July 25, 2024

Shifu,

thank you so much -- we are one of the NovaSeq sites that would positively benefit from a fast trimming implementation, particularly if it can handle polyX sequences. A +1 on the roadmap from us.

from fastp.

sfchen commented on July 25, 2024

Hi Brad and Oliver,

I'd like to ask some questions about polyX trimming:

We know that polyG is caused by signal attenuation of SBS. When the signal is very weak, the base will be detected as G in a two-colour system like NovaSeq or NextSeq. But what causes poly A/T/C?
If polyX trimming is implemented, do we still need polyG trimming options?
Do you think polyX trimming should be enabled by default for preprocessing NovSeq / NextSeq data?

Thanks
Shifu

from fastp.

chapmanb commented on July 25, 2024

Shifu;
Thanks so much for considering this feature. In terms of motivation, I think the polyX trimming is separate from NovaSeq polyG trims. There isn't a NovaSeq mechanism contributing to polyA/T/C since this is also present in non-NovaSeq (HiSeq and friends) reads. Rather this is due to some kind of slippage issue or is sequencing real signal. But either way what happens with this during variant calling is that these reads pile up on A stretches of the genome, creating very deep pileups of noise which take a long time to variant call through especially when identifying low frequency somatic variants.

So, I'd suggest having these as separate options. Some users might want only polyG and to keep the other poly stretches. I think having both as optional flags is probably the best option for now until trimming these becomes more standard accepted practice.

Regarding streaming for paired end reads, there are a couple of use cases where we could take advantage of this;

Streaming interleaved reads directly into mappers like bwa and minimap2. So instead of -o and -O, stream both ends, interleaved, directly to standard out.
Streaming into bgzip preparation of reads, doing something like -o >(bgzip --threads 8 > out_R1.fq.gz) -O >(bgzip --threads 8 > out_R2.fq.gz). This would help us prepare bgzipped outputs which are indexable for looks like grabix (https://github.com/arq5x/grabix) so we could parallelize subsequent alignment over regions of the fastq file.

Thanks again for considering all these options and all the work on fastp.

from fastp.

ohofmann commented on July 25, 2024

Shifu,

I think both options are useful. polyG is a 2-color-chemistry specific problem and causes variant callers to fail depending on coverage depth (or at least stall for a long time). That in itself is valuable, but based on the preliminary results removing other polyX regions helps with both speed and precision, something I didn't anticipate.

For two color chemistry I'd suggest polyG as a default, with everything else up to the user.

from fastp.

sfchen commented on July 25, 2024

Okay.

Another question: do you want to discard reads containing long polyX in the middle of reads. I mean the polyX is not in 3' or 5' ends.

This function may have side effects on MSI detection.

from fastp.

sfchen commented on July 25, 2024

Hi @chapmanb ,

Streaming both ends, interleaved, directly into STDOUT is a good suggestion. I have just confirmed that minimap2 also supports interleaved FASTQ, as well as bwa does.

I will add this feature to the roadmap.

Thanks
Shifu

from fastp.

ndaniel commented on July 25, 2024

@sfchen

But what causes poly A/T/C?

In RNA-seq one has polyA due to polyadenylation of mRNA, and therefore one would want to trim the polyA ends from the reads before aligning the reads. Some RNA-seq datasets, which come from non-stranded library preparation, will have the polyA showing up as polyT in the reads.

from fastp.

sfchen commented on July 25, 2024

Hi all, I am just starting to implement 3' end polyX trimming.

Another important question:
1, in your opinion, how many consecutive bases must be detected as a polyX? How about 10 bp?
2, to be error tolerant, how many mismatches are allowed? How about 1 mismatch per 10bp?

Although I will make these options configurable by command line, I still need these information to set the default configuration.

Thanks
Shifu

from fastp.

ohofmann commented on July 25, 2024

https://s3-ap-southeast-2.amazonaws.com/umccr/umccr/qc/polyg/ipmn2219-2_33_tumor_hotspot_fastqc.html#M9 is a FASTQC report including just one problematic region. I can only speak for poly-G tracks, but even removing just reads that consist of 50 consecutive Gs would allow most variant callers to finish the region in a reasonable amount of time; allowing one mismatch in 50 would get rid of all but the a small minority.

I think at least for the poly-G issue a conservative setting would work. Not sure what the Atropos defaults are though.

from fastp.

sfchen commented on July 25, 2024

fastp's current polyG implementation is a bit aggressive: 10 bp consecutive G will be categorised as polyG, with one mismatch allowed in per 8 bp.

from fastp.

chapmanb commented on July 25, 2024

Shifu;
The defaults you use for polyG -- 10 Gs with 1 mismatch per 8bp -- also seem reasonable to me for polyX. Having them consistent as defaults seems like a worthwhile strategy. Thanks again for working on this.

from fastp.

sfchen commented on July 25, 2024

I just implemented this feature.

Please try to build fastp with latest code on master. Or download http://opengene.org/fastp/fastp to test.

from fastp.

chapmanb commented on July 25, 2024

Thanks so much for the quick work integrating this, we really appreciate it. I've re-run a test on the same dataset used above and this does provide some nice improvements in removing these stretches. In comparison to what we were seeing before, here are the top 3' ends for the first 10 million reads:

GTGTGTGTGT 1217
TGTGTGTGTG 1175
CACACACACA 1122
ACACACACAC 1089
TTTTTTTTTG 1083
TTTTTTTTTA 1009

The low complexity dinucleotide repeats are still expected with our current trimming, but the last two polyT with a different base end are ones I'd expect to remove. I dug into them and here are some example reads with these ends left after trimming:

AGGAATTCTGCAGCTTTTTCTTTTCTTAATTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCCTGTTTTTTTTTATTTTTTTGTTTTTTTTTTATTTTTTTTTTTTTTTTTTTTTTTTA
CCCTTCTTTACGGTGAAGCTTATTCTGATTAAGCCTAGACTGTGTTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTA
TGAAGGCCTGGGGATGGTGACTGAAGAAGGAACACGTAAGTAACTAATGAATGTGAAGGCCATTCTCTTCCTGATTAAAATCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTG
TGGGTGTGGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGTTTTTTTTTTTTGTTTTTTTTTTTTTTTTTTTTTTTTTG

Happy to provide more examples if it would help. Thanks again for the work on this.

from fastp.

sfchen commented on July 25, 2024

@chapmanb how about trim the last base with -t 1 option? Are they from the 151st cycle?

Anyway, I can make a minor revision to trim polyX like TTTTTTTTTTTTTTTTTTTTTTTTA, but we should consider that whether it will cause over trimming. Will it impact MSI detection?

from fastp.

chapmanb commented on July 25, 2024

Shifu -- thanks for this. Generally I thought these should fall into the logic of being > poly_x_min_len (which I left at 10) and the logic of 1 allowed mismatch per 8bp with a maximum of 5 mismatches. Are you requiring that the polyX stretch initiate with the 3' most end and implicitly disallowing mismatches there?

MSIs should primarily by in more complex di-nucleotide+ repeats (like the first 4 examples in the remaining trim) and I agree the polyX shouldn't touch those to avoid messing with these. Exploring low complexity filters and the impact of this type of detection would be a useful secondary filter but something more long term. We're hoping to isolate the smaller set of trimming changes which help most with runtimes as a first pass, so clearing out the remaining noisy polyT and other reads would be helpful.

Thanks again for the discussion and help.

from fastp.

sfchen commented on July 25, 2024

@chapmanb I just made an update to trim the polyX like TTTTTTTTTTTTTTTTTTTTTTTTA . Could you please try the latest build?

from fastp.

chapmanb commented on July 25, 2024

Shifu;
Thanks much for the fix and continuing to work on this. I've been testing this and comparing with atropos trimming and finding some reads where atropos trims and fastp does not. This is a list of ~1000 that atropos will trim after being fed the fastp trimmed outputs. I don't think atropos does the right thing in all of these cases but would especially like to remove a lot of these problematic reads that are almost all polyX. The file is two columns, with the first the remaining read and the second the part that gets trimmed, so adding those first two columns gives the original post-fastp read:

https://gist.github.com/chapmanb/a0f3ccb645079634dc1e733be29f3de5

I'm continuing to try to understand which reads specifically help with downstream speed improvements, and appreciate all the help trying to be able to be more aggressive in removing this polyX noise. Thanks again.

from fastp.

sfchen commented on July 25, 2024

Thank you Brad, I will do a test with the file you provided.

from fastp.

sfchen commented on July 25, 2024

Hi @chapmanb I cannot download the file you uploaded at https://gist.github.com/chapmanb/a0f3ccb645079634dc1e733be29f3de5

Would you please attach it in this issue? I think it's small enough to be attached here.

Thanks
Shifu

from fastp.

chapmanb commented on July 25, 2024

Shifu -- what issues are you running into downloading it? This is the direct link for the raw version for download if that helps:

https://gist.githubusercontent.com/chapmanb/a0f3ccb645079634dc1e733be29f3de5/raw/9867e4ccb5c8feef5eecc14b3499dc2a36cdf318/untrimmed_fastp.txt

from fastp.

sfchen commented on July 25, 2024

This direct link works, thanks.

from fastp.

sfchen commented on July 25, 2024

Hi @chapmanb

With the latest build, I implemented low complexity filtering, which can filter out most of these reads. You can test it by specifying -y

from fastp.

chapmanb commented on July 25, 2024

Shifu;
Thanks so much for the implementation. I've been using this and working through different trimming comparisons with both fastp and atropos to try and find a good minimal combination of quality and polyX trimming that improves calling runtimes and helps with senstivity/specificity.

It looks in the end like the primary different is due to quality trimming differences, and polyX 3' ends up being a reflection of that:

https://github.com/bcbio/bcbio_validations/tree/master/somatic_trim

I'm trying to harmonize fastp and atropos trimming to better understand the differences but am having trouble replicating the runtime improvements we find with atropos trimming. For fastp I use:

--cut_by_quality3 --cut_mean_quality 5 --disable_quality_filtering

to get 3' only quality trimming and with atropos use:

--quality-cutoff 5

which I think should be roughly equivalent. I've tried increasing --cut_mean_quality without much change, so must be missing something.

I'll continue to dig but welcome any thoughts about how best to synchronize these, or if I'm missing anything obvious. Thank you again for all the help and happy to have two quality options to test with.

from fastp.

sfchen commented on July 25, 2024

Hi @chapmanb

A quick update, fastp 0.16.0 supports streaming to STDOUT, and also supports interleaved input. Please see the update on README: https://github.com/OpenGene/fastp#input-and-output

output to STDOUT

fastp supports streaming the passing-filter reads to STDOUT, so that it can be passed to other compressors like bzip2, or be passed to aligners like bwa and bowtie2.

specify --stdout to enable this mode to stream output to STDOUT
for PE data, the output will be interleaved FASTQ, which means the files will contain records like record1-R1 -> record1-R2 --> record2-R1 -> record2-R2 --> record3-R1 -> record3-R2 ...

I hope you will like this new feature.

BTW, I think it's you that maintain fastp on BioConda, am I right? If yes, can you please add me as one collaborator so that I can update it after new version released?

from fastp.

chapmanb commented on July 25, 2024

Shifu;
Thanks so much for these improvements, that is really helpful. I'll look at incorporating this into bcbio's use of fastp.

For the bioconda package, you don't need to be a collaborator or anything special. If you update the recipe, send a PR and cc me I'd be happy to merge:

https://bioconda.github.io/contributing.html

or you can also ask to become a contributor to the project to merge them yourself:

https://bioconda.github.io/contrib-setup.html#request-to-be-added-to-the-bioconda-team-optional

It's pretty lightweight meant to enable as many contributors as we can. Note that right nowe we're in the middle of a huge transition to a new compiler system and rebuild so things are bogged down on new recipes. Hopefully that backlog will get cleared soon. Thanks again.

from fastp.

sfchen commented on July 25, 2024

Hi @chapmanb

I sent u an email yesterday. Just wondering that whether you have received it?

Thanks
Shifu

from fastp.

chapmanb commented on July 25, 2024

Shifu;
Sorry, I don't think I got this. I'm not sure what happened, but could you resend or we can discuss here? My e-mail is in my GitHub profile. Thanks again.

from fastp.

sfchen commented on July 25, 2024

I sent the email to your Harvard mailbox, and probably it was filtered to the junk box. So I just resent it to your fastmail.com mailbox :)

from fastp.

Trimming for polyA/T/C; improved runtimes for downstream alignment and variant calling about fastp HOT 29 CLOSED

Comments (29)

output to STDOUT

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs