Comments (29)
Thank you Brad!
Actually fastp only implemented polyG trimming in 3' ends since I thought polyG is much more serious than poly A/T/C for NextSeq/NovaSeq. It's not difficult to support polyX trimming based on the polyG trimming implementation. And I will implement it, according to your good suggestion.
Since the paired-end processing will output two files, I cannot find a good solution to stream the output to aligners.
Thank you again for your support and good suggestions. I will continue to improve this tool.
from fastp.
Shifu,
thank you so much -- we are one of the NovaSeq sites that would positively benefit from a fast trimming implementation, particularly if it can handle polyX sequences. A +1 on the roadmap from us.
from fastp.
Hi Brad and Oliver,
I'd like to ask some questions about polyX trimming:
- We know that polyG is caused by signal attenuation of SBS. When the signal is very weak, the base will be detected as G in a two-colour system like NovaSeq or NextSeq. But what causes poly A/T/C?
- If polyX trimming is implemented, do we still need polyG trimming options?
- Do you think polyX trimming should be enabled by default for preprocessing NovSeq / NextSeq data?
Thanks
Shifu
from fastp.
Shifu;
Thanks so much for considering this feature. In terms of motivation, I think the polyX trimming is separate from NovaSeq polyG trims. There isn't a NovaSeq mechanism contributing to polyA/T/C since this is also present in non-NovaSeq (HiSeq and friends) reads. Rather this is due to some kind of slippage issue or is sequencing real signal. But either way what happens with this during variant calling is that these reads pile up on A stretches of the genome, creating very deep pileups of noise which take a long time to variant call through especially when identifying low frequency somatic variants.
So, I'd suggest having these as separate options. Some users might want only polyG and to keep the other poly stretches. I think having both as optional flags is probably the best option for now until trimming these becomes more standard accepted practice.
Regarding streaming for paired end reads, there are a couple of use cases where we could take advantage of this;
- Streaming interleaved reads directly into mappers like bwa and minimap2. So instead of
-o
and-O
, stream both ends, interleaved, directly to standard out. - Streaming into bgzip preparation of reads, doing something like
-o >(bgzip --threads 8 > out_R1.fq.gz) -O >(bgzip --threads 8 > out_R2.fq.gz)
. This would help us prepare bgzipped outputs which are indexable for looks like grabix (https://github.com/arq5x/grabix) so we could parallelize subsequent alignment over regions of the fastq file.
Thanks again for considering all these options and all the work on fastp.
from fastp.
Shifu,
I think both options are useful. polyG is a 2-color-chemistry specific problem and causes variant callers to fail depending on coverage depth (or at least stall for a long time). That in itself is valuable, but based on the preliminary results removing other polyX regions helps with both speed and precision, something I didn't anticipate.
For two color chemistry I'd suggest polyG as a default, with everything else up to the user.
from fastp.
Okay.
Another question: do you want to discard reads containing long polyX in the middle of reads. I mean the polyX is not in 3' or 5' ends.
This function may have side effects on MSI detection.
from fastp.
Hi @chapmanb ,
Streaming both ends, interleaved, directly into STDOUT is a good suggestion. I have just confirmed that minimap2 also supports interleaved FASTQ, as well as bwa does.
I will add this feature to the roadmap.
Thanks
Shifu
from fastp.
But what causes poly A/T/C?
In RNA-seq one has polyA due to polyadenylation of mRNA, and therefore one would want to trim the polyA ends from the reads before aligning the reads. Some RNA-seq datasets, which come from non-stranded library preparation, will have the polyA showing up as polyT in the reads.
from fastp.
Hi all, I am just starting to implement 3' end polyX trimming.
Another important question:
1, in your opinion, how many consecutive bases must be detected as a polyX? How about 10 bp?
2, to be error tolerant, how many mismatches are allowed? How about 1 mismatch per 10bp?
Although I will make these options configurable by command line, I still need these information to set the default configuration.
Thanks
Shifu
from fastp.
https://s3-ap-southeast-2.amazonaws.com/umccr/umccr/qc/polyg/ipmn2219-2_33_tumor_hotspot_fastqc.html#M9 is a FASTQC report including just one problematic region. I can only speak for poly-G tracks, but even removing just reads that consist of 50 consecutive G
s would allow most variant callers to finish the region in a reasonable amount of time; allowing one mismatch in 50 would get rid of all but the a small minority.
I think at least for the poly-G issue a conservative setting would work. Not sure what the Atropos defaults are though.
from fastp.
fastp's current polyG implementation is a bit aggressive: 10 bp consecutive G will be categorised as polyG, with one mismatch allowed in per 8 bp.
from fastp.
Shifu;
The defaults you use for polyG -- 10 Gs with 1 mismatch per 8bp -- also seem reasonable to me for polyX. Having them consistent as defaults seems like a worthwhile strategy. Thanks again for working on this.
from fastp.
I just implemented this feature.
Please try to build fastp with latest code on master. Or download http://opengene.org/fastp/fastp to test.
from fastp.
Thanks so much for the quick work integrating this, we really appreciate it. I've re-run a test on the same dataset used above and this does provide some nice improvements in removing these stretches. In comparison to what we were seeing before, here are the top 3' ends for the first 10 million reads:
GTGTGTGTGT 1217
TGTGTGTGTG 1175
CACACACACA 1122
ACACACACAC 1089
TTTTTTTTTG 1083
TTTTTTTTTA 1009
The low complexity dinucleotide repeats are still expected with our current trimming, but the last two polyT with a different base end are ones I'd expect to remove. I dug into them and here are some example reads with these ends left after trimming:
AGGAATTCTGCAGCTTTTTCTTTTCTTAATTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCCTGTTTTTTTTTATTTTTTTGTTTTTTTTTTATTTTTTTTTTTTTTTTTTTTTTTTA
CCCTTCTTTACGGTGAAGCTTATTCTGATTAAGCCTAGACTGTGTTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTA
TGAAGGCCTGGGGATGGTGACTGAAGAAGGAACACGTAAGTAACTAATGAATGTGAAGGCCATTCTCTTCCTGATTAAAATCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTG
TGGGTGTGGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGTTTTTTTTTTTTGTTTTTTTTTTTTTTTTTTTTTTTTTG
Happy to provide more examples if it would help. Thanks again for the work on this.
from fastp.
@chapmanb how about trim the last base with -t 1
option? Are they from the 151st cycle?
Anyway, I can make a minor revision to trim polyX like TTTTTTTTTTTTTTTTTTTTTTTTA
, but we should consider that whether it will cause over trimming. Will it impact MSI detection?
from fastp.
Shifu -- thanks for this. Generally I thought these should fall into the logic of being > poly_x_min_len (which I left at 10) and the logic of 1 allowed mismatch per 8bp with a maximum of 5 mismatches. Are you requiring that the polyX stretch initiate with the 3' most end and implicitly disallowing mismatches there?
MSIs should primarily by in more complex di-nucleotide+ repeats (like the first 4 examples in the remaining trim) and I agree the polyX shouldn't touch those to avoid messing with these. Exploring low complexity filters and the impact of this type of detection would be a useful secondary filter but something more long term. We're hoping to isolate the smaller set of trimming changes which help most with runtimes as a first pass, so clearing out the remaining noisy polyT and other reads would be helpful.
Thanks again for the discussion and help.
from fastp.
@chapmanb I just made an update to trim the polyX like TTTTTTTTTTTTTTTTTTTTTTTTA
. Could you please try the latest build?
from fastp.
Shifu;
Thanks much for the fix and continuing to work on this. I've been testing this and comparing with atropos trimming and finding some reads where atropos trims and fastp does not. This is a list of ~1000 that atropos will trim after being fed the fastp trimmed outputs. I don't think atropos does the right thing in all of these cases but would especially like to remove a lot of these problematic reads that are almost all polyX. The file is two columns, with the first the remaining read and the second the part that gets trimmed, so adding those first two columns gives the original post-fastp read:
https://gist.github.com/chapmanb/a0f3ccb645079634dc1e733be29f3de5
I'm continuing to try to understand which reads specifically help with downstream speed improvements, and appreciate all the help trying to be able to be more aggressive in removing this polyX noise. Thanks again.
from fastp.
Thank you Brad, I will do a test with the file you provided.
from fastp.
Hi @chapmanb I cannot download the file you uploaded at https://gist.github.com/chapmanb/a0f3ccb645079634dc1e733be29f3de5
Would you please attach it in this issue? I think it's small enough to be attached here.
Thanks
Shifu
from fastp.
Shifu -- what issues are you running into downloading it? This is the direct link for the raw version for download if that helps:
from fastp.
This direct link works, thanks.
from fastp.
Hi @chapmanb
With the latest build, I implemented low complexity filtering, which can filter out most of these reads. You can test it by specifying -y
from fastp.
Shifu;
Thanks so much for the implementation. I've been using this and working through different trimming comparisons with both fastp and atropos to try and find a good minimal combination of quality and polyX trimming that improves calling runtimes and helps with senstivity/specificity.
It looks in the end like the primary different is due to quality trimming differences, and polyX 3' ends up being a reflection of that:
https://github.com/bcbio/bcbio_validations/tree/master/somatic_trim
I'm trying to harmonize fastp and atropos trimming to better understand the differences but am having trouble replicating the runtime improvements we find with atropos trimming. For fastp I use:
--cut_by_quality3 --cut_mean_quality 5 --disable_quality_filtering
to get 3' only quality trimming and with atropos use:
--quality-cutoff 5
which I think should be roughly equivalent. I've tried increasing --cut_mean_quality
without much change, so must be missing something.
I'll continue to dig but welcome any thoughts about how best to synchronize these, or if I'm missing anything obvious. Thank you again for all the help and happy to have two quality options to test with.
from fastp.
Hi @chapmanb
A quick update, fastp 0.16.0
supports streaming to STDOUT, and also supports interleaved input. Please see the update on README: https://github.com/OpenGene/fastp#input-and-output
output to STDOUT
fastp
supports streaming the passing-filter reads to STDOUT, so that it can be passed to other compressors like bzip2
, or be passed to aligners like bwa
and bowtie2
.
- specify
--stdout
to enable this mode to stream output to STDOUT - for PE data, the output will be interleaved FASTQ, which means the files will contain records like
record1-R1 -> record1-R2 --> record2-R1 -> record2-R2 --> record3-R1 -> record3-R2 ...
I hope you will like this new feature.
BTW, I think it's you that maintain fastp
on BioConda, am I right? If yes, can you please add me as one collaborator so that I can update it after new version released?
from fastp.
Shifu;
Thanks so much for these improvements, that is really helpful. I'll look at incorporating this into bcbio's use of fastp.
For the bioconda package, you don't need to be a collaborator or anything special. If you update the recipe, send a PR and cc me I'd be happy to merge:
https://bioconda.github.io/contributing.html
or you can also ask to become a contributor to the project to merge them yourself:
https://bioconda.github.io/contrib-setup.html#request-to-be-added-to-the-bioconda-team-optional
It's pretty lightweight meant to enable as many contributors as we can. Note that right nowe we're in the middle of a huge transition to a new compiler system and rebuild so things are bogged down on new recipes. Hopefully that backlog will get cleared soon. Thanks again.
from fastp.
Hi @chapmanb
I sent u an email yesterday. Just wondering that whether you have received it?
Thanks
Shifu
from fastp.
Shifu;
Sorry, I don't think I got this. I'm not sure what happened, but could you resend or we can discuss here? My e-mail is in my GitHub profile. Thanks again.
from fastp.
I sent the email to your Harvard mailbox, and probably it was filtered to the junk box. So I just resent it to your fastmail.com mailbox :)
from fastp.
Related Issues (20)
- Feature request: add option to set lower limit of unqualified quality
- Missing most reads after given r2 adapter HOT 1
- Interpretation help file?
- Store duplicate reads
- Split interleaved output
- interleaved output is not reproducible with multiple threads HOT 2
- Not able to install on Mac book M1 HOT 4
- Keep occurred error message from the beginning < igzip: invalid gzip header found >
- Nanopore data filtering using fastp HOT 1
- No adapter detected for read and Q20 bases: 4747174600(99.9999%)
- fastp not removing all Illumina universal adapter sequences as indicated by FastQC HOT 4
- few options throw 'undefined error' -reg
- Error is raised for problematic rows HOT 3
- not support arm? HOT 1
- Not reproducible HOT 7
- Running fastp in quiet mode. HOT 2
- Feature Request: fastp operation on input and output directories
- ERROR: '+' expected HOT 5
- Feature request: remove reads with poly_X tails and polyX in general
- typo in "forogt" -> "forgot" HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fastp.