Shifu; Thanks for this great tool and adding pre-processing for UMIs. I've been lo

I also commented on bcbio : <a class="commit-link" dat

Support UMI pre-processing with a 3rd file containing UMIs about fastp HOT 10 CLOSED

opengene commented on August 29, 2024

Support UMI pre-processing with a 3rd file containing UMIs

from fastp.

Comments (10)

sfchen commented on August 29, 2024 1

Hi Brad,

Although you've closed this issue, I still have an important note about the method I proposed to run R1+R3 and R2+R3 individually.

You should always run these two commands with only 1 worker thread by specifying -w 1. If you run them with more than 1 thread, the read order of output.R1 and output.R2 may be inconsistent since read order is not kept by multi-threading output.

So the commands should be:

fastp -i R1.fq -I R3.fq -o R1.out.fq -O R3.out.fq --umi --umi_loc=read2 --umi_len=6 -Q -A -L -w 1
fastp -i R2.fq -I R3.fq -o R2.out.fq -O R3.out.fq --umi --umi_loc=read2 --umi_len=6 -Q -A -L -w 1

I hope you have noted this issue already.

Thanks
Shifu

from fastp.

sfchen commented on August 29, 2024

Brad,

It's possible to support, but it may take much time to implement since it breaks the design of paired-end reading. I need to check whether this case is common.

However, I can't understand why you put the UMIs in a 3rd file. If UMIs are presented in the read (either R1 or R2), bcl2fastq can shift them to the corresponding read headers directly.

from fastp.

chapmanb commented on August 29, 2024

Shifu;
Thanks much for considering this. The problem with putting them in the header using bcl2fastq is that the UMIs aren't present in read 1 or read 2, they're in a separate index read so are only supported coming out as the third file. This is fairly common method but unfortunately bcl2fastq doesn't support doing anything clever like attaching them into the other paired reads. Thanks again for looking into support for this.

from fastp.

sfchen commented on August 29, 2024

If my guess is right, you're using one of the two Illumina sample indexes (I7 or I5) as UMI. This is exactly the same as my lab does, and the UMI processing for such kind of data is already supported by fastp.

Note that the sample indexes are also present in the reads (both R1 and R2), so actually you don't have to input a 3rd index-only file. For example, a typical read is like:

@NS500713:64:HFKJJBGXY:1:11101:1675:1101 1:N:0:TATAGCCT+GACCCCCA
AAAAAAAAGCTACTTGGAGTACCAATAATAAAGTGAGCCCACCTTCCTGGTACCCAGACATTTCAGGAGGTCGGGAAA
+
6AAAAAEEEEE/E/EA/E/AEA6EE//AEE66/AAE//EEE/E//E/AA/EEE/A/AEE/EEA//EEEEEEEE6EEAA

In this case, index 1 (I7) is TATAGCCT, and index 2 (I5) is GACCCCCA

By specifying --umi_loc=index2, fastp will append index 2 (I5) to the end of the first part of the read header (both R1 and R2), and the read will become:

@NS500713:64:HFKJJBGXY:1:11101:1675:1101:GACCCCCA 1:N:0:TATAGCCT+GACCCCCA
AAAAAAAAGCTACTTGGAGTACCAATAATAAAGTGAGCCCACCTTCCTGGTACCCAGACATTTCAGGAGGTCGGGAAA
+
6AAAAAEEEEE/E/EA/E/AEA6EE//AEE66/AAE//EEE/E//E/AA/EEE/A/AEE/EEA//EEEEEEEE6EEAA

And --umi_loc=index1 will do the same thing for index 1 (I7).

If my guess was right, you can look into the README (https://github.com/OpenGene/fastp#unique-molecular-identifer-umi-processing) for more details.

But if my guess was wrong, please kindly let me know.

from fastp.

chapmanb commented on August 29, 2024

Shifu;
Thanks for this discussion. It would be great if we had a way to export directly into the indexes but am not sure how to do this with bcl2fastq. The setup we have is dual indexes with the UMI in I7 so it looks like IIIIIIIINNNNNN where I is the index and N is the UMI. We want to use the dual index for demultiplexing and then export the UMIs. So assuming 150 bp reads, 8 bp dual indices and single 6bp UMI in the first index, the bcl2fastq looks like 150Y, 8I6Y, 8I, 150Y where the 150Y, 6Y and 150Y generate the three files: read1, UMI and read2, respectively. The 8Is generate the two indexes used in demultiplexing.

Is there a way to handle both the demultiplexing and export into the header from bcl2fastq? I definitely see how to use fastp for this case if we could export either to an index or one of read1 or read2, but I might easily be ignorant about how best to setup bcl2fastq to generate something compatible. Thanks again for helping with this.

from fastp.

sfchen commented on August 29, 2024

Thanks Brad, I now understood your difficulties. You're right, AFAIK, bcl2fastq cannot use a part of index as UMI.

If you use fastp only for processing UMI (without any filtering/adapter/correction), there is a way to do: processing R1+R3 and R2+R3 individually. For example, your command must look like:

fastp -i R1.fq -I R3.fq -o R1.out.fq -O R3.out.fq --umi --umi_loc=read2 --umi_len=6 -Q -A -L -w 1
fastp -i R2.fq -I R3.fq -o R2.out.fq -O R3.out.fq --umi --umi_loc=read2 --umi_len=6 -Q -A -L -w 1

The -Q -A -L options disable quality filtering, adapter trimming and length filtering so all input reads will be output, to keep R1.out.fq and R2.out.fq consistent.

You may also specify -G to disable polyG end trimming if you're processing data from NextSeq / NovaSeq, but polyG end trimming will not discard reads so it will not break the read consistency.

The performance will be mainly unaffected since only R3.fq is loaded twice, which is very small since it's only 6 bp long. And actually you can run these two commands in parallel so that the performance may be even better :)

Updated on 1/11/2018: add -w 1 to the commands to limit the worker threads to keep read order of R1 and R2.

from fastp.

chapmanb commented on August 29, 2024

Shifu;
That is perfect, thanks so much for the clever idea. I was trying to envision all these more complicated approaches to concatenating R1 + R2 and streaming into fastp but this is much simpler. We don't need trimming/filtering so will turn off all those options to get identical outputs. Thanks again for helping think through this, much appreciated.

from fastp.

sfchen commented on August 29, 2024

I also commented on bcbio: bcbio/bcbio-nextgen@c5b60bc

from fastp.

chapmanb commented on August 29, 2024

Thanks so much for the heads up, apologies for missing this on my side. I've pushed the fix to use a single core, thank you again.

from fastp.

dh10 commented on August 29, 2024

It seems to help to add: -u 100 -n $UMI_LENGTH -Y 100 -G

Low quality sequences otherwise seem to break fastq sync using just the options originally proposed

from fastp.

Support UMI pre-processing with a 3rd file containing UMIs about fastp HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs