GithubHelp home page GithubHelp logo

Comments (10)

sfchen avatar sfchen commented on August 29, 2024

I just update the behaviour of UMI preprocessing for per_index and per_read mode.

  • per_index index1_index2 is used as UMI for both read1/read2.
  • per_read define umi1 as the head of read1, and umi2 as the head of read2. umi1_umi2 is used as UMI for both read1/read2.

Could you please try to build fastp with latest code on master. Or download http://opengene.org/fastp/fastp to test.

from fastp.

carlandt avatar carlandt commented on August 29, 2024

Thanks for the fast response! I'll hit it on Monday and let you know. Sorry for the poor spelling in the name of the bug, that's a bit embarrassing =P

Thanks again for the wonderful tool, loving it so far =)

from fastp.

sfchen avatar sfchen commented on August 29, 2024

Any update?

from fastp.

carlandt avatar carlandt commented on August 29, 2024

Tried out
fastp -i read_1.fastq.gz -I read_2.fastq.gz -o read_umi_1.fastq -O read_umi_2.fastq -U --umi_loc=per_read --umi_len=8
Was that what you were thinking?

If so, I'm still just getting the UMI from each read put on that read, not shared across, eg umi1_umi2

Used version 0.12.6 - the one at http://opengene.org/fastp/fastp

from fastp.

sfchen avatar sfchen commented on August 29, 2024

can you paste some reads here?

from fastp.

carlandt avatar carlandt commented on August 29, 2024

Sure! Here are the input reads, the output reads, and what I was hoping to see:


$ zcat 782404_V1_L1.N701_505_1.fastq.gz | head -n 8
@M01378:492:000000000-BLK46:1:1101:9647:3917 1:N:0: TAAGGCGA-TTAAGGAG
CACGCGAGTGGAGCTGAGCAGCCTGAGATCTGACCGTCTCCTCAGGTACGCACCCTCACTGTCTCTTATACACATCTCCGAGCCCACGAGACTAAGGCGAATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAGGTATGTTGGTTTATGTTTTTTTTGGTTGGATGTGGTATGGTTTTTTTTTGTTTGTTTTTTGTTAT
+
ABBBBBBBBBBFGGGGFGFGFFHHHHHHGFHFHFFGGGHGGHHFHFFHD1AEEEEHHFHGH5FFGFH5GHHHHHGFFHHGCEEEFHGC?FDFFGFBBG//<E/F?CF/GFBDDG@GF2G2@DHF0F2FFC#######################################################################
@M01378:492:000000000-BLK46:1:1101:11212:3937 1:N:0: TAAGGCGA-TTAAGGAG
TGGGGGAATGGAGCTGAGCAGCCTGAGATCTGGGCGTTCACCCAGGCTTCCACGTTCCCCTCGCTTGGGTCACCGTCTCCTCAGGTAAGAGGTCAGCCTGTCTCTTATACACATCTCCGAGCCCACGAGACTAAGGCGAATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAACTCAGGTGCGATGTGCAGCTGTCTTGTG
+
BBBBBBBBFFFFGGGGDACE4FGEFFHGHGDHCBGGGGFHGFH1G1BAFGHHHDCGFFGFHG?EFHGGEEFFFGGHGHHHHHHBGGHBB4FEFGGFBGHFHHHHHHFGEFHFGFHGHG/@/<BFFFC?/?DHHH1GFCFGCEHHFCEGFFDFDFAEFHBGHFF.00CC#################################


$ zcat 782404_V1_L1.N701_505_2.fastq.gz | head -n 8
@M01378:492:000000000-BLK46:1:1101:9647:3917 2:N:0: TAAGGCGA-TTAAGGAG
TGAGGGTGCTTACCTGCGGCGACGGTCAGATCTCTTTCTTCTCATCTCCACTCGCTTGCTGTCTCTTATACACATCTGACGCTGCCGACTACTCCTTTCTTGTTTTTTTTTTTTTTCTTCTTTTTTTTTTTTTTCATCTTTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGTTTTTTTTTTTTTTTTTTT
+
11111111AFFFGGGE1A0000AAAE?/01D1G22212A12DA22ADA11ADE/////>D1B1@@FG1BB22@1BBF12@/>?######################################################################################################################
@M01378:492:000000000-BLK46:1:1101:11212:3937 2:N:0: TAAGGCGA-TTAAGGAG
TCTGACCTCTTACCTGCGGAGACGGTGACCCAAGCTAGTTGAACTTGGAAGCCTGTTTGAACTCCCAGATCTCAGGCTGCTCAGCTCCATTCCCCCACTGTCTCTTATACACATCTGTCGCTGCCGACGTCTCCTTACGTGTTGTTCTCTGTTTTCTTCTTTTCTTTTTTATCTCTTTTTTTGTTGTTCTTTTGTTTTTTC
+
11>1>11BFFF@FGGG1A10AABABECE1A000//11112122A1A11110/0B0111011A1AAA01/BAFE220>>0>FE101BB@1@F2@1>//>01BBGFHG1D22211B>E122//?//0//////0011121/@F@0/@/11111?1<01111111?######################################


# here is the fastp command, correct me if I'm wrong
$ fastp -i 782404_V1_L1.N701_505_1.fastq.gz -I 782404_V1_L1.N701_505_2.fastq.gz -o 782404_V1_L1.N701_505_umi_1.fastq -O 782404_V1_L1.N701_505_umi_2.fastq -U --umi_loc=per_read --umi_len=8


# the output reads only have the UMI per read, not both as I had hoped and perhaps explained incorrectly

$ head -n 8 782404_V1_L1.N701_505_umi_1.fastq
@M01378:492:000000000-BLK46:1:1101:11212:3937:TGGGGGAA 1:N:0: TAAGGCGA-TTAAGGAG
TGGAGCTGAGCAGCCTGAGATCTGGGCGTTCACCCAGGCTTCCACGTTCCCCTCGCTTGGGTCACCGTCTCCTCAGGTAAGAGGTCAGCCTGTCTCTTATACACATCTCCGAGCCCACGAGACTAAGGCGAATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAACTCAGGTGCGATGTGCAGCTGTCTTGTG
+
FFFFGGGGDACE4FGEFFHGHGDHCBGGGGFHGFH1G1BAFGHHHDCGFFGFHG?EFHGGEEFFFGGHGHHHHHHBGGHBB4FEFGGFBGHFHHHHHHFGEFHFGFHGHG/@/<BFFFC?/?DHHH1GFCFGCEHHFCEGFFDFDFAEFHBGHFF.00CC#################################
@M01378:492:000000000-BLK46:1:1101:11675:3965:CAGGGGGG 1:N:0: TAAGGCGA-CTAAGGAG
TGGAGCTGAGCAGCCTGAAGTGCAACCGGGAAGGGAAGGAGTGGGAGACGGTACTCACCAGCCGGACCCTCACTGCTGCGGGCAGCTGTGACGTGGTGTGTGTCGCCTGTGAAAAAAGGATGCTGTCAGTGTTCTCCACCTGTGGTCACCGTCTCCTCAGGTAAGGCGCTTCTCTGTCTCTTATACACATCTC
+
@DDAGGF1@GFGGGF0BFHB>FGF1C0E/<B//C//0/<0/?FFAGFCFGAAGCFHE1FD0FF--<-<CGHHGCGHEHEG-?@@.9FFC/0;CEBAAAABAFFF--;B?9B/;/9-9B?BFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFEFB/;FFFF?@@FFFFFFFFFFFBFFFFFFFFFF9


$ head -n 8 782404_V1_L1.N701_505_umi_2.fastq
@M01378:492:000000000-BLK46:1:1101:11212:3937:TCTGACCT 2:N:0: TAAGGCGA-TTAAGGAG
CTTACCTGCGGAGACGGTGACCCAAGCTAGTTGAACTTGGAAGCCTGTTTGAACTCCCAGATCTCAGGCTGCTCAGCTCCATTCCCCCACTGTCTCTTATACACATCTGTCGCTGCCGACGTCTCCTTACGTGTTGTTCTCTGTTTTCTTCTTTTCTTTTTTATCTCTTTTTTTGTTGTTCTTTTGTTTTTTC
+
FFF@FGGG1A10AABABECE1A000//11112122A1A11110/0B0111011A1AAA01/BAFE220>>0>FE101BB@1@F2@1>//>01BBGFHG1D22211B>E122//?//0//////0011121/@F@0/@/11111?1<01111111?######################################
@M01378:492:000000000-BLK46:1:1101:11675:3965:TGAAGCGC 2:N:0: TAAGGCGA-CTAAGGAG
CTTACCTGAGGAGACGGTGACCACAGGTTGTTCACACTGACAGCCTCCTTTTTTCACAGTCTACACACACCACGTCACAGCTGCCCGCAGCAGTGAGGTTCCGGCTGTTGTGTACCGTCTCCCACTCCTTCCCTTCCCGTTTGCACTTCAGTCTGCTCAGCTCCACCCCCCTGCTGTCTCTTATACACATCTG
+
>AADGGGF1B01A0BA0AA01B1A00AB10//1A12A011111//AB/A1FDFGB2D211@121101?/?>///B//>B1/B110//////00>11100B1<E//>/0B10/B2@2<</@0100<?0<GDF0FGHG00..0<A<11>B1<11=F0=D<0//=</C/..-:;@ACB0;BFFFFGB0B00CFFF0```

from fastp.

carlandt avatar carlandt commented on August 29, 2024

As an example, I was hoping that first forward read would have come out with the UMI of both itself and the reverse read delimited in some way:

@M01378:492:000000000-BLK46:1:1101:9647:3917:CACGCGAG-TGAGGGTG 1:N:0: TAAGGCGA-TTAAGGAG
CACGCGAGTGGAGCTGAGCAGCCTGAGATCTGACCGTCTCCTCAGGTACGCACCCTCACTGTCTCTTATACACATCTCCGAGCCCACGAGACTAAGGCGAATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAGGTATGTTGGTTTATGTTTTTTTTGGTTGGATGTGGTATGGTTTTTTTTTGTTTGTTTTTTGTTAT
+
ABBBBBBBBBBFGGGGFGFGFFHHHHHHGFHFHFFGGGHGGHHFHFFHD1AEEEEHHFHGH5FFGFH5GHHHHHGFFHHGCEEEFHGC?FDFFGFBBG//<E/F?CF/GFBDDG@GF2G2@DHF0F2FFC#######################################################################

That way the read name of the forward and the reverse read would be the same (except for the 1:N:0 part) and BWA would still stake it.

from fastp.

sfchen avatar sfchen commented on August 29, 2024

Could you please confirm you used the latest fastp?

Seems like your result was obtain with old version of fastp. The fastp on bioconda is still old, you have to download from http://opengene.org/fastp/fastp, or use git to clone the latest code to build it.

With --umi_loc=per_read option, the latest version of fastp will output the reads like:

@NS500713:64:HFKJJBGXY:1:11101:1675:1101:TAGGAGGC_TAGGGCAA 1:N:0:TATAGCCT+GACCCCCA
TTGGAGTACCAATAATAAAGTGAGCCCACCTTCCTGGTACCCAGACATTTCAGGAGGTCGGGAAATTTTTAAACCCAGGCAGCTTCCTGGCAGTGACATTTGGAGCATCAAAGTGGTAAATAAAATTTCATTTACATTAATAT
+
EEE/E/EA/E/AEA6EE//AEE66/AAE//EEE/E//E/AA/EEE/A/AEE/EEA//EEEEEEEE6EEAAA/E/A/6E/6//6<EAAEEE/EEEA/EA/EEEEEE/<<EEEE//A/EE<AEEEEE/</AA</E<AAAE/E<E/

from fastp.

carlandt avatar carlandt commented on August 29, 2024

The work up above should have been 0.12.6

Sure, downloaded again:

$ ./fastp --version
fastp: an ultra-fast all-in-one FASTQ preprocessor
version 0.13.1

Yup, output looks mostly as you described. Thank you!

Checked a handful of reads, here is the output. Of the first four read pairs, two pairs gave output. Were the others just low quality? This sample did lose a great many of the reads due to that...

# the first four forward reads:
$ zcat 782404_V1_L1.N701_505_1.fastq.gz | head -n 16
@M01378:492:000000000-BLK46:1:1101:9647:3917 1:N:0: TAAGGCGA-TTAAGGAG
CACGCGAGTGGAGCTGAGCAGCCTGAGATCTGACCGTCTCCTCAGGTACGCACCCTCACTGTCTCTTATACACATCTCCGAGCCCACGAGACTAAGGCGAATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAGGTATGTTGGTTTATGTTTTTTTTGGTTGGATGTGGTATGGTTTTTTTTTGTTTGTTTTTTGTTAT
+
ABBBBBBBBBBFGGGGFGFGFFHHHHHHGFHFHFFGGGHGGHHFHFFHD1AEEEEHHFHGH5FFGFH5GHHHHHGFFHHGCEEEFHGC?FDFFGFBBG//<E/F?CF/GFBDDG@GF2G2@DHF0F2FFC#######################################################################
@M01378:492:000000000-BLK46:1:1101:11212:3937 1:N:0: TAAGGCGA-TTAAGGAG
TGGGGGAATGGAGCTGAGCAGCCTGAGATCTGGGCGTTCACCCAGGCTTCCACGTTCCCCTCGCTTGGGTCACCGTCTCCTCAGGTAAGAGGTCAGCCTGTCTCTTATACACATCTCCGAGCCCACGAGACTAAGGCGAATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAACTCAGGTGCGATGTGCAGCTGTCTTGTG
+
BBBBBBBBFFFFGGGGDACE4FGEFFHGHGDHCBGGGGFHGFH1G1BAFGHHHDCGFFGFHG?EFHGGEEFFFGGHGHHHHHHBGGHBB4FEFGGFBGHFHHHHHHFGEFHFGFHGHG/@/<BFFFC?/?DHHH1GFCFGCEHHFCEGFFDFDFAEFHBGHFF.00CC#################################
@M01378:492:000000000-BLK46:1:1101:20250:3942 1:N:0: TAAGGCGA-TTAAGGAG
TCCCCAAATGGAGCTGAGCAGCCTGAGATCTGTCACCGTCTCCTCAGGTAAGCCTACCGACTGTCTCTTATACACATCTCCTATCCCACGATACTAAGGCGAATCTCGTATTCCTTCTTCTTCTTTAAAAAAAAAATTTTTGGTTTATTTTATTTTGTTTTGTTGTTGTTTTTTATTTTTTTGTTTTTGTTTTTTTGTTGT
+
@AAAAFAFFFFFGGGGGF1F1C0A0BB1FHBHH321B0FCGBF1BF10G121A11B0B//AEA1AG2FEFHFB2F1A@F1D1121B1BEE/B>>G2210>///>/2B0?/01222B1FGBF122B12B1B#######################################################################
@M01378:492:000000000-BLK46:1:1101:11675:3965 1:N:0: TAAGGCGA-CTAAGGAG
CAGGGGGGTGGAGCTGAGCAGCCTGAAGTGCAACCGGGAAGGGAAGGAGTGGGAGACGGTACTCACCAGCCGGACCCTCACTGCTGCGGGCAGCTGTGACGTGGTGTGTGTCGCCTGTGAAAAAAGGATGCTGTCAGTGTTCTCCACCTGTGGTCACCGTCTCCTCAGGTAAGGCGCTTCTCTGTCTCTTATACACATCTC
+
@AAAADDD@DDAGGF1@GFGGGF0BFHB>FGF1C0E/<B//C//0/<0/?FFAGFCFGAAGCFHE1FD0FF--<-<CGHHGCGHEHEG-?@@.9FFC/0;CEBAAAABAFFF--;B?9B/;/9-9B?BFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFEFB/;FFFF?@@FFFFFFFFFFFBFFFFFFFFFF9


# the first four reverse reads
$ zcat 782404_V1_L1.N701_505_2.fastq.gz | head -n 16
@M01378:492:000000000-BLK46:1:1101:9647:3917 2:N:0: TAAGGCGA-TTAAGGAG
TGAGGGTGCTTACCTGCGGCGACGGTCAGATCTCTTTCTTCTCATCTCCACTCGCTTGCTGTCTCTTATACACATCTGACGCTGCCGACTACTCCTTTCTTGTTTTTTTTTTTTTTCTTCTTTTTTTTTTTTTTCATCTTTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGTTTTTTTTTTTTTTTTTTT
+
11111111AFFFGGGE1A0000AAAE?/01D1G22212A12DA22ADA11ADE/////>D1B1@@FG1BB22@1BBF12@/>?######################################################################################################################
@M01378:492:000000000-BLK46:1:1101:11212:3937 2:N:0: TAAGGCGA-TTAAGGAG
TCTGACCTCTTACCTGCGGAGACGGTGACCCAAGCTAGTTGAACTTGGAAGCCTGTTTGAACTCCCAGATCTCAGGCTGCTCAGCTCCATTCCCCCACTGTCTCTTATACACATCTGTCGCTGCCGACGTCTCCTTACGTGTTGTTCTCTGTTTTCTTCTTTTCTTTTTTATCTCTTTTTTTGTTGTTCTTTTGTTTTTTC
+
11>1>11BFFF@FGGG1A10AABABECE1A000//11112122A1A11110/0B0111011A1AAA01/BAFE220>>0>FE101BB@1@F2@1>//>01BBGFHG1D22211B>E122//?//0//////0011121/@F@0/@/11111?1<01111111?######################################
@M01378:492:000000000-BLK46:1:1101:20250:3942 2:N:0: TAAGGCGA-TTAAGGAG
TCTTTCTTCTTACCTGAGGAGACGGTGACATTTCTCCTTCTTCTCTGCTCCTTTTTTTTTCTTTCTCTTTTACTCTTCTGTCGCTGCCGTCTTCTCCTTTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCTTTTTTT
+
111>13BBBDFFGGGG1AA11AB00A0A133332A21111A12212111111121B////0A1B2A2@D1B211@##############################################################################################################################
@M01378:492:000000000-BLK46:1:1101:11675:3965 2:N:0: TAAGGCGA-CTAAGGAG
TGAAGCGCCTTACCTGAGGAGACGGTGACCACAGGTTGTTCACACTGACAGCCTCCTTTTTTCACAGTCTACACACACCACGTCACAGCTGCCCGCAGCAGTGAGGTTCCGGCTGTTGTGTACCGTCTCCCACTCCTTCCCTTCCCGTTTGCACTTCAGTCTGCTCAGCTCCACCCCCCTGCTGTCTCTTATACACATCTG
+
11111111>AADGGGF1B01A0BA0AA01B1A00AB10//1A12A011111//AB/A1FDFGB2D211@121101?/?>///B//>B1/B110//////00>11100B1<E//>/0B10/B2@2<</@0100<?0<GDF0FGHG00..0<A<11>B1<11=F0=D<0//=</C/..-:;@ACB0;BFFFFGB0B00CFFF0


# looking for the read name of the first read pair in the output fastq files, no luck
$ cat *.fastq | grep "@M01378:492:000000000-BLK46:1:1101:9647:3917"
<nothing>

# the second pair works
$ cat *.fastq | grep "@M01378:492:000000000-BLK46:1:1101:11212:3937"
@M01378:492:000000000-BLK46:1:1101:11212:3937:TGGGGGAA_TCTGACCT 1:N:0: TAAGGCGA-TTAAGGAG
@M01378:492:000000000-BLK46:1:1101:11212:3937:TGGGGGAA_TCTGACCT 2:N:0: TAAGGCGA-TTAAGGAG

# the third pair doesn't show
$ cat *.fastq | grep "@M01378:492:000000000-BLK46:1:1101:20250:3942"
<nothing>

# the fourth pair does
$ cat *.fastq | grep "@M01378:492:000000000-BLK46:1:1101:11675:3965"
@M01378:492:000000000-BLK46:1:1101:11675:3965:CAGGGGGG_TGAAGCGC 1:N:0: TAAGGCGA-CTAAGGAG
@M01378:492:000000000-BLK46:1:1101:11675:3965:CAGGGGGG_TGAAGCGC 2:N:0: TAAGGCGA-CTAAGGAG

from fastp.

carlandt avatar carlandt commented on August 29, 2024

Ah, yes, quality, if I turned off quality filtering they all come back. Looks like this issue is closed, thank you again!

$ cat *.fastq | grep "@M01378:492:000000000-BLK46:1:1101:20250:3942"
@M01378:492:000000000-BLK46:1:1101:20250:3942:TCCCCAAA_TCTTTCTT 1:N:0: TAAGGCGA-TTAAGGAG
@M01378:492:000000000-BLK46:1:1101:20250:3942:TCCCCAAA_TCTTTCTT 2:N:0: TAAGGCGA-TTAAGGAG

$ cat *.fastq | grep "@M01378:492:000000000-BLK46:1:1101:9647:3917"
@M01378:492:000000000-BLK46:1:1101:9647:3917:CACGCGAG_TGAGGGTG 1:N:0: TAAGGCGA-TTAAGGAG
@M01378:492:000000000-BLK46:1:1101:9647:3917:CACGCGAG_TGAGGGTG 2:N:0: TAAGGCGA-TTAAGGAG

from fastp.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.