linsalrob / fastq-pair Goto Github PK

View Code? Open in Web Editor NEW

137.0 5.0 30.0 249 KB

Match up paired end fastq files quickly and efficiently.

Home Page: https://edwards.flinders.edu.au/sorting-and-paring-fastq-files/

License: MIT License

CMake 1.44% C 98.56%

fastq-pair's Introduction

FASTQ PAIR

Rewrite paired end fastq files to make sure that all reads have a mate and to separate out singletons.

This code does one thing: it takes two fastq files, and generates four fastq files. That's right, for free it doubles the number of fastq files that you have!!

Usually when you get paired end read files you have two files with a /1 sequence in one and a /2 sequence in the other (or a /f and /r or just two reads with the same ID). However, often when working with files from a third party source (e.g. the SRA) there are different numbers of reads in each file (because some reads fail QC). Spades, bowtie2 and other tools break because they demand paired end files have the same number of reads.

This program solves that problem.

It rewrites the files with the sequences in order, with matching files for the two files provided on the command line, and then any single reads that are not matched are place in two separate files, one for each original file.

This code is designed to be fast and memory efficient, and works with large fastq files. It does not store the whole file in memory, but rather just stores the locations of each of the indices in the first file provided in memory.

Speed and efficiency considerations

The most efficient way to use this code is to provide the smallest file first (though it doesn't matter which way you provide the files), and then to manipulate the -t parameter on the command line. The code implementation is based on a hash table and the size of that table is the biggest way to make this code run faster. If you set the hash table size too low, then the data structure quickly fills up and the performance degrades to what we call O(n). On the other hand if you set the table size too big, then you waste a lot of memory, and it takes longer to initialize the data structures safely.

The optimal table size is basically somewhere around the number of sequences in your fastq files. You can quickly find out how many sequences there are in your fastq file:

wc -l fastq_filename

The number of sequences will be the number printed here, divided by 4.

Note: If you get an error that looks like

"We cannot allocate the memory for a table size of -436581356. Please try a smaller value for -t"

you are probably suffering from an integer overflow, so try reducing the value you are providing to the -t option. See issue 12 for more details.

If you are not sure, you can run this code with the -p parameter. Before it prints out the matched pairs of sequences, it will print out the number of sequences in each "bucket" in the table. If this number is more than about a dozen you need to increase the value you provide to -t. If most of the entries are zero, then you should decrease the size of -t.

As an aside, this code is also really slow if none of your sequences are paired. You should most likely use this after taking a peek at your files and making sure there are at least some paired sequences in your files!

Installing fastq_pair

We recommend installing fastq-pair using bioconda

mamba install -c bioconda fastq-pair

or in its own environment:

mamba create --name fastq-pair -c bioconda fastq-pair

Installing from source

To install the code, grab the github repository, then make a build directory:

mkdir build && cd build
cmake3 ..
make && sudo make install

There are more instructions on the installation page.

Running fastq_pair

fastq_pair takes two primary arguments. The name of the two fastq files that you want to pair up.

fastq_pair file1.fastq file2.fastq

You can also change the size of the hash table using the -t parameter:

fastq_pair -t 50021 file1.fastq file2.fastq

You can also print out the number of elements in each bucket using the -p parameter:

fastq_pair -p -t 100 file1.fastq file2.fastq

Testing fastq_pair

In the test directory there are two fastq files that you can use to test fastq_pair. There are 250 sequences in the left file and 75 sequences in the right file. Only 50 sequences are common between the two files.

You can test the code with:

fastq_pair -t 1000 test/left.fastq test/right.fastq

This will make four files in the test/ directory:

left.fastq.paired.fq
left.fastq.single.fq
right.fastq.paired.fq
right.fastq.single.fq

The paired files have 50 sequences each, and the two single files have 200 and 25 sequences (left and right respectively).

A note about gzipped fastq files

Unfortunately fastq_pair doesn't work with gzipped files at the moment, because it relies heavily on random access of the file stream. That is complex with gzipped files, especially when the uncompressed file exceeds available memory (which is exactly the situation that fastq_pair was designed to handle).

Therefore, at this time, fastq_pair does not support gzipped files. You need to uncompress the files before using fastq_pair.

If you really need to use gzipped files, and can accept slightly worse performance, then we have some alternative approaches written in Python that you can try.

Testing for gzipped files (issue #6)

We take a peek at the first couple of bytes in the file to see if the file is gzip compressed. Per the standard, the files should start 0x1F and 0x8B as the first two bytes. There is a small tester for the gzip program, called test_gzip.c, that takes a single argument and reports whether it is gzipped or not. You can compile that tester with the command:

gcc -std=gnu99  -o testgz ./test_gzip.c  is_gzipped.c

We now test both files and exit (hopefully gracefully) if either is gzip compressed. The easiest solution is to uncompress your files, and we recommend and love pigz because it is awesome!

Citing fastq_pair

Please see the CITATION file for the current citation for fastq-pair

fastq-pair's People

Contributors

Stargazers

Watchers

fastq-pair's Issues

pairing bug

original left read

@SRR8996821.1 1/1
CTCCGTTTCCGACCTGGGCCGGTTCNCCNNTNNNNNNNNNNNNNNNNNNNNNNNNNNNCNNNNNNNNNNNNNNNNNNNNNNNNNNNCNTNGNGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCAANCGATAG
+
AAFFFKAKA,<AKKKKKKF7AFFKK#7K##A###########################K###########################K#K#K#F###############################################A<,#AAFFKK
@SRR8996821.2 2/1
CTGGAGTGCAGTGGCTATACACAGGNGCNNTNNNNNNNNNNNNNNNNNNNNNNNNNNNGNNNNNNNNNNNNNNNNNNNNNNNNNNNTNCNCNCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTGANGCCGAA
+
A<FFFFKKKK<AKAAAFA,,FKKKK#KF##K###########################7###########################F#,#A#F###############################################F7<#,,<FKF
@SRR8996821.3 3/1
AGATACCATGATCACGAAGGTGGTTNTCNNANNNNNNNNNNNNNNNNNNNNNNNNNNNANNNNNNNNNNNNNNNNNNNNNNNNNNNTNTNGNANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCGGNTGAACT

original right read

@SRR8996821.1 1/1
CTCCGTTTCCGACCTGGGCCGGTTCNCCNNTNNNNNNNNNNNNNNNNNNNNNNNNNNNCNNNNNNNNNNNNNNNNNNNNNNNNNNNCNTNGNGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCAANCGATAG
+
AAFFFKAKA,<AKKKKKKF7AFFKK#7K##A###########################K###########################K#K#K#F###############################################A<,#AAFFKK
@SRR8996821.2 2/1
CTGGAGTGCAGTGGCTATACACAGGNGCNNTNNNNNNNNNNNNNNNNNNNNNNNNNNNGNNNNNNNNNNNNNNNNNNNNNNNNNNNTNCNCNCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTGANGCCGAA
+
A<FFFFKKKK<AKAAAFA,,FKKKK#KF##K###########################7###########################F#,#A#F###############################################F7<#,,<FKF
@SRR8996821.3 3/1
AGATACCATGATCACGAAGGTGGTTNTCNNANNNNNNNNNNNNNNNNNNNNNNNNNNNANNNNNNNNNNNNNNNNNNNNNNNNNNNTNTNGNANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCGGNTGAACT

output left read

@SRR8996821.1 1/1
CTCCGTTTCCGACCTGGGCCGGTTCNCCNNTNNNNNNNNNNNNNNNNNNNNNNNNNNNCNNNNNNNNNNNNNNNNNNNNNNNNNNNCNTNGNGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCAANCGATAG
+
AAFFFKAKA,<AKKKKKKF7AFFKK#7K##A###########################K###########################K#K#K#F###############################################A<,#AAFFKK
@SRR8996821.1 1/1
CTCCGTTTCCGACCTGGGCCGGTTCNCCNNTNNNNNNNNNNNNNNNNNNNNNNNNNNNCNNNNNNNNNNNNNNNNNNNNNNNNNNNCNTNGNGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCAANCGATAG
+
AAFFFKAKA,<AKKKKKKF7AFFKK#7K##A###########################K###########################K#K#K#F###############################################A<,#AAFFKK
@SRR8996821.3 3/1
AGATACCATGATCACGAAGGTGGTTNTCNNANNNNNNNNNNNNNNNNNNNNNNNNNNNANNNNNNNNNNNNNNNNNNNNNNNNNNNTNTNGNANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCGGNTGAACT
AGATACCATGATCACGAAGGTGGTTNTCNNANNNNNNNNNNNNNNNNNNNNNNNNNNNANNNNNNNNNNNNNNNNNNNNNNNNNNNTNTNGNANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCGGNTGAACT
AGATACCATGATCACGAAGGTGGTTNTCNNANNNNNNNNNNNNNNNNNNNNNNNNNNNANNNNNNNNNNNNNNNNNNNNNNNNNNNTNTNGNANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCGGNTGAACT

output right read

@SRR8996821.1 1/2
ATCGCTTGAGTACAGGNGTTCTGGGNTGNAGTNNNNNNTNNCNANCNGGTNTNCGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCNGGANNGGGGNACCACCANGNTGCCTACNGNGGNNNGANCCNGCNAAGGTCGNGANNNGNGAG
+
AAFFFKKKKKK,7FKK#KKKKKKKK#KK#KKK######K##K#K#K#KAF#F#7K###############################K#KFK##,FFA#,FKFKK7#F#7F,AFA7#7#,F###FF#KF#A7#,7<A,<,#,,###<#A7A
@SRR8996821.2 2/2
CCGCACTAAGTTCGGCNTCAATATGNTGNCCTNNNNNNANNGNGNGNCCANCNGGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGNAAANNGAGCNGGTCAAAACNCCCGTGCNGNTCNNNAGNGGNATNGCGCCTGNGAANNGNCAC
+
,A<FFFKKKKKKFKKA#KFFKKKKK#AF#KFK######7##7#F#K#7,A#K#7F###############################<#FKK##KFFF#KKF,AFFAA#FKAFKKK#,#,,###F,#<K#KF#,,AA<,7#,AA##,#,AF
@SRR8996821.3 3/2
CCCCCACTACCACAAANTATGCAGTNGANTTTNNNNCNTNNGNGNANATCNCNGGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCNCGCNNTGGGNAAAGCACCTNCGTGATCNTNCTNNNTANATNGGNAGAGCGTNGTGTNGNGAA
CCCCCACTACCACAAANTATGCAGTNGANTTTNNNNCNTNNGNGNANATCNCNGGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCNCGCNNTGGGNAAAGCACCTNCGTGATCNTNCTNNNTANATNGGNAGAGCGTNGTGTNGNGAA
CCCCCACTACCACAAANTATGCAGTNGANTTTNNNNCNTNNGNGNANATCNCNGGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCNCGCNNTGGGNAAAGCACCTNCGTGATCNTNCTNNNTANATNGGNAGAGCGTNGTGTNGNGAA

Output corrupted with gzipped reads

When I give two gzipped reads to fastq_pair, it seems to function normally. The output files are named with the .fq extension, but they're not a text file, nor a gzipped file.

$ fastq_pair SAMPLE_1.fq.gz SAMPLE_2.fq.gz
Writing the paired reads to SAMPLE_1.fq.gz.paired.fq and SAMPLE_2.fq.gz.paired.fq.
Writing the single reads to SAMPLE_1.fq.gz.single.fq and SAMPLE_2.fq.gz.single.fq
Left paired: 11619              Right paired: 11619
Left single: 381534             Right single: 418985

$ file -i SAMPLE_1.fq.gz.paired.fq
SAMPLE_1.fq.gz.paired.fq: application/octet-stream; charset=binary

$ gzip -t SAMPLE_1.fq.gz.paired.fq
gzip: SAMPLE_1.fq.gz.paired.fq: not in gzip format

Sequence header must start with @

Hi,
Thank you for developing this useful tool. I have an issue on using it. I found that the seuqence header start with 1,2 ...... It looks like some number have been add at the front of each line. That block the downstream analysis. Is that a way that I could get the output file as the original .fastq file format?
Thank you so much.

.

Error on very large fastq file

Hello,
I am analyzing a very large fastq file (~240 Gb for each pair).

I have this error when running fastq-pair.

"We cannot allocate the memory for a table size of -436581356. Please try a smaller value for -t"

Could you provide some solution for this?

Thanks
Sam

Very large files - >265 gb

Hello,

I have some very large metagenomes. R1 and R2 that are >265 gb. I gave fastq_pair a big amount of memory (2 TB). Still taking a long time.

Any thoughts?

Added to homebrew-bio

brew install brewsci/bio/fastq-pair

brewsci/homebrew-bio#726

Pair mismatched when read name is close

There is a read in R2 named
NS500207:121:HTFVJAFXX:1:11111:13010:1958
Later there is another named
NS500207:121:HTFVJAFXX:1:11111:13010:19581

R1 only has a read named
NS500207:121:HTFVJAFXX:1:11111:13010:1958

This tool will mistakenly write into R1.paired.fq
NS500207:121:HTFVJAFXX:1:11111:13010:1958
a second time thinking that such a pair existed.

This leads to mismatches in tools that require absolute name matching for read merging, such as usearch.

First read repeated with reads from SRA

I have a problem with reads downloaded from SRA. When the first read is present in the dataset, it gets repeated in the paired output.

attaching two 20 line files as an example.
r1short.fq.txt
r2short.fq.txt

These are the headers of the output for read 1:

@SRR6913986.1 HISEQ:556:HCHYJBCXY:2:1101:3383:2114 length=301
@SRR6913986.1 HISEQ:556:HCHYJBCXY:2:1101:3383:2114 length=301
@SRR6913986.3 HISEQ:556:HCHYJBCXY:2:1101:3856:2081 length=301
@SRR6913986.4 HISEQ:556:HCHYJBCXY:2:1101:3919:2100 length=301
@SRR6913986.6 HISEQ:556:HCHYJBCXY:2:1101:4387:2166 length=301

and read 2

@SRR6913986.1 HISEQ:556:HCHYJBCXY:2:1101:3383:2114 length=301
@SRR6913986.2 HISEQ:556:HCHYJBCXY:2:1101:3520:2051 length=301
@SRR6913986.3 HISEQ:556:HCHYJBCXY:2:1101:3856:2081 length=301
@SRR6913986.4 HISEQ:556:HCHYJBCXY:2:1101:3919:2100 length=301
@SRR6913986.6 HISEQ:556:HCHYJBCXY:2:1101:4387:2166 length=301

read length > 0 filter option

Feature suggestion:

It would be great to have an option to treat reads with zero length the same as missing reads.
These reads come up from time to time in SRA fastq files, and cause issues with downstream tools if not removed.

Cheers.

Add -V version flag

% fastq_pair -V
fastq_pair 1.0

To stdout + exit code 0

Complement the steps for local installation

Where
cmake3 -DCMAKE_INSTALL_PREFIX=$HOME/bin

update to

cmake3 -DCMAKE_INSTALL_PREFIX=$HOME/bin ..

Discarding unmatched reads

Hey!
Is there any way to keep only matched reads from both files while discarding all reads that does not match? In order to save some storage space
Thanks

Singleton reads

Provide an option to write the singleton reads to a single file. There is no real reason to keep them separate, though most tools can take multiple singleton reads.

Support gzip file

Hi, crAssphage man,

Any plan to support gzip input ?

https://lh3.github.io/2014/07/05/random-access-to-zlib-compressed-files

Good tool, thanks ~

Can't install it in Ubuntu 22

Hi I am trying to install it, and I got consistently the error:

cmake3 ../
Command 'cmake3' not found, did you mean:
command 'cmake' from snap cmake (3.27.7)
command 'cmake' from deb cmake (3.22.1-1ubuntu1.22.04.1)
See 'snap info ' for additional versions.

Could you suggest how to install it?

Thanks

problem with compilation

I am attemting to install fastq_pair locally and my system doesn't allow cmake version higher than 3.0.2. As a result, it seems imposible to compile fastq_pair: "CMake error at CMakeLists.txt:1 (cmake_minimum_required): CMake 3.6 or higher required. You are running version 3.0.2"

Any alternative ways to compile?

Why the final report "fprintf(stderr, "Left paired: %d\t\tR..." is stderr?

It is better to put it in stdout.

issues with compilation

I am unable to run make3, so I followed the instructions suggested to another user who experienced a similar problem. However, when creating the /build directory and executing

sudo gcc  ../*.c -o fastq_pair

I get the following error which I can't make sense of:

/usr/bin/ld: /tmp/ccPYVGea.o: in function `main':
test_gzip.c:(.text+0x0): multiple definition of `main'; /tmp/ccSZ6ste.o:main.c:(.text+0x7d): first defined here
collect2: error: ld returned 1 exit status

Any suggestions on how to proceed?

What would happen if line number cannot be exactly divided by four

Thanks for the amazing tool.

I want to ask if the fastq file has much noise; some reads might contain less than four lines. Will the tool consider this situation?

Best,
Shuai