Problem I'm trying to run library with 440 M read pairs and I'm ge

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I did some modifications to the tagfastq so that it reads the barcodes.fasta in

tagfastq: cannot allocate memory about blr HOT 15 CLOSED

FrickTobias commented on August 26, 2024

tagfastq: cannot allocate memory

from blr.

Comments (15)

marcelm commented on August 26, 2024 1

The problem is that tagfastq reads the entire barcodes.fasta file into memory, which is huge (36 GB on my dataset). Since the input reads (in trimmed.[12].fastq.gz) are in the same order as in barcodes.fasta, it’s not actually necessary to do this. I think you should therefore focus on changing the algorithm. Let me know if you need help.

from blr.

FrickTobias commented on August 26, 2024

@marcelm Do you have any insight to dnaio? Or do you think it might be related to something else?

from blr.

marcelm commented on August 26, 2024

The problem occurs when dnaio calls into xopen, which tries to open a subprocess to pigz for reading the input gzip file. All of these (dnaio, xopen, pigz) use only small amounts of memory, so possibly some other process is running at the same time, eating up all the memory, so that nothing is left for tagfastq.

But I’ll take a closer look.

from blr.

FrickTobias commented on August 26, 2024

Awesome, thanks!

from blr.

marcelm commented on August 26, 2024

I had also started a pipeline run a couple of days ago and it crashed also during tagfastq. Rerunning the command manually, I can see that it uses more and more memory while it is running. After six minutes, it’s already using 18 GB of memory. I think the tagfastq script itself is using too much memory.

from blr.

FrickTobias commented on August 26, 2024

Ok, I'll look into the code and see if I can optimize it a bit.

from blr.

marcelm commented on August 26, 2024

I had an idea I want to mention: I would suggest to change the barcodes.fasta file in such a way that it includes all the reads. That is, use no filtering options in Cutadapt. Then one can iterate over the barcodes.fasta file and the input reads simultaneously (similar to how one would merge two sorted lists). Then I would change it so that tagfastq does not read the barcodes.fasta file from disk, but from stdin, and then the Cutadapt output can be piped directly into it, and no huge intermediate file needs to be created.

from blr.

pontushojer commented on August 26, 2024

I did some modifications to the tagfastq script so that it reads the barcodes.fasta in parallel to the reads an implemented a buffering/cache system to catch any missing reads or barcode entries. Comparing for FASTQs with 1 million reads the improvement was quite big.

Parameter	Old script	New script
User time (seconds)	158.88	160.43
System time (seconds	9.54	3.34
Percent of CPU this job got	182%	300%
Elapsed (wall clock) time (h:mm:ss or m:ss)	1:32.07	0:54.50
Maximum resident set size (kbytes)	271,344	168,844

The updated script an be found in the tagfastq-improvement branch.

This way we don't need to change any other parts of the pipe.

If we were to keep all the entries in the barcode.fasta then this would include untrimmed reads in the clustering. Then we would need to implement a catch system for removing these as legit barcodes.

Also is it really possible to avoid writing the barcodes to disc? We need the file both for the starcode clustering and the tagging of the reads FASTQs. We would need to split the pipe from extracting the barcodes.fasta to both tagfastq and starcode and then wait until starcode finished before starting �tagfastq. Is this really possible or is there an other solution that I am missing?

from blr.

marcelm commented on August 26, 2024

I overlooked that starcode also needs the barcodes.fasta file. So that needs to exist on disk. I think what would work is to pipe the output of the other Cutadapt process (that removes barcodes and adapters) into tagfastq instead. That would allow us to skip the trimmed.[12].fastq.gz file and go directly to trimmed_barcoded.[12].fastq.gz. But that should be done as a separate improvement later.

By the way, if I understand correctly that you only keep the last ten records from barcodes.fasta in memory, then the memory usage should go down a lot more on bigger datasets than what your table above implies.

from blr.

pontushojer commented on August 26, 2024

Yes piping from the read trimming step is a possibility!

So the new script is build to handle either missing barcode entries or missing read entries. For each read header we want to find the corresponding header in the barcode file. For this I have made a cache that stores barcodes (headers: barcode_seq pairs) that don't match the current read header and a function that searches the cache and reads the barcode file. The function will move forward up to 10 positions in the barcode file to look for the correct header and store any header not matching. So the function will first look in the cache and only look in the file if the entry is not found. This means that more then ten records might be stored, infact all barcode records that are not found in the read file will be kept as it currently is.

The reason that I only look into the next 10 barcode records is that I expect these missing read entries to be few. Should there be more than 10 missing read entries in a row this will fail but I assume these events to be rare. In the 1 million reads I had in the test 38 (or 0.0038%) barcode records missed corresponding read records. Therefore I think this is fine for now.

from blr.

marcelm commented on August 26, 2024

Can you open a PR so we can discuss this next to the code?

from blr.

pontushojer commented on August 26, 2024

Sure!

from blr.

FrickTobias commented on August 26, 2024

I'll just test this before I close the issue.

from blr.

pontushojer commented on August 26, 2024

Memory profile for running tag_fastq on ~400 million reads on 20 cores on uppmax.

from blr.

FrickTobias commented on August 26, 2024

Awesome! For future records I thought I'd include a reference from before.

from blr.

tagfastq: cannot allocate memory about blr HOT 15 CLOSED

Comments (15)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs