GithubHelp home page GithubHelp logo

Comments (15)

marcelm avatar marcelm commented on August 26, 2024 1

The problem is that tagfastq reads the entire barcodes.fasta file into memory, which is huge (36 GB on my dataset). Since the input reads (in trimmed.[12].fastq.gz) are in the same order as in barcodes.fasta, it’s not actually necessary to do this. I think you should therefore focus on changing the algorithm. Let me know if you need help.

from blr.

FrickTobias avatar FrickTobias commented on August 26, 2024

@marcelm Do you have any insight to dnaio? Or do you think it might be related to something else?

from blr.

marcelm avatar marcelm commented on August 26, 2024

The problem occurs when dnaio calls into xopen, which tries to open a subprocess to pigz for reading the input gzip file. All of these (dnaio, xopen, pigz) use only small amounts of memory, so possibly some other process is running at the same time, eating up all the memory, so that nothing is left for tagfastq.

But I’ll take a closer look.

from blr.

FrickTobias avatar FrickTobias commented on August 26, 2024

Awesome, thanks!

from blr.

marcelm avatar marcelm commented on August 26, 2024

I had also started a pipeline run a couple of days ago and it crashed also during tagfastq. Rerunning the command manually, I can see that it uses more and more memory while it is running. After six minutes, it’s already using 18 GB of memory. I think the tagfastq script itself is using too much memory.

from blr.

FrickTobias avatar FrickTobias commented on August 26, 2024

Ok, I'll look into the code and see if I can optimize it a bit.

from blr.

marcelm avatar marcelm commented on August 26, 2024

I had an idea I want to mention: I would suggest to change the barcodes.fasta file in such a way that it includes all the reads. That is, use no filtering options in Cutadapt. Then one can iterate over the barcodes.fasta file and the input reads simultaneously (similar to how one would merge two sorted lists). Then I would change it so that tagfastq does not read the barcodes.fasta file from disk, but from stdin, and then the Cutadapt output can be piped directly into it, and no huge intermediate file needs to be created.

from blr.

pontushojer avatar pontushojer commented on August 26, 2024

I did some modifications to the tagfastq script so that it reads the barcodes.fasta in parallel to the reads an implemented a buffering/cache system to catch any missing reads or barcode entries. Comparing for FASTQs with 1 million reads the improvement was quite big.

Parameter Old script New script
User time (seconds) 158.88 160.43
System time (seconds 9.54 3.34
Percent of CPU this job got 182% 300%
Elapsed (wall clock) time (h:mm:ss or m:ss) 1:32.07 0:54.50
Maximum resident set size (kbytes) 271,344 168,844

The updated script an be found in the tagfastq-improvement branch.

This way we don't need to change any other parts of the pipe.

If we were to keep all the entries in the barcode.fasta then this would include untrimmed reads in the clustering. Then we would need to implement a catch system for removing these as legit barcodes.

Also is it really possible to avoid writing the barcodes to disc? We need the file both for the starcode clustering and the tagging of the reads FASTQs. We would need to split the pipe from extracting the barcodes.fasta to both tagfastq and starcode and then wait until starcode finished before starting �tagfastq. Is this really possible or is there an other solution that I am missing?

from blr.

marcelm avatar marcelm commented on August 26, 2024

I overlooked that starcode also needs the barcodes.fasta file. So that needs to exist on disk. I think what would work is to pipe the output of the other Cutadapt process (that removes barcodes and adapters) into tagfastq instead. That would allow us to skip the trimmed.[12].fastq.gz file and go directly to trimmed_barcoded.[12].fastq.gz. But that should be done as a separate improvement later.

By the way, if I understand correctly that you only keep the last ten records from barcodes.fasta in memory, then the memory usage should go down a lot more on bigger datasets than what your table above implies.

from blr.

pontushojer avatar pontushojer commented on August 26, 2024

Yes piping from the read trimming step is a possibility!

So the new script is build to handle either missing barcode entries or missing read entries. For each read header we want to find the corresponding header in the barcode file. For this I have made a cache that stores barcodes (headers: barcode_seq pairs) that don't match the current read header and a function that searches the cache and reads the barcode file. The function will move forward up to 10 positions in the barcode file to look for the correct header and store any header not matching. So the function will first look in the cache and only look in the file if the entry is not found. This means that more then ten records might be stored, infact all barcode records that are not found in the read file will be kept as it currently is.

The reason that I only look into the next 10 barcode records is that I expect these missing read entries to be few. Should there be more than 10 missing read entries in a row this will fail but I assume these events to be rare. In the 1 million reads I had in the test 38 (or 0.0038%) barcode records missed corresponding read records. Therefore I think this is fine for now.

from blr.

marcelm avatar marcelm commented on August 26, 2024

Can you open a PR so we can discuss this next to the code?

from blr.

pontushojer avatar pontushojer commented on August 26, 2024

Sure!

from blr.

FrickTobias avatar FrickTobias commented on August 26, 2024

I'll just test this before I close the issue.

from blr.

pontushojer avatar pontushojer commented on August 26, 2024

image

Memory profile for running tag_fastq on ~400 million reads on 20 cores on uppmax.

from blr.

FrickTobias avatar FrickTobias commented on August 26, 2024

rackham-snic2018-3-501-tobiasf-10551985

Awesome! For future records I thought I'd include a reference from before.

from blr.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.