GithubHelp home page GithubHelp logo

reuse's Introduction

REUSE

Rapid Elimination of Useless Sequences

Quick start

reuse build hg38.fa hg38

reuse filter -x hg38 -U input.fq -o filtered.fq

Getting REUSE

REUSE will run on most flavo(u)rs of Linux. Prerequisites include:

  • pigz

The easiest way to get REUSE is by downloading the binary from https://github.com/chorltsd/REUSE/releases/latest Alternatively, the source can be downloaded and compiled.

Building an index

-Identify all k-mers within a reference dataset and store that library to disk.

-Optionally filter k-mers from the reference that match to a target dataset so as to reduce false positive filtering. For example, filter all viral k-mers from the human genome so as not to filter reads originating from viruses in the sample.

Usage:

reuse build [options] <reference_in>

Main arguments

<reference_in> = A comma-separated list of FASTA files containing the reference sequences to be aligned to, or, if -c is specified, the sequences themselves. E.g., <reference_in> might be chr1.fa,chr2.fa,chrX.fa,chrY.fa, or, if -c is specified, this might be GGTCATCCT,ACGGGTCGT,CCGTTCTATGCGGCTTA.

= Location to save index k-mer dataset to disk

Options:

-p/--threads threads (default: 1)

-m maximum RAM usage (default: all available RAM)

-k = k-mer length (default: 21)

-c = The reference sequences are given on the command line. I.e. <reference_in> is a comma-separated list of sequences rather than a list of FASTA files.

-h = Hide (ie mask) k-mers found in this fasta file from the reference database. This option is used to minimize false positive filtering of related species or species of interest.

-g = Compress index when saving to disk. May take longer to generate the index and load when searching.

-h/--help = Print usage information and quit

-v/--version = Print version information and quit

Searching the index

-Eliminate all reads or read pairs when 1 or more k-mers is found within the read

Usage:

reuse filter [options] -x {-1 -2 | -U | --interleaved <i>}

Main arguments

-x The basename of the index for the reference dataset. This can either be generated with reuse-build (.db.gz) or with an alternative program for k-mer counting, such as Jellyfish, KAnalyze or others. K-mer count files in Jellyfish dump format/KAnalyze default output format (two column text file for both) may be compressed with gzip.

-1 Comma-separated list of files containing mate 1s (filename usually includes _1), e.g. -1 flyA_1.fq,flyB_1.fq. Sequences specified with this option must correspond file-for-file and read-for-read with those specified in . Reads may be a mix of different lengths. If - is specified, reuse will read the mate 1s from the “standard in” or “stdin” filehandle. Reads may be in FASTQ or FASTA format.

-2 Comma-separated list of files containing mate 2s (filename usually includes _2), e.g. -2 flyA_2.fq,flyB_2.fq. Sequences specified with this option must correspond file-for-file and read-for-read with those specified in . Reads may be a mix of different lengths. If - is specified, resuse will read the mate 2s from the “standard in” or “stdin” filehandle. Reads may be in FASTQ or FASTA format.

-U Comma-separated list of files containing unpaired reads to be aligned, e.g. lane1.fq,lane2.fq,lane3.fq,lane4.fq. Reads may be a mix of different lengths. If - is specified, reuse gets the reads from the “standard in” or “stdin” filehandle. Reads may be in FASTQ or FASTA format.

Options:

-o = Save reads not matching the k-mer filter to .fast(q/a) for single-end reads, or _1.fast(q/a) and _2.fast(q/a) for paired-end reads. By default, reads are output to STDOUT.

-f = Save reads matching the k-mer filter to .fast(q/a) for single-end reads, or _1.fast(q/a) and _2.fast(q/a) for paired-end reads. By default, reads are discarded.

-g = Compress outputted reads with gzip

-z = Compress outputted reads with alternate command, such as "bzip2"

-m = Maximum RAM usage (default: all available RAM)

-p/--threads = Threads to use (default: available number of threads)

-l = Log file

-mk = Minimum number of k-mers per read to filter it (default: 1)

-sp = Split pairs

Performance optimization:

REUSE will run fastest with filtering after the first k-mer is found (-mk 1), maximum thread and RAM usage, and a lower k-mer size. Lower k-mer sizes reduce the index size but are less accurate at differentiating species.

reuse's People

Contributors

innovate-invent avatar f0t1h avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.