GithubHelp home page GithubHelp logo

umiextract's Introduction

UMI-tools Extract

What are UMIs?:

UMI is an acronym for Unique Molecular Identifier.These are sequences of bases added before PCR amplification. They allow for accurate identification of PCR duplicates.

Purpose of the Extract Workflow:

UMIs exist within the sequence of the read inside the fastq file. The extract tool removes the UMIs from the sequence and places them in the header of the read. This will allow specific reads to be easily identified by other tools downstream.

Github Repository of UMI-Tools:

https://github.com/CGATOxford/UMI-tools

Inputs Needed

  • 2 fastq files: These will be the paired-end raw fastq files which contain the UMIs in the sequence (Not yet extracted)
  • 2 output file name prefixes: The names of the fastq files that will be outputted with the UMIs extracted (output file is always .fastq)
  • Log name: The name of the log file (Log file is always a .log)
  • Regex: This workflow supports extracting UMIs with any regular expression. However, there are 2 which are used in the pipeline (Keep the single quotes). '(?P<umi_1>.{3}[ACGN])(?P<discard_1>T)|(?P<umi_1>.{3})(?P<discard_1>T.)' OR '(?P<umi_1>.{3})(?P<discard_1>.{2})'

Outputs Given

  • 2 fastq files: These fastq files will have the UMIs extracted as specified by the pattern given in the regular expression
  • Log File (Autogenerated by UMI-Tools): A file which contains how many reads have been parsed, at what times every 100,000 reads have been parsed, total reads matching the regex and total reads not matching the regex.

umiextract's People

Contributors

alexjfortuna avatar rishi-shah12 avatar lheisler avatar

Watchers

James Cloos avatar  avatar  avatar

umiextract's Issues

Only extracts UMIs in reads 1. UMIs in read 2 are ignored.

The current workflow only extracts UMIs from read 1, ignoring any potential UMIs in read 2.
The regex passed to --bc-pattern in the command only extracts UMIs from read 1.
Option --bc-pattern2 is needed to extract UMIs from fastq 2.

The command should add option -bc-pattern2 and use an additional variable regex2 (see below).
Because -bc-pattern and -bc-pattern2 are now used in the command, R.E can be constructed to extract UMIs from read 1 or read 2 only:

  • extract from read 2 only
    regex='.*' --> match read1, no extraction
    regex2='(?<umi_1>.{3})(?<discard_1>.{2})'

  • extract from read 1 only
    regex='(?<umi_1>.{3})(?<discard_1>.{2})'
    regex2='.*' --> match read2, no extraction

umi_tools extract --extract-method=~{method} \
                    --bc-pattern=~{regex} \
                    --bc-pattern2=~{regex2} \
                    --stdin=~{fastq1} \
                    --stdout=~{outFileNamePrefix1}.fastq \
                    --read2-in=~{fastq2} \
                    --read2-out=~{outFileNamePrefix2}.fastq \
                    --log=~{logNamePrefix}.log

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.