umiextract's Introduction

UMI-tools Extract

What are UMIs?:

UMI is an acronym for Unique Molecular Identifier.These are sequences of bases added before PCR amplification. They allow for accurate identification of PCR duplicates.

Purpose of the Extract Workflow:

UMIs exist within the sequence of the read inside the fastq file. The extract tool removes the UMIs from the sequence and places them in the header of the read. This will allow specific reads to be easily identified by other tools downstream.

Github Repository of UMI-Tools:

https://github.com/CGATOxford/UMI-tools

Inputs Needed

2 fastq files: These will be the paired-end raw fastq files which contain the UMIs in the sequence (Not yet extracted)
2 output file name prefixes: The names of the fastq files that will be outputted with the UMIs extracted (output file is always .fastq)
Log name: The name of the log file (Log file is always a .log)
Regex: This workflow supports extracting UMIs with any regular expression. However, there are 2 which are used in the pipeline (Keep the single quotes). '(?P<umi_1>.{3}[ACGN])(?P<discard_1>T)|(?P<umi_1>.{3})(?P<discard_1>T.)' OR '(?P<umi_1>.{3})(?P<discard_1>.{2})'

Outputs Given

2 fastq files: These fastq files will have the UMIs extracted as specified by the pattern given in the regular expression
Log File (Autogenerated by UMI-Tools): A file which contains how many reads have been parsed, at what times every 100,000 reads have been parsed, total reads matching the regex and total reads not matching the regex.

umiextract's People

Contributors

Watchers

umiextract's Issues

Only extracts UMIs in reads 1. UMIs in read 2 are ignored.

The current workflow only extracts UMIs from read 1, ignoring any potential UMIs in read 2.
The regex passed to --bc-pattern in the command only extracts UMIs from read 1.
Option --bc-pattern2 is needed to extract UMIs from fastq 2.

The command should add option -bc-pattern2 and use an additional variable regex2 (see below).
Because -bc-pattern and -bc-pattern2 are now used in the command, R.E can be constructed to extract UMIs from read 1 or read 2 only:

extract from read 2 only
regex='.*' --> match read1, no extraction
regex2='(?<umi_1>.{3})(?<discard_1>.{2})'
extract from read 1 only
regex='(?<umi_1>.{3})(?<discard_1>.{2})'
regex2='.*' --> match read2, no extraction

umi_tools extract --extract-method=~{method} \
                    --bc-pattern=~{regex} \
                    --bc-pattern2=~{regex2} \
                    --stdin=~{fastq1} \
                    --stdout=~{outFileNamePrefix1}.fastq \
                    --read2-in=~{fastq2} \
                    --read2-out=~{outFileNamePrefix2}.fastq \
                    --log=~{logNamePrefix}.log

Recommend Projects

oicr-gsi / umiextract Goto Github PK

umiextract's Introduction

UMI-tools Extract

What are UMIs?:

Purpose of the Extract Workflow:

Github Repository of UMI-Tools:

Inputs Needed

Outputs Given

umiextract's People

Contributors

Watchers

umiextract's Issues

Only extracts UMIs in reads 1. UMIs in read 2 are ignored.

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs