UMI is an acronym for Unique Molecular Identifier.These are sequences of bases added before PCR amplification. They allow for accurate identification of PCR duplicates.
UMIs exist within the sequence of the read inside the fastq file. The extract tool removes the UMIs from the sequence and places them in the header of the read. This will allow specific reads to be easily identified by other tools downstream.
https://github.com/CGATOxford/UMI-tools
- 2 fastq files: These will be the paired-end raw fastq files which contain the UMIs in the sequence (Not yet extracted)
- 2 output file name prefixes: The names of the fastq files that will be outputted with the UMIs extracted (output file is always .fastq)
- Log name: The name of the log file (Log file is always a .log)
- Regex: This workflow supports extracting UMIs with any regular expression. However, there are 2 which are used in the pipeline (Keep the single quotes). '(?P<umi_1>.{3}[ACGN])(?P<discard_1>T)|(?P<umi_1>.{3})(?P<discard_1>T.)' OR '(?P<umi_1>.{3})(?P<discard_1>.{2})'
- 2 fastq files: These fastq files will have the UMIs extracted as specified by the pattern given in the regular expression
- Log File (Autogenerated by UMI-Tools): A file which contains how many reads have been parsed, at what times every 100,000 reads have been parsed, total reads matching the regex and total reads not matching the regex.