GithubHelp home page GithubHelp logo

jonahcullen / mirec Goto Github PK

View Code? Open in Web Editor NEW

This project forked from xuanrzhang/mirec

0.0 0.0 0.0 31.03 MB

An error correction tool for miRNA reads at single-base resolution

Shell 0.26% Python 1.12% C 1.58% Makefile 0.04% HTML 97.00%

mirec's Introduction

miREC

An error correction tool for miRNA reads at single-base resolution, which proposed a novel 3-layer lattice structure combining kmers, (k-1)mers and (k+1)mers to first time solve the problem of correcting erroneous bases in miRNA sequencing data. The novelty of our method is the use of a 3-layer (k-1)mer-kmer-(k+1)mer lattice structure to maintain the frequency differences of the kmers. The method is particularly effective for the accurate correction of indel errors.

Our miREC is a user-friendly tool and suits for different research needs. It provides sev- eral parameters for users to specify according to specific tasks. Three most useful settings are: the error types, the frequency threshold τ, and the kmer range [k_1,k_end]. Our miREC has two running modes. One is for the substitution error correction only, the other is Unlike the exist- ing methods which use a fixed threshold, miREC provides options to obtain a good threshold to identify erroneous kmers, thereby avoiding over-correction or insufficient performance when datasets change.

The miREC program is written in C++11 and tested on Red Hat Enterprise Linux Workstation release 7.7. It is availble under an open-source license.

Dependancies

KMC3 tool, a kmer counter, is used to obtain kmer frequences. Here is the instruction of KMC3 (http://sun.aei.polsl.pl/REFRESH/kmc). In the script miREC.sh, it uses kmc and kmc_dump. Please make sure the KMC tool can be ran on your machine and store in the current miREC folder.

Cutadapt tool, Cutadapt finds and removes adapter sequences, primers, poly-A tails and other types of unwanted sequence from your high-throughput sequencing reads.(https://cutadapt.readthedocs.io/en/stable/) The miREC uses cutadapt. Please make sure the cutadapt tool can be ran on your machine and store in the current miREC folder.

Download & Usage

Download

git clone https://github.com/XuanrZhang/miREC
cd miREC
make
chmod +x miREC.sh
chmod +x kmc* [If you don't install kmc tools] 
chmod +x cutadapt [If you don't install cutadapt tools]

Usage: ./miREC.sh -f [Input_File] -s [k_1] -e [k_end] -t [the number of threads] -o [Output_File] -u [run_type]

Required OPTIONS:
-f [File_Name]: miRNA sequence fastq dataset(fastq files)

Optional OPTIONS:
-t [the number of threads]: default is 8;
-r [the value of threshold]: default is 5;
-s [k_1]: 8;
-e [k_end]: 15;
-o [Ouput_FileName]: default is correct_read.fastq;
[run_type]: default is mix, "-u" for substitution errors only;
[cut_adapter]: " -c" to open cutadatper function, use by -c [adapter sequence] (e.g. -c GCCTTGGCACCCGAGAATTCCA);

Examples:

# test using simulated datasets in github folders.
./miREC.sh -f ./Data/simulated_data/mix_data/simumD1.fq -s 8 -e 15 -t 26 (correct substitution and indel errors, with threshold_value 5 and k_value from 8 to 15, with 26 threads)
./miREC.sh -f ./Data/simulated_data/mix_data/simumD1.fq -s 8 -e 15 -t 26 -r 6 (correct substitution and indel errors, with threshold_value 5 and k_value from 8 to 15, with 26 threads, with kmer frequency threshold as 6)
./miREC.sh -f ./Data/simulated_data/mix_data/simumD1.fq -s 8 -e 15 -t 26 -c TGGAATTCTCGGGTGCCAAGG (cutadapter with sequence "TGGAATTCTCGGGTGCCAAGG" and then correct substitution and indel errors, with threshold_value 5 and k_value from 8 to 15, with 26 threads)
./miREC.sh -f ./Data/simulated_data/mix_data/simumD1.fq -s 8 -e 15 -t 26 -u -o Correct.fastq(correct substitution errors only, with threshold_value 5 and k_value from 8 to 15, with 26 threads; Setting output file name as Correct.fastq)

# test user's datasets (user_input.fq)
./miREC.sh -f user_input.fq -s 8 -e 15 -t 26 (correct substitution and indel errors, with threshold_value 5 and k_value from 8 to 15,with 26 threads)
./miREC.sh -f user_input.fq -s 8 -e 15 -u  -t 26 (correct substitution errors only, with threshold_value 5 and k_value from 8 to 15,with 26 threads)

Tips for kvalue selection:

**The kmer range parameter is recommanded as [8,15].**
Every iterative step of miREC with the increasing length of kmer each time by 1 in the range [k1 , kend ] actually corrects different amounts of errors. According to experiments, after five consecutive lengths of k are iterated, about 99.61% of substitution errors, 88.77% of insertion errors and 94.63% of deletion errors can be corrected if k1 is set as 8. With more loops of correction, more erroneous bases are detected and corrected. As each iterative loop consumes the same order of time complexity, users are suggested to narrow the kmer range (by setting kend smaller) to shorten the program running time while correcting almost all of the errors for those miRNA sequencing datasets of huge size.

Data format

Input: A clean miRNA read dataset in fastq format(After adapter cutting)

- read.fq : store the whole read data needed to be corrected.

Output: A corrected read dataset

- correct_read.fq :store corrected read data.

Data materials

Simulated data:generated by scprits (gene_simu_mixerr.c / gene_simu_sub.c)

Real salmon data: can be downloaded at the link (https://www.ncbi.nlm.nih.gov//bioproject/PRJNA202973), with the Accession Number PRJNA202973.

963 miRXplore Universal Reference miRNAs read data: can be downloaded at the link (https://github.com/Jappy0/miREC/tree/main/src), as well as related code and results.

We also provide copies in miREC/data file folder.

Citation

Please cite the work. Zhang, Xuan, et al. "Aberration-corrected ultrafine analysis of miRNA reads at single-base resolution: a k-mer lattice approach." Nucleic Acids Research (2021).

Question

If any bugs during your running, please email to [email protected]

mirec's People

Contributors

xuanrzhang avatar jappy0 avatar jonahcullen avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.