GithubHelp home page GithubHelp logo

metagenome_ana_pipeline's Introduction

metagenome_ana_pipeline

参考流程

  博客地址

软件安装

Bowtie2

bowtie2 -h

metaphlan2

  下载地址(压缩文件500M+,用的谷歌云盘,需要挂梯子)

  ​官网教程

  文献链接

  直接下载解压即可。

  报错Error: Unable to find the mpa_pkl file at: /usr/bin/db_v20/mpa_v20_m200.pkl,解决方法是把对应的数据库放到指定的位置,软连接即可:

sudo ln -s /opt/software/metaphlan2/db_v20/ /usr/bin/

MegaHit

  软件地址

  文献链接

  直接用conda安装:

conda install -c bioconda megahit

  检测是否安装成功:

megahit -h

quast

  官网地址

  文献链接

  安装流程:

wget https://downloads.sourceforge.net/project/quast/quast-5.0.2.tar.gz
tar -xzf quast-5.0.2.tar.gz
cd quast-5.0.2
./setup.py install

sudo ln -s /opt/software/quast/quast.py /usr/bin/
sudo ln -s /opt/software/quast/metaquast.py /usr/bin/

拆分fastq

  项目地址

#!/bin/bash
# Usage: deinterleave_fastq.sh < interleaved.fastq f.fastq r.fastq [compress]
#
# Deinterleaves a FASTQ file of paired reads into two FASTQ
# files specified on the command line. Optionally GZip compresses the output
# FASTQ files using pigz if the 3rd command line argument is the word "compress"
#
# Can deinterleave 100 million paired reads (200 million total
# reads; a 43Gbyte file), in memory (/dev/shm), in 4m15s (255s)
#
# Latest code: https://gist.github.com/3521724
# Also see my interleaving script: https://gist.github.com/4544979
#
# Inspired by Torsten Seemann's blog post:
# http://thegenomefactory.blogspot.com.au/2012/05/cool-use-of-unix-paste-with-ngs.html

# Set up some defaults
GZIP_OUTPUT=0
PIGZ_COMPRESSION_THREADS=10

# If the third argument is the word "compress" then we'll compress the output using pigz
if [[ $3 == "compress" ]]; then
  GZIP_OUTPUT=1
fi

if [[ ${GZIP_OUTPUT} == 0 ]]; then
  paste - - - - - - - -  | tee >(cut -f 1-4 | tr "\t" "\n" > $1) | cut -f 5-8 | tr "\t" "\n" > $2
else
  paste - - - - - - - -  | tee >(cut -f 1-4 | tr "\t" "\n" | pigz --best --processes ${PIGZ_COMPRESSION_THREADS} > $1) | cut -f 5-8 | tr "\t" "\n" | pigz --best --processes ${PIGZ_COMPRESSION_THREADS} > $2
fi

  修改为可执行文件:

sudo chmod +x /usr/bin/deinterleave_fastq.sh

Prodigal

  项目地址

  文献链接

MMseqs2

  官网地址

  安装:

conda install -c conda-forge -c bioconda mmseqs2

创建文件夹

mkdir data/host-genome
mkdir results
mkdir results/0-fastqc
mkdir results/1-sambybowtie
mkdir results/2-bamfile
mkdir results/3-unmapped.bam
mkdir results/4-sorted.bam
mkdir results/5-new.fastq
mkdir results/6-new.fastqc
mkdir results/7-metaphlan
mkdir results/8-megahit
mkdir results/9-metaquast
mkdir results/10-prodigal
mkdir results/11-mmseqs2
mkdir results/12-diamond

数据预处理

  如果是从NCBI SRA数据库下载的数据,那么需要先对数据进行转换和质控。

  • 将SRA数据转换成fastq数据

    pfastq-dump --split-3 -t 30 -s data/download/SRR14784195 -O data/fastq
  • 数据质控

    fastqc data/fastq/* -o results/0-fastqc
    cd results/0-fastqc
    multiqc .

去除宿主基因组

  • 构建宿主基因组索引

    cd data/host-genome
    /usr/bin/bowtie2-build --seed 707 --threads 30 /home/publicdata/refgenome/panaxnotoginseng/yangshengchao/genome/Panax_notoginseng.genome.2.fasta data/host-genome/sanqi
  • 将数据mapping到宿主基因组上

    /usr/bin/bowtie2 --very-sensitive-local -p 15 -x data/host-genome/sanqi -1 data/fastq/SRR14784195_1.fastq -2 data/fastq/SRR14784195_2.fastq -S results/1-sambybowtie/SRR14784159.sam
  • 将sam文件转换成bam文件

    samtools view -@ 30 -bS results/1-sambybowtie/SRR14784159.sam > results/2-bamfile/SRR14784159.bam
  • 提取没有mapping上宿主基因组的序列

    samtools view -@ 30 -b -f 12 -F 256 results/2-bamfile/SRR14784159.bam > results/3-unmapped.bam/SRR14784159.bam
  • bam文件排序

    samtools sort -n results/3-unmapped.bam/SRR14784159.bam -o results/4-sorted.bam/SRR14784159.sorted.bam
  • bam文件转换成fastq文件

    samtools bam2fq results/4-sorted.bam/SRR14784159.sorted.bam > results/5-new.fastq/SRR14784159.fastq
  • 重新质控新的结果

    fastqc results/5-new.fastq/SRR14784159.fastq -o results/6-new.fastqc/
  • 质控完成后将所有的质控结果整合成一个

    cd 

MetaPhlan2分类

/usr/bin/metaphlan2.py ../10-idba.results/fq2fa/SRR14784159.fa --bowtie2out metagenome.bowtie2.2.bz2 --nproc 10 --input_type fasta > ./SRR14784159.fa.txt

MegaHit组装

megahit --12 results/5-new.fastq/SRR14784159.fastq --k-list 35 -o results/11-megahit.results/out

  组装完后成生成的文件:

  image-20220323194850-5f3zb5y

Metaquast评估组装质量

/usr/bin/quast.py -o results/13-metaquast.results results/11-megahit.results/out/final.contigs.fa

  image-20220323195214-hmdug9k

分箱binning

mkdir -p results/binning
mkdir -p results/binning/deinterleave_fastq
mkdir -p results/binning/index
mkdir -p results/binning/samfile
  • 先将先前合并的fastq文件拆成两个。输入文件为用于组装的fastq文件。
/usr/bin/deinterleave_fastq.sh < results/5-new.fastq/SRR14784159.fastq results/binning/deinterleave_fastq/R1.fastq results/binning/deinterleave_fastq/R2.fastq
  • 构建组装索引。输入文件是组装好的fasta文件。
bowtie2-build --seed 707 results/11-megahit.results/out/final.contigs.fa results/binning/index/index
  • 将拆分的两个fastq文件比对到构建的基因组上去。
bowtie2 --sensitive-local -p 10 --seed 707 -x results/binning/index/index -1 results/binning/deinterleave_fastq/R1.fastq -2 results/binning/deinterleave_fastq/R2.fastq -S results/binning/samfile/sample.sam

Prodigal预测基因

  输入MegaHit组装的结果,直接预测基因。输出蛋白序列、CDS序列及gff文件等。

prodigal -i results/11-megahit.results/out/final.contigs.fa -a results/prodigal.results/res.pep -d results/prodigal.results/res.cds -f gff -g 11 -o results/prodigal.results/res.gff -p single -s results/prodigal.results/res.stat

  筛选特定长度的基因:

import os
import re
import argparse

parser = argparse.ArgumentParser(description="You should add those parameters")
parser.add_argument("-i","--input", type = str, help = "The input fasta file")
parser.add_argument("-min","--min_length", type = int, help = "The minimal length of sequences")
parser.add_argument("-max","--max_length", type = int, default = 0, help = "The maximal length of sequences")
parser.add_argument("-o","--output", type = str, help = "The output fasta file")
#args = parser.parse_args()
args =parser.parse_known_args()[0]

# 输入的fasta文件转换成字典
file_open = open(args.input,"r")
file_read = file_open.readlines()
res_dict = {}
for line in file_read:
    if re.match(">",line):
        #print(line)
        res_dict[line] = ""
        flag = line
    else:
        res_dict[flag] = res_dict[flag] + line

# 根据长度进行筛选
# 根据长度进行筛选
final_results = []
for key, value in res_dict.items():
    if args.max_length == 0:
        if len(value) >= args.min_length:
            final_results.append(key)
            final_results.append(value)
    else:
        if args.min_length <= len(value) <= args.max_length:
            final_results.append(key)
            final_results.append(value)

# 保存文件
file_new = open(args.output,"w")
for i in final_results:
    file_new.write(i + "\n")

file_new.close()
python /usr/bin/get_gene_from_prodigal_with_length.py -i results/prodigal.results/res.cds -min 100 -o results/prodigal.results/res.cds.new.fa

  删除生成的gff文件中的注释行:

sed '/#/d' res.gff > res.new.gff

MMseq2构建基因集

mkdir results/mmseq2.results
mmseqs easy-cluster results/prodigal.results/res.cds.new.fa results/mmseq2.results/clusterRes results/mmseq2.results/tmp --min-seq-id 0.5 -c 0.8 --cov-mode 1
  • results/mmseq2.results/clusterRes是输出文件
  • results/mmseq2.results/tmp是临时文件

  image-20220324112938-f8c21nt

蛋白序列比对到NCBI NR库

  NCBI NR库下载地址

metagenome_ana_pipeline's People

Contributors

lixiang117423 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.