PacBio long read analysis
Contributors
Qingxiang Guo
Notes
Written by Qingxiang Guo, [email protected], distributed without any guarantees or restrictions.
Codes
1. Use SMART to filter the PacBio raw data, and obtain the subreads
# The content of Input.fofn is as follows:
/Work/data/Analysis_Results/m150730_090551_42199_c100821272550000001823174411031557_s1_p0.1.bax.h5
/Work/data/Analysis_Results/m150730_090551_42199_c100821272550000001823174411031557_s1_p0.2.bax.h5
/Work/data/Analysis_Results/m150730_090551_42199_c100821272550000001823174411031557_s1_p0.3.bax.h5
# Start analysis, do Filter.sh
# To find isoseq, run flnc.sh
2. Map the long-read iso-seq to the reference genome using Gmap
# Run gmap.sh
3. PacBio genome assembly
3.1 Use HGAP to assembly the genome
# Filter the PacBio raw data, and obtain the subreads
# Run Filter.pl
perl filter.pl -i /backup01/qingxiangguo/HGAP/MCB/input.fofn -d /backup01/AS_HGAP_MCB/Filter
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
=head1 Usage
perl filter.pl -i <your_input.fofn> -d <work_dir>
=head1 Case
perl filter.pl -i /backup01/qingxiangguo/HGAP/MCB/input.fofn -d /backup01/AS_HGAP_MCB/Filter
=cut
# The content of input.fofn is as follows:
/lustre/Work/data/HBE/E06_1/Analysis_Results/m150514_130050_42199_c100795172550000001823166309091574_s1_p0.1.bax.h5
/lustre/Work/data/HBE/E06_1/Analysis_Results/m150514_130050_42199_c100795172550000001823166309091574_s1_p0.2.bax.h5
/lustre/Work/data/HBE/E06_1/Analysis_Results/m150514_130050_42199_c100795172550000001823166309091574_s1_p0.3.bax.h5
/lustre/Work/data/HBE/F06_1/Analysis_Results/m150514_172003_42199_c100795172550000001823166309091575_s1_p0.1.bax.h5
/lustre/Work/data/HBE/F06_1/Analysis_Results/m150514_172003_42199_c100795172550000001823166309091575_s1_p0.2.bax.h5
/lustre/Work/data/HBE/F06_1/Analysis_Results/m150514_172003_42199_c100795172550000001823166309091575_s1_p0.3.bax.h5
# Run Filter.pbs
# We have already obtained the subreads, next we calculated the distribution of seed length to screen the candidate seeds
perl /lustre/Work/qingxiangguo/dev/script/PacBio/scripts/len_cal.pl filtered_subreads.fasta
# Then do seed length calculation
perl /lustre/Work/qingxiangguo/dev/script/PacBio/scripts/seed_length.pl -g 5000000 filtered_subreads.fasta_len
# Run the HGAP assembly for each seeds
perl /lustre/Work/qingxiangguo/dev/script/PacBio/HGAP/params_seed.pl /backup01/guoqing/genome/01QC/HGAP_10.lst
# The content of configuration file HGAP.list is as follows:
genomesize 5000000
minCorCov 10
input.fofn /backup01/01data/01QC/input.fofn
seed_length 16000 17000
work_path /backup01/02HGAP/min10
ppn 3
PBS_q cu
# The result is in data directory
3.2 Use MHAP to assembly the genome
# Need two files, mhap.pbs and configuration file mhap_pacbio.spec
# The content of mhap.pbs is as follows:
#PBS -N MHAP_test
#PBS -l nodes=1:ppn=3
#PBS -q cu
#PBS -S /bin/bash
cd /backup01/03MHAP/ECOLI_relax
PBcR -l K12 -s mhap_pacbio.spec -pbCNS -fastq /backup01/01data/02MHAP/ecoli_filtered.fastq genomeSize=4650000
# The content of mhap_pacbio.spec is as follows:
merSize=16
mhap=-k 16 --num-hashes 512 --num-min-matches 3 --threshold 0.04 --weighted
useGrid=0
scriptOnGrid=0
ovlMemory=32
ovlStoreMemory=32000
threads=32
ovlConcurrency=1
cnsConcurrency=8
merylThreads=32
merylMemory=32000
ovlRefBlockSize=20000
frgCorrThreads = 16
frgCorrBatchSize = 100000
ovlCorrBatchSize = 100000
asmOvlErrorRate=0.10
asmUtgErrorRate=0.07
asmCgwErrorRate=0.10
asmCnsErrorRate=0.10
utgGraphErrorRate=0.07
utgGraphErrorLimit=3.25
utgMergeErrorRate=0.0825
utgMergeErrorLimit=5.25
asmOBT=0
qsub mhap.pbs
3.3 Use Ectools to assembly the genome using both Illumina and PacBio reads
# Filter reads shorter than 1k
perl /lustre/Work/qingxiangguo/dev/script/seq/len_flt.pl -d 1000 filtered_subreads.fasta
# Then you will get long reads, filtered_subreads.fasta_len_flt.fasta
# Split the fasta file
python /lustre/Work/qingxiangguo/dev/sofs/SMRT/ectools-master/partition.py 500 500 filtered_subreads.fasta_len_flt.fasta
# Feed the Illumina reads
perl /lustre/Work/qingxiangguo/dev/sofs/SMRT/ectools-master/correct.pl -MIN_READ_LEN 1000 -UNITIG_file /backup01/01data/03ECtools/ecoli_illumina.fa -WORK_DIR /backup01/guoqing/genome/06ECTOOLS/
# Then run .sh file in correction.SH directory
# Combine all the corrected sequence
Cat ????/*.cor.fa > cor.fa
# Use Celera to assembly
perl /lustre/Work/qingxiangguo/dev/sofs/SMRT/wgs-8.3rc2/Linux-amd64/bin/convert-fasta-to-v2.pl -l organism_pbcor -s organism.cor.fa -q <(python /lustre/Work/qingxiangguo/dev/sofs/SMRT/ectools-master/qualgen.py cor.fa)> cor.frg
convert-fasta-to-v2.pl
runCA โd $work_dir โp ec_cor cor.frg
4. Use PacBio long reads to do gap filling for Illumina sequences
# Run .sh
#PBS -N pbjelly_test
#PBS -l nodes=1:ppn=3
#PBS -q cu
#PBS -S /bin/bash
source /lustre/Work/qingxiangguo/dev/sofs/SMRT/smrtanalysis_v3/current/etc/setup.sh
cd /backup01/06PBJelly/ref
/lustre/Work/software/Assembly/PBSuite_15.2.20/bin/fakeQuals.py lambda.fasta lambda.qual
cd /backup01/06PBJelly
Jelly.py setup Protocol.xml
Jelly.py mapping Protocol.xml
Jelly.py support Protocol.xml
Jelly.py extraction Protocol.xml
Jelly.py assembly Protocol.xml
Jelly.py output Protocol.xml
License
All source code, i.e. scripts/*.pl, scripts/*.sh or scripts/*.py are under the MIT license.