SPLICE

Analysis pipeline for long-read RNA-seq data using Nanopore technology

Requirement

python3 (v3.7 or higher)
minimap2 (v2.17 or higher)
luigi

Input file

FASTQ file

Usage

Step 1: Preparation of the reference transcriptome file

Download the reference transcriptome file from the UCSC genome browser (https://genome.ucsc.edu/)

Select Table Browser from the Tools tab.
Select "Genes and Gene Predictions" in the group section.
Select the desired database in the Track section and download the file.

Numbering the exons of the reference transcriptome.
If you use files from two databases (e.g. GENCODE and RefSeq), sort them by gene name and transcript name, and remove redundant transcripts.

$ cd <path to SPLICE>
$ python ref_exonnum.py -h
usage: ref_exonnum.py [-h] -i INPUT -g GENOME [-o OUTPUT] [-p PREFIX] [-w WORKERS] [--tmp TMP]

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        reference transcriptome (UCSC table browser output file)
  -g GENOME, --genome GENOME
                        reference genome sequence (FASTA)
  -o OUTPUT, --output OUTPUT
                        output directory
  -p PREFIX, --prefix PREFIX
                        prefix for the output files
  -w WORKERS, --workers WORKERS
                        number of threads
  --tmp TMP             temporary directory

Step 2: Annotation to reference transcriptome

$ python annot.py -h
usage: annot.py [-h] -i INPUT -g GENOME -r REF -p PREFIX [-o OUTPUT] [--tmp TMP] [-w WORKERS] [--bq_filt BQ_FILT]
                [--min_sc_len MIN_SC_LEN] [--mq_filt MQ_FILT] [--min_fusion_dist MIN_FUSION_DIST]
                [--max_fusion_bp_merge MAX_FUSION_BP_MERGE] [--min_fusion_read MIN_FUSION_READ]
                [--min_fusion_freq MIN_FUSION_FREQ] [--mq_filt_novel_exon MQ_FILT_NOVEL_EXON]

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        query sequence data (FASTQ)
  -g GENOME, --genome GENOME
                        reference genome sequence (FASTA)
  -r REF, --ref REF     reference transcriptome file (output file of ref_exonnum.py)
  -p PREFIX, --prefix PREFIX
                        prefix for the output files
  -o OUTPUT, --output OUTPUT
                        output directory
  --tmp TMP             temporary directory
  -w WORKERS, --workers WORKERS
                        number of threads
  --minimap2 MINIMAP2   execution file of minimap2
  --bq_filt BQ_FILT     read quality cutoff. Minimum average base quality score (15)
  --min_sc_len MIN_SC_LEN
                        minimum length of the softclip region to be remapped (60)
  --mq_filt MQ_FILT     mapping quality cutoff (0)
  --min_fusion_dist MIN_FUSION_DIST
                        minimum distance of each transcript in the fusion transcript (200000)
  --max_fusion_bp_merge MAX_FUSION_BP_MERGE
                        maximam distance to merge fusion gene breakpoints (5)
  --min_fusion_read MIN_FUSION_READ
                        minimum number of support reads for fusion transcripts (1)
  --min_fusion_freq MIN_FUSION_FREQ
                        minimum frequency of the fusion transcript in the total amount of the gene (percentage) (0.1)
  --mq_filt_novel_exon MQ_FILT_NOVEL_EXON
                        mapping quality cutoff of novel exon (1)

Step 3: Analysis of expression levels (Option for multiple analysis)

$ python exp.py -h
usage: exp.py [-h] -i INPUT -r REF [-o OUTPUT] [--tmp TMP] [-w WORKERS] [--min_read_num MIN_READ_NUM]
              [--min_read_freq MIN_READ_FREQ] [--range_sj_eva RANGE_SJ_EVA] [--err_rate_filt ERR_RATE_FILT]
              [--min_novel_len_gap MIN_NOVEL_LEN_GAP] [--min_novel_exon_len MIN_NOVEL_EXON_LEN]

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        input directory (output directory of annot.py)
  -r REF, --ref REF     reference transcriptome file (output file of ref_exonnum.py)
  -o OUTPUT, --output OUTPUT
                        output directory ('./out/exp')
  --tmp TMP             temporary directory ('./tmp/exp_tmp')
  -w WORKERS, --workers WORKERS
                        number of threads
  --min_read_num MIN_READ_NUM
                        minimum number of support reads (3)
  --min_read_freq MIN_READ_FREQ
                        minimum frequency of the transcript in the total amount of the gene (percentage) (1)
  --range_sj_eva RANGE_SJ_EVA
                        maximum change of novel exon length for evaluate the error rate (20)
  --err_rate_filt ERR_RATE_FILT
                        mapping error rate cutoff at splicing junctino sites (percentage) (20)
  --min_novel_len_gap MIN_NOVEL_LEN_GAP
                        minimum change of novel exon length (5)
  --min_novel_exon_len MIN_NOVEL_EXON_LEN
                        minimum length of novel exon (60)

Output

expression.tsv - Output file of transcript expression levels (number of supporting reads)

gene	transcript	known/novel	coding/non-coding	transcript length	novel information	sample1	sample2	sampleN
GAPDH	NM_001256799.2	known	coding	full	-	71	50	31
CSDE1	NM_001242891.1	known	coding	partial	-	346	40	88
SAP18	-	novel_exon_length	coding	-	,6,8,10,/,k,l,k,/21140681,21147186/,0,0,0,-35,0,0,0,0,	8	0	9

Notation of novel transcript

.fusion - Output file of fusion transcript expression levels (number of supporting reads) for each sample

Number of reads	Read frequency(%)	GeneA/B	ChrA/B	BreakpointA/B	Read IDs
150	28.571	UQCRFS1/YWHAE	chr19/chr17	29207585-29207585/1364879-1364879	86ada1...
29	7.143	RPS6KA5/TMSB4X	chr14/chrX	91060446-91060446/12977041-12977041	38a936...
12	5.128	TMED10/VPS4B	chr14/chr18	75132140-75132140/63390367-63390367	1e0cbc...

breakpoint indicates the range after merging the neighboring breakpoints

Example

Preparation of the reference transcriptome file

$ git clone https://github.com/hkiyose/SPLICE
$ cd SPLICE
$ python ref_exonnum.py -i ./example/gencode_v29_chr1.tsv -g <path to reference genome sequence (FASTA)>

Annotation to reference transcriptome.

$ python annot.py -i ./example/fastq/sample1_test.fastq -g <path to reference genome sequence (FASTA)> -r ./out/ref_exonnum -p sample1 --tmp ./tmp/annot_tmp/sample1
$ python annot.py -i ./example/fastq/sample2_test.fastq -g <path to reference genome sequence (FASTA)> -r ./out/ref_exonnum -p sample2 --tmp ./tmp/annot_tmp/sample2

Analysis of expression levels

$ python exp.py -i ./out/annot -r ./out/ref_exonnum

Installation and usage via Docker

Install Docker in your computer, and build a Docker image with the following commands.

$ git clone https://github.com/hkiyose/SPLICE.git
$ cd <path to SPLICE>
$ docker build -t splice .

The following command mounts the host directory containing the data to the container. Refer to Step 1 of Usage to download the reference data.

$ docker run --rm -it \
  -v <path to directory of reference transcriptome and referece genome sequence (FASTA)>:/ref \
  -v <path to directory of sample data(FASTQ)>:/sample_input \
  -v <path to output directory>:/sample_output \
  splice

Then Run according to Usage.

License

GPLv3

Contact

Hiroki Kiyose - [email protected]

demo error

Dear developer,

I tried to run the demo data in the SPLICE github repo with docker, however, in the second step, it failed with several errors:

python3 annot.py -i /home/ubuntu/LR_Fusion_Benchmarking/LR-fusion-tools/SPLICE/SPLICE/example/fastq/sample1_test.fastq -g /home/ubuntu/LR_Fusion_Benchmarking/LR-fusion-tools/SPLICE/SPLICE/GRCh38.primary_assembly.genome.fa -r $(pwd)/out/ref_exonnum -p sample1  --tmp $(pwd)/tmp/annot_tmp/sample1

Are you still maintaining the software? Could you please take a look at the errors below?

INFO: [pid 59] Worker Worker(salt=7559400199, workers=1, host=dc7342cb9a7f, username=root, pid=59) running   tasks.ConvSam5()
INFO: [pid 59] Worker Worker(salt=7559400199, workers=1, host=dc7342cb9a7f, username=root, pid=59) done      tasks.ConvSam5()
DEBUG: 1 running tasks, waiting for next task to finish
INFO: Informed scheduler that task   tasks.ConvSam5__99914b932b   has status   DONE
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 34
INFO: [pid 59] Worker Worker(salt=7559400199, workers=1, host=dc7342cb9a7f, username=root, pid=59) running   tasks.ConvSam6()
ERROR: [pid 59] Worker Worker(salt=7559400199, workers=1, host=dc7342cb9a7f, username=root, pid=59) failed    tasks.ConvSam6()
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/luigi/worker.py", line 203, in run
    new_deps = self._run_get_new_deps()
  File "/usr/local/lib/python3.8/dist-packages/luigi/worker.py", line 138, in _run_get_new_deps
    task_gen = self.task.run()
  File "annot.py", line 376, in run
    tools.sam_convert2(self.input()[0].path, self.input()[1].path, "0.3")
  File "/SPLICE/tools/sam_converts.py", line 307, in sam_convert2
    print(pre_read_info_l_l[0],read_len_dict[pre_read_info_l_l[0]],pre_read_info_l[2],pre_read_info_l[3],pre_read_info_l[4],pre_read_info_l[5],pre_read_info_l[6],pre_read_i
nfo_l[7],pre_read_info_l[8],pre_read_info_l[9],pre_read_info_l[10],sep="\t")
KeyError: ''
DEBUG: 1 running tasks, waiting for next task to finish
INFO: Informed scheduler that task   tasks.ConvSam6__99914b932b   has status   FAILED
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 34
INFO: [pid 59] Worker Worker(salt=7559400199, workers=1, host=dc7342cb9a7f, username=root, pid=59) running   tasks.GetMatchRate()
INFO: [pid 59] Worker Worker(salt=7559400199, workers=1, host=dc7342cb9a7f, username=root, pid=59) done      tasks.GetMatchRate()
DEBUG: 1 running tasks, waiting for next task to finish
INFO: Informed scheduler that task   tasks.GetMatchRate__99914b932b   has status   DONE
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 33
INFO: [pid 59] Worker Worker(salt=7559400199, workers=1, host=dc7342cb9a7f, username=root, pid=59) running   tasks.Minimap2_1()
INFO: Informed scheduler that task   tasks.ConvSam3__99914b932b   has status   DONE                                                                                [37/1268]
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 27
INFO: [pid 59] Worker Worker(salt=7559400199, workers=1, host=dc7342cb9a7f, username=root, pid=59) running   tasks.ConvSam4()
INFO: [pid 59] Worker Worker(salt=7559400199, workers=1, host=dc7342cb9a7f, username=root, pid=59) done      tasks.ConvSam4()
DEBUG: 1 running tasks, waiting for next task to finish
INFO: Informed scheduler that task   tasks.ConvSam4__99914b932b   has status   DONE
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 26
INFO: [pid 59] Worker Worker(salt=7559400199, workers=1, host=dc7342cb9a7f, username=root, pid=59) running   tasks.AnnotExonnum()
ERROR: [pid 59] Worker Worker(salt=7559400199, workers=1, host=dc7342cb9a7f, username=root, pid=59) failed    tasks.AnnotExonnum()
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/luigi/worker.py", line 203, in run
    new_deps = self._run_get_new_deps()
  File "/usr/local/lib/python3.8/dist-packages/luigi/worker.py", line 138, in _run_get_new_deps
    task_gen = self.task.run()
  File "annot.py", line 143, in run
    tools.annot_exonnum(args.ref, self.input().path, args.mq_filt, "1", "1")
  File "/SPLICE/tools/annots.py", line 6, in annot_exonnum
    f1 = open(f)
IsADirectoryError: [Errno 21] Is a directory: '/SPLICE/out/ref_exonnum'
DEBUG: 1 running tasks, waiting for next task to finish
INFO: Informed scheduler that task   tasks.AnnotExonnum__99914b932b   has status   FAILED
DEBUG: Asking scheduler for work...
DEBUG: Done
DEBUG: There are no more tasks to run at this time
DEBUG: There are 26 pending tasks possibly being run by other workers
DEBUG: There are 26 pending tasks unique to this worker
DEBUG: There are 26 pending tasks last scheduled by this worker
INFO: Worker Worker(salt=7559400199, workers=1, host=dc7342cb9a7f, username=root, pid=59) was stopped. Shutting down Keep-Alive thread
INFO:
===== Luigi Execution Summary =====

Best,
Qian

hkiyose / splice Goto Github PK

splice's Introduction

SPLICE

Requirement

Input file

Usage

Step 1: Preparation of the reference transcriptome file

Step 2: Annotation to reference transcriptome

Step 3: Analysis of expression levels (Option for multiple analysis)

Output

Example

Installation and usage via Docker

License

Contact

splice's People

Contributors

Stargazers

Watchers

Forkers

splice's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs