yanglab / clear Goto Github PK

direct comparison of circular and linear RNA expression

Python 100.00%

clear's Introduction

CLEAR/CIRCexplorer3

A computational pipeline for Circular and Linear RNA Expression Analysis from Ribosomal-RNA depleted (Ribo–) RNA-seq (CLEAR/CIRCexplorer3)

Schema

Installation requirements

Software
- CIRCexplorer2 (>=2.3.6)
- HISAT2 (>=2.0.5)
- StringTie (>1.3.6)
Package (python 2.7 +)
- pysam (>=0.8.4)
- pybedtools

Installation

git clone https://github.com/YangLab/CLEAR
cd CLEAR
python ./setup.py install

Usage

Start from fastq file:

usage: clear_quant [-h] -1 M1 [-2 M2] -g GENOME -i HISAT -j BOWTIE1 -G GTF
              [-o OUTPUT] [-p THREAD]

optional arguments:
  -h, --help            Show this help message and exit.
  -1 M1                 Comma-separated list of read sequence files in FASTQ
                        format. When running with pair-end read, this should
                        contain #1 mates.
  -2 M2                 Comma-separated list of read sequence files in FASTQ
                        format. -2 is only used when running with pair-end
                        read. This should contain #2 mates.
  -g GENOME, --genome GENOME
                        Genome FASTA file.
  -i HISAT, --hisat HISAT
                        Index files for HISAT2.
  -j BOWTIE1, --bowtie1 BOWTIE1
                        Index files for TopHat-Fusion.
  -G GTF, --gtf GTF     Annotation GTF file.
  -o OUTPUT, --output OUTPUT
                        The output directory.
  -p THREAD, --thread THREAD
                        Running threads. [default: 5]

Start from CIRCexplorer2 output file:

usage: circ_quant [-h] -c CIRC -b BAM -r REF [--threshold THRESHOLD]
                     [--ratio RATIO] [-l] [-t] [-o OUTPUT]

optional arguments:
  -h, --help            Show this help message and exit.
  -c CIRC, --circ CIRC  Input circular RNA file from CIRCexplorer2.
  -b BAM, --bam BAM     Input mapped reads from HISAT2 in BAM format.
  -r REF, --ref REF     The refFlat format gene annotation file.
  --threshold THRESHOLD
                        Threshold of FPB for choose circRNAs to filter linear
                        SJ.[default: 1]
  --ratio RATIO         The ratio is used for adjust comparison between circ
                        and linear.[default: 1]
  -l, --length          Whether to consider all reads' length? [default: False]
  -t, --tmp             Keep tmp dir? [default: False]
  -o OUTPUT, --output OUTPUT
                        Output file. [default: circRNA_quant.txt]

Example

Start from fastq file:

clear_quant -1 mate_1.fastq -2 mate_2.fastq -g hg38.fa -i hg38.hisat_index -j hg38.bowtie_index -G annotation.gtf -o output_dir

Start from CIRCexplorer2 output file:

circ_quant -c CIRCexplorer2_output.txt -b hisat_aligned.bam -t -r annotation.refFlat -o quant.txt

hisat_aligned.bam should not contain unmapped reads.

Output

output_dir/quant/quant.txt

Field	Description
chrom	Chromosome
start	Start of circular RNA
end	End of circular RNA
name	Circular RNA/Junction reads
score	Flag of fusion junction realignment
strand	+ or - for strand
thickStart	No meaning
thickEnd	No meaning
itemRgb	0,0,0
exonCount	Number of exons
exonSizes	Exon sizes
exonOffsets	Exon offsets
readNumber	Number of junction reads
circType	Type of circular RNA
geneName	Name of gene
isoformName	Name of isoform
index	Index of exon or intron
flankIntron	Left intron/Right intron
FPBcirc	Expression of circRNA
FPBlinear	Expression of cognate linear RNA
CIRCscore	Relative expression of circRNA

Citation

Ma XK*, Wang MR, Liu CX, Dong R, Carmichael GG, Chen LL and Yang L#. A CLEAR pipeline for direct comparison of circular and linear RNA expression. 2019, bioRxiv doi: 10.1101/668657

License

clear's People

Contributors

Stargazers

Watchers

Forkers

finallyisnoone xiongyichun scathacheng xjyx kepbod guozihuaa vallurumk mywanuo feiyue126 zxclovezby si-nan

clear's Issues

exonFrames field is being added, -genePredExt but no valid frames

Hello.
I am trying to use CLEAR for my data set and running the following command:
clear_quant -1 /userdata/sharmishtha/Hela/trimmedFastqFiles/trim_HeLa-AMT-1_R1.fastq.gz -2 /userdata/sharmishtha/Hela/trimmedFastqFiles/trim_HeLa-AMT-1_R2.fastq.gz -g /userdata/sharmishtha/ref_and_anno/hg38/hg38.fa -i /userdata/sharmishtha/IndexFiles/hg38/hisat2index/hg38_hisat2_index -j /userdata/sharmishtha/IndexFiles/hg38/bowtie1_index/bowtie1_index -G /userdata/sharmishtha/IndexFiles/hg38/hg38_kg.gtf -o HelaAMT1_output_dir

The steps untill tophat fusion worked, but got an error after Tophat fusion:
###Start circRNA annotation
Error: exonFrames field is being added, but I found a gene (ENST00000602051.5) with CDS but no valid frames. This can happen if program is invoked with -genePredExt but no valid frames are given in the file. If the 8th field of GFF/GTF file is always a placeholder, then don't use -genePredExt.
Traceback (most recent call last):
File "/userdata/sharmishtha/tools/anaconda3/envs/myenv/bin/clear_quant", line 11, in
load_entry_point('CLEAR==1.0.0', 'console_scripts', 'clear_quant')()
File "build/bdist.linux-x86_64/egg/src/run.py", line 262, in main
File "build/bdist.linux-x86_64/egg/src/run.py", line 173, in circ_annot
File "/userdata/sharmishtha/tools/anaconda3/envs/myenv/lib/python2.7/subprocess.py", line 223, in check_output
raise CalledProcessError(retcode, cmd, output=output)
subprocess.CalledProcessError: Command '['gtfToGenePred', '-genePredExt', '/userdata/sharmishtha/IndexFiles/hg38/hg38_kg.gtf', 'HelaAMT1_output_dir/circ/genePred.tmp']' returned non-zero exit status 255

I used te Circ explorer2 command to get the gtf file:
cut -f2-11 hg38_ref.txt|genePredToGtf file stdin hg38_ref.gtf

So I dont know whats going on. Why is the gtf file is giving the error. kindly help

Installation issue

When installing as per instructions, the installer runs without error, however, there seem to be some underlying linking/module placing and finding issues.
(I've had the same issue with multiple python3 installations, for reproducibly here it is in conda on Linux)

conda activate clear
conda install circexplorer2 -c bioconda
git clone https://github.com/YangLab/CLEAR.git
cd CLEAR
python ./setup.py install

. . .
verbose checking creating copying
. . .

Adding CLEAR 1.0.0 to easy-install.pth file
Installing circ_quant script to /usr2/collab/pessinj/.conda/envs/clear/bin
Installing clear_quant script to /usr2/collab/pessinj/.conda/envs/clear/bin
Installed /usr2/collab/pessinj/.conda/envs/clear/lib/python3.6/site-packages/CLEAR-1.0.0-py3.6.egg
Processing dependencies for CLEAR==1.0.0
Searching for pybedtools==0.8.0
Best match: pybedtools 0.8.0
Adding pybedtools 0.8.0 to easy-install.pth file
Using /usr2/collab/pessinj/.conda/envs/clear/lib/python3.6/site-packages
Searching for pysam==0.15.3
Best match: pysam 0.15.3 
Adding pysam 0.15.3 to easy-install.pth file
Using /usr2/collab/pessinj/.conda/envs/clear/lib/python3.6/site-packages 
Searching for six==1.12.0
Best match: six 1.12.0
Adding six 1.12.0 to easy-install.pth file
Using /usr2/collab/pessinj/.conda/envs/clear/lib/python3.6/site-packages
Finished processing dependencies for CLEAR==1.0.0

clear_quant and circ_quant are now in the path

(clear) [pessinj@scc2 clear]$ which clear_quant
~/.conda/envs/clear/bin/clear_quant 
(clear) [pessinj@scc2 clear]$ which circ_quant
~/.conda/envs/clear/bin/circ_quant

But clear_quant cannot find circ_quant

(clear) [pessinj@scc2 CLEAR]$ clear_quant -h Traceback (most recent call last): File "/usr2/collab/pessinj/.conda/envs/clear/bin/clear_quant", line 11, in load_entry_point('CLEAR==1.0.0', 'console_scripts', 'clear_quant')() File "/usr2/collab/pessinj/.conda/envs/clear/lib/python3.6/site-packages/pkg_resources/__init__.py", l ine 489, in load_entry_point return get_distribution(dist).load_entry_point(group, name) File "/usr2/collab/pessinj/.conda/envs/clear/lib/python3.6/site-packages/pkg_resources/__init__.py", l ine 2843, in load_entry_point return ep.load() File "/usr2/collab/pessinj/.conda/envs/clear/lib/python3.6/site-packages/pkg_resources/__init__.py", l ine 2434, in load return self.resolve() File "/usr2/collab/pessinj/.conda/envs/clear/lib/python3.6/site-packages/pkg_resources/__init__.py", l ine 2440, in resolve module = __import__(self.module_name, fromlist=['__name__'], level=0) File "/usr2/collab/pessinj/.conda/envs/clear/lib/python3.6/site-packages/CLEAR-1.0.0-py3.6.egg/src/run .py", line 11, in ModuleNotFoundError: No module named 'circ_quant'

and circ_quant cannot find spReads

(clear) [pessinj@scc2 CLEAR]$ circ_quant -h Traceback (most recent call last): File "/usr2/collab/pessinj/.conda/envs/clear/bin/circ_quant", line 11, in load_entry_point('CLEAR==1.0.0', 'console_scripts', 'circ_quant')() File "/usr2/collab/pessinj/.conda/envs/clear/lib/python3.6/site-packages/pkg_resources/__init__.py", l ine 489, in load_entry_point return get_distribution(dist).load_entry_point(group, name) File "/usr2/collab/pessinj/.conda/envs/clear/lib/python3.6/site-packages/pkg_resources/__init__.py", l ine 2843, in load_entry_point return ep.load() File "/usr2/collab/pessinj/.conda/envs/clear/lib/python3.6/site-packages/pkg_resources/__init__.py", l ine 2434, in load return self.resolve() File "/usr2/collab/pessinj/.conda/envs/clear/lib/python3.6/site-packages/pkg_resources/__init__.py", l ine 2440, in resolve module = __import__(self.module_name, fromlist=['__name__'], level=0) File "/usr2/collab/pessinj/.conda/envs/clear/lib/python3.6/site-packages/CLEAR-1.0.0-py3.6.egg/src/cir c_quant.py", line 19, in ModuleNotFoundError: No module named 'spReads'

spReads is in the git repo

(clear) [pessinj@scc2 CLEAR]$ find $PWD -name spRead* 
/usr2/collab/pessinj/CLEAR/build/lib/src/spReads.py 
/usr2/collab/pessinj/CLEAR/src/spReads.py

but some quick use of find does not turn it up in the conda-python space

(clear) [pessinj@scc2 CLEAR]$ cd ~/.conda/envs/clear/
(clear) [pessinj@scc2 clear]$ find $PWD -name spRead*
(clear) [pessinj@scc2 clear]$ find $PWD -name clear*
/usr2/collab/pessinj/.conda/envs/clear
/usr2/collab/pessinj/.conda/envs/clear/bin/clear
/usr2/collab/pessinj/.conda/envs/clear/bin/clear_quant
(clear) [pessinj@scc2 clear]$ find $PWD -name circ*
/usr2/collab/pessinj/.conda/envs/clear/bin/circ_quant
/usr2/collab/pessinj/.conda/envs/clear/conda-meta/circexplorer2-2.3.6-py_0.json

CLEAR availability in anaconda

Hi,

Is this version available in anaconda yet? Are you planning to add it soon?

Thanks.

IOError: [Errno 2] No such file or directory: circ/genePred.tmp

Hi,
Thanks for CLEAR.
I am in troubles using CLEAR and looking for solutions. I need your help and this is the error log.

###Parameters:
Namespace(bowtie1='/home/ionadmin/likun/13SequencingData/00.ref/GRCh38_bowtie_index/GCA_000001405.15_GRCh38_no_alt_analysis_set', genome='/home/ionadmin/likun/13SequencingData/00.ref/02.circ/hg38.fa', gtf='/home/ionadmin/likun/13SequencingData/00.ref/gencode.v39.annotation.gff3', hisat='/home/ionadmin/likun/13SequencingData/00.ref/grch38_snp_tran/genome_snp_tran', m1='temp/test.r1.fq', m2='temp/test.r2.fq', output='04.circRNA/test', thread='8')
###Parameters

###Start hisat2 mapping
# start to get sp sites for hisat mapping
# start to align to genome by hisat
# get mapped and unmapped reads
# sort bam file
# index bam file
###End hisat2 mapping


###Start tophat-fusion mapping
###End tophat-fusion mapping


###Start circRNA annotation
Traceback (most recent call last):
  File "/results/likun/13SequencingData/.snakemake/conda/11a4808895ed45c4508ebd2a16b6e45c/bin/clear_quant", line 11, in <module>
    load_entry_point('CLEAR==1.0.1', 'console_scripts', 'clear_quant')()
  File "/results/likun/13SequencingData/.snakemake/conda/11a4808895ed45c4508ebd2a16b6e45c/lib/python2.7/site-packages/CLEAR-1.0.1-py2.7.egg/src/run.py", line 269, in main
    args.genome, args.gtf, circ_dir)
  File "/results/likun/13SequencingData/.snakemake/conda/11a4808895ed45c4508ebd2a16b6e45c/lib/python2.7/site-packages/CLEAR-1.0.1-py2.7.egg/src/run.py", line 181, in circ_annot
    with open('{}/genePred.tmp'.format(circ_dir), 'r') as inf,\
IOError: [Errno 2] No such file or directory: '04.circRNA/test/circ/genePred.tmp'

TypeError when Running CLEAR: 'NoneType' object is not iterable

hello! First thanks a lot for providing a fantanstic tool for circRNA analysis!
I have an issue when running circ_quant, the nohup.out file says that: TypeError: 'NoneType' object is not iterable.

Here is the log file:

###Parameters:
Namespace(bam='/home/zhangfy/data_projects/RNAseq/LW_MPP/sambamfiles/C-1.sam.sorted.bam', circ='/home/zhangfy/data_projects/RNAseq/LW_MPP/circ_RNA_proj/4_annotate/C-1_circularRNA_known.txt', length=False, output='C-1_quant.txt', ratio=1, ref='/home/zhangfy/data_projects/RNAseq/LW_MPP/circ_RNA_proj/4_annotate/Rattus.genepred', threshold=1, tmp=True)
###Parameters
genePredToGtf is required for maximal isoform selection!
###Parameters:
Namespace(bam='/home/zhangfy/data_projects/RNAseq/LW_MPP/sambamfiles/C-1.sam.sorted.bam', circ='/home/zhangfy/data_projects/RNAseq/LW_MPP/circ_RNA_proj/4_annotate/C-1_circularRNA_known.txt', length=False, output='C-1_quant.txt', ratio=1, ref='/home/zhangfy/data_projects/RNAseq/LW_MPP/circ_RNA_proj/4_annotate/Rattus.genepred', threshold=1, tmp=True)
###Parameters
Traceback (most recent call last):
File "/home/zhangfy/miniconda3/envs/RNAseq/bin/circ_quant", line 11, in
load_entry_point('CLEAR==1.0.1', 'console_scripts', 'circ_quant')()
File "/home/zhangfy/miniconda3/envs/RNAseq/lib/python3.6/site-packages/CLEAR-1.0.1-py3.6.egg/src/circ_quant.py", line 334, in main
spReads.extract(extract_args)
File "/home/zhangfy/miniconda3/envs/RNAseq/lib/python3.6/site-packages/CLEAR-1.0.1-py3.6.egg/src/spReads.py", line 83, in extract
cigar = getCigar(read)
File "/home/zhangfy/miniconda3/envs/RNAseq/lib/python3.6/site-packages/CLEAR-1.0.1-py3.6.egg/src/spReads.py", line 30, in getCigar
cigar = [ (num2op[num], length) for num, length in read.cigartuples]
TypeError: 'NoneType' object is not iterable

And this is my code:

nohup circ_quant -c /home/zhangfy/data_projects/RNAseq/LW_MPP/circ_RNA_proj/4_annotate/C-1_circularRNA_known.txt \
-b /home/zhangfy/data_projects/RNAseq/LW_MPP/sambamfiles/C-1.sam.sorted.bam \
-t \
-r /home/zhangfy/data_projects/RNAseq/LW_MPP/circ_RNA_proj/4_annotate/Rattus.genepred \
-o C-1_quant.txt &

C-1_circularRNA_known.txt is the output file of CIRCExplorer2 annotate
C-1.sam.sorted.bam is the bam file of one sample. I have used samtools sort and samtools index to get this bam file
Rattus.genepred is the genepred file

Thanks a lot!
Zhangfy

Issue when using clear_quant

I met some issue when I try to run clear_quant function
"[2019-08-29 12:58:42] Mapping left_kept_reads to genome hg19 with Bowtie
[FAILED]
Error running bowtie:
Error while flushing and closing output
terminate called after throwing an instance of 'int'"

I check this error in the website which said I am run out of storage. But that would not possible since I am using server.

Tophat2, Bowtie, Samtools compatibility problems

Hi, there

I'm appreciate your hard-working for keeping upgrade CIRCexplorer pipeline. When I found your latest version CIRCexplorer3/CLEAR start using Hisat2 for the very first alignment, I cannot wait to try it.

However, I got a same problem as I met when I use CIRCexplorer2. Tophat2 (version 2.1.0) kept sending me error like :'Error running bowtie: Error while flushing and closing output'. Then I realized it was another compatibility problem with tophat2 and bowtie.

Then I tried to find solution across biostar and github. Finally I found that your answer in another issue which is bowtie (v0.12.9), tophat2 (v2.0.12) works very well (#3). I installed those older version software and tried to run it again, then it raise an error seems like tophat2 do not know my Samtools version...... (Probably my Samtools v1.10 version is too high.)

Is there any possible using other aligner like STAR to align un-mapped reads? Or would you please share your software version. I'm really appreciate for your help.

Regards

tophat fusion error

I have installed all the bowtie, tophat and samtool packages recommended for the tophatfusion run, but getting this error, it cant locate my samtools. I have version you have asked to install...Kindly help

Beginning TopHat run (v2.0.12)

[2021-01-09 16:21:33] Checking for Bowtie
Bowtie version: 1.0.0.0
[2021-01-09 16:21:33] Checking for Samtools
Traceback (most recent call last):
File "/userdata/sharmishtha/tools/tophat-2.0.12.Linux_x86_64/tophat", line 4087, in
sys.exit(main())
File "/userdata/sharmishtha/tools/tophat-2.0.12.Linux_x86_64/tophat", line 3885, in main
check_samtools()
File "/userdata/sharmishtha/tools/tophat-2.0.12.Linux_x86_64/tophat", line 1559, in check_samtools
samtools_version_str, samtools_version_arr = get_samtools_version()
File "/userdata/sharmishtha/tools/tophat-2.0.12.Linux_x86_64/tophat", line 1541, in get_samtools_version
samtools_version_arr = [int(version_match.group(x)) for x in [1,2,3]]
AttributeError: 'NoneType' object has no attribute 'group'

FPBcirc vs FPBlinear comparison from different sequencing data

Hi,

I have two sets of sequencing data for the same cell lines:

Total RNA-Seq (ribo-)
circRNA-Seq (ribo- and RNaseR+)

Referring to the CLEAR article, it says:

Different from poly(A)+ RNA-seq datasets that are used to detect polyadenylated cognate linear RNAs, all three types of non-polyadenylated RNA-seq can be used to determine circRNA expression by FPB. However, only ribo− RNA-seq datasets that profile both polyadenylated linear and non-polyadenylated circular RNAs in parallel are suitable for direct circular and linear RNA expression comparison by CIRCscore. In contrast, in poly(A)−/ribo−, and RNase R-treated RNA-seq datasets, polyadenylated linear RNAs are largely depleted, which is unsuitable for accurate linear RNA quantification and subsequent CIRCscore evaluation.

It means I cannot compare circRNA expression to linearRNA expression directly in either of these datasets by utilizing CIRCscore. Rather, I would like to compare them by calculating:

FPBlinear score from RNA-Seq
FPBcirc score from circRNA-Seq

Does this make sense? Are they comparable since they are derived from the same cell lines although they come from different sequencing data? Do I need additional normalization step?

Thanks a lot.
@xingma

Have you tried CSI NGS Portal yet?

Tophat-fusion error

Hi,

In the mapping with tophat-fusion step, I get the below error:

$ clear_quant -1 206_1.fq.gz -2 206_2.fq.gz -g /library/hg19.fa -i /library/hg19.fa -j /library/hg19.fa -G /library/hg19.genes.gtf -o 206 -p 40
###Parameters:
Namespace(bowtie1='/library/hg19.fa', genome='/library/hg19.fa', gtf='/library/hg19.genes.gtf', hisat='/library/hg19.fa', m1='206_1.fq.gz', m2='206_2.fq.gz', output='206', thread='40')
###Parameters

###Start hisat2 mapping
# start to get sp sites for hisat mapping
# start to align to genome by hisat
# get mapped and unmapped reads
# sort bam file
# index bam file
###End hisat2 mapping


###Start tophat-fusion mapping
Traceback (most recent call last):
  File "/software/anaconda2/bin/clear_quant", line 11, in <module>
    load_entry_point('CLEAR==1.0.0', 'console_scripts', 'clear_quant')()
  File "build/bdist.linux-x86_64/egg/src/run.py", line 256, in main
  File "build/bdist.linux-x86_64/egg/src/run.py", line 159, in fusion_align
  File "/software/anaconda2/lib/python2.7/subprocess.py", line 223, in check_output
    raise CalledProcessError(retcode, cmd, output=output)
subprocess.CalledProcessError: Command '['tophat2', '-o', '206/fusion', '-p', '40', '--fusion-search', '--keep-fasta-order', '--bowtie1', '--no-coverage-search', '/library/hg19.fa', '206/hisat/unmapped.fq']' returned non-zero exit status 1

I am not sure what the error says and how to fix it. Can you help?
@xingma

single end reads

what flag should be used for single end reads?

CodingErrorFound

Hi，when I try to run the command 'clear_quant -h", an error raised as following:
Traceback (most recent call last):
File "/data/users/dqgu/anaconda3/bin/clear_quant", line 11, in
load_entry_point('CLEAR==1.0.1', 'console_scripts', 'clear_quant')()
File "/data/users/dqgu/anaconda3/lib/python3.7/site-packages/pkg_resources/init.py", line 490, in load_entry_point
return get_distribution(dist).load_entry_point(group, name)
File "/data/users/dqgu/anaconda3/lib/python3.7/site-packages/pkg_resources/init.py", line 2859, in load_entry_point
return ep.load()
File "/data/users/dqgu/anaconda3/lib/python3.7/site-packages/pkg_resources/init.py", line 2450, in load
return self.resolve()
File "/data/users/dqgu/anaconda3/lib/python3.7/site-packages/pkg_resources/init.py", line 2456, in resolve
module = import(self.module_name, fromlist=['name'], level=0)
File "/data/users/dqgu/anaconda3/lib/python3.7/site-packages/CLEAR-1.0.1-py3.7.egg/src/run.py", line 178
DEVNULL = open(os.devnull, 'wb')
^
TabError: inconsistent use of tabs and spaces in indentation

Best wishes!

CLEAR with STAR alignment

I attach below the full CLEAR pipeline with STAR alignment in case someone needs:

# define parameters
file_extension="_1.fq.gz"
read_length=100
ref_genome="hg19"

# download reference files
fetch_ucsc.py "$ref_genome" fa "$ref_genome.fa"
fetch_ucsc.py "$ref_genome" ref "$ref_genome.ref.txt"
cut -f2-11 "$ref_genome.ref.txt" | genePredToGtf file stdin "$ref_genome.ref.gtf"

# generate genome index file
STAR --runMode genomeGenerate --genomeDir "STAR_$ref_genome/$read_length" --limitIObufferSize 1000000000 --runThreadN 16 --genomeFastaFiles "$ref_genome.fa" --outFileNamePrefix ./ --sjdbGTFfile "$ref_genome.ref.gtf" --sjdbOverhang "$(($read_length-1))"

# run pipeline
for read1 in $(find . -type l -name "*$file_extension"); do
        name="${read1%_1.fq.gz}" && \
        read2="${name}_2.fq.gz" && \
        mkdir -p "$name" && \
        STAR --chimSegmentMin 20 --runThreadN 16 --genomeLoad LoadAndRemove --limitBAMsortRAM 50000000000 --limitIObufferSize 1000000000 --outSAMtype BAM SortedByCoordinate --readFilesCommand zcat --outFileNamePrefix "$name/" --genomeDir "STAR_$ref_genome/100" --readFilesIn "$read1" "$read2" > "$name/$name.circRNA_alignment.log" 2>&1 && \
        samtools index "$name/Aligned.sortedByCoord.out.bam" && \
        fast_circ.py parse -r "$ref_genome.ref.txt" -g "$ref_genome.fa" -t STAR -o "$name/circRNA_out" "$name/Chimeric.out.junction" > "$name/$name.circRNA_parse.log" 2>&1 && \
        circ_quant -c "$name/circRNA_out/circularRNA_known.txt" -b "$name/Aligned.sortedByCoord.out.bam" -r "$ref_genome.ref.txt" -o "$name.circRNA_quant.txt" > "$name/$name.circRNA_quant.log" 2>&1 &
done

Have you tried CSI NGS Portal yet?

CLEAR: python2 or python3?

Greetings,

I tried to use the CLEAR pipeline under a python3 environment, and it worked well until the TopHat, where I got:

TopHat2 is not compable to python3. You can change to a python2 environment or change tophat shebang "#!/usr/bin/env python" to "#!/usr/bin/env python2"

Therefore, I created a python2 environment as suggested. This time I couldn't even got to the TopHat step, getting an error as soon as I launched CLEAR:

File "/opt/miniconda3/envs/TopHat/lib/python2.7/site-packages/pybedtools/contrib/long_range_interaction.py", line 357 print("%d (%.1f%%)\r" % (c, c / float(n) * 100), end="")
What can I do to solve the issue. Thanks in advance

Error in run clear_quant when "start tophat-fusion mapping"

Hi.
Here is the bug,
Traceback (most recent call last):
File "/zs32/home/jcdai/zouhan/ChIP/miniconda2/bin/clear_quant", line 9, in
load_entry_point('CLEAR==1.0.0', 'console_scripts', 'clear_quant')()
File "build/bdist.linux-x86_64/egg/src/run.py", line 256, in main
File "build/bdist.linux-x86_64/egg/src/run.py", line 159, in fusion_align
File "/zs32/home/jcdai/zouhan/ChIP/miniconda2/lib/python2.7/subprocess.py", line 223, in check_output
raise CalledProcessError(retcode, cmd, output=output)
subprocess.CalledProcessError: Command '['tophat2', '-o', '/hw2600/JcDai/02ChineseBrain/01CLEAR/fusion', '-p', '5', '--fusion-search', '--keep-fasta-order', '--bowtie1', '--no-coverage-search', '/zs32/home/jcdai/04circRNA/03database/bowtie1hg38index/hg38.bowtie1_index', '/hw2600/JcDai/02ChineseBrain/01CLEAR/hisat/unmapped.fq']' returned non-zero exit status 1

I think the problem is that unmapped.fq have no map with index. How do you have any idea to solve this problem?
thanks!

STAR bam files?

Not an issue but I have used circexplorer2 for my analyses and would now like to upgrade.
I tried running previous analysis starting from circexplorer2 output and bam files mapped with STAR (circ_quant).
I don't get any expression of cognate linear RNA reads.

Am I able to do the analysis with STAR bam files???

Feature Request: Calculate FPBlinear for all genes

Hi,

I think it will be very useful to obtain FPBlinear values in a sample for all the linear RNAs from the ref file, even though there is no circRNA generated from them. The FPBcirc values for those could be set to 0 in the output, hence CIRCscore will be also 0, but FPBlinear will be non-zero as long as the gene is expressed.

Best.

Have you tried CSI NGS Portal yet?

Problem with pysam: ValueError: fetch called on bamfile without index

So I'm trying to run clear_quant on a RNA-seq file I obtained online (Ribo-), but I cannot seem to move past the following error:

###Parameters:
Namespace(bowtie1='/home/nilu/Alignment_db/base_files/ensembl/bowtie/bowtie_index', genome='/home/nilu/Alignment_db/base_files/ensembl/base_files/Homo_sapiens.GRCh38.dna.primary_assembly.fa', gtf='/home/nilu/Alignment_db/base_files/ensembl/base_files/Homo_sapiens.GRCh38.109.gtf', hisat='/home/nilu/Alignment_db/base_files/ensembl/HISAT2/genome_tran_ensembl', m1='SRR1637089_1.fastq', m2='SRR1637089_2.fastq', output='/home/nilu/Alignment_db/analysis_files/SRX749316/SRR1637089/CLEAR', thread='30')
###Parameters

###Start hisat2 mapping
start to get sp sites for hisat mapping
start to align to genome by hisat
get mapped and unmapped reads
Traceback (most recent call last):
File "/home/nilu/.conda/envs/CLEAR/bin/clear_quant", line 11, in
load_entry_point('CLEAR==1.0.1', 'console_scripts', 'clear_quant')()
File "/home/nilu/.conda/envs/CLEAR/lib/python2.7/site-packages/CLEAR-1.0.1-py2.7.egg/src/run.py", line 258, in main
hisat_align(args.m1, args.m2, args.hisat, args.gtf, hisat_dir, args.thread)
File "/home/nilu/.conda/envs/CLEAR/lib/python2.7/site-packages/CLEAR-1.0.1-py2.7.egg/src/run.py", line 121, in hisat_align
for read in samfile.fetch():
File "pysam/calignmentfile.pyx", line 874, in pysam.calignmentfile.AlignmentFile.fetch (pysam/calignmentfile.c:10986)
ValueError: fetch called on bamfile without index

I read the run.py and it it does index the bamfile later on, but I cannot seem to get past the pysam error to the actual indexing step.
Thanks in advance

NO quant or circular output

I run the software with the following parameters：
clear_quant -1 ~/biodata/hcc-ribo/rmrRNA/LC501_tumor_totalRNA.derrRNA.fq.gz -g ~/biodata/index/GRCh38/GRCh38.primary_assembly.genome.fa -i ~/biodata/index/GRCh38/GRCh38 -j ~/biodata/index/GRCh38/GRCh38 -G ~/biodata/annotation/gencode.v35.annotation.sorted.gtf -o ~/tools/CLEAR-master/test2 -p 18
No error was reported during the process, the following is the output log：

###Parameters:
Namespace(bowtie1='/home/leelee/biodata/index/GRCh38/GRCh38', genome='/home/leelee/biodata/index/GRCh38/GRCh38.primary_assembly.genome.fa', gtf='/home/leelee/biodata/annotation/gencode.v35.annotation.sorted.gtf', hisat='/home/leelee/biodata/index/GRCh38/GRCh38', m1='/home/leelee/biodata/hcc-ribo/rmrRNA/LC501_tumor_totalRNA.derrRNA.fq.gz', m2=None, output='/home/leelee/tools/CLEAR-master/test2', thread='18')
###Parameters

###Start hisat2 mapping
# start to get sp sites for hisat mapping
# start to align to genome by hisat
# get mapped and unmapped reads
# sort bam file
# index bam file
###End hisat2 mapping

###Start tophat-fusion mapping
###End tophat-fusion mapping

###Start circRNA annotation
###End circRNA annotation

###Start circRNA quantification
###Parameters:
Namespace(bam='/home/leelee/tools/CLEAR-master/test2/hisat/align.bam', circ='/home/leelee/tools/CLEAR-master/test2/circ/circular.txt', length=False, output='/home/leelee/tools/CLEAR-master/test2/quant/quant.txt', ratio=1, ref='/home/leelee/tools/CLEAR-master/test2/circ/annotation.txt', threshold=1, tmp=True)
###Parameters
###End circRNA quantification

But I found that the size of my circular.txt and quant.txt files are both 0 Bytes, and the accepted_hits.bam file is 65m. I am confused about this. Is this situation because circularRNA is not found in my sequencing file or because the software is running There was an error in a certain step of?

ValueError: too many values to unpack

When I run clear_quant..... hisat2 and tophat-fusion seem to have finished corrected, but circexplorer2 (2.3.8) seems to give this error.

Traceback (most recent call last):
  File "/opt/conda/bin/CIRCexplorer2", line 10, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python2.7/site-packages/circ2/command_parse.py", line 47, in main
    command=command_log, name='parse')
  File "/opt/conda/lib/python2.7/site-packages/circ2/helper.py", line 38, in wrapper
    fn(*args)
  File "/opt/conda/lib/python2.7/site-packages/circ2/parse.py", line 49, in parse
    options['-f'])
  File "/opt/conda/lib/python2.7/site-packages/circ2/parse.py", line 71, in tophat_fusion_parse
    for i, read in enumerate(parse_fusion_bam(fusion, pair_flag)):
  File "/opt/conda/lib/python2.7/site-packages/circ2/parser.py", line 44, in parse_fusion_bam
    chr1, chr2 = read.get_tag('XF').split()[1].split('-')
ValueError: too many values to unpack
Traceback (most recent call last):
  File "/opt/conda/bin/clear_quant", line 11, in <module>
    load_entry_point('CLEAR==1.0.1', 'console_scripts', 'clear_quant')()
  File "/opt/conda/lib/python2.7/site-packages/CLEAR-1.0.1-py2.7.egg/src/run.py", line 269, in main
    args.genome, args.gtf, circ_dir)
  File "/opt/conda/lib/python2.7/site-packages/CLEAR-1.0.1-py2.7.egg/src/run.py", line 195, in circ_annot
    subprocess.check_output(circ_parse)
  File "/opt/conda/lib/python2.7/subprocess.py", line 223, in check_output
    raise CalledProcessError(retcode, cmd, output=output)
subprocess.CalledProcessError: Command '['CIRCexplorer2', 'parse', '-f', '-t', 'TopHat-Fusion', '/data/Ziegelbauer_lab/circRNADetection/ccbr983_v4/results/Rep6_KO_72h/CLEAR/fusion/accepted_hits.bam', '-b', '/data/Ziegelb
auer_lab/circRNADetection/ccbr983_v4/results/Rep6_KO_72h/CLEAR/circ/bsj.bed']' returned non-zero exit status 1

Hisat2 runs extremely slow

Hi,

I am running the tool with the suggested commandline for the given example:

clear_quant -1 mate_1.fastq -2 mate_2.fastq -g hg38.fa -i hg38.hisat_index -j hg38.bowtie_index -G annotation.gtf -o output_dir

My code and sample:

$ clear_quant -1 205_1.fq.gz -2 205_2.fq.gz -g /library/hg19.fa -i /library/hg19.fa -j /library/hg19.fa -G /library/hg19.genes.gtf -o 205 -p 40
###Parameters:
Namespace(bowtie1='/library/hg19.fa', genome='/library/hg19.fa', gtf='/library/hg19.genes.gtf', hisat='/library/hg19.fa', m1='205_1.fq.gz', m2='205_2.fq.gz', output='205', thread='40')
###Parameters

###Start hisat2 mapping
# start to get sp sites for hisat mapping
# start to align to genome by hisat

However, it takes extremely long. The mapping step got stuck there for 3 days, still running and not sure how long it will take. Is this normal? My samples have about 100M reads on average, either RNA-Seq (ribo-) or circRNA-Seq (ribo- and RNaseR+) samples. How can I speed it up starting from the fastq files (without running CIRCexplorer2 with STAR first to get bam files)? Can you also integrate STAR aligner to the CIRCexplorer3/CLEAR?

@xingma

yanglab / clear Goto Github PK

clear's Introduction

CLEAR/CIRCexplorer3

Schema

Installation requirements

Installation

Usage

Example

Output

Citation

License

clear's People

Contributors

Stargazers

Watchers

Forkers

clear's Issues

Beginning TopHat run (v2.0.12)

Recommend Projects

Recommend Topics

Recommend Org

Jobs