junchaoshi / sports1.1 Goto Github PK
View Code? Open in Web Editor NEWSmall non-coding RNA annotation Pipeline Optimized for rRNA- and tRNA-Derived Small RNAs
License: GNU General Public License v3.0
Small non-coding RNA annotation Pipeline Optimized for rRNA- and tRNA-Derived Small RNAs
License: GNU General Public License v3.0
Hi, Junchao,
I got the results from sports1.1, some similar sequences like:
AAAAAACATTAGACTGTGAATCTGACAACAGGAAATAAACCTCCT 42 30 232 53 296 24 46 104 197 229
AAAAAACATTAGACTGTGAATCTGACAACAGGAAATAAACCTCC 16 9 63 22 120 12 16 20 48 30
AAAAAACATTAGACTGTGAATCTGACAACAGGAAATAAACCTC 29 3 41 22 47 7 4 0 21 21
and my supervisor hope that I can merge such three sequence to one so that perfrom following differential analysis.
I don't know whether it's suitable here for smallRNA data process and I didn't find literatures with such a description.
Can you give me some ideas or suggestions?
Thank you very much in advance!
Best,
Haifeng Sun
Ph.D. candidata student from Nanjing Medical University, China
Hello,
Thank you for developing the tool. I used the provided Mus musculus pre-compiled databases files and I realized they are not the latest release. I want to use the updated files and for that I manually downloaded the mm39-tRNA fasta file. I then saw that you provide a script tRNA_db_processing.pl
that adds CCA to 3'end and G to 5'end of Histidine. However, I saw that in the recommended command line you don't indicate the name of the tRNA_CCA index files, even though the bowtie index files are included in the mm10 pre-compiled database:
sports.pl [...] -t /foo/bar/Mus_musculus/GtRNAdb/mm10/mm10-tRNAs
Instead of:
sports.pl [...] -t /foo/bar/Mus_musculus/GtRNAdb/mm10/mm10-tRNAs_CCA
I wonder what to use with the updated files that I downloaded. Should I pass the *_CCA indexes to sports.pl
?
Thanks,
Sergio
Hi!
I started the analysis twice using the code below. The analysis was completed without any problems the first time. On my second try, I changed the files in seq_address.txt and ran it again, but I got the following error.
"Command: bowtie-build --wrapper basic-0 --threads 24 -q /home/yildize/sports/Mus_musculus/genome/mm10/genome.fa /home/yildize/sports/Mus_musculus/genome/mm10/genome
mv: cannot stat '/home/yildize/sports/output/sh_file/.sh': No such file or directory
rmdir: failed to remove '/home/yildize/sports/output/sh_file': No such file or directory
rm: cannot remove '/home/yildize/sports/output/.sh': No such file or directory"
The code I used in both analyzes is;"sports.pl -i seq_address.txt -p 24 -g /home/yildize/sports/Mus_musculus/genome/mm10/genome -m /home/yildize/sports/Mus_musculus/miRBase/21/miRBase_21-mmu -r /home/yildize/sports/Mus_musculus/rRNAdb/mouse_rRNA -t /home/yildize/sports/Mus_musculus/GtRNAdb/mm10/mm10-tRNAs -w /home/yildize/sports/Mus_musculus/piRBase/piR_mouse -f /home/yildize/sports/Mus_musculus/Rfam/12.3/Rfam-12.3-mouse -o /home/yildize/sports/untroutput/"
Thank you in advance for your time.
Hi,
I am trying to interpret the processing report generated by Sports1.1.
After cutadapt, the first step is to match all reads from cutadapt to the genome, right? so the number we see here is the amount of reads left from cutadapt?
In the following example I have, is it right that after adapter trimming, there are 32432 reads left from 1,224,009 starting reads in the fastq file?
After that, the reads are divided to map to different libraries accordingly. Is that right?
remove 5' adapter
This is cutadapt 2.3 with Python 3.6.9
......
=== Summary ===
Total reads processed: 1,224,009
Reads with adapters: 1,022,193 (83.5%)
Reads with too many N: 0 (0.0%)
Reads written (passing filters): 1,224,009 (100.0%)
Total basepairs processed: 83,232,612 bp
Total written (filtered): 31,674,723 bp (38.1%)
=== Adapter 1 ===
Sequence: AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC; Type: regular 5'; Length: 34; Trimmed: 1022193 times.
No. of allowed errors:
0-9 bp: 0; 10-19 bp: 1; 20-29 bp: 2; 30-34 bp: 3
Overview of removed sequences
length count expect max.err error counts
3 2258 19125.1 0 2258
4 205 4781.3 0 205
.....
68 33556 0.0 3 6860 10078 9085 7533
match to genome
reads processed: 32432
reads with at least one reported alignment: 332 (1.02%)
reads that failed to align: 32100 (98.98%)
Reported 332 alignments
Thank you so much for your help in advance
Hi,
Thanks so much for answering my previous questions.
Another question regarding tRNA annotation. What's the criteria for annotating a read as a 3' or 5' end tRNA?
Thanks
Hi @junchaoshi,
I'm trying to get a figure like the one generated in overall_RNA_length_distribution.R
showing that same histogram but instead of individual samples I want to show samples of a specific treatment. In my experiment I have 2 treatments with 4 samples each. I was wondering what could be the best approach. Since the script computes the RPM values using the counts for the plot, I was thinking about leveraging that to get the RPM values for the individual samples and computing a mean for each of the groups. Do you think this is the right approach to quantify the abundance per treatment? Or would it be better to just merge (by averaging the raw counts, for example) the respective samples and rerun the whole SPORTS pipeline?
Best regards,
Dear Shi,
I would like to use sports1.1 to analyse a smallRNAseq dataset, but I'm finding some issues with it. I installed it following the installation recipe and all the programs seem to be functioning. However, when I run SPORT1.1 I got:
Class Sub_Class Reads
Clean_Reads - 3321509
Match_Genome - 0
Unannotated_Match_Genome - 0
Unannotated_Unmatch_Genome - 3321509
So, bowtie is not mapping any read. I have tried using bowtie2 with the same datasets and they are properly mapped. As I can see the issue is not from SPORTS1.1 but with bowtie. I'm using your pre-build database.
Checking the processing report file, I can see this Error reading _rstarts[] array: 7376, 14208 but I don't find this error in google.
match to genome
Error reading _rstarts[] array: 7376, 14208
Command: bowtie-align --wrapper basic-0 -f -v 0 -k 1 -p 8 --al /rds/project/rds-XUr6B1Jhndg/ah2192_backup/projects/small_rna_seq/data/small_RNAseq_NHS_HS/SPORTS1.1_output/1_S463_A1_R1_val_1/S463_A1_R1_val_1_match_genome.fa --un /rds/project/rds-XUr6B1Jhndg/ah2192_backup/projects/small_rna_seq/data/small_RNAseq_NHS_HS/SPORTS1.1_output/1_S463_A1_R1_val_1/S463_A1_R1_val_1_unmatch_genome.fa /home/ah2192/rds/rds-mrc_tox-XUr6B1Jhndg/ah2192_backup/annotations/smallRNA/Homo_sapiens/genome/hg38/genome /rds/project/rds-XUr6B1Jhndg/ah2192_backup/projects/small_rna_seq/data/small_RNAseq_NHS_HS/SPORTS1.1_output/1_S463_A1_R1_val_1/S463_A1_R1_val_1.fa
rm: cannot remove ‘/rds/project/rds-XUr6B1Jhndg/ah2192_backup/projects/small_rna_seq/data/small_RNAseq_NHS_HS/SPORTS1.1_output/1_S463_A1_R1_val_1/S463_A1_R1_val_1_processed/S463_A1_R1_val_1_output_tRNA’: No such file or directory
rm: cannot remove ‘/rds/project/rds-XUr6B1Jhndg/ah2192_backup/projects/small_rna_seq/data/small_RNAseq_NHS_HS/SPORTS1.1_output/1_S463_A1_R1_val_1/S463_A1_R1_val_1_processed/S463_A1_R1_val_1_tRNA_mapping.txt’: No such file or directory
Any input it more than welcome!
Thanks,
Andres
Hi there,
The final output of all piRNAs are labeled the same as "piRNA". Is there a way to include the piRNA id?
Dear Junchaoshi,
Thanks for this great software tool, I would like to know if there is a possibility to change the order of
databases annotation.
I would like to annotate piRNAdb after miRNA if it is possible.
I'm aware of the publication
Non-coding RNA fragments account for the majority of annotated piRNAs expressed in somatic non-gonadal tissues
but I would like to have the option for that.
Hi Junchao,
I am ran the sports.pl and produce the results. But when I run the annotation.pl, I meet the wrong:
Can't open '/smallrna/qi/out.fa': No such file or directory at annotation.pl line 33.
my sports.pl file is in the "/smallrna/qi/",named "out". The "out" file contain "1_S1_trimmed","processing_report","sh_file" and "run_this.sh"
I am looking forward to your reply!
Best wishes!
heming wang
Dear sir,
When I used the sports1.1, there is no error but I don't know have no output.
Please give me some advice, thank you very much.
Best wishes,
Crane
$./source/sports.pl -i Sperm.txt -p 4 -a -x GTTCAGAGTTCTACAGTCCGACGATC -y TGGAATTCTCGGGTGCCAAGG -g ./Mus_musculus/genome/mm10/genome -m ./Mus_musculus/miRBase/21/miRBase_21-mmu -r ./Mus_musculus/rRNAdb/mouse_rRNA -t ./Mus_musculus/GtRNAdb/mm10/mm10-tRNAs -w ./Mus_musculus/piRBase/piR_mouse -e ./Mus_musculus/Ensembl/release-89/Mus_musculus.GRCm38.ncrna -f ./Mus_musculus/Rfam/12.3/Rfam-12.3-mouse -o ../Reads/output -k
SPORTS version: 1.1.1
Citation: Junchao Shi, Eun-A Ko, Kenton M. Sanders, Qi Chen, Tong Zhou. “SPORTS1.0: a tool for annotating and profiling non-coding RNAs optimized for rRNA-and tRNA-derived small RNAs.” Genomics, Proteomics & Bioinformatics (2018) doi.org/10.1016/j.gpb.2018.04.004.
Input file address: /media/EXTend2018/Wanghe2019/zahuo/tsRNA/sports1.1-master/Sperm.txt
output file address: ../Reads/output
Reference genome address: ./Mus_musculus/genome/mm10/genome
Reference miRNA database address: ./Mus_musculus/miRBase/21/miRBase_21-mmu
Reference tRNA database address: ./Mus_musculus/GtRNAdb/mm10/mm10-tRNAs
Reference rRNA database address: ./Mus_musculus/rRNAdb/mouse_rRNA
Reference piRNA database address: ./Mus_musculus/piRBase/piR_mouse
Reference ensembl ncRNA database address: ./Mus_musculus/Ensembl/release-89/Mus_musculus.GRCm38.ncrna
Reference rfam ncRNA database address: ./Mus_musculus/Rfam/12.3/Rfam-12.3-mouse
Trimming 5' adapter: GTTCAGAGTTCTACAGTCCGACGATC
Trimming 3' adapter: TGGAATTCTCGGGTGCCAAGG
Filtering min lenth: 15
Filtering max lenth: 45
Mapping mismatch tolerance: 0
Debugging mode ON: keep all the intermediate files generated during the running progress
Use of uninitialized value $tmp in -d at ./source/sports.pl line 352, <FILE_IN> line 1.
Use of uninitialized value $tmp in -f at ./source/sports.pl line 387, <FILE_IN> line 1.
Use of uninitialized value $tmp in -d at ./source/sports.pl line 352, <FILE_IN> line 2.
Use of uninitialized value $tmp in -f at ./source/sports.pl line 387, <FILE_IN> line 2.
Use of uninitialized value $tmp in -d at ./source/sports.pl line 352, <FILE_IN> line 3.
Use of uninitialized value $tmp in -f at ./source/sports.pl line 387, <FILE_IN> line 3.
Use of uninitialized value $tmp in -d at ./source/sports.pl line 352, <FILE_IN> line 4.
Use of uninitialized value $tmp in -f at ./source/sports.pl line 387, <FILE_IN> line 4.
Use of uninitialized value $tmp in -d at ./source/sports.pl line 352, <FILE_IN> line 5.
Use of uninitialized value $tmp in -f at ./source/sports.pl line 387, <FILE_IN> line 5.
Use of uninitialized value $tmp in -d at ./source/sports.pl line 352, <FILE_IN> line 6.
Use of uninitialized value $tmp in -f at ./source/sports.pl line 387, <FILE_IN> line 6.
Processing input files...
Done!
Hi,
I got another question about the tRNA Histidine. https://github.com/junchaoshi/sports1.1#trna_db_processingpl
You mentioned that a G is added to the 5' of the sequence.
If I check the sequences, a G is already present at the start of the sequence: http://gtrnadb.ucsc.edu/genomes/eukaryota/Mmusc10/mm10-tRNAs.fa
So, I guess, you are adding a G only if the sequence does not start with G. Is this correct?
Also, is there any reference for this?
Thank you in advance for clarifying.
Is there already a precompiled database for hg38
?
The existing link provided contains both hg19
and hg38
and it is hard to know if all are from hg38
.
Hi,
I am trying to understand how the pre-compiled databases were obtained.
-rRNA database (Original source: https://www.ncbi.nlm.nih.gov/nuccore)
I am not sure how exactly rRNAs were obtained. Does one just have to search for the species and take all the fasta sequences?
-mitotRNAdb database [6] (Original source: http://mttrna.bioinf.uni-leipzig.de/mtDataOutput/)
It looks like there are only 22 mt_tRNAs for mouse: http://mttrna.bioinf.uni-leipzig.de/mtDataOutput/Result
Whereas, there are 45 in the pre-compiled ones.
Thank you for clarifying.
Hi,
Thank you so much for answering my previous questions promptly.
I have another question about the CCA end annotation: does it only count reads with intact non-template CCA tails or count those with non-template C or CC tails as well? I couldn't tell from my result because I see only intact CCA tails.
Thank you very much!
Dear shi,when i can not match to index of bowtie-build,although I already have these index file.The error is that:
match to miRNA-match_genome
Could not locate a Bowtie index corresponding to basename "Arabidopsis_thaliana/miRBase_21/miRBase_21-ath"
Command: bowtie --wrapper basic-0 -f -v 0 -a -p 1 --fullref --norc --al /home/cuimingyang/sports-sRNA/text/1_S1_1_trimmed/S1_1_trimmed_match_miRNA_match_genome.fa --un /home/cuimingyang/sports-sRNA/text/1_S1_1_trimmed/S1_1_trimmed_unmatch_miRNA_match_genome.fa Arabidopsis_thaliana/miRBase_21/miRBase_21-ath /home/cuimingyang/sports-sRNA/text/1_S1_1_trimmed/S1_1_trimmed_match_genome.fa
Hi Junchao,
I have only one small question regarding the Human precompiled database you provided. I noticed that the file hg19-mt_tRNAs.fa contains 29 sequences, but when I searched mitotRNAdb, I only found 22 results. Could you please explain why there is a discrepancy? Additionally, I'm just wondering whether you have plans to provide a precompiled database for hg38 in the future?
Thank you for your help and expertise.
Best regards,
Arya
Hi,
Thanks for the great tool.
I'm using your Mus Musculus annotation (and -r $ref/rRNA_db/mouse_rRNA
) with standard truseq small rna data, and there's never any overlap detected with rRNA_db
annotation: the _processed/*_output_rRNA_*
are empty (but there), and in the _result/*_summary.txt
file there are (many) reads assigned to ensembl-rRNA
and Rfam-rRNA
, but nothing from rRNA_db. The processing_report/*.txt
is empty in that section, and has no other indication except plotting errors at the end, i.e.:
match to rRNA database
match to tRNA database
[...]
generating graph
Error: No enough input parameters!
Execution halted
Error: No enough input parameters!
Execution halted
It seems difficult to imagine that there would really be no trace of the 4S/5S/etc (across many samples)...
Looking at the different annotation subfolders, I noticed that while all annotations have a single bowtie index (i.e. prefix), rRNA_db has several - could the problem stem from this?
Any pointer on how to debug this would be appreciated.
Dear Shi,
if I have a paired end small RNA-Seq data, can I analysis it using SPORTS?
Hi Junchao,
I am using sports1.1 to analyse sRNAseq data. I installed as you told. The programs bowtie(1.2.1.1) and sports.pl seem to be functioning properly. However, when I ran SPORT1.1 with my cleaned out data, I got “ Could not locate a Bowtie index corresponding to basename "Homo_sapiens/genome/hg38/genome" ” in the processing report.
I tried with these parameters:
nohup time sports.pl -i test.fq -p 8 -g Homo_sapiens/genome/hg38/genome -m /Homo_sapiens/miRBase/21/miRBase_21-has -r Homo_sapiens/rRNAdb/human_rRNA -t Homo_sapiens/GtRNAdb/hg19/hg19-tRNAs -w Homo_sapiens/piRBase/piR_human -o output_test/ -k > log_test 2>&1
When I checked the Homo_sapiens file, I saw that indexes (.ebwt files) have been output.
After the issue occurred, I tried reinstalling SPORTS1.1 and the corresponding software, and also re-downloading the pre-compiled database. In addition, I used the bowtie software to directly build indexes and then ran sports.pl. But it seems that these didn't work out. The latest result is still the same problem.
I'm not sure which part is wrong? And How can I deal with this issue? Hope you answer soon.
Thanks!
Arya
Hi,
Thanks for your wonderful work. It's very convenient to use sports1.1 to annotate smallRNAseq data. Once I got the results, I have a questions here:
For tsRNA results, I saw three types: *_5_end, *_3_end and *_CCA_end, what's the difference between 3_end and CCA_end?
Also, some tRNAs don't belongs to these three types, for example "tRNA-Val-AAC", which catergorize to none of these three; So how should I deal with this situation if I wanna analysis tsRNAs in my samples?
Looking forward for your reply!
Thanks,
Xiaozhuan
Hi:
Thanks very much for your work, Sports1.0 is very nice and useful.
My species genome is mm10, and I think some tsRNAs are derived from tRNA. But when I add the MT-tRNA sequence into mm10-tRNAs.fa like this:
Mus_musculus_tRNA-MTSer-TGA-1-1
GAGAAAGACATATAGGATATGAGATTGGCTTGAAACCAATTTTAGGGGGTTCGATTCCTTCCTTTCTTA
Mus_musculus_tRNA-MTThr-TGT-1-1
GTCTTGATAGTATAAACATTACTCTGGTCTTGTAAACCTGAAATGAAGATCTTCTCTTCTCAAGACA
and then run sports.pl. The result is like:
t00000651 TCCCTGGTGGTCTAGTGGTTAGGATTCGGC 30 2 Yes tRNA-Glu-CTC_5_end;tRNA--_5_end
t00000769 AAGAAAGATTGCAAGAACTG 20 2 Yes tRNA--_5_end
t00001268 GCATTGGTGGTTCAGTGGTAGAATTCTCGCCT 32 2 Yes tRNA-Gly-GCC_5_end;tRNA-Gly-CCC_5_end
which is not my expectation. I dont know how to solve this problem. Can you give me some advices?
Thank you in andvance for any help!
Best wishes
Haifeng Sun
2020.01.10
Hi,
thanks for your wonderful work. But if I want to add other types of ncRNA database into sports, for example snoRNA, vault RNA etc. How can I do it?
Thanks
Hi Junchaoshi,
Thank you very much for creating such wonderful tool! Would it be possible to update the version of genome, Rfam, miRBase and GtRNAdb? Besides, would it be possible to clarify the format criteria of different reference. That will be very helpful for people to apply sports to their unannotated ncRNA or new species.
Thanks again for your excellent tool!
Best,
Louis
Hi, I would like to consult why the results is not contain match pre-tRNA information
the operation is here :
sports.pl -i /etc/SRR/245out.fastq -p 4 -g /etc/SRR/UCSC/rn6/Sequence/BowtieIndex/genome -t /etc/SRR/GtRNAdb/rn5-tRNAs -M 1 -o /etc/SRR/output/
Thank you for you attention
Best wishes!
Hi,
Thanks in advance for this github project.
The sports1.1 is really nice for small-RNA-Seq data analysis, especially its updated version from 1.0 to 1.1.
I ran the pipeline successfully and the next step is to do differential analysis.
But I'm puzzled that the results are assigned by a single sample, like the quantitative result data comprising Sequence and Reads.
while my sample group is 3V3, how to merge the samples to a combined data, can i use the Sequence as the reference between samples?
And the following differential analysis step, can i use R package DESeq2 or other recommend packages?
Thanks again and looking forward to your reply!
Best,
Haifeng Sun
Nanjing Medical University, China
2022-09-29
Hi Junchao,
I have a question here: when you map to rRNA_5S (your script below), you have two steps: match genome and unmatch genome. But the bowtie index for rRNA 5S for the two steps is the same one. I don't understand why you map the unmatch genome reads to rRNA 5S again.
For my understanding, when the reads can't map to human genome, it can't map to human small RNA databases either.
That would be awesome if you can explain this part to me. Thanks.
'''
name=rRNA_5S
bowtie_address=/storeData/project/user/daixiaozhuan/reference/SPORTS1.0_smallRNAdb/Homo_sapiens/rRNAdb/human_rRNA_5S
######match genome part
echo ""
echo "match to ${name}-match_genome"
output_match_match_genome=${output_address}${input_query_name}match${name}_match_genome.fa
output_unmatch_match_genome=${output_address}${input_query_name}unmatch${name}_match_genome.fa
touch ${output_match_match_genome}
touch ${output_unmatch_match_genome}
bowtie ${bowtie_address} -f ${input_match} -v ${mismatch} -a -p ${thread} --fullref --norc --al ${output_match_match_genome} --un ${output_unmatch_match_genome} >> ${output_detail_match_genome}
######unmatch genome part
echo ""
echo "match to ${name}-unmatch_genome"
output_match_unmatch_genome=${output_address}${input_query_name}match${name}_unmatch_genome.fa
output_unmatch_unmatch_genome=${output_address}${input_query_name}unmatch${name}_ummatch_genome.fa
touch ${output_match_unmatch_genome}
touch ${output_unmatch_unmatch_genome}
bowtie ${bowtie_address} -f ${input_unmatch} -v ${mismatch} -a -p ${thread} --fullref --norc --al ${output_match_unmatch_genome} --un ${output_unmatch_unmatch_genome} >> ${output_detail_unmatch_genome}
'''
Hi,
Thank you so much for prompt reply to my previous questions. Really appreciate it.
I have more questions about interpreting the output from sports:
Is there overlapping between different classes? for example in output xxx.summary.txt, I saw the following classes for tRNA. Are reads counted in tRNA_5_end (and other ends) also counted in the tRNA_Match_Genome class?
The output pdf files shows the percentage of each type of tRNA in a pie chart, how the tRNA percentage is calculated? (5-end, 3-end, and CCA-end divided by the tRNA_match_genome plus tRNA_unmatch_genome? are mitotRNA also included in the calculation?
Thank you so much for your help!
--------- following is the classes in the summary.txt file in my case---
GtRNAdb-mature-tRNA_Match_Genome
GtRNAdb-mature-tRNA_5_end_Match_Genome
GtRNAdb-mature-tRNA_3_end_Match_Genome
GtRNAdb-mature-tRNA_CCA_end_Match_Genome
GtRNAdb-mature-tRNA_Unmatch_Genome
GtRNAdb-mature-tRNA_CCA_end_Unmatch_Genome
mitotRNAdb-mature-mt_tRNA_Match_Genome
mitotRNAdb-mature-mt_tRNA_3_end_Match_Genome
mitotRNAdb-mature-mt_tRNA_CCA_end_Match_Genome
mitotRNAdb-mature-mt_tRNA_Unmatch_Genome
mitotRNAdb-mature-mt_tRNA_CCA_end_Unmatch_Genome
Hi, thank you for you work in small RNA research , I would like to consult if the outcome from summary like tRNA_5_end and tRNA_3_end equal to the tsRNA?If equivalent ,then whether there tsRNAs have a uniform nomenclature.
Thank you for you attention.
Hi,
if the sequence have maped to the genome and maped to the GtRNAdb,but the sequence can not find in tsRNA database like tsRbase :"http://www.tsrbase.org/",tRFdb:"http://genome.bioch.virginia.edu/trfdb/statistics.php".Can we consider it a new tsRNA?
Thank you for you attention.
Best
Hi, I am trying to run sports.pl, like the example use 3. But the only output I am getting is a file named run_this.sh that is empty and I am not getting any warnings or errors so I would like to know what is happening
Hi Junchao,
I would like to know what's different of the Rfam ncRNA db between ensemblgenomes and Rfam database in the section of "Instruction for compiling annotation database by user" ? Which one should we follow? X_rfam_ncrna.fa or X_rfam.fa? Shall we just choice one for the ncRNA from rfam db?
Download and extract the noncoding RNA sequences in .fa format from Rfam database (http://ensemblgenomes.org/) and put the file X_rfam_ncrna.fa into the defined folder address: <your_defined_address>; (optional)
Download and extract the noncoding RNA sequences belong to the species in .fa format from Rfam database (https://rfam.xfam.org/) and put the file X_rfam.fa into the defined folder address: <your_defined_address>; (optional)
Thank you for your help.
Cheers,
Louis
Hi Junchao,
I want to generate the ratio of different RNA types. Firstly, I extracted the information directly from the file "*.mapped.sorted.txt" in the "processing_report" folder. But I check the reads number is not correct. Then I checked the summary file as below, why the Match_Genome number is larger than the Clean_Reads? Do you have any advice on which file should I use the output the summary?
Thanks,
Xiaozhuan
Hi Junchao,
I used this command line to annotate small RNA(input file is fastq format):
sports.pl -i seq_address.txt -p 4 -g /BIGDATA2/sysu_hshwang_1/soft/database/Homo_sapiens/genome/hg38/genome
-m /BIGDATA2/sysu_hshwang_1/soft/database/Homo_sapiens/miRBase/21/miRBase_21-hsa
-r /BIGDATA2/sysu_hshwang_1/soft/database/Homo_sapiens/rRNAdb/human_rRNA
-t /BIGDATA2/sysu_hshwang_1/soft/database/Homo_sapiens/GtRNAdb/hg19/hg19-tRNAs
-w /BIGDATA2/sysu_hshwang_1/soft/database/Homo_sapiens/piRBase/piR_human
-e /BIGDATA2/sysu_hshwang_1/soft/database/Homo_sapiens/Ensembl/release-89/Homo_sapiens.GRCh38.ncrna
-f /BIGDATA2/sysu_hshwang_1/soft/database/Homo_sapiens/Rfam/12.3/Rfam-12.3-human
-o /BIGDATA2/sysu_hshwang_1/sports/output
the output file is follow:
Hi Sports team!
Thanks for making this pipeline and I am trying to run the annotation.pl but faced the error with the incorrect file path.
The error is below
$ sports.pl -i 1_trim_out/1_0262_X1_18_S18_R1_001_trimmed.fastq -p 20 -k -z -M 2
-g /home/wkq953/scop/SCOP_2024_0262/pipeline-out/foo/bar/Homo_sapiens/genome/hg38/genome
-m /home/wkq953/scop/SCOP_2024_0262/pipeline-out/foo/bar/Homo_sapiens/miRBase/21/miRBase_21-hsa
-o /home/wkq953/scop/SCOP_2024_0262/pipeline-out/test/
$ annotation.pl test/
Can't open 'test/.fa': No such file or directory at /maps/projects/scop/apps/sports1.1-master/source/annotation.pl line 55.
$ annotation.pl test/1_0262_X1_18_S18_R1_001_trimmed/
Can't open 'test/1_0262_X1_18_S18_R1_001_trimmed/.fa': No such file or directory at /maps/projects/scop/apps/sports1.1-master/source/annotation.pl line 55.
$ annotation.pl test/1_0262_X1_18_S18_R1_001_trimmed/0262_X1_18_S18_R1_001_trimmed_fa/
Can't open 'test/1_0262_X1_18_S18_R1_001_trimmed/0262_X1_18_S18_R1_001_trimmed_fa/.fa': No such file or directory at /maps/projects/scop/apps/sports1.1-master/source/annotation.pl line 55.
$ annotation.pl test/1_0262_X1_18_S18_R1_001_trimmed/0262_X1_18_S18_R1_001_trimmed_fa/0262_X1_18_S18_R1_001_trimmed_match_genome
readline() on closed filehandle $file_handle{...} at /maps/projects/scop/apps/sports1.1-master/source/annotation.pl line 120.
And the output folder looks like
$ ll test/
total 144
1_0262_X1_18_S18_R1_001_trimmed
processing_report
run_this.sh
sh_file
Thanks for your help!
There are only 2 parameters (file.address and file.name) in the rRNA_length_distribution.R README example, but rRNA_length_distribution.R asks for another rRNA.length parameter provided.
I was checking some resulted summaries and while I was trying to find the sum up of "matched to genome reads" were correct and I found that it is different than the one it is typed in the "summary.txt".
Could you have a look on that?
random2_summary.txt
random1_summary.txt
These files have been produced from original summaries files using grep for only the "Match_Genome" entries
Hi,
I tried running sports with these parameters:
sports.pl -i files.txt -m /media/nsg/Data/Genome-assemblies/Rat/sports/Rattus_norvegicus/miRBase_21/miRBase_21-rno -r /media/nsg/Data/Genome-assemblies/Rat/sports/Rattus_norvegicus/rRNAdb/rat_rRNA -t /media/nsg/Data/Genome-assemblies/Rat/sports/Rattus_norvegicus/GtRNAdb/rn5-tRNAs -w /media/nsg/Data/Genome-assemblies/Rat/sports/Rattus_norvegicus/piRBase/piR_rat -f /media/nsg/Data/Genome-assemblies/Rat/sports/Rattus_norvegicus/Rfam_12.3/Rfam-12.3-rat -e /media/nsg/Data/Genome-assemblies/Rat/sports/Rattus_norvegicus/Ensembl/Rattus_norvegicus.Rnor_6.0.ncrna -g /media/nsg/Data/Genome-assemblies/Rat/sports/Rattus_norvegicus/UCSC/rn6/Sequence/BowtieIndex/genome -p 4 -o ./output
However I got this error message:
Filtering min lenth: 15
Filtering max lenth: 45
Mapping mismatch tolerance: 0
Use of uninitialized value $tmp in -d at /home/nsg/sports1.1-master/source/sports.pl line 352, <FILE_IN> line 1.
Use of uninitialized value $tmp in -f at /home/nsg/sports1.1-master/source/sports.pl line 387, <FILE_IN> line 1.
can not open '/media/nsg/Data/Sherif/Gtpbps small RNA seq/sports/output/sh_file/1_Gtpbp2.sh' at /home/nsg/sports1.1-master/source/sports.pl line 479, <FILE_IN> line 4.
Can you please help me figure out what`s the issue?
Thanks!
Hi !
I am running sports1.1 with the following command:
DIR=/media/sergio/HDD1/smallrna
sports.pl -i sports_pipeline/seq_address.txt -p 8 -g $DIR/features/SPORTS_db/Mus_musculus/genome/mm10/genome -m $DIR/features/SPORTS_db/Mus_musculus/miRBase/21/miRBase_21-mmu -r $DIR/features/SPORTS_db/Mus_musculus/rRNAdb/mouse_rRNA -t $DIR/features/SPORTS_db/Mus_musculus/GtRNAdb/mm10/mm10-tRNAs -w $DIR/features/SPORTS_db/Mus_musculus/piRBase/piR_mouse -e $DIR/features/SPORTS_db/Mus_musculus/Ensembl/release-89/Mus_musculus.GRCm38.ncrna -f $DIR/features/SPORTS_db/Mus_musculus/Rfam/12.3/Rfam-12.3-mouse -o sports_pipeline/output/ -k
Not that I'm ruinning Bowtie version 1.3.1 and I used Trimgalore to trim the read adapters.
Processed files are run very fasta and the processing report for every sample look like this, with no alignments in any database:
I guess that the message about the index comes from the more recent version of Bowtie. However I do not know where the "rm" error comes from. My "seq_address.txt" file looks like this:
Thank you in advance!
Sergio
Hi,
Have a question about one of the output file xxx_summary.txt: why the read numbers are not integer in this case? are these normalized read counts? e.g. the following is an example of the result I got:
1_S1_1/ncRNA-Run3-Sample1_S1_1_result$ more ncRNA-Run3-Sample1_S1_1_summary.txt
Class Sub_Class Reads
Clean_Reads - 639656
Match_Genome - 205787
miRBase-miRNA_Match_Genome - 1740
miRBase-miRNA_Match_Genome mmu-let-7a-1 10.50
miRBase-miRNA_Match_Genome mmu-let-7a-2 8.50
miRBase-miRNA_Match_Genome mmu-let-7b 4.00
miRBase-miRNA_Match_Genome mmu-let-7c-1 2.50
miRBase-miRNA_Match_Genome mmu-let-7c-2 3.50
Thank you very much for your help in advance
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.