GithubHelp home page GithubHelp logo

egluckthaler / starfish Goto Github PK

View Code? Open in Web Editor NEW
26.0 3.0 5.0 70.76 MB

starfish: a modular toolkit for giant mobile element annotation

License: GNU Affero General Public License v3.0

Perl 76.72% Python 1.38% Shell 2.74% Makefile 0.05% C++ 18.35% C 0.76%
bioinformatics genomics mobile-elements starships transposable-elements

starfish's People

Contributors

egluckthaler avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

starfish's Issues

typo in install instructions

From: https://github.com/egluckthaler/starfish/wiki/Installation

when replacing the cnef.cc file, your instructions say:
cp ../scripts/cneff.cc .

but it should say:
cp ../scripts/cnef.cc .

I guess this is a bit of a dangerous error, and might pop up due to other install errors. Since the original file and the new file have the same names, it is hard to distinguish between them.

I guess an option would be to rename your cnef.cc to cnef_starfish.cc

and then add a sed command for the Makefile? Something like this?
sed -i "s/cnef.cc/cnef_starfish.cc/" Makefile

Typo in step-by-step tutorial

I believe that the annotate command in the step-by-step tutorial has a typo:

starfish annotate -T 2 -x macpha6_tyr -a ome2assembly.txt -g macpha6.gff3 -p ../database/YRsuperfams.p1-512.hmm -P ../database/YRsuperfamRefs.faa -i tyr -o geneFinder/

Where the -g parameter should be ome2gff.txt instead of macpha6.gff3.

Thanks for the fantastic programme!

Trying to run on my own samples

Hello Emile and group! Excited to try your program on my Histoplasma datasets but I have hit a couple of snags, some I've solved but currently stuck on this one.

I have been following your tutorial for your test dataset and the portion I have been stuck on is your eggnog annotation sorting.
My genomes have been annotated using Funannotate, but I also attempted an eggnog-mapper run and my output does not look like yours.

Your step that cuts your emapper into a text file for later steps is giving me an issue:

cut -f1,10 ann/*emapper.annotations | grep -v '#' | perl -pe 's/^([^\s]+?)\t([^\|]+).+$/\1\t\2/' > ann/macph6.gene2og.txt

the -f1,10 cut is looking for "bestOG|evalue|score" tab where it pulls out the first portion "bestOG" to create the text file. Is it maybe a -flag I am missing for this specific emapper output?

Funannotate:
GeneID TranscriptID Feature Contig Start Stop Strand Name Product Alias/Synonyms EC_number BUSCO PFAM InterPro EggNog COG GO Terms Secreted Membrane Protease CAZyme Notes gDNA mRNA CDS-transcript Translation
19VMG-15_000001 19VMG-15_000001-T1 mRNA scaffold_1 63793 65897 + hypothetical protein 2.1.1.310 EOG091P0ENB PF01189;PF17125 IPR001678 SAM-dependent methyltransferase RsmB-F/NOP2-type domain;IPR011023 Nop2p;IPR018314 Bacterial Fmu (Sun)/eukaryotic nucleolar NOL1/Nop2p, conserved site;IPR023267 RNA (C5-cytosine) methyltransferase;IPR023273 RNA (C5-cytosine) methyltransferase, NOP2;IPR029063 S-adenosyl-L-methionine-dependent methyltransferase superfamily;IPR031341 Ribosomal RNA small subunit methyltransferase F, N-terminal ENOG503NUZ7 L:(L) Replication, recombination and repair GO_component: GO:0005730 - nucleolus [Evidence IEA];GO_function: GO:0008757 - S-adenosylmethionine-dependent methyltransferase activity [Evidence IEA];GO_function: GO:0008168 - methyltransferase activity [Evidence IEA];GO_function: GO:0003723 - RNA binding [Evidence IEA];GO_function: GO:0009383 - rRNA (cytosine-C5-)-methyltransferase activity [Evidence IEA];GO_process: GO:0006396 - RNA processing [Evidence IEA];GO_process: GO:0001510 - RNA methylation [Evidence IEA];GO_process: GO:0070475 - rRNA base methylation [Evidence IEA];GO_process: GO:0000470 - maturation of LSU-rRNA [Evidence IEA]

My eggnog-mapper:
#query seed_ortholog evalue score eggNOG_OGs max_annot_lvl COG_category Description Preferred_name GOs EC KEGG_ko KEGG_PathwayKEGG_Module KEGG_Reaction KEGG_rclass BRITE KEGG_TC CAZy BiGG_Reaction PFAMs
01_16_1 5037.XP_001538943.1 0 3312 COG1020@1|root,KOG1178@2759|Eukaryota,38V1T@33154|Opisthokonta,3Q08R@4751|Fungi,3R2AF@4890|Ascomycota,20DTP@147545|Eurotiomycetes,3AZS6@33183|Onygenales 4751|Fungi Q Condensation domain - - - ko:K22152 - - - ko00000 - - - AMP-binding,Condensation,PP-binding

Your eggnog-mapper:
#query_name seed_eggNOG_ortholog seed_ortholog_evalue seed_ortholog_score predicted_gene_name GO_terms KEGG_pathways Annotation_tax_scope OGs bestOG|evalue|score COG cat eggNOG annot
mp040_13792 441959.XP_002479979.1 5.00E-71 232.2 FG02084.1 GO:0005575,GO:0005623,GO:0005886,GO:0006810,GO:0008150,GO:0015886,GO:0016020,GO:0016021,GO:0031224,GO:0044425,GO:0044464,GO:0044699,GO:0044765,GO:0051179,GO:0051181,GO:0051234,GO:0071702,GO:0071705,GO:0071944,GO:1901678 fuNOG[21] 03RFZ@ascNOG,0IWNB@euNOG,0M5FN@euroNOG,0MQ5B@eurotNOG,0PN5J@fuNOG,11Q0B@NOG,13QP8@opiNOG 0PN5J|7.70053432588e-86|288.860290527 S RTA1 domain protein

Any guidance would be appreciated!

-Tania

Starfish failing to identify known starships with verified insertion site polymorphisms

I have now analyzed two genomes that we know to contain 8 full-length starship elements. Starfish failed to identify any of them. Maybe this is due to the aforementioned failure in the bedtools intersect command:

[Tue Aug 6 08:33:45 2024] checking formatting of GFFs in ome2gff.txt..
sh: -c: line 0: syntax error near unexpected token (' sh: -c: line 0: bedtools intersect -a Arcadia_SF_starships/Arcadia.filt.gff -b <(grep -w mRNA /scratch/farman/STARFISH/Arcadia_processed.gff) -wao >> Arcadia_SF_starships/Arcadia.intersect.gff'

[Tue Aug 6 08:33:45 2024] error: could not execute bedtools intersect on commandline for Arcadia_SF_starships/Arcadia.filt.gff and /scratch/farman/STARFISH/Arcadia_processed.gff, exiting..

I tried to bypass the failure by running the above bedtools command directly from the command line. However, I am suspecting that the subsequent failure to identify elements may result from the absence of the intersect.ids file, or possibly because starfish annotate is supposed to perform other operations after bedtools intersect. I am unable to troubleshoot the issue because intermediate "files" are only held in memory and not written to disk.

Add color values to bed/gff files

I'd like to suggest that color values be automatically added to bed/gff files to distinguish different features (e.g. TYRs, DRs, TIRs, etc.). This will make interpretation in browsers much easier. Perhaps allow overriding through command line options?

Some questions about the parameter "-p" in starfish annotate module

Hi! We encountered some difficulties while predicting the starship transposons.
The parameter "-p" is the profile HMM file in starfish annotate module. I wonder how can I obtain the HMM files when using the module? Which HMM file we shoud dowload when we predict the starship transposons? Could we obtain the HMM files on some public database? We would appreciate it if you could provide us with a web address for the database.

Thank you!

Bug or unfriendly pipeline breach?

ElementFinder is throwing an error but I suspect it is simply encountering a normal but perhaps unexpected eventuality. Do I interpret this output correctly to mean that of the 13 tyr captain candidates, none show insertional polymorphism when compared with the reference genome? In this case, why is the program attempting to build a blast directory from an empty candidate element file?

Also, it might help to have some additional blast hit info. For example, were there 0 tyr captains with candidate insertion sites, or were there simply no blast hits between the candidate tyr captains and the reference? No blast input/output files are saved (as far as I can tell) so it's not possible to investigate this manually.

(base) [farman@mcc-login001 STARFISH]$ singularity run --app starfish100 /share/singularity/images/ccs/conda/amd-conda16-rocky8.sinf starfish insert -T 6 -a ome2assembly.txt -d elementFinder/B71v2sh.fasta -b CD156_SF_starships/CD156.tyr.bed -i tyr -x CD156 -o elementFinder
[Mon Aug 5 22:03:37 2024] executing command: starfish insert -T 6 -a ome2assembly.txt -d elementFinder/B71v2sh.fasta -b CD156_SF_starships/CD156.tyr.bed -i tyr -x CD156 -o elementFinder
Key parameters:
--upstream 0-17000
--downstream 0-14000
--length 2000-700000
--pid 90
--hsp 1000
minDR 4
maxDR 40
maxEmptySiteLength 2000
maxElementLengthFlag 700000
minElementLengthFlag 2000
maxInsertCoverage 0.25
maxFlankCoverage 0.25
blastn -task dc-megablast -evalue 0.0001 -max_target_seqs 1000000
nucmer --mum
delta-filt -m -l 1000 -i 90

[Mon Aug 5 22:03:37 2024] reading in data..
[Mon Aug 5 22:03:37 2024] parsing upstream regions of 13 candidate tyr captains..
[Mon Aug 5 22:03:37 2024] searching for hits to the upstream regions of 13 candidate tyr captains..
[Mon Aug 5 22:03:37 2024] found 0 tyr captains with candidate insertion sites
[Mon Aug 5 22:03:37 2024] searching for hits to the downstream regions of candidate insertion sites for 0 tyr captains that are also downstream of those captains..
BLAST options error: File elementFinder/elementDB.CD156.fna is empty

[Mon Aug 5 22:03:37 2024] error: could not execute makeblastdb on commandline, exiting..
Inappropriate ioctl for device

Bug in STARFISH annotate code

There is a bug in the code that is preventing execution of the bedtools intersect command:

[Mon Aug 5 21:10:14 2024] checking formatting of GFFs in ome2gff.txt..
sh: -c: line 0: syntax error near unexpected token (' sh: -c: line 0: bedtools intersect -a CD156_SF_starships/CD156.filt.gff -b <(grep -w mRNA /scratch/farman/STARFISH/CD156_processed.gff) -wao >> CD156_SF_starships/CD156.intersect.gff'

[Mon Aug 5 21:10:14 2024] error: could not execute bedtools intersect on commandline for CD156_SF_starships/CD156.filt.gff and /scratch/farman/STARFISH/CD156_processed.gff, exiting..
No such file or directory

Troubleshooting is complicated by the fact that the error message is incorrect. Both files referenced in the message are present and correct in the specified locations. However, bedtools intersect is looking for a temp.gff that, apparently, is never created.

Note: This step also fails with the starfish examples datasets.

starfish consolidate question

Hi there,

the step-by-step instructions say to use:

xxx_tyr.filt_intersect.gff.

However, after running the geneFinder module I don't have a file like that. I do however have:
xx_tyr.filt.gff
xx_tyr.ntersect.gff
xx_tyr.gff

Which of those should I use as the -G argument for starfish consolidate?

Cheers

interpreting starfish insert error

Hi Emile,

Thanks for the great tool! I've been running it on a couple of genomes I'm working with. I followed your tutorial and was able to idenitify some tyr genes there. However, when I got to the element annotation step I got the following:

[Mon May 20 15:37:25 2024] executing command: starfish insert -T 2 -a analysis_and_temp_files/12_starship/ome2assembly.txt -d analysis_and_temp_files/12_starship/blastdb/macpha6.assemblies -b analysis_and_temp_files/12_starship/geneFinder/macpha6.tyr.bed -i tyr -x macpha6 -o analysis_and_temp_files/12_starship/elementFinder/
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
        LANGUAGE = (unset),
        LC_ALL = (unset),
        LANG = "en_GB.UTF-8"
    are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").
Key parameters:
--upstream             0-17000
--downstream           0-14000
--length               2000-700000
--pid                  90
--hsp                  1000
minDR                  4
maxDR                  40
maxEmptySiteLength     2000
maxElementLengthFlag   700000
minElementLengthFlag   2000
maxInsertCoverage      0.25
maxFlankCoverage       0.25
blastn                 -task dc-megablast -evalue 0.0001 -max_target_seqs 1000000
nucmer                 --mum
delta-filt             -m -l 1000 -i 90

[Mon May 20 15:37:25 2024] reading in data..
[Mon May 20 15:37:25 2024] parsing upstream regions of 11 candidate tyr captains..
[Mon May 20 15:37:25 2024] searching for hits to the upstream regions of 11 candidate tyr captains..
[Mon May 20 15:37:37 2024] found 4 tyr captains with candidate insertion sites
[Mon May 20 15:37:37 2024] searching for hits to the downstream regions of candidate insertion sites for 4 tyr captains that are also downstream of those captains..
[Mon May 20 15:37:42 2024] filtering out similar upstream and downstream flanks of insertion sites associated with 0 candidate tyr captains..
[Mon May 20 15:37:43 2024] refining boundary predictions of 0 candidate tyr captains with insertion sites..
BLAST options error: File analysis_and_temp_files/12_starship/elementFinder//emptyDB.macpha6.fa is empty


[Mon May 20 15:37:43 2024] error: could not execute makeblastdb on commandline, exiting..
Inappropriate ioctl for device

Does this mean that the 4 tyr genes that had upstream candidate insertion sites didn't have downstream sites, and that caused the error? If so, could it be because of the assembly fragmentation? One of the assemblies I'm working with isn't very contiguos (~60 contigs), so I thought it might be a problem here.

Thanks!
Gulnara

sh: 1: Syntax error: "(" unexpected

hey i do now what is wrong...

(starfish) marcin@marcin-:~/starfish/test$ starfish annotate -T 2 -x macpha6_tyr -a ome2assembly.txt -g ome2gff.txt -p ../db/YRsuperfams.p1-512.hmm -P ../db/YRsuperfamRefs.faa -i tyr -o geneFinder/
[Thu Apr 25 21:33:47 2024] executing command: starfish annotate -T 2 -x macpha6_tyr -a ome2assembly.txt -g ome2gff.txt -p ../db/YRsuperfams.p1-512.hmm -P ../db/YRsuperfamRefs.faa -i tyr -o geneFinder/
Key parameters:
metaeuk        -v 3 -s 7.5 -e 0.0001 --max-intron 2000 --max-seqs 300 --metaeuk-eval 0.001 --exhaustive-search 1 --metaeuk-tcov 0.25 --allow-deletion 1 --protein 1 --disk-space-limit 100G --compressed 1
hmmsearch      --max -E 0.001

[Thu Apr 25 21:33:47 2024] geneFinder//macpha6_tyr.fas exists, skipping metaeuk annotation
[Thu Apr 25 21:33:47 2024] geneFinder//macpha6_tyr.hmmout exists, skipping HMM annotation of metaeuk results
[Thu Apr 25 21:33:47 2024] filtering metaeuk annotations based on hmmsearch results..
[Thu Apr 25 21:33:47 2024] lifting over names of overlapping feature from GFFs in ome2gff.txt to metaeuk annotations..
[Thu Apr 25 21:33:47 2024] checking formatting of GFFs in ome2gff.txt..
sh: 1: Syntax error: "(" unexpected

cheers

skimage_feature

Hi,

Just did a clean install of starfish, but I get an error at the annotate step:

File "/mnt/LTR_userdata/auxie001/programs/conda/envs/starfish/bin/starfish", line 7, in <module>
from starfish.core.starfish import starfish
File "/mnt/LTR_userdata/auxie001/programs/conda/envs/starfish/lib/python3.8/site-packages/starfish/__init__.py", line 1, in <module>
from . import (
File "/mnt/LTR_userdata/auxie001/programs/conda/envs/starfish/lib/python3.8/site-packages/starfish/image.py", line 1, in <module>
from starfish.core.image import ( # noqa: F401
File "/mnt/LTR_userdata/auxie001/programs/conda/envs/starfish/lib/python3.8/site-packages/starfish/core/image/__init__.py", line 2, in <module>
from ._registration import LearnTransform
File "/mnt/LTR_userdata/auxie001/programs/conda/envs/starfish/lib/python3.8/site-packages/starfish/core/image/_registration/LearnTransform/__init__.py", line 2, in <module>
from .translation import Translation
File "/mnt/LTR_userdata/auxie001/programs/conda/envs/starfish/lib/python3.8/site-packages/starfish/core/image/_registration/LearnTransform/translation.py", line 2, in <module>
from skimage.feature import register_translation
ImportError: cannot import name 'register_translation' from 'skimage.feature' (/mnt/LTR_userdata/auxie001/programs/conda/envs/starfish/lib/python3.8/site-packages/skimage/feature/__init__.py)

From the sci-kit image blog, here I think the feature register_translation has been removed as of v0.19.3, and now they suggest using skimage.registration.phase_cross_correlation()

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.