oschwengers / bakta Goto Github PK

Rapid & standardized annotation of bacterial genomes, MAGs & plasmids

License: GNU General Public License v3.0

Python 93.32% Shell 4.03% Nextflow 0.80% Dockerfile 0.39% Common Workflow Language 1.38% JavaScript 0.08%

bioinformatics microbial-genomics genome-annotation bacteria bacterial-genomes plasmids metagenome-assembled-genomes annotation mag

bakta's Issues

output nucleotide sequences for each CDS

Would it be possible to also output the nucleotide sequence for each CDS, similar to what is already output for the translated protein?

Add VFDB to alignment-based expert system

Add VFDB protein sequences to protein sequence blast expert annotation system.

rank: 75
min identity: 90
min query coverage: 90
min model coverage: 90

download/update the database within Bakta

Provide a CLI command to download or update the database from within Bakta.

Improve plasmid annotation

Superseede UniRef100/UniRef90 annotations by the following specialized plasmid protein DBs:

Mobilization proteins:

MobScan: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2937521 https://castillo.dicom.unican.es/mobscan_about/ https://castillo.dicom.unican.es/mobscan_about/MOBfamDB.gz

Conjugation proteins:

CONJScan https://github.com/gem-pasteur/Macsyfinder_models/tree/master/models/Conjugation http://dx.doi.org/10.1371/journal.pone.0110726 http://dx.doi.org/10.1038/srep23080

Improve internal DB downloads

Improve the internal DB download workflow:

Improve the progress feedback during the DB download process by using alive-progress.
Improve download speed by increasing the download chunk size.

Add chromosome / plasmid rotation

For complete genomes, rotate chromosomes and plasmids to dnaA and RepA/RepB genes, respectively.

Ideas:

UniRef90: Search for dnaA and RepA/RepB genes in anotations
Dfast: https://github.com/nigyta/dfast_core/blob/master/dfc/components/DnaAfinder.py
Unicycler: https://github.com/rrwick/Unicycler/blob/master/unicycler/unicycler.py

If this is conducted before any annotation, Prodigal could be executed with the -c option to disallow the prediction of genes running off edges.

CDS prediction but not annotation option

Dear all,
Thank you for developing this tool, it looks very comprehensive and the documentation is really thorough.
I went through your documentation and I see you already have an option --skip-cds where you can skip CDS prediction and annotation. However, I was wondering if it would be possible to have the option to still have the CDS prediction but to skip the posterior CDS functional annotation.
This is simply because I'd like to use an alternative protein function annotation tool but would still need to have the CDS.
In this regard, if you do add this option, which DBs would not be necessary?

Thanks and regards,
Pedro

Add SwissProt annotations

Add SwissProt annotations:

parse uniprot_sprot.xml.gz
calc aa seq hash
lookup UPS entries
a) either update UPS gene/product if no UniRef90 id available in UPS
b) otherwise update PSC via UPS-UniRef90

Available annotation:

 <recommendedName>
   <fullName evidence="1">5'-deoxynucleotidase YfbR</fullName>
   <ecNumber evidence="1">3.1.3.89</ecNumber>
 </recommendedName>```

 <gene>
   <name evidence="1" type="primary">yfbR</name>
   <name type="ordered locus">SeAg_B2472</name>
 </gene>```

 <dbReference type="GO" id="GO:0005737">
   <property type="term" value="C:cytoplasm"/>
   <property type="evidence" value="ECO:0000501"/>
   <property type="project" value="UniProtKB-SubCell"/>
 </dbReference>
 <dbReference type="GO" id="GO:0002953">
   <property type="term" value="F:5'-deoxynucleotidase activity"/>
   <property type="evidence" value="ECO:0000501"/>
   <property type="project" value="InterPro"/>
 </dbReference>
 <dbReference type="GO" id="GO:0046872">
   <property type="term" value="F:metal ion binding"/>
   <property type="evidence" value="ECO:0000501"/>
   <property type="project" value="UniProtKB-KW"/>
 </dbReference>
 <dbReference type="GO" id="GO:0000166">
   <property type="term" value="F:nucleotide binding"/>
   <property type="evidence" value="ECO:0000501"/>
   <property type="project" value="UniProtKB-KW"/>
 </dbReference>```

Add dependency checks

Add check if required dependencies are available at runtime.

Also check the dependencies' software versions.

Add IS element detection

Bakta needs an IS element detection and annotation feature.
A promising candidate (detection, annotation, recently updated, BioConda) is ISEScan:

Add GenBank export

Add KEGG KofamKOALA annotation

Annotate all clusters by KEGG's KoFams:
https://www.genome.jp/ftp/db/kofam/

Duplicated riboswitch in GCF_002007885.1

In assembly GCF_002007885.1 a riboswitch is detected and duplicated in the final annotation.

Position: [126426,126511] on sequence NZ_CP019870.1

Products:

c-di-GMP-II-GAG riboswitch, score 48.9
Cyclic di-GMP-II riboswitch, score 74.3

Bakta exits with amrfinder error code 1

Hi,

I'm trying to test out Bakta on our newly sequenced bacterial genome (1 chromosome, 6 other genomic compartments in one multifasta file), hopefully comparing against existing Prokka+custom database pipeline. However, the process exits with error message once it gets to 'conduct expert systems...' stage under 'predict & annotate CDSs...' I can replicate this issue on both our lab's server and on my personal laptop (both running Ubuntu, 20.04 and 21.04).

Bakta was installed with conda/bioconda in both of the instances using below command, running on Ubuntu.

conda create -n bakta bakta

The annotation process command was:

bakta --db ~/bakta_data/db/ --output test --genus genusname --species speciesname --threads 20 --replicon replicon.tsv our_assembly.fasta

The process exits with below error message

Traceback (most recent call last):
  File "/home/user/miniconda3/envs/bakta/bin/bakta", line 10, in <module>
    sys.exit(main())
  File "/home/user/miniconda3/envs/bakta/lib/python3.9/site-packages/bakta/main.py", line 276, in main
    expert_amr_found = exp_amr.search(genome['features'][bc.FEATURE_CDS], cds_fasta_path)
  File "/home/user/miniconda3/envs/bakta/lib/python3.9/site-packages/bakta/expert/amrfinder.py", line 33, in search
    raise Exception(f'amrfinder error! error code: {proc.returncode}')
Exception: amrfinder error! error code: 1

Please let me know if you need more info/tests on my end for troubleshooting.

Thank you!

Add analysis of hypothetical proteins

Analyze hypotheticals via:

hmmsearch vs Pfam-A
compute protein statistics via BioPython
- molecular weight
- isoelectric point
trans-membrane analysis via tmhmm?

Add TSV export

Catch thread argument values larger than the number of available threads

Catch thread argument values larger than the number of available threads.

Thanks to @LuisFF bringing this up in #62

This line causes the bakta-docker.sh to crash

bakta/bakta-docker.sh

Line 36 in cdcb0ba

args[$((argount-1))]=/bakta/$GENOME_FILENAME

The variable in this line should be argcount

Dockerfile not OCI compliant

Describe the bug
When building the docker image with buildah an error message occurs that states that the "SHELL" command is not OCI compliant:
ERRO[0000] SHELL is not supported for OCI image format, [bash -l -c] will be ignored. Must use docker format

Should we change this to be OCI compliant?

Add signal peptide predictions

Uuse DeepSig (GPL3) to predict signal peptides and cleavage sites.

Add skip options for each feature type

Add skip options for each feature type:

--skip-trna
--skip-tmrna
--skip-rrna
--skip-ncrna
--skip-cds
--skip-sorf

adding sequence at the end of gff3 (similar to prokka gff output)

First of all:, thanks for this great tool.
I was wondering if would be feasible to add the fasta sequence at the end of the gff3 file (similarly to prokka gff output). It would be useful for downstream analysis (e.g. roary pangenome pipelines.). Thanks for your effort.
Paolo

Add OrthoSearch of reference files

Dear all,
Thank you for this amazing annotation tool for bacterial genomes! I want to known if you can add an option for orthosearch if we have Genbank or Protein FASTA file(s) that we want to annotate genes from as the first priority.
This option has been supplied in softwares dfast-core as "--references" (https://github.com/nigyta/dfast_core) and prokka as "--proteins".

Thanks and regards,
Tonny_Z

Design & implement a replicon information file

A configuration file for all replicons within an assembly / genome should be parsed in order to use this information in output files, e.g. GenBank.

Format idea:

original locus id	new locus id	type	topology	name
`old id`	[`new id` / `<empty>`]	[`chromosome` / `plasmid` / `contig` / `<empty>`]	[`circular` / `linear` / `<empty>`]	`name`

Available short cuts:

chromosome: c
plasmid: p
circular: c
linear: l

<empty> values (- / '') will be replaced by defaults. If new locus id is empty, a new contig name will be autogenerated.

Defaults:
replicon type: contig
topology: linear

Example:

original locus id	new locus id	type	topology	name
NODE_1	chrom	`chromosome`	`circular`	`-`
NODE_2	p1	`plasmid`	`c`	`pXYZ1`
NODE_3	p2	`p`	`c`	`pXYZ2`
NODE_4	special-contig-name-xyz	`-`	-
NODE_5	``	`-`	-

Usage:

bakta --replicons <file.tsv>

Add CRISPR detection

Add a CRISPR detetion with either:

hmmsearch error on empty hypotheticals.faa sequence file

analyze hypotheticals...
Traceback (most recent call last):
  File "<xyz>/bin/bakta", line 10, in <module>
    sys.exit(main())
  File "<xyz>/bakta/main.py", line 291, in main
    pfam_hits = feat_cds.predict_pfam(hypotheticals)
  File "<xyz>/bakta/features/cds.py", line 264, in predict_pfam
    raise Exception(f'hmmsearch error! error code: {proc.returncode}')
Exception: hmmsearch error! error code: 1

Thanks Matthew Croxen for pointing out.

Add comprehensive logging of database creation

Add a comprehensive logging of all initialization steps and applied annotations.
This logging file will be publicly provided in order to check all annotations in the bakta db.

Provide pre-calculated Prodigal training files

For fragmented draf assemblies as well as short plasmids the insufficient amount of genetic information might result in far from optimal training data for subsequent CDS prediction.
For these cases, it might be useful and beneficial to provide pre-calculated Prodigal training files for bacterial taxa (genus or species level) as well as plasmids maybe clustered to Inc groups.

Add oriC / oriT annotation

Detect oriC and oriT for (complete) chromosomes and plasmids.

oriC sequences: DoriC
oriT sequences: MOB-suite

Add NCBI BlastRules as first alignment-based expert system

Introduce an abstract system for protein sequence based expert systems providing:

reference sequences
minimal alignment thresholds: identity, query coverage, subject coverage
gene label
product description
DbXrefs: EC, etc ...
rank: a precedence rank to order & select potentially multiple annotations from several expert systems

Implement NCBI BlastRules as a first protein sequence based expert annotation systems

Add GFF3 export

Print summarizing genome stats

Add some general summarizing genome stats at the end of the annotation process

Add tmRNA prediction via aragorn

As tRNAscan-SE 2.0 only detects tRNAs we need to predict tmRNAs separately via e.g. aragorn

Compatible file for NCBI submission?

Hi,
I'm in the process of submitting annotated genomes with Bakta to the NCBI, hence while checking for the quality and errors of annotation (via table2asn and the gff3 file https://www.ncbi.nlm.nih.gov/genbank/genomes_gff/#run), I encountered several issues with it.
I understand if it's not something this tool will be compatible with as it can be quite tricky but anyhow, here is a list of some of the issues that I had that may be addressed in the gff file:

add a gene line

contig_1	Prodigal	CDS	3	179	.	-	0	ID=DOCECA_00005;locus_tag=DOCECA_00005;product=hypothetical protein
contig_1	Bakta	gene	3	179	.	-	0	ID=DOCECA_00005;locus_tag=DOCECA_00005

remove commas in the "product=" category with the exception of EC numbers

Commas that are intended to be part of a name should be encoded (%2C) according to the GFF3 specifications. However, literal commas should only be included when they are part of enzymatic names. Semi-colons generally should not be included in product names.

Add an option to remove the SO: in dbxref as they are not yet recognized (https://www.ncbi.nlm.nih.gov/genbank/collab/db_xref/)

Best
Greg

Add prediction of transcription terminators

Add detection/prediction of Rho (in)dependent translation terminators

Rho independent

transtermhp: https://genomebiology.biomedcentral.com/articles/10.1186/gb-2007-8-2-r22; Bioconda:https://anaconda.org/bioconda/transtermhp

Filter spurious CDS by AntiFam

hmmsearch --tblout out.tsv --noali --cpu 4 --cut_ga AntiFam_Bacteria.hmm aa.faa

Some annotations in the GFF do not have an ID assigned?

Some annotations don't have an ID in the 9th field
Majority of the annotations do have ID, but a few don't. I can't use the annotation readily for RNAseq counting using HTSeq. Is this expected?

Here are all of them in one of my annotations without ID (last one is a sample annotation with an ID):

1	2	3	4	5	6	7	8	9
HYI41_1	Blast+	oriC	247475	247846	.	?	.	Name=oriC;product=oriC
HYI41_1	Infernal	regulatory_region	3904850	3905064	3.4e-39	-	.	Name=rncO;product=rncO;Dbxref=RFAM:RF00552,SO:0005836
HYI41_2	Blast+	oriV	3931	4898	.	?	.	Name=oriV;product=oriV
HYI41_2	Blast+	oriT	4781	4868	.	?	.	Name=oriT;product=oriT
HYI41_1	Prodigal	CDS	2509	3582	.	+	0	ID=EOAEPG_00020;Name=DNA replication and repair protein RecF;locus_tag=EOAEPG_00020;product=DNA replication and repair protein RecF;Dbxref=COG:COG1195,COG:L,GO:0003697,GO:0005524,GO:0005737,GO:0006260,GO:0006281,GO:0009432,RefSeq:WP_000060112.1,SO:0001217,UniParc:UPI000016552D,UniRef100:UniRef100_A7ZTQ6,UniRef50:UniRef50_Q8Z9U9,UniRef90:UniRef90_Q8Z2N4;gene=recF

Add Tests

Add integration tests. Therefore, a tiny mock db as well as some small test genomes are needed.

Add assembly gap detection

Detect assembly gaps (>5 Ns) and add annotations (assembly_gap) accordingly.

https://github.com/nigyta/dfast_core/blob/master/dfc/tools/gap.py

Add detection of pseudo genes

First hints for ideas:

Improve annotation of phage-borne genes

Pre-annotation of UniRef90 clusters could be improved using a dedicated phage database like:

VOGDB: http://vogdb.org/download
PHROGs: https://phrogs.lmge.uca.fr/READMORE.php

archeal genomes

Does this tool also work for archeal genomes or metagenome-assembled genomes? Thanks!

refine rRNA detection

A refinement of the rRNA detection is necessary as currently too many truncated and false positive sequences are predicted.

Inspired by Barrnap and certain issues (tseemann/prokka#423, tseemann/barrnap#39), the cmsearch based approach (cmsearch --noali --cut_tc --notrunc) therefore gets replaced by a cmscan approach in glocal mode (cmscan --noali --cut_tc -g --nohmmonly --rfam)

Implement a general feature overlap detection filter

Many features overlap with each other. A hierarchical bacteria specific overlap detection filter needs to be implemented.
Current hierarchical overlap preference (descending):

Feature to remove	Feature overlap type
`tmRNA`	`-`
`tRNA`	`tmRNA`
`rRNA`	`-`
`ncRNA`	`-`
`ncRNA region`	`-`
`CRISPR`	`-`
`CDS`	`tmRNA`, `tRNA`, `rRNA`, `CRISPR`
`sORF`	`tmRNA`, `tRNA`, `rRNA`, `CRISPR`, `inframe CDS`, `shorter inframe sORF`

Add --debug option

Add a --debug option in order to keep tmp files and activate additional debugging logs.

Add a CWL description file

Add AMRFinderPlus as a first expert system

Implement common a ground for expert annotation systems as for example AMRFinderPlus thus allowing the incorporation of certain expert knowledge.

Additional "expert systems" might be:

NCBI BlastRules
HAMAP
UniProt annotation rules

Error version tRNAscan-SE with ubuntu 20.04

With Ubuntu 20.04, the version in the repo of tRNAscan-SE is 2.0.5-1

I got this error in the call of bakta

ERROR: Wrong tRNAscan-SE version installed. Please, either install tRNAscan-SE version v2.0.6 or use ['--skip-trna']!

With Ubuntu 21.04, the version in the repo of tRNAscan-SE is 2.0.7+ds-1, i dont encounter this error in bakta.

Improve AMRFinderPlus integration

Execute AMRFinderPlus with a custom $TMPDIR location pointing to Bakta's internal tmp dir.
Write AMRFinderPlus's db to a subdir within the Bakta database directory via amrfinder_update --database <db>/amrfinderplus/ --force_update and amrfinder --protein <input> --database <db>/amrfinderplus/latest as suggested in https://github.com/oschwengers/bakta/discussions/64

Use Prodigal training files pre-trained on high-quality reference genomes

For common genomes, use Prodigal training files pre-trained on high-quality reference genomes.
These could be selected by a quick Mash lookup via a RefSeq db of complete reference and representative genomes.

oschwengers / bakta Goto Github PK

bakta's Issues

Rho independent

Recommend Projects

Recommend Topics

Recommend Org

Jobs