oschwengers / bakta Goto Github PK
View Code? Open in Web Editor NEWRapid & standardized annotation of bacterial genomes, MAGs & plasmids
License: GNU General Public License v3.0
Rapid & standardized annotation of bacterial genomes, MAGs & plasmids
License: GNU General Public License v3.0
Would it be possible to also output the nucleotide sequence for each CDS, similar to what is already output for the translated protein?
Add VFDB protein sequences to protein sequence blast expert annotation system.
Provide a CLI command to download or update the database from within Bakta.
Superseede UniRef100/UniRef90 annotations by the following specialized plasmid protein DBs:
Mobilization proteins:
Conjugation proteins:
Improve the internal DB download workflow:
For complete genomes, rotate chromosomes and plasmids to dnaA and RepA/RepB genes, respectively.
Ideas:
If this is conducted before any annotation, Prodigal
could be executed with the -c
option to disallow the prediction of genes running off edges.
Dear all,
Thank you for developing this tool, it looks very comprehensive and the documentation is really thorough.
I went through your documentation and I see you already have an option --skip-cds
where you can skip CDS prediction and annotation. However, I was wondering if it would be possible to have the option to still have the CDS prediction but to skip the posterior CDS functional annotation.
This is simply because I'd like to use an alternative protein function annotation tool but would still need to have the CDS.
In this regard, if you do add this option, which DBs would not be necessary?
Thanks and regards,
Pedro
Add SwissProt
annotations:
uniprot_sprot.xml.gz
UPS
entriesUPS
gene
/product
if no UniRef90
id available in UPS
PSC
via UPS
-UniRef90
Available annotation:
<recommendedName>
<fullName evidence="1">5'-deoxynucleotidase YfbR</fullName>
<ecNumber evidence="1">3.1.3.89</ecNumber>
</recommendedName>```
<gene>
<name evidence="1" type="primary">yfbR</name>
<name type="ordered locus">SeAg_B2472</name>
</gene>```
<dbReference type="GO" id="GO:0005737">
<property type="term" value="C:cytoplasm"/>
<property type="evidence" value="ECO:0000501"/>
<property type="project" value="UniProtKB-SubCell"/>
</dbReference>
<dbReference type="GO" id="GO:0002953">
<property type="term" value="F:5'-deoxynucleotidase activity"/>
<property type="evidence" value="ECO:0000501"/>
<property type="project" value="InterPro"/>
</dbReference>
<dbReference type="GO" id="GO:0046872">
<property type="term" value="F:metal ion binding"/>
<property type="evidence" value="ECO:0000501"/>
<property type="project" value="UniProtKB-KW"/>
</dbReference>
<dbReference type="GO" id="GO:0000166">
<property type="term" value="F:nucleotide binding"/>
<property type="evidence" value="ECO:0000501"/>
<property type="project" value="UniProtKB-KW"/>
</dbReference>```
Add check if required dependencies are available at runtime.
Also check the dependencies' software versions.
Bakta needs an IS
element detection and annotation feature.
A promising candidate (detection, annotation, recently updated, BioConda) is ISEScan
:
Annotate all clusters by KEGG's KoFams:
https://www.genome.jp/ftp/db/kofam/
In assembly GCF_002007885.1
a riboswitch is detected and duplicated in the final annotation.
Position: [126426,126511]
on sequence NZ_CP019870.1
Products:
c-di-GMP-II-GAG riboswitch
, score 48.9
Cyclic di-GMP-II riboswitch
, score 74.3
Hi,
I'm trying to test out Bakta on our newly sequenced bacterial genome (1 chromosome, 6 other genomic compartments in one multifasta file), hopefully comparing against existing Prokka+custom database pipeline. However, the process exits with error message once it gets to 'conduct expert systems...' stage under 'predict & annotate CDSs...' I can replicate this issue on both our lab's server and on my personal laptop (both running Ubuntu, 20.04 and 21.04).
Bakta was installed with conda/bioconda in both of the instances using below command, running on Ubuntu.
conda create -n bakta bakta
The annotation process command was:
bakta --db ~/bakta_data/db/ --output test --genus genusname --species speciesname --threads 20 --replicon replicon.tsv our_assembly.fasta
The process exits with below error message
Traceback (most recent call last):
File "/home/user/miniconda3/envs/bakta/bin/bakta", line 10, in <module>
sys.exit(main())
File "/home/user/miniconda3/envs/bakta/lib/python3.9/site-packages/bakta/main.py", line 276, in main
expert_amr_found = exp_amr.search(genome['features'][bc.FEATURE_CDS], cds_fasta_path)
File "/home/user/miniconda3/envs/bakta/lib/python3.9/site-packages/bakta/expert/amrfinder.py", line 33, in search
raise Exception(f'amrfinder error! error code: {proc.returncode}')
Exception: amrfinder error! error code: 1
Please let me know if you need more info/tests on my end for troubleshooting.
Thank you!
Analyze hypotheticals
via:
hmmsearch
vs Pfam-A
molecular weight
isoelectric point
tmhmm
?Line 36 in cdcb0ba
The variable in this line should be argcount
Describe the bug
When building the docker image with buildah an error message occurs that states that the "SHELL" command is not OCI compliant:
ERRO[0000] SHELL is not supported for OCI image format, [bash -l -c] will be ignored. Must use docker
format
Should we change this to be OCI compliant?
Uuse DeepSig (GPL3) to predict signal peptides and cleavage sites.
Add skip options for each feature type:
--skip-trna
--skip-tmrna
--skip-rrna
--skip-ncrna
--skip-cds
--skip-sorf
First of all:, thanks for this great tool.
I was wondering if would be feasible to add the fasta sequence at the end of the gff3 file (similarly to prokka gff output). It would be useful for downstream analysis (e.g. roary pangenome pipelines.). Thanks for your effort.
Paolo
Dear all,
Thank you for this amazing annotation tool for bacterial genomes! I want to known if you can add an option for orthosearch if we have Genbank or Protein FASTA file(s) that we want to annotate genes from as the first priority.
This option has been supplied in softwares dfast-core as "--references" (https://github.com/nigyta/dfast_core) and prokka as "--proteins".
Thanks and regards,
Tonny_Z
A configuration file for all replicon
s within an assembly / genome should be parsed in order to use this information in output files, e.g. GenBank
.
Format idea:
original locus id | new locus id | type | topology | name |
---|---|---|---|---|
old id |
[new id / <empty> ] |
[chromosome / plasmid / contig / <empty> ] |
[circular / linear / <empty> ] |
name |
Available short cuts:
chromosome
: c
plasmid
: p
circular
: c
linear
: l
<empty>
values (-
/ ''
) will be replaced by defaults. If new locus id is empty
, a new contig name will be autogenerated.
Defaults:
replicon type: contig
topology: linear
Example:
original locus id | new locus id | type | topology | name |
---|---|---|---|---|
NODE_1 | chrom | chromosome |
circular |
- |
NODE_2 | p1 | plasmid |
c |
pXYZ1 |
NODE_3 | p2 | p |
c |
pXYZ2 |
NODE_4 | special-contig-name-xyz | - |
- | |
NODE_5 | `` | - |
- |
Usage:
bakta --replicons <file.tsv>
Add a CRISPR detetion with either:
analyze hypotheticals...
Traceback (most recent call last):
File "<xyz>/bin/bakta", line 10, in <module>
sys.exit(main())
File "<xyz>/bakta/main.py", line 291, in main
pfam_hits = feat_cds.predict_pfam(hypotheticals)
File "<xyz>/bakta/features/cds.py", line 264, in predict_pfam
raise Exception(f'hmmsearch error! error code: {proc.returncode}')
Exception: hmmsearch error! error code: 1
Thanks Matthew Croxen for pointing out.
Add a comprehensive logging of all initialization steps and applied annotations.
This logging file will be publicly provided in order to check all annotations in the bakta db.
For fragmented draf assemblies
as well as short plasmids
the insufficient amount of genetic information might result in far from optimal training data for subsequent CDS
prediction.
For these cases, it might be useful and beneficial to provide pre-calculated Prodigal
training files for bacterial taxa (genus
or species
level) as well as plasmids
maybe clustered to Inc
groups.
Introduce an abstract system for protein sequence based expert systems providing:
Implement NCBI BlastRules as a first protein sequence based expert annotation systems
Add some general summarizing genome stats at the end of the annotation process
As tRNAscan-SE
2.0 only detects tRNA
s we need to predict tmRNA
s separately via e.g. aragorn
Hi,
I'm in the process of submitting annotated genomes with Bakta to the NCBI, hence while checking for the quality and errors of annotation (via table2asn and the gff3 file https://www.ncbi.nlm.nih.gov/genbank/genomes_gff/#run), I encountered several issues with it.
I understand if it's not something this tool will be compatible with as it can be quite tricky but anyhow, here is a list of some of the issues that I had that may be addressed in the gff file:
contig_1 Prodigal CDS 3 179 . - 0 ID=DOCECA_00005;locus_tag=DOCECA_00005;product=hypothetical protein
contig_1 Bakta gene 3 179 . - 0 ID=DOCECA_00005;locus_tag=DOCECA_00005
Commas that are intended to be part of a name should be encoded (%2C) according to the GFF3 specifications. However, literal commas should only be included when they are part of enzymatic names. Semi-colons generally should not be included in product names.
SO:
in dbxref as they are not yet recognized (https://www.ncbi.nlm.nih.gov/genbank/collab/db_xref/)Best
Greg
Add detection/prediction of Rho (in)dependent translation terminators
hmmsearch --tblout out.tsv --noali --cpu 4 --cut_ga AntiFam_Bacteria.hmm aa.faa
Some annotations don't have an ID in the 9th field
Majority of the annotations do have ID, but a few don't. I can't use the annotation readily for RNAseq counting using HTSeq. Is this expected?
Here are all of them in one of my annotations without ID (last one is a sample annotation with an ID):
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|
HYI41_1 | Blast+ | oriC | 247475 | 247846 | . | ? | . | Name=oriC;product=oriC |
HYI41_1 | Infernal | regulatory_region | 3904850 | 3905064 | 3.4e-39 | - | . | Name=rncO;product=rncO;Dbxref=RFAM:RF00552,SO:0005836 |
HYI41_2 | Blast+ | oriV | 3931 | 4898 | . | ? | . | Name=oriV;product=oriV |
HYI41_2 | Blast+ | oriT | 4781 | 4868 | . | ? | . | Name=oriT;product=oriT |
HYI41_1 | Prodigal | CDS | 2509 | 3582 | . | + | 0 | ID=EOAEPG_00020;Name=DNA replication and repair protein RecF;locus_tag=EOAEPG_00020;product=DNA replication and repair protein RecF;Dbxref=COG:COG1195,COG:L,GO:0003697,GO:0005524,GO:0005737,GO:0006260,GO:0006281,GO:0009432,RefSeq:WP_000060112.1,SO:0001217,UniParc:UPI000016552D,UniRef100:UniRef100_A7ZTQ6,UniRef50:UniRef50_Q8Z9U9,UniRef90:UniRef90_Q8Z2N4;gene=recF |
Add integration tests. Therefore, a tiny mock db as well as some small test genomes are needed.
Detect assembly gaps (>5 Ns) and add annotations (assembly_gap) accordingly.
https://github.com/nigyta/dfast_core/blob/master/dfc/tools/gap.py
Pre-annotation of UniRef90 clusters could be improved using a dedicated phage database like:
Does this tool also work for archeal genomes or metagenome-assembled genomes? Thanks!
A refinement of the rRNA
detection is necessary as currently too many truncated and false positive sequences are predicted.
Inspired by Barrnap and certain issues (tseemann/prokka#423, tseemann/barrnap#39), the cmsearch
based approach (cmsearch --noali --cut_tc --notrunc
) therefore gets replaced by a cmscan approach in glocal
mode (cmscan --noali --cut_tc -g --nohmmonly --rfam
)
Many features overlap with each other. A hierarchical bacteria specific overlap detection filter needs to be implemented.
Current hierarchical overlap preference (descending):
Feature to remove | Feature overlap type |
---|---|
tmRNA |
- |
tRNA |
tmRNA |
rRNA |
- |
ncRNA |
- |
ncRNA region |
- |
CRISPR |
- |
CDS |
tmRNA , tRNA , rRNA , CRISPR |
sORF |
tmRNA , tRNA , rRNA , CRISPR , inframe CDS , shorter inframe sORF |
Add a --debug option in order to keep tmp files and activate additional debugging logs.
Implement common a ground for expert annotation systems as for example AMRFinderPlus thus allowing the incorporation of certain expert knowledge.
Additional "expert systems" might be:
With Ubuntu 20.04, the version in the repo of tRNAscan-SE is 2.0.5-1
I got this error in the call of bakta
ERROR: Wrong tRNAscan-SE version installed. Please, either install tRNAscan-SE version v2.0.6 or use ['--skip-trna']!
With Ubuntu 21.04, the version in the repo of tRNAscan-SE is 2.0.7+ds-1
, i dont encounter this error in bakta.
AMRFinderPlus
with a custom $TMPDIR
location pointing to Bakta's internal tmp dir.AMRFinderPlus
's db to a subdir within the Bakta database directory via amrfinder_update --database <db>/amrfinderplus/ --force_update
and amrfinder --protein <input> --database <db>/amrfinderplus/latest
as suggested in https://github.com/oschwengers/bakta/discussions/64For common genomes, use Prodigal
training files pre-trained on high-quality reference genomes.
These could be selected by a quick Mash
lookup via a RefSeq
db of complete reference
and representative
genomes.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.