GithubHelp home page GithubHelp logo

oschwengers / bakta Goto Github PK

View Code? Open in Web Editor NEW
411.0 14.0 44.0 62.8 MB

Rapid & standardized annotation of bacterial genomes, MAGs & plasmids

License: GNU General Public License v3.0

Python 93.32% Shell 4.03% Nextflow 0.80% Dockerfile 0.39% Common Workflow Language 1.38% JavaScript 0.08%
bioinformatics microbial-genomics genome-annotation bacteria bacterial-genomes plasmids metagenome-assembled-genomes annotation mag

bakta's Issues

Improve plasmid annotation

Improve internal DB downloads

Improve the internal DB download workflow:

  1. Improve the progress feedback during the DB download process by using alive-progress.
  2. Improve download speed by increasing the download chunk size.

Add chromosome / plasmid rotation

For complete genomes, rotate chromosomes and plasmids to dnaA and RepA/RepB genes, respectively.

Ideas:

If this is conducted before any annotation, Prodigal could be executed with the -c option to disallow the prediction of genes running off edges.

CDS prediction but not annotation option

Dear all,
Thank you for developing this tool, it looks very comprehensive and the documentation is really thorough.
I went through your documentation and I see you already have an option --skip-cds where you can skip CDS prediction and annotation. However, I was wondering if it would be possible to have the option to still have the CDS prediction but to skip the posterior CDS functional annotation.
This is simply because I'd like to use an alternative protein function annotation tool but would still need to have the CDS.
In this regard, if you do add this option, which DBs would not be necessary?

Thanks and regards,
Pedro

Add SwissProt annotations

Add SwissProt annotations:

  1. parse uniprot_sprot.xml.gz
  2. calc aa seq hash
  3. lookup UPS entries
  4. a) either update UPS gene/product if no UniRef90 id available in UPS
    b) otherwise update PSC via UPS-UniRef90

Available annotation:

  •  <recommendedName>
       <fullName evidence="1">5'-deoxynucleotidase YfbR</fullName>
       <ecNumber evidence="1">3.1.3.89</ecNumber>
     </recommendedName>```
    
  •  <gene>
       <name evidence="1" type="primary">yfbR</name>
       <name type="ordered locus">SeAg_B2472</name>
     </gene>```
    
  •  <dbReference type="GO" id="GO:0005737">
       <property type="term" value="C:cytoplasm"/>
       <property type="evidence" value="ECO:0000501"/>
       <property type="project" value="UniProtKB-SubCell"/>
     </dbReference>
     <dbReference type="GO" id="GO:0002953">
       <property type="term" value="F:5'-deoxynucleotidase activity"/>
       <property type="evidence" value="ECO:0000501"/>
       <property type="project" value="InterPro"/>
     </dbReference>
     <dbReference type="GO" id="GO:0046872">
       <property type="term" value="F:metal ion binding"/>
       <property type="evidence" value="ECO:0000501"/>
       <property type="project" value="UniProtKB-KW"/>
     </dbReference>
     <dbReference type="GO" id="GO:0000166">
       <property type="term" value="F:nucleotide binding"/>
       <property type="evidence" value="ECO:0000501"/>
       <property type="project" value="UniProtKB-KW"/>
     </dbReference>```
    

Add dependency checks

Add check if required dependencies are available at runtime.

Also check the dependencies' software versions.

Duplicated riboswitch in GCF_002007885.1

In assembly GCF_002007885.1 a riboswitch is detected and duplicated in the final annotation.

Position: [126426,126511] on sequence NZ_CP019870.1

Products:

  • c-di-GMP-II-GAG riboswitch, score 48.9
  • Cyclic di-GMP-II riboswitch, score 74.3

Bakta exits with amrfinder error code 1

Hi,

I'm trying to test out Bakta on our newly sequenced bacterial genome (1 chromosome, 6 other genomic compartments in one multifasta file), hopefully comparing against existing Prokka+custom database pipeline. However, the process exits with error message once it gets to 'conduct expert systems...' stage under 'predict & annotate CDSs...' I can replicate this issue on both our lab's server and on my personal laptop (both running Ubuntu, 20.04 and 21.04).

Bakta was installed with conda/bioconda in both of the instances using below command, running on Ubuntu.

conda create -n bakta bakta

The annotation process command was:

bakta --db ~/bakta_data/db/ --output test --genus genusname --species speciesname --threads 20 --replicon replicon.tsv our_assembly.fasta

The process exits with below error message

Traceback (most recent call last):
  File "/home/user/miniconda3/envs/bakta/bin/bakta", line 10, in <module>
    sys.exit(main())
  File "/home/user/miniconda3/envs/bakta/lib/python3.9/site-packages/bakta/main.py", line 276, in main
    expert_amr_found = exp_amr.search(genome['features'][bc.FEATURE_CDS], cds_fasta_path)
  File "/home/user/miniconda3/envs/bakta/lib/python3.9/site-packages/bakta/expert/amrfinder.py", line 33, in search
    raise Exception(f'amrfinder error! error code: {proc.returncode}')
Exception: amrfinder error! error code: 1

Please let me know if you need more info/tests on my end for troubleshooting.

Thank you!

Dockerfile not OCI compliant

Describe the bug
When building the docker image with buildah an error message occurs that states that the "SHELL" command is not OCI compliant:
ERRO[0000] SHELL is not supported for OCI image format, [bash -l -c] will be ignored. Must use docker format

Should we change this to be OCI compliant?

adding sequence at the end of gff3 (similar to prokka gff output)

First of all:, thanks for this great tool.
I was wondering if would be feasible to add the fasta sequence at the end of the gff3 file (similarly to prokka gff output). It would be useful for downstream analysis (e.g. roary pangenome pipelines.). Thanks for your effort.
Paolo

Add OrthoSearch of reference files

Dear all,
Thank you for this amazing annotation tool for bacterial genomes! I want to known if you can add an option for orthosearch if we have Genbank or Protein FASTA file(s) that we want to annotate genes from as the first priority.
This option has been supplied in softwares dfast-core as "--references" (https://github.com/nigyta/dfast_core) and prokka as "--proteins".

Thanks and regards,
Tonny_Z

Design & implement a replicon information file

A configuration file for all replicons within an assembly / genome should be parsed in order to use this information in output files, e.g. GenBank.

Format idea:

original locus id new locus id type topology name
old id [new id / <empty>] [chromosome / plasmid / contig / <empty>] [circular / linear / <empty>] name

Available short cuts:

  • chromosome: c
  • plasmid: p
  • circular: c
  • linear: l

<empty> values (- / '') will be replaced by defaults. If new locus id is empty, a new contig name will be autogenerated.

Defaults:
replicon type: contig
topology: linear

Example:

original locus id new locus id type topology name
NODE_1 chrom chromosome circular -
NODE_2 p1 plasmid c pXYZ1
NODE_3 p2 p c pXYZ2
NODE_4 special-contig-name-xyz - -
NODE_5 `` - -

Usage:

bakta --replicons <file.tsv>

hmmsearch error on empty hypotheticals.faa sequence file

analyze hypotheticals...
Traceback (most recent call last):
  File "<xyz>/bin/bakta", line 10, in <module>
    sys.exit(main())
  File "<xyz>/bakta/main.py", line 291, in main
    pfam_hits = feat_cds.predict_pfam(hypotheticals)
  File "<xyz>/bakta/features/cds.py", line 264, in predict_pfam
    raise Exception(f'hmmsearch error! error code: {proc.returncode}')
Exception: hmmsearch error! error code: 1

Thanks Matthew Croxen for pointing out.

Provide pre-calculated Prodigal training files

For fragmented draf assemblies as well as short plasmids the insufficient amount of genetic information might result in far from optimal training data for subsequent CDS prediction.
For these cases, it might be useful and beneficial to provide pre-calculated Prodigal training files for bacterial taxa (genus or species level) as well as plasmids maybe clustered to Inc groups.

Add NCBI BlastRules as first alignment-based expert system

Introduce an abstract system for protein sequence based expert systems providing:

  • reference sequences
  • minimal alignment thresholds: identity, query coverage, subject coverage
  • gene label
  • product description
  • DbXrefs: EC, etc ...
  • rank: a precedence rank to order & select potentially multiple annotations from several expert systems

Implement NCBI BlastRules as a first protein sequence based expert annotation systems

Compatible file for NCBI submission?

Hi,
I'm in the process of submitting annotated genomes with Bakta to the NCBI, hence while checking for the quality and errors of annotation (via table2asn and the gff3 file https://www.ncbi.nlm.nih.gov/genbank/genomes_gff/#run), I encountered several issues with it.
I understand if it's not something this tool will be compatible with as it can be quite tricky but anyhow, here is a list of some of the issues that I had that may be addressed in the gff file:

  • add a gene line
contig_1	Prodigal	CDS	3	179	.	-	0	ID=DOCECA_00005;locus_tag=DOCECA_00005;product=hypothetical protein
contig_1	Bakta	gene	3	179	.	-	0	ID=DOCECA_00005;locus_tag=DOCECA_00005
  • remove commas in the "product=" category with the exception of EC numbers

Commas that are intended to be part of a name should be encoded (%2C) according to the GFF3 specifications. However, literal commas should only be included when they are part of enzymatic names. Semi-colons generally should not be included in product names.

Best
Greg

Some annotations in the GFF do not have an ID assigned?

Some annotations don't have an ID in the 9th field
Majority of the annotations do have ID, but a few don't. I can't use the annotation readily for RNAseq counting using HTSeq. Is this expected?

Here are all of them in one of my annotations without ID (last one is a sample annotation with an ID):

1 2 3 4 5 6 7 8 9
HYI41_1 Blast+ oriC 247475 247846 . ? . Name=oriC;product=oriC
HYI41_1 Infernal regulatory_region 3904850 3905064 3.4e-39 - . Name=rncO;product=rncO;Dbxref=RFAM:RF00552,SO:0005836
HYI41_2 Blast+ oriV 3931 4898 . ? . Name=oriV;product=oriV
HYI41_2 Blast+ oriT 4781 4868 . ? . Name=oriT;product=oriT
HYI41_1 Prodigal CDS 2509 3582 . + 0 ID=EOAEPG_00020;Name=DNA replication and repair protein RecF;locus_tag=EOAEPG_00020;product=DNA replication and repair protein RecF;Dbxref=COG:COG1195,COG:L,GO:0003697,GO:0005524,GO:0005737,GO:0006260,GO:0006281,GO:0009432,RefSeq:WP_000060112.1,SO:0001217,UniParc:UPI000016552D,UniRef100:UniRef100_A7ZTQ6,UniRef50:UniRef50_Q8Z9U9,UniRef90:UniRef90_Q8Z2N4;gene=recF

Add Tests

Add integration tests. Therefore, a tiny mock db as well as some small test genomes are needed.

archeal genomes

Does this tool also work for archeal genomes or metagenome-assembled genomes? Thanks!

refine rRNA detection

A refinement of the rRNA detection is necessary as currently too many truncated and false positive sequences are predicted.

Inspired by Barrnap and certain issues (tseemann/prokka#423, tseemann/barrnap#39), the cmsearch based approach (cmsearch --noali --cut_tc --notrunc) therefore gets replaced by a cmscan approach in glocal mode (cmscan --noali --cut_tc -g --nohmmonly --rfam)

Implement a general feature overlap detection filter

Many features overlap with each other. A hierarchical bacteria specific overlap detection filter needs to be implemented.
Current hierarchical overlap preference (descending):

Feature to remove Feature overlap type
tmRNA -
tRNA tmRNA
rRNA -
ncRNA -
ncRNA region -
CRISPR -
CDS tmRNA, tRNA, rRNA, CRISPR
sORF tmRNA, tRNA, rRNA, CRISPR, inframe CDS, shorter inframe sORF

Add --debug option

Add a --debug option in order to keep tmp files and activate additional debugging logs.

Add AMRFinderPlus as a first expert system

Implement common a ground for expert annotation systems as for example AMRFinderPlus thus allowing the incorporation of certain expert knowledge.

Additional "expert systems" might be:

  • NCBI BlastRules
  • HAMAP
  • UniProt annotation rules

Error version tRNAscan-SE with ubuntu 20.04

With Ubuntu 20.04, the version in the repo of tRNAscan-SE is 2.0.5-1

I got this error in the call of bakta

ERROR: Wrong tRNAscan-SE version installed. Please, either install tRNAscan-SE version v2.0.6 or use ['--skip-trna']!

With Ubuntu 21.04, the version in the repo of tRNAscan-SE is 2.0.7+ds-1, i dont encounter this error in bakta.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.