GithubHelp home page GithubHelp logo

Comments (7)

oschwengers avatar oschwengers commented on July 3, 2024

Hi @michoug , thanks a lot for reporting this. So far we've tested the submission only for ENA. Of course, we're keen to make NCBI submissions as smooth as possible, too.

I'll encode the products as requested in the GFF3 specifications.

For the 1st and 3rd point, I think it might be best to add a --compliant option in line with the Prokka option to explicitly activate this behavior that might not be desired in other situations.

Is this a complete list of all issues you encountered?
Also, could you provide an exemplary line of commands you've used to generate the submission files? This could be helpful for other users to go through this process. Maybe I'll add a section to the readme, as well.

from bakta.

michoug avatar michoug commented on July 3, 2024

Hi,
Thanks for the super-fast response. The issues highlighted here are the main ones (e.g FATAL), there are others that depend more on the names of the products (see attached for a list for a genome)
Issues.txt

Here is the command that I used to generate submission files:

  • First, you need a template submission file
  • Then, the software table2asn
  • The command was for Linux:
    table2asn_GFF.Linux -M n -J -c w -t template.sbt -l paired-ends -j "[organism=Pseudomonas sp][strain=E102] [gcode=11]" -i E102_bakta/E102.fna -f E102_bakta/E102.gff3 -o E102_bakta/E102.sqn -Z

Here the link for the documentation (https://www.ncbi.nlm.nih.gov/genbank/genomes_gff/#run)

from bakta.

oschwengers avatar oschwengers commented on July 3, 2024

Thanks for the detailed information - that helps a lot.
I've already addressed the lacking gene and product encoding issues.

However, fixing the Dbxrefs and fatal product descriptions might take somewhat longer. But I've put this on the list for the upcoming 1.1 version which will hopefully be released in the next weeks.

from bakta.

oschwengers avatar oschwengers commented on July 3, 2024

interesting side effect: adhering to the GFF3 comma encoding convention (%2C) leads to FATAL: SUSPECT_PRODUCT_NAMES: 62 features contain '%'. Any idea how that could be bypassed? Or is this something that maybe shoulf be reported upstream to be fixed in the table2asn_GFF tool?

from bakta.

oschwengers avatar oschwengers commented on July 3, 2024

Hi @michoug , I've added a couple of fixes and improvements for GFF3 based GenBank submissions via table2asn_GFF.
All of the points you've raised above should be addressed and all issues should be solved. If this is not the case, please do not hesitate to reach out and re-open this issue.

I'll release v1.1.0 containing these improvements soon - most certainly next week.

Please let me know if there are any further issues - I'm looking forward to your feedback.
Thanks again for reporting and
best regards!

from bakta.

michoug avatar michoug commented on July 3, 2024

Hi,
Congrats for all the fast work, I have a few others "issues" that may be eventually addressed, even though I'm well aware that this process sometimes is a bottomless pit and quite tricky to automatize...

SUSPECT_PRODUCT_NAMES: 8 features May contain plural
E141.sqn:CDS	Urea carboxylase without Allophanate hydrolase 2 domains	lcl|contig_1:c493999-492260	GKKCFE_02155
E141.sqn:CDS	Phosphotransferase system, HPr-related proteins	lcl|contig_1:c658214-657810	GKKCFE_02965
E141.sqn:CDS	Hemolysins-related protein containing CBS domains	lcl|contig_1:c830356-829115	GKKCFE_03775
E141.sqn:CDS	Phage tail assembly chaperone proteins, E, or 41 or 14	lcl|contig_1:952781-953356	GKKCFE_04360
E141.sqn:CDS	Peptidoglycan/LPS O-acetylase OafA/YrhL, contains acyltransferase and SGNH-hydrolase domains	lcl|contig_1:c1007564-1006416	GKKCFE_04650
E141.sqn:CDS	Diguanylate cyclase with PAS/PAC and GAF sensors	lcl|contig_1:1171567-1172943	GKKCFE_05445


SUSPECT_PRODUCT_NAMES: 31 features contain 'unknown'
E141.sqn:CDS	Family of unknown function (DUF6124)	lcl|contig_1:c109230-108889	GKKCFE_00500
E141.sqn:CDS	Family of unknown function (DUF6124)	lcl|contig_1:254342-254698	GKKCFE_01120
E141.sqn:CDS	Family of unknown function (DUF6124)	lcl|contig_1:580095-580460	GKKCFE_02580

SUSPECT_PRODUCT_NAMES: 34 features contains three or more numbers together that may be identifiers more appropriate in note
E141.sqn:CDS	Uvs098	lcl|contig_1:252015-252467	GKKCFE_01095
E141.sqn:CDS	UPF0313 protein PSPTO_4928	lcl|contig_1:302226-304526	GKKCFE_01330
E141.sqn:CDS	L-pipecolate oxidase (1537)	lcl|contig_1:320431-321714	GKKCFE_01405
E141.sqn:CDS	HI0933-like protein	lcl|contig_1:c490707-489466	GKKCFE_02145
E141.sqn:CDS	Putative hydro-lyase B723_09185	lcl|contig_1:c496428-495622	GKKCFE_02165
E141.sqn:CDS	UPF0114 protein C7528_102400	lcl|contig_1:554275-554763	GKKCFE_02435
E141.sqn:CDS	UPF0225 protein CD58_06560	lcl|contig_1:c1018229-1017732	GKKCFE_04695
E141.sqn:CDS	UPF0276 protein SAMN03159293_01947	lcl|contig_1:c1039843-1038974	GKKCFE_04820


SUSPECT_PRODUCT_NAMES: 188 features contain underscore
E141.sqn:CDS	GBBH-like_N domain-containing protein	lcl|contig_1:c125879-125502	GKKCFE_00600
E141.sqn:CDS	FAD_binding_3 domain-containing protein	lcl|contig_1:c168453-167206	GKKCFE_00760
E141.sqn:CDS	ABC_trans_aux domain-containing protein	lcl|contig_1:261845-262549	GKKCFE_01150
E141.sqn:CDS	MotA_ExbB domain-containing protein	lcl|contig_1:272991-273842	GKKCFE_01195
E141.sqn:CDS	UPF0313 protein PSPTO_4928	lcl|contig_1:302226-304526	GKKCFE_01330
E141.sqn:CDS	Peripla_BP_6 domain-containing protein	lcl|contig_1:322080-323216	GKKCFE_01410
E141.sqn:CDS	Znf/thioredoxin_put domain-containing protein	lcl|contig_1:c389672-388437	GKKCFE_01700
E141.sqn:CDS	Cupin_3 domain-containing protein	lcl|contig_1:c469729-469385	GKKCFE_02035
E141.sqn:CDS	ZT_dimer domain-containing protein	lcl|contig_1:c476803-475910	GKKCFE_02080

SUSPECT_PRODUCT_NAMES: 1 feature contains '(TC'
E141.sqn:CDS	Sodium/proton antiporter, CPA1 family (TC 2A36)	lcl|contig_1:c3087019-3085778	GKKCFE_13940

SUSPECT_PRODUCT_NAMES: 1 feature contains 'FOG'
E141.sqn:CDS	FOG: TPR repeat, SEL1 subfamily	lcl|contig_1:c4136402-4136001	GKKCFE_18665

FATAL: SUSPECT_PRODUCT_NAMES: 1 feature contains '?'
E141.sqn:CDS	ABC transporter, substrate-binding protein (Cluster 15, trp?)	lcl|contig_1:4026495-4027427	GKKCFE_18180

FATAL: SUSPECT_PRODUCT_NAMES: 2 features contain '@'
E141.sqn:CDS	Deblocking aminopeptidase @ Cyanophycinase 2	lcl|contig_1:c1423448-1422258	GKKCFE_06635
E141.sqn:CDS	Maleylacetoacetate isomerase @ Glutathione S-transferase, zeta	lcl|contig_1:c4755920-4755285	GKKCFE_21485

SUSPECT_PRODUCT_NAMES: Use short product name instead of descriptive phrase
SUSPECT_PRODUCT_NAMES: 1 feature ends with 'activity'
E141.sqn:CDS	HD-like signal output (HDOD) domain, no enzymatic activity	lcl|contig_1:5955114-5956325	GKKCFE_27025

SUSPECT_PRODUCT_NAMES: 4 features Is longer than 100 characters. Remove descriptive phrases or synonyms from product names. Keep valid long product names, eg long enzyme names
E141.sqn:CDS	Multicopper oxidase with three cupredoxin domains (Includes cell division protein FtsP and spore coat protein CotA)	lcl|contig_1:819899-821275	GKKCFE_03735
E141.sqn:CDS	GTP pyrophosphokinase, (P)ppGpp synthetase I / Guanosine-3',5'-bis(Diphosphate) 3'-pyrophosphohydrolase	lcl|contig_1:c4197461-4195215	GKKCFE_18975
E141.sqn:CDS	Glyoxylate reductase / Glyoxylate reductase / Hydroxypyruvate reductase 2-ketoaldonate reductase, broad specificity	lcl|contig_1:4747621-4748592	GKKCFE_21440
E141.sqn:CDS	Glycine betaine/carnitine/choline ABC transporter, periplasmic glycine betaine/carnitine/choline-binding protein	lcl|contig_1:4859680-4860582	GKKCFE_21975

SUSPECT_PRODUCT_NAMES: 1 feature contains 'possibly'
E141.sqn:CDS	Membrane protein TerC, possibly involved in tellurium resistance	lcl|contig_1:c5854787-5854020	GKKCFE_26505

SUSPECT_PRODUCT_NAMES: 3 features contain 'gene'
E141.sqn:CDS	Yibq gene product, putative divergent polysaccharide deacetylase	lcl|contig_1:c43395-42619	GKKCFE_00250
E141.sqn:CDS	ABC transporter in pyoverdin gene cluster, ATP-binding component	lcl|contig_1:3868307-3869059	GKKCFE_17350
E141.sqn:CDS	YebG, DNA damage-inducible gene in SOS regulon, expressed in stationary phase	lcl|contig_1:4752788-4753048	GKKCFE_21470

BAD_GENE_NAME: 6 genes contain suspect phrase or characters
E141.sqn:Gene	5_ureB_sRNA	lcl|contig_1:346126-346411	GKKCFE_01530
E141.sqn:Gene	epd,gap,gapA	lcl|contig_1:c1070158-1069157	GKKCFE_04950
E141.sqn:Gene	Bacteria_small_SRP	lcl|contig_1:1650897-1650993	GKKCFE_07575
E141.sqn:Gene	RNaseP_bact_a	lcl|contig_1:c4698601-4698249	GKKCFE_21195
E141.sqn:Gene	epd,gap,gapA	lcl|contig_1:c5325075-5324020	GKKCFE_24085
E141.sqn:Gene	Pseudomon-1	lcl|contig_1:5829418-5829534	GKKCFE_26390

from bakta.

oschwengers avatar oschwengers commented on July 3, 2024

Hi,
I've tried to address as many SUSPECT_PRODUCT_NAMES as possible:

  • contains '?'
  • contains '@'
  • contains 'FOG'
  • containa underscore -> underscore in domain names
  • SUSPECT_PRODUCT_NAMES: features contain 'unknown' -> replace product by "DUF....-containing protein"
  • features contains three or more numbers together.... -> replace product by "UPF....-containing protein"

These are the low hanging fruits. All the other remaining issues are way more complex to fix - if they can be handled in an automatic manner at all.
I'll try to add some more "fix&replace" rules from time to time and I'm open to all sorts of ideas, suggestions and improvements from the community!
Thanks for all the reports! I'll release a patch version soon.

from bakta.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.