Motivation
The gff-toolbox convert module is capable of converting a GFF to a mongo database, however it seems that we can't manipulate the gff information stored in database without relying on pure mongo commands. A useful routine task done in many analysis is the insertion of new information into a GFF, i.e annotate a gene/transcript. This task could be done using many tools, such as gffutils, BCBio or even bash/other language script, through the inclusion of new attributes into the raw GFF column 9 using as input a file telling which set of annotations (e.g. GO, PFAM, EC number) correlate with each gene. However, the same annotation task can also be done in a different way involving the conversion of GFF to mongodb, inclusion of annotations to mongodb corresponding collection and, further, if desired, conversion back to GFF. I've been wondering that despite it may seem a more difficult procedure than annotate a raw GFF and spend more computer resources, it has some advantages:
- Possibility to include a description, link or other metadata related to an annotation. GFF format spec declares fields Ontology_terms and Dbxref in column 9 to accommodate, respectively, annotations from GO/ontology servers and other databases (e.g. PFAM, PANTHER, EC). Despite this, I lack a description field for each term annotated in a GFF. A thing that can be easily done in mongodb. Indeed, it can be included in description field of GFF, but it brings to my next GFF issue: noisy/polluted GFF;
- Clean visualization/reading of gene attributes in gffs/genomes having enormous quantity of annotations; and
- Go biond GFFs. In some situations, we would like to gather the information contatined in GFF into a different format. For instance, higlass visualization tool requires refseq format to display genes. Therefore, generate a file with refseq specs from a stored mongodb is easier than manipulate a raw gff.
Proposed solution
Probably the list of advantages and disadvantages of using a mongodb as intermediate to accomodate annotations is bigger than I could think of, but I see this way as a facilitator. Hence, I propose a new gff-toolbox module to perform this task, i.e. annotate a mongodb collection created by gfftool-box convert. In the following i will try to explain the main architecture of this module, that at first hand I nominated as ingest.
We would like the ingest module to receive a set of annotations and include them in corresponding gene/transcript entry in mongodb. Thus, assume that the mongodb was created by gff-toolbox convert module - parameters XXX; XXX; - and that we also have a txt/tsv file tab-separated with annotations such as the following:
##ID Id IdType description
gene-KPHS_00170 PTHR30520:SF0 PANTHER TRANSPORTER-RELATED
gene-KPHS_00170 GO:0006810 GO transport
gene-KPHS_00170 3.4.16.2 EC Lysosomal Pro-Xaa carboxypeptidase
gene-KPHS_00170 GO:0005215 GO transporter activity
gene-KPHS_02590 GO:0003735 GO structural constituent of ribosome
gene-KPHS_02590 PTHR36029 PANTHER
Inspecting the mongodb collection that correponds to gene-KPHS_00170 we can retrieve the json listing it's information:
{'_id': ObjectId('612e788a94ee11baab643fb0'),
'recid': 'NC_016845.1',
'source': 'RefSeq',
'type': 'gene',
'start': '22533',
'end': '22802',
'score': '.',
'strand': '+',
'phase': '.',
'attributes': {'ID': 'gene-KPHS_00170',
'Dbxref': 'GeneID:11844995',
'Name': 'KPHS_00170',
'gbkey': 'Gene',
'gene_biotype': 'protein_coding',
'locus_tag': 'KPHS_00170'}
The aim of the proposed gff-toolbox ingest module is to insert the annotations into corresponding gene in mongodb. After this procedure, we would like to have mongodb entry for gene-KPHS_00170 stored as:
{'_id': ObjectId('612e788a94ee11baab643fb0'),
'recid': 'NC_016845.1',
'source': 'RefSeq',
'type': 'gene',
'start': '22533',
'end': '22802',
'score': '.',
'strand': '+',
'phase': '.',
'attributes': {'ID': 'gene-KPHS_00170',
'Dbxref': 'GeneID:11844995',
'Dbxref': [ 'GeneID:11844995' ,
{'DBTAG': 'PANTHER', 'ID': 'PTHR30520:SF0', 'Description': 'FORMATE TRANSPORTER-RELATED'},
{'DBTAG': 'PANTHER', 'ID': 'PTHR30520', 'Description': 'FORMATE TRANSPORTER-RELATED'},
{'DBTAG': 'PFAM', 'ID': 'PF01226', 'Description': 'Formate/nitrite transporter'}
],
'Ontology_term': [ {'DBTAG': 'GO', 'ID': 'GO:0006810', 'Description': 'transport'},
{'DBTAG': 'GO', 'ID': 'GO:0016020', 'Description': 'membrane'},
{'DBTAG': 'GO', 'ID': 'GO:0005215', 'Description': 'transporter activity'}
],
'Name': 'KPHS_00170',
'gbkey': 'Gene',
'gene_biotype': 'protein_coding',
'locus_tag': 'KPHS_00170'}
According to GFF spec, "two reserved attributes, Ontology_term and Dbxref, can be used to establish links between a GFF3 feature and a data record contained in another database" (i.e. annotations). Also, "the value of both Ontology_term and Dbxref is the ID of the cross referenced object in the form "DBTAG:ID". The DBTAG indicates which database the referenced object can be found in, and ID indicates the identifier of the object within that database". Therefore, in mongodb schema we include an object for each annotation declaring DBTAG, ID and optional fields such as Description. Unfortunately, this is not the json schema declared in gff-toolbox convert module: the Dbxref entries generated after parsing a GFF to mongodb do not separate the DBTAG and ID fields. We can fix this, by simply adjusting the code to separate those fields before parsing the json into mongodb collection. I propose to fix this, but I have to know if this can bring any problem in other gff-toolbox modules. @fmalmeida, can it?
Another suggestion, I would like your opinion, if we should decouple the "ingestion" of annotations to mongodb - that is the proposed solution in this issue - and the "digestion" of a mongodb collection into a GFF/other file format. I think, another gff-toolbox module or even the gff-toolbox convert could be an answer to this question.
@fmalmeida, let me know what you think about it and if I can submit the pull request - I kind of have a code that can be adjusted to become the aforementioned gff-toolbox ingest module.