Best practice: Top-level feature types can include gene and pseudogene. Optionally, include a so_term_name attribute in column 9 to specify the child (type) of gene - e.g. protein_coding_gene, ncRNA_gene, miRNA_gene and snoRNA_gene (http://purl.obolibrary.org/obo/SO_0000704). Transcript features should include the appropriate SO term in column 3 (e.g. mRNA, snoRNA, etc).
I agree with all of this, but I think that the recommendation should be extended further to regularize non-transcribed features.
Right now non-transcribed features can be all over the map, and as a result become hard to parse. In the NCBI annotation of GRCh38, a wide array of top-level non-gene features are used. Additionally, I have not seen any spec define a collection of non-transcribed features (analogous to isoforms of a gene).
In the specification I built under the BioCantor repo, I attempted to regularize top-level features by calling any grouping of non-transcribed features a biological region
(which I chose based on SO:0001411
), and then deviated from SO by calling any interval in that grouping a feature_interval
. I then also chose to define a "joined" interval of non-transcribed feature (analogous to an exon) a subregion
.