the-sequence-ontology / specifications Goto Github PK
View Code? Open in Web Editor NEWGFF and GVF specification documents
GFF and GVF specification documents
Moving the location of the GFF3 specification has broken a web of links to this file. Can
you please add a redirect to fix the links.
Normally, I wouldn't file a bug report rather contact the web site admin, however
the links to the contacts is broken:
I use GFF3 as the primary export format for hundreds of prokaryotic and eukaryotic genomes and, while the structure is generally well defined in the specification for coding genes, it would be great to have some clarifications and even best practices for standards purposes in a future release. Considerations include:
Without some of these being formally in the specification it allows for competing standards from the large organizations, such as EMBL and now NCBI's support for GFF3.
As we have met it here (https://www.biostars.org/p/406128/#406210), the space used as delimiter in the headers is not necessarily obvious. I haven't t seen any mention of it in the specification, it could be nice to write it officially?
From the spec (emphasis mine):
In addition, the following characters have reserved meanings in column 9 and must be escaped when used in other contexts:
; semicolon (%3B)
= equals (%3D)
& ampersand (%26)
, comma (%2C)
and
Column 9: "attributes"
A list of feature attributes in the format tag=value. Multiple tag=value pairs are separated by semicolons. URL escaping rules are used for tags or values containing the following characters: ",=;". ...
I understand that semicolons (;
) and equals signs (=
) are for separating tag-value pairs, and that commas (,
) are used when multiple values are assigned to a single tag, but what are ampersands (&
) for? I don't see it mentioned anywhere in the spec, nor the pathological cases, and my best attempts to Google it have come up empty.
Thanks for any insight!
I think this line of specification is a little ambiguous.
There are so many implementation of unicode. UTF-8 is one of them.
https://github.com/The-Sequence-Ontology/Specifications/blame/master/gff3.md#L24
Hello,
would you be interested in registering a text/gff3
media type to the IANA ? I have experience for two applications (within the Debian project) and I found the process quite easy at that time. I see that some bioinformatics file format have their magic number registered in databases such as magic and shared-mime-info. While registration to the IANA is not just just about magic numbers, it would nicely close the loop.
The submission form is here: https://www.iana.org/form/media-types
In brief, the submission could be around the following lines:
Have a nice day,
Charles Plessy
Within the paragraph about Column 3: "type"
There is a) and c) but b) is missing. I found the description made here http://gmod.org/wiki/GFF3 by Scott cain clearer.
Text in the gvf spec's definition of column 3 still refers to term SO:0002073 as 'no_variation' even though the term name has changed to 'no_sequence_alteration'. Please update text in spec.
In my sense, it should be clearly stipulated since the "Column 9: "attributes" paragraph when the ID tag is mentioned that the ID attributes are only mandatory for features that have children. (And similarly with the Parent...).
It only mentioned later within the text "The ID attributes are only mandatory for those features that have children".
GFF writers should avoid trailing semicolons? or GFF3 readers should ignore trailing semicolons?
In my use case, for example, rtracklayer casually adds trailing semicolons, and jbrowse2 cannot ignore them resulting in an error. Which behavior should be fixed? I guess both should.
The current spec only states "Multiple tag=value pairs are separated by semicolons".
Hi, in the "Programmed frameshift" example (excerpt below), I'm confused about the phase of the second record:
chrX . CDS XXXX YYYY . + 0 ID=cds01;Parent=tran01
chrX . CDS YYYY-1 ZZZZ . + 1 ID=cds01;Parent=tran01
I'm not an expert but shouldn't the phase of the second segment always be 0
? From my informal survey of a couple dozen examples of ribosomal slippage
found in NCBI's human and mouse GFF3 downloads, it is true that the second segment always has phase 0
. Is it possible that NCBI's "ribosomal slippage" is just one subtype of programmed frameshift that is more strict than the general case?
The GFF3 specification is not really clear about how to treat CDS features of 1bp length that have a phase of 2 defined. According to
The phase is one of the integers 0, 1, or 2, indicating the number of bases that should be removed from the beginning of this feature to reach the first base of the next codon. In other words, a phase of "0" indicates that the next codon begins at the first base of the region described by the current line, a phase of "1" indicates that the next codon begins at the second base of this region, and a phase of "2" indicates that the codon begins at the third base of this region.
it is apparently assumed that length >= phase so the amount of bases can be 'skipped' as indicated by the phase. However, we have encountered cases in draft genomes (cf. genometools/genometools#793) where such short CDS show up.
Am I correct in assuming that in such cases the remaining phase shift is supposed to be 'carried forward' to the next CDS?
Hey Barrymoore, thanks for your detailed explanation for the gff format. I am now more clear about it. I have a small question though about the attributes. From http://mblab.wustl.edu/GTF2.html, I see there are attributes like gene_id, transcript_id while in the gff example in your .md file, there is only ID. Are they the same? Do we have a consistent standard form for gff file?
https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md#readme
In section Column 8: "phase" it says:
This is NOT to be confused with the frame, which is simply start modulo 3.
What does "start" refer to? The start column of the GFF3 file? In that case it would be the start position of the entire chromosome? Or does it refer to the start position of the start-codon?
An example that shows the differences between phase and frame would be appreciated.
Hi,
First of all the link to the Exonerate documentation
in the Gap attribute paragraph doesn't work.
Secondly if you go to the exonerate manual web page they don't describe exactly what was available in the past.
I mean they describe the CIGAR format and explain the meaning like that:
Operator | Description |
---|---|
M | Match |
C | Codon |
G | Gap |
N | Non-equivalenced region |
5 | 5' splice site |
3 | 3' splice site |
I | Intron |
S | Split codon |
F | Frameshift |
The CIGAR format related to Samtools that we can find everywhere on internet is like that:
Operator | Description |
---|---|
D | Deletion; the nucleotide is present in the reference but not in the read |
H | Hard Clipping; the clipped nucleotides are not present in the read. |
I | Insertion; the nucleotide is present in the read but not in the reference. |
M | Match; can be either an alignment match or mismatch. The nucleotide is present in the reference. |
N | Skipped region; a region of nucleotides is not present in the read |
P | Padding; padded area in the read and not in the reference |
S | Soft Clipping; the clipped nucleotides are present in the read |
X | Read Mismatch; the nucleotide is present in the reference |
= | Read Match; the nucleotide is present in the reference |
While old resources like
from 2004 FlyBase here: http://rice.bio.indiana.edu:7082/annot/gff3.html
from 2010 WormBase here: http://wiki.wormbase.org/index.php/GFF3specProposal
Describe the format like that:
Operator | Description |
---|---|
M | match |
I | insert a gap into the reference sequence |
D | insert a gap into the target (delete from reference) |
F | frameshift forward in the reference sequence |
R | frameshift reverse in the reference sequence |
To gather all the information in one place and not loose any, maybe a solution would be to create your own page describing the CIGAR format in its whole.
Here is the union of the values I have seen in the CIGAR format:
Operator | Description |
---|---|
M | Match ; can be either an alignment match or mismatch. The nucleotide is present in the reference. |
C | Codon |
G | Gap |
N | Non-equivalenced region |
5 | 5' splice site |
3 | 3' splice site |
I | Intron / the nucleotide is present in the read but not in the reference. / insert a gap into the reference sequence |
S | Split codon / Soft Clipping; the clipped nucleotides are present in the read |
H | Hard Clipping; the clipped nucleotides are not present in the read |
F | Frameshift / frameshift forward in the reference sequence |
D | Deletion; the nucleotide is present in the reference but not in the read / insert a gap into the target (delete from reference) |
P | Padding; padded area in the read and not in the reference |
X | Read Mismatch; the nucleotide is present in the reference |
= | Read Match; the nucleotide is present in the reference |
R | frameshift reverse in the reference sequence |
Hi,
Under the alignments section in the GFF3 spec, you list the non-existent "nucleotide_to_protein_match" term as a subclass of "match" in the SO.
I think the closest term would be "protein_match".
"nucleotide_motif" is also not a subclass of "match".
Minor details.
Seems like your markdown has gone all wonky there and things aren't rendering correctly.
Hi there!
We've recently encountered an issue while sharing GFF3 files with colleagues using Geneous.
I'm piecing this together from my colleague's interactions with geneous support, but seems that including ## sequence-region
lines with a start coordinate > 1 will affect how they display features or extract sequences.
I suspect they've interpreted it as an offset rather than a boundary check thing as I had interpreted it.
We typically use genometools to process/tidy GFF3 files, which automatically adds these lines based on the min and max coordinate in the file.
What is the correct way to specify these lines?
Should we just be setting the start and end to be 1 and the sequence length?
Thanks in advance!
Darcy
Currently GFF3 does not use SO IDs. Instead features are type by their label. This imposes strictures on SO - SO can't change labels without potentially breaking GFF3 usage.
Inspired by JSON-LD contexts we could have in the header declarations of mappings between values in the type column and SO term IDs. E.g:
# context:
# transcript: SO:0000673
# exon: ...
# ...
Note that this would need to be widely implemented before SO would be able to change labels, but this would be a step in the right direction
The spec points here: https://github.com/modENCODE-DCC/validator/blob/master/new_gff_validator.pl
This is 7 year old perl code
The SO wiki has:
http://www.sequenceontology.org/so_wiki/index.php/GFF3_Validation_Tools
which has GFFO (not in use?), FALDO (not really a validator) and the modENCODE validator. The modENCODE validator link doesn't work. But it seems to be this code:
https://github.com/genometools/genometools
which is in C
Reciprocal ticket: genometools/genometools#910
There is a question here:
https://www.biostars.org/p/177319/
indicates another validator here, this one in Python: http://www.raetschlab.org/suppl/gff-tools
Which of these is supported? Is the behavior identical? What expectations does each have on the SO obo file?
I don't think the spec should link to specific validators. However, the spec should indicate the expected behavior of the validator. This could be modularized into different checks, and we could group checks into profiles. E.g. some validators may only validate a basic syntactic profile. Others could validate a sofa profile, where we check that the type column maps to a SO ID.
Understanding how validators use relationships is important for maintenance of SO:
The-Sequence-Ontology/SO-Ontologies#465
There could be a validator registry separate from the spec, and defined conformance tests for the validators
I would like to cite this specification in a paper. Do you have any suggestions?
Could you add the repo to Zenodo to get a citable object?
From the introductory material, "A Quick Explanation and Example of GVF Content", under "attributes:Variant_seq":
@ (at): An alias for the sequence found in the Reference_seq attribute.
Other options include: ., -, ~, !, and ^.
The full 'Variant_seq' section, where this required term is defined, reads:
In addition to the observed nucleic acid sequence, several other characters (.-~@!^) are valid values in the Variant_seq attribute. Use of these characters is described below with examples.
However the '@' option is not defined in the examples along with the other characters.
Possibly related: '@' is not accepted as a valid entry for 'Variant_seq' by the current gvf_validator.pl script.
In the section Ontology Associations and DB Cross References of the spec, it refers to two files hosted on ftp.geneontology.org: ftp://ftp.geneontology.org/pub/go/doc/GO.xrf_abbs
and ftp://ftp.geneontology.org/pub/go/doc/GO.xrf_abbs_spec
. I am not able to access these files, or anything on ftp.geneontology.org. Are there updated locations for those?
Hi there!
I've been looking at translating protein-vs-protein or HMM-vs-protein alignments to genome coordinates in GFF format for easy browsing by colleagues and gene curation.
But it's unclear to me how matches across introns are supposed to be done if we want to include Gap annotations.
The issue is that your guidance on Gap for protein matches is to have M, I, D operations in AA length.
If the match goes over an intron and the CDS/intron isn't a multiple of 3, how should that be specified?
I see three possible ways, but neither quite seem to do it.
I'm just not really sure that the frameshift operators are meant to be used this way.
Exonerate for example has a split codon (S) operation.
Obviously this isn't super critical. We could just use "match_part"s and forget about the Gaps.
I'm adopting option 3 split into match_parts for now, because more genome browsers should display it properly.
But I like trying to stick to norms, so it would be nice to have an example in the docs. You have a few for nucleotide matches, but nothing for proteins.
Thanks in advance,
Darcy
PS. I think maybe your example for the frame shift is off by one in both of the second lines of the alignment?
For the CDS features in the Programmed frameshift example:
chrX . CDS XXXX YYYY 0 + . ID=cds01;Parent=tran01
chrX . CDS YYYY-1 ZZZZ 1 + . ID=cds01;Parent=tran01
Is it intended for the "0" and "1" to be in the phase (8th) column, rather than the score (6th) column? i.e.:
chrX . CDS XXXX YYYY . + 0 ID=cds01;Parent=tran01
chrX . CDS YYYY-1 ZZZZ . + 1 ID=cds01;Parent=tran01
Hello!
It is not clear from specification whether the GFF3 file should be sorted by seqid
or not if multiple seqid
present in a file.
I received a file where it is not the case, e.g. first there are lines of type gene
for multiple seqid
s and then multiple nRNA lines with the same set of seqid
s and with parents of the genes described above.
The reader I use (Sci-Kit Bio read
function) reads each occurrence of seqid
as new name. If specific sequence ID it provided, it reads only the first record (I presume because it encounters different seqid
after that).
So, my problem is that because it is not specified, I cannot understand is it reader's behaviour incorrect or it is being strict and correct and the file itself is formatted incorrectly?
Thank you very much for clarification.
Note that there is a formatting problem in the last line of the 3rd code chunk of the Trans-spliced transcript section:
chrX . exon XXXX YYYY . + . Parent=tran01 chrX . CDS XXXX YYYY . + . ID=cds01;Parent=tran01
Looks like 2 lines (exon and CDS) have been concatenated together:
Thanks!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.