the-sequence-ontology / specifications Goto Github PK

View Code? Open in Web Editor NEW

204.0 26.0 90.0 819 KB

GFF and GVF specification documents

specifications's People

Contributors

Stargazers

Watchers

Forkers

alexhenrie thefferon sophy7074 wenmore zzygyx9119 dcjones jeromezp colton-linnertz chenglilab tintingli arendsee weizhiting lukelukechen rdoradog hsiaoyi0504 kaydaramola flopezo rsano-sgr owendonohoe aallahyar yanding arielsschwartz 18853857973 mkierczak inambioinfo limbus-medtec yuntaotan arteymix jingmingxia longcheng0527 standage justawayx eternal-bug fishlist honphy xie186 hui-liu br1anchou wy2160640 mahejibin nsoranzo juke34 big-bear-digger lbergelson eparrar lehmannn voronkovventures mqasim2 ddiazescandon yaxche-io matthewha123 ganthark lapa34 hj1994412 biozhangzhou legezam davidmasp mikeaxtell shunsunsun ruth2014 gabriel-villiard-mcgill fengzhuo17 mandu408 dongfang1021 genening milenovic suryasaha ekcannon maxglycine joshuamcginnis gavinband palao sciencerdelafuente harmbrugge fiuzatayna zihengluo biocoderr bsalehe corneliusroemer noelmcloughlin probirc snailzoe gemygk rdolson buzgalbraith ddbj exgdt davidmerwin sanyashek

specifications's Issues

Broken links to GFF3 specifcation.

Moving the location of the GFF3 specification has broken a web of links to this file. Can
you please add a redirect to fix the links.

Normally, I wouldn't file a bug report rather contact the web site admin, however
the links to the contacts is broken:

http://www.sequenceontology.org/?page_id=259

http://www.sequenceontology.org/contacts/

Phase missing in example section GFF3

According to the specification, CDS always requires a phase to be annotated.

This requirement is violated in the example section. This should be fixed as it's not got for a specification to be in violation of itself.

Questions/Clarifications for next GFF3 version

I use GFF3 as the primary export format for hundreds of prokaryotic and eukaryotic genomes and, while the structure is generally well defined in the specification for coding genes, it would be great to have some clarifications and even best practices for standards purposes in a future release. Considerations include:

Non-coding gene encoding in GFF. This should include examples of tRNAs, rRNAs, etc. What does the gene graph look like for these?
Functional annotation standards. In the 9th column, can we decide some standardized keys for things like gene product names and gene symbols. Others, such as GO terms and EC numbers are already well described using Dbxrefs, but even these could be expanded to allow for attribution of sources of these terms as well as GO evidence codes.

Without some of these being formally in the specification it allows for competing standards from the large organizations, such as EMBL and now NCBI's support for GFF3.

gff3 header delimiter

As we have met it here (https://www.biostars.org/p/406128/#406210), the space used as delimiter in the headers is not necessarily obvious. I haven't t seen any mention of it in the specification, it could be nice to write it officially?

What is the "reserved meaning" of an ampersand (&) in column 9?

From the spec (emphasis mine):

In addition, the following characters have reserved meanings in column 9 and must be escaped when used in other contexts:

; semicolon (%3B)
= equals (%3D)
& ampersand (%26)
, comma (%2C)

and

Column 9: "attributes"
A list of feature attributes in the format tag=value. Multiple tag=value pairs are separated by semicolons. URL escaping rules are used for tags or values containing the following characters: ",=;". ...

I understand that semicolons (;) and equals signs (=) are for separating tag-value pairs, and that commas (,) are used when multiple values are assigned to a single tag, but what are ampersands (&) for? I don't see it mentioned anywhere in the spec, nor the pathological cases, and my best attempts to Google it have come up empty.

Thanks for any insight!

Encoding of GFF3 files

I think this line of specification is a little ambiguous.
There are so many implementation of unicode. UTF-8 is one of them.
https://github.com/The-Sequence-Ontology/Specifications/blame/master/gff3.md#L24

Application for a text/gff3 Media Type ?

Hello,

would you be interested in registering a text/gff3 media type to the IANA ? I have experience for two applications (within the Debian project) and I found the process quite easy at that time. I see that some bioinformatics file format have their magic number registered in databases such as magic and shared-mime-info. While registration to the IANA is not just just about magic numbers, it would nicely close the loop.

The submission form is here: https://www.iana.org/form/media-types

In brief, the submission could be around the following lines:

Type Name: text
Subtype Name: gff3
Required Parameters: None
Optional Parameters: None
Encoding Considerations: 8-bit, Unicode or Latin-1 recommended, URL escaping of some whitespace and delimiter characters.
Security Considerations: are there potential issues ? Is it possible to specially craft a GFF3 files so that parsing enter into infinite loops, etc ?
Interoperability Considerations
Published specification: this GitHub repository ?
Application Usage: Genome browsers, bioinformatics tools, database dumps, ...
Fragment Identifier Considerations: None
Restrictions on Usage: None
Provisional Registrations: text/gff3
Additional Information:
- Deprecated alias names for this type: text/x-gff3
- Magic number(s): ##gff-version 3
- File extension(s): gff3
- Macintosh File Type Code(s): None
- Object Identifier(s) or OID(s): None
Intended Usage: Common

Have a nice day,

Charles Plessy

typo error

Within the paragraph about Column 3: "type"
There is a) and c) but b) is missing. I found the description made here http://gmod.org/wiki/GFF3 by Scott cain clearer.

Update SO term referenced in "column 3 Type" definition

Text in the gvf spec's definition of column 3 still refers to term SO:0002073 as 'no_variation' even though the term name has changed to 'no_sequence_alteration'. Please update text in spec.

Precision about the 9th column

In my sense, it should be clearly stipulated since the "Column 9: "attributes" paragraph when the ID tag is mentioned that the ID attributes are only mandatory for features that have children. (And similarly with the Parent...).

It only mentioned later within the text "The ID attributes are only mandatory for those features that have children".

Trailing semicolons at GFF3 attributes should be avoided or ignored?

GFF writers should avoid trailing semicolons? or GFF3 readers should ignore trailing semicolons?

In my use case, for example, rtracklayer casually adds trailing semicolons, and jbrowse2 cannot ignore them resulting in an error. Which behavior should be fixed? I guess both should.

The current spec only states "Multiple tag=value pairs are separated by semicolons".

Clarification of GFF3 "Programmed frameshift" example

Hi, in the "Programmed frameshift" example (excerpt below), I'm confused about the phase of the second record:

chrX  . CDS                XXXX   YYYY .  +  0 ID=cds01;Parent=tran01
chrX  . CDS                YYYY-1 ZZZZ .  +  1 ID=cds01;Parent=tran01

I'm not an expert but shouldn't the phase of the second segment always be 0? From my informal survey of a couple dozen examples of ribosomal slippage found in NCBI's human and mouse GFF3 downloads, it is true that the second segment always has phase 0. Is it possible that NCBI's "ribosomal slippage" is just one subtype of programmed frameshift that is more strict than the general case?

GFF3: Phase > feature length not clearly defined

The GFF3 specification is not really clear about how to treat CDS features of 1bp length that have a phase of 2 defined. According to

The phase is one of the integers 0, 1, or 2, indicating the number of bases that should be removed from the beginning of this feature to reach the first base of the next codon. In other words, a phase of "0" indicates that the next codon begins at the first base of the region described by the current line, a phase of "1" indicates that the next codon begins at the second base of this region, and a phase of "2" indicates that the codon begins at the third base of this region.

it is apparently assumed that length >= phase so the amount of bases can be 'skipped' as indicated by the phase. However, we have encountered cases in draft genomes (cf. genometools/genometools#793) where such short CDS show up.
Am I correct in assuming that in such cases the remaining phase shift is supposed to be 'carried forward' to the next CDS?

why there is no gene_id, transcript_id in the example gff file given in .md file?

Hey Barrymoore, thanks for your detailed explanation for the gff format. I am now more clear about it. I have a small question though about the attributes. From http://mblab.wustl.edu/GTF2.html, I see there are attributes like gene_id, transcript_id while in the gff example in your .md file, there is only ID. Are they the same? Do we have a consistent standard form for gff file?

Difference between phase and frame is unclear in the GFF3 spec

https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md#readme

In section Column 8: "phase" it says:

This is NOT to be confused with the frame, which is simply start modulo 3.
What does "start" refer to? The start column of the GFF3 file? In that case it would be the start position of the entire chromosome? Or does it refer to the start position of the start-codon?

An example that shows the differences between phase and frame would be appreciated.

Gap attribute - CIGAR description - dead link

Hi,

First of all the link to the Exonerate documentation in the Gap attribute paragraph doesn't work.
Secondly if you go to the exonerate manual web page they don't describe exactly what was available in the past.
I mean they describe the CIGAR format and explain the meaning like that:

Operator	Description
M	Match
C	Codon
G	Gap
N	Non-equivalenced region
5	5' splice site
3	3' splice site
I	Intron
S	Split codon
F	Frameshift

The CIGAR format related to Samtools that we can find everywhere on internet is like that:

Operator	Description
D	Deletion; the nucleotide is present in the reference but not in the read
H	Hard Clipping; the clipped nucleotides are not present in the read.
I	Insertion; the nucleotide is present in the read but not in the reference.
M	Match; can be either an alignment match or mismatch. The nucleotide is present in the reference.
N	Skipped region; a region of nucleotides is not present in the read
P	Padding; padded area in the read and not in the reference
S	Soft Clipping; the clipped nucleotides are present in the read
X	Read Mismatch; the nucleotide is present in the reference
=	Read Match; the nucleotide is present in the reference

While old resources like
from 2004 FlyBase here: http://rice.bio.indiana.edu:7082/annot/gff3.html
from 2010 WormBase here: http://wiki.wormbase.org/index.php/GFF3specProposal
Describe the format like that:

Operator	Description
M	match
I	insert a gap into the reference sequence
D	insert a gap into the target (delete from reference)
F	frameshift forward in the reference sequence
R	frameshift reverse in the reference sequence

To gather all the information in one place and not loose any, maybe a solution would be to create your own page describing the CIGAR format in its whole.

Here is the union of the values I have seen in the CIGAR format:

Operator	Description
M	Match ; can be either an alignment match or mismatch. The nucleotide is present in the reference.
C	Codon
G	Gap
N	Non-equivalenced region
5	5' splice site
3	3' splice site
I	Intron / the nucleotide is present in the read but not in the reference. / insert a gap into the reference sequence
S	Split codon / Soft Clipping; the clipped nucleotides are present in the read
H	Hard Clipping; the clipped nucleotides are not present in the read
F	Frameshift / frameshift forward in the reference sequence
D	Deletion; the nucleotide is present in the reference but not in the read / insert a gap into the target (delete from reference)
P	Padding; padded area in the read and not in the reference
X	Read Mismatch; the nucleotide is present in the reference
=	Read Match; the nucleotide is present in the reference
R	frameshift reverse in the reference sequence

SO subclasses of "match" incorrect

Hi,

Under the alignments section in the GFF3 spec, you list the non-existent "nucleotide_to_protein_match" term as a subclass of "match" in the SO.
I think the closest term would be "protein_match".
"nucleotide_motif" is also not a subclass of "match".

Minor details.

GVF MD file

Seems like your markdown has gone all wonky there and things aren't rendering correctly.

Clarification on the use of the sequence region directive.

Hi there!

We've recently encountered an issue while sharing GFF3 files with colleagues using Geneous.
I'm piecing this together from my colleague's interactions with geneous support, but seems that including ## sequence-region lines with a start coordinate > 1 will affect how they display features or extract sequences.
I suspect they've interpreted it as an offset rather than a boundary check thing as I had interpreted it.

We typically use genometools to process/tidy GFF3 files, which automatically adds these lines based on the min and max coordinate in the file.

What is the correct way to specify these lines?
Should we just be setting the start and end to be 1 and the sequence length?

Thanks in advance!

Darcy

Allow a mapping between labels and ontology term IDs in the header of GFF

Currently GFF3 does not use SO IDs. Instead features are type by their label. This imposes strictures on SO - SO can't change labels without potentially breaking GFF3 usage.

Inspired by JSON-LD contexts we could have in the header declarations of mappings between values in the type column and SO term IDs. E.g:

# context:
#   transcript: SO:0000673
#   exon: ...
#   ...

Note that this would need to be widely implemented before SO would be able to change labels, but this would be a step in the right direction

Is there a single canonical validator, or multiple implementations?

The spec points here: https://github.com/modENCODE-DCC/validator/blob/master/new_gff_validator.pl

This is 7 year old perl code

The SO wiki has:
http://www.sequenceontology.org/so_wiki/index.php/GFF3_Validation_Tools

which has GFFO (not in use?), FALDO (not really a validator) and the modENCODE validator. The modENCODE validator link doesn't work. But it seems to be this code:
https://github.com/genometools/genometools
which is in C

Reciprocal ticket: genometools/genometools#910

There is a question here:
https://www.biostars.org/p/177319/
indicates another validator here, this one in Python: http://www.raetschlab.org/suppl/gff-tools

Which of these is supported? Is the behavior identical? What expectations does each have on the SO obo file?

I don't think the spec should link to specific validators. However, the spec should indicate the expected behavior of the validator. This could be modularized into different checks, and we could group checks into profiles. E.g. some validators may only validate a basic syntactic profile. Others could validate a sofa profile, where we check that the type column maps to a SO ID.

Understanding how validators use relationships is important for maintenance of SO:
The-Sequence-Ontology/SO-Ontologies#465

There could be a validator registry separate from the spec, and defined conformance tests for the validators

Citing the GFF3 spec

I would like to cite this specification in a paper. Do you have any suggestions?

Could you add the repo to Zenodo to get a citable object?

'@' Variant_seq option defined in introduction but not in Variant_seq definition section

From the introductory material, "A Quick Explanation and Example of GVF Content", under "attributes:Variant_seq":

@ (at): An alias for the sequence found in the Reference_seq attribute.

Other options include: ., -, ~, !, and ^.

The full 'Variant_seq' section, where this required term is defined, reads:

In addition to the observed nucleic acid sequence, several other characters (.-~@!^) are valid values in the Variant_seq attribute. Use of these characters is described below with examples.

However the '@' option is not defined in the examples along with the other characters.

Possibly related: '@' is not accepted as a valid entry for 'Variant_seq' by the current gvf_validator.pl script.

Can't access "Ontology Associations and DB Cross References" files

In the section Ontology Associations and DB Cross References of the spec, it refers to two files hosted on ftp.geneontology.org: ftp://ftp.geneontology.org/pub/go/doc/GO.xrf_abbs and ftp://ftp.geneontology.org/pub/go/doc/GO.xrf_abbs_spec. I am not able to access these files, or anything on ftp.geneontology.org. Are there updated locations for those?

Protein to nucleotide matches over introns

Hi there!

I've been looking at translating protein-vs-protein or HMM-vs-protein alignments to genome coordinates in GFF format for easy browsing by colleagues and gene curation.
But it's unclear to me how matches across introns are supposed to be done if we want to include Gap annotations.
The issue is that your guidance on Gap for protein matches is to have M, I, D operations in AA length.
If the match goes over an intron and the CDS/intron isn't a multiple of 3, how should that be specified?

I see three possible ways, but neither quite seem to do it.

Use a single "protein match" as a feature. Use "I" operations for introns and correct the frame at either side of the intron using R and F operations, respectively.
Use match_parts mirroring the matched CDS structure in the aligned region, and correct the frame at the beginning and end using R and F operations.
Use a translated_nucleotide_match SO type instead. But this might contradict your guidance on using Gaps with proteins

I'm just not really sure that the frameshift operators are meant to be used this way.
Exonerate for example has a split codon (S) operation.

Obviously this isn't super critical. We could just use "match_part"s and forget about the Gaps.
I'm adopting option 3 split into match_parts for now, because more genome browsers should display it properly.
But I like trying to stick to norms, so it would be nice to have an example in the docs. You have a few for nucleotide matches, but nothing for proteins.

Thanks in advance,
Darcy

PS. I think maybe your example for the frame shift is off by one in both of the second lines of the alignment?

Programmed frameshift example

For the CDS features in the Programmed frameshift example:

chrX  . CDS                XXXX   YYYY 0  +  . ID=cds01;Parent=tran01
chrX  . CDS                YYYY-1 ZZZZ 1  +  . ID=cds01;Parent=tran01

Is it intended for the "0" and "1" to be in the phase (8th) column, rather than the score (6th) column? i.e.:

chrX  . CDS                XXXX   YYYY .  +  0 ID=cds01;Parent=tran01
chrX  . CDS                YYYY-1 ZZZZ .  +  1 ID=cds01;Parent=tran01

Multiple sequences in a single file

Hello!

It is not clear from specification whether the GFF3 file should be sorted by seqid or not if multiple seqid present in a file.

I received a file where it is not the case, e.g. first there are lines of type gene for multiple seqids and then multiple nRNA lines with the same set of seqids and with parents of the genes described above.

The reader I use (Sci-Kit Bio read function) reads each occurrence of seqid as new name. If specific sequence ID it provided, it reads only the first record (I presume because it encounters different seqid after that).

So, my problem is that because it is not specified, I cannot understand is it reader's behaviour incorrect or it is being strict and correct and the file itself is formatted incorrectly?

Thank you very much for clarification.

Code chunk formatting problem

Note that there is a formatting problem in the last line of the 3rd code chunk of the Trans-spliced transcript section:

chrX  . exon               XXXX YYYY  .  +  . Parent=tran01 chrX . CDS XXXX YYYY . + . ID=cds01;Parent=tran01

Looks like 2 lines (exon and CDS) have been concatenated together:

Thanks!

the-sequence-ontology / specifications Goto Github PK

specifications's People

Contributors

Stargazers

Watchers

Forkers

specifications's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs