GithubHelp home page GithubHelp logo

specifications's People

Contributors

alexhenrie avatar barrymoore avatar juke34 avatar lbergelson avatar nicoleruiz avatar nsoranzo avatar srynobio avatar thefferon avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

specifications's Issues

Phase missing in example section GFF3

According to the specification, CDS always requires a phase to be annotated.

This requirement is violated in the example section. This should be fixed as it's not got for a specification to be in violation of itself.

image

Questions/Clarifications for next GFF3 version

I use GFF3 as the primary export format for hundreds of prokaryotic and eukaryotic genomes and, while the structure is generally well defined in the specification for coding genes, it would be great to have some clarifications and even best practices for standards purposes in a future release. Considerations include:

  1. Non-coding gene encoding in GFF. This should include examples of tRNAs, rRNAs, etc. What does the gene graph look like for these?
  2. Functional annotation standards. In the 9th column, can we decide some standardized keys for things like gene product names and gene symbols. Others, such as GO terms and EC numbers are already well described using Dbxrefs, but even these could be expanded to allow for attribution of sources of these terms as well as GO evidence codes.

Without some of these being formally in the specification it allows for competing standards from the large organizations, such as EMBL and now NCBI's support for GFF3.

What is the "reserved meaning" of an ampersand (&) in column 9?

From the spec (emphasis mine):

In addition, the following characters have reserved meanings in column 9 and must be escaped when used in other contexts:

; semicolon (%3B)
= equals (%3D)
& ampersand (%26)
, comma (%2C)

and

Column 9: "attributes"
A list of feature attributes in the format tag=value. Multiple tag=value pairs are separated by semicolons. URL escaping rules are used for tags or values containing the following characters: ",=;". ...

I understand that semicolons (;) and equals signs (=) are for separating tag-value pairs, and that commas (,) are used when multiple values are assigned to a single tag, but what are ampersands (&) for? I don't see it mentioned anywhere in the spec, nor the pathological cases, and my best attempts to Google it have come up empty.

Thanks for any insight!

Application for a text/gff3 Media Type ?

Hello,

would you be interested in registering a text/gff3 media type to the IANA ? I have experience for two applications (within the Debian project) and I found the process quite easy at that time. I see that some bioinformatics file format have their magic number registered in databases such as magic and shared-mime-info. While registration to the IANA is not just just about magic numbers, it would nicely close the loop.

The submission form is here: https://www.iana.org/form/media-types

In brief, the submission could be around the following lines:

  • Type Name: text
  • Subtype Name: gff3
  • Required Parameters: None
  • Optional Parameters: None
  • Encoding Considerations: 8-bit, Unicode or Latin-1 recommended, URL escaping of some whitespace and delimiter characters.
  • Security Considerations: are there potential issues ? Is it possible to specially craft a GFF3 files so that parsing enter into infinite loops, etc ?
  • Interoperability Considerations
  • Published specification: this GitHub repository ?
  • Application Usage: Genome browsers, bioinformatics tools, database dumps, ...
  • Fragment Identifier Considerations: None
  • Restrictions on Usage: None
  • Provisional Registrations: text/gff3
  • Additional Information:
    • Deprecated alias names for this type: text/x-gff3
    • Magic number(s): ##gff-version 3
    • File extension(s): gff3
    • Macintosh File Type Code(s): None
    • Object Identifier(s) or OID(s): None
  • Intended Usage: Common

Have a nice day,

Charles Plessy

typo error

Within the paragraph about Column 3: "type"
There is a) and c) but b) is missing. I found the description made here http://gmod.org/wiki/GFF3 by Scott cain clearer.

Precision about the 9th column

In my sense, it should be clearly stipulated since the "Column 9: "attributes" paragraph when the ID tag is mentioned that the ID attributes are only mandatory for features that have children. (And similarly with the Parent...).

It only mentioned later within the text "The ID attributes are only mandatory for those features that have children".

Trailing semicolons at GFF3 attributes should be avoided or ignored?

GFF writers should avoid trailing semicolons? or GFF3 readers should ignore trailing semicolons?

In my use case, for example, rtracklayer casually adds trailing semicolons, and jbrowse2 cannot ignore them resulting in an error. Which behavior should be fixed? I guess both should.

The current spec only states "Multiple tag=value pairs are separated by semicolons".

Clarification of GFF3 "Programmed frameshift" example

Hi, in the "Programmed frameshift" example (excerpt below), I'm confused about the phase of the second record:

chrX  . CDS                XXXX   YYYY .  +  0 ID=cds01;Parent=tran01
chrX  . CDS                YYYY-1 ZZZZ .  +  1 ID=cds01;Parent=tran01

I'm not an expert but shouldn't the phase of the second segment always be 0? From my informal survey of a couple dozen examples of ribosomal slippage found in NCBI's human and mouse GFF3 downloads, it is true that the second segment always has phase 0. Is it possible that NCBI's "ribosomal slippage" is just one subtype of programmed frameshift that is more strict than the general case?

GFF3: Phase > feature length not clearly defined

The GFF3 specification is not really clear about how to treat CDS features of 1bp length that have a phase of 2 defined. According to

The phase is one of the integers 0, 1, or 2, indicating the number of bases that should be removed from the beginning of this feature to reach the first base of the next codon. In other words, a phase of "0" indicates that the next codon begins at the first base of the region described by the current line, a phase of "1" indicates that the next codon begins at the second base of this region, and a phase of "2" indicates that the codon begins at the third base of this region.

it is apparently assumed that length >= phase so the amount of bases can be 'skipped' as indicated by the phase. However, we have encountered cases in draft genomes (cf. genometools/genometools#793) where such short CDS show up.
Am I correct in assuming that in such cases the remaining phase shift is supposed to be 'carried forward' to the next CDS?

Difference between phase and frame is unclear in the GFF3 spec

https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md#readme

In section Column 8: "phase" it says:

This is NOT to be confused with the frame, which is simply start modulo 3.
What does "start" refer to? The start column of the GFF3 file? In that case it would be the start position of the entire chromosome? Or does it refer to the start position of the start-codon?

An example that shows the differences between phase and frame would be appreciated.

Gap attribute - CIGAR description - dead link

Hi,

First of all the link to the Exonerate documentation in the Gap attribute paragraph doesn't work.
Secondly if you go to the exonerate manual web page they don't describe exactly what was available in the past.
I mean they describe the CIGAR format and explain the meaning like that:

Operator Description
M Match
C Codon
G Gap
N Non-equivalenced region
5 5' splice site
3 3' splice site
I Intron
S Split codon
F Frameshift

The CIGAR format related to Samtools that we can find everywhere on internet is like that:

Operator Description
D Deletion; the nucleotide is present in the reference but not in the read
H Hard Clipping; the clipped nucleotides are not present in the read.
I Insertion; the nucleotide is present in the read  but not in the reference.
M Match; can be either an alignment match or mismatch. The nucleotide is present in the reference.
N Skipped region; a region of nucleotides is not present in the read
P Padding; padded area in the read and not in the reference
S Soft Clipping;  the clipped nucleotides are present in the read
X Read Mismatch; the nucleotide is present in the reference
= Read Match; the nucleotide is present in the reference

While old resources like
from 2004 FlyBase here: http://rice.bio.indiana.edu:7082/annot/gff3.html
from 2010 WormBase here: http://wiki.wormbase.org/index.php/GFF3specProposal
Describe the format like that:

Operator Description
M match
I insert a gap into the reference sequence
D insert a gap into the target (delete from reference)
F frameshift forward in the reference sequence
R frameshift reverse in the reference sequence

To gather all the information in one place and not loose any, maybe a solution would be to create your own page describing the CIGAR format in its whole.

Here is the union of the values I have seen in the CIGAR format:

Operator Description
M Match ; can be either an alignment match or mismatch. The nucleotide is present in the reference.
C Codon
G Gap
N Non-equivalenced region
5 5' splice site
3 3' splice site
I Intron / the nucleotide is present in the read but not in the reference. / insert a gap into the reference sequence
S Split codon / Soft Clipping; the clipped nucleotides are present in the read
H Hard Clipping; the clipped nucleotides are not present in the read
F Frameshift / frameshift forward in the reference sequence
D Deletion; the nucleotide is present in the reference but not in the read / insert a gap into the target (delete from reference)
P Padding; padded area in the read and not in the reference
X Read Mismatch; the nucleotide is present in the reference
= Read Match; the nucleotide is present in the reference
R frameshift reverse in the reference sequence

SO subclasses of "match" incorrect

Hi,

Under the alignments section in the GFF3 spec, you list the non-existent "nucleotide_to_protein_match" term as a subclass of "match" in the SO.
I think the closest term would be "protein_match".
"nucleotide_motif" is also not a subclass of "match".

Minor details.

GVF MD file

Seems like your markdown has gone all wonky there and things aren't rendering correctly.

Clarification on the use of the sequence region directive.

Hi there!

We've recently encountered an issue while sharing GFF3 files with colleagues using Geneous.
I'm piecing this together from my colleague's interactions with geneous support, but seems that including ## sequence-region lines with a start coordinate > 1 will affect how they display features or extract sequences.
I suspect they've interpreted it as an offset rather than a boundary check thing as I had interpreted it.

We typically use genometools to process/tidy GFF3 files, which automatically adds these lines based on the min and max coordinate in the file.

What is the correct way to specify these lines?
Should we just be setting the start and end to be 1 and the sequence length?

Thanks in advance!

Darcy

Allow a mapping between labels and ontology term IDs in the header of GFF

Currently GFF3 does not use SO IDs. Instead features are type by their label. This imposes strictures on SO - SO can't change labels without potentially breaking GFF3 usage.

Inspired by JSON-LD contexts we could have in the header declarations of mappings between values in the type column and SO term IDs. E.g:

# context:
#   transcript: SO:0000673
#   exon: ...
#   ...

Note that this would need to be widely implemented before SO would be able to change labels, but this would be a step in the right direction

Is there a single canonical validator, or multiple implementations?

The spec points here: https://github.com/modENCODE-DCC/validator/blob/master/new_gff_validator.pl

This is 7 year old perl code

The SO wiki has:
http://www.sequenceontology.org/so_wiki/index.php/GFF3_Validation_Tools

which has GFFO (not in use?), FALDO (not really a validator) and the modENCODE validator. The modENCODE validator link doesn't work. But it seems to be this code:
https://github.com/genometools/genometools
which is in C

Reciprocal ticket: genometools/genometools#910

There is a question here:
https://www.biostars.org/p/177319/
indicates another validator here, this one in Python: http://www.raetschlab.org/suppl/gff-tools

Which of these is supported? Is the behavior identical? What expectations does each have on the SO obo file?

I don't think the spec should link to specific validators. However, the spec should indicate the expected behavior of the validator. This could be modularized into different checks, and we could group checks into profiles. E.g. some validators may only validate a basic syntactic profile. Others could validate a sofa profile, where we check that the type column maps to a SO ID.

Understanding how validators use relationships is important for maintenance of SO:
The-Sequence-Ontology/SO-Ontologies#465

There could be a validator registry separate from the spec, and defined conformance tests for the validators

Citing the GFF3 spec

I would like to cite this specification in a paper. Do you have any suggestions?

Could you add the repo to Zenodo to get a citable object?

'@' Variant_seq option defined in introduction but not in Variant_seq definition section

From the introductory material, "A Quick Explanation and Example of GVF Content", under "attributes:Variant_seq":

@ (at): An alias for the sequence found in the Reference_seq attribute.

Other options include: ., -, ~, !, and ^.

The full 'Variant_seq' section, where this required term is defined, reads:

In addition to the observed nucleic acid sequence, several other characters (.-~@!^) are valid values in the Variant_seq attribute. Use of these characters is described below with examples.

However the '@' option is not defined in the examples along with the other characters.

Possibly related: '@' is not accepted as a valid entry for 'Variant_seq' by the current gvf_validator.pl script.

Protein to nucleotide matches over introns

Hi there!

I've been looking at translating protein-vs-protein or HMM-vs-protein alignments to genome coordinates in GFF format for easy browsing by colleagues and gene curation.
But it's unclear to me how matches across introns are supposed to be done if we want to include Gap annotations.
The issue is that your guidance on Gap for protein matches is to have M, I, D operations in AA length.
If the match goes over an intron and the CDS/intron isn't a multiple of 3, how should that be specified?

I see three possible ways, but neither quite seem to do it.

  1. Use a single "protein match" as a feature. Use "I" operations for introns and correct the frame at either side of the intron using R and F operations, respectively.
  2. Use match_parts mirroring the matched CDS structure in the aligned region, and correct the frame at the beginning and end using R and F operations.
  3. Use a translated_nucleotide_match SO type instead. But this might contradict your guidance on using Gaps with proteins

I'm just not really sure that the frameshift operators are meant to be used this way.
Exonerate for example has a split codon (S) operation.

Obviously this isn't super critical. We could just use "match_part"s and forget about the Gaps.
I'm adopting option 3 split into match_parts for now, because more genome browsers should display it properly.
But I like trying to stick to norms, so it would be nice to have an example in the docs. You have a few for nucleotide matches, but nothing for proteins.

Thanks in advance,
Darcy

PS. I think maybe your example for the frame shift is off by one in both of the second lines of the alignment?

Programmed frameshift example

For the CDS features in the Programmed frameshift example:

chrX  . CDS                XXXX   YYYY 0  +  . ID=cds01;Parent=tran01
chrX  . CDS                YYYY-1 ZZZZ 1  +  . ID=cds01;Parent=tran01

Is it intended for the "0" and "1" to be in the phase (8th) column, rather than the score (6th) column? i.e.:

chrX  . CDS                XXXX   YYYY .  +  0 ID=cds01;Parent=tran01
chrX  . CDS                YYYY-1 ZZZZ .  +  1 ID=cds01;Parent=tran01

Multiple sequences in a single file

Hello!

It is not clear from specification whether the GFF3 file should be sorted by seqid or not if multiple seqid present in a file.

I received a file where it is not the case, e.g. first there are lines of type gene for multiple seqids and then multiple nRNA lines with the same set of seqids and with parents of the genes described above.

The reader I use (Sci-Kit Bio read function) reads each occurrence of seqid as new name. If specific sequence ID it provided, it reads only the first record (I presume because it encounters different seqid after that).

So, my problem is that because it is not specified, I cannot understand is it reader's behaviour incorrect or it is being strict and correct and the file itself is formatted incorrectly?

Thank you very much for clarification.

Code chunk formatting problem

Note that there is a formatting problem in the last line of the 3rd code chunk of the Trans-spliced transcript section:

chrX  . exon               XXXX YYYY  .  +  . Parent=tran01 chrX . CDS XXXX YYYY . + . ID=cds01;Parent=tran01

Looks like 2 lines (exon and CDS) have been concatenated together:

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.