fmalmeida / gff-toolbox Goto Github PK

Gff-toolbox is a toolbox of commands that enables one to get the gist of their GFF annotation files, as well as to analyse them in different ways.

Home Page: https://github.com/fmalmeida/gff-toolbox/wiki

License: GNU General Public License v3.0

Python 99.32% Shell 0.68%

gff gff3 gff-toolkit

gff-toolbox's Introduction

gff-toolbox is a toolbox of commands that tries to facilitate the work with GFF annotation files.

Requirements

This pipeline has only two requirement:

Python >= 3.6, and its packages:
- pandas
- biopython
- bcbiogff
- dna features viewer
- matplotlib
- docopt
- pymongo
mongo shell
- In conda

Installation

Via github

Installation is super easy and perhaps not required:

# Download
git clone https://github.com/fmalmeida/gff-toolbox.git
cd gff-toolbox

# Run without installing
python3 gfftoolbox-runner.py -h

# Install and run in any place
python3 setup.py install
gff-toolbox -h

Via conda

# Get the conda package
conda create -n gff-toolbox -c conda-forge -c bioconda -c anaconda -c defaults -c falmeida gff-toolbox

Documentation

The command is very well documented inside its help messages and examples which can be checked as:

gff-toolbox -h

Additionally, users can checkout the execution and results of commands in the wiki provided.

Collaborating

This is meant to be a collaborative project, which means it is meant to adapt to the community needs. Thus, we encourage users to use it and to collaborate with ideas for different implementations, new commands, additions, etc.

If you have an analysis that you constantly do when working with GFFs and would like to see it implemented in a command-like package to make your life easy, or whenever you feel something can be improved, don't hesitate and collaborate.

You can collaborate by:

flagging an enhancement issue discussing your idea in the homepage of the project
you can fork the repo, create and start the implementation of your own script/command in the project and then submit a pull request
- I'll then check the request, make sure it is in the same format and standards of the already implemented commands and confirm the inclusion.
- Of course, you will be recognized as the developer/creator of that specific implementation.

Checkout more at about forking and contributing to repos at:

Citation

To cite this pipeline users can use the github url. Users are encouraged to cite the python packages used in this pipeline whenever their outputs are valuable.

gff-toolbox's People

Contributors

Stargazers

Watchers

Forkers

rodtheo

gff-toolbox's Issues

add a "liftover-like" module

Add a module to receive a file containing two columns in order to fix the GFF chr names.

It gets in column 1 the names of chr we want to update and in column 2 the pattern for substitution, for "lift-over"

gff-toolbox plot enchancement

Would it be possible that, if requested, not plot the gene labels above in the features?

For instance, when plotting multiple GFFs at once to create groups with different colors, we may like to show the labels of only one group.

Problem with "gff3_ID_generator.py"

Hello,
So I have a gff3 file where there are lines missing ID (column 9). Those lines mainly concerns CDS' and stop codons. I tried to use "gff3_ID_generator.py" with the command :
"python3 gff3_ID_generator.py -g ../../../UK0001.gff3 -og UK0001Mo.gff3" .
With UK0001.gff3 being my gff file with missing IDs and UK0001Mo.gff3 being my desired output.

I get this error as the output:

INFO Reading input gff3 file: (../../../UK0001.gff3)
INFO Generate new ID for features in (../../../UK0001.gff3)
Traceback (most recent call last):
File "/home/mestiri/GFF3toolkit/gff3tool/lib/gff3_ID_generator.py", line 333, in
main(in_gff=args.gff, merge_report=args.merge_report, out_merge_report=args.out_merge_report, out_gff=args.output_gff, uuid_on=args.universally_unique_identifier, prefix=args.idprefix, digitlen=args.digitlen, report=args.report, alias=args.alias)
File "/home/mestiri/GFF3toolkit/gff3tool/lib/gff3_ID_generator.py", line 260, in main
if descend['attributes']['ID'] not in ID_dict:
KeyError: 'ID'

I don't understand what it means, to be honest. Is there something that I misunderstood concerning this python script ?

Thanks for your help ! Have a nice day !

add a venn-like diagram to compare GFFs

Add a venn-like diagram to compare two GFFs to compare two GFF files:

Finds intersection between features (selected feature type)
Plots intersections in venn-like diagram

Add new attributes (annotations) to a mongodb collection

Motivation

The gff-toolbox convert module is capable of converting a GFF to a mongo database, however it seems that we can't manipulate the gff information stored in database without relying on pure mongo commands. A useful routine task done in many analysis is the insertion of new information into a GFF, i.e annotate a gene/transcript. This task could be done using many tools, such as gffutils, BCBio or even bash/other language script, through the inclusion of new attributes into the raw GFF column 9 using as input a file telling which set of annotations (e.g. GO, PFAM, EC number) correlate with each gene. However, the same annotation task can also be done in a different way involving the conversion of GFF to mongodb, inclusion of annotations to mongodb corresponding collection and, further, if desired, conversion back to GFF. I've been wondering that despite it may seem a more difficult procedure than annotate a raw GFF and spend more computer resources, it has some advantages:

Possibility to include a description, link or other metadata related to an annotation. GFF format spec declares fields Ontology_terms and Dbxref in column 9 to accommodate, respectively, annotations from GO/ontology servers and other databases (e.g. PFAM, PANTHER, EC). Despite this, I lack a description field for each term annotated in a GFF. A thing that can be easily done in mongodb. Indeed, it can be included in description field of GFF, but it brings to my next GFF issue: noisy/polluted GFF;
Clean visualization/reading of gene attributes in gffs/genomes having enormous quantity of annotations; and
Go biond GFFs. In some situations, we would like to gather the information contatined in GFF into a different format. For instance, higlass visualization tool requires refseq format to display genes. Therefore, generate a file with refseq specs from a stored mongodb is easier than manipulate a raw gff.

Proposed solution

Probably the list of advantages and disadvantages of using a mongodb as intermediate to accomodate annotations is bigger than I could think of, but I see this way as a facilitator. Hence, I propose a new gff-toolbox module to perform this task, i.e. annotate a mongodb collection created by gfftool-box convert. In the following i will try to explain the main architecture of this module, that at first hand I nominated as ingest.

We would like the ingest module to receive a set of annotations and include them in corresponding gene/transcript entry in mongodb. Thus, assume that the mongodb was created by gff-toolbox convert module - parameters XXX; XXX; - and that we also have a txt/tsv file tab-separated with annotations such as the following:

##ID	Id	IdType	description
gene-KPHS_00170	PTHR30520:SF0	PANTHER	TRANSPORTER-RELATED
gene-KPHS_00170	GO:0006810	GO	transport
gene-KPHS_00170	3.4.16.2	EC	Lysosomal Pro-Xaa carboxypeptidase
gene-KPHS_00170	GO:0005215	GO	transporter activity
gene-KPHS_02590	GO:0003735	GO	structural constituent of ribosome
gene-KPHS_02590	PTHR36029	PANTHER

Inspecting the mongodb collection that correponds to gene-KPHS_00170 we can retrieve the json listing it's information:

{'_id': ObjectId('612e788a94ee11baab643fb0'),
  'recid': 'NC_016845.1',
  'source': 'RefSeq',
  'type': 'gene',
  'start': '22533',
  'end': '22802',
  'score': '.',
  'strand': '+',
  'phase': '.',
  'attributes': {'ID': 'gene-KPHS_00170',
   'Dbxref': 'GeneID:11844995',
   'Name': 'KPHS_00170',
   'gbkey': 'Gene',
   'gene_biotype': 'protein_coding',
   'locus_tag': 'KPHS_00170'}

The aim of the proposed gff-toolbox ingest module is to insert the annotations into corresponding gene in mongodb. After this procedure, we would like to have mongodb entry for gene-KPHS_00170 stored as:

{'_id': ObjectId('612e788a94ee11baab643fb0'),
  'recid': 'NC_016845.1',
  'source': 'RefSeq',
  'type': 'gene',
  'start': '22533',
  'end': '22802',
  'score': '.',
  'strand': '+',
  'phase': '.',
  'attributes': {'ID': 'gene-KPHS_00170',
   'Dbxref': 'GeneID:11844995',

   'Dbxref': [  'GeneID:11844995' ,
                    {'DBTAG': 'PANTHER', 'ID': 'PTHR30520:SF0', 'Description': 'FORMATE TRANSPORTER-RELATED'},
                    {'DBTAG': 'PANTHER', 'ID': 'PTHR30520', 'Description': 'FORMATE TRANSPORTER-RELATED'},
                    {'DBTAG': 'PFAM', 'ID': 'PF01226', 'Description': 'Formate/nitrite transporter'}
                    ],
   'Ontology_term': [ {'DBTAG': 'GO', 'ID': 'GO:0006810', 'Description': 'transport'}, 
                    {'DBTAG': 'GO', 'ID': 'GO:0016020', 'Description': 'membrane'},
                    {'DBTAG': 'GO', 'ID': 'GO:0005215', 'Description': 'transporter activity'}
                    ],

   'Name': 'KPHS_00170',
   'gbkey': 'Gene',
   'gene_biotype': 'protein_coding',
   'locus_tag': 'KPHS_00170'}

According to GFF spec, "two reserved attributes, Ontology_term and Dbxref, can be used to establish links between a GFF3 feature and a data record contained in another database" (i.e. annotations). Also, "the value of both Ontology_term and Dbxref is the ID of the cross referenced object in the form "DBTAG:ID". The DBTAG indicates which database the referenced object can be found in, and ID indicates the identifier of the object within that database". Therefore, in mongodb schema we include an object for each annotation declaring DBTAG, ID and optional fields such as Description. Unfortunately, this is not the json schema declared in gff-toolbox convert module: the Dbxref entries generated after parsing a GFF to mongodb do not separate the DBTAG and ID fields. We can fix this, by simply adjusting the code to separate those fields before parsing the json into mongodb collection. I propose to fix this, but I have to know if this can bring any problem in other gff-toolbox modules. @fmalmeida, can it?

Another suggestion, I would like your opinion, if we should decouple the "ingestion" of annotations to mongodb - that is the proposed solution in this issue - and the "digestion" of a mongodb collection into a GFF/other file format. I think, another gff-toolbox module or even the gff-toolbox convert could be an answer to this question.

@fmalmeida, let me know what you think about it and if I can submit the pull request - I kind of have a code that can be adjusted to become the aforementioned gff-toolbox ingest module.

add a new plot (ideogram)

Generate a ideogram plot (with karyoploteR) using a feature type or attribute key selected by the user.

Obs: It would require a .fai (samtools faidx) or genome file as input to get the chr sizes.

convertion to JSON

GFF convertion to JSON is not working properly ... the output is empty

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.