GithubHelp home page GithubHelp logo

i2bc / orfmine Goto Github PK

View Code? Open in Web Editor NEW
6.0 4.0 1.0 275.5 MB

ORFmine is an open-source tool for identifying and analyzing all Open Reading Frames (ORFs) in genomic data, focusing on their sequences, structures, evolution and translation activities.

Home Page: https://i2bc.github.io/ORFmine/

License: MIT License

Shell 0.53% Python 96.41% Dockerfile 0.31% Batchfile 0.01% R 2.74%
bioinformatics-tool de-novo-genes docker genomics non-coding-region python3

orfmine's Introduction

Overview

OMICS studies attribute a new role to the noncoding genome in the production of novel peptides. The widespread transcription of noncoding regions and the pervasive translation of the resulting RNAs offer a vast reservoir of novel peptides to the organisms.

ORFmine1, 2 is an open-source package that aims at extracting, annotating, and characterizing the sequence and structural properties of all Open Reading Frames (ORFs) of a genome (including coding and noncoding sequences) along with their translation activity. ORFmine consists of several independent programs that can be used together or independently:

  • ORFtrack
  • ORFold
  • ORFribo
  • ORFdate

Built with

  • python 3.6
  • miniconda 3
  • pyHCA 3
  • R
  • bash
  • Docker

All programs and dependencies are listed here.

Getting started

Prerequisites

Installation

Simply pull the ORFmine image from Dockerhub.

For docker:

# pull the ORFmine docker image from Dockerhub
docker pull lopesi2bc/orfmine:latest

For singularity:

# create a directory that will host the singularity image of ORFmine (adpat the location and directory name)
mkdir ~/orfmine

# build a singularity image named orfmine_latest.sif that will be located in ~/orfmine (to adapt)
singularity build ~/orfmine/orfmine_latest.sif docker://lopesi2bc/orfmine:latest

If you have any error, it might come from a permissions problem so you should try using these commands with sudo as prefix.

Usage

For usage examples, please check the Quick start section of our documentation page.

Documentation

Our full documentation is accessible here.

Issues

If you have suggestions to improve ORFmine or face technical issues, please post an issue here.

Contact

Anne Lopes - [email protected]

Citing

If you use only ORFtrack

Please cite:

Papadopoulos, C., Chevrollier, N., Lopes, A. Exploring the peptide potential of genomes. Meth. Mol. Biol. (2022).

If you use only ORFfold with HCA, IUPred and Tango

Please cite:

Papadopoulos, C., Chevrollier, N., Lopes, A. Exploring the peptide potential of genomes. Meth. Mol. Biol. (2022)

Bitard-Feildel, T. & Callebaut, I. HCAtk and pyHCA: A Toolkit and Python API for the Hydrophobic Cluster Analysis of Protein Sequences. bioRxiv 249995 (2018).

Mészáros, B., Erdős, G. & Dosztányi, Z. IUPred2A: context-dependent prediction of protein disorder as a function of redox state and protein binding. Nucleic acids research 46, W329–W337 (2018).

Linding, R., Schymkowitz, J., Rousseau, F., Diella, F. & Serrano, L. A comparative study of the relationship between protein structure and β-aggregation in globular and intrinsically disordered proteins. Journal of molecular biology 342, 345–353 (2004).

Otherwise, if you use ORFold with a combination of HCA, IUPred and Tango

Please cite:

Papadopoulos, C., Chevrollier, N., Lopes, A. Exploring the peptide potential of genomes. Meth. Mol. Biol. (2022)

For HCA, cite:

Bitard-Feildel, T. & Callebaut, I. HCAtk and pyHCA: A Toolkit and Python API for the Hydrophobic Cluster Analysis of Protein Sequences. bioRxiv 249995 (2018).

For IUPred, cite:

Mészáros, B., Erdős, G. & Dosztányi, Z. IUPred2A: context-dependent prediction of protein disorder as a function of redox state and protein binding. Nucleic acids research 46, W329–W337 (2018).

For Tango, cite:

Linding, R., Schymkowitz, J., Rousseau, F., Diella, F. & Serrano, L. A comparative study of the relationship between protein structure and β-aggregation in globular and intrinsically disordered proteins. Journal of molecular biology 342, 345–353 (2004).

If you use ORFribo or ORFdate

Please cite:

Papadopoulos, C., Arbes, H., Chevrollier, N., Blanchet, S., Cornu, D., Roginski, P., Rabier, C., Atia, S., Lespinet, O., Namy, O., Lopes, A. (submitted).

Licence

The ORFmine project is under the MIT licence. Please check here for more details.

References

  1. Papadopoulos, C., Chevrollier, N., Lopes, A. Exploring the peptide potential of genomes. Meth. Mol. Biol. (2022).
  2. Papadopoulos, C., Arbes, H., Chevrollier, N., Blanchet, S., Cornu, D., Roginski, P., Rabier, C., Atia, S., Lespinet, O., Namy, O., Lopes, A. (submitted).
  3. Bitard-Feildel, T. & Callebaut, I. HCAtk and pyHCA: A Toolkit and Python API for the Hydrophobic Cluster Analysis of Protein Sequences. bioRxiv 249995 (2018).
  4. Dosztanyi, Z., Csizmok, V., Tompa, P. & Simon, I. The pairwise energy content estimated from amino acid composition discriminates between folded and intrinsically unstructured proteins. Journal of molecular biology 347, 827–839 (2005).
  5. Dosztányi, Z. Prediction of protein disorder based on IUPred. Protein Science 27, 331– 340 (2018).
  6. Mészáros, B., Erdős, G. & Dosztányi, Z. IUPred2A: context-dependent prediction of protein disorder as a function of redox state and protein binding. Nucleic acids research 46, W329–W337 (2018).
  7. Fernandez-Escamilla, A.-M., Rousseau, F., Schymkowitz, J. & Serrano, L. Prediction of sequence-dependent and mutational effects on the aggregation of peptides and proteins. Nature biotechnology 22, 1302–1306 (2004).
  8. Linding, R., Schymkowitz, J., Rousseau, F., Diella, F. & Serrano, L. A comparative study of the relationship between protein structure and β-aggregation in globular and intrinsically disordered proteins. Journal of molecular biology 342, 345–353 (2004).
  9. Rousseau, F., Schymkowitz, J. & Serrano, L. Protein aggregation and amyloidosis: confusion of the kinds? Current opinion in structural biology 16, 118–126 (2006).

orfmine's People

Contributors

nchenche avatar cgpapado avatar proginski avatar annelopes avatar

Stargazers

HERMAN avatar  avatar  avatar Ambre Baumann avatar Radu Suciu avatar

Watchers

Arnaud Martel avatar James Cloos avatar Daniel Gautheret avatar  avatar

Forkers

radusuciu

orfmine's Issues

ORFtrack empty lines

Sometimes ORFtrack generates empty lines in the output GFF file and thus stop with an error when parsing the output GFF to build the summary.
To reproduce the behavior juste use the defaut options with the attached files (with txt extension to be uploadable on github).
gff.txt
fasta.txt

install warnings

./install.sh
"WARNING: Skipping orftrack as it is not installed.
...
WARNING: Skipping orfold as it is not installed."

Yet, the installation is successful.

Add different codon tables usage

During orftrack annotation process, ORFs are first defined from stop at stop codons. The stop codons and all other codons are hardcoded and come from the standard genetic code. This is problematic if the species (or even a human mitochondria) uses a different genetic code.

The hardcoded genetic code used to define stop codons (among others) needs to be somehow adjusted so that users can choose the codon table that fit it needs. A parameter could be used to set the desired codon table.

(https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi?chapter=tgencodes)

Infos from biopython package

import Bio.Data.CodonTable as ct
for k,v in ct.generic_by_id.items():
     print(k, v)

# k is the key id referring to the genetic table (same id as the one in the ncbi link above)
# v is an instance of the NCBICodonTable class with the following attributes:
v.id  # -> 33
v.names  # -> ['Cephalodiscidae Mitochondrial', None]
v.forward_table  # -> {'TTT': 'F', 'UUU': 'F', ...}
v.back_table  # -> {'K': 'AGG', 'N': 'AAU', ...}
v.start_codons  # -> ['TTG', 'UUG', 'CTG', 'CUG', 'ATG', 'AUG', 'GTG', 'GUG']
v.stop_codons   # -> ['TAG', 'UAG']

Link to problematic files

Issue created to add input files orftrack wrongly handles (preferentially link to files to avoid huge files to be stored here)

Cumbersome features

In some gff files are features that cover most of the track.
For example : GCF_000247795.1
In the related gff file (enclosed), there is a feature named "match" that fully overlaps with the first chromosome
NC_032650.1 RefSeq region 1 161108492 . + . ID=NC_032650.1:1..161108492;Dbxref=taxon:9915;Name=1;breed=Nelore;chromosome=1;country=Brazil;gb-synonym=Bos taurus indicus;gbkey=Src;genome=chromosome;isolate=QUIL7308;mol_type=genomic DNA;note=animal owned by Agropecuaria Quilombo Inc.;sex=male;tissue-type=peripheral blood mononuclear cells
line num 37235:
NC_032650.1 RefSeq match 1 161108492 . + . ID=aln0;Target=NC_032650.1 1 161108492 +;gap_count=0;num_mismatch=0;pct_coverage=100;pct_identity_gap=100

In consequence orfget is not able to define any pure intergenic ORF :

NC_032650.1

ORF type Quantity Average length (aa)


c_CDS 7649 100.45
nc_ovp_opp-CDS 19987 58.68
nc_ovp_opp-cDNA_match 201 39.65
nc_ovp_opp-match 1983772 46.8
nc_ovp_same-CDS 11740 52.03
nc_ovp_same-cDNA_match 713 39.64
nc_ovp_same-lnc_RNA 15831 42.05
nc_ovp_same-mRNA 439133 44.33
nc_ovp_same-match 2449854 46.35
nc_ovp_same-pseudogene 10750 48.33
nc_ovp_same-tRNA 16 68.0
nc_ovp_same-transcript 281 65.47

Would it be possible as a preliminary step in orftrack, to exclude features whose region coverage exceeds lets say 90% to avoid this behavior ?

Meanwhile, since the 6 only genomes with this error I identified so far, all contain a 'match' feature, I suggest to simply add 'match' to line 597 of gff_parser.py
if element_type not in ['chromosome', 'region','match']:

Add hardcoded element type list as a parameter

Each read element type of a gff is tested against the list ["chromosome", "region", "match"] to check if the element must be considered or not.

It would be more flexible to make this list as a parameter with the same default values.

For instance:
orftrack --get_element_type_considered --> returns ["chromosome", "region", "match"]
orftrack --delete_element_type_considered "match" -> returns ["chromosome", "region"]
orftrack --add_element_type_considered "any-long-overlapping-region-tag" -> returns ["chromosome", "region", "any-long-overlapping-region-tag"]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.