GithubHelp home page GithubHelp logo

genefamily-domain-composition-tool's Introduction

Tool for analyzing Pfam-domain compositions of gene families

This tool assigns Pfam-domains to families of protein sequences. The domains are reported in order of their starting positions in the sequence. Domain composition for each family is summarised in terms of percentage of sequences containing the domains found in the corresponding family and average jaccard score calculated from all the pair-wise jaccard scores from the family. Jaccard score between any two family sequences is = no-of-unique-common-domains-between-seq1-&-seq2/((total-no-of-unique-domains-in-seq1)+(total-no-of-unique-domains-in-seq2))

Requirments

  • UNIX based OS
  • PYTHON 2.7
  • PYTHON modules re, sys, os, subprocess, argparse, operator
  • Pfam-A database files Pfam-A.hmm and Pfam-A.hmm.dat files saved in the Pfam/database directory. The Pfam-A.hmm file must be pressed using hmmpress program from the HMMER package to create the database index file in the same directory
  • The scripts and the Pfam directories must be present in the same working directory
  • Add the Pfam modules to your PERL5LIB using the following command:
bash% export PERL5LIB=/path/to/pfam_Dir:$PERL5LIB

Tool Options

usage: gene-family-domain-arch-analysis.py [-h] --fasta_dir
                                           FAMILY_FASTA_FILE_DIR --output_dir
                                           OUTPUT_DIRECTORY --name
                                           DATASET_NAME

Tool for calculating domain composition and domain compostion scores for given
set of gene families

Arguments:
  -h, --help            show this help message and exit
  --fasta_dir FAMILY_FASTA_FILE_DIR
                        Location of directory containing family fasta files
  --output_dir OUTPUT_DIRECTORY
                        Location of the output directory
  --name DATASET_NAME   Name of the dataset used to create output files

Output files

  • pfamscan_out directory contains raw pfamscan output files for each family.
  • domain_order_results directory contains domain order files for each family. Each domain order file contains Pfam-domains for each sequence in the family in order of their starting positions in the sequences. A **NULL** domain is reported for sequences where no Pfam-domain is detected.
  • *.family_domain_compositions file contains summarised domain compositions for each family. Format: <family_id> <family-size> <domain-1>-<% of sequences in the family containing the domain> ...
  • *.family_domain_jaccard_scores file contains domain composition Jaccard scores for all the families. Format <family_id> <family-size> <Jaccard score>

genefamily-domain-composition-tool's People

Contributors

akshayayadav avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.