This tool assigns Pfam-domains to families of protein sequences. The domains are reported in order of their starting positions in the sequence. Domain composition for each family is summarised in terms of percentage of sequences containing the domains found in the corresponding family and average jaccard score calculated from all the pair-wise jaccard scores from the family. Jaccard score between any two family sequences is = no-of-unique-common-domains-between-seq1-&-seq2/((total-no-of-unique-domains-in-seq1)+(total-no-of-unique-domains-in-seq2))
- UNIX based OS
- PYTHON 2.7
- PYTHON modules re, sys, os, subprocess, argparse, operator
- Pfam-A database files Pfam-A.hmm and Pfam-A.hmm.dat files saved in the Pfam/database directory. The Pfam-A.hmm file must be pressed using hmmpress program from the HMMER package to create the database index file in the same directory
- The scripts and the Pfam directories must be present in the same working directory
- Add the Pfam modules to your PERL5LIB using the following command:
bash% export PERL5LIB=/path/to/pfam_Dir:$PERL5LIB
usage: gene-family-domain-arch-analysis.py [-h] --fasta_dir
FAMILY_FASTA_FILE_DIR --output_dir
OUTPUT_DIRECTORY --name
DATASET_NAME
Tool for calculating domain composition and domain compostion scores for given
set of gene families
Arguments:
-h, --help show this help message and exit
--fasta_dir FAMILY_FASTA_FILE_DIR
Location of directory containing family fasta files
--output_dir OUTPUT_DIRECTORY
Location of the output directory
--name DATASET_NAME Name of the dataset used to create output files
- pfamscan_out directory contains raw pfamscan output files for each family.
- domain_order_results directory contains domain order files for each family. Each domain order file contains Pfam-domains for each sequence in the family in order of their starting positions in the sequences. A **NULL** domain is reported for sequences where no Pfam-domain is detected.
- *.family_domain_compositions file contains summarised domain compositions for each family. Format: <family_id> <family-size> <domain-1>-<% of sequences in the family containing the domain> ...
- *.family_domain_jaccard_scores file contains domain composition Jaccard scores for all the families. Format <family_id> <family-size> <Jaccard score>