reynoldsk / pysca Goto Github PK
View Code? Open in Web Editor NEWA python implementation of the Statistical Coupling Analysis (SCA)
License: BSD 3-Clause "New" or "Revised" License
A python implementation of the Statistical Coupling Analysis (SCA)
License: BSD 3-Clause "New" or "Revised" License
I have had trouble with constructing a dictionary of phylogenetic groups after annotation using the annotate_MSA script because some annotations contain a vertical bar symbol in the description string ex:
"DNLJ_ZYMMO|DNA ligase {ECO:0000255|HAMAP-Rule:MF_01588}|Zymomonas mobilis subsp. mobilis (strain ATCC 31821 / ZM4 / CP4)|Bacteria,Proteobacteria,Alphaproteobacteria,Sphingomonadales,Sphingomonadaceae,Zymomonas."
A different delimiter is probably necessary.
I found an issue related to the MSAsearch function in scaTools.py. I am new to python and just came up with a naive solution. But the output it not consist with that described in the pySCA tutorial.
Firstly I have ggsearch36 installed, which means I can use ggsearch36 in command lines :
$ ggsearch36
USAGE
ggsearch36 [-options] query_file library_file
ggsearch36 -help for a complete option list
DESCRIPTION
GGSEARCH performs a global/global database searches
version: 36.3.8 Jul, 2015
COMMON OPTIONS (options must preceed query_file library_file)
-s: scoring matrix;
-f: gap-open penalty;
-g: gap-extension penalty;
-S filter lowercase (seg) residues;
-b: high scores reported (limited by -E by default);
-d: number of alignments shown (limited by -E by default);
-I interactive mode;
And in python3, I tried
./scaProcessMSA.py Inputs/s1Ahalabi_1470_nosnakes.an -s 3TGI -c E -t -n
to do MSA for the S1A family.
It gives error outputs:
Trying MSASearch with ggsearch
Trying MSASearch with EMBOSS
Trying MSASearch with BioPython
Error!!! Something wrong with PDBid or path...
After debugging for days I found this issue comes from this line in scaTools.py:
i_0 = [i for i in range(len(hd)) if output.split('\t')[1] in hd[i]]
It is that the byte type is not consist with str. I just solved it by modifying it to:
i_0 = [i for i in range(len(hd)) if output.split(b'\t')[1] in bytes(hd[i],'utf-8')]
It works and could give the final MSA outputs, but with 205 positions for S1A instead of 245.
Thank you very much for your help.
Hello, running the following script (attached input files), I get the following error:
Code/./scaProcessMSA.py Inputs/aln-kaic.fasta --refseq ref_seq_KaiC_elongatus.fa -c A --output Outputs/KaiC1_processed.db
Using reference sequence but no position list provided! Just numbering positions 1 to length(sequence)
Traceback (most recent call last):
File "Code/./scaProcessMSA.py", line 85, in <module>
options.refpos = range(len(options.refseq))+1
TypeError: can only concatenate list (not "int") to list
I presume you meant options.refpos = range(len(options.refseq)+1)
. After changing that line, I run the same command and get the following error:
Using reference sequence but no position list provided! Just numbering positions 1 to length(sequence)
Using the reference sequence and position list...
Loaded alignment of 1194 sequences, 1530 positions.
Checking alignment for non-standard amino acids
Aligment size after removing sequences with non-standard amino acids: 1194
Trimming alignment for highly gapped positions (80% or more).
Alignment size post-trimming: 536 positions
Finding reference sequence using provided sequence file...
Trying MSASearch with ggsearch
Trying MSASearch with EMBOSS
Trying MSASearch with BioPython
Error!! Can't find reference sequence...
In line 174, I print out h_tmp
and s_tmp[0]
and get
(['Elongatus_KaiC'], 'MTSAEMTSPNNNSEHQAIAKMRTMIEGFDDISHGGLPIGRSTLVSGTSGTGKTLFSIQFLYNGIIEFDEPGVFVTFEETPQDIIKNARSFGWDLAKLVDEGKLFILDASPDPEGQEVVGGFDLSALIERINYAIQKYRARRVSIDSVTSVFQQYDASSVVRRELFRLVARLKQIGATTVMTTERIEEYGPIARYGVEEFVSDNVVILRNVLEGERRRRTLEILKLRGTSHMKGEYPFTITDHGINIFPLGAMRLTQRSSNVRVSSGVVRLDEMCGGGFFKDSIILATGATGTGKTLLVSRFVENACANKERAILFAYEESRAQLLRNAYSWGMDFEEMERQNLLKIVCAYPESAGLEDHLQIIKSEINDFKPARIAIDSLSALARGVSNNAFRQFVIGVTGYAKQEEITGLFTNTSDQFMGAHSITDSHISTITDTIILLQYVEIRGEMSRAINVFKMRGSWHDKAIREFMISDKGPDIKDSFRNFERIISGSPTRITVDEKSELSRIVRGVQEKGPES')
Python version:
'2.7.13 |Enthought, Inc. (x86_64)| (default, Mar 2 2017, 08:20:50) \n[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]'
Biopython version:
1.71
Thanks in advance!
Main idea of SCA is to use MSA as a foundation stone. My question does it equally reliable for oligomeric protein?
The ftp url provided for downloading pfamseq.txt (ftp://ftp.sanger.ac.uk/pub/databases/Pfam/current_release/database_files/) is not active. It would be helpful to know which release of pfamseq was used when writing the code for annotate MSA.py, as the current release does not seem to work.
I have been having an issue with the scaSectors.py tool. When I input the command:
./scaSectorID.py ./PF00028.db
I received the error
Selected kpos=4 significant eigenmodes.
Traceback (most recent call last):
File "./scaSectorID.py", line 70, in
ics,icsize,sortedpos,cutoff,scaled_pd, pd = sca.icList(Vpica,kpos,Csca, p_cut=options.cutoff)
File "/home/dylan/pySca-master/scaTools.py", line 998, in icList
h_params = np.histogram(Vpica[:,k], nbins)
File "/home/dylan/anaconda2/lib/python2.7/site-packages/numpy/lib/function_base.py", line 719, in histogram
'bins
must be an integer, a string, or an array')
TypeError: bins
must be an integer, a string, or an array
I was wondering if anyone had insight into this error?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.