Comments (5)
Hi Cui,
I will look into this next week. I apologize for the delay, but I am swamped this week. Just wanted to give you a heads-up that it might be a bit.
If I happen to forget to follow up on this, feel free to send me a reminder message.
-Bradley
from raxml_ascbias.
Thanks a lot Bradley! That's okay, please take your time!
I have never used python, so I tried to ask help from chatgpt. Eventually I changed the code like this
line 49, changed to bases = ["A","G","C","T"]
line 68, changed to if any(value for value in column_unique if value not in bases):
The resulted file seems to work for me. But I have no idea whether this works correctly. Please do get back to me once you get a moment!
thanks,
Cui
from raxml_ascbias.
I just wanted to add here that I also experienced the same thing; the invariable sites (E.g. all samples having A nucleotides and one R, meaning A/G), are not removed, and raxml-ng is complaining about a large number of remaining invariable sites in the MSA.
The suggestion by @smallfishcui removes all columns with any IUPAC code completely, which is not what I would like to do.
from raxml_ascbias.
I added a section to the filter_invariants
function that looks for the above situation and also removes columns with multiple IUPAC code bases. Maybe it can be of use to you @btmartin721.
def filter_invariants(dframe):
initial_dframe_shape=dframe.shape
bases = ["A","G","C","T","R","Y","S","W","K","M","B","D","H","V"]
IUPAC_bases = ["R","Y","S","W","K","M","B","D","H","V","N"]
IUPAC_dict = {'B': 'CGT', 'D': 'AGT', 'H': 'ACT', 'K': 'GT', 'M': 'AC', 'N': 'ACGT', 'S': 'CG', 'R': 'AG', 'W': 'AT', 'V': 'ACG', 'Y': 'CT', 'X': 'ACGT'}
# collections::Counter library
stamatakis_cnt = Counter()
fels_cnt = 0
invariant_lst = list()
# Loop through each dataframe column
for i in dframe.columns:
# Gets unique values at each column and saves to list
column_unique = [x.upper() for x in dframe[i].unique().tolist()]
# Intersects column_unique with bases list
intersect = [value for value in bases if value in column_unique]
# If column contains only ambigous or IUPAC characters
# Save the column index for dropping later
if not any(value for value in bases if value in column_unique):
invariant_lst.append(i)
# If site is invariant (only A, C, G, or T); ignores N's and "-"
if len(intersect) == 1:
# Uses collections::Counter to get Stamatakis counts
stamatakis_cnt[intersect[0]] += 1
# Counts number of invariant sites for Felsenstein count
fels_cnt += 1
# Saves column indexes to list
invariant_lst.append(i)
# Assesses whether the column is considered invariate because a pattern like AAAAAARAAAAA (R being A or G)
# Save the column index for dropping later
IUPAC_bases_this_column = [x for x in intersect if x in IUPAC_bases]
column_unique_without_IUPAC = [x for x in intersect if x not in IUPAC_bases_this_column]
if len(IUPAC_bases_this_column) == 1 and \
len(column_unique_without_IUPAC) == 1 and \
column_unique_without_IUPAC[0] in list(IUPAC_dict[IUPAC_bases_this_column[0]]):
## if 1 IUPAC base AND 1 non-IUPAC base AND the non-IUPAC base is part of the IUPAC base set stored in IUPAC_dict
print(column_unique_without_IUPAC[0])
invariant_lst.append(i)
if len(IUPAC_bases_this_column) > 1:
## if there are more than one IUPAC bases in this column
invariant_lst.append(i)
# Drops invariant sites from dataframe
invariant_set=set(invariant_lst)
dframe.drop(invariant_set, axis=1, inplace=True)
print("## dimensions of the initial alignment (samples, nr of positions):",initial_dframe_shape)
print("##",len(invariant_set), "invariable sites removed")
print("## dimensions of the remaining alignment (samples, nr of positions):", dframe.shape)
return stamatakis_cnt, fels_cnt, dframe
from raxml_ascbias.
Awesome thanks! Sorry for the trouble. This was one of the first python scripts I ever wrote and I haven't touched it in a long time. @pvdam3 do you think you could do a pull request and I'll merge it with my original code so that this issue could be fixed?
from raxml_ascbias.
Related Issues (4)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from raxml_ascbias.