GithubHelp home page GithubHelp logo

Comments (5)

btmartin721 avatar btmartin721 commented on June 14, 2024

Hi Cui,

I will look into this next week. I apologize for the delay, but I am swamped this week. Just wanted to give you a heads-up that it might be a bit.

If I happen to forget to follow up on this, feel free to send me a reminder message.

-Bradley

from raxml_ascbias.

smallfishcui avatar smallfishcui commented on June 14, 2024

Thanks a lot Bradley! That's okay, please take your time!
I have never used python, so I tried to ask help from chatgpt. Eventually I changed the code like this
line 49, changed to bases = ["A","G","C","T"]
line 68, changed to if any(value for value in column_unique if value not in bases):
The resulted file seems to work for me. But I have no idea whether this works correctly. Please do get back to me once you get a moment!

thanks,
Cui

from raxml_ascbias.

pvdam3 avatar pvdam3 commented on June 14, 2024

I just wanted to add here that I also experienced the same thing; the invariable sites (E.g. all samples having A nucleotides and one R, meaning A/G), are not removed, and raxml-ng is complaining about a large number of remaining invariable sites in the MSA.
The suggestion by @smallfishcui removes all columns with any IUPAC code completely, which is not what I would like to do.

from raxml_ascbias.

pvdam3 avatar pvdam3 commented on June 14, 2024

I added a section to the filter_invariants function that looks for the above situation and also removes columns with multiple IUPAC code bases. Maybe it can be of use to you @btmartin721.

def filter_invariants(dframe):
    initial_dframe_shape=dframe.shape
    bases = ["A","G","C","T","R","Y","S","W","K","M","B","D","H","V"]
    IUPAC_bases = ["R","Y","S","W","K","M","B","D","H","V","N"]
    IUPAC_dict = {'B': 'CGT', 'D': 'AGT', 'H': 'ACT', 'K': 'GT', 'M': 'AC', 'N': 'ACGT', 'S': 'CG', 'R': 'AG', 'W': 'AT', 'V': 'ACG', 'Y': 'CT', 'X': 'ACGT'}


    # collections::Counter library
    stamatakis_cnt = Counter()
    fels_cnt = 0

    invariant_lst = list()
    # Loop through each dataframe column
    for i in dframe.columns:

        # Gets unique values at each column and saves to list
        column_unique = [x.upper() for x in dframe[i].unique().tolist()]

        # Intersects column_unique with bases list
        intersect = [value for value in bases if value in column_unique]

        # If column contains only ambigous or IUPAC characters
        # Save the column index for dropping later
        if not any(value for value in bases if value in column_unique):
            invariant_lst.append(i)

        # If site is invariant (only A, C, G, or T); ignores N's and "-"
        if len(intersect) == 1:
            # Uses collections::Counter to get Stamatakis counts
            stamatakis_cnt[intersect[0]] += 1

            # Counts number of invariant sites for Felsenstein count
            fels_cnt += 1

            # Saves column indexes to list
            invariant_lst.append(i)

        # Assesses whether the column is considered invariate because a pattern like AAAAAARAAAAA (R being A or G)
        # Save the column index for dropping later
        IUPAC_bases_this_column = [x for x in intersect if x in IUPAC_bases]
        column_unique_without_IUPAC = [x for x in intersect if x not in IUPAC_bases_this_column]

        if len(IUPAC_bases_this_column) == 1 and \
            len(column_unique_without_IUPAC) == 1 and \
            column_unique_without_IUPAC[0] in list(IUPAC_dict[IUPAC_bases_this_column[0]]):  
            ## if 1 IUPAC base AND 1 non-IUPAC base AND the non-IUPAC base is part of the IUPAC base set stored in IUPAC_dict
            print(column_unique_without_IUPAC[0])
            invariant_lst.append(i)
        if len(IUPAC_bases_this_column) > 1: 
            ## if there are more than one IUPAC bases in this column
            invariant_lst.append(i)

    # Drops invariant sites from dataframe
    invariant_set=set(invariant_lst)
    dframe.drop(invariant_set, axis=1, inplace=True)
    print("## dimensions of the initial alignment (samples, nr of positions):",initial_dframe_shape)
    print("##",len(invariant_set), "invariable sites removed")
    print("## dimensions of the remaining alignment (samples, nr of positions):", dframe.shape)
    return stamatakis_cnt, fels_cnt, dframe

from raxml_ascbias.

btmartin721 avatar btmartin721 commented on June 14, 2024

Awesome thanks! Sorry for the trouble. This was one of the first python scripts I ever wrote and I haven't touched it in a long time. @pvdam3 do you think you could do a pull request and I'll merge it with my original code so that this issue could be fixed?

from raxml_ascbias.

Related Issues (4)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.