orifinder's Introduction

OriFinder

Origin of replication finder using python. Thanks to hash tables its very fast

The origin of replication (also called the replication origin) is a particular sequence in a genome at which replication is initiated.

from collections import Counter
#Cause of string.count() is slow and we have a HUGE dataset we need a faster method to search.
#Collections.Counter is a data type for counting data using hash tables which is very fast

def diveandconquer(s):
    maxcount = 0
    maxsubstring = ""
    for u in range(5, 20):
        allsubstrings = []
        for i in (range(len(s)-u+2)):
            allsubstrings.append(s[i:i+u])
        c = Counter(allsubstrings).most_common(1)
        if(c[0][1] >= maxcount):
            maxsubstring = c[0][0]
            maxcount = c[0][1]
    return maxsubstring + " " +str(maxcount)
#We have two variables to store max number of occurances and most frequent substring. Then we have a list to store all possible substring. 
#In the for loop we create 5 charecter long substrings by shift throug original string.
# Then we create a counter object with all possible substring and retrive the most common one and assign it to C.
# Counter object return two tupples in a list in this format [("theword"),(20)] first the word and the number of occurances. 
#If the number of occurance is higher than previus maxcount value we assign current substring and count to maxcount and maxsubstring. 
#The we itterate the same code with 5 to 20 charecters long substrings.


#f = open("vibrio_cholerae_light.txt", "r")
#f = open("vibrio_cholerae_med.txt", "r")
# = open("deneme.txt", "r")

f = open("vibrio_cholerae.txt", "r")
data = f.read()
f.close()
#We read our dataset from a file and store it in a variable named data. 
#I created 3 alternative datasets by deleting much of the original dataset to prevent memory errors and reduce the time taken by running the code while prototyping

sonuc = diveandconquer(data)
print(sonuc)
TTTTT 3193
And laslty we call our function and print it

Recommend Projects