Origin of replication finder using python. Thanks to hash tables its very fast
The origin of replication (also called the replication origin) is a particular sequence in a genome at which replication is initiated.
from collections import Counter
#Cause of string.count() is slow and we have a HUGE dataset we need a faster method to search.
#Collections.Counter is a data type for counting data using hash tables which is very fast
def diveandconquer(s):
maxcount = 0
maxsubstring = ""
for u in range(5, 20):
allsubstrings = []
for i in (range(len(s)-u+2)):
allsubstrings.append(s[i:i+u])
c = Counter(allsubstrings).most_common(1)
if(c[0][1] >= maxcount):
maxsubstring = c[0][0]
maxcount = c[0][1]
return maxsubstring + " " +str(maxcount)
#We have two variables to store max number of occurances and most frequent substring. Then we have a list to store all possible substring.
#In the for loop we create 5 charecter long substrings by shift throug original string.
# Then we create a counter object with all possible substring and retrive the most common one and assign it to C.
# Counter object return two tupples in a list in this format [("theword"),(20)] first the word and the number of occurances.
#If the number of occurance is higher than previus maxcount value we assign current substring and count to maxcount and maxsubstring.
#The we itterate the same code with 5 to 20 charecters long substrings.
#f = open("vibrio_cholerae_light.txt", "r")
#f = open("vibrio_cholerae_med.txt", "r")
# = open("deneme.txt", "r")
f = open("vibrio_cholerae.txt", "r")
data = f.read()
f.close()
#We read our dataset from a file and store it in a variable named data.
#I created 3 alternative datasets by deleting much of the original dataset to prevent memory errors and reduce the time taken by running the code while prototyping
sonuc = diveandconquer(data)
print(sonuc)
TTTTT 3193
And laslty we call our function and print it