GithubHelp home page GithubHelp logo

mutux / ukkonen-s-suffix-tree-algorithm Goto Github PK

View Code? Open in Web Editor NEW
24.0 2.0 10.0 8 KB

Ukkonen's suffix tree algorithm, a complete version implemented in Python

Home Page: http://mutux.com

License: MIT License

Python 100.00%
string algorithm python suffix-tree ukkonen-algorithm

ukkonen-s-suffix-tree-algorithm's Introduction

Ukkonen's Suffix Tree Algorithm in Python Complete Version

Suffix Tree Algorithm implemented in Python, might be the most complete version online, even more complete than that demonstrated on stackoverflow.

I underestimated the complication of the algorithm and just wanted to have some fun. A primitive implementation was done in a couple of hours, and the demonstation example on stackoverflow works just fine. Then when I wanted try some more complicated examples, I kept hitting the wall time and time again. It annoyed me, and thus costed me several days to try different situations when constructing a suffix tree.

Finally, the version comes out, I think all the situations explained in the questions and answers have been experienced and covered in the algorithm above before I read the full post.

I also write a blog on explaining the implementation details on my blogger MuTuX with flowcharts and explanation on it.

Examples

    docs = ['abcabxabcd', 'dedododeeodoeodooedeeododooodoede$', 'ooooooooo', 'mississippi']
    for text in docs:
        tree, pst = build(text, regularize=True)
        Node.draw(tree, pst, ed='#')

The running results:

abcabxabcd 
● (0)
|
|   ab
+----------------● (4->6)
|                |
|                |   xabcd
|                +---------------● (5)
|                |
|                |   c
|                +---------------● (9->11)
|                                |
|                                |   abxabcd
|                                +---------------● (1)
|                                |
|                                |   d
|                                +---------------● (10)
|
|   c
+----------------● (13)
|                |
|                |   abxabcd
|                +---------------● (3)
|                |
|                |   d
|                +---------------● (14)
|
|   b
+----------------● (6)
|                |
|                |   xabcd
|                +---------------● (7)
|                |
|                |   c
|                +---------------● (11->13)
|                                |
|                                |   abxabcd
|                                +---------------● (2)
|                                |
|                                |   d
|                                +---------------● (12)
|
|   xabcd
+----------------● (8)
|
|   d
+----------------● (15)

dedododeeodoeodooedeeododooodoede$ 
● (0)
|
|   e
+----------------------------------------● (28)
|                                        |
|                                        |   $
|                                        +---------------------------------------● (71)
|                                        |
|                                        |   eodo
|                                        +---------------------------------------● (48->37)
|                                                                                |
|                                                                                |   eodooedeeododooodoede$
|                                                                                +---------------------------------------● (29)
|                                                                                |
|                                                                                |   dooodoede$
|                                                                                +---------------------------------------● (49)
|                                        |
|                                        |   d
|                                        +---------------------------------------● (44->19)
|                                                                                |
|                                                                                |   ododeeodoeodooedeeododooodoede$
|                                                                                +---------------------------------------● (18)
|                                                                                |
|                                                                                |   e
|                                                                                +---------------------------------------● (68->26)
|                                                                                                                        |
|                                                                                                                        |   eododooodoede$
|                                                                                                                        +---------------------------------------● (45)
|                                                                                                                        |
|                                                                                                                        |   $
|                                                                                                                        +---------------------------------------● (69)
|                                        |
|                                        |   odo
|                                        +---------------------------------------● (37->31)
|                                                                                |
|                                                                                |   eodooedeeododooodoede$
|                                                                                +---------------------------------------● (30)
|                                                                                |
|                                                                                |   dooodoede$
|                                                                                +---------------------------------------● (50)
|                                                                                |
|                                                                                |   oedeeododooodoede$
|                                                                                +---------------------------------------● (38)
|
|   d
+----------------------------------------● (19)
|                                        |
|                                        |   e
|                                        +---------------------------------------● (26->28)
|                                                                                |
|                                                                                |   dododeeodoeodooedeeododooodoede$
|                                                                                +---------------------------------------● (17)
|                                                                                |
|                                                                                |   $
|                                                                                +---------------------------------------● (70)
|                                                                                |
|                                                                                |   eodo
|                                                                                +---------------------------------------● (46->48)
|                                                                                                                        |
|                                                                                                                        |   eodooedeeododooodoede$
|                                                                                                                        +---------------------------------------● (27)
|                                                                                                                        |
|                                                                                                                        |   dooodoede$
|                                                                                                                        +---------------------------------------● (47)
|                                        |
|                                        |   o
|                                        +---------------------------------------● (33->35)
|                                                                                |
|                                                                                |   e
|                                                                                +---------------------------------------● (64->42)
|                                                                                                                        |
|                                                                                                                        |   de$
|                                                                                                                        +---------------------------------------● (65)
|                                                                                                                        |
|                                                                                                                        |   odooedeeododooodoede$
|                                                                                                                        +---------------------------------------● (34)
|                                                                                |
|                                                                                |   d
|                                                                                +---------------------------------------● (22->24)
|                                                                                                                        |
|                                                                                                                        |   eeodoeodooedeeododooodoede$
|                                                                                                                        +---------------------------------------● (23)
|                                                                                                                        |
|                                                                                                                        |   o
|                                                                                                                        +---------------------------------------● (53->31)
|                                                                                                                                                                |
|                                                                                                                                                                |   deeodoeodooedeeododooodoede$
|                                                                                                                                                                +---------------------------------------● (20)
|                                                                                                                                                                |
|                                                                                                                                                                |   oodoede$
|                                                                                                                                                                +---------------------------------------● (54)
|                                                                                |
|                                                                                |   o
|                                                                                +---------------------------------------● (57->59)
|                                                                                                                        |
|                                                                                                                        |   edeeododooodoede$
|                                                                                                                        +---------------------------------------● (40)
|                                                                                                                        |
|                                                                                                                        |   odoede$
|                                                                                                                        +---------------------------------------● (58)
|
|   o
+----------------------------------------● (35)
|                                        |
|                                        |   e
|                                        +---------------------------------------● (42->28)
|                                                                                |
|                                                                                |   odooedeeododooodoede$
|                                                                                +---------------------------------------● (36)
|                                                                                |
|                                                                                |   de
|                                                                                +---------------------------------------● (66->68)
|                                                                                                                        |
|                                                                                                                        |   eododooodoede$
|                                                                                                                        +---------------------------------------● (43)
|                                                                                                                        |
|                                                                                                                        |   $
|                                                                                                                        +---------------------------------------● (67)
|                                        |
|                                        |   d
|                                        +---------------------------------------● (24->19)
|                                                                                |
|                                                                                |   eeodoeodooedeeododooodoede$
|                                                                                +---------------------------------------● (25)
|                                                                                |
|                                                                                |   o
|                                                                                +---------------------------------------● (31->33)
|                                                                                                                        |
|                                                                                                                        |   e
|                                                                                                                        +---------------------------------------● (62->64)
|                                                                                                                                                                |
|                                                                                                                                                                |   de$
|                                                                                                                                                                +---------------------------------------● (63)
|                                                                                                                                                                |
|                                                                                                                                                                |   odooedeeododooodoede$
|                                                                                                                                                                +---------------------------------------● (32)
|                                                                                                                        |
|                                                                                                                        |   d
|                                                                                                                        +---------------------------------------● (51->22)
|                                                                                                                                                                |
|                                                                                                                                                                |   eeodoeodooedeeododooodoede$
|                                                                                                                                                                +---------------------------------------● (21)
|                                                                                                                                                                |
|                                                                                                                                                                |   ooodoede$
|                                                                                                                                                                +---------------------------------------● (52)
|                                                                                                                        |
|                                                                                                                        |   o
|                                                                                                                        +---------------------------------------● (55->57)
|                                                                                                                                                                |
|                                                                                                                                                                |   edeeododooodoede$
|                                                                                                                                                                +---------------------------------------● (39)
|                                                                                                                                                                |
|                                                                                                                                                                |   odoede$
|                                                                                                                                                                +---------------------------------------● (56)
|                                        |
|                                        |   o
|                                        +---------------------------------------● (59->35)
|                                                                                |
|                                                                                |   edeeododooodoede$
|                                                                                +---------------------------------------● (41)
|                                                                                |
|                                                                                |   doede$
|                                                                                +---------------------------------------● (61)
|                                                                                |
|                                                                                |   odoede$
|                                                                                +---------------------------------------● (60)
|
|   $
+----------------------------------------● (72)

ooooooooo$ 
● (0)
|
|   o
+----------------● (89)
|                |
|                |   $
|                +---------------● (90)
|                |
|                |   o
|                +---------------● (87->89)
|                                |
|                                |   $
|                                +---------------● (88)
|                                |
|                                |   o
|                                +---------------● (85->87)
|                                                |
|                                                |   $
|                                                +---------------● (86)
|                                                |
|                                                |   o
|                                                +---------------● (83->85)
|                                                                |
|                                                                |   $
|                                                                +---------------● (84)
|                                                                |
|                                                                |   o
|                                                                +---------------● (81->83)
|                                                                                |
|                                                                                |   $
|                                                                                +---------------● (82)
|                                                                                |
|                                                                                |   o
|                                                                                +---------------● (79->81)
|                                                                                                |
|                                                                                                |   $
|                                                                                                +---------------● (80)
|                                                                                                |
|                                                                                                |   o
|                                                                                                +---------------● (77->79)
|                                                                                                                |
|                                                                                                                |   $
|                                                                                                                +---------------● (78)
|                                                                                                                |
|                                                                                                                |   o
|                                                                                                                +---------------● (75->77)
|                                                                                                                                |
|                                                                                                                                |   $
|                                                                                                                                +---------------● (76)
|                                                                                                                                |
|                                                                                                                                |   o$
|                                                                                                                                +---------------● (74)
|
|   $
+----------------● (91)

mississippi$ 
● (0)
|
|   i
+------------------● (104)
|                  |
|                  |   ppi$
|                  +-----------------● (105)
|                  |
|                  |   $
|                  +-----------------● (109)
|                  |
|                  |   ssi
|                  +-----------------● (98->100)
|                                    |
|                                    |   ppi$
|                                    +-----------------● (99)
|                                    |
|                                    |   ssippi$
|                                    +-----------------● (94)
|
|   p
+------------------● (107)
|                  |
|                  |   i$
|                  +-----------------● (108)
|                  |
|                  |   pi$
|                  +-----------------● (106)
|
|   s
+------------------● (96)
|                  |
|                  |   i
|                  +-----------------● (102->104)
|                                    |
|                                    |   ppi$
|                                    +-----------------● (103)
|                                    |
|                                    |   ssippi$
|                                    +-----------------● (97)
|                  |
|                  |   si
|                  +-----------------● (100->102)
|                                    |
|                                    |   ppi$
|                                    +-----------------● (101)
|                                    |
|                                    |   ssippi$
|                                    +-----------------● (95)
|
|   mississippi$
+------------------● (93)
|
|   $
+------------------● (110)

Finally

Have fun!

ukkonen-s-suffix-tree-algorithm's People

Contributors

mutux avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

ukkonen-s-suffix-tree-algorithm's Issues

Uses Python 2 syntax

I tried to translate to Python 3 but the meaning of the double parentheses in the setoutedge definition eludes me.

def setoutedge(self, key, (anode, label_start_index, label_end_index, bnode)):

I guessed it would mean

def setoutedge(self, key, anode=None, label_start_index=None, label_end_index=None, bnode=None):

but implementing this change just broke the code in new ways. Could you please explain what this definition means?

O(n^2) time complexity due to list copy

There is a copy of list in unfold function. It leads to O(len(remains)) time complexity for this operation. And due to being inside while remainder > 0 this gains O(n^2) time complexity for the whole implementation (e.g. for "aaaaab").

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.