GithubHelp home page GithubHelp logo

cceh / suffix-tree Goto Github PK

View Code? Open in Web Editor NEW
40.0 9.0 12.0 4.77 MB

A Generalized Suffix Tree for any Python iterable using Ukkonen's algorithm, with Lowest Common Ancestor retrieval.

License: GNU General Public License v3.0

Makefile 1.71% Python 98.29%
python suffix-tree suffixtree ukkonen lca

suffix-tree's Introduction

A Generalized Suffix Tree

py39 py310 py311 py312 pypy39 coverage

A Generalized Suffix Tree for any Python sequence, with Lowest Common Ancestor retrieval.

pip install suffix-tree
>>> from suffix_tree import Tree

>>> tree = Tree({"A": "xabxac"})
>>> tree.find("abx")
True
>>> tree.find("abc")
False

This suffix tree:

  • works with any Python sequence, not just strings, if the items are hashable,
  • is a generalized suffix tree for sets of sequences,
  • is implemented in pure Python,
  • builds the tree in time proportional to the length of the input,
  • does constant-time Lowest Common Ancestor retrieval.

Being implemented in Python this tree is not very fast nor memory efficient. The building of the tree takes time proportional to the length of the string of symbols. The query time is proportional to the length of the query string.

To get the best performance turn the python optimizer on: python -O.

Documentation: https://cceh.github.io/suffix-tree/

PyPi: https://pypi.org/project/suffix-tree/

Usage examples:

>>> from suffix_tree import Tree
>>> tree = Tree()
>>> tree.add(1, "xabxac")
>>> tree.add(2, "awyawxawxz")
>>> tree.find("abx")
True
>>> tree.find("awx")
True
>>> tree.find("abc")
False
>>> tree = Tree({"A": "xabxac", "B": "awyawxawxz"})
>>> tree.find_id("A", "abx")
True
>>> tree.find_id("B", "abx")
False
>>> tree.find_id("B", "awx")
True
>>> tree = Tree(
...     {
...         "A": "sandollar",
...         "B": "sandlot",
...         "C": "handler",
...         "D": "grand",
...         "E": "pantry",
...     }
... )
>>> for k, length, path in tree.common_substrings():
...     print(k, length, path)
...
2 4 s a n d
3 3 a n d
4 3 a n d
5 2 a n
>>> tree = Tree({"A": "xabxac", "B": "awyawxawxz"})
>>> for C, path in sorted(tree.maximal_repeats()):
...     print(C, path)
...
1 a w
1 a w x
2 a
2 x
2 x a

suffix-tree's People

Contributors

marcelloperathoner avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

suffix-tree's Issues

Dump tree to file

Hey,

Do you plan to add a function for dumping a tree into a file?
I've tried using pickle, which works with a small tree, but when increasing it to a certain size, I got a recursion error:
RecursionError: maximum recursion depth exceeded in comparison

common_substrings documentation - possibility that path component of result may be None is not mentioned

Example demonstrating the problem is the following one:

strings = ["Housing", "Lever"]
tree = Tree(dict(enumerate(strings)))
for k, length, path in tree.common_substrings():
    print(f"{k=} {length=} {path=}")

The method common_substrings() results contains None in place of path - and such possibility is not mentioned in the documentation of the method at all. I believe that such a case should be documented, otherwise the user of the method will not be aware about such possibility and will not be able to handle it properly

Linear in what?

Unfortunately, I was not able to construct a suffix tree in linear time using this library:

import time
import random
from suffix_tree import Tree

random.seed(0)
words = [random.choices("abcdefghijklmnopqrstuvwxyz", k=random.randint(10, 20)) for _ in range(1000)]

result = []
for n in range(0, len(words) + 1, 50):
    it_words = words[:n]
    it_times = []
    for _ in range(3):
        start = time.process_time_ns()
        Tree(dict(enumerate(it_words)))
        end = time.process_time_ns()
        it_times.append(end - start)
    result.append((sum(map(len, it_words)), sum(it_times) / len(it_times)))
Results
0, 11649.666666666666
750, 46233650.333333336
1487, 89753019.66666667
2255, 161773125.66666666
2975, 258553530.0
3702, 358599287.6666667
4461, 458972382.0
5216, 608262643.3333334
5914, 758510699.3333334
6683, 961070447.3333334
7401, 1300382351.0
8162, 1550146969.0
8921, 1846915976.6666667
9662, 2192115563.0
10416, 2549625218.0
11208, 3017123428.0
11997, 3529445659.6666665
12719, 4047462716.0
13422, 5052296462.0
14168, 6944860159.666667
14922, 8248390707.333333

image
(graph of 1.6239x10^8 * 1.00026^x - 4.7164*10^7)

Issue with `common_substrings`

First of all, thanks a lot for providing this implementation of a generalized suffix tree. Kudos.

I was playing a little bit with the function common_substrings and noticed some strange behavior. In certain cases, it misses substrings and presents duplicates:

    strings = [ "abcxyz", "abcxyz", "xyzabc", "xyzabc" ]
    tree = Tree(dict(enumerate(strings)))
    for k, l, path in tree.common_substrings():
        print(k, l, path)

The output of this snippet is

    2 6 a b c x y z
    3 3 a b c
    4 3 a b c

In this example, we have four strings composed of abc and xyz. I was expecting common_substrings to return at least these two substrings. Furthermore, I cannot understand why the substring abc is returned twice with different (incorrect) $k$. What am I missing?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.