GithubHelp home page GithubHelp logo

merterpam / find-similar-strings Goto Github PK

View Code? Open in Web Editor NEW
0.0 2.0 0.0 26 KB

A Java application which finds similar strings based on common substrings

License: Apache License 2.0

Java 100.00%
java suffixtree similarity longest-common-substring

find-similar-strings's Introduction

Find Similar Strings

Find Similar Strings is a small Java application which finds similar documents using substring similarity. The application uses a modified version of abahgat’s generalized suffix tree implementation as the underlying data structure.

The intended usage of this application is to find documents similar to a given input string in a list of documents. We define the similarity between two documents s1 and s2 with the following equation: 2* |longest common substring (s1, s2) | / (|s1| + |s2|).

Usage

When using the application, the user inserts to the suffix tree the list of documents which he/she wants to search for similarity. Then, the user gives a document with a threshold and the algorithm gives the indexes of documents whose similarity with the given document is above the threshold.

Sample usage:

        GeneralizedSuffixTree in = new GeneralizedSuffixTree();
        String[] words = new String[]{"libertypike",
                "franklintn",
                "carothersjohnhenryhouse",
                "carothersezealhouse",
                "acrossthetauntonriverfromdightonindightonrockstatepark",
                "dightonma",
                "dightonrock",
                "bethesda"};
        for (int i = 0; i < words.length; ++i) {
            in.put(words[i], i);
        }
        
        String word = "carothersezealhouse";
        HashSet<Integer> similarWordIndexes = in.getSimilarStringIndexes(word, 0.3);
        for(Integer index: similarWordIndexes) {
          System.out.println(words[index];
        }

Complexity

The space complexity of the algorithm is O(n) where n is the total number of characters in the list of documents. Time complexity of one similarity search operation is O(m^2) where m is the length of the given document.

find-similar-strings's People

Contributors

merterpam avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.