GithubHelp home page GithubHelp logo

apurvakatti19 / top-k-substring-matching-algorithms Goto Github PK

View Code? Open in Web Editor NEW
3.0 0.0 0.0 278 KB

Projected implemented 4 different Top K Sub string matching algorithms from a conference paper using the dblp dataset.There is a wide range of applications that require to query a large database of texts to search for similar strings or substrings. Traditional approximate substring matching requests a user to specify a similarity threshold

Java 100.00%
database string-matching sax-parser dblp-dataset paper conference

top-k-substring-matching-algorithms's Introduction

Top K- Substring Matching Algorithms

In this project I have implemented the paper titled “Efficient Top-k Approximate algorithms for Substring Matching of SIGMOD '13.The link to the paper can be found here http://dbgroup.cs.tsinghua.edu.cn/dd/list/3.pdf

The main idea of the paper is to match the query string accepted from the user to the existing strings using the pattern matching technique particularly by using substrings.By generating q-grams it is possible to extrat the top k strings witht eh similar substrings.Four different algorithms are implemented as given in the paper and the same has been extended to the dblp dataset.

Programming Language:The implementation is based in java using the Netbeans IDE.

Dataset:For the algorithms, we use the DBLP.xml dataset which is a large xml file file containing the tags author, citation, paper and conference.

Database:The Dataset is stored in the MySQL database by setting up the connection using the code present in DBConnection.java.

DataParsing:To extract only the four important file use Element.java, Paper.java and Conference.java and run the Parser.java file to parse the file.It uses the SAX parser and extracts the required data and store it in the database.

The code for the creation of four tables in the MySQL database is given in db.sql

Algorithms:The implementation of the algorithms are given under the packages titled ToPKNaive, TopKINDEX, TopKSplit and TopKLBNew

Also find the important classes used for these algorithms under the respective packages.

Find the dblp datset at the google drive link https://drive.google.com/open?id=1sLVRM57sRB3rx3EcUkA7DcvHcsntVNnL

top-k-substring-matching-algorithms's People

Contributors

apurvakatti19 avatar

Stargazers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.