GithubHelp home page GithubHelp logo

lxndio / aas-benchmark Goto Github PK

View Code? Open in Web Editor NEW
0.0 2.0 1.0 1.47 MB

A collection of pattern matching algorithms and a tool to benchmark the algorithms against each other.

License: MIT License

Rust 91.93% Shell 0.03% TeX 8.04%

aas-benchmark's Introduction

aas-benchmark

codecov

A collection of pattern matching algorithms and a tool to benchmark the algorithms against each other.

Table of Contents

Build Instructions

If you don't want to build aas-benchmark yourself, you can also download a prebuilt release version. Building the tool yourself, however, should be the prefered way if you want the latest features as the release version may not be updated regularly.

Steps

How can follow these steps to compile aas-benchmark yourself:

  1. Make sure that you have the Rust compiler as well as Cargo installed. Preferably, use rustup to install the entire Rust toolchain.
  2. Clone or download this repository to a local directory.
  3. Open a terminal and navigate to the directory where the Cargo.toml file is located.
  4. Run cargo build --release to compile aas-benchmark.
  • Alternatively, you can run cargo run --release to compile and run aas-benchmark.
  • Using this, you can append command-line arguments after a double dash: cargo run --release -- --arguments --here.
  1. You will find an executable file in the target/release subdirectory.

Usage Instructions

This part of the README will explain in further detail how to use aas-benchmark using some examples. Make sure you've read the chapter Build instructions.

Specifying Algorithms

The tool requires the parameter -a which specifies the algorithm or algorithms that you want to benchmark. You can either set a single or multiple algorithms.

aas-benchmark -a naive ...
aas-benchmark -a naive horspool kmp ...

Benchmark All Algorithms at Once

There is also a shortcut to benchmark all algorithms at once:

aas-benchmark -a all ...

Specifying a Number of Executions

If you like, you can specify a number of executions for each algorithm. You could for example use

aas-benchmark naive,horspool -n 10 ...

to run both the naive and horspool algorithm 10 times to smooth out deviations in runtime. If you set different pattern lengths, the tool will run the set number of executions for each algorithm and pattern length.

Specifying a Text Source

Random Generated Text

You can generate a random text with a length of m bytes by using the -t or --tr argument:

aas-benchmark naive -t m ...

Text From File

It is possible to load a text as a UTF-8 string from a file by using --tf:

aas-benchmark naive ... --tf text.txt

This would load the content of the file text.txt as the text.

Specifying a Pattern Source

Below, all possible arguments for specifying a pattern source are listed.

Pattern(s) from... Usage Parameters Multiple patterns? Note
...fixed position in text --pt a..b Range¹ a..b of characters in text. No.
...random position in text --prt m or -p m Pattern length m. Yes, supply a range¹ for m or use --pmrt m1;m2;m3 with different lengths m_i.
...CLI argument --pa pattern Pattern as ASCII string pattern. Yes, use --pa multiple times or enter multiple patterns separated by spaces after --pa.
...file --pf pattern.txt File pattern.txt Yes, use --pmf and supply a file where each line contains one pattern.
Randomly generated --pr m Pattern length m. Yes, supply a range¹ for m.

¹ A range is written as a..b where a is the lower bound and b is the inclusive upper bound. You can also supply a step size c as in a..b,c.

Note that the names of those arguments all follow the same naming convention:

-- + p + Multiple? (m) Random? (r) + Source

This may help you to remember the correct arguments.

Specifying a Seed

You can set a seed to make the generation of a random text and random patterns predictable using the -s or --seed argument:

aas-benchmark naive ... --seed 12345

Other Arguments

Here is a list of other arguments you can set:

Argument Description
--noheader Disables the header in the CSV output
--alphabet n Set the alphabet size of randomly generated text and patterns to n

List of Algorithms

Currently, these algorithms are supported:

Single Pattern Algorithms

Algorithm Command-line argument name
Backward Nondeterministic DAWG Matching (BNDM) bndm
Backward Oracle Matching (BOM) bom
Horspool horspool
Naive Approach naive
Knuth-Morris-Pratt (KMP) kmp or kmp-classic
Shift-And shift-and
Double Window Algorithm dw
Bit-Parallel Length Independent Matching (BLIM) blim

Algorithms Using a Suffix Array

Algorithm Command-line argument name
Pattern Matching sa-match

See Suffix Array Generation Algorithms for more information on how the suffix array is generated.

Suffix Array Generation Algorithms

Algorithms that require a suffix array to work generate this suffix array using the SAIS algorithm by default. You can, however, select the used suffix array generation algorithm yourself by specifying the --suffixarray argument:

aas-benchmark sa-match ... --suffixarray sais

Currently, these algorithms are available for suffix array generation:

Algorithm Command-line argument name
Naive approach naive
SAIS sais

Approximative Algorithms

Algorithm Command-line argument name
Ukkonen's DP Algorithm ukkonen
Error Tolerant Shift-And et-shift-and

For approximative algorithms you can set a maximum allowed error value using the --maxerror argument:

aas-benchmark ukkonen ... --maxerror 2

This value defaults to 0 if not set.

List of Command-Line Arguments

You can run aas-benchmark --help to get a list of available arguments.

aas-benchmark's People

Contributors

lxndio avatar scholliyt avatar

Watchers

 avatar  avatar

Forkers

scholliyt

aas-benchmark's Issues

Specifying multiple patterns

For sensibly benchmarking multiple patterns algorithms it is necessary to be able to specify multiple patterns. Therefore, all (or most?) pattern source arguments have to be able to take multiple patterns.

Examples:

aas-benchmark aho-corasick -t 1000000 -p 10 10 10
aas-benchmark aho-corasick -t 1000000 -p 1..10 2..5 10
aas-benchmark aho-corasick -t 1000000 --patternfromarg abc acb bac bca cab cba

Implementing more algorithms (tracking issue)

This issue is used to track which algorithms from the AaS script have already been implemented.

Single Pattern

  • Naive approach
  • Knuth-Morris-Pratt
  • Shift-And
  • BNDM
  • BOM

Full Text Index

  • Ukkonen's Algorithm
  • SAIS
  • Exact pattern matching
  • Backward-Search-Algorithm

Approximative

  • Ukkonen's DP Algorithm
  • Error tolerant Shift-And
  • Error tolerant BNDM
  • Error tolerant Backward-Search-Algorithm

Multiple Patterns

  • Naive approach
  • Shift-And
  • Aho-Corasick
  • PWM Pattern Matching

KMP algorithm is much slower than Classic KMP

Currently, fn kmp is much slower than fn kmp_classic. That may be the case because the lps function is calculated each time the DFA delta function is simulated.

There should be a way to generate the lps function only once and still keep the clean function signature of fn dfa_delta or fn dfa_delta_lps for use with the generic fn dfa_with_delta.

Comma-seperated output

The program should be able to generate a comma-seperated output to be used in other applications.

The format should roughly be like this:

algorithm,text length,pattern length,execution,time ms
naive,1000000,5,0,25.3941
naive,1000000,5,1,24.3498
naive,1000000,5,2,27.4956
naive,1000000,5,3,25.0593
naive,1000000,5,4,26.2930
kmp,1000000,5,0,25.3941
kmp,1000000,5,1,24.3498
kmp,1000000,5,2,27.4956
kmp,1000000,5,3,25.0593
kmp,1000000,5,4,26.2930

Text options

There should be different options for setting a text:

  • Predefined text
    • from CLI argument
    • from file
  • Random text
    • with given length
    • with random length?
    • with increasing/decreasing length by given step size

Make functions private

Not all functions which are currently public need to be. Change all functions to private which are not accessed from outside of the corresponding modules.

Shift-And panics with long pattern lengths

When using long pattern lengths like 100, the Shift-And algorithm panics.

Used command: shift-and -t 1000000 -p 100

Panic message:

thread 'main' panicked at 'attempt to multiply with overflow', src/algorithms/single_pattern/shift_and.rs:11:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Pattern options

There should be different options for setting a pattern:

  • Predefined pattern
    • from CLI argument
      • as comma-seperated bytes
      • as UTF-8
    • from file
  • Random pattern
    • with given length
    • with random length?
    • with increasing/decreasing length by given step size
  • Pattern from fixed position in text (with given length)
  • Multiple patterns from fixed positions in text (with given lengths)
  • Pattern from random position in text
    • with given length
    • with random length?
    • with increasing/decreasing length by given step size

Aggregate measurements

Add CLI parameters to aggregate multiple measurements, e.g. to calculate the average etc.

Change alphabet size

Add an option to change the alphabet size (instead of using 8 bit characters). This would probably require many changes in the structure of the program itself as well as in all algorithms.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.