GithubHelp home page GithubHelp logo

zearch's Introduction

Regular Expression Search on Compressed Text

What is it? zearch is a regular expression engine that takes in input a regular expression and a grammar-based compressed text and returns every line of the uncompressed text containing a match for the regular expression.

Limitations

  • no match across lines
  • no support invert match option
  • only regular languages (like RE2) โ†’ e.g. no backreferences
  • no zero-width character
  • no named character classes (e.g. alnum, digit, lower,โ€ฆ)
  • no UTF8 support, only ASCII characters
  • no highlighted output, only matching line are reported in full

Compiling

Installing libfa

Instal the tools and libraries required to build augeas (besides the usual tools: gcc, autoconf, automake etc.)

sudo apt-get install bison, flex, readline-devel, libxml2-devel

Clone the augeas project repository.

git clone [email protected]:hercules-team/augeas.git

From the root folder run

./autogen.sh
make
sudo make install

Finally, update the library path so that zearch can find libfa

LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib
export LD_LIBRARY_PATH

Compiling Zearch

From the root folder of this repository run make zearch stats debug which generates three executables:

  • zearch is the main program.
  • stats works as zearch but produces some statistical information about memory usage.
  • debug prints a lot of information for debugging through stderr.

These three tools are invoked in the same way:

./{zearch,debug,stats} [-m] <option> <input_regex> <input_file>

where

  • -m is an optional argument. When present, the program will determinize and minimize the automaton, following the algorithm used by libfa. Note that this option does not always improves performance.
  • option can be -c, -l, -a or -b to print the number of matching lines, the matching lines, both of them or simply inform about whether there is (at least) one match or not respectively.
  • input_regex is a regular expression following the format accepted by libfa.
  • input_file is a repair-compressed file.

Comparison with other tools

We have compared the performance of zearch with the state of the art approach for regular expression matching on compressed text. Due to the limited functionality of our tool, the comparison only considers the running time required by these tools to report the number of lines in the original file containing a match. For the experiments the state of the art is represented by the following command

{lz4,zstd} -dc compressed_file | {grep,rg,hyperscan} -c regex

The experiments show that our tool outperforms the state of the art, even though decompression and search are done in parallel. The detailed results of this comparison are available here

zearch's People

Contributors

pevalme avatar pombredanne avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.