GithubHelp home page GithubHelp logo

Comments (3)

BurntSushi avatar BurntSushi commented on August 15, 2024

Nice find. At first I thought this might be a case of the latency of the regex engine playing a role here (because it generally has more overhead than what you have here). But some other examples reveal this probably isn't the case?

$ time rg -c '[A-Z]+' enwik9
7421426

real    0.604
user    0.547
sys     0.057
maxmem  959 MB
faults  0

$ time ltrep-1297041 -c '[A-Z]+' enwik9
7421426

real    0.904
user    0.843
sys     0.060
maxmem  954 MB
faults  0

$ time rg -c '[A-Z][A-Z]+' enwik9
1212981

real    1.141
user    1.077
sys     0.063
maxmem  959 MB
faults  0

$ time ltrep-1297041 -c '[A-Z][A-Z]+' enwik9
1212981

real    0.925
user    0.884
sys     0.040
maxmem  954 MB
faults  0

So I think this probably warrants some investigation.

Also, just because something is a toy doesn't mean it is expected to never be faster than something that isn't a non-toy. For example, this is the "toy" version of memchr:

fn memchr(needle: u8, haystack: &[u8]) -> Option<usize> {
    haystack.iter().position(|&b| b == needle)
}

But sometimes this will be faster than a SIMD optimized version. Because the SIMD optimized version needs a preamble, a fallback for the case when the haystack is smaller than the vector size, creating the vector and so on. And the SIMD version might not get inlined if it's using AVX2 and the surrounding code wasn't compiled with AVX2 enabled. (Which is the common case, although the rise of things like x86-64-v3 is changing that.)

And like, despite your regex engine being a toy, your search code is still very tight and about as good as you can do with a DFA: you transition from state to state and report whether a match state was seen. ripgrep's regex engine basically does the same, but with more frills.

from ripgrep.

BurntSushi avatar BurntSushi commented on August 15, 2024

My hypothesis here is that your toy regex engine is benefiting precisely from the coupling that comes from the nature of it being a toy. That is, in your grep, the code for searching the input and the code for walking the DFA are effectively intertwined. That generally isn't true for a general purpose regex engine: there's usually an abstraction boundary that separates them. To test this, I generated a file with the same number of lines as enwik9 but where every line was just AZ:

use std::io::Write;

fn main() -> anyhow::Result<()> {
    let lines = 13_147_025;
    let mut out = std::io::stdout().lock();
    for _ in 0..lines {
        writeln!(out, "AZ")?;
    }
    Ok(())
}

Then I benchmarked ripgrep, ltrep and GNU grep with hyperfine:

$ hyperfine --output pipe "LC_ALL=C grep -E -c '[A-Z][A-Z]' enwik9-two-capitals" "rg --no-config -c '[A-Z][A-Z]' enwik9-two-capitals" "ltrep-1297041 -c '[A-Z][A-Z]' enwik9-two-capitals"
Benchmark 1: LC_ALL=C grep -E -c '[A-Z][A-Z]' enwik9-two-capitals
  Time (mean ± σ):     112.7 ms ±   2.5 ms    [User: 110.7 ms, System: 2.2 ms]
  Range (min … max):   110.5 ms … 120.4 ms    25 runs

Benchmark 2: rg --no-config -c '[A-Z][A-Z]' enwik9-two-capitals
  Time (mean ± σ):     327.9 ms ±   1.5 ms    [User: 326.3 ms, System: 1.7 ms]
  Range (min … max):   325.4 ms … 331.0 ms    10 runs

Benchmark 3: ltrep-1297041 -c '[A-Z][A-Z]' enwik9-two-capitals
  Time (mean ± σ):      22.5 ms ±   3.4 ms    [User: 20.6 ms, System: 2.0 ms]
  Range (min … max):    16.0 ms …  31.3 ms    91 runs

Summary
  ltrep-1297041 -c '[A-Z][A-Z]' enwik9-two-capitals ran
    5.00 ± 0.76 times faster than LC_ALL=C grep -E -c '[A-Z][A-Z]' enwik9-two-capitals
   14.55 ± 2.20 times faster than rg --no-config -c '[A-Z][A-Z]' enwik9-two-capitals

ltrep doesn't just beat ripgrep, it also beats GNU grep. GNU grep doesn't have as much abstraction as ripgrep, but it has more than ltrep. Also, at a certain point, the feature support has a role to play here. As the features and optimizations and use cases grow, so to does the abstractions.

But ltrep's advantage here is data dependent. Your particular implementation examines every byte of input, and in exchange, your code is simpler but potentially significantly slower. For example, I wrote this program to generate a similar file as above, but with an extra 100 bytes after the initial AZ:

use std::io::Write;

fn main() -> anyhow::Result<()> {
    let lines = 13_147_025;
    let mut out = std::io::stdout().lock();
    let filler = "az".repeat(100);
    for _ in 0..lines {
        write!(out, "AZ")?;
        write!(out, "{filler}")?;
        writeln!(out, "")?;
    }
    Ok(())
}

Then I re-ran the benchmarks:

$ hyperfine --output pipe "LC_ALL=C grep -E -c '[A-Z][A-Z]' enwik9-two-capitals-with-filler" "rg --no-config -c '[A-Z][A-Z]' enwik9-two-capitals-with-filler" "ltrep-1297041 -c '[A-Z][A-Z]' enwik9-two-capitals-with-filler"
Benchmark 1: LC_ALL=C grep -E -c '[A-Z][A-Z]' enwik9-two-capitals-with-filler
  Time (mean ± σ):     330.5 ms ±   7.5 ms    [User: 161.5 ms, System: 168.8 ms]
  Range (min … max):   321.7 ms … 341.6 ms    10 runs

Benchmark 2: rg --no-config -c '[A-Z][A-Z]' enwik9-two-capitals-with-filler
  Time (mean ± σ):     531.5 ms ±  11.2 ms    [User: 460.0 ms, System: 71.1 ms]
  Range (min … max):   502.5 ms … 544.9 ms    10 runs

Benchmark 3: ltrep-1297041 -c '[A-Z][A-Z]' enwik9-two-capitals-with-filler
  Time (mean ± σ):      2.294 s ±  0.019 s    [User: 2.238 s, System: 0.055 s]
  Range (min … max):    2.261 s …  2.312 s    10 runs

Summary
  LC_ALL=C grep -E -c '[A-Z][A-Z]' enwik9-two-capitals-with-filler ran
    1.61 ± 0.05 times faster than rg --no-config -c '[A-Z][A-Z]' enwik9-two-capitals-with-filler
    6.94 ± 0.17 times faster than ltrep-1297041 -c '[A-Z][A-Z]' enwik9-two-capitals-with-filler

Both GNU grep and ripgrep know to stop searching on each line after seeing the AZ. But ltrep continues. Your code is less branchy because of it, but now it's doing a bunch of wasted work.

I think there's overall room for ripgrep to improve here, but I'd consider this difference to be "overall small." And once the abstraction genie is out of the box, it's hard to roll it back.

from ripgrep.

Bricktech2000 avatar Bricktech2000 commented on August 15, 2024

Short-circuiting out when on an accepting state in the middle of a partial match is such a low-hanging fruit I can't believe I didn't think of it. In the general case I believe this boils down to preprocessing the DFA by flagging states which either always or never reach an accepting state. And I wouldn't be surprised if the additional compare and branch needed in the hot loop nullified LTRE's "head start".

Thanks for taking the time to provide such a great response.

from ripgrep.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.