Comments (3)
Nice find. At first I thought this might be a case of the latency of the regex engine playing a role here (because it generally has more overhead than what you have here). But some other examples reveal this probably isn't the case?
$ time rg -c '[A-Z]+' enwik9
7421426
real 0.604
user 0.547
sys 0.057
maxmem 959 MB
faults 0
$ time ltrep-1297041 -c '[A-Z]+' enwik9
7421426
real 0.904
user 0.843
sys 0.060
maxmem 954 MB
faults 0
$ time rg -c '[A-Z][A-Z]+' enwik9
1212981
real 1.141
user 1.077
sys 0.063
maxmem 959 MB
faults 0
$ time ltrep-1297041 -c '[A-Z][A-Z]+' enwik9
1212981
real 0.925
user 0.884
sys 0.040
maxmem 954 MB
faults 0
So I think this probably warrants some investigation.
Also, just because something is a toy doesn't mean it is expected to never be faster than something that isn't a non-toy. For example, this is the "toy" version of memchr
:
fn memchr(needle: u8, haystack: &[u8]) -> Option<usize> {
haystack.iter().position(|&b| b == needle)
}
But sometimes this will be faster than a SIMD optimized version. Because the SIMD optimized version needs a preamble, a fallback for the case when the haystack is smaller than the vector size, creating the vector and so on. And the SIMD version might not get inlined if it's using AVX2 and the surrounding code wasn't compiled with AVX2 enabled. (Which is the common case, although the rise of things like x86-64-v3
is changing that.)
And like, despite your regex engine being a toy, your search code is still very tight and about as good as you can do with a DFA: you transition from state to state and report whether a match state was seen. ripgrep's regex engine basically does the same, but with more frills.
from ripgrep.
My hypothesis here is that your toy regex engine is benefiting precisely from the coupling that comes from the nature of it being a toy. That is, in your grep, the code for searching the input and the code for walking the DFA are effectively intertwined. That generally isn't true for a general purpose regex engine: there's usually an abstraction boundary that separates them. To test this, I generated a file with the same number of lines as enwik9
but where every line was just AZ
:
use std::io::Write;
fn main() -> anyhow::Result<()> {
let lines = 13_147_025;
let mut out = std::io::stdout().lock();
for _ in 0..lines {
writeln!(out, "AZ")?;
}
Ok(())
}
Then I benchmarked ripgrep, ltrep and GNU grep with hyperfine
:
$ hyperfine --output pipe "LC_ALL=C grep -E -c '[A-Z][A-Z]' enwik9-two-capitals" "rg --no-config -c '[A-Z][A-Z]' enwik9-two-capitals" "ltrep-1297041 -c '[A-Z][A-Z]' enwik9-two-capitals"
Benchmark 1: LC_ALL=C grep -E -c '[A-Z][A-Z]' enwik9-two-capitals
Time (mean ± σ): 112.7 ms ± 2.5 ms [User: 110.7 ms, System: 2.2 ms]
Range (min … max): 110.5 ms … 120.4 ms 25 runs
Benchmark 2: rg --no-config -c '[A-Z][A-Z]' enwik9-two-capitals
Time (mean ± σ): 327.9 ms ± 1.5 ms [User: 326.3 ms, System: 1.7 ms]
Range (min … max): 325.4 ms … 331.0 ms 10 runs
Benchmark 3: ltrep-1297041 -c '[A-Z][A-Z]' enwik9-two-capitals
Time (mean ± σ): 22.5 ms ± 3.4 ms [User: 20.6 ms, System: 2.0 ms]
Range (min … max): 16.0 ms … 31.3 ms 91 runs
Summary
ltrep-1297041 -c '[A-Z][A-Z]' enwik9-two-capitals ran
5.00 ± 0.76 times faster than LC_ALL=C grep -E -c '[A-Z][A-Z]' enwik9-two-capitals
14.55 ± 2.20 times faster than rg --no-config -c '[A-Z][A-Z]' enwik9-two-capitals
ltrep doesn't just beat ripgrep, it also beats GNU grep. GNU grep doesn't have as much abstraction as ripgrep, but it has more than ltrep. Also, at a certain point, the feature support has a role to play here. As the features and optimizations and use cases grow, so to does the abstractions.
But ltrep's advantage here is data dependent. Your particular implementation examines every byte of input, and in exchange, your code is simpler but potentially significantly slower. For example, I wrote this program to generate a similar file as above, but with an extra 100 bytes after the initial AZ
:
use std::io::Write;
fn main() -> anyhow::Result<()> {
let lines = 13_147_025;
let mut out = std::io::stdout().lock();
let filler = "az".repeat(100);
for _ in 0..lines {
write!(out, "AZ")?;
write!(out, "{filler}")?;
writeln!(out, "")?;
}
Ok(())
}
Then I re-ran the benchmarks:
$ hyperfine --output pipe "LC_ALL=C grep -E -c '[A-Z][A-Z]' enwik9-two-capitals-with-filler" "rg --no-config -c '[A-Z][A-Z]' enwik9-two-capitals-with-filler" "ltrep-1297041 -c '[A-Z][A-Z]' enwik9-two-capitals-with-filler"
Benchmark 1: LC_ALL=C grep -E -c '[A-Z][A-Z]' enwik9-two-capitals-with-filler
Time (mean ± σ): 330.5 ms ± 7.5 ms [User: 161.5 ms, System: 168.8 ms]
Range (min … max): 321.7 ms … 341.6 ms 10 runs
Benchmark 2: rg --no-config -c '[A-Z][A-Z]' enwik9-two-capitals-with-filler
Time (mean ± σ): 531.5 ms ± 11.2 ms [User: 460.0 ms, System: 71.1 ms]
Range (min … max): 502.5 ms … 544.9 ms 10 runs
Benchmark 3: ltrep-1297041 -c '[A-Z][A-Z]' enwik9-two-capitals-with-filler
Time (mean ± σ): 2.294 s ± 0.019 s [User: 2.238 s, System: 0.055 s]
Range (min … max): 2.261 s … 2.312 s 10 runs
Summary
LC_ALL=C grep -E -c '[A-Z][A-Z]' enwik9-two-capitals-with-filler ran
1.61 ± 0.05 times faster than rg --no-config -c '[A-Z][A-Z]' enwik9-two-capitals-with-filler
6.94 ± 0.17 times faster than ltrep-1297041 -c '[A-Z][A-Z]' enwik9-two-capitals-with-filler
Both GNU grep and ripgrep know to stop searching on each line after seeing the AZ
. But ltrep continues. Your code is less branchy because of it, but now it's doing a bunch of wasted work.
I think there's overall room for ripgrep to improve here, but I'd consider this difference to be "overall small." And once the abstraction genie is out of the box, it's hard to roll it back.
from ripgrep.
Short-circuiting out when on an accepting state in the middle of a partial match is such a low-hanging fruit I can't believe I didn't think of it. In the general case I believe this boils down to preprocessing the DFA by flagging states which either always or never reach an accepting state. And I wouldn't be surprised if the additional compare and branch needed in the hot loop nullified LTRE's "head start".
Thanks for taking the time to provide such a great response.
from ripgrep.
Related Issues (20)
- Potentially incorrect or undocumented .gitignore files handling
- `--text` flag only works if it's late in the command HOT 1
- Ripgrep parsing files in gitignore HOT 1
- Searching disk images for a text regex causes rg to be SIGKILL'd HOT 8
- `--null-data` seems to inhibit some optimizations
- Some directories in .ignore aren't ignored HOT 2
- rg.exe does not complete in bash.exe(WSL2) - Incorrectly assumes stdin HOT 2
- Using `--max-count` and `--context` together does not correctly display syntax highlighting for matched characters
- Please add Makefile.inc to the default "make" type. HOT 1
- bad string breaks ripgrep HOT 1
- Release the mmap heuristics as a crate HOT 1
- --count --multiline doesn't behave as documented HOT 2
- Not Getting Regex Matches When Other Tools Do Match HOT 5
- Heuristics of `ignore::WalkParallel`'s `threads` field: hardcoded to 2? HOT 2
- .gitignore entries are not filtered HOT 4
- [ignore] Unexpected behaviour when ignoring directories without a slash HOT 3
- Expose submatches in `Sink::matched` somehow HOT 2
- [feature request] only return file, line, and column for multiple matches HOT 2
- [feat] skip/silence permission denied HOT 3
- Large memory consumption with `globset::GlobSet.is_match` HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ripgrep.