GithubHelp home page GithubHelp logo

Comments (5)

ashvardanian avatar ashvardanian commented on July 3, 2024 1

Sure @JobLeonard, you can start and I can join a bit later.

For benchmarking I often take cloud instances (c7g and r7iz) to gain access to more hardware. Might be useful to you too 🤗

One thing to watch out for - on very short strings (well under 64 bytes) we are optimizing the for-loop. On longer strings, if we take the first and the last character - we end up fetching 2 cache lines from each string, instead of just one.

from stringzilla.

ashvardanian avatar ashvardanian commented on July 3, 2024 1

Hi @JobLeonard,

I have a big update! I've generalized the substring search methods to be able to match different characters within the needle. The method that infers the best targets is called _sz_locate_needle_anomalies. For long needles and small alphabets, updating it may have a noticeable impact. Here is the code:

/**
* @brief Chooses the offsets of the most interesting characters in a search needle.
*
* Search throughput can significantly deteriorate if we are matching the wrong characters.
* Say the needle is "aXaYa", and we are comparing the first, second, and last character.
* If we use SIMD and compare many offsets at a time, comparing against "a" in every register is a waste.
*
* Similarly, dealing with UTF8 inputs, we know that the lower bits of each character code carry more information.
* Cyrillic alphabet, for example, falls into [0x0410, 0x042F] code range for uppercase [А, Я], and
* into [0x0430, 0x044F] for lowercase [а, я]. Scanning through a text written in Russian, half of the
* bytes will carry absolutely no value and will be equal to 0x04.
*/
SZ_INTERNAL void _sz_locate_needle_anomalies(sz_cptr_t start, sz_size_t length, //
sz_size_t *first, sz_size_t *second, sz_size_t *third) {
*first = 0;
*second = length / 2;
*third = length - 1;
//
int has_duplicates = //
start[*first] == start[*second] || //
start[*first] == start[*third] || //
start[*second] == start[*third];
// Loop through letters to find non-colliding variants.
if (length > 3 && has_duplicates) {
// Pivot the middle point right, until we find a character different from the first one.
for (; start[*second] == start[*first] && *second + 1 < *third; ++(*second)) {}
// Pivot the third (last) point left, until we find a different character.
for (; (start[*third] == start[*second] || start[*third] == start[*first]) && *third > (*second + 1);
--(*third)) {}
}
// TODO: Investigate alternative strategies for long needles.
// On very long needles we have the luxury to choose!
// Often dealing with UTF8, we will likely benfit from shifting the first and second characters
// further to the right, to achieve not only uniqness within the needle, but also avoid common
// rune prefixes of 2-, 3-, and 4-byte codes.
if (length > 8) {
// Pivot the first and second points right, until we find a character, that:
// > is different from others.
// > doesn't start with 0b'110x'xxxx - only 5 bits of relevant info.
// > doesn't start with 0b'1110'xxxx - only 4 bits of relevant info.
// > doesn't start with 0b'1111'0xxx - only 3 bits of relevant info.
//
// So we are practically searching for byte values that start with 0b0xxx'xxxx or 0b'10xx'xxxx.
// Meaning they fall in the range [0, 127] and [128, 191], in other words any unsigned int up to 191.
sz_u8_t const *start_u8 = (sz_u8_t const *)start;
sz_size_t vibrant_first = *first, vibrant_second = *second, vibrant_third = *third;
// Let's begin with the seccond character, as the termination criterea there is more obvious
// and we may end up with more variants to check for the first candidate.
for (; (start_u8[vibrant_second] > 191 || start_u8[vibrant_second] == start_u8[vibrant_third]) &&
(vibrant_second + 1 < vibrant_third);
++vibrant_second) {}
// Now check if we've indeed found a good candidate or should revert the `vibrant_second` to `second`.
if (start_u8[vibrant_second] < 191) { *second = vibrant_second; }
else { vibrant_second = *second; }
// Now check the first character.
for (; (start_u8[vibrant_first] > 191 || start_u8[vibrant_first] == start_u8[vibrant_second] ||
start_u8[vibrant_first] == start_u8[vibrant_third]) &&
(vibrant_first + 1 < vibrant_second);
++vibrant_first) {}
// Now check if we've indeed found a good candidate or should revert the `vibrant_first` to `first`.
// We don't need to shift the third one when dealing with texts as the last byte of the text is
// also the last byte of a rune and contains the most information.
if (start_u8[vibrant_first] < 191) { *first = vibrant_first; }
}
}

Not sure about what would be the best dataset for such benchmarks, seems like this is related to #91.

from stringzilla.

JobLeonard avatar JobLeonard commented on July 3, 2024 1

Thanks for the heads-up, looks like a worthwhile change for the examples given in the doc comment.

I was about to write that I'm still interested in giving this a go but have been very busy at work in the last month. Not entirely sure when I manage to free up some time again but just wanted to re-assure you that I haven't forgotten about it!

from stringzilla.

ashvardanian avatar ashvardanian commented on July 3, 2024

Hi @JobLeonard! Sorry for a late response, didn’t see the issue.

That’s a good suggestion for the serial version! I haven’t spent much time optimizing it.

On most modern CPUs the forward and backward passes over memory are equally fast, AFAIK. It might be a good idea to also add a version that uses 64 bit integers, if misaligned reads are allowed. Would you be interested in trying a couple of such approaches and submitting a PR?

In case you do, the CONTRIBUTING.md file references several datasets for benchmarks 🤗

from stringzilla.

JobLeonard avatar JobLeonard commented on July 3, 2024

I'll take a shot at it, with the following caveats:

  • I only have one laptop to benchmark it with (a six years old Lenovo P51 with an Intel® Core™ i7-7820HQ Processor running KDE Neon (Ubuntu LTS 22.04 based))
  • don't have too much spare time so the GCC 12 I already have installed will have to do.
  • "read string as 64 bit integers" sounds like it's great when everything aligns neatly, but wouldn't that require a lot of special casing? Checks for whether sz_size_t length is less than eight characters, and for whether sz_string_start_t a and/or sz_string_start_t b start misaligned, or end misaligned. Start or end within the same 8-byte word or not. That's a lot of variations to consider (unless there's an obvious bitmasking + aligned reads trick that I'm too tired to work out in my head right now). So I think I'll skip that for now.

So my benchmark nrs will be limited to a few simple variations of sz_equal on x86-64, with AVX2 thrown in as a point of comparison too, and therefore only useful as a first sanity check for whether this idea is worth investigating further. Is that ok?

from stringzilla.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.