So I recently had to implement a fast text filter for work (in browser Java, so

Sure <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

I'll take a shot at it, with the following caveats: I only hav

Better Heuristics for Substring Search about stringzilla HOT 5 OPEN

ashvardanian commented on July 3, 2024 1

Better Heuristics for Substring Search

from stringzilla.

Comments (5)

ashvardanian commented on July 3, 2024 1

Sure @JobLeonard, you can start and I can join a bit later.

For benchmarking I often take cloud instances (c7g and r7iz) to gain access to more hardware. Might be useful to you too 🤗

One thing to watch out for - on very short strings (well under 64 bytes) we are optimizing the for-loop. On longer strings, if we take the first and the last character - we end up fetching 2 cache lines from each string, instead of just one.

from stringzilla.

ashvardanian commented on July 3, 2024 1

Hi @JobLeonard,

I have a big update! I've generalized the substring search methods to be able to match different characters within the needle. The method that infers the best targets is called _sz_locate_needle_anomalies. For long needles and small alphabets, updating it may have a noticeable impact. Here is the code:

StringZilla/include/stringzilla/stringzilla.h

Lines 1586 to 1657 in cbdae26

 /** 

  * @brief Chooses the offsets of the most interesting characters in a search needle. 

  * 

  * Search throughput can significantly deteriorate if we are matching the wrong characters. 

  * Say the needle is "aXaYa", and we are comparing the first, second, and last character. 

  * If we use SIMD and compare many offsets at a time, comparing against "a" in every register is a waste. 

  * 

  * Similarly, dealing with UTF8 inputs, we know that the lower bits of each character code carry more information. 

  * Cyrillic alphabet, for example, falls into [0x0410, 0x042F] code range for uppercase [А, Я], and 

  * into [0x0430, 0x044F] for lowercase [а, я]. Scanning through a text written in Russian, half of the 

  * bytes will carry absolutely no value and will be equal to 0x04. 

  */ 

 SZ_INTERNAL void _sz_locate_needle_anomalies(sz_cptr_t start, sz_size_t length, // 

 sz_size_t *first, sz_size_t *second, sz_size_t *third) { 

 *first = 0; 

 *second = length / 2; 

 *third = length - 1; 

 // 

 int has_duplicates = // 

 start[*first] == start[*second] || // 

 start[*first] == start[*third] || // 

 start[*second] == start[*third]; 

 // Loop through letters to find non-colliding variants. 

 if (length > 3 && has_duplicates) { 

 // Pivot the middle point right, until we find a character different from the first one. 

 for (; start[*second] == start[*first] && *second + 1 < *third; ++(*second)) {} 

 // Pivot the third (last) point left, until we find a different character. 

 for (; (start[*third] == start[*second] || start[*third] == start[*first]) && *third > (*second + 1); 

 --(*third)) {} 

 } 

 // TODO: Investigate alternative strategies for long needles. 

 // On very long needles we have the luxury to choose! 

 // Often dealing with UTF8, we will likely benfit from shifting the first and second characters 

 // further to the right, to achieve not only uniqness within the needle, but also avoid common 

 // rune prefixes of 2-, 3-, and 4-byte codes. 

 if (length > 8) { 

 // Pivot the first and second points right, until we find a character, that: 

 // > is different from others. 

 // > doesn't start with 0b'110x'xxxx - only 5 bits of relevant info. 

 // > doesn't start with 0b'1110'xxxx - only 4 bits of relevant info. 

 // > doesn't start with 0b'1111'0xxx - only 3 bits of relevant info. 

 // 

 // So we are practically searching for byte values that start with 0b0xxx'xxxx or 0b'10xx'xxxx. 

 // Meaning they fall in the range [0, 127] and [128, 191], in other words any unsigned int up to 191. 

 sz_u8_t const *start_u8 = (sz_u8_t const *)start; 

 sz_size_t vibrant_first = *first, vibrant_second = *second, vibrant_third = *third; 

 // Let's begin with the seccond character, as the termination criterea there is more obvious 

 // and we may end up with more variants to check for the first candidate. 

 for (; (start_u8[vibrant_second] > 191 || start_u8[vibrant_second] == start_u8[vibrant_third]) && 

 (vibrant_second + 1 < vibrant_third); 

 ++vibrant_second) {} 

 // Now check if we've indeed found a good candidate or should revert the `vibrant_second` to `second`. 

 if (start_u8[vibrant_second] < 191) { *second = vibrant_second; } 

 else { vibrant_second = *second; } 

 // Now check the first character. 

 for (; (start_u8[vibrant_first] > 191 || start_u8[vibrant_first] == start_u8[vibrant_second] || 

 start_u8[vibrant_first] == start_u8[vibrant_third]) && 

 (vibrant_first + 1 < vibrant_second); 

 ++vibrant_first) {} 

 // Now check if we've indeed found a good candidate or should revert the `vibrant_first` to `first`. 

 // We don't need to shift the third one when dealing with texts as the last byte of the text is 

 // also the last byte of a rune and contains the most information. 

 if (start_u8[vibrant_first] < 191) { *first = vibrant_first; } 

 } 

 }

Not sure about what would be the best dataset for such benchmarks, seems like this is related to #91.

from stringzilla.

JobLeonard commented on July 3, 2024 1

Thanks for the heads-up, looks like a worthwhile change for the examples given in the doc comment.

I was about to write that I'm still interested in giving this a go but have been very busy at work in the last month. Not entirely sure when I manage to free up some time again but just wanted to re-assure you that I haven't forgotten about it!

from stringzilla.

ashvardanian commented on July 3, 2024

Hi @JobLeonard! Sorry for a late response, didn’t see the issue.

That’s a good suggestion for the serial version! I haven’t spent much time optimizing it.

On most modern CPUs the forward and backward passes over memory are equally fast, AFAIK. It might be a good idea to also add a version that uses 64 bit integers, if misaligned reads are allowed. Would you be interested in trying a couple of such approaches and submitting a PR?

In case you do, the CONTRIBUTING.md file references several datasets for benchmarks 🤗

from stringzilla.

JobLeonard commented on July 3, 2024

I'll take a shot at it, with the following caveats:

I only have one laptop to benchmark it with (a six years old Lenovo P51 with an Intel® Core™ i7-7820HQ Processor running KDE Neon (Ubuntu LTS 22.04 based))
don't have too much spare time so the GCC 12 I already have installed will have to do.
"read string as 64 bit integers" sounds like it's great when everything aligns neatly, but wouldn't that require a lot of special casing? Checks for whether sz_size_t length is less than eight characters, and for whether sz_string_start_t a and/or sz_string_start_t b start misaligned, or end misaligned. Start or end within the same 8-byte word or not. That's a lot of variations to consider (unless there's an obvious bitmasking + aligned reads trick that I'm too tired to work out in my head right now). So I think I'll skip that for now.

So my benchmark nrs will be limited to a few simple variations of sz_equal on x86-64, with AVX2 thrown in as a point of comparison too, and therefore only useful as a first sanity check for whether this idea is worth investigating further. Is that ok?

from stringzilla.

Better Heuristics for Substring Search about stringzilla HOT 5 OPEN

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs

	/**
	* @brief Chooses the offsets of the most interesting characters in a search needle.
	*
	* Search throughput can significantly deteriorate if we are matching the wrong characters.
	* Say the needle is "aXaYa", and we are comparing the first, second, and last character.
	* If we use SIMD and compare many offsets at a time, comparing against "a" in every register is a waste.
	*
	* Similarly, dealing with UTF8 inputs, we know that the lower bits of each character code carry more information.
	* Cyrillic alphabet, for example, falls into [0x0410, 0x042F] code range for uppercase [А, Я], and
	* into [0x0430, 0x044F] for lowercase [а, я]. Scanning through a text written in Russian, half of the
	* bytes will carry absolutely no value and will be equal to 0x04.
	*/
	SZ_INTERNAL void _sz_locate_needle_anomalies(sz_cptr_t start, sz_size_t length, //
	sz_size_t first, sz_size_t second, sz_size_t *third) {
	*first = 0;
	*second = length / 2;
	*third = length - 1;

	//
	int has_duplicates = //
	start[first] == start[second] \|\| //
	start[first] == start[third] \|\| //
	start[second] == start[third];

	// Loop through letters to find non-colliding variants.
	if (length > 3 && has_duplicates) {
	// Pivot the middle point right, until we find a character different from the first one.
	for (; start[second] == start[first] && second + 1 < third; ++(*second)) {}
	// Pivot the third (last) point left, until we find a different character.
	for (; (start[third] == start[second] \|\| start[third] == start[first]) && third > (second + 1);
	--(*third)) {}
	}

	// TODO: Investigate alternative strategies for long needles.
	// On very long needles we have the luxury to choose!
	// Often dealing with UTF8, we will likely benfit from shifting the first and second characters
	// further to the right, to achieve not only uniqness within the needle, but also avoid common
	// rune prefixes of 2-, 3-, and 4-byte codes.
	if (length > 8) {
	// Pivot the first and second points right, until we find a character, that:
	// > is different from others.
	// > doesn't start with 0b'110x'xxxx - only 5 bits of relevant info.
	// > doesn't start with 0b'1110'xxxx - only 4 bits of relevant info.
	// > doesn't start with 0b'1111'0xxx - only 3 bits of relevant info.
	//
	// So we are practically searching for byte values that start with 0b0xxx'xxxx or 0b'10xx'xxxx.
	// Meaning they fall in the range [0, 127] and [128, 191], in other words any unsigned int up to 191.
	sz_u8_t const start_u8 = (sz_u8_t const )start;
	sz_size_t vibrant_first = first, vibrant_second = second, vibrant_third = *third;

	// Let's begin with the seccond character, as the termination criterea there is more obvious
	// and we may end up with more variants to check for the first candidate.
	for (; (start_u8[vibrant_second] > 191 \|\| start_u8[vibrant_second] == start_u8[vibrant_third]) &&
	(vibrant_second + 1 < vibrant_third);
	++vibrant_second) {}

	// Now check if we've indeed found a good candidate or should revert the `vibrant_second` to `second`.
	if (start_u8[vibrant_second] < 191) { *second = vibrant_second; }
	else { vibrant_second = *second; }

	// Now check the first character.
	for (; (start_u8[vibrant_first] > 191 \|\| start_u8[vibrant_first] == start_u8[vibrant_second] \|\|
	start_u8[vibrant_first] == start_u8[vibrant_third]) &&
	(vibrant_first + 1 < vibrant_second);
	++vibrant_first) {}

	// Now check if we've indeed found a good candidate or should revert the `vibrant_first` to `first`.
	// We don't need to shift the third one when dealing with texts as the last byte of the text is
	// also the last byte of a rune and contains the most information.
	if (start_u8[vibrant_first] < 191) { *first = vibrant_first; }
	}
	}