wojciechmula / sse4-strstr Goto Github PK

View Code? Open in Web Editor NEW

228.0 22.0 27.0 115 KB

SIMD (SWAR/SSE/SSE4/AVX2/AVX512F/ARM Neon) of Karp-Rabin algorithm's modification

Home Page: http://0x80.pl/articles/simd-strfind.html

License: BSD 2-Clause "Simplified" License

Makefile 3.81% Shell 0.05% Python 1.49% C 13.39% C++ 81.26%

string-manipulation sse avx2 avx512 neon

sse4-strstr's Introduction

SIMD-friendly algorithms for substring searching

Sample programs for article "SIMD-friendly algorithms for substring searching" (http://0x80.pl/articles/simd-strfind.html).

The root directory contains C++11 procedures implemented using intrinsics for SSE, SSE4, AVX2, AVX512F, AVX512BW and ARM Neon (both ARMv7 and ARMv8).

The subdirectory original contains 32-bit programs with inline assembly, written in 2008 for another article.

Usage

To run unit and validation tests type make test_ARCH, to run performance tests type make run_ARCH. Value ARCH selectes the CPU architecture:

sse4,
avx2,
avx512f,
avx512bw,
arm,
aarch64.

Performance results

The subdirectory results contains raw timings from various computers.

sse4-strstr's People

Contributors

Stargazers

Watchers

sse4-strstr's Issues

Support for multithreading

Is it possible to use a tool such as OpenMP to multithread these SIMD algorithms as I suppose SIMD and multithreading are independent so I suppose it would speed up the algorithm even further

Performance vs std::boyer_moore_horspool_searcher

Hi, could you add in the performance comparison table also std::boyer_moore_horspool_searcher (C++17)? I'm curious about this comparison.

Consider adding the following naive algorithm

The problem with comparing with, say, the standard library, is that it might use non-trivial implementations. It would be useful to have a really basic reference like this:

int naive(char * hay, int size, char *needle, int needlesize) {
  const char first = needle[0];
  const int maxpos = size - needlesize;
  for(int i = 0; i < maxpos; i++) {
    if(hay[i] != first) {
       i++;
       while( i < maxpos && hay[i] != first ) i++;
       if ( i == maxpos ) break;
    }
    int j = 1;
    for( ; j < needlesize; ++j)
      if(hay[ i + j ] != needle[ j ] ) break;
    if( j == needlesize) return i;
  }
  return size;
}

Optimization when trying to find the longest substring match?

Context: for the Lempel-Ziv 77 family of compression algorithms, one has to search through a sliding window for the longest possible substring match. I figured that there were SIMD algorithms for that these days. Your page on SIMD-friendly algorithms for substring searching was one of the first things to show up. However, while a beautiful write-up of wonderful research and code, the algorithm described there appears to serve a subtly different purpose: finding a substring of known length. In the LZ77 scenario we don't know ahead of time what the longest substring we are looking for is. Instead we have a very long string as our input (the sliding window of LZ77), and ask the question "taking a substring at starting index i, what is the longest substring match inside the sliding window behind it?"

However, conceptually it seems that there is an obvious way to optimize the algorithm you described on your page.

Taking your algorithm as a starting point, I just came up with the following: first we start with a 2-length string. We can skip the memcmp part (since any 2-length first+last matches are guaranteed to be complete matches). Then in a loop:

increase the string length,
test the last char of the new length
instead of a memcmp, we AND the result with the mask previous result
- if the new mask is non-zero, then basic induction tells use still have a substring match at the new length - we checked all characters so far after all.
- if we have no values left, then we know that the previous mask contained however many substring matches occurred within the text

So we basically loop until the "new mask" is empty, then use the second-to-last mask to determine where all the substring matches.

This is just something I came up with on the spot while reading your article, but it feels like the most straightforward, naive way to do this efficiently. Are you aware of whether or not this is a sensible approach when trying to find the longest substring match?

Consider wider algorithms to reduce branching

Something like this... using 64-bit masks... which can reduce the number of branches quite a bit...

size_t  avx2_strstr_anysize64(const char* s, size_t n, const char* needle, size_t k) {
    assert(k > 0);
    assert(n > 0);
    const __m256i first = _mm256_set1_epi8(needle[0]);
    const __m256i last  = _mm256_set1_epi8(needle[k - 1]);
    for (size_t i = 0; i < n; i += 64) {

        const __m256i block_first1 = _mm256_loadu_si256((const __m256i*)(s + i));
        const __m256i block_last1  = _mm256_loadu_si256((const __m256i*)(s + i + k - 1));

        const __m256i block_first2 = _mm256_loadu_si256((const __m256i*)(s + i + 32));
        const __m256i block_last2  = _mm256_loadu_si256((const __m256i*)(s + i + k - 1 + 32));

        const __m256i eq_first1 = _mm256_cmpeq_epi8(first, block_first1);
        const __m256i eq_last1  = _mm256_cmpeq_epi8(last, block_last1);

        const __m256i eq_first2 = _mm256_cmpeq_epi8(first, block_first2);
        const __m256i eq_last2  = _mm256_cmpeq_epi8(last, block_last2);

        uint32_t mask1 = _mm256_movemask_epi8(_mm256_and_si256(eq_first1, eq_last1));
        uint32_t mask2 = _mm256_movemask_epi8(_mm256_and_si256(eq_first2, eq_last2));
        uint64_t mask = mask1  | ((uint64_t)mask2 << 32);

        while (mask != 0) {
            int bitpos = __builtin_ctzll(mask);
            if (memcmp(s + i + bitpos + 1, needle + 1, k - 2) == 0) {
                return i + bitpos;
            }
            mask ^= mask & (-mask);
        }
    }

    return n;
}

Is this unititialised variable'use intended?

Hi!

I am trying to use and (somewhat) understand you sse4-strstr code and come across this:

sse4-strstr/avx2-strstr-v2.cpp

Line 62 in 9cdc4b6

__m256i next1;

Is this use of uninitialized next1 intended?

Thanks!

Present Day Performance Advantage

if i read your latest benchmarks correctly...

https://github.com/WojciechMula/sse4-strstr/blob/master/results/cascadelake-Gold-5217-gcc-7.4.0-avx512bw.txt

...nowadays there's no advantage to seeking std::strstr alternatives unless you're using AVX512F or AVX512BW?

Because I assume all the standard libraries have been made AVX2 aware? I poked around in glibc earlier and could see AVX2 usage for memcmp, memcpy etc but not strstr.. only SSE4.2. maybe the SSE4.2 handwritten assembly is just so good?

clang produces different results than gcc-compiled program

GCC compilation - all procedures (except AVX2-wide) reports the same reference result

naive scalar                            ... reference result = 8108076510, time =   6.299609 s
std::strstr                             ... reference result = 8108076510, time =   0.659882 s
SWAR 64-bit (generic)                   ... reference result = 8108076510, time =   1.446615 s
SWAR 32-bit (generic)                   ... reference result = 8108076510, time =   2.529733 s
SSE2 (generic)                          ... reference result = 8108076510, time =   0.498816 s
SSE4.1 (MPSADBW)                        ... reference result = 8108076510, time =   0.640781 s
SSE4.1 (MPSADBW unrolled)               ... reference result = 8108076510, time =   0.961995 s
SSE4.2 (PCMPESTRM)                      ... reference result = 8108076510, time =   1.373412 s
SSE (naive)                             ... reference result = 8108076510, time =   1.960058 s
AVX2 (MPSADBW)                          ... reference result = 8108076510, time =   0.578520 s
AVX2 (generic)                          ... reference result = 8108076510, time =   0.374598 s
AVX2 (naive)                            ... reference result = 8108076510, time =   1.147053 s
AVX2 (naive unrolled)                   ... reference result = 8108076510, time =   0.795070 s
AVX2-wide (naive)                       ... reference result = 8107771150, time =   0.541654 s

MPSADBW variants in clang compilation have different values:

naive scalar                            ... reference result = 8108076510, time =   6.293796 s
std::strstr                             ... reference result = 8108076510, time =   0.660113 s
SWAR 64-bit (generic)                   ... reference result = 8108076510, time =   1.334720 s
SWAR 32-bit (generic)                   ... reference result = 8108076510, time =   2.518706 s
SSE2 (generic)                          ... reference result = 8108076510, time =   0.489896 s
SSE4.1 (MPSADBW)                        ... reference result = 5713208130, time =   1.787850 s
SSE4.1 (MPSADBW unrolled)               ... reference result = 7962617290, time =   0.985689 s
SSE4.2 (PCMPESTRM)                      ... reference result = 8108076510, time =   1.448608 s
SSE (naive)                             ... reference result = 8108076510, time =   1.946516 s
AVX2 (MPSADBW)                          ... reference result = 8108076510, time =   0.694087 s
AVX2 (generic)                          ... reference result = 8108076510, time =   0.353279 s
AVX2 (naive)                            ... reference result = 8108076510, time =   1.054814 s
AVX2 (naive unrolled)                   ... reference result = 8108076510, time =   0.795445 s
AVX2-wide (naive)                       ... reference result = 8107771150, time =   0.577752 s

Compilation errors with AVX2-strstr v1 and v2

Hi, I have found this gold mine of repository, and have been running some of your SSE code successfully. An exception, however, is your AVX2 strstr code.

First, I copy and paste the code from avx2-strstr-v2.cpp after the code that implements the bits namespace, and then included the following test procedure:

int main(int argc, char *argv[])
{
    const char* s1 = argv[1];
    const char* s2 = argv[2];
    size_t result = avx2_strstr_v2(s1, strlen(s1), s2, strlen(s2));
    std::cout << result << std::endl;
    bool result2 = (strstr(s1, s2)) ? true : false;
    std::cout << result2 << std::endl;
    return 0;
}

Compiling the above with g++ -std=gnu++11 -mavx2 strstrAVX2_v02.cpp gives me this error:

strstrAVX2.cpp:237:27: error: the last argument must be an 8-bit immediate
             const __m256i substring = _mm256_alignr_epi8(next1, curr, i);

Then I try withe code from avx2-strstr.cpp and compile with g++ -std=gnu++11 -mavx2 strstrAVX2_v01.cpp. Then I get a bunch of errors of the same type, all stating that:

error: inlining failed in call to always_inline

I send both test files attached.

Thanks!

strstrSIMD.zip

could not compile avx2 on macos

In file included from src/all.h:18:
./avx2-strstr-v2.cpp:66:39: fatal error: argument to '__builtin_ia32_palignr256' must be a constant integer
const __m256i substring = _mm256_alignr_epi8(next1, curr, i);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/9.0.0/include/avx2intrin.h:130:12: note:
expanded from macro '_mm256_alignr_epi8'
(__m256i)__builtin_ia32_palignr256((__v32qi)(__m256i)(a),
^
1 error generated.

still, i tried a lot , could not fix this.

avx2_strstr_anysize does not work with pattern of size one

With patterns that are a single character, the line:

sse4-strstr/avx2-strstr-v2.cpp

Line 25 in 9308a59

if (memcmp(s + i + bitpos + 1, needle + 1, k - 2) == 0) {

will not evaluate to true as k-2 is undefined.

avx2_strstr_anysize should have a precondition to have k > 1 (and not > 0), or change that line to accept also if k == 1 (but that's sad for the non-common case), or start with a special case.

add avx2 naive algo with loop unrolling

You added the naive AVX2 algorithm, but without loop unrolling. Consider adding something like this:

size_t naiveavx2_unrolled(const char* s, size_t text_size, const char* needle, size_t needle_size) {
      assert(text_size % 32 == 0);
      // todo: fix it so we can handle variable-length inputs and
      // can catch matches at the end of the data.
      for (size_t i = 0; i < text_size - needle_size; i += 32) {
        uint32_t found = 0xFFFFFFFF; // 32 1-bits
        size_t j = 0;
        for (; (j + 3 < needle_size) && (found != 0)  ; j += 4) {
          __m256i textvector1 = _mm256_loadu_si256((const __m256i *)(s + i + j));
          __m256i needlevector1 = _mm256_set1_epi8(needle[j]);
          __m256i textvector2 = _mm256_loadu_si256((const __m256i *)(s + i + j + 1));
          __m256i needlevector2 = _mm256_set1_epi8(needle[j + 1]);
          __m256i cmp1 = _mm256_cmpeq_epi8(textvector1, needlevector1);
          __m256i cmp2 = _mm256_cmpeq_epi8(textvector2, needlevector2);
          __m256i textvector3 = _mm256_loadu_si256((const __m256i *)(s + i + j + 2));
          __m256i needlevector3 = _mm256_set1_epi8(needle[j + 2]);
          __m256i textvector4 = _mm256_loadu_si256((const __m256i *)(s + i + j + 3));
          __m256i needlevector4 = _mm256_set1_epi8(needle[j + 3]);
          __m256i cmp3 = _mm256_cmpeq_epi8(textvector3, needlevector3);
          __m256i cmp4 = _mm256_cmpeq_epi8(textvector4, needlevector4);
          __m256i cmp12 = _mm256_and_si256(cmp1,cmp2);
          __m256i cmp34 = _mm256_and_si256(cmp3,cmp4);
          uint32_t bitmask = _mm256_movemask_epi8(_mm256_and_si256(cmp12,cmp34));
          found = found & bitmask;
        }
        for (; (j < needle_size) && (found != 0) ; ++j) {
          __m256i textvector = _mm256_loadu_si256((const __m256i *)(s + i + j));
          __m256i needlevector = _mm256_set1_epi8(needle[j]);
          uint32_t bitmask = _mm256_movemask_epi8(_mm256_cmpeq_epi8(textvector, needlevector));
          found = found & bitmask;
        }
        if(found != 0) {
          // got a match... maybe
          return i + __builtin_ctz(found);
        }
      }
      return text_size;
}

SSE naive implementation is broken

$ make test_sse4
SSE naive... FAILED
   string = '$x#' (length 3)
   neddle = '$x#' (length 3)
   expected result = 0, actual result = 18446744073709551615

wojciechmula / sse4-strstr Goto Github PK

sse4-strstr's Introduction

SIMD-friendly algorithms for substring searching

Usage

Performance results

sse4-strstr's People

Contributors

Stargazers

Watchers

Forkers

sse4-strstr's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs