GithubHelp home page GithubHelp logo

wojciechmula / sse4-strstr Goto Github PK

View Code? Open in Web Editor NEW
228.0 22.0 27.0 115 KB

SIMD (SWAR/SSE/SSE4/AVX2/AVX512F/ARM Neon) of Karp-Rabin algorithm's modification

Home Page: http://0x80.pl/articles/simd-strfind.html

License: BSD 2-Clause "Simplified" License

Makefile 3.81% Shell 0.05% Python 1.49% C 13.39% C++ 81.26%
string-manipulation sse avx2 avx512 neon

sse4-strstr's Introduction

SIMD-friendly algorithms for substring searching

Sample programs for article "SIMD-friendly algorithms for substring searching" (http://0x80.pl/articles/simd-strfind.html).

The root directory contains C++11 procedures implemented using intrinsics for SSE, SSE4, AVX2, AVX512F, AVX512BW and ARM Neon (both ARMv7 and ARMv8).

The subdirectory original contains 32-bit programs with inline assembly, written in 2008 for another article.

Usage

To run unit and validation tests type make test_ARCH, to run performance tests type make run_ARCH. Value ARCH selectes the CPU architecture:

  • sse4,
  • avx2,
  • avx512f,
  • avx512bw,
  • arm,
  • aarch64.

Performance results

The subdirectory results contains raw timings from various computers.

sse4-strstr's People

Contributors

gdahlm avatar jserv avatar tbarbette avatar wojciechmula avatar xiehuc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sse4-strstr's Issues

Support for multithreading

Is it possible to use a tool such as OpenMP to multithread these SIMD algorithms as I suppose SIMD and multithreading are independent so I suppose it would speed up the algorithm even further

Consider adding the following naive algorithm

The problem with comparing with, say, the standard library, is that it might use non-trivial implementations. It would be useful to have a really basic reference like this:

int naive(char * hay, int size, char *needle, int needlesize) {
  const char first = needle[0];
  const int maxpos = size - needlesize;
  for(int i = 0; i < maxpos; i++) {
    if(hay[i] != first) {
       i++;
       while( i < maxpos && hay[i] != first ) i++;
       if ( i == maxpos ) break;
    }
    int j = 1;
    for( ; j < needlesize; ++j)
      if(hay[ i + j ] != needle[ j ] ) break;
    if( j == needlesize) return i;
  }
  return size;
}

Optimization when trying to find the longest substring match?

Context: for the Lempel-Ziv 77 family of compression algorithms, one has to search through a sliding window for the longest possible substring match. I figured that there were SIMD algorithms for that these days. Your page on SIMD-friendly algorithms for substring searching was one of the first things to show up. However, while a beautiful write-up of wonderful research and code, the algorithm described there appears to serve a subtly different purpose: finding a substring of known length. In the LZ77 scenario we don't know ahead of time what the longest substring we are looking for is. Instead we have a very long string as our input (the sliding window of LZ77), and ask the question "taking a substring at starting index i, what is the longest substring match inside the sliding window behind it?"

However, conceptually it seems that there is an obvious way to optimize the algorithm you described on your page.

Taking your algorithm as a starting point, I just came up with the following: first we start with a 2-length string. We can skip the memcmp part (since any 2-length first+last matches are guaranteed to be complete matches). Then in a loop:

  1. increase the string length,
  2. test the last char of the new length
  3. instead of a memcmp, we AND the result with the mask previous result
    • if the new mask is non-zero, then basic induction tells use still have a substring match at the new length - we checked all characters so far after all.
    • if we have no values left, then we know that the previous mask contained however many substring matches occurred within the text

So we basically loop until the "new mask" is empty, then use the second-to-last mask to determine where all the substring matches.

This is just something I came up with on the spot while reading your article, but it feels like the most straightforward, naive way to do this efficiently. Are you aware of whether or not this is a sensible approach when trying to find the longest substring match?

Consider wider algorithms to reduce branching

Something like this... using 64-bit masks... which can reduce the number of branches quite a bit...

size_t  avx2_strstr_anysize64(const char* s, size_t n, const char* needle, size_t k) {
    assert(k > 0);
    assert(n > 0);
    const __m256i first = _mm256_set1_epi8(needle[0]);
    const __m256i last  = _mm256_set1_epi8(needle[k - 1]);
    for (size_t i = 0; i < n; i += 64) {

        const __m256i block_first1 = _mm256_loadu_si256((const __m256i*)(s + i));
        const __m256i block_last1  = _mm256_loadu_si256((const __m256i*)(s + i + k - 1));

        const __m256i block_first2 = _mm256_loadu_si256((const __m256i*)(s + i + 32));
        const __m256i block_last2  = _mm256_loadu_si256((const __m256i*)(s + i + k - 1 + 32));

        const __m256i eq_first1 = _mm256_cmpeq_epi8(first, block_first1);
        const __m256i eq_last1  = _mm256_cmpeq_epi8(last, block_last1);

        const __m256i eq_first2 = _mm256_cmpeq_epi8(first, block_first2);
        const __m256i eq_last2  = _mm256_cmpeq_epi8(last, block_last2);

        uint32_t mask1 = _mm256_movemask_epi8(_mm256_and_si256(eq_first1, eq_last1));
        uint32_t mask2 = _mm256_movemask_epi8(_mm256_and_si256(eq_first2, eq_last2));
        uint64_t mask = mask1  | ((uint64_t)mask2 << 32);

        while (mask != 0) {
            int bitpos = __builtin_ctzll(mask);
            if (memcmp(s + i + bitpos + 1, needle + 1, k - 2) == 0) {
                return i + bitpos;
            }
            mask ^= mask & (-mask);
        }
    }

    return n;
}

Present Day Performance Advantage

if i read your latest benchmarks correctly...

https://github.com/WojciechMula/sse4-strstr/blob/master/results/cascadelake-Gold-5217-gcc-7.4.0-avx512bw.txt

...nowadays there's no advantage to seeking std::strstr alternatives unless you're using AVX512F or AVX512BW?

Because I assume all the standard libraries have been made AVX2 aware? I poked around in glibc earlier and could see AVX2 usage for memcmp, memcpy etc but not strstr.. only SSE4.2. maybe the SSE4.2 handwritten assembly is just so good?

clang produces different results than gcc-compiled program

GCC compilation - all procedures (except AVX2-wide) reports the same reference result

naive scalar                            ... reference result = 8108076510, time =   6.299609 s
std::strstr                             ... reference result = 8108076510, time =   0.659882 s
SWAR 64-bit (generic)                   ... reference result = 8108076510, time =   1.446615 s
SWAR 32-bit (generic)                   ... reference result = 8108076510, time =   2.529733 s
SSE2 (generic)                          ... reference result = 8108076510, time =   0.498816 s
SSE4.1 (MPSADBW)                        ... reference result = 8108076510, time =   0.640781 s
SSE4.1 (MPSADBW unrolled)               ... reference result = 8108076510, time =   0.961995 s
SSE4.2 (PCMPESTRM)                      ... reference result = 8108076510, time =   1.373412 s
SSE (naive)                             ... reference result = 8108076510, time =   1.960058 s
AVX2 (MPSADBW)                          ... reference result = 8108076510, time =   0.578520 s
AVX2 (generic)                          ... reference result = 8108076510, time =   0.374598 s
AVX2 (naive)                            ... reference result = 8108076510, time =   1.147053 s
AVX2 (naive unrolled)                   ... reference result = 8108076510, time =   0.795070 s
AVX2-wide (naive)                       ... reference result = 8107771150, time =   0.541654 s

MPSADBW variants in clang compilation have different values:

naive scalar                            ... reference result = 8108076510, time =   6.293796 s
std::strstr                             ... reference result = 8108076510, time =   0.660113 s
SWAR 64-bit (generic)                   ... reference result = 8108076510, time =   1.334720 s
SWAR 32-bit (generic)                   ... reference result = 8108076510, time =   2.518706 s
SSE2 (generic)                          ... reference result = 8108076510, time =   0.489896 s
SSE4.1 (MPSADBW)                        ... reference result = 5713208130, time =   1.787850 s
SSE4.1 (MPSADBW unrolled)               ... reference result = 7962617290, time =   0.985689 s
SSE4.2 (PCMPESTRM)                      ... reference result = 8108076510, time =   1.448608 s
SSE (naive)                             ... reference result = 8108076510, time =   1.946516 s
AVX2 (MPSADBW)                          ... reference result = 8108076510, time =   0.694087 s
AVX2 (generic)                          ... reference result = 8108076510, time =   0.353279 s
AVX2 (naive)                            ... reference result = 8108076510, time =   1.054814 s
AVX2 (naive unrolled)                   ... reference result = 8108076510, time =   0.795445 s
AVX2-wide (naive)                       ... reference result = 8107771150, time =   0.577752 s

Compilation errors with AVX2-strstr v1 and v2

Hi, I have found this gold mine of repository, and have been running some of your SSE code successfully. An exception, however, is your AVX2 strstr code.

First, I copy and paste the code from avx2-strstr-v2.cpp after the code that implements the bits namespace, and then included the following test procedure:

int main(int argc, char *argv[])
{
    const char* s1 = argv[1];
    const char* s2 = argv[2];
    size_t result = avx2_strstr_v2(s1, strlen(s1), s2, strlen(s2));
    std::cout << result << std::endl;
    bool result2 = (strstr(s1, s2)) ? true : false;
    std::cout << result2 << std::endl;
    return 0;
}

Compiling the above with g++ -std=gnu++11 -mavx2 strstrAVX2_v02.cpp gives me this error:

strstrAVX2.cpp:237:27: error: the last argument must be an 8-bit immediate
             const __m256i substring = _mm256_alignr_epi8(next1, curr, i);

Then I try withe code from avx2-strstr.cpp and compile with g++ -std=gnu++11 -mavx2 strstrAVX2_v01.cpp. Then I get a bunch of errors of the same type, all stating that:

error: inlining failed in call to always_inline

I send both test files attached.

Thanks!

strstrSIMD.zip

could not compile avx2 on macos

In file included from src/all.h:18:
./avx2-strstr-v2.cpp:66:39: fatal error: argument to '__builtin_ia32_palignr256' must be a constant integer
const __m256i substring = _mm256_alignr_epi8(next1, curr, i);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/9.0.0/include/avx2intrin.h:130:12: note:
expanded from macro '_mm256_alignr_epi8'
(__m256i)__builtin_ia32_palignr256((__v32qi)(__m256i)(a),
^
1 error generated.

still, i tried a lot , could not fix this.

add avx2 naive algo with loop unrolling

You added the naive AVX2 algorithm, but without loop unrolling. Consider adding something like this:

size_t naiveavx2_unrolled(const char* s, size_t text_size, const char* needle, size_t needle_size) {
      assert(text_size % 32 == 0);
      // todo: fix it so we can handle variable-length inputs and
      // can catch matches at the end of the data.
      for (size_t i = 0; i < text_size - needle_size; i += 32) {
        uint32_t found = 0xFFFFFFFF; // 32 1-bits
        size_t j = 0;
        for (; (j + 3 < needle_size) && (found != 0)  ; j += 4) {
          __m256i textvector1 = _mm256_loadu_si256((const __m256i *)(s + i + j));
          __m256i needlevector1 = _mm256_set1_epi8(needle[j]);
          __m256i textvector2 = _mm256_loadu_si256((const __m256i *)(s + i + j + 1));
          __m256i needlevector2 = _mm256_set1_epi8(needle[j + 1]);
          __m256i cmp1 = _mm256_cmpeq_epi8(textvector1, needlevector1);
          __m256i cmp2 = _mm256_cmpeq_epi8(textvector2, needlevector2);
          __m256i textvector3 = _mm256_loadu_si256((const __m256i *)(s + i + j + 2));
          __m256i needlevector3 = _mm256_set1_epi8(needle[j + 2]);
          __m256i textvector4 = _mm256_loadu_si256((const __m256i *)(s + i + j + 3));
          __m256i needlevector4 = _mm256_set1_epi8(needle[j + 3]);
          __m256i cmp3 = _mm256_cmpeq_epi8(textvector3, needlevector3);
          __m256i cmp4 = _mm256_cmpeq_epi8(textvector4, needlevector4);
          __m256i cmp12 = _mm256_and_si256(cmp1,cmp2);
          __m256i cmp34 = _mm256_and_si256(cmp3,cmp4);
          uint32_t bitmask = _mm256_movemask_epi8(_mm256_and_si256(cmp12,cmp34));
          found = found & bitmask;
        }
        for (; (j < needle_size) && (found != 0) ; ++j) {
          __m256i textvector = _mm256_loadu_si256((const __m256i *)(s + i + j));
          __m256i needlevector = _mm256_set1_epi8(needle[j]);
          uint32_t bitmask = _mm256_movemask_epi8(_mm256_cmpeq_epi8(textvector, needlevector));
          found = found & bitmask;
        }
        if(found != 0) {
          // got a match... maybe
          return i + __builtin_ctz(found);
        }
      }
      return text_size;
}

SSE naive implementation is broken

$ make test_sse4
SSE naive... FAILED
   string = '$x#' (length 3)
   neddle = '$x#' (length 3)
   expected result = 0, actual result = 18446744073709551615

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.