zwegner / faster-utf8-validator Goto Github PK

View Code? Open in Web Editor NEW

219.0 219.0 10.0 87 KB

A very fast library for validating UTF-8 using AVX2/SSE4 instructions

Python 4.36% Lua 27.83% C 67.82%

faster-utf8-validator's People

Contributors

Stargazers

Watchers

Forkers

fcccode zhuomingliang ubuntu-repo arthurfait sahwar zhaskell crackercat arryboom darkersquirrel zufuliu

faster-utf8-validator's Issues

AVX512 Validation

This isn't so much an issue as much as it is a tip. But you mention not having access to an AVX512 machine:

This algorithm should map fairly nicely to AVX-512, and should in fact be a bit faster than 2x the speed of AVX2 since a few instructions can be saved. But I don't have an AVX-512 machine, so I haven't tried it yet.

I wanted to mention that the Intel Software Development Emulator can simulate AVX-512 instructions for any program you pass into it, and can allow you to verify an AVX-512 implementation.

I'm also willing to benchmark any AVX-512 implementations you have on my Skylake-X i9-7900X and on the upcoming CascadeLake-X processor that I will be getting soon.

Provide separated functions/files for AVX2/SSE4/SSSE3

Thanks for the greater work 👍, I now use the library in Notepad2 (https://github.com/zufuliu/notepad2 by commit zufuliu/notepad2@4813772).
I found manually expanded/unrolled version of the two functions are about 2x faster (on i5 x64 built by MSVC 2017) than current functions.

Attachment is the expanded results. it's based on the preprocessor out by cl /EP (similar to gcc -E -P), with some formatting changes:

in testing ASCII (vmask_t high = v_test_bit(bytes, 7);), _mm_slli_epi16 / _mm256_slli_epi16 is removed.
_mm_srli_epi16 / _mm256_srli_epi16 is removed from vec_t e_2 = v_lookup(error_2, shifted_bytes, 0);
the for (int n = 1; n <= 3; n++) loop is manually unrolled to fix compiler warning about potential uninitialized vmask_t cont;.
added a scalar version for SSE 4.1 _mm_testz_si128, so the code works for SSSE3. the scalar version seems even faster.

#if defined(__SSE4_1__)
    if (!_mm_testz_si128(_mm_and_si128(e_1, e_2), e_3)) {
        return 0;
    }
#else
    e_3 = _mm_and_si128(_mm_and_si128(e_1, e_2), e_3);
    uint64_t dummy[2];
    _mm_storeu_si128((__m128i *)dummy, e_3);
    dummy[0] |= dummy[1];
    if (dummy[0]) {
        return 0;
    }
#endif

Applications can use separated functions for runtime dispatching, adding separated functions/files make the library easier to build.

The Attachment, nearly identical to code at https://github.com/zufuliu/notepad2/blob/master/src/EditEncoding.c#L1227
z_validate.zip

Investigate adaptive approaches

BeeOnRope brought up some good points in this subthread of my HN post: https://news.ycombinator.com/item?id=21550453

I can see two main approaches to adaptation: whether to use an early exit for ASCII-only input, and only computing the continuation mask for 4-byte (and maybe 3-byte?) sequences lazily, if the normal checks fail. Here's some quick notes on both of these approaches, and then some general notes.

ASCII approach:
The main idea is to take away the misprediction penalty that happens when input has long strings of only-ASCII or only-non-ASCII, and hoist the unpredictable branch to an outer loop of N vectors of input. The unpredictable branch would be based on a counter, updated in both the fast-ASCII-exit and branchless paths. This could be made very cheap. Where we have in the current code:

    if (!(high | *last_cont))
        return 1;

This would turn into something like this:

    vmask_t cond = !(high | *last_cont);
    *counter += cond;
    if (BRANCHY && cond)
        return 1;

...where BRANCHY is a macro/specialization that determines whether we're compiling the branchless version or not. If the counter indicates X out of N vectors in the outer loop were ASCII-only, we should use the fast-ASCII-path version in the next chunk.

4-byte approach:
I'll just copy+paste part of one of my posts: because my algorithm uses lookup tables for error flags which have some free bits, and because the 4 byte sequences can be detected in the same indices that are used for these lookups, we can set another error bit that means "take the 4-byte slow path". Then, when some input fails validation, we only do the work there to check whether it's really a failure. This gets complicated, though: first off, the check for the proper number of continuation bytes is before the table lookup, so we'd need to put some logic in there. Secondly, this lookup table gets validation failures one byte later in the stream than the initial byte. So in the case that the 4-byte sequence starts on the last byte of a vector, we'd need to have special handling.

General issues:

Realistic benchmarks: what sort of applications would a UTF-8 validator be used for, and what are typical inputs? BeeOnRope suggested Wikipedia articles in various languages, which is a pretty good starting point.
How will this affect code complexity? The C preprocessor "metaprogramming" already sucks. Would it be worth switching to C++? It looks like AVX-512 will already benefit from one extra specialization (VBMI or not), multiplying this by two extra flags will add lots of code.
How will this affect code size? In the current microbenchmarks, code size is mostly irrelevant, but this matters more for large applications. By increasing code size/I$ pressure, besides just making cache misses more likely, we also miss out on the metadata that Intel chips (IIRC) keep only in the L1I$ with decoding hints and branch information.

If anyone has any thoughts on these issues, particularly anyone that has plans to use this code in a real project, I'd like to hear from them.

license file

Hey @zwegner this looks really interesting. Would you mind clarifying the license of the validator? I didn't see any mention of this. Or better yet add a LICENSE file to the project?

Chromium/firefox use

If it is truly the state of the art, browsers should probably begin to use it.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

Jobs

Jooble