Comments (10)
Sure, @Zabrane! Here is the C++ code for the benchmark, but it might be a bit convoluted. The function is pretty well documented in the C header, it might be easier to start there.
from stringzilla.
Hi @luluna02! Thanks for participation! The baseline serial and AVX-512 implementations are present, but there is a lot of space for improvement. All depends on your experience and technical expertise. If you look at the current AVX-512 variant it uses Galois Field operations implemented in hardware and available with GFNI extensions. That helps us emulate multi-shift operation at the level of 8-bit words of the ZMM register. Changing the multiplication argument there can save us several CPU cycles, but it might be a hard place to start.
from stringzilla.
Yes, this can be a nice feature. What would be a good interface for that? How about this?
text.split(separators='abcd')
I was also considering adding specialized SWAR splitting functionality for CSV rows... I'd have to skip the escape symbols and support separators of different length: ,
, ",
, ,"
, ","
.
from stringzilla.
Yeah that is an important concern.
Considering the nature of separators (ascii punctuation, unicode punctuation, ascii whitespace, unicode whitespace, control characters that are often confused by libraries to be either punctuation or whitespace etc)
I think it would be necessary to have a robust set of testing before a feature like this is merged.
One way to help canonicalize the string would be to only accept raw strings r" "
but in that case escapes are handled to be literals (so r"\n" is definitely \ and n) which might be difficult to handle as well
from stringzilla.
@chris-ha458 I've added this functionality, and it will be coming with the v3 major release. With AVX-512 there is a noticeable uplift compared to serial code.
Parsed the file with 8388608 words of 5 mean length!
Benchmarking for whitespaces:
- std::string_view.find_first_of 0.2617 GB/s
- sz_find_from_set_serial 0.2548 GB/s
- sz_find_from_set_avx512 0.4394 GB/s
- std::string_view.find_last_of 0.2861 GB/s
- sz_find_last_from_set_serial 0.2604 GB/s
- sz_find_last_from_set_avx512 0.4402 GB/s
Benchmarking for punctuation marks:
- std::string_view.find_first_of 0.5188 GB/s
- sz_find_from_set_serial 0.5265 GB/s
- sz_find_from_set_avx512 0.6465 GB/s
- std::string_view.find_last_of 0.5308 GB/s
- sz_find_last_from_set_serial 0.5330 GB/s
- sz_find_last_from_set_avx512 0.7384 GB/s
from stringzilla.
This would be helpful for tokenizers or anything relying on tokenizers, which is an important use case for string manipulation
from stringzilla.
I would think the best interface would be taking a list.
For instance if we take multi character strings as a list it becomes ambiguous how to handle
"\n"(newline achieved by escape but also technically '' + 'n') vs str(b"0x0A") (actual bit representation for ASCII newline).
Althought the pitfalls still exist for lists like ['\n','','n',str(b'0x0A')], it becomes less ambiguous imo.
(So..many..pitfalls especially how python can silently coerce a string into a list and vice versa)
from stringzilla.
From UX perspective - you are right. From performance perspective, depending on the time of the day the interpreter may spend a lot of time building those lists… strings might be a cheaper option.
from stringzilla.
@chris-ha458 I've added this functionality, and it will be coming with the v3 major release. With AVX-512 there is a noticeable uplift compared to serial code.
Parsed the file with 8388608 words of 5 mean length! Benchmarking for whitespaces: - std::string_view.find_first_of 0.2617 GB/s - sz_find_from_set_serial 0.2548 GB/s - sz_find_from_set_avx512 0.4394 GB/s - std::string_view.find_last_of 0.2861 GB/s - sz_find_last_from_set_serial 0.2604 GB/s - sz_find_last_from_set_avx512 0.4402 GB/s Benchmarking for punctuation marks: - std::string_view.find_first_of 0.5188 GB/s - sz_find_from_set_serial 0.5265 GB/s - sz_find_from_set_avx512 0.6465 GB/s - std::string_view.find_last_of 0.5308 GB/s - sz_find_last_from_set_serial 0.5330 GB/s - sz_find_last_from_set_avx512 0.7384 GB/s
@ashvardanian amazing. I'm also interested on this.
Is there a C example I can look at to learn how to use it please?
from stringzilla.
Hello @ashvardanian, I find this project really interesting (good job!!) and I want to contribute.
I'm curious if this issue is closed yet?
from stringzilla.
Related Issues (20)
- Standard-compliant `split` implementation
- Missing `sz::string::shrink_to_fit` HOT 3
- Overwrite LibC symbols with `LD_PRELOAD` HOT 1
- Improve Rolling Hashes
- Avoid Python GIL in `write_to`, sorting, Levenshtein HOT 1
- Refactor Str and SplitIterator to use `sz_string_view_t`
- V4 Wishlist HOT 3
- search for string without loading entire file into memory? HOT 1
- [BUG] Instant error STATUS_ACCESS_VIOLATION on Windows with Rust lib HOT 8
- Inconsistent compiler flags with Clang HOT 1
- Quick-start instructions for C++, Rust, and Swift HOT 4
- CMake targets for the C shared library HOT 3
- Pretty-printing `Strs` in Python HOT 3
- sz_capabilities might be incorrect for AVX512 HOT 4
- [CLI] sz_split error HOT 8
- sz::string length();size() and rstrip() HOT 3
- Inline Assembly for detecting CPU features on Arm
- Doesn't build under FreeBSD 14-STABLE HOT 6
- V3 bindings for Node.js
- Bug: sz_find incorrectly finds the substring with length=5 HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from stringzilla.