GithubHelp home page GithubHelp logo

Comments (10)

ashvardanian avatar ashvardanian commented on July 22, 2024 4

Sure, @Zabrane! Here is the C++ code for the benchmark, but it might be a bit convoluted. The function is pretty well documented in the C header, it might be easier to start there.

from stringzilla.

ashvardanian avatar ashvardanian commented on July 22, 2024 2

Hi @luluna02! Thanks for participation! The baseline serial and AVX-512 implementations are present, but there is a lot of space for improvement. All depends on your experience and technical expertise. If you look at the current AVX-512 variant it uses Galois Field operations implemented in hardware and available with GFNI extensions. That helps us emulate multi-shift operation at the level of 8-bit words of the ZMM register. Changing the multiplication argument there can save us several CPU cycles, but it might be a hard place to start.

from stringzilla.

ashvardanian avatar ashvardanian commented on July 22, 2024 1

Yes, this can be a nice feature. What would be a good interface for that? How about this?

text.split(separators='abcd')

I was also considering adding specialized SWAR splitting functionality for CSV rows... I'd have to skip the escape symbols and support separators of different length: ,, ",, ,", ",".

from stringzilla.

chris-ha458 avatar chris-ha458 commented on July 22, 2024 1

Yeah that is an important concern.
Considering the nature of separators (ascii punctuation, unicode punctuation, ascii whitespace, unicode whitespace, control characters that are often confused by libraries to be either punctuation or whitespace etc)
I think it would be necessary to have a robust set of testing before a feature like this is merged.

One way to help canonicalize the string would be to only accept raw strings r" " but in that case escapes are handled to be literals (so r"\n" is definitely \ and n) which might be difficult to handle as well

from stringzilla.

ashvardanian avatar ashvardanian commented on July 22, 2024 1

@chris-ha458 I've added this functionality, and it will be coming with the v3 major release. With AVX-512 there is a noticeable uplift compared to serial code.

Parsed the file with 8388608 words of 5 mean length!
Benchmarking for whitespaces:
- std::string_view.find_first_of          0.2617 GB/s
- sz_find_from_set_serial          0.2548 GB/s
- sz_find_from_set_avx512          0.4394 GB/s
- std::string_view.find_last_of          0.2861 GB/s
- sz_find_last_from_set_serial          0.2604 GB/s
- sz_find_last_from_set_avx512          0.4402 GB/s
Benchmarking for punctuation marks:
- std::string_view.find_first_of          0.5188 GB/s
- sz_find_from_set_serial          0.5265 GB/s
- sz_find_from_set_avx512          0.6465 GB/s
- std::string_view.find_last_of          0.5308 GB/s
- sz_find_last_from_set_serial          0.5330 GB/s
- sz_find_last_from_set_avx512          0.7384 GB/s

from stringzilla.

chris-ha458 avatar chris-ha458 commented on July 22, 2024

This would be helpful for tokenizers or anything relying on tokenizers, which is an important use case for string manipulation

from stringzilla.

chris-ha458 avatar chris-ha458 commented on July 22, 2024

I would think the best interface would be taking a list.
For instance if we take multi character strings as a list it becomes ambiguous how to handle
"\n"(newline achieved by escape but also technically '' + 'n') vs str(b"0x0A") (actual bit representation for ASCII newline).
Althought the pitfalls still exist for lists like ['\n','','n',str(b'0x0A')], it becomes less ambiguous imo.
(So..many..pitfalls especially how python can silently coerce a string into a list and vice versa)

from stringzilla.

ashvardanian avatar ashvardanian commented on July 22, 2024

From UX perspective - you are right. From performance perspective, depending on the time of the day the interpreter may spend a lot of time building those lists… strings might be a cheaper option.

from stringzilla.

Zabrane avatar Zabrane commented on July 22, 2024

@chris-ha458 I've added this functionality, and it will be coming with the v3 major release. With AVX-512 there is a noticeable uplift compared to serial code.

Parsed the file with 8388608 words of 5 mean length!
Benchmarking for whitespaces:
- std::string_view.find_first_of          0.2617 GB/s
- sz_find_from_set_serial          0.2548 GB/s
- sz_find_from_set_avx512          0.4394 GB/s
- std::string_view.find_last_of          0.2861 GB/s
- sz_find_last_from_set_serial          0.2604 GB/s
- sz_find_last_from_set_avx512          0.4402 GB/s
Benchmarking for punctuation marks:
- std::string_view.find_first_of          0.5188 GB/s
- sz_find_from_set_serial          0.5265 GB/s
- sz_find_from_set_avx512          0.6465 GB/s
- std::string_view.find_last_of          0.5308 GB/s
- sz_find_last_from_set_serial          0.5330 GB/s
- sz_find_last_from_set_avx512          0.7384 GB/s

@ashvardanian amazing. I'm also interested on this.
Is there a C example I can look at to learn how to use it please?

from stringzilla.

luluna02 avatar luluna02 commented on July 22, 2024

Hello @ashvardanian, I find this project really interesting (good job!!) and I want to contribute.
I'm curious if this issue is closed yet?

from stringzilla.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.