Hi, I am writing this out in C++, and wonder if you might clarify a couple of thin

Our implementations are very different... <p dir="auto"

Hi, So I finally published my version: <a href="https://github.com/Steppenwolfe65/

Parallel can use OpenMP, concurrent or future. <p dir="

Some questions about the C version of Blake2B.. about blake2 HOT 21 CLOSED

blake2 commented on June 13, 2024

Some questions about the C version of Blake2B..

from blake2.

Comments (21)

sneves commented on June 13, 2024

What do you mean by "you are processing those buffers as 2 * block size arrays of zero bytes"? I'm afraid I don't understand what you are asking.

from blake2.

Steppenwolfe65 commented on June 13, 2024

Our implementations are very different, mine houses 2b and 2bp in the same class (version set via constructor), and the parallel update function is also a departure from your design. I see now that you are caching for the blake2b final function, so not all zeroes as I first thought, but I still think the design might be improved:
In my version, if I have more than 4 full blocks, I compress, and store the remainder in the blake2bp_state buffer for the finalizer, the state buffers for each leaf are zero sized and not used in parallel mode, I think this saves unnecessary copying and buffer allocations are reduced.
Because you are using 2 classes, blake2b update would need a modification to do this, a flag added to the blake2b_state 'nocache' which bypasses the buffer stage when a full block is received, allowing the blake2bp update function control over the caching mechanism. Make sense?
I'm just writing this out now, and trying to make sense of it, so I may have misconstrued something.. when it is complete, I'll send you a link.

from blake2.

noloader commented on June 13, 2024

Our implementations are very different...

If you don't mind the question, which library or project?

from blake2.

Steppenwolfe65 commented on June 13, 2024

The Blake2 implementations are not published yet, but they will be going into CEX when I am done (+/- a week).

from blake2.

sneves commented on June 13, 2024

OK, I get what you're saying now. You're right, this is an instance of buffer bloat. It is absolutely something that can be improved. I don't think I am going to address this, though, since it slightly increases complexity in the blake2b implementation, which is by far the most used in practice. Will have to think about it.

from blake2.

Steppenwolfe65 commented on June 13, 2024

I'm still working on it, maybe when it's done, you could take a look at results then judge, (though, I think change will likely break the parallel output kats). I should be done by next weekend, (then on to parallel Skein?).
You can close this thread, I'll drop a link in another issue if that's better.
John

from blake2.

Steppenwolfe65 commented on June 13, 2024

Hi,
So I finally published my version: https://github.com/Steppenwolfe65/Blake2
Some of the highlights..
It works with or without intrinsics (both unrolled).
Parallel can use OpenMP, concurrent or future.
Each class (Blake2B512 and Blake2S256), can run either the sequential or parallel version, and they pass your vector tests for each variant (2b, 2bp, 2s and 2sp).
The public class members use standard types (no external state struct).
Speed is a mixed bag, it is 15-20% slower in sequential mode, but 15-20% faster in parallel mode.
In sequential, I take a hit for using vectors, (but no leaks, no cleanup, and it's good RAII), in parallel mode I managed to reduce buffering to just a single call per input block for a big performance gain, (yours is calling copy +16k times per MiB, mine can use only a single copy regardless of input block size).
In parallel mode, they align exactly up to a 512 byte block, I looked at it, and I think there may be a serious bug in blake2bp_update (and sp), I could be wrong, but it looks like if a block size greater than 512 bytes is used, your loop is not being offset correctly, and subsequently overlaps. In my test project, I compare speeds with your C implementation, and in that copy of the update function, I left notes as to what I think I'm seeing, and what might be the correct code.
I made some changes to the tree params structure, split off the key into a MacParams, and appropriated the first reserved flag as a ThreadCount. My version can use any even number of threads, and has been tested up to 8 which on my box ran at over 3000 mb p/s, (and it should work fine no matter how many threads you set, so if 32 cores becomes the norm someday, we'll be ready ;).
You can do whatever you want with my version, if you want to add it to your distribution, that's fine, I can even change the license to CC0 (I use MIT in my library).
Anyways, thanks for putting up the original source .
Let me know what you think..

Best
John Underhill

from blake2.

sneves commented on June 13, 2024

Can you elaborate on the blake2bp_update serious bug? What is a block size in this context?

from blake2.

Steppenwolfe65 commented on June 13, 2024

Like I said, I'm not entirely sure but.. it looks like the inner while loop is not getting the correct starting offset. It looks like it is only being offset by block size * iterator. Parallel block size should be rounded input size divided by processor count, and the input bytes offset in inner while loop should be iterator times that size. Looks like your offset is iterator times block size. This is why they align at 512 input length, but not beyond that. I could be mistaken though.. and if I am, my apologies.
.. and by block size I mean the internal 128 bytes used by the compression function, and parallel block size is the chunk of bytes processed by the while loop by each thread.

from blake2.

noloader commented on June 13, 2024

Parallel can use OpenMP, concurrent or future.

Out of curiosity, how well does OpenMP and BLAKE2 benchmark? Or are you just using it for the portable threads/parallel tasks?

I did some extensive benchmark tests with a C++ security library, OpenMP and RW-signtures a while back. When I OpenMP'd the base implementation, I lost about 1500 signatures per second.

Related, using Bernstein's RSA signatures and Rabin–Williams signatures: the state of the art, Method 6 (e and f tweaks), I could increase throughput by 5000 signatures per second.

RW-signatures and BLAKE2 are kind of apples and oranges because RSA and RW signatures are heavy in private exponent operations.

from blake2.

Steppenwolfe65 commented on June 13, 2024

@noloader I found concurrent (win) to be a bit faster than openmp, but future about the same, but I think that has a lot to do with the specific os/architecture.
Really I just wanted more options to ensure a parallel mechanism was available to as many configurations as possible.
I'll look at the paper on the weekend, djb is always a good read ;)

from blake2.

sneves commented on June 13, 2024

OK, I assume you're speaking of this snippet here (removing the ifdefs for clarity):

  for( size_t id__ = 0; id__ < PARALLELISM_DEGREE; ++id__ )
  {
    uint64_t inlen__ = inlen;
    const uint8_t *in__ = ( const uint8_t * )in;
    in__ += id__ * BLAKE2B_BLOCKBYTES;

    while( inlen__ >= PARALLELISM_DEGREE * BLAKE2B_BLOCKBYTES )
    {
      blake2b_update( S->S[id__], in__, BLAKE2B_BLOCKBYTES );
      in__ += PARALLELISM_DEGREE * BLAKE2B_BLOCKBYTES;
      inlen__ -= PARALLELISM_DEGREE * BLAKE2B_BLOCKBYTES;
    }
  }

This performs 4 passes over the message. The first pass processes bytes 0--127, 512--639, etc in state S->S[0]. The second pass processes bytes 128--255, 640--767, etc in state S->S[1]. Therefore the initial inner while loop starts at the correct offset for each lane indexed by id__. Was this what you meant?

This multi-pass strategy is clearly not the best for performance, but it was explicitly made this way as a simple fallback for OpenMP-less code. I have a more efficient single-threaded SIMD implementation of blake2{b,s}p in here.

from blake2.

Steppenwolfe65 commented on June 13, 2024

So.. if you pass in a length of say 2048 with 4 threads, the while loop starts at offsets 0, 512, 1024, 1536 ?
How can that be if:
in__ += id__ * BLAKE2B_BLOCKBYTES;
Looks like the input pointer in__ is only being moved by for iterator id__ * 128 bytes, so 0, 128, 256, 512.
Sorry if I am misreading that, maybe I need to debug it again.
The way mine works is:
The input length is rounded to inlen -= inlen % (threads * 128)
Then that is split:
parallel block = inlen / threads
The starting offset for each thread is then thread num * parallel block.
Is this what your version is doing?

from blake2.

sneves commented on June 13, 2024

If you have a message of length 2048:

Pass 0 (id__ = 0) starts at offset 0, then in__ += PARALLELISM_DEGREE * BLAKE2B_BLOCKBYTES; makes this 512 = 4 * 128, then 1024, then 1536.
Pass 1 (id__ = 1) starts at offset 128, then in__ += PARALLELISM_DEGREE * BLAKE2B_BLOCKBYTES; makes this 640 = 128 + 4 * 128, then 1152, then 1664.

from blake2.

Steppenwolfe65 commented on June 13, 2024

I see.. this is why they don't match over 512 bytes.. so sorry for that. I debugged the hell out of mine when they didn't align, and I was sure it was working properly, so I started looking for errors in yours, (should have known better.. again, my apologies).
Still getting rid of all the copies does make a faster update mechanism, gotta give me that ;)
Thanks for the quick responses.. and take a look at mine, I'd really be interested in what you think..

from blake2.

Steppenwolfe65 commented on June 13, 2024

Ok.. I aligned my parallel version to yours, (the round robin queuing is a better way, so..)
I prove this by running a test with random input sizes of p-rand message in a loop 1000 times, now every variant passes.
Tomorrow is all about performance, I'm going to start by changing all the internal functions to pointers, I think that will speed it up a bit. Next week, (when I'm dome tinkering with the cpp version), I'll translate it to C# for my other library..

from blake2.

Steppenwolfe65 commented on June 13, 2024

Ok, C# version is up:
https://github.com/Steppenwolfe65/Blake2-NET
Just as with the C++ version, all variants, tested, mac, built in drbg etc.
The two versions have also been integrated into their libraries:
https://github.com/Steppenwolfe65/CEX-NET
https://github.com/Steppenwolfe65/CEX

Performance.. it wasn't the code but the compiler.. visual studios abysmal support for intrinsics the cause of poor performance, (though it seems to handle C code perfectly well). I installed the Intel compiler, and added your library and mine to static libs for some speed tests, same settings. Best results were by forcing AVX (/arch:AVX). Speeds were within +/-10% with most runs on my dev box (just an off-the-shelf hp pavillion, so no speed demon).
avg. mb p/s with 40gb samples, 10mb input size:
2B: yours/mine 906/981, 2BP: 1867/2245.
So, the Intel compiler makes a huge difference, so did forcing AVX, this gave me best speed from both implementations.
Tested OpenMP against future and concurrency (ppl.h) with various input sizes, and actually OpenMP comes out on top with my win10/i7 configuration..
So, thanks again for this great project, (I learned more about intrinsics reading your code, then from the manual).

from blake2.

sneves commented on June 13, 2024

Cool, this matches my experience; MSVC lags heavily behind in intrinsic codegen, though it has been a lot worse!

from blake2.

Steppenwolfe65 commented on June 13, 2024

Yeah, I couldn't figure out why it was slower, without the buffering, should have been at least a little bit faster.. so out of frustration, I dropped mine into C, and.. 25% performance gain, now my 2b was about 10% faster, so, had to be compiler.
Looks like their C compiler has more support, but none of the sse/avx macros work, (very annoying), and I can't figure out why forcing avx makes it faster, (without that, my version really drags), I don't see any 256 api, so.. strange.
Intel compiler add in isn't all that great either, getting some strange errors in x86, and it only supports up to sse3, (and they are Intel, why wouldn't their compiler support their own instruction sets!).
Well, I guess if programming wasn't so consistently challenging, it would lose it's appeal..

from blake2.

noloader commented on June 13, 2024

I can't figure out why forcing avx makes it faster,

AVX and AVX2 can memcpy and memmove 32 and 64 bytes at a time.

Under GCC, you can usually get the benefits with GCC 4.8 and above at -O3, but you have to be careful of buffer alignments. If your buffers are not aligned, then don't attempt to use -O3.

Under MSVC you need to use /arch:AVX or /arch:AVX2. Also see /arch (x64) on MSDN. Its available for Visual Studio 2012 and above.

from blake2.

Steppenwolfe65 commented on June 13, 2024

@noloader That would explain a lot, runs faster with /arch:AVX2, even though it's not supported by my cpu..

from blake2.

Some questions about the C version of Blake2B.. about blake2 HOT 21 CLOSED

Comments (21)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs