Comments (21)
What do you mean by "you are processing those buffers as 2 * block size arrays of zero bytes"? I'm afraid I don't understand what you are asking.
from blake2.
Our implementations are very different, mine houses 2b and 2bp in the same class (version set via constructor), and the parallel update function is also a departure from your design. I see now that you are caching for the blake2b final function, so not all zeroes as I first thought, but I still think the design might be improved:
In my version, if I have more than 4 full blocks, I compress, and store the remainder in the blake2bp_state buffer for the finalizer, the state buffers for each leaf are zero sized and not used in parallel mode, I think this saves unnecessary copying and buffer allocations are reduced.
Because you are using 2 classes, blake2b update would need a modification to do this, a flag added to the blake2b_state 'nocache' which bypasses the buffer stage when a full block is received, allowing the blake2bp update function control over the caching mechanism. Make sense?
I'm just writing this out now, and trying to make sense of it, so I may have misconstrued something.. when it is complete, I'll send you a link.
from blake2.
Our implementations are very different...
If you don't mind the question, which library or project?
from blake2.
The Blake2 implementations are not published yet, but they will be going into CEX when I am done (+/- a week).
from blake2.
OK, I get what you're saying now. You're right, this is an instance of buffer bloat. It is absolutely something that can be improved. I don't think I am going to address this, though, since it slightly increases complexity in the blake2b implementation, which is by far the most used in practice. Will have to think about it.
from blake2.
I'm still working on it, maybe when it's done, you could take a look at results then judge, (though, I think change will likely break the parallel output kats). I should be done by next weekend, (then on to parallel Skein?).
You can close this thread, I'll drop a link in another issue if that's better.
John
from blake2.
Hi,
So I finally published my version: https://github.com/Steppenwolfe65/Blake2
Some of the highlights..
It works with or without intrinsics (both unrolled).
Parallel can use OpenMP, concurrent or future.
Each class (Blake2B512 and Blake2S256), can run either the sequential or parallel version, and they pass your vector tests for each variant (2b, 2bp, 2s and 2sp).
The public class members use standard types (no external state struct).
Speed is a mixed bag, it is 15-20% slower in sequential mode, but 15-20% faster in parallel mode.
In sequential, I take a hit for using vectors, (but no leaks, no cleanup, and it's good RAII), in parallel mode I managed to reduce buffering to just a single call per input block for a big performance gain, (yours is calling copy +16k times per MiB, mine can use only a single copy regardless of input block size).
In parallel mode, they align exactly up to a 512 byte block, I looked at it, and I think there may be a serious bug in blake2bp_update (and sp), I could be wrong, but it looks like if a block size greater than 512 bytes is used, your loop is not being offset correctly, and subsequently overlaps. In my test project, I compare speeds with your C implementation, and in that copy of the update function, I left notes as to what I think I'm seeing, and what might be the correct code.
I made some changes to the tree params structure, split off the key into a MacParams, and appropriated the first reserved flag as a ThreadCount. My version can use any even number of threads, and has been tested up to 8 which on my box ran at over 3000 mb p/s, (and it should work fine no matter how many threads you set, so if 32 cores becomes the norm someday, we'll be ready ;).
You can do whatever you want with my version, if you want to add it to your distribution, that's fine, I can even change the license to CC0 (I use MIT in my library).
Anyways, thanks for putting up the original source .
Let me know what you think..
Best
John Underhill
from blake2.
Can you elaborate on the blake2bp_update
serious bug? What is a block size in this context?
from blake2.
Like I said, I'm not entirely sure but.. it looks like the inner while loop is not getting the correct starting offset. It looks like it is only being offset by block size * iterator. Parallel block size should be rounded input size divided by processor count, and the input bytes offset in inner while loop should be iterator times that size. Looks like your offset is iterator times block size. This is why they align at 512 input length, but not beyond that. I could be mistaken though.. and if I am, my apologies.
.. and by block size I mean the internal 128 bytes used by the compression function, and parallel block size is the chunk of bytes processed by the while loop by each thread.
from blake2.
Parallel can use OpenMP, concurrent or future.
Out of curiosity, how well does OpenMP and BLAKE2 benchmark? Or are you just using it for the portable threads/parallel tasks?
I did some extensive benchmark tests with a C++ security library, OpenMP and RW-signtures a while back. When I OpenMP'd the base implementation, I lost about 1500 signatures per second.
Related, using Bernstein's RSA signatures and Rabin–Williams signatures: the state of the art, Method 6 (e and f tweaks), I could increase throughput by 5000 signatures per second.
RW-signatures and BLAKE2 are kind of apples and oranges because RSA and RW signatures are heavy in private exponent operations.
from blake2.
@noloader I found concurrent (win) to be a bit faster than openmp, but future about the same, but I think that has a lot to do with the specific os/architecture.
Really I just wanted more options to ensure a parallel mechanism was available to as many configurations as possible.
I'll look at the paper on the weekend, djb is always a good read ;)
from blake2.
OK, I assume you're speaking of this snippet here (removing the ifdefs for clarity):
for( size_t id__ = 0; id__ < PARALLELISM_DEGREE; ++id__ )
{
uint64_t inlen__ = inlen;
const uint8_t *in__ = ( const uint8_t * )in;
in__ += id__ * BLAKE2B_BLOCKBYTES;
while( inlen__ >= PARALLELISM_DEGREE * BLAKE2B_BLOCKBYTES )
{
blake2b_update( S->S[id__], in__, BLAKE2B_BLOCKBYTES );
in__ += PARALLELISM_DEGREE * BLAKE2B_BLOCKBYTES;
inlen__ -= PARALLELISM_DEGREE * BLAKE2B_BLOCKBYTES;
}
}
This performs 4 passes over the message. The first pass processes bytes 0--127, 512--639, etc in state S->S[0]
. The second pass processes bytes 128--255, 640--767, etc in state S->S[1]
. Therefore the initial inner while loop starts at the correct offset for each lane indexed by id__
. Was this what you meant?
This multi-pass strategy is clearly not the best for performance, but it was explicitly made this way as a simple fallback for OpenMP-less code. I have a more efficient single-threaded SIMD implementation of blake2{b,s}p in here.
from blake2.
So.. if you pass in a length of say 2048 with 4 threads, the while loop starts at offsets 0, 512, 1024, 1536 ?
How can that be if:
in__ += id__ * BLAKE2B_BLOCKBYTES;
Looks like the input pointer in__ is only being moved by for iterator id__ * 128 bytes, so 0, 128, 256, 512.
Sorry if I am misreading that, maybe I need to debug it again.
The way mine works is:
The input length is rounded to inlen -= inlen % (threads * 128)
Then that is split:
parallel block = inlen / threads
The starting offset for each thread is then thread num * parallel block.
Is this what your version is doing?
from blake2.
If you have a message of length 2048:
- Pass 0 (
id__ = 0
) starts at offset 0, thenin__ += PARALLELISM_DEGREE * BLAKE2B_BLOCKBYTES;
makes this 512 = 4 * 128, then 1024, then 1536. - Pass 1 (
id__ = 1
) starts at offset 128, thenin__ += PARALLELISM_DEGREE * BLAKE2B_BLOCKBYTES;
makes this 640 = 128 + 4 * 128, then 1152, then 1664.
from blake2.
I see.. this is why they don't match over 512 bytes.. so sorry for that. I debugged the hell out of mine when they didn't align, and I was sure it was working properly, so I started looking for errors in yours, (should have known better.. again, my apologies).
Still getting rid of all the copies does make a faster update mechanism, gotta give me that ;)
Thanks for the quick responses.. and take a look at mine, I'd really be interested in what you think..
from blake2.
Ok.. I aligned my parallel version to yours, (the round robin queuing is a better way, so..)
I prove this by running a test with random input sizes of p-rand message in a loop 1000 times, now every variant passes.
Tomorrow is all about performance, I'm going to start by changing all the internal functions to pointers, I think that will speed it up a bit. Next week, (when I'm dome tinkering with the cpp version), I'll translate it to C# for my other library..
from blake2.
Ok, C# version is up:
https://github.com/Steppenwolfe65/Blake2-NET
Just as with the C++ version, all variants, tested, mac, built in drbg etc.
The two versions have also been integrated into their libraries:
https://github.com/Steppenwolfe65/CEX-NET
https://github.com/Steppenwolfe65/CEX
Performance.. it wasn't the code but the compiler.. visual studios abysmal support for intrinsics the cause of poor performance, (though it seems to handle C code perfectly well). I installed the Intel compiler, and added your library and mine to static libs for some speed tests, same settings. Best results were by forcing AVX (/arch:AVX). Speeds were within +/-10% with most runs on my dev box (just an off-the-shelf hp pavillion, so no speed demon).
avg. mb p/s with 40gb samples, 10mb input size:
2B: yours/mine 906/981, 2BP: 1867/2245.
So, the Intel compiler makes a huge difference, so did forcing AVX, this gave me best speed from both implementations.
Tested OpenMP against future and concurrency (ppl.h) with various input sizes, and actually OpenMP comes out on top with my win10/i7 configuration..
So, thanks again for this great project, (I learned more about intrinsics reading your code, then from the manual).
from blake2.
Cool, this matches my experience; MSVC lags heavily behind in intrinsic codegen, though it has been a lot worse!
from blake2.
Yeah, I couldn't figure out why it was slower, without the buffering, should have been at least a little bit faster.. so out of frustration, I dropped mine into C, and.. 25% performance gain, now my 2b was about 10% faster, so, had to be compiler.
Looks like their C compiler has more support, but none of the sse/avx macros work, (very annoying), and I can't figure out why forcing avx makes it faster, (without that, my version really drags), I don't see any 256 api, so.. strange.
Intel compiler add in isn't all that great either, getting some strange errors in x86, and it only supports up to sse3, (and they are Intel, why wouldn't their compiler support their own instruction sets!).
Well, I guess if programming wasn't so consistently challenging, it would lose it's appeal..
from blake2.
I can't figure out why forcing avx makes it faster,
AVX and AVX2 can memcpy
and memmove
32 and 64 bytes at a time.
Under GCC, you can usually get the benefits with GCC 4.8 and above at -O3
, but you have to be careful of buffer alignments. If your buffers are not aligned, then don't attempt to use -O3
.
Under MSVC you need to use /arch:AVX
or /arch:AVX2
. Also see /arch (x64) on MSDN. Its available for Visual Studio 2012 and above.
from blake2.
@noloader That would explain a lot, runs faster with /arch:AVX2, even though it's not supported by my cpu..
from blake2.
Related Issues (20)
- Perfromance degradation on ARM Cortex-M3 with NATIVE_LITTLE_ENDIAN defined HOT 3
- OID for blake2s? HOT 1
- Update binaries at blake2.net
- Porting on SPARC CPU HOT 3
- 'b2sum' names conflict HOT 1
- Tree test vectors missing
- different read blocksize result in failed checksum HOT 8
- Keyed hash test case for an independent implementation HOT 3
- Diagonal shuffle optimization for BLAKE2s
- blake2 HOT 1
- Any plan to make a new release? HOT 4
- Autotools configure.ac and Makefile.am
- Support for Apple Silicon on macOS (mach-o ARM64) HOT 2
- Status of blake2.net implementation list
- Provide a reference implementation for BLAKE2sp and BLAKE2bp in Java
- npm-gyp failer to rebuild blake2 HOT 1
- Include Blake2x HOT 1
- Question: What is the maximum key lenght that BLAKE2x can proccess?
- Questions about Blake2x: Its state size (internal state) and its security when generating keys with size more than 256/512bits HOT 1
- Question: Where do IV values come from? HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from blake2.