Comments (12)
Hey hey heeey :) @dot-asm
We've hit this issue a few days ago and been thinking a lot on how to work around it. For us it's kind of important to support pre-compiled binaries since most people prefer docker workflows; but even those who do not generally have different build and execution environments.
Currently the only two options we have is to either blindly disable the ADX optimisations and make it slower for everybody; or to leave the optimisation in and hope for the best. The former seems to entail a significant hit on processing performance (measured in hours of extra processing time); whereas the latter is problematic in cloud environments where there's no direct control over the CPU allocation (or cross cloud deployments where it's even a bigger randomness on what CPU we get).
Would it be possible to pick up this idea of runtime detection of ADX instruction sets and run the faster or slower algorithm based on that? If you don't have the time to dive into the implementation yourself, would you be open to us sending in a PR to do it?
Thanks, Peter
from blst.
I'm also in favor of auto-detection.
one can provide two binaries to choose from at installation time
This is what we do presently. Here are some pain points:
- It's a pain for novice users. Many users have no idea what "portability" is, nor do they have any idea that their CPU has an instruction set that varies from others. I'm very fond of the idea of people using blockchains without knowing how their binary communicates with their CPU.
- Cloud instances don't always make promises about which CPU you'll be running on. We use AWS t2 instances and it seems like there's a roughly 1/4 chance of getting a CPU that only works with BLST portable. Given that there are no promises about which CPU you'll get after a reboot, the practical outcome of this is either switching to c-class compute instances (more $$) or just choosing the portable version and observing tangibly increased CPU usage.
- Additional runtime on build/test infra which costs real-world money.
- Also additional complexity in that infrastructure.
Or install both binaries and provide a wrapper that would evaluate processor capability at application startup and execute one of the two
Is it really best to do this at the layer above BLST (i.e., the application that imports BLST)? As I understand it, using the faster CPU instructions is always better if you support them and always worse if you don't. In my experience of using this library for a several months, the only thing the application gets from the ability to choose is the risk of making a bad decision. If it turns out that some user really wants to ignore runtime detection for whatever reason then an "ignore runtime detection" runtime/compile flag could be used, ultimately placing us in a position of being "as fast as is safe" out-of-the-box.
Evaluating processor capability inside BLST means less code duplication across applications and it keeps the "which instruction set to use" concerns contained. If BLST finds benefit from some new CPU instruction then with self-contained runtime detection it can be implemented without breaking upstream apps (i.e., most of Eth2).
Or one can wrap one binary into another and have first binary extract second one in case it's needed. Kind of self-wrapper.
As above.
from blst.
I'll try to substantiate the feature request
Context
Performance
Current clients are squashing bottlenecks in sync, IO, hashing via various techniques including caching, in-memory DB and in general avoiding to do work.
However there is one thing that we can't avoid, its verifying signatures and in all clients pairings are still one of the major bottlenecks. Having 20% faster verification is significant on the whole application.
For example, when syncing, the bottlenecks are disk speed, network speed and cryptography, on a fast network with a SSD, only cryptography remains. 20% faster verification translates to a couple of hours difference in syncing speed and will translate to days as the blockchain grows.
Ergonomics
Clients have a difficult time already to make blockchain safe and ergonomic to use. They try to abstract theoretical concept and low-level concept to the uninitiated.
A SIGILL
breaks that abstraction.
Furthermore, consumers of clients like staking pools or packagers like DappNode might run on cloud services where CPUs are not stable and would be forced to use the portable build. A 20% performance difference would translate to significant cost saving. A runtime detection would translate to savings significant maintenance, documentation and debugging headaches. I expect all clients have "What to do if you receive SIGILL" in their documentation at the moment.
Analysis of runtime detection.
Using branches
I'd argue that even if CPU runtime detection is done via a branch at the lowest possible level, the cost is negligible. Case in point, both Gurvy and my own library, constantine, use runtime detection just before fp_mul and are faster than BLST on G1 doubling.
source eccbench 2021-01-29 (https://hackmd.io/@zkteam/eccbench#G1-mixed-additiondoubling2):
Code for detection:
- https://github.com/ConsenSys/gurvy/blob/e350ead/bls381/fp/element_ops_amd64.s#L232-L233
- https://github.com/mratsim/constantine/blob/c4a2dee/constantine/arithmetic/limbs_montgomery.nim#L423-L427 (This even checks both BMI2 and ADX instead of just ADX.
Hardware predictors since Haswell have been significantly improved, they are improved so much that for interpreter dispatch (unpredictable with up to 255 opcodes!) it's better to use switch dispatching instead of a table of function pointers:
- Rohou et al, Branch Prediction and the Performance of Interpreters - Donβt Trust Folklore, https://hal.inria.fr/hal-01100647/document
Furthermore while it's true that hardware predictors are limited in space, there are still 4096 Branch Target Buffers (https://www.agner.org/optimize/microarchitecture.pdf, https://xania.org/201602/bpu-part-one, https://stackoverflow.com/questions/16513943/how-can-i-get-my-cpus-branch-target-bufferbtb-size). As the library is fully constant-time with almost no branches otherwise, it is completely unused. I'm not sure what the situation on ARM is but BLST doesn't use branch predictors due to its constant-time nature.
Also the branch misprediction can only happen on the first call to multiplication would require at most 10 cycles misprediction/pipeline flush cost (https://lemire.me/blog/2019/10/15/mispredicted-branches-can-multiply-your-running-times/) but every other call would be free. A fp_mul operation would be 100 cycles with ADCX/ADOX or 120 cycles without, so out of the 12000+ multiplications you need for a full pairing, a 10 cycle misprediction cost doesn't even register as 0.0001%.
Lastly it's likely that the branch misprediction, if it happens, is hidden in the cache miss to fetch the Fp elements data from memory (which is needed whatever branch is chosen).
Not using branches
Alternatively, if branches needs to be avoided for other reasons, the technique from GCC function multiversioning can be used with an initial dispatch function that is stored in global memory and then replace itself with the proper function for the arch on the first call. This is detailed in https://www.agner.org/optimize/optimizing_cpp.pdf
Detecting at a high-level
We can indeed do CPU detection at a higher level however I think given the previous sections, this is extra code that saves a couple nanoseconds on a full pairing and significantly complexify builds.
from blst.
Is there a compiler flag we can turn on for this or does it require a fork?
objcopy --redefine-sym
works for static libraries - it's not .. hard, just messy - it's the kind of thing that is likely easier to maintain in a fork of blst that exists only for the purpose of naming symbols at compilation stage.
from blst.
This is especially interesting because the CPUs are now widespread
Yes, and it's argued that they are so widespread that non-adx code is on the brink of irrelevant. At least close enough to grant compile-time detection. Especially since more and more software is distributed as source code.
Cloudflare BN256 does that for example
And? :-) :-) :-) But on serious note, performance is not all about benchmarking tight loop with short subroutine. What I'm trying to say is that conditional branch in a short subroutine is not optimal placement. It surely looks adequate in controlled benchmark environment, but in real life, where branch predictor is a limited resource, you'll risk misprediction penalties, and the said penalties are comparable to subroutine execution time. It can even be affected by linking order and other link-time optimizations. Run-time switch is better placed as high in call hierarchy as possible, say on point operation level. Or on whole library level, i.e. at compile time ;-)
from blst.
At the end of the day performance is not necessarily about doing things in shortest possible time, but rather about meeting some kind of a deadline [preferably with confidence]. In other words if user-facing or "desktop-class" application is not dependent on 100% availability of CPU time for 24 hours a day, one doesn't have to feel bad about not using latest instruction extension over not-that-big margin. As long as application actually does the job 100%. Well, one can still make case that faster code would consume less power. Though one has to recognize that relation is far from linear, in sense that 20% faster code won't yield 20% savings. Because a sizable portion of power is spent on just merely being up... In other words I for one am not actually convinced that an application distributed in binary form for "mass consumption" and not using ADCX would actually be inadequate... Alternatively (and apart from most obvious "two-shared-libraries-to-choose-from-at-run-time") one can provide two binaries to choose from at installation time. Or install both binaries and provide a wrapper that would evaluate processor capability at application startup and execute one of the two. Or one can wrap one binary into another and have first binary extract second one in case it's needed. Kind of self-wrapper. Things might get tricky if you run as Windows service though. In which case installation-time evaluation would have to do. And if still unsuitable for some reason, let's talk about that, I'm sure I'll be able to offer a solution...
from blst.
I'll be replying in smaller "bites." (Unfortunately maybe not as fast as you'd wish/expect;-) Also, these won't be categorical "no"-s, but we have to recognize certain things.
Not using branches
...
(*CriticalFunction)
Rendered unacceptable for cryptographic applications in January 2018. On a related note, here is a homework assignment. Compile all the different things with clang[++] -mretpoline
and re-compare the results.
As the library is fully constant-time with almost no branches otherwise, it is completely unused.
Constant-time-ness doesn't mean that there are no conditional branches, only that they are not dependent on secret values data.
while it's true that hardware predictors are limited in space, there are still ...
Going back to the beginning of this remark, excessive amount of branch prediction logic is actually bad for security. It doesn't matter how well the processor can predict branches in a controlled[!] environment, as long one can indirectly affect the logic in some way, the only viable strategy for cryptographic code is to have as few conditional branches as possible, especially on critical paths. You might say "you're being over-paranoid." Yeah, but unfortunately such statements work only till somebody figures out something...
from blst.
I think most of us think of many of these functions at a higher level than the individual math operations - we sign and verify things - thus, one possible way forward would be to have the each of the specialized implementations for a "family" of instruction sets exported and allow both of them to be built into a single "build" by creating several "namespaces": blsSignGeneric
, blsSignAdx
and so on - we can then, when consuming the library, choose between these in any way we find reasonable (detection, compile-time selection etc).
from blst.
I agree that having runtime detection would be hugely beneficial for consumers of this library, and would love to see this issue get resolved, the sooner the better.
If you don't have the time to dive into the implementation yourself, would you be open to us sending in a PR to do it?
?
from blst.
Currently the only two options we have is to either blindly disable the ADX optimisations and make it slower for everybody; or to leave the optimisation in and hope for the best.
The third option is two load two copies of blst (one for adx and one without adx) and during runtime check the CPU and call the right copy of blst. It's ugly, but we are going to try this approach in Grandine unless the proper solution will be in place at that time.
from blst.
How easy is it to auto-namespace the library? Is there a compiler flag we can turn on for this or does it require a fork?
from blst.
Addressed on Jun 12, 2023:
And SHA256 runtime detection as well:
from blst.
Related Issues (20)
- Portable builds fail in shared library build HOT 2
- Support `-fvisibility=hidden`, use explicit visibility attributes for both private and public symbols HOT 7
- Using the C library: exp_mont_384 is hard to use HOT 1
- Rust bindings: `std` feature used but not declared in Cargo.toml HOT 25
- Unoptimized sha256 funcs always do runtime detection HOT 4
- Update optimization level recommendation HOT 2
- LVI countermeasures in assembly HOT 1
- Rust bindings reference non-existent "std" feature HOT 3
- Error building blst Wasm in Rust HOT 2
- Windows: bool is 4-bytes by default HOT 3
- `blst` fails to build in Windows on ARM device HOT 4
- `blst` fails to build macos 13.x HOT 2
- Rust bindings: `PublicKey::key_validate` not linking for `x86_64-fortanix-unknown-sgx` (current master branch) HOT 11
- [Rust-binding] Proposal to implement `std::hash::Hash` for publicly exposed structures HOT 6
- Rust Bindings: Replacing slices of references to iterators of references for aggregation HOT 3
- BLST throws illegal instruction error on AMD K10 CPUs (Windows) HOT 27
- How Derive keys by path? HOT 2
- The same private key but different public key results HOT 2
- Segmentation fault in some machines and not in others using OpenBSD adJ74 HOT 14
- Rust bindings not recompiled on target CPU change HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from blst.