oasisprotocol / curve25519-voi Goto Github PK

High-performance Curve25519/ristretto255 for Go.

License: BSD 3-Clause "New" or "Revised" License

Go 93.64% Assembly 6.34% Shell 0.02%

go golang cryptography-algorithms curve25519 ristretto255 x25519 ed25519 sr25519 batch-verification cryptography

curve25519-voi's Introduction

curve25519-voi

It was only machinery. I’m surprised it’s lasted as long as it has, frankly. There must still be some residual damage-repair capability. We Demarchists build for posterity, you know.

This package aims to provide a modern X25519/Ed25519/sr25519 implementation for Go, mostly derived from curve25519-dalek. The primary motivation is to hopefully provide a worthy alternative to the current state of available Go implementations, which is best described as "a gigantic mess of ref10 and donna ports". The irony of the previous statement in the light of curve25519-dalek's lineage does not escape this author.

WARNING

DO NOT BOTHER THE curve25519-dalek DEVELOPERS ABOUT THIS PACKAGE

Package structure

curve: A mid-level API in the spirit of curve25519-dalek.
primitives/x25519: A X25519 implementation like x/crypto/curve25519.
primitives/ed25519: A Ed25519 implementation like crypto/ed25519.
primitives/ed25519/extra/ecvrf: A implementation of the "Verifiable Random Functions" draft (v10, v13).
primitives/sr25519: A sr25519 implementation like https://github.com/w3f/schnorrkel.
primitives/merlin: A Merlin transcript implementation.
primitives/h2c: A implementation of the "Hashing to Elliptic Curves" draft (v16).

Ed25519 verification semantics

At the time of this writing, Ed25519 signature verification behavior varies based on the implementation. The implementation provided by this package aims to provide a sensible default, and to support compatibility with other implementations if required.

The default verification semantics are as follows, using the terminology from ed25519-speccheck:

Both iterative and batch verification are cofactored.
Small order A is rejected.
Small order R is accepted.
Non-canonical A is rejected.
Non-canonical R is rejected.
A signature's scalar component must be in canonical form (S < L).

Pre-defined configuration presets for compatibility with the Go standard library, FIPS 186-5/RFC 8032, and ZIP-215 are provided for convenience.

For more details on this general problem, see Taming the many EdDSAs.

Notes

The curve25519-dalek crate makes use of "modern" programing language features not available in Go. This package's mid-level API attempts to provide something usable by developers familiar with idiomatic Go, and thus has more sharp edges, but in all honestly, developers that opt to use the mid-level API in theory already know what they are getting into. Stability of the mid-level API is currently NOT guaranteed.

The curve25519-dalek crate has a series of nice vectorized backends written using SIMD intrinsics. While Go has no SIMD intrinsics, and the assembly dialect is anything but nice, the AVX2 backend is also present in this implementation.

Memory sanitization while maintaining reasonable performance in Go is a hard/unsolved problem, and this package makes no attempts to do so. Anyone that mentions memguard will be asked to re-read the previous sentence again, and then be mercilessly mocked. It is worth noting that the standard library does not do this appropriately either.

The minimum required Go version for this package follows the Go support policy, of latest version and the previous one. Attempting to subvert the toolchain checks will result in reduced performance or insecurity on certain platforms.

This package uses build tags to enable the 32-bit or 64-bit backend respectively. Note that for 64-bit targets, this primarily depends on if the SSA code (src/cmd/compile/internal/ssagen/ssa.go) has the appropriate special cases to make math/bits.Mul64/math/bits.Add64 perform well.

64-bit: amd64, arm64, ppc64le, ppc64, s390x
32-bit: 386, arm, mips, mipsle, wasm, mips64, mips64le, riscv64, loong64
Unsupported: Everything else.

WARNING: As a concession to the target's growing popularity, the wasm target is supported using the 32-bit backend, however the WebAssembly specification does not mandate that any opcodes are constant time, making it difficult to provide assurances related to timing side-channels.

The lack of a generic "just use 32-bit" fallback can be blamed on the Go developers rejecting adding build tags for bit-width.

The lattice reduction implementation currently only has a 64-bit version, and thus it will be used on all platforms. Note that while Go 1.12 had a vartime implementation of math/bits routines, that version of the compiler is long unsupported, and the lattice reduction is verification only so the lack of a timing guarantee has no security impact.

Special credits

curve25519-voi would not exist if it were not for the amazing work done by various other projects. Any bugs in curve25519-voi are the fault of the curve25519-voi developers alone.

The majority of curve25519-voi is derived from curve25519-dalek.
The Ed25519 batch verification started off as a port of the implementation present in ed25519-dalek, but was later switched to be based off ed25519consensus.
The ABGLSV-Pornin multiplication implementation is derived from a curve25519-dalek pull request by Jack Grigg (@str4d), with additional inspiration taken from Thomas Pornin's paper and curve9767 implementation.
The assembly optimized field element multiplications were taken (with minor modifications) from George Tankersley's ristretto255 package.
The Elligator 2 mapping was taken from Loup Vaillant's Monocypher package.

curve25519-voi's People

Contributors

Stargazers

Watchers

Forkers

sloppyjuicy isgasho oasisunofficial daotlresearch knkgun icarus9913 dadacf polymerdao williambanfield gouserdev bbenzikry ctree35 pentest-dev w3f kokeshim0chi

curve25519-voi's Issues

perf: Consider using saturated 64-bit limbs for the field arithmatic

During the external review it was pointed out that the field multiply, square, and inverse would gain some performance if the implementation used 64-bit saturated limbs. Since the relevant math/bits intrinsics expose the equivalent of the carry and borrow flag, this should be possible to implement in a portable manner.

A cursory examination of the paper this would be based on suggests that the gains in the portable case would be < 5%, with more substantial gains if BMI2 was used, so this is low priority for now, as any system with BMI2 will also have AVX2.

Make the intermediate API more plesant to use

While functional, the intermediate API is somewhat cumbersome to use. It would be much nicer when Go gets exotic language features like operator overloading, but in the meanwhile, it could probably be improved.

I seem to recall filippo/gtank's packages providing something that's nicer, so stealing ideas from there might be worth the time investment.

../github.com/oasisprotocol/curve25519-voi/internal/toolchain/constraints.go:48:6: undefined: __SOFTWARE_REQUIRES_GO_VERSION_1_18__

../github.com/oasisprotocol/curve25519-voi/internal/toolchain/constraints.go:48:6: undefined: SOFTWARE_REQUIRES_GO_VERSION_1_18

perf: Replace the STROBE implementation

The STROBE implementation pulled in is functional but inefficient/allocation heavy. While I'm not thrilled at either having to pull in a Keccak implementation either, replacing this should improve sr25519 performance, and make it more heap friendly.

perf: Improve the lattice reduction performance

The lattice reduction implementation could likely be optimized further in the following ways:

Add a 32-bit backend (Low priority, the current code provides adequate performance)
Take a futher page out of curve9767's book and inline absolutely everything.
Assembly?

As a datapoint, in a branch a partial implementation (panics on edge cases) of a fully inlined version of the lattice reduction was faster, but I'm not sure if the unreadable mess I ended up with is worth a ~2% improvement in signature verification performance.

Add CI with tests

This should have CI, that runs tests. When tests are automated, they should be both done with the force32bit and force64bit build tags set (Along with a way to disable assembly if I ever get around to adding AVX2 support).

ci: Figure out how to setup arm64 and arm32 builds/tests

While there are a number of build tags to try to force the no assembly and 32-bit backends to be build and exercised this really should build and test with an actual 32-bit toolchain/system so that stupid errors can be caught.

As we don't care about performance, QEMU or something should be sufficient here.

Go 1.19.x tracking

New architecture: loong64 (lacking intrinsics)

As of golang/go@969f48a the risc64 should use the 64-bit path, but this change is not in released versions of the toolchain yet.

Improve performance on 32 bit architectures

The curve/scalar and internal/field packages currently assumes a 64 bit architecture. While the code will work (correctly, the underlying 128 bit integer intrinsics are guaranteed to be constant time) on 32 bit architectures, this is rather sub-optimal for performance.

The abstractions are in place to allow adding such a thing, but this is low priority since 32 bit architectures are on their way out.

feature: Consider using avo for the assembly

Right now the assembly is hand-written. We could take a page from circl and Filippo's edwards25519 package and use avo as our macro assembler since writing Go that generates assembly might be more maintainable.

See: https://github.com/mmcloughlin/avo

housekeeping: Go 1.17 related cleanup

Go 1.17 was released, so do some housekeeping.

Drop support for Go 1.15
Re-benchmark. We've gotten faster as well (yay)

perf: Consider faster batch forgery identification

Right now, the strategy used for batch verification with forgery identification is to fall back to serial verification. Per "Faster batch fogery identification", it is possible to reuse intermediaries from the multiscalar multiply when identifying the forgeries.

Complications:

The experimental results with Stratus has a cutoff of n/3 forgeries beyond which, serial verification is faster.
We use Pippenger for large batches.

api: Consider exposing the merlin implementation

The internal/merlin implementation may be useful to other people, so consider cleaning some of it up, and moving it under primitives, after #63 is merged. The benefits it has over the existing common implementation are:

Support for a transcript based RNG construct.
It creates much less garbage due to using a "better" STROBE implementation.

Note: I am reluctant/opposed to exposing the STROBE internal package because it explicitly only implements a subset that is useful for implementing merlin.

cleanup: Drop support for old versions of Go

We should track the officially supported toolchain/standard library version, and aggressively drop support for versions that aren't supported by the Go maintainers. As of the current policy, this is "each version is supported until there are two newer major releases".

The main benefit is that this lets us use marginally less complicated build tags, since we can ignore special cases for versions that are no longer relevant.

perf: Certain operations involve more heap allocs compared to the standard library

Compared to the standard library there are extra heap allocs as follows:

ScalarBasepointMultiply (1x, due to interface unboxing) (Fixed in #36)
Signing (1x, outlining the function doesn't make the signature get allocated on the stack)

While this isn't ideal, it is something that is probably ok for now.

perf: provide safepoints for pre-emptive scheduling

I'd like to raise this as more of a discussion, though it seems like there may be a way to mark safepoints in assembly code that the Go scheduler may pre-empt on.

@aclements (comment): ... with some extra annotations in the assembly to indicate registers containing pointers it will become preemptible without any extra work or run-time overhead to reach an explicit safe-point.

https://go.googlesource.com/proposal/+/master/design/24543/safe-points-everywhere.md: By default, the runtime cannot safely preempt assembly code since it won‘t know what registers contain pointers. As a follow-on to the work on safe-points everywhere, we should audit assembly in the standard library for non-preemptible loops and annotate them with register maps. In most cases this should be trivial since most assembly never constructs a pointer that isn’t shadowed by an argument, so it can simply claim there are no pointers in registers. We should also document in the Go assembly guide how to do this for user code.

Given that crypto code in general is CPU-bound, and that there isn't an option to i.e. offload the execution of such code into a separate thread pool away from I/O-bound tasks, providing safepoints would improve the overall latency of all tasks in Go programs that rely on crypto.

Where this would prove valuable is that I've been working on a Go program that makes heavy use of network I/O and ed25519 batch verification - the biggest bottleneck of the program right now is that noticeably, almost all of the goroutines in the program that are I/O-bound are being starved and driven nearly to a complete halt whenever there is any sort of crypto code being executed in a separate goroutine. I noticed this behavior by recording traces of the program and opening them up in Go's trace viewer.

The Zig/C equivalent of the same program, which makes use of dedicated thread pools for I/O-bound code and CPU-bound code, has significantly lower latencies and scheduler contention in comparison.

Now, I'd like this to be more of a discussion because I have two open questions:

The possibility of marking safepoints in assembly code is only documented in the Go proposal, though I haven't seen any examples of assembly code doing it in i.e. the Go standard library so far. Is the ability to mark safepoints actually possible right now in the latest version of Go?
Are there any security implications to encouraging pre-emption for such crypto-related code?

Public key encryption, private key decryption

hello,Can public key encryption and private key decryption be implemented

enhancement: Add paranoid ed25519 signing

There are several mitigations that can be put in place to harden the signing process against certain fault injection and side-channel attacks. The most promising of which is to add
randomness when deriving r a la XEdDSA/an expired IETF draft.

Since this is backwards-compatible, there is no reason not to provide this as an option.

Note: sr25519 already does this, so no changes are needed there.

tests: Increase test coverage

While a lot of the code is well exercised by the test cases, there are a number of gaps that should also be covered if only to prevent regressions when further development occurs. In particular

The ristretto code could use more test cases, particularly on functions that are just wrappers around the corresponding Ed25519 routines.
A number of "user-friendlyness" routines in the curve package could use more coverage, such as the constructors and MarshalBinary/UnmarshalBinary routines.

As of right now, none of these routines are used, and they are likely correct, but it won't hurt to make sure they stay correct.

perf: primitives/x25519: Rethink calling x/crypto/curve25519

Currently on amd64 systems, ScalarMult will call x/crypto/curve25519's implementation because it WAS entirely optimized assembly lifted from SUPERCOP. It appears that fairly recently the backend was switched to use Filippo's field arithmetic, leading to a massive performance regression.

Since the differential is now ~5% it may not be worth falling back anymore (even if it leads to x25519 being slightly slower), though this depends on the exact version of the x/crypto import that people use.

While thinking about this, it may be worth implementing a more optimized x25519 scalar multiply, though this will require using ADCX/ADOX/MULX (a la https://eprint.iacr.org/2017/264.pdf) to really be substantial.

Browser wasm compatibility request

Cant seem to find another library that supports ZIP 215 in golang that is also wasm compatible.

cleanup: Use the 1.17 cast from slice to array syntax

Once the minimum compiler version is 1.17, use the cast to array syntax instead of doing an allocate + copy.

ed25519: optimize reuse of instantiated BatchVerifier

Hello!

Just a small feature request - I'm verifying batches of ~10,000 signatures at the moment and it looks like there's quite a bit of allocation overhead from BatchVerifier.

A quick improvement I believe would be to include a Reset() method that would re-slice the internal entries []entry of a BatchVerifier into a slice of length zero. An alternative would be to also change up the BatchVerifier API to accept slices and not allocate on its own so that allocations can be managed by the caller.

Any thoughts on this?

I also have a quick question: is there any reason entry.signature is a byte slice rather than a fixed 32 byte array?

perf: Support the dalek AVX-512 IFMA backend

curve25519-dalek has some rather nice vector backends for the group and underlying field operations. It would be nice if those could be adapted for use in this package.

Orthogonal to this, voi already uses vector operations to accelerate the basepoint multiply.

perf: Think more about defaulting to ExpandedPublicKeys in the ed25519 batch verifier

The ed25519 BatchVerifier works under the assumption that using ExpandedPublicKey internally in entries is the right thing to do.

The choice to do this is to optimize performance for the cases where:

The batch size is not too large (as only the multiscalar multiplication with Straus' method leverages the precomputed table form of points).
The same public keys are used repeatedly across batches (via AddExpandedWithOptions).
It is actually possible for a batch to fail, and exact information on which entries in the batch is desired.

It has been brought to my attention that there are use cases that involve extremely large batches (~10k), where (presumably) there is a lot of churn in the public keys used per batch (See #68). If there is a moderately clean way to improve performance and memory allocation behavior in this case, improvements should be made.

feature: Add support for the ristretto based primitives

Since the required underlying group operations are implemented, it would be relatively easy to support things like sr25519. Due to the popularity in certain spaces, this library should do so at some point.

Obtain external review by qualified reviewers

While this passes tests, before anyone should even consider using it, this package really needs external review for correctness.

feature: Consider adding a way to enforce `order L` A/R when verifying Ed25519 signatures

I poked at this in a branch, before not doing it because it is extremely expensive and only applicable to people doing something rather exotic, but there's no reason why this couldn't optionally check for a non-zero torsion component of A and or R when doing signature verification.

But really, at that point "you should be using Ristretto, or a different curve all together".

doc: Provide helpful examples

The test cases for the various primitives are a good place to start, but there should be documentation on how to use the primitive implementations to accomplish common tasks, particularly in combination with the configurable parts of the verification behavior.

perf: Add sr25519 precomputation support

At some point the sr25519 code should also have precomputation support for feature parity with ed25519.

The gains here are not going to be as big because we do not need to maintain compatibility with the standard library, and thus will only do deserialzation/decompression once.

cleanup: Fix assembly `go vet` issues

# github.com/oasisprotocol/curve25519-voi/curve
curve/edwards_vector_amd64.s:195:1: [amd64] vecConditionalSelect_AVX2: wrong argument size 32; expected $...-28
curve/edwards_vector_amd64.s:201:1: [amd64] vecConditionalSelect_AVX2: invalid VPBROADCASTD of mask+24(FP); uint32 is 4-byte value
curve/window_amd64.s:9:1: [amd64] lookupAffineNiels: wrong argument size 24; expected $...-17
curve/window_amd64.s:135:1: [amd64] lookupCached: wrong argument size 24; expected $...-17

As far as I can tell:

~~Argument sizes are an avo issue since it's autogenerating that from the function signature.~~
The VPBROADCASTD issue is a go vet bug, because it assumes opcodes use the messed up Go dialect suffixes where D means 8-bytes.

While all of these are annoying, I think it's ok to ignore for now, especially considering that the VPBROADCASTD issue is just a false positive, and I'd rather not edit avo output, so resolving this is blocked on other people fixing their stuff.

Platform support

This serves as a tracking issue to document which targets1 are supported, along with ancillary information.

GOARCH	Supported	Backend	Notes
amd64	✔️	64-bit + asm	Main development platform
arm64	✔️	64-bit
ppc64	✔️	64-bit
ppc64le	✔️	64-bit
s390x	✔️	64-bit
386	✔️	32-bit
arm	✔️	32-bit
mips	✔️	32-bit
mipsle	✔️	32-bit
mips64	✔️	32-bit	`bits.Add64`/`bits.Mul64` are slow
mips64le	✔️	32-bit	`bits.Add64`/`bits.Mul64` are slow
riscv64	✔️	32-bit	`bits.Add64` is slow in released versions2
loong64	✔️	32-bit	`bits.Add64`/`bits.Mul64` are slow
wasm	✔️	32-bit	WebAssembly does not guarantee constant time integer operations

It may be the case that certain 64-bit platforms that currently use the 32-bit code path will perform better with the 64-bit code path, despite the lack of compiler optimization for the relevant math/bits intrinsics. As I do not have access to the various targets, benchmark results showing this will be welcome.

WASM is supported now due to the growing popularity of the target, however WebAssembly3 does not mandate nor guarantee instruction timings. The standard techniques used to mitigate timing side-channels work under the assumption that certain things are constant-time with regards to the inputs, which may not be true.

internal/field: Think about using fiat-crypto

fiat-crypto has neat formally verified/auto-generated field arithmetic that can be used to replace the fiddly bits of the internal/field package. It would be nice to be able to use it, since it would let me feel less scared of exposing the package, and it's a nice checkbox to have.

Actually doing the switch requires the fiat-crypto performance to not be horrifically bad, since the current code works (and has been reviewed externally for correctness). Benchmarks from Novi's curve25519-dalek fork (that has been upstreamed) documentation indicate that a 8-15% performance degradation is to be expected, but the actual observed results are significantly worse.

Write an initial naive experimental branch to get ballpark benchmarks.
Be really really really sad about performance.
Try to figure out why performance is, significantly worse, when the rust code is just "slightly worse".
- CarryMul is ridiculously slow because of addcarryxU64
- CarrySquare is ridiculously slow because of addcarryxU64
- Go's shit inliner strikes again. The rust code #[inline]s fiat_25519_carry_mul among other things. Carry is over the inliner budget.
Try to reduce the performance penalty further
- Switch CarrySquare over to CarryPow2k
- Add Add/Sub/Opp + Carry
- Add Add/Sub + CarryMul/CarrySquare
- Fix X25519 performance
Benchmark on 32-bit ARM or something.

Current status (2021/08/09)

The upstream fiat-crypto developers were kind enough to fold in some performance related changes (CarryMul/CarrySquare performance fixed, Add/Sub/Opp + Carry added).

My branch still adds CarryPow2k and Add/Sub + CarryMul/CarrySquare, which I suppose I can ask about.

Since the branch is close enough to what I envision a switch would look like, these are the current rough benchmarks:

With AVX2 used where it exists:

name                                old time/op    new time/op    delta
VerifyBatchOnly/1-8                    103µs ± 0%     109µs ± 0%   +5.67%
VerifyBatchOnly/2-8                    146µs ± 0%     157µs ± 0%   +7.56%
VerifyBatchOnly/4-8                    232µs ± 0%     252µs ± 0%   +8.59%
VerifyBatchOnly/8-8                    402µs ± 0%     444µs ± 0%  +10.49%
VerifyBatchOnly/16-8                   742µs ± 0%     826µs ± 0%  +11.31%
VerifyBatchOnly/32-8                  1.43ms ± 0%    1.58ms ± 0%  +10.91%
VerifyBatchOnly/64-8                  2.76ms ± 0%    3.11ms ± 0%  +12.55%
VerifyBatchOnly/128-8                 5.24ms ± 0%    5.95ms ± 0%  +13.61%
VerifyBatchOnly/256-8                 9.52ms ± 0%   10.91ms ± 0%  +14.57%
VerifyBatchOnly/384-8                 13.5ms ± 0%    15.6ms ± 0%  +15.35%
VerifyBatchOnly/512-8                 17.7ms ± 0%    20.4ms ± 0%  +15.28%
VerifyBatchOnly/768-8                 25.4ms ± 0%    29.6ms ± 0%  +16.32%
VerifyBatchOnly/1024-8                32.8ms ± 0%    38.4ms ± 0%  +17.11%
GenerateKey/voi-8                     25.5µs ± 0%    27.9µs ± 0%   +9.33%
GenerateKey/stdlib-8                  43.3µs ± 0%    43.1µs ± 0%   -0.41%
NewKeyFromSeed/voi-8                  25.3µs ± 0%    27.7µs ± 0%   +9.43%
NewKeyFromSeed/stdlib-8               42.7µs ± 0%    42.8µs ± 0%   +0.07%
Signing/voi-8                         28.0µs ± 0%    30.3µs ± 0%   +8.21%
Signing/stdlib-8                      52.2µs ± 0%    52.2µs ± 0%   -0.04%
Verification/voi-8                    72.9µs ± 0%    79.4µs ± 0%   +8.94%
Verification/voi_stdlib-8             83.3µs ± 0%    90.9µs ± 0%   +9.13%
Verification/stdlib-8                  124µs ± 0%     124µs ± 0%   +0.15%
Expanded/NewExpandedPublicKey-8       11.4µs ± 0%    14.2µs ± 0%  +24.98%
Expanded/Verification/voi-8           62.6µs ± 0%    65.5µs ± 0%   +4.65%
Expanded/Verification/voi_stdlib-8    74.9µs ± 0%    78.9µs ± 0%   +5.39%
Expanded/VerifyBatchOnly/1-8          91.9µs ± 0%    96.7µs ± 0%   +5.18%
Expanded/VerifyBatchOnly/2-8           123µs ± 0%     128µs ± 0%   +4.02%
Expanded/VerifyBatchOnly/4-8           184µs ± 0%     197µs ± 0%   +6.76%
Expanded/VerifyBatchOnly/8-8           308µs ± 0%     331µs ± 0%   +7.54%
Expanded/VerifyBatchOnly/16-8          556µs ± 0%     597µs ± 0%   +7.37%
Expanded/VerifyBatchOnly/32-8         1.05ms ± 0%    1.13ms ± 0%   +7.70%
Expanded/VerifyBatchOnly/64-8         2.02ms ± 0%    2.20ms ± 0%   +8.56%
Expanded/VerifyBatchOnly/128-8        3.87ms ± 0%    4.24ms ± 0%   +9.47%
Expanded/VerifyBatchOnly/256-8        7.03ms ± 0%    7.77ms ± 0%  +10.38%
Expanded/VerifyBatchOnly/384-8        10.0ms ± 0%    11.1ms ± 0%  +10.93%
Expanded/VerifyBatchOnly/512-8        13.1ms ± 0%    14.5ms ± 0%  +11.00%
Expanded/VerifyBatchOnly/768-8        18.6ms ± 0%    20.9ms ± 0%  +12.10%
Expanded/VerifyBatchOnly/1024-8       24.0ms ± 0%    26.9ms ± 0%  +11.68%

name                      old time/op    new time/op    delta
ScalarBaseMult/voi-8        24.5µs ± 0%    26.9µs ± 0%   +9.54%
ScalarMult/voi-8            80.2µs ± 0%   113.0µs ± 0%  +40.98%

Note: The massive regression for X25519 ScalarMult is due to the removal of assembly. Sufficiently recent x/crypto/curve25519 does away with it as well, and clocks in at ~107us/op (~161us/op purego).

purego:

name                                old time/op    new time/op    delta
VerifyBatchOnly/1-8                    164µs ± 0%     179µs ± 0%   +9.15%
VerifyBatchOnly/2-8                    238µs ± 0%     247µs ± 0%   +3.79%
VerifyBatchOnly/4-8                    351µs ± 0%     383µs ± 0%   +9.18%
VerifyBatchOnly/8-8                    603µs ± 0%     657µs ± 0%   +9.03%
VerifyBatchOnly/16-8                  1.10ms ± 0%    1.21ms ± 0%   +9.29%
VerifyBatchOnly/32-8                  2.09ms ± 0%    2.29ms ± 0%   +9.60%
VerifyBatchOnly/64-8                  4.09ms ± 0%    4.48ms ± 0%   +9.36%
VerifyBatchOnly/128-8                 8.04ms ± 0%    8.93ms ± 0%  +11.05%
VerifyBatchOnly/256-8                 14.3ms ± 0%    15.9ms ± 0%  +11.09%
VerifyBatchOnly/384-8                 20.2ms ± 0%    22.5ms ± 0%  +11.43%
VerifyBatchOnly/512-8                 26.3ms ± 0%    29.3ms ± 0%  +11.46%
VerifyBatchOnly/768-8                 37.3ms ± 0%    41.4ms ± 0%  +11.10%
VerifyBatchOnly/1024-8                48.1ms ± 0%    53.4ms ± 0%  +11.20%
GenerateKey/voi-8                     48.8µs ± 0%    52.6µs ± 0%   +7.80%
GenerateKey/stdlib-8                  58.3µs ± 0%    58.2µs ± 0%   -0.02%
NewKeyFromSeed/voi-8                  48.5µs ± 0%    52.8µs ± 0%   +8.84%
NewKeyFromSeed/stdlib-8               58.2µs ± 0%    58.1µs ± 0%   -0.12%
Signing/voi-8                         51.3µs ± 0%    55.0µs ± 0%   +7.20%
Signing/stdlib-8                      72.4µs ± 0%    72.2µs ± 0%   -0.27%
Verification/voi-8                     108µs ± 0%     117µs ± 0%   +8.55%
Verification/voi_stdlib-8              131µs ± 0%     145µs ± 0%  +10.17%
Verification/stdlib-8                  181µs ± 0%     181µs ± 0%   -0.22%
Expanded/NewExpandedPublicKey-8       14.5µs ± 0%    16.0µs ± 0%  +10.85%
Expanded/Verification/voi-8           93.4µs ± 0%   101.9µs ± 0%   +9.16%
Expanded/Verification/voi_stdlib-8     119µs ± 0%     132µs ± 0%  +11.36%
Expanded/VerifyBatchOnly/1-8           149µs ± 0%     163µs ± 0%   +9.48%
Expanded/VerifyBatchOnly/2-8           196µs ± 0%     215µs ± 0%   +9.80%
Expanded/VerifyBatchOnly/4-8           293µs ± 0%     319µs ± 0%   +9.03%
Expanded/VerifyBatchOnly/8-8           484µs ± 0%     530µs ± 0%   +9.44%
Expanded/VerifyBatchOnly/16-8          868µs ± 0%     943µs ± 0%   +8.67%
Expanded/VerifyBatchOnly/32-8         1.62ms ± 0%    1.77ms ± 0%   +9.44%
Expanded/VerifyBatchOnly/64-8         3.13ms ± 0%    3.43ms ± 0%   +9.75%
Expanded/VerifyBatchOnly/128-8        6.33ms ± 0%    7.00ms ± 0%  +10.72%
Expanded/VerifyBatchOnly/256-8        11.3ms ± 0%    12.6ms ± 0%  +11.41%
Expanded/VerifyBatchOnly/384-8        15.9ms ± 0%    17.7ms ± 0%  +11.16%
Expanded/VerifyBatchOnly/512-8        20.7ms ± 0%    23.0ms ± 0%  +11.17%
Expanded/VerifyBatchOnly/768-8        29.1ms ± 0%    32.4ms ± 0%  +11.34%
Expanded/VerifyBatchOnly/1024-8       37.4ms ± 0%    41.5ms ± 0%  +10.98%

name                      old time/op    new time/op    delta
ScalarBaseMult/voi-8        47.8µs ± 0%    51.7µs ± 0%  +8.18%
ScalarMult/voi-8             118µs ± 0%     113µs ± 0%  -4.05%

This is getting to "an acceptable slowdown" if people think that using the fiat code is better over the code that came from dalek, but the issue of having to ship modified routines from the 64/curve25519 implementation remains.