Hi! I’m Jakub, @lilyminium’s partner. She suggested that I look at your implementation of vectorized distances and see if I can improve it.
Scalar baseline
I used your scalar distance code as a baseline:
void calculate_distances1(
const float* coords1,
const float* coords2,
const float* box,
float* out,
size_t n
) {
for (size_t i = 0; i < n; ++i) {
float dist = 0.0;
for (size_t j = 0; j < 3; ++j) {
float r = coords1[3 * i + j] - coords2[3 * i + j];
float b = box[j];
float adj = roundf(r / b);
r -= adj * b;
dist += r * r;
}
out[i] = sqrtf(dist);
}
}
10 million iterations on n = 4096
took 9m 38.5s on my machine.
Looking at the disassembly, I noticed that Clang attempted to auto-vectorize the loop:
...
vmovups (%r12,%r15), %ymm0
vmovups 32(%r12,%r15), %ymm1
vmovups 64(%r12,%r15), %ymm2
vsubps 64(%r13,%r15), %ymm2, %ymm2
vmovaps %ymm2, 64(%rsp) ## 32-byte Spill
vsubps 32(%r13,%r15), %ymm1, %ymm1
vmovaps %ymm1, 192(%rsp) ## 32-byte Spill
vsubps (%r13,%r15), %ymm0, %ymm0
vmovaps %ymm0, 384(%rsp) ## 32-byte Spill
vblendps $146, %ymm1, %ymm0, %ymm0 ## ymm0 = ymm0[0],ymm1[1],ymm0[2,3],ymm1[4],ymm0[5,6],ymm1[7]
...
With auto-vectorization turned off, the loop took 11m 26.9s instead.
Still, Clang’s vectorization is not very good. Firstly, it spends a lot of time swizzling data around the vectors. Secondly, Clang is unable to vectorize roundf
so it makes multiple function calls at every iteration:
...
callq _roundf
vmovaps 128(%rsp), %xmm1 ## 16-byte Reload
vinsertps $16, %xmm0, %xmm1, %xmm0 ## xmm0 = xmm1[0],xmm0[0],xmm1[2,3]
vmovaps %xmm0, 128(%rsp) ## 16-byte Spill
vpermilpd $1, (%rsp), %xmm0 ## 16-byte Folded Reload
## xmm0 = mem[1,0]
callq _roundf
vmovaps 128(%rsp), %xmm1 ## 16-byte Reload
vinsertps $32, %xmm0, %xmm1, %xmm0 ## xmm0 = xmm1[0,1],xmm0[0],xmm1[3]
vmovaps %xmm0, 128(%rsp) ## 16-byte Spill
vpermilps $231, (%rsp), %xmm0 ## 16-byte Folded Reload
## xmm0 = mem[3,1,2,3]
callq _roundf
...
nearbyintf
Clang’s inability to vectorize roundf
is a feature, not a bug. roundf
, by the standard, always rounds away from zero, whereas the vroundps
instruction rounds to even. A compiler will only auto-vectorize when it can guarantee that it won’t change the result.
We can change roundf
to nearbyintf
:
float dist = 0.0f;
for (size_t j = 0; j < 3; ++j) {
float r = coords1[3 * i + j] - coords2[3 * i + j];
float b = box[j];
float adj = nearbyintf(r / b);
r -= adj * b;
dist += r * r;
}
out[i] = sqrtf(dist);
This lets Clang inline and vectorize:
...
vsubps %ymm13, %ymm10, %ymm10
vmulps %ymm10, %ymm10, %ymm10
vroundps $12, %ymm12, %ymm12
vaddps %ymm5, %ymm11, %ymm11
vmulps %ymm2, %ymm12, %ymm12
...
This loop takes 36.4s, a 16× speedup!
Fused multiply-add
We can do better still. AVX2 has a fused multiply-add (FMA) instruction which turns expressions of the form a + b × c into a single step. This is roughly twice as fast as two separate operations. The result is also more accurate, since we’re rounding once instead of twice. Unfortunately, more accurate implies different, so the optimizer can’t use FMA without our permission. math.h
provides a fma
function that we can use:
for (size_t i = 0; i < n; ++i) {
float dist;
for (size_t j = 0; j < 3; ++j) {
float r = coords1[3 * i + j] - coords2[3 * i + j];
float b = box[j];
float adj = nearbyintf(r / b);
r = fmaf(-adj, b, r);
dist = j == 0 ? r * r : fmaf(r, r, dist);
}
out[i] = sqrtf(dist);
}
The function with FMA takes 29.7s, so it’s 1.2× faster.
On periodic boundaries
Applying periodic boundaries with float adj = nearbyintf(r / b); r = fmaf(-adj, b, r)
is problematic.
Firstly, it is slow. In the best case (where we’ve precomputed 1/b
) it has a latency of 16 cycles (on Skylake; for comparison square root has a latency of 12 cycles) and it dispatches 4 μops to the floating point execution units.
Secondly, it has accuracy issues that lead to unexpected results for perfectly valid inputs. In particular, for values of r
that are big enough, the integer nearest to r / b
is not representable as a single-precision floating-point number. This means that I can construct an input that yields a distance much bigger than √(box_x² + box_y² + box_z²), which is clearly wrong. We can solve this problem with remainderf
, but it has the same performance issues as roundf
: the benchmark takes 17m 17.1s to run, 35× slower than our current best.
If you support positions that are arbitrary floats, you have two choices:
- Slow calculations.
- Absurd results for some valid inputs.
Given that, I suggest requiring that all positions are either in the range 0 ≤ x < box_x or in -box_x/2 ≤ x ≤ box_x/2. With this condition on the input, we can accurately apply the periodic boundary with
float r = coords1[3 * i + j] - coords2[3 * i + j];
float b = box[j];
r = fabsf(r);
r = fminf(r, b - r);
This is also faster, with a latency of 9 (or 7-8 with ✨t r i c k s✨) and 3 μops. It runs in 28.9s, a 1.03× improvement (the speedup is slightly bigger when using intrinsics).
It’s worth noting that these computational shortcuts improve the speed of scalar code as well as vector code. When we compile the below function (with auto-vectorization disabled by the pragma), it runs in 2m 9.4s, a 5× speedup on the scalar starting point.
void calculate_distances5_nonvector(
const float* restrict coords1,
const float* restrict coords2,
const float* restrict box,
float* restrict out,
size_t n
) {
#pragma clang loop vectorize(disable)
for (size_t i = 0; i < n; ++i) {
float dist;
for (size_t j = 0; j < 3; ++j) {
float r = coords1[3 * i + j] - coords2[3 * i + j];
float b = box[j];
r = fabsf(r);
r = fminf(r, b - r);
dist = j == 0 ? r * r : fmaf(r, r, dist);
}
out[i] = sqrtf(dist);
}
}
Intrinsics
We appear to have pushed the limits of (at least Clang’s) auto-vectorization. I implemented this computation with AVX2 intrinsics:
__m256 mm256_abs_ps(__m256 a) {
__m256 abs_mask = _mm256_set1_ps(-0.0f);
return _mm256_andnot_ps(abs_mask, a);
}
__m256 mm256_min_nnve_ps(__m256 a, __m256 b) {
return _mm256_castsi256_ps(_mm256_min_epu32(_mm256_castps_si256(a), _mm256_castps_si256(b)));
}
typedef struct {
__m256 a;
__m256 b;
__m256 c;
} m256_3;
m256_3 mm256_transpose_8x3_ps(__m256 a, __m256 b, __m256 c) {
/* a = x0y0z0x1y1z1x2y2 */
/* b = z2x3y3z3x4y4z4x5 */
/* c = y6z6x7y7z7x8y8z8 */
__m256 m1 = _mm256_blend_ps(a, b, 0xf0);
__m256 m2 = _mm256_permute2f128_ps(a, c, 0x21);
__m256 m3 = _mm256_blend_ps(b, c, 0xf0);
/* m1 = x0y0z0x1x4y4z4x5 */
/* m2 = y1z1x2y2y5z5x6y6 */
/* m3 = z2x3y3z3z6x7y7z7 */
__m256 t1 = _mm256_shuffle_ps(m2, m3, _MM_SHUFFLE(2,1,3,2));
__m256 t2 = _mm256_shuffle_ps(m1, m2, _MM_SHUFFLE(1,0,2,1));
/* t1 = x2y2x3y3x6y6x7y7 */
/* t2 = y0z0y1z1y4z4y5z5 */
__m256 x = _mm256_shuffle_ps(m1, t1, _MM_SHUFFLE(2,0,3,0));
__m256 y = _mm256_shuffle_ps(t2, t1, _MM_SHUFFLE(3,1,2,0));
__m256 z = _mm256_shuffle_ps(t2, m3, _MM_SHUFFLE(3,0,3,1));
/* x = x0x1x2x3x4x5x6x7 */
/* y = y0y1y2y3y4y5y6y7 */
/* z = z0z1z2z3z4z5z6z7 */
m256_3 res = {x, y, z};
return res;
}
void calculate_distances_vectorized(
const float * restrict arr1 __attribute__((align_value(32))),
const float * restrict arr2 __attribute__((align_value(32))),
const float * restrict box,
float * restrict out __attribute__((align_value(32))),
size_t n
) {
const __m256 * restrict arr1_256 __attribute__((align_value(32))) = (const __m256 *) arr1;
const __m256 * restrict arr2_256 __attribute__((align_value(32))) = (const __m256 *) arr2;
__m256 * restrict out_256 __attribute__((align_value(32))) = (__m256 *) out;
__m256 boxv = {box[0], box[1], box[2], NAN, box[1], box[2], box[0], NAN};
__m256 box1 = _mm256_permute_ps(boxv, _MM_SHUFFLE(0,2,1,0));
__m256 box2 = _mm256_permute_ps(boxv, _MM_SHUFFLE(2,1,0,2));
__m256 box3 = _mm256_permute_ps(boxv, _MM_SHUFFLE(1,0,2,1));
n >>= 3;
#pragma unroll 2 // Any more and Clang will spill registers
for (size_t i = 0; i < n; ++i) {
size_t j = i * 3;
__m256 m11 = arr1_256[j];
__m256 m12 = arr1_256[j+1];
__m256 m13 = arr1_256[j+2];
__m256 m21 = arr2_256[j];
__m256 m22 = arr2_256[j+1];
__m256 m23 = arr2_256[j+2];
__m256 diffm1 = m11 - m21;
__m256 diffm2 = m12 - m22;
__m256 diffm3 = m13 - m23;
diffm1 = mm256_abs_ps(diffm1);
diffm2 = mm256_abs_ps(diffm2);
diffm3 = mm256_abs_ps(diffm3);
diffm1 = mm256_min_nnve_ps(diffm1, box1 - diffm1);
diffm2 = mm256_min_nnve_ps(diffm2, box2 - diffm2);
diffm3 = mm256_min_nnve_ps(diffm3, box3 - diffm3);
m256_3 transpose_res = mm256_transpose_8x3_ps(diffm1, diffm2, diffm3);
__m256 x_diff = transpose_res.a;
__m256 y_diff = transpose_res.b;
__m256 z_diff = transpose_res.c;
__m256 dist_sq = x_diff * x_diff;
dist_sq = _mm256_fmadd_ps(y_diff, y_diff, dist_sq);
dist_sq = _mm256_fmadd_ps(z_diff, z_diff, dist_sq);
__m256 dist = _mm256_sqrt_ps(dist_sq);
out_256[i] = dist;
}
}
This runs in 17.5s, a 1.7× speedup over the best auto-vectorized version. There are six things to note here:
- I do most of the math before shuffling, so the out-of-order execution engine can compensate for random delays in memory access.
- There is no absolute value instruction, so
mm256_abs_ps
implements it by zeroing the sign bit.
mm256_min_nnve_ps
compares non-negative floats quickly. Non-negative floats maintain their order when interpreted as 32-bit integers. This instruction has 1 cycle latency (+1 cycle penalty for moving the register between domains), whereas floating point comparison has 4 cycles’ latency.
mm256_transpose_8x3_ps
has to shuffle data between 128-bit lanes and then perform a 4×3 transpose within each lane. It’s more efficient than Intel’s method because blends are way cheaper than inserts.
- Many of the usual operators are defined on __m256. For example,
*
and -
are the same as (and nicer than) _mm_mul_ps
and _mm_sub_ps
.
- I’m using 256-bit vectors. There’s little reason not to use them on machines that support them. All the operations have the same cost for 128-bit and 256-bit vectors. The only extra cost is that the transpose has a few more instructions. Indeed, I tried a 128-bit version of the same code and it ran for 29.7s, 1.7× longer.
Assembly?
I was curious how far I could push this so I hand-wrote a distance function in assembly. It ran in 16.5s, which is a small improvement from code that is a lot less maintainable…
For comparison, I wrote a small function reads integers from two arrays, adds them, and writes them to memory:
asm volatile ("xor %%eax, %%eax \n"
"sum_arrs_loop: \n"
" vmovdqa (%2,%%rax), %%ymm0 \n"
" vmovdqa (%3,%%rax), %%ymm1 \n"
" vpaddb 32(%2,%%rax), %%ymm0, %%ymm0 \n"
" vpaddb 32(%3,%%rax), %%ymm1, %%ymm1 \n"
" vpaddb 64(%2,%%rax), %%ymm0, %%ymm0 \n"
" vpaddb 64(%3,%%rax), %%ymm1, %%ymm1 \n"
" vpaddb %%ymm0, %%ymm1, %%ymm0 \n"
" vmovdqa %%ymm0, (%0) \n"
" addq $32, %0 \n"
" addq $96, %%rax \n"
" decq %1 \n"
" jne sum_arrs_loop\n"
: "+r" (out), "+r" (n)
: "r" (arr2), "r" (arr1)
: "ymm0", "ymm1", "rax", "memory");
This measures the cost of just accessing the positions and writing the results—we can’t hope to do better than this. It runs in 10.8s, which is only 1.5× faster than my assembly distance calculation (and 1.6× faster than the version using intrinsics).
Memory boundedness
All the above benchmarks were run for 10 million iterations on 4096 pairs. The arrays’ small size means that they fit in the L2 cache, so the CPU does not have to read from the main memory. If instead we run 1000 iterations on 40 960 000 pairs, we find that the starting code takes 9m 47.7s, the best auto-vectorized version takes 1m 26.4s, the intrinsics version runs for 1m 17.6s, and the memory access baseline is 1m 11.5s. We can see that for large inputs, we are memory-bounded and this will not get any faster 🤷♂️.
Tables
10 000 000 iterations × 4 096 pairs
Version |
Time |
× faster than baseline |
× slower than intrinsics |
Baseline (non-autovectorized) |
11m 26.9s |
0.8 |
39.2 |
Baseline |
9m 38.5s |
1 |
33.0 |
^ + nearbyint |
36.4s |
15.9 |
2.1 |
^ + FMA |
29.7s |
19.5 |
1.7 |
^ + faster boundary |
29.0s |
20.0 |
1.7 |
Intrinsics (128-bit) |
29.7s |
19.5 |
1.7 |
Intrinsics (256-bit) |
17.5s |
33.0 |
1 |
Assembly |
16.5s |
35.1 |
0.9 |
(Memory access) |
10.8s |
53.4 |
0.6 |
1 000 iterations × 40 960 000 pairs
Version |
Time |
× faster than baseline |
× slower than intrinsics |
Baseline (non-autovectorized) |
11m 34.3s |
0.8 |
8.9 |
Baseline |
9m 47.7s |
1 |
7.6 |
^ + nearbyint |
1m 21.5s |
7.2 |
1.0 |
^ + FMA |
1m 26.2s |
6.8 |
1.1 |
^ + faster boundary |
1m 27.3s |
6.7 |
1.1 |
Intrinsics (128-bit) |
1m 19.2s |
7.4 |
1.0 |
Intrinsics (256-bit) |
1m 17.6s |
7.6 |
1 |
Assembly |
1m 17.2s |
7.6 |
1.0 |
(Memory access) |
1m 11.5s |
8.2 |
0.9 |
Proposal
Here are actionable items I think would speed things up:
- Assume that all inputs are between 0 and box (or between -box/2 and box/2) and use the method I described to apply the periodic condition boundary.
- Use fused multiply-add on platforms that support it.
- Use 256-bit vectors on platforms that support them.
Please let me know if this is helpful. I’m happy to make a PR with these changes.