mdanalysis / distopia Goto Github PK

SIMD instructions for faster distance calculations.

Home Page: https://www.mdanalysis.org/distopia/

CMake 10.26% C++ 43.65% C 4.66% Python 37.98% Cython 3.45%

distopia's Introduction

MDAnalysis Repository README

MDAnalysis is a Python library for the analysis of computer simulations of many-body systems at the molecular scale, spanning use cases from interactions of drugs with proteins to novel materials. It is widely used in the scientific community and is written by scientists for scientists.

It works with a wide range of popular simulation packages including Gromacs, Amber, NAMD, CHARMM, DL_Poly, HooMD, LAMMPS and many others — see the lists of supported trajectory formats and topology formats. MDAnalysis also includes widely used analysis algorithms in the MDAnalysis.analysis module.

This project is bound by a Code of Conduct.

If you use MDAnalysis in your project consider letting your users and the world know about it by displaying the MDAnalysis badge! Embedding code is available for different markups.

Example analysis script

import MDAnalysis as mda

# Load simulation results with a single line
u = mda.Universe('topol.tpr','traj.trr')

# Select atoms
ag = u.select_atoms('name OH')

# Atom data made available as Numpy arrays
ag.positions
ag.velocities
ag.forces

# Iterate through trajectories
for ts in u.trajectory:
    print(ag.center_of_mass())

Documentation

New users should read the Quickstart Guide and might want to look at our videos, in which core developers explain various aspects of MDAnalysis.

All users should read the User Guide.

Developers may also want to refer to the MDAnalysis API docs.

A growing number of tutorials are available that explain how to conduct RMSD calculations, structural alignment, distance and contact analysis, and many more.

Installation and availability

The latest release can be installed via pip or conda as described in the Installation Quick Start.

Source code is hosted in a git repository at https://github.com/MDAnalysis/mdanalysis and is packaged under the GNU General Public License, version 3 or any later version. Invidiual source code components are provided under a mixture of GPLv3+ compatible licenses, including LGPLv2.1+ and GPLv2+. Please see the file LICENSE for more information.

Contributing

Please report bugs or enhancement requests through the Issue Tracker. Questions can also be asked on GitHub Discussions.

If you are a new developer who would like to start contributing to MDAnalysis get in touch on GitHub Discussions. To set up a development environment and run the test suite read the developer guide.

Citation

When using MDAnalysis in published work, please cite the following two papers:

R. J. Gowers, M. Linke, J. Barnoud, T. J. E. Reddy, M. N. Melo, S. L. Seyler, D. L. Dotson, J. Domanski, S. Buchoux, I. M. Kenney, and O. Beckstein. MDAnalysis: A Python package for the rapid analysis of molecular dynamics simulations. In S. Benthall and S. Rostrup, editors, Proceedings of the 15th Python in Science Conference, pages 102-109, Austin, TX, 2016. SciPy. doi: 10.25080/Majora-629e541a-00e
N. Michaud-Agrawal, E. J. Denning, T. B. Woolf, and O. Beckstein. MDAnalysis: A Toolkit for the Analysis of Molecular Dynamics Simulations. J. Comput. Chem. 32 (2011), 2319--2327. doi: 10.1002/jcc.21787

For citations of included algorithms and sub-modules please see the references.

distopia's People

Contributors

Stargazers

Watchers

Forkers

lilyminium hmacdope scal444 zafar-hussain rmeli

distopia's Issues

some files prefixed x86 are/should be platform agnostic

current x86_calcbonds.cpp and x86_calc_angles.cpp are actually ISA independent apart from a call to __mmfence() and should be labeled as the generic implementation.

self_distance_array & self_distance_array_idx

Should copy MDA.lib.distance.self_distance_array

Support MSCV /arch flags

Currently we lack support for MSVC sse flags eg /arch:sse and /arch:avx. This should be fixed.

Add benchmarks for angles

As angles use a shim, once we are using vectorised trig we should also addd a benchmark. Linked to #26 and #73

Improve test coverage for CalcBonds dispatch

Currently we do not cover all paths for dispatch in CalcBonds. We should aim to have tests that cover all the paths available.

make extra clear in docs that we only support x86_64

This is not clear enough at the moment.

Should we try and detect the number of AVX512 FMA units?

GROMACS tries to detect the number of AVX512FMA units and if it finds only one it prefers AVX2 over AVX512. Should we try and do the same or at least investigate?

establish any sort of order in the repo

Problems with flags etc.

We are starting to run into problems where you can only compile with certain flags enabled, eg -mavx.
We need to think about how we are going to handle this going forward.

A proper CMake solution is in the works in #22. @nbgl @richardjgowers if I try and make that more watertight you would be happy with that? You then just need to protect flag dependent code with something like?

#if DISTOPIA_USE_AVX
	auto blah = __mm256__(blah2)
#endif

Segfault for YMM code on linux

@nbgl the YMM code CalcBonds256 segfaults on linux but not on macos (@lilyminium and I checked across 2 systems).
Valgrind tells me its in mm256_periodic_boundary_distance_round but the problem may be further up. I'll see if I can dig any deeper.

Investigate issues with pip installing in Windows

Follow up from #115

pip installing from source in Windows leads to issues with identifying the correct NumPy install.
Everything works fine when using python setup.py build, so it looks like some deeper pip wheel generation thing probably?

Current build/deploy workflow

Build and deploy workflow is currently just

git tag 0.1.0-rcX
python3 setup.py sdist
twine upload dist/*

I should have been using testpypi but have just been going a bit too ham and adding releases to the real pypi repo and removing them my apologies I know this is bad practice.

Rsqrt benchmarks

I benchmarked calculations with vsqrtps against ones using vrsqrtps (with and without a Newton iteration). The results are (on 10 000 000 iterations × 4 096 pairs):

Version	Time
vsqrtps	17.7s
vrsqrtps	16.7s
vrsqrtps + Newton	19.8s

These results make sense. vsqrtps has a latency of 12 cycles (on Skylake; other generations will be similar), which is somewhat high, but not massive compared to the rest of the calculation (40 cycles total in my version, not including memory access). The biggest problem with vsqrtps is not its latency but its throughput: you can only do a square root once every 6 cycles. But since our entire calculation takes longer than that, this is not a problem.

We can replace vsqrtps with vrsqrtps and a multiplication. This has a latency of 8 and does indeed increase speed. Unfortunately, it only has a precision of about 11 bits. That’s not even close to the true result, according to numpy.isclose…

We can improve the accuracy with an iteration of Newton’s method. This involves another two multiplications and a fused multiply-add, increasing the latency to 16 (higher than vsqrtps, but also with a higher throughput). Unfortunately, the biggest bottleneck in this distance computation is the FMA unit (which is also responsible for multiplication and addition/subtraction). It can only begin two operations per cycle, so it already spends most of its time at capacity. Adding even more instructions for it to execute actually slows down the computation.

Timings

Hey @richardjgowers @nbgl, I did some basic timings with the code in #31

Working our way up from SSE1-AVX2 with the following compiler flags
-O3 -ffast-math -fno-matherrno -mfma. Each coordinate size is run over 100000 iterations.

SSE1

Not much difference between our three optimised versions, MDA quite slow.

SSE2

Again little difference between hand rolled vs @nbgl's autovectorised versions. Slightly faster overall, MDA, MDTraj etc closing in.

SSE3

Again little difference between hand rolled vs @nbgl's autovectorised versions. Not much faster than SSE2.

SSE4

Once again little difference between hand rolled vs @nbgl's autovectorised versions. Not much faster than SSE3.

AVX

Same story

AVX2

Fastest overall with very different cache behaviour? Hand rolled is slower than autovectorised.

Let me know if you have any thoughts.

Switch dispatch build to be default?

We should make building for namespaced dispatch with multiple SSE targets the default as this is more compatible with building through pip and conda etc.

Investigate flop accounting for acosf vs atan2 angles/dihedrals

We should do some timings on SIMD acosf vs atan2 stability and speed once we have decided what library to use.

Investigate prefetching.

Can explicit prefetch instructions help improve cache locality? esp for IDX algorithms which are cache miss heavy.

Non explicit IDX loader is broken.

AVX512 is not supported correctly

AVX512 instruction set is not homogenous and consists of many features. We can either explicitly detect these or use mtune=native

Make 3xVec structs

common structure is a 3x__m256 and we should wrap in a class or struct.

Add documentation

We need to work on adding some documentation at least to the public headers. Do we want to go with some Doxygen docs? I am happy to set that up.

Support ARM

In the future we should add support for ARM intrinsics. This is a large undertaking as will need to wrap the intrinsics in with the x86 ones.

Documentation deployment

Once the docs are in a usable state they should be deployed to RTD

Calc_bonds & calc_bonds_idx

mda.lib.distances.calc_bonds

Fix rpath on linux

The attempted build with the conda recipe is showing som RPATH problems

Detect cache size

Periodic boundary implementation breaks triangle inequality

Consider the following example:

float periodic_boundary(float r, float b) {
    return r - b * nearbyintf(r / b);
}

void calculate_distances(
    const float* coords1,
    const float* coords2,
    const float* box,
    float* out,
    size_t n
) {
    for (size_t i = 0; i < n; ++i) {
        float dist = 0.0;
        for (size_t j = 0; j < 3; ++j) {
            float r = coords1[3 * i + j] - coords2[3 * i + j];
            float b = box[j];
            r = periodic_boundary(r, b);
            dist += r * r;
        }
        out[i] = sqrtf(dist);
    }
}

int main(int argc, const char * argv[]) {
    float x1 = 1e7f, x2 = 0.3f, x3 = 9.7f;
    float y1 = 1e7f, y2 = 0.3f, y3 = 9.7f;
    float z1 = 1e7f, z2 = 0.3f, z3 = 9.7f;
    float coords1[9] = {
        x1, y1, z1,
        x1, y1, z1,
        x2, y2, z2,
    };
    float coords2[9] = {
        x2, y2, z2,
        x3, y3, z3,
        x3, y3, z3,
    };
    float out[3];
    float box[3] = {1.0f, 1.0f, 1.0f};
    calculate_distances(coords1, coords2, box, out, 3);
    float d12 = out[0], d13 = out[1], d23 = out[2];
    printf("d(v1,v2) = %f, d(v1,v3) = %f, d(v2,v3) = %f\n", d12, d13, d23);
}

We are finding the distances between v1 = (1e7, 1e7, 1e7), v2 = (.3, .3, .3), and v3 = (9.7, 9.7, 9.7). The box has dimensions (1, 1, 1). The output is d(v1, v2) = 0, d(v1, v3) = 0, and d(v2, v3) = .692820. Notice that d(v2, v3) > d(v1, v2) + d(v1, v3), in violation of the triangle inequality.

This is happening because of floating-point rounding: x1 - x2 == 1e7. We can fix it by applying the periodic boundary condition to the positions before computing their distance:

void calculate_distances(
    const float* coords1,
    const float* coords2,
    const float* box,
    float* out,
    size_t n
) {
    for (size_t i = 0; i < n; ++i) {
        float dist = 0.0;
        for (size_t j = 0; j < 3; ++j) {
            float b = box[j];
            float x1 = periodic_boundary(coords1[3 * i + j], b);
            float x2 = periodic_boundary(coords2[3 * i + j], b);
            float r = fabsf(x1 - x2);
            r = fminf(r, b - r);
            dist += r * r;
        }
        out[i] = sqrtf(dist);
    }
}

This outputs d(v1, v2) = .519615, d(v1, v3) = .519616, and d(v2, v3) = .692820, which does not break maths.

(I realize that coordinates that high are a rare edge case. Nonetheless, they are valid positions, so they should yield sensible outputs.)

distance_array and distance_array_idx

calc_angles & calc_angles_idx

Make actual tests with gtest

We need to have actual tests such that CI detects problems.

Clean up API and headers

The headers and API are currently a total mess. Once we are happy with the content we should try and do a big cleanup.

Benchmark stream vs store

How much benefit do we get out of packed stream vs store instructions?

Relicensing

MDAnalysis is moving from GPL to LGPL in the near future. Is anyone averse to re-licensing to match the main package? @richardjgowers @nbgl ? If so let me know here.

Optimizing vectorized distances

Hi! I’m Jakub, @lilyminium’s partner. She suggested that I look at your implementation of vectorized distances and see if I can improve it.

Scalar baseline

I used your scalar distance code as a baseline:

void calculate_distances1(
    const float* coords1,
    const float* coords2,
    const float* box,
    float* out,
    size_t n
) {
    for (size_t i = 0; i < n; ++i) {
        float dist = 0.0;
        for (size_t j = 0; j < 3; ++j) {
            float r = coords1[3 * i + j] - coords2[3 * i + j];
            float b = box[j];
            float adj = roundf(r / b);
            r -= adj * b;
            dist += r * r;
        }
        out[i] = sqrtf(dist);
    }
}

10 million iterations on n = 4096 took 9m 38.5s on my machine.

Looking at the disassembly, I noticed that Clang attempted to auto-vectorize the loop:

...
vmovups (%r12,%r15), %ymm0
vmovups 32(%r12,%r15), %ymm1
vmovups 64(%r12,%r15), %ymm2
vsubps  64(%r13,%r15), %ymm2, %ymm2
vmovaps %ymm2, 64(%rsp)         ## 32-byte Spill
vsubps  32(%r13,%r15), %ymm1, %ymm1
vmovaps %ymm1, 192(%rsp)        ## 32-byte Spill
vsubps  (%r13,%r15), %ymm0, %ymm0
vmovaps %ymm0, 384(%rsp)        ## 32-byte Spill
vblendps    $146, %ymm1, %ymm0, %ymm0 ## ymm0 = ymm0[0],ymm1[1],ymm0[2,3],ymm1[4],ymm0[5,6],ymm1[7]
...

With auto-vectorization turned off, the loop took 11m 26.9s instead.

Still, Clang’s vectorization is not very good. Firstly, it spends a lot of time swizzling data around the vectors. Secondly, Clang is unable to vectorize roundf so it makes multiple function calls at every iteration:

...
callq   _roundf
vmovaps 128(%rsp), %xmm1        ## 16-byte Reload
vinsertps   $16, %xmm0, %xmm1, %xmm0 ## xmm0 = xmm1[0],xmm0[0],xmm1[2,3]
vmovaps %xmm0, 128(%rsp)        ## 16-byte Spill
vpermilpd   $1, (%rsp), %xmm0 ## 16-byte Folded Reload
                                    ## xmm0 = mem[1,0]
callq   _roundf
vmovaps 128(%rsp), %xmm1        ## 16-byte Reload
vinsertps   $32, %xmm0, %xmm1, %xmm0 ## xmm0 = xmm1[0,1],xmm0[0],xmm1[3]
vmovaps %xmm0, 128(%rsp)        ## 16-byte Spill
vpermilps   $231, (%rsp), %xmm0 ## 16-byte Folded Reload
                                    ## xmm0 = mem[3,1,2,3]
callq   _roundf
...

nearbyintf

Clang’s inability to vectorize roundf is a feature, not a bug. roundf, by the standard, always rounds away from zero, whereas the vroundps instruction rounds to even. A compiler will only auto-vectorize when it can guarantee that it won’t change the result.

We can change roundf to nearbyintf:

float dist = 0.0f;
for (size_t j = 0; j < 3; ++j) {
    float r = coords1[3 * i + j] - coords2[3 * i + j];
    float b = box[j];
    float adj = nearbyintf(r / b);
    r -= adj * b;
    dist += r * r;
}
out[i] = sqrtf(dist);

This lets Clang inline and vectorize:

...
vsubps  %ymm13, %ymm10, %ymm10
vmulps  %ymm10, %ymm10, %ymm10
vroundps    $12, %ymm12, %ymm12
vaddps  %ymm5, %ymm11, %ymm11
vmulps  %ymm2, %ymm12, %ymm12
...

This loop takes 36.4s, a 16× speedup!

Fused multiply-add

We can do better still. AVX2 has a fused multiply-add (FMA) instruction which turns expressions of the form a + b × c into a single step. This is roughly twice as fast as two separate operations. The result is also more accurate, since we’re rounding once instead of twice. Unfortunately, more accurate implies different, so the optimizer can’t use FMA without our permission. math.h provides a fma function that we can use:

for (size_t i = 0; i < n; ++i) {
    float dist;
    for (size_t j = 0; j < 3; ++j) {
        float r = coords1[3 * i + j] - coords2[3 * i + j];
        float b = box[j];
        float adj = nearbyintf(r / b);
        r = fmaf(-adj, b, r);
        dist = j == 0 ? r * r : fmaf(r, r, dist);
    }
    out[i] = sqrtf(dist);
}

The function with FMA takes 29.7s, so it’s 1.2× faster.

On periodic boundaries

Applying periodic boundaries with float adj = nearbyintf(r / b); r = fmaf(-adj, b, r) is problematic.

Firstly, it is slow. In the best case (where we’ve precomputed 1/b) it has a latency of 16 cycles (on Skylake; for comparison square root has a latency of 12 cycles) and it dispatches 4 μops to the floating point execution units.

Secondly, it has accuracy issues that lead to unexpected results for perfectly valid inputs. In particular, for values of r that are big enough, the integer nearest to r / b is not representable as a single-precision floating-point number. This means that I can construct an input that yields a distance much bigger than √(box_x² + box_y² + box_z²), which is clearly wrong. We can solve this problem with remainderf, but it has the same performance issues as roundf: the benchmark takes 17m 17.1s to run, 35× slower than our current best.

If you support positions that are arbitrary floats, you have two choices:

Slow calculations.
Absurd results for some valid inputs.

Given that, I suggest requiring that all positions are either in the range 0 ≤ x < box_x or in -box_x/2 ≤ x ≤ box_x/2. With this condition on the input, we can accurately apply the periodic boundary with

float r = coords1[3 * i + j] - coords2[3 * i + j];
float b = box[j];
r = fabsf(r);
r = fminf(r, b - r);

This is also faster, with a latency of 9 (or 7-8 with ✨t r i c k s✨) and 3 μops. It runs in 28.9s, a 1.03× improvement (the speedup is slightly bigger when using intrinsics).

It’s worth noting that these computational shortcuts improve the speed of scalar code as well as vector code. When we compile the below function (with auto-vectorization disabled by the pragma), it runs in 2m 9.4s, a 5× speedup on the scalar starting point.

void calculate_distances5_nonvector(
    const float* restrict coords1,
    const float* restrict coords2,
    const float* restrict box,
    float* restrict out,
    size_t n
) {
#pragma clang loop vectorize(disable)
    for (size_t i = 0; i < n; ++i) {
        float dist;
        for (size_t j = 0; j < 3; ++j) {
            float r = coords1[3 * i + j] - coords2[3 * i + j];
            float b = box[j];
            r = fabsf(r);
            r = fminf(r, b - r);
            dist = j == 0 ? r * r : fmaf(r, r, dist);
        }
        out[i] = sqrtf(dist);
    }
}

Intrinsics

We appear to have pushed the limits of (at least Clang’s) auto-vectorization. I implemented this computation with AVX2 intrinsics:

__m256 mm256_abs_ps(__m256 a) {
    __m256 abs_mask = _mm256_set1_ps(-0.0f);
    return _mm256_andnot_ps(abs_mask, a);
}

__m256 mm256_min_nnve_ps(__m256 a, __m256 b) {
    return _mm256_castsi256_ps(_mm256_min_epu32(_mm256_castps_si256(a), _mm256_castps_si256(b)));
}

typedef struct {
    __m256 a;
    __m256 b;
    __m256 c;
} m256_3;

m256_3 mm256_transpose_8x3_ps(__m256 a, __m256 b, __m256 c) {
    /* a = x0y0z0x1y1z1x2y2 */
    /* b = z2x3y3z3x4y4z4x5 */
    /* c = y6z6x7y7z7x8y8z8 */
    
    __m256 m1 = _mm256_blend_ps(a, b, 0xf0);
    __m256 m2 = _mm256_permute2f128_ps(a, c, 0x21);
    __m256 m3 = _mm256_blend_ps(b, c, 0xf0);
    /* m1 = x0y0z0x1x4y4z4x5 */
    /* m2 = y1z1x2y2y5z5x6y6 */
    /* m3 = z2x3y3z3z6x7y7z7 */
    
    __m256 t1 = _mm256_shuffle_ps(m2, m3, _MM_SHUFFLE(2,1,3,2));
    __m256 t2 = _mm256_shuffle_ps(m1, m2, _MM_SHUFFLE(1,0,2,1));
    /* t1 = x2y2x3y3x6y6x7y7 */
    /* t2 = y0z0y1z1y4z4y5z5 */
    
    __m256 x = _mm256_shuffle_ps(m1, t1, _MM_SHUFFLE(2,0,3,0));
    __m256 y = _mm256_shuffle_ps(t2, t1, _MM_SHUFFLE(3,1,2,0));
    __m256 z = _mm256_shuffle_ps(t2, m3, _MM_SHUFFLE(3,0,3,1));
    /* x = x0x1x2x3x4x5x6x7 */
    /* y = y0y1y2y3y4y5y6y7 */
    /* z = z0z1z2z3z4z5z6z7 */
    
    m256_3 res = {x, y, z};
    return res;
}

void calculate_distances_vectorized(
    const float * restrict arr1 __attribute__((align_value(32))),
    const float * restrict arr2 __attribute__((align_value(32))),
    const float * restrict box,
    float * restrict out __attribute__((align_value(32))),
    size_t n
) {
    const __m256 * restrict arr1_256 __attribute__((align_value(32))) = (const __m256 *) arr1;
    const __m256 * restrict arr2_256 __attribute__((align_value(32))) = (const __m256 *) arr2;
    __m256 * restrict out_256 __attribute__((align_value(32))) = (__m256 *) out;

    __m256 boxv = {box[0], box[1], box[2], NAN, box[1], box[2], box[0], NAN};
    __m256 box1 = _mm256_permute_ps(boxv, _MM_SHUFFLE(0,2,1,0));
    __m256 box2 = _mm256_permute_ps(boxv, _MM_SHUFFLE(2,1,0,2));
    __m256 box3 = _mm256_permute_ps(boxv, _MM_SHUFFLE(1,0,2,1));
    
    n >>= 3;
#pragma unroll 2  // Any more and Clang will spill registers
    for (size_t i = 0; i < n; ++i) {
        size_t j = i * 3;
        
        __m256 m11 = arr1_256[j];
        __m256 m12 = arr1_256[j+1];
        __m256 m13 = arr1_256[j+2];
        __m256 m21 = arr2_256[j];
        __m256 m22 = arr2_256[j+1];
        __m256 m23 = arr2_256[j+2];

        __m256 diffm1 = m11 - m21;
        __m256 diffm2 = m12 - m22;
        __m256 diffm3 = m13 - m23;
        
        diffm1 = mm256_abs_ps(diffm1);
        diffm2 = mm256_abs_ps(diffm2);
        diffm3 = mm256_abs_ps(diffm3);
        
        diffm1 = mm256_min_nnve_ps(diffm1, box1 - diffm1);
        diffm2 = mm256_min_nnve_ps(diffm2, box2 - diffm2);
        diffm3 = mm256_min_nnve_ps(diffm3, box3 - diffm3);

        m256_3 transpose_res = mm256_transpose_8x3_ps(diffm1, diffm2, diffm3);
        __m256 x_diff = transpose_res.a;
        __m256 y_diff = transpose_res.b;
        __m256 z_diff = transpose_res.c;

        __m256 dist_sq = x_diff * x_diff;
        dist_sq = _mm256_fmadd_ps(y_diff, y_diff, dist_sq);
        dist_sq = _mm256_fmadd_ps(z_diff, z_diff, dist_sq);

        __m256 dist = _mm256_sqrt_ps(dist_sq);
        out_256[i] = dist;
    }
}

This runs in 17.5s, a 1.7× speedup over the best auto-vectorized version. There are six things to note here:

I do most of the math before shuffling, so the out-of-order execution engine can compensate for random delays in memory access.
There is no absolute value instruction, so mm256_abs_ps implements it by zeroing the sign bit.
mm256_min_nnve_ps compares non-negative floats quickly. Non-negative floats maintain their order when interpreted as 32-bit integers. This instruction has 1 cycle latency (+1 cycle penalty for moving the register between domains), whereas floating point comparison has 4 cycles’ latency.
mm256_transpose_8x3_ps has to shuffle data between 128-bit lanes and then perform a 4×3 transpose within each lane. It’s more efficient than Intel’s method because blends are way cheaper than inserts.
Many of the usual operators are defined on __m256. For example, * and - are the same as (and nicer than) _mm_mul_ps and _mm_sub_ps.
I’m using 256-bit vectors. There’s little reason not to use them on machines that support them. All the operations have the same cost for 128-bit and 256-bit vectors. The only extra cost is that the transpose has a few more instructions. Indeed, I tried a 128-bit version of the same code and it ran for 29.7s, 1.7× longer.

Assembly?

I was curious how far I could push this so I hand-wrote a distance function in assembly. It ran in 16.5s, which is a small improvement from code that is a lot less maintainable…

For comparison, I wrote a small function reads integers from two arrays, adds them, and writes them to memory:

asm volatile ("xor %%eax, %%eax \n"
              "sum_arrs_loop: \n"
              "    vmovdqa    (%2,%%rax), %%ymm0 \n"
              "    vmovdqa    (%3,%%rax), %%ymm1 \n"
              "    vpaddb    32(%2,%%rax), %%ymm0, %%ymm0 \n"
              "    vpaddb    32(%3,%%rax), %%ymm1, %%ymm1 \n"
              "    vpaddb    64(%2,%%rax), %%ymm0, %%ymm0 \n"
              "    vpaddb    64(%3,%%rax), %%ymm1, %%ymm1 \n"
              "    vpaddb    %%ymm0, %%ymm1, %%ymm0 \n"
              "    vmovdqa    %%ymm0, (%0) \n"
              "    addq    $32, %0 \n"
              "    addq    $96, %%rax \n"
              "    decq    %1 \n"
              "    jne    sum_arrs_loop\n"
             : "+r" (out), "+r" (n)
             : "r" (arr2), "r" (arr1)
             : "ymm0", "ymm1", "rax", "memory");

This measures the cost of just accessing the positions and writing the results—we can’t hope to do better than this. It runs in 10.8s, which is only 1.5× faster than my assembly distance calculation (and 1.6× faster than the version using intrinsics).

Memory boundedness

All the above benchmarks were run for 10 million iterations on 4096 pairs. The arrays’ small size means that they fit in the L2 cache, so the CPU does not have to read from the main memory. If instead we run 1000 iterations on 40 960 000 pairs, we find that the starting code takes 9m 47.7s, the best auto-vectorized version takes 1m 26.4s, the intrinsics version runs for 1m 17.6s, and the memory access baseline is 1m 11.5s. We can see that for large inputs, we are memory-bounded and this will not get any faster 🤷‍♂️.

Tables

10 000 000 iterations × 4 096 pairs

Version	Time	× faster than baseline	× slower than intrinsics
Baseline (non-autovectorized)	11m 26.9s	0.8	39.2
Baseline	9m 38.5s	1	33.0
^ + nearbyint	36.4s	15.9	2.1
^ + FMA	29.7s	19.5	1.7
^ + faster boundary	29.0s	20.0	1.7
Intrinsics (128-bit)	29.7s	19.5	1.7
Intrinsics (256-bit)	17.5s	33.0	1
Assembly	16.5s	35.1	0.9
(Memory access)	10.8s	53.4	0.6

1 000 iterations × 40 960 000 pairs

Version	Time	× faster than baseline	× slower than intrinsics
Baseline (non-autovectorized)	11m 34.3s	0.8	8.9
Baseline	9m 47.7s	1	7.6
^ + nearbyint	1m 21.5s	7.2	1.0
^ + FMA	1m 26.2s	6.8	1.1
^ + faster boundary	1m 27.3s	6.7	1.1
Intrinsics (128-bit)	1m 19.2s	7.4	1.0
Intrinsics (256-bit)	1m 17.6s	7.6	1
Assembly	1m 17.2s	7.6	1.0
(Memory access)	1m 11.5s	8.2	0.9

Proposal

Here are actionable items I think would speed things up:

Assume that all inputs are between 0 and box (or between -box/2 and box/2) and use the method I described to apply the periodic condition boundary.
Use fused multiply-add on platforms that support it.
Use 256-bit vectors on platforms that support them.

Please let me know if this is helpful. I’m happy to make a PR with these changes.

Add clang-format config file.

We use clang-format to lint our cxx, the clang format style file should be added.

ShuntFirst2Last for __m256d requires AVX2FMA rather than just AVX2

We should be using instructions that come only with the base instruction set.
Rather than requiring the FMA extensions for an __m256d shuffle with _mm256_permute4x64_pd()

Documentation CI

Should add a docs build to the current CI matrix.

Calc_dihedrals & calc_dihedrals_idx

copy of MDA.lib.distances.calc_dihedrals

Should we use march=native, mtune=native if the user does not select an ISA

Currently we do not use march=native but instead select the highest level of available SIMD and use appropriate flags. march=native, mtune=native is likely a to find more optimisations but we may then have trouble protecting code eg with if __AVX__,

Move to inplace API rather than allocating ourselves?

We currently allocate the memory for the return array ourselves which is slightly different to how it is done in MDA.
This makes us more standalone as we don't rely on the right array being passed but leads to a more complicated on the MDA side.

Make cython API actually wrap core functions + add numpy support

Currently the python API is outdated and doesn't wrap the right functions. Should correctly wrap in numpy + cython.

Remove ifs from hotpaths using templates or function pointers.

General note that we should remove all the if's we can using templates.

clock_gettime causing troubles

src/lib/tests/googlebench/src/libbenchmark.a(timers.cc.o): In function benchmark::ProcessCPUUsage()': timers.cc:(.text+0x1e): undefined reference to clock_gettime'
src/lib/tests/googlebench/src/libbenchmark.a(timers.cc.o): In function benchmark::ThreadCPUUsage()': timers.cc:(.text+0x8e): undefined reference to clock_gettime'
collect2: error: ld returned 1 exit status

I get this with my gcc 7.2.0 env

Make logo distinct from MDA logo

We should make our own logo distinct from the MDA logo.

Move to GH actions

Friendship over with travis now githubactions is my best friend

Create shim for trig functions.

We are not aiming to have vectorised trig for 0.1.0 so instead we will add a shim layer that can do the requisite trig and replace later.

CI

travispls

Need documentation

Setup documentation infrastructure

pick test framework

cpp or python level?

Vectorised trig.

Hi @richardjgowers and @nbgl,

I have started to put in the angle code.

The main problem is that we will need to make some decisions on what to do for vectorised trig.
@nbgl has already done a lovely atan implementation. We probably need either an atan2 implementation or acos implementation.

Some OS things to crib off could be:
https://github.com/shibatch/sleef/blob/master/src/libm/sleefsimdsp.c
or
https://github.com/gromacs/gromacs/blob/master/src/gromacs/simd/simd_math.h