GithubHelp home page GithubHelp logo

mdanalysis / distopia Goto Github PK

View Code? Open in Web Editor NEW
9.0 9.0 5.0 7.34 MB

SIMD instructions for faster distance calculations.

Home Page: https://www.mdanalysis.org/distopia/

CMake 10.26% C++ 43.65% C 4.66% Python 37.98% Cython 3.45%

distopia's Introduction

MDAnalysis Repository README

Powered by NumFOCUS

Github Actions Build Status Github Actions Cron Job Status Cirrus CI - Cron job status Github Actions Linters Status Coverage Status

Documentation (latest release) Documentation (development version) GitHub Discussions

Anaconda ASV Benchmarks

MDAnalysis is a Python library for the analysis of computer simulations of many-body systems at the molecular scale, spanning use cases from interactions of drugs with proteins to novel materials. It is widely used in the scientific community and is written by scientists for scientists.

It works with a wide range of popular simulation packages including Gromacs, Amber, NAMD, CHARMM, DL_Poly, HooMD, LAMMPS and many others — see the lists of supported trajectory formats and topology formats. MDAnalysis also includes widely used analysis algorithms in the MDAnalysis.analysis module.

The MDAnalysis project uses an open governance model and is fiscally sponsored by NumFOCUS. Consider making a tax-deductible donation to help the project pay for developer time, professional services, travel, workshops, and a variety of other needs.

NumFOCUS (Fiscally Sponsored Project)

This project is bound by a Code of Conduct.

Powered by MDAnalysis

If you use MDAnalysis in your project consider letting your users and the world know about it by displaying the MDAnalysis badge! Embedding code is available for different markups.

Example analysis script

import MDAnalysis as mda

# Load simulation results with a single line
u = mda.Universe('topol.tpr','traj.trr')

# Select atoms
ag = u.select_atoms('name OH')

# Atom data made available as Numpy arrays
ag.positions
ag.velocities
ag.forces

# Iterate through trajectories
for ts in u.trajectory:
    print(ag.center_of_mass())

Documentation

New users should read the Quickstart Guide and might want to look at our videos, in which core developers explain various aspects of MDAnalysis.

All users should read the User Guide.

Developers may also want to refer to the MDAnalysis API docs.

A growing number of tutorials are available that explain how to conduct RMSD calculations, structural alignment, distance and contact analysis, and many more.

Installation and availability

The latest release can be installed via pip or conda as described in the Installation Quick Start.

Source code is hosted in a git repository at https://github.com/MDAnalysis/mdanalysis and is packaged under the GNU General Public License, version 3 or any later version. Invidiual source code components are provided under a mixture of GPLv3+ compatible licenses, including LGPLv2.1+ and GPLv2+. Please see the file LICENSE for more information.

Contributing

Please report bugs or enhancement requests through the Issue Tracker. Questions can also be asked on GitHub Discussions.

If you are a new developer who would like to start contributing to MDAnalysis get in touch on GitHub Discussions. To set up a development environment and run the test suite read the developer guide.

Citation

When using MDAnalysis in published work, please cite the following two papers:

  • R. J. Gowers, M. Linke, J. Barnoud, T. J. E. Reddy, M. N. Melo, S. L. Seyler, D. L. Dotson, J. Domanski, S. Buchoux, I. M. Kenney, and O. Beckstein. MDAnalysis: A Python package for the rapid analysis of molecular dynamics simulations. In S. Benthall and S. Rostrup, editors, Proceedings of the 15th Python in Science Conference, pages 102-109, Austin, TX, 2016. SciPy. doi: 10.25080/Majora-629e541a-00e
  • N. Michaud-Agrawal, E. J. Denning, T. B. Woolf, and O. Beckstein. MDAnalysis: A Toolkit for the Analysis of Molecular Dynamics Simulations. J. Comput. Chem. 32 (2011), 2319--2327. doi: 10.1002/jcc.21787

For citations of included algorithms and sub-modules please see the references.

distopia's People

Contributors

hmacdope avatar ialibay avatar nbgl avatar richardjgowers avatar rmeli avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

distopia's Issues

Support MSCV /arch flags

Currently we lack support for MSVC sse flags eg /arch:sse and /arch:avx. This should be fixed.

Problems with flags etc.

We are starting to run into problems where you can only compile with certain flags enabled, eg -mavx.
We need to think about how we are going to handle this going forward.

A proper CMake solution is in the works in #22. @nbgl @richardjgowers if I try and make that more watertight you would be happy with that? You then just need to protect flag dependent code with something like?

#if DISTOPIA_USE_AVX
	auto blah = __mm256__(blah2)
#endif

Segfault for YMM code on linux

@nbgl the YMM code CalcBonds256 segfaults on linux but not on macos (@lilyminium and I checked across 2 systems).
Valgrind tells me its in mm256_periodic_boundary_distance_round but the problem may be further up. I'll see if I can dig any deeper.

Investigate issues with pip installing in Windows

Follow up from #115

pip installing from source in Windows leads to issues with identifying the correct NumPy install.
Everything works fine when using python setup.py build, so it looks like some deeper pip wheel generation thing probably?

Current build/deploy workflow

Build and deploy workflow is currently just

  • git tag 0.1.0-rcX
  • python3 setup.py sdist
  • twine upload dist/*

I should have been using testpypi but have just been going a bit too ham and adding releases to the real pypi repo and removing them my apologies I know this is bad practice.

Rsqrt benchmarks

I benchmarked calculations with vsqrtps against ones using vrsqrtps (with and without a Newton iteration). The results are (on 10 000 000 iterations × 4 096 pairs):

Version Time
vsqrtps 17.7s
vrsqrtps 16.7s
vrsqrtps + Newton 19.8s

These results make sense. vsqrtps has a latency of 12 cycles (on Skylake; other generations will be similar), which is somewhat high, but not massive compared to the rest of the calculation (40 cycles total in my version, not including memory access). The biggest problem with vsqrtps is not its latency but its throughput: you can only do a square root once every 6 cycles. But since our entire calculation takes longer than that, this is not a problem.

We can replace vsqrtps with vrsqrtps and a multiplication. This has a latency of 8 and does indeed increase speed. Unfortunately, it only has a precision of about 11 bits. That’s not even close to the true result, according to numpy.isclose

We can improve the accuracy with an iteration of Newton’s method. This involves another two multiplications and a fused multiply-add, increasing the latency to 16 (higher than vsqrtps, but also with a higher throughput). Unfortunately, the biggest bottleneck in this distance computation is the FMA unit (which is also responsible for multiplication and addition/subtraction). It can only begin two operations per cycle, so it already spends most of its time at capacity. Adding even more instructions for it to execute actually slows down the computation.

Timings

Hey @richardjgowers @nbgl, I did some basic timings with the code in #31

Working our way up from SSE1-AVX2 with the following compiler flags
-O3 -ffast-math -fno-matherrno -mfma. Each coordinate size is run over 100000 iterations.

SSE1
PLOT_msse1_O3_ffast_math_fno_matherrno_mfma
Not much difference between our three optimised versions, MDA quite slow.

SSE2
PLOT_msse2_O3_ffast_math_fno_matherrno_mfma
Again little difference between hand rolled vs @nbgl's autovectorised versions. Slightly faster overall, MDA, MDTraj etc closing in.

SSE3
PLOT_msse3_O3_ffast_math_fno_matherrno_mfma
Again little difference between hand rolled vs @nbgl's autovectorised versions. Not much faster than SSE2.

SSE4
PLOT_msse4_O3_ffast_math_fno_matherrno_mfma
Once again little difference between hand rolled vs @nbgl's autovectorised versions. Not much faster than SSE3.

AVX
PLOT_mavx_O3_ffast_math_fno_matherrno_mfma
Same story

AVX2
PLOT_mavx2_O3_ffast_math_fno_matherrno_mfma

Fastest overall with very different cache behaviour? Hand rolled is slower than autovectorised.

Let me know if you have any thoughts.

Switch dispatch build to be default?

We should make building for namespaced dispatch with multiple SSE targets the default as this is more compatible with building through pip and conda etc.

Investigate prefetching.

Can explicit prefetch instructions help improve cache locality? esp for IDX algorithms which are cache miss heavy.

AVX512 is not supported correctly

AVX512 instruction set is not homogenous and consists of many features. We can either explicitly detect these or use mtune=native

Make 3xVec structs

common structure is a 3x__m256 and we should wrap in a class or struct.

Add documentation

We need to work on adding some documentation at least to the public headers. Do we want to go with some Doxygen docs? I am happy to set that up.

Support ARM

In the future we should add support for ARM intrinsics. This is a large undertaking as will need to wrap the intrinsics in with the x86 ones.

Fix rpath on linux

The attempted build with the conda recipe is showing som RPATH problems

Periodic boundary implementation breaks triangle inequality

Consider the following example:

float periodic_boundary(float r, float b) {
    return r - b * nearbyintf(r / b);
}

void calculate_distances(
    const float* coords1,
    const float* coords2,
    const float* box,
    float* out,
    size_t n
) {
    for (size_t i = 0; i < n; ++i) {
        float dist = 0.0;
        for (size_t j = 0; j < 3; ++j) {
            float r = coords1[3 * i + j] - coords2[3 * i + j];
            float b = box[j];
            r = periodic_boundary(r, b);
            dist += r * r;
        }
        out[i] = sqrtf(dist);
    }
}

int main(int argc, const char * argv[]) {
    float x1 = 1e7f, x2 = 0.3f, x3 = 9.7f;
    float y1 = 1e7f, y2 = 0.3f, y3 = 9.7f;
    float z1 = 1e7f, z2 = 0.3f, z3 = 9.7f;
    float coords1[9] = {
        x1, y1, z1,
        x1, y1, z1,
        x2, y2, z2,
    };
    float coords2[9] = {
        x2, y2, z2,
        x3, y3, z3,
        x3, y3, z3,
    };
    float out[3];
    float box[3] = {1.0f, 1.0f, 1.0f};
    calculate_distances(coords1, coords2, box, out, 3);
    float d12 = out[0], d13 = out[1], d23 = out[2];
    printf("d(v1,v2) = %f, d(v1,v3) = %f, d(v2,v3) = %f\n", d12, d13, d23);
}

We are finding the distances between v1 = (1e7, 1e7, 1e7), v2 = (.3, .3, .3), and v3 = (9.7, 9.7, 9.7). The box has dimensions (1, 1, 1). The output is d(v1, v2) = 0, d(v1, v3) = 0, and d(v2, v3) = .692820. Notice that d(v2, v3) > d(v1, v2) + d(v1, v3), in violation of the triangle inequality.

This is happening because of floating-point rounding: x1 - x2 == 1e7. We can fix it by applying the periodic boundary condition to the positions before computing their distance:

void calculate_distances(
    const float* coords1,
    const float* coords2,
    const float* box,
    float* out,
    size_t n
) {
    for (size_t i = 0; i < n; ++i) {
        float dist = 0.0;
        for (size_t j = 0; j < 3; ++j) {
            float b = box[j];
            float x1 = periodic_boundary(coords1[3 * i + j], b);
            float x2 = periodic_boundary(coords2[3 * i + j], b);
            float r = fabsf(x1 - x2);
            r = fminf(r, b - r);
            dist += r * r;
        }
        out[i] = sqrtf(dist);
    }
}

This outputs d(v1, v2) = .519615, d(v1, v3) = .519616, and d(v2, v3) = .692820, which does not break maths.

(I realize that coordinates that high are a rare edge case. Nonetheless, they are valid positions, so they should yield sensible outputs.)

Clean up API and headers

The headers and API are currently a total mess. Once we are happy with the content we should try and do a big cleanup.

Relicensing

MDAnalysis is moving from GPL to LGPL in the near future. Is anyone averse to re-licensing to match the main package? @richardjgowers @nbgl ? If so let me know here.

Optimizing vectorized distances

Hi! I’m Jakub, @lilyminium’s partner. She suggested that I look at your implementation of vectorized distances and see if I can improve it.

Scalar baseline

I used your scalar distance code as a baseline:

void calculate_distances1(
    const float* coords1,
    const float* coords2,
    const float* box,
    float* out,
    size_t n
) {
    for (size_t i = 0; i < n; ++i) {
        float dist = 0.0;
        for (size_t j = 0; j < 3; ++j) {
            float r = coords1[3 * i + j] - coords2[3 * i + j];
            float b = box[j];
            float adj = roundf(r / b);
            r -= adj * b;
            dist += r * r;
        }
        out[i] = sqrtf(dist);
    }
}

10 million iterations on n = 4096 took 9m 38.5s on my machine.

Looking at the disassembly, I noticed that Clang attempted to auto-vectorize the loop:

...
vmovups (%r12,%r15), %ymm0
vmovups 32(%r12,%r15), %ymm1
vmovups 64(%r12,%r15), %ymm2
vsubps  64(%r13,%r15), %ymm2, %ymm2
vmovaps %ymm2, 64(%rsp)         ## 32-byte Spill
vsubps  32(%r13,%r15), %ymm1, %ymm1
vmovaps %ymm1, 192(%rsp)        ## 32-byte Spill
vsubps  (%r13,%r15), %ymm0, %ymm0
vmovaps %ymm0, 384(%rsp)        ## 32-byte Spill
vblendps    $146, %ymm1, %ymm0, %ymm0 ## ymm0 = ymm0[0],ymm1[1],ymm0[2,3],ymm1[4],ymm0[5,6],ymm1[7]
...

With auto-vectorization turned off, the loop took 11m 26.9s instead.

Still, Clang’s vectorization is not very good. Firstly, it spends a lot of time swizzling data around the vectors. Secondly, Clang is unable to vectorize roundf so it makes multiple function calls at every iteration:

...
callq   _roundf
vmovaps 128(%rsp), %xmm1        ## 16-byte Reload
vinsertps   $16, %xmm0, %xmm1, %xmm0 ## xmm0 = xmm1[0],xmm0[0],xmm1[2,3]
vmovaps %xmm0, 128(%rsp)        ## 16-byte Spill
vpermilpd   $1, (%rsp), %xmm0 ## 16-byte Folded Reload
                                    ## xmm0 = mem[1,0]
callq   _roundf
vmovaps 128(%rsp), %xmm1        ## 16-byte Reload
vinsertps   $32, %xmm0, %xmm1, %xmm0 ## xmm0 = xmm1[0,1],xmm0[0],xmm1[3]
vmovaps %xmm0, 128(%rsp)        ## 16-byte Spill
vpermilps   $231, (%rsp), %xmm0 ## 16-byte Folded Reload
                                    ## xmm0 = mem[3,1,2,3]
callq   _roundf
...

nearbyintf

Clang’s inability to vectorize roundf is a feature, not a bug. roundf, by the standard, always rounds away from zero, whereas the vroundps instruction rounds to even. A compiler will only auto-vectorize when it can guarantee that it won’t change the result.

We can change roundf to nearbyintf:

float dist = 0.0f;
for (size_t j = 0; j < 3; ++j) {
    float r = coords1[3 * i + j] - coords2[3 * i + j];
    float b = box[j];
    float adj = nearbyintf(r / b);
    r -= adj * b;
    dist += r * r;
}
out[i] = sqrtf(dist);

This lets Clang inline and vectorize:

...
vsubps  %ymm13, %ymm10, %ymm10
vmulps  %ymm10, %ymm10, %ymm10
vroundps    $12, %ymm12, %ymm12
vaddps  %ymm5, %ymm11, %ymm11
vmulps  %ymm2, %ymm12, %ymm12
...

This loop takes 36.4s, a 16× speedup!

Fused multiply-add

We can do better still. AVX2 has a fused multiply-add (FMA) instruction which turns expressions of the form a + b × c into a single step. This is roughly twice as fast as two separate operations. The result is also more accurate, since we’re rounding once instead of twice. Unfortunately, more accurate implies different, so the optimizer can’t use FMA without our permission. math.h provides a fma function that we can use:

for (size_t i = 0; i < n; ++i) {
    float dist;
    for (size_t j = 0; j < 3; ++j) {
        float r = coords1[3 * i + j] - coords2[3 * i + j];
        float b = box[j];
        float adj = nearbyintf(r / b);
        r = fmaf(-adj, b, r);
        dist = j == 0 ? r * r : fmaf(r, r, dist);
    }
    out[i] = sqrtf(dist);
}

The function with FMA takes 29.7s, so it’s 1.2× faster.

On periodic boundaries

Applying periodic boundaries with float adj = nearbyintf(r / b); r = fmaf(-adj, b, r) is problematic.

Firstly, it is slow. In the best case (where we’ve precomputed 1/b) it has a latency of 16 cycles (on Skylake; for comparison square root has a latency of 12 cycles) and it dispatches 4 μops to the floating point execution units.

Secondly, it has accuracy issues that lead to unexpected results for perfectly valid inputs. In particular, for values of r that are big enough, the integer nearest to r / b is not representable as a single-precision floating-point number. This means that I can construct an input that yields a distance much bigger than √(box_x² + box_y² + box_z²), which is clearly wrong. We can solve this problem with remainderf, but it has the same performance issues as roundf: the benchmark takes 17m 17.1s to run, 35× slower than our current best.

If you support positions that are arbitrary floats, you have two choices:

  1. Slow calculations.
  2. Absurd results for some valid inputs.

Given that, I suggest requiring that all positions are either in the range 0 ≤ x < box_x or in -box_x/2 ≤ x ≤ box_x/2. With this condition on the input, we can accurately apply the periodic boundary with

float r = coords1[3 * i + j] - coords2[3 * i + j];
float b = box[j];
r = fabsf(r);
r = fminf(r, b - r);

This is also faster, with a latency of 9 (or 7-8 with ✨t r i c k s✨) and 3 μops. It runs in 28.9s, a 1.03× improvement (the speedup is slightly bigger when using intrinsics).

It’s worth noting that these computational shortcuts improve the speed of scalar code as well as vector code. When we compile the below function (with auto-vectorization disabled by the pragma), it runs in 2m 9.4s, a 5× speedup on the scalar starting point.

void calculate_distances5_nonvector(
    const float* restrict coords1,
    const float* restrict coords2,
    const float* restrict box,
    float* restrict out,
    size_t n
) {
#pragma clang loop vectorize(disable)
    for (size_t i = 0; i < n; ++i) {
        float dist;
        for (size_t j = 0; j < 3; ++j) {
            float r = coords1[3 * i + j] - coords2[3 * i + j];
            float b = box[j];
            r = fabsf(r);
            r = fminf(r, b - r);
            dist = j == 0 ? r * r : fmaf(r, r, dist);
        }
        out[i] = sqrtf(dist);
    }
}

Intrinsics

We appear to have pushed the limits of (at least Clang’s) auto-vectorization. I implemented this computation with AVX2 intrinsics:

__m256 mm256_abs_ps(__m256 a) {
    __m256 abs_mask = _mm256_set1_ps(-0.0f);
    return _mm256_andnot_ps(abs_mask, a);
}

__m256 mm256_min_nnve_ps(__m256 a, __m256 b) {
    return _mm256_castsi256_ps(_mm256_min_epu32(_mm256_castps_si256(a), _mm256_castps_si256(b)));
}

typedef struct {
    __m256 a;
    __m256 b;
    __m256 c;
} m256_3;

m256_3 mm256_transpose_8x3_ps(__m256 a, __m256 b, __m256 c) {
    /* a = x0y0z0x1y1z1x2y2 */
    /* b = z2x3y3z3x4y4z4x5 */
    /* c = y6z6x7y7z7x8y8z8 */
    
    __m256 m1 = _mm256_blend_ps(a, b, 0xf0);
    __m256 m2 = _mm256_permute2f128_ps(a, c, 0x21);
    __m256 m3 = _mm256_blend_ps(b, c, 0xf0);
    /* m1 = x0y0z0x1x4y4z4x5 */
    /* m2 = y1z1x2y2y5z5x6y6 */
    /* m3 = z2x3y3z3z6x7y7z7 */
    
    __m256 t1 = _mm256_shuffle_ps(m2, m3, _MM_SHUFFLE(2,1,3,2));
    __m256 t2 = _mm256_shuffle_ps(m1, m2, _MM_SHUFFLE(1,0,2,1));
    /* t1 = x2y2x3y3x6y6x7y7 */
    /* t2 = y0z0y1z1y4z4y5z5 */
    
    __m256 x = _mm256_shuffle_ps(m1, t1, _MM_SHUFFLE(2,0,3,0));
    __m256 y = _mm256_shuffle_ps(t2, t1, _MM_SHUFFLE(3,1,2,0));
    __m256 z = _mm256_shuffle_ps(t2, m3, _MM_SHUFFLE(3,0,3,1));
    /* x = x0x1x2x3x4x5x6x7 */
    /* y = y0y1y2y3y4y5y6y7 */
    /* z = z0z1z2z3z4z5z6z7 */
    
    m256_3 res = {x, y, z};
    return res;
}

void calculate_distances_vectorized(
    const float * restrict arr1 __attribute__((align_value(32))),
    const float * restrict arr2 __attribute__((align_value(32))),
    const float * restrict box,
    float * restrict out __attribute__((align_value(32))),
    size_t n
) {
    const __m256 * restrict arr1_256 __attribute__((align_value(32))) = (const __m256 *) arr1;
    const __m256 * restrict arr2_256 __attribute__((align_value(32))) = (const __m256 *) arr2;
    __m256 * restrict out_256 __attribute__((align_value(32))) = (__m256 *) out;

    __m256 boxv = {box[0], box[1], box[2], NAN, box[1], box[2], box[0], NAN};
    __m256 box1 = _mm256_permute_ps(boxv, _MM_SHUFFLE(0,2,1,0));
    __m256 box2 = _mm256_permute_ps(boxv, _MM_SHUFFLE(2,1,0,2));
    __m256 box3 = _mm256_permute_ps(boxv, _MM_SHUFFLE(1,0,2,1));
    
    n >>= 3;
#pragma unroll 2  // Any more and Clang will spill registers
    for (size_t i = 0; i < n; ++i) {
        size_t j = i * 3;
        
        __m256 m11 = arr1_256[j];
        __m256 m12 = arr1_256[j+1];
        __m256 m13 = arr1_256[j+2];
        __m256 m21 = arr2_256[j];
        __m256 m22 = arr2_256[j+1];
        __m256 m23 = arr2_256[j+2];

        __m256 diffm1 = m11 - m21;
        __m256 diffm2 = m12 - m22;
        __m256 diffm3 = m13 - m23;
        
        diffm1 = mm256_abs_ps(diffm1);
        diffm2 = mm256_abs_ps(diffm2);
        diffm3 = mm256_abs_ps(diffm3);
        
        diffm1 = mm256_min_nnve_ps(diffm1, box1 - diffm1);
        diffm2 = mm256_min_nnve_ps(diffm2, box2 - diffm2);
        diffm3 = mm256_min_nnve_ps(diffm3, box3 - diffm3);

        m256_3 transpose_res = mm256_transpose_8x3_ps(diffm1, diffm2, diffm3);
        __m256 x_diff = transpose_res.a;
        __m256 y_diff = transpose_res.b;
        __m256 z_diff = transpose_res.c;

        __m256 dist_sq = x_diff * x_diff;
        dist_sq = _mm256_fmadd_ps(y_diff, y_diff, dist_sq);
        dist_sq = _mm256_fmadd_ps(z_diff, z_diff, dist_sq);

        __m256 dist = _mm256_sqrt_ps(dist_sq);
        out_256[i] = dist;
    }
}

This runs in 17.5s, a 1.7× speedup over the best auto-vectorized version. There are six things to note here:

  1. I do most of the math before shuffling, so the out-of-order execution engine can compensate for random delays in memory access.
  2. There is no absolute value instruction, so mm256_abs_ps implements it by zeroing the sign bit.
  3. mm256_min_nnve_ps compares non-negative floats quickly. Non-negative floats maintain their order when interpreted as 32-bit integers. This instruction has 1 cycle latency (+1 cycle penalty for moving the register between domains), whereas floating point comparison has 4 cycles’ latency.
  4. mm256_transpose_8x3_ps has to shuffle data between 128-bit lanes and then perform a 4×3 transpose within each lane. It’s more efficient than Intel’s method because blends are way cheaper than inserts.
  5. Many of the usual operators are defined on __m256. For example, * and - are the same as (and nicer than) _mm_mul_ps and _mm_sub_ps.
  6. I’m using 256-bit vectors. There’s little reason not to use them on machines that support them. All the operations have the same cost for 128-bit and 256-bit vectors. The only extra cost is that the transpose has a few more instructions. Indeed, I tried a 128-bit version of the same code and it ran for 29.7s, 1.7× longer.

Assembly?

I was curious how far I could push this so I hand-wrote a distance function in assembly. It ran in 16.5s, which is a small improvement from code that is a lot less maintainable…

For comparison, I wrote a small function reads integers from two arrays, adds them, and writes them to memory:

asm volatile ("xor %%eax, %%eax \n"
              "sum_arrs_loop: \n"
              "    vmovdqa    (%2,%%rax), %%ymm0 \n"
              "    vmovdqa    (%3,%%rax), %%ymm1 \n"
              "    vpaddb    32(%2,%%rax), %%ymm0, %%ymm0 \n"
              "    vpaddb    32(%3,%%rax), %%ymm1, %%ymm1 \n"
              "    vpaddb    64(%2,%%rax), %%ymm0, %%ymm0 \n"
              "    vpaddb    64(%3,%%rax), %%ymm1, %%ymm1 \n"
              "    vpaddb    %%ymm0, %%ymm1, %%ymm0 \n"
              "    vmovdqa    %%ymm0, (%0) \n"
              "    addq    $32, %0 \n"
              "    addq    $96, %%rax \n"
              "    decq    %1 \n"
              "    jne    sum_arrs_loop\n"
             : "+r" (out), "+r" (n)
             : "r" (arr2), "r" (arr1)
             : "ymm0", "ymm1", "rax", "memory");

This measures the cost of just accessing the positions and writing the results—we can’t hope to do better than this. It runs in 10.8s, which is only 1.5× faster than my assembly distance calculation (and 1.6× faster than the version using intrinsics).

Memory boundedness

All the above benchmarks were run for 10 million iterations on 4096 pairs. The arrays’ small size means that they fit in the L2 cache, so the CPU does not have to read from the main memory. If instead we run 1000 iterations on 40 960 000 pairs, we find that the starting code takes 9m 47.7s, the best auto-vectorized version takes 1m 26.4s, the intrinsics version runs for 1m 17.6s, and the memory access baseline is 1m 11.5s. We can see that for large inputs, we are memory-bounded and this will not get any faster 🤷‍♂️.

Tables

10 000 000 iterations × 4 096 pairs

Version Time × faster than baseline × slower than intrinsics
Baseline (non-autovectorized) 11m 26.9s 0.8 39.2
Baseline 9m 38.5s 1 33.0
^ + nearbyint 36.4s 15.9 2.1
^ + FMA 29.7s 19.5 1.7
^ + faster boundary 29.0s 20.0 1.7
Intrinsics (128-bit) 29.7s 19.5 1.7
Intrinsics (256-bit) 17.5s 33.0 1
Assembly 16.5s 35.1 0.9
(Memory access) 10.8s 53.4 0.6

1 000 iterations × 40 960 000 pairs

Version Time × faster than baseline × slower than intrinsics
Baseline (non-autovectorized) 11m 34.3s 0.8 8.9
Baseline 9m 47.7s 1 7.6
^ + nearbyint 1m 21.5s 7.2 1.0
^ + FMA 1m 26.2s 6.8 1.1
^ + faster boundary 1m 27.3s 6.7 1.1
Intrinsics (128-bit) 1m 19.2s 7.4 1.0
Intrinsics (256-bit) 1m 17.6s 7.6 1
Assembly 1m 17.2s 7.6 1.0
(Memory access) 1m 11.5s 8.2 0.9

Proposal

Here are actionable items I think would speed things up:

  1. Assume that all inputs are between 0 and box (or between -box/2 and box/2) and use the method I described to apply the periodic condition boundary.
  2. Use fused multiply-add on platforms that support it.
  3. Use 256-bit vectors on platforms that support them.

Please let me know if this is helpful. I’m happy to make a PR with these changes.

Move to inplace API rather than allocating ourselves?

We currently allocate the memory for the return array ourselves which is slightly different to how it is done in MDA.
This makes us more standalone as we don't rely on the right array being passed but leads to a more complicated on the MDA side.

clock_gettime causing troubles

src/lib/tests/googlebench/src/libbenchmark.a(timers.cc.o): In function benchmark::ProcessCPUUsage()': timers.cc:(.text+0x1e): undefined reference to clock_gettime'
src/lib/tests/googlebench/src/libbenchmark.a(timers.cc.o): In function benchmark::ThreadCPUUsage()': timers.cc:(.text+0x8e): undefined reference to clock_gettime'
collect2: error: ld returned 1 exit status

I get this with my gcc 7.2.0 env

Create shim for trig functions.

We are not aiming to have vectorised trig for 0.1.0 so instead we will add a shim layer that can do the requisite trig and replace later.

CI

travispls

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.