GithubHelp home page GithubHelp logo

vcdevel / std-simd Goto Github PK

View Code? Open in Web Editor NEW
552.0 22.0 37.0 3.42 MB

std::experimental::simd for GCC [ISO/IEC TS 19570:2018]

License: Other

CMake 8.43% C++ 90.25% Shell 1.15% Makefile 0.18%
cpp17 simd libstdcxx gcc sse avx avx512 neon wg21

std-simd's Introduction

std::experimental::simd

portable, zero-overhead C++ types for explicitly data-parallel programming

Development here is going to move on to std::simd for C++26. For the TS implementation reach for GCC/libstdc++. std::experimental::simd is shipping with GCC since version 11.

This package implements ISO/IEC TS 19570:2018 Section 9 "Data-Parallel Types". The implementation derived from https://github.com/VcDevel/Vc.

By default, the install.sh script places the std::experimental::simd headers into the directory where the standard library of your C++ compiler (identified via $CXX) resides.

It is only tested and supported with GCC trunk, even though it may work with older GCC versions.

Target support

  • x86_64 is the main development platform and thoroughly tested. This includes support from SSE-only up to AVX512 on Xeon Phi or Xeon CPUs.
  • aarch64, arm, and ppc64le was tested and verified to work. No significant performance evaluation was done.
  • In any case, a fallback to correct execution via builtin arithmetic types is available for all targets.

Installation Instructions

$ ./install.sh

Use --help to learn about the available options.

Example

Scalar Product

Let's start from the code for calculating a 3D scalar product using builtin floats:

using Vec3D = std::array<float, 3>;
float scalar_product(Vec3D a, Vec3D b) {
  return a[0] * b[0] + a[1] * b[1] + a[2] * b[2];
}

Using simd, we can easily vectorize the code using the native_simd<float> type (Compiler Explorer):

using std::experimental::native_simd;
using Vec3D = std::array<native_simd<float>, 3>;
native_simd<float> scalar_product(Vec3D a, Vec3D b) {
  return a[0] * b[0] + a[1] * b[1] + a[2] * b[2];
}

The above will scale to 1, 4, 8, 16, etc. scalar products calculated in parallel, depending on the target hardware's capabilities.

For comparison, the same vectorization using Intel SSE intrinsics is more verbose, uses prefix notation (i.e. function calls), and neither scales to AVX or AVX512, nor is it portable to different SIMD ISAs:

using Vec3D = std::array<__m128, 3>;
__m128 scalar_product(Vec3D a, Vec3D b) {
  return _mm_add_ps(_mm_add_ps(_mm_mul_ps(a[0], b[0]), _mm_mul_ps(a[1], b[1])),
                    _mm_mul_ps(a[2], b[2]));
}

Build Requirements

none. It's header-only.

However, to build the unit tests you will need:

  • cmake >= 3.0
  • GCC >= 9.1

To execute all AVX512 unit tests, you will need the Intel SDE.

Building the tests

$ make test

This will create a build directory, run cmake, compile the tests, and execute the tests.

Documentation

https://en.cppreference.com/w/cpp/experimental/simd

Publications

License

The simd headers, tests, and benchmarks are released under the terms of the 3-clause BSD license.

Note that the code in libstdc++ is distributed under GPL3 with runtime library exception.

std-simd's People

Contributors

brycelelbach avatar chr-engwer avatar j-stephan avatar jcowgill avatar kgnk avatar mattkretz avatar oshadura avatar sxleixer avatar themarix avatar vks avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

std-simd's Issues

clang++: abs of float vector comes out zero

clang++ (clang version 10.0.0-4ubuntu1) yields zero when abs() is called with a float vector. Here's code to reproduce the error:

#include <iostream>
#include <experimental/simd>

int main ( int argc , char * argv[] )
{
  std::experimental::simd
          < float ,
            std::experimental::simd_abi::fixed_size < 2 >
          > x ;

  x[0] = -1.0f ;
  x[1] =  1.0f ;

  auto y = abs ( x ) ;

  for ( int i = 0 ; i < 2 ; i++ )
  {
    std::cout << "x[" << i << "] : " << x[i] << " -> " ;
    std::cout << "y[" << i << "] : " << y[i] << std::endl ;
  }
}

Actual Results

compiling with clang I get the error:

$ clang++ -std=c++17 stdsimd_abs.cc
$ ./a.out
x[0] : -1 -> y[0] : 0
x[1] : 1 -> y[1] : 0

Expected Results

g++ yields the correct result.

$ g++ -std=c++17 stdsimd_abs.cc
$ ./a.out
x[0] : -1 -> y[0] : 1
x[1] : 1 -> y[1] : 1

Kay

Inconsistency with n4808

This library, as well as the implementation in gcc, is inconsistent with n4808, and surprisingly possesses the exact same mistakes as cppreference once did. I found some and fixed them in cppreference these days:

  • missing noexcept for concat split simd_cast and static_simd_cast
  • missing operator+ and operator~ for const_where_expression
  • missing function split_by entirely

MacOS - fatal error: numeric_traits.h: No such file or directory

version / revision Operating System Compiler & Version Compiler Flags CPU
   1.00                  | MacOS Catalina  | gcc version 10.2.0 (Homebrew GCC 10.2.0)|  -g0 -O2  -std=c++17 | i7-3615QM

Testcase

I am trying to compile a dummy program just to learn a few things about vectorization. When compiling with g++-10 and the above mentioned flags, I get this error /usr/local/Cellar/gcc/10.2.0/include/c++/10.2.0/experimental/bits/simd.h:34:10: fatal error: numeric_traits.h: No such file or directory. I can see that numeric_traits.h is located at /usr/local/Cellar/gcc/10.2.0/include/c++/10.2.0/experimental/ext but I cannot find a way to link it.

Any suggestions?

Thank you in advance

building error and warning

version / revision Operating System Compiler & Version Compiler Flags CPU
std_simd::master Linux gcc (GCC) 11.0.0 20210119 (experimental) make test x86_64

Testcase

ninja/make build_tests

Actual Results

building pass without error

Expected Results

  1. building warning from _MaskImplBuiltin::_S_set
$workspace/cpp_libs/std-simd/tests/mask.cpp:124:4:   required from ‘static void Tests::operators_<M>::run() [with M = ⠶simd_mask<char, ⠶simd_abi::_VecBuiltin<32> >]’
$workspace/cpp_libs/std-simd/tests/virtest/vir/test.h:1015:33:   required from ‘int vir::test::detail::addTestInstantiations(const char*, vir::Typelist<Bs ...>) [with 
TestWrapper = Tests::operators_; Ts = {⠶simd_mask<long long int, ⠶simd_abi::_Scalar>, ⠶simd_mask<long long int, ⠶simd_abi::_VecBuiltin<16> >, ⠶simd_mask<long long int, 
⠶simd_abi::_VecBuiltin<24> >, ⠶simd_mask<long long int, ⠶simd_abi::_VecBuiltin<32> >, ⠶simd_mask<long int, ⠶simd_abi::_Scalar>, ⠶simd_mask<long int, ⠶simd_abi::_VecBuiltin<16> >, 
⠶simd_mask<long int, ⠶simd_abi::_VecBuiltin<24> >, ⠶simd_mask<long int, ⠶simd_abi::_VecBuiltin<32> >, ⠶simd_mask<long long unsigned int, ⠶simd_abi::_Scalar>, ⠶simd_mask<long long unsigned int, 
⠶simd_abi::_VecBuiltin<16> >, ⠶simd_mask<long long unsigned int, ⠶simd_abi::_VecBuiltin<24> >, ⠶simd_mask<long long unsigned int, ⠶simd_abi::_VecBuiltin<32> >, ⠶simd_mask<long unsigned int, 
⠶simd_abi::_Scalar>, ⠶simd_mask<long unsigned int, ⠶simd_abi::_VecBuiltin<16> >, ⠶simd_mask<long unsigned int, ⠶simd_abi::_VecBuiltin<24> >, ⠶simd_mask<long unsigned int, 
⠶simd_abi::_VecBuiltin<32> >, ⠶simd_mask<char, ⠶simd_abi::_Scalar>, ⠶simd_mask<char, ⠶simd_abi::_VecBuiltin<8> >, ⠶simd_mask<char, ⠶simd_abi::_VecBuiltin<16> >, ⠶simd_mask<char, 
⠶simd_abi::_VecBuiltin<24> >, ⠶simd_mask<char, ⠶simd_abi::_VecBuiltin<32> >, ⠶simd_mask<wchar_t, ⠶simd_abi::_Scalar>, ⠶simd_mask<wchar_t, ⠶simd_abi::_VecBuiltin<8> >, ⠶simd_mask<wchar_t, 
⠶simd_abi::_VecBuiltin<16> >, ⠶simd_mask<wchar_t, ⠶simd_abi::_VecBuiltin<24> >, ⠶simd_mask<wchar_t, ⠶simd_abi::_VecBuiltin<32> >}]’
$workspace/cpp_libs/std-simd/tests/mask.cpp:108:1:   required from here
$workspace/cpp_libs/std-simd/tests/../experimental/bits/simd_builtin.h:2853:29: warning: comparison of integer expressions of different signedness: ‘int’ and 
‘⠶integral_constant<long unsigned int, 31>::value_type’ {aka ‘long unsigned int’} [-Wsign-compare]
  1. building error from _SimdImplBuiltin::_S_isfinite
    if we use the _SimdImplBuiltin::_S_isfinite directly, there is an building error from _S_isfinite
#workspace/cpp_libs/std-simd/tests/../experimental/bits/simd_builtin.h:2287:33: error: could not convert ‘((((__vector(16) int)__absn) <= ((__vector(16) int)__maxn)) ? (__vector(16) int){-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1} : (__vector(16) int){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0})’ from ‘__vector(16) int’ to ‘std::experimental::parallelism_v2::_SimdImplBuiltin<std::experimental::parallelism_v2::simd_abi::_VecBltnBtmsk<64> >::_MaskMember<float>’ {aka ‘std::experimental::parallelism_v2::_SimdWrapper<bool, 16, void>’}
 2287 |         return (__absn <= __maxn);
      |                                 ^
      |                                 |
      |                                 __vector(16) int
#workspace/cpp_libs/std-simd/tests/../experimental/bits/simd_builtin.h: In instantiation of ‘static std::experimental::parallelism_v2::_SimdImplBuiltin<_Abi>::_MaskMember<_Tp> std::experimental::parallelism_v2::_SimdImplBuiltin<_Abi>::_S_isfinite(std::experimental::parallelism_v2::_SimdWrapper<_Tp, _Np>) [with _Tp = double; long unsigned int _Np = 8; _Abi = std::experimental::parallelism_v2::simd_abi::_VecBltnBtmsk<64>; std::experimental::parallelism_v2::_SimdImplBuiltin<_Abi>::_MaskMember<_Tp> = std::experimental::parallelism_v2::_SimdWrapper<bool, 8, void>]’:
#workspace/cpp_libs/std-simd/tests/../experimental/bits/simd_x86.h:3012:29:   required from ‘static std::experimental::parallelism_v2::_SimdImplX86<_Abi>::_MaskMember<_Tp> std::experimental::parallelism_v2::_SimdImplX86<_Abi>::_S_isfinite(std::experimental::parallelism_v2::_SimdWrapper<_Tp, _Np>) [with _Tp = double; long unsigned int _Np = 8; _Abi = std::experimental::parallelism_v2::simd_abi::_VecBltnBtmsk<64>; std::experimental::parallelism_v2::_SimdImplX86<_Abi>::_MaskMember<_Tp> = std::experimental::parallelism_v2::_SimdWrapper<bool, 8, void>]’
#workspace/cpp_libs/std-simd/tests/../experimental/bits/simd_math.h:1334:1:   required from ‘std::enable_if_t<is_floating_point_v<_Tp>, _R> std::experimental::parallelism_v2::isfinite(std::experimental::parallelism_v2::simd<_Tp, _Ap>) [with _Tp = double; _Abi = std::experimental::parallelism_v2::simd_abi::_VecBltnBtmsk<64>; <template-parameter-1-3> = {}; _R = std::experimental::parallelism_v2::simd_mask<double, std::experimental::parallelism_v2::simd_abi::_VecBltnBtmsk<64> >; std::enable_if_t<is_floating_point_v<_Tp>, _R> = std::experimental::parallelism_v2::simd_mask<double, std::experimental::parallelism_v2::simd_abi::_VecBltnBtmsk<64> >]’
#workspace/cpp_libs/std-simd/tests/math.cpp:159:2:   required from ‘static void Tests::fpclassify_<V>::run() [with V = std::experimental::parallelism_v2::simd<double, std::experimental::parallelism_v2::simd_abi::_VecBltnBtmsk<64> >]’
#workspace/cpp_libs/std-simd/tests/virtest/vir/test.h:1015:33:   required from ‘int vir::test::detail::addTestInstantiations(const char*, vir::Typelist<Bs ...>) [with TestWrapper = Tests::fpclassify_; Ts = {std::experimental::parallelism_v2::simd<float, std::experimental::parallelism_v2::simd_abi::_VecBltnBtmsk<64> >, std::experimental::parallelism_v2::simd<double, std::experimental::parallelism_v2::simd_abi::_VecBltnBtmsk<64> >}]’
#workspace/cpp_libs/std-simd/tests/math.cpp:125:1:   required from here
#workspace/cpp_libs/std-simd/tests/../experimental/bits/simd_builtin.h:2287:33: error: could not convert ‘((((__vector(8) long int)__absn) <= ((__vector(8) long int)__maxn)) ? (__vector(8) long int){-1, -1, -1, -1, -1, -1, -1, -1} : (__vector(8) long int){0, 0, 0, 0, 0, 0, 0, 0})’ from ‘__vector(8) long int’ to ‘std::experimental::parallelism_v2::_SimdImplBuiltin<std::experimental::parallelism_v2::simd_abi::_VecBltnBtmsk<64> >::_MaskMember<double>’ {aka ‘std::experimental::parallelism_v2::_SimdWrapper<bool, 8, void>’}
ninja: build stopped: subcommand failed.

Add SVE2 instructions to SIMD.

Enhancement
Hello, @std-simd
I would like to add SVE2 instructions via intrinsics/inline-assembly into std-simd to process data quicker, optimize the library.

unittest failed by math_avx512_ldouble_float_double_schar_

version / revision Operating System Compiler & Version Compiler Flags CPU
std_simd::master Linux gcc (GCC) 11.0.0 20210119 (experimental) make test x86_64

Testcase

29919     3592 - math_avx512_ldouble_float_double_schar_uchar_0 (Failed)
29920     3593 - math_avx512_ldouble_float_double_schar_uchar_1 (Failed)
29921     3594 - math_avx512_ldouble_float_double_schar_uchar_2 (Failed)
29922     3595 - math_avx512_ldouble_float_double_schar_uchar_3 (Failed)
29923     3596 - math_avx512_ldouble_float_double_schar_uchar_4 (Failed)
29924     3597 - math_avx512_ldouble_float_double_schar_uchar_5 (Failed)
29925     3598 - math_avx512_ldouble_float_double_schar_uchar_6 (Failed)
29926     3599 - math_avx512_ldouble_float_double_schar_uchar_7 (Failed)
29927     3600 - math_avx512_ldouble_float_double_schar_uchar_8 (Failed)

Actual Results

29904 99% tests passed, 9 tests failed out of 4821
29905 
29906 Label Time Summary:
29907 AVX       = 2250.66 secproc (804 tests)
29908 AVX2      = 2282.78 sec
proc (803 tests)
29909 AVX512    = 1805.20 secproc (803 tests)
29910 KNL       =   0.13 sec
proc (3 tests)
29911 SSE       =   0.22 secproc (5 tests)
29912 SSE2      = 2840.78 sec
proc (801 tests)
29913 SSE4_2    = 2787.19 secproc (801 tests)
29914 SSSE3     = 2892.13 sec
proc (801 tests)
29915 
29916 Total Test time (real) = 14874.98 sec
29917 
29918 The following tests FAILED:
29919     3592 - math_avx512_ldouble_float_double_schar_uchar_0 (Failed)
29920     3593 - math_avx512_ldouble_float_double_schar_uchar_1 (Failed)
29921     3594 - math_avx512_ldouble_float_double_schar_uchar_2 (Failed)
29922     3595 - math_avx512_ldouble_float_double_schar_uchar_3 (Failed)
29923     3596 - math_avx512_ldouble_float_double_schar_uchar_4 (Failed)
29924     3597 - math_avx512_ldouble_float_double_schar_uchar_5 (Failed)
29925     3598 - math_avx512_ldouble_float_double_schar_uchar_6 (Failed)
29926     3599 - math_avx512_ldouble_float_double_schar_uchar_7 (Failed)
29927     3600 - math_avx512_ldouble_float_double_schar_uchar_8 (Failed)
29928 FAILED: CMakeFiles/test_random
29929 cd /home/zhongxiao.yzx/workspace/cpp_libs/std-simd/build-disk1-zhongxiao.yzx-compiler-gcc-release-bin-g++ && /usr/local/bin/ctest --schedule-random
29930 ninja: build stopped: subcommand failed.
29931 make: *** [Makefile:37: test] Error 1

Expected Results

100% Passed

Two questions for clarification

I'm sorry for posting this as an issue (which it most likely isn't), but I don't know the proper way of asking questions about the code.

I'm currently switching one of my libraries to make direct use of std::experimental::simd when available, and there are two things I can't work out for myself:

  • is there a way to determine if the underlying architecture natively supports certain SIMD types? Something like
    template<typename scalar_type, int len> constexpr inline bool simd_exists;
    where, e.g. on a system with SSE2, simd_exists<float,4> would be true, but simd_exists<float,2> would be false?
  • Is there an "official" way to cast a simd object to its underlying low-level type, like native_simd<double> to __m256d on a machine with AVX? There are a few pretty special situations where I have to use custom functions with intrinsics for best performance.

I have tried to find the answers in the specification, but I fear I'm not good enough at standardese...

Thanks a lot, and I'm happy to continue discussion in a more suitable place if needed,
Martin

Thank you

This is just a heads-up. I am one of the maintainers of the Einstein Toolkit https://einsteintoolkit.org which contains code to solve the Einstein equations on HPC systems with various architectures. We have, over the years, created a module to support explicit SIMD vectorization in very much the same spirit as this library (see https://bitbucket.org/cactuscode/cactusutils/src/master/Vectors/).

I have always been looking for a community standard for such a library, and I hope that std-simd will be / become this standard. At first glance, you seem to have all the required features (various architectures, integer and logical vectors, masked operations, etc.), obviously with a much cleaner design than our code.

All the best! Depending on whether I can convince my colleagues, we might start contributing towards this library – or towards VcDevel/Vc instead if you think that would be more appropriate.

setZero() idiom generates suboptimal code on avx512 platforms (icelake-client)

When using where(mask, vector) = 0.0f correct correct code is generated only on AVX2 platforms.

Godbot for AVX2 (skylake): https://godbolt.org/z/2Jqddn
Godbolt for AVX512 (icelake-client): https://godbolt.org/z/l36_Zm

Testcase

using float_v = std::experimental::native_simd<float>;
using int_v = std::experimental::fixed_size_simd<int32_t, float_v::size()>;


float_v f(float_v value, float_v::mask_type mask) {
    where(mask, value) = 0.0f;
    return value;
}

Actual Results

  vxorps %xmm1, %xmm1, %xmm1
  kmovw %edi, %k1
  vmovaps %zmm1, %zmm0{%k1}

Expected Results

vandnps %ymm0, %ymm1, %ymm0

[Question] Optimizing the multiplication of two large vectors

I am currently trying to familiarize myself with SIMD and am faced with the task of optimally multiplying large vectors. Here is my test case:

#ifndef Included_TestSIMDVectorizing_H
#define Included_TestSIMDVectorizing_H

#include <omp>
#include <iostream>
#include <fstream>
#include <memory>
#include <valarray>
#include "gtest/gtest.h"
#include "experimental/simd"

using dataType = float;

namespace std_simd = std::experimental::parallelism_v2;
using StdSimdV = std_simd::native_simd<dataType>;


const size_t RND_MAX = 100;
const size_t SIMD_SIZE = StdSimdV::size();
const size_t DATA_SIZE_MIN = SIMD_SIZE;
const size_t DATA_SIZE_MAX = SIMD_SIZE * 1e8;
const size_t DATA_SIZE_STEP = SIMD_SIZE * 1e5;

int main()
{
    std::ofstream l_logFile;
    l_logFile.open("./plot.csv", std::ofstream::trunc);
    l_logFile << "vec_type,size,runtime" << std::endl;
    l_logFile.close();

    for (size_t n = DATA_SIZE_MIN; n < DATA_SIZE_MAX; n+=DATA_SIZE_STEP)
    {
        l_logFile.open("./plot.csv", std::ofstream::app);

        float *l_vX = new float[n];
        float *l_vY = new float[n];

        srand(time(NULL));
        for (size_t i = 0; i < n; i++)
        {
            l_vX[i] = rand() % RND_MAX;
            l_vY[i] = rand() % RND_MAX;
        }

        float *l_vZ_autovec = new float[n];
        double l_tB_autovec = omp_get_wtime();
        for (size_t i = 0; i < n; i++)
        {
            l_vZ_autovec[i] = l_vX[i] * l_vY[i];
        }
        double l_t_autovec = omp_get_wtime() - l_tB_autovec;
        l_logFile << std::setprecision(16) << "autovec," << n << "," << l_t_autovec << std::endl;

        float *l_vZ_std_simd_out = new float[n];
        StdSimdV l_vZ_std_simd;

        double l_tB_std_simd = omp_get_wtime();
        for (size_t i = 0; i < n; i+=SIMD_SIZE)
        {
            l_vZ_std_simd = StdSimdV(l_vX+i, std_simd::element_aligned) * StdSimdV(l_vY+i, std_simd::element_aligned);
            l_vZ_std_simd.copy_to(l_vZ_std_simd_out+i, std_simd::element_aligned);
        }
        double l_t_std_simd = omp_get_wtime() - l_tB_std_simd;
        l_logFile << std::setprecision(16) << "std_simd," << n << "," << l_t_std_simd << std::endl;

        l_logFile.close();

        delete [] l_vX;
        delete [] l_vY;
        delete [] l_vZ_autovec;
        delete [] l_vZ_std_simd_out;
    }
    return 0;
}

#endif      // Included_TestSIMDVectorizing_H

The following plot shows the comparison of the runtimes between the multiplication with the auto-vectorization of the compiler and the vectorization with std-simd:

plot_2020-05-19_simd

Is my use of std-simd correct and/or can the multiplication be further optimized?

Reduce to a bigger type.

When reducing a SIMD data type containing, say, chars using addition, the resulting type is char, likely leading to overflows. Traditional solutions would use the SAD instructions (Sum of Absolute Differences) to do a partial sum into bigger types, then use a reduce on the bigger types.

I don't see the SAD functions used in the codebase, nor a way to ask for a different return type using reduce. Am I missing something? Is the standard offering that functionality?

ldexp improvement

Only a small suggestion:

I don't understand very much of the code and way some things are implemented in detail but for implementing ldexp _mm512_scalef_ps would probably be a good way? 😉

Support for type traits

Hi, I have not read the technical specification, but shouldn't std::is_arithmetic<native_simd<float>>::value be true ?

More generally, it would be nice to support <type_traits>.

How to cast simd_mask of type T to another type?

version / revision Operating System Compiler & Version Compiler Flags CPU
  GCC (trunk/all)          |      x86-64      |    GCC (trunk/all)        |  -mavx2      |

Testcase

using doublev = std::experimental::native_simd<double>;
using size_tv = std::experimental::rebind_simd_t<std::size_t, doublev>;

struct Bound
{
    std::vector<doublev> vals;
    std::vector<size_tv> idxs;
};

void min(Bound& a, const Bound& b) {
    for (std::size_t i = 0; i < a.vals.size(); ++i)
    {
        const auto otherIsLess = b.vals[i] < a.vals[i];
        std::experimental::where(otherIsLess, a.vals[i]) = b.vals[i];
        std::experimental::where(otherIsLess, a.idxs[i]) = b.idxs[i];
    }
}

Actual Results

This code doesn't compile (https://godbolt.org/z/666G5YTqK)

Expected Results

I couldn't find anywhere on cppreference indicating the correct pattern for this use case - how can I apply a mask to different SIMD types of different element_type's?

installation issue

Hi there,
Does anybody know how can I solve the below issue?
Thanks in advance!

root@ubuntu:~/work/std-simd# ./install.sh 
Testing that g++ can include <experimental/simd> without errors:
In file included from experimental/simd:56,
                 from <stdin>:1:
experimental/bits/simd.h:177:14: error: '__bit_ceil' is not a member of 'std'
  177 |       = std::__bit_ceil(sizeof(_Up) * _Tp::size());
      |              ^~~~~~~~~~
compilation terminated due to -fmax-errors=1.

Compile test failed.
*********************************************
 Note that std-simd requires at least GCC 9.
*********************************************

root@ubuntu:~/work/std-simd# gcc --version
gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

root@ubuntu:~/work/std-simd# 

Is the reference returned by operator[] to restrictive?

By disallowing a binding of simd<>::reference to an lvalue the code below (ref. https://godbolt.org/z/7qW8bffe5) becomes clumsy:

#include <iostream>
#include <experimental/simd>

namespace stdx = std::experimental;

void foo(auto&& x)
{
    std::forward<decltype(x)>(x) = .2;
    //x = .3;
}

int main()
{
  stdx::simd<double> x = 0;
  x[0] = .1;
  foo(x[1]);
  for (int i = 0; i < x.size(); ++i) std::cout << x[i] << ' ';
}

The x = .3 clause doesn't compile, since the assignment is declared as reference::operator=(...) && and x is an lvalue. Why is the operator= restricted to rvalues?
Is it known, that the above code becomes clumsy by this restriction?

Question about simd_abi::max_fixed_size

Hi,

I'm struggling to understand the purpose of simd_abi::max_fixed_size.

For example, on a Broadwell processor where native_simd<double>::size() is 4, simd_abi::max_fixed_size<double> is 32.

Why isn't it 64 or 128? In other words, what limits simd_abi::max_fixed_size<double> in this case?

Thanks,
Christos

About a potential simd matrix extension ..

Hi,
with gcc14 adding ARM SVE1/2 std::experimental::simd support, support is complete(minus RISC-V..
so looking already at a potential std::experimental::simd_matrix proposal with Intel AMX, ARM SME1/2 support and (via hacks) even Apple AMX?
It’s planned already and some early (software) implementations/prototypes?

Install script error

version / revision Operating System Compiler & Version Compiler Flags CPU
fade8f9 Ubuntu 18.04.4 LTS gcc version 7.4.0 None AMD Ryzen 7 3700X

Testcase

Actual Results

 jaehun@deepmi  ~/workspace/std-simd $ sudo ./install.sh       
Testing that g++ can include <experimental/simd> without errors:
In file included from experimental/simd:54:0,
                 from <stdin>:1:
experimental/bits/simd.h:450:27: warning: ‘__gnu__::__always_inline__’ scoped attribute directive ignored [-Wattributes]
 __bit_cast(const _From __x)
                           ^
experimental/bits/simd.h:450:27: warning: ‘__gnu__::__artificial__’ scoped attribute directive ignored [-Wattributes]
experimental/bits/simd.h:474:56: warning: ‘__gnu__::__always_inline__’ scoped attribute directive ignored [-Wattributes]
   _GLIBCXX_SIMD_INTRINSIC constexpr _ExactBool(bool __b) : _M_data(__b) {}
                                                        ^
experimental/bits/simd.h:474:56: warning: ‘__gnu__::__artificial__’ scoped attribute directive ignored [-Wattributes]
experimental/bits/simd.h:476:53: warning: ‘__gnu__::__always_inline__’ scoped attribute directive ignored [-Wattributes]
   _GLIBCXX_SIMD_INTRINSIC constexpr operator bool() const { return _M_data; }
                                                     ^~~~~
experimental/bits/simd.h:476:53: warning: ‘__gnu__::__artificial__’ scoped attribute directive ignored [-Wattributes]
experimental/bits/simd.h:483:66: warning: ‘__gnu__::__always_inline__’ scoped attribute directive ignored [-Wattributes]
 __execute_on_index_sequence(_Fp&& __f, std::index_sequence<_I...>)
                                                                  ^
experimental/bits/simd.h:483:66: warning: ‘__gnu__::__artificial__’ scoped attribute directive ignored [-Wattributes]
experimental/bits/simd.h:490:57: warning: ‘__gnu__::__always_inline__’ scoped attribute directive ignored [-Wattributes]
 __execute_on_index_sequence(_Fp&&, std::index_sequence<>)
                                                         ^
experimental/bits/simd.h:490:57: warning: ‘__gnu__::__artificial__’ scoped attribute directive ignored [-Wattributes]
experimental/bits/simd.h:495:28: warning: ‘__gnu__::__always_inline__’ scoped attribute directive ignored [-Wattributes]
 __execute_n_times(_Fp&& __f)
                            ^
experimental/bits/simd.h:495:28: warning: ‘__gnu__::__artificial__’ scoped attribute directive ignored [-Wattributes]
experimental/bits/simd.h:505:78: warning: ‘__gnu__::__always_inline__’ scoped attribute directive ignored [-Wattributes]
 __execute_on_index_sequence_with_return(_Fp&& __f, std::index_sequence<_I...>)
                                                                              ^
experimental/bits/simd.h:505:78: warning: ‘__gnu__::__artificial__’ scoped attribute directive ignored [-Wattributes]
experimental/bits/simd.h:512:40: warning: ‘__gnu__::__always_inline__’ scoped attribute directive ignored [-Wattributes]
 __generate_from_n_evaluations(_Fp&& __f)
                                        ^
experimental/bits/simd.h:512:40: warning: ‘__gnu__::__artificial__’ scoped attribute directive ignored [-Wattributes]
experimental/bits/simd.h:523:22: warning: ‘__gnu__::__always_inline__’ scoped attribute directive ignored [-Wattributes]
      _FArgs&& __fargs)
                      ^
experimental/bits/simd.h:523:22: warning: ‘__gnu__::__artificial__’ scoped attribute directive ignored [-Wattributes]
experimental/bits/simd.h:530:55: warning: ‘__gnu__::__always_inline__’ scoped attribute directive ignored [-Wattributes]
 __call_with_n_evaluations(_F0&& __f0, _FArgs&& __fargs)
                                                       ^
experimental/bits/simd.h:530:55: warning: ‘__gnu__::__artificial__’ scoped attribute directive ignored [-Wattributes]
experimental/bits/simd.h:541:70: warning: ‘__gnu__::__always_inline__’ scoped attribute directive ignored [-Wattributes]
 __call_with_subscripts(_Tp&& __x, index_sequence<_It...>, _Fp&& __fun)
                                                                      ^
experimental/bits/simd.h:541:70: warning: ‘__gnu__::__artificial__’ scoped attribute directive ignored [-Wattributes]
experimental/bits/simd.h:548:46: warning: ‘__gnu__::__always_inline__’ scoped attribute directive ignored [-Wattributes]
 __call_with_subscripts(_Tp&& __x, _Fp&& __fun)
                                              ^
experimental/bits/simd.h:548:46: warning: ‘__gnu__::__artificial__’ scoped attribute directive ignored [-Wattributes]
experimental/bits/simd.h:699:33: warning: ‘__gnu__::__always_inline__’ scoped attribute directive ignored [-Wattributes]
 __data(const simd<_Tp, _Ap>& __x);
                                 ^
experimental/bits/simd.h:699:33: warning: ‘__gnu__::__artificial__’ scoped attribute directive ignored [-Wattributes]
experimental/bits/simd.h:702:27: warning: ‘__gnu__::__always_inline__’ scoped attribute directive ignored [-Wattributes]
 __data(simd<_Tp, _Ap>& __x);
                           ^
experimental/bits/simd.h:702:27: warning: ‘__gnu__::__artificial__’ scoped attribute directive ignored [-Wattributes]
experimental/bits/simd.h:706:38: warning: ‘__gnu__::__always_inline__’ scoped attribute directive ignored [-Wattributes]
 __data(const simd_mask<_Tp, _Ap>& __x);
                                      ^
experimental/bits/simd.h:706:38: warning: ‘__gnu__::__artificial__’ scoped attribute directive ignored [-Wattributes]
experimental/bits/simd.h:709:32: warning: ‘__gnu__::__always_inline__’ scoped attribute directive ignored [-Wattributes]
 __data(simd_mask<_Tp, _Ap>& __x);
                                ^
experimental/bits/simd.h:709:32: warning: ‘__gnu__::__artificial__’ scoped attribute directive ignored [-Wattributes]
experimental/bits/simd.h:721:63: warning: ‘__gnu__::__always_inline__’ scoped attribute directive ignored [-Wattributes]
   _GLIBCXX_SIMD_INTRINSIC const _Up& operator()(const _Up& __x)
                                                               ^
experimental/bits/simd.h:721:63: warning: ‘__gnu__::__artificial__’ scoped attribute directive ignored [-Wattributes]
experimental/bits/simd.h:731:50: warning: ‘__gnu__::__always_inline__’ scoped attribute directive ignored [-Wattributes]
 __to_value_type_or_member_type(const _V& __x) -> decltype(__data(__x))
                                                  ^~~~~~~~
experimental/bits/simd.h:731:50: warning: ‘__gnu__::__artificial__’ scoped attribute directive ignored [-Wattributes]
experimental/bits/simd.h:738:66: warning: ‘__gnu__::__always_inline__’ scoped attribute directive ignored [-Wattributes]
 __to_value_type_or_member_type(const typename _V::value_type& __x)
                                                                  ^
experimental/bits/simd.h:738:66: warning: ‘__gnu__::__artificial__’ scoped attribute directive ignored [-Wattributes]
experimental/bits/simd.h:860:80: warning: ‘__gnu__::__always_inline__’ scoped attribute directive ignored [-Wattributes]
   _GLIBCXX_SIMD_INTRINSIC _GLIBCXX_SIMD_CONST static auto __firstbit(_Tp __bits)
                                                                                ^
experimental/bits/simd.h:860:80: warning: ‘__gnu__::__artificial__’ scoped attribute directive ignored [-Wattributes]
experimental/bits/simd.h:879:79: warning: ‘__gnu__::__always_inline__’ scoped attribute directive ignored [-Wattributes]
   _GLIBCXX_SIMD_INTRINSIC _GLIBCXX_SIMD_CONST static auto __lastbit(_Tp __bits)
                                                                               ^
experimental/bits/simd.h:879:79: warning: ‘__gnu__::__artificial__’ scoped attribute directive ignored [-Wattributes]
experimental/bits/simd.h:927:32: error: ‘__remove_cvref_t’ was not declared in this scope
      __is_narrowing_conversion<__remove_cvref_t<_From>, _To>>::value>>
                                ^~~~~~~~~~~~~~~~
experimental/bits/simd.h:927:32: note: suggested alternative: ‘remove_cv_t’
      __is_narrowing_conversion<__remove_cvref_t<_From>, _To>>::value>>
                                ^~~~~~~~~~~~~~~~
                                remove_cv_t
compilation terminated due to -fmax-errors=1.
Failed.

missing gather/scatter

Going through the 'Working Draft, C++Extensions forParallelism Version 2', I could not find any reference to gather/scatter functions. How is this supposed to be done? Additionally, I think that strided gather/scatter, where the data are equidistant in memory, would also be desirable.

sin and cos of std::simd come out wrong with clang++

version / revision | Operating System | Compiler & Version | Compiler Flags | CPU
repo as per Dec 14 2020 | ubuntu 20.04 | clang++ 10.0.0 | --std=c++17 | Haswell core i5

Here's a strange one, affecting only clang++, both for single and double precision. Other trigonometric funcions I tested come out right.

Testcase

#include <iostream>
#include <experimental/simd>

typedef std::experimental::simd
          < float ,
            std::experimental::simd_abi::fixed_size < 4 >
          > f4_t ;

int main ( int argc , char * argv[] )
{
  f4_t f4 ( .5f ) ;

  f4_t r4 = sin ( f4 ) ;
  std::cout << "sin " << f4[0] << " -> " << r4[0] << std::endl ;

  r4 = cos ( f4 ) ;
  std::cout << "cos " << f4[0] << " -> " << r4[0] << std::endl ;
}

compiled with:
clang++ -std=c++17 sincos.cc -osincos

Actual Results

sin 0.5 -> 0
cos 0.5 -> 1

compiled with g++ -std=c++17 sincos.cc -osincos
as expected:

Expected Results

sin 0.5 -> 0.479426
cos 0.5 -> 0.877583

Latest version of std-simd is 1.38x slower, potentially because of prefering partial registers on AVX512

Master version of std-simd (de882b7) works 1.38x slower than a version from July (24be5a4). It happens because avx512 registers are disabled in master

The picture below should explain the performance drop: Icelake has 3x256 pipelines and 2x512 pipelines. Which gives theoretical performance difference 1.33x.

изображение

version / revision Operating System Compiler & Version Compiler Flags CPU
de882b7 Win10+MinGW GCC 9.2 -march=icelake-client -ffixed-xmm16 -ffixed-xmm17 -ffixed-xmm18 -ffixed-xmm19 -ffixed-xmm20 -ffixed-xmm21 -ffixed-xmm22 -ffixed-xmm23 -ffixed-xmm24 -ffixed-xmm25 -ffixed-xmm26 -ffixed-xmm27 -ffixed-xmm28 -ffixed-xmm29 -ffixed-xmm30 -ffixed-xmm31 i7-1065G7 (Icelake)

Testcase

See attached zip file:
tst-int-oct.zip

Actual Results (de882b7)

Calculation finished (float): 313

Expected Results (24be5a4)

Calculation finished (float): 228

hmin and / or imin

Hi there,

Thanks a lot for your work on std-simd, it's looking very well so far!

I was attempting my first code at using your library. I have previously used Vc, as well as other vectorization libraries. I bumped into the following use-case: I would like to calculate the index of the lowest element in a vector, out of some active elements in a mask. Looking at the following document:

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/n4808.pdf

I found that there is hmin, which can accept a const_where_expression, which sort-of does what I want but not quite, since this returns the value of the lowest element, not the index. In order to further fetch the index, I figured one can create yet another mask comparing with the obtained value, and then find the index of that. Ie. see the following pseudo-code:

Testcase

const auto best_scatter = std::experimental::parallelism_v2::hmin(std::experimental::const_where_expression(mask, scatter));
const auto best_mask = best_scatter == scatter;
const auto best_scatter_index = std::experimental::parallelism_v2::find_first_set(best_mask);

With this, I have two questions:

  • I cannot find hmin anywhere in the library. Perhaps I am looking at an outdated documentation?
  • The above code looks cumbersome. Would you suggest a better practice using std-simd?

Missing documentation about how to use the library.

Hi there,
I am an experienced C++ programmer but I'm completely lost when it comes to SIMD operations. Currently I'm trying your library for over a week and I still cannot figure out, how to get it to be more performant than the straight forward way.

In my particular case, I am trying to create a SAXPY operation according to BLAS standard using SIMD operations. My vectors are huge and still the straight forward way is much faster. I appended the two examples with my performance measurements to the bottom.

The copy operations to the buffer array are the most time consuming part. My suspicion is, that these copy operations are not needed and the filling of the native_simd<float> takes place in a lot more implicit way. However, I haven't figured it out how to do it yet and searching the source code confuses me even more.

By the way, I already copied the values directly to the native_simd<float> object, with very similar results.

May you please provide an example on how to actually provide data to the native_simd<float> vector properly? I would really appreciate it and will surely contribute some examples on how to use the library, since I think the documentation is the key, to get that neat library to the C++ core libraries.

Obvious way

template<class numeric_t>
void normal_axpy(const int n, const numeric_t a, 
        const numeric_t* x, const int inc_x, numeric_t* y, const int inc_y) {
    for(auto i = 0, i_x = 0, i_y = 0; i < n; ++i, i_x += inc_x, i_y += inc_y) {
        y[i_y] = a * x[i_x] + y[i_y];
    }
}

Results

Results for 'Conventional'; Vector Dimension: 1000; Iterations: 10000
 -- Min: 3 (3.00 µs)
 -- Max: 45 (45.00 µs)
 -- Average: 5.0503 (5.05 µs)
 -- Median: 5 (5.00 µs)
 -- Lower Quartile: 4 (4.00 µs)
 -- Upper Quartile: 5 (5.00 µs)

Most probably wrong SIMD way

template<class numeric_t>
void axpy(const int n, const numeric_t a, 
        const numeric_t* x, const int inc_x, numeric_t* y, const int inc_y) {
    const size_t _chunk_size = native_simd<numeric_t>::size();

    native_simd<numeric_t> aa, xx, yy;

    size_t i_x = 0, i_y = 0, i_yr = 0;
    for(auto j = 0; j < _chunk_size; ++j)
        aa[j] = a;

    numeric_t buffer[_chunk_size];
    for (size_t i = 0; i < n; i += _chunk_size) {
        int num_elements = min(_chunk_size, n - i);
        for(auto j = 0; j < num_elements; ++j, i_x += inc_x)
            buffer[j] = x[i_x];
        xx.copy_from(buffer, element_aligned);

        for(auto j = 0; j < num_elements; ++j, i_y += inc_y)
            buffer[j] = y[i_y];
        yy.copy_from(buffer, element_aligned);

        yy = aa * xx + yy;

        yy.copy_to(buffer, element_aligned);
        for(auto j = 0; j < num_elements; ++j, i_yr += inc_y)
            y[i_yr] = buffer[j];
    }
}

Results

Results for 'SIMD'; Vector Dimension: 1000; Iterations: 10000
 -- Min: 18 (18.00 µs)
 -- Max: 118 (118.00 µs)
 -- Average: 26.0297 (26.03 µs)
 -- Median: 27 (27.00 µs)
 -- Lower Quartile: 21 (21.00 µs)
 -- Upper Quartile: 28 (28.00 µs)

g++-9 compiler error

I using Mac OS g++-9 to compiler your code, and I got following error:

error:
[1]: error: expected ')' before '&&' token static constexpr void __unused(_Tp&&)
[2]: 'ushort' was not declared in this scope

solution:
[1]:write "#undef __unused" in simd.h , (__unused macro already define in cdefs.h)
[2]:using ushort = unsigned short; (ushort is not cross platform)

virtest inconsistency

git clone https://github.com/VcDevel/std-simd -b master --depth 10 --recurse-submodules --shallow-submodules 
Cloning into 'std-simd'...
remote: Enumerating objects: 158, done.
remote: Counting objects: 100% (158/158), done.
remote: Compressing objects: 100% (104/104), done.
remote: Total 158 (delta 65), reused 103 (delta 52), pack-reused 0
Receiving objects: 100% (158/158), 231.28 KiB | 883.00 KiB/s, done.
Resolving deltas: 100% (65/65), done.
Submodule 'tests/virtest' (https://github.com/mattkretz/virtest) registered for path 'tests/virtest'
Cloning into '/home/philix/git/std-simd/tests/virtest'...
remote: Enumerating objects: 28, done.        
remote: Counting objects: 100% (28/28), done.        
remote: Compressing objects: 100% (26/26), done.        
remote: Total 28 (delta 6), reused 8 (delta 0), pack-reused 0        
remote: Total 0 (delta 0), reused 0 (delta 0), pack-reused 0
error: Server does not allow request for unadvertised object 88a3c9ab64469fffff0c07e78f983f15a3e3c9e1
Fetched in submodule path 'tests/virtest', but it did not contain 88a3c9ab64469fffff0c07e78f983f15a3e3c9e1. Direct fetching of that commit failed.

status of AVX-512 and Vc

Hi,
I have been using Vc with our biomechanics code where it showed to be very beneficial, around 30% peak performance even in the non-academic scenarios 👍
Now it would be nice to be able to use AVX-512 on an Intel Cascade Lake processor (using GCC 10.2, Ubuntu 20.04). I'm not really up to date with the status of Vc and std-simd so I thought I'd just ask here about it:

  1. How is the status of Vc in terms of AVX-512? v1.4 doesn't support it. I checked out the branch mkretz/datapar and could run simple AVX-512 arithmetic, however things like Vc::abs, Vc::exp, Vc::iif seem to not exist. I tried a merge of the branch mkretz/datapar into 1.4, which gave numerous conflicts. At lot of them were files that were renamed differently, but some conflicts were also in the code. I am completely lost here and not able to do this merge.
  2. The alternative would be to switch to std-simd. How stable and feature-complete is this code base already?
    My use case would be an inclusion of the headers instead of requiring our users to patch and rebuild their GCC compilers. I would probably start with defining a compatibility layer such as
namespace Vc
{
  template<typename T,int n>
  using array = std::experimental::fixed_size_simd<T,n>;
  using double_v = std::experimental::native_simd<double>;
}

The symbols of Vc used in our code are: Vc::double_v, Vc::int_v and *::size(), Vc::Zero, Vc::One, Vc::Allocator, Vc::abs, Vc::log, Vc::exp, Vc::isnegative, Vc::isfinite, Vc::all_of, Vc::any_of, Vc::where.

I really appreciate your efforts for getting this into the C++ standard in the long term. Maybe it would be nice to have a short-term solution building on Vc, too? (At least a subset for Intel Cascade Lake.)
Would it be possible for sb. to implement a Vc 1.5 with AVX-512? Imho the big advantage of Vc from a user's perspective is it's good documentation. Or is std-simd already on the same level as Vc feature-wise and only the documentation is not yet there? Then, could sb. write me the mentioned compatibility layer or guide me in the right direction?

none_of() generates extra memory access on AVX2 in comparison to Vc 1.4

version / revision Operating System Compiler & Version Compiler Flags CPU
master (dd9a3f3) Windows 10 GCC 9.2.0 (MSYS/MinGW) -O2 -std=c++17 -DNDEBUG -fabi-version=0 -ffp-contract=fast -fext-numeric-literals -mavx2 AVX2 platform

Compiler Explorer for Vc 1.4: https://godbolt.org/z/6gNtWq
Compiler Explorer for std-simd: https://godbolt.org/z/3cQHE4

Testcase

// std-simd code
bool compareVectors2(float *a, float *b) {
    Vc::float_v aVec(a, Vc::vector_aligned);
    Vc::float_v bVec(b, Vc::vector_aligned);

    bVec *= 2.0f;

    return none_of(aVec == bVec);
}

// Vc 1.4 code
bool compareVectors2(float *a, float *b) {
    Vc::float_v aVec(a, Vc::Aligned);
    Vc::float_v bVec(b, Vc::Aligned);

    bVec *= 2.0f;

    return (aVec == bVec).isEmpty();
}

Actual Results

compareVectors2(float*, float*):
  vmovaps ymm0, YMMWORD PTR [rsi]
  vaddps ymm0, ymm0, ymm0
  vcmpeqps ymm0, ymm0, YMMWORD PTR [rdi]
  vtestps ymm0, YMMWORD PTR .LC0[rip]
  sete al
  vzeroupper
  ret
.LC0:
  .long 4294967295
  .long 4294967295
  .long 4294967295
  .long 4294967295
  .long 4294967295
  .long 4294967295
  .long 4294967295
  .long 4294967295

Expected Results

compareVectors2(float*, float*):
  vmovaps ymm0, YMMWORD PTR [rsi]
  vaddps ymm0, ymm0, ymm0
  vcmpeqps ymm0, ymm0, YMMWORD PTR [rdi]
  vtestps ymm0, ymm0
  sete al
  vzeroupper
  ret

PS:
In general, there is something really weird happening with std-simd. I have really high register spilling in std-simd, on almost exactly the same code. I'm trying to debug it somehow, this is one of the traces I found. The other one in Vc::Zero and Vc::One constants, but it is not related to this very bug.

Make `simd` and `simd_mask` ranges

version / revision Operating System Compiler & Version Compiler Flags CPU
? All All All All

Motivation

The library already provides indexed access. Providing a range interface/iterator access would allow the type to be used with all STL functions. Event though it would degrade to doing it element-wise, it would still vastly increase the ergonomics when some specific utility is not directly available for the simd type. It could also pave the way for specializing some STL functions, like std::min_element, instead of providing hmin, addressing #18

Let me know if you think this is a direction worth taking and I can write a PR

Testcase

#include <experimental/simd>
#include <iostream>

namespace stdx = std::experimental;

int main() {
  stdx::native_simd<int> a{[](int i) { return 2 << (i - 1); }};

  for (auto v : a) {
    std::cout << v << '\n';
  }
}

Actual Results

$ g++ simd-range.cpp && ./a.out 
simd-range.cpp: In function ‘int main()’:
simd-range.cpp:13:17: error: ‘begin’ was not declared in this scope; did you mean ‘std::begin’?
   13 |   for (auto v : a) {
      |                 ^
      |                 std::begin
In file included from /usr/include/c++/13.2.1/string:53,
                 from /usr/include/c++/13.2.1/bitset:52,
                 from /usr/include/c++/13.2.1/experimental/bits/simd.h:33,
                 from /usr/include/c++/13.2.1/experimental/simd:74,
                 from simd-range.cpp:1:
/usr/include/c++/13.2.1/bits/range_access.h:114:37: note: ‘std::begin’ declared here
  114 |   template<typename _Tp> const _Tp* begin(const valarray<_Tp>&) noexcept;
      |                                     ^~~~~
simd-range.cpp:13:17: error: ‘end’ was not declared in this scope; did you mean ‘std::end’?
   13 |   for (auto v : a) {
      |                 ^
      |                 std::end
/usr/include/c++/13.2.1/bits/range_access.h:116:37: note: ‘std::end’ declared here
  116 |   template<typename _Tp> const _Tp* end(const valarray<_Tp>&) noexcept;
      | 

Expected Results

0
2
4
8

Support for `std::byte`

version / revision Operating System Compiler & Version Compiler Flags CPU
? Linux gcc 13.2.1 20230801 None N/A

Testcase

#include <experimental/simd>
#include <cstddef>

int main() {
  std::experimental::native_simd<std::byte> s;
}

Actual Results

$ g++ byte.cpp
In file included from /usr/include/c++/13.2.1/experimental/simd:74,
                 from byte.cpp:1:
/usr/include/c++/13.2.1/experimental/bits/simd.h: In substitution of ‘template<class _Tp> using std::experimental::parallelism_v2::native_simd = std::experimental::parallelism_v2::simd<_Tp, std::experimental::parallelism_v2::simd_abi::native<_Tp> > [with _Tp = std::byte]’:
byte.cpp:5:43:   required from here
/usr/include/c++/13.2.1/experimental/bits/simd.h:3018:9: error: no type named ‘type’ in ‘struct std::enable_if<false, void>’
 3018 |   using native_simd = simd<_Tp, simd_abi::native<_Tp>>;
      |         ^~~~~~~~~~~

Expected Results

Compiles with support for the same operators as std::byte

Let me know if you think this is a direction worth taking and I can write a PR

Examples in README.md are not equivalent

The examples given in the front page (README.md) are not equivalent. That is, the examples showing native_simd and compiler intrinsics are not vectorization of the original scalar_product function because:

  • Their Vec3D array contains a different number of elements. In the original code, there are 3 floats in the array, and the modified examples use vector types for elements, where the vector type itself is a number of floats. The vectorized functions should operate on the same input data as the original. Yes, that means you should probably show how your library helps dealing with tail data that doesn't fit in a native vector.
  • The effect of the scalar_product function in the modified examples is different, as it does not produce a single float that is the sum of products of the input arrays. The vectorized code is missing the final reduction step.

I understand that you are trying to be concise in your front page examples, but I still believe the examples should provide user with a realistic comparison of the two ways to implement the same functionality. Without this equivalence, the comparison is pointless, as you are comparing apples to oranges.

clang support

libstdc++ is used by many also for clang. It would be great if this will eventually also support clang. Do you already know the reasons (i.e. the list of missing compiler features and compiler bugs) why this doesn't work in clang? I think it would be nice to compile the list of clang issue numbers here to track when clang will support this. Do you want help with that?

Are the benchmarks still executable?

version / revision Operating System Compiler & Version Compiler Flags CPU
1.0.0 Arch Linux GCC 11.1.0 see below Ryzen 3700X

I tried to run the benchmarks like below but failed.

$ cd benchmarks
$ CXX=$(which g++) ./run.sh hypot2.cpp

The compiler gives a lot of errors. One of those complains that

bench.h:252:33: error: ‘_S_width’ is not a member of ‘std::experimental::parallelism_v2::_VectorTraitsImpl<__vector(4) float, void>’
  252 |         for (int i = 0; i < VT::_S_width; ++i) {                                         \
      |                                 ^~~~~~~~

So I searched through the code using ripgrep but found nothing in the lib.

$ cd std-simd
$ rg _S_width
benchmarks/bench.h
38:            return std::experimental::_VectorTraits<T>::_S_width;
252:        for (int i = 0; i < VT::_S_width; ++i) {                                         \

I guess the benchmarks have already been broken and kept unmaintained. If I am wrong, then how to run those benchmarks?

Is this Out-of-bounds access?

version / revision Operating System Compiler & Version Compiler Flags CPU
GCC11 Linux GCC11 -mavx2 -mfma -mavx tigerlake

Testcase

namespace stdx = std::experimental;

static auto indexSimd = stdx::native_simd<int>{[](int i){ return i; }};

stdx::native_simd<int> b;
int a[3]{42, 43, 44};
stdx::where(indexSimd < 3, b).copy_from(a, stdx::element_aligned);

Does this access memory out of bounds? I am pretty sure this code would:

b.copy_from(a, stdx::element_aligned);

But I would expect the former to do a maskload such that it doesn't touch the shadow 4th element?

If this is UB, is there any good way to load a partial SIMD from a contiguous region safely? This would be very valuable for iterating over for example a std::vector of length that is not necessarily divisible by the vector block size

Actual Results

Expected Results

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.