db-tu-dresden / tsl Goto Github PK

View Code? Open in Web Editor NEW

7.0 3.0 8.0 60.06 MB

Template SIMD Library (+Generator)

License: GNU General Public License v3.0

Python 34.02% CMake 2.29% Shell 0.21% C++ 3.85% Verilog 0.38% HTML 58.19% SystemVerilog 1.06%

abstraction-database hardware-agnostic simd-intrinsics simd-programming

tsl's People

Contributors

Stargazers

Watchers

Forkers

yuhta kgpai cmrschwarz ericmier ratusz tomschw dertuchi jpietrzyktud

tsl's Issues

Add support for logic operations (OR) in lscpu_flags

As it turns out, the flag indicating the existence of ARM neon using lscpu (/proc/cpuinfo) is either "neon" or "asimd" (advanced simd). Currently, the specified list of lscpu_flags of a definition is used as conjunction => all flags must be available. However, there should be a way to express them using an either-or.
This is also true for oneAPI FPGAs since we can only identify the accelerator cards using lspci | grep accel (afaik). However, stratix 10 has another hex-value than the agilex while providing the same set of functionalities in the scope of the TSL.

Maybe we can implement this quite "natively" using lists of lists for the lscpu_flags:

#old: lscpu_flags: ["neon"]
#new: lscpu_flags: [["neon"], ["asimd"]]

Some lscpu flags need an alias

When the TSL is generated, the required compiler arguments are derived from the lscpu list; either trhough py-cpuinfo or lscpu.

However, some flags have no direct mapping to compiler flags, especially avx512 flags, e.g.

lscpu	g++/clang
avx512_fp16	-mavx512fp16
avx512_vpopcntdq	-mavx512vpopcntdq
avx512_vbmi2	-mavx512vbmi2

Maybe we can conventionally just remove the underscores, but this might not be true for all flags.

Imask_population_count counts more than necessary

The actual value is smaller than the type. Pls Check

#59 has comments on that.

E.g:

here __builtin_popcountll(mask); is used but only 4 bits need to be checked, right?.

Different requirement for test cases lead to multiple definitions of the same test case

If the requirements differ for multiple tests of primitive, multiple test cases are generated (with the same test identifier).

Runner break

Known Bug with Ubuntu-latest and clang

Unwanted `Primitive <..> not implemented` warnings

The test suite currently produces a significant amount of warnings related to unimplemented primitives:

Full Dump

./build/generator_output/src/test/tsl_test |& grep -A 1 warning | grep implemented | cut -d" " -f4-
shift_left<simd<double, avx2>> not implemented.
shift_left<simd<float, avx2>> not implemented.
shift_left<simd<int8_t, avx2>> not implemented.
shift_left<simd<uint8_t, avx2>> not implemented.
shift_left<simd<double, avx512>> not implemented.
shift_left<simd<float, avx512>> not implemented.
shift_left<simd<int8_t, avx512>> not implemented.
shift_left<simd<uint8_t, avx512>> not implemented.
shift_left<simd<double, scalar>> not implemented.
shift_left<simd<float, scalar>> not implemented.
shift_left<simd<double, sse>> not implemented.
shift_left<simd<float, sse>> not implemented.
shift_left<simd<int8_t, sse>> not implemented.
shift_left<simd<uint8_t, sse>> not implemented.
shift_left_vector<simd<double, avx2>> not implemented.
shift_left_vector<simd<float, avx2>> not implemented.
shift_left_vector<simd<int8_t, avx2>> not implemented.
shift_left_vector<simd<uint8_t, avx2>> not implemented.
shift_left_vector<simd<double, avx512>> not implemented.
shift_left_vector<simd<float, avx512>> not implemented.
shift_left_vector<simd<int8_t, avx512>> not implemented.
shift_left_vector<simd<uint8_t, avx512>> not implemented.
shift_left_vector<simd<double, scalar>> not implemented.
shift_left_vector<simd<float, scalar>> not implemented.
shift_left_vector<simd<double, sse>> not implemented.
shift_left_vector<simd<float, sse>> not implemented.
shift_left_vector<simd<int8_t, sse>> not implemented.
shift_left_vector<simd<uint8_t, sse>> not implemented.
shift_right<simd<double, avx2>> not implemented.
shift_right<simd<float, avx2>> not implemented.
shift_right<simd<int8_t, avx2>> not implemented.
shift_right<simd<uint8_t, avx2>> not implemented.
shift_right<simd<double, avx512>> not implemented.
shift_right<simd<float, avx512>> not implemented.
shift_right<simd<int8_t, avx512>> not implemented.
shift_right<simd<uint8_t, avx512>> not implemented.
shift_right<simd<double, scalar>> not implemented.
shift_right<simd<float, scalar>> not implemented.
shift_right<simd<double, sse>> not implemented.
shift_right<simd<float, sse>> not implemented.
shift_right<simd<int8_t, sse>> not implemented.
shift_right<simd<uint8_t, sse>> not implemented.
shift_right_logical<simd<double, avx2>> not implemented.
shift_right_logical<simd<float, avx2>> not implemented.
shift_right_logical<simd<int8_t, avx2>> not implemented.
shift_right_logical<simd<uint8_t, avx2>> not implemented.
shift_right_logical<simd<double, avx512>> not implemented.
shift_right_logical<simd<float, avx512>> not implemented.
shift_right_logical<simd<int8_t, avx512>> not implemented.
shift_right_logical<simd<uint8_t, avx512>> not implemented.
shift_right_logical<simd<double, scalar>> not implemented.
shift_right_logical<simd<float, scalar>> not implemented.
shift_right_logical<simd<double, sse>> not implemented.
shift_right_logical<simd<float, sse>> not implemented.
shift_right_logical<simd<int8_t, sse>> not implemented.
shift_right_logical<simd<uint8_t, sse>> not implemented.
shift_right_logical_vector<simd<double, avx2>> not implemented.
shift_right_logical_vector<simd<float, avx2>> not implemented.
shift_right_logical_vector<simd<int8_t, avx2>> not implemented.
shift_right_logical_vector<simd<uint8_t, avx2>> not implemented.
shift_right_logical_vector<simd<double, avx512>> not implemented.
shift_right_logical_vector<simd<float, avx512>> not implemented.
shift_right_logical_vector<simd<int8_t, avx512>> not implemented.
shift_right_logical_vector<simd<uint8_t, avx512>> not implemented.
shift_right_logical_vector<simd<double, scalar>> not implemented.
shift_right_logical_vector<simd<float, scalar>> not implemented.
shift_right_logical_vector<simd<double, sse>> not implemented.
shift_right_logical_vector<simd<float, sse>> not implemented.
shift_right_logical_vector<simd<int8_t, sse>> not implemented.
shift_right_logical_vector<simd<uint8_t, sse>> not implemented.
shift_right_vector<simd<double, avx2>> not implemented.
shift_right_vector<simd<float, avx2>> not implemented.
shift_right_vector<simd<int8_t, avx2>> not implemented.
shift_right_vector<simd<uint8_t, avx2>> not implemented.
shift_right_vector<simd<double, avx512>> not implemented.
shift_right_vector<simd<float, avx512>> not implemented.
shift_right_vector<simd<int8_t, avx512>> not implemented.
shift_right_vector<simd<uint8_t, avx512>> not implemented.
shift_right_vector<simd<double, scalar>> not implemented.
shift_right_vector<simd<float, scalar>> not implemented.
shift_right_vector<simd<double, sse>> not implemented.
shift_right_vector<simd<float, sse>> not implemented.
shift_right_vector<simd<int8_t, sse>> not implemented.
shift_right_vector<simd<uint8_t, sse>> not implemented.
equal<simd<double, sse>> not implemented.
equal<simd<float, sse>> not implemented.
equal<simd<double, sse>> not implemented.
equal<simd<float, sse>> not implemented.
mask_equal<simd<double, avx2>> not implemented.
mask_equal<simd<float, avx2>> not implemented.
mask_equal not implemented for avx512
mask_equal not implemented for scalar
mask_equal not implemented for sse
convert_down<simd<double, avx2>> not implemented.
convert_down<simd<float, avx2>> not implemented.
convert_down<simd<int16_t, avx2>> not implemented.
convert_down<simd<int8_t, avx2>> not implemented.
convert_down<simd<uint16_t, avx2>> not implemented.
convert_down<simd<uint8_t, avx2>> not implemented.
convert_down not implemented for avx512
convert_down not implemented for scalar
convert_up<simd<double, avx2>> not implemented.
convert_up<simd<float, avx2>> not implemented.
convert_up not implemented for avx512
convert_up not implemented for scalar
convert_up<simd<double, sse>> not implemented.
convert_up<simd<float, sse>> not implemented.
convert_up<simd<int64_t, sse>> not implemented.
convert_up<simd<uint64_t, sse>> not implemented.

Some of these are justified, mainly convert_up/convert_down/mask_equal/equal not being implemented for some extensions

But others aren't:

convert_down doesn't make sense for uint_8 (no smaller type exists)
convert_up doesn't make sense for uint64_t (no larger type exists)
shift_*** does not make sense for float or double (C++ scalars and AVX don't supports that, neither should we. User should cast instead.)
there are probably more cases that just don't occur because the corresponding primitive doesn't have tests yet, e.g.
cast

I see three ways to deal with this:

Get rid of this warning entirely
Only emit the warning if the implementation is missing because of lscpu flags, not for implementations that were never written at all
Add a yaml tag (e.g. implementation_omitted) to indicate implementations that are explicitly not desired.

My personal favorite is 3., since i would rather not remove a warning that is very helpful in development.
While we're at it, we might consider emitting different warnings for the two cases of

missing because of lscpu flags
missing because no implementation was written

Which mean very different things for the developer.

Solve ambiguity problems

As we allow function overloading through functor_name we can end up in a situation where the same function signature is generated twice (e.g., for scalar definition + mask/imask primitives, since the mask type equals the imask type).

Currently, we work around that issue by leaving out the offending definitions. However, this seems to be bad practice.
Thus, we need a better solution.

[NEON] Missing parentheses in `mask_ls_neon.hpp`

I'm getting the following warning with NEON:

generate_tsl_neon-asimd/include/generated/definitions/mask_ls/mask_ls_neon.hpp:850:36: 
warning: & has lower precedence than ==; == will be evaluated first [-Wparentheses]
  850 |                   if ((imask >> i) & 0b1 == true) {
      |                                    ^~~~~~~~~~~~~

This looks like the generated code is "wrong". If == is evaluated first, the code checks if (mask >> i) & true, where true will just be 1. So this has the same effect in the end. You can probably just remove the == true because any value != 0 should evaluate to true anyway.

Make CXX-Standard an option

I think the generated code (if not using C++-20 concepts) can be generated with c++-14.
We should incorporate the functionality to choose between c++-14/17/20

Usability: Add a CMakeLists.txt

As it currently stands,
this repository does not have it's own CMakeLists.txt.

The current 'TVL' repository also only contains a pre generated
version of the tvl for a single set of lscpu flags.

Adding a CMakeLists.txt to this repository that generates a customized version of the library
on demand would be a better way of distribution, and also help development.

A simple starting version is attached here:

CMakeLists.txt

Check lscpu-flags in primitives

Alex brought to my attention that there are some primitive definitions with wrong lscpu flags.
We should incorporate a lscpu flag check within the CI-Pipelines, this should do the trick.

[AVX512] `convert_up` missing types

The following conversions lead to compile errors with AVX512. The errors are all like the foloowing, but with the corresponding types.

 error: no member named 'apply' in 'tsl::functors::convert_up<tsl::simd<unsigned int, tsl::avx512>, 
                                                              tsl::simd<unsigned long, tsl::avx512>, 
                                                              tsl::workaround>'

uint16_t --> uint32_t
uint16_t --> uint64_t
uint32_t --> uint64_t

Add lscpi

Since the lib should also support FPGA through oneapi and gpu(?), we should query for those devices, too. This should be possible using lscpi (or some python package like [1] or with some lines of code [2]).

[1] https://pypi.org/project/py-lspci/
[2] https://gist.github.com/alex-pat/cf69546d067b4ac50f18c1f0d8813ddf

Update requirements.txt

Set versions for packages

Proof-read Readme

Compressstore buggy

In the workaround version of compress store we have this little peace of code in the end:

if(((mask>>Vec::vector_element_count())&0b1) == 0) {
   *memory = safe[memory-orig_mem];
}

This is just wrong.

Primitive Table Inconsistencies

When viewing the primitive table, it shows almost always equal availability for oneAPIfpga and oneAPIfpgaRTL. That is due to the "fallback" mechanism of making C++ HLS code available, if no RTL specification is present.

However, this is rather confusing, since no actual indication of the fallback is present. This should be somehow enhanced to mark fallback Solutions or actual avilability.

Sort include-order for definitions

As some primitives internally use other primitives declared and defined in different files, we need to build a dependency graph and sort the includes in tsl_generated.hpp accordingly.
Example:
calc.yaml:1236ff (mod<simd<uint32_t, avx512>>):

/*...*/
__m512 vec_d = tsl::cast<Vec, typename Vec::template transform_extension<T>>(vec);
/*...*/

Cast is defined in convert.yaml:360ff.
The include order would be:

#include "extensions/scalar.hpp"
#include "extensions/simd/intel/avx2.hpp"
#include "declarations/*"
#include "definitions/compare/compare_avx2.hpp"
#include "definitions/compare/compare_sse.hpp"
#include "definitions/compare/compare_scalar.hpp"
#include "definitions/calc/calc_avx2.hpp"
#include "definitions/calc/calc_sse.hpp"
#include "definitions/calc/calc_scalar.hpp"

Reproduce the workflow for the Prerequisites and Usage section in top-level Readme

add py-cpuinfo to requirements.txt

Missing Runtime with deb

When installing the TSL from the deb package and including it via
#include <tsl/tslintrin.hpp>

the CPU runtime header cannot be found:

In file included from /usr/include/tsl/tslintrin.hpp:33,
                 from main.cpp:5:
/usr/include/tsl/generated/tsl_generated.hpp:79:10: fatal error: tslCPUrt.hpp: No such file or directory
   79 | #include "tslCPUrt.hpp"
      |          ^~~~~~~~~~~~~~
compilation terminated.

The link to the tslCPUrt.hpp is originally formed through the CMake integration and has to be present through tslintrin as well.

Shift do not supported for `uint8_t`

It looks like there is currently no support for shifts of uint8_t in AVX512 vectors.

Missing NEON Intrinsics

load_imask
compressstore
mask_storeu

[NEON] Unknown type name `neon` in tslCPUrt.hpp

When using tsl::runtime::cpu::max_width_extension_t on AArch64, I get the following error message:

runtime/cpu/include/tslCPUrt.hpp:15:29: error: unknown type name 'neon'
   15 |         using extension_t = neon;
      |                             ^

runtime/cpu/include/tslCPUrt.hpp:22:23: error: use of undeclared identifier 'scalar'
   22 |             (Par==1), scalar, typename details::simd_ext_helper_t<sizeof(T)*8*Par>::extension_t
      |                       ^

These types are not included, so they cannot be known.

Side note: same holds for VectorProcessingStyle and TSLArithmetic further down in the code. I'm not using them, but my IDE shows me that they are undefined. Probably just a few includes missing.

Sequence

As reported by @EricMier, the primitive sequence creates a register with the values in the wrong order.

[NEON] Shift left requires constant integer

When downloading the current tar.gz from v0.1.9-rc5 (and probably ones before that but I didn't check), I cannot compile the NEON stuff on Mac with LLVM/Clang 18. I get the following error multiple times.

cmake-build-release/_deps/tsl_gen-src/generate_tsl_neon-asimd/include/generated/definitions/binary/binary_neon.hpp:1117:104: error: argument to '__builtin_neon_vshlq_n_v' must be a constant integer
 1117 |                 return __extension__ ({ uint8x16_t __ret; uint8x16_t __s0 = data; __ret = (uint8x16_t) __builtin_neon_vshlq_n_v((int8x16_t)__s0, shift, 48); __ret; });

which is the expanded macro for:

[[nodiscard]] 
TSL_FORCE_INLINE 
static typename Vec::register_type apply(
    const typename Vec::register_type data, const unsigned int shift
) {
    return vshlq_n_u8(data, shift);  // <-- this is a macro in Clang
}

When looking at the NEON specs, this error is correct, as vshlq_n_u8 requires a const int as the second argument. I'm not sure how this should be handled, but probably this needs a vdup_n_* for the runtime value before the shift.

Improve PR-pipeline

Maybe we should change the github action workflow to make this less annoying?

Originally posted by @cmrschwarz in #37 (comment)

Allow TVLGen to generate only a certain set of ctypes

TVLGen already provides the possibility to limit the generated code to a set of extensions. However, especially for scientific paper writing, it would be great to limit the library also to a set of selected ctypes, e.g. float and uint16_t.

db-tu-dresden / tsl Goto Github PK

tsl's People

Contributors

Stargazers

Watchers

Forkers

tsl's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs