GithubHelp home page GithubHelp logo

Comments (11)

fpetrogalli avatar fpetrogalli commented on August 24, 2024

@shibatch I think we shouldn't talk about this until LLVM comes up with the names they need.
Most of the information needed to classify vector functions (mask, vector length, target architecture and vector extension, alignment of function parameters) are already available in the mangled names generated with the rules of the publicly available vector ABI, with the exception of:

  • accuracy and input domain
  • error number and exception support

Those two items (accuracy and input domain in particular) make the possible choices virtually infinite. I think it will make sense to specify which precision to use at compile time, with a set compiler flags. But this is compiler work, not library one. Let's focus on having the library exporting the names with the state-of-the-art classification of vector routine, which is via ABI defined using the OpenMP declare simd directive.

As far as I understood, @hfinkel agrees on this approach of limiting the exported names to the vector ABI ones.

Thank you!

from sleef.

shibatch avatar shibatch commented on August 24, 2024

The generic vector ABI is only for generic use, and there are
additional requirements for vectorized math functions. Since we are
working on implemenenting vectorized math library, I think it is
relevant and appropriate for us to talk about this.

The generic ABI has a specification for mask, but it still lacks a
specification for how to convert a function prototype of scalar function
to that of vectorized function with a mask.

As for the target architecture, it is not clear if it specifies the
instructions for execution or it is just required for passing
parameters. We also need to talk if the library is responsible for
providing a dispatcher or it is the compiler.

We don't need to make the number of combination of accuracy and input
domain to be finite. My idea is that anyway the compiler needs to know
which functions are available in the library, it would be convenient
if the compiler can know the availability by just checking the symbols in
library. I know this is quite different from the current
implementation, but I decided to write everything I am thinking now.

I am expecting this discussion could take something like several months.
I am also thinking about contributing to LLVM for this part.

I agree that at this point we should focus on what you are saying, and
we should implement a library with the current ABI.

from sleef.

gnzlbg avatar gnzlbg commented on August 24, 2024

Instruction set required to execute the function (AVX2, SSE4.2+POPCNT, etc.)

Note that currently the dispatcher functions do not state any ISA requirements, but they are not callable on all ISAs, for example:

  • __m128d Sleef_sind2_u10(__m128d a); requires SSE- it won't work on machines that only have 64-bit wide MMX)
  • __m256d Sleef_sind4_u10(__m256d a); requires AVX - it won't work on machines that only have 128-bit wide SSE registers.

It is also not only that the won't work on the wrong machine, but that callers must uphold a particular ABI: that vectors like __m128d and __m256d are passed in a single register.

That is, trying to call __m256d Sleef_sind4_u10 from C code that's compiled without AVX will introduce undefined behavior, because the compiler will represent and pass the __m256d as two __m128d registers, while the function expects a single __m256d register.

If one is using the library from C directly, this is currently probably a non-issue, because these functions are only available if the appropriate features are enabled for the whole translation unit that includes the SLEEF headers (e.g. if __AVX__ is not defined, __m256d Sleef_sind4_u10 is not available). However, it could be desirable to call these functions from a translation unit compiled with less features:

// Compiled with SSE
#include <sleef.h> // __AVX__ not defined

void foo_fallback() { /* do stuff calling sleef SSE functions */ }

void foo_avx2() __attribute__ ((__target__ ("avx"))) {
    /* do stuff but... can't call sleef __m256d functions
        since they are not exported in the sleef.h header ! */
}

void bar() {
    if (stars_align()) { foo_avx2() } else { foo_fallback() }
}

I think a way to fix this is to:

  • always expose all functions (that is, not hide the AVX functions if __AVX__ is not defined).
  • use__attribute__((target("TARGET."))) on all exposed functions
  • properly name functions according to TARGET, including dispatchers
  • provide dispatchers for lower targets (more below).

I think it would be particularly useful to provide e.g. __m256 (and AVX-512) dispatchers for targets without the feature. That is, for example taking a look at sind:

  • rename: __m256d Sleef_sind4_u10(__m256d a); to Sleef_sind4_u10_disp_avx (similar for sse and avx-512 dispatchers)
  • provide a __m128dx2 Sleef_sind4_u10_disp_sse(_m128dx2) that internally converts a __m128dx2 to a __m256d and calls the best __m256d function.
  • provide a __m128dx4 Sleef_sind4_u10_disp_sse(_m128dx4) that internally converts a
    __m128dx4 to a __m256dx2 or a __m512d and calls the best AVX-512 or AVX-2 function.

This way users have maximum flexibility of either doing run-time feature detection themselves, or using a dispatcher that can scale across all ISA features.

from sleef.

shibatch avatar shibatch commented on August 24, 2024

It is also not only that the won't work on the wrong machine, but that callers must uphold a particular ABI: that vectors like __m128d and __m256d are passed in a single register.

The situation is same in SVML. In order to call __m256d functions, it is assumed that __m256d data type is available on that computer.

That is, trying to call __m256d Sleef_sind4_u10 from C code that's compiled without AVX will introduce undefined behavior

You cannot do that, since Sleef_sind4_u10 is not declared in the header if AVX is not available.

from sleef.

gnzlbg avatar gnzlbg commented on August 24, 2024

You cannot do that, since Sleef_sind4_u10 is not declared in the header if AVX is not available.

Technically, it is very easy to do this. One only has to add a function declaration, and call it, and it won't even fail to link since the function is present in the binary of the library. That is, one does not have to include the sleef.h header at all to call any functions in the library.

In practice, most people will call the function by using the library headers, but even then, exposing the definitions is as easy as:

#define __AVX512F__
#define ...
#include <sleef.h>
#undef ...
#undef __AVX512F__

Both approaches are, obviously, at least some level of "wrong". OTOH, this allows me to implement a "more generic" dispatcher as follows that works correctly without undefined behavior:

// ... expose all sleef library functions "somehow" (irrelevant how)

// appropriately wrap them with target features
// this sets the calling convention ABI for the function

__m128d Sleef_sind2_u10_wrapper(__m128d a) __attribute__((target("sse"))) {
    /* call sleef function */ 
}
__m256d Sleef_sind4_u10_wrapper(__m256d a) __attribute__((target("avx"))) { /* ... */ }
__m512d Sleef_sind8_u10avx512f_wrapper(__m512d a)  __attribute__((target("avx512f"))) { /* ... */ }

struct __m128dx4 { __m128d value[4]; };  
struct __m256dx2 { __m256d value[2]; };  

// Dispatches to the best implementation - requires SSE
// a values live in 128-bit registers, but will be probably passed by memory here
// depending on the platform ABI for SSE calls.
__m128dx4 dispatch_sind4_u10_sse(__m128dx4 a) __attribute__((target("sse"))) {
    if (vector512_available()) {
          // See the 256-bit case, the comment explains how and why this works:
          return (__m128dx4)Sleef_sind8_u10avx512f_wrapper((__m512d)a);
    } else if (vector256_available()) {
          // Here, each __m256d element of v.value will be stored in this function
          // as two 128-bit registers. Because the wrapper takes a single
          // __m256d and has the "avx" ABI on it, the compiler copies the two
          // 128-bit registers into a single one before calling the function, 
          // and properly unwraps the result from one 256-bit register into 
          // two 128-bit registers:
          auto v = (__m256dx2)a;
          v.value[0] = Sleef_sind4_u10_wrapper(v.value[0]);
          v.value[1] = Sleef_sind4_u10_wrapper(v.value[1]);
          return v;
    } else {
          // this is obviously defined behavior since the ABIs match
          v.value[0] = Sleef_sind2_u10_wrapper(v.value[0]);
          v.value[1] = Sleef_sind2_u10_wrapper(v.value[1]);
          v.value[2] = Sleef_sind2_u10_wrapper(v.value[2]);
          v.value[3] = Sleef_sind2_u10_wrapper(v.value[3]);
          return v;
    }
}

As long as the translation unit is compiled with SSE or higher, this all will work correctly.

Also, for the Rust headers I don't have a choice. I cannot include a C header in a Rust program. I have to write extern function definitions that map to the library binary, just as one can do in C.

sleef-sys currently goes way out of its way to detect the features that the Rust program is compiled with, and properly define __AVX__ and friends when calling the C pre-processor on the header before automatically generating the Rust extern definitions on the Rust side. But this is only one of the many ways there is to do this.

I also thought here about globally enabling the same Rust target features in the C compiler used to compile SLEEF but this means that sleef's SSE functions might end up being compiled with AVX enabled. And I would have to go way out of my way to translate Rust target features to the command line options and feature names of the more widely used C compilers (maybe someday).

from sleef.

shibatch avatar shibatch commented on August 24, 2024

Technically, it is very easy to do this. One only has to add a function declaration, and call it, and it won't even fail to link since the function is present in the binary of the library. That is, one does not have to include the sleef.h header at all to call any functions in the library.

Of course you can call any library functions with wrong argument types. That's nonsense.
The current design goal is to implement what is supported by SVML.
You can of course generalize the argument types, but that's endless work.

from sleef.

gnzlbg avatar gnzlbg commented on August 24, 2024

Yeah, I was only commenting that from the point of view of calling sleef from a different language, I cannot include sleef.h. I have to manually write a file in whatever programming language I am using, and via the language C FFI facilities, declare the extern functions of the library binary and call them.

From there, some functions have _avx suffixes that denote that they require AVX, but others don't have anything, yet they do require AVX too (e.g. the __m256... dispatchers). I found that confusing, and it would have helped me if these would also have in their name which features they require.

I did call them with the wrong types, and got garbage, and had to debug it. It wasn't hard, and in hindsight obvious, but when wrapping the functions it just did not occur to me that they also needed e.g. AVX.

Also, as mentioned, even if a TU isn't globally compiled with say AVX enabled, such that the sleef.h header doesn't expose some functions, it is still possible to call these safely and correctly, so not exposing them might be overly restrictive. I'd rather have all functions always exposed, but making it clear in their name which features they do require to be invoked safely (which should express which calling ABI they require).

from sleef.

shibatch avatar shibatch commented on August 24, 2024

From there, some functions have _avx suffixes that denote that they require AVX, but others don't have anything, yet they do require AVX too (e.g. the __m256... dispatchers). I found that confusing, and it would have helped me if these would also have in their name which features they require.

The same problem exists in SVML. There are both merit and demerit in adding AVX to the function names. I chose the current naming scheme since it would have better compatibility between different architectures.

I did call them with the wrong types, and got garbage, and had to debug it. It wasn't hard, and in hindsight obvious, but when wrapping the functions it just did not occur to me that they also needed e.g. AVX.

Please carefully read the reference. They are well defined.

Also, as mentioned, even if a TU isn't globally compiled with say AVX enabled, such that the sleef.h header doesn't expose some functions, it is still possible to call these safely and correctly, so not exposing them might be overly restrictive.

SLEEF is already pretty complex. I don't think that adding such a feature is a good decision. We can think of so many new features that might be useful, but I don't want to make the library bloated.

from sleef.

gnzlbg avatar gnzlbg commented on August 24, 2024

SLEEF is already pretty complex. I don't think that adding such a feature is a good decision. We can think of so many new features that might be useful, but I don't want to make the library bloated.

I think that to properly support this feature it would be enough to:

  • remove the ifdef __AVX__ and friends
  • use __attribute__((target("...")) on all the functions in the library properly

That's pretty much it AFAICT. That would do, even without renaming the dispatchers (people would need to be careful while using them, but that's already kind of the case if they are calling this via C FFI).

Obviously, if the objective is compatibility with SVML, and it doesn't do this, then SLEEF probably shouldn't either. But this is something that might remove complexity instead of adding it (for users, all functions are always available, they must make sure that they call them from the proper places).

I chose the current naming scheme since it would have better compatibility between different architectures.

That's a good point that I hadn't thought about. Does this apply to the 128-bit wide functions, or also to the 256/512-bit wide ones (AFAIK only x86 exposes functions taking 256 and 512 bit wide registers)? For the 128-bit wide functions writing _sse might not make much sense, since all x86 targets that the library support probably have SSE enabled anyways, and that is a very reasonable expectation to have (e.g. most compilers i586 targets have SSE enabled, and i686 ones typically have SSE2 enabled).

Please carefully read the reference. They are well defined.

The reference said:

Vector extension specifier.

  • (Nothing) : Dispatcher automatically chooses the fastest available vector extension

Where does it say that one needs AVX to call say __m256d Sleef_sind4_u10(__m256d a); ?

I read the reference very quickly as my posts show, so I probably oversaw that, but since calling these functions from a TU with only SSE enabled invokes undefined behavior, this might be something that might be worth repeating in a couple of places to prevent idiots like me from scratching their heads :D

from sleef.

shibatch avatar shibatch commented on August 24, 2024

use attribute((target("...")) on all the functions in the library properly

I think that attribute is not supported by MSVC.

But this is something that might remove complexity instead of adding it (for users, all functions are always available, they must make sure that they call them from the proper places).

If I make all functions always available, it would be more complex.

Does this apply to the 128-bit wide functions, or also to the 256/512-bit wide ones (AFAIK only x86 exposes functions taking 256 and 512 bit wide registers)?

Currently 128-bit only.

Where does it say that one needs AVX to call say __m256d Sleef_sind4_u10(__m256d a); ?

It assumes that __m256d data type is defined.
The usage is same as SVML. It's not hard to understand.

from sleef.

gnzlbg avatar gnzlbg commented on August 24, 2024

I think that attribute is not supported by MSVC.

You are correct, it seems that it is not possible to do that portably in C and C++.

It's not hard to understand.

I think that my confusion here is that in clang and GCC you can always use types that convert to __m256d even if __m256d is not defined. That is, you don't have to use __m256d to use the intrinsics, you can just pass them v4f64 and it will work just fine.

from sleef.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.