libxsmm / libxsmm Goto Github PK

Library for specialized dense and sparse matrix operations, and deep learning primitives.

Home Page: https://libxsmm.readthedocs.io/

License: BSD 3-Clause "New" or "Revised" License

Shell 1.45% C++ 1.73% C 92.64% Python 1.00% Makefile 1.93% Batchfile 0.04% Fortran 1.14% HTML 0.01% JavaScript 0.01% CSS 0.01% Starlark 0.01% CMake 0.03%

jit simd avx512 machine-learning sparse blas matrix-multiplication transpose bfloat16 avx2

libxsmm's People

Contributors

Stargazers

Watchers

Forkers

maxhutch efeguney xianyi rscohn2 molguin-qc yyzreal liyancas danielpeter oanaoana mdebski cfandy nrsatish algoskynet zhangyangang dislexic yuede jspark1105 benoitsteiner liuguoyou kunalbanerjee jewillco 6676401088 loliod wolf1981 qingsong99 bmcdanie templeblock egeor dmudiger rajbarik alvarovm scottsallinen dnbaker aizatrosli yujunfeng xiaoxuefeng geoffreyqiu hiprince liangfu breuera taihulight xjtuwj narayanan2004 alheinecke xiaocenxiaocen jjykh mtaillefumier zhcui qq332982511 mdave schoenemeyer luke-evans-liu chenzheng1030 mypopydev stoni gregmbi dev-zero shaun95 agostini01 sharkhack awesomemachinelearning gpuworld zdqf nom8393 neveroldmilk sprinterzzj ddkalamk hanzz2007 ugiwgh taozhang8 yang123vc ceseo magastzheng isuruf kobeliu85 legrosbuffle mahudu97 ashokei zoq abhisekkundu-intel paulhjkelly xing-liu crystalbobby fossabot bhaskarnallani zwbjtu123 thebluesmoke firecracker15 fbaru-dev ciyongch vkarihal dpfhty dmitry-gorokhov nazpyro xiaming9880 ranalytica xiangchunyang wuyouqian96169 yushansu liutongxuan

libxsmm's Issues

Full xGEMM interface and LD_PRELOADable library

Add procedures with the exact LAPACK/xGEMM signature including the appropriate code dispatch. Implement a libxsmm_proxy library (so, dll) which is able to intercept existing xGEMM calls. Document the way to achieve a similar effect using static linkage (no code changes but adjusting the link-line).

support for alpha -1

a common and special case for alpha is -1. Currently, the generator is not generally supporting this case (alpha -1).

Add a version stamp to LIBXSMM's interface

Add a version stamp (compile-time), and perhaps a runtime API to query the version of the library (C/C++ and Fortran interfaces).

Move libxsmm_generator_dense_add_isa_check_header and libxsmm_generator_dense_add_isa_check_footer into generator_common

The functions:
libxsmm_generator_dense_add_isa_check_header
libxsmm_generator_dense_add_isa_check_footer
are not specific to dense matrix multiplication and therefore they should be renamed and moved to generator_common.c/h

remove generation for aligned stores and loads from the generator backend

All recent IAs (Sandy Bridge or later) do not suffer from performance penalties when executing a unaligned vector load (vmovups/vmovupd) on aligned data (so in theory we could use vmovaps/vmovapd). Therefore, we can take this complexity out of the generator backend.

Side-note: this would also mean that the Intel Knights Corner backend needs to be removed, or at least limited to aligned LDx. This is due to the fact that the previous statement is not true on Intel Knight Corner as this architecture does not offer simple unaligned vector data move instructions.

the internal jit_generator tester is currently broken

After the latest refactoring, the jit_generator doesn't compile anymore. This seems to be simple include issue. However, we moved many function definition between several headers, so we need to check which header needs to be included in jit_validation.c

clBLAS-master\src\library\blas\xgemm.cc(394): error C2065: 'gemmSelectKernel': nichtdeklarierter Bezeichner (not declared)

Hi there,
I struggle to create a cblas.lib using Visual Studio (desktop) 2012 since days and I just can't get it to compile correctly.
The long list of errors starts with

clBLAS-master\src\library\blas\xgemm.cc(394): error C2065: 'gemmSelectKernel': nichtdeklarierter Bezeichner

after that there's a lot of errors aorund which I presume are just follow ups. Can anyone help?

Many thanks,
René

-cp2k flag in make.sh

hey all,
just a few quick questions about the -cp2k flag in make.sh. Is it supposed to deliver a cp2k suitable library?

I see that the MNK options are

MNK="
23,
6,
14 16 29,
14 32 29,
5 32 13 24 26,
9 32 22,
64,
78,
16 29 55,
32 29 55,
12,
4 5 7 9 13 25 26 28 32 45"

But in the cp2k's toolchain installer the MNK options are

MNK="1 4 5 6 8 9 13 16 17 22 23 24 26 32"
which are many less, but also different combinations.

Next it sets SSE=3, which according to the makefile.inc and documentation doesn't exist. Only SSE=1 and AVX=1|2|3 exists.

And last, I read Intel optimized and the cp2k.pdf inside documentation says I should have icc. Is it also supposed to work with gcc compiler? I did manage to compile it with gcc. Are there any issues with gcc?

Thank you for any answers.

Johannes

KNC code generation

KNC is generated although not requested:

make M="4 8 10 12 16 64 100 144" N="4 8 10 12 16 64 100 144" K="4 8 10 12" BETA=0 OFFLOAD=0 MIC=0 SSE=3

Furthermore, when calling with OFFLOAD=0, the application shouldn't be required to use -no-offload.

This is true for compiling the f90 module or including the libxsmm.h header in C/C++ applications.

Support the prefetch interface for Fortran.

Right now when requesting PREFETCH=1, LIBXSMM does not generate the extended function signatures for taking prefetch locations using the Fortran interface.

TODO

Code optimizations: (1) prefetching memory references, (2) introducing a leading matrix dimension such that aligned Load and/or Store instructions can be used, and (3) AVX-512 testing and tuning.
Incorporate separate routines for matrix transposes, and check performance of a specialized MM kernel which is multiplying with a pretransposed B matrix.
Improved build system retiring the current mechanism (INDICES_M, INDICES_N, and INDICES_K). It is also accepting empty list(s) i.e., not generating a specialized function.
Publish performance results along with the benchmark driver.

Support specifying static code versions in full detail (LDx, etc.)

There is no support for specifying LDx with static code generation. This is a minor feature which can be supported via our long-planned spec-file. The latter would allow to specify static code versions beyond (M,N,K) triplets.

Provide PREFIX-based installation mechanism and related cleanup

Implement a PREFIX-based installation, and perhaps renamed the generator executable to libxsmm_generator (to avoid any name clashes). This task will be able to leverage/complement the existing out-of-tree build mechanism.

Make flow is not compatible with python 3.4

Python 3.4 throws some errors when using the current make-flow:
[aheineck@aheineck-linux libxsmm_github]$ make realclean
File "/nfs_home/aheineck/Projects/LIBXSMM_workspace/libxsmm_github/scripts/libxsmm_utilities.py", line 133
print " ".join(map(lambda mnk: "".join(map(str, mnk)), dims))
^
SyntaxError: invalid syntax
File "/nfs_home/aheineck/Projects/LIBXSMM_workspace/libxsmm_github/scripts/libxsmm_utilities.py", line 133
print " ".join(map(lambda mnk: "".join(map(str, mnk)), dims))
^
SyntaxError: invalid syntax
File "/nfs_home/aheineck/Projects/LIBXSMM_workspace/libxsmm_github/scripts/libxsmm_utilities.py", line 133
print " ".join(map(lambda mnk: "".join(map(str, mnk)), dims))
^
SyntaxError: invalid syntax
[aheineck@aheineck-linux libxsmm_github]$ make generator
File "/nfs_home/aheineck/Projects/LIBXSMM_workspace/libxsmm_github/scripts/libxsmm_utilities.py", line 133
print " ".join(map(lambda mnk: "".join(map(str, mnk)), dims))
^
SyntaxError: invalid syntax
File "/nfs_home/aheineck/Projects/LIBXSMM_workspace/libxsmm_github/scripts/libxsmm_utilities.py", line 133
print " ".join(map(lambda mnk: "".join(map(str, mnk)), dims))
^
SyntaxError: invalid syntax
File "/nfs_home/aheineck/Projects/LIBXSMM_workspace/libxsmm_github/scripts/libxsmm_utilities.py", line 133
print " ".join(map(lambda mnk: "".join(map(str, mnk)), dims))
^
SyntaxError: invalid syntax

Implement dynamic code dispatch for ISA extensions

Currently a single code path (instruction set extension) is supported and determined at build time of the library. This is true for the statically requested kernels but also for the JITted code, both of which can support selecting the architecture at runtime (initialization time of the library). To actually implement this feature, we can check feature bits our self, or rely on certain attributes available for both the Intel and the GNU based tool chain. The solution based on attributes might preferred with respect to maintenance. However, the level of ISA-dispatch will be ultimately driven by an anticipated performance impact (we do not want a performance impact due to supporting this feature). Fixing this issue is to at least enable JITted code matching the platform at runtime.

LIBXSMM_GEMM_DESCRIPTOR macro is broken for sparse

the LIBXSMM_GEMM_DESCRIPTOR in libxsmm_generator.h doesn't allow for values 0 for LDA,LDB and LDC. However, these exception cases are used by the sparse matrix code generator for determining which matrix is sparse. A workaround was added generator_driver.c (simply overwriting the generated descriptor).

However, automatically promoting LDA - LDC to m or k seems to be a pretty dangerous thing. If the users requests such a code -> no DGEMM as this is an invalid specification. An error should be issued during generation of such a code instead.

add travis-ci tests for F90 examples

Currently the F90-binding examples are not tested for correctness when running travis-ci tests. this should be fixed.

routines in libxsmm don't obey BETA=0

The inlined routines libxsmm_{s,d}imm and libxsmm_{s,d}blasmm generated in libxsmm.f90 hardcode beta = 1 even when built with BETA=0.

Incorrect results when ldA > K, ldB > N, or ldC > N.

As of commit baad5c1, libxsmm_sgemm returns incorrect results when ldA != K, ldB != N, and ldC != N for a row-major configuration.

Specifically, libsxmm was compiled with

$ make AVX=2 JIT=1 ROW_MAJOR=1

I've attached a reproducer:
xmm-bug.zip

Edit the makefile to find your local libsxmm, then run

$ ./xmm-bug 64 240 64 64 240 240 1

You will see that reference C code, MKL, and libxsmm roughly agree on the answers.

Run

$ ./xmm-bug 64 239 64 64 240 240 1

and you will see that MKL and C agree, but libxsmm does not.

Dispatch for unsupported code generation requests

Detect unsupported JIT code generation requests when building the LIBXSMM_GEMM_DESCRIPTOR. An unsupported code version needs to be dispatched to the fallback code path. Currently an unsupported code version would fail in the code generator. This error condition is likely generated too slow to be used for code dispatch.

determine highest available vector instruction set extension based on CPUID when using JIT option

Currently, at compile time the target vector instruction set extensions are fixed. This is an unnecessary limitation as the CPU running LIBXSMM later on can be different (newer or older). In order to guarantee best out-of-the box performance we should read the targets CPUID and determine the highest supported vector instruction set extension on-the-fly.

mmfunction dispatch not working

The libxsmm_mmfunction interface invariably returns 0.

Having built libxsmm like so:

$ make JIT=1 AVX=2 ROW_MAJOR=1

and the attached code like so:

$ make -f Makefile.big xmm-dispatch-bug

Run the example:

$ ./xmm-dispatch-bug 64 240 64 64 240 240 1

Note the assert that fails. The other call to libxsmm seems to succeed.

xmm-dispatch-bug.zip

loop elimination in generated code

Independent of the matrix kernel size, the generator backend generates loop bodies. For very small sizes (M<16) these loops have only one on trip, therefore they can be eliminated.

extracting common parts of sample makefiles

currently makefiles in the sample directories (smm, cp2k, nek) do not share a common configuration. Common parts should be carved out to make their maintenance easier.

Timers in sample/smm rely on OpenMP

remove OpenMP timers and use gettimeofday (at least under linux). This allows us to run in serial and to debug the Fortran interface performance.

support for arbitrary values of alpha and beta

Implementing support for arbitrary values of alpha and beta is not impossible (~5% performance hit for very small sizes). Therefore we should considerate adding this to the generator backend.

Remove any calls performing non-private file I/O (incl. console output)

A library is not supposed to perform I/O operations which is not invisible (console, and leave-behind files). However, our non-NDEBUG code path may perform such kind of I/O to improve application testing and debugging. This requirement belongs to the code quality category which is about allowing our code to be adopted where highest standards apply.

Provide libxsmmf library accompanying the MODule file

Providing a libxsmmf.[a|so|dll] library which is accompanying the MODule file (already generated) allows for using LIBXSMM without including the header file and the related implications. Including libxsmm.f and linking against the regular libxsmm.[a|so|dll] is just an additional option for users who prefer working with a compiler-dependent module file.

support for transb in the generator backend

Currently the generator can only generate code for non-trans operations. Support for transB is straightforward and should be therefore added.

fortran module breaks under `-r8` or `-fdefault-real-8`

ifort's -r8 and gfortran's -fdefault-real-8 cause LIBXSMM_SINGLE_PRECISION and LIBXSMM_DOUBLE_PRECISION to be the same, causing double implementations of all the calls that differ only in precision. I can think of a few solutions:

Don't change anything; codes shouldn't be using -r8 anyways.
Define LIBXSMM_SINGLE_PRECISION using selected_real_kind
Define LIBXSMM_SINGLE_PRECISION as 4
If BLAS had a true interface, then I'd go with (1), but seeing as LIBXSMM_SINGLE_PRECISION being defined as anything other than 4 would break the underlying SGEMM call, I think (2) and (3) are more flexible for right now. The difference there should be mostly aesthetic. Thoughts?

AVX512 instruction size reduction

AVX512 instructions allow for various memory reference encodings which impact the encoded instructions length. In general shorter instructions should achieve better performance. The generator should be rewritten to use short instructions as often as possible.

integration tests

Could some of the samples, smm seems particularly suitable, be massaged into integration tests? It would boost my confidence in making changes and opening PRs.

I use travis for other projects. If someone else sets up a script that returns 0 on pass and non-zero on fail, I'm willing to set up the rest.

add support for vendor-specific (e.g. CRAY) wrappers to at least LIBXSMM samples

running on Cray machines is easiest when using the cray wrappers for gnu/intel compilers. They are CC=cc CXX=CC FC=ftn. Currently makefiles can be hacked (incl. STATIC=1) to build on cray.

Often on cray machines, the login node has a different arch then the compute node, but the wrappers have the best arch flags -> LIBXSMM's cray support shouldn't specify -xHost.

Finalize the library and free internal resources (libxsmm_finalize)

Finalize the library (as the opposite of "libxsmm_init"), and free internal resources such as memory allocated to hold generated code (hash table).

Remove exit calls and instead propagate errors to the call side

A library is not supposed to exit an application. Instead, an unrecoverable error is propagated to the call side (where exit may be called or not). This gives an application the chance to perform own cleanup and tear-down (independent of "magic" exit handler code). This requirement belongs to the code safety category which is about allowing our code to be adopted where highest standards apply.

Rework Makefile's mkdir mechanism to avoid issues in parallel builds

There are still spurious issues when building in parallel (make -j). The problems appear also with newer versions of GNU make (and independent of what is worked around already; make v3.82). Adopt a solution which is implicitly creating the necessary directories for any target placed in a particular folder by introducing a "dummy" file representing the directory in questions.

As a general cleanup, remove the rule(s) in the NEK sample which are installing into DEPDIR. Really this is an awful solution where the sample code installs into the library's directory structure (and this cannot be preserved). Any NEK-related code can still do it the other way around and simply rely on the sample folder. In another cleanup stage, one could also remove the sample related rules in LIBXSMM's Makefile doing various stuff (testing, script generation, etc.). Really this was never intended to be a solution for dealing with Travis (and there are better ways to do this).

Omit registering SSE code if JIT code can reach higher an ISA level

Omit registering SSE code if JIT code can reach higher an ISA level. This feature allows to statically generate and include SSE3 code into the library but still getting the best ISA level (if the JIT backend is enabled). Please note that the JIT backend does not support non-AVX (SSE3).

OFFLOAD mode issue

I am trying to run this on a Phi and am compiling with
> make install OFFLOAD=1 MNK="2,4,6,8,10,12,14,16,18,20,23" AVX=3
but it errors out with

../../include/libxsmm.f90(143): error #6643: This statement is incorrectly positioned.
!DIR$ ATTRIBUTES OFFLOAD:MIC :: libxsmm_smm_2_2_2
----------^
../../include/libxsmm.f90(150): error #6643: This statement is incorrectly positioned.
!DIR$ ATTRIBUTES OFFLOAD:MIC :: libxsmm_dmm_2_2_2
----------^

Is there another flag that needs to be set to compile for the Phi?

samples/smm doesn't build

When I try building this sample, I'm blasted with these errors:

/usr/lib/gcc/x86_64-linux-gnu/4.9/include/xopintrin.h(438): error: identifier "__builtin_ia32_vpcomltud" is undefined
    return (__m128i) __builtin_ia32_vpcomltud ((__v4si)__A, (__v4si)__B);
                     ^

In file included from /usr/lib/gcc/x86_64-linux-gnu/4.9/include/x86intrin.h(52),
                 from /usr/include/x86_64-linux-gnu/c++/4.9/bits/opt_random.h(33),
                 from /usr/include/c++/4.9/random(50),
                 from /usr/include/c++/4.9/bits/stl_algo.h(66),
                 from /usr/include/c++/4.9/algorithm(62),
                 from /home/maxhutch/src/libxsmm/samples/smm/blas.cpp(37):
/usr/lib/gcc/x86_64-linux-gnu/4.9/include/xopintrin.h(444): error: identifier "__builtin_ia32_vpcomleud" is undefined
    return (__m128i) __builtin_ia32_vpcomleud ((__v4si)__A, (__v4si)__B);
                     ^

compilation aborted for /home/maxhutch/src/libxsmm/samples/smm/blas.cpp (code 4)
Makefile:380: recipe for target 'build/blas-cpp.o' failed
make: *** [build/blas-cpp.o] Error 4

These look like compiler issues, but I'm running a vanilla debian system, so I thought they'd be worth pointing out.

list option when pre-building the library

Currently, if a user wants to pre-build a specific set of specializations the MNK="" or M="", N="", K="" interface has to be used. For larger sets there have been reports that bash/make are failing and the entire build fails. It might be useful to have a python script, json, or xml input file which specifies the requested kernels and the LIBXSMM make system builds these kernels afterwards step-by-step.

Dynamically dispatch CRC32 according to CPUID flags

Dynamically dispatch the code path making use of CRC32 instructions. This will allow running on pre-Nehalem/Westmere CPUs (no SSE4.2/CRC32 instructions). The intention is to support Linux distributions (package managers) aiming for a wider range of processors.

link Error (build requirements?)

I'm trying to run libxsmm on a CPU-only (IvyBridge-E) system with a somewhat dated compiler:

maxhutch@edoras:~/src/clean-tests/RTI-LST$ ifort --version
ifort (IFORT) 14.0.1 20131008
Copyright (C) 1985-2013 Intel Corporation.  All rights reserved.

At link, it gives some warnings about some MIC things and then dies with an opaque Error 100, maybe related to missing x86_64-k1om-linux-ld:

/opt/openmpi-intel/bin/mpif90 -g -check all -debug all -traceback  -o nek5000 -ffpe-trap=invalid,zero,overflow -fsignaling-nans -I/opt/fftw3/include/ -I/home/maxhutch/src/libxsmm/include obj/test.o obj/kinds_mod.o obj/mpif.o obj/fftw3.o obj/size_mod.o obj/speclib.o obj/mesh_mod.o obj/input_mod.o obj/parallel_mod.o obj/fft_fftw_mod.o obj/ctimer_mod.o obj/dealias_mod.o obj/domain_mod.o obj/dxyz_mod.o obj/eigen_mod.o obj/esolv_mod.o obj/fdmh1_mod.o obj/geom_mod.o obj/hsmg_mod.o obj/interp_mod.o obj/ixyz_mod.o obj/mvgeom_mod.o obj/nekuse_mod.o obj/opctr_mod.o obj/restart_mod.o obj/scratch_mod.o obj/semhat_mod.o obj/soln_mod.o obj/steady_mod.o obj/string_mod.o obj/topol_mod.o obj/tstep_mod.o obj/turbo_mod.o obj/wz_mod.o obj/wzf_mod.o obj/zper_mod.o obj/io_mod.o obj/poisson_mod.o obj/navier4.o obj/drive.o obj/drive1.o obj/drive2.o obj/plan4.o obj/bdry.o obj/coef.o obj/conduct.o obj/connect1.o obj/connect2.o obj/dssum.o obj/eigsolv.o obj/genxyz.o obj/hsmg.o obj/gmres.o obj/convect.o obj/induct.o obj/navier0.o obj/navier1.o obj/navier5.o obj/navier6.o obj/navier8.o obj/map2.o obj/ic.o obj/ssolv.o obj/math.o obj/mxm_wrapper.o obj/hmholtz.o obj/subs1.o obj/fast3d.o obj/fasts.o obj/byte.o obj/chelpers.o obj/byte_mpi.o obj/prepost.o obj/nek_comm.o obj/setprop.o obj/papi.o obj/gauss.o obj/makeq.o obj/makeq_aux.o obj/mxm_std.o obj/comm_mpi.o obj/singlmesh.o obj/jl_gs.o obj/jl_sort.o obj/jl_sarray_transfer.o obj/jl_sarray_sort.o obj/jl_gs_local.o obj/jl_crystal.o obj/jl_comm.o obj/jl_tensor.o obj/jl_fail.o obj/jl_fcrystal.o obj/jl_findpts.o obj/jl_findpts_local.o obj/jl_obbox.o obj/jl_poly.o obj/jl_lob_bnd.o obj/jl_findpts_el_3.o obj/jl_findpts_el_2.o obj/jl_sparse_cholesky.o obj/jl_xxt.o obj/jl_fcrs.o -lblas -llapack -L/opt/fftw3/lib/ -lfftw3 -L/home/maxhutch/src/libxsmm/lib/intel64 -lxsmm
ifort: command line warning #10006: ignoring unknown option '-ffpe-trap=invalid,zero,overflow'
ifort: command line warning #10006: ignoring unknown option '-fsignaling-nans'
ifort: warning #10182: disabling optimization; runtime debug checks enabled
ifort: command line warning #10006: ignoring unknown option '-ffpe-trap=invalid,zero,overflow'
ifort: command line warning #10006: ignoring unknown option '-fsignaling-nans'
ifort: warning #10362: Environment configuration problem encountered.  Please check for proper MPSS installation and environment setup.
ifort: warning #10182: disabling optimization; runtime debug checks enabled
x86_64-k1om-linux-ld: No such file or directory
makefile:165: recipe for target 'nek5000' failed
make: *** [nek5000] Error 100

LIBXSMM interface/frontend refinement

Promote Alpha and Beta arguments to the simplified interface. Support JIT-building kernels with general xGEMM arguments using the frontend, and adjust the dispatch functions accordingly. This change will break with our currently deployed simplified interface (frontend) which is only accepting M, N, and K parameters. The intention of this issue is to settle our frontend interface.

FORTRAN interface

Generate and implement a FORTRAN interface along with some sample code (driver).

MPSS required, even with MIC=0

On commit 50ed3d0, my system with ICC 16.0.1 and without MPSS cannot build libxsmm, even with MIC=0:

$ make MIC=0

[jsewall libxsmm (master)]$ make OFFLOAD=0
icc -Wall -Wno-unused-function -Wno-attributes -fPIC -O2 -ftree-vectorize -ffast-math -funroll-loops -D__extern_always_inli
ne=inline -DNDEBUG -D__STATIC -D__MKL -Iinclude -Ibuild -I/nfs_home/jsewall/src/libxsmm/src -I/swtools/intel/compilers_and_lib
raries_2016.1.150/linux/mkl/include -mavx2 -c /nfs_home/jsewall/src/libxsmm/src/libxsmm.c -o build/intel64/libxsmm.o
icc: command line warning #10006: ignoring unknown option '-ffast-math'
icc: command line warning #10353: option '-mavx2' ignored, suggest using '-march=core-avx2'
icc: warning #10193: -vec is default; use -x and -ax to configure vectorization
icc: command line warning #10006: ignoring unknown option '-ffast-math'
icc: warning #10362: Environment configuration problem encountered. Please check for proper MPSS installation and environment
setup.
icc: warning #10193: -vec is default; use -x and -ax to configure vectorization
In file included from include/libxsmm_frontend.h(35),
from include/libxsmm.h(65),
from /nfs_home/jsewall/src/libxsmm/src/libxsmm.c(31):
include/libxsmm_macros.h(279): catastrophic error: MIC cannot open source file "pthread.h"

include <pthread.h>

I also get warnings about the flag -mavx2, which ICC ignores (-march=core-avx2 is the preferred flag).

remove ALIGNED_STORES and ALIGNED_LOADS options

currently the LIBXSMM has two build options which control implicit changes of LDA and LDC parameters.
As we are moving to a more general interface which includes support for LDx ALIGNED_STORES and ALIGNED_LOADS are redundant and should be removed.
LIBXSMM should still provide macros or functions which allow for easily deriving "padded" LDx values matching the smallest required value.

assumed-size F90 interface

Currently LIBXSMM's F90 interface requires 2D Fortran arrays as inputs. We have seen applications which need to call LIBXSMM routines for contiguous slices of higher dimensional arrays. A quick test unveiled that the needed reshape is not replaced by a no-op. Therefore, the only solution is to changes LIBXSMM's F90 interface to an assumed-size interface. This will disable row-major support for Fortran.

Remove compiler generated fallback code

Currently, LIBXSMM offers two fallback options: a) compiler generated and unrolled code b) call into a BLAS library. As it's planned to evolve LIBXSMM generator to support additional cases such as arbitrary alpha and beta and transpose options and LIBXSMM's JIT feature will become a stable release feature, alternative a) is redundant and will most likely never been called. Therefore it should be considered as deprecated and removed in a future release of LIBXSMM.

Handle hash key collisions in the code cache.

This issue is known both in terms of the problem and the solution. It is planned to evict code from the cache in case of a collision in order to avoid the performance overhead of a full collision handling. The latter requires an exact comparison of two descriptors on top of CRC32 based hash key. Moreover, the infrastructure to receive the target/populated descriptor needs to be implemented. Actually evicting the code also requires to properly release/reuse the memory associated with an entry of the cache.

Implement streaming stores

I want to call libxsmm functions for chunks of large matrices in a memory-bound kernel (similar to the "batched" mode in the examples).
Therefore, it would be great to have the possibility to employ streaming stores for the result matrix.

Is it possible/sensible/realistic that you implement this?

libxsmm / libxsmm Goto Github PK

libxsmm's People

Contributors

Stargazers

Watchers

Forkers

libxsmm's Issues

include <pthread.h>

Recommend Projects

Recommend Topics

Recommend Org

Jobs