GithubHelp home page GithubHelp logo

libxsmm / libxsmm Goto Github PK

View Code? Open in Web Editor NEW
800.0 50.0 181.0 303.86 MB

Library for specialized dense and sparse matrix operations, and deep learning primitives.

Home Page: https://libxsmm.readthedocs.io/

License: BSD 3-Clause "New" or "Revised" License

Shell 1.45% C++ 1.73% C 92.64% Python 1.00% Makefile 1.93% Batchfile 0.04% Fortran 1.14% HTML 0.01% JavaScript 0.01% CSS 0.01% Starlark 0.01% CMake 0.03%
jit simd avx512 machine-learning sparse blas matrix-multiplication transpose bfloat16 avx2

libxsmm's People

Contributors

abhisek-kundu avatar adelmanm avatar alheinecke avatar benoitsteiner avatar breuera avatar chenyuzhang16 avatar ddkalamk avatar deeptiag1 avatar dmudiger avatar egeor avatar freddiewitherden avatar geoffreyqiu avatar gregmhenry avatar hfp avatar ibhati avatar jeffhammond avatar jspark1105 avatar kunalbanerjee avatar kvoronin-intel avatar mahudu97 avatar maxhutch avatar mdebski avatar narendrachaudhary51 avatar nrsatish avatar nshustrov avatar rajbarik avatar rengolin avatar ska278 avatar xing-liu avatar zbarukh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

libxsmm's Issues

Full xGEMM interface and LD_PRELOADable library

Add procedures with the exact LAPACK/xGEMM signature including the appropriate code dispatch. Implement a libxsmm_proxy library (so, dll) which is able to intercept existing xGEMM calls. Document the way to achieve a similar effect using static linkage (no code changes but adjusting the link-line).

support for alpha -1

a common and special case for alpha is -1. Currently, the generator is not generally supporting this case (alpha -1).

remove generation for aligned stores and loads from the generator backend

All recent IAs (Sandy Bridge or later) do not suffer from performance penalties when executing a unaligned vector load (vmovups/vmovupd) on aligned data (so in theory we could use vmovaps/vmovapd). Therefore, we can take this complexity out of the generator backend.

Side-note: this would also mean that the Intel Knights Corner backend needs to be removed, or at least limited to aligned LDx. This is due to the fact that the previous statement is not true on Intel Knight Corner as this architecture does not offer simple unaligned vector data move instructions.

the internal jit_generator tester is currently broken

After the latest refactoring, the jit_generator doesn't compile anymore. This seems to be simple include issue. However, we moved many function definition between several headers, so we need to check which header needs to be included in jit_validation.c

clBLAS-master\src\library\blas\xgemm.cc(394): error C2065: 'gemmSelectKernel': nichtdeklarierter Bezeichner (not declared)

Hi there,
I struggle to create a cblas.lib using Visual Studio (desktop) 2012 since days and I just can't get it to compile correctly.
The long list of errors starts with

clBLAS-master\src\library\blas\xgemm.cc(394): error C2065: 'gemmSelectKernel': nichtdeklarierter Bezeichner

after that there's a lot of errors aorund which I presume are just follow ups. Can anyone help?

Many thanks,
René

-cp2k flag in make.sh

hey all,
just a few quick questions about the -cp2k flag in make.sh. Is it supposed to deliver a cp2k suitable library?

I see that the MNK options are

MNK="
23,
6,
14 16 29,
14 32 29,
5 32 13 24 26,
9 32 22,
64,
78,
16 29 55,
32 29 55,
12,
4 5 7 9 13 25 26 28 32 45"

But in the cp2k's toolchain installer the MNK options are

MNK="1 4 5 6 8 9 13 16 17 22 23 24 26 32"
which are many less, but also different combinations.

Next it sets SSE=3, which according to the makefile.inc and documentation doesn't exist. Only SSE=1 and AVX=1|2|3 exists.

And last, I read Intel optimized and the cp2k.pdf inside documentation says I should have icc. Is it also supposed to work with gcc compiler? I did manage to compile it with gcc. Are there any issues with gcc?

Thank you for any answers.

Johannes

KNC code generation

KNC is generated although not requested:

make M="4 8 10 12 16 64 100 144" N="4 8 10 12 16 64 100 144" K="4 8 10 12" BETA=0 OFFLOAD=0 MIC=0 SSE=3

Furthermore, when calling with OFFLOAD=0, the application shouldn't be required to use -no-offload.

This is true for compiling the f90 module or including the libxsmm.h header in C/C++ applications.

TODO

  • Code optimizations: (1) prefetching memory references, (2) introducing a leading matrix dimension such that aligned Load and/or Store instructions can be used, and (3) AVX-512 testing and tuning.
  • Incorporate separate routines for matrix transposes, and check performance of a specialized MM kernel which is multiplying with a pretransposed B matrix.
  • Improved build system retiring the current mechanism (INDICES_M, INDICES_N, and INDICES_K). It is also accepting empty list(s) i.e., not generating a specialized function.
  • Publish performance results along with the benchmark driver.

Make flow is not compatible with python 3.4

Python 3.4 throws some errors when using the current make-flow:
[aheineck@aheineck-linux libxsmm_github]$ make realclean
File "/nfs_home/aheineck/Projects/LIBXSMM_workspace/libxsmm_github/scripts/libxsmm_utilities.py", line 133
print " ".join(map(lambda mnk: "".join(map(str, mnk)), dims))
^
SyntaxError: invalid syntax
File "/nfs_home/aheineck/Projects/LIBXSMM_workspace/libxsmm_github/scripts/libxsmm_utilities.py", line 133
print " ".join(map(lambda mnk: "
".join(map(str, mnk)), dims))
^
SyntaxError: invalid syntax
File "/nfs_home/aheineck/Projects/LIBXSMM_workspace/libxsmm_github/scripts/libxsmm_utilities.py", line 133
print " ".join(map(lambda mnk: "".join(map(str, mnk)), dims))
^
SyntaxError: invalid syntax
[aheineck@aheineck-linux libxsmm_github]$ make generator
File "/nfs_home/aheineck/Projects/LIBXSMM_workspace/libxsmm_github/scripts/libxsmm_utilities.py", line 133
print " ".join(map(lambda mnk: "
".join(map(str, mnk)), dims))
^
SyntaxError: invalid syntax
File "/nfs_home/aheineck/Projects/LIBXSMM_workspace/libxsmm_github/scripts/libxsmm_utilities.py", line 133
print " ".join(map(lambda mnk: "".join(map(str, mnk)), dims))
^
SyntaxError: invalid syntax
File "/nfs_home/aheineck/Projects/LIBXSMM_workspace/libxsmm_github/scripts/libxsmm_utilities.py", line 133
print " ".join(map(lambda mnk: "
".join(map(str, mnk)), dims))
^
SyntaxError: invalid syntax

Implement dynamic code dispatch for ISA extensions

Currently a single code path (instruction set extension) is supported and determined at build time of the library. This is true for the statically requested kernels but also for the JITted code, both of which can support selecting the architecture at runtime (initialization time of the library). To actually implement this feature, we can check feature bits our self, or rely on certain attributes available for both the Intel and the GNU based tool chain. The solution based on attributes might preferred with respect to maintenance. However, the level of ISA-dispatch will be ultimately driven by an anticipated performance impact (we do not want a performance impact due to supporting this feature). Fixing this issue is to at least enable JITted code matching the platform at runtime.

LIBXSMM_GEMM_DESCRIPTOR macro is broken for sparse

the LIBXSMM_GEMM_DESCRIPTOR in libxsmm_generator.h doesn't allow for values 0 for LDA,LDB and LDC. However, these exception cases are used by the sparse matrix code generator for determining which matrix is sparse. A workaround was added generator_driver.c (simply overwriting the generated descriptor).

However, automatically promoting LDA - LDC to m or k seems to be a pretty dangerous thing. If the users requests such a code -> no DGEMM as this is an invalid specification. An error should be issued during generation of such a code instead.

Incorrect results when ldA > K, ldB > N, or ldC > N.

As of commit baad5c1, libxsmm_sgemm returns incorrect results when ldA != K, ldB != N, and ldC != N for a row-major configuration.

Specifically, libsxmm was compiled with

$ make AVX=2 JIT=1 ROW_MAJOR=1

I've attached a reproducer:
xmm-bug.zip

Edit the makefile to find your local libsxmm, then run

$ ./xmm-bug 64 240 64 64 240 240 1

You will see that reference C code, MKL, and libxsmm roughly agree on the answers.

Run

$ ./xmm-bug 64 239 64 64 240 240 1

and you will see that MKL and C agree, but libxsmm does not.

Dispatch for unsupported code generation requests

Detect unsupported JIT code generation requests when building the LIBXSMM_GEMM_DESCRIPTOR. An unsupported code version needs to be dispatched to the fallback code path. Currently an unsupported code version would fail in the code generator. This error condition is likely generated too slow to be used for code dispatch.

mmfunction dispatch not working

The libxsmm_mmfunction interface invariably returns 0.

Having built libxsmm like so:

$ make JIT=1 AVX=2 ROW_MAJOR=1

and the attached code like so:

$ make -f Makefile.big xmm-dispatch-bug

Run the example:

$ ./xmm-dispatch-bug 64 240 64 64 240 240 1 

Note the assert that fails. The other call to libxsmm seems to succeed.

xmm-dispatch-bug.zip

loop elimination in generated code

Independent of the matrix kernel size, the generator backend generates loop bodies. For very small sizes (M<16) these loops have only one on trip, therefore they can be eliminated.

Timers in sample/smm rely on OpenMP

remove OpenMP timers and use gettimeofday (at least under linux). This allows us to run in serial and to debug the Fortran interface performance.

support for arbitrary values of alpha and beta

Implementing support for arbitrary values of alpha and beta is not impossible (~5% performance hit for very small sizes). Therefore we should considerate adding this to the generator backend.

Remove any calls performing non-private file I/O (incl. console output)

A library is not supposed to perform I/O operations which is not invisible (console, and leave-behind files). However, our non-NDEBUG code path may perform such kind of I/O to improve application testing and debugging. This requirement belongs to the code quality category which is about allowing our code to be adopted where highest standards apply.

Provide libxsmmf library accompanying the MODule file

Providing a libxsmmf.[a|so|dll] library which is accompanying the MODule file (already generated) allows for using LIBXSMM without including the header file and the related implications. Including libxsmm.f and linking against the regular libxsmm.[a|so|dll] is just an additional option for users who prefer working with a compiler-dependent module file.

fortran module breaks under `-r8` or `-fdefault-real-8`

ifort's -r8 and gfortran's -fdefault-real-8 cause LIBXSMM_SINGLE_PRECISION and LIBXSMM_DOUBLE_PRECISION to be the same, causing double implementations of all the calls that differ only in precision. I can think of a few solutions:

  1. Don't change anything; codes shouldn't be using -r8 anyways.
  2. Define LIBXSMM_SINGLE_PRECISION using selected_real_kind
  3. Define LIBXSMM_SINGLE_PRECISION as 4
    If BLAS had a true interface, then I'd go with (1), but seeing as LIBXSMM_SINGLE_PRECISION being defined as anything other than 4 would break the underlying SGEMM call, I think (2) and (3) are more flexible for right now. The difference there should be mostly aesthetic. Thoughts?

AVX512 instruction size reduction

AVX512 instructions allow for various memory reference encodings which impact the encoded instructions length. In general shorter instructions should achieve better performance. The generator should be rewritten to use short instructions as often as possible.

integration tests

Could some of the samples, smm seems particularly suitable, be massaged into integration tests? It would boost my confidence in making changes and opening PRs.

I use travis for other projects. If someone else sets up a script that returns 0 on pass and non-zero on fail, I'm willing to set up the rest.

add support for vendor-specific (e.g. CRAY) wrappers to at least LIBXSMM samples

running on Cray machines is easiest when using the cray wrappers for gnu/intel compilers. They are CC=cc CXX=CC FC=ftn. Currently makefiles can be hacked (incl. STATIC=1) to build on cray.

Often on cray machines, the login node has a different arch then the compute node, but the wrappers have the best arch flags -> LIBXSMM's cray support shouldn't specify -xHost.

Remove exit calls and instead propagate errors to the call side

A library is not supposed to exit an application. Instead, an unrecoverable error is propagated to the call side (where exit may be called or not). This gives an application the chance to perform own cleanup and tear-down (independent of "magic" exit handler code). This requirement belongs to the code safety category which is about allowing our code to be adopted where highest standards apply.

Rework Makefile's mkdir mechanism to avoid issues in parallel builds

There are still spurious issues when building in parallel (make -j). The problems appear also with newer versions of GNU make (and independent of what is worked around already; make v3.82). Adopt a solution which is implicitly creating the necessary directories for any target placed in a particular folder by introducing a "dummy" file representing the directory in questions.

As a general cleanup, remove the rule(s) in the NEK sample which are installing into DEPDIR. Really this is an awful solution where the sample code installs into the library's directory structure (and this cannot be preserved). Any NEK-related code can still do it the other way around and simply rely on the sample folder. In another cleanup stage, one could also remove the sample related rules in LIBXSMM's Makefile doing various stuff (testing, script generation, etc.). Really this was never intended to be a solution for dealing with Travis (and there are better ways to do this).

Omit registering SSE code if JIT code can reach higher an ISA level

Omit registering SSE code if JIT code can reach higher an ISA level. This feature allows to statically generate and include SSE3 code into the library but still getting the best ISA level (if the JIT backend is enabled). Please note that the JIT backend does not support non-AVX (SSE3).

OFFLOAD mode issue

I am trying to run this on a Phi and am compiling with
> make install OFFLOAD=1 MNK="2,4,6,8,10,12,14,16,18,20,23" AVX=3
but it errors out with

../../include/libxsmm.f90(143): error #6643: This statement is incorrectly positioned.
!DIR$ ATTRIBUTES OFFLOAD:MIC :: libxsmm_smm_2_2_2
----------^
../../include/libxsmm.f90(150): error #6643: This statement is incorrectly positioned.
!DIR$ ATTRIBUTES OFFLOAD:MIC :: libxsmm_dmm_2_2_2
----------^

Is there another flag that needs to be set to compile for the Phi?

samples/smm doesn't build

When I try building this sample, I'm blasted with these errors:

/usr/lib/gcc/x86_64-linux-gnu/4.9/include/xopintrin.h(438): error: identifier "__builtin_ia32_vpcomltud" is undefined
    return (__m128i) __builtin_ia32_vpcomltud ((__v4si)__A, (__v4si)__B);
                     ^

In file included from /usr/lib/gcc/x86_64-linux-gnu/4.9/include/x86intrin.h(52),
                 from /usr/include/x86_64-linux-gnu/c++/4.9/bits/opt_random.h(33),
                 from /usr/include/c++/4.9/random(50),
                 from /usr/include/c++/4.9/bits/stl_algo.h(66),
                 from /usr/include/c++/4.9/algorithm(62),
                 from /home/maxhutch/src/libxsmm/samples/smm/blas.cpp(37):
/usr/lib/gcc/x86_64-linux-gnu/4.9/include/xopintrin.h(444): error: identifier "__builtin_ia32_vpcomleud" is undefined
    return (__m128i) __builtin_ia32_vpcomleud ((__v4si)__A, (__v4si)__B);
                     ^

compilation aborted for /home/maxhutch/src/libxsmm/samples/smm/blas.cpp (code 4)
Makefile:380: recipe for target 'build/blas-cpp.o' failed
make: *** [build/blas-cpp.o] Error 4

These look like compiler issues, but I'm running a vanilla debian system, so I thought they'd be worth pointing out.

list option when pre-building the library

Currently, if a user wants to pre-build a specific set of specializations the MNK="" or M="", N="", K="" interface has to be used. For larger sets there have been reports that bash/make are failing and the entire build fails. It might be useful to have a python script, json, or xml input file which specifies the requested kernels and the LIBXSMM make system builds these kernels afterwards step-by-step.

Dynamically dispatch CRC32 according to CPUID flags

Dynamically dispatch the code path making use of CRC32 instructions. This will allow running on pre-Nehalem/Westmere CPUs (no SSE4.2/CRC32 instructions). The intention is to support Linux distributions (package managers) aiming for a wider range of processors.

link Error (build requirements?)

I'm trying to run libxsmm on a CPU-only (IvyBridge-E) system with a somewhat dated compiler:

maxhutch@edoras:~/src/clean-tests/RTI-LST$ ifort --version
ifort (IFORT) 14.0.1 20131008
Copyright (C) 1985-2013 Intel Corporation.  All rights reserved.

At link, it gives some warnings about some MIC things and then dies with an opaque Error 100, maybe related to missing x86_64-k1om-linux-ld:

/opt/openmpi-intel/bin/mpif90 -g -check all -debug all -traceback  -o nek5000 -ffpe-trap=invalid,zero,overflow -fsignaling-nans -I/opt/fftw3/include/ -I/home/maxhutch/src/libxsmm/include obj/test.o obj/kinds_mod.o obj/mpif.o obj/fftw3.o obj/size_mod.o obj/speclib.o obj/mesh_mod.o obj/input_mod.o obj/parallel_mod.o obj/fft_fftw_mod.o obj/ctimer_mod.o obj/dealias_mod.o obj/domain_mod.o obj/dxyz_mod.o obj/eigen_mod.o obj/esolv_mod.o obj/fdmh1_mod.o obj/geom_mod.o obj/hsmg_mod.o obj/interp_mod.o obj/ixyz_mod.o obj/mvgeom_mod.o obj/nekuse_mod.o obj/opctr_mod.o obj/restart_mod.o obj/scratch_mod.o obj/semhat_mod.o obj/soln_mod.o obj/steady_mod.o obj/string_mod.o obj/topol_mod.o obj/tstep_mod.o obj/turbo_mod.o obj/wz_mod.o obj/wzf_mod.o obj/zper_mod.o obj/io_mod.o obj/poisson_mod.o obj/navier4.o obj/drive.o obj/drive1.o obj/drive2.o obj/plan4.o obj/bdry.o obj/coef.o obj/conduct.o obj/connect1.o obj/connect2.o obj/dssum.o obj/eigsolv.o obj/genxyz.o obj/hsmg.o obj/gmres.o obj/convect.o obj/induct.o obj/navier0.o obj/navier1.o obj/navier5.o obj/navier6.o obj/navier8.o obj/map2.o obj/ic.o obj/ssolv.o obj/math.o obj/mxm_wrapper.o obj/hmholtz.o obj/subs1.o obj/fast3d.o obj/fasts.o obj/byte.o obj/chelpers.o obj/byte_mpi.o obj/prepost.o obj/nek_comm.o obj/setprop.o obj/papi.o obj/gauss.o obj/makeq.o obj/makeq_aux.o obj/mxm_std.o obj/comm_mpi.o obj/singlmesh.o obj/jl_gs.o obj/jl_sort.o obj/jl_sarray_transfer.o obj/jl_sarray_sort.o obj/jl_gs_local.o obj/jl_crystal.o obj/jl_comm.o obj/jl_tensor.o obj/jl_fail.o obj/jl_fcrystal.o obj/jl_findpts.o obj/jl_findpts_local.o obj/jl_obbox.o obj/jl_poly.o obj/jl_lob_bnd.o obj/jl_findpts_el_3.o obj/jl_findpts_el_2.o obj/jl_sparse_cholesky.o obj/jl_xxt.o obj/jl_fcrs.o -lblas -llapack -L/opt/fftw3/lib/ -lfftw3 -L/home/maxhutch/src/libxsmm/lib/intel64 -lxsmm
ifort: command line warning #10006: ignoring unknown option '-ffpe-trap=invalid,zero,overflow'
ifort: command line warning #10006: ignoring unknown option '-fsignaling-nans'
ifort: warning #10182: disabling optimization; runtime debug checks enabled
ifort: command line warning #10006: ignoring unknown option '-ffpe-trap=invalid,zero,overflow'
ifort: command line warning #10006: ignoring unknown option '-fsignaling-nans'
ifort: warning #10362: Environment configuration problem encountered.  Please check for proper MPSS installation and environment setup.
ifort: warning #10182: disabling optimization; runtime debug checks enabled
x86_64-k1om-linux-ld: No such file or directory
makefile:165: recipe for target 'nek5000' failed
make: *** [nek5000] Error 100

LIBXSMM interface/frontend refinement

Promote Alpha and Beta arguments to the simplified interface. Support JIT-building kernels with general xGEMM arguments using the frontend, and adjust the dispatch functions accordingly. This change will break with our currently deployed simplified interface (frontend) which is only accepting M, N, and K parameters. The intention of this issue is to settle our frontend interface.

FORTRAN interface

Generate and implement a FORTRAN interface along with some sample code (driver).

MPSS required, even with MIC=0

On commit 50ed3d0, my system with ICC 16.0.1 and without MPSS cannot build libxsmm, even with MIC=0:

$ make MIC=0

[jsewall libxsmm (master)]$ make OFFLOAD=0
icc -Wall -Wno-unused-function -Wno-attributes -fPIC -O2 -ftree-vectorize -ffast-math -funroll-loops -D__extern_always_inli
ne=inline -DNDEBUG -D__STATIC -D__MKL -Iinclude -Ibuild -I/nfs_home/jsewall/src/libxsmm/src -I/swtools/intel/compilers_and_lib
raries_2016.1.150/linux/mkl/include -mavx2 -c /nfs_home/jsewall/src/libxsmm/src/libxsmm.c -o build/intel64/libxsmm.o
icc: command line warning #10006: ignoring unknown option '-ffast-math'
icc: command line warning #10353: option '-mavx2' ignored, suggest using '-march=core-avx2'
icc: warning #10193: -vec is default; use -x and -ax to configure vectorization
icc: command line warning #10006: ignoring unknown option '-ffast-math'
icc: warning #10362: Environment configuration problem encountered. Please check for proper MPSS installation and environment
setup.
icc: warning #10193: -vec is default; use -x and -ax to configure vectorization
In file included from include/libxsmm_frontend.h(35),
from include/libxsmm.h(65),
from /nfs_home/jsewall/src/libxsmm/src/libxsmm.c(31):
include/libxsmm_macros.h(279): catastrophic error: MIC cannot open source file "pthread.h"

include <pthread.h>

I also get warnings about the flag -mavx2, which ICC ignores (-march=core-avx2 is the preferred flag).

remove ALIGNED_STORES and ALIGNED_LOADS options

currently the LIBXSMM has two build options which control implicit changes of LDA and LDC parameters.
As we are moving to a more general interface which includes support for LDx ALIGNED_STORES and ALIGNED_LOADS are redundant and should be removed.
LIBXSMM should still provide macros or functions which allow for easily deriving "padded" LDx values matching the smallest required value.

assumed-size F90 interface

Currently LIBXSMM's F90 interface requires 2D Fortran arrays as inputs. We have seen applications which need to call LIBXSMM routines for contiguous slices of higher dimensional arrays. A quick test unveiled that the needed reshape is not replaced by a no-op. Therefore, the only solution is to changes LIBXSMM's F90 interface to an assumed-size interface. This will disable row-major support for Fortran.

Remove compiler generated fallback code

Currently, LIBXSMM offers two fallback options: a) compiler generated and unrolled code b) call into a BLAS library. As it's planned to evolve LIBXSMM generator to support additional cases such as arbitrary alpha and beta and transpose options and LIBXSMM's JIT feature will become a stable release feature, alternative a) is redundant and will most likely never been called. Therefore it should be considered as deprecated and removed in a future release of LIBXSMM.

Handle hash key collisions in the code cache.

This issue is known both in terms of the problem and the solution. It is planned to evict code from the cache in case of a collision in order to avoid the performance overhead of a full collision handling. The latter requires an exact comparison of two descriptors on top of CRC32 based hash key. Moreover, the infrastructure to receive the target/populated descriptor needs to be implemented. Actually evicting the code also requires to properly release/reuse the memory associated with an entry of the cache.

Implement streaming stores

I want to call libxsmm functions for chunks of large matrices in a memory-bound kernel (similar to the "batched" mode in the examples).
Therefore, it would be great to have the possibility to employ streaming stores for the result matrix.

Is it possible/sensible/realistic that you implement this?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.