alpaka-group / llama Goto Github PK

A Low-Level Abstraction of Memory Access

License: Mozilla Public License 2.0

CMake 1.28% C++ 98.45% Shell 0.28%

llama's Introduction

LLAMA – Low-Level Abstraction of Memory Access

LLAMA is a cross-platform C++17/C++20 header-only template library for the abstraction of data layout and memory access. It separtes the view of the algorithm on the memory and the real data layout in the background. This allows for performance portability in applications running on heterogeneous hardware with the very same code.

Documentation

Our extensive user documentation is available on Read the Docs. It includes:

Installation instructions
Motivation and goals
Overview of concepts and ideas
Descriptions of LLAMA's constructs

An API documentation is generated by Doxygen from the C++ source. Please read the documentation on Read the Docs first!

Supported compilers

LLAMA tries to stay close to recent developments in C++ and so requires fairly up-to-date compilers. The following compilers are supported by LLAMA and tested as part of our CI:

Linux	Windows	MacOS
g++ 10 - 13 clang++ 12 - 17 icpx (latest) nvc++ 23.5 nvcc 11.6 - 12.3	Visual Studio 2022 (latest on GitHub actions)	clang++ (latest from brew)

Single header

We create a single-header version of LLAMA on each commit, which you can find on the single-header branch.

This also useful, if you would like to play with LLAMA on Compiler explorer:

#include <https://raw.githubusercontent.com/alpaka-group/llama/single-header/llama.hpp>

Contributing

We greatly welcome contributions to LLAMA. Rules for contributions can be found in CONTRIBUTING.md.

Scientific publications

We published an article on LLAMA in the journal of Software: Practice and Experience. We gave a talk on LLAMA at CERN's Compute Accelerator Forum on 2021-05-12. The video recording (starting at 40:00) and slides are available here on CERN's Indico. Mind that some of the presented LLAMA APIs have been renamed or redesigned in the meantime.

We presented recently added features to LLAMA at the ACAT22 workshop as a poster and a contribution to the proceedings. Additionally, we gave a talk at ACAT22 on LLAMA's instrumentation capabilities during a case study on AdePT, again, with a contribution to the proceedings.

Attribution

If you use LLAMA for scientific work, please consider citing this project. We upload all releases to Zenodo, where you can export a citation in your preferred format. We provide a DOI for each release of LLAMA. Additionally, consider citing the LLAMA paper.

License

LLAMA is licensed under the MPL-2.0.

llama's People

Contributors

Stargazers

Watchers

Forkers

theziz sbastrakov mephisto-hpc bertwesarg psychocoderhpc markusvelten bernhardmgruber sliwowitz atom3333 fish23

llama's Issues

Compressed blobs

Motivated by use cases in databases, LLAMA blobs could utilize leightweight compression algorithms like PFOR (Patched Frame of Reference). While reduction of blob sizes is an interesting side effect, the main motiviation comes from reducing the memory bandwidth when data is read through the memory hierarchy since decompression should be very local and very close to the compute units.

This could be implemented based on computed fields #170, but probably does not play well with random access.

Relicensing vom LGPL3+ to MPL2?

See ComputationalRadiationPhysics/alpaka#637. Does the same arguments hold here too?

stack allocation method for DS, DE, ...

add method to llama::DS (llama::DE?, ..) to allocate object on the stack

auto temp = llama::stackVirtualDatumAlloc< Particle >( );
// =>
auto temp = Particle::alloc(); // stack only

@bertwesarg

Cannot use `.accessor<>` when view's type is passed as template

When changing the vectorAdd example to use element-wise access inside the kernel I get compile errors:

diff --git i/examples/vectoradd/vectoradd.cpp w/examples/vectoradd/vectoradd.cpp
index 88e3137..9c79b53 100644 examples/vectoradd/vectoradd.cpp
--- i/examples/vectoradd/vectoradd.cpp
+++ w/examples/vectoradd/vectoradd.cpp
@@ -75,7 +75,7 @@ struct AddKernel
         );
 
         LLAMA_INDEPENDENT_DATA
-        for ( auto pos = start; pos < end; ++pos )
+        for ( auto pos = start; pos < end; ++pos ) {
 #if VECTORADD_BYPASS_LLAMA == 1
             for ( auto dd = 0; dd < 3; ++dd )
 #if VECTORADD_BYPASS_SOA == 1
@@ -85,8 +85,11 @@ struct AddKernel
                 a[ pos * 3 + dd ] += b[ pos * 3 + dd ];
 #endif // VECTORADD_BYPASS_SOA
 #else
-            a( pos ) += b( pos );
+            a.accessor<dd::X>({pos}) += b.accessor<dd::X>({pos});
+            a.accessor<dd::Y>({pos}) += b.accessor<dd::Y>({pos});
+            a.accessor<dd::Z>({pos}) += b.accessor<dd::Z>({pos});
 #endif // VECTORADD_BYPASS_LLAMA
+        }
     }
 };

I get compile errors with G++ 6:

vectoradd.cpp: In member function 'void AddKernel<problemSize, elems>::operator()(const T_Acc&, T_View, T_View) const':
vectoradd.cpp:88:29: error: expected primary-expression before '>' token
             a.accessor<dd::X>({pos}) += b.accessor<dd::X>({pos});
                             ^
vectoradd.cpp:88:35: error: expected ';' before '}' token
             a.accessor<dd::X>({pos}) += b.accessor<dd::X>({pos});
                                   ^
vectoradd.cpp:88:57: error: expected primary-expression before '>' token
             a.accessor<dd::X>({pos}) += b.accessor<dd::X>({pos});
                                                         ^
vectoradd.cpp:88:63: error: expected ';' before '}' token
             a.accessor<dd::X>({pos}) += b.accessor<dd::X>({pos});
                                                               ^
vectoradd.cpp:89:29: error: expected primary-expression before '>' token
             a.accessor<dd::Y>({pos}) += b.accessor<dd::Y>({pos});
                             ^
vectoradd.cpp:89:35: error: expected ';' before '}' token
             a.accessor<dd::Y>({pos}) += b.accessor<dd::Y>({pos});
                                   ^
vectoradd.cpp:89:57: error: expected primary-expression before '>' token
             a.accessor<dd::Y>({pos}) += b.accessor<dd::Y>({pos});
                                                         ^
vectoradd.cpp:89:63: error: expected ';' before '}' token
             a.accessor<dd::Y>({pos}) += b.accessor<dd::Y>({pos});
                                                               ^
vectoradd.cpp:90:29: error: expected primary-expression before '>' token
             a.accessor<dd::Z>({pos}) += b.accessor<dd::Z>({pos});
                             ^
vectoradd.cpp:90:35: error: expected ';' before '}' token
             a.accessor<dd::Z>({pos}) += b.accessor<dd::Z>({pos});
                                   ^
/home/wesarg/Work/Mephisto/tests/llama/examples/vectoradd/
                                   ^
vectoradd.cpp:90:57: error: expected primary-expression before '>' token
             a.accessor<dd::Z>({pos}) += b.accessor<dd::Z>({pos});
                                                         ^
vectoradd.cpp:90:63: error: expected ';' before '}' token
             a.accessor<dd::Z>({pos}) += b.accessor<dd::Z>({pos});
                                                               ^

It works when using the template keyword:

diff --git i/examples/vectoradd/vectoradd.cpp w/examples/vectoradd/vectoradd.cpp
index dfc08a5..ba04b59 100644 examples/vectoradd/vectoradd.cpp
--- i/examples/vectoradd/vectoradd.cpp
+++ w/examples/vectoradd/vectoradd.cpp
@@ -85,9 +85,9 @@ struct AddKernel
                 a[ pos * 3 + dd ] += b[ pos * 3 + dd ];
 #endif // VECTORADD_BYPASS_SOA
 #else
-            a.accessor<dd::X>({pos}) += b.accessor<dd::X>({pos});
-            a.accessor<dd::Y>({pos}) += b.accessor<dd::Y>({pos});
-            a.accessor<dd::Z>({pos}) += b.accessor<dd::Z>({pos});
+            a.template accessor<dd::X>({pos}) += b.template accessor<dd::X>({pos});
+            a.template accessor<dd::Y>({pos}) += b.template accessor<dd::Y>({pos});
+            a.template accessor<dd::Z>({pos}) += b.template accessor<dd::Z>({pos});
 #endif // VECTORADD_BYPASS_LLAMA
         }
     }

Outside of the container it works as accepted:

diff --git i/examples/vectoradd/vectoradd.cpp w/examples/vectoradd/vectoradd.cpp
index 81ceb54..ba04b59 100644 examples/vectoradd/vectoradd.cpp
--- i/examples/vectoradd/vectoradd.cpp
+++ w/examples/vectoradd/vectoradd.cpp
@@ -319,6 +319,7 @@ int main(int argc,char * * argv)
             mirrorB
 #endif // VECTORADD_BYPASS_LLAMA
         );
+        auto a0_x = mirrorA.accessor<dd::X>({0});
         chrono.printAndReset("Add kernel");
         dummy( static_cast<void*>( mirrorA.blob[0] ) );
         dummy( static_cast<void*>( mirrorB.blob[0] ) );

1D resizeable container

std::vector<T> is probably the most used standard data structure and a great building block for many programs. LLAMA's views, while very powerful and almost providing similar functionality as std::vector<T>, lack the ability to grow in size. However that is easy to implement by creating a new view when we run out of storage and copy the old content to the new view. The building blocks are there.

Such a facility is already needed by the raycasting example (not yet merged).
We also likely need a std::deque<T> equivalent for the integration of LLAMA into PIConGPU to represent the particle lists.

Create Zenodo DOI

We should have each release uploaded to Zenodo and a DOI created for it.

Forgot to include in tests/common.h

Need to include #include <sstream> in tests/common.h in order for tests to compile. Otherwise line 37 std::stringstream rawSS(raw); will trigger compile time error.

Boost

Hi,

are there plans to remove the Boost dependency to make LLAMA more lightweight / standalone? :)

Allow statically sized array as record dimension

LLAMA allows statically sized arrays inside Records as a shorthand notation. However, it seems using a statically sized array directly as record dimension fails to compile.

using RecordDim = float[3];
using ArrayDims = ...;
using Mapping = llama::mapping::AoS<ArrayDims, RecordDim>;
...

Investiage this and add a unit test.

Consider allowing non-trivial field types

LLAMA could allow non-trivial types inside the record dimension. This has a couple of consequences, especially during:

view construction: constructors may need to be run. Is the execution context for the view construction also elegible to run the fields constructors? E.g. for CUDA, it is not. The view object is constructed on the host while the field's constructors may need to be run on the device.
copying: appropriate copy/move constructors need to be run. A pure memcpy of the data would be wrong.
destruction: same concern on the execution context of the destructors as the constructor.

Exception guarantees also have to be considered in all these circumstances. A first step could be to assert that all operations are noexcept.

This idea came from a participant at the CERN Compute Accelerator Forum.

Cannot build against an installed LLAMA

After make install of LLAMA I tried to build examples/nbody against this installation:

Just to be sure, I changed the CMakeFiles.txt to the following:

diff --git i/examples/nbody/CMakeLists.txt w/examples/nbody/CMakeLists.txt
index 1b00b5e..b4655c5 100644 examples/nbody/CMakeLists.txt
--- i/examples/nbody/CMakeLists.txt
+++ w/examples/nbody/CMakeLists.txt
@@ -1,9 +1,7 @@
 cmake_minimum_required (VERSION 3.3)
 project(llama-nbody)
 
-if (NOT TARGET llama::llama)
-       find_package(llama REQUIRED)
-endif()
+find_package(llama REQUIRED)
 find_package(alpaka 0.5.0 REQUIRED)
 ALPAKA_ADD_EXECUTABLE(${PROJECT_NAME} nbody.cpp ../common/Dummy.cpp)
 target_link_libraries(${PROJECT_NAME} PRIVATE llama::llama alpaka::alpaka)

$ export CMAKE_PREFIX_PATH=/home/wesarg/opt/llama-dev:$CMAKE_PREFIX_PATH
$ cd examples/nbody
$ cmake .
:
-- Configuring done
CMake Error at /home/wesarg/opt/alpaka-dev/lib/cmake/alpaka/addExecutable.cmake:53 (ADD_EXECUTABLE):
  Target "llama-nbody" links to target "llama::llama" but the target was not
  found.  Perhaps a find_package() call is missing for an IMPORTED target, or
  an ALIAS target is missing?
Call Stack (most recent call first):
  CMakeLists.txt:6 (ALPAKA_ADD_EXECUTABLE)


-- Generating done
CMake Generate step failed.  Build files cannot be regenerated correctly.

This still generated a Makefile, but make fails then with:

$ make
Scanning dependencies of target llama-nbody
[ 33%] Building CXX object CMakeFiles/llama-nbody.dir/nbody.cpp.o
/home/wesarg/Work/Mephisto/llama/examples/nbody/nbody.cpp:44:10: fatal error: llama/llama.hpp: No such file or directory
 #include <llama/llama.hpp>
          ^~~~~~~~~~~~~~~~~
compilation terminated.
CMakeFiles/llama-nbody.dir/build.make:82: recipe for target 'CMakeFiles/llama-nbody.dir/nbody.cpp.o' failed
make[2]: *** [CMakeFiles/llama-nbody.dir/nbody.cpp.o] Error 1
CMakeFiles/Makefile2:95: recipe for target 'CMakeFiles/llama-nbody.dir/all' failed
make[1]: *** [CMakeFiles/llama-nbody.dir/all] Error 2
Makefile:149: recipe for target 'all' failed
make: *** [all] Error 2

While I investigated this, I noticed, that llama_DIR is set to /home/wesarg/opt/llama-dev/lib/cmake/llama:

$ grep llama_DIR CMakeCache.txt 
llama_DIR:PATH=/home/wesarg/opt/llama-dev/lib/cmake/llama

But llama-config.cmake uses this as the base for the include directory:

set(llama_INCLUDE_DIR ${llama_INCLUDE_DIR} "${llama_DIR}/include")

Which would result in the wrong directory anyway.

Any idea what went wrong here?

Spack Package

We need a Spack package! :)

Add traits to detect mapping specializations

Quite often it is useful to check whether a given type is a specific mapping instantiation. E.g.

template <typename Mapping>
void f(Mapping m) {
  if constexpr (isSoA<Mapping>) { ... }
}

We should add such traits for all LLAMA mappings.

Delete master branch

We would like to delete the master branch since all development happens on the develop branch anyway, and develop is also the default branch. Future releases of LLAMA will be done from release branches that branch off develop.

Implement a SIMD for_each

LLAMA is a library for efficient memory layouts and memory access patterns. For CPUs, this is only half the story. The second half is filling vector registers to also optimize the compute throughput. While this should not be directly handled by LLAMA, LLAMA certainly can provide useful data layouts and should allow for some algorithmic primitives on top.

After thinking some time on this topic and rereading some stuff from Matthias Kretz on Vc and std::simd, I think these primitives are going to be the SIMD variants of the STL algorithms, as proposed here: http://wg21.link/P0350.

Modified example from P0350:

template <class T>
T something(T); // T can by float or std::simd<float>

auto f(const std::vector<float>& data) {
  std::vector<float> output;
  stdx::transform(std::execution::simd,
    data.begin(), data.end(), output.begin(),
    [](auto x) { // type of x can by float or std::simd<float>
      return something(x + 1);
    });
  return output;
}

I think just having a for_each in this fassion could already be quite useful. Even more useful, if we would allow the lambda argument to be a mutable reference allowing write-back to memory. Depending on the data layout and target architecture, we could then run the lambda with SIMD vectors or scalars.

Runtime changeable mapping

Currently, all LLAMA mappings are fixed at compile time and statically compiled into a program. We could however create a mapping, that forwards to other mappings via a runtime configurable dispatch, allowing to change or configure a mapping at runtime.

In combination with the instrumenting mappings, this could allow a mapping to adapt itself during runtime based on the observed usage.

This idea came from a participant at the CERN Compute Accelerator Forum.

CMake: support add_subdirectory

The CMakeLists.txt need a little update to allow adding LLAMA as in-source dependency via add_subdirectory().

Re-Definable Control Defines

Forged out of #15 (comment):

Some examples/ deploy #define vars for building them in variants.

I usually prefer to put such controls in a

#ifndef VAR
#define VAR default
#endif

block so one can control them without warnings externally.

Use plural form in doxygen refs

We can use the plural form of LLAMA types in Doxygen like this:

    /// Concatenate a set of \ref RecordCoord%s.

We should reapply this to all doxygen comments in LLAMA.

Originally posted by @j-stephan in #308 (comment)

Improve compile time

Many LLAMA constructs depend on compile time iteration of type lists. For very big record dimensions, some constructs become painfully slow.

Consider the following clang timetrace from a LLAMA example using the ROOT HEP framework:

The first part marked with "ROOT" is parsing ROOT's headers, the part marked with "D" is LLAMA's dumping code. The part between the black bars is parsing LLAMA's headers (~30ms). Then comes a very long part "InstantiateFunction", which comes from this code:

llama::forEachLeaf<Event>([&](auto coord) {
    using Name = llama::GetTag<Event, decltype(coord)>;
    using Type = llama::GetType<Event, decltype(coord)>;
    auto column = ntuple->GetView<Type>(llama::structName<Name>());
    for (std::size_t i = 0; i < n; i++)
        view(i)(coord) = column(i);
});

The big issue here is that the Event record dimension is large (~200 fields). The loop created by forEachLeaf is linear in compilation time and not yet the problem, but view(i)(coord) eventually calles into the AoS::blobNrAndOffset, which calles offsetOf<RD, RC>, and the implementation of offsetOf is linear again. Thus, the whole code snipped is quadratic in compilation time.

The situation could be improved a lot by making use of template instantiation memoization. That is, if the value of offsetOf<RD, RC> could somehow depend on the value of offsetOf<RD, RC - 1>, the compiler could reuse memoized previous template instantiations and the code snipped should compile in linear time.

The big difficulty is that record coordinates are hierarchical and RC - 1 is a complex operation. A good solution would be to linearize record coordinates before they are passed to mappings, and mappings deal solely in linear coordinates.

New file `AlpakaMemCopy.hpp` missing

Small documentation bugs

Here is a list of small bugs, which I found during reading the documentation.

https://llama-doc.readthedocs.io/en/latest/pages/introduction.html#goals : ? displayed instead ', e.g.: structure?s
https://llama-doc.readthedocs.io/en/latest/pages/virtualrecord.html#one : code block missing
https://llama-doc.readthedocs.io/en/latest/pages/virtualrecord.html#arithmetic-and-logical-operatores : in the third code block I believe some types for variable declaration are missing, e.g.: a1(x{}) = 0.0f;
https://llama-doc.readthedocs.io/en/latest/pages/virtualrecord.html#structured-bindings : code highlighting failed

Allow construction of One from other virtual record

Currently, in order to load a virtual record from memory and store it in a local variable, the following code is needed:

llama::One<Record> r;
r = view(ad);

We cannot merge these two lines into one statement like:

llama::One<Record> r = view(ad); // error

Because llama::One lacks the necessary constructor. We should add the necessary machinery.

Compiling simpletest with MSVC in VS 2019 crashes

Compiling the simpletest example on the develop branch with VS 2019 crashes the compiler frontend:

1>------ Build started: Project: llama-simpletest, Configuration: Debug x64 ------
1>simpletest.cpp
1>C:\dev\llama\include\llama\DatumCoord.hpp(153,1): fatal error C1001: Internal compiler error.
1>(compiler file 'msc1.cpp', line 1532)
1> To work around this problem, try simplifying or changing the program near the locations listed above.
1>If possible please provide a repro here: https://developercommunity.visualstudio.com
1>Please choose the Technical Support command on the Visual C++
1> Help menu, or open the Technical Support help file for more information
1>C:\dev\llama\include\llama\DatumCoord.hpp(153): message : see reference to alias template instantiation 'llama::DatumCoord<0,0>::Cat<T_Other>' being compiled
1>C:\dev\llama\include\llama\DatumStruct.hpp(57): message : see reference to class template instantiation 'llama::DatumCoord<0,0,0>' being compiled
1>C:\dev\llama\include\llama\DatumStruct.hpp(80): message : see reference to class template instantiation 'llama::internal::LinearBytePosImpl<float,T_DatumCoord,llama::DatumCoord<0,0,0>>' being compiled
1>        with
1>        [
1>            T_DatumCoord=llama::DatumCoord<0>
1>        ]
1>C:\dev\llama\include\llama\DatumStruct.hpp(80): message : see reference to class template instantiation 'llama::internal::LinearBytePosImpl<boost::mp11::mp_list<boost::mp11::mp_list<Z,float>>,T_DatumCoord,llama::DatumCoord<0,0>>' being compiled
1>        with
1>        [
1>            T_DatumCoord=llama::DatumCoord<0>
1>        ]
1>C:\dev\llama\examples\simpletest\simpletest.cpp(9): message : see reference to class template instantiation 'llama::internal::LinearBytePosImpl<Name,llama::DatumCoord<0>,llama::DatumCoord<0>>' being compiled
1>INTERNAL COMPILER ERROR in 'C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.25.28610\bin\HostX64\x64\CL.exe'
1>    Please choose the Technical Support command on the Visual C++
1>    Help menu, or open the Technical Support help file for more information
1>Done building project "llama-simpletest.vcxproj" -- FAILED.
========== Build: 0 succeeded, 1 failed, 1 up-to-date, 0 skipped ==========

Here is a minimum example reproducing the crash:

#include <llama/llama.hpp>

struct Pos;
struct Z;
using Name = llama::DS<llama::DE<Pos, llama::DS<llama::DE<Z, float>>>>;
int main(int argc, char** argv) {
	llama::internal::LinearBytePosImpl<Name, llama::DatumCoord<0>, llama::DatumCoord<0>>::value;
	return 0;
}

Funnily enough, it compiles fine when I select the VS2017 v141 toolset. It seems like a regression in VS2019 and I have reported it via the feedback hub to the VS team.

IDK if you want to do something about this, but it's good to know ;)

Integer packing and arbitrary precision integers

LLAMA could allow to pack integers into fewer bits than their usual size. Especially memory footprint sensitive data layouts frequently use such types to save memory.
Another approach is the support of arbitrary precision integer types, which are also common in FPGA code, e.g. ap_int<N>: https://www.xilinx.com/html_docs/xilinx2020_2/vitis_doc/use_arbitrary_precision_data_type.html

E.g.: 3 12-bit integers forming an RGB value:

using RecordDim = llama::Record<
    llama::Field<R, llama::Int<12>>,
    llama::Field<G, llama::Int<12>>,
    llama::Field<B, llama::Int<12>>,
>;

An open design point is how a reference to such an object is formed, since a mapping may not place these objects at a byte-boundary. Thus, locations of such elements might not be addressable. A solution could be a proxy reference like in e.g. std::bitset<N>.

Allow view indexing with signed types

When subscripting a llama::View using the variadic operator(T...) and signed integers, some compilers generate a warning in the implementation because the signed integeres are used to construct a llama::ArrayDim which contains values of size_t. We should check that the indexing type T is convertible to size_t and add a cast to suppress the warning.

Real-time memory access visualization

Since LLAMA's mappings have full visibility of memory access, they can be used to instrument an application. The Trace and Heatmap mappings are an example. However, both of these mappings only show aggregated information.

Going one step further, we could visualize the memory access in real-time using an appropriate GUI. Since CPU memory operations are usually faster than human vision, and the GUI is directly linked with the memory mapping, we could allow for runtime configurable slowdown of the application. We could even provide a "Debug now" button to allow the user to trigger a debug trap.

Such a feature might give incredible insight into realtime memory access patterns and make for really great presentation material.

Put operator for VirtualRecord

In debugging scenarios it is quite useful to just print a VirtualRecord to e.g. std::cout. LLAMA probably has all the information to realize this. The corresponding operator<< would need to be implemented.

Implement equivalent of container_of

container_of is a macro from linux kernel, that let's one retrieve a pointer to a structure from a pointer to a data member of that structure. See also: https://stackoverflow.com/questions/15832301/understanding-container-of-macro-in-the-linux-kernel

In LLAMA, this would allow to map a T* to a field inside a memory layout created by LLAMA back into a VirtualRecord.

Suggested by @amadio.

Add a graph algorithm example

LLAMA has native support for ND-arrays of structured data. This is an important building block for further, higher level data structures. To show this, we could add an example demonstrating a graph data structure built on LLAMA.

Rethink AoSoA Lanes parameter

The non-type template parameter Lanes of the AoSoA mapping specifies the number of attributes of multiple datums which should be blocked together to form little vectors of Lanes lanes. Although this works in simple cases, it might be the wrong approach.

CPUs typically have a fixed vector register width expressed in bits. depending on the loaded data types, the number of used lanes is actually different. E.g. AVX2 has 256bits registers and will pack 8 floats, but 4 doubles. What should the Lanes parameter be? Should we ideally pack 8 floats and doubles, 4 floats and doubles, or 8 floats and 4 doubles?

We should think about this and consider a redesign.

Please make 'develop' the default branch

@theZiz @psychocoderHPC

Add new AoSoA variant blocking with fixed size

The current AoSoA<L> mapping implementation blocks elements with a fixed factor of L. This may not be optimal for heterogeneous structures. Consider the following struct and AoSoA<8> memory layout:

using ParticleUnaligned = llama::Record<
    llama::Field<tag::Id, std::uint16_t>,
    llama::Field<tag::Pos, llama::Record<
        llama::Field<tag::X, float>,
        llama::Field<tag::Y, float>
    >>,
    llama::Field<tag::Mass, double>,
    llama::Field<tag::Flags, bool[3]>
>;

The floats are arranged for AVX2 vector access. The doubles can still be accessed using two registers. The int16 Ids however are too narrow already, and a vector fetch will already load non Id data or needs to use some repacking in the registers. Similar but worse situation for the bools.

What would be more optimal is a memory layout that blocks e.g. 32/64 bytes of elements into the inner AoSoA array, so each fetch will contain 100% useful data. However, such a mapping also increases the distance between heterogeneous values at the same array coordinates, which might be worse in some situations. Nevertheless, such a mapping should be added to LLAMA.

One with minimum padding

llama::One<Record> is a facility for creating a scalar record value backed by stack memory. With the padding minimization developed in #233, we could add an additional version of One that reorders it's members to minimize padding.

Such a feature is not available in C++ and was traditionally done manually, see e.g.: http://www.catb.org/esr/structure-packing/
There is an effort to get such a feature in C++: https://wg21.link/p1112. LLAMA would be a library solution to this.
Rust has this built-in.

Such a one could actually just be used by itself, e.g. std::vector<llama::MinOne<T>>.

Amalgamated header depends on fmt

The fmt library is used in LLAMA's DumpMapping.h for string formatting. This header is not include by default to allow users to use LLAMA even without installing fmt. However, the amalgamated header includes DumpMapping.h, so it is only usable when fmt is installed.

Add nvc++ to the CI

@ax3l mentioned a couple of relevant HPC compilers here. LLAMA already tests with all publicly available ones, except for the new nvc++ from NVIDIA.

nvc++ recently started to support CUDA as well, so it can be used as a host and device compiler (much like clang). We should add it to the CI.

Test autovectorization in CI

Automatic vectorization is brittle and easily broken. We should verify as part of the CI that a commit does not break vectorization.

Some<Record> a construct for some elements and SIMD

LLAMA has the One<Record> construct to hold and manipulate a single record. To facilitate SIMD programming models, we would need a construct to hold several records, depending on the target hardware architecture and the given record dimension. We should explore adding a Some<Record> construct representing a bunch of records.

Whether Some<Record> should expose a size is an open question: Some<Record, N>.

Some<Record> for CPU targets should create a structure of SIMD vectors and thus should integrate with SIMD libraries like Vc or std::simd. But it should neither require a SIMD library (GPU targets) or become a SIMD library (mission creep). The interaction is mostly on how to store from any mapping into SIMD vectors. Whether that is efficient or not depends on the mapping (e.g. vector fetches vs. gather/scatter vs. individual loads).

A design aspect is how to differentiate a scalar access from a vector access. This information also needs to be carried through the VirtualRecord's created on the way until the access is terminal.

I believe this construct could solve the inefficiency of the AoSoA mapping, which is slow when individual elements are accessed. But when Some<Records>s are read from an AoSoA with matching lane length, access could be a lot better. Furthermore, this construct is a good building block on how algorithms could be designed to be SIMD friendly and a stepping stone towards a vectorizable kernel language.

Static assert that field types are trivial

LLAMA assumes in many places that the types used in fields are trivially constructible, copyable and destructible. This should be asserted on the types used in the record domain. If this property is not fulfulled, LLAMA should fail to compile to prevent unwanted results.

Add MacOS CI job

@SimeonEhrig pointed out that LLAMA is not tested on MacOS. It should run in principle, so we should add a CI job for this.

Also update the badge on the README.md then.

structure the example and add documenttion

Based on mephisto-hpc/testing#1 (comment) please structure the example a little bit and add some documentation.

Allow dynamic nested arrays

The LLAMA record dimensions' size is currently fully static, including statically sized subarrays. Dynamically sized subarrays, even if not resized once created, are an often occuring pattern in large data sets such as in HEP (High Energy Physics). Adding direct support for this in LLAMA would allow to directly express HEP data sets commonly occuring at CERN.

This idea came from a participant at the CERN Compute Accelerator Forum.

Prototype computed fields

There are cases when defining a data structure with a set of elements, where some elements are functionally dependent on other elements. A good example is a triangle storing its 3 vertices and its normal vector. In some cases it makes sense to store the normal as part of a triangle object because computing it is expensive (2 vector subtractions, 1 cross product, 1 normalization). There might be cases though, where we value the memory footprint more than the computational costs. Or their might be architectures where the recomputation is still faster then fetching additional data from memory.

LLAMA should allow mappings where the mapping can decide to not store certain elements, but compute them on access.

This allows further interesting and edgy use cases like fully computed LLAMA views, having no memory footprint anymore.

RODARE Version

At HZDR, go to publication database and start a "software publication".

https://www.hzdr.de/db/!WebAppl.Publ_Input.Data?pNid=108&pRedirect=%21Publications

It will provide us with a publication-ID with which we can create a .rodare.json with which we can auto-generate DOIs for release tags.

Move view copy into LLAMA

A generic copy routine between views has been developed as part of the viewcopy example. We should move this implementation into the LLAMA library itself.

Improve single record loading to stack

In order to copy a single record form a view to the stack, the following code is needed:

llama::One<RecordDim> o;
o = view(i);

We should allow the following contractions, which require an additional constructor:

llama::One<RecordDim> o = view(i);
llama::One<RecordDim> o{view(i)};

We could additionally allow a member function on a virtual record:

auto o = view(i).deepCopy(); // o is an independent copy and has type llama::One<RecordDim>

CMake: add a target with all headers?

I think it is convenient (at least for Visual Studio) to have a cmake target that just has all headers of the library, like how alpaka does with alpakaIde. Otherwise it is sometimes problematic to quickly navigate to a library file from VS.
Of course, that is just a matter of convenience and not a bug.

Update to recent Alpaka

I'm confused about this change from stream to queue in alpaka, and I'm not sure if this is related at all. But I needed to switch to the alpaka::kernel API to get it compile again. Than I'm also not sure, if I'm allowed to fork this to create a PR. Thus just this issue, with the hint.

Wish: Somethinkg like `llama::TypeOf<llama::DS<…>>::type`

Just to have something to pass as C++ type to third-party stuff which may be used in an pass-through allocator. Else one also need to define it by itself and need to keep it in sync with the llama::DS.

A rough/incomplete example:

struct rgb
{
    double r, g, b;
};

namespace st
{
    struct R {};
    struct G {};
    struct B {};
}

using RGB = llama::DS<
    llama::DE< st::R, double >,
    llama::DE< st::G, double >,
    llama::DE< st::B, double >
>;

std::vector<rgb> v(100);

// a factory with a pass-through allocator
auto view = PassThroughFactory::allocView(mapping, &v.data[0]);

Try to improve ND-Iterator codegen

The generated assembly for View Iterators of dimensions higher than 1 is demanding.

Example: copying the contents of one view to another, changing layout:

const auto arrayDims = llama::ArrayDims{1024, 1024, 16};
const auto aosView= llama::allocView(llama::mapping::SoA{arrayDims, Particle{}});
const auto soaView= llama::allocView(llama::mapping::AoS{arrayDims, Particle{}});
std::copy(aosView.begin(), aosView.end(), soaView.begin());

This version uses llama::Iterator which is produced by View::begin and View::end. Here is one third of what g++ 11 produces for the std::copy call, generated with perf:

Here is a functionally equivalent call:

llama::forEachADCoord(srcView.mapping.arrayDims(), [&](auto ad) { dstView(ad) = srcView(ad); });

This version recurses on the array dimensions and produces 3 nested loops. Here is everything that g++ 11 produces:

We can see that for the 3 nested loops the assembly is very clean and most time is actually spent copying values. The version with LLAMA iterator generates a lot of additonal arithmetic code to handle the mapping from 1D loop iteration to the ND array dimension value the iterator holds.

We should try to improve the generated code for llama::Iterator.

Related comment: #174 (comment)

alpaka-group / llama Goto Github PK

llama's Introduction

LLAMA – Low-Level Abstraction of Memory Access

Documentation

Supported compilers

Single header

Contributing

Scientific publications

Attribution

License

llama's People

Contributors

Stargazers

Watchers

Forkers

llama's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs