pika-org / pika Goto Github PK
View Code? Open in Web Editor NEWpika builds on C++ std::execution with fiber, CUDA, HIP, and MPI support.
Home Page: https://pikacpp.org
License: Boost Software License 1.0
pika builds on C++ std::execution with fiber, CUDA, HIP, and MPI support.
Home Page: https://pikacpp.org
License: Boost Software License 1.0
We currently only have a macro translation layer for basic CUDA functionality and cuBLAS functionality to HIP equivalents. cuSOLVER is currently CUDA-only.
dataflow
recursively waits for any containers of futures (of containers). One can do e.g. this with dataflow
: dataflow(unwrapping([](vector<T>){...}), vector<future<T>>{...})
. There is no equivalent sender adaptor. when_all
is variadic and requires all arguments to be senders themselves.
This feature is used in DLA-Future. A when_all_vector(vector<sender_of<T>>)
would be sufficient as a starting point.
This is potentially very important for debugging. We currently only test with lsan (leak sanitizer). We should try to add tsan (thread sanitizer), msan (memory sanitizer), asan (address sanitizer), and ubsan (undefined behaviour sanitizers). Enabling them with heavy suppression files would be a good start, just to allow consumers of pika to enable sanitizers.
This would require recompiling all dependencies with -fsanitize=memory
, including the standard library: https://github.com/google/sanitizers/wiki/MemorySanitizer#using-instrumented-libraries. This is something to consider doing with the CSCS CI pipelines and spack.
This is the common case, and should possibly be the default. The open question is, what should the default be if there is no process mask (i.e. all pus are in the mask)? We currently default to only one worker thread per core, rather than per pu. We can probably keep this behaviour even if using the process mask is the default. The only confusing aspect is that a process mask with all pus will not necessarily create as many worker threads as bits in the process mask (though that is the case at the moment as well).
Imported targets like Hwloc::hwloc
should be namespaced with pika
(or pika_internal
) instead to avoid potential conflicts with other projects.
Currently using threadmanager
functionality requires using the pika/modules/threadmanager.hpp
header (previously hpx/include/threadmanager.hpp
). It should either get its own public API header or be baked into the pika/runtime.hpp
header. My preference is probably slightly towards the latter option. The same applies for the resource partitioner.
Including hpx::async/apply/dataflow/when_all
. This can be done once DLA-Future has been completely ported to use senders. The cleanup itself for this issue is not complicated and only requires removing functionality. What functionality is currently missing on the sender/receiver side?
E.g. is_action
can be removed and various specializations removed/simplified.
In practice this means replacing the non-Cray builds currently running on Jenkins. Work was started on STEllAR-GROUP/hpx-local#5.
Make use of the gitlab matrix functionality at least for release/debug builds, if not for different compiler etc. configurations.
Also replace the HIP testing done with Jenkins with container-runner-hohgant-mi200
(as already started in eth-cscs/DLA-Future#982 for DLA-Future, though that is blocked on an MPI issue; we can initially run all non-MPI tests for pika).
https://github.com/wolfpld/tracy
This is not terribly difficult on a basic level, but integration into projects and running applications with tracy is a bit clunky, especially for multi-node runs, since a tracy-instrumented application needs to send the data to a server.
#252 adds basic support. Next steps would be one or both of the following:
Serialization is not used by DLA-Future.
Primarily cuBLAS/cuSOLVER equivalents. Can we already do this or does this need to wait for functionality on the HIP side?
Edit: hipblas seems to be ok (#37), but hipsolver is still an open question.
Not strictly a bug, but taking the lock (especially trying to take the lock) means that shutdown can take longer or a very long time.
Try to see if this can be reproduced: https://app.circleci.com/pipelines/github/pika-org/pika/133/workflows/00b0996c-601f-4f6c-8f1c-f3b533738a4a/jobs/1462/steps?invite=true#step-103-164.
We could initially refer to hpx's documentation and provide only the things that differs from it. But it might diverge too much from it in the future to keep it like this.
To try to have something actionable here, I think we would need the following in order of importance:
The public API of pika is small: sender/receiver functionality, runtime initialization, what else?
Hidden functionality can then gradually be brought into the public namespace through pika::experimental::
or directly into pika::
.
The only reasonable way to do this is module by module:
This is also a good opportunity to do general cleanup.
Avoid nesting detail
namespaces into experimental
namespace #448
This will be more useful when/if parts of the repository (e.g. algorithms) are in separate repositories. With a single repository this is not very urgent.
The main open question here is if the algorithms
project should rely on the pika
runtime, the other way around, or neither? The default execution policies assume that a global thread pool exists which would normally be set up by the runtime.
The event polling has been successful and turned out to perform significantly better than using CUDA callbacks. However, that was tested when the CUDA callbacks still required runtime registration on the CUDA thread. We should check:
The use of async_cuda
and async_mpi
functionality has so far been through pika/modules/async_{cuda,mpi}.hpp
. Since we don't consider pika/modules
headers as public API headers we should probably add something more official to access that functionality.
Possible options:
pika/execution/{cuda,hip,gpu,mpi,communication}.hpp
pika/{cuda,hip,gpu,mpi,communication}.hpp
The testing functionality should not be exposed to users of pika
.
See #17. The same points apply here.
pika::detail::small_vector
seems to be significantly slower than boost::container::small_vector
. It's unclear if it's "just" a bug in the implementation or if it's something more inherent in the use of the standard library features in the implementation.
We should:
future::then
since small_vector
is used for storing continuations)pika::detail::small_vector
is fixableIf we can't find a suitable regression test within pika, the following DLA-Future test shows a clear performance drop: srun -n4 -c36 miniapp/miniapp_triangular_solver --m 20480 --n 20480 --mb 128 --nb 128 --grid-rows 2 --grid-cols 2 --nruns 5 --pika:use-process-mask
(on the Piz Daint mc partition). The performance is about ~1150GFlop/s with Boost's small_vector
and ~800GFlop/s with pika's.
With the distributed functionality removed, the actual APEX support was removed. APEX can be turned into a direct dependency with no special support required on the APEX side and is quite straightforward to implement. I have some old working code for this already but from HPX. This needs to be revived.
The sender/receiver CPOs currently use a helper base class to define fallback implementations with tag_fallback_invoke
. The need for tag_fallback_invoke
should be revisited and the CPO types should potentially be in nested namespaces to avoid the tag_fallback_invoke
overloads being in the overload set for unrelated CPOs. This could improve compile times.
The exact requirements of this are not 100% clear. This at least needs:
We currently don't test MPI functionality anywhere. It needs to be added to at least one CI configuration.
The current references seem to be somewhat too strict. Alternatively, can we slightly relax the criteria (without missing performance regressions).
The locality concept has little meaning with distributed functionality removed.
The following (and possibly a few more) tests fail for various reasons after disabling timed suspensions. They need to be dealt with before the first release.
The internal format
implementation could perhaps be replaced by fmt
?
#1 added some debug output to pika_add_module.cmake
and the Jenkins scripts. This should be removed.
This would reduce duplicate and unnecessary builds. This requires:
All versions of macOS that should be supported already have <filesystem>
. This also removes one more Boost dependency (Boost error_code
).
We currently still use https://hub.docker.com/r/stellargroup/build_env. We could do with a much lighter image without documentation tools and e.g. python.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.