rapidsai / rmm Goto Github PK
View Code? Open in Web Editor NEWRAPIDS Memory Manager
Home Page: https://docs.rapids.ai/api/rmm/stable/
License: Apache License 2.0
RAPIDS Memory Manager
Home Page: https://docs.rapids.ai/api/rmm/stable/
License: Apache License 2.0
Any publication? I want to cite it.
Shouldn't compile and install rmm as part of custrings / cuDF / etc. builds, should find rmm already installed on the system and dynamically link against that instead.
Is your feature request related to a problem? Please describe.
cudf has RMM_TRY and RMM_TRY_CUDAERROR. Other projects using RMM often need to redefine RMM_TRY and RMM_TRY_CUDAERROR. It will be better if RMM provides these macros.
And we may need no throw version (e.g. something like RMM_TRY_NOTHROW) mainly for class destructors and for RMM_FREE; class destructors are noexcept by default and if RMM_FREE with an erroneous parameter results in undefined behavior similar to std::free (https://en.cppreference.com/w/cpp/memory/c/free), it's better to crash the program after printing error than continue execution with undefined behavior.
Is your feature request related to a problem? Please describe.
The libcudf Doxygen documentation HTML page should be accessible without requiring someone to clone the repo and build with make doc.
Describe the solution you'd like
Doxygen HTML should be hosted and accessible via github or dev docs page.
Is your feature request related to a problem? Please describe.
Formatting of RMM source code should be consistent and automatic.
Describe the solution you'd like
clang-format should be added as a git hook to automatically format code before it is committed.
See some repos that have already done some of the pre-work necessary:
https://github.com/barisione/clang-format-hooks
https://github.com/andrewseidl/githook-clang-format
(I'm using the RMM in rapidsai/cudf/branch-0.5.)
It seems the Doxygen comments, such as they are for alloc()
, for RMM_ALLOC()
and other relevant function do not indicate whether allocations are aligned and to what degree they are.
Report needed documentation
Documentation is needed in the README that describes how to enable and access the logging information that RMM provides.
Is your feature request related to a problem? Please describe.
For some tests, I would like to be able to compile/run kernels and Thrust functions. However, I cannot build any .cu
files using RMM's existing cmake
configuration.
Describe the solution you'd like
Update RMM's cmake configuration to allowing build .cu
files.
Describe the bug
test_rmm.py serves as a reference to test code with different RMM configurations.
38 # Test all combinations of default/managed and pooled/non-pooled allocation
39 @pytest.mark.parametrize('managed, pool',
40 list(product([False, True], [False, True])))
41 def test_rmm_modes(managed, pool):
42 rmm.finalize()
43 rmm_cfg.use_managed_memory = managed
44 rmm_cfg.use_pool_allocator = pool
45 rmm.initialize()
46
47 assert(rmm.is_initialized())
48
49 array_tester(np.int32, 128)
array_tester creates objects holding GPU memory. Calling rmm.finalize() before these objects are destroyed can lead to memory corruption; this can lead to undefined behaviors. Calling gc.collect() (before rmm.finalize()) triggers objects with 0 reference count to be deleted (and release GPU memory) to avoid memory corruption.
Is your feature request related to a problem? Please describe.
As more and more people start using the ecosystem and building workloads using rapids.ai they will start spawning processes that are triggered by real time events, by a clock, by user interaction etc. We don't have a way of estimating usage of all of our algorithms (e.g. group by and join) but we DO know each time that cudf requests and allocation from rmm. Because the execution of these different workloads using rapids.ai is both unpredictable in terms of scheduling and memory consumption we can run into situations where we run out of resources not because any of the particular jobs requires more memory than can be provided but because the jobs can't be run at the same time.
Describe the solution you'd like
Describe alternatives you've considered
Tracking allocations within our uses of cudf and adding a wrapper to the cudf python library that keeps track of memory as it comes in and out but I don't think this would really work.
Additional context
The code we currently have works great for demos and workloads that you are running one time. As people develop their toolsets they will run and run more workloads and it will not be possible to assume that these workloads are being queued to be run nor shoudl they be. I really think we should start considering possibilities for managing allocations across multiple processes. This could also allow us to be more aggressive with the size of the pool. Last piece of context is that I have not thought this through at length and this is just some stream of conscience ideas to help get a discussion going.
I'm developing a feature over cudf branch-0.6; using rmm changeset dfe2c4b . At some point, I'm getting this error:
terminate called after throwing an instance of 'thrust::system::system_error'
what(): rmm_allocator::deallocate(): RMM_FREE: __global__ function call is not configured
Aborted
This is a problematic exception. Regardless of the reason this happened to me - "system error" is very general, and a typical user of rmm cannot understand what the what()
message means.
So, please rewrite the code producing this what()
message to:
What is your question?
In the below code snippet the last line d_pos=pos
transfers data across main memory and GPU memory. I'm having issues with this line when vectors hold data with size over a certain a threshold (around 30GB). Is there a hard limit of data rmm can allocate/move at once?
void initDataset (std::vector<float> *pos, size_t x, size_t y, size_t z)
{
int i,j,k;
double Pe;
std::mt19937 rng(time(NULL));
std::uniform_real_distribution<float> gen(-4.0, 0.0);
for (i=-(int)x/2;i<((int)x/2);++i)
{
for (j=-(int)y/2;j<((int)y/2);++j)
{
for (k=0;k<z;++k)
{
Pe = gen(rng);
pos->push_back(i);
pos->push_back(j);
pos->push_back(k);
pos->push_back(Pe);
}
}
}
}
int main (int argc, char *argv[])
{
unsigned int i, iter = 30;
size_t sx = 400, sy = 400, sz = 2000;
size_t numParticles = 0;
std::vector<float> pos; // particle positions
rmm::device_vector<float> d_pos; // particle positions in GPU
rmm::device_vector<float> d_posOut; // particle positions out in GPU
// This willl be used to generate plane's normals randomly
// between -1 to 1
std::mt19937 rng(time(NULL));
std::uniform_real_distribution<float> gen(-1.0, 1.0);
numParticles = sx*sy*sz;
// Types of allocations:
// CudaDefaultAllocation
// PoolAllocation
// CudaManagedMemory
rmmOptions_t options{rmmAllocationMode_t::PoolAllocation, 0, true};
rmmInitialize(&options);
initDataset(&pos, sx, sy, sz);
// plane defined by normal and D
float normal[3], d = 0.0f;
for (i=0;i<iter;i++)
{
// Generating plane's normals randomly
// between -1 to 1
normal[0] = gen(rng);
normal[1] = gen(rng);
normal[2] = gen(rng);
timer.reset();
d_pos = pos;
....
Describe the bug
The RMM log is slow. It was written quickly to get something working but the overhead of using STL for a log is too high and therefore it is off by default.
Steps/Code to reproduce bug
Turn on logging in a big app with a lot of alloc/free (e.g. RAPIDS E2E workflow) and see how much it slows down.
Expected behavior
Fast.
Is your feature request related to a problem? Please describe.
rmm::device_vector
is an alias for a thrust::device_vector
that uses RMM as the allocator. By default, thrust::device_vector
will invoke the default constructor for each element in the vector. This is oftentimes unnecessary overhead as it requires invoking a kernel to initialize the elements of the vector.
Describe the solution you'd like
Provide rmm::uninitialized_device_vector
that simply allocates the memory of the specified size and sets the .size()
appropriately.
See https://github.com/thrust/thrust/blob/master/examples/uninitialized_vector.cu for reference.
Environment details (please complete the following information):
ocker pull rapidsai/rapidsai-dev:0.9-cuda10.0-devel-ubuntu16.04-py3.7
docker run --runtime=nvidia --rm -it --net=host -p 8888:8888 -p 8787:8787 -p 8786:8786 -v /home/rapids/notebooks-extended/:/rapids/notebooks/extended/ -v /home/rapids/data/:/home/rapids/data/ rapidsai/rapidsai-dev:0.9-cuda10.0-devel-ubuntu16.04-py3.7
Describe the bug
I am using the Jupyter notebook NYCTaxi-E2E.ipynb and have added the RMM functionality; however, the system crashes at the XGBoost training step. See below the error:
terminate called after throwing an instance of 'thrust::system::system_error'
what(): parallel_for failed: out of memory
Steps/Code to reproduce bug
Added methods:
def initialize_rmm_pool():
rmm_cfg.use_pool_allocator = True
return cudf.rmm.initialize()
def initialize_rmm_no_pool():
rmm_cfg.use_pool_allocator = False
return cudf.rmm.initialize()
def finalize_rmm():
return cudf.rmm.finalize()
Describe the bug
Calling rmm.finalize()
after rmm has been initialized in pool mode should/used to free up the memory pool. This no longer happens.
Steps/Code to reproduce bug
import rmm
from rmm import rmm_config as rmm_cfg
rmm_cfg.use_pool_allocator = True
rmm.initialize()
Pool allocated with 1/2 the cpu memory
NVIDIA-SMI 410.104 Driver Version: 410.104 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:3B:00.0 Off | 0 |
| N/A 38C P0 27W / 70W | 7669MiB / 15079MiB | 0% Default
rmm.finalize()
Gpu memory usage is still 1/2 gpu memory.
| NVIDIA-SMI 410.104 Driver Version: 410.104 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:3B:00.0 Off | 0 |
| N/A 43C P0 28W / 70W | 7669MiB / 15079MiB | 0% Default |
Expected behavior
Gpu memory should be freed
Environment details (please complete the following information):
rmm/print_env.sh
script to gather relevant environment detailsAdditional context
Add any other context about the problem here.
We should transition the current Python bindings API to Cython in order
Describe the bug
Realloc should copy the old data into the new allocation.
Steps/Code to reproduce bug
Just realloc an existing array -- data is likely to be gone.
Expected behavior
The data should be there.
Is your feature request related to a problem? Please describe.
Currently device_array_from_ptr in librmm_cffi/wrapper.py assumes 1D array. Some cuML algorithms return higher dimensional arrays, and we need to wrap them as DeviceNDArray.
Describe the solution you'd like
Add shape and order options to wrap multi dimensional device arrays.
Here is an implementation from cuML SVM:
https://github.com/tfeher/cuml/blob/97d2c00d538a2799db7b42b584b8006aee1633ed/python/cuml/utils/numba_utils.py#L145-L185
Is your feature request related to a problem? Please describe.
RMM should build the HTML Doxygen documentation from its in-line comments.
Describe the solution you'd like
Add a Doxyfile with configuration options necessary to build the RMM Doxygen HTML documentation.
Ideally, the HTML documentation should then be made available on the web without requiring individuals to build it themselves.
Related: rapidsai/cudf#698
Describe the bug
If rmm is used with a libary with -Werror
then the compilation fails with the following message :
/home/aatish/workspace/cuhornet/hornet/../externals/rmm/include/rmm/rmm.hpp: In constructor โrmm::LogIt::LogIt(rmm::Logger::MemEvent_t, void*, size_t, cudaStream_t, const char*, unsigned int, bool)โ:
/home/aatish/workspace/cuhornet/hornet/../externals/rmm/include/rmm/rmm.hpp:101:8: error: โrmm::LogIt::usageLoggingโ will be initialized after [-Werror=reorder]
bool usageLogging;
^~~~~~~~~~~~
/home/aatish/workspace/cuhornet/hornet/../externals/rmm/include/rmm/rmm.hpp:100:16: error: โunsigned int rmm::LogIt::lineโ [-Werror=reorder]
unsigned int line;
^~~~
/home/aatish/workspace/cuhornet/hornet/../externals/rmm/include/rmm/rmm.hpp:59:3: error: when initialized here [-Werror=reorder]
LogIt(Logger::MemEvent_t event, void* ptr, size_t size, cudaStream_t stream,
^~~~~
cc1plus: all warnings being treated as errors
This can be replicated with branch-0.10
Describe the bug
If a project includes rmm/rmm.h
without doing cmake, make, make install compilation fails with
rmm/include/rmm/detail/memory_manager.hpp:37:30: fatal error: rmm/detail/cnmem.h: No such file or directory
This does not happen if the include of cnmem.h
in memory_manager.hpp:37
is done via #include "cnmem.h"
instead of #include "rmm/detail/cnmem.h"
. For projects which have header only dependency to rmm the cmake, make, make install step is not necessary so it would be desriable if this works.
Steps/Code to reproduce bug
Compiling
#include <rmm/rmm.h>
int main()
{
return 0;
}
with
g++ -I$CUDA_HOME/include -Irmm/include rmm_include_bug.cpp
reproduces the error
In file included from rmm/include/rmm/rmm.hpp:28:0,
from rmm/include/rmm/rmm.h:5,
from rmm_include_bug.cpp:1:
rmm/include/rmm/detail/memory_manager.hpp:37:10: fatal error: rmm/detail/cnmem.h: No such file or directory
#include "rmm/detail/cnmem.h"
^~~~~~~~~~~~~~~~~~~~
compilation terminated.
Expected behavior
Compilation of the above example works.
Environment details:
print_env.sh
attached as rmm_print_env.log
Is your feature request related to a problem? Please describe.
rmm::device_vector
is currently a simple alias for a thrust::device_vector
with a rmm_allocator<T>
used as it's allocator template argument. This allocator always uses the null stream for memory allocation, and there is no way for users to modify this behavior.
As seen in rapidsai/cudf#2631, this is problematic.
Describe the solution you'd like
RMM should provide an improved device_vector
abstraction. It cannot simply be just a type alias as it requires specifying constructor arguments that thrust::device_vector
does not currently support(*). However, we can avoid fully reinventing the wheel by inheriting from a thrust::device_vector
and adding the new necessary constructors.
It should be built to also accept a device_memory_resource
to support the new memory resource design.
(*)Thrust in CUDA 10.1 added passing allocators as a function argument, however, that does not fully solve this issue. First of all, we cannot assume all users of RMM can use CUDA 10.1. Second of all, this still does not allow simply specifying a stream in a constructor argument.
Additional context
Add any other context, code examples, or references to existing implementations about the feature request here.
Is your feature request related to a problem? Please describe.
Calling rmmFinalize
should deallocate the memory pool in any of the pool resources. Currently the only way to free a pool is when the pool resource is destroyed at the end of the application.
Describe the solution you'd like
The pool resources need release
methods added to free their memory pools. For example, see std::pmr::synchronized_pool::release()
.
Additional context
When rmmFinalize
is invoked, how do we know what resources to call release
on? release
is not a member of the device_memory_resource
base class, so it's not possible to call get_default_resource()->release()
. Do we just always call pool_resource()->release()
and managed_pool_resource()->release()
? But that will end up constructing those resources only to then release them.
Describe the bug
The unit test MemoryTest.GetInfo
tests that the memory available on the GPU goes down after a successful allocation.
https://github.com/rapidsai/rmm/blob/branch-0.7/tests/memory_tests.cpp#L190
It uses the rmmGetInfo
API, which in non-pool mode calls cudaMemGetInfo
which queries the entire device's memory usage. This isn't resilient to other processes using the GPU, as another process may free a large portion of memory causing the total device memory to go down, causing this test to fail:
04:29:21 [ RUN ] MemoryManagerTest/2.GetInfo
04:29:21 /rapids/cudf/cpp/thirdparty/rmm/tests/memory_tests.cpp:207: Failure
04:29:21 Expected: (freeAfter) <= (freeBefore), actual: 20142030848 vs 20114767872
04:29:21 [ FAILED ] MemoryManagerTest/2.GetInfo, where TypeParam = ModeType<(rmmAllocationMode_t)2> (3 ms)
I believe this test could be made more resilient to GPU sharing by using the NVML API nvmlDeviceGetComputeRunningProcesses
. This allows you to query the GPU memory usage of each process using the GPU. In this way, the test can be refactored to ensure that the memory used by the calling process grows as a result of the allocation.
Expected behavior
Unit tests should be resilient to multiple processes using the GPU.
Describe the bug
When RMM options are set to use pool allocations and use CUDA Managed Memory, the AllocateTB
test hangs or runs for a very long time. I believe the cause is that cudaMallocManaged
succeeds for a 1TB allocation when there is sufficient virtual system memory, but the subsequent cudaMemPrefetchAsync() runs for a long time.
Steps/Code to reproduce bug
Just run RMM_TEST on a DGX-1.
Expected behavior
It should return quickly, and the test should pass (potentially by correctly detecting an allocation failure, or by not prefetching if the allocation is larger than the gpu memory size).
Environment details (please complete the following information):
Related issue: #66
Is your feature request related to a problem? Please describe.
I wish I could use RMM for a multi-GPU node. However, it may not be possible in the current implementation if I enable pool allocation.
54 // Initialize memory manager state and storage.
55 rmmError_t rmmInitialize(rmmOptions_t *options)
56 {
57 rmm::Manager::getInstance().initialize(options);
58
59 if (rmm::Manager::usePoolAllocator())
60 {
61 cnmemDevice_t dev;
62 RMM_CHECK_CUDA( cudaGetDevice(&(dev.device)) );
63 // Note: cnmem defaults to half GPU memory
64 dev.size = rmm::Manager::getOptions().initial_pool_size;
65 dev.numStreams = 1;
66 cudaStream_t streams[1]; streams[0] = 0;
67 dev.streams = streams;
68 dev.streamSizes = 0;
69 unsigned flags = rmm::Manager::useManagedMemory() ? CNMEM_FLAGS_MANAGED : 0;
70 RMM_CHECK_CNMEM( cnmemInit(1, &dev, flags) );
71 }
72 return RMM_SUCCESS;
73 }
rmmInitialize() calls cnmemInit in line 70 with numDevices set to 1.
1071 cnmemStatus_t cnmemInit(int numDevices, const cnmemDevice_t *devices, unsigned flags) {
1072 // Make sure we have at least one device declared.
1073 CNMEM_CHECK_TRUE(numDevices > 0, CNMEM_STATUS_INVALID_ARGUMENT);
1074
1075 // Find the largest ID of the device.
1076 int maxDevice = 0;
1077 for( int i = 0 ; i < numDevices ; ++i ) {
1078 if( devices[i].device > maxDevice ) {
1079 maxDevice = devices[i].device;
1080 }
1081 }
1082
1083 // Create the global context.
1084 cnmem::Context::create();
...
cnmemInit() calls cnmem::Context::create() in line 1084 and
1024 cnmemStatus_t Context::create() {
1025 sCtx = new Context;
1026 sCtxCheck = CTX_VALID;
1027 return CNMEM_STATUS_SUCCESS;
1028 }
create() resets the Context class's static member variable sCtx to a newly created Context object in sCtx.
So, if I call rmmInitialize() multiple times (after cudaSetDevice(), once per device), only the last call will have effect (besides memory leaks for previously allocated Context objects).
rmmInitialize() does not take num_devices as cnmemInit, so I cannot initialize RMM for multiple devices in a single rmmInitialize() call, either.
Describe the solution you'd like
Need a mechanism to initialize RMM for multiple devices (in cnmem style or by calling rmmInitialize multiple times after cudaSetDevice).
Is your feature request related to a problem? Please describe.
I'd like to run the following code.
from librmm_cffi import librmm as rmm
import cudf
s = cudf.Series([0, 1, 2])
a = rmm.device_array_like(s)
Currently this fails with the following error.
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-3-917a74c3463e> in <module>
----> 1 rmm.device_array_like(s)
~/miniconda/envs/rapids9/lib/python3.7/site-packages/librmm_cffi/wrapper.py in device_array_like(self, ary, stream)
227 ary = ary.reshape(1)
228
--> 229 return self.device_array(ary.shape, ary.dtype, ary.strides,
230 stream=stream)
231
AttributeError: 'Series' object has no attribute 'strides'
Describe the solution you'd like
It would be great if rmm.device_array_like
worked with Series objects. No strong feelings about how that is accomplished.
Describe alternatives you've considered
We could special case handling of Series
objects, but this shifts the burden to other libraries to solve this problem.
Alternatively cuDF Series
objects could gain a strides
attribute. This could be reasonable.
Additional context
This came up when trying to better handle GPU array-like objects in cuML ( rapidsai/cuml#1086 ), which is part of the Grid Search effort.
Edit: More specifically, we tried to use librmm_cffi.librmm.to_device
instead of numba.cuda.to_device
, but were unable to as Series are not supported.
Besides, is there any publication for rmm?
When using RMM in pool mode, a problem could arise that out-of-bound memory segfaults will go undetected as the out-of-bound memory access will be within the bounds of the pre-allocated memory pool.
To avoid this, it is highly recommended that when developing code that the non-pool version of RMM be used until correctness has been verified at which case the pool can be used to improve performance.
Is your feature request related to a problem? Please describe.
There is currently no way to query whether or not RMM has been initialized and if so, what options were used.
Describe the solution you'd like
Provide an API for querying initialization state of RMM, e.g. bool rmm::is_initialized(rmmOptions_t *options)
, which would return true
or false
and if true
return the options struct filled out.
Describe alternatives you've considered
I have also considered separating the Boolean state and the options in separate queries, but I think allowing nullptr
as a valid value for options
satisfies both use cases.
Additional context
This is necessary for interoperation of multiple modules / libraries that all need to use RMM without re-initializing it.
Any Java Surpport in plan?
Describe the bug
A segmentation fault occurs inside of the thrust::sort
call inside of gdf_order_by
of libcudf when RMM pool allocation is used.
Steps/Code to reproduce bug
from librmm_cffi import librmm_config as rmm_cfg
rmm_cfg.use_pool_allocator = True
import cudf
cudf._gdf.rmm_initialize()
df = cudf.DataFrame()
df['a'] = [1,2,3,4,5]
df['b'] = [5,4,3,2,1]
print(df.sort_values(['a']))
Environment details (please complete the following information):
Using branch-0.5
of cuDF 5aa1429f8305cfeb120aaa904d71dabfe785898d
Additional context
As you can see from the stack trace below, the error is occurring inside of a thrust::sort
call that is attempting to use RMM to allocate a temporary buffer and using a non-null stream.
#1 0x00007fffe06b05a0 in cuEGLApiInit () from /usr/lib/x86_64-linux-gnu/libcuda.so
#2 0x00007fffe05c8555 in cuMemGetAttribute_v2 () from /usr/lib/x86_64-linux-gnu/libcuda.so
#3 0x00007fffe070f83f in cuStreamGetFlags () from /usr/lib/x86_64-linux-gnu/libcuda.so
#4 0x00007fffdf6e9ebf in ?? () from /usr/local/cuda-9.2/targets/x86_64-linux/lib/libcudart.so.9.2
#5 0x00007fffdf71231f in cudaStreamGetFlags () from /usr/local/cuda-9.2/targets/x86_64-linux/lib/libcudart.so.9.2
#6 0x00007fffdf964df9 in cnmem::Manager::setStream (this=0x21f65cf0, stream=0x7ffb50000600) at /home/jhemstad/RAPIDS/repro/cudf/cpp/thirdparty/rmm/thirdparty/cnmem/src/cnmem.cpp:392
#7 0x00007fffdf9643fe in cnmemRegisterStream (stream=0x7ffb50000600) at /home/jhemstad/RAPIDS/repro/cudf/cpp/thirdparty/rmm/thirdparty/cnmem/src/cnmem.cpp:1166
#8 0x00007fffdf95ef8e in rmm::Manager::registerStream (this=0x7fffdfb6e160 <rmm::Manager::getInstance()::instance>, stream=0x7ffb50000600) at /home/jhemstad/RAPIDS/repro/cudf/cpp/thirdparty/rmm/src/memory_manager.cpp:94
#9 0x00007fffc84791b1 in rmm::alloc<void> (ptr=0x7fffffffc2b0, size=767, stream=0x7ffb50000600, file=0x7fffc8b7c0e0 <_ZN3rmmL17RMM_USAGE_LOGGINGE+3889> "/home/jhemstad/RAPIDS/repro/cudf/cpp/thirdparty/rmm/include/rmm/thrust_rmm_allocator.h", line=48)
at /home/jhemstad/RAPIDS/repro/cudf/cpp/thirdparty/rmm/include/rmm/rmm.hpp:133
#10 0x00007fffc84e6038 in rmm_allocator<char>::allocate (this=0x7fffffffcae0, n=767) at /home/jhemstad/RAPIDS/repro/cudf/cpp/thirdparty/rmm/include/rmm/thrust_rmm_allocator.h:48
#11 0x00007fffc84e5bdf in thrust::detail::allocator_traits<rmm_allocator<char> >::allocate(rmm_allocator<char>&, unsigned long)::workaround_warnings::allocate(rmm_allocator<char>&, unsigned long) (a=..., n=767)
at /usr/local/cuda/targets/x86_64-linux/include/thrust/detail/allocator/allocator_traits.inl:230
#12 0x00007fffc84e5c05 in thrust::detail::allocator_traits<rmm_allocator<char> >::allocate (a=..., n=767) at /usr/local/cuda/targets/x86_64-linux/include/thrust/detail/allocator/allocator_traits.inl:234
#13 0x00007fffc84e4a59 in thrust::detail::get_temporary_buffer<char, rmm_allocator<char>, thrust::cuda_cub::execute_on_stream_base> (system=..., n=767) at /usr/local/cuda/targets/x86_64-linux/include/thrust/detail/execute_with_allocator.h:86
#14 0x00007fffc84e2f76 in thrust::get_temporary_buffer<char, thrust::detail::execute_with_allocator<rmm_allocator<char>, thrust::cuda_cub::execute_on_stream_base> > (exec=..., n=767)
at /usr/local/cuda/targets/x86_64-linux/include/thrust/detail/temporary_buffer.h:62
#15 0x00007fffc84e24b3 in thrust::cuda_cub::get_memory_buffer<thrust::detail::execute_with_allocator<rmm_allocator<char>, thrust::cuda_cub::execute_on_stream_base> > (policy=..., n=767)
at /usr/local/cuda/targets/x86_64-linux/include/thrust/system/cuda/detail/memory_buffer.h:57
#16 0x00007fffc84e1096 in thrust::cuda_cub::__merge_sort::merge_sort<thrust::detail::integral_constant<bool, false>, thrust::detail::integral_constant<bool, false>, thrust::cuda_cub::execution_policy<thrust::detail::execute_with_allocator<rmm_allocator<char>, thrust::cuda_cub::execute_on_stream_base> >, int*, int*, __nv_dl_wrapper_t<__nv_dl_tag<void (*)(void* const*, unsigned char* const*, int*, signed char*, unsigned long, unsigned long, bool, int*, bool, CUstream_st*), &(void multi_col_sort<int>(void* const*, unsigned char* const*, int*, signed char*, unsigned long, unsigned long, bool, int*, bool, CUstream_st*)), 2u>, LesserRTTI<int> > > (compare_op=..., items_first=0x0, keys_last=0x7ffb50000614, keys_first=0x7ffb50000600, policy=...)
at /usr/local/cuda/targets/x86_64-linux/include/thrust/system/cuda/detail/sort.h:1336
#17 thrust::cuda_cub::__smart_sort::smart_sort<thrust::detail::integral_constant<bool, false>, thrust::detail::integral_constant<bool, false>, thrust::cuda_cub::execution_policy<thrust::detail::execute_with_allocator<rmm_allocator<char>, thrust::cuda_cub::execute_on_stream_base> >, int*, int*, __nv_dl_wrapper_t<__nv_dl_tag<void (*)(void* const*, unsigned char* const*, int*, signed char*, unsigned long, unsigned long, bool, int*, bool, CUstream_st*), &(void multi_col_sort<int>(void* const*, unsigned char* const*, int*, signed char*, unsigned long, unsigned long, bool, int*, bool, CUstream_st*)), 2u>, LesserRTTI<int> > > (compare_op=..., items_first=0x0, keys_last=0x7ffb50000614, keys_first=0x7ffb50000600, policy=...)
at /usr/local/cuda/targets/x86_64-linux/include/thrust/system/cuda/detail/sort.h:1576
#18 thrust::cuda_cub::sort<thrust::detail::execute_with_allocator<rmm_allocator<char>, thrust::cuda_cub::execute_on_stream_base>, int*, __nv_dl_wrapper_t<__nv_dl_tag<void (*)(void* const*, unsigned char* const*, int*, signed char*, unsigned long, unsigned long, bool, int*, bool, CUstream_st*), &(void multi_col_sort<int>(void* const*, unsigned char* const*, int*, signed char*, unsigned long, unsigned long, bool, int*, bool, CUstream_st*)), 2u>, LesserRTTI<int> > > (policy=..., first=0x7ffb50000600,
last=0x7ffb50000614, compare_op=...) at /usr/local/cuda/targets/x86_64-linux/include/thrust/system/cuda/detail/sort.h:1653
#19 0x00007fffc84de58b in thrust::sort<thrust::detail::execute_with_allocator<rmm_allocator<char>, thrust::cuda_cub::execute_on_stream_base>, int*, __nv_dl_wrapper_t<__nv_dl_tag<void (*)(void* const*, unsigned char* const*, int*, signed char*, unsigned long, unsigned long, bool, int*, bool, CUstream_st*), &(void multi_col_sort<int>(void* const*, unsigned char* const*, int*, signed char*, unsigned long, unsigned long, bool, int*, bool, CUstream_st*)), 2u>, LesserRTTI<int> > > (exec=...,
first=0x7ffb50000600, last=0x7ffb50000614, comp=...) at /usr/local/cuda/targets/x86_64-linux/include/thrust/detail/sort.inl:56
#20 0x00007fffc84ddb1e in multi_col_sort<int> (d_cols=0x7ffb50000a00, d_valids=0x7ffb50000c00, d_col_types=0x7ffb50000e00, d_asc_desc=0x7ffb50000800 "", ncols=1, nrows=5, have_nulls=false, d_indx=0x7ffb50000600, nulls_are_smallest=false, stream=0x0)
at /home/jhemstad/RAPIDS/repro/cudf/cpp/src/orderby/../sqls/sqls_rtti_comp.h:814
#21 0x00007fffc84daa21 in (anonymous namespace)::multi_col_order_by (cols=0x219e89e0, asc_desc=0x7ffb50000800 "", ncols=1, output_indices=0x21998850, flag_nulls_are_smallest=false) at /home/jhemstad/RAPIDS/repro/cudf/cpp/src/orderby/orderby.cu:57
#22 0x00007fffc84daae9 in gdf_order_by (cols=0x219e89e0, asc_desc=0x7ffb50000800 "", ncols=1, output_indices=0x21998850, flag_nulls_are_smallest=0) at /home/jhemstad/RAPIDS/repro/cudf/cpp/src/orderby/orderby.cu:88
Is your feature request related to a problem? Please describe.
std::unique_ptr and std::shared_ptr support safer programming, but to use those with RMM, I need to define custom deleters that invoke RMM_FREE instead of C++'s default delete. Currently, every project using RMM should define its own, and this requires duplicated works.
Also, cudf currently has device_buffer and this provides a wrapper for an RMM memory block (similar to thrust::device_vector with RMM allocator but does not incur initialization overhead). Other projects can benefit from this as well, and I hope RMM provides this feature rather than every project reimplementing its own.
Describe the bug
random_allocate.cpp includes the line #define _BSD_SOURCE
which is deprecated in newer versions of GCC and causes -Werror compilation to fail.
Steps/Code to reproduce bug
Fails to compile on Linux Ubuntu 18.04 L4T kernel
g++ (Ubuntu/Linaro 7.3.0-27ubuntu1~18.04) 7.3.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Additional context
Trying to build on Jetson Xavier
Note that the fix is easy: remove that line as it seems unnecessary.
For Python classes that wrap C++ classes that contain memory allocated by RMM, when the python process ends the Python order of de-allocation may cause an RMM_FREE initialization error when using the pool allocator. This occurs because RMM instance may have been destroyed before the Python class. This error causes the python process to terminate with a core dump instead of cleanly exiting.
Simple testcase to show the error from a python command-line interpretter:
>>> from librmm_cffi import librmm as rmm
>>> from librmm_cffi import librmm_config as rmm_cfg
>>> rmm_cfg.use_pool_allocator = True
>>> rmm.initialize()
0
>>> import nvstrings
>>> strs = nvstrings.to_device(["hello"])
>>> exit()
Before the process ends cleanly the following exception occurs terminating the process:
terminate called after throwing an instance of 'thrust::system::system_error'
what(): rmm_allocator::deallocate(): RMM_FREE: cudaErrorInitializationError: initialization error
Aborted
This particular error is thrown in
rmm/include/rmm/thrust_rmm_allocator.h
Line 59 in 919fb5b
inline void deallocate(pointer ptr, size_t)
{
rmmError_t error = RMM_FREE(thrust::raw_pointer_cast(ptr), stream);
if(error != RMM_SUCCESS)
{
throw thrust::system_error(error, thrust::cuda_category(), "rmm_allocator::deallocate(): RMM_FREE");
}
}
The nvstrings instance points to a C++ NVStrings instance which has a member variable allocated with rmm::device_vector and this vector is freed after RMM is deinitialized by the python process.
Throwing the error in the rmm::device_vector destructor causes the process to terminator (core dump).
Propose checking for this condition (free called after deinit) inside of RMM_FREE
or rmm::free
and ignoring this error since the memory has already been freed and no corruption will occur.
Describe the bug
Now that CI is being added, flake8 is finding minor style problems, for example in rmm_tests.py
Steps/Code to reproduce bug
Run flake8 python
from the root RMM directory.
Expected behavior
No errors
Is your feature request related to a problem? Please describe.
There's currently no memory resource for allocating pinned memory (e.g., cudaHostAlloc
).
Describe the solution you'd like
There should be a pinned_memory_resource
.
Additional context
Inspired by rapidsai/cudf#2872 (comment)
Is your feature request related to a problem? Please describe.
rmmGetInfo
gives information about the amount of free memory available. However, that can be an incorrect information in the light of fragmentation of the memory regions.
Describe the solution you'd like
rmmGetInfo
should also give another variable as output which tells us what is the largest contiguous memory region available for allocation.
Describe alternatives you've considered
There are no alternatives to this. The way we have worked-around this issue is to expose a 'max-mem' parameter to our users and hope that they'll decide and pass the right amount that'll not cause OOM error down the line. This code can be seen here
Additional context
Since RMM wraps around cnmem, maybe this change should be done in cnmem itself. But I've filed this issue inside RMM, atleast to get the conversation started. Tagging cuML folks, JFYI: @JohnZed @dantegd @cjnolet
What is your question?
AresDB integrated with RMM last week and tried to run it under staging for a while.
We used pooled memory management and default stream for memory allocation.
After 30 minutes, it seems all memory of one GPU card is exhausted and a segmentation fault happens in next memory allocation.
I don't think there are any memory leaks in our code since previously when we call cudaMalloc/cudaFree, it works.
Here is the link to our code
https://github.com/uber/aresdb/blob/master/memutils/memory/rmm_alloc.cu
Thank you so much!
The name of the RMM header memory.h clashes with STL or C standard header names, creating build issues ('extern "C"' causing mangling issues when linking being one of the harmful consequences). Please consider renaming to a non-standard header name (e.g., rmm_memory.h).
Is your feature request related to a problem? Please describe.
Currently, the memory manager is a singleton class which means all devices share the same pool.
Describe the solution you'd like
Ideally, we can create a memory manager per device or pass in a device paramter in the RMMMalloc/RMMFree call.
Hi!
README.md
file mentions RMM can only be installed via source code.
Nevertheless, I have found the following conda packages:
https://anaconda.org/rapidsai/rmm
https://anaconda.org/rapidsai/librmm
I am wondering if the README.md file is up-to-date. If not, it should be great to update it mentioning conda installs.
Location of incorrect documentation
README.md in master branch.
https://github.com/rapidsai/rmm#install-rmm
Describe the problems or issues found in the documentation
(detailed above)
I used make install
and it does indeed copy files to a location that I specify or the default location (which was /usr/local/
on my system. The header files are placed there in include/include/
. If I then I try to do #include <include/memory.h>', it fails as the other files are not set to the include path. It would make sense to either make a directory called
rmmwithin include and be sure that all the header files within rmm also look for files within that directory, or simply do not place the files into
usr/local/include/includebut rather to
usr/local/include/.
Describe the bug
After allocation pool size of X is consumed (and freed). New memory allocations (and frees) cause additional memory to be allocated from the GPU in increments of X.
Memory allocations/frees below the initial pool size X work fine until a new allocate goes above the initial size. This causes rmm to allocate a new chunk of memory on top of the initial pool size to accommodate the request. Caller frees all memory and requests new memory which again goes over the initial pool size. This causes rmm now to allocate yet another chunk of memory. The first extra chunk is not reused although it has been entirely been freed. There are now 3X of GPU memory allocated though < 2X memory has been requested. Continuing this pattern causes additional chunks of X memory until the GPU resources are used up.
Steps/Code to reproduce bug
Created simple test to show this problem here:
https://github.com/davidwendt/rmmtest/blob/master/explode.cu
The program allocates increasing memory 2 at time (each followed by 2 frees) and requests no more that 4GB total at any one time. Again, all memory is freed almost immediately after allocating.
With an initial pool size set to 4GB, this works well. The rmm allocates 4GB and never goes above.
With an initial pool size set to 2GB, rmm ends up allocating 24GB of GPU memory for the same code.
The intermediate new chunks of memory do not seem to be reused.
Expected behavior
Requesting memory beyond the initial pool size should be able to reuse freed memory in the new chunks.
Environment details (please complete the following information):
Location of incorrect documentation
README.md has not explanation of logging and how to use it from C++ or Python
Describe the problems or issues found in the documentation
README.md has not explanation of logging and how to use it from C++ or Python
Suggested fix for documentation
Add explanation and usage examples of logging to README.md
Should wait until after CFFI is migrated to Cython.
Problem: Bit-wise or-ing yields int not enum.
The API documentation implies that the allocation mode enums can bit bit-ored.
Example:
rmmOptions_t rmm_option {
.allocation_mode = PoolAllocation | CudaManagedMemory,
.initial_pool_size = free_memory / 2,
.enable_logging = true };
gives a compiler error: error: a value of type "int" cannot be used to initialize an entity of type "rmmAllocationMode_t"
Suggestions:
a) Implement operator:
inline rmmAllocationMode_t operator|(rmmAllocationMode_t left, rmmAllocationMode_t right) {
return static_cast<rmmAllocationMode_t>(
static_cast<int>(left) | static_cast<int>(right));
}
or
b) add member PoolAllocationCudaManagedMemory = 3
to struct without bitwise or-ing.
Is your feature request related to a problem? Please describe.
RMM supports multi-device allocation, and is thread safe. But we don't have tests of either these.
Describe the solution you'd like
Add tests for allocation on multiple devices. Add multi-threaded single-device and multi-device tests.
If I try to rmm free an invalid address (note c_stream
is 0):
err = RMM_FREE(reinterpret_cast<void*>(100), c_stream);
It prints this warning:
warning: Cuda API error detected: cudaFree returned (0x11)
but err
is RMM_SUCCESS
. I expected: RMM_ERROR_CUDA_ERROR
.
This is a recent issue in branch-0.10, possibly related to #127.
This is some repro code, where we can't go into the if statement.
cudaStream_t c_stream = reinterpret_cast<cudaStream_t>(0);
rmmError_t err = RMM_FREE(reinterpret_cast<void*>(100), c_stream);
if (err != RMM_SUCCESS) {
std::cout <<"not successful free of invalid address" << err<<std::endl;
}
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.