rapidsai / rmm Goto Github PK

RAPIDS Memory Manager

Home Page: https://docs.rapids.ai/api/rmm/stable/

License: Apache License 2.0

CMake 2.63% C++ 71.31% Shell 2.32% Python 9.70% Cuda 4.41% Makefile 0.07% Cython 9.09% HTML 0.24% Dockerfile 0.07% C 0.16%

rapids cuda memory-management memory-allocation

rmm's People

Contributors

Stargazers

Watchers

Forkers

harrism kkraus14 raydouglass jrhemstad dillon-cullinan igordzreyev lucafuji trxcllnt mluukkainen aphilipnv okoskinen pradghos cwharris yutiansut pentschev quasiben wojciechwasko takeshi-yoshimura abaelhe vibhujawa jlubea felipeblazing rlratzel shwina umashankartriforce millerhooks ayushdg j-ieong flykobe dantegd oliviernv codereport jakirkham ccoulombe ajunlonglive raydouglass-org madsbk wolegechu gmarkall trevorsm7 ajaythorve zeta1999 gtrunsec og-chronic revans2 rongou thomcom efajardo-nv mythrocks davidwendt seunghwak rwlee bdice karthikeyann rgsl888prabhu etsangsplk razajafri hcho3 sean-frye abellina germasch galipremsagar ezhangle zhangjianting hyperbolic2346 gigony marlenezw sample-rapids-developer gonzalobg isvoid jasonatnvidia justplay h2oai hokkanen wbo4958 skirui-source mtjrider hahaxun lamarrr calebwin mdemoret-nv teju85 hummingtree ptartan21 mzient degerli zellx3 cssprad1 imaginary-person robertmaynard jjacobelli afender blockspacer standbyme wanderer2014 miguelusque mike-wendt elstehle mattf vyasr

rmm's Issues

Is there any publication of RMM

Any publication? I want to cite it.

[FEA] Create conda package for librmm (C++) and rmm (Python) and depend on them for downstream packages

Shouldn't compile and install rmm as part of custrings / cuDF / etc. builds, should find rmm already installed on the system and dynamically link against that instead.

[FEA] RMM_TRY and RMM_TRY_CUDAERROR

Is your feature request related to a problem? Please describe.
cudf has RMM_TRY and RMM_TRY_CUDAERROR. Other projects using RMM often need to redefine RMM_TRY and RMM_TRY_CUDAERROR. It will be better if RMM provides these macros.

And we may need no throw version (e.g. something like RMM_TRY_NOTHROW) mainly for class destructors and for RMM_FREE; class destructors are noexcept by default and if RMM_FREE with an erroneous parameter results in undefined behavior similar to std::free (https://en.cppreference.com/w/cpp/memory/c/free), it's better to crash the program after printing error than continue execution with undefined behavior.

[FEA] Host Doxygen HTML

Is your feature request related to a problem? Please describe.
The libcudf Doxygen documentation HTML page should be accessible without requiring someone to clone the repo and build with make doc.

Describe the solution you'd like
Doxygen HTML should be hosted and accessible via github or dev docs page.

[FEA] Add git commit hook to format code with `clang-format`

Is your feature request related to a problem? Please describe.
Formatting of RMM source code should be consistent and automatic.

Describe the solution you'd like
clang-format should be added as a git hook to automatically format code before it is committed.

See some repos that have already done some of the pre-work necessary:
https://github.com/barisione/clang-format-hooks
https://github.com/andrewseidl/githook-clang-format

[DOC] RMM headers don't specify alignment of allocations

(I'm using the RMM in rapidsai/cudf/branch-0.5.)

It seems the Doxygen comments, such as they are for alloc(), for RMM_ALLOC() and other relevant function do not indicate whether allocations are aligned and to what degree they are.

[DOC] Add documentation to README describing how to use/access log info

Report needed documentation
Documentation is needed in the README that describes how to enable and access the logging information that RMM provides.

[FEA] Configure RMM CMake to build CUDA files

Is your feature request related to a problem? Please describe.
For some tests, I would like to be able to compile/run kernels and Thrust functions. However, I cannot build any .cu files using RMM's existing cmake configuration.

Describe the solution you'd like
Update RMM's cmake configuration to allowing build .cu files.

[BUG] gc.collect() needs to be called before rmm.finalize() in test_rmm.py

Describe the bug

test_rmm.py serves as a reference to test code with different RMM configurations.

 38 # Test all combinations of default/managed and pooled/non-pooled allocation
 39 @pytest.mark.parametrize('managed, pool',
 40                          list(product([False, True], [False, True])))
 41 def test_rmm_modes(managed, pool):
 42     rmm.finalize()
 43     rmm_cfg.use_managed_memory = managed
 44     rmm_cfg.use_pool_allocator = pool
 45     rmm.initialize()
 46 
 47     assert(rmm.is_initialized())
 48 
 49     array_tester(np.int32, 128)

array_tester creates objects holding GPU memory. Calling rmm.finalize() before these objects are destroyed can lead to memory corruption; this can lead to undefined behaviors. Calling gc.collect() (before rmm.finalize()) triggers objects with 0 reference count to be deleted (and release GPU memory) to avoid memory corruption.

[FEA] Create a process so RMM can live as a global memory manager for multiple processes

Is your feature request related to a problem? Please describe.
As more and more people start using the ecosystem and building workloads using rapids.ai they will start spawning processes that are triggered by real time events, by a clock, by user interaction etc. We don't have a way of estimating usage of all of our algorithms (e.g. group by and join) but we DO know each time that cudf requests and allocation from rmm. Because the execution of these different workloads using rapids.ai is both unpredictable in terms of scheduling and memory consumption we can run into situations where we run out of resources not because any of the particular jobs requires more memory than can be provided but because the jobs can't be run at the same time.

Describe the solution you'd like

To make a distinction between allocations which are in a temporary state (e.g. being used for calculations in a short term process (something that lives in seconds not minutes) and those which are long living (e.g. stuff we hand back to the user, things that we decide to store for longer than its just being processed on)
To understand whether or not a job can begin and proceed according to if there is either enough memory available right now or will be soon because free the allocations in a temporary state would grant enough memory to make the allocation.
To have all allocations originate in a single process which is able to keep track of how much has been allocated and in what state so that we can be able to fulfill requests even when perhaps enough memory is not available the moment the initial request was made so long as it can be fulfilled very soon when the memory is available.
To hopefully maybe or maybe not allow allocations to be non blocking and and belong to a group of allocations which we can syncrhonize together and as a group. This would handle cases where an algorithm is going to make 4 allocations at one point and those allocations all need to exist at the same time. They can succeed or fail as a group and we can back out any allocations that were made if any in the group did not succeed.

Describe alternatives you've considered
Tracking allocations within our uses of cudf and adding a wrapper to the cudf python library that keeps track of memory as it comes in and out but I don't think this would really work.

Additional context
The code we currently have works great for demos and workloads that you are running one time. As people develop their toolsets they will run and run more workloads and it will not be possible to assume that these workloads are being queued to be run nor shoudl they be. I really think we should start considering possibilities for managing allocations across multiple processes. This could also allow us to be more aggressive with the size of the pool. Last piece of context is that I have not thought this through at length and this is just some stream of conscience ideas to help get a discussion going.

[BUG] "global function call is not configured" <- What does this mean?

I'm developing a feature over cudf branch-0.6; using rmm changeset dfe2c4b . At some point, I'm getting this error:

terminate called after throwing an instance of 'thrust::system::system_error'
  what():  rmm_allocator::deallocate(): RMM_FREE: __global__ function call is not configured
Aborted

This is a problematic exception. Regardless of the reason this happened to me - "system error" is very general, and a typical user of rmm cannot understand what the what() message means.

So, please rewrite the code producing this what() message to:

Target people who don't know anything about rmm internals.
Be more specific in what exactly happened that should not have.
Describe something which should have been done differently, or was skipped etc. - to give the user a hint regarding how to avoid this exception.
Add some specifics in addition to the canned string (e.g. address, size, other parameters, other state, or some of the above).

[QST] Hard limit on the amount of data rmm can allocate at once?

What is your question?
In the below code snippet the last line d_pos=pos transfers data across main memory and GPU memory. I'm having issues with this line when vectors hold data with size over a certain a threshold (around 30GB). Is there a hard limit of data rmm can allocate/move at once?

void initDataset (std::vector<float> *pos, size_t x, size_t y, size_t z)
{
    int i,j,k;
double Pe;
std::mt19937 rng(time(NULL));
std::uniform_real_distribution<float> gen(-4.0, 0.0);
for (i=-(int)x/2;i<((int)x/2);++i)
{
	for (j=-(int)y/2;j<((int)y/2);++j)
	{
		for (k=0;k<z;++k)
		{
			Pe = gen(rng);
			pos->push_back(i);
			pos->push_back(j);
			pos->push_back(k);
			pos->push_back(Pe);
		}
	}
}
}

int main (int argc, char *argv[])
     {
unsigned int i, iter = 30;
size_t sx = 400, sy = 400, sz = 2000;
size_t numParticles = 0;
std::vector<float> pos; // particle positions

rmm::device_vector<float> d_pos; // particle positions in GPU
rmm::device_vector<float> d_posOut; // particle positions out in GPU

// This willl be used to generate plane's normals randomly
// between -1 to 1
std::mt19937 rng(time(NULL));
std::uniform_real_distribution<float> gen(-1.0, 1.0);
numParticles = sx*sy*sz;

    // Types of allocations:
    // CudaDefaultAllocation
    // PoolAllocation
    // CudaManagedMemory

rmmOptions_t options{rmmAllocationMode_t::PoolAllocation, 0, true};
rmmInitialize(&options);

initDataset(&pos, sx, sy, sz);

// plane defined by normal and D
float normal[3], d = 0.0f;


for (i=0;i<iter;i++)
{
	// Generating plane's normals randomly
	// between -1 to 1
	normal[0] = gen(rng);
	normal[1] = gen(rng);
	normal[2] = gen(rng);

	timer.reset();
	d_pos = pos;
   ....

[BUG] RMM logging is slow.

Describe the bug
The RMM log is slow. It was written quickly to get something working but the overhead of using STL for a log is too high and therefore it is off by default.

Steps/Code to reproduce bug
Turn on logging in a big app with a lot of alloc/free (e.g. RAPIDS E2E workflow) and see how much it slows down.

Expected behavior
Fast.

[FEA] Provide a `uninitialized_vector` that doesn't initialize the allocation

Is your feature request related to a problem? Please describe.
rmm::device_vector is an alias for a thrust::device_vector that uses RMM as the allocator. By default, thrust::device_vector will invoke the default constructor for each element in the vector. This is oftentimes unnecessary overhead as it requires invoking a kernel to initialize the elements of the vector.

Describe the solution you'd like
Provide rmm::uninitialized_device_vector that simply allocates the memory of the specified size and sets the .size() appropriately.

See https://github.com/thrust/thrust/blob/master/examples/uninitialized_vector.cu for reference.

[BUG] Error when using RMM: parallel_for failed: out of memory

Environment details (please complete the following information):

Environment location: Docker
Method of RMM install: Docker
- Docker pull: docker pull rapidsai/rapidsai-dev:0.9-cuda10.0-devel-ubuntu16.04-py3.7
- Docker run: docker run --runtime=nvidia --rm -it --net=host -p 8888:8888 -p 8787:8787 -p 8786:8786 -v /home/rapids/notebooks-extended/:/rapids/notebooks/extended/ -v /home/rapids/data/:/home/rapids/data/ rapidsai/rapidsai-dev:0.9-cuda10.0-devel-ubuntu16.04-py3.7

Describe the bug
I am using the Jupyter notebook NYCTaxi-E2E.ipynb and have added the RMM functionality; however, the system crashes at the XGBoost training step. See below the error:

terminate called after throwing an instance of 'thrust::system::system_error'
  what():  parallel_for failed: out of memory

Steps/Code to reproduce bug
Added methods:

def initialize_rmm_pool():
    rmm_cfg.use_pool_allocator = True
    return cudf.rmm.initialize()

def initialize_rmm_no_pool():
    rmm_cfg.use_pool_allocator = False
    return cudf.rmm.initialize()

def finalize_rmm():
    return cudf.rmm.finalize()

[BUG] rmm.finalize not releasing memory (pool mode)

Describe the bug
Calling rmm.finalize() after rmm has been initialized in pool mode should/used to free up the memory pool. This no longer happens.

Steps/Code to reproduce bug

import rmm
from rmm import rmm_config as rmm_cfg

rmm_cfg.use_pool_allocator = True
rmm.initialize()

Pool allocated with 1/2 the cpu memory

 NVIDIA-SMI 410.104      Driver Version: 410.104      CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:3B:00.0 Off |                    0 |
| N/A   38C    P0    27W /  70W |   7669MiB / 15079MiB |      0%      Default

rmm.finalize()

Gpu memory usage is still 1/2 gpu memory.

| NVIDIA-SMI 410.104      Driver Version: 410.104      CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:3B:00.0 Off |                    0 |
| N/A   43C    P0    28W /  70W |   7669MiB / 15079MiB |      0%      Default |

Expected behavior
Gpu memory should be freed

Environment details (please complete the following information):

Environment location: Conda
Method of RMM install: Conda install nightly rmm=0.10.*
librmm-0.10.0a191007
rmm-0.10.0a191007
Please run and attach the output of the rmm/print_env.sh script to gather relevant environment details

Additional context
Add any other context about the problem here.

[FEA] Replace CFFI bindings with Cython

We should transition the current Python bindings API to Cython in order

to support promoting C++ exceptions up to Python exceptions for better error reporting
to match the approach cuDF is taking

[BUG] rmmRealloc deletes the old data and doesn't copy it.

Describe the bug
Realloc should copy the old data into the new allocation.

Steps/Code to reproduce bug
Just realloc an existing array -- data is likely to be gone.

Expected behavior
The data should be there.

[FEA] Add array shape and order options for device_array_from_ptr

Is your feature request related to a problem? Please describe.

Currently device_array_from_ptr in librmm_cffi/wrapper.py assumes 1D array. Some cuML algorithms return higher dimensional arrays, and we need to wrap them as DeviceNDArray.

Describe the solution you'd like
Add shape and order options to wrap multi dimensional device arrays.

Here is an implementation from cuML SVM:
https://github.com/tfeher/cuml/blob/97d2c00d538a2799db7b42b584b8006aee1633ed/python/cuml/utils/numba_utils.py#L145-L185

[FEA] Configure and start building Doxygen HTML documentation

Is your feature request related to a problem? Please describe.
RMM should build the HTML Doxygen documentation from its in-line comments.

Describe the solution you'd like
Add a Doxyfile with configuration options necessary to build the RMM Doxygen HTML documentation.

Ideally, the HTML documentation should then be made available on the web without requiring individuals to build it themselves.

Related: rapidsai/cudf#698

[BUG]Wrong order of LogIt class private variables

Describe the bug
If rmm is used with a libary with -Werror then the compilation fails with the following message :

/home/aatish/workspace/cuhornet/hornet/../externals/rmm/include/rmm/rmm.hpp: In constructor ‘rmm::LogIt::LogIt(rmm::Logger::MemEvent_t, void*, size_t, cudaStream_t, const char*, unsigned int, bool)’:
/home/aatish/workspace/cuhornet/hornet/../externals/rmm/include/rmm/rmm.hpp:101:8: error: ‘rmm::LogIt::usageLogging’ will be initialized after [-Werror=reorder]
   bool usageLogging;
        ^~~~~~~~~~~~
/home/aatish/workspace/cuhornet/hornet/../externals/rmm/include/rmm/rmm.hpp:100:16: error:   ‘unsigned int rmm::LogIt::line’ [-Werror=reorder]
   unsigned int line;
                ^~~~
/home/aatish/workspace/cuhornet/hornet/../externals/rmm/include/rmm/rmm.hpp:59:3: error:   when initialized here [-Werror=reorder]
   LogIt(Logger::MemEvent_t event, void* ptr, size_t size, cudaStream_t stream,
   ^~~~~
cc1plus: all warnings being treated as errors

This can be replicated with branch-0.10

[BUG] Including rmm/rmm.h without cmake, make, make install fails

Describe the bug
If a project includes rmm/rmm.h without doing cmake, make, make install compilation fails with

rmm/include/rmm/detail/memory_manager.hpp:37:30: fatal error: rmm/detail/cnmem.h: No such file or directory

This does not happen if the include of cnmem.h in memory_manager.hpp:37 is done via #include "cnmem.h" instead of #include "rmm/detail/cnmem.h". For projects which have header only dependency to rmm the cmake, make, make install step is not necessary so it would be desriable if this works.

Steps/Code to reproduce bug
Compiling

#include <rmm/rmm.h>

int main()
{
        return 0;
}

with

g++ -I$CUDA_HOME/include -Irmm/include rmm_include_bug.cpp

reproduces the error

In file included from rmm/include/rmm/rmm.hpp:28:0,
                 from rmm/include/rmm/rmm.h:5,
                 from rmm_include_bug.cpp:1:
rmm/include/rmm/detail/memory_manager.hpp:37:10: fatal error: rmm/detail/cnmem.h: No such file or directory
 #include "rmm/detail/cnmem.h"
          ^~~~~~~~~~~~~~~~~~~~
compilation terminated.

Expected behavior
Compilation of the above example works.

Environment details:

Environment location: Bare-metal
Method of RMM install: from source
Output of print_env.sh attached as rmm_print_env.log

[FEA] Provide a `device_vector`-like abstraction that can accept streams

Is your feature request related to a problem? Please describe.

rmm::device_vector is currently a simple alias for a thrust::device_vector with a rmm_allocator<T> used as it's allocator template argument. This allocator always uses the null stream for memory allocation, and there is no way for users to modify this behavior.

As seen in rapidsai/cudf#2631, this is problematic.

Describe the solution you'd like

RMM should provide an improved device_vector abstraction. It cannot simply be just a type alias as it requires specifying constructor arguments that thrust::device_vector does not currently support(*). However, we can avoid fully reinventing the wheel by inheriting from a thrust::device_vector and adding the new necessary constructors.

It should be built to also accept a device_memory_resource to support the new memory resource design.

(*)Thrust in CUDA 10.1 added passing allocators as a function argument, however, that does not fully solve this issue. First of all, we cannot assume all users of RMM can use CUDA 10.1. Second of all, this still does not allow simply specifying a stream in a constructor argument.

Additional context
Add any other context, code examples, or references to existing implementations about the feature request here.

[FEA] Add release() function to pool resources

Is your feature request related to a problem? Please describe.
Calling rmmFinalize should deallocate the memory pool in any of the pool resources. Currently the only way to free a pool is when the pool resource is destroyed at the end of the application.

Describe the solution you'd like
The pool resources need release methods added to free their memory pools. For example, see std::pmr::synchronized_pool::release().

Additional context
When rmmFinalize is invoked, how do we know what resources to call release on? release is not a member of the device_memory_resource base class, so it's not possible to call get_default_resource()->release(). Do we just always call pool_resource()->release() and managed_pool_resource()->release()? But that will end up constructing those resources only to then release them.

[BUG] MemoryTest assumes exclusive use of GPU

Describe the bug

The unit test MemoryTest.GetInfo tests that the memory available on the GPU goes down after a successful allocation.

https://github.com/rapidsai/rmm/blob/branch-0.7/tests/memory_tests.cpp#L190

It uses the rmmGetInfo API, which in non-pool mode calls cudaMemGetInfo which queries the entire device's memory usage. This isn't resilient to other processes using the GPU, as another process may free a large portion of memory causing the total device memory to go down, causing this test to fail:

04:29:21 [ RUN      ] MemoryManagerTest/2.GetInfo
04:29:21 /rapids/cudf/cpp/thirdparty/rmm/tests/memory_tests.cpp:207: Failure
04:29:21 Expected: (freeAfter) <= (freeBefore), actual: 20142030848 vs 20114767872
04:29:21 [  FAILED  ] MemoryManagerTest/2.GetInfo, where TypeParam = ModeType<(rmmAllocationMode_t)2> (3 ms)

I believe this test could be made more resilient to GPU sharing by using the NVML API nvmlDeviceGetComputeRunningProcesses. This allows you to query the GPU memory usage of each process using the GPU. In this way, the test can be refactored to ensure that the memory used by the calling process grows as a result of the allocation.

Expected behavior
Unit tests should be resilient to multiple processes using the GPU.

[BUG] Hang on 1TB allocation test with Managed Pool mode on DGX-1

Describe the bug
When RMM options are set to use pool allocations and use CUDA Managed Memory, the AllocateTB test hangs or runs for a very long time. I believe the cause is that cudaMallocManaged succeeds for a 1TB allocation when there is sufficient virtual system memory, but the subsequent cudaMemPrefetchAsync() runs for a long time.

Steps/Code to reproduce bug
Just run RMM_TEST on a DGX-1.

Expected behavior
It should return quickly, and the test should pass (potentially by correctly detecting an allocation failure, or by not prefetching if the allocation is larger than the gpu memory size).

Environment details (please complete the following information):

Environment location: Bare-metal
Method of RMM install: from source

[FEA] Multi-GPU support (single node)

Related issue: #66

Is your feature request related to a problem? Please describe.

I wish I could use RMM for a multi-GPU node. However, it may not be possible in the current implementation if I enable pool allocation.

 54 // Initialize memory manager state and storage.
 55 rmmError_t rmmInitialize(rmmOptions_t *options)
 56 {
 57     rmm::Manager::getInstance().initialize(options);
 58 
 59     if (rmm::Manager::usePoolAllocator())
 60     {
 61         cnmemDevice_t dev;
 62         RMM_CHECK_CUDA( cudaGetDevice(&(dev.device)) );
 63         // Note: cnmem defaults to half GPU memory
 64         dev.size = rmm::Manager::getOptions().initial_pool_size;
 65         dev.numStreams = 1;
 66         cudaStream_t streams[1]; streams[0] = 0;
 67         dev.streams = streams;
 68         dev.streamSizes = 0;
 69         unsigned flags = rmm::Manager::useManagedMemory() ? CNMEM_FLAGS_MANAGED : 0;
 70         RMM_CHECK_CNMEM( cnmemInit(1, &dev, flags) );
 71     }
 72     return RMM_SUCCESS;
 73 }

rmmInitialize() calls cnmemInit in line 70 with numDevices set to 1.

1071 cnmemStatus_t cnmemInit(int numDevices, const cnmemDevice_t *devices, unsigned flags) {
1072     // Make sure we have at least one device declared.
1073     CNMEM_CHECK_TRUE(numDevices > 0, CNMEM_STATUS_INVALID_ARGUMENT);
1074 
1075     // Find the largest ID of the device.
1076     int maxDevice = 0;
1077     for( int i = 0 ; i < numDevices ; ++i ) {
1078         if( devices[i].device > maxDevice ) {
1079             maxDevice = devices[i].device;
1080         }
1081     }
1082 
1083     // Create the global context.
1084     cnmem::Context::create();
             ...

cnmemInit() calls cnmem::Context::create() in line 1084 and

1024 cnmemStatus_t Context::create() {
1025     sCtx = new Context;
1026     sCtxCheck = CTX_VALID;
1027     return CNMEM_STATUS_SUCCESS;
1028 }

create() resets the Context class's static member variable sCtx to a newly created Context object in sCtx.

So, if I call rmmInitialize() multiple times (after cudaSetDevice(), once per device), only the last call will have effect (besides memory leaks for previously allocated Context objects).

rmmInitialize() does not take num_devices as cnmemInit, so I cannot initialize RMM for multiple devices in a single rmmInitialize() call, either.

Describe the solution you'd like
Need a mechanism to initialize RMM for multiple devices (in cnmem style or by calling rmmInitialize multiple times after cudaSetDevice).

[FEA] Supporting cuDF Series in device_array_like

Is your feature request related to a problem? Please describe.

I'd like to run the following code.

from librmm_cffi import librmm as rmm
import cudf

s = cudf.Series([0, 1, 2])
a = rmm.device_array_like(s)

Currently this fails with the following error.

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-3-917a74c3463e> in <module>
----> 1 rmm.device_array_like(s)

~/miniconda/envs/rapids9/lib/python3.7/site-packages/librmm_cffi/wrapper.py in device_array_like(self, ary, stream)
    227             ary = ary.reshape(1)
    228 
--> 229         return self.device_array(ary.shape, ary.dtype, ary.strides,
    230                                  stream=stream)
    231 

AttributeError: 'Series' object has no attribute 'strides'

Describe the solution you'd like

It would be great if rmm.device_array_like worked with Series objects. No strong feelings about how that is accomplished.

Describe alternatives you've considered

We could special case handling of Series objects, but this shifts the burden to other libraries to solve this problem.

Alternatively cuDF Series objects could gain a strides attribute. This could be reasonable.

Additional context

This came up when trying to better handle GPU array-like objects in cuML ( rapidsai/cuml#1086 ), which is part of the Grid Search effort.

Edit: More specifically, we tried to use librmm_cffi.librmm.to_device instead of numba.cuda.to_device, but were unable to as Series are not supported.

How much faster is rmm?

Besides, is there any publication for rmm?

[DOC] - RMM will not capture Out-of-Bound segfaults in pool mode

When using RMM in pool mode, a problem could arise that out-of-bound memory segfaults will go undetected as the out-of-bound memory access will be within the bounds of the pre-allocated memory pool.

To avoid this, it is highly recommended that when developing code that the non-pool version of RMM be used until correctness has been verified at which case the pool can be used to improve performance.

[FEA] Provide a way to query the initialized state of RMM

Is your feature request related to a problem? Please describe.
There is currently no way to query whether or not RMM has been initialized and if so, what options were used.
Describe the solution you'd like
Provide an API for querying initialization state of RMM, e.g. bool rmm::is_initialized(rmmOptions_t *options), which would return true or false and if true return the options struct filled out.

Describe alternatives you've considered
I have also considered separating the Boolean state and the options in separate queries, but I think allowing nullptr as a valid value for options satisfies both use cases.

Additional context
This is necessary for interoperation of multiple modules / libraries that all need to use RMM without re-initializing it.

Java API support

Any Java Surpport in plan?

[BUG] Segmentation Fault inside `thrust::sort` when using pool allocation

Describe the bug
A segmentation fault occurs inside of the thrust::sort call inside of gdf_order_by of libcudf when RMM pool allocation is used.

Steps/Code to reproduce bug

from librmm_cffi import librmm_config as rmm_cfg
rmm_cfg.use_pool_allocator = True
import cudf
cudf._gdf.rmm_initialize()

df = cudf.DataFrame()
df['a'] = [1,2,3,4,5]
df['b'] = [5,4,3,2,1]
print(df.sort_values(['a']))

Environment details (please complete the following information):
Using branch-0.5 of cuDF 5aa1429f8305cfeb120aaa904d71dabfe785898d

Additional context
As you can see from the stack trace below, the error is occurring inside of a thrust::sort call that is attempting to use RMM to allocate a temporary buffer and using a non-null stream.

#1  0x00007fffe06b05a0 in cuEGLApiInit () from /usr/lib/x86_64-linux-gnu/libcuda.so
#2  0x00007fffe05c8555 in cuMemGetAttribute_v2 () from /usr/lib/x86_64-linux-gnu/libcuda.so
#3  0x00007fffe070f83f in cuStreamGetFlags () from /usr/lib/x86_64-linux-gnu/libcuda.so
#4  0x00007fffdf6e9ebf in ?? () from /usr/local/cuda-9.2/targets/x86_64-linux/lib/libcudart.so.9.2
#5  0x00007fffdf71231f in cudaStreamGetFlags () from /usr/local/cuda-9.2/targets/x86_64-linux/lib/libcudart.so.9.2
#6  0x00007fffdf964df9 in cnmem::Manager::setStream (this=0x21f65cf0, stream=0x7ffb50000600) at /home/jhemstad/RAPIDS/repro/cudf/cpp/thirdparty/rmm/thirdparty/cnmem/src/cnmem.cpp:392
#7  0x00007fffdf9643fe in cnmemRegisterStream (stream=0x7ffb50000600) at /home/jhemstad/RAPIDS/repro/cudf/cpp/thirdparty/rmm/thirdparty/cnmem/src/cnmem.cpp:1166
#8  0x00007fffdf95ef8e in rmm::Manager::registerStream (this=0x7fffdfb6e160 <rmm::Manager::getInstance()::instance>, stream=0x7ffb50000600) at /home/jhemstad/RAPIDS/repro/cudf/cpp/thirdparty/rmm/src/memory_manager.cpp:94
#9  0x00007fffc84791b1 in rmm::alloc<void> (ptr=0x7fffffffc2b0, size=767, stream=0x7ffb50000600, file=0x7fffc8b7c0e0 <_ZN3rmmL17RMM_USAGE_LOGGINGE+3889> "/home/jhemstad/RAPIDS/repro/cudf/cpp/thirdparty/rmm/include/rmm/thrust_rmm_allocator.h", line=48)
    at /home/jhemstad/RAPIDS/repro/cudf/cpp/thirdparty/rmm/include/rmm/rmm.hpp:133
#10 0x00007fffc84e6038 in rmm_allocator<char>::allocate (this=0x7fffffffcae0, n=767) at /home/jhemstad/RAPIDS/repro/cudf/cpp/thirdparty/rmm/include/rmm/thrust_rmm_allocator.h:48
#11 0x00007fffc84e5bdf in thrust::detail::allocator_traits<rmm_allocator<char> >::allocate(rmm_allocator<char>&, unsigned long)::workaround_warnings::allocate(rmm_allocator<char>&, unsigned long) (a=..., n=767)
    at /usr/local/cuda/targets/x86_64-linux/include/thrust/detail/allocator/allocator_traits.inl:230
#12 0x00007fffc84e5c05 in thrust::detail::allocator_traits<rmm_allocator<char> >::allocate (a=..., n=767) at /usr/local/cuda/targets/x86_64-linux/include/thrust/detail/allocator/allocator_traits.inl:234
#13 0x00007fffc84e4a59 in thrust::detail::get_temporary_buffer<char, rmm_allocator<char>, thrust::cuda_cub::execute_on_stream_base> (system=..., n=767) at /usr/local/cuda/targets/x86_64-linux/include/thrust/detail/execute_with_allocator.h:86
#14 0x00007fffc84e2f76 in thrust::get_temporary_buffer<char, thrust::detail::execute_with_allocator<rmm_allocator<char>, thrust::cuda_cub::execute_on_stream_base> > (exec=..., n=767)
    at /usr/local/cuda/targets/x86_64-linux/include/thrust/detail/temporary_buffer.h:62
#15 0x00007fffc84e24b3 in thrust::cuda_cub::get_memory_buffer<thrust::detail::execute_with_allocator<rmm_allocator<char>, thrust::cuda_cub::execute_on_stream_base> > (policy=..., n=767)
    at /usr/local/cuda/targets/x86_64-linux/include/thrust/system/cuda/detail/memory_buffer.h:57
#16 0x00007fffc84e1096 in thrust::cuda_cub::__merge_sort::merge_sort<thrust::detail::integral_constant<bool, false>, thrust::detail::integral_constant<bool, false>, thrust::cuda_cub::execution_policy<thrust::detail::execute_with_allocator<rmm_allocator<char>, thrust::cuda_cub::execute_on_stream_base> >, int*, int*, __nv_dl_wrapper_t<__nv_dl_tag<void (*)(void* const*, unsigned char* const*, int*, signed char*, unsigned long, unsigned long, bool, int*, bool, CUstream_st*), &(void multi_col_sort<int>(void* const*, unsigned char* const*, int*, signed char*, unsigned long, unsigned long, bool, int*, bool, CUstream_st*)), 2u>, LesserRTTI<int> > > (compare_op=..., items_first=0x0, keys_last=0x7ffb50000614, keys_first=0x7ffb50000600, policy=...)
    at /usr/local/cuda/targets/x86_64-linux/include/thrust/system/cuda/detail/sort.h:1336
#17 thrust::cuda_cub::__smart_sort::smart_sort<thrust::detail::integral_constant<bool, false>, thrust::detail::integral_constant<bool, false>, thrust::cuda_cub::execution_policy<thrust::detail::execute_with_allocator<rmm_allocator<char>, thrust::cuda_cub::execute_on_stream_base> >, int*, int*, __nv_dl_wrapper_t<__nv_dl_tag<void (*)(void* const*, unsigned char* const*, int*, signed char*, unsigned long, unsigned long, bool, int*, bool, CUstream_st*), &(void multi_col_sort<int>(void* const*, unsigned char* const*, int*, signed char*, unsigned long, unsigned long, bool, int*, bool, CUstream_st*)), 2u>, LesserRTTI<int> > > (compare_op=..., items_first=0x0, keys_last=0x7ffb50000614, keys_first=0x7ffb50000600, policy=...)
    at /usr/local/cuda/targets/x86_64-linux/include/thrust/system/cuda/detail/sort.h:1576
#18 thrust::cuda_cub::sort<thrust::detail::execute_with_allocator<rmm_allocator<char>, thrust::cuda_cub::execute_on_stream_base>, int*, __nv_dl_wrapper_t<__nv_dl_tag<void (*)(void* const*, unsigned char* const*, int*, signed char*, unsigned long, unsigned long, bool, int*, bool, CUstream_st*), &(void multi_col_sort<int>(void* const*, unsigned char* const*, int*, signed char*, unsigned long, unsigned long, bool, int*, bool, CUstream_st*)), 2u>, LesserRTTI<int> > > (policy=..., first=0x7ffb50000600, 
    last=0x7ffb50000614, compare_op=...) at /usr/local/cuda/targets/x86_64-linux/include/thrust/system/cuda/detail/sort.h:1653
#19 0x00007fffc84de58b in thrust::sort<thrust::detail::execute_with_allocator<rmm_allocator<char>, thrust::cuda_cub::execute_on_stream_base>, int*, __nv_dl_wrapper_t<__nv_dl_tag<void (*)(void* const*, unsigned char* const*, int*, signed char*, unsigned long, unsigned long, bool, int*, bool, CUstream_st*), &(void multi_col_sort<int>(void* const*, unsigned char* const*, int*, signed char*, unsigned long, unsigned long, bool, int*, bool, CUstream_st*)), 2u>, LesserRTTI<int> > > (exec=..., 
    first=0x7ffb50000600, last=0x7ffb50000614, comp=...) at /usr/local/cuda/targets/x86_64-linux/include/thrust/detail/sort.inl:56
#20 0x00007fffc84ddb1e in multi_col_sort<int> (d_cols=0x7ffb50000a00, d_valids=0x7ffb50000c00, d_col_types=0x7ffb50000e00, d_asc_desc=0x7ffb50000800 "", ncols=1, nrows=5, have_nulls=false, d_indx=0x7ffb50000600, nulls_are_smallest=false, stream=0x0)
    at /home/jhemstad/RAPIDS/repro/cudf/cpp/src/orderby/../sqls/sqls_rtti_comp.h:814
#21 0x00007fffc84daa21 in (anonymous namespace)::multi_col_order_by (cols=0x219e89e0, asc_desc=0x7ffb50000800 "", ncols=1, output_indices=0x21998850, flag_nulls_are_smallest=false) at /home/jhemstad/RAPIDS/repro/cudf/cpp/src/orderby/orderby.cu:57
#22 0x00007fffc84daae9 in gdf_order_by (cols=0x219e89e0, asc_desc=0x7ffb50000800 "", ncols=1, output_indices=0x21998850, flag_nulls_are_smallest=0) at /home/jhemstad/RAPIDS/repro/cudf/cpp/src/orderby/orderby.cu:88

[FEA] smart pointers (unique_ptr and shared_ptr) with custom deleters and device_buffer.

Is your feature request related to a problem? Please describe.
std::unique_ptr and std::shared_ptr support safer programming, but to use those with RMM, I need to define custom deleters that invoke RMM_FREE instead of C++'s default delete. Currently, every project using RMM should define its own, and this requires duplicated works.

Also, cudf currently has device_buffer and this provides a wrapper for an RMM memory block (similar to thrust::device_vector with RMM allocator but does not incur initialization overhead). Other projects can benefit from this as well, and I hope RMM provides this feature rather than every project reimplementing its own.

[BUG] Remove deprecated #define _BSD_SOURCE from random_allocate benchmark

Describe the bug
random_allocate.cpp includes the line #define _BSD_SOURCE which is deprecated in newer versions of GCC and causes -Werror compilation to fail.

Steps/Code to reproduce bug
Fails to compile on Linux Ubuntu 18.04 L4T kernel

g++ (Ubuntu/Linaro 7.3.0-27ubuntu1~18.04) 7.3.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Additional context
Trying to build on Jetson Xavier

Note that the fix is easy: remove that line as it seems unnecessary.

[BUG] rmm_allocator::deallocate(): RMM_FREE: initialization error

For Python classes that wrap C++ classes that contain memory allocated by RMM, when the python process ends the Python order of de-allocation may cause an RMM_FREE initialization error when using the pool allocator. This occurs because RMM instance may have been destroyed before the Python class. This error causes the python process to terminate with a core dump instead of cleanly exiting.

Simple testcase to show the error from a python command-line interpretter:

>>> from librmm_cffi import librmm as rmm
>>> from librmm_cffi import librmm_config as rmm_cfg
>>> rmm_cfg.use_pool_allocator = True 
>>> rmm.initialize()
0
>>> import nvstrings
>>> strs = nvstrings.to_device(["hello"])
>>> exit()

Before the process ends cleanly the following exception occurs terminating the process:

terminate called after throwing an instance of 'thrust::system::system_error'
  what():  rmm_allocator::deallocate(): RMM_FREE: cudaErrorInitializationError: initialization error
Aborted

This particular error is thrown in

rmm/include/rmm/thrust_rmm_allocator.h

Line 59 in 919fb5b

inline void deallocate(pointer ptr, size_t)

    inline void deallocate(pointer ptr, size_t)
    {
      rmmError_t error = RMM_FREE(thrust::raw_pointer_cast(ptr), stream);
  
      if(error != RMM_SUCCESS)
      {
        throw thrust::system_error(error, thrust::cuda_category(), "rmm_allocator::deallocate(): RMM_FREE");
      }
    }

The nvstrings instance points to a C++ NVStrings instance which has a member variable allocated with rmm::device_vector and this vector is freed after RMM is deinitialized by the python process.

Throwing the error in the rmm::device_vector destructor causes the process to terminator (core dump).

Propose checking for this condition (free called after deinit) inside of RMM_FREE or rmm::free and ignoring this error since the memory has already been freed and no corruption will occur.

[BUG] RMM pytests failing flake8 style checks

Describe the bug
Now that CI is being added, flake8 is finding minor style problems, for example in rmm_tests.py

Steps/Code to reproduce bug
Run flake8 python from the root RMM directory.

Expected behavior
No errors

[FEA] Add a pinned memory resource

Is your feature request related to a problem? Please describe.

There's currently no memory resource for allocating pinned memory (e.g., cudaHostAlloc).

Describe the solution you'd like

There should be a pinned_memory_resource.

Additional context
Inspired by rapidsai/cudf#2872 (comment)

[FEA] Function to get info on what's the largest chunk that can be safely allocated

Is your feature request related to a problem? Please describe.
rmmGetInfo gives information about the amount of free memory available. However, that can be an incorrect information in the light of fragmentation of the memory regions.

Describe the solution you'd like
rmmGetInfo should also give another variable as output which tells us what is the largest contiguous memory region available for allocation.

Describe alternatives you've considered
There are no alternatives to this. The way we have worked-around this issue is to expose a 'max-mem' parameter to our users and hope that they'll decide and pass the right amount that'll not cause OOM error down the line. This code can be seen here

Additional context
Since RMM wraps around cnmem, maybe this change should be done in cnmem itself. But I've filed this issue inside RMM, atleast to get the conversation started. Tagging cuML folks, JFYI: @JohnZed @dantegd @cjnolet

RMM Memory Leak after running for a while [QST]

What is your question?
AresDB integrated with RMM last week and tried to run it under staging for a while.
We used pooled memory management and default stream for memory allocation.

After 30 minutes, it seems all memory of one GPU card is exhausted and a segmentation fault happens in next memory allocation.

I don't think there are any memory leaks in our code since previously when we call cudaMalloc/cudaFree, it works.

Here is the link to our code
https://github.com/uber/aresdb/blob/master/memutils/memory/rmm_alloc.cu
Thank you so much!

Rename RMM's header memory.h

The name of the RMM header memory.h clashes with STL or C standard header names, creating build issues ('extern "C"' causing mangling issues when linking being one of the harmful consequences). Please consider renaming to a non-standard header name (e.g., rmm_memory.h).

[FEA] Support pooled memory manager for multiple devices

Is your feature request related to a problem? Please describe.
Currently, the memory manager is a singleton class which means all devices share the same pool.

Describe the solution you'd like
Ideally, we can create a memory manager per device or pass in a device paramter in the RMMMalloc/RMMFree call.

[DOC] Update README to include conda install

Report incorrect documentation

Hi!

README.md file mentions RMM can only be installed via source code.

Nevertheless, I have found the following conda packages:

https://anaconda.org/rapidsai/rmm

https://anaconda.org/rapidsai/librmm

I am wondering if the README.md file is up-to-date. If not, it should be great to update it mentioning conda installs.

Location of incorrect documentation
README.md in master branch.

https://github.com/rapidsai/rmm#install-rmm

Describe the problems or issues found in the documentation
(detailed above)

[BUG] - files placed in the wrong directory as part of RMM installation.

I used make install and it does indeed copy files to a location that I specify or the default location (which was /usr/local/ on my system. The header files are placed there in include/include/. If I then I try to do #include <include/memory.h>', it fails as the other files are not set to the include path. It would make sense to either make a directory called rmmwithin include and be sure that all the header files within rmm also look for files within that directory, or simply do not place the files intousr/local/include/includebut rather tousr/local/include/.

[BUG] Allocation beyond initial pool size does not reuse freed memory

Describe the bug
After allocation pool size of X is consumed (and freed). New memory allocations (and frees) cause additional memory to be allocated from the GPU in increments of X.
Memory allocations/frees below the initial pool size X work fine until a new allocate goes above the initial size. This causes rmm to allocate a new chunk of memory on top of the initial pool size to accommodate the request. Caller frees all memory and requests new memory which again goes over the initial pool size. This causes rmm now to allocate yet another chunk of memory. The first extra chunk is not reused although it has been entirely been freed. There are now 3X of GPU memory allocated though < 2X memory has been requested. Continuing this pattern causes additional chunks of X memory until the GPU resources are used up.

Steps/Code to reproduce bug
Created simple test to show this problem here:
https://github.com/davidwendt/rmmtest/blob/master/explode.cu
The program allocates increasing memory 2 at time (each followed by 2 frees) and requests no more that 4GB total at any one time. Again, all memory is freed almost immediately after allocating.
With an initial pool size set to 4GB, this works well. The rmm allocates 4GB and never goes above.
With an initial pool size set to 2GB, rmm ends up allocating 24GB of GPU memory for the same code.
The intermediate new chunks of memory do not seem to be reused.

Expected behavior
Requesting memory beyond the initial pool size should be able to reuse freed memory in the new chunks.

Environment details (please complete the following information):

Environment location: created on Ubuntu 16.04 desktop
Method of RMM install: built from source -- repo for example above has cmake

rmmenv.txt

[DOC] Document logging

Report incorrect documentation

Location of incorrect documentation
README.md has not explanation of logging and how to use it from C++ or Python

Describe the problems or issues found in the documentation
README.md has not explanation of logging and how to use it from C++ or Python

Suggested fix for documentation
Add explanation and usage examples of logging to README.md

[FEA] Create pip package for rmm

Should wait until after CFFI is migrated to Cython.

[FEA] Provide operator for bit-wise or of allocation modes

Problem: Bit-wise or-ing yields int not enum.

The API documentation implies that the allocation mode enums can bit bit-ored.
Example:

rmmOptions_t rmm_option {
  .allocation_mode = PoolAllocation | CudaManagedMemory,
  .initial_pool_size = free_memory / 2,
  .enable_logging = true };

gives a compiler error: error: a value of type "int" cannot be used to initialize an entity of type "rmmAllocationMode_t"

Suggestions:
a) Implement operator:

inline rmmAllocationMode_t operator|(rmmAllocationMode_t left, rmmAllocationMode_t right) {
   return static_cast<rmmAllocationMode_t>(
     static_cast<int>(left) | static_cast<int>(right));
}

b) add member PoolAllocationCudaManagedMemory = 3 to struct without bitwise or-ing.

[FEA] Add multi-device and multi-threaded test(s)

Is your feature request related to a problem? Please describe.
RMM supports multi-device allocation, and is thread safe. But we don't have tests of either these.

Describe the solution you'd like
Add tests for allocation on multiple devices. Add multi-threaded single-device and multi-device tests.

[BUG] RMM_FREE of an invalid address is returning RMM_SUCCESS

If I try to rmm free an invalid address (note c_stream is 0):
err = RMM_FREE(reinterpret_cast<void*>(100), c_stream);

It prints this warning:
warning: Cuda API error detected: cudaFree returned (0x11)

but err is RMM_SUCCESS. I expected: RMM_ERROR_CUDA_ERROR.

This is a recent issue in branch-0.10, possibly related to #127.

This is some repro code, where we can't go into the if statement.

  cudaStream_t c_stream = reinterpret_cast<cudaStream_t>(0);
  rmmError_t err = RMM_FREE(reinterpret_cast<void*>(100), c_stream);
  if (err != RMM_SUCCESS) {
    std::cout <<"not successful free of invalid address" << err<<std::endl;
  }

rapidsai / rmm Goto Github PK

rmm's People

Contributors

Stargazers

Watchers

Forkers

rmm's Issues

Report incorrect documentation

Report incorrect documentation

Recommend Projects

Recommend Topics

Recommend Org

Jobs