owensgroup / mvgpubtree Goto Github PK
View Code? Open in Web Editor NEWGPU B-Tree with support for versioning (snapshots).
License: Apache License 2.0
GPU B-Tree with support for versioning (snapshots).
License: Apache License 2.0
Dear author!Thanks for your great work.When I try to reproduce your experimental results and try to compile MVGpuBtree,the following error after the command(cmake ..)
I tried to solve it used different versions gcc/g++:tried 9.3 ,7.5,11.3,different cmake version : 3.8,3.16.3,but without any change,solutions did not work.I check my CmakeError.log.It seems like /usr/bin/ld: cannot find -lpthread.But I don't know the location where I could add -lpthread.
What's more,I use 8vCPU 32GiB,GPU:NVIDIA V100,Cuda version:11.4.
Kindly confirm why we are facing this issue?
Hi:
It seems for each k,v pair stored in GpuBTree must have same memory size of type Key and type Value. Why it has to be the same? Can I use custom data structure with a comparator for key type?
Running on GeForce RTX 3080 GPU
CUDA 12.1
C++17
CMake version 3.26.3
Although the g++ compiler is version 9.4.0 on Ubuntu 20.04,
CMake is using CXX STANDARD 17 so it should be compiling with -std=c++17.
I am getting namespace errors such as the following when running 'make -j' in these instructions:
/MVGpuBTree/include/btree_kernels.hpp(79): error: name followed by "::" must be a class or namespace
name
attribute((shared)) cg::experimental::block_tile_memory<4, btree::reclaimer_block_size_> block_tile_shemm;
Implement a constructor that takes in key-value pairs and bulk-build the tree.
The benchmarking code now requires 20 GiBs of memory for a complete set of benchmarks. It would be nice to limit the memory requirements for benchmarking on workstations. The code (thrust) will throw an out-of-memory error or some exception when it runs out of memory. To help with limiting the memory requirements, here are the reasons why we need these 20 GiBs:
The tree data structure:
All memory allocations to the tree are either satisfied by the device_bump_allocator
or SlabAllocator
. Both allocators allocate 8 GiBs on construction by default. You may reduce this to 4 GiBs by changing template parameters (SlabAllocator
only supports power-of-two allocations) but keep in mind that when inserting keys into the tree, I don't check for out-of-memory errors in device code (code will either segfault or deadlock). Also, keep in mind that device_bump_allocator
does not support free memory, so benchmarks like VoB-Tree will not scale.
Input:
2.1. Point query benchmarks require only keys and values. For 50 million key-value pairs, the code will need ~0.2 GiBs for each array. The total will be
2.2. Range query benchmarks require keys and values, and range query lower and upper bounds and output buffer. For an input size of 50 million pairs, the code will need
Memory reclaimer: Allocates ~0.3 GiBs and can be changed by setting this number.
Maximum will be 8 (tree) + 0.8 (RQ input) + 9.3 (maximum RQ output) + 0.3 (reclaimer)= 18.4 GiBs. Notice that I never explicitly free GPU memory since I used a shared pointer wrapper around all allocations (see example) which means that any allocations get deallocated when it becomes out of scope.
Occasionally, when calling cooperative_insert
from my own kernel, the function never returns.
I am running the code on an RTX 4090 with driver version 525.78.01, and CUDA 11.8.
I was able to reproduce this issue multiple times using the following code:
void investigate_tree_deadlock() {
using key_type = uint32_t;
using value_type = uint32_t;
size_t build_size = size_t{1} << 25;
key_type min_usable_key = 1;
key_type max_usable_key = std::numeric_limits<key_type>::max() - 2;
std::mt19937_64 gen(42);
std::uniform_int_distribution<key_type> key_dist(min_usable_key, max_usable_key);
std::vector<key_type> build_keys(build_size);
std::unordered_set<key_type> build_keys_set;
while (build_keys_set.size() < build_size) {
key_type key = key_dist(gen);
build_keys_set.insert(key);
}
std::copy(build_keys_set.begin(), build_keys_set.end(), build_keys.begin());
std::sort(build_keys.begin(), build_keys.end());
key_type* keys_on_gpu;
cudaMalloc(&keys_on_gpu, build_size * sizeof(key_type));
cudaMemcpy(keys_on_gpu, build_keys.data(), build_size * sizeof(key_type), cudaMemcpyHostToDevice);
for (size_t i = 0; i < 10000; ++i) {
std::cout << "round " << i << " starting" << std::endl;
gpu_blink_tree<key_type, value_type, 16> tree;
modified_insert_kernel<<<(build_size + 511) / 512, 512>>>(keys_on_gpu, build_size, tree);
std::cout << "tree uses " << tree.compute_memory_usage() << " GB" << std::endl;
std::cout << "round " << i << " done" << std::endl;
}
cudaFree(keys_on_gpu);
}
I ran the snippet twice and observed the issue in iterations 61 and 1699, respectively. In both cases, I had to terminate the process forcefully using CTRL+C. My modified_insert_kernel
is almost identical to the default insertion kernel, it looks like this:
template <typename key_type, typename size_type, typename btree>
__global__ void modified_insert_kernel(
const key_type* keys,
const size_type keys_count,
btree tree
) {
auto thread_id = threadIdx.x + blockIdx.x * blockDim.x;
auto block = cg::this_thread_block();
auto tile = cg::tiled_partition<btree::branching_factor>(block);
if ((thread_id - tile.thread_rank()) >= keys_count) { return; }
auto key = btree::invalid_key;
auto value = btree::invalid_value;
bool to_insert = false;
if (thread_id < keys_count) {
key = keys[thread_id];
value = thread_id;
to_insert = true;
}
using allocator_type = typename btree::device_allocator_context_type;
allocator_type allocator{tree.allocator_, tile};
size_type num_inserted = 1;
auto work_queue = tile.ballot(to_insert);
while (work_queue) {
auto cur_rank = __ffs(work_queue) - 1;
auto cur_key = tile.shfl(key, cur_rank);
auto cur_value = tile.shfl(value, cur_rank);
tree.cooperative_insert(cur_key, cur_value, tile, allocator);
if (tile.thread_rank() == cur_rank) { to_insert = false; }
num_inserted++;
work_queue = tile.ballot(to_insert);
}
}
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.