mjansson / rpmalloc Goto Github PK

Public domain cross platform lock free thread caching 16-byte aligned memory allocator implemented in C

License: Other

C 44.09% C++ 0.96% Python 54.95%

rpmalloc's Introduction

rpmalloc - General Purpose Memory Allocator

This library provides a public domain cross platform lock free thread caching 16-byte aligned memory allocator implemented in C. The latest source code is always available at https://github.com/mjansson/rpmalloc

Created by Mattias Jansson (@maniccoder) - Discord server for discussions at https://discord.gg/M8BwTQrt6c

Platforms currently supported:

Windows
Linux
MacOS
iOS
Android

The code should be easily portable to any platform with atomic operations and an mmap-style virtual memory management API. The API used to map/unmap memory pages can be configured in runtime to a custom implementation and mapping granularity/size.

This library is put in the public domain; you can redistribute it and/or modify it without any restrictions. Or, if you choose, you can use it under the MIT license.

Performance

We believe rpmalloc is faster than most popular memory allocators like tcmalloc, hoard, ptmalloc3 and others without causing extra allocated memory overhead in the thread caches compared to these allocators. We also believe the implementation to be easier to read and modify compared to these allocators, as it is a single source file of ~2200 lines of C code. All allocations have a natural 16-byte alignment.

Contained in a parallel repository is a benchmark utility that performs interleaved unaligned allocations and deallocations (both in-thread and cross-thread) in multiple threads. It measures number of memory operations performed per CPU second, as well as memory overhead by comparing the virtual memory mapped with the number of bytes requested in allocation calls. The setup of number of thread, cross-thread deallocation rate and allocation size limits is configured by command line arguments.

https://github.com/mjansson/rpmalloc-benchmark

Below is an example performance comparison chart of rpmalloc and other popular allocator implementations, with default configurations used.

The benchmark producing these numbers were run on an Ubuntu 16.10 machine with 8 logical cores (4 physical, HT). The actual numbers are not to be interpreted as absolute performance figures, but rather as relative comparisons between the different allocators. For additional benchmark results, see the BENCHMARKS file.

Configuration of the thread and global caches can be important depending on your use pattern. See CACHE for a case study and some comments/guidelines.

Required functions

Before calling any other function in the API, you MUST call the initialization function, either rpmalloc_initialize or rpmalloc_initialize_config, or you will get undefined behaviour when calling other rpmalloc entry point.

Before terminating your use of the allocator, you SHOULD call rpmalloc_finalize in order to release caches and unmap virtual memory, as well as prepare the allocator for global scope cleanup at process exit or dynamic library unload depending on your use case.

Using

The easiest way to use the library is simply adding rpmalloc.[h|c] to your project and compile them along with your sources. The allocator is complete self contained, you are not required to call the init/fini functions from your own code, but can do so in order to initialize and finalize the allocator in specific places or provide your own hooks and/or configuration:

rpmalloc_initialize : Call at process start to initialize the allocator

rpmalloc_initialize_config : Optional entry point to call at process start to initialize the allocator with a custom memory mapping backend, memory page size and mapping granularity.

rpmalloc_finalize: Call at process exit to finalize the allocator

rpmalloc_thread_initialize: Call at each thread start to initialize the thread local data for the allocator

rpmalloc_thread_finalize: Call at each thread exit to finalize and release thread cache back to global cache

rpmalloc_config: Get the current runtime configuration of the allocator

Then simply use the rpmalloc/rpfree and the other malloc style replacement functions. Remember all allocations are 16-byte aligned, so no need to call the explicit rpmemalign/rpaligned_alloc/rpposix_memalign functions unless you need greater alignment, they are simply wrappers to make it easier to replace in existing code.

If you wish to override the standard library malloc family of functions and have automatic initialization/finalization of process and threads, define ENABLE_OVERRIDE to non-zero (default is 1) which will include the malloc.c file in compilation of rpmalloc.c, and then rebuild the library or your project where you added the rpmalloc source. If you compile rpmalloc as a separate library you must make the linker use the override symbols from the library by referencing at least one symbol. The easiest way is to simply include rpmalloc.h in at least one source file and call rpmalloc_linker_reference somewhere - it's a dummy empty function. For C++ overrides you have to #include <rpnew.h> in at least one source file. The list of libc entry points replaced may not be complete, use libc/stdc++ replacement only as a convenience for testing the library on an existing code base, not a final solution.

For explicit first class heaps, see the rpmalloc_heap*_ API under first class heaps section, requiring RPMALLOC_FIRST_CLASS_HEAPS to be defined to 1 - default is 0, as it imposes a very slight performance hit in deallocation path from an extra conditinal instruction.

Building

To compile as a static library run the configure python script which generates a Ninja build script, then build using ninja. The ninja build produces both a static and a dynamic library named rpmalloc.

By default the dynamic library can be used with LD_PRELOAD/DYLD_INSERT_LIBRARIES to inject in a preexisting binary, replacing any malloc/free family of function calls (when ENABLE_OVERRIDE is defined to 1). This is only implemented for Linux and macOS targets. The list of libc entry points replaced may not be complete, use preloading as a convenience for testing the library on an existing binary, not a final solution.

The latest stable release is available in the master branch. For latest development code, use the develop branch.

Configuration options

Detailed statistics are available if ENABLE_STATISTICS is defined to 1 (default is 0, or disabled), either on compile command line or by setting the value in rpmalloc.c. This will cause a slight overhead in runtime to collect statistics for each memory operation, and will also add 4 bytes overhead per allocation to track sizes.

Integer safety checks on all calls are enabled if ENABLE_VALIDATE_ARGS is defined to 1 (default is 0, or disabled), either on compile command line or by setting the value in rpmalloc.c. If enabled, size arguments to the global entry points are verified not to cause integer overflows in calculations.

Asserts are enabled if ENABLE_ASSERTS is defined to 1 (default is 0, or disabled), either on compile command line or by setting the value in rpmalloc.c.

To include malloc.c in compilation and provide overrides of standard library malloc entry points define ENABLE_OVERRIDE to 1 (this is the default).

To enable support for first class heaps, define RPMALLOC_FIRST_CLASS_HEAPS to 1 (this is the default).

Huge pages

The allocator has support for huge/large pages on Windows, Linux and MacOS. To enable it, pass a non-zero value in the config value enable_huge_pages when initializing the allocator with rpmalloc_initialize_config. If the system does not support huge pages it will be automatically disabled. You can query the status by looking at enable_huge_pages in the config returned from a call to rpmalloc_config after initialization is done.

Quick overview

The allocator uses separate heaps for each thread and partitions memory blocks according to a preconfigured set of size classes, up to 8MiB. Huge blocks above this limit are mapped and unmapped directly. Blocks are allocated from a page of multiple blocks, all of the same size class. Each page is one of three page types, small, medium or large. Each page belongs to an even larger span of pages, each of the same page type.

Implementation details

The allocator is based on a fixed page alignment per page type, and 16 byte block alignment within the page. On Windows this the page alignment is automatically guaranteed up to 64KiB by the VirtualAlloc granularity, and on mmap systems it is achieved by oversizing the mapping and aligning the returned virtual memory address to the required boundaries. By aligning to a fixed size the free operation can locate the header of the memory page without having to do a table lookup by simply masking out the low bits of the address (for 64KiB this would be the low 16 bits).

Memory blocks are divided into the three page types. Small pages have blocks in [0, 4096] bytes, medium blocks (4096, 262144] bytes, and large blocks (262144, 8388608] bytes. The three page types are further divided in block size classes, where small block sizes have a fixed granularity and interval of 16 bytes, and medium and large blocks have a variable interval to limit overhead to a fixed ratio.

Each span belongs to a single heap that owns all containing blocks to are allocated/free. To avoid locks, each span is completely owned by the allocating thread, and all cross-thread deallocations will be deferred to the owner thread through a separate free list per span.

Memory mapping

By default the allocator uses OS APIs to map virtual memory pages as needed, either VirtualAlloc on Windows or mmap on POSIX systems. If you want to use your own custom memory mapping provider you can use rpmalloc_initialize or rpmalloc_initialize_config and pass function pointers to map and unmap virtual memory. These function should reserve and free the requested number of bytes.

The returned memory address from the memory map function MUST be aligned to the system memory page size or the configured page size given during initialization using rpmalloc_initialize_config. Use rpmalloc_config to find the configured system page size for required alignment.

Memory mapping requests are always done in multiples of the memory page size. You can specify a custom page size when initializing rpmalloc with rpmalloc_initialize_config, or pass 0 to let rpmalloc determine the system memory page size using OS APIs. The page size MUST be a power of two.

On macOS and iOS mmap requests are tagged with tag 240 for easy identification with the vmmap tool.

Memory fragmentation

There is no memory fragmentation by the allocator in the sense that it will not leave unallocated and unusable "holes" in the memory pages by calls to allocate and free blocks of different sizes. This is due to the fact that the memory pages allocated for each size class is split up in perfectly aligned blocks which are not reused for a request of a different size. The block freed by a call to rpfree will always be immediately available for an allocation request within the same size class.

However, there is memory fragmentation in the meaning that a request for x bytes followed by a request of y bytes where x and y are at least one size class different in size will return blocks that are at least one memory page apart in virtual address space. Only blocks of the same size will potentially be within the same memory page span.

rpmalloc keeps an "active span" and free list for each size class. This leads to back-to-back allocations will most likely be served from within the same span of memory pages (unless the span runs out of free blocks). The rpmalloc implementation will also use any "holes" in memory pages in semi-filled spans before using a completely free span.

First class heaps

rpmalloc provides a first class heap type with explicit heap control API. Heaps are maintained with calls to rpmalloc_heap_acquire and rpmalloc_heap_release and allocations/frees are done with rpmalloc_heap_alloc and rpmalloc_heap_free. See the rpmalloc.h documentation for the full list of functions in the heap API. The main use case of explicit heap control is to scope allocations in a heap and release everything with a single call to rpmalloc_heap_free_all without having to maintain ownership of memory blocks. Note that the heap API is not thread-safe, the caller must make sure that each heap is only used in a single thread at any given time.

Producer-consumer scenario

Compared to the some other allocators, rpmalloc does not suffer as much from a producer-consumer thread scenario where one thread allocates memory blocks and another thread frees the blocks. In some allocators the free blocks need to traverse both the thread cache of the thread doing the free operations as well as the global cache before being reused in the allocating thread. In rpmalloc the freed blocks will be reused as soon as the allocating thread needs to get new spans from the thread cache. This enables faster release of completely freed memory pages as blocks in a memory page will not be aliased between different owning threads.

Best case scenarios

Threads that keep ownership of allocated memory blocks within the thread and free the blocks from the same thread will have optimal performance.

Threads that have allocation patterns where the difference in memory usage high and low water marks fit within the thread cache thresholds in the allocator will never touch the global cache except during thread init/fini and have optimal performance. Tweaking the cache limits can be done on a per-size-class basis.

Worst case scenarios

Since each thread cache maps spans of memory pages per size class, a thread that allocates just a few blocks of each size class (16, 32, ...) for many size classes will never fill each bucket, and thus map a lot of memory pages while only using a small fraction of the mapped memory. However, the wasted memory will always be less than 4KiB (or the configured memory page size) per size class as each span is initialized one memory page at a time. The cache for free spans will be reused by all size classes.

Threads that perform a lot of allocations and deallocations in a pattern that have a large difference in high and low water marks, and that difference is larger than the thread cache size, will put a lot of contention on the global cache. What will happen is the thread cache will overflow on each low water mark causing pages to be released to the global cache, then underflow on high water mark causing pages to be re-acquired from the global cache. This can be mitigated by changing the MAX_SPAN_CACHE_DIVISOR define in the source code (at the cost of higher average memory overhead).

Caveats

VirtualAlloc has an internal granularity of 64KiB. However, mmap lacks this granularity control, and the implementation instead oversizes the memory mapping with configured span size to be able to always return a memory area with the required alignment. Since the extra memory pages are never touched this will not result in extra committed physical memory pages, but rather only increase virtual memory address space.

All entry points assume the passed values are valid, for example passing an invalid pointer to free would most likely result in a segmentation fault. The library does not try to guard against errors!.

Other languages

Johan Andersson at Embark has created a Rust wrapper available at rpmalloc-rs

Stas Denisov has created a C# wrapper available at Rpmalloc-CSharp

License

This is free and unencumbered software released into the public domain.

Anyone is free to copy, modify, publish, use, compile, sell, or distribute this software, either in source code form or as a compiled binary, for any purpose, commercial or non-commercial, and by any means.

In jurisdictions that recognize copyright laws, the author or authors of this software dedicate any and all copyright interest in the software to the public domain. We make this dedication for the benefit of the public at large and to the detriment of our heirs and successors. We intend this dedication to be an overt act of relinquishment in perpetuity of all present and future rights to this software under copyright law.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

For more information, please refer to http://unlicense.org

You can also use this software under the MIT license if public domain is not recognized in your country

The MIT License (MIT)

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

rpmalloc's People

Contributors

Stargazers

Watchers

Forkers

templeblock gomson neoapostol xygroup putizl slunski andoma tchen0123 hbcbh1999 bigjoe01 weimingtom yidanzhou liuguoyou hhy5277 tisma zachlungu kimusan longjohncoder agahlot lsalamon devenlu czipperz k56flex twiddle-bits eftychis frandy stonedreamforest claudiouzelac icecream95 coherentlabs outerra feralinteractive brandon-kohn rabbitonly dcou alarouche alwaystoolate erdroy solidest jakirkham king1600 spurnaye khongten001 readablesystems gfphoenix78 kallisti5 xwlan paulyc hedong0411 camir donghy waddlesplash gregory-meyer shards-lang lxlyh jneuhauser serzhiio jprjr dingsp kphillisjr dianpeng romange ikrima sbahra tru ptruser hixio-mh edubart domclark johanneslorenz kingstars digit-google lobster-king fightli1017 0mcandal0 praetoriandroid embarkstudios blackpants dan-giddins michalpetryka kterteriandragonlk friendlyanon ccliuyang fujiz xiaoshzx deobfuscator glasslight sgraham ops-l citizenfx wayne19z pszabady hexcolors60 liuguohua-cn icyfox168168 xiufengzhang eddiejames neonkore rgal rock59

rpmalloc's Issues

Add optional over/underwrite guards

Add compiletime selectable overwrite/underwrite guards for each allocated block to help detect memory corruption issues.

Improve documentation

General design principles, list similar allocators
Document allocation patterns that are handled poorly
Discussion about fragmentation

Store subspan remainder counter as atomic

Store remaining subspans as atomic counter in master span to avoid having to defer unmapping subspans to owner thread. Subspan remainder is always decreasing, an atomic could be updated in any thread and master span thus unmapped in any thread when counter reaches zero.

Very high memory usage under shbench

As seen here: https://github.com/ezrosent/allocators-rs/blob/master/info/elfmalloc-performance.md#shbench

Also of note is the memory consumption of ptmalloc2, llalloc and rpmalloc: something about the varying lifetimes of objects seems to trip these allocators up here, as they use over an order of magnitude more memory than jemalloc and elfmalloc.

Indeed, it looks like rpmalloc is using over 1GB of memory, vs. < 50MB for jemalloc and friends.

Other than that, these benchmarks seem to show rpmalloc having a very good performance/overhead tradeoff. Nice!

Comparison with bmalloc?

Hi,

I would be interested in a comparison (both in terms of speed and on memory usage) with bmalloc.

https://opensource.apple.com/source/bmalloc/

https://svn.webkit.org/repository/webkit/trunk/Source/bmalloc/

rpmalloc crashs

I created a dll for windows which contains rpmalloc.c and exports the rpmalloc functions.
The global operators new and delete were overwritten and contains the following code:

//********************************************************
void * operator new(size_t size)
{
return rpmalloc(size);
}

//********************************************************
void operator delete(void * const p) noexcept
{
rpfree(p);
}

//********************************************************
void * operator new[](size_t size)
{
return rpmalloc(size);
}

//********************************************************
void operator delete[](void * const p)
{
rpfree(p);
}

When I start our application it runs for about 10s and then crashes.
I attached a picture which shows the crash and the content of the variables in rpmalloc.

Fails to compile using clang on Windows

While using static_assert in C seems to be fine for CL.EXE, clang complains.
I fixed it by replacing the static_assert with an emulation:

--- a/rpmalloc.c     2018-05-04 06:29:23.000000000 0200
+++ b/rpmalloc.c  2018-05-04 06:29:23.000000000 0200
@@ -85,7 +85,12 @@
 /// Platform and arch specifics
 #ifdef _MSC_VER
 #  define FORCEINLINE __forceinline
-#  define _Static_assert static_assert
+#  define TOKEN_PASTE2(a, b) a##b
+#  define TOKEN_PASTE(a, b) TOKEN_PASTE2(a, b)
+#  define _Static_assert(cond, msg) \
+    typedef struct { \
+        int static_assert_failed : !!(cond); \
+    } TOKEN_PASTE(static_assert, __COUNTER__)
 #  if ENABLE_VALIDATE_ARGS
 #    include <Intsafe.h>
 #  endif

Improve cross thread use cases

Improve cross thread use cases, either by redesigning the deferred deallocation scheme or introducing an optional opportunistic locking scheme for freeing blocks in other heaps.

EINVAL undeclared on MINGW

Reading rpmalloc.c it appears as though EINVAL is never defined properly for MINGW environments. I think this is provided by <errno.h> but this seems to be disabled for Win32 builds.

The build log is available here.

Random access violations in custom test on windows

Hello,

I've been trying to do some testing with your library and have run into issues while running the following:

std::size_t nAllocations = 1000000;
TEST(rpmalloc_test_suite, cross_thread_bench)
{
    rpmalloc_initialize();
    using namespace stk::thread;
    std::size_t nOSThreads = std::thread::hardware_concurrency();
    work_stealing_thread_pool<moodycamel_concurrent_queue_traits> pool(rpmalloc_thread_initialize, rpmalloc_thread_finalize, nOSThreads);
    using future_t = boost::future<void>;
    std::vector<future_t> futures;
    futures.reserve(nAllocations);
    {
        GEOMETRIX_MEASURE_SCOPE_TIME("rpmalloc_cross_thread_32_bytes");
        for (size_t i = 0; i < nAllocations; ++i)
        {
            auto pAlloc = rpmalloc(32);
            futures.emplace_back(pool.send(i++%pool.number_threads(),
                [pAlloc]()
                {
                    rpfree(pAlloc);
                }));
        }

        boost::for_each(futures, [](const future_t& f) { f.wait(); });
    }
    rpmalloc_finalize();
}

Essentially, allocate the block in one thread and deallocate in another. I get random access violations though. Does my usage seem correct?

Improve adaptive cache size

Improve adaptive cache sizes with runtime use case heuristics

Support breaking up large super spans

When freeing large allocations >64KiB, the super span should be broken up into 64KiB spans and stored in cache.

The first span should have a master flag and an atomic counter of number of still used sibling spans.

The sibling should have a sibling flag and an offset in number of spans to master. When a sibling is unmapped, the counter is decremented.

When count reach 0 (all spans are "unmapped"), the entire memory range can be unmapped.

segfault in python 2.7 shutdown

I get a segfault after running this script:
https://github.com/pixelb/ps_mem

Program received signal SIGSEGV, Segmentation fault.
rpmalloc_finalize () at rpmalloc/rpmalloc.c:1345
1345                            _memory_deallocate_deferred(heap, 0);
(gdb) bt
#0  rpmalloc_finalize () at rpmalloc/rpmalloc.c:1345
#1  0x00007ffff7de9a3a in ?? () from /lib64/ld-linux-x86-64.so.2
#2  0x00007ffff7657940 in ?? () from /lib64/libc.so.6
#3  0x00007ffff765799a in exit () from /lib64/libc.so.6
#4  0x00007ffff76421e8 in __libc_start_main () from /lib64/libc.so.6
#5  0x000000000040063a in _start ()
(gdb)

Test cases

Implement a decent set of test cases to complement the benchmark brute force test

Tag entry points with attributes

Tag entry points with attributes matching those in each C runtime

rpmalloc.c doesn't compile as C++

Nearly all the errors are related to implicit-casting void* to some other type. Would you accept a pull request fixing these by adding explicit casts? If so, I'll be happy to do the work.

Problem when building rpmalloc in 32bit on linux

Hello,
On linux, by default the building architecture is x86-64 and that's working fine. I tried to build for x86 by excuting:
./configure.py -a x86
ninja
I'm getting the following error (there are a lot of them):

[7/44] CC rpmalloc/rpmalloc.c
FAILED: build/ninja/linux/debug/x86/rpmalloc-7c2f09b/rpmalloc-65d008a.o
clang -MMD -MT build/ninja/linux/debug/x86/rpmalloc-7c2f09b/rpmalloc-65d008a.o -MF build/ninja/linux/debug/x86/rpmalloc-7c2f09b/rpmalloc-65d008a.o.d -I. -DRPMALLOC_COMPILE=1 -funit-at-a-time -fstrict-aliasing -fno-math-errno -ffinite-math-only -funsafe-math-optimizations -fno-trapping-math -ffast-math -D_GNU_SOURCE=1 -W -Werror -pedantic -Wall -Weverything -Wno-padded -Wno-documentation-unknown-command -std=c11 -m32 -DBUILD_DEBUG=1 -g -c rpmalloc/rpmalloc.c -o build/ninja/linux/debug/x86/rpmalloc-7c2f09b/rpmalloc-65d008a.o
rpmalloc/rpmalloc.c:1278:31: error: implicit conversion changes signedness: 'ptrdiff_t' (aka 'int') to 'unsigned int' [-Werror,-Wsign-conversion]
return size_class->size - (pointer_diff(p, blocks_start) % size_class->size);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~
rpmalloc/rpmalloc.c:201:37: note: expanded from macro 'pointer_diff'
#define pointer_diff(first, second) (ptrdiff_t)((const char*)(first) - (const char*)(second))
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated.

any help please?

ABA problem with orphaned heaps

Hi,
I believe there is a bug in the handling of the orphaned heaps linked list. It suffers from a race when adding/removing heaps.
Imagine the initial state in Thread 1 is A->B and we want to use the heap A, we load it and the next_heap will be B. Immediately before the CAS, the thread is interrupted.
Then you get other threads that adding/removing heaps reach a state of A->N->B.
Now Thread 1 resumes, the CAS succeeds, because A is still the head, but now after the CAS the list's head is B, N is gone.
Unfortunately in this way we could also hit a heap that is in use.
The ABA problem in linked lists is not trivial to fix unfortunately. In your Global span cache, you use a lock (the SPAN_LIST_LOCK_TOKEN), I guess using a lock for the orphaned heaps will be OK as well - the operation of handling them is relatively seldom - only when threads are init/finalized.
Please correct me if I'm wrong somewhere.

Use extern "C" around header file rpmalloc.h for C++ users

Wrap

#ifndef __cplusplus
extern "C" {
#endif

and

#ifndef __cplusplus
}
#endif

around rpmalloc.h so that C++ users can link properly

Allow fully allocated spans to be adopted by other heap

When a span is fully allocated it could be adopted by a different heap upon cross-thread deallocation, to help performance in a producer-consumer scenario where one thread does allocations and another deallocations.

rpmalloc_initialize() takes over 5 seconds on iOS

This is on an iPhone 7, and occurs both with -Os and -O0. Here's a picture of what it's doing:

rpMalloc crashing in iOS w/ a 16kb page size

I'm trying this library out in an iOS ARKit-based project where I can assume a 16KB page size, however setting the page size to that value in rpmalloc.c causes a crash after a few allocations:

988 if (*cache)
*-> 989 span->data.list_size = (cache)->data.list_size + 1;
990 else
991 span->data.list_size = 1;
992 *cache = span;

I'm not familiar with this code, yet, but after a little bit of debugging the crash seems to possibly be related to SPAN_CLASS_COUNT being equal to 1 for a 16KB page, and the code not being able to handle that case.

If I increase the value of SPAN_ADDRESS_GRANULARITY proportionally, to (4 * 65536), it no longer crashes. Is that actually an appropriate fix?

Cheers,
James

Add entry points

Add entry points for reallocarray, valloc and pvalloc

allow rpaligned_realloc to accept oldsize == 0

_memory_reallocate allows zero to be passed for the parameter 'oldsize', however this convenience is essentially hidden by rpaligned_realloc() when alignment > 16.

I am tentatively using this change locally in order to avoid a hash lookup on the original allocation size. If it looks correct, I'm happy to create a pull request, or feel free to just make the change if that's easier.

		if (!(flags & RPMALLOC_NO_PRESERVE)) {
+			if (oldsize == 0)
+				oldsize = _memory_usable_size(ptr);
			memcpy(block, ptr, oldsize < size ? oldsize : size);
		}

Dynamic loading broken on Mac

After building and invoking the following simple C program:

#include<stdlib.h>
#include<stdio.h>

int main() {
	printf("%p\n", malloc(8));
}

as DYLD_FORCE_FLAT_NAMESPACE=1 DYLD_INSERT_LIBRARIES=./bin/macosx/debug/librpmalloc.dylib ./a.out

The program crashes with Abort trap: 6. Using ulimit -c unlimited and running again, I got a core file. Using lldb --core <core file>, I got the following backtrace:

* thread #1: tid = 0x0000, 0x00007fff9364bf06 libsystem_kernel.dylib`__pthread_kill + 10, stop reason = signal SIGSTOP
  * frame #0: 0x00007fff9364bf06 libsystem_kernel.dylib`__pthread_kill + 10
    frame #1: 0x00007fff9868f4ec libsystem_pthread.dylib`pthread_kill + 90
    frame #2: 0x00007fff86d856df libsystem_c.dylib`abort + 129
    frame #3: 0x00007fff925a53e0 libdyld.dylib`_tlv_bootstrap + 9
    frame #4: 0x000000010645b991 librpmalloc.dylib`rpmalloc_thread_initialize + 17 at rpmalloc.c:1425
    frame #5: 0x000000010645b7e7 librpmalloc.dylib`rpmalloc_initialize + 263 at rpmalloc.c:1336
    frame #6: 0x000000010645e817 librpmalloc.dylib`initializer + 71 at malloc.c:103
    frame #7: 0x000000010645e981 librpmalloc.dylib`malloc(size=8) + 17 at malloc.c:211
    frame #8: 0x00007fff6a0971be dyld`operator new(unsigned long) + 30
    frame #9: 0x00007fff6a0846c5 dyld`std::__1::vector<char const* (*)(dyld_image_states, unsigned int, dyld_image_info const*), std::__1::allocator<char const* (*)(dyld_image_states, unsigned int, dyld_image_info const*)> >::insert(std::__1::__wrap_iter<char const* (* const*)(dyld_image_states, unsigned int, dyld_image_info const*)>, char const* (* const&)(dyld_image_states, unsigned int, dyld_image_info const*)) + 343
    frame #10: 0x00007fff6a07f507 dyld`dyld::registerImageStateBatchChangeHandler(dyld_image_states, char const* (*)(dyld_image_states, unsigned int, dyld_image_info const*)) + 147
    frame #11: 0x00007fff925a489e libdyld.dylib`dyld_register_image_state_change_handler + 76
    frame #12: 0x00007fff925a465f libdyld.dylib`_dyld_initializer + 47
    frame #13: 0x00007fff9757c9fd libSystem.B.dylib`libSystem_initializer + 116
    frame #14: 0x00007fff6a08e10b dyld`ImageLoaderMachO::doModInitFunctions(ImageLoader::LinkContext const&) + 265
    frame #15: 0x00007fff6a08e284 dyld`ImageLoaderMachO::doInitialization(ImageLoader::LinkContext const&) + 40
    frame #16: 0x00007fff6a08a8bd dyld`ImageLoader::recursiveInitialization(ImageLoader::LinkContext const&, unsigned int, ImageLoader::InitializerTimingList&, ImageLoader::UninitedUpwards&) + 305
    frame #17: 0x00007fff6a08a852 dyld`ImageLoader::recursiveInitialization(ImageLoader::LinkContext const&, unsigned int, ImageLoader::InitializerTimingList&, ImageLoader::UninitedUpwards&) + 198
    frame #18: 0x00007fff6a08a852 dyld`ImageLoader::recursiveInitialization(ImageLoader::LinkContext const&, unsigned int, ImageLoader::InitializerTimingList&, ImageLoader::UninitedUpwards&) + 198
    frame #19: 0x00007fff6a08a852 dyld`ImageLoader::recursiveInitialization(ImageLoader::LinkContext const&, unsigned int, ImageLoader::InitializerTimingList&, ImageLoader::UninitedUpwards&) + 198
    frame #20: 0x00007fff6a08a852 dyld`ImageLoader::recursiveInitialization(ImageLoader::LinkContext const&, unsigned int, ImageLoader::InitializerTimingList&, ImageLoader::UninitedUpwards&) + 198
    frame #21: 0x00007fff6a08a852 dyld`ImageLoader::recursiveInitialization(ImageLoader::LinkContext const&, unsigned int, ImageLoader::InitializerTimingList&, ImageLoader::UninitedUpwards&) + 198
    frame #22: 0x00007fff6a08a852 dyld`ImageLoader::recursiveInitialization(ImageLoader::LinkContext const&, unsigned int, ImageLoader::InitializerTimingList&, ImageLoader::UninitedUpwards&) + 198
    frame #23: 0x00007fff6a08a743 dyld`ImageLoader::processInitializers(ImageLoader::LinkContext const&, unsigned int, ImageLoader::InitializerTimingList&, ImageLoader::UninitedUpwards&) + 127
    frame #24: 0x00007fff6a08a9b3 dyld`ImageLoader::runInitializers(ImageLoader::LinkContext const&, ImageLoader::InitializerTimingList&) + 75
    frame #25: 0x00007fff6a07d0ab dyld`dyld::initializeMainExecutable() + 138
    frame #26: 0x00007fff6a080d98 dyld`dyld::_main(macho_header const*, unsigned long, int, char const**, char const**, char const**, unsigned long*) + 3596
    frame #27: 0x00007fff6a07c276 dyld`dyldbootstrap::start(macho_header const*, int, char const**, long, macho_header const*, unsigned long*) + 512
    frame #28: 0x00007fff6a07c036 dyld`_dyld_start + 54

The problem seems to come down to the fact that when dynamically loading a .dylib on Mac, the _tlv_bootstrap function is implemented by simply calling abort().

question on thread api

When I'm including new.cc in my project then I don't have to call rpmalloc_thread_initialize() on new or new [] thanks to initialize() function. I have to always call rpmalloc_thread_finalize() myself when I close/destroy my thread, right?

Expand integer range checking to new entry points

Include range checking for ENABLE_VALIDATE_ARGS around;

https://github.com/rampantpixels/rpmalloc/blob/master/rpmalloc/malloc.c#L226
and
https://github.com/rampantpixels/rpmalloc/blob/master/rpmalloc/malloc.c#L236

Both cases if the caller assumes they have been returned valid pointers, they would start to over write the heap causing various issues.

Build fails with Clang 8

I updated from Clang 7 to Clang 8, which causes the build to fail. Using Clang 8 results in additional warnings (see below), and it seems that -Werror is set, causing the warnings to be fatal.

[24/48] CC test/thread.c
FAILED: build/ninja/linux/debug/x86-64/test-57ec084/thread-35aa063.o 
clang -MMD -MT build/ninja/linux/debug/x86-64/test-57ec084/thread-35aa063.o -MF build/ninja/linux/debug/x86-64/test-57ec084/thread-35aa063.o.d -I. -Irpmalloc -Itest -DRPMALLOC_COMPILE=1 -funit-at-a-time -fstrict-aliasing -fno-math-errno -ffinite-math-only -funsafe-math-optimizations -fno-trapping-math -ffast-math -D_GNU_SOURCE=1 -W -Werror -pedantic -Wall -Weverything -Wno-padded -Wno-documentation-unknown-command -std=c11 -m64 -DBUILD_DEBUG=1 -g -DENABLE_ASSERTS=1 -DENABLE_STATISTICS=1 -c test/thread.c -o build/ninja/linux/debug/x86-64/test-57ec084/thread-35aa063.o
test/thread.c:95:2: error: implicit use of sequentially-consistent atomic may incur stronger memory barriers than necessary [-Werror,-Watomic-implicit-seq-cst]
        __sync_synchronize();
        ^~~~~~~~~~~~~~~~~~
test/thread.c:104:2: error: implicit use of sequentially-consistent atomic may incur stronger memory barriers than necessary [-Werror,-Watomic-implicit-seq-cst]
        __sync_synchronize();
        ^~~~~~~~~~~~~~~~~~
2 errors generated.
[25/48] CC test/thread.c
FAILED: build/ninja/linux/release/x86-64/test-57ec084/thread-35aa063.o 
clang -MMD -MT build/ninja/linux/release/x86-64/test-57ec084/thread-35aa063.o -MF build/ninja/linux/release/x86-64/test-57ec084/thread-35aa063.o.d -I. -Irpmalloc -Itest -DRPMALLOC_COMPILE=1 -funit-at-a-time -fstrict-aliasing -fno-math-errno -ffinite-math-only -funsafe-math-optimizations -fno-trapping-math -ffast-math -D_GNU_SOURCE=1 -W -Werror -pedantic -Wall -Weverything -Wno-padded -Wno-documentation-unknown-command -std=c11 -m64 -DBUILD_RELEASE=1 -O3 -g -funroll-loops -DENABLE_ASSERTS=1 -DENABLE_STATISTICS=1 -c test/thread.c -o build/ninja/linux/release/x86-64/test-57ec084/thread-35aa063.o
test/thread.c:95:2: error: implicit use of sequentially-consistent atomic may incur stronger memory barriers than necessary [-Werror,-Watomic-implicit-seq-cst]
        __sync_synchronize();
        ^~~~~~~~~~~~~~~~~~
test/thread.c:104:2: error: implicit use of sequentially-consistent atomic may incur stronger memory barriers than necessary [-Werror,-Watomic-implicit-seq-cst]
        __sync_synchronize();
        ^~~~~~~~~~~~~~~~~~
2 errors generated.
[26/48] CXX rpmalloc/new.cc
[27/48] SO build/ninja/linux/debug/x86-64/rpmalloc-cccf0ca/librpmalloc.so
[28/48] SO build/ninja/linux/debug/x86-64/rpmalloc-5aa0e6/librpmallocwrap.so
[29/48] CC test/main.c
[30/48] CC rpmalloc/rpmalloc.c
[31/48] CC rpmalloc/rpmalloc.c
[32/48] CC rpmalloc/rpmalloc.c
[33/48] CC rpmalloc/rpmalloc.c
[34/48] CC test/main.c
ninja: build stopped: subcommand failed.

Endless loop if rpmalloc is compiled with MSVC for 32-bit targets

I changed rpmalloc from a static library into a shared library so that it can be used in our application.

Therefore I have done the following things:

changed the configuration type from "static library (.lib)" to "dynamic library (.dll)"
added the following code in rpmalloc.h

    #ifdef RPMALLOC_EXPORT
      # define RPMALLOC_API __declspec(dllexport)
    #else
      # define RPMALLOC_API __declspec(dllimport)
    #endif

replaced "extern " through "extern RPMALLOC_API "
added RPMALLOC_EXPORT to the preprocessor directive in project "rpmalloc"
replaced _LIB through _DLL in preprocessor directive

When the program "test" is now executed and a thread calls "rpmalloc_thread_finalize" the program hangs in an endless loop.

I attached the changed source code.

Do you have any idea why this happens ?

rpmalloc-1.3.1.zip

Comparison with Lockless Inc's Allocator

Hi,
I'm interested in benchmarks of how this allocator compares to the Lockless allocator. I'd also like to know if you have any comments on the way their allocator works compared to yours. Theirs seems to be based on slabs and b-trees, as well as some form of caching? The license seems a bit restrictive, but it seems like just including it in an open-source benchmark would be fine. Disclaimer: IANAL.

https://github.com/VladX/lockless-allocator (a slightly improved, cross-platform version)
http://locklessinc.com/technical_allocator.shtml
http://locklessinc.com/articles/allocator_tricks/

Thanks for this awesome allocator!

Reduce number of span size classes

Reduce number of span size classes from 16 to 4 to improve cache hit rate. Smaller block spans would need more than 255 blocks.

Configurable span alignment requirement

Make it possible to configure the span max size, and thus the alignment requirement, down to memory page size (normally 4KiB) in order to make it easier to use the allocator in a custom environment as a sub-allocator where memory is "mapped" by allocating memory from some other allocator.

This would allow the allocator to work reasonably well (with lower limits for medium and large chunks and lower number of size classes) without requiring allocations in 64KiB increments.

Implement more detailed memory statistics

Add more detailed statistics, such as cache transitions, map/unmap calls and number of used spans.

Considere using a license?

IANAL, but It seems that unlicensed can cause issues.

See http://softwareengineering.stackexchange.com/questions/147111/what-is-wrong-with-the-unlicense

As also the tread on https://www.reddit.com/r/programming/comments/63mizt/rpmalloc_a_faster_malloc_public_domain/

If you don't care about licenses, and you just want your software to be used by anyone as broad as possible, then its wiser to chose MIT than this Unlicensed stuff.

MIT will serve the purpose better than Unlicensed.

It just crashes

It crashed with dekstop "workload" (being /etc/ld.so.preload'ed) and it crashed with MySQL workload as well. I assume it's not production ready as such.

Add support for external allocators

We would like to use rpmalloc as an allocator on top of a lower-level page-based allocator, where the low-level allocator provides mmap/munmap-style primitives to obtain pages of memory.

We have some constraints on where in the address space the memory can come from and so this would allow us to have a well-performing allocator (rpmalloc) that could also honour these existing constraints (on where pages can be located).

We have a patch that implements this on our feral_allocator branch but thought it worth discussing before opening a pull request to discuss the design.

The summary is:

A configuration structure, rpmalloc_config_t, is provided on initialisation to supply mmap/munmap callbacks.
If no configuration is provided, or if no callbacks are provided, the callbacks default to a VirtualAlloc/mmap implementation.
Alignment is handled by over-allocating (if necessary) rather than assuming the underlying allocator can honour specific address requests (as, in our case, we can't).

Some discussion of those points:

A configuration structure could also have a version number, or size field, for backwards source-compatibility. There may also be other configuration-style parameters you would like to add.
A new initialisation method is added (rpmalloc_initialize_config) but configuration could be added to rpmalloc_initialize with a default parameter if you're assuming C11.
_memory_map/_memory_unmap now return a pointer and an alignment offset. The alignment offset returned when mapping must be provided when unmapping and so this is stored in each span_t (or the heap_t) along with the other per-map state.
Another approach from a return parameter would be to declare a 'slab_t' structure that contains this offset, place a slab_t at the start of each span_t/heap_t structure, then rename these methods to _slab_allocate/_slab_free. That would avoid the need to return separate pointer and offset values, and would make both spans and heaps an instance of a 'slab' (where a 'slab' hides the underlying allocator and the alignment logic but can provide a contiguous page range that starts with a slab_t header).
If the underlying allocator doesn't return a 64KiB-aligned block (VirtualAlloc will, our usage will, mmap might not) then we align by releasing that block and over-allocating by 64KiB. This allows us to support allocators that can return any arbitrary address, and avoids the need for mmap-specific code to track a base address then loop to increment. The downside is that over-allocating can waste up to 16 pages of address space. But, since those alignment pages are never written to they will never be dirtied and so will not increase paging pressure.

rpmalloc_usable_size returns wrong size when used on aligned allocs

ENABLED_GUARDS is also completely wrong when used with aligned allocs

Multiple integer overflows expose security risks

Read up on integer overflows here http://projects.webappsec.org/w/page/13246946/Integer%20Overflows

There's a few places I noticed them in you're code, a few of the most obvious / dangerous;

https://github.com/rampantpixels/rpmalloc/blob/master/rpmalloc/rpmalloc.c#L1287
https://github.com/rampantpixels/rpmalloc/blob/master/rpmalloc/rpmalloc.c#L1303
https://github.com/rampantpixels/rpmalloc/blob/master/rpmalloc/rpmalloc.c#L1200
https://github.com/rampantpixels/rpmalloc/blob/master/rpmalloc/rpmalloc.c#L837
...
There's quite a few. These will cause security issues for anybody linking this library.
You have to validate overflow's will not happen if you adjust the size, every time, or else it will expose threats of exploitation.

Neat library though, fix it up :) 👍

Crash with lot of threads in _memory_cache_extract on windows

While trying to create a potential workaround for the malloc/free bug on windows:

https://stackoverflow.com/questions/48906343/malloc-free-in-several-threads-crahes-on-windows

I have perhaps found a racing condition, that leads to a crash in _memory_cache_extract :

	if (span_ptr) {
		span_t* span = (void*)span_ptr;
		//By accessing the span ptr before it is swapped out of list we assume that a contending thread
		//does not manage to traverse the span to being unmapped before we access it

Crash--> void* new_cache = (void*)((uintptr_t)span->prev_span | ((uintptr_t)atomic_incr32(&cache->counter) & ~_memory_span_mask));
if (atomic_cas_ptr(&cache->cache, new_cache, global_span)) {
atomic_add32(&cache->size, -(int32_t)span->data.list.size);
return span;
}
}

We have an access violation, the memory accessed is not mapped anymore.
Is it possible that the pointer is obtained, the page is then removed in another thread and then we access it and it leads to an access violation ?

Example reproducing the crash compiled with visual studio 2017

#include "stdafx.h"
#include
#include
#include <conio.h>

#include "rpmalloc.h"

using namespace std;

#define MAX_THREADS 50

void task(void)
{
rpmalloc_thread_initialize();

const int nbAlloc = 1000;
const int sizeAlloc = 35000;

void *listPtr[nbAlloc];
while (true) {
for (int i = 0; i < nbAlloc; i++) {
listPtr[i] = (char *)rpmalloc(sizeAlloc);
}
for (int i = 0; i < nbAlloc; i++) {
if (listPtr[i] == NULL)
continue;
rpfree(listPtr[i]);
}
}

int main(int argc, char** argv)
{
thread some_threads[MAX_THREADS];

for (int i = 0; i < MAX_THREADS; i++)
{
	some_threads[i] = thread(task);
}
for (int i = 0; i < MAX_THREADS; i++)
{
	some_threads[i].join();
}

_getch();
return 0;

}

Note:
A potential workaround for the windows bug is to modify the rpmalloc method so that we lock the usage of the virtualAlloc/VirtualFree method.
If we don't use the lock, the test crash almost immediately.
With the lock, the test run for five minutes before crashing with an access violation.

out of bounds read in _memory_allocate_large_from_heap

https://github.com/rampantpixels/rpmalloc/blob/82fbcc0af8f46d640024dbb5f612d7b7f0552ec5/rpmalloc/rpmalloc.c#L915

...there is wrong testing order. (idx < LARGE_CLASS_COUNT) have to be before !heap->large_cache[idx] because if idx == LARGE_CLASS_COUNT (after iteration) then you read from undefined memory (out-of-bounds).

Assimilate orphaned heaps

Assimilate orphaned heaps into active heaps according to some scheme to be determined. This avoid free spans being left dangling in heaps that are never reused in the case where the peak number of threads is not reached again.

Each heap must keep a local running counter (non-atomic) of number of used spans, in order for the assimilator to know when is safe to release the heap pages.

rpmalloc related segfaults with ksplashqml and Networkmanager(glib)

Hi, I tried rpmalloc for my whole linux system and I got some segfaults.
The first one is from a qt program and the second one is from Networkmanager and is glib related.
I hope this is enough to find die bugs If not just comment here.

ksplashqml
#0  0x00007f5ad84b80a7 in _memory_deallocate_large_to_heap (heap=0x7f5ad0dc0000, span=0x7f5acc060000) at rpmalloc/rpmalloc.c:1161
#1  0x00007f5ad5ef8621 in ?? () from /usr/lib64/opengl/nvidia/lib/libGL.so.1
#2  0x00007f5ad418bbe3 in ?? () from /usr/lib64/libnvidia-glcore.so.390.25
#3  0x00007f5ad418ce00 in ?? () from /usr/lib64/libnvidia-glcore.so.390.25
#4  0x00007f5ad414eb89 in ?? () from /usr/lib64/libnvidia-glcore.so.390.25
#5  0x00007f5ad4151921 in ?? () from /usr/lib64/libnvidia-glcore.so.390.25
#6  0x00007f5ad5f432dd in ?? () from /usr/lib64/opengl/nvidia/lib/libGL.so.1
#7  0x00007f5ad84cec53 in ?? () from /lib64/ld-linux-x86-64.so.2
#8  0x00007f5ad71ce030 in ?? () from /lib64/libc.so.6
#9  0x00007f5ad71ce11a in exit () from /lib64/libc.so.6
#10 0x00007f5ad71aba6e in __libc_start_main () from /lib64/libc.so.6
#11 0x00000000004055da in _start ()


#0  _memory_allocate_from_heap (heap=0x7ffbf4120000, size=<optimized out>) at rpmalloc/rpmalloc.c:747
747             if (active_block->free_count) {
[Current thread is 1 (Thread 0x7ffbf492c700 (LWP 153212))]
(gdb) bt
#0  _memory_allocate_from_heap (heap=0x7ffbf4120000, size=<optimized out>) at rpmalloc/rpmalloc.c:747
#1  _memory_allocate (size=<optimized out>) at rpmalloc/rpmalloc.c:1272
#2  0x00007ffbf719457f in g_malloc (n_bytes=16) at /var/tmp/portage/dev-libs/glib-2.52.3/work/glib-2.52.3/glib/gmem.c:96
#3  0x00007ffbf719487b in g_malloc_n (n_blocks=2, n_block_bytes=8) at /var/tmp/portage/dev-libs/glib-2.52.3/work/glib-2.52.3/glib/gmem.c:339
#4  0x00007ffbf71cec59 in g_variant_new_dict_entry (key=0x7ffbf40a4260, value=0x7ffbf3321600) at /var/tmp/portage/dev-libs/glib-2.52.3/work/glib-2.52.3/glib/gvariant.c:925
#5  0x00007ffbf73bd64f in parse_value_from_blob (buf=0x7ffbf492ba40, type=0x7ffbf745aa51, just_align=0, indent=4, error=0x7ffbf492b848)
    at /var/tmp/portage/dev-libs/glib-2.52.3/work/glib-2.52.3/gio/gdbusmessage.c:1778
#6  0x00007ffbf73bd4ab in parse_value_from_blob (buf=0x7ffbf492ba40, type=0x7ffbf745aa50, just_align=0, indent=2, error=0x7ffbf492bb00)
    at /var/tmp/portage/dev-libs/glib-2.52.3/work/glib-2.52.3/gio/gdbusmessage.c:1721
#7  0x00007ffbf73bde9f in g_dbus_message_new_from_blob (blob=0x7ffbf40e0020 "l\002\001\001", blob_len=72, capabilities=G_DBUS_CAPABILITY_FLAGS_UNIX_FD_PASSING,
    error=0x7ffbf492bb00) at /var/tmp/portage/dev-libs/glib-2.52.3/work/glib-2.52.3/gio/gdbusmessage.c:2084
#8  0x00007ffbf73ca329 in _g_dbus_worker_do_read_cb (input_stream=0x7ffbf49f1130, res=0x7ffbf49c4e00, user_data=0x7ffbf49e01e0)
    at /var/tmp/portage/dev-libs/glib-2.52.3/work/glib-2.52.3/gio/gdbusprivate.c:721
#9  0x00007ffbf736ff7c in g_task_return_now (task=0x7ffbf49c4e00) at /var/tmp/portage/dev-libs/glib-2.52.3/work/glib-2.52.3/gio/gtask.c:1145
#10 0x00007ffbf736ffd9 in complete_in_idle_cb (task=0x7ffbf49c4e00) at /var/tmp/portage/dev-libs/glib-2.52.3/work/glib-2.52.3/gio/gtask.c:1159
#11 0x00007ffbf718e977 in g_idle_dispatch (source=0x7ffbf40f1520, callback=0x7ffbf736ffc1 <complete_in_idle_cb>, user_data=0x7ffbf49c4e00)
    at /var/tmp/portage/dev-libs/glib-2.52.3/work/glib-2.52.3/glib/gmain.c:5586
#12 0x00007ffbf718bf5d in g_main_dispatch (context=0x7ffbf49d00e0) at /var/tmp/portage/dev-libs/glib-2.52.3/work/glib-2.52.3/glib/gmain.c:3234
#13 0x00007ffbf718cdfe in g_main_context_dispatch (context=0x7ffbf49d00e0) at /var/tmp/portage/dev-libs/glib-2.52.3/work/glib-2.52.3/glib/gmain.c:3899
#14 0x00007ffbf718cfe2 in g_main_context_iterate (context=0x7ffbf49d00e0, block=1, dispatch=1, self=0x7ffbf6079ad0)
    at /var/tmp/portage/dev-libs/glib-2.52.3/work/glib-2.52.3/glib/gmain.c:3972
#15 0x00007ffbf718d408 in g_main_loop_run (loop=0x7ffbf4a81140) at /var/tmp/portage/dev-libs/glib-2.52.3/work/glib-2.52.3/glib/gmain.c:4168
#16 0x00007ffbf73c9794 in gdbus_shared_thread_func (user_data=0x7ffbf4a81100) at /var/tmp/portage/dev-libs/glib-2.52.3/work/glib-2.52.3/gio/gdbusprivate.c:252
#17 0x00007ffbf71bbfcb in g_thread_proxy (data=0x7ffbf6079ad0) at /var/tmp/portage/dev-libs/glib-2.52.3/work/glib-2.52.3/glib/gthread.c:784
#18 0x00007ffbf6fb05ea in start_thread () from /lib64/libpthread.so.0
#19 0x00007ffbf6ee14bf in clone () from /lib64/libc.so.6

Option to disable unmapping

I'd like to request a feature to disallow unmapping; on wasm only "growing" is supported; so giving back doesn't make sense there.

Using HugePages

It would be nice if we can instruct rpmalloc to use HugePages allocation.

At least on linux it's achieved by mmap-ing files from where the hugetlbfs is mounted. So we should be able to configure rpmalloc with the mount point.

Another possible issue: hugepages on x86 come in size of 2MiB, I don't know how it will interact with the 64KiB alignment used by rpmalloc.

Build fails with "depfile has multiple output paths"

I'm trying to build rpmalloc on Ubuntu 18.04 with Ninja 1.8.2 using the following commands:

wget -O - https://github.com/rampantpixels/rpmalloc/archive/1.3.1.tar.gz | tar xz
cd rpmalloc-1.3.1 && python3 configure.py && ninja

Unexpectedly, the Ninja build fails:

[1/48] MKDIR lib/linux/debug
[2/48] MKDIR lib/linux/release
[3/48] MKDIR 'lib/linux/debug/b'\''x86_64'\'''
[4/48] MKDIR 'lib/linux/release/b'\''x86_64'\'''
[5/48] MKDIR bin/linux/debug
[6/48] MKDIR 'bin/linux/debug/b'\''x86_64'\'''
[7/48] CC rpmalloc/malloc.c
FAILED: build/ninja/linux/debug/b'x86_64'/rpmalloc-a8e50b6/malloc-53b207c.o 
clang -MMD -MT 'build/ninja/linux/debug/b'\''x86_64'\''/rpmalloc-a8e50b6/malloc-53b207c.o' -MF 'build/ninja/linux/debug/b'\''x86_64'\''/rpmalloc-a8e50b6/malloc-53b207c.o'.d -I.  -DRPMALLOC_COMPILE=1 -funit-at-a-time -fstrict-aliasing -fno-math-errno -ffinite-math-only -funsafe-math-optimizations -fno-trapping-math -ffast-math -D_GNU_SOURCE=1 -W -Werror -pedantic -Wall -Weverything -Wno-padded -Wno-documentation-unknown-command -std=c11  -DBUILD_DEBUG=1 -g -DENABLE_PRELOAD=1 -c rpmalloc/malloc.c -o 'build/ninja/linux/debug/b'\''x86_64'\''/rpmalloc-a8e50b6/malloc-53b207c.o'
depfile has multiple output paths[8/48] CC rpmalloc/malloc.c

FAILED: build/ninja/linux/release/b'x86_64'/rpmalloc-a8e50b6/malloc-53b207c.o 
clang -MMD -MT 'build/ninja/linux/release/b'\''x86_64'\''/rpmalloc-a8e50b6/malloc-53b207c.o' -MF 'build/ninja/linux/release/b'\''x86_64'\''/rpmalloc-a8e50b6/malloc-53b207c.o'.d -I.  -DRPMALLOC_COMPILE=1 -funit-at-a-time -fstrict-aliasing -fno-math-errno -ffinite-math-only -funsafe-math-optimizations -fno-trapping-math -ffast-math -D_GNU_SOURCE=1 -W -Werror -pedantic -Wall -Weverything -Wno-padded -Wno-documentation-unknown-command -std=c11  -DBUILD_RELEASE=1 -O3 -g -funroll-loops -DENABLE_PRELOAD=1 -c rpmalloc/malloc.c -o 'build/ninja/linux/release/b'\''x86_64'\''/rpmalloc-a8e50b6/malloc-53b207c.o'
depfile has multiple output paths[9/48] CXX rpmalloc/new.cc

FAILED: build/ninja/linux/release/b'x86_64'/rpmalloc-a8e50b6/new-440206f.o 
clang++ -MMD -MT 'build/ninja/linux/release/b'\''x86_64'\''/rpmalloc-a8e50b6/new-440206f.o' -MF 'build/ninja/linux/release/b'\''x86_64'\''/rpmalloc-a8e50b6/new-440206f.o'.d -I.  -DRPMALLOC_COMPILE=1 -funit-at-a-time -fstrict-aliasing -fno-math-errno -ffinite-math-only -funsafe-math-optimizations -fno-trapping-math -ffast-math -D_GNU_SOURCE=1 -W -Werror -pedantic -Wall -Weverything -Wno-padded -Wno-documentation-unknown-command -std=gnu++14  -DBUILD_RELEASE=1 -O3 -g -funroll-loops -DENABLE_PRELOAD=1 -c rpmalloc/new.cc -o 'build/ninja/linux/release/b'\''x86_64'\''/rpmalloc-a8e50b6/new-440206f.o'
depfile has multiple output paths[10/48] CC rpmalloc/rpmalloc.c
...

Do you have any idea what the problem could be?

GCD and rpmalloc_thread_initialize

Hi,

I'm trying to include rpmalloc into my ios project however, since we use GCD I'm not sure where to call rpmalloc_thread_initialize.

Without this call the app crashes in all the callers that are not performed in the main thread where rpmalloc_initialize was called.

Thanksm

Comparison with the other tiny lockfree allocator

Hi. We have another tiny lockfree allocator, albeit written in C++: https://github.com/Begun/lockfree-malloc

Would it be terribly hard to add it to the benchmarks too?

WebAssembly page_size

Hi,

On WebAssembly the size of a "page"(minimal allocation block) is 64k. setting page_size on memory config to 65536 doesn't work because of

	if (_memory_page_size > (16 * 1024))
		_memory_page_size = (16 * 1024);

Is it safe to just change that check to 64?

When I do that it fails with a sigsegv here:

_memory_deallocate_to_heap
_memory_deallocate
free

This might be a bug in my compiler though. Is there anything else that needs changing to make it work?

Offer three different cache presets

Three preprocessor-controlled cache presets for unlimited, time-priority and space-priority setups

Handle zero sized allocations

Current implementation does not handle zero sized allocations in 'default' configuration as it will trigger an out of bounds read when looking up class id (https://github.com/rampantpixels/rpmalloc/blob/master/rpmalloc/rpmalloc.c#L722).

As this is implementation defined (as far as I know), I'll defer to you how to handle this.

Personally I would prefer it to return non-null and can see the reasoning behind handling it by forcing a minimum allocation size of one byte or with a special non-null pointer (each with their own trade-offs).