Comments (64)
Sorry, my bad.
After the loop, and instead if map.reset_inner(i);
, use the following which is equivalent to your reset_inner
:
map.with_submap_m(i, [&](map_str_int_type::EmbeddedSet& set) {
set = map_str_int_type::EmbeddedSet();
});
from parallel-hashmap.
Added! Looks like you did an amazing job learning C++!
I saw a small perf. issue in lill4map.cc
, where using the non-mutable with_submap
does not allow to std::move
the string out of the map. Could be an issue if long strings are used (and if not using the fixed string).
Here is my updated version.
from parallel-hashmap.
That is great performance for the AMD 7950x CPU. The "write stdout" taking 3.2 seconds is likely the L3 size difference between the two CPUs. I am providing the following output, because the 7950x CPU is amazing. In other words, my system is unable to reach 2x faster total time for being 64 logical cores. Building with clang++
runs faster for me.
It is impressive that "get properties" for the 7950x CPU completes in 12.3 seconds. Wow!
g++
$ g++ -o llil4map -std=c++20 -fopenmp -Wall -O3 llil4map.cc -I./parallel-hashmap
$ ./llil4map in/big* in/big* in/big* | cksum
llil4map (fixed string length=12) start
use OpenMP
use boost sort
get properties 7.496 secs
map to vector 0.965 secs
vector stable sort 1.740 secs
write stdout 0.657 secs
total time 10.859 secs
count lines 970195200
count unique 200483043
clang++
$ clang++ -o llil4map -std=c++20 -fopenmp -Wall -O3 llil4map.cc -I./parallel-hashmap
$ ./llil4map in/big* in/big* in/big* | cksum
llil4map (fixed string length=12) start
use OpenMP
use boost sort
get properties 7.496 secs
map to vector 0.791 secs
vector stable sort 1.137 secs
write stdout 0.591 secs
total time 10.017 secs
count lines 970195200
count unique 200483043
2057246516 1811140689
Last year, I tried custom memory allocation. I was trying to reduce peak memory consumption to no avail, because it greatly impacted "write stdout". Thanks for the reset suggestion. Memory consumption no longer peaks high.
from parallel-hashmap.
No, not weird at all. Because of the random generator script. This is normal for unique count to differ per machine or whenever refreshing the input files. If you run gen_files
again, it is likely for unique count to differ.
Do you want to add to .gitignore, the tmp dir?
/examples/llil_utils/tmp
from parallel-hashmap.
Sure, will do, even though in my case I generate the files in my build directory.
I tried with clang-16
and indeed it is also faster on my system, esp the write stdout
.
lil4map (fixed string length=12) start
use OpenMP
use boost sort
get properties 12.873 secs
map to vector 1.370 secs
vector stable sort 1.862 secs
write stdout 1.177 secs
total time 17.284 secs
count lines 970195200
count unique 200478308
from parallel-hashmap.
Sure, will do, even though in my case I generate the files in my build directory.
Oh, I assumed wrongly the location.
I tried with clang-16 and indeed it is also faster on my system, esp the write stdout.
That is amazing.
I checked and you have the same defaults. Mindfulness for users and why default fixed-length is important? So the application can run with lesser memory consumption, no surprises on machines with 8 ~ 16 GB. The reset was the missing piece.
#define USE_BOOST_PARALLEL_SORT 1
#define MAX_STR_LEN_L (size_t) 12
Folks may run with a partial list, not all 92 files. For example just a
i.e. biga* biga* biga*
, or [ab]
i.e. big[ab]* big[ab]* big[ab]*
.
from parallel-hashmap.
I have a question about the data files. They contain words followed by a number. Am I right to assume this number is always positive?
I'll see if I can make get_properties
a little bit faster, this is fun :-)
Also I added a couple minor simplifications in the code, see this
from parallel-hashmap.
Am I right to assume this number is always positive?
Yes, always positive is certain.
this is fun :-)
It's mind boggling. We saw that clang++
is faster than g++
. Also, -std=c++20
is faster than -std=c++17
. Why did I stop at 92 input files? Running big* big* big*
handles approximately 1 billion lines, and 200 million unique keys. Testing big*
once is mostly insertions into the map. Testing big* big* big*
is combination insertion/updates. Testing smaller input is possible simply with suffix biga* biga* biga*
. Then, there is fixed-length mode and long strings. I'm grateful for reset.
The application runs parallel at each level (1) input, (2) map to vector, (3) sorting, and (4) output. Crazy fast. The coolness factor is the phmap
library managing the many sub maps internally (1 << 12 = 4096). Thank you for allowing a custom mutex class. The spinlock_mutex
class improves parallel insertions/updates.
Randomness is beautiful. With the many sub maps, it's less likely for threads to be using the same map/mutex in a given time window or fraction of a second. If they do, no problems. The spinlock_mutex
class has low overhead.
Also I added a couple minor simplifications in the code
I had no idea about set.size()
. I appreciate learning from a master. Ah, the clear/swap lines are not needed. Because, the map construction is inside a block + the reset works so well. Oh, the sorting simplifications. That's amazing.
Thank you, @greg7mdp.
from parallel-hashmap.
There is one more problem to resolve. That is for long strings or variable length keys.
Give me a couple days, I'll provide a solution for that (fixed size string which also supports long strings and is as memory efficient as possible)
from parallel-hashmap.
Hi @marioroy ,
I added a new type string_cnt
which store a short string if its length is < 11 bytes, otherwise use a pointer, so the program works for any string length, but uses little memore for short string, and even for long strings it uses less space than std::string
.
I also used a set instead of a map, storing the string_cnt
directly, so that we know that it will use 16 bytes per entry for short strings.
Some other small changes and cleanup, I hope you don't mind.
see this
from parallel-hashmap.
Hi @greg7mdp,
Wow! I gave it a spin on this machine. This peaks at 6.9 GB. The lowest memory consumption thus far for supporting long strings.
$ ./llil4mapg big* big* big* | cksum
llil4map start
use OpenMP
use boost sort
get properties 8.152 secs
map to vector 0.885 secs
vector stable sort 2.764 secs
write stdout 0.696 secs
total time 12.499 secs
count lines 970195200
count unique 200483043
2057246516 1811140689
At your convenience, please make one tiny change to allow parallel sort to run on all CPUs threads, i.e. no limit. Replace std::min(nthds, 32)
with nthds
.
$ ./llil4mapg big* big* big* | cksum
llil4map start
use OpenMP
use boost sort
get properties 8.104 secs
map to vector 0.894 secs
vector stable sort 1.859 secs <--- improvement, okay to utilize all CPU threads
write stdout 0.723 secs
total time 11.581 secs
count lines 970195200
count unique 200483043
2057246516 1811140689
I will study string_cnt
. The awesome part is not having to specify fixed-length or long strings. It's automatic.
from parallel-hashmap.
Hi @marioroy , I just made that update to remove the 32 thread limit.
from parallel-hashmap.
I reached out to eyepopslikeamosquito, to let him know. He triggered my curiosity in another thread, to making an attempt at the original Long List is Long challenge. This became a great opportunity for learning OpenMP using C++.
Thank you @greg7mdp, for the string_cnt
solution.
from parallel-hashmap.
Well, the current string_cnt
has already the first 12 bytes functioning as a union equivalent to:
union {
char *str;
char immediate[12];
};
so the first 12 bytes contain either a pointer to a long string, or the immediate string.
There would be no saving if we made the immediate[12]
an immediate[8]
to be the same size as the pointer, as the whole struct including the count would be 12 bytes, but when stored in a vector
or a set
the next struct would be aligned to an 8 byte boundary, the the 4 bytes apparently saved would be left unused.
And yes, we need a way to know whether the struct stores a pointer or an immediate string. I use the last byte extra[3]
, and when it is 0 (meaning immediate string), the zero can also serve as the null terminator of the string (since we don't store its size), so we can use the full 11 other bytes for useful characters.
from parallel-hashmap.
Pointers have to be aligned on a 8 byte boundary. So if you were to put the 4 byte cnt
first, the next 4 bytes would be wasted because the union would have to start on an 8 byte boundary.
from parallel-hashmap.
string_cnt
is clever. My apologies for my ramblings.
I see it clearly, right here :)
string_cnt(string_cnt&& o) noexcept {
if (o.extra[3]) {
str = o.str;
o.str = nullptr;
extra[3] = 1;
} else {
std::strcpy((char *)(&str), o.get());
extra[3] = 0;
}
cnt = o.cnt;
}
from parallel-hashmap.
This is a move constructor, so this is stealing the data from o
.
If o
has a pointer to a long string, we can just copy the pointer, note that we have a pointer in extra[3]
, and set o.str
to null so o
's destructor will not free the memory.
If o
contains a small string, we just copy it and mark that we have a small string as well.
I could add some comments.
from parallel-hashmap.
I think I have an idea to make get_properties
faster. I'll try to check something tomorrow. This is a fun problem!
from parallel-hashmap.
I understand your string_cnt
fully. Thank you. The magic lies in the set
method. The strcpy
may span the two variables str
and extra
up to 3 chars (4th char is a flag).
auto len = std::strlen(s);
if (len >= buffsz) {
str = new char[len+1];
std::strcpy(str, s);
extra[3] = 1;
} else {
std::strcpy((char *)(&str), s);
extra[3] = 0;
}
My favorite line std::strcpy((char *)(&str), s)
, union-like behavior str and extra[3 chars max]. Thus, 11 chars max. There's more cleverness :-) The extra[3]
is dual-purpose. A flag (indicating a pointer) or null-terminating for 11-char small string.
And the importance of o.str = nullptr
in the move constructor, after stealing the data.
from parallel-hashmap.
@greg7mdp Your solution is superb. Thanks many, I learned a lot. The string_cnt
struct boggles my mind on its clever design. You helped bring back sanity in my life. :-) Now, I no longer think about the Long List is Long challenge.
Blessings and grace,
from parallel-hashmap.
If only
string_cnt
allowed one to specify the fix-length limit, forMAX_STR_LEN_L
-like capability.
It is indeed possible. I have previously modified the version of string_cnt
to use std::string_view
, but here is a version which is templatized to support various sso
sizes:
// ---------------------------------------------------------------------------------------------
// Stores a string + a count
// For strings up to N-1 bytes, total space used is N + 4 bytes
// For larger strings, uses N+4 bytes + strlen(string) + 1
//
// invariants
// if extra[mark_idx], str is a valid string pointer
// if !extra[mark_idx], the N bytes starting at (const char *)(&str) store a null_terminated string
// ---------------------------------------------------------------------------------------------
template<size_t N>
struct string_cnt_t {
using uint_t = uint32_t;
static_assert(N >= 12);
static constexpr size_t extra_sz = N - sizeof(char*);
static constexpr size_t mark_idx = extra_sz - 1;
char * str;
char extra[extra_sz];
uint_t cnt;
static constexpr size_t buffsz = sizeof(str) + sizeof(extra);
string_cnt_t() : str{nullptr}, extra{0}, cnt{0} {}
string_cnt_t(std::string_view s, uint_t c = 0) : str(nullptr), cnt(c) { set(s); }
~string_cnt_t() { free(); }
string_cnt_t(const string_cnt_t& o) {
set(o.get());
}
string_cnt_t(string_cnt_t&& o) noexcept {
if (o.extra[mark_idx]) {
str = o.str;
o.str = nullptr;
extra[mark_idx] = 1;
} else {
std::strcpy((char *)(&str), o.get());
extra[mark_idx] = 0;
}
cnt = o.cnt;
}
string_cnt_t& operator=(const string_cnt_t& o) {
free();
set(o.get());
cnt = o.cnt;
return *this;
}
string_cnt_t& operator=(string_cnt_t&& o) noexcept {
free();
new (this) string_cnt_t(std::move(o));
return *this;
}
std::strong_ordering operator<=>(const string_cnt_t& o) const { return std::strcmp(get(), o.get()) <=> 0; }
bool operator==(const string_cnt_t& o) const { return std::strcmp(get(), o.get()) == 0; }
std::size_t hash() const {
auto s = get();
std::string_view sv {s};
return std::hash<std::string_view>()(sv);
}
const char *get() const { return extra[mark_idx] ? str : (const char *)(&str); }
private:
void free() { if (extra[mark_idx]) { delete [] str; str = nullptr; } }
void set(std::string_view s) {
static_assert(buffsz == 12);
static_assert(offsetof(string_cnt_t, cnt) == (intptr_t)buffsz);
static_assert(sizeof(string_cnt_t) == 16);
assert(!extra[mark_idx] || !str);
if (s.empty())
std::memset(&str, 0, buffsz);
else {
auto len = s.size();
if (len >= buffsz) {
str = new char[len+1];
std::memcpy(str, s.data(), len);
str[len] = 0;
extra[mark_idx] = 1;
} else {
std::memcpy((char *)(&str), s.data(), len);
((char *)&str)[len] = 0;
extra[mark_idx] = 0;
}
}
}
void set(const char *s) {
set(std::string_view{s});
}
};
using string_cnt = string_cnt_t<12>;
namespace std {
template<> struct hash<string_cnt> {
std::size_t operator()(const string_cnt& v) const noexcept { return v.hash(); };
};
}
from parallel-hashmap.
PS: I got get_properties
to be faster... takes 8.8s on my machine with the 6 character strings.
from parallel-hashmap.
That's great, thanks for all your work. And for fixing the static_assert in string_cnt
. I still have some work to do on my new version but I'm not sure when I'll get it done.
I too want to use string_cnt in my phmap demonstrations
You are welcome to use string_cnt
for whatever you want.
from parallel-hashmap.
So much learning. I sped up get_properties by storing the length of the key with the count i.e. (count << 8) + (klen & 0xff).
- I replaced two set lines from
set(o.get());
to thisset(o.get(), o.cnt & 0xff);
and changed the private set method.
void set(const char *s, uint8_t len) {
set(std::string_view{s, len});
}
- The hash method is called hundreds of million times. The 1 second improvement is from here.
std::size_t hash() const {
auto s = get();
std::string_view sv {s, cnt & 0xff};
return std::hash<std::string_view>()(sv);
}
I ran with 16 CPU cores. Before and after.
$ NUM_THREADS=16 taskset -c 0-15 ./llil4map_greg in/big* in/big* in/big* | cksum
llil4map start
use OpenMP
use boost sort
get properties 23.146 secs 22.106 secs <-- improved
map to vector 0.818 secs 0.846 secs
vector stable sort 4.547 secs 4.496 secs
write stdout 0.538 secs 0.535 secs
total time 29.049 secs 27.985 secs
count lines 970195200
count unique 200483043
2057246516 1811140689
get_properties is now as fast as my llil4map version (without string_cnt).
from parallel-hashmap.
Hi @marioroy , congrats on all the great work. I've been very busy at work, so I haven't had time to work on this or look at your latest changes, but hopefully I will soon. All the best!
from parallel-hashmap.
Thanks for the great explanation @marioroy .
Can you do the reset inside the with_submap_n
, as in:
for (size_t i = 0; i < map.subcnt(); ++i) {
map.with_submap(i, [&](map_str_int_type::EmbeddedSet& set) {
auto it = I[i];
for (auto& x : set)
*it++ = std::make_pair(std::move(x.first), x.second);
set = map_str_int_type::EmbeddedSet(); // <------------------------------
});
}
from parallel-hashmap.
Thanks. Per your suggestion (removing const
) and inserting line.
for (size_t i = 0; i < map.subcnt(); ++i) {
map.with_submap(i, [&](map_str_int_type::EmbeddedSet& set) {
auto it = I[i];
for (auto& x : set)
*it++ = std::make_pair(std::move(x.first), x.second);
set = map_str_int_type::EmbeddedSet();
});
}
The clang++
compiler fails.
In file included from llil4map.cc:51:
./parallel-hashmap/parallel_hashmap/phmap.h:3431:9: error: no matching function for call to object of type '(lambda at llil4map.cc:415:32)'
3431 | fCallback(set);
| ^~~~~~~~~
llil4map.cc:415:17: note: in instantiation of function template specialization 'phmap::priv::parallel_hash_set<12, phmap::priv::raw_hash_set, spinlock_mutex, phmap::priv::FlatHashMapPolicy<std::basic_string<char>, unsigned int>, phmap::priv::StringHashEqT<char>::Hash, phmap::priv::StringHashEqT<char>::Eq, std::allocator<std::pair<const std::basic_string<char>, unsigned int>>>::with_submap<(lambda at llil4map.cc:415:32)>' requested here
415 | map.with_submap(i, [&](map_str_int_type::EmbeddedSet& set) {
| ^
llil4map.cc:415:32: note: candidate function not viable: 1st argument ('const EmbeddedSet' (aka 'const raw_hash_set<phmap::priv::FlatHashMapPolicy<std::basic_string<char, std::char_traits<char>, std::allocator<char>>, unsigned int>, phmap::priv::StringHashEqT<char>::Hash, phmap::priv::StringHashEqT<char>::Eq, std::allocator<std::pair<const std::basic_string<char, std::char_traits<char>, std::allocator<char>>, unsigned int>>>')) would lose const qualifier
415 | map.with_submap(i, [&](map_str_int_type::EmbeddedSet& set) {
| ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated.
from parallel-hashmap.
Thank you, @greg7mdp. Your suggestion works.
for (size_t i = 0; i < map.subcnt(); ++i) {
map.with_submap(i, [&](const map_str_int_type::EmbeddedSet& set) {
auto it = I[i];
for (auto& x : set)
*it++ = std::make_pair(std::move(x.first), x.second);
});
// The sub map is no longer needed. Reset the set to reclaim memory.
map.with_submap_m(i, [&](map_str_int_type::EmbeddedSet& set) {
set = map_str_int_type::EmbeddedSet();
});
}
Notice "map to vector" completing in 2 seconds. This was taking 3 seconds, previously.
$ ./llil4map in/big* in/big* in/big* | cksum
llil4map start
use OpenMP
use boost sort
get properties 8.946 secs
map to vector 2.018 secs <-- Yeah! Thank you!
vector stable sort 2.960 secs
write stdout 0.589 secs
total time 14.514 secs
count lines 970195200
count unique 200483043
2057246516 1811140689
from parallel-hashmap.
Hey @marioroy , I just tried your lill4map.cc
myself. Would you mind if I added it to phmap as an example (with your attribution comment at the top of lill4map.cc
of course).
Which parameters did you use for gen-llil.pl
. I used the ones from the comment and I got:
❯ NUM_THREADS=3 ../ex_lill4map big1.txt big2.txt big3.txt >out.txt
llil4map (fixed string length=12) start
use OpenMP
don't use boost sort
get properties 0.695 secs
map to vector 0.093 secs
vector stable sort 1.316 secs
write stdout 0.061 secs
total time 2.166 secs
count lines 10545600
count unique 10368458
from parallel-hashmap.
Wow! Do not mind. llil4map.cc is final.
Thank you for your help, @greg7mdp; (1) some time ago, found one-off error in chunking; (2) at the time, helped with map to vector allowing parallel; (3) and now, helped ensure minimum memory consumption. Blessings and grace.
That's it, just as written in the comments for generating the input files.
perl gen-llil.pl big1.txt 200 3 1
perl gen-llil.pl big1.txt 200 3 1
perl gen-llil.pl big1.txt 200 3 1
I found my script for making 92 input files + shuffled (requires 2.8 GB). On my machine (where llil4map.cc resides), in
is a symbolic link to /data/input
. Change the path accordingly.
#!/bin/bash
# Script for generating input files for llil4map.cc.
mkdir -p /data/input
cp gen-llil.pl shuffle.pl /data/input
cd /data/input
# create 26 random files
for n in $(perl -le "print for 'aa'..'az'"); do
perl gen-llil.pl big$n 200 3 1
perl shuffle.pl big$n >1; mv 1 big$n
done &
# create 26 random files
for n in $(perl -le "print for 'ba'..'bz'"); do
perl gen-llil.pl big$n 200 3 1
perl shuffle.pl big$n >2; mv 2 big$n
done &
# create 26 random files
for n in $(perl -le "print for 'ca'..'cz'"); do
perl gen-llil.pl big$n 200 3 1
perl shuffle.pl big$n >3; mv 3 big$n
done &
# create 14 random files (total 92 files)
for n in $(perl -le "print for 'da'..'dn'"); do
perl gen-llil.pl big$n 200 3 1
perl shuffle.pl big$n >4; mv 4 big$n
done &
wait
The default mode in llil4map.cc
is fixed string length = 12
. Commenting out line 119
becomes variable length keys (using std::basic_string). This requires 16 GB to process the 92 input files when commented out. Otherwise, much less for fixed-length keys.
// #define MAX_STR_LEN_L (size_t) 12
My PerlMonks friend eyepopslikeamosquito helped me learn C++.
I enjoy parallel processing. So naturally, I tried: (1) chunk IO in get properties using spinlock_mutex
; (2) map to vector, this too is parallel; (3) parallel sort; (4) finally, parallel output but orderly.
from parallel-hashmap.
Wow! Do not mind. llil4map.cc is final.
Thanks, I'll add it as an example then, it is a great example!
from parallel-hashmap.
Thanks. I learned also, reading phmap_fwd_decl.h
and your examples.
I tried your updated version and added line to reset the set, especially beneficial for long strings. Without it, the "map to vector" takes 3 seconds for long strings and peaks near 18.4 GB.
for (size_t i = 0; i < map.subcnt(); ++i) {
map.with_submap_m(i, [&](map_str_int_type::EmbeddedSet& set) {
auto it = I[i];
for (auto& x : set)
*it++ = std::make_pair(std::move(const_cast<str_type&>(x.first)), x.second);
// reset the set (no longer needed) to reclaim memory early
set = map_str_int_type::EmbeddedSet();
});
}
I respect your decision if you prefer leaving the line commented out, long strings mode (line 119).
// #define MAX_STR_LEN_L (size_t) 12
Or default to fixed-length, uncomment line.
#define MAX_STR_LEN_L (size_t) 12
Here too, I respect your decision disabled or enabled (line 46). I enabled it locally to validate parallel sort. Disabled takes 58.0 seconds for "vector stable sort" (long strings mode), or 21.5 seconds (fixed-length mode).
#define USE_BOOST_PARALLEL_SORT 1
Testing was done on a 32 core (64 logical threads), AMD Ryzen Threadripper 3970X machine, DDR4 memory 3600 MHz.
$ ./llil4map in/big* in/big* in/big* | cksum
llil4map start
use OpenMP
use boost sort without reset with reset
get properties 9.035 secs 9.004 secs
map to vector 3.068 secs 2.010 secs <--- reset, minimum peak memory consumption
vector stable sort 2.972 secs 2.988 secs
write stdout 0.639 secs 0.563 secs
total time 15.716 secs 14.567 secs
count lines 970195200
count unique 200483043
2057246516 1811140689
from parallel-hashmap.
Thanks @marioroy , I reverted the defaults to the way you had them, and also added the reset the set
.
On my system AMD® Ryzen 9 7950x 16-core processor × 32 with DDR5 memory
, I get the following, which is not the same as you:
~/gi/g/parallel-hashmap/build_gcc12_rel/llil_utils master *1 +1 !2 ?4 ❯ ./run_llil4map 19s
llil4map (fixed string length=12) start
use OpenMP
use boost sort
get properties 12.296 secs
map to vector 1.300 secs
vector stable sort 2.133 secs
write stdout 3.247 secs
total time 18.978 secs
count lines 970195200
count unique 200485780
3341100189 1811165988
~/gi/g/parallel-hashmap/build_gcc12_rel/llil_utils master *1 +1 !2 ?4 ❯ cat ./run_llil4map 19s
#!/bin/bash
cd tmp
../../ex_llil4map big* big* big* | cksum
from parallel-hashmap.
It is weird that I don't have the same result for count unique
as you do. Maybe a difference in gen_llil.pl?
from parallel-hashmap.
I'll play with it some more later. It is an interesting piece of code for sure!
from parallel-hashmap.
Hey @marioroy , just wondering, if your solution the fastest one that exists for this problem?
from parallel-hashmap.
if your solution the fastest one that exists for this problem?
Hi, @greg7mdp I am not aware of anything faster.
The llil4emh emhash
demonstration is the fastest for fixed-length keys. For long strings or variable length keys, llil4map
and llil4emh
perform similarly. The map demonstrations scale to many CPUs cores with ease. These are the fastest beyond 12 CPU threads.
There is the llil4vec implementation. This appends to a vector, no maps. So, may consume a lot of memory, particularly repeating keys. It may run faster up to some number of cores, but does not scale well beyond that. If daring to run llil4vec
, the max you can do for 32 GB is big* big* big*
and only fixed-length=12 max. It peaks near 28 GB. So, be careful. It may cause swapping.
llil4map peaks less than 8 GB for 200 million keys, fixed-length=12.
I created a backup plan, if the number of unique keys were to exceed memory capacity :-) Run with NUM_MAPS=128
maximum, default 32.
- llil4tkh tkrzw::ShardDBM demonstration. The hash dbms are managed by the C++ library.
- llil4tkh2 tkrzw::HashDBM demonstration. The hash dbms are managed by the application.
from parallel-hashmap.
Wow, thanks for all the information! You seem to have spent a lot of time on this. Is this something you are doing for work, or just for fun?
from parallel-hashmap.
Is this something you are doing for work, or just for fun?
For fun actually. I wondered if possible to solve the challenge. A Perl Monk "eyepopslikeamosquito" tried the vector approach, stating memory is cheap. My solutions took some time (new to C++). I chose the path that preferred low memory consumption.
Your phmap C++ library made this possible (emhash as well). OpenMP made parallel possible. Chunk IO and spinlock mutex enables scaling to many CPU cores. The reset capability ensures low memory consumption during map to vector. I had no idea there was a way to do so. Thank you, for your tip. Boost's block_indirect_sort
allows many threads and fast. OpenMP's ordered
directive works, allowing parallel output, and orderly.
from parallel-hashmap.
There is one more problem to resolve. That is for long strings or variable length keys.
The std::string (std::basic_string) container is not memory efficient, unfortunately. In long strings mode, llil4map
peaks 17.8 GB for the same input files (previously, 18.4 GB without the reset). That's horrendous and more than doubles fixed-length mode. It is much worst for the llil4vec
vector demonstration. That one peaks 58 GB, requiring a VM or machine with 64 GB.
For comparison to llil4map
, the llil4emh
emhash implementation peaks 20.0 GB for long strings mode. It computes the hash_value
once only, and stores it with the key.
from parallel-hashmap.
I made a memory efficient version llil4map2, using a vector of pointers, to the phmap key-value pairs. That met having to construct the phmap outside the block, and no reset. This results in sorting taking longer, possibly due to CPU cache misses. However, memory consumption is significantly less.
200mil unique keys llil4map llil4map2
------------------ -------- --------
fixed-length=12 ~ 8.0 GB 6.0 GB
long-strings 17.8 GB 12.3 GB
I removed the limit for the number of threads for better performance, sorting and output. This is helpful if RAM data is scattered, not cache friendly.
from parallel-hashmap.
From
string_cnt
, "For larger strings, uses 16 bytes + strlen(string) + 1"
Recently, I came across a stack overflow thread. The top answer contains a string class involving a union.
class string {
public:
// all 83 member functions
private:
size_type m_size;
union {
class {
// This is probably better designed as an array-like class
std::unique_ptr<char[]> m_data;
size_type m_capacity;
} m_large;
std::array<char, sizeof(m_large)> m_small;
};
};
I will try a union in string_cnt
. Ah, so that means cnt
must come first in the structure. Is this not worth the effort? Curiosity wonders about how much lower memory consumption will reduce :-)
struct string_cnt {
uint32_t cnt;
union {
char * str;
char extra[4];
};
...
};
from parallel-hashmap.
Never mind. A union will not work, because extra[3] is a boolean condition whether valid string pointer. For long strings, only 3 chars is wasted space. Have I misunderstood the clever string_cnt
structure?
from parallel-hashmap.
I don't think you can save more space in the struct itself. But in the case of long strings, if they are all the same size, you can use a custom allocator like boost::pool_allocator
, but that would not work in the general case where the strings are of different sizes.
from parallel-hashmap.
Blessings and grace, @greg7mdp. Thank you for the extra clarity. The string_cnt
struct is clever.
Pointers have to be aligned on a 8 byte boundary.
Just earlier, I tried (for fun) putting cnt
first and wondered about the assertions complaining. Now I understand.
What if cnt
contained both flag (1 bit, indicating a pointer is used) and count (31 bits) using masking? Will that help remove the unused gap for long strings. Doing so may improve sorting performance.
Note: For long strings, I really mean variable length strings. Thanks, boost::pool_allocator
will not work.
from parallel-hashmap.
If you want the same struct to work for both long and short strings, as we do I think, it has to contain at least a pointer (8 bytes) + a count.
If the count was never greater than 8, we could store it in the 3 lower bits of the pointer (which are always 0), so the pointer + count would fit in 8 bytes, so we could have either a 7 byte immediate string + count, or a pointer + count.
If the count can be greater than 8, then we need more than 8 bytes, so 16 bytes is the minimum our struct will occupy (for alignment), unless we are even more tricky by moving the pointers to an aligned address before dereferencing them.
But honestly, I don't see how you can save more than a few bytes, and then also you reduce the size of immediate strings that are supported.
from parallel-hashmap.
Please, disregard. So... "Instead of extra[3], store the flag (1 bit) indicating whether using a pointer in the cnt
variable. 31 bits is plenty big for count" will result in no savings?
from parallel-hashmap.
No problem, it is not obvious to understand if C++ is not your main language for sure.
from parallel-hashmap.
Your beautification of the include directives put me to work :-) For consistency reason, I thought to beautify the include directives as well. I created a thread at Perl Monks to let folks know: https://perlmonks.org/?node_id=11158711
I brought back llil4umap (unordered_map variant). Your tip for releasing memory immediately resolved the long cleanup delay in that one. So, the umap version is nice to have for comparing the standard container vs the powerful phmap library.
from parallel-hashmap.
@greg7mdp There is extra global cleanup time, exiting the llil4map
+ substring_cnt
program.
It dawned on me to do a testing involving mixed length. I created long??
input files containing 6 chars or 15 chars. Basically, alternating 200 12 1
or 200 3 1
arguments to gen-llil.pl
.
# create 26 random files
for n in $(perl -le "print for 'aa'..'az'"); do
perl gen-llil.pl long$n 200 12 1
perl shuffle.pl long$n >1; mv 1 long$n
done &
# create 26 random files
for n in $(perl -le "print for 'ba'..'bz'"); do
perl gen-llil.pl long$n 200 3 1
perl shuffle.pl long$n >2; mv 2 long$n
done &
# create 26 random files
for n in $(perl -le "print for 'ca'..'cz'"); do
perl gen-llil.pl long$n 200 12 1
perl shuffle.pl long$n >3; mv 3 long$n
done &
# create 14 random files (total 92 files)
for n in $(perl -le "print for 'da'..'dn'"); do
perl gen-llil.pl long$n 200 3 1
perl shuffle.pl long$n >4; mv 4 long$n
done &
wait
For this round of testing, the UNIX time is captured for detecting any extra global/gc cleanup time. That is the case for substring_cnt
.
1. llil4map_greg - Greg's llil4map version with string_cnt
2. llil4map_long - llil4map with the MAX_STR_LEN_L line commented out
3. llil4map_fix16 - llil4map with MAX_STR_LEN_L 16
llil4map_greg
$ time ./llil4map_greg in/long* in/long* in/long* | cksum
llil4map start
use OpenMP
use boost sort
get properties 12.370 secs
map to vector 1.670 secs
vector stable sort 9.336 secs
write stdout 3.929 secs
total time 27.307 secs
count lines 970195200
count unique 295748977
3333632426 4307239210
real 0m43.433s <--- 16.126 seconds extra
user 23m22.291s
sys 0m42.527s
llil4map_long
WARNING: Do not run llil4map_long (peaks 32 GB), or process longa* longa* longa* (a suffix) instead.
$ time ./llil4map_long in/long* in/long* in/long* | cksum
llil4map start
use OpenMP
use boost sort
get properties 10.232 secs
map to vector 3.249 secs
vector stable sort 4.737 secs
write stdout 1.340 secs
total time 19.560 secs
count lines 970195200
count unique 295748977
3333632426 4307239210
real 0m20.269s
user 13m21.889s
sys 0m24.138s
llil4map_fix16
$ time ./llil4map_fix16 in/long* in/long* in/long* | cksum
llil4map (fixed string length=16) start
use OpenMP
use boost sort
get properties 8.672 secs
map to vector 1.553 secs
vector stable sort 1.931 secs
write stdout 1.476 secs
total time 13.633 secs
count lines 970195200
count unique 295748977
3333632426 4307239210
real 0m13.785s
user 10m23.834s
sys 0m13.176s
from parallel-hashmap.
Interesting!
What happens is when you use std::string
, the gnu standard library uses the same trick I did and does not allocate memory for strings up to 16 characters long (it is called the short string optimization
or sso
).
So in your test, with the mix of 6 character strings and 15 character string, you didn't allocate memory for strings whether you commented MAX_STR_LEN_L
or not.
However, using string_cnt
, that same optimization is limited to 11 bytes.
Probably if you tried with a mix 6/11 characters, or 6/17 characters, the difference would be less.
The aim with string_cnt
was to reduce memory usage. indeed for string sizes between from 12 to 15 characters it will be slower.
from parallel-hashmap.
@greg7mdp Thank you for the clarity about std::string sso. I wanted testing to be apples-to-apples comparison.
What happens is when you use std::string ... (it is called the short string optimization or sso).
I will re-create 19 length (to exeed sso limit), by passing alternating 200 16 1
or 200 3 1
arguments to gen-llil.pl
. Well, to see if std::string
does the same thing i.e. long GC or global cleanup when the program exits. That was the idea of testing, check overall behavior for dynamic memory allocation.
std::string does the same thing; long exit time due to global cleanup.
llil4map_greg
$ time ./llil4map_greg in/long* in/long* in/long* | cksum
llil4map start
use OpenMP
use boost sort
get properties 12.440 secs
map to vector 1.656 secs
vector stable sort 9.348 secs
write stdout 4.056 secs
total time 27.502 secs
count lines 970195200
count unique 295755152
29612263 5038456270
real 0m43.777s <--- 16.275s of this time is from GC/global cleanup
user 23m27.969s
sys 0m43.282s
llil4map_long
$ time ./llil4map_long in/long* in/long* in/long* | cksum
llil4map start
use OpenMP
use boost sort
get properties 13.111 secs
map to vector 3.153 secs
vector stable sort 13.529 secs
write stdout 3.055 secs
total time 32.849 secs
count lines 970195200
count unique 295755152
29612263 5038456270
real 0m49.815s <--- 16.966s of this time is from GC/global cleanup
user 20m59.538s
sys 0m59.482s
string_cnt shines, consuming lesser memory consumption.
llil4map_greg peak 17.2 GB
llil4map_long peak > 36 GB
from parallel-hashmap.
std::string and string_cnt, once past sso limit, result in significantly longer exit time due to global cleanup. If only string_cnt
allowed one to specify the fix-length limit, for MAX_STR_LEN_L
-like capability. A key size of 11 chars is quite small. Chuma, the OP in the LLiL PerlMonks thread, did not provide us the max key length.
@greg7mdp Akin to { sha1sum, sha224sum, sha256sum, sha384sum, sha512sum } naming, one can build llil4map supporting fixed-length mode { llil4map_12, llil4map_16, llil4map_20, llil4map_24, llil4map_30 }, and llil4map_dyn (using static_cnt).
The Chuma LLiL challenge is solved. :-)
from parallel-hashmap.
The new update is wonderful.
I like the fact that the private method set
no longer calls strlen
. Because, the length of the key is already known in get_properties. Previously (version 1), lazy_emplace_l accepted beg_str. So now, passing a string_view variable instead. It compiled without errors or warnings.
int_type count = fast_atoll64(found + 1);
std::string_view sv(beg_ptr, found - beg_ptr);
// Use lazy_emplace to modify the set while the mutex is locked.
set_ret.lazy_emplace_l(
sv,
[&](string_cnt_set_t::value_type& p) {
p.cnt += count; // called only when key was already present
},
[&](const string_cnt_set_t::constructor& ctor) {
ctor(std::move(sv), count); // construct value_type in place when key not present
}
);
I got get_properties to be faster...
Nice work, get_properties does run faster (set strlen(s), strcpy vs. set s.size(), memcpy()). Before and after results.
$ ./llil4mapg big* big* big* | cksum
llil4map start
use OpenMP
use boost sort
get properties 8.104 secs 7.763 secs
map to vector 0.894 secs 0.937 secs
vector stable sort 1.859 secs 1.867 secs
write stdout 0.723 secs 0.674 secs
total time 11.581 secs 11.243 secs
count lines 970195200
count unique 200483043
2057246516 1811140689
I re-compiled llil4map_greg.cc
using string_cnt_t<20>
. Though, two assertions failed. I changed the first and third assert lines.
void set(std::string_view s) {
static_assert(buffsz == N);
static_assert(offsetof(string_cnt_t, cnt) == (intptr_t)buffsz);
// static_assert(sizeof(string_cnt_t) == N + extra_sz); // complained for string_cnt_t<20>
The momemt has come :-) Let's try fixed-length 20 i.e. string_cnt_t<20>
. Before and after results.
$ time ./llil4map_greg long* long* long* | cksum
llil4map start
use OpenMP
use boost sort
get properties 12.440 secs 9.270 secs improved
map to vector 1.656 secs 2.118 secs humm, we can bump thds_move
vector stable sort 9.348 secs 3.294 secs improved
write stdout 4.056 secs 1.668 secs improved
total time 27.502 secs 16.352 secs
count lines 970195200
count unique 295755152
29612263 5038456270
real 0m43.777s 0m16.823s improved, no GC side effects
user 23m27.969s 13m35.370s
sys 0m43.282s 0m15.187s
I bumped nthds_move
to 12. No need to go higher. Running again :-).
int nthds_move = std::min(nthds, 12);
$ time ./llil4map_greg long* long* long* | cksum
llil4map start
use OpenMP
use boost sort
get properties 12.440 secs 9.277 secs improved
map to vector 1.656 secs 1.853 secs ah yes, :-)
vector stable sort 9.348 secs 3.293 secs improved
write stdout 4.056 secs 1.626 secs improved
total time 27.502 secs 16.051 secs
count lines 970195200
count unique 295755152
29612263 5038456270
real 0m43.777s 0m16.517s improved, no GC side effects
user 23m27.969s 13m35.582s
sys 0m43.282s 0m15.290s
Wow! The improved string_cnt
provides flexibility, allowing the developer to specify the sso fixed-length size. Also, no problem if the input has long strings sparingly, exceeding sso. It will still work, with minimum impact to GC cleanup.
At last, the LLiL challenge is solved. Thank you, @greg7mdp. The string_cnt
update is amazing.
P.S. Not fair :-) I too want to use string_cnt
in my phmap demonstrations #include <parallel_hashmap/string_cnt.h>
or #include <examples/string_cnt.h>
.
from parallel-hashmap.
Very nice speedup! congrats.
from parallel-hashmap.
I created a gist (two files) containing llil4map_next.cc
and string_cnt.h
. Modification: cnt
contains count and key length i.e. cnt = (cnt << 8) + (len & 0xff);
That improves the hash method, benefiting get_properties
.
std::size_t hash() const {
auto s = getstr();
std::string_view sv {s, cnt & 0xff};
return std::hash<std::string_view>()(sv);
}
https://gist.github.com/marioroy/693d952b578792bf090fe20c2aaccad5
Edit: The gist contains three files, added string_cnt_u.h
.
from parallel-hashmap.
Sorting an array runs faster for fixed length keys. That is the case for block_indirect_sort
. So, I tried union std::array in the struct (see string_cnt_u.h
).
union {
struct {
char * str;
char extra[extra_sz];
uint_t cnt;
} s_;
std::array<char, sizeof(s_) - sizeof(uint_t)> d_;
};
bool operator<(const string_cnt_t& o) const {
if (s_.extra[mark_idx] || o.s_.extra[mark_idx])
return std::strcmp(getstr(), o.getstr()) < 0;
else
return d_ < o.d_;
}
Before and after results.
$ ./llil4map big* big* big* | cksum
llil4map start
use OpenMP
use boost sort string_cnt.h string_cnt_u.h
get properties 7.763 secs 7.689 secs
map to vector 0.937 secs 0.836 secs
vector stable sort 1.867 secs 1.589 secs <-- improved (64 threads)
write stdout 0.674 secs 0.719 secs
total time 11.243 secs 10.835 secs
count lines 970195200
count unique 200483043
2057246516 1811140689
Temporarily, I decreased the number of threads to 8 for sorting. Look at sorting go. :-)
$ ./llil4map big* big* big* | cksum
llil4map start
use OpenMP
use boost sort string_cnt.h string_cnt_u.h
get properties 7.703 secs 7.682 secs
map to vector 0.900 secs 0.898 secs
vector stable sort 9.264 secs 7.283 secs <-- improved (8 threads)
write stdout 0.763 secs 0.640 secs
total time 18.631 secs 16.506 secs
count lines 970195200
count unique 200483043
2057246516 1811140689
Let's try string_cnt_t<20>
, again 8 threads for sorting, to get an idea.
$ ./llil4map long* long* long* | cksum
llil4map start
use OpenMP
use boost sort string_cnt.h string_cnt_u.h
get properties 9.161 secs 9.208 secs
map to vector 1.729 secs 1.860 secs
vector stable sort 15.586 secs 12.861 secs <-- improved (8 threads)
write stdout 1.625 secs 1.630 secs
total time 28.104 secs 25.560 secs
count lines 970195200
count unique 295755152
29612263 5038456270
Gees! All the union changes just to improve sorting for fixed-length keys.
from parallel-hashmap.
return d_ < o.d_;
Hum, this comparison is not necessarily correct, if you have garbage after the terminating null character of the contained string, because the std::array
<
will not stop at the null character.
from parallel-hashmap.
If you set()
a long string, the first 8 bytes contain a pointer. Now if you set a 2 byte string "ab", the first 8 bytes will contain "ab\0?????". I imagine this doesn't happen in llil.
from parallel-hashmap.
Hi @greg7mdp,
You are correct. There could be garbage, fixed by clearing in set. Thank you!
std::memset(f_.data(), 0, N);
std::memcpy(f_.data(), s.data(), std::min(len, N - 1));
Fixing the garbage issue allowed me to improve the ==
operator in string_cnt_u.h
. Now, get_properties
runs as fast as my version without string_cnt
.
bool operator==(const string_cnt_t& o) const {
if (s_.extra[mark_idx] || o.s_.extra[mark_idx])
return std::strcmp(getstr(), o.getstr()) == 0;
else
return std::memcmp(getdata(), o.getdata(), N) == 0;
}
Freeing memory is now handled with multiple threads, in llil4map_next.cc
.
// Free dynamically allocated memory in parallel.
#pragma omp parallel for schedule(dynamic, 300000) num_threads(nthds)
for (size_t i = 0; i < propvec.size(); i++)
propvec[i].free();
I added a fix-length only version, string_cnt_f.h
, nice to have for comparing with my version.
// #include "string_cnt.h" // original + key length stored with cnt
// #include "string_cnt_f.h" // fix-length only version
#include "string_cnt_u.h" // union std::array
Thank you for string_cnt.h
. The program runs fast (string_cnt_u.h
, slightly faster string_cnt_f.h
).
from parallel-hashmap.
Greetings, @greg7mdp
I made a new gist containing my final phmap and vector demonstrations. The gist contains a README and LLiL utilities. There are three string_cnt_t
variants; with extra comments.
C++ phmap and vector demonstrations using string_cnt_t
https://gist.github.com/marioroy/693d952b578792bf090fe20c2aaccad5
llil4map_next.cc
- phmap::parallel_flat_hash_map demonstrationllil4vec_next.cc
- similarly, a std::vector demonstrationstring_cnt.h
- original + key length stored with countstring_cnt_f.h
- fixed-length-only versionstring_cnt_u.h
- union std::arraygen-files-big.sh
- generate 92 big files (total 2.8 GB)gen-files-long.sh
- generate 92 long files (total 5.0 GB)gen-llil.pl
- generate LLiL test fileshuffle.pl
- shuffle LLiL test file
I learned more C++ from your string_cnt.h
example, so much appreciated. The LLiL challenge is solved. llil4map_next.cc
runs with low memory consumption and pleasantly fast using string_cnt_u.h
.
from parallel-hashmap.
Hi @greg7mdp, Running get_properties
on Linux + ECHO CPU Scheduler + the Realtime patch set. I made a repo for building such a kernel on Clear Linux.
Processing long* long* long*
took 16.9s, previously. Now 14.1s on a realtime Linux versus 11.5s non-RT. That took me by surprise. A realtime kernel has greater responsibility for sustaining low latency. This is normal RT behavior for many threads involving locking or pipes/sockets. The other stages perform the same.
int_type count = fast_atoll64(found + 1);
string_cnt s{(const char *)beg_ptr, (size_t)(found - beg_ptr), count};
// Use lazy_emplace to modify the set while the mutex is locked.
set_ret.lazy_emplace_l(
s,
[&](string_cnt_set_t::value_type& p) {
p.incr(count); // called only when key was already present
},
[&](const string_cnt_set_t::constructor& ctor) {
ctor(std::move(s)); // construct value_type in place when key not present
}
);
from parallel-hashmap.
Hi @marioroy , thanks for the updates. On my side I've been busy and have not worked on llil
at all, I still have interesting ideas I'd like to try out when I get a chance.
from parallel-hashmap.
I made an attempt at dynamic buffer allocation. Get properties decreased to 10.6s on a realtime Linux, now similar to running on a non-realtime kernel.
https://gist.github.com/marioroy/d02881b96b20fa1adde4388b3e216163
The dynamic allocation logic is thread-safe i.e. first argument optional thread id for thread-safety. Plus rolling back the dynamic memory position if the long string was already present.
int_type count = fast_atoll64(found + 1);
string_cnt s{tid, beg_ptr, (size_t)(found - beg_ptr), count};
// Use lazy_emplace to modify the set while the mutex is locked.
set_ret.lazy_emplace_l(
s,
[&](string_cnt_set_t::value_type& p) {
p.incr(count); // called only when key was already present
s.mem_rollback(tid);
},
[&](const string_cnt_set_t::constructor& ctor) {
// construct value_type in place when key not present
// long strings will be stored in dynamic memory storage
ctor(std::move(s));
}
);
Notice lesser memory consumption. And there too, get properties completing faster than before.
$ time ./llil4map long* long* long* | cksum
llil4map start
use OpenMP ( old ) ( next ) ( buf )
memory consumption 36.6 GB 17.5 GB 16.0 GB
use boost sort
get properties 12.967s 11.527s 10.402s
map to vector 3.244s 1.252s 1.211s
vector stable sort 10.265s 8.708s 8.161s
write stdout 1.410s 1.388s 1.386s
total time 27.887s 22.877s 21.163s
count lines 970195200
count unique 295755152
29612263 5038456270
real time 0m43.973s 0m24.872s 0m21.909s
user 25m08.532s 22m58.120s 20m56.492s
sys 0m55.602s 0m48.783s 0m11.551s
Thank you, @greg7mdp. Your string_cnt_t
struct is how I learned to do this.
from parallel-hashmap.
Related Issues (20)
- LLDB pretty-printer is buggy HOT 2
- Avoid Memory Reallocations in flat_hash_map clear for Trivial Classes HOT 3
- Suggestion: lazy_emplace_l without last argument HOT 12
- Unused variable warning HOT 2
- question about performance of `at(...)` HOT 4
- Iteration Order Differs on Arm64 Architecture HOT 4
- Very minor optimization: _mm_abs_epi8 instead of _mm_sign_epi8 HOT 1
- Slow in a specific case HOT 20
- asan build on linux failed with memory error HOT 3
- In template: constexpr variable 'kFirst' must be initialized by a constant expression HOT 6
- phmap_dump saving a lot more than needed HOT 4
- Thread safe way to check existence of items HOT 8
- is it possible to miss the necessary .cmake in the conda packaging? HOT 1
- nvc++: integer conversion resulted in a change of sign HOT 9
- Need a tag about `reserve` bug HOT 3
- Need help speeding up large hash map HOT 6
- merging maps HOT 8
- Release memory after shrinking HOT 6
- Memory not reclaimed after calling map.clear() HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from parallel-hashmap.