Process 1 billion row of text data as fast as possible
https://www.morling.dev/blog/one-billion-row-challenge/
Discussion thread: gunnarmorling/1brc#138
Use sha256sum to check that output is same as reference output 016930801788eb421a15cf6def8ea435b4b47fb5f41df09e02ecdd7fbc9ac92b result.txt
I used this file (generated by ./create_measurements.sh 1000000000
) to test:
https://drive.google.com/file/d/1HEyNw4M453n0tnuaAm9nwaCiLydQYnpo/view?usp=sharing
To run, just download the file above, extract, then ./run_cpp.sh
To run with 8 threads to compare with other submission, set N_THREADS = 8
in 1brc_final_valid.cpp
- Unsigned int overflow hashing: cheapest hash method possible.
- SIMD hashing
- SIMD for string comparison in hash table probing
- Notice properties of actual data
-
- 99% of station names has
length <= 16
, use compiler hint + implement SIMD for this specific case. If length > 16, use a fallback => still meet requirements ofMAX_KEY_LENGTH = 100
- 99% of station names has
-
-99.9 <= temperature <= 99.9
guaranteed, use branchless code using this property
- Use mmap for fast file reading
- Use multithreading for both parsing the file, and aggregating the data
- Other random tricks (intentional ordering of variable assignments)