GithubHelp home page GithubHelp logo

alugowski / parse-bench Goto Github PK

View Code? Open in Web Editor NEW
3.0 1.0 0.0 1.18 MB

Benchmark different approaches to parsing scientific datafiles.

License: BSD 2-Clause "Simplified" License

CMake 3.64% C++ 32.87% Jupyter Notebook 59.62% Shell 1.34% Python 2.53%

parse-bench's Introduction

Benchmark different approaches to parsing scientific datafiles.

In particular focus on text datafiles dominated by integers and floating-point fields. For example, each line may be two integers and a float:

111 2222 3.333
444 555 6E6

C++ IO Streams or scanf are often the go-to choice. While easy to use and very featureful, their throughput is low. Let's find something faster.

Parse Bench

Benchmarks in this repo measure throughput of several approaches for each of several parsing steps to show the performance impact of each approach:

  • Scan: a simple iteration reading each byte. Shows the roofline performance supported by the memory subsystem.
  • Newline detection: A scan that finds newlines. Show the performance hit of testing characters.
  • Line tokenization: A scan that also splits each line into fields based on whitespace.
  • Parse integers: Test approaches to parsing integers.
  • Parse double: Test approaches to parsing floating point numbers.
  • Parse block: Use lessons learned to write a parser that can parse a simple datafile.

The datafile lines we test contain two integers and a double. Of course other schemas will show different performance, but this provides good ballpark figures.

Parsing methods do not attempt to cheat by skipping error checking (where possible) as a real parser will have to perform error checking.

Results

The core result is bytes per second for each approach. That is the only figure shared by all the different steps, and the most useful figure to estimate how long a parser will take to parse a file.

string to integer conversion string to double conversion parallel parse scaling

Formatting:

double_to_string conversion

We use short and long strings to attempt to not cherry-pick easy or hard inputs. See main.cpp for what the input data looks like.

See plots for a Jupyter notebook that reads the Google Benchmark output and plots relevant results. The provided results were run on an M1 Macbook Pro.

Takeaways

  • Datafile parse performance is CPU bound due to parsing, not IO bound. By a lot.
  • IO Streams are very slow, about 5x slower than a mechanical hard drive, and get slower when used in parallel.
  • A fast approach uses fast methods to identify fields then use fast field conversion methods.
  • C routines like strtoll and strtod are usable (sequentially only).
  • The C++17 std::from_chars methods are fast. The float version has spotty compiler support, use fast_float.
  • Avoid scanf. Slow, gets slower when used in parallel, poor error checking.
  • Parallelism: an efficient parallel parser can reach >1 GB/s speeds on a laptop.
  • Parallelism: standard library methods lock internally (on the locale). This kills parallel performance. Examples: istringstream, strtod, scanf, etc.

Parallelism

CPU-bound operations have a natural path to speedup: parallelism. We include some parallel parsing here, with many simplifying (but realistic) assumptions to focus on speeding up parsing only.

All parallel tests included here are embarassingly parallel. All threads have their own unique inputs and nothing is shared between threads (at least at the application level, see note below).

Some methods that are good options for single-threaded use are either not thread safe or include internal synchronization that kills multithreaded performance. For example, using strtod in parallel is slower than single threaded, at least on some platforms. The problem appears to be locale support which requires locking. Some platforms like BSD and macOS include a strtod_l which can avoid this, but it is not portable.

Run on Your Own System

Building and running the benchmarks is fully automated. Just run:

./run_all_benchmarks.sh

Which will write benchmark output to the console and a JSON version to plots/benchmark_outputs/. This will not overwrite anything.

Then open the plots/plot.ipynb notebook in Jupyter and run all the cells to see the plots with data from your system.

This will install and open Jupyter (in a virtual env) for you:

cd plots/
./start_jupyter.sh

Then select "Cells" -> "Run All".

macOS

The version of clang bundled with macOS does not support OpenMP. Parallel benchmarks are disabled if OpenMP is not found.

Install Homebrew's LLVM and this project's cpp/run.sh will pick it up.

brew install llvm

You can also try to install OpenMP with brew install libomp. That works sometimes.

parse-bench's People

Contributors

alugowski avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar

parse-bench's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.