Comments (9)
Is the tradeoff worth it?
Please grab https://github.com/lemire/simple_fastfloat_benchmark I believe that it is a reasonable benchmark.
Do cmake -B build && cmake --build build && ./run_bench.sh
from fast_float.
The differences in your benchmarks appear to reach ~40 cycles for short inputs. It seems unlikely that checking for the presence of eight-digits using a fast path, which might in this case be a simple length comparison ("you can't even access 8 bytes, so don't even do the check"), would cost that many cycles.
Please take the same library and change just one thing, one path, and measure the difference in performance
from fast_float.
@lemire My problem is the tests are all focused on high numbers of digits.
Take canada.txt for instance if you count the digits in each number this is the distribution
{17: 100717, 5: 28, 8: 635, 16: 7811, 2: 36, 3: 28, 15: 350, 9: 1384, 7: 50, 4: 42, 6: 45}
Over 90% of the numbers have 17 digits. I donβt know about the other tests but uniform random doubles will tend to be 17-16 sig figs when written out in a roundtripable form.
If I shorten the digits in canada.txt by writing them out in fixed notation with 2 decimal places. There is a huge perf reduction compared to your earlier fast_double_parser on short floats with small numbers of sig figs. This is the file I used for the test canada_short.txt
You are right im seeing very little difference on low sig figs removing the SWAR stuff so its probably not that.
Benchmark was compiled with gcc and ran on a i5-1135G7
Current impl
[mwalcott@fedora simple_fastfloat_benchmark]$ ./build/benchmarks/benchmark -f data/canada.txt
# read 111126 lines
volume = 1.93374 MB
netlib : 339.16 MB/s (+/- 1.9 %) 19.49 Mfloat/s 31.32 i/B 571.45 i/f (+/- 0.0 %) 0.15 bm/B 2.78 bm/f (+/- 1.3 %) 11.76 c/B 214.66 c/f (+/- 0.9 %) 2.66 i/c 4.18 GHz
doubleconversion : 320.10 MB/s (+/- 3.4 %) 18.40 Mfloat/s 51.16 i/B 933.48 i/f (+/- 0.0 %) 0.05 bm/B 0.83 bm/f (+/- 4.6 %) 12.47 c/B 227.48 c/f (+/- 2.1 %) 4.10 i/c 4.18 GHz
strtod : 195.83 MB/s (+/- 1.2 %) 11.25 Mfloat/s 70.30 i/B 1282.83 i/f (+/- 0.0 %) 0.15 bm/B 2.80 bm/f (+/- 0.8 %) 20.38 c/B 371.82 c/f (+/- 0.9 %) 3.45 i/c 4.18 GHz
abseil : 487.79 MB/s (+/- 1.9 %) 28.03 Mfloat/s 30.17 i/B 550.47 i/f (+/- 0.0 %) 0.03 bm/B 0.60 bm/f (+/- 0.5 %) 8.18 c/B 149.25 c/f (+/- 1.3 %) 3.69 i/c 4.18 GHz
fastfloat : 1053.34 MB/s (+/- 1.9 %) 60.53 Mfloat/s 15.58 i/B 284.27 i/f (+/- 0.0 %) 0.01 bm/B 0.11 bm/f (+/- 7.9 %) 3.79 c/B 69.14 c/f (+/- 1.5 %) 4.11 i/c 4.19 GHz
fast_double_parser : 1139.63 MB/s (+/- 1.4 %) 65.49 Mfloat/s 12.64 i/B 230.56 i/f (+/- 0.0 %) 0.01 bm/B 0.11 bm/f (+/- 0.3 %) 3.50 c/B 63.90 c/f (+/- 0.9 %) 3.61 i/c 4.18 GHz
[mwalcott@fedora simple_fastfloat_benchmark]$ ./build/benchmarks/benchmark -f data/canada_short.txt
# read 111126 lines
volume = 0.598098 MB
netlib : 464.83 MB/s (+/- 8.3 %) 86.37 Mfloat/s 34.90 i/B 196.98 i/f (+/- 0.0 %) 0.03 bm/B 0.15 bm/f (+/- 1.3 %) 8.58 c/B 48.45 c/f (+/- 0.8 %) 4.07 i/c 4.18 GHz
doubleconversion : 284.03 MB/s (+/- 3.3 %) 52.77 Mfloat/s 66.27 i/B 373.98 i/f (+/- 0.0 %) 0.02 bm/B 0.11 bm/f (+/- 1.7 %) 14.04 c/B 79.25 c/f (+/- 2.6 %) 4.72 i/c 4.18 GHz
strtod : 121.46 MB/s (+/- 1.3 %) 22.57 Mfloat/s 129.04 i/B 728.24 i/f (+/- 0.0 %) 0.12 bm/B 0.66 bm/f (+/- 0.8 %) 32.86 c/B 185.46 c/f (+/- 1.0 %) 3.93 i/c 4.19 GHz
abseil : 205.37 MB/s (+/- 1.3 %) 38.16 Mfloat/s 70.69 i/B 398.96 i/f (+/- 0.0 %) 0.08 bm/B 0.46 bm/f (+/- 0.7 %) 19.43 c/B 109.65 c/f (+/- 0.9 %) 3.64 i/c 4.18 GHz
fastfloat : 489.87 MB/s (+/- 2.0 %) 91.02 Mfloat/s 34.52 i/B 194.84 i/f (+/- 0.0 %) 0.01 bm/B 0.04 bm/f (+/- 1.2 %) 8.15 c/B 45.98 c/f (+/- 1.3 %) 4.24 i/c 4.19 GHz
fast_double_parser : 1239.27 MB/s (+/- 5.7 %) 230.26 Mfloat/s 16.06 i/B 90.65 i/f (+/- 0.0 %) 0.00 bm/B 0.00 bm/f (+/- 1.5 %) 3.22 c/B 18.20 c/f (+/- 3.6 %) 4.98 i/c 4.19 GHz
Removing the 8 char SWAR stuff.
[mwalcott@fedora simple_fastfloat_benchmark]$ ./build/benchmarks/benchmark -f data/canada.txt
# read 111126 lines
volume = 1.93374 MB
netlib : 335.22 MB/s (+/- 1.5 %) 19.26 Mfloat/s 31.32 i/B 571.45 i/f (+/- 0.0 %) 0.15 bm/B 2.71 bm/f (+/- 0.7 %) 11.90 c/B 217.19 c/f (+/- 0.6 %) 2.63 i/c 4.18 GHz
doubleconversion : 329.30 MB/s (+/- 1.7 %) 18.92 Mfloat/s 51.16 i/B 933.48 i/f (+/- 0.0 %) 0.03 bm/B 0.62 bm/f (+/- 0.6 %) 12.12 c/B 221.08 c/f (+/- 1.4 %) 4.22 i/c 4.18 GHz
strtod : 194.08 MB/s (+/- 1.5 %) 11.15 Mfloat/s 70.30 i/B 1282.83 i/f (+/- 0.0 %) 0.15 bm/B 2.81 bm/f (+/- 0.8 %) 20.56 c/B 375.14 c/f (+/- 1.3 %) 3.42 i/c 4.18 GHz
abseil : 481.38 MB/s (+/- 2.6 %) 27.66 Mfloat/s 30.17 i/B 550.47 i/f (+/- 0.0 %) 0.04 bm/B 0.67 bm/f (+/- 9.9 %) 8.29 c/B 151.23 c/f (+/- 2.3 %) 3.64 i/c 4.18 GHz
fastfloat : 811.33 MB/s (+/- 1.1 %) 46.62 Mfloat/s 17.70 i/B 322.90 i/f (+/- 0.0 %) 0.01 bm/B 0.10 bm/f (+/- 0.2 %) 4.92 c/B 89.73 c/f (+/- 0.6 %) 3.60 i/c 4.18 GHz
[mwalcott@fedora simple_fastfloat_benchmark]$ ./build/benchmarks/benchmark -f data/canada_short.txt
# read 111126 lines
volume = 0.598098 MB
netlib : 460.80 MB/s (+/- 6.7 %) 85.62 Mfloat/s 34.90 i/B 196.98 i/f (+/- 0.0 %) 0.03 bm/B 0.16 bm/f (+/- 1.3 %) 8.66 c/B 48.87 c/f (+/- 0.7 %) 4.03 i/c 4.18 GHz
doubleconversion : 279.87 MB/s (+/- 3.2 %) 52.00 Mfloat/s 66.27 i/B 373.98 i/f (+/- 0.0 %) 0.02 bm/B 0.11 bm/f (+/- 3.0 %) 14.26 c/B 80.47 c/f (+/- 2.4 %) 4.65 i/c 4.18 GHz
strtod : 121.77 MB/s (+/- 1.6 %) 22.63 Mfloat/s 129.04 i/B 728.24 i/f (+/- 0.0 %) 0.12 bm/B 0.67 bm/f (+/- 0.5 %) 32.76 c/B 184.86 c/f (+/- 1.4 %) 3.94 i/c 4.18 GHz
abseil : 205.98 MB/s (+/- 1.3 %) 38.27 Mfloat/s 70.69 i/B 398.96 i/f (+/- 0.0 %) 0.08 bm/B 0.46 bm/f (+/- 0.5 %) 19.38 c/B 109.35 c/f (+/- 0.9 %) 3.65 i/c 4.19 GHz
fastfloat : 506.06 MB/s (+/- 3.1 %) 94.03 Mfloat/s 32.75 i/B 184.84 i/f (+/- 0.0 %) 0.01 bm/B 0.04 bm/f (+/- 1.6 %) 7.89 c/B 44.51 c/f (+/- 2.8 %) 4.15 i/c 4.18 GHz
from fast_float.
Why is Clinger's fast path only being applied to positive exponents? I feel like all the numbers I'm using in canada_short should be able to fast path (Small significand and an exponent of -2).
Edit:
Nvm saw issue #149. That is incredibly annoying. Changing the floating point rounding mode and not resetting it seems almost like the user is asking for problems.
from fast_float.
Changing the floating point rounding mode and not resetting it seems almost like the user is asking for problems.
Agreed but the C++ specification is what it is.
from fast_float.
Thanks for the extra file, I have added it to my benchmark.
Looking at your numbers, we have the following numbers of instructions per float for fast_float: 184.84 i/f and 194.84 i/f (canada_short.txt) and 322.90 i/f and 284.27 i/f (canada_short.txt). You add 10 instructions on the one hand, and you save 39 instructions on the other hand.
It depends what you favour. The canada
file is derived from actual data. It is unclear to me which application would match the canada_short
file. Is that the sort of data you encounter in your work?
I'd be most interested in real-world reports. Please note that we always try to optimize for the data people do have.
I am tuning the performance based on additional test which take into account canada_short
. Note that I weight canada_short
somewhat less because I consider it less likely to be realistic.
See #152
from fast_float.
Is that the sort of data you encounter in your work?
Rarely, it's a bit of a contrived example. Some old Fortran codes I interface with output ascii files with short fixed point decimal numbers (In that case noone really cares about 1ulp errors or accuracy in general so maybe fast_float is overkill). For the most part I expect to deal with numbers closer to what is seen in canada.txt.
I just thought that simple low digit fixed point performance should be tested because it sometimes pops up and there appears to be a performance regression there compared to fast_double_parser.
fastfloat : 489.87 MB/s (+/- 2.0 %) 91.02 Mfloat/s 34.52 i/B 194.84 i/f (+/- 0.0 %) 0.01 bm/B 0.04 bm/f (+/- 1.2 %) 8.15 c/B 45.98 c/f (+/- 1.3 %) 4.24 i/c 4.19 GHz
fast_double_parser : 1239.27 MB/s (+/- 5.7 %) 230.26 Mfloat/s 16.06 i/B 90.65 i/f (+/- 0.0 %) 0.00 bm/B 0.00 bm/f (+/- 1.5 %) 3.22 c/B 18.20 c/f (+/- 3.6 %) 4.98 i/c 4.19 GHz
But it appears that appears to primarily be related to other issues. For instance, re-enabling the fast path for negative exponents (not standard compliant) immediately saw a significant perf boost in performance when parsing the short numbers. I think it was close to 50%. But that's not something that can really be changed.
from fast_float.
For instance, re-enabling the fast path for negative exponents
Right. There might be room for more clever approaches. Unfortunately, testing for the rounding mode each time is too expensive.
I will close this issue for now. It would be very helpful to give more thought to ways around the performance regression you allude to. Of course, we could simply make it a compile-time option, or even a runtime option. But it is less useful than it sounds because the default would have to be standard compliance... sadly.
from fast_float.
Thanks for the report.
from fast_float.
Related Issues (20)
- Single header for release 3.9.0 is named fastfloat.h instead of fast_float.h HOT 2
- fast_float for x86 fails on basictest HOT 4
- Some exhaustive tests fail since version 4.0.0 [Note: This is due to the API change with respect to error reports.] HOT 2
- simple_decimal_conversion.h can be deleted HOT 1
- Please bundle Apache 2.0 license in the releases as the license requires HOT 2
- Release fastfloat.h does not include BOOST license HOT 5
- warning: C4459: declaration of 'uint' hides global declaration HOT 2
- incomplete type is not allowed HOT 1
- Make an intermediate release of fast_float HOT 1
- Support for multiple decimal points HOT 1
- parsing uint8_t
- Allow testing withouth supplemental_test_files HOT 1
- Buffer overflow in parse_int_string HOT 1
- warning STL4038: The contents of <stdfloat> are available only with C++23 or later. HOT 1
- `ascii_number.h` is closing `namespace` incorrectly HOT 4
- 6.1.0 release asset is from an older version HOT 3
- Function `write_u64` seems to be unused
- New fast_float versions don't compile with NVIDIA's nvcc HOT 1
- Provide x86-64-v2 optimizations
- Suggestion: provide from_chars_advanced overload or other function name that takes parsed_number_string_t<UC> argument HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fast_float.