With regards to is_made_of_eight_digits_fast and parse_eight_digits_unrolled. Processi

Is the tradeoff worth it? Please grab <a

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Is the tradeoff of processing 8 chars at a time worth it? about fast_float HOT 9 CLOSED

mwalcott3 commented on June 2, 2024

Is the tradeoff of processing 8 chars at a time worth it?

from fast_float.

Comments (9)

lemire commented on June 2, 2024

Is the tradeoff worth it?

Please grab https://github.com/lemire/simple_fastfloat_benchmark I believe that it is a reasonable benchmark.

Do cmake -B build && cmake --build build && ./run_bench.sh

from fast_float.

lemire commented on June 2, 2024

The differences in your benchmarks appear to reach ~40 cycles for short inputs. It seems unlikely that checking for the presence of eight-digits using a fast path, which might in this case be a simple length comparison ("you can't even access 8 bytes, so don't even do the check"), would cost that many cycles.

Please take the same library and change just one thing, one path, and measure the difference in performance

from fast_float.

mwalcott3 commented on June 2, 2024

@lemire My problem is the tests are all focused on high numbers of digits.

Take canada.txt for instance if you count the digits in each number this is the distribution
{17: 100717, 5: 28, 8: 635, 16: 7811, 2: 36, 3: 28, 15: 350, 9: 1384, 7: 50, 4: 42, 6: 45}
Over 90% of the numbers have 17 digits. I don’t know about the other tests but uniform random doubles will tend to be 17-16 sig figs when written out in a roundtripable form.

If I shorten the digits in canada.txt by writing them out in fixed notation with 2 decimal places. There is a huge perf reduction compared to your earlier fast_double_parser on short floats with small numbers of sig figs. This is the file I used for the test canada_short.txt

You are right im seeing very little difference on low sig figs removing the SWAR stuff so its probably not that.

Benchmark was compiled with gcc and ran on a i5-1135G7

Current impl

[mwalcott@fedora simple_fastfloat_benchmark]$ ./build/benchmarks/benchmark -f data/canada.txt 
# read 111126 lines 
volume = 1.93374 MB 
netlib                                  :   339.16 MB/s (+/- 1.9 %)    19.49 Mfloat/s      31.32 i/B   571.45 i/f (+/- 0.0 %)      0.15 bm/B     2.78 bm/f (+/- 1.3 %)     11.76 c/B   214.66 c/f (+/- 0.9 %)      2.66 i/c      4.18 GHz 
doubleconversion                        :   320.10 MB/s (+/- 3.4 %)    18.40 Mfloat/s      51.16 i/B   933.48 i/f (+/- 0.0 %)      0.05 bm/B     0.83 bm/f (+/- 4.6 %)     12.47 c/B   227.48 c/f (+/- 2.1 %)      4.10 i/c      4.18 GHz 
strtod                                  :   195.83 MB/s (+/- 1.2 %)    11.25 Mfloat/s      70.30 i/B  1282.83 i/f (+/- 0.0 %)      0.15 bm/B     2.80 bm/f (+/- 0.8 %)     20.38 c/B   371.82 c/f (+/- 0.9 %)      3.45 i/c      4.18 GHz 
abseil                                  :   487.79 MB/s (+/- 1.9 %)    28.03 Mfloat/s      30.17 i/B   550.47 i/f (+/- 0.0 %)      0.03 bm/B     0.60 bm/f (+/- 0.5 %)      8.18 c/B   149.25 c/f (+/- 1.3 %)      3.69 i/c      4.18 GHz 
fastfloat                               :  1053.34 MB/s (+/- 1.9 %)    60.53 Mfloat/s      15.58 i/B   284.27 i/f (+/- 0.0 %)      0.01 bm/B     0.11 bm/f (+/- 7.9 %)      3.79 c/B    69.14 c/f (+/- 1.5 %)      4.11 i/c      4.19 GHz 
fast_double_parser                      :  1139.63 MB/s (+/- 1.4 %)    65.49 Mfloat/s      12.64 i/B   230.56 i/f (+/- 0.0 %)      0.01 bm/B     0.11 bm/f (+/- 0.3 %)      3.50 c/B    63.90 c/f (+/- 0.9 %)      3.61 i/c      4.18 GHz 
[mwalcott@fedora simple_fastfloat_benchmark]$ ./build/benchmarks/benchmark -f data/canada_short.txt 
# read 111126 lines 
volume = 0.598098 MB 
netlib                                  :   464.83 MB/s (+/- 8.3 %)    86.37 Mfloat/s      34.90 i/B   196.98 i/f (+/- 0.0 %)      0.03 bm/B     0.15 bm/f (+/- 1.3 %)      8.58 c/B    48.45 c/f (+/- 0.8 %)      4.07 i/c      4.18 GHz 
doubleconversion                        :   284.03 MB/s (+/- 3.3 %)    52.77 Mfloat/s      66.27 i/B   373.98 i/f (+/- 0.0 %)      0.02 bm/B     0.11 bm/f (+/- 1.7 %)     14.04 c/B    79.25 c/f (+/- 2.6 %)      4.72 i/c      4.18 GHz 
strtod                                  :   121.46 MB/s (+/- 1.3 %)    22.57 Mfloat/s     129.04 i/B   728.24 i/f (+/- 0.0 %)      0.12 bm/B     0.66 bm/f (+/- 0.8 %)     32.86 c/B   185.46 c/f (+/- 1.0 %)      3.93 i/c      4.19 GHz 
abseil                                  :   205.37 MB/s (+/- 1.3 %)    38.16 Mfloat/s      70.69 i/B   398.96 i/f (+/- 0.0 %)      0.08 bm/B     0.46 bm/f (+/- 0.7 %)     19.43 c/B   109.65 c/f (+/- 0.9 %)      3.64 i/c      4.18 GHz 
fastfloat                               :   489.87 MB/s (+/- 2.0 %)    91.02 Mfloat/s      34.52 i/B   194.84 i/f (+/- 0.0 %)      0.01 bm/B     0.04 bm/f (+/- 1.2 %)      8.15 c/B    45.98 c/f (+/- 1.3 %)      4.24 i/c      4.19 GHz 
fast_double_parser                      :  1239.27 MB/s (+/- 5.7 %)   230.26 Mfloat/s      16.06 i/B    90.65 i/f (+/- 0.0 %)      0.00 bm/B     0.00 bm/f (+/- 1.5 %)      3.22 c/B    18.20 c/f (+/- 3.6 %)      4.98 i/c      4.19 GHz

Removing the 8 char SWAR stuff.

[mwalcott@fedora simple_fastfloat_benchmark]$ ./build/benchmarks/benchmark -f data/canada.txt 
# read 111126 lines 
volume = 1.93374 MB 
netlib                                  :   335.22 MB/s (+/- 1.5 %)    19.26 Mfloat/s      31.32 i/B   571.45 i/f (+/- 0.0 %)      0.15 bm/B     2.71 bm/f (+/- 0.7 %)     11.90 c/B   217.19 c/f (+/- 0.6 %)      2.63 i/c      4.18 GHz 
doubleconversion                        :   329.30 MB/s (+/- 1.7 %)    18.92 Mfloat/s      51.16 i/B   933.48 i/f (+/- 0.0 %)      0.03 bm/B     0.62 bm/f (+/- 0.6 %)     12.12 c/B   221.08 c/f (+/- 1.4 %)      4.22 i/c      4.18 GHz 
strtod                                  :   194.08 MB/s (+/- 1.5 %)    11.15 Mfloat/s      70.30 i/B  1282.83 i/f (+/- 0.0 %)      0.15 bm/B     2.81 bm/f (+/- 0.8 %)     20.56 c/B   375.14 c/f (+/- 1.3 %)      3.42 i/c      4.18 GHz 
abseil                                  :   481.38 MB/s (+/- 2.6 %)    27.66 Mfloat/s      30.17 i/B   550.47 i/f (+/- 0.0 %)      0.04 bm/B     0.67 bm/f (+/- 9.9 %)      8.29 c/B   151.23 c/f (+/- 2.3 %)      3.64 i/c      4.18 GHz 
fastfloat                               :   811.33 MB/s (+/- 1.1 %)    46.62 Mfloat/s      17.70 i/B   322.90 i/f (+/- 0.0 %)      0.01 bm/B     0.10 bm/f (+/- 0.2 %)      4.92 c/B    89.73 c/f (+/- 0.6 %)      3.60 i/c      4.18 GHz
[mwalcott@fedora simple_fastfloat_benchmark]$ ./build/benchmarks/benchmark -f data/canada_short.txt 
# read 111126 lines 
volume = 0.598098 MB 
netlib                                  :   460.80 MB/s (+/- 6.7 %)    85.62 Mfloat/s      34.90 i/B   196.98 i/f (+/- 0.0 %)      0.03 bm/B     0.16 bm/f (+/- 1.3 %)      8.66 c/B    48.87 c/f (+/- 0.7 %)      4.03 i/c      4.18 GHz 
doubleconversion                        :   279.87 MB/s (+/- 3.2 %)    52.00 Mfloat/s      66.27 i/B   373.98 i/f (+/- 0.0 %)      0.02 bm/B     0.11 bm/f (+/- 3.0 %)     14.26 c/B    80.47 c/f (+/- 2.4 %)      4.65 i/c      4.18 GHz 
strtod                                  :   121.77 MB/s (+/- 1.6 %)    22.63 Mfloat/s     129.04 i/B   728.24 i/f (+/- 0.0 %)      0.12 bm/B     0.67 bm/f (+/- 0.5 %)     32.76 c/B   184.86 c/f (+/- 1.4 %)      3.94 i/c      4.18 GHz 
abseil                                  :   205.98 MB/s (+/- 1.3 %)    38.27 Mfloat/s      70.69 i/B   398.96 i/f (+/- 0.0 %)      0.08 bm/B     0.46 bm/f (+/- 0.5 %)     19.38 c/B   109.35 c/f (+/- 0.9 %)      3.65 i/c      4.19 GHz 
fastfloat                               :   506.06 MB/s (+/- 3.1 %)    94.03 Mfloat/s      32.75 i/B   184.84 i/f (+/- 0.0 %)      0.01 bm/B     0.04 bm/f (+/- 1.6 %)      7.89 c/B    44.51 c/f (+/- 2.8 %)      4.15 i/c      4.18 GHz

canada_short.txt

from fast_float.

mwalcott3 commented on June 2, 2024

Why is Clinger's fast path only being applied to positive exponents? I feel like all the numbers I'm using in canada_short should be able to fast path (Small significand and an exponent of -2).

Edit:
Nvm saw issue #149. That is incredibly annoying. Changing the floating point rounding mode and not resetting it seems almost like the user is asking for problems.

from fast_float.

lemire commented on June 2, 2024

Changing the floating point rounding mode and not resetting it seems almost like the user is asking for problems.

Agreed but the C++ specification is what it is.

from fast_float.

lemire commented on June 2, 2024

Thanks for the extra file, I have added it to my benchmark.

Looking at your numbers, we have the following numbers of instructions per float for fast_float: 184.84 i/f and 194.84 i/f (canada_short.txt) and 322.90 i/f and 284.27 i/f (canada_short.txt). You add 10 instructions on the one hand, and you save 39 instructions on the other hand.

It depends what you favour. The canada file is derived from actual data. It is unclear to me which application would match the canada_short file. Is that the sort of data you encounter in your work?

I'd be most interested in real-world reports. Please note that we always try to optimize for the data people do have.

I am tuning the performance based on additional test which take into account canada_short. Note that I weight canada_short somewhat less because I consider it less likely to be realistic.

See #152

from fast_float.

mwalcott3 commented on June 2, 2024

Is that the sort of data you encounter in your work?

Rarely, it's a bit of a contrived example. Some old Fortran codes I interface with output ascii files with short fixed point decimal numbers (In that case noone really cares about 1ulp errors or accuracy in general so maybe fast_float is overkill). For the most part I expect to deal with numbers closer to what is seen in canada.txt.

I just thought that simple low digit fixed point performance should be tested because it sometimes pops up and there appears to be a performance regression there compared to fast_double_parser.

fastfloat                               :   489.87 MB/s (+/- 2.0 %)    91.02 Mfloat/s      34.52 i/B   194.84 i/f (+/- 0.0 %)      0.01 bm/B     0.04 bm/f (+/- 1.2 %)      8.15 c/B    45.98 c/f (+/- 1.3 %)      4.24 i/c      4.19 GHz 
fast_double_parser                      :  1239.27 MB/s (+/- 5.7 %)   230.26 Mfloat/s      16.06 i/B    90.65 i/f (+/- 0.0 %)      0.00 bm/B     0.00 bm/f (+/- 1.5 %)      3.22 c/B    18.20 c/f (+/- 3.6 %)      4.98 i/c      4.19 GHz

But it appears that appears to primarily be related to other issues. For instance, re-enabling the fast path for negative exponents (not standard compliant) immediately saw a significant perf boost in performance when parsing the short numbers. I think it was close to 50%. But that's not something that can really be changed.

from fast_float.

lemire commented on June 2, 2024

For instance, re-enabling the fast path for negative exponents

Right. There might be room for more clever approaches. Unfortunately, testing for the rounding mode each time is too expensive.

I will close this issue for now. It would be very helpful to give more thought to ways around the performance regression you allude to. Of course, we could simply make it a compile-time option, or even a runtime option. But it is less useful than it sounds because the default would have to be standard compliance... sadly.

from fast_float.

lemire commented on June 2, 2024

Thanks for the report.

from fast_float.

Is the tradeoff of processing 8 chars at a time worth it? about fast_float HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs