While most of fast_float doesn't use floating point operations and so doesn't care abo

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

CC <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

I meant replace <div class="highlight highlight-source-c notranslate position-rela

It is promising. Your proposal: <div class="snippet-clipboard-co

Clinger's fast path and non-default rounding modes about fast_float HOT 13 CLOSED

jakubjelinek commented on June 2, 2024

Clinger's fast path and non-default rounding modes

from fast_float.

Comments (13)

jakubjelinek commented on June 2, 2024

See https://gcc.gnu.org/PR107468

from fast_float.

lemire commented on June 2, 2024

That's an interesting issue. Evidently, we would want to round round-to-nearest in all cases. We cannot very well disable Clinger's fast path generally, as it would cost too much performance.

Disabling it conditionally is possible... e.g.,

  if (std::fegetround() == FE_TONEAREST && binary_format<T>::min_exponent_fast_path() <= pns.exponent && pns.exponent <= binary_format<T>::max_exponent_fast_path() && pns.mantissa <=binary_format<T>::max_mantissa_fast_path() && !pns.too_many_digits) {

Unfortunately, doing so appears to cost about 10% in performance.

If I compile and run the following program, which parses the same string with strtof and std::from_chars, I get the same result. That is, the program prints out...

0x1.000014p+25
0x1.000014p+25
0x1.000012p+25
0x1.000012p+25

meaning that both strtof and std::from_chars are impacted by the non-standard rounding mode.

#include <cfenv>
#include <charconv>
#include <iostream>
#include <string_view>

int main() {

  char buf[] = "3.355447e+07";

  {
    float f;
    auto [ptr, ec] = std::from_chars(buf, buf + sizeof(buf) - 1, f);
    std::cout << std::hexfloat << f << std::endl;
    char *pr;
    std::cout << std::hexfloat << strtof(buf, &pr) << std::endl;
  }

  fesetround(FE_DOWNWARD);
  {
    float f;
    auto [ptr, ec] = std::from_chars(buf, buf + sizeof(buf) - 1, f);
    std::cout << std::hexfloat << f << std::endl;
    char *pr;
    std::cout << std::hexfloat << strtof(buf, &pr) << std::endl;
  }
}

I think that this is covered in the runtime with code such as this...

#if _GLIBCXX_USE_C99_FENV_TR1 && defined(FE_TONEAREST)
	const int rounding = std::fegetround();
	if (rounding != FE_TONEAREST)
	  std::fesetround(FE_TONEAREST);
#endif

from fast_float.

jakubjelinek commented on June 2, 2024

Yeah, but strtod etc. are meant to honor the current rounding mode, while from_chars always round to nearest.
E.g. C17 has:
"Functions such as strtod that convert character sequences to floating types honor the rounding direction."
https://eel.is/c++draft/charconv.from.chars
says "after rounding according to round_to_nearest"

And yes, libstdc++ uses those std::fegetround() + temporarily std::fesetround(FE_TONEAREST) if needed around strtold etc.
calls, but it isn't used in the fast_float fast path.

from fast_float.

lemire commented on June 2, 2024

@jakubjelinek Yes. Please review my comment, I don't think we disagree. What I am thinking through is whether calling std::fegetround() systematically makes sense.

from fast_float.

lemire commented on June 2, 2024

Calling std::fegetround() seems to trigger a function call under GCC and under x64 processors, this might end up getting compiled to stmxcsr which can be a somewhat expensive instruction (more expensive than a mere multiplication, at least). I am quite certain that AMD/Intel engineers do not expect that stmxcsr gets called repeatedly in a loop. Under aarch64, it might get compiled to mrs, for which I cannot find much performance documentation.

So let us go at it experimentally.
Please refer to https://github.com/lemire/simple_fastfloat_benchmark

I am benchmarking on a file (mesh.txt) that relies a lot on Clinger's fast path (due to the type of numbers involved). I use GCC 9 on an AMD Rome processor. I produce CPU performance counters (bm=branch miss, i = instruction, c=cycle).

mesh.txt data set (heavily reliant on Clinger):

original                                 :   531.13 MB/s (+/- 0.9 %)    72.35 Mfloat/s      25.68 i/B   197.67 i/f (+/- 0.0 %)      0.01 bm/B     0.09 bm/f (+/- 1.1 %)      6.08 c/B    46.83 c/f (+/- 0.8 %)      4.22 i/c      3.39 GHz 

with new guard                           :   410.86 MB/s (+/- 0.3 %)    55.97 Mfloat/s      28.08 i/B   216.12 i/f (+/- 0.0 %)      0.01 bm/B     0.10 bm/f (+/- 1.2 %)      7.87 c/B    60.54 c/f (+/- 0.2 %)      3.57 i/c      3.39 GHz 

with Clinger removed                     :   402.96 MB/s (+/- 0.8 %)    54.89 Mfloat/s      32.15 i/B   247.48 i/f (+/- 0.0 %)      0.01 bm/B     0.09 bm/f (+/- 0.5 %)      8.02 c/B    61.72 c/f (+/- 0.6 %)      4.01 i/c      3.39 GHz

Thus the new guard makes the Clinger's fast path much less compelling. It might be best to just prune it out if one cannot ensure that the rounding is to the nearest float. You might lose 30% of the performance in some cases but you still get a competitive performance... Here are my results with Clinger's fast path removed for the mesh data set (a worst case scenario)...

./buildnew/benchmarks/benchmark -f data/mesh.txt 
# read 73019 lines 
volume = 0.536009 MB 
netlib                                  :   305.82 MB/s (+/- 0.5 %)    41.66 Mfloat/s      34.29 i/B   263.92 i/f (+/- 0.0 %)      0.08 bm/B     0.58 bm/f (+/- 0.2 %)     10.57 c/B    81.38 c/f (+/- 0.3 %)      3.24 i/c      3.39 GHz 
doubleconversion                        :   196.30 MB/s (+/- 1.2 %)    26.74 Mfloat/s      57.19 i/B   440.17 i/f (+/- 0.0 %)      0.03 bm/B     0.20 bm/f (+/- 0.5 %)     16.47 c/B   126.75 c/f (+/- 0.9 %)      3.47 i/c      3.39 GHz 
strtod                                  :   121.74 MB/s (+/- 2.6 %)    16.58 Mfloat/s      86.99 i/B   669.61 i/f (+/- 0.0 %)      0.04 bm/B     0.33 bm/f (+/- 0.5 %)     26.55 c/B   204.36 c/f (+/- 2.1 %)      3.28 i/c      3.39 GHz 
abseil                                  :   211.05 MB/s (+/- 1.1 %)    28.75 Mfloat/s      54.96 i/B   423.08 i/f (+/- 0.0 %)      0.04 bm/B     0.31 bm/f (+/- 1.0 %)     15.31 c/B   117.81 c/f (+/- 0.8 %)      3.59 i/c      3.39 GHz 
fastfloat                               :   403.26 MB/s (+/- 1.6 %)    54.93 Mfloat/s      32.15 i/B   247.48 i/f (+/- 0.0 %)      0.01 bm/B     0.09 bm/f (+/- 0.7 %)      8.02 c/B    61.70 c/f (+/- 0.6 %)      4.01 i/c      3.39 GHz

@jakubjelinek What do you think? Currently, based on my early investigations, it seems that just taking out Clinger's fast path would be good. It solves the issue. It simplifies slightly the code. There is a performance cost, unfortunately, but it is a bearable one.

Guarding Clinger's fast path exposes us to all sorts of bad surprises, performance-wise. It might be 'worth it' but it will almost surely create some bad performance regressions on some machines. I would much rather not rely on std::fegetround() for every call.

from fast_float.

jakubjelinek commented on June 2, 2024

CC @jwakely
Dunno. E.g. on x86 AVX512F has the possibility of explicit rounding modes on arithmetic operations but that is unlikely something we could use all the time (and many CPUs don't support it). I don't really have any statistics on how people in real-world use from_chars (what % of cases trigger Clinger's fast path). If significant amount of those would be for non-negative powers of 5 and smaller pns.mantissa values, perhaps we could keep the fast path to cases where we can prove there will be no rounding? I mean, for multiplication by 1 it will never happen, 5 fits into 3 bits and so if bit_width(pns.mantissa) + 3 fits into mantissa bits, there will be no rounding either, pow(5,2) fits into 5 bits, so if bit_width(pns.mantissa) + 5 fits into mantissa bits, we could do multiplication, for pow(5,3) which fits into 8 bits similarly etc.

from fast_float.

jakubjelinek commented on June 2, 2024

Oh, and the fegetround() guard can be either added as the first condition around the fast path (so it is done always and slows down everything) or as the last && condition so that it only slows down the fast path if it would be previously used.

from fast_float.

lemire commented on June 2, 2024

If we can assume that we have the appropriate AVX-512 instructions, then it opens up possibilities, not just for this section of the code. But I think we will agree that we are interested in the general case. So let us assume that we do not have AVX-512.

where we can prove there will be no rounding?

Let us take a number at random from mesh.txt: 2.35722780228. This is 235722780228/10^11. This works well for Clinger because both numbers have an exact IEEE representation (235722780228, 10^11) so the division is guaranteed to work out with exact nearest rounding. We can check quickly that the conditions are met, it is basically two conditions:

One condition on the exponent: binary_format<T>::min_exponent_fast_path() <= pns.exponent <= binary_format<T>::max_exponent_fast_path().
One condition on the significand: pns.mantissa <=binary_format<T>::max_mantissa_fast_path().

It also applies often enough and the check is cheap. It does not trigger too many unpredicted branches in practice because numbers tend to follow a pattern and if Clinger's tends to apply, it tends to often apply...

It is seems that your proposal would add more expensive constraints, and it would have significantly less coverage. Whether it is worth it is an interesting research question, but what seems likely is that the benefits will be less than Clinger's (so less than 30% in the best case scenario).

Still: it is an empirical issue and we can test it. Can you spell out the code? What do you have in mind? Can we run some benchmarks?

Oh, and the fegetround() guard can be either added as the first condition around the fast path (so it is done always and slows down everything) or as the last && condition so that it only slows down the fast path if it would be previously used.

Indeed. But in the mesh results above, it must be called all the time anyhow (since mesh.txt is very reliant on it), and you see that it basically wipes away the benefit of the Clinger's fast path (in that test). It could be that I am wrong, but my current verdict is that guarding Clinger's with std::fegetround() is not worth it.

Bear in mind that our own code is not much slower than Clinger's: we are still just doing a multiplication (although we rely on an integer multiplication).

from fast_float.

jakubjelinek commented on June 2, 2024

I meant replace

  if (binary_format<T>::min_exponent_fast_path() <= pns.exponent && pns.exponent <= binary_format<T>::max_exponent_fast_path() && pns.mantissa <=binary_format<T>::max_mantissa_fast_path() && !pns.too_many_digits) {

with

  static int power_of_five_bits[] = { 0, 3, 5, 7, 10, 12, 14, 17, 19, 21, 24, 26, 28, 31, 33, 35, 38, 40, 42, 45, 47, 49, 52 };
  if (pns.exponent >= 0 && pns.exponent <= binary_format<T>::max_exponent_fast_path() && pns.mantissa <= (uint64_t(2) << (binary_format<T>::mantissa_explicit_bits() - power_of_five_bits[pns.exponent])) && !pns.too_many_digits) {

or so (completely untested), i.e. only non-negative exponents and only if mantissa * pow(5,pns.exponent) is guaranteed to be exactly representable.

from fast_float.

lemire commented on June 2, 2024

It is promising.

Your proposal:

mesh.txt                               :   469.74 MB/s (+/- 0.3 %)    63.99 Mfloat/s      28.91 i/B   222.50 i/f (+/- 0.0 %)      0.01 bm/B     0.08 bm/f (+/- 0.8 %)      6.88 c/B    52.96 c/f (+/- 0.1 %)      4.20 i/c      3.39 GHz 
canada.txt                               :   721.56 MB/s (+/- 0.2 %)    41.47 Mfloat/s      18.53 i/B   338.08 i/f (+/- 0.0 %)      0.01 bm/B     0.10 bm/f (+/- 0.3 %)      4.48 c/B    81.75 c/f (+/- 0.1 %)      4.14 i/c      3.39 GHz

With no Clinger's path:

mesh.txt                               :   401.91 MB/s (+/- 1.1 %)    54.75 Mfloat/s      32.15 i/B   247.48 i/f (+/- 0.0 %)      0.01 bm/B     0.09 bm/f (+/- 0.8 %)      8.04 c/B    61.91 c/f (+/- 0.5 %)      4.00 i/c      3.39 GHz 
canada.txt                               :   706.30 MB/s (+/- 0.3 %)    40.59 Mfloat/s      18.37 i/B   335.11 i/f (+/- 0.0 %)      0.01 bm/B     0.10 bm/f (+/- 0.2 %)      4.57 c/B    83.46 c/f (+/- 0.2 %)      4.02 i/c      3.39 GHz

I will work on it more in the coming days.

from fast_float.

lemire commented on June 2, 2024

Would you give me feedback on #150 ?

from fast_float.

mwalcott3 commented on June 2, 2024

@lemire Would something like this work as a guard? Still expensive but it gets rid of the function call.

#include <limits>
inline bool rounds_nearest() {
    //volatile should prevent compiler from optimizing this away compiletime
    //It would do this with nearest rounding calculations and always be true
    static volatile float fmin = std::numeric_limits<float>::min();
    return (fmin + 1.0f == 1.0f - fmin);
}

I was seeing ~7x better perf than calling fegetround in quick-bench.

It appears to detect the rounding mode just fine in my limited testing on godbolt.

from fast_float.

lemire commented on June 2, 2024

@mwalcott3 That's the kind of approach I went fishing for... Let me check it out.

from fast_float.

Clinger's fast path and non-default rounding modes about fast_float HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs