GithubHelp home page GithubHelp logo

ulfjack / ryu Goto Github PK

View Code? Open in Web Editor NEW
1.2K 49.0 98.0 8.8 MB

Converts floating point numbers to decimal strings

License: Apache License 2.0

C++ 76.28% C 18.88% Java 4.34% Batchfile 0.09% Shell 0.02% Makefile 0.04% Starlark 0.30% CMake 0.06%

ryu's Introduction

Ryu & Ryu Printf Build Status

This project contains routines to convert IEEE-754 floating-point numbers to decimal strings using shortest, fixed %f, and scientific %e formatting. The primary implementation is in C, and there is a port of the shortest conversion to Java. All algorithms have been published in peer-reviewed publications. At the time of this writing, these are the fastest known float-to-string conversion algorithms. The fixed, and scientific conversion routines are several times faster than the usual implementations of sprintf (we compared against glibc, Apple's libc, MSVC, and others).

Generating scientific and fixed output format for 16 and 32 bit IEEE floating point numbers can be implemented by converting to 64 bit, and then using the 64 bit routines. Note that there is no 128 bit implementation at this time.

When converting to shortest, DO NOT CAST; shortest conversion is based on the precision of the source type, and casting to a different type will not return the expected output. There are highly optimized 32 and 64 bit implementations as well as a generic 128 bit implementation that can handle any IEEE format up to 128 bits.

These are the supported conversion modes for the C implementation:

IEEE Type Supported Output Formats
16 Bit (half) Shortest (via ryu_generic_128.h)
32 Bit (float) Shortest
64 Bit (double) Shortest, Scientific, Fixed
80 Bit (long double) Shortest (via ryu_generic_128.h)
128 Bit (__float128) Shortest (via ryu_generic_128.h)

The code is continuously tested on Ubuntu 18.04, MacOS High Sierra, and Windows Server version 1803.

All code outside of third_party/ is copyrighted by Ulf Adams and contributors, and may be used freely in accordance with the Apache 2.0 license. Alternatively, the files in the ryu/ directory may be used freely in accordance with the Boost 1.0 license.

All contributions are required to maintain these licenses.

Ryu

Ryu generates the shortest decimal representation of a floating point number that maintains round-trip safety. That is, a correct parser can recover the exact original number. For example, consider the binary 32-bit floating point number 00111110100110011001100110011010. The stored value is exactly 0.300000011920928955078125. However, this floating point number is also the closest number to the decimal number 0.3, so that is what Ryu outputs.

This problem of generating the shortest possible representation was originally posed by White and Steele [1], for which they described an algorithm called "Dragon". It was subsequently improved upon with algorithms that also had dragon-themed names. I followed in the same vein using the japanese word for dragon, Ryu. In general, all these algorithms should produce identical output given identical input, and this is checked when running the benchmark program.

The C implementation of Ryu is in the ryu/ directory. The Java implementations are RyuFloat and RyuDouble under src/main/java/. Both cover 32 and 64-bit floating point numbers.

In addition, there is an experimental C implementation that can handle inputs of any size up to 128-bit, albeit with lower performance than the highly optimized 32-bit and 64-bit implementations. Furthermore, there is an experimental low-level C API that returns the decimal floating-point representation as a struct, allowing clients to implement their own formatting. These are still subject to change.

Note: The Java implementation differs from the output of Double.toString [2] in some cases: sometimes the output is shorter (which is arguably more accurate) and sometimes the output may differ in the precise digits output (e.g., see #83).

Note: While the Java specification requires outputting at least 2 digits, other specifications, such as for JavaScript, always require the shortest output. We may change the Java implementation in the future to support both.

My PLDI'18 paper includes a complete correctness proof of the algorithm: https://dl.acm.org/citation.cfm?doid=3296979.3192369

Other implementations of Ryu:

Language Author Link
Scala Andriy Plokhotnyuk https://github.com/plokhotnyuk/jsoniter-scala
Rust David Tolnay https://github.com/dtolnay/ryu
Julia Jacob Quinn https://github.com/JuliaLang/julia/tree/master/base/ryu
Factor Alexander Iljin https://github.com/AlexIljin/ryu
Go Caleb Spare https://github.com/cespare/ryu
C# Dogwei https://github.com/Dogwei/RyuCsharp
C# Shad Storhaug https://github.com/NightOwl888/J2N
D Ilya Yaroshenko https://github.com/libmir/mir-algorithm
Scala Denys Shabalin https://github.com/scala-native/scala-native
Erlang/BEAM Thomas Depierre https://github.com/erlang/otp/tree/master/erts/emulator/ryu
Zig Marc Tiehuis https://github.com/tiehuis/zig-ryu
Haskell Lawrence Wu https://github.com/haskell/bytestring

Ryu Printf

Since Ryu generates the shortest decimal representation, it is not immediately suitable for use in languages that have printf-like facilities. In most implementations, printf provides three floating-point specific formatters, %f, %e, and %g:

  • The %f format prints the full decimal part of the given floating point number, and then appends as many digits of the fractional part as specified using the precision parameter.

  • The %e format prints the decimal number in scientific notation with as many digits after the initial digit as specified using the precision parameter.

  • The %g format prints either %f or %e format, whichever is shorter.

Ryu Printf implements %f and %e formatting in a way that should be drop-in compatible with most implementations of printf, although it currently does not implement any formatting flags other than precision. The benchmark program verifies that the output matches exactly, and outputs a warning if not. Any unexpected output from the benchmark indicates a difference in output.

Note that old versions of MSVC ship with a printf implementation that has a confirmed bug: it does not always round the last digit correctly.

Note that msys cuts off the output after ~17 digits, and therefore generally differs from Ryu Printf output for precision values larger than 17.

Note that the output for NaN values can differ between implementations; we use ifdefs in an attempt to match platform output.

According to our benchmarks, Ryu Printf compares favorably with the following implementations of printf for precision parameters 1, 10, 100, and 1000:

OS Libc Ryu Printf is faster by
Ubuntu 18.04 libc6 2.27-3ubuntu1 15x
Ubuntu 18.04 musl 1.1.19-1 4x
Windows 10 Home 1803 MSVC 19.14.26429.4 9x
Windows 10 Home 1803 msys-runtime-devel 2.10.0-2 between 8x and 20x
macOS Mojave 10.14 Apple Libc 24x

In addition, Ryu Printf has a more predictable performance profile. In theory, an implementation that performs particularly badly for some subset of numbers could be exploited as a denial-of-service attack vector.

My OOPSLA'2019 paper provides a correctness proof: https://dl.acm.org/citation.cfm?doid=3366395.3360595

Building, Testing, Running

We use the Bazel build system (https://bazel.build) 0.14 or later, although we recommend using the latest release. You also need to install Jdk 8 (or later) to build and run the Java code, and/or a C/C++ compiler (gcc or clang on Ubuntu, XCode on MacOS, or MSVC on Windows) to build the C/C++ code.

To build Ryu, run

$ bazel build //ryu

To build Ryu Printf, run

$ bazel build //ryu:ryu_printf

Big-Endian Architectures

The C implementations should work on big-endian architectures provided that the floating point type and the corresponding integer type use the same endianness.

There are no concerns around endianness for the Java implementation.

Building with a Custom Compiler

You can select a custom C++ compiler by setting the CC environment variable, e.g., use these steps to build with clang-4.0 on Ubuntu:

$ export CC=clang-4.0
$ bazel build //ryu

Building Ryu Printf against musl and msys requires installing the corresponding packages. We only tested against the musl Debian package that installs a gcc wrapper and is enabled by setting CC. However, building against msys requires manually adjusting Bazel's compiler configuration files.

Tests

You can run both C and Java tests with

$ bazel test //ryu/... //src/...

Ryu: Additional Notes

Jaffer

The code given by Jaffer in the original paper does not come with a license declaration. Instead, we're using code found on GitHub 3, which contains a license declaration by Jaffer. Compared to the original code, this implementation no longer outputs incorrect values for negative numbers.

We provide a binary to find differences between Ryu and the Jaffer / Jdk implementations:

$ bazel run //src/main/java/info/adams/ryu/analysis:FindDifferences --

Add the -mode=csv option to get all the discovered differences as a CSV. Use -mode=latex instead to get a latex snippet of the first 20. Use -mode=summary to only print the number of discovered differences (this is the default mode).

Computing Required Lookup Table Sizes

You can compute the required lookup table sizes with:

$ bazel run //src/main/java/info/adams/ryu/analysis:ComputeTableSizes --

Add -v to get slightly more verbose output.

Computing Required Bit Sizes

You can compute the required bit sizes with:

$ bazel run //src/main/java/info/adams/ryu/analysis:ComputeRequiredBitSizes --

Add the -128 and -256 flags to also cover 128- and 256-bit numbers. This could take a while - 128-bit takes ~20 seconds on my machine while 256-bit takes a few hours. Add -v to get very verbose output.

Java: Comparing All Possible 32-bit Values Exhaustively

You can check the slow vs. the fast implementation for all 32-bit floating point numbers using:

$ bazel run //src/main/java/info/adams/ryu/analysis:ExhaustiveFloatComparison

This takes ~60 hours to run to completion on an Intel(R) Core(TM) i7-4770K with 3.50GHz.

Java: Comparing All Possible 64-bit Values Exhaustively

You can check the slow vs. the fast implementation for all 64-bit floating point numbers using:

$ bazel run //src/main/java/info/adams/ryu/analysis:ExtensiveDoubleComparison

This takes approximately forever, so you will need to interrupt the program.

Benchmarks

Ryu

We provide both C and Java benchmark programs.

Enable optimization by adding "-c opt" on the command line:

$ bazel run -c opt //ryu/benchmark:ryu_benchmark --
    Average & Stddev Ryu  Average & Stddev Grisu3
32:   22.515    1.578       90.981   41.455
64:   27.545    1.677       98.981   80.797

For the Java benchmark, run:

$ bazel run //src/main/java/info/adams/ryu/benchmark --
    Average & Stddev Ryu  Average & Stddev Jdk  Average & Stddev Jaffer
32:   56.680    9.127       254.903  170.099
64:   89.751   13.442      1085.596  302.371     1089.535  309.245

Additional parameters can be passed to the benchmark after the -- parameter:

  -32           only run the 32-bit benchmark
  -64           only run the 64-bit benchmark
  -samples=n    run n pseudo-randomly selected numbers
  -iterations=n run each number n times
  -ryu          run Ryu only, no comparison
  -v            generate verbose output in CSV format

If you have gnuplot installed, you can generate plots from the benchmark data with:

$ bazel build -c opt --jobs=1 //scripts:shortest-{c,java}-{float,double}.pdf

The resulting files are bazel-genfiles/scripts/shortest-{c,java}-{float,double}.pdf.

Ryu Printf

We provide a C++ benchmark program that runs against the implementation of snprintf bundled with the selected C++ compiler. You need to enable optimization using "-c opt" on the command line:

$ bazel run -c opt //ryu/benchmark:ryu_printf_benchmark --
    Average & Stddev Ryu  Average & Stddev snprintf
%f:  116.359  130.992     3983.251 5331.505
%e:   40.853   10.872      210.648   36.779

Additional parameters can be passed to the benchmark after the -- parameter:

  -f            only run the %f benchmark
  -e            only run the %e benchmark
  -precision=n  run with precision n (default is 6)
  -samples=n    run n pseudo-randomly selected numbers
  -iterations=n run each number n times
  -ryu          run Ryu Printf only, no comparison
  -v            generate verbose output in CSV format

See above for selecting a different compiler. Note that msys C++ compilation does not work out of the box.

We also provide a simplified C benchmark for platforms that do not support C++ compilation, but note that pure C compilation is not natively supported by Bazel:

$ bazel run -c opt //ryu/benchmark:ryu_printf_benchmark_c --

If you have gnuplot installed, you can generate plots from the benchmark data with:

$ bazel build -c opt --jobs=1 //scripts:{f,e}-c-double-{1,10,100,1000}.pdf

The resulting files are bazel-genfiles/scripts/{f,e}-c-double-{1,10,100,1000}.pdf.

ryu's People

Contributors

9il avatar abolz avatar afrb2 avatar alcaro avatar alex-ilin avatar alugowski avatar bebesparkelsparkel avatar cassioneri avatar cespare avatar dianaolympos avatar dtolnay avatar finii avatar gritzko avatar m-ou-se avatar mitchblank avatar ngn avatar nightowl888 avatar plokhotnyuk avatar rlespinet avatar stephantlavavej avatar thechampagne avatar ulfjack avatar vinniefalco avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ryu's Issues

Java: float/double to byte array (or bytebuffer)

Ulf, thanks for your great work!

Do you have a plan to add more methods to store text representation of numbers except for Java strings?

Lot of serialization libraries and frameworks would appreciate the ability to store immediately into the pre-allocated array or a byte buffer.

Here is my attempt to adopt initial versions of RyuFloat.java and RyuDouble.java in the JSON serializer for Scala: plokhotnyuk/jsoniter-scala@1224719

For the float writing, I have able to avoid / and %operations at all. Also, access to tables was revised to minimize dependent pointer reads, etc. Would you like to try these tricks in your Java version?

Remove LEGACY_MODE

This was only added to match the results from the initial paper submission to support the PLDI artifact review. Since nobody can see the initial paper anymore, the code can be dropped now.

Trying to just extract the mantissa, exponent, and sign

I'm trying to extract

unsigned long long mantissa;
short exponent;
bool negative;

from a double. I already wrote some code before I heard about the Ryลซ so if anyone reading this wants a good laugh here it is:

    if(f < 0)
    {
        f = std::abs(f);
        neg_ = true;
    }
    else
    {
        neg_ = false;
    }
    static exponent_type const bias =
        static_cast<exponent_type>(std::floor(
            std::log10((std::numeric_limits<
                mantissa_type>::max)()))) - 2;
    exp_ = static_cast<exponent_type>(
        std::floor(std::log10(f)));
    auto mant = f / std::pow(10, exp_);
    mant_ = static_cast<mantissa_type>(
        mant * std::pow(10, bias));
    exp_ -= bias;

It is considerably simpler than Ryลซ and correspondingly, less correct.

I'm trying to adapt the source code in this project to achieve my ends, but it seems to be very oriented towards printing rather than simple extraction of the decimal mantissa and exponent, am I missing something? What's the shortest path to getting these numbers?

Incorrect vrIsTrailingZeros?

In d2s.c, should this call

ryu/ryu/d2s.c

Line 315 in 00ecce8

vrIsTrailingZeros = multipleOfPowerOf2(mv, q - 1);
use q instead of q-1?

As I understand it, the code uses q' = max(0, q-1) to make sure that the loop below is executed at least once, which is required to get a correct value for lastRemovedDigit which in turn is required to correctly round the result. Later, the values of vrIsTrailingZeros and lastRemovedDigit are used to determine if the exact number ends in ...5000...0 here:

ryu/ryu/d2s.c

Line 378 in 00ecce8

if (vrIsTrailingZeros && lastRemovedDigit == 5 && vr % 2 == 0) {

Now, might it be possible that vrIsTrailingZeros is missing one digit, i.e.

                missed digit?
                v
            ...5?000...0
               ^ ~~~~~~^
lastRemovedDigit       vrIsTrailingZeros

I haven't found a counter-example... so it's probably me, who is missing something...

NB: the e2 >= 0 branch is using multipleOfPowerOf5(mv, q).


Related: In f2s.c,

ryu/ryu/f2s.c

Line 195 in 00ecce8

vrIsTrailingZeros = multipleOfPowerOf5(mv, q);

Shouldn't this use q-1 instead of q? In this case vrIsTrailingZeros includes the value of the last removed digit, but shouldn't it only include the digits after the last removed digit?

lastRemovedDigit
               v
            ...5000...0
               ~~~~~~~^
                      vrIsTrailingZeros

I have replaced the assignment above with

vrIsTrailingZeros = (q == 0) || multipleOfPowerOf5(mv, q - 1);

and tested all single-precision numbers: It doesn't affect the output. So, again, I'm probably missing something here...

NB: the e2 < 0 branch is using multipleOfPowerOf2(mv, q - 1) here.

Java: got "1.943037616030839E16" for 1.9430376160308388E16 input (even rounding mode)

IN=100001101010001010000011111010010111111001110001100101100101001
   S=+ E=2 M=4857594040077097
E =16
d+=19430376160308390
d =19430376160308388
d-=19430376160308386
e2=0
19430376160308388 * 2^0
V+=19430376160308390
V =19430376160308388
V-=19430376160308386
19430376160308388 2
d+=19430376160308390
d =19430376160308388
d-=19430376160308386
e10=0
d-10=false
LAST_REMOVED_DIGIT=8
VP=1943037616030839
VR=1943037616030838
VM=1943037616030838
O=1943037616030839
OLEN=16
EXP=16
1.943037616030839E16 1.9430376160308388E16

Would a smarter decimalLength() help?

The current decimalLength() code is just a series of if() statements. Since the value is usually near the top of the range this isn't so bad (i.e. it's not common to get far down list of if() statements) However, it's going to compile down to a pretty large number of instructions and at least the first branch is going to predict pretty badly.

On platforms that have a "number-of-leading-zeros" instruction (including x86_64) there is simple branch-free method of computing an integerlog10() (with the caveat that an input of 0 needs to be treated as a special case when using that to computing "number of digits") This isn't my gist, but here's the first one that popped up when I googled: https://gist.github.com/CAFxX/ad150f2403a0604e14cc There are also variants that do a couple more ALU instructions to shrink the thr[] table into something that will fit in fewer cachelines -- may or may not be a win..

The above uses the gcc/clang a = __builtin_clzll(x) intrinsic. The MSVC equivalent of that is:

    unsigned long tmp;
    (void) _BitScanReverse64(&tmp, x);
    clz = 63 - tmp;

... the compiler will reduce that and eliminate tmp at the end, so don't let the &tmp scare you.

Anyway, maybe you've already tried something like this and didn't matter (I just watched the video of your talk, haven't read the paper yet) but maybe a couple nanoseconds could be shaved here.

As is usual this sort of code, the seminal reference material is Henry Warren's "Hacker's Delight" If you have a copy of the Second Edition at hand (everyone should...) he discusses a number of log10 options in the last 6 pages of chapter 11.

Differences from paper

I noticed there are some differences in the constants from what the paper would suggest. For example,

ryu/ryu/d2s.h

Lines 40 to 41 in 5c36146

#define DOUBLE_POW5_INV_BITCOUNT 122
#define DOUBLE_POW5_BITCOUNT 121

whereas the paper suggests these should both be 124 (Figure 4). Why the difference?

Missing carries in non-intrinsic umul128()?

ryu/ryu/d2s_intrinsics.h

Lines 48 to 76 in 2ec9ead

static inline uint64_t umul128(const uint64_t a, const uint64_t b, uint64_t* const productHi) {
// The casts here help MSVC to avoid calls to the __allmul library function.
const uint32_t aLo = (uint32_t)a;
const uint32_t aHi = (uint32_t)(a >> 32);
const uint32_t bLo = (uint32_t)b;
const uint32_t bHi = (uint32_t)(b >> 32);
const uint64_t b00 = (uint64_t)aLo * bLo;
const uint64_t b01 = (uint64_t)aLo * bHi;
const uint64_t b10 = (uint64_t)aHi * bLo;
const uint64_t b11 = (uint64_t)aHi * bHi;
const uint32_t b00Lo = (uint32_t)b00;
const uint32_t b00Hi = (uint32_t)(b00 >> 32);
const uint64_t mid1 = b10 + b00Hi;
const uint32_t mid1Lo = (uint32_t)(mid1);
const uint32_t mid1Hi = (uint32_t)(mid1 >> 32);
const uint64_t mid2 = b01 + mid1Lo;
const uint32_t mid2Lo = (uint32_t)(mid2);
const uint32_t mid2Hi = (uint32_t)(mid2 >> 32);
const uint64_t pHi = b11 + mid1Hi + mid2Hi;
const uint64_t pLo = ((uint64_t)mid2Lo << 32) + b00Lo;
*productHi = pHi;
return pLo;
}
implements a 64 * 64 = 128 wide multiplication. I realized that it implements the grade-school algorithm for multiplying two 2-digit numbers (where each "digit" is uint32_t), except that it's missing carries! This seems deeply concerning for correctness, but I haven't found any failures due to the missing carries yet.

Is there something about how Ryu calls umul128() that prevents this from being a problem? If so, should it be commented and/or asserted? My attempts to add carrying code have degraded x86 perf by at least 10%.

(I noticed that I replicated this behavior in umul256() so it has greater impact.)

Add f2fixed() and f2exp()?

int d2fixed_buffered_n(double d, uint32_t precision, char* result) {
and
int d2exp_buffered_n(double d, uint32_t precision, char* result) {
take 64-bit double but work with 32-bit float (because float can be losslessly widened to double; note that this is untrue for string-to-floating conversions, where rounding a string to a double and rounding again to a float is a subtle mistake).

Would dedicated 32-bit implementations be even faster, possibly by using narrower multiplications? Like how

ryu/ryu/f2s.c

Line 154 in a73883c

static inline floating_decimal_32 f2d(const uint32_t ieeeMantissa, const uint32_t ieeeExponent) {
is faster than

ryu/ryu/d2s.c

Line 238 in a73883c

static inline floating_decimal_64 d2d(const uint64_t ieeeMantissa, const uint32_t ieeeExponent) {

RyuDouble.java tests against Float.POSITIVE_INFINITY, not Double.POSITIVE_INFINITY

Forgive me if there is something subtle going on that I have missed.
RyuDouble.java tests against Float.POSITIVE_INFINITY and its opposite.
I can understand RyuFloat.java doing that, but it seems mildly astonishing
in RyuDouble.java. I would expect to see the test against Double.POSITIVE_INFINITY
and its negation.

 public static String doubleToString(double value, RoundingMode roundingMode) {
    // Step 1: Decode the floating point number, and unify normalized and subnormal cases.
    // First, handle all the trivial cases.
    if (Double.isNaN(value)) return "NaN";
    if (value == Float.POSITIVE_INFINITY) return "Infinity";
    if (value == Float.NEGATIVE_INFINITY) return "-Infinity";

Compability with ES6 / V8

I'm working with something that possibly will become an IETF standard:
https://tools.ietf.org/html/draft-rundgren-json-canonicalization-scheme-02
It depends on number formatting compatible with ES6/V8.

A short test revealed that the Java version of Ryu is not entirely compatible with this scheme.
x0000000000000001 returns 5e-324 in ES6/V8 while 4.9e-324 in Ryu

This is not really a bug but could be worth knowing.

https://github.com/cyberphone/json-canonicalization/tree/master/testdata#es6-numbers

Public identifiers should have a prefix to avoid name clashes

Public identifiers (extern functions and public enumerators) should have a prefix to avoid name collisions with user code and other libraries.

Extern functions should have a prefix such as ryu_.

The ryu_parse Status enumerators should have a prefix such as RYU_ or RYU_STATUS_ to avoid clashing with macros as well as identifiers in the global namespace. The SUCCESS enumerator is particularly susceptible to conflict.

Is there a Bug in Master?

In commit 388280c the comment is "add comprehensive tests, finding a core algorithm bug." That commit is only included in the generate-full-output branch.

Does it mean that there was a bug discovered in the Ryu algorithm, which is currently not fixed in master?

I'm asking, because I'm porting Ryu into a different language, and would like to understand the current state.

Thank you!

The tests leak

Memory allocated by routines such as f2s is never freed. When the tests are run under a sanitizer, leaks are reported.

Some of the results of RyuFloat are different from that of JDK

        Random random = new Random();

        for (int i = 0; i < 1000 * 1000 * 10; ++i) {
            float value = random.nextFloat();

            String str1 = Float.toString(value);
            String str2 = RyuFloat.floatToString(value);

            if (!str1.equals(str2)) {
                System.out.println(str1 + " -> " + str2);
                assertTrue(Float.parseFloat(str1) == Float.parseFloat(str2));
            }
        }

different results:

0.33007812 -> 0.33007813
0.17382812 -> 0.17382813
0.0063476562 -> 0.0063476563
0.43945312 -> 0.43945313
0.33789062 -> 0.33789063
0.0043945312 -> 0.0043945313
0.036132812 -> 0.036132813
0.47851562 -> 0.47851563
0.17382812 -> 0.17382813
0.47851562 -> 0.47851563
0.26757812 -> 0.26757813
0.40039062 -> 0.40039063
0.36914062 -> 0.36914063
0.43945312 -> 0.43945313
0.18945312 -> 0.18945313
0.043945312 -> 0.043945313
0.40039062 -> 0.40039063
0.020507812 -> 0.020507813
0.040039062 -> 0.040039063
0.36132812 -> 0.36132813
0.37695312 -> 0.37695313
0.29882812 -> 0.29882813
0.36132812 -> 0.36132813
0.29882812 -> 0.29882813
0.36914062 -> 0.36914063
0.41601562 -> 0.41601563

How do we get this into Boost?

We were discussing how to get this into Boost and an obvious solution came up, ask @StephanTLavavej ! We could use something header-only with the option of separate compilation. I was working on a port but I stopped because the changes to support this use case would be extensive and this project looks like it is still in development and might see considerably more commits that I would have to port to my C++-mangled header-only solution.

The routines which I ultimately need are

// this is user-facing
struct number
{
    unsigned long long mantissa;
    short exponent;
    bool sign;
};

number double_to_number(double);
char* number_to_chars(char* buf, number);

And of course an implementation of C++17's to_chars (so that users of older C++ versions have access to it).

There are a number of Boost libraries that would benefit from this code such as lexical_cast and a few others.

Remove trailing zeros from output functions

Hi, first of all thanks a lot for the library, it's performance is really awesome.

I've been working in integrating it in Postgis which decides the format and the amount of decimal digits to print depending on multiple parameters (user input and space left in the buffer), so I can't use d2s_buffered_n.

My main issue is that both d2fixed_buffered_n and d2exp_buffered_n include trailing zeros, like 15.0000 or 2.010e+04. Although it is possible to remove them manually by looking for . or e, it isn't very efficient; since ryu already knows about what's trailing and what not, I'd modified it to remove them if required.

I was wondering if it makes sense to contribute this here. If so, I would create a PR and make any necessary changes to have this included.

In case you want to have a look, the current patch looks like this:

diff --git a/ryu/d2fixed.c b/ryu/d2fixed.c
index b008dab..3bcc22f 100644
--- a/ryu/d2fixed.c
+++ b/ryu/d2fixed.c
@@ -420,7 +420,8 @@ int d2fixed_buffered_n(double d, uint32_t precision, char* result) {
   if (!nonzero) {
     result[index++] = '0';
   }
-  if (precision > 0) {
+  const bool printDecimalPoint = precision > 0;
+  if (printDecimalPoint) {
     result[index++] = '.';
   }
 #ifdef RYU_DEBUG
@@ -532,6 +533,18 @@ int d2fixed_buffered_n(double d, uint32_t precision, char* result) {
     memset(result + index, '0', precision);
     index += precision;
   }
+
+#if RYU_NO_TRAILING_ZEROS
+  if (printDecimalPoint) {
+    while (result[index - 1] == '0') {
+      index--;
+    }
+    if (result[index - 1] == '.') {
+      index--;
+    }
+  }
+#endif
+
   return index;
 }
 
@@ -771,6 +784,18 @@ int d2exp_buffered_n(double d, uint32_t precision, char* result) {
       }
     }
   }
+
+#if RYU_NO_TRAILING_ZEROS
+  if (printDecimalPoint) {
+    while (result[index - 1] == '0') {
+      index--;
+    }
+    if (result[index - 1] == '.') {
+      index--;
+    }
+  }
+#endif
+
   result[index++] = 'e';
   if (exp < 0) {
     result[index++] = '-';

It only does the extra checks when RYU_NO_TRAILING_ZEROS is defined, so the performance for the current case shouldn't be affected.

Why is the exponent 32 bits?

I see in the floating_decimal structure the exponent is stored using a 32-bit signed integer. But exponents for doubles only have a range of ~2,000 distinct integers, why isn't this a short instead?

Missing loop in two-digit optimization?

Hey @ulfjack, thanks for the great algorithm!

Quick question: should this be a loop rather than an if, or was that intentional?

ryu/ryu/d2s.c

Line 414 in b2ae7a7

if (vpDiv100 > vmDiv100) { // Optimization: remove two digits at a time (~86.2%).

I haven't tried building your benchmarks yet, but in my own implementation based on your C code changing this to a loop gives a noticeable speedup for formatting numbers like 0.3.

Special case integers

There's no need to do all that work if the floating point number is an exact integer. @cespare mentioned in #86 that this is a significant win for Go, so we should do that too.

Big endian architectures?

The code that extracts the bits from a float, and the code that converts bits to a float probably don't work on big endian machines.

Reduce 64-bit lookup table size

It should be possible (and cheap) to only store every other value, and take the next smaller available entry, but it may increase the number of bits required for the intermediate calculations.

Integrate Manual Modulo from Puff Algorithm.

I am the author of the Fastest Method to Printing Integers and Floating-point numbers, which contains the manual modulo optimization for Ryu. When you do mod 100 div 100, you just did 2 division instructions when you only needed to do one. You're welcome. You are now apart of the Uniprinter (Universal Printer), which I'm aiming to be the world's fastest text processing library that grows from stack to heap with optional dynamic memory, compiles in about 3 seconds, and actually prints Unicode in C++ right.

I'm still looking through the algorithm for optimizations, I just cloned.

I just ran the 128-bit bit shifting tables from Puff and yes, the pattern generalizes. The uint32_t decimalLength(const uint128_t v) function is very slow.

#Bits Max #Bits Max #Bits #Bits Max
1 1.00E+00 33 8.59E+09 65 3.69E+19 97
2 3.00E+00 34 1.72E+10 66 7.38E+19 98
3 7.00E+00 35 3.44E+10 67 1.48E+20 99
4 1.50E+01 36 6.87E+10 68 2.95E+20 100
5 3.10E+01 37 1.37E+11 69 5.90E+20 101
6 6.30E+01 38 2.75E+11 70 1.18E+21 102
7 1.27E+02 39 5.50E+11 71 2.36E+21 103
8 2.55E+02 40 1.10E+12 72 4.72E+21 104
9 5.11E+02 41 2.20E+12 73 9.44E+21 105
10 1.02E+03 42 4.40E+12 74 1.89E+22 106
11 2.05E+03 43 8.80E+12 75 3.78E+22 107
12 4.10E+03 44 1.76E+13 76 7.56E+22 108
13 8.19E+03 45 3.52E+13 77 1.51E+23 109
14 1.64E+04 46 7.04E+13 78 3.02E+23 110
15 3.28E+04 47 1.41E+14 79 6.04E+23 111
16 6.55E+04 48 2.81E+14 80 1.21E+24 112
17 1.31E+05 49 5.63E+14 81 2.42E+24 113
18 2.62E+05 50 1.13E+15 82 4.84E+24 114
19 5.24E+05 51 2.25E+15 83 9.67E+24 115
20 1.05E+06 52 4.50E+15 84 1.93E+25 116
21 2.10E+06 53 9.01E+15 85 3.87E+25 117
22 4.19E+06 54 1.80E+16 86 7.74E+25 118
23 8.39E+06 55 3.60E+16 87 1.55E+26 119
24 1.68E+07 56 7.21E+16 88 3.09E+26 120
25 3.36E+07 57 1.44E+17 89 6.19E+26 121
26 6.71E+07 58 2.88E+17 90 1.24E+27 122
27 1.34E+08 59 5.76E+17 91 2.48E+27 123
28 2.68E+08 60 1.15E+18 92 4.95E+27 124
29 5.37E+08 61 2.31E+18 93 9.90E+27 125
30 1.07E+09 62 4.61E+18 94 1.98E+28 126
31 2.15E+09 63 9.22E+18 95 3.96E+28 127
32 4.29E+09 64 1.84E+19 96 7.92E+28 128

Mod 10 is very slow due to the number of decimals in the result (i.e. the less you have to divide by the faster their newton's method they use for the hardware division) I've benchmarked on 64-bit systems to be faster than mod 10 what is called the MSD Look-Up-Down (MSD-LUD), but I gave up on it when I discovered Puff. The only version I saved prints both ends of the number simultaneously, it is here, but that version of the algorithm has issues and Script2 is down for surgery for at least one more month to update the v1 ASCII Data Spec.

compute_shortest question!

In the paper, in compute_shortest algorithm, can we send to input a, b, c, such that length(b) < length(c)? If yes, can we get bi(output final value) equal to zero, or zero + 1(by rounding)?

=====================================
For example in vacuum: a = 6, b = 8, c = 11, and as output we can get 1e1, which is not closest to input: right answer is just 8e0 ?

Where am i wrong?

Improve control over the output format

Would it be possible to control whether the number is printed in exponent or decimal notation? Or, if more convenient, return the chosen location for the decimal point, if present, the number of digits as well as the exponent. And make it possible to get the chosen exponent as integer.

I'm trying to integrate this into a formatting library and as such would like to transform the output into all of the forms available for printf(). In particular, the decimal formatting with user provided precision makes it necessary to reformat the number heavily. While some are trivial to implement, reparsing the exponent to decide on the automatic formatting and moving the floating point separator feels suboptimal.

The double-conversion interfaces could be a guideline here but the defaults and best options are quite specialized to ECMAScript.

d2s/f2s RYU_DEBUG inconsistency

d2s.c line 253:

#ifdef RYU_DEBUG
  printf("E=%d M=%" PRIu64 "\n", e2 + 2, m2);
#endif

f2s.c line 168:

#ifdef RYU_DEBUG
  printf("E=%d M=%u\n", e2, m2);
#endif

e2 and m2 are computed in a structurally identical manner, so it seems inconsistent that d2s prints e2 + 2 while f2s prints e2, but I don't know which one is desired.

Can POW10_SPLIT_2 be smaller?

POW10_SPLIT_2 is large, 8 * 4445 * 3 = 106,680 bytes:

static const uint64_t POW10_SPLIT_2[4445][3] = {

Interestingly, it contains 1177 rows that are entirely zero. For example, lines 1583 and 1591 have nonzero elements, but lines 1584-1590 are entirely zero:

ryu/ryu/d2fixed_full_table.h

Lines 1583 to 1591 in 9976c57

{ 0u, 0u, 64000000000u },
{ 0u, 0u, 0u },
{ 0u, 0u, 0u },
{ 0u, 0u, 0u },
{ 0u, 0u, 0u },
{ 0u, 0u, 0u },
{ 0u, 0u, 0u },
{ 0u, 0u, 0u },
{ 175162308u, 0u, 0u },

These entirely zero rows are consuming 26.5% of the table, or 8 * 1177 * 3 = 28,248 bytes. Looking at the usage:

ryu/ryu/d2fixed.c

Lines 448 to 459 in 9976c57

const uint32_t p = POW10_OFFSET_2[idx] + i;
if (p >= POW10_OFFSET_2[idx + 1]) {
// If the remaining digits are all 0, then we might as well use memset.
// No rounding required in this case.
const int32_t fill = precision - 9 * i;
memset(result + index, '0', fill);
index += fill;
break;
}
// Temporary: j is usually around 128, and by shifting a bit, we push it to 128 or above, which is
// a slightly faster code path in mulShift_mod1e9. Instead, we can just increase the multipliers.
uint32_t digits = mulShift_mod1e9(m2 << 8, POW10_SPLIT_2[p], j + 8);

ryu/ryu/d2fixed.c

Lines 650 to 653 in 9976c57

const uint32_t p = POW10_OFFSET_2[idx] + i;
// Temporary: j is usually around 128, and by shifting a bit, we push it to 128 or above, which is
// a slightly faster code path in mulShift_mod1e9. Instead, we can just increase the multipliers.
digits = (p >= POW10_OFFSET_2[idx + 1]) ? 0 : mulShift_mod1e9(m2 << 8, POW10_SPLIT_2[p], j + 8);

makes me think that reorganizing POW10_OFFSET_2 and POW10_SPLIT_2 might allow these entirely zero rows to be eliminated for free (i.e. no additional lookups/branches), possibly even improving performance if this results in skipping mulShift_mod1e9 calls, but this is currently at the outer limit of my understanding of the algorithm.

The library is slower than double-conversion on small integers.

About two times slower on numbers in range 0..9, about 1.5 times slower on numbers in range 0..999.

This is important because your library is pretending to be fastest:

At the time of this writing, these are the fastest known float-to-string conversion algorithms.

But to make the statement that something is fast, it's better to test on real applications and on real datasets.

In what applications there is a need to convert massive amounts of floating point numbers to strings? Some of this applications are using double because there is no other choice or because it is the reasonable default choice.

Example: JavaScript, Lua; JSON serialization, CSV and TSV.
In these applications, most of the doubles will be in fact, integers.

Another examples are time series databases. These databases store sensor values for monitoring. Various metrics can be stored in a single table. For example, server CPU usage is usually fractional, but the number of running processes in a server is integer. It's likely that the most of sensor values in some databases are, in fact, integers.

It means, that it's important to keep in mind the usage scenario when most of formatted floats are integers.

Look for more details here: ClickHouse/ClickHouse#8542
I have plugged ryu in ClickHouse and it shows very promising results except for one case.

Good news is that it's easy to fix with small impact on performance of formatting fractional floats!
Example is here: https://github.com/ClickHouse/ClickHouse/pull/8542/files#diff-5b2c0d6b450ea8c072f39c9625094f66R121
Just check that number is integer and do shortcut.

Some numbers on synthetic benchmark
million rows per second: Ryu, double-conversion, Ryu with shortcut for integers:

SELECT count() FROM system.numbers WHERE NOT ignore(toString(toFloat64(number % 2)))
18.81 26.85 123.33
SELECT count() FROM system.numbers WHERE NOT ignore(toString(toFloat64(number % 10)))
11.59 18.86 102.85
SELECT count() FROM system.numbers WHERE NOT ignore(toString(toFloat64(number % 100)))
10.25 16.08 93.77
SELECT count() FROM system.numbers WHERE NOT ignore(toString(toFloat64(number % 1000)))
10.87 14.92 86.98
SELECT count() FROM system.numbers WHERE NOT ignore(toString(toFloat64(number % 10000)))
10.95 13.57 82.56

Double-conversion compatible output?

Hi, after seeing the benchmarks we are looking into using Ryu instead of the double-conversion library for double-to-string conversions in our JSON library.

For that we don't need flexible output formats like in #27, however we would prefer the output to be formatted like in the double-conversion library, i.e. use scientific notation only for large numbers.

After changing Ryu to use 0.0 instead of 0E0, and e instead of E, here are some examples of how Ryu compares to double-conversion:

= 0.0
= -0.0
! 1.0 1e0
! -1.0 -1e0
! 1.5 1.5e0
! -1.5 -1.5e0
! 0.0001 1e-4
! -0.0001 -1e-4
! 3.1416 3.1416e0
! 42.0 4.2e1
! -42.0 -4.2e1
! 0.1 1e-1
! 10000000000.0 1e10
! -10000000000.0 -1e10
! -10000000000.0 -1e10
= -1e-10
! 12340000000.0 1.234e10
= 1.234e-10
= 1.79769e308
= 2.22507e-308
= -1.79769e308
= -2.22507e-308
= 5e-324
= 2.225073858507201e-308
= 2.2250738585072014e-308
= 1.7976931348623157e308
! 18446744073709552000.0 1.8446744073709552e19
! -9223372036854776000.0 -9.223372036854776e18
! 0.9868011474609375 9.868011474609375e-1
= 1.23e36
! 45913141877270640000.0 4.591314187727064e19
= 2.225073858507201e-308
= 1.7976931348623157e308
= 2.2250738585072014e-308
= 2.225073858507201e-308
! 0.9999999999999999 9.999999999999999e-1
! 1.0000000000000002 1.0000000000000002e0
! 72057594037927930.0 7.205759403792793e16
! 72057594037927940.0 7.205759403792794e16
! 9223372036854775000.0 9.223372036854775e18
! 9223372036854776000.0 9.223372036854776e18
= 1.0141204801825834e31
= 1.0141204801825835e31
= 5.708990770823839e45
= 5.70899077082384e45
! 300.0 3e2

A leading = means that the two coincide, a leading ! means that they don't, with the first number being from double-conversion, and the second from Ryu.

Any thoughts on this subject, or hints on how to go about it?

Thank you!

Feature request: Returning number of characters processed in ryu_parse

ryu_parse does not return the number of characters processed, so we are forced to tokenize the input before feeding it to s2d. s2d already knows when to stop consuming the input at the first non-number character, so there ends up being a duplication of effort and parsing a list of numbers is slower than it should be.

Both std::strtod and std::from_chars return a pointer to the first character not matching the number pattern. ryu_parse should do the same.

With this proposal, the INPUT_TOO_LONG status code would become obsolete.

Java: in correct rounding to two digits

The Java implementation sometimes doesn't output the closest two-digit representation of a number, but instead rounds to one digit (correctly), and then appends a '0'.

Known examples:
1.0E-323 should be 9.9E-324
1.0E-322 should be 9.9E-323

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.