GithubHelp home page GithubHelp logo

Comments (8)

lemire avatar lemire commented on June 3, 2024 2

For Buffer, benchmark uses Buffer.allocUnsafe() which uses uninitialized memory, which may contain information from older buffers.

Having non-deterministic inputs when running benchmarks, especially if you want to make comparisons, does not seem ideal.

Note that running benchmarks in tight loops with tiny data inputs is problematic because processors quickly learn to predict exactly thousands of different branches: https://lemire.me/blog/2019/10/16/benchmarking-is-hard-processors-learn-to-predict-branches/

I realize that it is orthogonal to your concerns, but it seems that this benchmark could be made much more realistic and useful. Firstly, use realistic and deterministic (same each time you run the benchmark) inputs. Secondly, make the benchmark 'large enough' so that the processor cannot learn branches (e.g., use large inputs, or use multiple small inputs).

from simdutf.

anonrig avatar anonrig commented on June 3, 2024

I removed benchmarks with fatal=0 because it's irrelevant for this benchmark.

from simdutf.

anonrig avatar anonrig commented on June 3, 2024

The benchmark code:

'use strict';

const common = require('../common.js');

const bench = common.createBenchmark(main, {
  encoding: ['utf-8', 'latin1', 'iso-8859-3'],
  ignoreBOM: [0, 1],
  fatal: [0, 1],
  len: [256, 1024 * 16, 1024 * 512],
  n: [1e2],
  type: ['SharedArrayBuffer', 'ArrayBuffer', 'Buffer']
});

function main({ encoding, len, n, ignoreBOM, type, fatal }) {
  const decoder = new TextDecoder(encoding, { ignoreBOM, fatal });
  let buf;

  switch (type) {
    case 'SharedArrayBuffer': {
      buf = new SharedArrayBuffer(len);
      break;
    }
    case 'ArrayBuffer': {
      buf = new ArrayBuffer(len);
      break;
    }
    case 'Buffer': {
      buf = Buffer.allocUnsafe(len);
      break;
    }
  }

  bench.start();
  for (let i = 0; i < n; i++) {
    try {
      decoder.decode(buf);
    } catch {
      // eslint-disable no-empty
    }
  }
  bench.end(n);
}

from simdutf.

anonrig avatar anonrig commented on June 3, 2024

The benchmarks are too flaky to create a reliable outcome. This is with utf8_validate:

                                                                                                         confidence improvement accuracy (*)    (**)   (***)
util/text-decoder.js type='ArrayBuffer' n=100 len=16384 fatal=0 ignoreBOM=0 encoding='iso-8859-3'                       -0.17 %       ±6.15%  ±8.18% ±10.65%
util/text-decoder.js type='ArrayBuffer' n=100 len=16384 fatal=0 ignoreBOM=0 encoding='latin1'                            0.80 %       ±6.17%  ±8.23% ±10.75%
util/text-decoder.js type='ArrayBuffer' n=100 len=16384 fatal=0 ignoreBOM=0 encoding='utf-8'                            -5.61 %       ±7.85% ±10.50% ±13.80%
util/text-decoder.js type='ArrayBuffer' n=100 len=16384 fatal=0 ignoreBOM=1 encoding='iso-8859-3'                        2.75 %       ±6.49%  ±8.68% ±11.40%
util/text-decoder.js type='ArrayBuffer' n=100 len=16384 fatal=0 ignoreBOM=1 encoding='latin1'                           -0.86 %       ±3.12%  ±4.15%  ±5.40%
util/text-decoder.js type='ArrayBuffer' n=100 len=16384 fatal=0 ignoreBOM=1 encoding='utf-8'                            -3.50 %       ±6.34%  ±8.45% ±11.03%
util/text-decoder.js type='ArrayBuffer' n=100 len=16384 fatal=1 ignoreBOM=0 encoding='iso-8859-3'                        0.86 %       ±6.80%  ±9.04% ±11.77%
util/text-decoder.js type='ArrayBuffer' n=100 len=16384 fatal=1 ignoreBOM=0 encoding='latin1'                           -3.16 %       ±4.55%  ±6.08%  ±7.97%
util/text-decoder.js type='ArrayBuffer' n=100 len=16384 fatal=1 ignoreBOM=0 encoding='utf-8'                    ***    143.73 %      ±17.56% ±23.55% ±31.04%
util/text-decoder.js type='ArrayBuffer' n=100 len=16384 fatal=1 ignoreBOM=1 encoding='iso-8859-3'                       -0.23 %       ±4.18%  ±5.57%  ±7.26%
util/text-decoder.js type='ArrayBuffer' n=100 len=16384 fatal=1 ignoreBOM=1 encoding='latin1'                           -1.76 %       ±4.33%  ±5.78%  ±7.54%
util/text-decoder.js type='ArrayBuffer' n=100 len=16384 fatal=1 ignoreBOM=1 encoding='utf-8'                    ***    132.20 %      ±19.99% ±26.93% ±35.71%
util/text-decoder.js type='ArrayBuffer' n=100 len=256 fatal=0 ignoreBOM=0 encoding='iso-8859-3'                         -3.96 %       ±5.38%  ±7.20%  ±9.45%
util/text-decoder.js type='ArrayBuffer' n=100 len=256 fatal=0 ignoreBOM=0 encoding='latin1'                             -0.08 %       ±3.77%  ±5.02%  ±6.54%
util/text-decoder.js type='ArrayBuffer' n=100 len=256 fatal=0 ignoreBOM=0 encoding='utf-8'                              -1.09 %       ±6.45%  ±8.60% ±11.24%
util/text-decoder.js type='ArrayBuffer' n=100 len=256 fatal=0 ignoreBOM=1 encoding='iso-8859-3'                         -0.54 %       ±3.89%  ±5.18%  ±6.75%
util/text-decoder.js type='ArrayBuffer' n=100 len=256 fatal=0 ignoreBOM=1 encoding='latin1'                             -1.88 %       ±3.54%  ±4.71%  ±6.13%
util/text-decoder.js type='ArrayBuffer' n=100 len=256 fatal=0 ignoreBOM=1 encoding='utf-8'                              -1.70 %       ±3.14%  ±4.18%  ±5.44%
util/text-decoder.js type='ArrayBuffer' n=100 len=256 fatal=1 ignoreBOM=0 encoding='iso-8859-3'                         -3.00 %       ±4.03%  ±5.38%  ±7.03%
util/text-decoder.js type='ArrayBuffer' n=100 len=256 fatal=1 ignoreBOM=0 encoding='latin1'                             -0.35 %       ±3.01%  ±4.00%  ±5.21%
util/text-decoder.js type='ArrayBuffer' n=100 len=256 fatal=1 ignoreBOM=0 encoding='utf-8'                      ***    119.47 %       ±4.93%  ±6.59%  ±8.65%
util/text-decoder.js type='ArrayBuffer' n=100 len=256 fatal=1 ignoreBOM=1 encoding='iso-8859-3'                         -1.65 %       ±3.06%  ±4.07%  ±5.30%
util/text-decoder.js type='ArrayBuffer' n=100 len=256 fatal=1 ignoreBOM=1 encoding='latin1'                             -1.37 %       ±2.76%  ±3.67%  ±4.78%
util/text-decoder.js type='ArrayBuffer' n=100 len=256 fatal=1 ignoreBOM=1 encoding='utf-8'                      ***    114.16 %       ±5.26%  ±7.05%  ±9.26%
util/text-decoder.js type='ArrayBuffer' n=100 len=524288 fatal=0 ignoreBOM=0 encoding='iso-8859-3'                      -1.61 %       ±4.71%  ±6.28%  ±8.20%
util/text-decoder.js type='ArrayBuffer' n=100 len=524288 fatal=0 ignoreBOM=0 encoding='latin1'                          -1.33 %       ±4.64%  ±6.18%  ±8.06%
util/text-decoder.js type='ArrayBuffer' n=100 len=524288 fatal=0 ignoreBOM=0 encoding='utf-8'                           -4.33 %       ±9.40% ±12.53% ±16.38%
util/text-decoder.js type='ArrayBuffer' n=100 len=524288 fatal=0 ignoreBOM=1 encoding='iso-8859-3'                      -1.82 %       ±2.77%  ±3.68%  ±4.80%
util/text-decoder.js type='ArrayBuffer' n=100 len=524288 fatal=0 ignoreBOM=1 encoding='latin1'                           3.16 %       ±7.01%  ±9.40% ±12.39%
util/text-decoder.js type='ArrayBuffer' n=100 len=524288 fatal=0 ignoreBOM=1 encoding='utf-8'                           -3.22 %       ±6.29%  ±8.39% ±10.95%
util/text-decoder.js type='ArrayBuffer' n=100 len=524288 fatal=1 ignoreBOM=0 encoding='iso-8859-3'                       3.78 %       ±4.02%  ±5.39%  ±7.07%
util/text-decoder.js type='ArrayBuffer' n=100 len=524288 fatal=1 ignoreBOM=0 encoding='latin1'                          -2.18 %       ±7.52% ±10.01% ±13.06%
util/text-decoder.js type='ArrayBuffer' n=100 len=524288 fatal=1 ignoreBOM=0 encoding='utf-8'                   ***    284.40 %      ±15.98% ±21.49% ±28.43%
util/text-decoder.js type='ArrayBuffer' n=100 len=524288 fatal=1 ignoreBOM=1 encoding='iso-8859-3'                      -0.92 %       ±3.87%  ±5.16%  ±6.75%
util/text-decoder.js type='ArrayBuffer' n=100 len=524288 fatal=1 ignoreBOM=1 encoding='latin1'                          -1.56 %       ±3.76%  ±5.01%  ±6.53%
util/text-decoder.js type='ArrayBuffer' n=100 len=524288 fatal=1 ignoreBOM=1 encoding='utf-8'                   ***    281.14 %      ±19.04% ±25.64% ±33.99%
util/text-decoder.js type='Buffer' n=100 len=16384 fatal=0 ignoreBOM=0 encoding='iso-8859-3'                             2.10 %       ±5.20%  ±6.92%  ±9.00%
util/text-decoder.js type='Buffer' n=100 len=16384 fatal=0 ignoreBOM=0 encoding='latin1'                                 1.27 %       ±4.56%  ±6.07%  ±7.90%
util/text-decoder.js type='Buffer' n=100 len=16384 fatal=0 ignoreBOM=0 encoding='utf-8'                                 -6.71 %       ±7.55% ±10.12% ±13.33%
util/text-decoder.js type='Buffer' n=100 len=16384 fatal=0 ignoreBOM=1 encoding='iso-8859-3'                            -0.84 %       ±3.58%  ±4.77%  ±6.20%
util/text-decoder.js type='Buffer' n=100 len=16384 fatal=0 ignoreBOM=1 encoding='latin1'                                -1.17 %       ±3.74%  ±4.97%  ±6.47%
util/text-decoder.js type='Buffer' n=100 len=16384 fatal=0 ignoreBOM=1 encoding='utf-8'                                 -2.69 %       ±7.09%  ±9.47% ±12.39%
util/text-decoder.js type='Buffer' n=100 len=16384 fatal=1 ignoreBOM=0 encoding='iso-8859-3'                            -0.13 %       ±4.33%  ±5.78%  ±7.56%
util/text-decoder.js type='Buffer' n=100 len=16384 fatal=1 ignoreBOM=0 encoding='latin1'                                -2.45 %       ±6.56%  ±8.78% ±11.53%
util/text-decoder.js type='Buffer' n=100 len=16384 fatal=1 ignoreBOM=0 encoding='utf-8'                         ***     17.93 %       ±3.34%  ±4.45%  ±5.79%
util/text-decoder.js type='Buffer' n=100 len=16384 fatal=1 ignoreBOM=1 encoding='iso-8859-3'                            -0.17 %       ±4.26%  ±5.67%  ±7.37%
util/text-decoder.js type='Buffer' n=100 len=16384 fatal=1 ignoreBOM=1 encoding='latin1'                                -0.90 %       ±6.50%  ±8.65% ±11.25%
util/text-decoder.js type='Buffer' n=100 len=16384 fatal=1 ignoreBOM=1 encoding='utf-8'                         ***     15.20 %       ±2.81%  ±3.74%  ±4.87%
util/text-decoder.js type='Buffer' n=100 len=256 fatal=0 ignoreBOM=0 encoding='iso-8859-3'                               1.03 %       ±3.45%  ±4.59%  ±5.97%
util/text-decoder.js type='Buffer' n=100 len=256 fatal=0 ignoreBOM=0 encoding='latin1'                                   0.98 %       ±4.02%  ±5.36%  ±6.99%
util/text-decoder.js type='Buffer' n=100 len=256 fatal=0 ignoreBOM=0 encoding='utf-8'                                   -3.37 %       ±4.64%  ±6.17%  ±8.03%
util/text-decoder.js type='Buffer' n=100 len=256 fatal=0 ignoreBOM=1 encoding='iso-8859-3'                              -1.04 %       ±3.78%  ±5.04%  ±6.56%
util/text-decoder.js type='Buffer' n=100 len=256 fatal=0 ignoreBOM=1 encoding='latin1'                                  -1.18 %       ±4.05%  ±5.40%  ±7.04%
util/text-decoder.js type='Buffer' n=100 len=256 fatal=0 ignoreBOM=1 encoding='utf-8'                                   -3.05 %       ±4.97%  ±6.65%  ±8.73%
util/text-decoder.js type='Buffer' n=100 len=256 fatal=1 ignoreBOM=0 encoding='iso-8859-3'                              -3.84 %       ±4.30%  ±5.73%  ±7.47%
util/text-decoder.js type='Buffer' n=100 len=256 fatal=1 ignoreBOM=0 encoding='latin1'                                  -2.12 %       ±3.53%  ±4.69%  ±6.11%
util/text-decoder.js type='Buffer' n=100 len=256 fatal=1 ignoreBOM=0 encoding='utf-8'                           ***    120.68 %       ±7.60% ±10.11% ±13.16%
util/text-decoder.js type='Buffer' n=100 len=256 fatal=1 ignoreBOM=1 encoding='iso-8859-3'                              -2.89 %       ±3.63%  ±4.84%  ±6.30%
util/text-decoder.js type='Buffer' n=100 len=256 fatal=1 ignoreBOM=1 encoding='latin1'                                  -1.87 %       ±4.86%  ±6.49%  ±8.48%
util/text-decoder.js type='Buffer' n=100 len=256 fatal=1 ignoreBOM=1 encoding='utf-8'                           ***    114.04 %       ±7.78% ±10.45% ±13.80%
util/text-decoder.js type='Buffer' n=100 len=524288 fatal=0 ignoreBOM=0 encoding='iso-8859-3'                           -2.35 %       ±4.53%  ±6.05%  ±7.91%
util/text-decoder.js type='Buffer' n=100 len=524288 fatal=0 ignoreBOM=0 encoding='latin1'                               -2.31 %       ±6.60%  ±8.78% ±11.43%
util/text-decoder.js type='Buffer' n=100 len=524288 fatal=0 ignoreBOM=0 encoding='utf-8'                                -3.60 %       ±8.23% ±10.98% ±14.35%
util/text-decoder.js type='Buffer' n=100 len=524288 fatal=0 ignoreBOM=1 encoding='iso-8859-3'                           -2.53 %       ±3.43%  ±4.59%  ±6.02%
util/text-decoder.js type='Buffer' n=100 len=524288 fatal=0 ignoreBOM=1 encoding='latin1'                                1.31 %       ±6.31%  ±8.46% ±11.12%
util/text-decoder.js type='Buffer' n=100 len=524288 fatal=0 ignoreBOM=1 encoding='utf-8'                                -3.14 %       ±8.23% ±10.96% ±14.26%
util/text-decoder.js type='Buffer' n=100 len=524288 fatal=1 ignoreBOM=0 encoding='iso-8859-3'                           -1.94 %       ±6.26%  ±8.38% ±11.01%
util/text-decoder.js type='Buffer' n=100 len=524288 fatal=1 ignoreBOM=0 encoding='latin1'                               -2.19 %       ±6.07%  ±8.10% ±10.60%
util/text-decoder.js type='Buffer' n=100 len=524288 fatal=1 ignoreBOM=0 encoding='utf-8'                        ***    274.92 %      ±23.65% ±31.84% ±42.20%
util/text-decoder.js type='Buffer' n=100 len=524288 fatal=1 ignoreBOM=1 encoding='iso-8859-3'                            2.06 %       ±7.00%  ±9.32% ±12.14%
util/text-decoder.js type='Buffer' n=100 len=524288 fatal=1 ignoreBOM=1 encoding='latin1'                               -0.56 %       ±5.34%  ±7.11%  ±9.25%
util/text-decoder.js type='Buffer' n=100 len=524288 fatal=1 ignoreBOM=1 encoding='utf-8'                        ***    283.88 %      ±13.04% ±17.53% ±23.18%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=16384 fatal=0 ignoreBOM=0 encoding='iso-8859-3'                  1.29 %       ±4.98%  ±6.62%  ±8.63%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=16384 fatal=0 ignoreBOM=0 encoding='latin1'                     -4.57 %       ±5.25%  ±7.02%  ±9.19%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=16384 fatal=0 ignoreBOM=0 encoding='utf-8'                      -1.99 %       ±5.69%  ±7.57%  ±9.86%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=16384 fatal=0 ignoreBOM=1 encoding='iso-8859-3'                  0.59 %       ±3.89%  ±5.18%  ±6.74%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=16384 fatal=0 ignoreBOM=1 encoding='latin1'                     -2.58 %       ±3.46%  ±4.61%  ±6.01%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=16384 fatal=0 ignoreBOM=1 encoding='utf-8'                      -3.05 %       ±6.79%  ±9.05% ±11.81%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=16384 fatal=1 ignoreBOM=0 encoding='iso-8859-3'                 -2.55 %       ±4.14%  ±5.52%  ±7.18%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=16384 fatal=1 ignoreBOM=0 encoding='latin1'                     -0.77 %       ±5.98%  ±8.01% ±10.53%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=16384 fatal=1 ignoreBOM=0 encoding='utf-8'              ***    153.57 %      ±15.72% ±21.00% ±27.50%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=16384 fatal=1 ignoreBOM=1 encoding='iso-8859-3'                 -2.21 %       ±4.87%  ±6.51%  ±8.54%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=16384 fatal=1 ignoreBOM=1 encoding='latin1'                     -2.16 %       ±4.86%  ±6.47%  ±8.43%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=16384 fatal=1 ignoreBOM=1 encoding='utf-8'              ***    139.29 %      ±13.96% ±18.79% ±24.90%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=256 fatal=0 ignoreBOM=0 encoding='iso-8859-3'                    0.51 %       ±3.64%  ±4.85%  ±6.33%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=256 fatal=0 ignoreBOM=0 encoding='latin1'                        1.46 %       ±4.14%  ±5.53%  ±7.23%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=256 fatal=0 ignoreBOM=0 encoding='utf-8'                         0.05 %       ±5.48%  ±7.33%  ±9.60%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=256 fatal=0 ignoreBOM=1 encoding='iso-8859-3'                   -0.45 %       ±3.32%  ±4.42%  ±5.75%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=256 fatal=0 ignoreBOM=1 encoding='latin1'                       -0.94 %       ±3.18%  ±4.24%  ±5.52%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=256 fatal=0 ignoreBOM=1 encoding='utf-8'                  *     -3.30 %       ±2.70%  ±3.59%  ±4.67%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=256 fatal=1 ignoreBOM=0 encoding='iso-8859-3'                   -1.91 %       ±3.09%  ±4.11%  ±5.35%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=256 fatal=1 ignoreBOM=0 encoding='latin1'                       -3.21 %       ±4.18%  ±5.57%  ±7.26%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=256 fatal=1 ignoreBOM=0 encoding='utf-8'                ***    119.43 %       ±6.18%  ±8.27% ±10.83%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=256 fatal=1 ignoreBOM=1 encoding='iso-8859-3'                   -0.87 %       ±3.10%  ±4.12%  ±5.36%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=256 fatal=1 ignoreBOM=1 encoding='latin1'                       -2.08 %       ±3.37%  ±4.49%  ±5.85%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=256 fatal=1 ignoreBOM=1 encoding='utf-8'                ***    114.55 %      ±12.32% ±16.57% ±21.93%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=524288 fatal=0 ignoreBOM=0 encoding='iso-8859-3'                -1.38 %       ±5.69%  ±7.60%  ±9.95%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=524288 fatal=0 ignoreBOM=0 encoding='latin1'                    -2.46 %       ±3.75%  ±5.00%  ±6.51%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=524288 fatal=0 ignoreBOM=0 encoding='utf-8'                     -2.59 %       ±8.94% ±11.90% ±15.52%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=524288 fatal=0 ignoreBOM=1 encoding='iso-8859-3'                -1.51 %       ±4.18%  ±5.57%  ±7.25%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=524288 fatal=0 ignoreBOM=1 encoding='latin1'                    -1.77 %       ±4.95%  ±6.62%  ±8.68%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=524288 fatal=0 ignoreBOM=1 encoding='utf-8'                     -1.80 %       ±5.89%  ±7.84% ±10.21%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=524288 fatal=1 ignoreBOM=0 encoding='iso-8859-3'                 0.68 %       ±4.32%  ±5.75%  ±7.48%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=524288 fatal=1 ignoreBOM=0 encoding='latin1'              *     -6.11 %       ±5.60%  ±7.50%  ±9.86%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=524288 fatal=1 ignoreBOM=0 encoding='utf-8'             ***    269.62 %      ±28.64% ±38.55% ±51.11%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=524288 fatal=1 ignoreBOM=1 encoding='iso-8859-3'                -3.75 %       ±6.29%  ±8.45% ±11.17%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=524288 fatal=1 ignoreBOM=1 encoding='latin1'                    -4.48 %       ±5.61%  ±7.48%  ±9.74%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=524288 fatal=1 ignoreBOM=1 encoding='utf-8'             ***    281.32 %      ±18.88% ±25.42% ±33.70%

Be aware that when doing many comparisons the risk of a false-positive result increases.
In this case, there are 108 comparisons, you can thus expect the following amount of false-positive results:
  5.40 false positives, when considering a   5% risk acceptance (*, **, ***),
  1.08 false positives, when considering a   1% risk acceptance (**, ***),
  0.11 false positives, when considering a 0.1% risk acceptance (***)

from simdutf.

anonrig avatar anonrig commented on June 3, 2024

The results with validate_utf8_with_errors:

➜  node git:(deps/simdutf) ✗ node-benchmark-compare decoder.csv
                                                                                                         confidence improvement accuracy (*)   (**)   (***)
util/text-decoder.js type='ArrayBuffer' n=100 len=16384 fatal=0 ignoreBOM=0 encoding='iso-8859-3'                       -0.84 %       ±1.98% ±2.64%  ±3.46%
util/text-decoder.js type='ArrayBuffer' n=100 len=16384 fatal=0 ignoreBOM=0 encoding='latin1'                            3.25 %       ±5.97% ±8.03% ±10.63%
util/text-decoder.js type='ArrayBuffer' n=100 len=16384 fatal=0 ignoreBOM=0 encoding='utf-8'                             0.36 %       ±4.91% ±6.54%  ±8.52%
util/text-decoder.js type='ArrayBuffer' n=100 len=16384 fatal=0 ignoreBOM=1 encoding='iso-8859-3'                       -0.83 %       ±1.41% ±1.87%  ±2.44%
util/text-decoder.js type='ArrayBuffer' n=100 len=16384 fatal=0 ignoreBOM=1 encoding='latin1'                            0.49 %       ±1.52% ±2.03%  ±2.64%
util/text-decoder.js type='ArrayBuffer' n=100 len=16384 fatal=0 ignoreBOM=1 encoding='utf-8'                            -1.71 %       ±6.14% ±8.19% ±10.68%
util/text-decoder.js type='ArrayBuffer' n=100 len=16384 fatal=1 ignoreBOM=0 encoding='iso-8859-3'                       -0.50 %       ±1.87% ±2.50%  ±3.27%
util/text-decoder.js type='ArrayBuffer' n=100 len=16384 fatal=1 ignoreBOM=0 encoding='latin1'                            0.03 %       ±1.68% ±2.24%  ±2.92%
util/text-decoder.js type='ArrayBuffer' n=100 len=16384 fatal=1 ignoreBOM=0 encoding='utf-8'                    ***    148.51 %       ±7.19% ±9.68% ±12.81%
util/text-decoder.js type='ArrayBuffer' n=100 len=16384 fatal=1 ignoreBOM=1 encoding='iso-8859-3'                       -1.32 %       ±1.82% ±2.42%  ±3.16%
util/text-decoder.js type='ArrayBuffer' n=100 len=16384 fatal=1 ignoreBOM=1 encoding='latin1'                           -0.83 %       ±2.16% ±2.89%  ±3.80%
util/text-decoder.js type='ArrayBuffer' n=100 len=16384 fatal=1 ignoreBOM=1 encoding='utf-8'                    ***    155.42 %       ±5.15% ±6.86%  ±8.94%
util/text-decoder.js type='ArrayBuffer' n=100 len=256 fatal=0 ignoreBOM=0 encoding='iso-8859-3'                         -1.08 %       ±2.09% ±2.78%  ±3.62%
util/text-decoder.js type='ArrayBuffer' n=100 len=256 fatal=0 ignoreBOM=0 encoding='latin1'                             -0.19 %       ±2.42% ±3.22%  ±4.19%
util/text-decoder.js type='ArrayBuffer' n=100 len=256 fatal=0 ignoreBOM=0 encoding='utf-8'                               0.04 %       ±2.49% ±3.32%  ±4.35%
util/text-decoder.js type='ArrayBuffer' n=100 len=256 fatal=0 ignoreBOM=1 encoding='iso-8859-3'                         -0.88 %       ±2.13% ±2.83%  ±3.69%
util/text-decoder.js type='ArrayBuffer' n=100 len=256 fatal=0 ignoreBOM=1 encoding='latin1'                             -0.85 %       ±1.92% ±2.56%  ±3.34%
util/text-decoder.js type='ArrayBuffer' n=100 len=256 fatal=0 ignoreBOM=1 encoding='utf-8'                              -2.11 %       ±3.06% ±4.09%  ±5.35%
util/text-decoder.js type='ArrayBuffer' n=100 len=256 fatal=1 ignoreBOM=0 encoding='iso-8859-3'                         -1.65 %       ±4.62% ±6.19%  ±8.16%
util/text-decoder.js type='ArrayBuffer' n=100 len=256 fatal=1 ignoreBOM=0 encoding='latin1'                             -1.43 %       ±2.14% ±2.85%  ±3.71%
util/text-decoder.js type='ArrayBuffer' n=100 len=256 fatal=1 ignoreBOM=0 encoding='utf-8'                      ***    115.98 %       ±4.50% ±6.04%  ±7.97%
util/text-decoder.js type='ArrayBuffer' n=100 len=256 fatal=1 ignoreBOM=1 encoding='iso-8859-3'                         -0.14 %       ±3.32% ±4.43%  ±5.78%
util/text-decoder.js type='ArrayBuffer' n=100 len=256 fatal=1 ignoreBOM=1 encoding='latin1'                             -1.10 %       ±1.81% ±2.41%  ±3.14%
util/text-decoder.js type='ArrayBuffer' n=100 len=256 fatal=1 ignoreBOM=1 encoding='utf-8'                      ***    119.89 %       ±3.30% ±4.39%  ±5.72%
util/text-decoder.js type='ArrayBuffer' n=100 len=524288 fatal=0 ignoreBOM=0 encoding='iso-8859-3'                      -1.97 %       ±4.48% ±6.03%  ±7.97%
util/text-decoder.js type='ArrayBuffer' n=100 len=524288 fatal=0 ignoreBOM=0 encoding='latin1'                           0.72 %       ±1.51% ±2.01%  ±2.62%
util/text-decoder.js type='ArrayBuffer' n=100 len=524288 fatal=0 ignoreBOM=0 encoding='utf-8'                            0.84 %       ±1.98% ±2.64%  ±3.45%
util/text-decoder.js type='ArrayBuffer' n=100 len=524288 fatal=0 ignoreBOM=1 encoding='iso-8859-3'                      -0.39 %       ±1.05% ±1.41%  ±1.84%
util/text-decoder.js type='ArrayBuffer' n=100 len=524288 fatal=0 ignoreBOM=1 encoding='latin1'                          -2.31 %       ±4.75% ±6.40%  ±8.48%
util/text-decoder.js type='ArrayBuffer' n=100 len=524288 fatal=0 ignoreBOM=1 encoding='utf-8'                            1.24 %       ±2.22% ±2.96%  ±3.87%
util/text-decoder.js type='ArrayBuffer' n=100 len=524288 fatal=1 ignoreBOM=0 encoding='iso-8859-3'                      -0.18 %       ±1.52% ±2.03%  ±2.64%
util/text-decoder.js type='ArrayBuffer' n=100 len=524288 fatal=1 ignoreBOM=0 encoding='latin1'                          -2.90 %       ±6.32% ±8.51% ±11.29%
util/text-decoder.js type='ArrayBuffer' n=100 len=524288 fatal=1 ignoreBOM=0 encoding='utf-8'                   ***    278.79 %       ±6.25% ±8.40% ±11.10%
util/text-decoder.js type='ArrayBuffer' n=100 len=524288 fatal=1 ignoreBOM=1 encoding='iso-8859-3'                      -0.51 %       ±3.84% ±5.11%  ±6.66%
util/text-decoder.js type='ArrayBuffer' n=100 len=524288 fatal=1 ignoreBOM=1 encoding='latin1'                          -0.57 %       ±0.94% ±1.26%  ±1.66%
util/text-decoder.js type='ArrayBuffer' n=100 len=524288 fatal=1 ignoreBOM=1 encoding='utf-8'                   ***    280.59 %       ±6.46% ±8.66% ±11.42%
util/text-decoder.js type='Buffer' n=100 len=16384 fatal=0 ignoreBOM=0 encoding='iso-8859-3'                            -0.27 %       ±1.37% ±1.83%  ±2.40%
util/text-decoder.js type='Buffer' n=100 len=16384 fatal=0 ignoreBOM=0 encoding='latin1'                                -0.26 %       ±2.07% ±2.75%  ±3.58%
util/text-decoder.js type='Buffer' n=100 len=16384 fatal=0 ignoreBOM=0 encoding='utf-8'                                  1.67 %       ±5.40% ±7.19%  ±9.37%
util/text-decoder.js type='Buffer' n=100 len=16384 fatal=0 ignoreBOM=1 encoding='iso-8859-3'                            -0.14 %       ±2.74% ±3.64%  ±4.74%
util/text-decoder.js type='Buffer' n=100 len=16384 fatal=0 ignoreBOM=1 encoding='latin1'                                -0.49 %       ±2.34% ±3.12%  ±4.08%
util/text-decoder.js type='Buffer' n=100 len=16384 fatal=0 ignoreBOM=1 encoding='utf-8'                                  0.93 %       ±4.14% ±5.51%  ±7.17%
util/text-decoder.js type='Buffer' n=100 len=16384 fatal=1 ignoreBOM=0 encoding='iso-8859-3'                             1.86 %       ±3.59% ±4.78%  ±6.22%
util/text-decoder.js type='Buffer' n=100 len=16384 fatal=1 ignoreBOM=0 encoding='latin1'                                -0.60 %       ±2.71% ±3.61%  ±4.70%
util/text-decoder.js type='Buffer' n=100 len=16384 fatal=1 ignoreBOM=0 encoding='utf-8'                         ***     94.61 %       ±3.21% ±4.27%  ±5.55%
util/text-decoder.js type='Buffer' n=100 len=16384 fatal=1 ignoreBOM=1 encoding='iso-8859-3'                            -2.26 %       ±2.79% ±3.71%  ±4.83%
util/text-decoder.js type='Buffer' n=100 len=16384 fatal=1 ignoreBOM=1 encoding='latin1'                                -0.73 %       ±1.45% ±1.92%  ±2.51%
util/text-decoder.js type='Buffer' n=100 len=16384 fatal=1 ignoreBOM=1 encoding='utf-8'                         ***    100.43 %       ±5.28% ±7.09%  ±9.35%
util/text-decoder.js type='Buffer' n=100 len=256 fatal=0 ignoreBOM=0 encoding='iso-8859-3'                              -0.09 %       ±2.23% ±2.97%  ±3.87%
util/text-decoder.js type='Buffer' n=100 len=256 fatal=0 ignoreBOM=0 encoding='latin1'                                  -1.46 %       ±1.78% ±2.37%  ±3.09%
util/text-decoder.js type='Buffer' n=100 len=256 fatal=0 ignoreBOM=0 encoding='utf-8'                                    0.40 %       ±3.87% ±5.19%  ±6.84%
util/text-decoder.js type='Buffer' n=100 len=256 fatal=0 ignoreBOM=1 encoding='iso-8859-3'                              -0.43 %       ±2.30% ±3.06%  ±3.99%
util/text-decoder.js type='Buffer' n=100 len=256 fatal=0 ignoreBOM=1 encoding='latin1'                                  -0.58 %       ±1.83% ±2.44%  ±3.18%
util/text-decoder.js type='Buffer' n=100 len=256 fatal=0 ignoreBOM=1 encoding='utf-8'                                   -1.39 %       ±2.17% ±2.89%  ±3.77%
util/text-decoder.js type='Buffer' n=100 len=256 fatal=1 ignoreBOM=0 encoding='iso-8859-3'                              -0.13 %       ±2.61% ±3.48%  ±4.53%
util/text-decoder.js type='Buffer' n=100 len=256 fatal=1 ignoreBOM=0 encoding='latin1'                            *     -2.61 %       ±2.17% ±2.90%  ±3.78%
util/text-decoder.js type='Buffer' n=100 len=256 fatal=1 ignoreBOM=0 encoding='utf-8'                           ***    116.45 %       ±3.45% ±4.62%  ±6.07%
util/text-decoder.js type='Buffer' n=100 len=256 fatal=1 ignoreBOM=1 encoding='iso-8859-3'                        *     -2.58 %       ±2.37% ±3.16%  ±4.12%
util/text-decoder.js type='Buffer' n=100 len=256 fatal=1 ignoreBOM=1 encoding='latin1'                                  -0.43 %       ±3.25% ±4.33%  ±5.63%
util/text-decoder.js type='Buffer' n=100 len=256 fatal=1 ignoreBOM=1 encoding='utf-8'                           ***    119.79 %       ±5.94% ±7.91% ±10.30%
util/text-decoder.js type='Buffer' n=100 len=524288 fatal=0 ignoreBOM=0 encoding='iso-8859-3'                           -1.07 %       ±2.51% ±3.36%  ±4.42%
util/text-decoder.js type='Buffer' n=100 len=524288 fatal=0 ignoreBOM=0 encoding='latin1'                               -0.61 %       ±1.11% ±1.48%  ±1.94%
util/text-decoder.js type='Buffer' n=100 len=524288 fatal=0 ignoreBOM=0 encoding='utf-8'                                 0.52 %       ±1.35% ±1.79%  ±2.33%
util/text-decoder.js type='Buffer' n=100 len=524288 fatal=0 ignoreBOM=1 encoding='iso-8859-3'                           -0.16 %       ±1.55% ±2.06%  ±2.68%
util/text-decoder.js type='Buffer' n=100 len=524288 fatal=0 ignoreBOM=1 encoding='latin1'                               -1.83 %       ±3.73% ±5.03%  ±6.66%
util/text-decoder.js type='Buffer' n=100 len=524288 fatal=0 ignoreBOM=1 encoding='utf-8'                                 2.54 %       ±3.59% ±4.83%  ±6.41%
util/text-decoder.js type='Buffer' n=100 len=524288 fatal=1 ignoreBOM=0 encoding='iso-8859-3'                           -0.49 %       ±1.72% ±2.31%  ±3.04%
util/text-decoder.js type='Buffer' n=100 len=524288 fatal=1 ignoreBOM=0 encoding='latin1'                               -1.18 %       ±3.20% ±4.31%  ±5.70%
util/text-decoder.js type='Buffer' n=100 len=524288 fatal=1 ignoreBOM=0 encoding='utf-8'                        ***    277.10 %       ±6.77% ±9.08% ±11.98%
util/text-decoder.js type='Buffer' n=100 len=524288 fatal=1 ignoreBOM=1 encoding='iso-8859-3'                           -2.40 %       ±2.76% ±3.71%  ±4.90%
util/text-decoder.js type='Buffer' n=100 len=524288 fatal=1 ignoreBOM=1 encoding='latin1'                               -2.82 %       ±5.91% ±7.96% ±10.55%
util/text-decoder.js type='Buffer' n=100 len=524288 fatal=1 ignoreBOM=1 encoding='utf-8'                        ***    284.70 %       ±5.59% ±7.46%  ±9.74%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=16384 fatal=0 ignoreBOM=0 encoding='iso-8859-3'                  0.15 %       ±1.21% ±1.61%  ±2.09%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=16384 fatal=0 ignoreBOM=0 encoding='latin1'                     -1.77 %       ±1.80% ±2.39%  ±3.12%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=16384 fatal=0 ignoreBOM=0 encoding='utf-8'                       1.77 %       ±4.46% ±5.96%  ±7.78%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=16384 fatal=0 ignoreBOM=1 encoding='iso-8859-3'                  0.40 %       ±1.29% ±1.72%  ±2.24%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=16384 fatal=0 ignoreBOM=1 encoding='latin1'              **     -2.45 %       ±1.82% ±2.42%  ±3.16%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=16384 fatal=0 ignoreBOM=1 encoding='utf-8'                       0.45 %       ±3.42% ±4.55%  ±5.92%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=16384 fatal=1 ignoreBOM=0 encoding='iso-8859-3'                  0.69 %       ±2.20% ±2.93%  ±3.84%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=16384 fatal=1 ignoreBOM=0 encoding='latin1'                     -1.47 %       ±2.42% ±3.22%  ±4.19%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=16384 fatal=1 ignoreBOM=0 encoding='utf-8'              ***    152.66 %       ±6.57% ±8.82% ±11.63%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=16384 fatal=1 ignoreBOM=1 encoding='iso-8859-3'                 -0.36 %       ±1.42% ±1.89%  ±2.46%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=16384 fatal=1 ignoreBOM=1 encoding='latin1'                     -1.51 %       ±1.70% ±2.27%  ±2.97%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=16384 fatal=1 ignoreBOM=1 encoding='utf-8'              ***    154.13 %       ±6.60% ±8.79% ±11.44%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=256 fatal=0 ignoreBOM=0 encoding='iso-8859-3'                   -2.09 %       ±2.11% ±2.82%  ±3.69%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=256 fatal=0 ignoreBOM=0 encoding='latin1'                       -0.62 %       ±2.56% ±3.41%  ±4.44%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=256 fatal=0 ignoreBOM=0 encoding='utf-8'                        -2.19 %       ±2.82% ±3.76%  ±4.89%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=256 fatal=0 ignoreBOM=1 encoding='iso-8859-3'                   -1.21 %       ±2.51% ±3.35%  ±4.39%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=256 fatal=0 ignoreBOM=1 encoding='latin1'                       -1.46 %       ±2.32% ±3.09%  ±4.02%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=256 fatal=0 ignoreBOM=1 encoding='utf-8'                 **     -3.41 %       ±2.21% ±2.94%  ±3.83%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=256 fatal=1 ignoreBOM=0 encoding='iso-8859-3'                   -0.26 %       ±1.81% ±2.41%  ±3.14%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=256 fatal=1 ignoreBOM=0 encoding='latin1'                       -1.42 %       ±1.86% ±2.47%  ±3.22%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=256 fatal=1 ignoreBOM=0 encoding='utf-8'                ***    115.57 %       ±5.07% ±6.82%  ±9.00%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=256 fatal=1 ignoreBOM=1 encoding='iso-8859-3'                   -0.40 %       ±2.20% ±2.93%  ±3.82%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=256 fatal=1 ignoreBOM=1 encoding='latin1'                       -1.80 %       ±2.13% ±2.84%  ±3.70%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=256 fatal=1 ignoreBOM=1 encoding='utf-8'                ***    117.77 %       ±4.62% ±6.19%  ±8.13%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=524288 fatal=0 ignoreBOM=0 encoding='iso-8859-3'                -0.45 %       ±1.04% ±1.40%  ±1.83%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=524288 fatal=0 ignoreBOM=0 encoding='latin1'              *     -1.83 %       ±1.49% ±1.98%  ±2.58%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=524288 fatal=0 ignoreBOM=0 encoding='utf-8'                      0.72 %       ±2.33% ±3.09%  ±4.03%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=524288 fatal=0 ignoreBOM=1 encoding='iso-8859-3'                -0.55 %       ±0.97% ±1.29%  ±1.70%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=524288 fatal=0 ignoreBOM=1 encoding='latin1'              *     -3.33 %       ±3.08% ±4.15%  ±5.49%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=524288 fatal=0 ignoreBOM=1 encoding='utf-8'                      0.68 %       ±1.85% ±2.47%  ±3.23%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=524288 fatal=1 ignoreBOM=0 encoding='iso-8859-3'                 0.17 %       ±1.69% ±2.25%  ±2.94%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=524288 fatal=1 ignoreBOM=0 encoding='latin1'                    -1.59 %       ±1.63% ±2.18%  ±2.84%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=524288 fatal=1 ignoreBOM=0 encoding='utf-8'             ***    282.58 %       ±6.01% ±8.03% ±10.51%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=524288 fatal=1 ignoreBOM=1 encoding='iso-8859-3'                 0.22 %       ±1.51% ±2.02%  ±2.63%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=524288 fatal=1 ignoreBOM=1 encoding='latin1'             **     -2.87 %       ±1.78% ±2.39%  ±3.16%
util/text-decoder.js type='SharedArrayBuffer' n=100 len=524288 fatal=1 ignoreBOM=1 encoding='utf-8'             ***    283.77 %       ±5.30% ±7.05%  ±9.20%

Be aware that when doing many comparisons the risk of a false-positive result increases.
In this case, there are 108 comparisons, you can thus expect the following amount of false-positive results:
  5.40 false positives, when considering a   5% risk acceptance (*, **, ***),
  1.08 false positives, when considering a   1% risk acceptance (**, ***),
  0.11 false positives, when considering a 0.1% risk acceptance (***)

from simdutf.

lemire avatar lemire commented on June 3, 2024

There is a lot of code involved in this Node benchmark. So let us try to take apart the issue. It is likely that I do not understand everything so please correct me as needed.

Firstly, it seems that the benchmark entails decoding an input initialized with zeros (all zeros??). It seems that the benchmarked is like this...

  const decoder = new TextDecoder(encoding, { ignoreBOM });
  let buf = new ArrayBuffer(len);

    
  bench.start();
  for (let i = 0; i < n; i++) {
    decoder.decode(buf);
  }
  bench.end(n);

So the input is always valid and always ASCII, and always zero? If I understand correctly, this benchmark is not very realistic. When do you ever transcode multiple times the same array made of zeros ?

Let us put that aside for now.

Is the issue with the simdutf library? That is, do we have a competing validate_utf8 that is significantly faster somewhere in Node or in its dependencies? I am skeptical.

If I look at the the PR at nodejs/node#45803 it seems that the PR adds the following...

// Convert the input into an encoded string
void DecodeUTF8(const FunctionCallbackInfo<Value>& args) {
...
  bool has_fatal = args[2]->IsTrue();


   if (has_fatal) {
     bool is_valid = simdutf::validate_utf8(data, length);

     if (!is_valid) {
       return node::THROW_ERR_ENCODING_INVALID_ENCODED_DATA(
           env->isolate(), "The encoded data was not valid for encoding utf-8");
     }
   }

If I understand this code correctly, it should only ever run if fatal=1. Yet we have significant variables irrespective of the value of the variable fatal. That's a bit curious.

In any case, the new code does not replace an existing (slower?) function. Rather it seems that it adds an extra path. From the look of it, it is followed by a validating transcoding, that is, one checks that the input is valid, and if it is, then we proceed to transcode with validation. If the input is valid, then you effectively now have two passes over the data, a validation pass followed by a validating transcoding. Thus you would expect the sum total to become slower. Only when the input it invalid would you expect a performance gain (and if that's your expectation, I would recommend validate_utf8_with_errors).

If we are going to transcode anyhow, then what we want to to modify the encoding function...

  Local<Value> error;
  MaybeLocal<Value> maybe_ret =
      StringBytes::Encode(env->isolate(), data, length, UTF8, &error);
  Local<Value> ret;
  if (!maybe_ret.ToLocal(&ret)) {
    CHECK(!error.IsEmpty());
    env->isolate()->ThrowException(error);
    return;
  }
  args.GetReturnValue().Set(ret);

In any case, ultimately, in the case where you are decoding UTF-8, it appears that Node calls the v8 function v8::String::NewFromUtf8 and that's where the bulk of the work (including the validation) is done.

If you search through the v8 code, it leads you to a DFA decoder (http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ ), see file utf8-decoder.h, which we already use as a reference.

This being said, TextDecoder should only call DecodeUTF8 when we specify UTF-8 as the input. So I am perplexed by the performance regression regarding other encodings.

So let me sum up what I think I understand:

  1. The benchmark is somewhat naive (transcoding zero buffers) and unrealistic. Nevertheless, it appears to always decode valid (trivial) inputs.
  2. The PR appears to add a validation step which could only help in the case where the input is invalid (in which case we would skip the validating transcoding).
  3. The numbers appear to be affected even when UTF-8 is not involved which is suspicious.

Please correct my understanding.

from simdutf.

anonrig avatar anonrig commented on June 3, 2024

So the input is always valid and always ASCII, and always zero? If I understand correctly, this benchmark is not very realistic. When do you ever transcode multiple times the same array made of zeros ?

The input is only zero for ArrayBuffer and SharedArrayBuffer types. For Buffer, benchmark uses Buffer.allocUnsafe() which uses uninitialized memory, which may contain information from older buffers.

Since Buffer.allocUnsafe can contain any value, it throws error, due to invalid UTF8 input. For other values for SharedArrayBuffer and ArrayBuffer, since it uses .fill(0), the only performance degradation might occur because of the introduction of simdutf. In that scenario, the second benchmark I added which uses validate_utf8_with_errors proves that it's faster in certain scenarios, even though for latin1 There is a mid-confidence (**) that latin1 is 2-3% slower.

If I understand this code correctly, it should only ever run if fatal=1. Yet we have significant variables irrespective of the value of the variable fatal. That's a bit curious.

Prior to this pull request, for certain parameters (encoding=UTF8, fatal != true) we were not initializing ICU and directly doing to decoding using StringBytes::Encode which used String::NewFromUtf8. With this pull request, encoding=UTF8, fatal == true would call this fast path (also without initializing ICU).

The numbers appear to be affected even when UTF-8 is not involved which is suspicious.

In my last benchmark run, you'll see that there is a strong confidence, which is represented by *** for UTF8 values. Other non-confident benchmark results shouldn't be used as a reference, and might change due to system/network related stuff (I ran the benchmarks on my local machine).

In summary:

The only downside I see is with small strings.

util/text-decoder.js type='SharedArrayBuffer' n=100 len=256 fatal=0 ignoreBOM=1 encoding='utf-8'                 **     -3.41 %       ±2.21% ±2.94%  ±3.83%

from simdutf.

lemire avatar lemire commented on June 3, 2024

@anonrig I am reclosing this issue. If you think that there is an issue, it would be best to propose a specific C++ benchmark where we can compare two equivalent functions side-by-side with accompanying inputs. In this instance, the function that v8 relies upon for converting from UTF-8 is already in our benchmarks. You can see it here... https://github.com/simdutf/simdutf/blob/master/benchmarks/competition/hoehrmann/hoehrmann.h

If we want to discuss simdutf-related design issues, it might be best to open a discussion at https://github.com/simdutf/simdutf/discussions

from simdutf.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.