How does the graph differ from Lichess's?,about rooklift/nibbler

Comments (33)

rooklift commented on June 1, 2024 1

I guess this line?

https://github.com/rooklift/nibbler/blob/master/files/src/renderer/55_winrate_graph.js#L96C5-L96C30

e here is some number between 0 and 1

from nibbler.

yuzisee commented on June 1, 2024 1

My understanding is, engines output centipawns, then there's some conversion formula for win %. So where's this calculation happen?

https://github.com/vondele/Stockfish/blob/ad2aa8c06f438de8b8bb7b7c8726430e3f2a5685/src/uci.cpp#L223

win_rate_per_mille = 1000 / (1 + std::exp((a - x) / b))
win_rate = 1.0 / (1 + std::exp((a - x) / b))
win_rate = σ((a - x) / b) = σ(centipawn_scale_value)
logit(win_rate) = centipawn_scale_value

Where σ is the standard logistic function, and logit is the logit function: https://en.wikipedia.org/wiki/Logit

The rounding in Stockfish's win_rate_model means resolution of win_rate is capped at

0.9995 = 99.9% ≅ centipawn_scale_value of +7.6
0.0005 = 0.1% ≅ centipawn_scale_value of −7.6

So replacing

nibbler/files/src/renderer/55_winrate_graph.js

Line 96 in 004e66b

let y = (1 - e) * height;

with something like

let clamped_e = Math.min(Math.max(e, 0.0005), 0.9995);  // values from 0.0005 to 0.9995
let logit_e = 2.0 * Math.atanh(2.0 * clamped_e - 1.0);  // values from −7.6 to +7.6
let logit_e_scaled = (logit_e / 7.6) / 2.0;  // values from −0.5 to +0.5
let logit_e_scaled_01 = logit_e_scaled + 0.5;  // values from 0.0 to 1.0
let y = (1 - logit_e_scaled_01) * height;

should do what you're looking for.

Here's what that would look like

from nibbler.

yuzisee commented on June 1, 2024 1

I really don't follow. Why is it something on move X that it's not on other moves? And it changes over time? Well, your linked thread probably answers all that.

Yes. The linked thread answers all that.

One way to think about it is the quote "Centipawns are great for developing chess engines, which is their main use. But not so much for human comprehension." from (2) which you linked above.

What happened is that this same "human comprehension" problem also caused a headache for the Stockfish team because, as you saw in 1 that you linked above, both Lichess & Chess.com (and now Nibbler) look at Win% to classify inaccuracy/mistake/blunder and not centipawns.

So in October 2022 the Stockfish team set out to try to reduce this headache by getting very detailed about the winrate↔centipawns relationship inside the engine. That's what this thread talks about official-stockfish/Stockfish#4216 and at this point the winrate model had been thoroughly tested for 28 months in development, 24 months in production, and three full Stockfish versions 12, 13 & 14

Comparing the…

latest Stockfish formula

winrate_scale = 0.5  +  0.5 / (1 + std::exp((a - centipawn*MULTIPLIER) / b))) - 0.5 / (1 + std::exp((a + centipawn*MULTIPLIER) / b)))

Stockfish's current value for MULTIPLIER, m, a and b are:

MULTIPLIER = 3.28
m = ply / 64.0
a = 0.38036525 m³ − 2.82015070 m² + 23.17882135 m + 307.36768407
b = −2.29434733 m³ + 13.27689788 m² − 14.26828904 m + 63.45318330

with the…

latest Lichess formula

winrate_scale = 1/(1+Math.exp(MULTIPLIER * centipawn_scale))

Lichess's current value for MULTIPLIER is -0.368208

we can get "win_scale" evals corresponding to e.g. −2.5, −1.0, +1.0, and +2.5 "centipawn_scale" evals

formula	eval −2.5	eval −1.0	eval +1.0	eval +2.5
SF v16 @ move 0	~100% B	79.0% B : 21.0% B	21.0% B : 79.0% W	~100% W
SF v16 @ move 32	~100% B	75.0% B : 25.0% W	25.0% B : 75.0% W	~100% W
SF v16 @ move ≥120	99.9% B : 0.01% W	67.5% B : 32.5% W	32.5% B : 67.5% W	0.01% B : 99.9% W
Lichess 2023	71.5% B : 28.5% W	59.1% B : 40.9% W	40.9% B : 59.1% W	28.5% B : 71.5% W

Does your proposed code insertion handle it?

Yes, the proposed code insertion handles it, because Nibbler reads the direct wdl output of Stockfish which already has the corrections applied.

If you make the proposed code changes, you'll get the same "line graph" as Lichess (as long as you're running at the same depth as Lichess; Lichess seems to run at very low depth).

Regardless, we have 1, then 2 - and from (2) we have "Why not just use Stockfish centipawns?", suggesting it's not using direct Stockfish outputs. I've only skimmed (2). Is (2) outdated, or did I misread?

To verify what is out-of-date and what is up-to-date, we have to trace the following:

2018 and earlier
- engines did not typically output win percentages, everyone had to calculate on their own
- as early as 2016, Lichess has calculated winrate to set the size of arrows when drawing them on the board and for drawing the eval bar… BUT did not yet use winrates for the analysis graph nor inaccuracy/blunder/mistake calculation
Feb 2018: Lc0 calculates a crude centipawn↔winrate conversion formula and releases v0.1 to the world
Nov 2018: Lichess realizes that centipawns suck and switches to win% for inaccuracy/blunder/mistake calculation — this is what the article refers to https://web.archive.org/web/20220715020005/https://lichess.org/page/accuracy
Feb 2019: Lc0 starts to calculate WDL directly in its development branch, which means winrate is now more reliable than centipawns (since centipawn conversion is lossy as it cannot express drawrate information)
March 2019: Lc0 v0.21 ships
June 2019: Nibbler v0.5.9 adds the ability to display centipawns and Nibbler v0.6.7 adds the ability to select centipawn output
July 2019: Stockfish and Lc0 make their formal partnership concrete
Nov 2019: Lc0 allows outputting winrate based on its full WDL calculations
Dec 2019: Lc0 v0.23.0 ships
Jan 2020: Nibbler v1.1.3 adds the ability to display winrate
April 2020: Lc0 realizes that Stockfish's centipawn↔winrate calculation is not stable and publishes a blog post about it https://lczero.org/blog/2020/04/wdl-head/
April 2020: Nibbler v1.2.1 makes winrate the default output
June 2020: Stockfish displays winrate for the first time in its development branch, using a WDL model
Sept 2020: Stockfish 12 ships the winrate feature
July 2022: Lichess refreshes their winrate formula to be based on player performance at rapid time controls
Nov 2022: Stockfish makes winrate the official source of centipawn normalization
Dec 2022: Stockfish 15.1 is released officially retiring "centipawn" as an official reference point
May 2023: Lc0 stabilizes their centipawn equivalence so that it can be kept in sync with winrate going forward and publishes a blog post about it https://lczero.org/blog/2023/07/the-lc0-v0.30.0-wdl-rescale/contempt-implementation

Summary

Lichess calculates winrate based on actual player performance during rapid time controls, but draws their graph using Stockfish centipawns. Stockfish calculates centipawns using NNUE and uses a formula to keep them in sync with winrates with a process that is stable between releases.

Given that Lc0 now calculates winrates directly, and converts to a centipawn value purely to work with the Stockfish definition and keep them in sync, it seems like graphing winrate is probably the better choice when performing engine analysis.

Note that Nibbler ideally works well with both Stockfish type engines and Leela type engines, so showing winrate helps achieve that goal since it's a stable output reference point for both engine types. For now, both the Stockfish centipawn↔winrate equivalence and Lichess centipawn↔winrate equivalence are both based on the https://en.wikipedia.org/wiki/Logit function, so it's very simple (5 lines of code) to match them up visually as we saw here #242 (comment)

TL;DR:

The (2) you linked was first published in 2018. Stockfish did not provide winrates back then.

In 2020, Stockfish started showing winrates (June 2020) and redefined centipawns to match winrates (Nov 2022). Meanwhile, Lichess keeps a separate winrate formula because it's based on player performance in rapid time controls whereas Stockfish is giving you the "winrate of the position itself" so to speak.

Nobody is relying on centipawns as a reference point anymore.

from nibbler.

yuzisee commented on June 1, 2024 1

To a perfect player, the graph only ever takes on three values (1, -1, 0)

The same happens in centipawns though. At high depth, the graph will only ever take on three values: (+200 a.k.a. Mate for white, −200 a.k.a. Mate for black, 0.0.)

The problem isn't centipawns vs winrate, it's that the depth of your engine is too high.

It'd be more informative to output "human winrate"

Lc0 specifically added this feature just last month! You give it your Elo and it calculates a "human winrate" at that level for any/each position.

https://lczero.org/blog/2023/07/the-lc0-v0.30.0-wdl-rescale/contempt-implementation/

When I get back home maybe I'll see about a pull request for Nibbler to support this new setting.

Lc0 Discord often discusses interesting stuff

Oh neat, I'll check it out. Thanks!

Your labeling code is unaffected by changing the graph code?

Yes. They are separate and don't rely on each other.

from nibbler.

Naphthalin commented on June 1, 2024 1

Just saw that you referenced LeelaChessZero/lc0#1791 and wanted to add my two cents on eval definition/normalization and win rate, despite the issue already being closed.

I personally don't think that engine WDL (and the derived expected score) is particularly useful for analyzing human games, and requires some translation to be of use, which is mostly what Lc0 PR1791 is about. After working on that quite some time, I came to the conclusion that what is now the WDL_mu score type is the most (or possibly only) natural way of assigning a single number to a position, and the normalized SF centipawn eval does a very good job at approximating that up to +2.5 or so, at which point evals become irrelevant. Within this range, the main problem isn't that centipawn evals are a bad reference, it's that engines are too strong.

I'd personally strongly prefer Nibbler to show eval (SF centipawn or Lc0 WDL_mu) on a [-2.5, +2.5] y-axis with dashed lines marking the +1 and -1 boundaries for 50% W/L over displaying the expected score; maybe I can convince @rooklift why that would be a good idea :)

from nibbler.

Naphthalin commented on June 1, 2024 1

@yuzisee

The trouble with defaulting to WDL_mu in Nibbler is making sure the user gets the correct WDLDrawRateReference since right now Nibbler doesn't care what network you use as long as you provide one.

Not sure where you took that from; if the user is using defaults (and not setting any contempt related parameters), there won't be any WDL adjustment. However, the beauty of the WDL_mu eval is that it is almost completely invariant under the applied WDL transformation (and to some extent, even network selection!), unlike any other score type like Q or centipawn. Feel free to test various values of WDLCalibrationElo with either Contempt: "0" and/or WDLEvalObjectivity: 1.0 to see how it affects WDL_mu and Q evals :)

I can't speak for @rooklift here, but I presume the idea that a new person can download Nibbler and it "works out of the box" is appealing.

Seems like his opinion is similar, given that #244 makes my proposed "display centipawn from -2.5 to +2.5 eval, with dashed lines depicting +1.0 and -1.0" the default.

Blunder/Mistake/Inaccuracy checks by convention still rely on some sort of WDL threshold rather than a centipawn threshold. Given this issue of "engines are too strong, WDL needs translation or it will be too extreme" would you like to suggest a path forward for #237

Due to Nibbler being inherently a GUI for Lc0, this question is very closely related to how the --temp-value-cutoff parameter is applied in Lc0, which currently also is based on deltaQ. However, my WDL adjustment/rescale/contempt implementation from the mentioned PR1791 is based on the central assumption that there is some scale where the numerical size of inaccuracies isn't depending on the evaluation (contrary to centipawn or winrate, which is more resp. less sensitive in decisive positions), and the evaluation on this scale is exactly what the WDL_mu score estimates. I didn't do any of the actual data science groundwork and check how well that translates to human games, but I presume that judging inaccuracies on the WDL_mu scale should work significantly better than both (native engine) winrate and (old A/B engine) centipawn. Since the introduction of normalizing evals to +1.00 = 50% W however, the SF centipawn is doing a very good job at approximating WDL_mu (only has some minor issue due to game progress dependence) up to +2.5 or so, from where things become increasingly meaningless.

To throw around some numbers in the WDL_mu scale: Since +0.25 is roughly the initial white advantage (half a tempo?) and 1.0 the difference between equal and 50% W, it might make sense to start with thresholds of 0.25, 0.5 and 1.0, basically meaning "giving up the initiative" (inaccuracy), "handing over the initiative to the opponent" (mistake) and "enough to change the most likely outcome" (blunder), with treating evals above +2.5 as virtually the same.

from nibbler.

rooklift commented on June 1, 2024 1

Although I like the sort of "critical strip" in the new graph, the clamping of it to the [-2.5, 2.5] centipawn range is a bit counterintuitive...

from nibbler.

OverLordGoldDragon commented on June 1, 2024 1

The original topic's been answered, now content's being mixed and isn't great for future readers. I think a separate Issue is best - if this makes sense, I've opened #245, consider continuing there.

It'd also be good if @yuzisee moved their latest comment to there (and deleted it here)

from nibbler.

OverLordGoldDragon commented on June 1, 2024

Well that's fetched from eval_list[n] and eval_list = node.future_eval_history();, so it's really in future_eval_history(), but that's defined as this.get_end().eval_history();?

My understanding is, engines output centipawns, then there's some conversion formula for win %. So where's this calculation happen?

from nibbler.

rooklift commented on June 1, 2024

It's been a while, but as far as I remember, Leela outputs winrates which Nibbler just uses directly. Stockfish doesn't but does output WDL stats, which can be easily converted to winrates - which I think is what Nibbler does.

from nibbler.

yuzisee commented on June 1, 2024

As of official-stockfish/Stockfish#4216 Stockfish is now normalizes its centipawn values to an exact winrate. (Shipped in Stockfish 15.1)

Since Leela operates in winrates directly, I think the way Nibbler does it is actually where the future is headed. Winrates also have a well defined range (0.0 to 1.0) and the blunder/mistake/inaccuracy detection is in winrates in the first place (so if you want the graph to reflect the detections, we'd want to graph in winrates anyway).

It's likely that in the future Lichess changes their chart to look like Nibbler's, actually.

from nibbler.

yuzisee commented on June 1, 2024

Oh, maybe they already did?

https://github.com/lichess-org/lila/blob/aab0c63b922d8d53cf1a58fabe0cee763c2ece2f/ui/ceval/src/view.ts#L160

from nibbler.

yuzisee commented on June 1, 2024

I find Nibbler's evaluations graph to be less granular and more prone to picking extremes (totally winning or totally losing) than Lichess's

@OverLordGoldDragon do you have a screenshot of Lichess and a screenshot of Nibbler showing the same game? (And if you could attach the PGN it would be easy to reproduce where the biggest differences are)

from nibbler.

OverLordGoldDragon commented on June 1, 2024

Granularity is my main issue. Note how Nibbler is mostly "ez" or "u ded". I don't think it's tied to the compute limit per move, but I could be wrong (tested a few configs).

PGN -- https://lichess.org/VjoUufjl/black

from nibbler.

yuzisee commented on June 1, 2024

Note how Nibbler is mostly "ez" or "u ded".

Observe that the specific move you're referring to 46. Re4 is, in fact, "u ded" at higher Stockfish depth. It's only with a lower depth that you can see the granularity you're looking for.

The longer you run Stockfish, the closer to the Nibbler chart it's going to get. So, technically, Nibbler's current "picking extremes" is actually the more correct behavior.

I don't think it's tied to the compute limit per move

Put another way, if you had a much slower computer your chart would look like the Lichess one. On the other extreme, if you run a supercomputer long enough it eventually finds "Mate in 40" after 46. Re4 meaning the game is completely lost.

from nibbler.

OverLordGoldDragon commented on June 1, 2024

Thanks for the investigation @yuzisee !

46. Re4 is, in fact, "u ded" at higher Stockfish depth

lol. Yep. Indeed I've not tested very compute-limited settings, I expected Lichess isn't being that cheap but it makes sense.

So replacing

So, is it "already winrate" from Stockfish or not?

Here's what that would look like

I presume that's from your local run? Then before/after pics would be helpful - though I can just test myself later with previous point clarified

from nibbler.

yuzisee commented on June 1, 2024

So, is it "already winrate" from Stockfish or not?

According to official-stockfish/Stockfish#4216 it's "already winrate" on move 32 of each game.

There's about a 7% reduction starting at move 0, up to a 15% increase by move 128.

Lichess is drawing centipawns though, even though Stockfish is calibrating to winrate internally.

Discussion:

For most evals, Stockfish's formula is somewhat more extreme with winrates early in the game and more lenient with winrates later in the game. However, Lichess (human data) is far more lenient than anything Stockfish produces.
- It turns out that humans are expected to make more mistakes than engines; who knew?
If the eval is exactly 0.00 it's always considered to be "50% B : 50% W" by every formula.
SF v16 @ move 32 with eval ±1.0 are the calibration points for Stockfish. They will always be 75.0% B : 25.0% W & 25.0% B : 75.0% W respectively (i.e. a 50% advantage) in the Stockfish formula

I presume that's from your local run?

Yes, the screenshot from
#242 (comment)
is drawing centipawns using Stockfish 16 at 10M nodes. You can think of it like

from nibbler.

OverLordGoldDragon commented on June 1, 2024

I really don't follow. Why is it something on move X that it's not on other moves? And it changes over time? Well, your linked thread probably answers all that. Does your proposed code insertion handle it?

Regardless, we have 1, then 2 - and from (2) we have "Why not just use Stockfish centipawns?", suggesting it's not using direct Stockfish outputs. I've only skimmed (2). Is (2) outdated, or did I misread?

from nibbler.

OverLordGoldDragon commented on June 1, 2024

Thanks for the highly detailed response!

So TL;DR: Lichess graphs centipawns, but labels blunders-etc via winrate, and centipawns != winrate. Stockfish outputs winrate, and your code converts it back to centipawns. Correct?

Also, I don't know whether "winrate" is useful at all, if it's based on, as you suggested earlier, what the engine thinks its odds of winning the game are. To a perfect player, the graph only ever takes on three values (1, -1, 0), and the current engines are around 90% of the way there, if not 99% with enough time. This isn't an issue early on since they can't yet quickly see "mate in 64", but with enough pieces off the board that's often enough what my graphs look like.

It'd be more informative to output "human winrate", e.g. if I lose an unimportant pawn in opening, a perfect player's eval is 1, but mine is still around 0. Hence it'd be also tailored to rating and time control. You do say "Lichess calculates winrate based on actual player performance during rapid time controls", so is this actually what's happening? I guess, minus the rating part, instead they calibrate it to ~2800 I suppose.

we have to trace the following

Geez, you searched and typed up all that? I mean, not complaining!

PS, Lc0 Discord often discusses interesting stuff, in case you're not there already.

from nibbler.

OverLordGoldDragon commented on June 1, 2024

Also I take it your labeling code is unaffected by changing the graph code?

from nibbler.

OverLordGoldDragon commented on June 1, 2024

Your and yuzisee's comments are excellent Wiki material and shouldn't just get lost in a closed issue. Do you have a summary somewhere that explains your approach?

@rooklift Consider enabling repository Wiki, where you can say "Not verified by core maintainers" to absolve review responsibility. I still don't know what some of the buttons do, maybe some will volunteer to explain.

from nibbler.

Naphthalin commented on June 1, 2024

A summary for the approach taken in Lc0 can be found in the linked PR1791, with original parametrization choice, explanation of the WDL_mu eval definition and comparison of pawn-odds startpos between SF and Lc0 probably being the most useful comments.

from nibbler.

yuzisee commented on June 1, 2024

The trouble with defaulting to WDL_mu in Nibbler is making sure the user gets the correct WDLDrawRateReference since right now Nibbler doesn't care what network you use as long as you provide one.

I can't speak for @rooklift here, but I presume the idea that a new person can download Nibbler and it "works out of the box" is appealing.

from nibbler.

yuzisee commented on June 1, 2024

I personally don't think that engine WDL (and the derived expected score) is particularly useful for analyzing human games.... the WDL_mu score type is the most (or possibly only) natural way of assigning a single number to a position, and the normalized SF centipawn eval does a very good job at approximating that up to +2.5 or so, at which point evals become irrelevant. Within this range, the main problem isn't that centipawn evals are a bad reference, it's that engines are too strong.

And since we have @Naphthalin here, I'll ask a question:

Blunder/Mistake/Inaccuracy checks by convention still rely on some sort of WDL threshold rather than a centipawn threshold. Given this issue of "engines are too strong, WDL needs translation or it will be too extreme" would you like to suggest a path forward for #237

from nibbler.

Naphthalin commented on June 1, 2024

What about a simple alternative like displaying the [-2.5, +2.5] range as it is now, compressing the [+2.5, +5.0] into the [+2.5, +3.0] range, and clamp above +5.0?

from nibbler.

OverLordGoldDragon commented on June 1, 2024

[-2.5, +2.5] range as it is now, compressing the [+2.5, +5.0] into the [+2.5, +3.0] range

I'd prefer whatever makes the interpretation approximately linear, minus the clamping part (and maybe that's what you also intend).

from nibbler.

Naphthalin commented on June 1, 2024

The [-2.5,+2.5] is meant that way: approximately linear inside, and basically meaningless outside. Main question for me is how to accomodate for the fact that SF/Lc0 show evals like +5 or +10, but a +4 from one of them could be a +20 from the other.

from nibbler.

OverLordGoldDragon commented on June 1, 2024

Then certainly there should be a conversion first (to a common eval)? If the method can't check which engine is running, then the best solution is to make it. If that's a problem, I guess just document it.

from nibbler.

yuzisee commented on June 1, 2024

Main question for me is how to accomodate for the fact that SF/Lc0 show evals like +5 or +10, but a +4 from one of them could be a +20 from the other.

It's actually worse than that: a "+4" from low depth/low nodes could be "+20" from high depth/high nodes with the same engine

from nibbler.

OverLordGoldDragon commented on June 1, 2024

Isn't that just "doing evaluation"?

from nibbler.

Naphthalin commented on June 1, 2024

It's actually worse than that: a "+4" from low depth/low nodes could be "+20" from high depth/high nodes with the same engine

Hence the notion of "evals above +2.5 are imaginary numbers anyway". This kind of (self-)disagreement only really seems to happen above that value.

from nibbler.

yuzisee commented on June 1, 2024

(comment deleted, centipawn analysis moved to #242 (comment) and inaccuracy/mistake/blunder discussion moved to #237 (comment))

from nibbler.

OverLordGoldDragon commented on June 1, 2024

This kind of (self-)disagreement only really seems to happen above that value

What do you mean by self-disagreement? Doesn't more evaluated positions just mean updated evaluation (and presumably more accurate)?

from nibbler.

How does the graph differ from Lichess's? about nibbler HOT 33 CLOSED

Comments (33)

Summary

TL;DR:

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs