Comments (33)
I guess this line?
https://github.com/rooklift/nibbler/blob/master/files/src/renderer/55_winrate_graph.js#L96C5-L96C30
e here is some number between 0 and 1
from nibbler.
My understanding is, engines output centipawns, then there's some conversion formula for win %. So where's this calculation happen?
https://github.com/vondele/Stockfish/blob/ad2aa8c06f438de8b8bb7b7c8726430e3f2a5685/src/uci.cpp#L223
win_rate_per_mille = 1000 / (1 + std::exp((a - x) / b))
win_rate = 1.0 / (1 + std::exp((a - x) / b))
- win_rate =
σ
((a - x) / b) =σ
(centipawn_scale_value) logit
(win_rate) = centipawn_scale_value
Where σ
is the standard logistic function, and logit
is the logit function: https://en.wikipedia.org/wiki/Logit
The rounding in Stockfish's win_rate_model
means resolution of win_rate is capped at
- 0.9995 = 99.9% ≅ centipawn_scale_value of +7.6
- 0.0005 = 0.1% ≅ centipawn_scale_value of −7.6
So replacing
with something like
let clamped_e = Math.min(Math.max(e, 0.0005), 0.9995); // values from 0.0005 to 0.9995
let logit_e = 2.0 * Math.atanh(2.0 * clamped_e - 1.0); // values from −7.6 to +7.6
let logit_e_scaled = (logit_e / 7.6) / 2.0; // values from −0.5 to +0.5
let logit_e_scaled_01 = logit_e_scaled + 0.5; // values from 0.0 to 1.0
let y = (1 - logit_e_scaled_01) * height;
should do what you're looking for.
Here's what that would look like
from nibbler.
I really don't follow. Why is it something on move X that it's not on other moves? And it changes over time? Well, your linked thread probably answers all that.
Yes. The linked thread answers all that.
One way to think about it is the quote "Centipawns are great for developing chess engines, which is their main use. But not so much for human comprehension." from (2) which you linked above.
What happened is that this same "human comprehension" problem also caused a headache for the Stockfish team because, as you saw in 1 that you linked above, both Lichess & Chess.com (and now Nibbler) look at Win%
to classify inaccuracy/mistake/blunder and not centipawns.
So in October 2022 the Stockfish team set out to try to reduce this headache by getting very detailed about the winrate↔centipawns relationship inside the engine. That's what this thread talks about official-stockfish/Stockfish#4216 and at this point the winrate model had been thoroughly tested for 28 months in development, 24 months in production, and three full Stockfish versions 12, 13 & 14
Comparing the…
latest Stockfish formula
winrate_scale = 0.5 + 0.5 / (1 + std::exp((a - centipawn*MULTIPLIER) / b))) - 0.5 / (1 + std::exp((a + centipawn*MULTIPLIER) / b)))
Stockfish's current value for MULTIPLIER
, m
, a
and b
are:
MULTIPLIER
= 3.28m
=ply
/ 64.0a
= 0.38036525m
³ − 2.82015070m
² + 23.17882135m
+ 307.36768407b
= −2.29434733m
³ + 13.27689788m
² − 14.26828904m
+ 63.45318330
with the…
latest Lichess formula
winrate_scale = 1/(1+Math.exp(MULTIPLIER * centipawn_scale))
Lichess's current value for MULTIPLIER
is -0.368208
we can get "win_scale" evals corresponding to e.g. −2.5, −1.0, +1.0, and +2.5 "centipawn_scale" evals
formula | eval −2.5 | eval −1.0 | eval +1.0 | eval +2.5 |
---|---|---|---|---|
SF v16 @ move 0 | ~100% B | 79.0% B : 21.0% B | 21.0% B : 79.0% W | ~100% W |
SF v16 @ move 32 | ~100% B | 75.0% B : 25.0% W | 25.0% B : 75.0% W | ~100% W |
SF v16 @ move ≥120 | 99.9% B : 0.01% W | 67.5% B : 32.5% W | 32.5% B : 67.5% W | 0.01% B : 99.9% W |
Lichess 2023 | 71.5% B : 28.5% W | 59.1% B : 40.9% W | 40.9% B : 59.1% W | 28.5% B : 71.5% W |
Does your proposed code insertion handle it?
Yes, the proposed code insertion handles it, because Nibbler reads the direct wdl
output of Stockfish which already has the corrections applied.
If you make the proposed code changes, you'll get the same "line graph" as Lichess (as long as you're running at the same depth as Lichess; Lichess seems to run at very low depth).
Regardless, we have 1, then 2 - and from (2) we have "Why not just use Stockfish centipawns?", suggesting it's not using direct Stockfish outputs. I've only skimmed (2). Is (2) outdated, or did I misread?
To verify what is out-of-date and what is up-to-date, we have to trace the following:
- 2018 and earlier
- engines did not typically output win percentages, everyone had to calculate on their own
- as early as 2016, Lichess has calculated winrate to set the size of arrows when drawing them on the board and for drawing the eval bar… BUT did not yet use winrates for the analysis graph nor inaccuracy/blunder/mistake calculation
- Feb 2018: Lc0 calculates a crude centipawn↔winrate conversion formula and releases v0.1 to the world
- Nov 2018: Lichess realizes that centipawns suck and switches to win% for inaccuracy/blunder/mistake calculation — this is what the article refers to https://web.archive.org/web/20220715020005/https://lichess.org/page/accuracy
- Feb 2019: Lc0 starts to calculate WDL directly in its development branch, which means winrate is now more reliable than centipawns (since centipawn conversion is lossy as it cannot express drawrate information)
- March 2019: Lc0 v0.21 ships
- June 2019: Nibbler v0.5.9 adds the ability to display centipawns and Nibbler v0.6.7 adds the ability to select centipawn output
- July 2019: Stockfish and Lc0 make their formal partnership concrete
- Nov 2019: Lc0 allows outputting winrate based on its full WDL calculations
- Dec 2019: Lc0 v0.23.0 ships
- Jan 2020: Nibbler v1.1.3 adds the ability to display winrate
- April 2020: Lc0 realizes that Stockfish's centipawn↔winrate calculation is not stable and publishes a blog post about it https://lczero.org/blog/2020/04/wdl-head/
- April 2020: Nibbler v1.2.1 makes winrate the default output
- June 2020: Stockfish displays winrate for the first time in its development branch, using a WDL model
- Sept 2020: Stockfish 12 ships the winrate feature
- July 2022: Lichess refreshes their winrate formula to be based on player performance at rapid time controls
- Nov 2022: Stockfish makes winrate the official source of centipawn normalization
- Dec 2022: Stockfish 15.1 is released officially retiring "centipawn" as an official reference point
- May 2023: Lc0 stabilizes their centipawn equivalence so that it can be kept in sync with winrate going forward and publishes a blog post about it https://lczero.org/blog/2023/07/the-lc0-v0.30.0-wdl-rescale/contempt-implementation
Summary
Lichess calculates winrate based on actual player performance during rapid time controls, but draws their graph using Stockfish centipawns. Stockfish calculates centipawns using NNUE and uses a formula to keep them in sync with winrates with a process that is stable between releases.
Given that Lc0 now calculates winrates directly, and converts to a centipawn value purely to work with the Stockfish definition and keep them in sync, it seems like graphing winrate is probably the better choice when performing engine analysis.
Note that Nibbler ideally works well with both Stockfish type engines and Leela type engines, so showing winrate helps achieve that goal since it's a stable output reference point for both engine types. For now, both the Stockfish centipawn↔winrate equivalence and Lichess centipawn↔winrate equivalence are both based on the https://en.wikipedia.org/wiki/Logit function, so it's very simple (5 lines of code) to match them up visually as we saw here #242 (comment)
TL;DR:
The (2) you linked was first published in 2018. Stockfish did not provide winrates back then.
In 2020, Stockfish started showing winrates (June 2020) and redefined centipawns to match winrates (Nov 2022). Meanwhile, Lichess keeps a separate winrate formula because it's based on player performance in rapid time controls whereas Stockfish is giving you the "winrate of the position itself" so to speak.
Nobody is relying on centipawns as a reference point anymore.
from nibbler.
To a perfect player, the graph only ever takes on three values (1, -1, 0)
The same happens in centipawns though. At high depth, the graph will only ever take on three values: (+200 a.k.a. Mate for white, −200 a.k.a. Mate for black, 0.0.)
The problem isn't centipawns vs winrate, it's that the depth of your engine is too high.
It'd be more informative to output "human winrate"
Lc0 specifically added this feature just last month! You give it your Elo and it calculates a "human winrate" at that level for any/each position.
https://lczero.org/blog/2023/07/the-lc0-v0.30.0-wdl-rescale/contempt-implementation/
When I get back home maybe I'll see about a pull request for Nibbler to support this new setting.
Lc0 Discord often discusses interesting stuff
Oh neat, I'll check it out. Thanks!
Your labeling code is unaffected by changing the graph code?
Yes. They are separate and don't rely on each other.
from nibbler.
Just saw that you referenced LeelaChessZero/lc0#1791 and wanted to add my two cents on eval definition/normalization and win rate, despite the issue already being closed.
I personally don't think that engine WDL (and the derived expected score) is particularly useful for analyzing human games, and requires some translation to be of use, which is mostly what Lc0 PR1791 is about. After working on that quite some time, I came to the conclusion that what is now the WDL_mu
score type is the most (or possibly only) natural way of assigning a single number to a position, and the normalized SF centipawn eval does a very good job at approximating that up to +2.5 or so, at which point evals become irrelevant. Within this range, the main problem isn't that centipawn evals are a bad reference, it's that engines are too strong.
I'd personally strongly prefer Nibbler to show eval (SF centipawn or Lc0 WDL_mu) on a [-2.5, +2.5] y-axis with dashed lines marking the +1 and -1 boundaries for 50% W/L over displaying the expected score; maybe I can convince @rooklift why that would be a good idea :)
from nibbler.
The trouble with defaulting to
WDL_mu
in Nibbler is making sure the user gets the correctWDLDrawRateReference
since right now Nibbler doesn't care what network you use as long as you provide one.
Not sure where you took that from; if the user is using defaults (and not setting any contempt related parameters), there won't be any WDL adjustment. However, the beauty of the WDL_mu
eval is that it is almost completely invariant under the applied WDL transformation (and to some extent, even network selection!), unlike any other score type like Q
or centipawn
. Feel free to test various values of WDLCalibrationElo
with either Contempt: "0"
and/or WDLEvalObjectivity: 1.0
to see how it affects WDL_mu
and Q
evals :)
I can't speak for @rooklift here, but I presume the idea that a new person can download Nibbler and it "works out of the box" is appealing.
Seems like his opinion is similar, given that #244 makes my proposed "display centipawn from -2.5 to +2.5 eval, with dashed lines depicting +1.0 and -1.0" the default.
Blunder/Mistake/Inaccuracy checks by convention still rely on some sort of WDL threshold rather than a centipawn threshold. Given this issue of "engines are too strong, WDL needs translation or it will be too extreme" would you like to suggest a path forward for #237
Due to Nibbler being inherently a GUI for Lc0, this question is very closely related to how the --temp-value-cutoff
parameter is applied in Lc0, which currently also is based on deltaQ
. However, my WDL adjustment/rescale/contempt implementation from the mentioned PR1791 is based on the central assumption that there is some scale where the numerical size of inaccuracies isn't depending on the evaluation (contrary to centipawn or winrate, which is more resp. less sensitive in decisive positions), and the evaluation on this scale is exactly what the WDL_mu
score estimates. I didn't do any of the actual data science groundwork and check how well that translates to human games, but I presume that judging inaccuracies on the WDL_mu
scale should work significantly better than both (native engine) winrate and (old A/B engine) centipawn. Since the introduction of normalizing evals to +1.00 = 50% W
however, the SF centipawn is doing a very good job at approximating WDL_mu
(only has some minor issue due to game progress dependence) up to +2.5 or so, from where things become increasingly meaningless.
To throw around some numbers in the WDL_mu
scale: Since +0.25 is roughly the initial white advantage (half a tempo?) and 1.0 the difference between equal and 50% W, it might make sense to start with thresholds of 0.25, 0.5 and 1.0, basically meaning "giving up the initiative" (inaccuracy), "handing over the initiative to the opponent" (mistake) and "enough to change the most likely outcome" (blunder), with treating evals above +2.5 as virtually the same.
from nibbler.
Although I like the sort of "critical strip" in the new graph, the clamping of it to the [-2.5, 2.5] centipawn range is a bit counterintuitive...
from nibbler.
The original topic's been answered, now content's being mixed and isn't great for future readers. I think a separate Issue is best - if this makes sense, I've opened #245, consider continuing there.
It'd also be good if @yuzisee moved their latest comment to there (and deleted it here)
from nibbler.
Well that's fetched from eval_list[n]
and eval_list = node.future_eval_history();
, so it's really in future_eval_history()
, but that's defined as this.get_end().eval_history();
?
My understanding is, engines output centipawns, then there's some conversion formula for win %. So where's this calculation happen?
from nibbler.
It's been a while, but as far as I remember, Leela outputs winrates which Nibbler just uses directly. Stockfish doesn't but does output WDL stats, which can be easily converted to winrates - which I think is what Nibbler does.
from nibbler.
As of official-stockfish/Stockfish#4216 Stockfish is now normalizes its centipawn values to an exact winrate. (Shipped in Stockfish 15.1)
Since Leela operates in winrates directly, I think the way Nibbler does it is actually where the future is headed. Winrates also have a well defined range (0.0 to 1.0) and the blunder/mistake/inaccuracy detection is in winrates in the first place (so if you want the graph to reflect the detections, we'd want to graph in winrates anyway).
It's likely that in the future Lichess changes their chart to look like Nibbler's, actually.
from nibbler.
Oh, maybe they already did?
from nibbler.
I find Nibbler's evaluations graph to be less granular and more prone to picking extremes (totally winning or totally losing) than Lichess's
@OverLordGoldDragon do you have a screenshot of Lichess and a screenshot of Nibbler showing the same game? (And if you could attach the PGN it would be easy to reproduce where the biggest differences are)
from nibbler.
Granularity is my main issue. Note how Nibbler is mostly "ez" or "u ded". I don't think it's tied to the compute limit per move, but I could be wrong (tested a few configs).
PGN -- https://lichess.org/VjoUufjl/black
from nibbler.
Note how Nibbler is mostly "ez" or "u ded".
Observe that the specific move you're referring to 46. Re4
is, in fact, "u ded" at higher Stockfish depth. It's only with a lower depth that you can see the granularity you're looking for.
The longer you run Stockfish, the closer to the Nibbler chart it's going to get. So, technically, Nibbler's current "picking extremes" is actually the more correct behavior.
I don't think it's tied to the compute limit per move
Put another way, if you had a much slower computer your chart would look like the Lichess one. On the other extreme, if you run a supercomputer long enough it eventually finds "Mate in 40" after 46. Re4
meaning the game is completely lost.
from nibbler.
Thanks for the investigation @yuzisee !
46. Re4
is, in fact, "u ded" at higher Stockfish depth
lol. Yep. Indeed I've not tested very compute-limited settings, I expected Lichess isn't being that cheap but it makes sense.
So replacing
So, is it "already winrate" from Stockfish or not?
Here's what that would look like
I presume that's from your local run? Then before/after pics would be helpful - though I can just test myself later with previous point clarified
from nibbler.
So, is it "already winrate" from Stockfish or not?
According to official-stockfish/Stockfish#4216 it's "already winrate" on move 32 of each game.
There's about a 7% reduction starting at move 0, up to a 15% increase by move 128.
Lichess is drawing centipawns though, even though Stockfish is calibrating to winrate internally.
Discussion:
- For most evals, Stockfish's formula is somewhat more extreme with winrates early in the game and more lenient with winrates later in the game. However, Lichess (human data) is far more lenient than anything Stockfish produces.
- It turns out that humans are expected to make more mistakes than engines; who knew?
- If the eval is exactly
0.00
it's always considered to be "50% B : 50% W" by every formula. - SF v16 @ move 32 with eval ±1.0 are the calibration points for Stockfish. They will always be 75.0% B : 25.0% W & 25.0% B : 75.0% W respectively (i.e. a 50% advantage) in the Stockfish formula
I presume that's from your local run?
Yes, the screenshot from
#242 (comment)
is drawing centipawns using Stockfish 16 at 10M nodes. You can think of it like
from nibbler.
I really don't follow. Why is it something on move X that it's not on other moves? And it changes over time? Well, your linked thread probably answers all that. Does your proposed code insertion handle it?
Regardless, we have 1, then 2 - and from (2) we have "Why not just use Stockfish centipawns?", suggesting it's not using direct Stockfish outputs. I've only skimmed (2). Is (2) outdated, or did I misread?
from nibbler.
Thanks for the highly detailed response!
So TL;DR: Lichess graphs centipawns, but labels blunders-etc via winrate, and centipawns != winrate. Stockfish outputs winrate, and your code converts it back to centipawns. Correct?
Also, I don't know whether "winrate" is useful at all, if it's based on, as you suggested earlier, what the engine thinks its odds of winning the game are. To a perfect player, the graph only ever takes on three values (1, -1, 0), and the current engines are around 90% of the way there, if not 99% with enough time. This isn't an issue early on since they can't yet quickly see "mate in 64", but with enough pieces off the board that's often enough what my graphs look like.
It'd be more informative to output "human winrate", e.g. if I lose an unimportant pawn in opening, a perfect player's eval is 1, but mine is still around 0. Hence it'd be also tailored to rating and time control. You do say "Lichess calculates winrate based on actual player performance during rapid time controls", so is this actually what's happening? I guess, minus the rating part, instead they calibrate it to ~2800 I suppose.
we have to trace the following
Geez, you searched and typed up all that? I mean, not complaining!
PS, Lc0 Discord often discusses interesting stuff, in case you're not there already.
from nibbler.
Also I take it your labeling code is unaffected by changing the graph code?
from nibbler.
Your and yuzisee's comments are excellent Wiki material and shouldn't just get lost in a closed issue. Do you have a summary somewhere that explains your approach?
@rooklift Consider enabling repository Wiki, where you can say "Not verified by core maintainers" to absolve review responsibility. I still don't know what some of the buttons do, maybe some will volunteer to explain.
from nibbler.
A summary for the approach taken in Lc0 can be found in the linked PR1791, with original parametrization choice, explanation of the WDL_mu eval definition and comparison of pawn-odds startpos between SF and Lc0 probably being the most useful comments.
from nibbler.
The trouble with defaulting to WDL_mu
in Nibbler is making sure the user gets the correct WDLDrawRateReference
since right now Nibbler doesn't care what network you use as long as you provide one.
I can't speak for @rooklift here, but I presume the idea that a new person can download Nibbler and it "works out of the box" is appealing.
from nibbler.
I personally don't think that engine WDL (and the derived expected score) is particularly useful for analyzing human games.... the WDL_mu score type is the most (or possibly only) natural way of assigning a single number to a position, and the normalized SF centipawn eval does a very good job at approximating that up to +2.5 or so, at which point evals become irrelevant. Within this range, the main problem isn't that centipawn evals are a bad reference, it's that engines are too strong.
And since we have @Naphthalin here, I'll ask a question:
- Blunder/Mistake/Inaccuracy checks by convention still rely on some sort of WDL threshold rather than a centipawn threshold. Given this issue of "engines are too strong, WDL needs translation or it will be too extreme" would you like to suggest a path forward for #237
from nibbler.
What about a simple alternative like displaying the [-2.5, +2.5]
range as it is now, compressing the [+2.5, +5.0]
into the [+2.5, +3.0]
range, and clamp above +5.0?
from nibbler.
[-2.5, +2.5]
range as it is now, compressing the[+2.5, +5.0]
into the[+2.5, +3.0]
range
I'd prefer whatever makes the interpretation approximately linear, minus the clamping part (and maybe that's what you also intend).
from nibbler.
The [-2.5,+2.5]
is meant that way: approximately linear inside, and basically meaningless outside. Main question for me is how to accomodate for the fact that SF/Lc0 show evals like +5 or +10, but a +4 from one of them could be a +20 from the other.
from nibbler.
Then certainly there should be a conversion first (to a common eval)? If the method can't check which engine is running, then the best solution is to make it. If that's a problem, I guess just document it.
from nibbler.
Main question for me is how to accomodate for the fact that SF/Lc0 show evals like +5 or +10, but a +4 from one of them could be a +20 from the other.
It's actually worse than that: a "+4" from low depth/low nodes could be "+20" from high depth/high nodes with the same engine
from nibbler.
Isn't that just "doing evaluation"?
from nibbler.
It's actually worse than that: a "+4" from low depth/low nodes could be "+20" from high depth/high nodes with the same engine
Hence the notion of "evals above +2.5 are imaginary numbers anyway". This kind of (self-)disagreement only really seems to happen above that value.
from nibbler.
(comment deleted, centipawn analysis moved to #242 (comment) and inaccuracy/mistake/blunder discussion moved to #237 (comment))
from nibbler.
This kind of (self-)disagreement only really seems to happen above that value
What do you mean by self-disagreement? Doesn't more evaluated positions just mean updated evaluation (and presumably more accurate)?
from nibbler.
Related Issues (20)
- Change Engine Play 1st-4th Best Move Toggles HOT 1
- Nibbler v2.4.0 - Lc0 v0.29 "Awaiting uciok from engine" HOT 9
- Eval on nibbler from stockfish HOT 3
- Renderer crash HOT 4
- Sending to the engine failed. HOT 1
- Add option for unique logfiles
- Why no flatpak support? HOT 2
- core dump on linux - fedora - when starting: "GPU process isn't usable. Goodbye." HOT 2
- Stockfish support, extent? HOT 2
- What should the evaluation graph be? HOT 1
- Customizable keybindings HOT 2
- Ability to edit position HOT 1
- move by move analysis HOT 3
- set the limite Nodes manualy HOT 5
- Code to parse UCI moves HOT 5
- HiDPI on macOS
- Node limit interaction with lowerbound / upperbound OFF
- Option for scaling the graph
- Some ideas from lizzie (go)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from nibbler.