Spacious Mind rating test reloaded

Tibono2 · Post by **Tibono2** » Fri Feb 02, 2024 11:27 am

Hello dear all,

Let's first remind what is this all about:
With some help from members, our friend Nick developped and shared on this forum a very interesting rating test, based on historical games from ancient masters (from 15th and 18th century).

First thoughts and early development started in 2013 based on games from Leonard Barden's book. Back then, Nick used Critter 1.6a as the analysis engine, performing 21 to 22 plies search for each position from the games.
The prototype test assumed 100% best moves found would result in the maximum score of 3000 Elo.

Then in 2014 the final test was developped and completed, leveraging five century masters games, featuring 226 test positions. Only good or acceptable moves per position were rated, best move(s) scored 30 points, and the suboptimal ones scored a relevant positive value in-between the 0-30 range. Bad moves scored zero. The total of points scored by a computer thru the master game then translated into a percentage of the maximum score which is 30 points times the number of evaluated moves. The percentage was then applied to 3400 in order to provide an Elo evaluation. In a nutshell, the test measures the ability to compute/play good moves thru the masters games. It works rather well, at least for strong chess computers and chess programs (I studied a subset of 159 tests, most of them reported by Nick, they scored 2224 on average, and 2256 as the median; 322 as the standard deviation). That's a rather high range, isn't it?

Tibono2 · Post by **Tibono2** » Fri Feb 02, 2024 11:36 am

To my opinion, the best use of the test is not to accurately forecast the Elo of a single program or chess computer (indeed fine for getting an order of magnitude), rather to make comparisons; such as identifying which playing style performs best thru the test, or other setting change if any (selectivity, evaluation components, ...).

Of course the rating of each individual move is key; Nick did not tell much about the process he used, apart from, I quote: (Jan. 2017) 'When I did these tests a couple of years ago, they were done on a fast 8 core desktop. Every move was analyzed between a minimum of 20 minutes and maximum 24 hours, depending on the move's complexity.'.

The analysis engine is not mentioned, but as the initial scoring occured in 2014 and Nick stated he did not own Komodo 8, this could have been Stockfish DD (4.5) or Stockfish 5, at best some refining with Stockfish 6 could have occured in 2015. Or maybe Houdini 4, another strong engine from 2013, who knows.

Anyway, this was ten years ago, a quantum leap happened in strength of software, maybe not so much hardware-wise as Nick used a powerful desktop. Since 2020, NNUE rules (Stockfish 12, Komodo Dragon,...). And Stockfish evolved further up to version 16 since... I thought it was high time a refresh of the test happened. It surely deserves paying the effort!

to be continued...

Tibono2 · Post by **Tibono2** » Sat Feb 03, 2024 2:00 pm

As previously stated, the test proved to be efficient for rather strong programs; it is untrue for the weakest ones. The reason why is quite clear: strong programs won't often play very bad moves; whilst weak ones easily may blunder. Bad moves are not scored in Nick's test, whatever inaccurate, mistake, error or blunder they are, the granted score is zero. This also prevents measuring any "fun" level.

Therefore, I decided to use Stockfish 16 as the new, up-to-date reference scorer; and unlike Nick's approach I decided to score all moves, assigning to bad ones a negative value from the symetrical range (0 to -30). I back-engineered Nick's formula used for good moves (all moves scores can be retrieved from his Excel files, that's a lot of data) and designed a negative score on my own.

This is based on the centipawns loss (cpl) concept, Stockfish 16' best move is zero cpl and translates into 30 scored points. I considered more than 100 cpl as the origin for negative scores, and 800 cpl or more as deserving the full penalty (-30 points). This may look shy, but consider Stockfish 16 is able to dive into deep depths, thus amplifying the negative score of bad moves (assuming accurate play from the opponent, which Stockfish does). Of course blundering a Queen for a Pawn would most usually translate into -30 points from my score formula. Rather than texting more details, I share the curve graph:

Should the graph not be displayed automatically, please use the link.

to be continued...

Tibono2 · Post by **Tibono2** » Sun Feb 04, 2024 8:03 am

The process I used for getting a score for each move (the 226 positions led to 7411 of them!): I ran Stockfish 16 in "all lines" analysis mode, until at least depth 25 was reached and completed. This means I let the analysis reach depth 26 for the best move; if a couple of top moves were very close in score I let them as well reach depth 26. This provided me with the cpl score for each move of each position.

As a second step, I ran again the full test using Dragon 3.3 (by Komodo) as an alternate reference. This time not for scoring moves, rather to make sure no other best move would appear. Therefore I set the analysis mode to one single PV, enabling more in-depth search; I let it reach at least depth 30 completed. Any alternative move reported as best within the depth range 25-30 is as well granted 30 points. This second step has been useful, with a couple of alternate best identified per game.

To summarize the process, best moves deserving 30 points were identified thru a joint work Stockfish 16 + Komodo Dragon 3.3; and scoring for other moves is Stockfish 16' output.

to be continued...

Tibono2 · Post by **Tibono2** » Mon Feb 05, 2024 6:53 am

Now, let's normalize the Elo-like-score output from each game test: at the completion of the above process, I could substitute to Nick's scores of moves the updated ones; and re-use the 'computer score - 30 seconds' tab data Nick provided, to assess the new percentage those computers achieve. This was a long boring task, so I limited myself to the 159 ones I previously mentionned as my study scope.

The objective of this study step was to design a formula to translate the new percentage values (remember they can be much lower due to negative scores for bad moves) into decent Elo estimations. Of course there is no reason why the initial computing (percentage x 3400) would still fit best.

Well, let's pay a tribute to Nick's good work and adopt the Elo score his test provided as the target "decent" Elo forecast. As a refinement, for a subset of 60 (out of 159) programs where Nick provided the active chess score from the SC Wiki list, I used this value as the "real" target, instead of Nick's score. Then, game per game, I used Excel to get a normalized formula, equation of the curve fitting best the dispersion of pairs (%, score). To normalize both the origin and the maximum of the curve, I targeted 3400 Elo max for 100% score (consistently with Nick's initial design) and 400 Elo for the origin (zero score). 400 Elo is usually admitted as the bottom level for a very beginner. Zero score means the few good moves are annihilated by the many bad ones.

Tibono2 · Post by **Tibono2** » Mon Feb 05, 2024 7:03 am

Here are the resulting dispersions, normalization curves and equations:

link

link

to be continued...

Tibono2 · Post by **Tibono2** » Tue Feb 06, 2024 7:21 am

As an approach to assess consistency, the average score over the 159 programs sample is 2224 points using Nick's initial test, and 2212 using the "reloaded" test. That's consistent. Out of curiosity, I also zoomed into the "Lang issue" Nick wrote about: surprisingly low scores of R. Lang' programs on test game 4.
I filtered 20 Lang programs from the study and averaged the scores per game:

Nick's test: 1:2499 2:2357 3:2523 4:1792 5:2247 --> overall score 2219, to be compared to the active Wiki averaged score 2212
This is an evidence Nick's test provided an accurate Elo estimation; despite an indeed very low score for game 4.

Reloaded test: 1:2227 2:2252 3:2324 4:2056 5:2332 --> overall score 2219
The resulting Elo estimation is unchanged, with scores per game more evenly distributed by the reloaded test. Game 4 is still the weak point, but not so much as Nick's test reported. I like that.

Let me conclude the consistency comment using this graph, displaying a nice alignment in both scores over 1700 Elo or so:

link
The increased dispersion under 1700 Elo was to be expected, as the reloaded test is designed for an increased consideration for bad moves.

to be continued...

Tibono2 · Post by **Tibono2** » Tue Feb 06, 2024 2:16 pm

Well, here are the links to download the reloaded test:

Test 1
Test 2
Test 3
Test 4
Test 5
Overall score

Like with Spacious Mind's download links, the connection is http (not https), therefore most probably your browser will require your agreement for actual download.

In the overall score computing spreadsheet, only the scoring formula (I2 cell) changed. This new formula doesn't change any previous result you may have from Nick's initial test, but on the other hand is definitely required to score from the reloaded test games. Should you keep on using a previously stored rating list fed with your own tests results, please make sure you substitute the new formula instead of the old one, before adding new lines.

Nick's user manual is still valid, look for instructions in this page.

Nevertheless, few user experience changes maybe worth pointing out:

- the dropdown lists for moves selection result from HCE Pro I used as the analysis GUI. Most obvious change is castling symbol (long or short) uses the O letter, whilst Nick's test used zeroes. So, castling moves will be found deeper sorted within the alphabetical list. No worries upon any new input, I trust you will find it; on the other hand, should you copy-paste moves from Nick's data (or from your own previous tests), the "zeroed" castling move (if any) will not be found and score will be #N/A (not available). This prevents the score display, so you would be fully aware. Just replace the "zeroed" castling move by the one from the drop-list.

- few other move notation changes can result in the same as above, eg in game1 the final mating move 21.Qd8 is Qd8+ in Nick's test and Qd8# in the reloaded test. I think HCE output is the correct notation.

- I didn't change anything in the ancient game cells, neither the moves notation, nor the scores. Kept as sort of a reference.

- The rightmost set of move cells should be kept unchanged, as they are the source to let you safely and quickly reset the input set located in the middle columns (using copy and past, as Nick described in the instructions). Unlike Nick, I protected the rightmost cells to prevent corrupting them unintentionnally.

All the worksheet editing has been done using LibreOffice, I hope no serious glitch would happen using other software such as MS Excel. Should you spot any issue or bug, please let me know.

Hope this will be useful to some,
best regards,
Tibono

paulwise3 · Post by **paulwise3** » Tue Feb 06, 2024 4:15 pm

Hi Eric,

Great initiative! I liked Nick's Excel testsheets.
But I hated testing with game 4. Boring closed game, and especially boring very long game! So after testing a number of my computers, I had enough of it.
Maybe I will take it up again, then I should test with at least 2 computers at the same time, which saves a lot of time

.
And as you mentioned: it could help with testing computers that have different playing styles and/or programmable evaluation parameters.

Best regards,
Paul

Tibono2 · Post by **Tibono2** » Tue Feb 06, 2024 4:47 pm

Hi,

reading again the instructions, I see Nick shared another version named "5_test_rating_list" of the worksheet used to compute the final, overall score from the 5 games output. The difference is: this list not only provides the calculation formula (line 2) but also stores the data for many tested computers.

Of course this also needs the updated formula in the very same I2 cell.

You should copy/paste this (simplified!) cell formula from the Overall score spreadsheet I linked in the above post.
Or download this already updated version of Nick's 5_test_rating_list
Cheers,
Tibono

Tibono2 · Post by **Tibono2** » Tue Feb 06, 2024 5:37 pm

Hello Paul,

thanks for the interest you have, and the kind words.
I agree Test 4 is a pain in the neck to reproduce many times, but as Nick stated, including a closed game in the test was important for a balanced enough skills evaluation.

One of my objectives was to better assess weak programs, and I am proud to share this graph built using both initial and reloaded test, for each level from my Vonset L6 v1. Low levels are designed for children and are indeed very weak, offering pieces as gifts:

link

The reloaded test is much more reactive to such gifts; and the evaluation for highest levels only marginally differ across both tests.
From my OTB tests I thought the levels scaled well, with levels 11 to 15 being the truly interesting ones for training (and still instant play). The graph confirms that. What kind of surprises me is the flattening of the curve (of both, by the way) with levels 16 to 20, despite these using more computing time. There is a flaw somewhere, that obviously has been removed in the L6v2 version.

KR,
Eric

Tibono2 · Post by **Tibono2** » Sun Feb 25, 2024 7:23 am

Tibono2 wrote: ↑Tue Feb 06, 2024 2:16 pm Well, here are the links to download the reloaded test:

Test 2

Hi all,
I re-uploaded the test 2 as a typo made its way into the score of move 21.Qe2. Made correct, and no other change.
Link unchanged, just re-download to get the latest version.
Best regards,
Tibono

Tibono2 · Post by **Tibono2** » Tue Mar 12, 2024 11:45 am

Hi,
another bee's work done: running the test using the many intermediate levels of the King Performance (to my understanding, fits the King Competition and the King Element/Classic Element/Lasker Edition as well), including the "fun" ones:

: KP_SMR_test.jpg (33.23 KiB) Viewed 358 times

Best,
Tibono

DaMaBu · Post by **DaMaBu** » Wed Mar 13, 2024 1:39 am

Very interesting! So the test suggests around 1400 for Easy 0? The machine itself says it's around 1000, which never seemed right to me - it's more of a challenge than that!

Tibono2 · Post by **Tibono2** » Wed Mar 13, 2024 7:02 am

I agree 1000 Elo for level Easy 0 looks underevaluated, TMHO. Even computing few nodes, the King program is rather smart from a positional standpoint; and whatever the level (even fun ones!), once a winning path is foreseen (mate in xx spotted in-depth thanks to extensions), it will stick to the winning path (which is fine). In addition, if I am correct, there is no limitation applied to the opening books for so-called "easy" levels.

Zooming into the level Easy 0 intermediate scores can be interesting: game 1 & game 3 resulted in 1116 and 906 scores; therefore spot on the machine forecast. But game 2 scored 1833, game 4 (a much closed game) 1724, and game 5 1310. The overall score (weighted average) is 1431.

To my understanding, Elo 1000 for highly tactical games can fit; but the King engine Easy 0 is significantly stronger than that in closed / positional games.

HIARCS Chess Forums

Spacious Mind rating test reloaded

Spacious Mind rating test reloaded

Re: Spacious Mind rating test reloaded

Re: Spacious Mind rating test reloaded

Re: Spacious Mind rating test reloaded

Re: Spacious Mind rating test reloaded

Re: Spacious Mind rating test reloaded

Re: Spacious Mind rating test reloaded

Re: Spacious Mind rating test reloaded

Re: Spacious Mind rating test reloaded

Re: Spacious Mind rating test reloaded

Re: Spacious Mind rating test reloaded

Re: Spacious Mind rating test reloaded

Re: Spacious Mind rating test reloaded

Re: Spacious Mind rating test reloaded

Re: Spacious Mind rating test reloaded