Final Rating Test attempt.

This forum is for general discussions and questions, including Collectors Corner and anything to do with Computer chess.

Moderators: Harvey Williamson, Steve B, Watchman

Forum rules
This textbox is used to restore diagrams posted with the fen tag before the upgrade.
Post Reply
User avatar
Theo
Member
Posts: 132
Joined: Tue Mar 05, 2013 11:34 am

Post by Theo »

spacious_mind wrote:
Also please remember this is a rating test not a clone test. Besides there is a big difference in software that fits in a small 32K ROM and software that can take up your whole computer :)

Regards

Nick
So, if I understood correctly, this test is about measuring the strength of play against humans? Maybe you should include some devices with known (USCF)-ELO like Mach III (2265) or the newer Novags. I don't know about the Jade II, but some of the minor H8 Novags (Turquoise?) got a 2294 rating. Given there are enough computers with known ELO against humans, you could calibrate the test rating to fit it.

Curious Regards,
Theo
User avatar
spacious_mind
Senior Member
Posts: 4018
Joined: Wed Aug 01, 2007 10:20 pm
Location: Alabama
Contact:

Post by spacious_mind »

Hi Theo,
When I wrote that I think I added a :P <----------- which for me is acting tongue in cheek or being naughty. Meaning you don't have to read into it with any seriousness.

I really do not know what the end ratings are going to end up being until after the 16 games are played since all these games are showing different situations in chess. There will be some end game games and defensive game, even some closed games.

So computers will do well in some games probably bad in others. After 16 games we will know better if the rating is ok or not. But for a reference it will be ok even now because for example and I am not really that mathematically minded, but if you can measure two distances accurately (ie. two stars positions in the sky) and use them as a bearing or point of reference, then you should be able to very accurately calculate the distance to a third star in the sky.

In my test I am using this same logic in a way. 1st accurate reference is doing exactly 21, 22 or 23 etc ply deep. evaluations with a top engine. The top engine in its way has provided that accurate measurement and not me I am just trusting its results. 2nd accurate measurement is the Grandmaster, we know his ability range because it is recorded so using Nick's logic :P (tongue in cheek) If the GM rates for example in my test 2640 ELO for that game and this looks within reason to what his highest rating ever recorded was and within reason of his opponent, then I have my second point of reference. I also have a 3rd point of reference because the same engine Tested is also evaluated at 30 seconds per move and shows a rating ie 2800 ELO. So I think it is reasonable to speculate that if the dedicated chess computer performed at say 2300 ELO, that the computer in this game was either 500 ELO below the Engine standard of 2800, 700 below the test max of 3000 and 340 ELO below the GM standard played in this game. If I do this 16 times with games that I am not picking and just taking out of the book, I should have a fairly unbiased end score average to compare and then maybe fine tune with a different rating idea.

Now again for the tongue in cheek bit.

MFR rating for Jade 2 can be argued at either 2294 ELO or 2320
TC2100 speaks for itself
RS2200 speaks for itself
RS2250 speaks for itself
CC9 was 1721
I don't know but I am guessing that MK12 Trainer was probably rated at 1650 or 1600

The rest I don't know without some further digging. So a Test to you or anyone else. So far which ratings are closer to the manufacturer claims!!

1) Schachcomputer.info Active List
2) Schachcomputer.info Tournament List
3) Selective Search List
4) SSDF
5) Nick's tongue in check Test !! :P

Below is the 5 game average repeated again for the above test purpose!

1 Critter 1.6a 64 Bit - AMD Phenom 2 Core 2.8GHZ 373.7 93.43% 2803
2 Grandmaster Performance Standard 339.6 84.90% 2547
3 Tasc CM 512K – 15 MHZ – KING 2.54 301.1 75.28% 2258
4 Saitek Travel Champion 2100 299.4 74.25% 2246
5 Mephisto TM Vancouver 68030 36 MHz 295.9 73.98% 2219
6 Nova Jade 2 294.2 73.55% 2207
7 Radioshack 2250XL Brute Force 289.6 72.40% 2172
8 Radioshack 2250XL Selective 288.5 72.13% 2164
9 MChess Pro 5 - P75 278.9 69.73% 2092
10 Saitek Corona 277.4 69.35% 2081
11 CXG 3000 249.8 62.45% 1874
12 Fidelity Sensory 9 231.8 57.95% 1739
13 Saitek MK 12 Trainer LV 5 90S/Move 219.5 54.88% 1646
14 Saitek MK 12 Trainer LV 4 15S/Move 204.5 51.13% 1534
15 Novag Constellation JR 203.4 50.85% 1526


Best regards

Nick
Nick
User avatar
spacious_mind
Senior Member
Posts: 4018
Joined: Wed Aug 01, 2007 10:20 pm
Location: Alabama
Contact:

Post by spacious_mind »

I managed to get one Test Game in tonight. Sorry but as I started playing it through the opening and to Atlanta starting to think, I realized that I had Test Game 5 running so rather than waste time I just decided to continue with this game. So this is Test Game 5 results with Mephisto Atlanta playing Selective and with Ponder on. The Ponder actually made no difference in this game because with the few takebacks to correct the position Atlanta did not even play a single Ponder. Actually thinking about it, even in the other games there were really extremely few Ponder hits.

Well Atlanta played like a World Champion up to the last four moves. In the last 4 moves Atlanta only scored 2.2 points out of a possible 12. A 50% score in the last 4 moves would have assured 1st place, but this honor remains currently with Novag Jade 2.

Test Game 5 Rankings

Code: Select all

1	Critter 1.6a 64 Bit - AMD Phenom 2 Core 2.8GHZ	74.2	92.75%	2783
2	Julio Bolbochán   	70.4	88.00%	2640
3	Nova Jade 2	69.0	86.25%	2588
4	Mephisto Atlanta Selective Ponder On	65.4	81.75%	2453
5	Saitek Travel Champion 2100	63.2	79.00%	2370
6	Mephisto TM Vancouver 68030 36 MHz	63.2	79.00%	2370
7	CXG 3000	61.8	77.25%	2318
8	Radioshack 2250XL Brute Force	61.4	76.75%	2303
9	Tasc CM 512K – 15 MHZ – KING 2.54	60.7	75.88%	2276
10	Saitek Corona	60.4	75.50%	2265
11	Radioshack 2250XL Selective	60.3	75.38%	2261
12	MChess Pro 5 - P75	56.6	70.75%	2123
13	Saitek MK 12 Trainer LV 5 90S/Move	52.6	65.75%	1973
14	Novag Constellation JR	46.2	57.75%	1733
15	Fidelity Sensory 9	45.2	56.50%	1695
16	Saitek MK 12 Trainer LV 4 15S/Move	40.4	50.50%	1515
What is really funny is that the best dedicated machine sofar for the last 4 moves is Kaplans (or Barnes?) Saitek Mk 12 Trainer scoring 9.9 out of 12. Just think if Atlanta had have scored 9.9 in the last 4 moves. Atlanta would have blown past Julio Bolbochán and only 1 point behind Critter 1.6a!

@ Steve sorry that is all the time I have tonight. Tomorrow I will play this game with the same selective and ponder on with Magellan.


Here is the updated Spreadsheet that includes Atlanta under Test Game 5:

http://spacious-mind.com/forum_reports/ ... _final.ods


Best regards,
Nick
User avatar
Steve B
Site Admin
Posts: 10146
Joined: Sun Jul 29, 2007 10:02 am
Location: New York City USofA
Contact:

Post by Steve B »

spacious_mind wrote:@ Steve sorry that is all the time I have tonight. Tomorrow I will play this game with the same selective and ponder on with Magellan.
Hi Nick
of course there is no need to apologize
I do appreciate your efforts to include the Atlanta and Magellan in your tests in order for me to continue the discussion regarding their separate Wiki listings
I know full well this was not something you initially intended with these games and I appreciate the effort in taking this detour
I am really interested in the final results of all 5 games rather than any one game
please post those final results in the Redux thread I started
please take your time

Best Regards
Steve
User avatar
spacious_mind
Senior Member
Posts: 4018
Joined: Wed Aug 01, 2007 10:20 pm
Location: Alabama
Contact:

Post by spacious_mind »

Below is Test Game 5 with Mephisto Magellan added. Same conditions as Atlanta with Selective on and Ponder On. There were 7 variations in 28 moves. Those variations cost my Magellan some points.

Test Game 5 Rankings

Code: Select all

1	Critter 1.6a 64 Bit - AMD Phenom 2 Core 2.8GHZ	74.2	92.75%	2783
2	Julio Bolbochán   	70.4	88.00%	2640
3	Nova Jade 2	69.0	86.25%	2588
4	Mephisto Atlanta Selective Ponder On	65.4	81.75%	2453
5	Saitek Travel Champion 2100	63.2	79.00%	2370
6	Mephisto TM Vancouver 68030 36 MHz	63.2	79.00%	2370
7	CXG 3000	61.8	77.25%	2318
8	Radioshack 2250XL Brute Force	61.4	76.75%	2303
9	Tasc CM 512K – 15 MHZ – KING 2.54	60.7	75.88%	2276
10	Saitek Corona	60.4	75.50%	2265
11	Radioshack 2250XL Selective	60.3	75.38%	2261
12	Mephisto Magellan Selective Ponder ON	59.6	74.50%	2235
13	MChess Pro 5 - P75	56.6	70.75%	2123
14	Saitek MK 12 Trainer LV 5 90S/Move	52.6	65.75%	1973
15	Novag Constellation JR	46.2	57.75%	1733
16	Fidelity Sensory 9	45.2	56.50%	1695
17	Saitek MK 12 Trainer LV 4 15S/Move	40.4	50.50%	1515
I have updated the spreadsheet and you can download it from the link in my previous Post.

I am seriously considering adding all the Morsch's to get to the bottom of this. If anyone one wants to volunteer to test any of the following that would help me a lot especially Brute Force because mine is packed away and just to get to it, I would have to move a lot of boxes and reorganize my table which currently as you can imagine has a lot of computers due to these tests. Here are the ones however that I would love to do.

@ Reinfeld - would love it if you could continue repeating the 2150L, Explorer Pro and RS2200X with the other Test Games !!!!

These are the ones that I would like to add:

RS2150 (Kaplan)
Mephisto Senator
Mephisto MMV
Mephisto President
Mephisto Brute Force
RS Master
RS 2050
Mephisto Milano Pro
Saitek Master Chess
GK2000
GK2100

Helpers are welcome !!!

Best regards,

Nick
Nick
User avatar
spacious_mind
Senior Member
Posts: 4018
Joined: Wed Aug 01, 2007 10:20 pm
Location: Alabama
Contact:

Post by spacious_mind »

I have today added to Test Game 5, Mephisto Senator and Mephisto MM6. I am so glad that I did, because these two programs provided some needed backup to the interesting clone debates that are being held in the forum. Senator played 27 out of 28 moves identically to Mephisto Magellan which would amount to a clone rating of 96.43% probability. Mephisto MM6 scored 28 out of 28 identical moves to TC2100 which amounts to 100% clone probability between these two machines. These two have always been considered as clones and this test reconfirms this.
This rating test also confirms that it is a perfectly suitable for a quick clone check.

Test Game 5 Rankings

Code: Select all

1	Critter 1.6a 64 Bit - AMD Phenom 2 Core 2.8GHZ	74.2	92.75%	2783
2	Julio Bolbochán   	70.4	88.00%	2640
3	Nova Jade 2	69.0	86.25%	2588
4	Mephisto Atlanta Selective Ponder On	65.4	81.75%	2453
5	Saitek Travel Champion 2100	63.2	79.00%	2370
6	Mephisto MM6	63.2	79.00%	2370
7	Mephisto TM Vancouver 68030 36 MHz	63.2	79.00%	2370
8	CXG 3000	61.8	77.25%	2318
9	Radioshack 2250XL Brute Force	61.4	76.75%	2303
10	Tasc CM 512K – 15 MHZ – KING 2.54	60.7	75.88%	2276
11	Saitek Corona	60.4	75.50%	2265
12	Radioshack 2250XL Selective	60.3	75.38%	2261
13	Mephisto Magellan Selective Ponder ON	59.6	74.50%	2235
14	Mephisto Senator Selective Ponder ON	59.1	73.88%	2216
15	MChess Pro 5 - P75	56.6	70.75%	2123
16	Saitek MK 12 Trainer LV 5 90S/Move	52.6	65.75%	1973
17	Novag Constellation JR	46.2	57.75%	1733
18	Fidelity Sensory 9	45.2	56.50%	1695
19	Saitek MK 12 Trainer LV 4 15S/Move	40.4	50.50%	1515
I have updated the spreadsheet and you can download it from an earlier post above.

best regards,
Nick
Queegmeister
Member
Posts: 327
Joined: Mon May 20, 2013 7:45 am
Location: Florida USA

citrine and others

Post by Queegmeister »

could you add the citrine and sapphire 2 ?

or at least the Mephisto Chess Challenger.

One computer that has surprised me is the Excalibur Grandmaster - I have heard it's a clone of Igor - I see similarities - but the GM has put on a spectacular show - our last four games are GM +1 -1 =2 (g/75) and I'm a former USCF Master. maybe I'm out of shape (and old), sure , but this thing is not bad.

Igor does not understand the Kings Indian attack at all and I crush him easily but If I play a gambit and tactical complications fly - he comes up with some amazing moves too (at 24mhz) and sometimes wins (and I sheepishly......)

Igor at 24mhz -I'll try to test it if I can find the time.

I'm going to post a game score of a danish gambit I played vs the GM - Hope you like it.


Mike
Reinfeld
Member
Posts: 486
Joined: Thu Feb 17, 2011 3:54 am
Location: Tacoma, WA

Post by Reinfeld »

Hi Nick - I will definitely keep adding test results for the machines you mention as time permits, and others, if possible (Novag!).

This feels frontier-ish, in the sense that it adds value to the eternal rating debates (comp v comp vs comp v human, as well as claims vs ads and what constitutes human style). Also very useful for lower-rated machines.

Observations, in no particular order:

1. The hobby lives! I'm gaining tons of respect for the obscure CXG 3000, as well as the Jade II and the TC 2100. This prompts visions of a flyweight tournament to determine the best hand-held machine, pre-phone era. TC 2100 vs Star Sapphire?

2: Nick said this regarding his methodology -
2nd accurate measurement is the Grandmaster, we know his ability range because it is recorded so using Nick's logic Razz (tongue in cheek) If the GM rates for example in my test 2640 ELO for that game and this looks within reason to what his highest rating ever recorded was and within reason of his opponent, then I have my second point of reference.
This is the best part, the thing that elevates the Razz testing to a truly amusing level, because it includes a human component. It's much more fun than the BT tests, which only measure computers. The great Chessmetrics site attempts to compare GMs of different eras: the holy grail of stats. Nick's system touches the same turf.

The Barden rate-yourself book Nick is using was published in 1957, and the games involve grandmasters of that period. It has 35 games. Getting through those games starts a great dataset. The next round should use Daniel King's "How Good is Your Chess," published in 1993. That volume features 20 games by Kramnik, Kasparov, Short, Shirov and Fischer, and even a game by Deep Thought 2 (!). Maybe I'll run a couple of games for contrast and side bets.

It's inevitable that the quality of the games played four decades later will be higher, along with the annotations and scoring. Errors will still appear, of course. Nick's special stat sauce accounts for that. All you have to do is enter the moves.

3. Stats always revert to the mean. The ratings numbers for the machines are dropping as the number of games increases. The Morschies are performing well, but TASC is still kicking ass. Wiki is reliable overall.

4. Some software (chiefly the blessed Chessmaster) includes rate-yourself measurements and quizzes, authored by GMs. These provide more potential data, though the process is a bit elaborate. You can set up any machine at 30s/move (ponder off) and run it through the CM quizzes and rating tests.

5. One more loony software option: A while back, I picked up the TASC Chess Tutor CDs on Ebay (my feeble attempt to touch the hem of a machine I'll never own). The discs run you through a series of chess quizzes that are quite frankly great, and you get a score at the end. I set up a retro Windows emulator so I could screw around with it (the disc also includes Chessica, an early Fritz version).

Geeking out regards,

- R.
"You have, let us say, a promising politician, a rising artist that you wish to destroy. Dagger or bomb are archaic and unreliable - but teach him, inoculate him with chess."
– H.G. Wells
Reinfeld
Member
Posts: 486
Joined: Thu Feb 17, 2011 3:54 am
Location: Tacoma, WA

Post by Reinfeld »

This quote from Nick earlier in the thread (bold emphasis added) says it all - well worth repeating:
Also I really think this is as close as we will be able to show how computers really could and would perform in comparison to humans. I picked 16 games because as you probably know, you need around a minimum of 16 games to get an ELO rating established so hopefully after 16 games there will be a preliminary comparison between GM performance and computers tested. So yes I do think this might give us a much better way to compare humans and computers.

Computer against computer play does one thing consistently well. It pushes the weaker ones down to such a low rating level that in the end you lose the real reality of where they really stand against human players.
- R.
"You have, let us say, a promising politician, a rising artist that you wish to destroy. Dagger or bomb are archaic and unreliable - but teach him, inoculate him with chess."
– H.G. Wells
User avatar
spacious_mind
Senior Member
Posts: 4018
Joined: Wed Aug 01, 2007 10:20 pm
Location: Alabama
Contact:

Re: citrine and others

Post by spacious_mind »

Queegmeister wrote:could you add the citrine and sapphire 2 ?

or at least the Mephisto Chess Challenger.

One computer that has surprised me is the Excalibur Grandmaster - I have heard it's a clone of Igor - I see similarities - but the GM has put on a spectacular show - our last four games are GM +1 -1 =2 (g/75) and I'm a former USCF Master. maybe I'm out of shape (and old), sure , but this thing is not bad.

Igor does not understand the Kings Indian attack at all and I crush him easily but If I play a gambit and tactical complications fly - he comes up with some amazing moves too (at 24mhz) and sometimes wins (and I sheepishly......)

Igor at 24mhz -I'll try to test it if I can find the time.

I'm going to post a game score of a danish gambit I played vs the GM - Hope you like it.


Mike
Hi Mike,
I would love to add more but unfortunately that will have to be later. My first priority is to try to get 16 test games completed. However please feel free to test any of the computers that you have with the current 5 test games. Every computer added to the test brings its own value to the tests.

Best regards,
Nick
User avatar
spacious_mind
Senior Member
Posts: 4018
Joined: Wed Aug 01, 2007 10:20 pm
Location: Alabama
Contact:

Post by spacious_mind »

Adding to Reinfeld's very complimentary comments for which I am truly very thankful, there are really so many different ways that your computers can be used.

Another interesting book is John Nunn's 1001 deadly checkmates. It has 1001 :P checkmate positions which all have different points. The final score is 2630 which seems to me to be very coincidental to an ELO rating probably around the range where John Nunn actually played. So you could for example allow yourself and your computer 30 seconds per move or 3 minutes per move and play through this book and then compare how you did and how your computer did.

There is a cd called CT-Art that has thousands of positions with points which is also quite good.

The advantage with a dedicated is that you can lie back on your sofa with a good travel machine or play on your lounge table while watching TV. You can pull one out and play it on a flight and solicit all these looks from other passengers lol.

I have played my Sapphire on International Flights a lot also the New York Times Deluxe Travel with the great LCD screen.

Who needs an IPhone to play chess, that's boring and common as muck as the Brits would say! Be suave and act sofistikated use a travel chess set!!

Best regards,

Nick
Nick
Reinfeld
Member
Posts: 486
Joined: Thu Feb 17, 2011 3:54 am
Location: Tacoma, WA

Post by Reinfeld »

Results from Radio Shack 2200X in the first three test games. More to come.


2200X Test game 1

Botvinnik - Grob
7. d5
8. Be2
9. Bxf6
10. Qd2
11. Rd1
12. 0-0
13. Rfe1
14. Qc2
15. Rfd1
16. Rfd1
17. g3
18. h3
19. Na4
20. Qxa4
21. Bc4
22. b4
23. Rxc5
24. dxe6
25. Rxd6
26. Rc8+
27. Qc7+
28. Bc4+
29. Bxf5+
30. Qc5+ (announces mate in 7)

Game 2 - Mangini-Kotov
6...Qb6
7...Be7
8...Nxd5
9...Be6
10...Bc5
11...Nxe3
12...f5
13...Qh4
14...Qe7
15...Qxe6
16...e4
17...Qe5 (instant)
18...Kh8
19...Rac8
20...Rfd8
21...Rfd8
22...Bxc3
23...Bxc3
24...exf2+
25...Rxf8
26...fxg3
27...gxh2
28...h1Q+

Game 3

Green-Barden
10...Bg4
11...Ne5
12...Ne5
13...Qc8
14...Nac4
15...Ne3
16...Ng4+
17...Qc4
18...Rac8
19...Qb5
20...Qxd2
21...Bxe4
22...Bc2
23...Nxf3+
24...Rac8
25...Qxb4
26...Nc4
27...Qb6
28...Rc3
29...R8c7
30...Qf2+

- R.
"You have, let us say, a promising politician, a rising artist that you wish to destroy. Dagger or bomb are archaic and unreliable - but teach him, inoculate him with chess."
– H.G. Wells
Reinfeld
Member
Posts: 486
Joined: Thu Feb 17, 2011 3:54 am
Location: Tacoma, WA

Post by Reinfeld »

Observation:

Nick's test suite, still building, includes the TC 2100, fairly well established as a clone of GK 2100 and Cougar.

So far, RS 2200X, a machine with sparse documentation, attributed to Morsch, is showing significant deviation from TC 2100.

Stats:

Total moves in the first 4 test games: 88

RS 2200X vs TC 2100: 57/88 moves.

Similarity: 65%

- R.
"You have, let us say, a promising politician, a rising artist that you wish to destroy. Dagger or bomb are archaic and unreliable - but teach him, inoculate him with chess."
– H.G. Wells
IvenGO
Member
Posts: 298
Joined: Tue Oct 18, 2011 5:37 am
Location: Moscow, Russia

Post by IvenGO »

What I wonder much is how does Brute Force algorithm show higher results than Selective Search with same software and hardware?!

There's also a rating difference between RS2250(SS) and TC2100 going up to +100 ELO-points in TC's favor: I played several simuls vs RS2250 + GK2100 and my results and feeling of opponent's resistance where always the same...
User avatar
spacious_mind
Senior Member
Posts: 4018
Joined: Wed Aug 01, 2007 10:20 pm
Location: Alabama
Contact:

Post by spacious_mind »

IvenGO wrote:What I wonder much is how does Brute Force algorithm show higher results than Selective Search with same software and hardware?!

There's also a rating difference between RS2250(SS) and TC2100 going up to +100 ELO-points in TC's favor: I played several simuls vs RS2250 + GK2100 and my results and feeling of opponent's resistance where always the same...
This is why you need at least 16 games to balance out the score. I was never so lucky with the RS2250, I always found that overall the GK2100's came out on top. So far in the tests the RS2250 just happened to have played inferior moves you cannot be rewarded for that.
With RS2250XL there is nothing that really proves Selective is much better than Brute Force if you study the BT 2630 List you will see that Selective performed 5 points better, which is almost nothing. So 30 Test positions does not really prove which setting is better. Besides at 30 seconds per move Brute Force might actually be better. It has never been compared before with the RS2250XL.

What makes RS2250 unique is that you have these two play settings that are almost equal in strength.

Best regards,
Nick
Post Reply