Dedicated Chess Computer Test Scores

scandien · Post by **scandien** » Fri Jan 22, 2016 7:28 pm

My formula have been truncated :

USCF = 180 + 0.94FIDE, if FIDE under 2000
USCF = 20 +1.02FIDE if FIDE greater 2000

fide uscf
2787 2862,74
2773 2848,46
2718 2792,36
2659 2732,18

i have check for the First 5 players of your list , it seems to fit quite well !

But all of this is not really important. This is just to have a reference

br

Nicolas

Brian B · Post by **Brian B** » Fri Jan 22, 2016 7:43 pm

Another piece of information. At the recent Millionaire Open #2 in Las Vegas, FIDE ratings were used without adjustment in the Open Section, where available or USCF rating was used. In ALL other sections, any FIDE ratings used had 60 points added to the FIDE rating. Not sure of the methodology used to arrive at a 60 point difference, but the organizers were trying to eliminate any sandbagging or players playing in the wrong section.

Rating regards,
Brian B

spacious_mind · Post by **spacious_mind** » Fri Jan 22, 2016 7:57 pm

scandien wrote:My formula have been truncated :

USCF = 180 + 0.94FIDE, if FIDE under 2000
USCF = 20 +1.02FIDE if FIDE greater 2000

fide uscf
2787 2862,74
2773 2848,46
2718 2792,36
2659 2732,18

i have check for the First 5 players of your list , it seems to fit quite well !

But all of this is not really important. This is just to have a reference

br

Nicolas

Can you also try the less than 2000 rating

Regards
Nick

spacious_mind · Post by **spacious_mind** » Fri Jan 22, 2016 8:00 pm

Brian B wrote:Another piece of information. At the recent Millionaire Open #2 in Las Vegas, FIDE ratings were used without adjustment in the Open Section, where available or USCF rating was used. In ALL other sections, any FIDE ratings used had 60 points added to the FIDE rating. Not sure of the methodology used to arrive at a 60 point difference, but the organizers were trying to eliminate any sandbagging or players playing in the wrong section.

Rating regards,
Brian B

Cool that seems to fit between the 41 and 87 that I had found

So therefore still I believe you can use the chess.com scale up to 1700 and work a progression upwards

Best regards

scandien · Post by **scandien** » Sat Jan 23, 2016 8:38 am

Hello
whatever the used rating system ( which may introduce only an offset) it seems that , when tuned the "Computer Vs Computer" to the "Man Vs Machine" results , we have to for the stronger engines adjust the Computer rating.

After a while and several test i consider that for stronger engine (rating over 2000 in my rating list) i have to adjust my rating list with the following formula :
"Rating Vs Human"=( 2000+ (("Vs Computer Rating"-2000)*0.6)).

My own test seems to show that this "tuning" is not necessary for machine with rating under 2000.

The level vs human data are coming from :
. CRA Rating ,
. Rating from game versus rated human ( Internet information mostly from AEGON or "Hombre versus Machina" data)
. Internet versus Human on FICS
. SelSr Data ( when available for a few machines)

All those data seems to be relevant and compliant. I remember that L Kaufman did the same observation in his old CCR rating. I never really understand why the rating of stronger machine must have to be tune but it seems to be done.
Strangely i found that the Elo Aktiv rating list seems to be ok and doesn't need to be tune ....

I will be happy to get any explanation (or idea) for this behavior

Br

Nicolas

Post by **Steve B** » Sat Jan 23, 2016 10:59 am

scandien wrote:Hello
whatever the used rating system ( which may introduce only an offset) it seems that , when tuned the "Computer Vs Computer" to the "Man Vs Machine" results , we have to for the stronger engines adjust the Computer rating.

After a while and several test i consider that for stronger engine (rating over 2000 in my rating list) i have to adjust my rating list with the following formula :
"Rating Vs Human"=( 2000+ (("Vs Computer Rating"-2000)*0.6)).

My own test seems to show that this "tuning" is not necessary for machine with rating under 2000.

The level vs human data are coming from :
. CRA Rating ,
. Rating from game versus rated human ( Internet information mostly from AEGON or "Hombre versus Machina" data)
. Internet versus Human on FICS
. SelSr Data ( when available for a few machines)

Hi Nicholas
Selective Search published a Dedicated Vs Human rating list each month for years and it included alot more then a few machines
the last published list i see is from 2005 and includes about 100 different computers
number of games played per machine ranged from a small sample to hundreds of games per machine
i didnt scientifically analyze the human rating list but there does not seem to be a pattern for computers rated over 2000 or under 2000
or under 2200 or over 2200 or any other elo range
some machines are rated higher against human some less...some rating differences are as Little as 10 pts and some as high as 200 pts

an example of differences Vs.humans

Tasc R30 -78
Mephisto Genius 030 +10

Fidelity Eag V5 +25
Mephisto MM5 -119

Extensive List Regards
Steve

spacious_mind · Post by **spacious_mind** » Sat Jan 23, 2016 11:05 am

Steve B wrote: Tasc R30 -78

Mephisto MM5 -119

Extensive List Regards
Steve

Hi Steve,

Now that's where I would consider the Kopec effect

R30 ELO 2526 - 78 = 2448

Best regards
Nick

spacious_mind · Post by **spacious_mind** » Sat Jan 23, 2016 11:27 am

scandien wrote:Hello
whatever the used rating system ( which may introduce only an offset) it seems that , when tuned the "Computer Vs Computer" to the "Man Vs Machine" results , we have to for the stronger engines adjust the Computer rating.

After a while and several test i consider that for stronger engine (rating over 2000 in my rating list) i have to adjust my rating list with the following formula :
"Rating Vs Human"=( 2000+ (("Vs Computer Rating"-2000)*0.6)).

My own test seems to show that this "tuning" is not necessary for machine with rating under 2000.

The level vs human data are coming from :
. CRA Rating ,
. Rating from game versus rated human ( Internet information mostly from AEGON or "Hombre versus Machina" data)
. Internet versus Human on FICS
. SelSr Data ( when available for a few machines)

All those data seems to be relevant and compliant. I remember that L Kaufman did the same observation in his old CCR rating. I never really understand why the rating of stronger machine must have to be tune but it seems to be done.
Strangely i found that the Elo Aktiv rating list seems to be ok and doesn't need to be tune ....

I will be happy to get any explanation (or idea) for this behavior

Br

Nicolas

Hi Nicolas,

In a sense no list needs tuning because the data as it stands is correct since it reflects the results played. But for tuning against humans you would only be correct in your list selectively if you only fix the over 2000 for humans.

Absolutely it needs tuning from top to bottom. As I stated before the bottom ones when compared to humans are far wronger than even the top ones.

You have to start from the bottom. As I have stated previously there is no such thing as a computer that is weaker than a beginner. If you want to really fix the problem then you have to consider that at the bottom there would be hardly a computer below 1000 USCF or 1200 FIDE. Currently all those lists whether it is dedicated or engines run as low as 700. Now since you are all telling me these lists are European and FIDE that would make the 700 computer be 500 USCF and neither of those exist. Especially since FIDE cannot be lower than USCF.

You are at the moment only adjusting a 25% problem and not considering the other 75% problem. It is a 100% problem that must be fixed.

Best regards

scandien · Post by **scandien** » Sat Jan 23, 2016 12:43 pm

Steve B wrote: Selective Search published a Dedicated Vs Human rating list each month for years and it included alot more then a few machines
the last published list i see is from 2005 and includes about 100 different computers
number of games played per machine ranged from a small sample to hundreds of games per machine

Yes i know.. i tried unsuccessfully to get this rating list .... Thi sshould be interesting ....

Steve B wrote:
i didnt scientifically analyze the human rating list but there does not seem to be a pattern for computers rated over 2000 or under 2000
or under 2200 or over 2200 or any other elo range
some machines are rated higher against human some less...some rating differences are as Little as 10 pts and some as high as 200 pts

an example of differences Vs.humans

Tasc R30 -78
Mephisto Genius 030 +10

Fidelity Eag V5 +25
Mephisto MM5 -119

Extensive List Regards
Steve

I think that it rely mostly on the program tyle. Some are really effective versus human ( as the Henne, Kaplan or Lang programs) and other are tuned to play versus other computer ( Rathsman's or Morsch's program). Schroder's ones may be better versus human ( MMIV Polgar) or computer ( MMV).

br

Nicolas

spacious_mind · Post by **spacious_mind** » Sat Jan 23, 2016 1:15 pm

scandien wrote:
Steve B wrote: Selective Search published a Dedicated Vs Human rating list each month for years and it included alot more then a few machines
the last published list i see is from 2005 and includes about 100 different computers
number of games played per machine ranged from a small sample to hundreds of games per machine
Yes i know.. i tried unsuccessfully to get this rating list .... Thi sshould be interesting ....

Steve B wrote:
i didnt scientifically analyze the human rating list but there does not seem to be a pattern for computers rated over 2000 or under 2000
or under 2200 or over 2200 or any other elo range
some machines are rated higher against human some less...some rating differences are as Little as 10 pts and some as high as 200 pts

an example of differences Vs.humans

Tasc R30 -78
Mephisto Genius 030 +10

Fidelity Eag V5 +25
Mephisto MM5 -119

Extensive List Regards
Steve
I think that it rely mostly on the program tyle. Some are really effective versus human ( as the Henne, Kaplan or Lang programs) and other are tuned to play versus other computer ( Rathsman's or Morsch's program). Schroder's ones may be better versus human ( MMIV Polgar) or computer ( MMV).

br

Nicolas

Sure a players way of playing can be affected by the computer opponent they play, I am not disagreeing with that. The R30 however if you read Kopec again is the ideal candidate for the Prestige rating adjustment he listed. He is at the top of the food chain he will have that effect as part of his rating for sure.

Certain other computers for sure will also have this. You have to remember that the lists don't play opponents equally and are strongly influenced by the opponents they played, therefore there are going to be anomalies throughout a large list. The rating formulas also don't allow for calibrations therefore the bigger the list the bigger the deviations at the top and at the bottom from reality.

best regards

spacious_mind · Post by **spacious_mind** » Sat Jan 23, 2016 1:46 pm

I have adjusted the chess.com list to go all the way to USCF 800 = Absolute beginner and matched it to the FIDE equivalent per chess.com analysis.

Here are the original SSDF ratings compared to today's list.

Hopefully you can see as I can see clearly the pattern between original SSDF and Chess.com's FIDE.

Best regards

scandien · Post by **scandien** » Sat Jan 23, 2016 7:34 pm

May be we are all wrong ... the level of computer chess vs human is not really working as the Man Vs Man rating.

I have run several match on Internet for several machine. For every machine ( except for MEPHISTO MIRAGE and KRYPTON REGENCY) the result was clear. The machine were playing at a specific level ( the range of compliant opponent was each time 100 points wide). Player below this range were crushed and players above this range won easily ( a set of player better than the machine by 200 points won 80-85% of the game ).
The results of the machine are not compliant with ELO formulas.

So when you defeat a machine in a match ( i just run a small match in 4 games versus the romaII and i won by 2.5 to 0.5 , the 4th game was useless as i qualified for the next step). This is a small match but i am sure not to be rated 273 points over the Roma.

The results are the same in Computer versus computer match ( MEPHISTO NIGEL SHORT outclass SAITEK TURBO ADVANCED TRAINER by 2.5 to 0.5, and NOVAG SUPER FORTE C outclass MEPHISTO MODENA with the same score).

This is the reason why i try to run compliant games or match with various opponents.

Br

Nicolas[/img]

scandien · Post by **scandien** » Sun Jan 24, 2016 10:29 am

Hello again

i get some interesting result for machine vs man under tournament conditions. The Reference is the FIDE rating:

Novag Sapphire : 34 games - perf 2087 (2100 on FICS)
Novag Diablo/Scorpio: 19 games - 2185
MEPHISTO ATLANTA : 6 games - perf 2382
MEPHISTO MILANO/POLGAR : 6 games - perf 2237
MEPHISTO BERLIN PRO : 7 GAMES - perf 2300
FIDELITY DESIGNER MACH III : 4 Games - perf 2160

a lot of data are coming from Hombre machina tounament or AEGON results.4
I know that 6 games is not a lot but it may help for a good tuning. And 19 gales or 34 games are enough to get an official rating.

BR

Nicolas

Post by **Steve B** » Sun Jan 24, 2016 11:26 am

scandien wrote:Hello again

i get some interesting result for machine vs man under tournament conditions. The Reference is the FIDE rating:

Novag Sapphire : 34 games - perf 2087 (2100 on FICS)
Novag Diablo/Scorpio: 19 games - 2185
MEPHISTO ATLANTA : 6 games - perf 2382
MEPHISTO MILANO/POLGAR : 6 games - perf 2237
MEPHISTO BERLIN PRO : 7 GAMES - perf 2300
FIDELITY DESIGNER MACH III : 4 Games - perf 2160

a lot of data are coming from Hombre machina tounament or AEGON results.4
I know that 6 games is not a lot but it may help for a good tuning. And 19 gales or 34 games are enough to get an official rating.

BR

Nicolas

the Milano And Polgar not the same exact program so why listed together?

anyway ...Lets Compare to Sel Ser human list from 2005

Novag Sapphire : 83 games - 2139
Novag Diablo/Scorpio: 140 games - 2126
MEPHISTO ATLANTA : 9 games -2357
MEPHISTO MILANO 14 games -2087
Mephisto POLGAR 5 Mhz 17 games 2076
MEPHISTO BERLIN PRO : 29 GAMES -2217
FIDELITY DESIGNER MACH III : 245 Games - 2107

All Close except for the MILANO/POLGAR rating

Comparative Regards
Steve

spacious_mind · Post by **spacious_mind** » Sun Jan 24, 2016 12:42 pm

Steve B wrote:
scandien wrote:Hello again

i get some interesting result for machine vs man under tournament conditions. The Reference is the FIDE rating:

Novag Sapphire : 34 games - perf 2087 (2100 on FICS)
Novag Diablo/Scorpio: 19 games - 2185
MEPHISTO ATLANTA : 6 games - perf 2382
MEPHISTO MILANO/POLGAR : 6 games - perf 2237
MEPHISTO BERLIN PRO : 7 GAMES - perf 2300
FIDELITY DESIGNER MACH III : 4 Games - perf 2160

a lot of data are coming from Hombre machina tounament or AEGON results.4
I know that 6 games is not a lot but it may help for a good tuning. And 19 gales or 34 games are enough to get an official rating.

BR

Nicolas
the MIL And Polgar not the same exact program so why listed together?

anyway ...Lets Compare to Sel Ser human list from 2005

Novag Sapphire : 83 games - 2139
Novag Diablo/Scorpio: 140 games - 2126
MEPHISTO ATLANTA : 9 games -2357
MEPHISTO MILANO 14 games -2087
Mephisto POLGAR 5 Mhz 17 games 2076
MEPHISTO BERLIN PRO : 29 GAMES -2217
FIDELITY DESIGNER MACH III : 245 Games - 2107

All Close except for the MILANO/POLGAR rating

Comparative Regards
Steve

Yep that's all interesting. Kaufmann spent a lot of time on R30 in 1994. He was quite impressed with it and rated King 2.5 at around 2530. The 1994 list showed King 2.2 with a Mean 2526. CCN (Eric Hallsworth) at 2521 and PLY (SSDF) at 2530.

Ok so if you take the mean at 2526 for King 2.2 and if you deduct say the GM average difference of 87 or the 60 from the casinos then you would have a rating for King 2.2 of maybe somewhere between 2439 - 2466 ELO. The Active list rating has it at 2367 which is a difference of between 72 - 99 ELO.

Atlanta in the Active List is 2266 which is 101 ELO lower than the Active King 2.2 rating.

Based on your two independent Human Atlanta ratings the average rating over the combined 15 games is 2367.

Therefore you have the following:

1) Kaufmann, CCN & PLY would probably have rated Atlanta at around 2435
2) Deduct 87 and you get 2348
3) Deduct 60 and you get 2375

Mean = 2362

It is all very close really.
So lets assume the 60 deduction becomes a standard for over 2000. Then at near the top of the dedicated List you would have:

King 2.2 at 2466 as a list calibration computer
Atlanta at 2367 as a list calibration computer (your human rating)

At the bottom you could have:

MK I at 985 as a list calibration computer
CC7 (has plenty of games) at 1311 as a list calibration computer

And I think you would soon find that most other computers would quite closely fall into place for a reasonably good comparison against humans.

The problems with todays ratings software like ELO stat or Bayesian is that you can only calibrate one computer. Which means that when you run your list with 300-500 computers, you are very quickly so far away from human ratings that it becomes impossible to compare accurately across the whole list. Think about it you are expecting good human comparison across 500 computers based on 1 calibration. It just doesn't work today.

You almost need 1 calibration per say every 50 computer programs to have a good chance of being accurate across the complete list.

So anyway this brings the circle around to why I posted questions to Nicolas originally

Best regards