blog.mathieuacher.com/GPTsChessEloRatingLegalMoves

Preview meta tags from the blog.mathieuacher.com website.

Linked Hostnames

11

Search Engine Appearance

Google

https://blog.mathieuacher.com/GPTsChessEloRatingLegalMoves

Debunking the Chessboard: Confronting GPTs Against Chess Engines to Estimate Elo Ratings and Assess Legal Move Abilities

Can GPTs like ChatGPT-4 play legal moves and finish chess games? What is the actual Elo rating of GPTs? There have been some hypes, (subjective) assessment, and buzz lately from “GPT is capable of beating 99% of players?” to “GPT plays lots of illegal moves” to “here is a magic prompt with Magnus Carlsen in the headers”. There are more or less solid anecdotes here and there, with counter-examples showing impressive failures or magnified stories on how GPTs can play chess well. I’ve resisted for a long time, but I’ve decided to do it seriously! I have synthesized hundreds of games with different variants of GPT, different prompt strategies, against different chess engines (with various skills). This post is here to document the variability space of experiments I have explored so far… and the underlying insights and results. The tldr; is that gpt-3.5-turbo-instruct operates around 1750 Elo and is capable of playing end-to-end legal moves, even with black pieces or when the game starts with strange openings. However, though there are “avoidable” errors, the issue of generating illegal moves is still present in 16% of the games. Furthermore, ChatGPT-3.5-turbo and more surprisingly ChatGPT-4, however, are much more brittle. Hence, we provide first solid evidence that training for chat makes GPT worse on a well-defined problem (chess). Please do not stop to the tldr; and read the entire blog posts: there are subtleties and findings worth discussing!



Bing

Debunking the Chessboard: Confronting GPTs Against Chess Engines to Estimate Elo Ratings and Assess Legal Move Abilities

https://blog.mathieuacher.com/GPTsChessEloRatingLegalMoves

Can GPTs like ChatGPT-4 play legal moves and finish chess games? What is the actual Elo rating of GPTs? There have been some hypes, (subjective) assessment, and buzz lately from “GPT is capable of beating 99% of players?” to “GPT plays lots of illegal moves” to “here is a magic prompt with Magnus Carlsen in the headers”. There are more or less solid anecdotes here and there, with counter-examples showing impressive failures or magnified stories on how GPTs can play chess well. I’ve resisted for a long time, but I’ve decided to do it seriously! I have synthesized hundreds of games with different variants of GPT, different prompt strategies, against different chess engines (with various skills). This post is here to document the variability space of experiments I have explored so far… and the underlying insights and results. The tldr; is that gpt-3.5-turbo-instruct operates around 1750 Elo and is capable of playing end-to-end legal moves, even with black pieces or when the game starts with strange openings. However, though there are “avoidable” errors, the issue of generating illegal moves is still present in 16% of the games. Furthermore, ChatGPT-3.5-turbo and more surprisingly ChatGPT-4, however, are much more brittle. Hence, we provide first solid evidence that training for chat makes GPT worse on a well-defined problem (chess). Please do not stop to the tldr; and read the entire blog posts: there are subtleties and findings worth discussing!



DuckDuckGo

https://blog.mathieuacher.com/GPTsChessEloRatingLegalMoves

Debunking the Chessboard: Confronting GPTs Against Chess Engines to Estimate Elo Ratings and Assess Legal Move Abilities

Can GPTs like ChatGPT-4 play legal moves and finish chess games? What is the actual Elo rating of GPTs? There have been some hypes, (subjective) assessment, and buzz lately from “GPT is capable of beating 99% of players?” to “GPT plays lots of illegal moves” to “here is a magic prompt with Magnus Carlsen in the headers”. There are more or less solid anecdotes here and there, with counter-examples showing impressive failures or magnified stories on how GPTs can play chess well. I’ve resisted for a long time, but I’ve decided to do it seriously! I have synthesized hundreds of games with different variants of GPT, different prompt strategies, against different chess engines (with various skills). This post is here to document the variability space of experiments I have explored so far… and the underlying insights and results. The tldr; is that gpt-3.5-turbo-instruct operates around 1750 Elo and is capable of playing end-to-end legal moves, even with black pieces or when the game starts with strange openings. However, though there are “avoidable” errors, the issue of generating illegal moves is still present in 16% of the games. Furthermore, ChatGPT-3.5-turbo and more surprisingly ChatGPT-4, however, are much more brittle. Hence, we provide first solid evidence that training for chat makes GPT worse on a well-defined problem (chess). Please do not stop to the tldr; and read the entire blog posts: there are subtleties and findings worth discussing!

  • General Meta Tags

    8
    • title
      Debunking the Chessboard: Confronting GPTs Against Chess Engines to Estimate Elo Ratings and Assess Legal Move Abilities – Mathieu Acher – Professor in Computer Science
    • charset
      utf-8
    • Content-Type
      text/html; charset=utf-8
    • X-UA-Compatible
      IE=edge
    • viewport
      width=device-width, initial-scale=1.0, maximum-scale=1.0
  • Open Graph Meta Tags

    2
    • og:description
      Can GPTs like ChatGPT-4 play legal moves and finish chess games? What is the actual Elo rating of GPTs? There have been some hypes, (subjective) assessment, and buzz lately from “GPT is capable of beating 99% of players?” to “GPT plays lots of illegal moves” to “here is a magic prompt with Magnus Carlsen in the headers”. There are more or less solid anecdotes here and there, with counter-examples showing impressive failures or magnified stories on how GPTs can play chess well. I’ve resisted for a long time, but I’ve decided to do it seriously! I have synthesized hundreds of games with different variants of GPT, different prompt strategies, against different chess engines (with various skills). This post is here to document the variability space of experiments I have explored so far… and the underlying insights and results. The tldr; is that gpt-3.5-turbo-instruct operates around 1750 Elo and is capable of playing end-to-end legal moves, even with black pieces or when the game starts with strange openings. However, though there are “avoidable” errors, the issue of generating illegal moves is still present in 16% of the games. Furthermore, ChatGPT-3.5-turbo and more surprisingly ChatGPT-4, however, are much more brittle. Hence, we provide first solid evidence that training for chat makes GPT worse on a well-defined problem (chess). Please do not stop to the tldr; and read the entire blog posts: there are subtleties and findings worth discussing!
    • og:title
      Debunking the Chessboard: Confronting GPTs Against Chess Engines to Estimate Elo Ratings and Assess Legal Move Abilities
  • Link Tags

    2
    • alternate
      /feed.xml
    • stylesheet
      /style.css

Emails

1

Links

22