Develop way to evaluate performance of our model #110

nashirj · 2022-01-20T23:56:05Z

Ideas:

Comparing by win/loss against other agents

Compare against an agent that selects actions randomly
Compare against our previous best baseline agent
Compare against stockfish (of varying search depth)

Comparing output move distribution

Compare list of outputted moves to list of best moves outputted by Stockfish. We can use a similarity metric. One option is the Footrule distance. The Footrule distance treats difference at top of list as being just as relevant at bottom of list. However, we care more about the "best" moves, and lower moves in the list are not as important. We should consider some weighted implementation.

nashirj · 2022-01-21T00:00:29Z

Note: we want to normalize the similarity metric to always be between 0 and 1.

nashirj · 2022-03-18T19:49:59Z

In our last software meeting, we discussed using a series of puzzles (https://database.lichess.org/#puzzles) to determine the elo of the agent. Each of these puzzles has a distinct best move for the "player" to make, so we can use it to objectively quantify performance. One caveat is that there may be puzzles with multiple 'mate-in-one' moves, in which case any move leading to checkmate should be counted as a solution. We can select a subset of puzzles to use and incorporate these into the training loop every n iterations to quantify model improvement/deterioration.

nashirj · 2022-03-20T18:37:15Z

A couple interesting quotes I read just now:

Remember that tactics only come about because it's a good position. If you don't know how to play positionally and set up for tactics, they will never show up in your games

I agree that it's not all about tactics, but even if it was, there's no reason these two ratings should be in sync with each other. They are completely different systems. One is a result of head-to-head competition, the other is a solo endeavor where the "rating" you get assigned is really quite arbitrary.

So maybe we should use puzzles as a first pass, and if the new AI can solve the puzzles, we evaluate it with self play against the previous best model?

nashirj · 2022-03-20T18:48:29Z

Here is how AlphaZero does evaluation:

nashirj added the AI/ML Team AI, CV label Jan 21, 2022

nashirj assigned JuddBE Jan 28, 2022

nashirj added the High Priority Need to be addressed asap label Apr 14, 2022

nashirj self-assigned this Apr 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Develop way to evaluate performance of our model #110

Develop way to evaluate performance of our model #110

nashirj commented Jan 20, 2022

nashirj commented Jan 21, 2022

nashirj commented Mar 18, 2022

nashirj commented Mar 20, 2022

nashirj commented Mar 20, 2022

Develop way to evaluate performance of our model #110

Develop way to evaluate performance of our model #110

Comments

nashirj commented Jan 20, 2022

Comparing by win/loss against other agents

Comparing output move distribution

nashirj commented Jan 21, 2022

nashirj commented Mar 18, 2022

nashirj commented Mar 20, 2022

nashirj commented Mar 20, 2022