Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop way to evaluate performance of our model #110

Open
nashirj opened this issue Jan 20, 2022 · 4 comments
Open

Develop way to evaluate performance of our model #110

nashirj opened this issue Jan 20, 2022 · 4 comments
Assignees
Labels
AI/ML Team AI, CV High Priority Need to be addressed asap

Comments

@nashirj
Copy link
Member

nashirj commented Jan 20, 2022

Ideas:

Comparing by win/loss against other agents

  • Compare against an agent that selects actions randomly
  • Compare against our previous best baseline agent
  • Compare against stockfish (of varying search depth)

Comparing output move distribution

  • Compare list of outputted moves to list of best moves outputted by Stockfish. We can use a similarity metric. One option is the Footrule distance. The Footrule distance treats difference at top of list as being just as relevant at bottom of list. However, we care more about the "best" moves, and lower moves in the list are not as important. We should consider some weighted implementation.
@nashirj
Copy link
Member Author

nashirj commented Jan 21, 2022

Note: we want to normalize the similarity metric to always be between 0 and 1.

@nashirj
Copy link
Member Author

nashirj commented Mar 18, 2022

In our last software meeting, we discussed using a series of puzzles (https://database.lichess.org/#puzzles) to determine the elo of the agent. Each of these puzzles has a distinct best move for the "player" to make, so we can use it to objectively quantify performance. One caveat is that there may be puzzles with multiple 'mate-in-one' moves, in which case any move leading to checkmate should be counted as a solution. We can select a subset of puzzles to use and incorporate these into the training loop every n iterations to quantify model improvement/deterioration.

@nashirj
Copy link
Member Author

nashirj commented Mar 20, 2022

A couple interesting quotes I read just now:

Remember that tactics only come about because it's a good position. If you don't know how to play positionally and set up for tactics, they will never show up in your games

I agree that it's not all about tactics, but even if it was, there's no reason these two ratings should be in sync with each other. They are completely different systems. One is a result of head-to-head competition, the other is a solo endeavor where the "rating" you get assigned is really quite arbitrary.

So maybe we should use puzzles as a first pass, and if the new AI can solve the puzzles, we evaluate it with self play against the previous best model?

@nashirj
Copy link
Member Author

nashirj commented Mar 20, 2022

Here is how AlphaZero does evaluation:

alphazero-evaluation

@nashirj nashirj added the High Priority Need to be addressed asap label Apr 14, 2022
@nashirj nashirj self-assigned this Apr 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
AI/ML Team AI, CV High Priority Need to be addressed asap
Projects
None yet
Development

No branches or pull requests

2 participants