Very cool project. I'm currently reading *The Alignment Problem* by Brian Christ...

Very cool project. I'm currently reading The Alignment Problem by Brian Christian on AI (very well written and easy to read), and also (very slowly) My Great Predecessors, Vol 1 by Garry Kasparov. This is a great combination of the two topics!

I'm particularly interested in the second neural net that generates explanations. Is it only a neural net to generate the natural language using artifacts from the original engine net, or is it actually inspecting the state of the engine net to derive the insights?

It's an interesting application of explainability of AI algorithms that Christian talks about in his chapter on transparency. In particular, he discusses "saliency" of algorithms (knowing what parts of the input were most important in producing a prediction) and "multitask nets" that output multiple predictions (so, maybe here, one output is the best move, and another output is the explanation).

The writeups are fantastic reading. I see almost no sources in common with the bibliography in The Alignment Problem (which has a 50 page bib) which makes them nice complements. Only common citations I could find were Sutton 1988 (Temporal Differences) and Silver et al 2016 (AlphaGo), 2017 (AlphaGo Zero) and 2018 (AlphaZero).

There's a note (44) from chapter 5 entitled "Shaping" where Christian talks about "meta-reasoning: the right way to think about thinking. When you play a game–for instance, chess–you win because of the moves you chose, but it was the thoughts you had that enabled you to choose those moves...Figuring out how an aspiring chess player–or any kind of agent–should learn about its thought process seemed like a more important but also dramatically harder task than simply learning how to pick good moves." This note might as well be direct inspiration for this project. It goes on to quote Stuart Russell about "a computation that changes your mind about what is a good move to make...reward that computation by how much you changed your mind...so you could change your mind in the sense of discovering that what was the second-best move is actually even better than what was the best move." That's in the context of a cautionary tale where only optimizing for those "changes of mind" doesn't necessarily find a correct outcome, and that you have to "arrange these internal pseudorewards so that along a path, they add up to the same as true, eventually." This sounds pretty much like the task of a coach.