This model card is eye-opening (I think it might be designed to be). The alignment and model welfare sections are extensive, which is heartening. At least on the surface Anthropic seems to be living up to its promises RE safety. That said, has anyone else read section 5.2.3 in the Alignment Risk Update https://www-cdn.anthropic.com/79c2d46d997783b9d2fb3241de4321...? This is referenced in the model card in 4.1.3. Basically they ended up training a the model with an RL reward model that had access to the model's reasoning in 8% of cases, by accident. The problem being that the model could learn to directly manipulate it's reasoning traces to satisfy external observers. This seems like a huge deal and it may have partially poisoned Anthropic's interpretability pipeline moving forward.