> What powers In context learning is not very well understood but it doesn't app...

> What powers In context learning is not very well understood but it doesn't appear to be or really work exactly like just a non teacher-forced version of training.

I mean, yes. One distills information from a training set into a compressed representation, and the other generates a compressed representation that yields (more or less) fixed state space attractors. It's just inducing a bias over the state space of the network, nothing incredibly special, though I'd consider the initial stage of training to be the most important, as it is responsible for all of the ingest of all of the embedded information the network will be using during autoregressive inference (especially w.r.t. the context of doing so during longer generations).

So the notion of 'in context learning' is one I find to be a bit of an illusion, of course, as no actual learning is being done, just the induction of biases, which appears to give rise to a transiently-'better trained' network.

You could see this as a bit straightforward perhaps, but I feel it needs to be said.