> What powers In context learning is not very well understood but it doesn't appear to be or really work exactly like just a non teacher-forced version of training.
I mean, yes. One distills information from a training set into a compressed representation, and the other generates a compressed representation that yields (more or less) fixed state space attractors. It's just inducing a bias over the state space of the network, nothing incredibly special, though I'd consider the initial stage of training to be the most important, as it is responsible for all of the ingest of all of the embedded information the network will be using during autoregressive inference (especially w.r.t. the context of doing so during longer generations).
So the notion of 'in context learning' is one I find to be a bit of an illusion, of course, as no actual learning is being done, just the induction of biases, which appears to give rise to a transiently-'better trained' network.
You could see this as a bit straightforward perhaps, but I feel it needs to be said.
I guess if we'd automatically feed every such conversation back to the training set, it would be proper "in-context learning", and it would work very much like humans learn - inferring new facts from recalled facts, and memorizing the inferences that come up repeatedly or otherwise feel important.
I mean, yes. One distills information from a training set into a compressed representation, and the other generates a compressed representation that yields (more or less) fixed state space attractors. It's just inducing a bias over the state space of the network, nothing incredibly special, though I'd consider the initial stage of training to be the most important, as it is responsible for all of the ingest of all of the embedded information the network will be using during autoregressive inference (especially w.r.t. the context of doing so during longer generations).
So the notion of 'in context learning' is one I find to be a bit of an illusion, of course, as no actual learning is being done, just the induction of biases, which appears to give rise to a transiently-'better trained' network.
You could see this as a bit straightforward perhaps, but I feel it needs to be said.