I think a lot of people unwittingly think of the training process as "smart".
Similar criticisms are "well there are many descriptions of game x it would have read so why doesn't it play x well". But gradient descent is a dumb optimizer. Training itself is not actually like someone reading a text anymore than evolution is like someone thinking about the best way to augment an organism.
a "smart" optimizer would look at a reversal applicable sentence and know exactly what bunch of weights to change to store it in such a way as to be recalled reversibly in the future.
Inference may be smart (GPT-4 can reverse in context just fine, potentially play games from only a description) but the training is not.
A prior (now-deleted) comment of yours put this in such a great way, and I'm sad that it got deleted:
> You could potentially describe how a game is played to GPT-4 without examples and get it playing it correctly but passing that same description into the training process of the model just gets you a model that can describe your game correctly.
I'm not really sure if I understand the intuition here, this seems rather disconnected from what I understand to be the math of optimization.
It seems like you're referring to an associative Hebbian/Hopfield-like lookup, which the current 'dumb' optimizers already do. Better yet, the learning rates are normalized by the diagonal of the empirical Fisher so that the learning w.r.t. to some estimated expected information is more constant, for said associative lookup operation.
Additionally, the training loop (which you call 'dumb') is just...a teacher-forced version of inference, which you call 'smart'?
It's better to simply minimize the log-likelihood in a scalable way. Hand-engineered solutions rarely survive compared to strong-scaling ones.
smart is in quotes here for a reason. It's 'dumb' in relation to many people's expectations.
>Additionally, the training loop (which you call 'dumb') is just...a teacher-forced version of inference, which you call 'smart'?
What powers In context learning is not very well understood but it doesn't appear to be or really work exactly like just a non teacher-forced version of training. There are qualitative differences. The same models have no problem with this 'curse' when the information is provided in context for example.
>It's better to simply minimize the log-likelihood in a scalable way. Hand-engineered solutions rarely survive compared to strong-scaling ones.
> What powers In context learning is not very well understood but it doesn't appear to be or really work exactly like just a non teacher-forced version of training.
I mean, yes. One distills information from a training set into a compressed representation, and the other generates a compressed representation that yields (more or less) fixed state space attractors. It's just inducing a bias over the state space of the network, nothing incredibly special, though I'd consider the initial stage of training to be the most important, as it is responsible for all of the ingest of all of the embedded information the network will be using during autoregressive inference (especially w.r.t. the context of doing so during longer generations).
So the notion of 'in context learning' is one I find to be a bit of an illusion, of course, as no actual learning is being done, just the induction of biases, which appears to give rise to a transiently-'better trained' network.
You could see this as a bit straightforward perhaps, but I feel it needs to be said.
I guess if we'd automatically feed every such conversation back to the training set, it would be proper "in-context learning", and it would work very much like humans learn - inferring new facts from recalled facts, and memorizing the inferences that come up repeatedly or otherwise feel important.
I'm LLM-ignorant, but my layman brain wonders if the logical relationships belong in a kind of as-of-yet-not-invented "pre-processor" that feeds into the LLM.
(Perhaps though it has to be done during training so that "Who was the ninth Chancellor of Germany" becomes knowable in the first place.)
Similar criticisms are "well there are many descriptions of game x it would have read so why doesn't it play x well". But gradient descent is a dumb optimizer. Training itself is not actually like someone reading a text anymore than evolution is like someone thinking about the best way to augment an organism.
a "smart" optimizer would look at a reversal applicable sentence and know exactly what bunch of weights to change to store it in such a way as to be recalled reversibly in the future.
Inference may be smart (GPT-4 can reverse in context just fine, potentially play games from only a description) but the training is not.