I think a lot of people unwittingly think of the training process as "smart". Si...

LoganDark · on Nov 15, 2023

A prior (now-deleted) comment of yours put this in such a great way, and I'm sad that it got deleted:

> You could potentially describe how a game is played to GPT-4 without examples and get it playing it correctly but passing that same description into the training process of the model just gets you a model that can describe your game correctly.

tysam_and · on Nov 15, 2023

I'm not really sure if I understand the intuition here, this seems rather disconnected from what I understand to be the math of optimization.

It seems like you're referring to an associative Hebbian/Hopfield-like lookup, which the current 'dumb' optimizers already do. Better yet, the learning rates are normalized by the diagonal of the empirical Fisher so that the learning w.r.t. to some estimated expected information is more constant, for said associative lookup operation.

Additionally, the training loop (which you call 'dumb') is just...a teacher-forced version of inference, which you call 'smart'?

It's better to simply minimize the log-likelihood in a scalable way. Hand-engineered solutions rarely survive compared to strong-scaling ones.

famouswaffles · on Nov 15, 2023

smart is in quotes here for a reason. It's 'dumb' in relation to many people's expectations.

>Additionally, the training loop (which you call 'dumb') is just...a teacher-forced version of inference, which you call 'smart'?

What powers In context learning is not very well understood but it doesn't appear to be or really work exactly like just a non teacher-forced version of training. There are qualitative differences. The same models have no problem with this 'curse' when the information is provided in context for example.

>It's better to simply minimize the log-likelihood in a scalable way. Hand-engineered solutions rarely survive compared to strong-scaling ones.

I never said anything about it being bad.

tysam_and · on Nov 15, 2023

> What powers In context learning is not very well understood but it doesn't appear to be or really work exactly like just a non teacher-forced version of training.

I mean, yes. One distills information from a training set into a compressed representation, and the other generates a compressed representation that yields (more or less) fixed state space attractors. It's just inducing a bias over the state space of the network, nothing incredibly special, though I'd consider the initial stage of training to be the most important, as it is responsible for all of the ingest of all of the embedded information the network will be using during autoregressive inference (especially w.r.t. the context of doing so during longer generations).

So the notion of 'in context learning' is one I find to be a bit of an illusion, of course, as no actual learning is being done, just the induction of biases, which appears to give rise to a transiently-'better trained' network.

You could see this as a bit straightforward perhaps, but I feel it needs to be said.

TeMPOraL · on Nov 15, 2023

I guess if we'd automatically feed every such conversation back to the training set, it would be proper "in-context learning", and it would work very much like humans learn - inferring new facts from recalled facts, and memorizing the inferences that come up repeatedly or otherwise feel important.

JKCalhoun · on Nov 15, 2023

I'm LLM-ignorant, but my layman brain wonders if the logical relationships belong in a kind of as-of-yet-not-invented "pre-processor" that feeds into the LLM.

(Perhaps though it has to be done during training so that "Who was the ninth Chancellor of Germany" becomes knowable in the first place.)