One interesting thought process i've had around these topics is how it's not just attention but all DL methods suffer similar problems.
I truly believe the last step to AGI is solving continual learning. Efficient will always inch up but the "jump" is honestly not in sight.
Maybe attention + (unknown thing) really is all we need.
The thought is interesting because if you extrapolate that all DL models suffer the same class of problems (CL) the solution is implying two possibilities.
1. In the future, AGI level models will be entire new categories sharing little to nothing with methods like attention. (Every part is different like the article suggests)
2. Or (maybe more likely) we will simply build on what we have. If that's true then next generation models in agi like realm will be the same models we have now with one unifying change to all of them.
I previously made a unique transformer model whose every single neuron acted like a decision gate. Every neuron would choose a "computation nueron" before going on. Back prop was modified so that only computation neurons contributed to back prop of the next layer.
It had some interesting properties, the largest being that every token loop through the model was essentially seeing a completely different model. I was/am under the belief that scaling dimensionality == solving CL.
I bring it up because technically this architecture was identical to the transformer. I could drop my special neuron into literally any DL model out there and train.
I believe this kind of advancement is what will be the next generations models. Not a change of the transformer or attention but to the fundamental building blocks of all DL models.
It honestly does feel like attention gets us part of thr AGI equation well enough. It seems to have solved or will soon solve most short term hard problems. Again this is why CL is key, it's the timr comonent no AI method across the board has ever solved.
For the same reason Yann LeCun and everyone else says language won’t lead to AGI, nothing will lead to AGI.
Yann says language models need to be updated with new language to describe new observation.
But that’s not just with language. That’s physics. We cannot solve going to Mars or anything without the process.
But space time is endless and eventually some composition of it will come along the continuous learning machine has no ability to adapt to before it’s destroyed.
We’ve lost the information of the past and merely store simulation. We cannot see all of the future, just reduce to simulation.
Eventually any autonomous thing hits a snag it cannot solve before its destruction because in any reference frame it cannot know all the next best steps and know which past options to eliminate to simplify.
Energy based models will streamline away nonessential state to generating media and making a robot lift a box, like Linux and software like we know, but without 100% accurate data of the past and future (generation of which is impossible) whatever autonomous thing will eventually encounter a problem it never had time to solve and be smashed by the immutable churn of physics.
I just don’t get why we’re talking about cosmic scales but modern AI tech and not a hypothetical ASI a thousand years out with an iq of 2 million that would actually encounter these limits.
I truly believe the last step to AGI is solving continual learning. Efficient will always inch up but the "jump" is honestly not in sight.
Maybe attention + (unknown thing) really is all we need.
The thought is interesting because if you extrapolate that all DL models suffer the same class of problems (CL) the solution is implying two possibilities.
1. In the future, AGI level models will be entire new categories sharing little to nothing with methods like attention. (Every part is different like the article suggests)
2. Or (maybe more likely) we will simply build on what we have. If that's true then next generation models in agi like realm will be the same models we have now with one unifying change to all of them.
I previously made a unique transformer model whose every single neuron acted like a decision gate. Every neuron would choose a "computation nueron" before going on. Back prop was modified so that only computation neurons contributed to back prop of the next layer.
It had some interesting properties, the largest being that every token loop through the model was essentially seeing a completely different model. I was/am under the belief that scaling dimensionality == solving CL.
I bring it up because technically this architecture was identical to the transformer. I could drop my special neuron into literally any DL model out there and train.
I believe this kind of advancement is what will be the next generations models. Not a change of the transformer or attention but to the fundamental building blocks of all DL models.
It honestly does feel like attention gets us part of thr AGI equation well enough. It seems to have solved or will soon solve most short term hard problems. Again this is why CL is key, it's the timr comonent no AI method across the board has ever solved.