Great point! If I recall correctly, this team (well, nearly all the top teams from DawnBench) took Page's code and wrestled it into the multi-GPU realm. I'm a sucker for simplicity, as much as is reasonable (this codebase currently does not use JIT or any custom kernels! (!!!)), and also making sure that the average practitioner (like me) could do something workable without having to pay tons of money. My computing costs are $50 a month currently, i.e. the cost of Pro Colab. And we were able to break the single-GPU WR, and we're really close to pushing past any of the official multi-GPU submissions (old as they may be!).
I took David's work in a different direction and just kept it true I think to the original spirit of things. Cycle times for experimentation are king in ML when it comes to the speed of research progress, regardless of what anyone else might tell you. Having tons of hardware may be really flashy and useful for the end product, but it's certainly not needed for much of the lo-fi, day-to-day stuff.
That said, the A100 is definitely a step up. It is under 2x, though, as we are basically only memory-and-slow-backprop-kernel limited now, not as much by the convolutions (which now are among the shorter operations). Running https://github.com/99991/cifar10-fast-simple on my end gave me 17.2 seconds, vs the 24 seconds that Dave reported on the V100 (though the lovely author of that repo, @99991, was able to get faster speeds on their personal A100 setup). So we're definitely in that weird regime where moving everything to massively scaled matrix multiplies when possible is preferred, and sometimes that's...tricky for a few of these operations.
> may not be a fair comparison to newer hardware like the A100
in fairness most of those entries use 4-8 V100s, to OP's single GPU. while the A100 is more powerful, I think just the "on a single GPU" framing is valuable
I commented in the parent comment addressing this too, sorry that I topic-leaked! I'm cruising in my personal ML sabbatical on savings, so I'm sorta money-incentivized to be as thrifty as possible. Hence as noted before, right now I'm just at $50 a month!
I'm hoping this research is valuable to people in other areas, too. The concepts about order-of-operations, information flow, scaling, information-efficiency-at-high-throughputs, etc I think are applicable anywhere, given the right contexts. Though I have some sneaking huge suspicions that many of these laws (like the traditional scaling laws) only start popping up in importance and becoming more relevant as the ideally efficient architecture families are slowly approached through iterative optimization.
The top on DAWNBench a few years ago was $0.02, but that was a single V100 and their best time was 45s on 8*V100. No idea how much the 10s (top time) cost to run, but it was also 8*V100.
I think it's maybe something like 13.8 'credits' an hour on Colab, and you get 500 credits for $50 straight up, or $50 a month (I'm truly a sucker for simple flat pricing schemes with a natural cap on them, it's good for the overzealous network trainer's/developer's wallet! :D). So that's like, I dunno, $1.38 per hour for an A100 basically guaranteed (not bad at all! And the H100 is coming soon, I'd assume! :D)
If training takes ~9.91-9.96 seconds, and we ignore everything else in the process (assuming we have some kind of strange Elvish magical computers that don't require any spinup of any sorts)... then that's (9.91 to 9.96)/60/60 * 1.38 = 0.0037988 - 0.0038180 dollars per run, or .37988 - .38180 cents per run. The full setup including install from clone, data download, and network init, I'd estimate being lower bounded at maybe 1.2-1.3 cents per run or so with a good internet connection (but I'm not entirely sure about that! D:). Upper bound for a reasonably fast machine I think would be no more than 2 cents, clean start to finish for a single training run. Multiples for best-of (maybe not the safest idea), or better yet -- simple ensembling of the EMA-ed models could be upper bounded at likely no more than ~4 cents or so for 5 models, if I'm doing my math correctly.
That said, the 'cents' calculation likely I think is .37988 - .38180 cents in this case.
What's weird is that that does seem a bit steep considering it's 8 V100s for 45 seconds, and those were...pretty pricy at that time, I think? So maybe something is horribly wrong with my math! D:
Hope that helps, great question and many thanks for the question, happy to answer any follow-up questions you might have. This is a very interesting line of inquiry, and I haven't yet spent enough time developing it yet! :D
I would best guess (if I could) that these are baseline runs from the Stanford Dawnbench team themselves in 2017, when (I'm assuming) the competition was first launched.
Oddly enough, this field was not as crowded as I remembered. Nowadays, I'm alone re-running a competition that many have left behind, as many have left the 90's and its fashion and everything else behind. It's a bit of a lonely competition field, but these days any major change I make at this point is basically a new world record.
I wanted to thank you for pointing that out, that's really interesting. I think I haven't given enough credit to the lovely entry by Chen Wang -- beating out the previous competitor, using a P100, with their own small GTX 1080 (?!?!?!) in 35 minutes. Now, the absolutely bonkers thing here is that they seemed to achieve an accuracy of 95.29%, which is very much nonlinear in difficulty. This, I've at least personally found, is truly a case of dabbing on the haters, as the children or whatever they are are saying these days.
Keep in mind that the last entry was from 2020 *(94.39% in 10 seconds)*, so it may not be a fair comparison to newer hardware like the A100.