While NVIDIA provides the best absolute performance for training, Intel (i.e. Ga...

nl · on Nov 16, 2023

Yeah no one really cares about this for large models. Iteration speed matters, and no one is waiting 50 times as long to train a LLM.

Intel has always published the "training per dollar" because no one else competes.

Even for fine tuning you are almost always better off getting smaller GPU cloud instances.

sgillen · on Nov 15, 2023

That article shows that it takes about 50x as long to train gpt-3 with intel's offering vs Nvidia. At least in the current environment, if you are training llms I think almost no amount of cost savings can justify that.

adrian_b · on Nov 15, 2023

That 50X is only if you can afford one thousand NVIDIA H100.

There cannot be more than a handful of companies in the entire world that could afford such a huge price (tens of millions of $).

In comparison with a still extremely expensive cluster of 64 NVIDIA H100, the difference in speed would reduce to only two to three times, and paying several times less for the entire training becomes very attractive.

lazide · on Nov 16, 2023

You’ve got a really warped sense of scale if you think only a handful of companies couldn’t afford tens of millions of dollars.

That’s complete pocket change for any of the Fortune 500.

adrian_b · on Nov 16, 2023

The problem is not having so much money available.

Such a big expense only makes sense for a company where spending that amount would bring hundreds of millions of $ of additional revenue.

I doubt that any of the companies that have already spent such amounts have recovered even a small part of their expenses. It is more likely that they bet on future revenues, but it remains to be seen who will succeed to achieve that.

lazide · on Nov 16, 2023

Kinda, companies of that scale regularly spend more than that on (often) random R&D.

Sure if there is a plausible ROI, they’d have no issues dropping that much money (actually far more). Revenues for fortune 500’s are going to be in the 10’s of billions anyway, and it wouldn’t be hard to make an argument that random AI project could increase that by a couple percent or decrease costs a couple percent, which would more than provide that ROI.

Their biggest issue is usually having anyone in leadership that has a clue enough to even propose something plausible, let alone get a team together to give it a plausible go.

If they have that, Capital is not the issue.

collegeburner · on Nov 16, 2023

huggingface supports multi-node training?? https://huggingface.co/docs/optimum/habana/usage_guides/mult...

am i missing something here? just like you'd want to scale an h100 cluster out beyond one box of 8, you'd do the same for gaudi2?