> However long (programs run) they never seem to become "long-running". Most app...

igouy · 2026-04-28T19:35:50 1777404950

> … whether long running is 3 hours or 3 days…

JavaOne long ago, there would be mixed messages: both "So a benchmark that ends in less than 10 sec probably does not measure anything interesting." and in blog post benchmarks "100000000 hashes in 5.745 secs … 100000000 primes in 1.548 secs"

(Goldilocks would know.)

> … different machine workloads…

I'm happy to accept that you didn't mean no parallel programs.

> … very hard to generalise …

Indeed.

https://www.larcenists.org/Twobit/bmcrock.temp.html

pron · 2026-04-30T14:25:17 1777559117

> JavaOne long ago, there would be mixed messages

I didn't say that short-running benchmarks don't measure anything interesting, only that they don't say much about long running programs, where the same mechanisms can exhibit very different behaviour.

igouy · 2026-04-30T17:39:22 1777570762

Seems like the benchmarks game didn't say that anything interesting about long running programs was measured? And didn't say that "interesting" memory management was measured. And didn't say…

I suppose when you write "because it compares different algorithms" you didn't say that there were no comparisons based on the same algorithm.

We've certainly not attempted to prove that these measurements, of a few tiny programs, are somehow representative of the performance of any real-world applications — not known — and in-any-case Benchmarks are a crock.

pron · 2026-04-30T20:00:01 1777579201

The problem with benchmarks isn't that they themselves are lying. Benchmarks always tell the truth - about themselves. The problem is in the conclusions people draw from them. In the nineties benchmarks were still a little extrapolatable because we could say X is slow and Y is fast, as many operations had an intrinsic cost. These days, almost no benchmark (certainly microbenchmark) is extrapolatable to anything beside itself. Is a branch slow or fast? That depends on what the program did before and what it intends to do later. Is memory access slow or fast? Ditto. Function call? Allocation? They're all so context-dependent now that the only use of benchmarks of some mechanism is for the authors of the mechanism who know exactly how it works, what exactly is being measured, and what can be extrapolated from that.

If I write a malloc benchmark I may think, oh, this measures the cost of malloc/free. In reality, it only measures the cost for a program whose concurrency, allocation/deallocation patterns, and duration match exactly what I wrote, and bear little resemblance to the numbers I'd get if any of those were different.

So I'm not saying that the Benchmark Game is lying. It is telling the truth about how long those programs ran. It's just that what we can generalise from those benchmarks is even less than what we can from more "interesting" ones, but given that even that is close to nothing anyway, maybe it doesn't matter.

igouy · 2026-04-30T20:48:10 1777582090

It is telling the truth about how long those programs ran, period.

There seem to be people who find those brute facts surprising in themselves.

pron · 2026-04-30T23:54:55 1777593295

All benchmarks tell the truth about themselves. That has never been what makes benchmarks good or bad. The worst and best benchmarks ever made are both truthful about their results.

But a good benchmark suite is one that covers a variety of different problems and/or programs similar to a significant portion of production software. The Benchmark Game is neither, plus it's confusing because it often compare things that measure the sophistication of the algorithm while making it seem it measures something about a language (you don't need to be deceitful to confuse). So no, I don't think it's a good benchmark suite at all.

igouy · 2026-05-01T01:29:02 1777598942

> The Benchmark Game is neither.

And makes no claim to be.

Here's something that could reasonably make those claims:

https://dl.acm.org/doi/10.1145/3669940.3707217

Oh! It's only Java.

pron · 2026-05-01T11:08:38 1777633718

> And makes no claim to be.

I know. I don't understand why you think I have a problem with the site's honesty. It's a poor benchmark suite, and it admits it is. We're in agreement.

> Here's something that could reasonably make those claims

I'm not familiar with this paper, but you seem to think I was complaining about false claims, which I wasn't. Benchmarks are problematic these days because results no longer generalise as they did a couple of decades ago, but some benchmarks are of higher quality than others (again, I'm not talking about what they say they are but about what they actually are) by at least covering a wider and possibly more relevant set of use cases, and by offering comparisons that are less confusing.

igouy · 2026-05-01T17:08:44 1777655324

> I'm not familiar with this paper…

It presents "DaCapo Chopin, a major release of the DaCapo benchmark suite for Java". It's a benchmark suite. It says so.

> I'm not talking about what they say they are but about what they actually are

“When I use a word,” Humpty Dumpty said in rather a scornful tone, “it means just what I choose it to mean—neither more nor less.”

“The question is,” said Alice, “whether you can make words mean so many different things.”

“The question is,” said Humpty Dumpty, “which is to be master—that's all.”

pron · 2026-05-01T21:28:27 1777670907

I don't understand what you're trying to say. I said that the Benchmark Game is not a good benchmark suite in the sense that it does not measure language speed differences since 1. it compares different algorithms, and 2. it doesn't cover some of the most important use-cases that languages/runtimes optimise for [1]. That's all. I'm not saying it's deceitful, I'm saying it's just not good comparison of language speeds. Are you agreeing or disagreeing?

[1]: In particular, Java was designed to overcome some of the biggest performance issues of low-level languages that has plagued a large number of applications: memory management when objects are of varying sizes and lifetimes, concurrency (especially lock-free data structures), and dynamic dispatch, which grows in use as applications grow in size and complexity. Not a single one of these is covered in the Benchmark Game, which focuses on small, very regular, batch workloads, the very things that low-level languages have always been good at, and none of the areas where the performance of low-level languages has traditionally (and to this day) suffered and which led to different compiler and memory management designs.

igouy · 2026-05-02T19:06:16 1777748776

> it's just not good comparison of language speeds

It's not that the benchmarks game is not a good benchmark suite, it isn't a benchmark suite.

It's not that the benchmarks game is not a good comparison of language speeds, it's that comparison of "language speeds" is so under-specified as-to-be wishful thinking.

> Java was designed to…

"… build software for the next generation of consumer electronics – think smart toasters, interactive TVs, and other futuristic gadgets." Things change.

>… the very things that low-level languages have always been good at…

Which is why there are people who find those kind-of Java programs being in-any-way comparable, somewhat surprising.

pron · 2026-05-02T21:54:35 1777758875

> It's not that the benchmarks game is not a good benchmark suite, it isn't a benchmark suite.

OK, but I was responding to someone who did consider it to be a benchmark suite. As long as we agree it's not a good benchmark suite whatever it considers itself to be, we're in agreement.

> It's not that the benchmarks game is not a good comparison of language speeds, it's that comparison of "language speeds" is so under-specified as-to-be wishful thinking.

With that I completely agree. But if you group results by language, that's exactly what you're inviting, and if your suite of benchmarks or whatever you want to call it covered a wider range of problems, that point could be more easily seen. Let's say that the combination of grouping results by language and covering only a very narrow (and niche) set of problems that also happens to be the sweet spot of some languages that have other significant performance failings in other use cases doesn't exactly help people get the right impression.

igouy · 2026-05-02T23:07:24 1777763244

> As long as we agree…

Close enough.

> … help people get the right impression.

The target audience wonder "Which programming language is fastest?"

A table or chart sorted by elapsed time is the answer they expect.

The target audience have various (perhaps un-examined) ideas about the question.

The sources and measurements can be a way to examine and discuss some of those ideas.

pron · 2026-05-03T18:01:27 1777831287

Ok, but if the measurements were wider in scope they could at least offer a more interesting, well-rounded, and perhaps even relevant basis for discussion (even if the other flaws, which are harder to fix, remained).

igouy · 2026-05-03T21:50:02 1777845002

Once upon a time, I might have imagined that would be so. Now it seems more like squeezing a lemon, there's hardly any more after the first squeeze.

https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

pron · 2026-05-05T15:51:53 1777996313

I think you're referring to the less important point I made. Correcting for apples-to-apples is harder and less valuable. Having more domain coverage is easier and more valuable (especially since the current coverage is so narrow and largely irrelevant to most software).

BTW, what we do is compare our suite of micro-benchmarks to our (much smaller) suite of macro-benchmarks. This way we get at least some sense of how relevant the microbenchmarks are (i.e. we're looking at the correlation of the deltas). Some microbenchmarks are more correlated with the macrobenchmarks than others. If an optimisation helps some microbenchmarks that we think are not representative of many programs and doesn't help with any macrobenchmark - we take it out.

Just to give an example, we may want to measure some optimisation that helps some allocation pattern. Sometimes it turns out that if that pattern is diluted by other allocation patterns the program does for other tasks, the advantage is completely erased. Some optimisations in free-list allocators are particularly susceptible to this: if your program allocates only in this specific way, it will be super fast. If, in addition, there are some sporadic allocations that follow a different pattern, then after an hour you'll see performance start to drop.

igouy · 2026-05-05T18:21:50 1778005310

> apples-to-apples

Hopefully, some of the target audience might try to confirm that programs are what they think of as "comparable".

> Having more domain coverage is easier and more valuable…

So where are the examples of that being done? (It's been decades.)

pron · 2026-05-07T16:53:42 1778172822

> So where are the examples of that being done?

Whenever people want to get valuable information. As I said, we in OpenJDK have a couple hundred benchmarks, some macro, many micro, which are meant to give a decent coverage of the things that affect performance.

If a website wants to group results by languages, it should think about performance from the perspective of how languages work (which include compilers, linkers, and runtimes).

For example, what compiler/linker optimisations are done can depend a lot on whether the program is in a single compilation unit or multiple (and in the case of C and C++ - it does).

On the runtime front, think about memory management. These mechanisms often have different behaviour depending on whether the objects are of similar size or not, whether they're allocated and freed by multiple threads or a single one, and whether the heap is "young" and unfragmented or old and fragmented.

Another area in runtimes is data structures. Are they single-threaded or concurrent, and if concurrent, how do they behave under low and high contention?

Some mechanisms, in all of these levels, have great performance under some conditions and not so great performance in others, and sometimes where they perform great is actually a condition that is encountered less often in real programs.

If you're asking what multi-lingual benchmark suites offer good coverage - I don't know. But that we don't have good information doesn't mean that it's good to offer bad information. Imagine that in American presidential elections there were no national polls and no polls in most states. Would having a poll only in Alabama or only in California offer good insight into who's likely to win? Probably not, because such a poll offers a very partial view of the situation. Is it better than nothing? Maybe, but not by much, because the outcome in Alabama and California is easy to predict without any polls, so it's only helpful in the most extreme cases.

My point is that bad information is bad information, and if people don't understand how different languages behave under different conditions (e.g. that the optimisations the compiler does can differ depending on whether the program is in a single file or not) then they can get the wrong impression. Imagine that someone has no idea about the regional polarisation in the US, and you tell them, well, there are 50 states, but since we don't have polls for all of them, here's the poll for Alabama. Is that information helpful at all?

In any event, any increase in the coverage makes the information a little better, and because the audience may not know whether multiple benchmarks exercise the same or different behaviour in the language, it's the role of the website to pick problems that trigger the different codepaths in the languages' infrastructure. Otherwise, there's the wrong impression of variety, like saying we don't poll only in Alabama but also in Mississippi. Or it's like testing the structure of a bridge by driving a car across it, and then doing it with ten different car models. Testing a bridge does require variety, but the different car models are not what triggers different conditions for the bridge.

igouy · 2026-05-08T17:09:44 1778260184

> If you're asking what multi-lingual benchmark suites offer good coverage - I don't know.

In which case, given: The target audience wonder "Which programming language is fastest?" :there doesn't seem to be support for your claim that: "Having more domain coverage is easier and more valuable…".

> Is it better than nothing? Maybe, but not by much…

The benchmarks game: provisional and modest.

pron · 2026-05-08T20:47:12 1778273232

> There doesn't seem to be support for your claim that: "Having more domain coverage is easier and more valuable…".

Since programming languages specifically optimise differently for the different conditions I listed above, the "support" for my "claim" is that it's obviously true. No one who implements languages or runtimes will dispute it.

But I don't understand the logical implication. The existence or nonexistence of good information says nothing about the value of the information we do have. If all you know is how much cash I have in my wallet, the fact that no one has ever published how much money I have in my bank account doesn't make the information you have more relevant as an estimate of my wealth. That information is irrelevant regardless of whether or not you have access to the relevant information. That information being available is not what's needed to "support" my "claim" that what you know is irrelevant. All you need to know is how people keep their money.

> The benchmarks game: provisional and modest.

I would say it's more like a website comparing US presidential candidates through polls only in Alabama. A more appropriate description than "provisional and modest" would be that it doesn't actually give us valuable information about the candidates' chances.

If people know how US elections work, such information could be put in context, but I don't know how many programmers understand how languages and runtimes optimise performance. Merely saying it's partial/provisional/modest is insufficient to give people the appropriate context.

igouy · 2026-05-08T23:07:30 1778281650

Sorry, I'm just not interested in wading through your analogies.