I'm a little confused then what the value of this demonstration is. The expectat...

maxbond · on May 15, 2023

Well, all I can tell you is that I'm having a good time discussing the nitty gritty of it in the comments here, and it made the gears of my mind turn in a pleasant way. It didn't make me go, "wow," but it did make me go, "hmm...."

tbalsam · on May 15, 2023

Gotcha, yeah, that makes sense to me now.

Maybe the real article was the HN comments section that we made along the way....

walterbell · on May 15, 2023

> Maybe the real article was the HN comments section that we made along the way....

Nominating this comment for https://news.ycombinator.com/highlights

tomjakubowski · on May 16, 2023

"Misunderstanding is the foundation of human discourse."—Jacques Lacan

antonvs · on May 16, 2023

It’s a kind of stone soup situation. With the OP providing the stone.

nimih · on May 15, 2023

Speaking personally, a webpage with nothing beyond a cool or interesting sequence of images is already in the 90th percentile of good HN posts.

sobellian · on May 15, 2023

I think the sequence tends toward the edge at exp(-n), not 2^-n. The distance from the edge is a product of n I.I.D. variables, so the logarithm is a sum of n I.I.D. variables, and the central limit theorem gives us the result.

You can confirm in a python terminal (or anything else) that the product of n `random()`s decreases more rapidly than 2^-n.

So maybe there's some value in it after all :)

SideQuark · on May 15, 2023

>The distance from the edge is a product of n I.I.D. variables

It is not a product of random variables; it is an iterated random variable. The output of one influences those higher in the chain. Redo your python code with rand(rand(rand...))) not rand() * rand() * rand()...

The expectation of composition of functions is not the composition of expectations, so there is some work to do.

For uniform over [0,1] for the innermost item, it becomes an iterated integral, with value (1/2)^n.

sobellian · on May 15, 2023

I did redo it, and the distributions are identical. It's just a very heavily skewed distribution, so the expected value is not very intuitive. I still think even E[x] decreases faster than 2^-n though. See my other comments.

Edit - Actually, I think E[x] = 2^-n, but showing this numerically for large n is impractical because of just how skewed the distribution is. It is simultaneously true that E[iter_rand(N)] = 2^-N and E[log(iter_rand(N))] = -N. Fun stuff :)

SideQuark · on May 15, 2023

It's not about intuition. I computed it directly from the integral definition for expected value. I posted it elsewhere in the thread. It proves it for all N.

Also, you keep using the new incorrect claim that r() * r() * ... * r() is the same as r(r(...r())..). They are not.

sobellian · on May 15, 2023

I agree that E[r() * r() * ...] = 2^-n but that's not the final word. E[log(r() * r() * ...)] = -n and Pr(r() * r() * ... > 2^-n) is very small. That's the intuition part.

The two distributions are identical, I checked numerically and they have the same summary statistics. Furthermore, r() * n is a uniform distribution from 0 to n so by induction we can prove the two are identical.

sobellian · on May 16, 2023

For anyone who cares and read this far (and if so, I'm honestly surprised) I made a colab that demonstrates all this: https://colab.research.google.com/drive/1xCZ24cLPNOqyJD0kUTO...

To wrap it up:

- X == r() * r() * ... == r(r(...))

- E[X] = 2^-n

- median[X] ~ 1/2 exp(1 - n) ?!

Anyway, I thought it was interesting - especially how the median never quite "catches up" to the value one might naively compute from the CLT. I think there's some subtlety I'm missing there.

maxbond · on May 15, 2023

random() * 5 is equivalent to random(5) (the former being the classic way to implement the latter). So why wouldn't random() * random() be equivalent to random(random())?

SideQuark · on May 15, 2023

Multiplying by a constant commutes with expectation. Multiplying by a random variable does not. It may in some cases, but intuition is not the way to check it.

anderskaseorg · on May 16, 2023

The claim here is not about expectations. random(X) is identically distributed to random() * X, even if X itself is a random variable, because the randomness in random() is independent from the randomness in X. Indeed, the way you would implement random(X), if you only had random() and X available, would be to compute random() * X.

So in particular, random(random()) is identically distributed to random() * random(). (To be clear, this is different from random()^2, as pointed out elsewhere.)

jasomill · on May 16, 2023

As a quick sanity check,

  Mathematica 13.2.1 Kernel for Linux ARM (64-bit)
  Copyright 1988-2023 Wolfram Research, Inc.

  In[1]:= DI = Distributed;

  In[2]:= EV = Expectation;

  In[3]:= SS = Subscript;

  In[4]:= UD = UniformDistribution;

  In[5]:= f[1] :=
    EV[x, DI[x, UD[{0, 1}]]]

  In[6]:= f[n_Integer] :=
    EV[x, DI[x, UD[{0, f[n - 1]}]]]

  In[7]:= Block[{n = 42},
   EV[Product[SS[x, k], {k, 1, n}],
     Map[DI[SS[x, #], UD[{0, 1}]] &,
      Range[1, n]]]
    == Product[EV[x, DI[x, UD[{0, 1}]]], {k, 1, n}]
    == f[n]]

  Out[7]= True

where n = 42 ∈ ℕ is arbitrary.

Alas, Mathematica doesn't appear to support actual nested distributions of the form UD[{1,UD[{1,...}]}], so, e.g.,

  f[2]
  == EV[x_2,
       DI[x_2,
         UD[{0,
           EV[x_1,
             DI[x_1, UD[{0,1}]]]}]]]  
  == EV[x_2,
       DI[x_2,
         UD[{0,1/2}]
  == 1/4

which is trivial.

jasomill · on May 16, 2023

(As is the expectation of a uniform distribution in the first place. From the PDF for the uniform distribution (1/(b - a) for x ∈ [a,b], zero elsewhere) and the definition of EV as the integral of the product of the identity map and the PDF over the reals,

     EV[x, DI[x, UD[{a,b}]]]
  == Integrate[x * Piecewise[{
         {1/(b - a), a <= x <= b},
         {0, x < a || x > b}}],
       {x, -∞, ∞}]
  == Integrate[x * 1/(b - a), {x, a, b}]
  == ((b^2/2) - (a^2/2)) / (b - a)
  == (a + b)/2

so for a = 0,

  == b/2

no matter how many nested integrals it takes to define the real number b.)

SideQuark · on May 16, 2023

Agreed, in this case they agree, but it's completely a numerical coincidence and not general. So people with a rough feeling that they compute rand * 5 should equate to rand(rand())=rand * rand should be admonished to check it extremely carefully. And the result is not true at the level of code.

In practice it's not true, unless extreme care is taken to ensure perfect uniformity, which is rare due to nonuniformity of IEEE floats. I've not seen a codebase in the wild that makes all the required pieces work. It's doable, I've written articles on it, but it's costly to do.

Another error in practice is that the method of taking a rand * 5 is not uniform due to numerical representations. So these are not identical in practice.

This the best way to analyze, which is trivially analyzable, is the nested form.

It's also why I wrote "may" in the previous comment.

anderskaseorg · on May 16, 2023

It is true, both in the ideal mathematical sense, and also in the IEEE sense under the totally obvious implementation that most programmers would write without any special care:

    function rand(endpoint = 1) {
        return endpoint * Math.random();
    }

Here rand(rand()) would evaluate to literally the same number as rand() * rand() given the same state of the underlying generator. Someone who follows this chain of reasoning is perfectly correct to consider it intuitively obvious.

And rand() * rand() is a great form for analysis, since it allows us to take advantage of E[XY] = E[X]E[Y] for independent X, Y to compute the mean very quickly.

Clearly you followed a different chain of reasoning, and that’s fine; maybe you have a much more complicated generator in mind that takes more care to ensure ideal rounding at the limits of floating-point precision, and that’s fine too. But let’s not admonish anyone for making correct statements.

SideQuark · on May 17, 2023

>It is true, both in the ideal mathematical sense, and also in the IEEE sense under the totally obvious implementation .....

That implementation is not uniform under IEEE 754. Here [1] is but one of hundreds of papers on the subject. To quote "Drawing a floating-point number uniformly at random from an interval [a, b) is usually performed by a location-scale transformation of some floating-point number drawn uniformly from [0, 1). Due to the weak properties of floating-point arithmetic, such a transformation cannot ensure respect of the bounds, uniformity or spatial equidistributivity.".

This is trivial to understand. Suppose you truly had a IEEE uniform [0,1) in the sense every possible float representation could occur and an interval of [a,b) (for representable a,b) is represented with probability b-a. Then if you multiply, say, by 0.99, then by the pigenhole principle some of the original values would coalesce, and some would not, so now you are no longer uniform. Some values are more likely to occur. This also happens when you multiply by 5, or 7, or any number, except the ranges, as they stretch, due to finite precision, do not map to the correct length ranges. The new range [0,5) has a different set of representable floats, and the precision of them moves in predictable but non-uniform ways, so the result is no longer uniform. So it is not true that rand() * N is uniform for IEEE 754 numbers.

Another simple way to see it: when you multiply by floats > 1, then some of the low bits get lost for some floats, so you cannot begin to hope the results are uniform. You're truncating/rounding, which loses information, so you by necessity only have an approximation.

And note that very few (if any) popular languages even manage to get [0,1) uniform (more details on 15 languages are in [1]). Their conclusion, Table II, to the question of whether the language rng provides uniformity (and spatial equidistributivity, needed to scale as the above), is no for all 15 languages. And this is because most do what you claim works. Even getting the [0,1) uniform as a start is extremely tricky and nuanced. The usual "uniform random int in [0,2^p-1] then divide by 2^p (or 2^p-1)" fails to be uniform most of the time and (as the paper shows) for pretty much every language implementation out there.

This issue is important for places needing super careful accuracy like physics sims, weather sims, many monte carlo algorithms where you don't want bad numerics screwing with your results, and in all of those places, they use custom, properly written methods to create such numbers. The general idea is that for getting a uniform in [a,b) you need to sample enough bits for the range, then carefully bit construct the floating point value - it cannot be transformed (easily) from smaller or larger ranges.

Table V lists various methods to draw floats in [a,b). Your method is listed (as I claimed above) as not being uniform (in fact, it fails all 5 criteria in the table, the only method of the 7 listed to score so poorly).

Here's [2] a paper showing the common method of using a PRNG then dividing by some number to get "uniform" in [0,1) is not uniform, which is well known [2]. So most implementations don't even start with uniform. This is a well known problem in random number generation. And the rand(rand()) == rand() * rand() relies on uniform distributions.

So please stop claiming things until you have checked them carefully. Fuzzy feelings about what you guess is true is no substitute for actually doing the work to check, and doubling down when someone points you on the right path is bizarre to me.

[1] https://hal.science/hal-03282794v4/document

[2] https://hal.science/hal-02427338/document\\*

anderskaseorg · on May 17, 2023

I’m fully aware of how IEEE 754 works, and that there is—to quote my own comment—“a much more complicated generator that takes more care to ensure ideal rounding at the limits of floating-point precision”. If you’re looking for somebody clueless to argue at, you’ll need to go find someone else.

But these details aren’t relevant to this conversation:

• We’re not simulating the weather, we’re drawing dots in a circle—and anyone who cares about the dots being shifted from true uniformity by tiny fractions of the radius of an electron is going to have objections to any conceivable computerized representation of the numbers involved.

• It’s quite reasonable to implicitly treat this discussion as a question within the domain of true mathematical reals, and quite unreasonable to chastise someone as incorrect and careless for doing so.

• It’s still false that “rand(rand()) == rand() * rand() relies on uniform distributions”. This identity between distributions holds with the naïve multiplicative implementation of rand() and any underlying random generator, even a blatantly skewed one, in either the mathematical or IEEE reals, exactly. (And with a more precise IEEE generator, it will hold in as close of an approximate sense as one can expect any mathematical identity to hold when translated to IEEE, which from my perspective is just fine in this context. My only real interest in even acknowledging IEEE here is to support the point that commenters above were correct to consider the identity intuitively obvious, even when you try to throw in that well-actually.)

SideQuark · on May 17, 2023

> We’re not simulating the weather, we’re drawing dots in a circle

Ah, so we don't care about mathematical rigor, yet we're discussing a proof of the pure mathematical behavior?

> It’s quite reasonable to implicitly treat this discussion as a question within the domain of true mathematical reals,

So we do care about mathematical rigor?

> It’s still false that “rand(rand()) == rand() * rand() relies on uniform distributions”.

Consider a distribution A(r) uniform on [0,r) except it never returns 1/4. This is a perfectly valid distribution, extremely close to being the same (in the mathematical sense) as the uniform distribution itself.

A(A(1)) will never be 1/4. A(1) * A(1) may be 1/4. Thus the claim needing uniform (or a proof for whatever distribution you want) requires careful checking. And if you don't like this distribution, you can fiddle with ever larger sets (in cardinality, then lebesgue measure zero) where you tweak things, and you'll soon discover that without uniform, you cannot make the simplification. And I demonstrated that the actual PRNGs are not uniform, so much care is needed to analyze. (They're not even continuous domain or range)

Yes, if you define the rng circularly, then you can make the assumption, but then you cannot make the claim of 1/2 per term, since they are not uniform. If you define the rng more carefully, then they often don't commute (maybe never?).

For example, defining the rng in another reasonable (yet still naive) fashion to keep more bits:

    float rnd(float max)
        i = int_rand(N) // 0 to N-1, try weird N like N = 12345678
        f = i*max
        return f/N

this does not satisfy rnd(rnd()) == rnd()*rnd() in general, as you can check easily in code.

Thus, to make the claim you can swap them requires for the mathematical part true uniformity (my claim way up above) and for the algorithm sense it requires knowing details about the underlying algorithm. Simply having a rnd(max) blackbox algorithm is not enough.

anderskaseorg · on May 17, 2023

Of course we care about mathematical rigor.

> Yes, if you define the rng circularly, then you can make the assumption,

We’re agreed then.

> but then you cannot make the claim of 1/2 per term, since they are not uniform.

As mathematicians, we’re allowed to make multi-step arguments. The first step rand(rand()) = rand() * rand() holds for any distribution that scales multiplicatively with the argument, as we’ve agreed. The second step E[rand() * rand()] = E[rand()] * E[rand()] holds for any two independent distributions with finite expectation. The third step E[rand()] = 1/2 holds for this particular distribution. An ideal mathematical rand() satisfies all three of these properties, so this argument that E[rand(rand())] = 1/4 is valid in the mathematical reals. (And it’s approximately valid in the IEEE reals, which is all one typically asks of the IEEE reals.)

tbalsam · on May 15, 2023

How do you mean?

I'm going based on the propagation of expectations through the system.

The expectation of a uniform distribution is half the bound. Unless there's some math shenanigans going on, I believe that the expected expected value of a distribution drawn from distributions should be 1/2 * 1/2 * 1 in this case.

Of course it's not a bound but right now I'm having a hard time determining otherwise.

Is there a mathematical motivation for e^-n here? That seems an odd choice since the uniform distribution is 0-1 bounded generally. However I could understand if it emerges in a few conditions.

sobellian · on May 15, 2023

e shows up because we're doing an iterated product (of random variables, but still). If you look at the central limit theorem, sum(log(rand())) tends to N * E[log(rand())] for large N. Well, what's E[log(rand())]? -1!

Like I said, it's fairly easy to test this in a python terminal. I encourage anyone who doubts me to try it :)

tbalsam · on May 15, 2023

Alright, I tried it in the terminal and the numbers confirmed my earlier math. .5->.25->.125 mean as N->infinity. I chained the python uniform function like the above code in the originally linked post does and averaged the results.

The snark to me and the other commenters is a bit unnecessary in either case. I'm not really sure where you're getting a natural log or an iterated product of random variables in this instance.

Could you perhaps show where you're transforming uniform(uniform(uniform(uniform(0, 1)))) into the math you're showing above? I'm trying to follow along here but am having difficulty connecting your line of reasoning to the problem at hand.

sobellian · on May 15, 2023

Not trying to be snarky by any means. I'm sorry if you interpreted it that way.

I don't think the difference will show up for small N. This is an asymptotes thing. Try it for N = 100, that's what I did. For example:

  >>> np.product(np.random.random(100))
  5.469939152265385e-43

  >>> 1 / np.e**100
  3.7200759760208555e-44

  >>> 1 / 2**100
  7.888609052210118e-31

The underlying thing here is that random(random()) in this case is the same as random() * random(). So random(random(random(...))) is the same as random() * random() * random() and then the analysis goes on. And sure, random() * random() has a mean close to 1/4. But the dynamics change as N becomes large.

Edit - and just in case you doubt whether random() * random() * ... is a valid way of doing this, I also just checked the laborious way of doing it:

  >>> def foo(n):
  ...   result = 1.0
  ...   for _ in range(n):
  ...     result = np.random.uniform(0, result)
  ...   return result
  ...
  >>> foo(100)
  1.4267531652344414e-46
  >>> foo(100)
  7.852496730908136e-49
  >>> foo(100)
  1.3216070221780724e-41

tbalsam · on May 15, 2023

> I'm sorry if you interpreted it that way.

is probably a good marker that it is time for me take my bow in this conversation, however, for an alternative approach I recommend SideQuark's comment on iterative random variables vs chains of multiplied random variables, which have different characteristics when defined as a sequence.

sobellian · on May 15, 2023

I checked his comment, I think he's incorrect on the approaches being different. Using my function `foo` that does the iterative approach, we can compare the distributions and they're fairly identical.

  >>> np.mean([foo(100) for _ in range(100000)])
  3.258425093913613e-33

  >>> np.mean([np.product(np.random.random(100)) for _ in range(100000)])
  8.814732867008917e-33

(There's quite a bit of variance on a log scale, so 3 vs 8 is not a huge difference. I re-ran this and got varying numbers with both approaches. But the iterated code is very slow...)

Note that the mean is actually quite a bit closer to 2^-100 even though the vast majority of numbers from either distribution fall below 2^-100. Even so, the mean for both is approximately a factor of 100 less than 2^-100. Suspicious! Although I think we've both burned enough time on this.

anderskaseorg · on May 16, 2023

Yeah, the problem with your sampling experiment (as your suspicious observation makes it sound like you’ve started to realize) is that this distribution has a very narrow tail to 1 that you won’t hit by accident but nonetheless contributes very significantly to the computation of the mean.

You correctly computed that the mean of log(foo(n)) is exactly −n, and the median of foo(n) indeed shrinks like e^−n (more specifically like e^(1/3 − n), I think), but the mean of foo(n) is exactly 2^−n.

tbalsam · on May 16, 2023

If this were stackoverflow, I'd mark this as an answer.

Also answers my question up above a bit as well, just from a different approach.

Many thanks for clarifying here and elsewhere! :)

HN is a lovely, curious, and interesting place.

Nevermark · on May 15, 2023

Except that your logarithms can have any consistent base, including 2.

So I don't think using a logarithm to introduce a special dependence on e is warranted in this case.

sobellian · on May 15, 2023

Well the expected value of E[log2(rand())] is not the same as E[log(rand())] and therein lies the difference. See my sibling comment.

Like I said, you can check this in a python terminal.