> It’s similar to computing NaN-mean or NaN-sum for an array of all NaN values (which returns 0).
Why would you want to do this rather than validating understanding of the data before computing on this?
> For purposes of updating the accumulator, the mean of an empty array is perfectly well-defined: it should add nothing to the acculator (add 0).
That's not a mean, though, that's a quirk of how you decide to (incorrectly) compute a mean. What's the point of such a program? It should refuse to compute when it's given no values--that's clearly a category error.
Otherwise, you're not using a mean, you're using a mean-or-0-when-lacking-a-mean. Might as well not call it a mean at all so other people can read your code.
Yes, this is pedantic, but this kind of subtle changing meaning of terms is exactly what leads to bugs. Name your functions accurately.
From your response I can tell you don’t do much numerical linear algebra work.
Consider needs to vectorize a large column-wise mean calculation across columns of a large data matrix (where NaN values are sparse but appreciable).
The NaNs might be perfectly reasonable, expected pieces of data, but you still want to understand the distribution of the non-NaN data, and adding extra work to filter it out first might be hugely costly, or even actually wrong depending on what other operations the NaN data is planned to be passed to and how those operations natively handle NaN.
And simple columnar summary stats are just the tip of the iceberg. It gets much more complicated.
By no means is the solution of “diagnose why there are NaNs ahead of time and preprocess them accordingly” even remotely realistic in most use cases. This is why libraries like pandas, numpy and scipy for instance provide specific nan-ignoring functions or function parameters.
That's all fine and dandy, but conceptually it's wrong to say the average of an array is 0, and can and will lead to wrong results in a variety of cases. I'm sure you can think of a lot of these cases yourself. I think in the history of computer science we programmers have found that there are a lot of convenience shortcuts that make sense in a lot of cases but bite our asses in other. Implicit is fast and fun, but it's nice to have your seatbelt on when the car crashes. Going back to the average case, if you want an average function that returns 0 on empty arrays, fine. But that's not the average function, and you shouldn't call it that way, and names matter, you should call it averageOrZero or something like that.
Why is it conceptually wrong to say the average of an empty array is zero? My undergrad degree is in pure math and my grad degree is in mathematical statistics and I’ve never heard an idea like saying the mean of an empty array is zero is “conceptually wrong.”
You bring up the history of CS, but even there you have debates about what convention to use for defining 1/0 for function totality and theorem provers.
There’s no aspect of pure math derivation of number systems on up through vector spaces that definitively makes a zero mean for an empty array ill-defined. Whatever choice you make, positive infinity, undefined, 0, or any finite values, etc., any such choice is purely down to convention that depends precisely on your use case.
> Why is it conceptually wrong to say the average of an empty array is zero?
It’s not conceptually wrong, it just means the “mean” you’re referring to calculates a different value than the “mean” we’re taught in school. So, underlying assumptions about the differences in “mean” should be communicated where it’s used.
Sure, I agree they should be communicated. Like, in the docs for “standard” mean functions, and not pushed into “specialized” mean functions, since needing this particular convention is not remotely special, and is rudimentary and expected in 99% of linear algebra and data analytics work, which are the largest drivers of these types of statistical functions.
Why would you want to do this rather than validating understanding of the data before computing on this?
> For purposes of updating the accumulator, the mean of an empty array is perfectly well-defined: it should add nothing to the acculator (add 0).
That's not a mean, though, that's a quirk of how you decide to (incorrectly) compute a mean. What's the point of such a program? It should refuse to compute when it's given no values--that's clearly a category error.
Otherwise, you're not using a mean, you're using a mean-or-0-when-lacking-a-mean. Might as well not call it a mean at all so other people can read your code.
Yes, this is pedantic, but this kind of subtle changing meaning of terms is exactly what leads to bugs. Name your functions accurately.