> It’s similar to computing NaN-mean or NaN-sum for an array of all NaN values (...

mlthoughts2018 · on Aug 10, 2018

From your response I can tell you don’t do much numerical linear algebra work.

Consider needs to vectorize a large column-wise mean calculation across columns of a large data matrix (where NaN values are sparse but appreciable).

The NaNs might be perfectly reasonable, expected pieces of data, but you still want to understand the distribution of the non-NaN data, and adding extra work to filter it out first might be hugely costly, or even actually wrong depending on what other operations the NaN data is planned to be passed to and how those operations natively handle NaN.

And simple columnar summary stats are just the tip of the iceberg. It gets much more complicated.

By no means is the solution of “diagnose why there are NaNs ahead of time and preprocess them accordingly” even remotely realistic in most use cases. This is why libraries like pandas, numpy and scipy for instance provide specific nan-ignoring functions or function parameters.

nicops · on Aug 11, 2018

That's all fine and dandy, but conceptually it's wrong to say the average of an array is 0, and can and will lead to wrong results in a variety of cases. I'm sure you can think of a lot of these cases yourself. I think in the history of computer science we programmers have found that there are a lot of convenience shortcuts that make sense in a lot of cases but bite our asses in other. Implicit is fast and fun, but it's nice to have your seatbelt on when the car crashes. Going back to the average case, if you want an average function that returns 0 on empty arrays, fine. But that's not the average function, and you shouldn't call it that way, and names matter, you should call it averageOrZero or something like that.

mlthoughts2018 · on Aug 11, 2018

Why is it conceptually wrong to say the average of an empty array is zero? My undergrad degree is in pure math and my grad degree is in mathematical statistics and I’ve never heard an idea like saying the mean of an empty array is zero is “conceptually wrong.”

You bring up the history of CS, but even there you have debates about what convention to use for defining 1/0 for function totality and theorem provers.

There’s no aspect of pure math derivation of number systems on up through vector spaces that definitively makes a zero mean for an empty array ill-defined. Whatever choice you make, positive infinity, undefined, 0, or any finite values, etc., any such choice is purely down to convention that depends precisely on your use case.

drb91 · on Aug 12, 2018

> Why is it conceptually wrong to say the average of an empty array is zero?

It’s not conceptually wrong, it just means the “mean” you’re referring to calculates a different value than the “mean” we’re taught in school. So, underlying assumptions about the differences in “mean” should be communicated where it’s used.

mlthoughts2018 · on Aug 12, 2018

Sure, I agree they should be communicated. Like, in the docs for “standard” mean functions, and not pushed into “specialized” mean functions, since needing this particular convention is not remotely special, and is rudimentary and expected in 99% of linear algebra and data analytics work, which are the largest drivers of these types of statistical functions.