Question for the Big Data folks: where do sampling and statistics fit into this, if at all? Unless you're summing to the penny, why would you ever need to aggregate a large volume of data (the population) rather than a small volume of data (a sample)? I'm not saying there isn't a reason. I just don't know what it is. Any thoughts from people who have genuine experience in this realm?
Good question. I am not an expert but here’s my take from my time in this space.
Big data folks typically do sampling and such, but that doesn’t eliminate the need for a big data environment where such sampling can occur. Just as a compiler can’t predict every branch that could happen at compile time (sorry VLIW!) and thus CPUs need dynamic branch predictors, so too a sampling function can’t be predicted in advance of an actual dataset.
In a large dataset there are many ways the sample may not represent the whole. The real world is complex. You sample away that complexity at your peril. You will often find you want to go back to the original raw dataset.
Second, in a large organization, sampling alone presumes you are only focused on org-level outcomes. But in a large org there may be individuals who care about the non-aggregated data relevant to their small domain. There can be thousands of such individuals. You do sample the whole but you also have to equip people at each level of abstraction to do the same. The cardinality of your data will in some way reflect the cardinality of your organization and you can’t just sample that away.
I've done it both ways. Look into Data Sketches also if you want to see applications.
The pros:
-- Samples are small and fast most of the time.
-- can be used opportunistically, eg in queries against the full dataset.
-- can run more complex queries that can't be pre-aggregated (but not always accurately).
The cons:
-- requires planning about what to sample and what types of queries you're answering. Sudden requirements changes are difficult.
-- data skew makes uniform sampling a bad choice.
-- requires ETL pipelines to do the sampling as new data comes in. That includes re-running large backfills if data or sampling changes.
-- requires explaining error to users
-- Data sketches can be particularly inflexible; they're usually good at one metric but can't adapt to new ones. Queries also have to be mapped into set operations.
These problems can be mitigated with proper management tools; I have built frameworks for this type of application before -- fixed dashboards with slow-changing requirements are relatively easy to handle.
Sampling is almost always heavily used here, because it's Ace. However, if you need to produce row level predictions then you can't sample as you by definition need the role level data.
However you can aggregate user level info into just the features you need which will get you a looooonnnnggggg way.