As someone who deals in diagnostic tests all day, nobody's diagnosed on one test. There's a similar problem I worked out on the odds of getting into medical school, I'm sure this is elementary stats and has someone's name on it, but I'll call it the acceptance problem: how many medical schools do you have to apply to in order to have 90% chance of getting into medical school?
For any school there is a acceptance quotient:
Q = (acceptances sent out)/(number of applications received)
For any given student applying to some schools 1 thru n, the goal is getting at least once acceptance, and applying to more schools, mathematically, can't possibly hurt in the closed case (neglecting social engineering, time spent on applications, etc), so the chance of acceptance, Ca, approaches 1 with every new application in the following fashion:
Similarly, if the sensitivity of a test is 90%, that means the test identifies 9 of every 10 people with the diagnosis. If I administer n different tests each with a sensitivity of S, then the chances of accurately diagnosing the disease, Cd, goes up* with each additional positive but never gets to 1.
So lets say you are doing, say, genetic testing, and any one gene is 1% sensitive for the disease. If you tested 300 genes you could be no more than 95% certain of the diagnosis.
(1 - (1 - .99)^300)100 = 95.09591...%
Now, if your genetic tests were 5% accurate, you're panel could be no more than 95% accurate with 59 tests.
If your test was 50% accurate, you're panel could consist of 6 tests and be no more than 95% accurate.
Of course, if some of the tests are negative, things get more complicated. One of the problems with these data sets is that we have no idea how predictive they are. You can't even calculate the predictive power of the database. There simply haven't been enough events. Then we get into surrogate measures (how many were positive on tests 1 - n and were found to have razor blades in their homes, etc).
The claim that these databases can't be effective isn't true. They could be. P might also equal NP. Whether the hypothesis is strictly true or not, the vague but real set of 'practical concerns' suggest that the truth of the hypothesis is sufficiently difficult to test as to render the null hypothesis the de facto assumption until proven otherwise.
The assumption of independence underlying the math you're doing is probably outright false (albeit mathematically simple). As one example, imagine the case where all n schools use the exact same admissions criteria. Unless you're applying with a random application to each school, your math is shot; you will either get into every school or none of them. I won't even go into the genetic independence issue.
Every mathematical model is a false representation of reality. The predictive accuracy must be validated by experiment. And using the link I reference, you'll see there's actually some pretty strong data to start from in this case.
For any given student applying to some schools 1 thru n, the goal is getting at least once acceptance, and applying to more schools, mathematically, can't possibly hurt in the closed case (neglecting social engineering, time spent on applications, etc), so the chance of acceptance, Ca, approaches 1 with every new application in the following fashion:
Ca = (1 - (1 - Q1)(1 - Q2)(1 - Q3)...(1 - Qn))100
There's a visual and some worked out examples here: http://nielsolson.us/MedSchool/
Similarly, if the sensitivity of a test is 90%, that means the test identifies 9 of every 10 people with the diagnosis. If I administer n different tests each with a sensitivity of S, then the chances of accurately diagnosing the disease, Cd, goes up* with each additional positive but never gets to 1.
Cd = (1 - (1 - S1)(1 - S2)(1 - S3) . . . (1 - Sn))100%
So lets say you are doing, say, genetic testing, and any one gene is 1% sensitive for the disease. If you tested 300 genes you could be no more than 95% certain of the diagnosis.
(1 - (1 - .99)^300)100 = 95.09591...%
Now, if your genetic tests were 5% accurate, you're panel could be no more than 95% accurate with 59 tests.
If your test was 50% accurate, you're panel could consist of 6 tests and be no more than 95% accurate.
Of course, if some of the tests are negative, things get more complicated. One of the problems with these data sets is that we have no idea how predictive they are. You can't even calculate the predictive power of the database. There simply haven't been enough events. Then we get into surrogate measures (how many were positive on tests 1 - n and were found to have razor blades in their homes, etc).
The claim that these databases can't be effective isn't true. They could be. P might also equal NP. Whether the hypothesis is strictly true or not, the vague but real set of 'practical concerns' suggest that the truth of the hypothesis is sufficiently difficult to test as to render the null hypothesis the de facto assumption until proven otherwise.