Is there a reason they don't run child exploitation detection ML engine (PhotoDNA) on the media they index? I'd guess they already run an ML engine to "describe" a image while they index it and this additional step shouldn't be too expensive?
PhotoDNA isn't ML-based. My understanding is that it's essentially a Bloom filter for images -- it'll tell you if your image is in the set of known-bad images, but it won't recognize newly created images.
Based on the redactions in the images in the article, those searches are returning ~75% cp. I guess that could still be a small portion of whats out there, but that's horrifying.
Porn accounts for 1% of images. Child porn accounts for 1 in a million fraction of that. So it’s 10 per 1 billion images.
If you figure the filters exclude 90% of child porn images, that’s 1 per billion which will show up in search results.
I can’t find a good estimate of total number of images, but YouTube shows 5 billion videos a day and gets 1800 minutes of video per minute.
So if we estimate a trillion photos, then we’d expect around a thousand child porn images to make it past the filters, and Bing to be able to return a few pages of 75% child porn when we accidentally stumble on a term in that category.
There's a slightly bizarre situation here where search algorithms are working against themselves, yeah. Presumably anything spiked by PhotoDNA isn't returned at all - which frees up those spots for the next-most-relevant result. And the more effective Bing's indexing is, well, the worse that result will be...
Since PhotoDNA is basically a known-bad tool, it presumably can't win the battle unless turnover is fairly low. Penalizing or hiding sites with many PhotoDNA hits (or perhaps a high percentage of hits) might do better by targeting concentrators, but that would depend on what sort of sites are serving this stuff. I assume they have to be fairly small/scattered to stay operational, which in turn makes it harder to predict what sort of content they have.
(And despite the article, it doesn't seem clear that Google has solved this problem, so much as bypassed it with a whitelist approach to nudity in general.)
From the article: “We index everything, as does Google, and we do the best job we can of screening it. We use a combination of PhotoDNA and human moderation but that doesn’t get us to perfect every time. We’re committed to getting better all the time.”