Is there a reason they don't run child exploitation detection ML engine (PhotoDN...

duskwuff · on Jan 10, 2019

PhotoDNA isn't ML-based. My understanding is that it's essentially a Bloom filter for images -- it'll tell you if your image is in the set of known-bad images, but it won't recognize newly created images.

FakeComments · on Jan 10, 2019

What makes you think they don’t, and what is found here is just the small portion that makes it past the filters?

dwighttk · on Jan 10, 2019

Based on the redactions in the images in the article, those searches are returning ~75% cp. I guess that could still be a small portion of whats out there, but that's horrifying.

FakeComments · on Jan 10, 2019

Hm. Here’s my napkin math:

Porn accounts for 1% of images. Child porn accounts for 1 in a million fraction of that. So it’s 10 per 1 billion images.

If you figure the filters exclude 90% of child porn images, that’s 1 per billion which will show up in search results.

I can’t find a good estimate of total number of images, but YouTube shows 5 billion videos a day and gets 1800 minutes of video per minute.

So if we estimate a trillion photos, then we’d expect around a thousand child porn images to make it past the filters, and Bing to be able to return a few pages of 75% child porn when we accidentally stumble on a term in that category.

dwighttk · on Jan 10, 2019

Oh yeah, didn’t think about the concentrating factor of the search term.

Bartweiss · on Jan 10, 2019

There's a slightly bizarre situation here where search algorithms are working against themselves, yeah. Presumably anything spiked by PhotoDNA isn't returned at all - which frees up those spots for the next-most-relevant result. And the more effective Bing's indexing is, well, the worse that result will be...

Since PhotoDNA is basically a known-bad tool, it presumably can't win the battle unless turnover is fairly low. Penalizing or hiding sites with many PhotoDNA hits (or perhaps a high percentage of hits) might do better by targeting concentrators, but that would depend on what sort of sites are serving this stuff. I assume they have to be fairly small/scattered to stay operational, which in turn makes it harder to predict what sort of content they have.

(And despite the article, it doesn't seem clear that Google has solved this problem, so much as bypassed it with a whitelist approach to nudity in general.)

FakeComments · on Jan 10, 2019

No worries, I’ve now spent more effort thinking about this than I really care to, and am going to go watch puppy videos.

Thank you to whoever actually deals with this, so I don’t have to.

withinrafael · on Jan 10, 2019

From the article: “We index everything, as does Google, and we do the best job we can of screening it. We use a combination of PhotoDNA and human moderation but that doesn’t get us to perfect every time. We’re committed to getting better all the time.”