Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Is there a reason they don't run child exploitation detection ML engine (PhotoDNA) on the media they index? I'd guess they already run an ML engine to "describe" a image while they index it and this additional step shouldn't be too expensive?


PhotoDNA isn't ML-based. My understanding is that it's essentially a Bloom filter for images -- it'll tell you if your image is in the set of known-bad images, but it won't recognize newly created images.


What makes you think they don’t, and what is found here is just the small portion that makes it past the filters?


Based on the redactions in the images in the article, those searches are returning ~75% cp. I guess that could still be a small portion of whats out there, but that's horrifying.


Hm. Here’s my napkin math:

Porn accounts for 1% of images. Child porn accounts for 1 in a million fraction of that. So it’s 10 per 1 billion images.

If you figure the filters exclude 90% of child porn images, that’s 1 per billion which will show up in search results.

I can’t find a good estimate of total number of images, but YouTube shows 5 billion videos a day and gets 1800 minutes of video per minute.

So if we estimate a trillion photos, then we’d expect around a thousand child porn images to make it past the filters, and Bing to be able to return a few pages of 75% child porn when we accidentally stumble on a term in that category.


Oh yeah, didn’t think about the concentrating factor of the search term.


There's a slightly bizarre situation here where search algorithms are working against themselves, yeah. Presumably anything spiked by PhotoDNA isn't returned at all - which frees up those spots for the next-most-relevant result. And the more effective Bing's indexing is, well, the worse that result will be...

Since PhotoDNA is basically a known-bad tool, it presumably can't win the battle unless turnover is fairly low. Penalizing or hiding sites with many PhotoDNA hits (or perhaps a high percentage of hits) might do better by targeting concentrators, but that would depend on what sort of sites are serving this stuff. I assume they have to be fairly small/scattered to stay operational, which in turn makes it harder to predict what sort of content they have.

(And despite the article, it doesn't seem clear that Google has solved this problem, so much as bypassed it with a whitelist approach to nudity in general.)


No worries, I’ve now spent more effort thinking about this than I really care to, and am going to go watch puppy videos.

Thank you to whoever actually deals with this, so I don’t have to.


From the article: “We index everything, as does Google, and we do the best job we can of screening it. We use a combination of PhotoDNA and human moderation but that doesn’t get us to perfect every time. We’re committed to getting better all the time.”




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: