Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Cooking up a NSFW filter for marginalia search.

Pipeline so far has gone like this:

* Use the search engine's API to query a bunch of depravity

* Use qwen3.5 to label the search results and generate training data

* Try to use fasttext to create a fast model

* Get good results in theory but awful results in practice because it picks up weird features

* Yolo implement a small neural net using hand selected input features instead

* Train using fasttext training data

* Do a pretty good job

* for (;;) Apply the model to real a world link database and relabel positive findings with qwen to provide more training data

Currently this is where I'm at

  Accuracy:   90.90%
  True  Positive: 1021
  False Positive: 154
  True  Negative: 2816
  False Negative: 230
  Precision:  0.8689
  Recall:     0.8161
  F1:         0.8417
There's a lot of vague middle ground and many of the false positives are arguably just mislabeled.


Just want to say that I love your search engine for my ttrpg side projects to find obscure blogs, etc. thank you.


Never heard about it. Is it like Google search? And why does it need a nsfw filter?


It's like google search in 1998.

It needs an NSFW filter because some people want it, especially certain API consumers.


Nice cover up for ... actually hoarding depravity ;)


It really is for scientific purposes! ;-)


For scientific search experiments, you may like to consider using PyTerrier (which facilitates comparing multiple search model types - (sparse) vector space model; Boolean model; Binary Probabilistic Model; Support Vector Learning-to-Rank model; Divergence from Randomness model; (dense)embedding ranked retrieval models etc.).


The model needs to run in real time on a JVM, so py-anything sadly isn't viable.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: