Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

An interesting ML exercise (possible class project!) would be to try to automate this. A bigram corpus combined with word embeddings and a NSFW text classifier, maybe? Your usual word embedding might not work because the point is the multiple meanings, so maybe the twist would be that you need a polysemous word embedding with multiple vectors or something like that, so it's not just an off-the-shelf word2vec...


You don’t even need embeddings or Ml. A simple search across dictionaries, thesauruses and the Wikipedia entry list (with disambiguations) should be enough.

- Find all 2 word phrases and compound words - search across all pairwise combinations of mutual synonyms, and determine whether the compound synonym is itself a word or phrase

https://chatgpt.com/share/6818d11d-f444-800a-96b0-7a932e9213...


That would be a good baseline. Maybe the ML part would be for ranking them? Because I expect you would drown in matches, and the hard part becomes finding the good ones.


I expect the opposite. I would expect an ML/embedding approach to find lots of false positives, because lots of words have close embeddings but are not synonyms. A strict thesaurus lookup should produce fewer matches. As for ranking the good ones, an embedder might help with that, but then we need a definition of "good". I would argue the most conceptually "unrelated" matches are the "best" ones, so yes, an embedder could quickly determine the farthest vector distance.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: