An interesting ML exercise (possible class project!) would be to try to automate this. A bigram corpus combined with word embeddings and a NSFW text classifier, maybe? Your usual word embedding might not work because the point is the multiple meanings, so maybe the twist would be that you need a polysemous word embedding with multiple vectors or something like that, so it's not just an off-the-shelf word2vec...
You don’t even need embeddings or Ml. A simple search across dictionaries, thesauruses and the Wikipedia entry list (with disambiguations) should be enough.
- Find all 2 word phrases and compound words
- search across all pairwise combinations of mutual synonyms, and determine whether the compound synonym is itself a word or phrase
That would be a good baseline. Maybe the ML part would be for ranking them? Because I expect you would drown in matches, and the hard part becomes finding the good ones.
I expect the opposite. I would expect an ML/embedding approach to find lots of false positives, because lots of words have close embeddings but are not synonyms. A strict thesaurus lookup should produce fewer matches. As for ranking the good ones, an embedder might help with that, but then we need a definition of "good". I would argue the most conceptually "unrelated" matches are the "best" ones, so yes, an embedder could quickly determine the farthest vector distance.