An interesting ML exercise (possible class project!) would be to try to automate...

_1tem · 2025-05-05T14:50:14 1746456614

You don’t even need embeddings or Ml. A simple search across dictionaries, thesauruses and the Wikipedia entry list (with disambiguations) should be enough.

- Find all 2 word phrases and compound words - search across all pairwise combinations of mutual synonyms, and determine whether the compound synonym is itself a word or phrase

https://chatgpt.com/share/6818d11d-f444-800a-96b0-7a932e9213...

gwern · 2025-05-05T14:53:45 1746456825

That would be a good baseline. Maybe the ML part would be for ranking them? Because I expect you would drown in matches, and the hard part becomes finding the good ones.

_1tem · 2025-05-05T15:04:50 1746457490

I expect the opposite. I would expect an ML/embedding approach to find lots of false positives, because lots of words have close embeddings but are not synonyms. A strict thesaurus lookup should produce fewer matches. As for ranking the good ones, an embedder might help with that, but then we need a definition of "good". I would argue the most conceptually "unrelated" matches are the "best" ones, so yes, an embedder could quickly determine the farthest vector distance.