"Short version: I tried to load https://maurycyz.com/misc/ipv4/ directly and via search. The server is intentionally serving AI crawlers decoy text ("Garbage for the garbage king!", random chemistry/manual fragments, etc.) instead of the real article. Because of that, I can't actually read the real content of the page."
Seems like this poison pill strategery is a non-starter if a chatbot can reliably identify the page as nonsense. Most you're going to do is burn bandwidth to trap a spider.
I mean how does it know that though? How would you know if the set of possible texts is garbage without running them? Honestly feels like your saying LLMs solved the halting problem as programs which seems to be dishonest granted you could probably guess with high efficiency
Not a clue. But apparently it does. Try a few nonsense texts yourself, see if it rejects them.
I'm saying that if you're spidering the whole web, then training an LLM on that corpus, asking an existing LLM "does this page make sense?" is a comparatively small additional load.
> guess with high efficiency
Yes, I think that's basically what's happening. Markov nonsense is cheap to produce, but easy to classify. A more subtle strategy might be more successful (for example someone down-thread mentions using LLM-generated text, and we know that's quite a hard thing to classify).