Looks to me like these "watermarks" are embedded in monetary numbers, acronyms e...

crooked-v · 2025-04-21T07:52:45 1745221965

U+202F in the screenshot, between "FY" and "2024", is the "Narrow No-Break Space ". Similarly, U+A0 is the "No-Break Space" (aka  ).

It's not watermarks, it's just scraped typography.

helsinkiandrew · 2025-04-21T07:56:03 1745222163

The examples they give all look like valid uses of different Non-breaking spaces, with width hints for their use/location, this might be a little overzealous if written by a human but perhaps not for a machine.

https://en.wikipedia.org/wiki/Non-breaking_space

Different display apps may "display them identically" but others and typesetters/printing apps might not.

rep_lodsb · 2025-04-21T07:41:51 1745221311

Yeah, it's probably not an intentional watermark, just something the model has been trained to do. Maybe some professionally written news articles already use them for the same purpose?

Still hope HN adds a filter to block any comment with those characters in it :)

johnisgood · 2025-04-21T07:54:45 1745222085

It is very easy to filter those out from the output of GPT, though, using basic UNIX utilities. In fact, many methods don't survive reformatting or copy-pasting, not requiring filtering at all.

It is a very basic watermark technique (text steganography) if it indeed is supposed to be one.

A more advanced one would be a linguistic (grammar-based) one, but I am not going to give any more ideas. :D

rep_lodsb · 2025-04-21T08:18:39 1745223519

It's easy to remove those characters, but that still requires being aware of them, and an intent to deceive. So many people just copy LLM output here because they (wrongly) believe it adds something of value to a discussion.

johnisgood · 2025-04-21T08:30:36 1745224236

I do not think either that pasting output of LLM typically adds anything to the conversation. It might, usually it does not.

niel · 2025-04-21T07:38:46 1745221126

This is the most likely explanation.

I mean, sure, these characters could be used to help estimate the likelihood text was generated (because human writers might be less likely to add proper non-breaking spaces), but I doubt these are watermarks.