Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

>What is stopping [...] Java, JS, and C# files in UTF-8?

The output of files on disk can be UTF-8. The continued use of UCS-2 (later revised to UTF16) is happening in the runtime because things like the Win32 API which C# uses is UCS-2. The internal raw memory of layout of strings in Win32 is UCS-2.

*EDIT to add correction



Win32 narrow API calls support UTF-8 natively now.


Code page 65001 has existed for a long time now, but it was discouraged because there were a lot of corner cases that didn't work. Did they finally get all the kinks out of it?


Yes. Applications can switch code page on their own.


Yes, Windows 10 has continually improved UTF-8 support. You can even set applications to use it by default now.


UTF-16*, not UCS-2. Although there are probably many programs that assume UCS-2.


When Windows adopted Unicode, I think the only encoding available was UCS-2. They converted pretty quickly to UTF-16 though, and I think the same is true of everybody else who started with UCS-2. Unfortunately UTF-16 has its own set of hassles.


Technically, they converted to WTF-16 [0] since many places, including filenames, allow you to use unpaired surrogates.

[0] https://simonsapin.github.io/wtf-8/


Note that the asterisk in `UTF-16*` is a really big asterisk. I fixed a UCS-16 bug last week at my day job.


Yeah, there's sometimes a lot more hacks like WTF-8 and WTF-16 in practice on UCS-2 originally systems (including Windows and JS) than is healthy: https://simonsapin.github.io/wtf-8/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: