Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Around the World With Unicode (norasandler.com)
19 points by totallymike on Dec 4, 2017 | hide | past | favorite | 5 comments


> This [han unification] significantly reduces the number of code points you need, and simplifies normalizing and collating CJK text, at the expense of undermining the entire point of Unicode.

This is the best description of the pro and cons of han unification I've ever seen.

Personally I really enjoy the ease of which I can look up characters from a Japanese text in a Chinese dictionary and see how they are subtly understood differently across the two languages.


>, at the expense of undermining the entire point of Unicode.

I'm not a Unicode history expert but did the author (Nora Sandler) accurately represent the philosophical intentions of Unicode?

Specifically, she quotes: "Unicode provides a unique number for every character,"

Is "every character" underspecified? What's the philosophy of Unicode? Is it to:

a) map a codepoint for every semantic character?

b) map a codepoint for every visual character?

Here's a non-CJK example of the single tick (′) character U+2032 [1]:

That has 3 different semantics:

  1) foot mark e.g. 3′ to a yard
  2) prime mark e.g. f′(x)
  3) coordinates minutes e.g. 48°51′24″N
In each case, the character's rendering to screen and printer look identical so Unicode used the same codepoint for 3 different meanings. Even if Unicode created 3 separate codepoints for 3 separate meanings of (′), semantic fidelity would be lost anyway since authors would often choose the first glyph visually that "looked like" the tick mark they wanted. Or they'd probably just use U+0027, the traditional ASCII apostrophe (') since it's the easiest to type on a keyboard.

Put another way, was Unicode intended to create codepoints at the level of characters-the-visual-look or at the level of language character sets? It looks like Unicode chose the visual look which is why 3 uses of tick marks and many Han characters collapse to single codepoints. If the "entire point of Unicode" was for the 1-to-many mapping of codepoints to map to language sets, that means that there would be a contiguous array of Han codepoints for Chinese and another set of Han codepoints for Korean -- with many duplicates. Is there evidence that that was the "correct" idea for Unicode and somehow politics or technical debates de-duplicated the CJK characters?

Yes, there are some duplicate characters such as math symbols and Greek language letters so maybe Unicode philosophy has no single consistent idea of what a codepoint maps to.

[1] http://www.fileformat.info/info/unicode/char/2032/index.htm


Isn't the main argument against the Han unification that the characters are actually visually different between the languages?

One could certainly "just use a correct font" for each of these languages, but currently it is actually impossible to write a text in Chinese and Japanese and have all characters be represented correctly without additional metadata.

Wikipedia article does give many good examples of where this causes issues. https://en.wikipedia.org/wiki/Han_unification

As you have pointed out, this is not consistent over the unicode. Russian Р could just reuse the latin P.

I think that bandwidth and memory nowadays are cheap enough that we could have a few (relatively) more code points to encode all languages properly.


I am also not Unicode history expert nor Unicode expert in any way.

> It looks like Unicode chose the visual look

This might be true for Han characters, but it is not for others. For example capital Latin A (U+0041), Cyrillic A (U+0410) and Greek A (U+0391) have same origin and are visually identical. Same for many other symbols.


For Greek, Cyrillic, and Latin in specific, Unicode has to be round-trip compatible with encoding schemes which distinguish between them.

(There might have been other reasons as well, but the round-trip desideratum forces it in any case.)

There are a lot of characters which were explicitly encoded for such compatibility:

https://en.wikipedia.org/wiki/Unicode_compatibility_characte...

This document discusses equivalence among Unicode characters:

http://unicode.org/reports/tr15/tr15-18.html




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: