On Fri, Mar 27, 2020 at 10:24:52PM +0100, Laslo Hunhold wrote:
> ... This will cover 99.5% of all cases...
What do you mean? They managed to add in grapheme cluster definition some weird
edge cases up to 0.5%??
About string comparison: if I recall well, after utf-8 normalization (n11n), strings
are supposed to be 100% perfect for comparison byte per byte.
The more you know: utf-8 n11n got its way in linux filesystems support, and
that quite recently. This will become a problem for terminal based
applications. In near future gnu/linux distros, the filenames will become
normalized using the "right way"(TM) n11n.
This "right way"(TM) n11n (there are 2 n11ns) produces only non-pre-composed
grapheme cluster of codepoints (but in the CJK realm, there are exceptions if I
recall properly). AFAIK, all terminal based applications do expect
"pre-composed" grapheme codepoint.
For instance the french letter 'è' won't be 1 codepoint anymore, but 'e' + '`'
(I don't recall the n11n order), namely a sequence of 2 codepoints.
I am a bit scared because software like ncurses, lynx, links, vim, may use the
abominations of software we discussed earlier to handle all this.
--
Sylvain
Received on Fri Mar 27 2020 - 23:24:22 CET