Re: [dev] [libgrapheme] announcement

From: Mattias Andrée <maandree_AT_kth.se>
Date: Sat, 28 Mar 2020 00:32:24 +0100

On Fri, 27 Mar 2020 22:24:22 +0000
<sylvain.bertrand_AT_gmail.com> wrote:

> On Fri, Mar 27, 2020 at 10:24:52PM +0100, Laslo Hunhold wrote:
> > ... This will cover 99.5% of all cases...
>
> What do you mean? They managed to add in grapheme cluster definition some weird
> edge cases up to 0.5%??
>
> About string comparison: if I recall well, after utf-8 normalization (n11n), strings
> are supposed to be 100% perfect for comparison byte per byte.
>
> The more you know: utf-8 n11n got its way in linux filesystems support, and
> that quite recently. This will become a problem for terminal based
> applications. In near future gnu/linux distros, the filenames will become
> normalized using the "right way"(TM) n11n.
>
> This "right way"(TM) n11n (there are 2 n11ns) produces only non-pre-composed
> grapheme cluster of codepoints (but in the CJK realm, there are exceptions if I
> recall properly). AFAIK, all terminal based applications do expect
> "pre-composed" grapheme codepoint.

This sounds absolutely horrible. Non-pre-composed characters are not widely
well support and are often rendered terribly, some software (like the Linux VT)
cannot even rendering them.

Why is even the kernel getting into encoding issues?, that should be an
application issue, not a kernel issue. A kernel should only know bytes. Is it
really a security issue?

>
> For instance the french letter 'è' won't be 1 codepoint anymore, but 'e' + '`'
> (I don't recall the n11n order), namely a sequence of 2 codepoints.
>
> I am a bit scared because software like ncurses, lynx, links, vim, may use the
> abominations of software we discussed earlier to handle all this.
>
Received on Sat Mar 28 2020 - 00:32:24 CET

This archive was generated by hypermail 2.3.0 : Sat Mar 28 2020 - 00:36:10 CET