On Fri, Sep 28, 2018 at 11:50:11AM +0200, Laslo Hunhold wrote:
> On Fri, 28 Sep 2018 02:05:20 +0000
> sylvain.bertrand_AT_gmail.com wrote:
> ...
>> Well, there is something about stream safe unicode application.
>> Basically, it is a buffer of 128 bytes (32 unicode points) with a
>> continuation mark if a "extented grapheme cluster" is not finished at
>> the end of the buffer. It seems related only to stream normalization
>> on the fly, though.
>
> At this point we need to just question this insanity. As I like to
> jokingly say, even some African tribe with a very delicate language
> would not have grapheme clusters longer than say 10 code points or so.
> Everything even above 10-20 elements screams unicode exploit (remember
> those accent-trees that used to flood online chats?) and it would
> definitely be enough to just have a fixed size buffer (varied in
> config.h) for grapheme clusters.
That's what the specs says: "extended grapheme cluster" (EGC) should not go
beyond 10 unicode points "in theory". This stream-safe thingy seems to apply to
non normalized unicode stream with it's 32 unicode points and continuation
mark.
With that "continuation mark", an EGCs can go to "infinity and
beyond"... and the application is in charge of the size of the "infinity and
beyond" (aka, _you better deal with microsoft, apple, google and mozilla
"infinity and beyond"_).
I am in favor of a hard limit of 32 unicode points, with a nice 128 bytes
shifting buffer (AVX/MMX register size if I recall properly). The "continuation
mark" would switch the state machine in "discarding" mode, and certainly not in
"infinity and beyond" memory allocation. The parser would need to switch to a
discarding state till the "infinity and beyond" EGC terminator bound or some
corruption.
> Depends on the level. A safe UTF-8 dencoder catches garbage on its
> level, and will replace it with an Unicode "invalid" code point (forgot
> the name).
> On higher levels, everything is within the bounds of the Unicode spec
> and the "invalid" code point is just another code point.
I wonder how this is handled in lynx, ncurses, vim, readline, libedit, etc...
Wild guess: their "atom" in only 1 unicode point. Probably some work will have
to be done here... (and their maintainers won't be happy...)
> ...
>> - can we normalize on the fly a "extented grapheme cluster"?
>
> Yes, but don't worry about that too much as we don't need normalization
> as much as you probably think.
Agreed, as far as I can think of, with my limited knowledge on unicode, it
would be kind of required only for the EGC renderer in order to help the
"rendering correctness".
Additionally, ill EGCs with tons of combining code points (less than 32 though)
will likely be "compressed" by this normalization.
> ...
regards,
--
Sylvain
Received on Fri Sep 28 2018 - 15:38:03 CEST