On Fri, 28 Sep 2018 02:05:20 +0000
sylvain.bertrand_AT_gmail.com wrote:
Dear Sylvain,
> Agreed: the "atom" would be this "extended grapheme cluster", and
> from this point of view, a terminal would be a grid of "space" and
> "extended grapheme".
yes exactly.
> Unfortunately, I am still working out some issues before sueing the
> french administration for that...
Kudos for that! It takes a lot of strength to rise up against
bureaucratic structures, and sites breaking massively without
Javascript is a big problem.
> > This is not a bash or anything but really just due to the fact that
> > all this processing on higher layers is a question of efficiency,
> > especially when e.g. the UNIX system tools are used with plain ASCII
> > data 99% of the time, not requiring all the UTF-8 processing.
>
> For pure system tools ofc. But then I would need an i18n terminal for
> mutt, lynx, etc.
It also depends on the applications. E.g. cat(1) is relatively agnostic
of the higher levels of the stream, but once it involves counting
"characters" it gets a bit complicated.
Especially for system tools, I question the need for NFD. I'd rather
prefer a byte-per-byte comparison rather than a comparison in the sense
of human language interpretation.
> Well, there is something about stream safe unicode application.
> Basically, it is a buffer of 128 bytes (32 unicode points) with a
> continuation mark if a "extented grapheme cluster" is not finished at
> the end of the buffer. It seems related only to stream normalization
> on the fly, though.
At this point we need to just question this insanity. As I like to
jokingly say, even some African tribe with a very delicate language
would not have grapheme clusters longer than say 10 code points or so.
Everything even above 10-20 elements screams unicode exploit (remember
those accent-trees that used to flood online chats?) and it would
definitely be enough to just have a fixed size buffer (varied in
config.h) for grapheme clusters.
> I did not go that deep into the "extended grapheme cluster" boundaries
> computation, it seems that everything we need is there, but it raises
> many more questions, for instance:
It's simple enough. If I find the time I'll make a repo for it.
> - how this finite state machine is resilient to garbage data?
Depends on the level. A safe UTF-8 dencoder catches garbage on its
level, and will replace it with an Unicode "invalid" code point (forgot
the name).
On higher levels, everything is within the bounds of the Unicode spec
and the "invalid" code point is just another code point.
> - can we locate "extended grapheme cluster" boundaries on non
> normalized unicode?
Sure! :)
> - can we normalize on the fly a "extented grapheme cluster"?
Yes, but don't worry about that too much as we don't need normalization
as much as you probably think.
With best regards
Laslo
--
Laslo Hunhold <dev_AT_frign.de>
Received on Fri Sep 28 2018 - 11:50:11 CEST