Re: [dev] [st] Proposal of changing internal representation

From: Christoph Lohmann <20h_AT_r-36.net>
Date: Sun, 24 Aug 2014 16:56:33 +0200

Greetings.

On Sun, 24 Aug 2014 16:56:33 +0200 "Roberto E. Vargas Caballero" <k0ga_AT_shike2.com> wrote:
> If the character is a multibyte, we decode it again!!!!. So for
> multibyte characters we:
>
> - decode
> - encode
> - decode

These steps aren't as slow as you might think.

> It is slow and really ugly. But we have this problem not only in
> tputc. We have a function utf8len:
>
>
> size_t
> utf8len(char *c) {
> return utf8decode(c, &(long){0}, UTF_SIZ);
> }
>
> That decode again the string because in some places we need the size
> of the utf8 string.

Look at how »decoding« is done. If you really think this slows down st
then utf8len can be optimized further. Decoding isn’t as heavy as you
think in comparison to the 32 bit burden you add and its illogic.

> I think we should decode the utf8 character in the input, store it
> in raw unicode with 4 bytes, and encode again in output (usually in
> getsel or in printer functions). The memory usage is going to be the
> same, because we store the utf8 string with 'char c[UTF_SIZ]', where
> UTF_SIZE is 4 (although it should be bigger because if we accept
> unicode of 32 bits then we can receive utf8 strings of 6 bytes).

This is exactly the reason why st keeps this internal representation: to
adapt to future expansions of UTF‐8, no matter what any crippled stan‐
dard says. If you adapt to a dynamically growing bytes per char string
you end up with a meta format like UTF‐8 too.

As said, if you think utf8len should be optimized, look at [0].

Another question arises from the st UTF‐8 support: Who will implement
the normalisation? Will it be included in the new internal string repre‐
sentation? Now this question can be easily answered because st keeps the
raw representation. Helper functions take care of it, when it’s needed.

UTF‐32 (UTF‐16 is a joke) is a disease, fight it.


Sincerely,

Christoph Lohmann

[0] http://canonical.org/~kragen/strlen-utf8.html
Received on Sun Aug 24 2014 - 16:56:33 CEST

This archive was generated by hypermail 2.3.0 : Sun Aug 24 2014 - 17:12:08 CEST