Re: [dev] [st] Proposal of changing internal representation

From: Roberto E. Vargas Caballero <>
Date: Sun, 24 Aug 2014 19:52:16 +0200

> > - decode
> > - encode
> > - decode
> These steps aren't as slow as you might think.

I already have said in another mail that they are not a bottleneck,
and we are not going to increment the performance of st. It's only
about make the code better.

> Look at how ??decoding?? is done. If you really think this slows down st
> then utf8len can be optimized further. Decoding isn???t as heavy as you
> think in comparison to the 32 bit burden you add and its illogic.

Sorry, but I don't understand whay you mean here.

> > I think we should decode the utf8 character in the input, store it
> > in raw unicode with 4 bytes, and encode again in output (usually in
> > getsel or in printer functions). The memory usage is going to be the
> > same, because we store the utf8 string with 'char c[UTF_SIZ]', where
> > UTF_SIZE is 4 (although it should be bigger because if we accept
> > unicode of 32 bits then we can receive utf8 strings of 6 bytes).
> This is exactly the reason why st keeps this internal representation: to
> adapt to future expansions of UTF???8, no matter what any crippled stan???
> dard says. If you adapt to a dynamically growing bytes per char string
> you end up with a meta format like UTF???8 too.

If we only decode in one place, and encode in one place, then the
adaptation to new standards is only to modify a typedef and the
encode/decode routines (which should be modified anyway).

> As said, if you think utf8len should be optimized, look at [0].

This is a modification we can apply, of course. Although, I think if
we move the representation to UTF32 we are not going to need utf8len,
because the length of the utf8 character is going to be calculated
in the conversion.

> Another question arises from the st UTF???8 support: Who will implement
> the normalisation? Will it be included in the new internal string repre???
> sentation? Now this question can be easily answered because st keeps the
> raw representation. Helper functions take care of it, when it???s needed.

I was thinking to take the value that utf8decode generates, that
in this moment is the value we use as utf8 string, not the original,
due to the decode/encode pair before of calling tputc.

> UTF???32 (UTF???16 is a joke) is a disease, fight it.

This was my idea, and I don't see what are the problems of
UTF-32 here, please let me know them.


Roberto E. Vargas Caballero
Received on Sun Aug 24 2014 - 19:52:16 CEST

This archive was generated by hypermail 2.3.0 : Sun Aug 24 2014 - 20:00:12 CEST