Re: [dev] [st] Proposal of changing internal representation from q_AT_c9x.me on 2014-08-23 (dev mail list archive)

From: <q_AT_c9x.me>
Date: Sat, 23 Aug 2014 21:34:19 +0200

My 2 cents,

On Sat, Aug 23, 2014 at 05:35:54PM +0200, Roberto E. Vargas Caballero wrote:
> I was trying a patch when I realized the problems we have in
> st about the internal representation. Take a look to this loop:
>
> ptr = buf;
> while((charsize = utf8decode(ptr, &unicodep, buflen))) {
> utf8encode(unicodep, s, UTF_SIZ);
> tputc(s, charsize);
> ptr += charsize;
> buflen -= charsize;
> }

Why not using the already encoded version at ptr? The utf8encode call
seems useless here. And utf8decode is only here to get the size, I try
to address this further in my reply.

> For every unicode character we decode it, because we need know how
> much input it is used and because we need to know the size of the utf8
> string, but after decoding we encode again in utf8 because we need
> the utf8 string for tputc.
>
> This part is ugly, but look in the beginning of tputc:
>
> if(len == 1) {
> width = 1;
> unicodep = ascii = *c;
> } else {
> utf8decode(c, &unicodep, UTF_SIZ);
> width = wcwidth(unicodep);
> control = ISCONTROLC1(unicodep);
> ascii = unicodep;
> }

I can't say much about this, another way to do it is to make ISCONTROLC1
act on the utf8 representation directly.

> If the character is a multibyte, we decode it again!!!!. So for
> multibyte characters we:
>
> - decode
> - encode
> - decode
>
> It is slow and really ugly. But we have this problem not only in
> tputc. We have a function utf8len:
>
>
> size_t
> utf8len(char *c) {
> return utf8decode(c, &(long){0}, UTF_SIZ);
> }
>
> That decode again the string because in some places we need the size
> of the utf8 string.

This is not a very optimal implementation. We can maybe use
utf8decodebyte as done in the beggining utf8decode to be faster.
The only point of utf8decode here is to validate the encoding,
is that necessary?

>
> I think we should decode the utf8 character in the input, store it
> in raw unicode with 4 bytes, and encode again in output (usually in
> getsel or in printer functions). The memory usage is going to be the
> same, because we store the utf8 string with 'char c[UTF_SIZ]', where
> UTF_SIZE is 4 (although it should be bigger because if we accept
> unicode of 32 bits then we can receive utf8 strings of 6 bytes).
>

If you think this will simplify the code and prevent future bugs, this
is a good idea. In terms of performance you need to profile to be
sure that there is indeed a problem here. Stupid code is stupid but
sometimes it works nicely! Also, it's not like st is doing poorly on
the performance side (c.f. recent text editors embedded in full blown
web browsers), and we don't expect a lot from a terminal emulator anyway.

Have fun if you start this.

Best,

-- Q
Received on Sat Aug 23 2014 - 21:34:19 CEST

This archive was generated by hypermail 2.3.0 : Sat Aug 23 2014 - 21:36:07 CEST