Hi,
I was trying a patch when I realized the problems we have in
st about the internal representation. Take a look to this loop:
ptr = buf;
while((charsize = utf8decode(ptr, &unicodep, buflen))) {
utf8encode(unicodep, s, UTF_SIZ);
tputc(s, charsize);
ptr += charsize;
buflen -= charsize;
}
For every unicode character we decode it, because we need know how
much input it is used and because we need to know the size of the utf8
string, but after decoding we encode again in utf8 because we need
the utf8 string for tputc.
This part is ugly, but look in the beginning of tputc:
if(len == 1) {
width = 1;
unicodep = ascii = *c;
} else {
utf8decode(c, &unicodep, UTF_SIZ);
width = wcwidth(unicodep);
control = ISCONTROLC1(unicodep);
ascii = unicodep;
}
If the character is a multibyte, we decode it again!!!!. So for
multibyte characters we:
- decode
- encode
- decode
It is slow and really ugly. But we have this problem not only in
tputc. We have a function utf8len:
size_t
utf8len(char *c) {
return utf8decode(c, &(long){0}, UTF_SIZ);
}
That decode again the string because in some places we need the size
of the utf8 string.
I think we should decode the utf8 character in the input, store it
in raw unicode with 4 bytes, and encode again in output (usually in
getsel or in printer functions). The memory usage is going to be the
same, because we store the utf8 string with 'char c[UTF_SIZ]', where
UTF_SIZE is 4 (although it should be bigger because if we accept
unicode of 32 bits then we can receive utf8 strings of 6 bytes).
I would like to listen the opinion of another suckless developers
before of beginning with these modifications. What do you think
guys?
Regards,
--
Roberto E. Vargas Caballero
Received on Sat Aug 23 2014 - 17:35:54 CEST