Re: [dev] [st] Proposal of changing internal representation

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]

From: Silvan Jegen <s.jegen_AT_gmail.com>
Date: Sat, 23 Aug 2014 18:15:47 +0200

Hi

On Sat, Aug 23, 2014 at 05:35:54PM +0200, Roberto E. Vargas Caballero wrote:
> [...]
>
> If the character is a multibyte, we decode it again!!!!. So for
> multibyte characters we:
>
> - decode
> - encode
> - decode
>
> It is slow and really ugly. But we have this problem not only in
> tputc. We have a function utf8len:
>
>
> size_t
> utf8len(char *c) {
> return utf8decode(c, &(long){0}, UTF_SIZ);
> }
>
> That decode again the string because in some places we need the size
> of the utf8 string.
>
> I think we should decode the utf8 character in the input, store it
> in raw unicode with 4 bytes, and encode again in output (usually in
> getsel or in printer functions). The memory usage is going to be the
> same, because we store the utf8 string with 'char c[UTF_SIZ]', where
> UTF_SIZE is 4 (although it should be bigger because if we accept
> unicode of 32 bits then we can receive utf8 strings of 6 bytes).

Reducing the number of (unnecessary) decode/encode steps is definitely
the way to go.

I do not think that we need more than 4 bytes for UTF-8, however.
According to the standard, UTF-8 is defined to encode the same range of
Unicode code points as UTF-16 and thus will never be longer than 4 bytes
(this apparently was decided in 2003). See here.

http://en.wikipedia.org/wiki/UTF-8#Description

Cheers,

Silvan
Received on Sat Aug 23 2014 - 18:15:47 CEST

This archive was generated by hypermail 2.3.0 : Sat Aug 23 2014 - 18:24:07 CEST