Re: [dev] [sbase][RFC] Add a simplistic version of tr

From: Silvan Jegen <s.jegen_AT_gmail.com>
Date: Sat, 30 Nov 2013 12:15:13 +0100

On Thu, Nov 28, 2013 at 01:24:40PM -0500, Strake wrote:
> [..]
>
> > UTF-32 is an encoding that is identical to the unicode point as far as
> > I know. So what I am thinking is that one would either use the UTF-8
> > representation of the Unicode point as an index, or the unicode point
> > itself. Since using UTF-8 would not require any conversion (on UTF-8
> > locales) I think it would be preferrable.
>
> UTF-8 has variable width, so one must find the length of the sequence
> anyhow and shift it bytewise into an integer, so one may as well just
> use fgetwc or the like and work with codepoints.

You are right about the variable width.

According to the standard, UTF-8 has a maximum length of 4 bytes which
would fit into a int on most (all?) platforms so shifting would not be
necessary, I think.

I am not too familiar with C but wouldn't it theoretically be possible
to figure out the length of a UTF-8 sequence, cast only the sequence to
an int and use it to map into a sparse array of wchar_t/uint32_t's?

Obviously having a sparse array that is backed by only a fraction of the
actually requested memory would be crucial because UTF-8 allows 4 byte
sequences with almost all the most significant bits set.
Received on Sat Nov 30 2013 - 12:15:13 CET

This archive was generated by hypermail 2.3.0 : Sat Nov 30 2013 - 12:24:07 CET