Re: [dev] [sbase][RFC] Add a simplistic version of tr

From: Silvan Jegen <s.jegen_AT_gmail.com>
Date: Sat, 30 Nov 2013 12:04:27 +0100

On Thu, Nov 28, 2013 at 07:01:17PM +0000, Thorsten Glaser wrote:
> Silvan Jegen dixit:
>
> >If I understand correctly you would use mmap to allocate a sparse
> >memory area into which we could then directly index (either using
> >UTF-8 or UTF-32 indices), right? Since mmap needs a file descriptor
>
> I think that wouldn’t help much.

Intuitively I would say it should help quite a lot because we usually do
not map more than a few characters using tr (well, at least I do) and
thus have a very sparsely populated memory area.
Implementing a mmap and a non-mmap version of the code and comparing the
memory usage should not be too hard to do, however.


> >Sadly, I do not follow. I recognize that the lengths of those arrays
> >multiplied correspond to the maximum number of Unicode code points
> >(1,114,112) but I am not sure how the mapping (from UTF-8 or UTF-32
> >encoding) should be done. Care to enlighten me?
>
> Eh, &0xFF and >>8?

Bear with me for a moment, I am not used to bit twiddling :-)

So your suggestion is to convert the UTF-8 to the Unicode code point
(aka UTF-32) and use its value >>8-shifted as an index into an array of
pointers to 255-member arrays of wchar_t's (or uint32_t's). The least
significant byte of the UTF-32 encoded code point can then be extracted
by using the bitwise AND operation with 0xFF and used as an index into
the uint32_t/wchar_t array itself.

That sounds reasonable but requires that we convert UTF-8 to UTF-32
which should not be strictly necessary when we only map one UTF-8 value
to another. I wonder whether there's an easy solution that would not
necessitate that conversion, but this may just be a premature
optimization...
Received on Sat Nov 30 2013 - 12:04:27 CET

This archive was generated by hypermail 2.3.0 : Sat Nov 30 2013 - 12:12:06 CET