Re: [dev] [sbase][RFC] Add a simplistic version of tr from Silvan Jegen on 2013-11-28 (dev mail list archive)

From: Silvan Jegen <s.jegen_AT_gmail.com>
Date: Thu, 28 Nov 2013 13:54:27 +0100

Thanks for the comments!

On Tue, Nov 26, 2013 at 11:40 PM, Thorsten Glaser <tg_AT_mirbsd.de> wrote:
> Strake dixit:
>>On 26/11/2013, Silvan Jegen <s.jegen_AT_gmail.com> wrote:
>>> If you you would rather not take this version, what approach would
>>> you take for the character set mapping when using UTF-8?
>>
>>On Linux, one can easily make a sparse array with 1-page granularity
>>with mmap, and so simply use a (wchar_t []) or (Rune []), but I'm not
>>sure how portable this is.
>
> Pretty portable, and 2²¹ * sizeof(wchar_t)/CHAR_BITS is at best 2²⁵
> or 32 MiB, so this would even work.

If I understand correctly you would use mmap to allocate a sparse
memory area into which we could then directly index (either using
UTF-8 or UTF-32 indices), right? Since mmap needs a file descriptor
argument I would need a "typed memory object" for use with mmap which
can be obtained by using
http://pubs.opengroup.org/onlinepubs/009695399/functions/posix_typed_mem_open.html
. Those functions are POSIX so they should be reasonably portable I
would assume.

> But common, for Unicode, is to use the planes.
>
> struct {
> wchar_t foo[0x100];
> } *repl[0x1100];
>
> Do note that sizeof(wchar_t) may be 16, and that the OS’ own
> representation of wchar_t may not be Unicode, so the type would
> be semantically wrong.
>
> You might want to use uint32_t there.

Sadly, I do not follow. I recognize that the lengths of those arrays
multiplied correspond to the maximum number of Unicode code points
(1,114,112) but I am not sure how the mapping (from UTF-8 or UTF-32
encoding) should be done. Care to enlighten me?
Received on Thu Nov 28 2013 - 13:54:27 CET

This archive was generated by hypermail 2.3.0 : Thu Nov 28 2013 - 14:00:09 CET