Re: [hackers] [libgrapheme] Refine types (uint8_t -> char, uint32_t -> uint_least32_t) || Laslo Hunhold

From: Laslo Hunhold <>
Date: Thu, 16 Dec 2021 19:08:35 +0100

On Thu, 16 Dec 2021 02:45:54 -0800
Michael Forney <> wrote:

Dear Michael,

I know this thread is already long enough, but I took my time now to
read deeper into the topic. Please read below, as we might come to a
conclusion there now.

> Both of these observations are true, but just because uint8_t is 8-bit
> and unsigned char is 8-bit doesn't mean that uint8_t == unsigned char.
> A C implementation can have implementation-defined extended integer
> types, so it is possible that it defines uint8_t as an 8-bit extended
> integer type, distinct from unsigned char (similar to how long long
> and long may be distinct 64-bit integer types). As far as I know, this
> would be still be POSIX compliant.
> Yes, I believe this is a possibility.
> If you are assuming that unsigned char == uint8_t, I think you should
> just use unsigned char in your API. You could document the API as
> expecting one UTF-8 code unit per byte if you are worried about
> confusion regarding CHAR_BIT.

I found that _a lot_ of code relies on casting to and from (uint8_t *),
but this, as you already explained very well, breaks strict aliasing
as uint8_t is not a character type. This is not a problem in practice
because only gcc enforces strict aliasing and uint8_t is typedef'd to
unsigned char in all (?) cases, which lets uint8_t inherit the
aliasing-exception, however, nobody stops an implementer to define a
separate integral type that then does not work.
Many projects I found casting to and from (uint8_t *) explicitly
disable strict aliasing with the flag -fno-strict-aliasing and
technically have no problem in this regard, but this is such a
technical thing most users of the library, if we also pretty much
forced them to cast to and from (uint8_t *)), would just not know.

Interestingly, there was even an internal discussion on the
gcc-bugtracker[0] about this. They were thinking about adding an
attribute __attribute__((no_alias)) to the uint8_t typedef so it would
explicitly lose the aliasing-exception.

There's a nice rant on [1] and a nice discussion on [2] about this
whole thing. And to be honest, at this point I still wasn't 100%

What convinced me was how they added UTF-8-literals in C11. There you
can define explicit UTF-8 literals as u8"Hällö Wörld!" and they're of
type char[]. So even though char * is a bit ambiguous, we document well
that we expect an UTF-8 string. C11 goes further and accomodates us
with ways to portably define them.

> Ah, okay, I see what you mean. To be honest I'm not really sure how
> something like file encoding and I/O would work on such a system, but
> I was assuming that files would contain one code unit per byte, rather
> than packing multiple code units into a single byte. For instance, on
> a hypothetical system with 9-bit bytes, I wouldn't expect a code unit
> to cross the byte boundary.

To also address this point, here's what we can do to make us all happy:

  1) Change the API to accept char*
  2) Cast the pointers internally to (unsigned char *) for bitwise
     modifications. We may do that as we may alias with char, unsigned
     char and signed char.
  3) Treat it as an invalid code point when any bit higher than the 9th
     is set. This is actually already in the implementation, as we have
     strict ranges.

Please take a look at the attached diff and let me know what you think.
Is this portable and am I correct to assume we might even handle
chars longer than 8 bit properly?

There's just one open question: Do you know of a better way than to do

   (char *)(unsigned char[]){ 0xff, 0xef, 0xa0 }

to specify a literal char-array with specific bit-patterns?

With best regards and thanks again for your help and this very
interesting discussion!



Received on Thu Dec 16 2021 - 19:08:35 CET

This archive was generated by hypermail 2.3.0 : Thu Dec 16 2021 - 19:12:32 CET