Re: [hackers] [libgrapheme] Refine types (uint8_t -> char, uint32_t -> uint_least32_t) || Laslo Hunhold

From: Laslo Hunhold <>
Date: Sun, 12 Dec 2021 08:59:04 +0100

On Sat, 11 Dec 2021 15:18:56 -0800
Michael Forney <> wrote:

Dear Michael,

> Just want to mention up front that all of below is what I believe to
> be true from my interpretation of the standard. I'm happy to be
> corrected if I am wrong about any of this.

thanks again for your elaborate response!

> Neither conversion is undefined behavior, but unsigned char values >
> CHAR_MAX converted to char is implementation defined.
> Conversion of a negative char value to unsigned char is defined by C99
> > Otherwise, if the new type is unsigned, the value is converted by
> > repeatedly adding or subtracting one more than the maximum value
> > that can be represented in the new type until the value is in the
> > range of the new type.
> Conversion of unsigned char values outside the range of char is
> implementation defined by C99
> > Otherwise, the new type is signed and the value cannot be
> > represented in it; either the result is implementation-defined or an
> > implementation-defined signal is raised.

On Sat, 11 Dec 2021 15:33:12 -0800
Michael Forney <> wrote:

> On 2021-12-11, Michael Forney <> wrote:
> > Conversion of unsigned char values outside the range of char is
> > implementation defined by C99
> >
> >> Otherwise, the new type is signed and the value cannot be
> >> represented in it; either the result is implementation-defined or
> >> an implementation-defined signal is raised.
> Also worth noting, this clause still remains even in the current C23
> draft, which requires two's complement. So, assuming that CHAR_MAX ==
> 127, (char)0xFD will continue to be implementation defined and might
> raise a signal. This is different from C++, which went a step further
> to define conversion between all integer types to be the unique value
> congruent to 2^N (where N is the number of bits of the destination
> type).

In [0] the gcc developers write in this regard: "For conversion to a
type of width N, the value is reduced modulo 2^N to be within range of
the type; no signal is raised."

However, it seems to be a bit pedantic when you want to convert a value
that is more than one 2^N "away" from the signed range, because it
probably assumed you made a mistake and warns about it.

> >> > - .arr = (uint8_t[]){ 0xFD },
> >> > + .arr = (char[]){
> >> > + (unsigned char)0xFD,
> >> > + },
> >>
> >> This cast doesn't do anything here. Both 0xFD and (unsigned
> >> char)0xFD have the same value (0xFD), which can't necessarily be
> >> represented as char. For example if CHAR_MAX is 127, this
> >> conversion is implementation defined and could raise a signal (C99
> >>

Now we're getting closer: gcc doesn't warn, because char and unsigned
char have the same conversion rank.

> >> I think using hex escapes in a string literal ("\xFD") has the
> >> behavior you want here. You could also create an array of unsigned
> >> char and cast to char *.
> >
> > From how I understood the standard it does make a difference.
> > "0xFD" as is is an int-literal and it prints a warning stating that
> > this cannot be cast to a (signed) char. However, it does not
> > complain with unsigned char, so I assumed that the standard somehow
> > safeguards it.
> I'm not sure why casting to unsigned char makes the warning go away.
> The only difference is the type of the expression (int vs unsigned
> char), but the rules in don't care about the source type, only
> its value.
> I'm not aware of any exception in the standard for unsigned char to
> char conversion (but if there is one, I'd be interested to know).
> > But when I got it correctly, you are saying that this only works
> > because I assume two's complement, right? So what's the portable
> > way to work with chars? :)
> I guess it depends specifically on what you are trying to do. If you
> want a char *, such that when it is cast to unsigned char * and
> dereferenced, you get some value 0xAB, you could write "\xAB", or
> (char *)(unsigned char[]){0xAB}. There isn't really a nice way to get
> a char such that converting to unsigned char results in some value,
> since this isn't usually what you want and can't be done in general
> (with sign-magnitude, there is no char such that converting to
> unsigned char results in 0x80).

Alright, and C99 gives the guarantee in C99 "The value of an
octal or hexadecimal escape sequence shall be in the range of
representable values for the type __unsigned char__ for an integer
character constant, or the unsigned type corresponding to wchar_t for a
wide character constant."

So at least for the test-cases, using hexadecimal escapes in a string
literal is probably the most elegant. This however doesn't solve the
other way round (char -> unsigned char for bit-fiddling).

> Regarding two's complement assumption, consider the UTF-8 encoding of
> α: 0xCE 0xB1 or 11001110 10110001. If you interpret that as two's
> complement, you get [-50, -79]. Converting to unsigned char will add
> 256, resulting in [0xCE, 0xB1] like you want. However, with
> sign-magnitude you get [-78, -49], converted to unsigned char is
> [0xB2, 0xCF] (and something else for one's complement). If you instead
> just interpret 11001110 10110001 as unsigned char, you get [0xCE,
> 0xB1] without depending on the signed integer representation. With
> C23, the only possible interpretation of 11001110 10110001 as signed
> char is [-50, -79], so it doesn't matter if you go through char or
> directly to unsigned char, the result is the same.
> Really, I think UTF-8 encoding stored in char * is kind of a lie,
> since it doesn't really make sense to talk about negative code units,
> but it is useful so that you can still use standard string libc
> functions. The string.h functions are even specified to interpret as
> unsigned char (C99 7.21.1p3):
> > For all functions in this subclause, each character shall be
> > interpreted as if it had the type unsigned char (and therefore every
> > possible object representation is valid and has a different value).

So would you say that the only good way would be to only accept arrays
of unsigned char in the API? I think this seems to be the logical
When I read more I found out that C++ introduced static_cast and
reinterpret_cast for this simple reason: Assuming some crazy
signed-int-representation we just make up in our heads (some random
permutation of 0..255 to -127..128), it is impossible to really know the
intent of the user passing us a (signed) char-array. Let's say
"0b01010101" means "0" in our crazy signed type, does the user intend
to convey to us a null-byte (which is simply "encoded" in the signed
type), or does he literally mean "0b01010101"? With static_cast and
reinterpret_cast you can handle both cases separately.

One might say: 'Ah well, what does it matter?! You can rely on the
implementation and assume that the user always meant the former!'
However, this can really become a footgun if we're talking about FFIs.
If I wrote a FFI to libgrapheme in some external language, I'd be
happier to see an explicit unsigned char array rather than some
signed-char-footgun due to the above reasons, even if we can make it
work in some way within C.

My initial intent was to handle systems that don't have an 8-bit
integer type. This might sound crazy nowadays, but if you were really
stuck on Mars with such a thing and you really had to work with UTF-8,
you would simply read e.g. a UTF-8 encoded file and store each octet
within the low bits of e.g. a 16-bit integer. The other way around
would work respectively. In stdint-lingo you would want the type
uint_least8_t, but that's what unsigned char is defined to be (unsigned
int of at least 8 bits size).

Two questions remain:

 1) Would you also go down the route of just demanding an array of
    unsigned integers of at least 8 bits?
 2) Would you define it as "unsigned char *" or "uint_least8_t *"?
    I'd almost favor the latter, given the entire library is already
    using the stdint-types.

With best regards


Received on Sun Dec 12 2021 - 08:59:04 CET

This archive was generated by hypermail 2.3.0 : Sun Dec 12 2021 - 09:00:33 CET