Re: [hackers] [libgrapheme] Refine types (uint8_t -> char, uint32_t -> uint_least32_t) || Laslo Hunhold from Michael Forney on 2021-12-12 (hackers mail list archive)

From: Michael Forney <mforney_AT_mforney.org>
Date: Sat, 11 Dec 2021 15:18:56 -0800

Just want to mention up front that all of below is what I believe to
be true from my interpretation of the standard. I'm happy to be
corrected if I am wrong about any of this.

On 2021-12-11, Laslo Hunhold <dev_AT_frign.de> wrote:
> Okay, maybe I misunderstood something here, but from what I understand
> casting between signed and unsigned char is well-defined, no matter the
> implementation. However, if you want to work bitwise it's only
> well-defined if you do it on an unsigned type (i.e. unsigned char in
> this case), which is why I cast to unsigned char. Where is the
> undefined behaviour here? Is it undefined behaviour to cast between
> signed and unsigned char when the value is larger than 128?

Neither conversion is undefined behavior, but unsigned char values >
CHAR_MAX converted to char is implementation defined.

Conversion of a negative char value to unsigned char is defined by C99
6.3.1.3p2:

> Otherwise, if the new type is unsigned, the value is converted by
> repeatedly adding or subtracting one more than the maximum value
> that can be represented in the new type until the value is in the
> range of the new type.

Conversion of unsigned char values outside the range of char is
implementation defined by C99 6.3.1.3p3:

> Otherwise, the new type is signed and the value cannot be represented
> in it; either the result is implementation-defined or an
> implementation-defined signal is raised.

>> > - .arr = (uint8_t[]){ 0xFD },
>> > + .arr = (char[]){
>> > + (unsigned char)0xFD,
>> > + },
>>
>> This cast doesn't do anything here. Both 0xFD and (unsigned char)0xFD
>> have the same value (0xFD), which can't necessarily be represented as
>> char. For example if CHAR_MAX is 127, this conversion is
>> implementation defined and could raise a signal (C99 6.3.1.3p2).
>>
>> I think using hex escapes in a string literal ("\xFD") has the
>> behavior you want here. You could also create an array of unsigned
>> char and cast to char *.
>
> From how I understood the standard it does make a difference. "0xFD" as
> is is an int-literal and it prints a warning stating that this cannot
> be cast to a (signed) char. However, it does not complain with unsigned
> char, so I assumed that the standard somehow safeguards it.

I'm not sure why casting to unsigned char makes the warning go away.
The only difference is the type of the expression (int vs unsigned
char), but the rules in 6.3.1.3 don't care about the source type, only
its value.

I'm not aware of any exception in the standard for unsigned char to
char conversion (but if there is one, I'd be interested to know).

> But when I got it correctly, you are saying that this only works
> because I assume two's complement, right? So what's the portable way to
> work with chars? :)

I guess it depends specifically on what you are trying to do. If you
want a char *, such that when it is cast to unsigned char * and
dereferenced, you get some value 0xAB, you could write "\xAB", or
(char *)(unsigned char[]){0xAB}. There isn't really a nice way to get
a char such that converting to unsigned char results in some value,
since this isn't usually what you want and can't be done in general
(with sign-magnitude, there is no char such that converting to
unsigned char results in 0x80).

Regarding two's complement assumption, consider the UTF-8 encoding of
α: 0xCE 0xB1 or 11001110 10110001. If you interpret that as two's
complement, you get [-50, -79]. Converting to unsigned char will add
256, resulting in [0xCE, 0xB1] like you want. However, with
sign-magnitude you get [-78, -49], converted to unsigned char is
[0xB2, 0xCF] (and something else for one's complement). If you instead
just interpret 11001110 10110001 as unsigned char, you get [0xCE,
0xB1] without depending on the signed integer representation. With
C23, the only possible interpretation of 11001110 10110001 as signed
char is [-50, -79], so it doesn't matter if you go through char or
directly to unsigned char, the result is the same.

Really, I think UTF-8 encoding stored in char * is kind of a lie,
since it doesn't really make sense to talk about negative code units,
but it is useful so that you can still use standard string libc
functions. The string.h functions are even specified to interpret as
unsigned char (C99 7.21.1p3):

> For all functions in this subclause, each character shall be
> interpreted as if it had the type unsigned char (and therefore every
> possible object representation is valid and has a different value).
Received on Sun Dec 12 2021 - 00:18:56 CET

This archive was generated by hypermail 2.3.0 : Sun Dec 12 2021 - 00:24:31 CET