Re: [hackers] [libgrapheme] Refine types (uint8_t -> char, uint32_t -> uint_least32_t) || Laslo Hunhold from Laslo Hunhold on 2021-12-12 (hackers mail list archive)

From: Laslo Hunhold <dev_AT_frign.de>
Date: Sun, 12 Dec 2021 08:59:04 +0100

On Sat, 11 Dec 2021 15:18:56 -0800
Michael Forney <mforney_AT_mforney.org> wrote:

Dear Michael,

> Just want to mention up front that all of below is what I believe to
> be true from my interpretation of the standard. I'm happy to be
> corrected if I am wrong about any of this.

thanks again for your elaborate response!

> Neither conversion is undefined behavior, but unsigned char values >
> CHAR_MAX converted to char is implementation defined.
>
> Conversion of a negative char value to unsigned char is defined by C99
> 6.3.1.3p2:
>
> > Otherwise, if the new type is unsigned, the value is converted by
> > repeatedly adding or subtracting one more than the maximum value
> > that can be represented in the new type until the value is in the
> > range of the new type.
>
> Conversion of unsigned char values outside the range of char is
> implementation defined by C99 6.3.1.3p3:
>
> > Otherwise, the new type is signed and the value cannot be
> > represented in it; either the result is implementation-defined or an
> > implementation-defined signal is raised.

On Sat, 11 Dec 2021 15:33:12 -0800
Michael Forney <mforney_AT_mforney.org> wrote:

> On 2021-12-11, Michael Forney <mforney_AT_mforney.org> wrote:
> > Conversion of unsigned char values outside the range of char is
> > implementation defined by C99 6.3.1.3p3:
> >
> >> Otherwise, the new type is signed and the value cannot be
> >> represented in it; either the result is implementation-defined or
> >> an implementation-defined signal is raised.
>
> Also worth noting, this clause still remains even in the current C23
> draft, which requires two's complement. So, assuming that CHAR_MAX ==
> 127, (char)0xFD will continue to be implementation defined and might
> raise a signal. This is different from C++, which went a step further
> to define conversion between all integer types to be the unique value
> congruent to 2^N (where N is the number of bits of the destination
> type).

In [0] the gcc developers write in this regard: "For conversion to a
type of width N, the value is reduced modulo 2^N to be within range of
the type; no signal is raised."

However, it seems to be a bit pedantic when you want to convert a value
that is more than one 2^N "away" from the signed range, because it
probably assumed you made a mistake and warns about it.

> >> > - .arr = (uint8_t[]){ 0xFD },
> >> > + .arr = (char[]){
> >> > + (unsigned char)0xFD,
> >> > + },
> >>
> >> This cast doesn't do anything here. Both 0xFD and (unsigned
> >> char)0xFD have the same value (0xFD), which can't necessarily be
> >> represented as char. For example if CHAR_MAX is 127, this
> >> conversion is implementation defined and could raise a signal (C99
> >> 6.3.1.3p2).

Now we're getting closer: gcc doesn't warn, because char and unsigned
char have the same conversion rank.

> >> I think using hex escapes in a string literal ("\xFD") has the
> >> behavior you want here. You could also create an array of unsigned
> >> char and cast to char *.
> >
> > From how I understood the standard it does make a difference.
> > "0xFD" as is is an int-literal and it prints a warning stating that
> > this cannot be cast to a (signed) char. However, it does not
> > complain with unsigned char, so I assumed that the standard somehow
> > safeguards it.
>
> I'm not sure why casting to unsigned char makes the warning go away.
> The only difference is the type of the expression (int vs unsigned
> char), but the rules in 6.3.1.3 don't care about the source type, only
> its value.
>
> I'm not aware of any exception in the standard for unsigned char to
> char conversion (but if there is one, I'd be interested to know).
>
> > But when I got it correctly, you are saying that this only works
> > because I assume two's complement, right? So what's the portable
> > way to work with chars? :)
>
> I guess it depends specifically on what you are trying to do. If you
> want a char *, such that when it is cast to unsigned char * and
> dereferenced, you get some value 0xAB, you could write "\xAB", or
> (char *)(unsigned char[]){0xAB}. There isn't really a nice way to get
> a char such that converting to unsigned char results in some value,
> since this isn't usually what you want and can't be done in general
> (with sign-magnitude, there is no char such that converting to
> unsigned char results in 0x80).

Alright, and C99 gives the guarantee in C99 6.4.4.4p9: "The value of an
octal or hexadecimal escape sequence shall be in the range of
representable values for the type __unsigned char__ for an integer
character constant, or the unsigned type corresponding to wchar_t for a
wide character constant."

So at least for the test-cases, using hexadecimal escapes in a string
literal is probably the most elegant. This however doesn't solve the
other way round (char -> unsigned char for bit-fiddling).

> Regarding two's complement assumption, consider the UTF-8 encoding of
> α: 0xCE 0xB1 or 11001110 10110001. If you interpret that as two's
> complement, you get [-50, -79]. Converting to unsigned char will add
> 256, resulting in [0xCE, 0xB1] like you want. However, with
> sign-magnitude you get [-78, -49], converted to unsigned char is
> [0xB2, 0xCF] (and something else for one's complement). If you instead
> just interpret 11001110 10110001 as unsigned char, you get [0xCE,
> 0xB1] without depending on the signed integer representation. With
> C23, the only possible interpretation of 11001110 10110001 as signed
> char is [-50, -79], so it doesn't matter if you go through char or
> directly to unsigned char, the result is the same.
>
> Really, I think UTF-8 encoding stored in char * is kind of a lie,
> since it doesn't really make sense to talk about negative code units,
> but it is useful so that you can still use standard string libc
> functions. The string.h functions are even specified to interpret as
> unsigned char (C99 7.21.1p3):
>
> > For all functions in this subclause, each character shall be
> > interpreted as if it had the type unsigned char (and therefore every
> > possible object representation is valid and has a different value).

So would you say that the only good way would be to only accept arrays
of unsigned char in the API? I think this seems to be the logical
conclusion.
When I read more I found out that C++ introduced static_cast and
reinterpret_cast for this simple reason: Assuming some crazy
signed-int-representation we just make up in our heads (some random
permutation of 0..255 to -127..128), it is impossible to really know the
intent of the user passing us a (signed) char-array. Let's say
"0b01010101" means "0" in our crazy signed type, does the user intend
to convey to us a null-byte (which is simply "encoded" in the signed
type), or does he literally mean "0b01010101"? With static_cast and
reinterpret_cast you can handle both cases separately.

One might say: 'Ah well, what does it matter?! You can rely on the
implementation and assume that the user always meant the former!'
However, this can really become a footgun if we're talking about FFIs.
If I wrote a FFI to libgrapheme in some external language, I'd be
happier to see an explicit unsigned char array rather than some
signed-char-footgun due to the above reasons, even if we can make it
work in some way within C.

My initial intent was to handle systems that don't have an 8-bit
integer type. This might sound crazy nowadays, but if you were really
stuck on Mars with such a thing and you really had to work with UTF-8,
you would simply read e.g. a UTF-8 encoded file and store each octet
within the low bits of e.g. a 16-bit integer. The other way around
would work respectively. In stdint-lingo you would want the type
uint_least8_t, but that's what unsigned char is defined to be (unsigned
int of at least 8 bits size).

Two questions remain:

1) Would you also go down the route of just demanding an array of
    unsigned integers of at least 8 bits?
2) Would you define it as "unsigned char *" or "uint_least8_t *"?
    I'd almost favor the latter, given the entire library is already
    using the stdint-types.

With best regards

Laslo

[0]:http://gcc.gnu.org/onlinedocs/gcc/Integers-implementation.html#Integers-implementation
Received on Sun Dec 12 2021 - 08:59:04 CET

This archive was generated by hypermail 2.3.0 : Sun Dec 12 2021 - 09:00:33 CET