Re: [hackers] [libgrapheme] Refine types (uint8_t -> char, uint32_t -> uint_least32_t) || Laslo Hunhold from Laslo Hunhold on 2021-12-11 (hackers mail list archive)

From: Laslo Hunhold <dev_AT_frign.de>
Date: Sat, 11 Dec 2021 22:43:17 +0100

On Sat, 11 Dec 2021 12:24:10 -0800
Michael Forney <mforney_AT_mforney.org> wrote:

Dear Michael,

thanks for your input. You really know the intrinsics much better than
I do.

> It is true that the existence of uint32_t implies that uint_least32_t
> also has exactly 32 bits and no padding bits, but they could still be
> distinct types. For instance, on a 32-bit platform with int and long
> both being exactly 32 bits, you could define uint32_t as one and
> uint_least32_t as the other. In that case, dereferencing an array of
> uint32_t as uint_least32_t would be undefined behavior.
>
> That said, I agree with this change. It also has the benefit of
> matching the definition of C11's char32_t.

That's a nice coincidence. The undefined behaviour would be okay for
me, given it would be a user error. In 99% of the cases it will not be a
problem, and in all cases not libgrapheme's fault which specifies the
interfaces well enough, but still it's good to know.

>
> > diff --git a/src/utf8.c b/src/utf8.c
> > index 4488359..1cb5e17 100644
> > --- a/src/utf8.c
> > +++ b/src/utf8.c
> > _AT_@ -92,7 +101,7 @@ lg_utf8_decode(const uint8_t *s, size_t n,
> > uint32_t *cp)
> > * (i.e. between 0x80 (10000000) and 0xBF (10111111))
> > */
> > for (i = 1; i <= off; i++) {
> > - if(!BETWEEN(s[i], 0x80, 0xBF)) {
> > + if(!BETWEEN((unsigned char)s[i], 0x80, 0xBF)) {
> > /*
> > * byte does not match format; return
> > * number of bytes processed excluding the
> >
>
> Although irrelevant in C23, which will require 2's complement
> representation, I want to note the distinction between (unsigned
> char)s[i] and ((unsigned char *)s)[i]. The former adds 2^CHAR_BIT to
> negative values, while the latter interprets as a CHAR_BIT-bit
> unsigned integer (adds 2^CHAR_BIT if the sign bit is set). For
> example, if char had sign-magnitude representation, we'd have
> (unsigned char)"\x80"[0] == 0, but ((unsigned char *)"\x80")[0] ==
> 0x80.
>
> The latter is probably what you want, but you could ignore this if you
> only care about 2's complement (which is a completely reasonable
> position).

Okay, maybe I misunderstood something here, but from what I understand
casting between signed and unsigned char is well-defined, no matter the
implementation. However, if you want to work bitwise it's only
well-defined if you do it on an unsigned type (i.e. unsigned char in
this case), which is why I cast to unsigned char. Where is the
undefined behaviour here? Is it undefined behaviour to cast between
signed and unsigned char when the value is larger than 128?

> > - .arr = (uint8_t[]){ 0xFD },
> > + .arr = (char[]){
> > + (unsigned char)0xFD,
> > + },
>
> This cast doesn't do anything here. Both 0xFD and (unsigned char)0xFD
> have the same value (0xFD), which can't necessarily be represented as
> char. For example if CHAR_MAX is 127, this conversion is
> implementation defined and could raise a signal (C99 6.3.1.3p2).
>
> I think using hex escapes in a string literal ("\xFD") has the
> behavior you want here. You could also create an array of unsigned
> char and cast to char *.

From how I understood the standard it does make a difference. "0xFD" as
is is an int-literal and it prints a warning stating that this cannot
be cast to a (signed) char. However, it does not complain with unsigned
char, so I assumed that the standard somehow safeguards it.

But when I got it correctly, you are saying that this only works
because I assume two's complement, right? So what's the portable way to
work with chars? :)

With best regards

Laslo
Received on Sat Dec 11 2021 - 22:43:17 CET

This archive was generated by hypermail 2.3.0 : Sat Dec 11 2021 - 22:48:31 CET