Re: [hackers] [libgrapheme] Refine types (uint8_t -> char, uint32_t -> uint_least32_t) || Laslo Hunhold

From: Michael Forney <mforney_AT_mforney.org>
Date: Sun, 12 Dec 2021 12:41:15 -0800

On 2021-12-12, Laslo Hunhold <dev_AT_frign.de> wrote:
> yes, if we were only accessing that would be fine. However, what about
> the other way around? libgrapheme also writes to arrays with
> lg_utf8_encode(), and that's where we can't just write to char.

You can do that with a cast, too:

*(unsigned char *)s = ...;

> But char and unsigned char are of integer type, aren't they?

They are integer types and character types. Character types are a
subset of integer types: char, signed char, and unsigned char.

> So on a
> POSIX-system, which is 99.999% of cases, it makes no difference if we
> cast between (char *) and (unsigned char *) (as you suggested above if
> we went with unsigned char * for the interfaces) and between (char *)
> and (uint_least8_t *), does it? So if the end-user has to cast anyway,
> then he can just cast to an uint* type as well.

The difference is that uint8_t and uint_least8_t are not necessarily
character types. Although the existence of uint8_t implies that
unsigned char has exactly 8 bits, uint8_t could be a separate 8-bit
integer type distinct from the character types. If this were the case,
accessing an array of unsigned char through a pointer to uint8_t would
be undefined behavior (C99 6.5p7).

Here are some examples:

char a[1] = {0};
// always valid, evaluates to 0
*(unsigned char *)a;
// always valid, sets the bits of a[0] to 11111111
// but the value of a[0] depends on the signed-int representation
*(unsigned char *)a = 0xff;
// undefined behavior if uint8_t is not a character type
*(uint8_t *)a;
*(uint8_t *)a = 0xff;

uint8_t b[1] = {0};
// always valid, evaluates to 0
*(unsigned char *)b;
// always valid, sets the bits of a[0] to 11111111
*(unsigned char *)b = 0xff;

> Even more drastically, given UTF-8 is an encoding, I don't really feel
> good about not being strict about the returned arrays in such a way that
> it becomes possible to have an array of e.g. 16-bit integers where only
> the bottom half is used and it become the user's job to then hand-craft
> it into a proper array to send over the network, etc. Surely one can
> hack around this as a library user, but at a certain point I think "to
> hell with it" and just be strict about it in the API. C already has a
> weak type system and I don't want to further weaken it by supporting
> decades-old implicit assumptions on types. So in a way, maybe uint8_t
> is the way to go, and then the library user immediately knows it's not
> going to work with his machine because uint8_t is not defined for him.

Not quite sure what you mean here. Are you talking about the case
where CHAR_BIT is 16? In that case, there'd be no uint8_t, so you
couldn't "hand-craft it into a proper array". I'm not sure how
networking APIs would work on such a system, but maybe they'd consider
only the lowest 8 bits of each byte.
Received on Sun Dec 12 2021 - 21:41:15 CET

This archive was generated by hypermail 2.3.0 : Sun Dec 12 2021 - 22:48:32 CET