Re: [hackers] [libgrapheme] Refine types (uint8_t -> char, uint32_t -> uint_least32_t) || Laslo Hunhold from Laslo Hunhold on 2021-12-15 (hackers mail list archive)

From: Laslo Hunhold <dev_AT_frign.de>
Date: Wed, 15 Dec 2021 15:39:14 +0100

On Sun, 12 Dec 2021 12:41:15 -0800
Michael Forney <mforney_AT_mforney.org> wrote:

Dear Michael,

> > But char and unsigned char are of integer type, aren't they?
>
> They are integer types and character types. Character types are a
> subset of integer types: char, signed char, and unsigned char.
>
> > So on a
> > POSIX-system, which is 99.999% of cases, it makes no difference if
> > we cast between (char *) and (unsigned char *) (as you suggested
> > above if we went with unsigned char * for the interfaces) and
> > between (char *) and (uint_least8_t *), does it? So if the end-user
> > has to cast anyway, then he can just cast to an uint* type as well.
> >
>
> The difference is that uint8_t and uint_least8_t are not necessarily
> character types. Although the existence of uint8_t implies that
> unsigned char has exactly 8 bits, uint8_t could be a separate 8-bit
> integer type distinct from the character types. If this were the case,
> accessing an array of unsigned char through a pointer to uint8_t would
> be undefined behavior (C99 6.5p7).
>
> Here are some examples:
>
> char a[1] = {0};
> // always valid, evaluates to 0
> *(unsigned char *)a;
> // always valid, sets the bits of a[0] to 11111111
> // but the value of a[0] depends on the signed-int representation
> *(unsigned char *)a = 0xff;
> // undefined behavior if uint8_t is not a character type
> *(uint8_t *)a;
> *(uint8_t *)a = 0xff;
>
> uint8_t b[1] = {0};
> // always valid, evaluates to 0
> *(unsigned char *)b;
> // always valid, sets the bits of a[0] to 11111111
> *(unsigned char *)b = 0xff;

thanks for clearing that up! After more thought I made the decision to
go with uint8_t, though. I see the point regarding character types, but
this notion is more of a smelly foot in the C standard. We are moving
towards UTF-8 as _the_ default encoding format, so considering
character strings as such is justified.
Any other way would have introduced too many implicit assumptions.

> > Even more drastically, given UTF-8 is an encoding, I don't really
> > feel good about not being strict about the returned arrays in such
> > a way that it becomes possible to have an array of e.g. 16-bit
> > integers where only the bottom half is used and it become the
> > user's job to then hand-craft it into a proper array to send over
> > the network, etc. Surely one can hack around this as a library
> > user, but at a certain point I think "to hell with it" and just be
> > strict about it in the API. C already has a weak type system and I
> > don't want to further weaken it by supporting decades-old implicit
> > assumptions on types. So in a way, maybe uint8_t is the way to go,
> > and then the library user immediately knows it's not going to work
> > with his machine because uint8_t is not defined for him.
>
> Not quite sure what you mean here. Are you talking about the case
> where CHAR_BIT is 16? In that case, there'd be no uint8_t, so you
> couldn't "hand-craft it into a proper array". I'm not sure how
> networking APIs would work on such a system, but maybe they'd consider
> only the lowest 8 bits of each byte.

Yes exactly. Trying to import grapheme.h would immediately show that
the system is incompatible rather than silently "breaking" on this
behalf. Given how smart compilers have become working with "halves" of
registers, I'd much rather expect the CPU to offer instructions to work
with 8-bit-integers as "halves" of 16 bits (accessing lower and upper).

And even if all fails and there simply is no 8-bit-type, one can always
use the lg_grapheme_isbreak()-function and roll his own de/encoding.

With best regards

Laslo
Received on Wed Dec 15 2021 - 15:39:14 CET

This archive was generated by hypermail 2.3.0 : Wed Dec 15 2021 - 15:48:31 CET