Re: [hackers] [libgrapheme] Refine types (uint8_t -> char, uint32_t -> uint_least32_t) || Laslo Hunhold

From: Laslo Hunhold <dev_AT_frign.de>
Date: Sun, 12 Dec 2021 12:17:14 +0100

On Sun, 12 Dec 2021 01:22:47 -0800
Michael Forney <mforney_AT_mforney.org> wrote:

Dear Michael,

> On 2021-12-11, Laslo Hunhold <dev_AT_frign.de> wrote:
> > So would you say that the only good way would be to only accept
> > arrays of unsigned char in the API? I think this seems to be the
> > logical conclusion.
>
> That's one option, but another is to keep using arrays of char, but
> cast to unsigned char * before accessing. This is perfectly fine in C
> since unsigned char is a character type and you are allowed to access
> the representation of any object through a pointer to character type,
> regardless of the object's actual type.
>
> Accepting unsigned char * is maybe a bit nicer for libgrapheme's
> implementation, but char * is nicer for the users, since that's likely
> the type they already have. It also allows them to continue to use
> string.h functions such as strlen or strcmp on the same buffer (which
> also are defined to interpret characters as unsigned char).

yes, if we were only accessing that would be fine. However, what about
the other way around? libgrapheme also writes to arrays with
lg_utf8_encode(), and that's where we can't just write to char.

> I guess it depends on how that data was obtained in the first place.
> Say you have char buf[1024], and read UTF-8 encoded data from a file
> into it. fread is defined in terms of fgetc, which "obtains that
> character as unsigned char" and stores into an array of unsigned char
> overlaying the object. In this case, accessing as unsigned char is the
> intention.
>
> I can't really think of a case where the intention would be to
> interpret as signed char and convert to unsigned char. With
> sign-magnitude, it'd be impossible to encode Ā (UTF-8 0xC4 0x80) this
> way, since there is no char value that results in 0x80 when converted
> to unsigned char.
>
> I know it's just a thought experiment, but note that there are only
> three signed-int representations valid in C: sign-magnitude, one's
> complement, and two's complement. They only differ by the meaning of
> the sign bit, which is the highest bit of the corresponding unsigned
> integer type, so you couldn't go as crazy as the representation you
> described.

Yeah, it was just a thought-experiment. :)

> > 1) Would you also go down the route of just demanding an array of
> > unsigned integers of at least 8 bits?
>
> I'd suggest sticking with char *, but unsigned char * seems
> reasonable as well.
>
> > 2) Would you define it as "unsigned char *" or "uint_least8_t *"?
> > I'd almost favor the latter, given the entire library is already
> > using the stdint-types.
>
> I don't think uint_least8_t is a good idea, since there is no
> guarantee that it is a character type. The API user is unlikely to
> have the data in a buffer of this type, so they'd potentially have to
> allocate a new one and copy into it. With unsigned char *, they could
> just cast if necessary.

But char and unsigned char are of integer type, aren't they? So on a
POSIX-system, which is 99.999% of cases, it makes no difference if we
cast between (char *) and (unsigned char *) (as you suggested above if
we went with unsigned char * for the interfaces) and between (char *)
and (uint_least8_t *), does it? So if the end-user has to cast anyway,
then he can just cast to an uint* type as well.

Even more drastically, given UTF-8 is an encoding, I don't really feel
good about not being strict about the returned arrays in such a way that
it becomes possible to have an array of e.g. 16-bit integers where only
the bottom half is used and it become the user's job to then hand-craft
it into a proper array to send over the network, etc. Surely one can
hack around this as a library user, but at a certain point I think "to
hell with it" and just be strict about it in the API. C already has a
weak type system and I don't want to further weaken it by supporting
decades-old implicit assumptions on types. So in a way, maybe uint8_t
is the way to go, and then the library user immediately knows it's not
going to work with his machine because uint8_t is not defined for him.
Done.
I find it much more plausible that maybe even a compiler could
"emulate" 8-bit-types even on machines with 16-bit-chars, but this is
such an extreme case.

The standard consortiums made a good choice to let memcpy operate on
void*. They knew chars were a mess and it might be the best option to
just not touch them within the library at all and stick with
well-defined types.
I'll think about it.

With best regards

Laslo
Received on Sun Dec 12 2021 - 12:17:14 CET

This archive was generated by hypermail 2.3.0 : Sun Dec 12 2021 - 12:24:32 CET