Re: [hackers] [libgrapheme] Refine types (uint8_t -> char, uint32_t -> uint_least32_t) || Laslo Hunhold from Michael Forney on 2021-12-12 (hackers mail list archive)

From: Michael Forney <mforney_AT_mforney.org>
Date: Sun, 12 Dec 2021 01:22:47 -0800

On 2021-12-11, Laslo Hunhold <dev_AT_frign.de> wrote:
> So would you say that the only good way would be to only accept arrays
> of unsigned char in the API? I think this seems to be the logical
> conclusion.

That's one option, but another is to keep using arrays of char, but
cast to unsigned char * before accessing. This is perfectly fine in C
since unsigned char is a character type and you are allowed to access
the representation of any object through a pointer to character type,
regardless of the object's actual type.

Accepting unsigned char * is maybe a bit nicer for libgrapheme's
implementation, but char * is nicer for the users, since that's likely
the type they already have. It also allows them to continue to use
string.h functions such as strlen or strcmp on the same buffer (which
also are defined to interpret characters as unsigned char).

> When I read more I found out that C++ introduced static_cast and
> reinterpret_cast for this simple reason: Assuming some crazy
> signed-int-representation we just make up in our heads (some random
> permutation of 0..255 to -127..128), it is impossible to really know the
> intent of the user passing us a (signed) char-array. Let's say
> "0b01010101" means "0" in our crazy signed type, does the user intend
> to convey to us a null-byte (which is simply "encoded" in the signed
> type), or does he literally mean "0b01010101"? With static_cast and
> reinterpret_cast you can handle both cases separately.

I guess it depends on how that data was obtained in the first place.
Say you have char buf[1024], and read UTF-8 encoded data from a file
into it. fread is defined in terms of fgetc, which "obtains that
character as unsigned char" and stores into an array of unsigned char
overlaying the object. In this case, accessing as unsigned char is the
intention.

I can't really think of a case where the intention would be to
interpret as signed char and convert to unsigned char. With
sign-magnitude, it'd be impossible to encode Ā (UTF-8 0xC4 0x80) this
way, since there is no char value that results in 0x80 when converted
to unsigned char.

I know it's just a thought experiment, but note that there are only
three signed-int representations valid in C: sign-magnitude, one's
complement, and two's complement. They only differ by the meaning of
the sign bit, which is the highest bit of the corresponding unsigned
integer type, so you couldn't go as crazy as the representation you
described.

> 1) Would you also go down the route of just demanding an array of
> unsigned integers of at least 8 bits?

I'd suggest sticking with char *, but unsigned char * seems reasonable as well.

> 2) Would you define it as "unsigned char *" or "uint_least8_t *"?
> I'd almost favor the latter, given the entire library is already
> using the stdint-types.

I don't think uint_least8_t is a good idea, since there is no
guarantee that it is a character type. The API user is unlikely to
have the data in a buffer of this type, so they'd potentially have to
allocate a new one and copy into it. With unsigned char *, they could
just cast if necessary.
Received on Sun Dec 12 2021 - 10:22:47 CET

This archive was generated by hypermail 2.3.0 : Sun Dec 12 2021 - 10:24:31 CET