Re: [hackers] [libgrapheme] Refine types (uint8_t -> char, uint32_t -> uint_least32_t) || Laslo Hunhold

From: Michael Forney <mforney_AT_mforney.org>
Date: Thu, 16 Dec 2021 02:45:54 -0800

On 2021-12-16, Laslo Hunhold <dev_AT_frign.de> wrote:
> However, the case
> I'm making is that we can assume that
>
> 1) uint8_t exists
> 2) uint8_t == unsigned char

I think assumption 1 is valid, but not necessarily 2.

> This may not be directly specified in the standard, but follows from
> the following observations:
>
> 1) We make use of POSIX-functions in the code, so compiling
> libgrapheme requires a POSIX-compliant compiler and stdlib. POSIX
> requires CHAR_BIT == 8, which means that we can assume that chars
> are 8 bit, and thus uint8_t exists.
> 2) C99 specifies char to be of at least 8 bit size. Given char is meant
> to be the smallest addressable unit and uint8_t exists, char is
> exactly 8 bits.

Both of these observations are true, but just because uint8_t is 8-bit
and unsigned char is 8-bit doesn't mean that uint8_t == unsigned char.
A C implementation can have implementation-defined extended integer
types, so it is possible that it defines uint8_t as an 8-bit extended
integer type, distinct from unsigned char (similar to how long long
and long may be distinct 64-bit integer types). As far as I know, this
would be still be POSIX compliant.

> However, here you have a problem when suddenly char is 16 bits (might
> be according to the standard). Because then you read in two
> UTF-8-code-units at once, but lg_utf8_decode silently discards half of
> the data in the high bits.
> But this wouldn't even happen, given POSIX mandates char to be 8 bits,
> and given even C99 mandates char to be of integral type, you only have
> one unique way to specify an unsigned integer of certain bit-length,
> given C99 also mandates that char shouldn't have any padding.

Ah, okay, I see what you mean. To be honest I'm not really sure how
something like file encoding and I/O would work on such a system, but
I was assuming that files would contain one code unit per byte, rather
than packing multiple code units into a single byte. For instance, on
a hypothetical system with 9-bit bytes, I wouldn't expect a code unit
to cross the byte boundary.

> So the case can be made that uint8_t == unsigned char, and casting
> between char and unsigned char is fine, so you just cast any char * to
> uint8_t * which will work as you would otherwise not have been able to
> even compile libgrapheme in the first place.
>
> Or am I missing something here except from the standard semantically
> making a difference? Is there any technical possibility to have a
> system that has CHAR_BIT == 8 where uint8_t != unsigned char?

Yes, I believe this is a possibility.

If you are assuming that unsigned char == uint8_t, I think you should
just use unsigned char in your API. You could document the API as
expecting one UTF-8 code unit per byte if you are worried about
confusion regarding CHAR_BIT.
Received on Thu Dec 16 2021 - 11:45:54 CET

This archive was generated by hypermail 2.3.0 : Thu Dec 16 2021 - 12:00:34 CET