Re: [hackers] [libgrapheme] Refine types (uint8_t -> char, uint32_t -> uint_least32_t) || Laslo Hunhold

From: Michael Forney <mforney_AT_mforney.org>
Date: Sat, 11 Dec 2021 12:24:10 -0800

On 2021-12-11, git_AT_suckless.org <git_AT_suckless.org> wrote:
> The type uint32_t is not guaranteed by the standard to be present,
> but it guarantees uint_least32_t. If a libgrapheme-user passes a
> pointer
> to an uint32_t (instead of uint_least32_t) there will be no problem,
> as the presence of uint32_t immediately implies uint32_t ==
> uint_least32_t.

It is true that the existence of uint32_t implies that uint_least32_t
also has exactly 32 bits and no padding bits, but they could still be
distinct types. For instance, on a 32-bit platform with int and long
both being exactly 32 bits, you could define uint32_t as one and
uint_least32_t as the other. In that case, dereferencing an array of
uint32_t as uint_least32_t would be undefined behavior.

That said, I agree with this change. It also has the benefit of
matching the definition of C11's char32_t.

> diff --git a/src/utf8.c b/src/utf8.c
> index 4488359..1cb5e17 100644
> --- a/src/utf8.c
> +++ b/src/utf8.c
> _AT_@ -92,7 +101,7 @@ lg_utf8_decode(const uint8_t *s, size_t n, uint32_t *cp)
> * (i.e. between 0x80 (10000000) and 0xBF (10111111))
> */
> for (i = 1; i <= off; i++) {
> - if(!BETWEEN(s[i], 0x80, 0xBF)) {
> + if(!BETWEEN((unsigned char)s[i], 0x80, 0xBF)) {
> /*
> * byte does not match format; return
> * number of bytes processed excluding the

Although irrelevant in C23, which will require 2's complement
representation, I want to note the distinction between (unsigned
char)s[i] and ((unsigned char *)s)[i]. The former adds 2^CHAR_BIT to
negative values, while the latter interprets as a CHAR_BIT-bit
unsigned integer (adds 2^CHAR_BIT if the sign bit is set). For
example, if char had sign-magnitude representation, we'd have
(unsigned char)"\x80"[0] == 0, but ((unsigned char *)"\x80")[0] ==
0x80.

The latter is probably what you want, but you could ignore this if you
only care about 2's complement (which is a completely reasonable
position).

> diff --git a/test/utf8-decode.c b/test/utf8-decode.c
> index 1182fb0..ee71cf9 100644
> --- a/test/utf8-decode.c
> +++ b/test/utf8-decode.c
> _AT_@ -9,7 +9,7 @@
> #define LEN(x) (sizeof(x) / sizeof(*(x)))
>
> static const struct {
> - uint8_t *arr; /* UTF-8 byte sequence */
> + char *arr; /* UTF-8 byte sequence */
> size_t len; /* length of UTF-8 byte sequence */
> size_t exp_len; /* expected length returned */
> uint32_t exp_cp; /* expected code point returned */
> _AT_@ -29,7 +29,9 @@ static const struct {
> * [ 11111101 ] ->
> * INVALID
> */
> - .arr = (uint8_t[]){ 0xFD },
> + .arr = (char[]){
> + (unsigned char)0xFD,
> + },

This cast doesn't do anything here. Both 0xFD and (unsigned char)0xFD
have the same value (0xFD), which can't necessarily be represented as
char. For example if CHAR_MAX is 127, this conversion is
implementation defined and could raise a signal (C99 6.3.1.3p2).

I think using hex escapes in a string literal ("\xFD") has the
behavior you want here. You could also create an array of unsigned
char and cast to char *.

> .len = 1,
> .exp_len = 1,
> .exp_cp = LG_CODEPOINT_INVALID,
Received on Sat Dec 11 2021 - 21:24:10 CET

This archive was generated by hypermail 2.3.0 : Sat Dec 11 2021 - 22:00:32 CET