Re: [hackers] [libgrapheme] Refine types (uint8_t -> char, uint32_t -> uint_least32_t) || Laslo Hunhold from Michael Forney on 2021-12-16 (hackers mail list archive)

From: Michael Forney <mforney_AT_mforney.org>
Date: Thu, 16 Dec 2021 14:01:48 -0800

On 2021-12-16, Laslo Hunhold <dev_AT_frign.de> wrote:
> I know this thread is already long enough, but I took my time now to
> read deeper into the topic. Please read below, as we might come to a
> conclusion there now.

Thanks for sticking with it. I know this topic is quite pedantic and
hypothetical, but I think it's still important to consider and
understand.

> Interestingly, there was even an internal discussion on the
> gcc-bugtracker[0] about this. They were thinking about adding an
> attribute __attribute__((no_alias)) to the uint8_t typedef so it would
> explicitly lose the aliasing-exception.
>
> There's a nice rant on [1] and a nice discussion on [2] about this
> whole thing. And to be honest, at this point I still wasn't 100%
> satisfied.

Thanks for the links. The aliasing discussion in [0] is very
interesting, and I will definitely bookmark [1] to use as a reference
in the future.

> What convinced me was how they added UTF-8-literals in C11. There you
> can define explicit UTF-8 literals as u8"Hällö Wörld!" and they're of
> type char[]. So even though char * is a bit ambiguous, we document well
> that we expect an UTF-8 string. C11 goes further and accomodates us
> with ways to portably define them.

Interestingly, there is a C23 proposal[0] to introduce char8_t as a
typedef for unsigned char and change the type (!) of UTF-8 string
literals from char * to char8_t * (aka unsigned char *). It has not
been discussed in any meeting yet, but it will be interesting to see
what the committee thinks of it. I don't think u8 string literals are
widely used at this point, but it's weird to see a proposal breaking
backwards compatibility like this.

[0] http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2653.htm

> To also address this point, here's what we can do to make us all happy:
>
> 1) Change the API to accept char*
> 2) Cast the pointers internally to (unsigned char *) for bitwise
> modifications. We may do that as we may alias with char, unsigned
> char and signed char.
> 3) Treat it as an invalid code point when any bit higher than the 9th
> is set. This is actually already in the implementation, as we have
> strict ranges.
>
> Please take a look at the attached diff and let me know what you think.
> Is this portable and am I correct to assume we might even handle
> chars longer than 8 bit properly?

I agree with all of this. Your patch looks good to me.

> There's just one open question: Do you know of a better way than to do
>
> (char *)(unsigned char[]){ 0xff, 0xef, 0xa0 }
>
> to specify a literal char-array with specific bit-patterns?

I believe "\xff\xef\xa0" also works, but I am not very confident about
this; the wording of the standard is not clear to me.

It says (6.4.4.4p6)

> The hexadecimal digits that follow the backslash and the letter x
> in a hexadecimal escape sequence are taken to be part of the
> construction of a single character for an integer character constant
> or of a single wide character for a wide character constant. The
> numerical value of the hexadecimal integer so formed specifies the
> value of the desired character or wide character.

Okay, so '\xff' constructs a single character with value 255. But, is
'\xff' considered an integer character constant containing a single
character?

Then (6.4.4.4p10):

> An integer character constant has type int. The value of an integer
> character constant containing a single character that maps to a
> single-byte execution character is the numerical value of the
> representation of the mapped character interpreted as an integer.

Does this one apply? Not sure because later sentences mention escape
sequences explicitly, and it's not clear if 255 maps to a single-byte
execution character if CHAR_MAX == 127. Also, I'm not sure how to
parse the last part of the sentence (some grouping parentheses would
be helpful). The representation of 255 is 11111111, so what does it
mean to interpret as an integer (of what width)?

> The value of an integer character constant containing more than one
> character (e.g., 'ab'), or containing a character or escape sequence
> that does not map to a single-byte execution character, is
> implementation-defined.

If '\xff' is considered to not map to a single-byte execution
character, then this would indicate that it's implementation-defined.

> If an integer character constant contains
> a single character or escape sequence, its value is the one that
> results when an object with type char whose value is that of the
> single character or escape sequence is converted to type int.

What does it mean for a char to have value of the escape sequence,
since char may not be able to represent 255? Why are there two
sentences that specify the value of an integer character constant
containing a single character? If the first one applies, is this one
ignored?

The main thing that indicates to me that it is defined is example 2 in
that section (6.4.4.4p13):

> Consider implementations that use two's complement representation
> for integers and eight bits for objects that have type char. In an
> implementation in which type char has the same range of values as
> signed char, the integer character constant '\xFF' has the value
> -1; if type char has the same range of values as unsigned char, the
> character constant '\xFF' has the value +255.

It mentions two's complement and 8-bit char explicitly, and says
'\xFF' has the value -1 (not "may have"). This makes me think that I
should somehow be able to justify this using the above paragraphs.

So I can't say for sure, and I haven't been very lucky with searching
the web for discussion about this, but I think it should be fine to
use hex escapes to construct string literals with specific bit
patterns (at the very worst it is implementation defined).
Received on Thu Dec 16 2021 - 23:01:48 CET

This archive was generated by hypermail 2.3.0 : Fri Dec 17 2021 - 00:00:50 CET