Re: [hackers] [libgrapheme] Refine types (uint8_t -> char, uint32_t -> uint_least32_t) || Laslo Hunhold

From: Laslo Hunhold <dev_AT_frign.de>
Date: Fri, 17 Dec 2021 00:08:56 +0100

On Thu, 16 Dec 2021 14:01:48 -0800
Michael Forney <mforney_AT_mforney.org> wrote:

Dear Michael,

> Thanks for sticking with it. I know this topic is quite pedantic and
> hypothetical, but I think it's still important to consider and
> understand.

yeah definitely! Most probably think that we're crazy discussing this
stuff for so long, but it's imperative to have a "stable" API before
releasing version 1.

> Thanks for the links. The aliasing discussion in [0] is very
> interesting, and I will definitely bookmark [1] to use as a reference
> in the future.

I'm glad you can make use of it!

> Interestingly, there is a C23 proposal[0] to introduce char8_t as a
> typedef for unsigned char and change the type (!) of UTF-8 string
> literals from char * to char8_t * (aka unsigned char *). It has not
> been discussed in any meeting yet, but it will be interesting to see
> what the committee thinks of it. I don't think u8 string literals are
> widely used at this point, but it's weird to see a proposal breaking
> backwards compatibility like this.
>
> [0] http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2653.htm

I stumbled upon that as well.

> I agree with all of this. Your patch looks good to me.

Thanks for checking the patch! Nice to hear that you agree.

> > The hexadecimal digits that follow the backslash and the letter x
> > in a hexadecimal escape sequence are taken to be part of the
> > construction of a single character for an integer character constant
> > or of a single wide character for a wide character constant. The
> > numerical value of the hexadecimal integer so formed specifies the
> > value of the desired character or wide character.
>
> Okay, so '\xff' constructs a single character with value 255. But, is
> '\xff' considered an integer character constant containing a single
> character?
>
> Then (6.4.4.4p10):
>
> > An integer character constant has type int. The value of an integer
> > character constant containing a single character that maps to a
> > single-byte execution character is the numerical value of the
> > representation of the mapped character interpreted as an integer.
>
> Does this one apply? Not sure because later sentences mention escape
> sequences explicitly, and it's not clear if 255 maps to a single-byte
> execution character if CHAR_MAX == 127. Also, I'm not sure how to
> parse the last part of the sentence (some grouping parentheses would
> be helpful). The representation of 255 is 11111111, so what does it
> mean to interpret as an integer (of what width)?
>
> > The value of an integer character constant containing more than one
> > character (e.g., 'ab'), or containing a character or escape sequence
> > that does not map to a single-byte execution character, is
> > implementation-defined.
>
> If '\xff' is considered to not map to a single-byte execution
> character, then this would indicate that it's implementation-defined.
>
> > If an integer character constant contains
> > a single character or escape sequence, its value is the one that
> > results when an object with type char whose value is that of the
> > single character or escape sequence is converted to type int.
>
> What does it mean for a char to have value of the escape sequence,
> since char may not be able to represent 255? Why are there two
> sentences that specify the value of an integer character constant
> containing a single character? If the first one applies, is this one
> ignored?
>
> The main thing that indicates to me that it is defined is example 2 in
> that section (6.4.4.4p13):
>
> > Consider implementations that use two's complement representation
> > for integers and eight bits for objects that have type char. In an
> > implementation in which type char has the same range of values as
> > signed char, the integer character constant '\xFF' has the value
> > -1; if type char has the same range of values as unsigned char, the
> > character constant '\xFF' has the value +255.
>
> It mentions two's complement and 8-bit char explicitly, and says
> '\xFF' has the value -1 (not "may have"). This makes me think that I
> should somehow be able to justify this using the above paragraphs.
>
> So I can't say for sure, and I haven't been very lucky with searching
> the web for discussion about this, but I think it should be fine to
> use hex escapes to construct string literals with specific bit
> patterns (at the very worst it is implementation defined).

Thanks for digging through the standard! This was exactly the same
pitfall I was facing and I'm not sure, to be honest. After all, I think
just building an unsigned char-array and casting it to (char *) is
probably the safest way to go. :)

I'll push the commit and add a manpage for the UTF-8-functions. At that
point, we should be ready for a first release.

With best regards

Laslo
Received on Fri Dec 17 2021 - 00:08:56 CET

This archive was generated by hypermail 2.3.0 : Fri Dec 17 2021 - 00:12:31 CET