Re: [hackers] [libgrapheme] Refine types (uint8_t -> char, uint32_t -> uint_least32_t) || Laslo Hunhold

From: Laslo Hunhold <dev_AT_frign.de>
Date: Thu, 16 Dec 2021 11:06:11 +0100

On Wed, 15 Dec 2021 12:24:21 -0800
Michael Forney <mforney_AT_mforney.org> wrote:

Dear Michael,

> I think this is a mistake. It makes it very difficult to use the API
> correctly if you have data in an array of char or unsigned char, which
> is usually the case.
 
> Here's an example of some real code that has a char * buffer:
> https://git.sr.ht/~exec64/imv/tree/a83304d4d673aae6efed51da1986bd7315a4d642/item/src/console.c#L54-58
>
> How would you suggest that this code be written for the new API? The
> only thing I can think is
>
> if (buffer[position] != 0) {
> size_t bufferlen = strlen(buffer) + 1 - position;
> uint8_t *newbuffer = malloc(bufferlen);
> if (!newbuffer) ...
> memcpy(newbuffer, buffer + position, bufferlen);
> position += grapheme_bytelen(newbuffer);
> free(newbuffer);
> }
> return position;
>
> This sort of thing would turn me off of using the library entirely.

yeah, it would be insane to malloc() a new buffer. However, the case
I'm making is that we can assume that

 1) uint8_t exists
 2) uint8_t == unsigned char

This may not be directly specified in the standard, but follows from
the following observations:

1) We make use of POSIX-functions in the code, so compiling
   libgrapheme requires a POSIX-compliant compiler and stdlib. POSIX
   requires CHAR_BIT == 8, which means that we can assume that chars
   are 8 bit, and thus uint8_t exists.
2) C99 specifies char to be of at least 8 bit size. Given char is meant
   to be the smallest addressable unit and uint8_t exists, char is
   exactly 8 bits.

> > Any other way would have introduced too many implicit assumptions.
>
> Like what?

I was unclear there. What I actually meant was that "char" carries
implicit assumptions in the programming world that are actually not
even reflected in the standard. When specifying the UTF-8-array as char
*, you basically carry on this tradition instead of being specific with
what you actually want.

> If you really want your code to break when CHAR_BIT != 8, you could
> use a static assert (there are also ways to emulate this in C99). But
> even if CHAR_BIT > 8, unsigned char is perfectly capable to represent
> all the values used in UTF-8 encoding, so I don't see the problem.

Let's take a simple example: Say you have a file in UTF-8 encoding of
known size and wanted to read it and simply print the code points. You
would probably do it as follows in C (no checks to get the point
across), and let's assume here that lg_utf8_* accepts char *:

   FILE *fp;
   size_t size, off, ret, i;
   char *data;
   uint_least32_t cp;

   /* open */
   fp = fopen("file.txt", "r");

   /* get file size and allocate buffer */
   fseek(fp, 0L, SEEK_END);
   size = ftell(fp);
   rewind(fp);
   data = malloc(size);
   
   /* fill buffer */
   for (off = 0; (ret = fread(data + off, 1, size, fp)) > 0; off += ret)
      ;

   /* print code points */
   for (i = lg_utf8_decode(data, size, &cp); data[i] != '\0';
        i += lg_utf8_decode(data + i, size - i, &cp)) {
      printf("code point: %"PRIu32"\n", cp);
   }

However, here you have a problem when suddenly char is 16 bits (might
be according to the standard). Because then you read in two
UTF-8-code-units at once, but lg_utf8_decode silently discards half of
the data in the high bits.
But this wouldn't even happen, given POSIX mandates char to be 8 bits,
and given even C99 mandates char to be of integral type, you only have
one unique way to specify an unsigned integer of certain bit-length,
given C99 also mandates that char shouldn't have any padding.

So the case can be made that uint8_t == unsigned char, and casting
between char and unsigned char is fine, so you just cast any char * to
uint8_t * which will work as you would otherwise not have been able to
even compile libgrapheme in the first place.

Or am I missing something here except from the standard semantically
making a difference? Is there any technical possibility to have a
system that has CHAR_BIT == 8 where uint8_t != unsigned char?

> > And even if all fails and there simply is no 8-bit-type, one can
> > always use the lg_grapheme_isbreak()-function and roll his own
> > de/encoding.
>
> I'm still confused as to what you mean by rolling your own
> de/encoding. What would that look like?
>
> If there is no 8-bit type, libgrapheme could not be compiled or used
> at all since uint8_t would be missing.

Yeah, it was a bit of a transitive argument given you would have to
tailor grapheme and remove the utf8-encoder/decoder. But then you could
simply use the lg_grapheme_isbreak()-function which works on code
points. How you obtain the code points is up to the user, but then
libgrapheme doesn't care and simply returns a "decision".

tl;dr: I don't see what's wrong with simply casting char * to uint8_t *
given it's reasonable to assume that uint8_t == unsigned char for the
aforementioned reasons.

With best regards

Laslo
Received on Thu Dec 16 2021 - 11:06:11 CET

This archive was generated by hypermail 2.3.0 : Thu Dec 16 2021 - 11:12:31 CET