On Wed, 15 Dec 2021 12:24:21 -0800
Michael Forney <mforney_AT_mforney.org> wrote:
Dear Michael,
> I think this is a mistake. It makes it very difficult to use the API
> correctly if you have data in an array of char or unsigned char, which
> is usually the case.
> Here's an example of some real code that has a char * buffer:
> https://git.sr.ht/~exec64/imv/tree/a83304d4d673aae6efed51da1986bd7315a4d642/item/src/console.c#L54-58
>
> How would you suggest that this code be written for the new API? The
> only thing I can think is
>
> if (buffer[position] != 0) {
> size_t bufferlen = strlen(buffer) + 1 - position;
> uint8_t *newbuffer = malloc(bufferlen);
> if (!newbuffer) ...
> memcpy(newbuffer, buffer + position, bufferlen);
> position += grapheme_bytelen(newbuffer);
> free(newbuffer);
> }
> return position;
>
> This sort of thing would turn me off of using the library entirely.
yeah, it would be insane to malloc() a new buffer. However, the case
I'm making is that we can assume that
1) uint8_t exists
2) uint8_t == unsigned char
This may not be directly specified in the standard, but follows from
the following observations:
1) We make use of POSIX-functions in the code, so compiling
libgrapheme requires a POSIX-compliant compiler and stdlib. POSIX
requires CHAR_BIT == 8, which means that we can assume that chars
are 8 bit, and thus uint8_t exists.
2) C99 specifies char to be of at least 8 bit size. Given char is meant
to be the smallest addressable unit and uint8_t exists, char is
exactly 8 bits.
> > Any other way would have introduced too many implicit assumptions.
>
> Like what?
I was unclear there. What I actually meant was that "char" carries
implicit assumptions in the programming world that are actually not
even reflected in the standard. When specifying the UTF-8-array as char
*, you basically carry on this tradition instead of being specific with
what you actually want.
> If you really want your code to break when CHAR_BIT != 8, you could
> use a static assert (there are also ways to emulate this in C99). But
> even if CHAR_BIT > 8, unsigned char is perfectly capable to represent
> all the values used in UTF-8 encoding, so I don't see the problem.
Let's take a simple example: Say you have a file in UTF-8 encoding of
known size and wanted to read it and simply print the code points. You
would probably do it as follows in C (no checks to get the point
across), and let's assume here that lg_utf8_* accepts char *:
FILE *fp;
size_t size, off, ret, i;
char *data;
uint_least32_t cp;
/* open */
fp = fopen("file.txt", "r");
/* get file size and allocate buffer */
fseek(fp, 0L, SEEK_END);
size = ftell(fp);
rewind(fp);
data = malloc(size);
/* fill buffer */
for (off = 0; (ret = fread(data + off, 1, size, fp)) > 0; off += ret)
;
/* print code points */
for (i = lg_utf8_decode(data, size, &cp); data[i] != '\0';
i += lg_utf8_decode(data + i, size - i, &cp)) {
printf("code point: %"PRIu32"\n", cp);
}
However, here you have a problem when suddenly char is 16 bits (might
be according to the standard). Because then you read in two
UTF-8-code-units at once, but lg_utf8_decode silently discards half of
the data in the high bits.
But this wouldn't even happen, given POSIX mandates char to be 8 bits,
and given even C99 mandates char to be of integral type, you only have
one unique way to specify an unsigned integer of certain bit-length,
given C99 also mandates that char shouldn't have any padding.
So the case can be made that uint8_t == unsigned char, and casting
between char and unsigned char is fine, so you just cast any char * to
uint8_t * which will work as you would otherwise not have been able to
even compile libgrapheme in the first place.
Or am I missing something here except from the standard semantically
making a difference? Is there any technical possibility to have a
system that has CHAR_BIT == 8 where uint8_t != unsigned char?
> > And even if all fails and there simply is no 8-bit-type, one can
> > always use the lg_grapheme_isbreak()-function and roll his own
> > de/encoding.
>
> I'm still confused as to what you mean by rolling your own
> de/encoding. What would that look like?
>
> If there is no 8-bit type, libgrapheme could not be compiled or used
> at all since uint8_t would be missing.
Yeah, it was a bit of a transitive argument given you would have to
tailor grapheme and remove the utf8-encoder/decoder. But then you could
simply use the lg_grapheme_isbreak()-function which works on code
points. How you obtain the code points is up to the user, but then
libgrapheme doesn't care and simply returns a "decision".
tl;dr: I don't see what's wrong with simply casting char * to uint8_t *
given it's reasonable to assume that uint8_t == unsigned char for the
aforementioned reasons.
With best regards
Laslo
Received on Thu Dec 16 2021 - 11:06:11 CET