Re: [hackers] [libgrapheme][PATCH] Expose cp_decode and boundary (both renamed) to the user from Laslo Hunhold on 2020-05-07 (hackers mail list archive)

From: Laslo Hunhold <dev_AT_frign.de>
Date: Thu, 7 May 2020 15:15:24 +0200

On Sat, 25 Apr 2020 23:03:56 +0200
Mattias Andrée <maandree_AT_kth.se> wrote:

Dear Mattias,

> cp_decode has been renamed to grapheme_decode

I personally don't like that change. It's already difficult enough to
understand the differences between codepoints and grapheme clusters,
and changing the name of the function to that really complicates it
more and might cause misunderstandings.

This also contradicts the below goal of providing a solution for
non-UTF-8-text, which I honestly don't care about. There's no logical
reason to encode text into anything other than UTF-8 and the best
approach is to just re-encode everything into UTF-8.

It's a tough call to decide if we want to turn libgrapheme into a
general purpose UTF-8-library.

> and boundary has been renamed to grapheme_boundary.

I'm a bit conflicted with this change, though I would probably expose
it as grapheme_boundary(uint32_t, uint32_t, int *) if I chose to
include it. Sure, we do use Codepoint interally as a typedef, but for a
"public" API I prefer not to have them if possible.

The reason I'm conflicted with this change is that there's no guarantee
the grapheme-cluster-boundary algorithm gets changed again. It already
has been in Unicode 12.0, which made it suddenly require a state to be
carried with it, but there's no guarantee it will get even crazier,
making it almost infeasible to expose more than a "gclen()"-function to
the user.

The current implementation can store 32 states and uses 2 of them for
the algorithm. In this regard, we still have some headroom.

> The purpose of this is to allow faster text rendering
> where both individual code points and grapheme clusters
> boundaries are of interest, but it also (1) makes it
> easy to do online processing of large document (the user
> does not need to search for spaces, but only know an
> upper limit for how long encoding is needed to encode
> any codepoint) and (2) makes to library easy to use
> with non-UTF-8 text.

As I said above, I don't care about non-UTF-8-text and anything
non-UTF-8 is either an internal representation (e.g. in Java with
UTF-16LE) or ancient.

I see how the stateful function might be useful though for a
byte-per-byte reading of a file, or something else.

> This change also eliminates all unnamespaced, non-static
> functions that are not exposed to the user.

This is a very good point! I'll try to solve that in an own coding
session on the existing code. Thanks for your patch! I will have to
think about this more. What do you think about the following
API-overview integrating above changes? This would also include some
UTF-8 functionality.

size_t grapheme_cp_decode(uint32_t*, char *, size_t)
   Decode the UTF-8 sequence into a codepoint from the array of given
   length, returning the number of bytes consumed or zero if there was
   an error (which also sets the codepoint to UTF_INVALID)

size_t grapheme_cp_encode(uint32_t, char *, size_t)
   Encode the given codepoint into UTF-8 in the given array of given
   length. Return the number of bytes "used" or zero if something failed
   (codepoint out of bounds, array too small, etc.)

size_t grapheme_len(const char *)
   Return the length (in bytes) of the grapheme cluster beginning at
   the provided char-address.

int grapheme_boundary(uint32_t, uint32_t, int *state)
   Based on the current state (which is 0 at the beginning) determine
   if between two codepoints is a grapheme-cluster-boundary. If so,
   return 1, and 0 otherwise.

What do you think?

With best regards

Laslo
Received on Thu May 07 2020 - 15:15:24 CEST

This archive was generated by hypermail 2.3.0 : Thu May 07 2020 - 15:36:36 CEST