Re: [hackers] [libgrapheme][PATCH] Expose cp_decode and boundary (both renamed) to the user

From: Mattias Andrée <maandree_AT_kth.se>
Date: Thu, 7 May 2020 18:32:23 +0200

Hi Laslo,

On Thu, 7 May 2020 15:15:24 +0200
Laslo Hunhold <dev_AT_frign.de> wrote:

> On Sat, 25 Apr 2020 23:03:56 +0200
> Mattias Andrée <maandree_AT_kth.se> wrote:
>
> Dear Mattias,
>
> > cp_decode has been renamed to grapheme_decode
>
> I personally don't like that change. It's already difficult enough to
> understand the differences between codepoints and grapheme clusters,
> and changing the name of the function to that really complicates it
> more and might cause misunderstandings.

Perhaps, but do you I wouldn't expect anyone that don't understand
the difference to even use libgrapheme. But you would also state in
grapheme.h and the man pages that all functions except grapheme_len
are low-level functions.

>
> This also contradicts the below goal of providing a solution for
> non-UTF-8-text, which I honestly don't care about. There's no logical
> reason to encode text into anything other than UTF-8 and the best
> approach is to just re-encode everything into UTF-8.

Not a goal, but a positive side-effect of exposing the boundary
test function.

>
> It's a tough call to decide if we want to turn libgrapheme into a
> general purpose UTF-8-library.

I don't hiding the code function, but since the library already
used it, I thought it was nice to not have to implement your own
when using the library.

>
> > and boundary has been renamed to grapheme_boundary.
>
> I'm a bit conflicted with this change, though I would probably expose
> it as grapheme_boundary(uint32_t, uint32_t, int *) if I chose to
> include it. Sure, we do use Codepoint interally as a typedef, but for a
> "public" API I prefer not to have them if possible.

I would as well, I just didn't want to change too much.

>
> The reason I'm conflicted with this change is that there's no guarantee
> the grapheme-cluster-boundary algorithm gets changed again. It already
> has been in Unicode 12.0, which made it suddenly require a state to be
> carried with it, but there's no guarantee it will get even crazier,
> making it almost infeasible to expose more than a "gclen()"-function to
> the user.

How about

        typedef struct grapheme_state GRAPHEME_STATE;

        /* Hidden from the user { */
        struct grapheme_state {
                uint32_t cp0;
                int state;
        };
        /* } */

        int grapheme_boundary(uint32_t cp1, GRAPHEME_STATE *);

        GRAPHEME_STATE *grapheme_create_state(void);

        /* Just in case the state in the future
         * would require dynamic allocation */
        void grapheme_free_state(GRAPHEME_STATE *);

grapheme_create_state() would reset the state each time
a boundary is found, so no reset function would be needed,
and would be useful to avoid a new allocation if the
grapheme cluster identification process is aborted and a
a started for a new text. Since this would be very rare
there, no reset function is needed.

The only future I can see there this wouldn't be sufficient
if a cluster break (or non-break) could be retroactively
inserted where where the algorithm already stated that there
as no break (or was a break). This would be so bizarre, I
cannot imagine this would ever be the case.

>
> The current implementation can store 32 states and uses 2 of them for
> the algorithm. In this regard, we still have some headroom.
>
> > The purpose of this is to allow faster text rendering
> > where both individual code points and grapheme clusters
> > boundaries are of interest, but it also (1) makes it
> > easy to do online processing of large document (the user
> > does not need to search for spaces, but only know an
> > upper limit for how long encoding is needed to encode
> > any codepoint) and (2) makes to library easy to use
> > with non-UTF-8 text.
>
> As I said above, I don't care about non-UTF-8-text and anything
> non-UTF-8 is either an internal representation (e.g. in Java with
> UTF-16LE) or ancient.
>
> I see how the stateful function might be useful though for a
> byte-per-byte reading of a file, or something else.
>
> > This change also eliminates all unnamespaced, non-static
> > functions that are not exposed to the user.
>
> This is a very good point! I'll try to solve that in an own coding
> session on the existing code. Thanks for your patch! I will have to
> think about this more. What do you think about the following
> API-overview integrating above changes? This would also include some
> UTF-8 functionality.
>
> size_t grapheme_cp_decode(uint32_t*, char *, size_t)
> Decode the UTF-8 sequence into a codepoint from the array of given
> length, returning the number of bytes consumed or zero if there was
> an error (which also sets the codepoint to UTF_INVALID)
>
> size_t grapheme_cp_encode(uint32_t, char *, size_t)
> Encode the given codepoint into UTF-8 in the given array of given
> length. Return the number of bytes "used" or zero if something failed
> (codepoint out of bounds, array too small, etc.)
>
> size_t grapheme_len(const char *)
> Return the length (in bytes) of the grapheme cluster beginning at
> the provided char-address.
>
> int grapheme_boundary(uint32_t, uint32_t, int *state)
> Based on the current state (which is 0 at the beginning) determine
> if between two codepoints is a grapheme-cluster-boundary. If so,
> return 1, and 0 otherwise.
>
> What do you think?

I don't see the point of including grapheme_cp_encode(), however
I'm not opposed to making a larger UTF-8/Unicode library, rather
I think it would be nice to have one place for all my Unicode
needs, especially if I otherwise would have a hand full of libraries
that all their own UTF-8 decoding functions that all have to be
linked.

>
> With best regards
>
> Laslo
>
Received on Thu May 07 2020 - 18:32:23 CEST

This archive was generated by hypermail 2.3.0 : Thu May 07 2020 - 19:36:38 CEST