Re: [hackers] [libgrapheme][PATCH] Expose cp_decode and boundary (both renamed) to the user from Laslo Hunhold on 2020-05-09 (hackers mail list archive)

From: Laslo Hunhold <dev_AT_frign.de>
Date: Sat, 9 May 2020 07:25:41 +0200

On Thu, 7 May 2020 18:32:23 +0200
Mattias Andrée <maandree_AT_kth.se> wrote:

Dear Mattias,

> Perhaps, but do you I wouldn't expect anyone that don't understand
> the difference to even use libgrapheme. But you would also state in
> grapheme.h and the man pages that all functions except grapheme_len
> are low-level functions.

that could work.

> Not a goal, but a positive side-effect of exposing the boundary
> test function.

I agree with that, it has a positive side-effect.

> > The reason I'm conflicted with this change is that there's no
> > guarantee the grapheme-cluster-boundary algorithm gets changed
> > again. It already has been in Unicode 12.0, which made it suddenly
> > require a state to be carried with it, but there's no guarantee it
> > will get even crazier, making it almost infeasible to expose more
> > than a "gclen()"-function to the user.
>
> How about
>
> typedef struct grapheme_state GRAPHEME_STATE;
>
> /* Hidden from the user { */
> struct grapheme_state {
> uint32_t cp0;
> int state;
> };
> /* } */
>
> int grapheme_boundary(uint32_t cp1, GRAPHEME_STATE *);
>
> GRAPHEME_STATE *grapheme_create_state(void);
>
> /* Just in case the state in the future
> * would require dynamic allocation */
> void grapheme_free_state(GRAPHEME_STATE *);
>
> grapheme_create_state() would reset the state each time
> a boundary is found, so no reset function would be needed,
> and would be useful to avoid a new allocation if the
> grapheme cluster identification process is aborted and a
> a started for a new text. Since this would be very rare
> there, no reset function is needed.
>
> The only future I can see there this wouldn't be sufficient
> if a cluster break (or non-break) could be retroactively
> inserted where where the algorithm already stated that there
> as no break (or was a break). This would be so bizarre, I
> cannot imagine this would ever be the case.

I don't like this change, because it destroys reentrancy, which is very
importent for multithreaded applications, and complicates things
unnecessarily.
However, I think we should just risk it and assume that further
versions of the Unicode-Grapheme-Boundary-algorithm will only rely on
such a state.

> > [...]
> > What do you think?
>
> I don't see the point of including grapheme_cp_encode(), however
> I'm not opposed to making a larger UTF-8/Unicode library, rather
> I think it would be nice to have one place for all my Unicode
> needs, especially if I otherwise would have a hand full of libraries
> that all their own UTF-8 decoding functions that all have to be
> linked.

Yes, I agree with that. There are lots of bad and unsafe
UTF-de/encoders out there and the one in libgrapheme is actually pretty
fast and safe (e.g. no overencoded nul, proper error-handling, etc.).
It would be no bloat to expose it outside, as it runs "in the
background" anyway. It's more of a debate on the "purity" of
libgrapheme, but when including the boundary function, offering a way
to read codepoints from a char-array makes a lot of sense.

With best regards

Laslo
Received on Sat May 09 2020 - 07:25:41 CEST

This archive was generated by hypermail 2.3.0 : Sat May 09 2020 - 07:36:34 CEST