Re: [hackers] [libgrapheme][PATCH] Expose cp_decode and boundary (both renamed) to the user

From: Mattias Andrée <maandree_AT_kth.se>
Date: Sat, 9 May 2020 10:10:39 +0200

On Sat, 9 May 2020 07:25:41 +0200
Laslo Hunhold <dev_AT_frign.de> wrote:

> On Thu, 7 May 2020 18:32:23 +0200
> Mattias Andrée <maandree_AT_kth.se> wrote:
>
> Dear Mattias,
>
> > Perhaps, but do you I wouldn't expect anyone that don't understand
> > the difference to even use libgrapheme. But you would also state in
> > grapheme.h and the man pages that all functions except grapheme_len
> > are low-level functions.
>
> that could work.
>
> > Not a goal, but a positive side-effect of exposing the boundary
> > test function.
>
> I agree with that, it has a positive side-effect.
>
> > > The reason I'm conflicted with this change is that there's no
> > > guarantee the grapheme-cluster-boundary algorithm gets changed
> > > again. It already has been in Unicode 12.0, which made it suddenly
> > > require a state to be carried with it, but there's no guarantee it
> > > will get even crazier, making it almost infeasible to expose more
> > > than a "gclen()"-function to the user.
> >
> > How about
> >
> > typedef struct grapheme_state GRAPHEME_STATE;
> >
> > /* Hidden from the user { */
> > struct grapheme_state {
> > uint32_t cp0;
> > int state;
> > };
> > /* } */
> >
> > int grapheme_boundary(uint32_t cp1, GRAPHEME_STATE *);
> >
> > GRAPHEME_STATE *grapheme_create_state(void);
> >
> > /* Just in case the state in the future
> > * would require dynamic allocation */
> > void grapheme_free_state(GRAPHEME_STATE *);
> >
> > grapheme_create_state() would reset the state each time
> > a boundary is found, so no reset function would be needed,
> > and would be useful to avoid a new allocation if the
> > grapheme cluster identification process is aborted and a
> > a started for a new text. Since this would be very rare
> > there, no reset function is needed.
> >
> > The only future I can see there this wouldn't be sufficient
> > if a cluster break (or non-break) could be retroactively
> > inserted where where the algorithm already stated that there
> > as no break (or was a break). This would be so bizarre, I
> > cannot imagine this would ever be the case.
>
> I don't like this change, because it destroys reentrancy, which is very
> importent for multithreaded applications, and complicates things
> unnecessarily.

malloc(3) and free(3) are thread-safe, so there shouldn't be any problem:

        GRAPHEME_STATE *state = grapheme_create_state(void);
        ... = grapheme_boundary(..., state);
        grapheme_free_state(state);

> However, I think we should just risk it and assume that further
> versions of the Unicode-Grapheme-Boundary-algorithm will only rely on
> such a state.

I agree.

>
> > > [...]
> > > What do you think?
> >
> > I don't see the point of including grapheme_cp_encode(), however
> > I'm not opposed to making a larger UTF-8/Unicode library, rather
> > I think it would be nice to have one place for all my Unicode
> > needs, especially if I otherwise would have a hand full of libraries
> > that all their own UTF-8 decoding functions that all have to be
> > linked.
>
> Yes, I agree with that. There are lots of bad and unsafe
> UTF-de/encoders out there and the one in libgrapheme is actually pretty
> fast and safe (e.g. no overencoded nul, proper error-handling, etc.).
> It would be no bloat to expose it outside, as it runs "in the
> background" anyway. It's more of a debate on the "purity" of
> libgrapheme, but when including the boundary function, offering a way
> to read codepoints from a char-array makes a lot of sense.
>
> With best regards
>
> Laslo
>
Received on Sat May 09 2020 - 10:10:39 CEST

This archive was generated by hypermail 2.3.0 : Sat May 09 2020 - 10:12:36 CEST