Re: [hackers] [sbase][PATCH 4/5] fold: fix handling of multibyte characters

From: Richard Ipsum <richardipsum_AT_vx21.xyz>
Date: Fri, 2 Oct 2020 18:10:21 +0200

On Thu, Oct 01, 2020 at 08:52:34AM +0200, Laslo Hunhold wrote:
> On Wed, 30 Sep 2020 22:41:47 -0700
> Michael Forney <mforney_AT_mforney.org> wrote:
>
> Dear Michael,
>
> > POSIX says we should be counting column positions rather than
> > codepoints, but I think that might be rather difficult to get right
> > and this is probably an improvement already.
> >
> > I know Laslo has studied this area for libgrapheme, so maybe he has
> > suggestions.
>
> if you want to do it 100% right, there's no way around using
> libgrapheme (or another library handling grapheme clusters like icu,
> but I bet there's none nearly as lightweight as libgrapheme). Counting
> codepoints is only halfway there and there are trivial counterexamples
> which prove that this is not the complete solution and there are
> discrepancies.
>
> On the other hand, in the western world, most grapheme clusters are
> emojis and certain cases with more complex writing systems. It's a much
> different matter when you go to asia or africa, where you can't really
> properly implement many very popular writing systems (like Hangul)
> without using grapheme clusters.
> Most importantly in general though are if you're processing
> denormalized input (i.e. where everything is broken down as much as
> possible, for example the single codepoint
> (=1-codepoint-grapheme-cluster) "ä" is turned into the codepoint "a"
> with an umlaut modifier, making it a 2-codepoint-grapheme-cluster),
> leading to a lot of gotchas, inconsistencies and maybe even security
> problems.
>
> All in all though, codepoint-counting is a step in the right direction,
> but definitely not exhaustive, especially as time moves on and more and
> more people are using the higher unicode planes for data. If you really
> want to do it right, you must handle grapheme clusters, and libgrapheme
> is actually very fast and should even be faster than the Rune-solution
> using libutf.h, because it works on the byte-level rather than the
> Codepoint-level.
>
> With best regards
>
> Laslo
>

I'm happy to drop this patch from the series but libgrapheme isn't in
sbase's tree and it doesn't seem reasonable to expect users of sbase to
install libgrapheme themselves?

I'm not at all familiar with libgrapheme either and I don't know what
the trivial counterexamples Laslo refers to are, maybe it's better if he
takes over this part of the fix?

Thanks,
Richard
Received on Fri Oct 02 2020 - 18:10:21 CEST

This archive was generated by hypermail 2.3.0 : Fri Oct 02 2020 - 18:48:31 CEST