Re: [dev] [libgrapheme] announcement

From: Laslo Hunhold <dev_AT_frign.de>
Date: Fri, 27 Mar 2020 22:24:52 +0100

On Fri, 27 Mar 2020 20:58:16 +0000
sylvain.bertrand_AT_gmail.com wrote:

Dear Sylvain,

> On this very mailing list we already had some exchange of thoughts
> about the unicode grapheme cluster.
> One question which was stuck into my head after this exchange was:
> how many of unicode "scripts" can be rendered, in a reasonably
> readable way, in a terminal grid?

yeah, that's an interesting matter. "Grapheme clusters" are the
"smallest unit of script", so one can think of one grapheme cluster to
be a single character.
It gets a bit more complicated when looking at things like "मनीष".
This is a name consisting of three grapheme clusters. What's
interesting is that the three letters are linked, so when thinking of
text-rendering, it gets really complicated.
I think, though, that it's sufficient for a terminal to be able to
separate into grapheme clusters and then pass each one individually to
the text renderer. This will cover 99.5% of all cases.

> That said, it is a brave first step towards "suckless" "i18n" unicode
> software. I got nausea looking at libunistring and the horrible
> gnulib SDK, not to mention the c++ infection of the "official" libicu
> (don't let me start on harfb...): they all deserve a rube goldberg
> award.

Yes, the ecosystem is a huge mess. What's really bad is that the
software exposes the end-user to a lot of unnecessary complexity. After
having thought about it for a few years, since I started working with
this topic, the following is my opinion about suckless unicode
handling:

  1) text comparison: Don't go the unicode way, but byte-by-byte. There
     are too many edge-cases that just make it all suck. The problem is
     that within grapheme clusters, for final rendering, it doesn't
     matter in which order modifiers come.
     It's a deep deep rabbit hole if you go along this path to find
     canonical forms of grapheme clusters. Just compare stuff
     byte-for-byte and be done with it.
  2) lower/upper-case: Really probably one of the worst aspects of all
     of this. If you are serious about that, the mappings from lower- to
     upper-case are not idempotent and they expand the bytestream or
     contract it. I personally also don't see the use of it and it's
     probably not worth the hassle.
     The concept of lower/upper-case writing is a very western concept
     and other scriptures don't even reflect it well.
  3) sorting: really really complicated, Unicode has some "defaults" but
     also one million different locale-dependent rules one can choose to
     apply (hint: you don't want to :P). I'd first go for a "naïve"
     byte-by-byte approach, especially because UTF-8 is transparent in
     ordering relative to codepoints, but one might look into parsing
     the unicode and work on a sorting algorithm.
     An idea for a simple interface would be
     "grapheme_cmp(const char *, const char *)" and have the same
     semantics as strcmp. Obviously, given 2), an implementation for
     strcasecmp would not make sense, but one can also have a
     "grapheme_ncmp(const char *, const char *, size_t)", but then there
     needs to be a discussion if the size_t means the number of
     grapheme clusters or the number of bytes to compare.
     As I said above, Unicode considers permutations of modifiers
     equivalent, so maybe we might have to go a bit off-the-track there
     and skip this equivalency check.

So that's that. I'll read a bit more and might code a bit in this
regard. What I need to note is that Unicode gets more and more
complicated with each version. For instance, I had to implement a small
state machine to even be able to measure the length of grapheme
clusters, which was not necessary before and forced me to adapt the
API. I think it won't get worse than that and the API will work for
future versions of Unicode as well, but it takes more consideration
when thinking about string comparison and other things.

With best regards

Laslo
Received on Fri Mar 27 2020 - 22:24:52 CET

This archive was generated by hypermail 2.3.0 : Fri Mar 27 2020 - 22:36:09 CET