Re: [hackers] [st][patch] replace utf8strchr with wcschr from Laslo Hunhold on 2019-03-15 (hackers mail list archive)

From: Laslo Hunhold <dev_AT_frign.de>
Date: Fri, 15 Mar 2019 10:51:23 +0100

On Fri, 15 Mar 2019 08:27:56 +0200
Lauri Tirkkonen <lotheac_AT_iki.fi> wrote:

Dear Lauri,

> I don't understand your logic. The current solution *is* converting
> everything to a Rune.
>
> static char *utf8strchr(char *, Rune);
>
> worddelimiters is char *, but utf8strchr() calls utf8decode() on it to
> obtain Runes (to compare to the second argument). While I don't think
> efficiency actually matters a lot here since this is only called when
> you double-click to select something, Jules' solution is quite similar
> to mine in that the worddelimiters string needs no conversion at
> runtime, and therefore more efficient than the current one.

yes, sorry for that. I noticed after sending that my wording is
unclear. Of course utf8strchr() does an in-situ Rune conversion, but
your solution requires passing a Rune-array to utf8strchr(), implying
that besides converting you would also have to _store_ the Runes
somewhere.

> > Now, to clear it up: A Rune literally is only a codepoint and just a
> > typedef for an (at least) 32-bit-integer.
>
> Yes, and yet Rune values are still being passed to wcwidth() in the
> current code. You objected to wchar_t on grounds of portability, but
> already the current code is broken on platforms where wchar_t is less
> than 32 bits, or its values do not match Unicode codepoints. I hope
> you will not suggest replacing wcwidth() with an application-local
> character width table.

wcwidth() is fundamentally broken, given the assumption that 1
codepoint = 1 character (or grapheme if you prefer Unicode-newspeak) is
_wrong_. The discussion on how far we want to support Unicode has been
going on for years and is a difficult call.

Standards move very slowly and I see no way around doing it ourselves
one way or another. The grapheme-cluster-boundary-detection I talked
about earlier uses awk(1) to generate the rules automatically from the
machine-readable unicode-standard-table, converting them to LUTs.

For width-calculation on grapheme clusters, it's more difficult, but
not impossible. Usually, grapheme clusters are made up of base
characters (half or full width) with modifiers, so something along the
lines of [0] with automatically-generated LUTs would be ideal.

Before the question comes up: ICU should be avoided like the plague,
given it encompasses all locales and is very bloated in nature. There
is a notion of a "common denominator" in Unicode, which is locale
independent, and that's what we should go with.

But please, stop pretending that the standard is in any way even
closely capable of handling Unicode. It isn't and it needs an overhaul.
UTF-8 is a sane default. We can compose codepoints on top of that and
then compose grapheme clusters, for which we can make educated
estimations of their drawing width. Everything else is just a hack and
doesn't approach the problem wholeheartedly.

With best regards

Laslo

[0]:https://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c

-- 
Laslo Hunhold <dev_AT_frign.de>

application/pgp-signature attachment: stored

Received on Fri Mar 15 2019 - 10:51:23 CET

This archive was generated by hypermail 2.3.0 : Fri Mar 15 2019 - 11:00:23 CET