Re: [hackers] [st][patch] replace utf8strchr with wcschr from Lauri Tirkkonen on 2019-03-15 (hackers mail list archive)

From: Lauri Tirkkonen <lotheac_AT_iki.fi>
Date: Fri, 15 Mar 2019 12:10:21 +0200

On Fri, Mar 15 2019 10:51:23 +0100, Laslo Hunhold wrote:
> yes, sorry for that. I noticed after sending that my wording is
> unclear. Of course utf8strchr() does an in-situ Rune conversion, but
> your solution requires passing a Rune-array to utf8strchr(), implying
> that besides converting you would also have to _store_ the Runes
> somewhere.

Jules also showed the storage: it was the Rune worddelimiters[] in
config.h.

> > Yes, and yet Rune values are still being passed to wcwidth() in the
> > current code. You objected to wchar_t on grounds of portability, but
> > already the current code is broken on platforms where wchar_t is less
> > than 32 bits, or its values do not match Unicode codepoints. I hope
> > you will not suggest replacing wcwidth() with an application-local
> > character width table.
>
> wcwidth() is fundamentally broken, given the assumption that 1
> codepoint = 1 character (or grapheme if you prefer Unicode-newspeak) is
> _wrong_. The discussion on how far we want to support Unicode has been
> going on for years and is a difficult call.

Some wcwidth() *implementations* may be fundamentally broken, and maybe
indeed there is some standard that underspecifies things. But it doesn't
follow that every application then needs to implement their own
character width tables. There are libc's that get this right and those
that don't should be *fixed*, not catered to.

> Standards move very slowly and I see no way around doing it ourselves
> one way or another. The grapheme-cluster-boundary-detection I talked
> about earlier uses awk(1) to generate the rules automatically from the
> machine-readable unicode-standard-table, converting them to LUTs.
>
> For width-calculation on grapheme clusters, it's more difficult, but
> not impossible. Usually, grapheme clusters are made up of base
> characters (half or full width) with modifiers, so something along the
> lines of [0] with automatically-generated LUTs would be ideal.

> [0]:https://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c

I completely disagree about that being ideal - I think it's a terrible
idea for every application to have to have their own character width
table, when libc should be providing just that. xterm does that too, by
the way, using a local copy of that same wcwidth() implementation you
linked - and thus xterm was fixed not to do that in OpenBSD; see the
thread at https://marc.info/?l=openbsd-tech&m=155205245721315&w=2

(Ingo's also fixing some similar madness in less(1) from what I can tell
reading tech_AT_)

> Before the question comes up: ICU should be avoided like the plague,
> given it encompasses all locales and is very bloated in nature. There
> is a notion of a "common denominator" in Unicode, which is locale
> independent, and that's what we should go with.
>
> But please, stop pretending that the standard is in any way even
> closely capable of handling Unicode. It isn't and it needs an overhaul.

What is "the standard" you are referring to? I'm approaching this
problem from a practical standpoint where things actually work on
operating systems that do store Unicode codepoints in wchar_t, and
wchar_t is large enough to store them (which, I must point out again, is
an assumption st code *already makes* since it uses libc wcwidth() with
codepoint values).

-- 
Lauri Tirkkonen | lotheac _AT_ IRCnet

Received on Fri Mar 15 2019 - 11:10:21 CET

This archive was generated by hypermail 2.3.0 : Fri Mar 15 2019 - 11:12:24 CET