Re: [hackers] [st][patch] replace utf8strchr with wcschr from Laslo Hunhold on 2019-03-15 (hackers mail list archive)

From: Laslo Hunhold <dev_AT_frign.de>
Date: Fri, 15 Mar 2019 11:41:34 +0100

On Fri, 15 Mar 2019 12:10:21 +0200
Lauri Tirkkonen <lotheac_AT_iki.fi> wrote:

Dear Lauri,

> Jules also showed the storage: it was the Rune worddelimiters[] in
> config.h.

I didn't mean that, but the Rune-array of the text you are actually
checking.

> Some wcwidth() *implementations* may be fundamentally broken, and
> maybe indeed there is some standard that underspecifies things. But
> it doesn't follow that every application then needs to implement
> their own character width tables. There are libc's that get this
> right and those that don't should be *fixed*, not catered to.

_All_ wcwidth() implementations are broken, because of their
common signature:

        wcwidth(wchar_t wc);

It can take only a single codepoint, but as I already said in my
response, the assumption that 1 codepoint = 1 printed character, is
wrong. wcwidth() lacks crucial context to even make a proper "decision"
in this regard.

> I completely disagree about that being ideal - I think it's a terrible
> idea for every application to have to have their own character width
> table, when libc should be providing just that. xterm does that too,
> by the way, using a local copy of that same wcwidth() implementation
> you linked - and thus xterm was fixed not to do that in OpenBSD; see
> the thread at https://marc.info/?l=openbsd-tech&m=155205245721315&w=2
>
> (Ingo's also fixing some similar madness in less(1) from what I can
> tell reading tech_AT_)

I agree with the changes in OpenBSD, and they make sense given hacks
like this are generally a problem.

I would not propose a "plug-in" wcwidth(), given it's broken. I linked
the source file because it is well-documented and gives an impression
for the problems that need to be solved (LUTs, grapheme cluster
context, UTF-8 parsing, ...). In a general sense, this would need to be
solved in a libray.

ICU goes too far with its approach. It loads megabytes of locale-tables
at runtime-initialization, has a horrible API and brings everything but
the kitchen sink.
POSIX is the other extreme, but due to their nature as a consortium and
being tasked with supporting all kinds of encodings and edge-cases. I
am positive there won't be a proper unicode-interface in decades to
come, so don't hold your breath in this regard.

> What is "the standard" you are referring to?

POSIX.

> I'm approaching this problem from a practical standpoint where things
> actually work on operating systems that do store Unicode codepoints
> in wchar_t, and wchar_t is large enough to store them (which, I must
> point out again, is an assumption st code *already makes* since it
> uses libc wcwidth() with codepoint values).

It is easily forgivable to misunderstand the problem. The problem is
not storing codepoints, but the fact that Unicode adds one layer on top
of codepoints. You can often ignore this layer, as you can also often
ignore the codepoint-layer and simply work with the bytes (e.g. in
cat(1)). For some cases, like character width, you need to enter the
grapheme cluster layer and at this point the standard library will fail
you big time, because it can only work within the codepoint layer
(Rune/wchar_t is a codepoint, but a character can be made up of
multiple codepoints).

        (byte byte...) --(UTF-8)----> codepoint
        (codepoint codepoint...) --(Unicode)--> grapheme cluster

If you ask me, I'd drop all this wchar_t and Rune-typedef madness and
just write an API using the stdint.h-type uint32_t (or uint_least32_t,
but who cares?) and be done with it. Unicode has approximately less than
2^21 codepoints (by RFC 3629 Unicode is limited by UCS-2 between 0x0 and
0x10FFFF ~ 2^20.09), so a 32-Bit integer is the canonical choice (as in
UCS-4).

A codepoint then is an uint32_t, and a grapheme cluster is just an
array of uint32_t's. Introducing nomenclature like "Runes" is just a
source of confusion, and it has been over the last few years.

If you look at st.h:60[0], you can see that Rune is defined as
uint_least32_t. So I don't know why you spread the FUD that Rune one
way or another is not capable to store all Unicode codepoints. It is! :)

With best regards

Laslo

[0]:https://git.suckless.org/st/file/st.h.html#l60

-- 
Laslo Hunhold <dev_AT_frign.de>

application/pgp-signature attachment: stored

Received on Fri Mar 15 2019 - 11:41:34 CET

This archive was generated by hypermail 2.3.0 : Fri Mar 15 2019 - 11:48:23 CET