Re: [hackers] [st][patch] replace utf8strchr with wcschr from Lauri Tirkkonen on 2019-03-15 (hackers mail list archive)

From: Lauri Tirkkonen <lotheac_AT_iki.fi>
Date: Fri, 15 Mar 2019 13:17:01 +0200

On Fri, Mar 15 2019 11:41:34 +0100, Laslo Hunhold wrote:
> > Jules also showed the storage: it was the Rune worddelimiters[] in
> > config.h.
>
> I didn't mean that, but the Rune-array of the text you are actually
> checking.

Right, this is done elsewhere in st already as part of its operation.

> > Some wcwidth() *implementations* may be fundamentally broken, and
> > maybe indeed there is some standard that underspecifies things. But
> > it doesn't follow that every application then needs to implement
> > their own character width tables. There are libc's that get this
> > right and those that don't should be *fixed*, not catered to.
>
> _All_ wcwidth() implementations are broken, because of their
> common signature:
>
> wcwidth(wchar_t wc);
>
> It can take only a single codepoint, but as I already said in my
> response, the assumption that 1 codepoint = 1 printed character, is
> wrong. wcwidth() lacks crucial context to even make a proper "decision"
> in this regard.

> > I'm approaching this problem from a practical standpoint where things
> > actually work on operating systems that do store Unicode codepoints
> > in wchar_t, and wchar_t is large enough to store them (which, I must
> > point out again, is an assumption st code *already makes* since it
> > uses libc wcwidth() with codepoint values).
>
> It is easily forgivable to misunderstand the problem. The problem is
> not storing codepoints, but the fact that Unicode adds one layer on top
> of codepoints. You can often ignore this layer, as you can also often
> ignore the codepoint-layer and simply work with the bytes (e.g. in
> cat(1)). For some cases, like character width, you need to enter the
> grapheme cluster layer and at this point the standard library will fail
> you big time, because it can only work within the codepoint layer
> (Rune/wchar_t is a codepoint, but a character can be made up of
> multiple codepoints).
>
> (byte byte...) --(UTF-8)----> codepoint
> (codepoint codepoint...) --(Unicode)--> grapheme cluster

Thanks for the clarifications. I think we are fundamentally talking
about different problems.

I am proposing a patch to st that improves its current behavior, while
not changing assumptions it already makes. You are pointing out,
correctly, that character width cannot strictly speaking be determined
from the codepoint alone. I am arguing that the current solution with
Rune is not any better in this regard than wchar_t is - which is why I
asked Hiltjo what he didn't like about it.

> If you ask me, I'd drop all this wchar_t and Rune-typedef madness and
> just write an API using the stdint.h-type uint32_t (or uint_least32_t,
> but who cares?) and be done with it. Unicode has approximately less than
> 2^21 codepoints (by RFC 3629 Unicode is limited by UCS-2 between 0x0 and
> 0x10FFFF ~ 2^20.09), so a 32-Bit integer is the canonical choice (as in
> UCS-4).
>
> A codepoint then is an uint32_t, and a grapheme cluster is just an
> array of uint32_t's. Introducing nomenclature like "Runes" is just a
> source of confusion, and it has been over the last few years.

I agree the typedef is superfluous. But for practical reasons I would
replace it with wchar_t instead. It's important for all components in a
system to agree about the width of characters: otherwise you get the
kind of issues I outlined in the OpenBSD tech_AT_ thread. This,
practically, is the primary reason why I propose to just use the libc
wide-character functions. While you are correct about them being broken
in regards to combining characters, and even if a correct solution as
you outlined was implemented in st, I don't think it's sufficient to
implement in st alone (since then you will also get mismatches in
character widths between different programs).

> If you look at st.h:60[0], you can see that Rune is defined as
> uint_least32_t. So I don't know why you spread the FUD that Rune one
> way or another is not capable to store all Unicode codepoints. It is! :)

It was not my intention to imply that Rune isn't capable of storing
codepoints. Instead I was responding to criticisms about wchar_t being
possibly less than 32 bits wide, or containing values other than
codepoints, on some platforms - which might be permitted by POSIX, but
is in my opinion just broken (and st already assumes that wcwidth()
takes at least 32-bit codepoint values).

-- 
Lauri Tirkkonen | lotheac _AT_ IRCnet

Received on Fri Mar 15 2019 - 12:17:01 CET

This archive was generated by hypermail 2.3.0 : Fri Mar 15 2019 - 12:24:24 CET