Re: [hackers] [st][patch] replace utf8strchr with wcschr

From: Laslo Hunhold <dev_AT_frign.de>
Date: Fri, 15 Mar 2019 12:46:45 +0100

On Fri, 15 Mar 2019 13:17:01 +0200
Lauri Tirkkonen <lotheac_AT_iki.fi> wrote:

Dear Lauri,

> Thanks for the clarifications. I think we are fundamentally talking
> about different problems.
>
> I am proposing a patch to st that improves its current behavior, while
> not changing assumptions it already makes. You are pointing out,
> correctly, that character width cannot strictly speaking be determined
> from the codepoint alone. I am arguing that the current solution with
> Rune is not any better in this regard than wchar_t is - which is why I
> asked Hiltjo what he didn't like about it.

he was pretty clear about it: We assume UTF-8, and for good reason.

UTF-16 is cancer because it has BOMs and is non-canonical across
endianness-changes with regard to memory layout, is really inefficient
compared to UTF-8 (_except_ if you for some reason store massive
amounts of chinese _plain_ text (HTML and other markup pushes the
advantage toward UTF-8 again)), invites to programming errors (because
surrogate characters in UTF-16 are rare) and disrupts the
byte->codepoint translation (as the units are 16-Bit rather than 8-Bit).

Only a madman would use anything else than UTF-8 and the choice of
UTF-16 within Java and Windows is just legacy cruft and the consequence
of political rather than technical decisions.

I can't blame POSIX for introducing the wide-char interfaces, because
they were developed in a time when it was all a big mess. We should not
cargo-cult such an interface though when there are other technical
challenges ahead; it is just a distraction given the perfect
character-encoding already exists (namely UTF-8).

> I agree the typedef is superfluous. But for practical reasons I would
> replace it with wchar_t instead. It's important for all components in
> a system to agree about the width of characters: otherwise you get the
> kind of issues I outlined in the OpenBSD tech_AT_ thread. This,
> practically, is the primary reason why I propose to just use the libc
> wide-character functions. While you are correct about them being
> broken in regards to combining characters, and even if a correct
> solution as you outlined was implemented in st, I don't think it's
> sufficient to implement in st alone (since then you will also get
> mismatches in character widths between different programs).

You are mistaken here. The "interchange" format is always UTF-8, and no
matter how you handle it internally, your program should output UTF-8.
The discussion about Runes, wchar_t, uint32_t is all about how to store
codepoints internally, and these codepoints are all in the range of 0 to
~2^21 by standard (!).

wchar_t's size is dependent on the character set of the system and can
even be 16 Bit. So it is a horrible choice for storing codepoints! It
may not be large enough.

> It was not my intention to imply that Rune isn't capable of storing
> codepoints. Instead I was responding to criticisms about wchar_t being
> possibly less than 32 bits wide, or containing values other than
> codepoints, on some platforms - which might be permitted by POSIX, but
> is in my opinion just broken (and st already assumes that wcwidth()
> takes at least 32-bit codepoint values).

I see that we just a nomenclature problem. When talking about
codepoints, I exclusively mean Unicode codepoints. Everything else is
legacy cruft, and I thought you assumed that as well.

In this context, given POSIX overcomplicates this matter, the
conclusion is simple that the POSIX interfaces are simply good enough
to handle byte arrays. Everything they offer on top of it is
overengineered and should be avoided.

If you were really interested in practical solutions, as you proclaimed,
you would not propose wchar_t in this matter.

With best regards

Laslo

-- 
Laslo Hunhold <dev_AT_frign.de>

Received on Fri Mar 15 2019 - 12:46:45 CET

This archive was generated by hypermail 2.3.0 : Fri Mar 15 2019 - 12:48:23 CET