Re: [hackers] [st][patch] replace utf8strchr with wcschr from Laslo Hunhold on 2019-03-14 (hackers mail list archive)

From: Laslo Hunhold <dev_AT_frign.de>
Date: Thu, 14 Mar 2019 10:55:44 +0100

On Wed, 13 Mar 2019 20:35:09 +0100
Hiltjo Posthuma <hiltjo_AT_codemadness.org> wrote:

Dear Hiltjo,

> I don't like mixing of the existing functions with wchar_t.
> I think st should (at the very least internally) use utf-8.
>
> Won't apply.

I totally agree with you! Come to think of it, do we really need to
compare codepoints here? How about preprocessing worddelimiters and
storing the offsets of each beginning of a codepoint? Determining if a
certain "lookahead"-byte-sequence is a delimiter then just means
traversing through this sequence, which would be highly cache-efficient.

The only downside I see is adversarial "wasteful" encodings of
codepoints into longer UTF-8-sequences, but if we just want to match
the reduced forms, which occur in 99.999% of the cases, we can just do
a byte-by-byte comparison which would also be more efficient.

The question is always how deep we want to go into the Unicode
rabbithole. I am currently working on a self-generating LUT-based
grapheme cluster "detector" (it basically says if there is a
grapheme-cluster-break between two code-points or not). Doing a sort of
preprocessing on the worddelimiters-string and identifying the offsets
at which there is a grapheme cluster, you could then go about simply
comparing byte-sequences.

The downside here is, yet again, ambiguity. There are ways to
"normalize" grapheme clusters, but e.g. the ordering of codepoints is
not always guaranteed.

Anyway, just my 2 cents. The way it is right now works out though and
everything regarding the cancerous wide-char-standard has been said.

With best regards

Laslo

-- 
Laslo Hunhold <dev_AT_frign.de>

application/pgp-signature attachment: stored

Received on Thu Mar 14 2019 - 10:55:44 CET

This archive was generated by hypermail 2.3.0 : Thu Mar 14 2019 - 11:00:23 CET