Re: [hackers] [st][patch] replace utf8strchr with wcschr from Hiltjo Posthuma on 2019-03-15 (hackers mail list archive)

From: Hiltjo Posthuma <hiltjo_AT_codemadness.org>
Date: Fri, 15 Mar 2019 12:52:04 +0100

On Thu, Mar 14, 2019 at 09:57:02AM +0200, Lauri Tirkkonen wrote:
> Hi,
>
> On Wed, Mar 13 2019 20:35:09 +0100, Hiltjo Posthuma wrote:
> > I don't like mixing of the existing functions with wchar_t.
> > I think st should (at the very least internally) use utf-8.
>
> I think I explained my position poorly, so let me try to clarify.
> My apologies if this seems a bit pushy :)
>
> First - I agree with using UTF-8. That's actually how I ended up with
> this diff -- I was trying to configure U+3000 IDEOGRAPHIC SPACE as a
> delimiter, but seeing that worddelimiters was char *, I started
> wondering whether I could actually use unicode characters in it and had
> to go read the code, thus finding utf8strchr().
>
> utf8strchr() is a bit peculiar - on every call to ISDELIM(), it decodes
> the worddelimiters utf-8 string into Runes (so that it can compare to
> the Rune argument). It seems a little strange to me to be doing that --
> the delimiters string cannot change at runtime, so storing the
> codepoints instead of the multibyte string feels like a better fit. And
> that's what wchar_t * is, with the added bonus that we can use libc
> wcschr() instead of rolling our own search function.
>
> I already mentioned that Rune is being passed to wcwidth(wchar_t), so it
> seems like there is a builtin assumption that Rune and wchar_t hold
> equivalent values. I actually don't understand why that typedef exists
> instead of just using wchar_t; maybe I'm missing something.
>
> Could you explain what it is that you don't like about wchar_t?
>

Hi,

I've applied both of the patches and a small change to the default
worddelimiters.

Thanks for the clarifications. The codepoint assumption was indeed wrong.

I do not mind wchar_t, but in practise it is not consistent across platforms.
However we already use wchar_t in st so it should be as correct as possible
matching the POSIX standard.

(_AT_Laslo) for simplicity/sanity sake I think assuming 1 codepoint is 1
"character" makes sense.

Thanks,

-- 
Kind regards,
Hiltjo

Received on Fri Mar 15 2019 - 12:52:04 CET

This archive was generated by hypermail 2.3.0 : Fri Mar 15 2019 - 13:00:24 CET