Re: [dev] [st] Strange behaviour of backspace under csh under st from Steffen Nurpmeso on 2024-11-07 (dev mail list archive)

From: Steffen Nurpmeso <steffen_AT_sdaoden.eu>
Date: Thu, 07 Nov 2024 03:13:49 +0100

Steffen Nurpmeso wrote in
<20241107013734.gC5CYhMl_AT_steffen%sdaoden.eu>:
|Jinsong Zhao wrote in
| <09350f56-59c1-4a2f-b7cc-9063e0c241b2_AT_yeah.net>:
||I was trying to use st on a FreeBSD workstation, and my shell is csh.
||When I use backspace to delete the Chinese character, I observe strange
||behavior.
||
||On the first,
||zjs_AT_freebsd:~ % 中文|

not to mention that possibly only the wcwidth(3) attributes of
these "So" (Symbol, other) Unicode entries is false.
This is a bug of the locale tables of FreeBSD then.

  ...
||This behavior is observed under bash, but not under sh.

Bash also uses wcwidth(3), sh seems to use BSD editline library
instead, and that surely uses myriads of successive processing of
mbtowc and wctomb etc to get the stuff back and forth, and likely
keeps, like eg ncurses, "index slots" instead of a simple
"character byte data". So that when you backspace all bytes
making up an "index slot" are removed, whereas st (and mksh fwiw)
simply "synchronizes back" on the "character byte data" until it
finds an UTF-8 start byte.
That is: with Unicode combining characters etc multiple adjacent
such UTF-8 characters form a single "grapheme" in Unicode terms;
many languages have / know / require that in Unicode. Ie bash:

  master:lib/readline/rlmbutil.h:# define WCWIDTH(wc) ((_rl_utf8locale && UNICODE_COMBINING_CHAR(wc)) ? 0 : _rl_wcwidth(wc))

With that, backspace in reality has to skip over multiple adjacent
(UTF-8) characters (aka multi multi-byte bytes).
For the simple line editor i have written for my MUA i use

        tc.tc_novis = (iswprint(wc) == 0);
        tc.tc_width = a_tty_wcwidth(wc);

(where it is not wcwidth() because ISO C did not standardize it).
I use cells aka index-slots, too.

Having said that, now i confused myself. Plain is that bash on
Linux (glibc 2.40) *can* handle these characters. So likely the
character set data of the actual locale you are using on your
specific FreeBSD does not correctly describe the symbols you
mention. Now it *must* be said that in my latest UnicodeData
i have (from 2019, ooops), i see

  3197;IDEOGRAPHIC ANNOTATION MIDDLE MARK;So;0;L;<super> 4E2D;;;;N;KAERITEN TYUU;;;;
  32A5;CIRCLED IDEOGRAPH CENTRE;So;0;L;<circle> 4E2D;;;;N;CIRCLED IDEOGRAPH CENTER;;;;
  1F22D;SQUARED CJK UNIFIED IDEOGRAPH-4E2D;So;0;L;<square> 4E2D;;;;N;;;;;

  2F42;KANGXI RADICAL SCRIPT;So;0;ON;<compat> 6587;;;;N;;;;;
  3246;CIRCLED IDEOGRAPH SCHOOL;So;0;L;<circle> 6587;;;;N;;;;;

but *no* other occurrences of U+4E2D or U+6587, so maybe the
fallback for "unknown" code points is wrong. My thing uses

  # ifdef mx_HAVE_WCWIDTH
              w = (wc == '\t' ? 1 : wcwidth(wc));
  # else
              if(wc == '\t' || iswprint(wc))
                 w = 1 + (wc >= 0x1100u); /* S-CText isfullwidth() */
              else
                 w = -1;
  # endif

which is very shitty, but since both codepoints are above U+1100
we treat them as fullwidth aka of width 2. ...

Hope that helps .. :/

--steffen
|
|Der Kragenbaer, The moon bear,
|der holt sich munter he cheerfully and one by one
|einen nach dem anderen runter wa.ks himself off
|(By Robert Gernhardt)
|
|And in Fall, feel "The Dropbear Bard"s ball(s).
|
|The banded bear
|without a care,
|Banged on himself fore'er and e'er
|
|Farewell, dear collar bear
Received on Thu Nov 07 2024 - 03:13:49 CET

This archive was generated by hypermail 2.3.0 : Thu Nov 07 2024 - 03:24:10 CET