Re: [dev] [st] wide characters from Random832 on 2013-04-14 (dev mail list archive)

From: Random832 <random832_AT_fastmail.us>
Date: Sun, 14 Apr 2013 10:56:26 -0400

On 04/14/2013 02:10 AM, Christoph Lohmann wrote:
> Greetings.
>
> On Sun, 14 Apr 2013 08:10:22 +0200 Random832 <random832_AT_fastmail.us> wrote:
>> I am forced to ask, though, why character cell values are stored in
>> utf-8 rather than as wchar_t (or as an explicitly unicode int) in the
>> first place, particularly since the simplest way to detect a wide
>> character is to call the function wcwidth. What was the reason for this
>> design decision? It doesn't save any space, since on most systems
>> UTF_SIZ == sizeof(int) == sizeof(wchar_t).
> That design decision can change when I’m actually implementing the dou‐
> ble‐width and double‐height support in st. The codebase is small enough
> to change such a type in less than 10 minutes. So no religion was intro‐
> duced here.

The reason for my question about using codepoints instead of UTF-8 was
because I thought it might make it easier to support combining
diacritics, not wide characters. The two problems are broadly related
because both of them affect the number of character cells occupied by a
string.

>> And I don't know the st codebase well enough (or at all, really) to tell
>> at a glance what would have to be changed to be able to support a
>> double-width character cell, or to support wrapping to the next line if
>> one is output at the second-to-last column.
> I hadn't yet the time to read all the double-width implementations in other
> terminals so st would do the »right thing« in implementing all questionable
> cases.
>
> Double‐width characters are like BCE a design decision applications need
> adapt to.
>
> Some corner cases I haven't yet found a good answer to:
> * Is there any standard for this except for setting the flag in
> terminfo and taking up two cells in the terminal?

I don't know if there's a standard. I can find nothing about character
cell terminals in any UTR, and ECMA 48 is silent on the question of wide
characters.

I don't know what terminfo flag you are referring to. I was talking
about support for east asian characters, not VT100-style stretching of
ASCII characters. I suspect the widcs/swidm/rwidm capabilities refer to
the latter (though the only actual instance in the terminfo database is
a swidm string on the att730).

Observed behavior in various terminals that do support them is:
* cursor position can be in either half of a double character, though
the whole character is hilighted (all observed terminals)
* outputting one at the end of the line (i.e. where a pair of two narrow
characters would be split across lines) fails entirely (xterm) or wraps
to the next line leaving the last cell alone (vte, tmux, mlterm, kterm).
* outputting a narrow character on top of a wide character erases the
entire wide character (xterm, tmux, mlterm, kterm) or erases only when
in the left half (vte)

* deleting (e.g. with ESC [ P) part of a character has various different
behaviors:
** on xterm and kterm, deleting either half of a character replaces the
remaining half with a single-width blank space.
** tmux's behavior is very buggy: a vertical line drawn across a
different part of the screen _after_ deleting different parts of wide
characters on different lines ended up redrawing incorrectly after
refreshing. As for the wide characters themselves, deleting the left
half deletes the entire character and deleting the right half has no
effect, but there is some hidden state involved - a sequence of two
deletions will delete a single wide character. I suspect the "right
half" is filled with some placeholder value that is not output to the
host terminal, and they are deleted individually. This is consistent
with all of my observations.
** on mlterm, deleting the left half of a character deletes the entire
character; deleting the right half replaces it with two spaces.
** on vte, deleting the right half of a character replaces the _next_
character with a space. Deleting the left half replaces the present
character with a space, but seems to leave some hidden state, since the
cursor on this "space" is still double width.
* the xterm/kterm behavior seems the most rational, since it yields no
visual glitches, always keeps the cursor in the same logical position,
and a deletion always shifts characters right of it by the same amount.

I haven't made any detailed investigation into the actual set of
characters that are considered wide (or combining) by each terminal and
by various applications, (except tmux, which has a list of ranges in
utf8.c). I also haven't investigated whether any of them have
locale-dependent treatment of "ambiguous" characters (e.g. greek or
cyrillic) which are wide in historical east asian fonts (except tmux,
which does not)

mlterm does have an option that makes it work differently; the above
results are with -Z enabled.

> * If st has double-width default.
> * What happens if the application does naive character
> counting? Will layouts break?

My experience is that layouts break now. I'm not sure if I can think of
an application that would break that wouldn't break already due to UTF-8
support (counting bytes).

> * Is there some way to tell the application that we have
> double-width support enforced except for the terminfo?
I would argue that an application that doesn't expect wide character
support shouldn't be outputting CJK characters.
> * How do applications implement this? Is there some historical
> cruft that will break?
I can't speak for every application ever, but I did observe that zsh
breaks when confronted with characters that should be wide, in the
prompt, being treated as narrow. I haven't ever heard of anything
breaking in ways related to this (as opposed to e.g. by byte counting on
UTF-8) with the behavior currently implemented in other terminal emulators.
> * With an option to toggle the double-width handling:
> * Is this needed for tmux, screen or other terminal proxies
> that for example miss BCE too?

tmux does support it. I don't know about screen.

mlterm has such an option. with mlterm's implementation, characters are
still visually wide which leads to some visual glitches and surprising
cursor movement behavior.
> These are the questions I miss an answer too before implementing this.
> The code isn’t a problem.
>
>
> Sincerely,
>
> Christoph Lohmann
>
>
Received on Sun Apr 14 2013 - 16:56:26 CEST

This archive was generated by hypermail 2.3.0 : Sun Apr 14 2013 - 17:00:07 CEST