Re: [dev] [st] Erasing UTF-8 characters in ed from Roberto E. Vargas Caballero on 2014-08-04 (dev mail list archive)

From: Roberto E. Vargas Caballero <k0ga_AT_shike2.com>
Date: Mon, 4 Aug 2014 23:00:55 +0200

> from the source code. xterm has several related options (-lc to adapt the
> encoding to the locale used, -u8 to force UTF-8, and others). I glanced

As far as I know xterm uses an external program for it luit(1).

> st does not use termios(3), and does not seem to do anything special
> depending on the locale encoding. I'm certainly missing something, because I
> don't understand how you can have iutf8 enabled in st by default!

I think it depends of what are the default stty flags in your system.
I know that linux vt begins in non unicode mode, and the startup
scripts have to call unicode_start to switch to unicode.

> Yes, I saw after I had sent my message that iutf8 is not POSIX. Does the
> erase character work correctly with multi-byte characters in cat or ed on
> your OpenBSD machine?

This is even more strange. If I run t from dwm shortcut:

static const char *termcmd[] = { "/usr/local/bin/st", "-e", "utmp", NULL };

when I try to put a non ascii leter I get them in latin1 encode. If
I try running /usr/local/bin/st -e /usr/local/bin/utmp from command
line (from a st terminal opened with dwm) I get them in utf8
encode. If I execute xterm from dmenu shortcut of dwm I get
latin1 encode, but if I execute it from command line I get
ut8 encode. If I execute (in command line execution of st or xterm):

        $ touch f.txt
        $ ed f.txt <<EOF
> a
> á
> .
> w
> q
> EOF
        0
        3
        $ hexdump f.txt
        0000000 a1c3 000a
        0000003
        $

        That is correct.

If I execute (again in command line execution):

        $ stty erase ^H
        $ touch f.txt
        $ ed f.txt <<EOF
> a
> á^H
> .
> w
> q
> EOF
        0
        4
        $ hexdump f.txt
        0000000 a1c3 0a08
        0000004
        $

         is not interpreted, and it is correct because
        the input of ed doesn't travel across the line driver.

If I execute ed without the here document (again in command line
execution).

        $ stty erase ^H
        $ touch f.txt
        $ ed f.txt
        0
        a

        .
        w
        2
        q
        $ hexdump f.txt
        0000000 0ac3
        0000002
        $

        That is incorrect, but I get the same output with st and
        with xterm.

If I try the program using the terminal emulator of the OpenBSD
kernel I get:

        $ hexdump f.txt
        0000000 000a
        0000001
        $

        That is correct, but this terminal emulator runs in latin1
        encode (and as far as I know, there is no way of changing it).

I am not sure what is happening, but I have two things clear:

        - dwm is doing something wrong because terminals launched by it
          get an incorrect encoding in input characters.
        - OpenBSD tty driver doesn't handle utf8 encoding correctly.

I will repeat these test tomorrow with linux.

Regards,

-- 
Roberto E. Vargas Caballero

Received on Mon Aug 04 2014 - 23:00:55 CEST

This archive was generated by hypermail 2.3.0 : Mon Aug 04 2014 - 23:12:07 CEST