Re: 8-bit transparency in the C locale vs. UTF-8 support (was Re: [dev] [sbase][RFC] Add a simplistic version of tr) from Silvan Jegen on 2013-12-25 (dev mail list archive)

From: Silvan Jegen <s.jegen_AT_gmail.com>
Date: Wed, 25 Dec 2013 13:53:15 +0100

On Tue, Dec 24, 2013 at 10:31:37PM +0000, Thorsten Glaser wrote:
> Strake dixit:
>
> >Use wchar.h functions and a sane libc, e.g. musl, which has a pure
> >UTF-8 C locale, which ISO C explicitly allows [1].
> >
> >The 8-bit clarity what POSIX wants [1] seems nonsense to me, as one
> >can use byte functions for that, but I may be wrong.
> ^^^^^^^^^^^^^^^^^^^^^^
> Not always, see below.
>
> >[1] http://wiki.musl-libc.org/wiki/Functional_differences_from_glibc
>
> MirBSD has exactly one “locale” (just enough to satisfy POSIX),
> and it’s pure UTF-8 (with a 16-bit wchar_t though) but 8-bit clean.
> This was a requirement from the start.

Wouldn't a 16-bit wchar_t be non-standard-conform when using a UTF-8
locale? The man page for stddef.h (0P) says

wchar_t
          Integer type whose range of values can represent distinct wide-
          character codes for all members of the largest character set
          specified among the locales supported by the compilation envi‐
          ronment: the null character has the code value 0 and each member
          of the portable character set has a code value equal to its
          value when used as the lone character in an integer character
          constant.

Since the Unicode character set contains potentially more than 1.1 Mio
code points you would be confined to the Unicode BMP, i. e. a subset of
Unicode and not "all members of the largest character set specified."

> Imagine this: txtfile and binfile are, respectively, a plain text
> UTF-8 file and a binary file (say, an ELF object). “with*locale”
> is a placeholder to set the respective LC_* settings or something.
>
> $ withClocale tr x x <txtfile >txtfile2
> $ withUTF8locale tr x x <txtfile >txtfile3
> $ withClocale tr x x <binfile >binfile2
> $ withUTF8locale tr x x <binfile >binfile3
>
> The output of this, when using a character-aware tr(1), will be:
> • txtfile2 and txtfile3 will be identical to txtfile
> • binfile2 will be identical to binfile
> • binfile3 will be 0 bytes long, and the system will
> have thrown EILSEQ, because the binary file contains
> sequences that are not conforming UTF-8; this is actually
> *required* and *correct* and the reason Debian has introduced
> (on my prodding) a “C.UTF-8” locale, which is just the same
> as “C” except with UTF-8 encoding, and _always_ installed.
>
> Now, on a system with multiple locales, you can just set the
> appropriate locale when dealing with files you know are binary
> or UTF-8 text. If you know.
>
> But if your “C” locale is UTF-8, you absolutely lose the ability
> to operate the standard Unix utilities on nōn-UTF-8 files (or,
> for example, files with mixed encoding). Hilarity ensues (such
> as nvi in Debian trashing files *on save*, with no warning before
> and no method to revert) with such files in UTF-8 encodings.
>
> You cannot just “use the byte functions” because, for example,
> you want to use tr(1), or you want to use your favourite editor
> on a file that’s “mostly” UTF-8 but contains some “raw octets”;
> the script I use in MirBSD to convert catmanpages to HTML is
> such an example because these octets (e.g. \xFE and \xFF) are
> used as separators for sed(1) calls, or placeholders.
>
> I hope to have sufficiently shown my case.

So the problem seems to be that binary files contain bytes that are not
valid UTF-8 and that using tools on them that expect UTF-8 will mangle
these files.

IMHO, either not running UTF-8 text tools on binary files or just not
mixing UTF-8 with non-valid UTF-8 byte sequences seems to be the obvious
"solution."

> Now, as for the solution, as first appeared in MirBSD:
> [...]

I am not sure I understood the details of your solution but I think I
could follow the general approach. However, it seems to be an awfully
involved solution to a problem that can be avoided by not doing stupid
things known to be stupid (i.e. running UTF-8 text tools on non-text
[or mixed] data).

But maybe I just lack the experience to appreciate the graveness of the
problem...
Received on Wed Dec 25 2013 - 13:53:15 CET

This archive was generated by hypermail 2.3.0 : Wed Dec 25 2013 - 14:00:13 CET