[dev] Re: 8-bit transparency in the C locale vs. UTF-8 support from Thorsten Glaser on 2013-12-26 (dev mail list archive)

From: Thorsten Glaser <tg_AT_mirbsd.de>
Date: Wed, 25 Dec 2013 23:01:13 +0000 (UTC)

Silvan Jegen dixit:

>Wouldn't a 16-bit wchar_t be non-standard-conform when using a UTF-8
>locale?

Nope. UTF-8 is just an encoding for Unicode, and as long as I take
care to #define __STDC_ISO_10646__ 200009L (and no later date) this
is perfectly permissible.

(And please do not language-lawyer me, I’ve had enough of those,
and since I can prove that 100% POSIX compliance is probably illegal
in my country, I don’t care, even.)

>So the problem seems to be that binary files contain bytes that are not
>valid UTF-8 and that using tools on them that expect UTF-8 will mangle
>these files.

No. The problem is that “using tools that use the wchar_t API” will
mangle them _iff_ the locale is UTF-8.

So if your C locale is UTF-8, you *will* break all kinds of things,
since “env LC_ALL=C tr x x <binfile” is supposed to retain the binary
input unchanged.

This just means that your C locale cannot be strictly UTF-8. All
others can, but the C locale is precisely for this. This is because
the C locale is special like that.

bye,
//mirabilos

-- 
13:37⎜«Natureshadow» Deep inside, I hate mirabilos. I mean, he's a good
guy. But he's always right! In every fsckin' situation, he's right. Even
with his deeply perverted taste in software and borked ambition towards
broken OSes - in the end, he's damn right about it :(! […] works in mksh

Received on Thu Dec 26 2013 - 00:01:13 CET

This archive was generated by hypermail 2.3.0 : Thu Dec 26 2013 - 00:12:20 CET