Re: [dev] sed breaks utf8 in [ ]

From: FRIGN <>
Date: Mon, 30 Mar 2015 21:41:20 +0200

On Mon, 30 Mar 2015 21:13:11 +0200
Markus Wichmann <> wrote:

Hey Markus,

> How? I heard that assertion before but never found anyone willing to
> explain that one more.
> (...)
> Unfortunately, your oppinion on that will have to contend with all the
> other ones on the topic. And stuff like this is largely decided by
> majority concencus, even if it is insane.

let me explain what I meant in more verbosity, as this complex topic
didn't receive the wording it deserves.
To put this in front: This is not a matter of opinion, but about making
sane choices which technoloies to employ in a project like sbase/ubase.

> There is also an ISO standard for formatting dates and times. No-one
> prefers it over their local customary one.

There's no problem with that and it makes sense. I personally prefer the
ISO-dates, because it depends on where you live what 02-01-15 means.
So I use the 2015-02-01 ISO-format as often as possible. There's no real
reason not to use it. It's the same as discussing imperial measurements.
The US-Americans have become so used to their "great" units that they'd
rather die than adopt the sane SI-units.
In the long run, nobody cares. I wouldn't even mind dropping locale-based
date formats from strftime().

Although, given it's no hassle to use it, we implemented it anyway and
everybody's happy.

> Localized differences do exist in reality.

This always comes up in these discussions. I reply in this length in the
hope that this question will not come up in the future and if, there's a
document like this explaining it.
I bet the difference you are talking about are for instance sorting-order
of umlauts. In a German locale, you want u and grouped together, but
maybe not in another locale.
I gotta tell you: This is the most insane thing ever. I won't even put
in regard here that unicode _demands_ sorting-functions to respect
multiple | one -> one | multiple sortings and case conversions, which is
impossible to achieve with current sorting functions demanded by the

In the end, the idea of locales is founded in some deeply-resting issue
with self-guilt, assuming there's some African tribe which sorts after
I gotta tell ya: Software should be all about consistency. The unicode
guys know best about the cultural history of Unicode codepoints and they
already got these things sorted out for years.
POSIX has just been infested by some maniacs and old habits not to adopt
the well-designed Unicode algorithms and rather relies on some vodoo-
definition of locales and collating sequences going so far that not even
_GNU tar_ knows how to deal with [=e=].
There's literally not one single line in the POSIX spec actually _really_
telling you how to deal with these things.

> Hold your horses here. sbase and ubase protest against GNU's
> bloatedness, and GNU manages bloatedness very fine on its own. Locales
> are but a fraction of GNU's insanity. (Because to these guys, dynamic
> linking and loading is AWESOME!)

Locales have been developed by POSIX alongside the GNU madness.

> Why? That is inconsistent! There is an ISO standard on date formats, so
> why not just use that? Isn't it simplicity in all things you strive for?
> Surely you can manage to learn a different date format then, if it makes
> the program easier to write!

You can't question the fact the ISO standard is a sane date-format. Read
above for reasons why.
In Germany, our main date format is "DD.MM.[YY]YY". It confuses the hell
out of me when I see CCTV-recordings with the date like "MM-DD-YY".
Sometimes you have to research where a video comes from to really know
which date is meant.
Enough of that.

> Now obviously, that was sarcasm. If I recall correctly, computers exist
> to serve humans, not the other way around.

The only way for computers to serve humans is to allow them to communicate
with us in unambiguous ways. You can't just ask a computer like a human
what he means with a date-format, it has to be in the right notation in the
first place.

> Yes, we do. And it would be awesome if we would just tell the libc about
> that as well.
> But it is! The issue at hand is that glibc is unwilling to handle UTF-8
> in regexs unless it is set to UTF-8 mode. And the only way to do that is
> to call setlocale().

If we only tell the libc is that we assume UTF-8 that would be fine with me,
but I'd like to hear the others' opinions about it.

> I suppose you could drive your religious hatred of setlocale() so far as
> to write your own UTF-8 aware regex engine.

I don't hate setlocale(), I just feel sorry that we've come so far to have
a broken interface like that. :P

> Look at it from a pragmatic side: It'll only really affect the people
> who use sbase and ubase linked against glibc, anyway. setlocale() in
> musl is pretty much a no-op; I don't know if it exists in dietlibc, but
> a) dietlibc's a toy, and b) it couldn't have a complete locale support
> in that little code; and who uses uclibc these days anyway? Remains
> klibc and libc5, which I can't comment on.

setlocale() is probably a stub in each of them. I know of musl that it is
a stub there.

> The point is, I don't understand your reasoning here: You are faced with
> a problem you know the solution of, yet you are unwilling for no good
> reason to implement the solution. Other than "I don't want to!".

Using setlocale() activates "magic" in the libc I am unable to know about.
I am concerned about the security and sanity of the results, and this is
the reason why I have a problem with that.
I propose UTF-8 everywhere, everything else sucks. Until then, I could live
with these preliminary locale-sets.



Received on Mon Mar 30 2015 - 21:41:20 CEST

This archive was generated by hypermail 2.3.0 : Mon Mar 30 2015 - 21:48:07 CEST