Re: [dev] sed breaks utf8 in [ ]

From: Markus Wichmann <>
Date: Mon, 30 Mar 2015 21:13:11 +0200

On Mon, Mar 30, 2015 at 08:33:48PM +0200, FRIGN wrote:
> On Mon, 30 Mar 2015 19:05:19 +0200
> Markus Wichmann <> wrote:
> > How about simply calling setlocale()? Or was that too simple? If the
> > user has set a non-UTF-8 locale and then uses UTF-8, that's on them!
> POSIX locales are an insane concept.

How? I heard that assertion before but never found anyone willing to
explain that one more.

The only thing I can think of, is that struct nl_langinfo does have
quite a lot insane parts to it (especially haunting was the member that
described thousands-separation), and that supporting a large repository
of locales would always require dynamic loading, parsing, or a huge lib,
which no-one really wants.

> Unicode has already gone a long
> way to define sane international collation and sorting sequences which
> make sense.

Unfortunately, your oppinion on that will have to contend with all the
other ones on the topic. And stuff like this is largely decided by
majority concencus, even if it is insane.

There is also an ISO standard for formatting dates and times. No-one
prefers it over their local customary one.

> The idea of localized differences has its origin in the
> sick minds of the POSIX-authors.

Localized differences do exist in reality.

> sbase and ubase are one part of a protest against all this locale-
> madness.

Hold your horses here. sbase and ubase protest against GNU's
bloatedness, and GNU manages bloatedness very fine on its own. Locales
are but a fraction of GNU's insanity. (Because to these guys, dynamic
linking and loading is AWESOME!)

Also, POSIX over-specifies these things. I once read the POSIX manpage
of cp(1) and it basically defined that it could only be written in
Haskell if you were willing to jump through some large hoops.

> I agree there should be localized date-formats, but everything
> beyond that is mostly insane.

Why? That is inconsistent! There is an ISO standard on date formats, so
why not just use that? Isn't it simplicity in all things you strive for?
Surely you can manage to learn a different date format then, if it makes
the program easier to write!

Now obviously, that was sarcasm. If I recall correctly, computers exist
to serve humans, not the other way around.

> We assume a UTF-8-locale and that's it.

Yes, we do. And it would be awesome if we would just tell the libc about
that as well.

> setlocale is just ugly and imho
> not the solution to this issue.

But it is! The issue at hand is that glibc is unwilling to handle UTF-8
in regexs unless it is set to UTF-8 mode. And the only way to do that is
to call setlocale().

I suppose you could drive your religious hatred of setlocale() so far as
to write your own UTF-8 aware regex engine.

Look at it from a pragmatic side: It'll only really affect the people
who use sbase and ubase linked against glibc, anyway. setlocale() in
musl is pretty much a no-op; I don't know if it exists in dietlibc, but
a) dietlibc's a toy, and b) it couldn't have a complete locale support
in that little code; and who uses uclibc these days anyway? Remains
klibc and libc5, which I can't comment on.

The point is, I don't understand your reasoning here: You are faced with
a problem you know the solution of, yet you are unwilling for no good
reason to implement the solution. Other than "I don't want to!".

Received on Mon Mar 30 2015 - 21:13:11 CEST

This archive was generated by hypermail 2.3.0 : Mon Mar 30 2015 - 21:24:07 CEST