Re: [dev] [sbase][RFC] Add a simplistic version of tr

From: Silvan Jegen <s.jegen_AT_gmail.com>
Date: Tue, 24 Dec 2013 17:20:08 +0100

On Thu, Nov 28, 2013 at 12:45:40PM +0200, sin wrote:
> On Tue, Nov 26, 2013 at 12:01:01PM -0800, Silvan Jegen wrote:
> > If you you would rather not take this version, what approach would
> > you take for the character set mapping when using UTF-8? A hashmap-,
> > or B-tree-based solution or something else entirely?
>
> I am not knowledgeable enough about UTF-8 so I can't answer this.
> A B-tree is I think an overkill for sbase. We do not have a nice
> implementation of a hash table in sbase as we did not need it but
> if we go down that path it makes sense to put this in util/ so other
> programs can benefit. Currently we don't have an implementation of
> a singly linked list that we can reuse, but that is trivial enough and
> we've re-implemented it wherever needed (with the minimum set of
> operations needed for each tool). I can send an implementation of
> a hash table that I've used for my own programs, MIT/X licensed and it is
> simple enough.

I played around with the mmap-based approach suggested in this thread
and as far as I can tell it works beautifully. I will post the code as
soon as I'm finished testing the program using more diverse inputs.


> Regarding UTF-8, some other programs in sbase also lack proper handling
> of UTF-8. Do you think we could embed libutf8 from suckless.org and
> use it?

In my current implementation I use libutf to convert from UTF-8 to the
corresponding Unicode code points.

I just realized that I use putwchar to print the converted Unicode code
points which invites the question of whether we should drop libutf and use
all the locale-dependent wchar.h functions like

mbtowc
wctomb

instead. IIRC their functionality is equivalent to libutf as far as the
conversion is concerned, though, according to [1], the POSIX locale seems
to suck.

So I guess the question boils down to whether you would rather use
libutf or the standardized, POSIX-locale-dependent wchar.h functions for
the UTF-8 conversion. I see one advantage of the wchar.h functions:
If we use them we could avoid adding an external dependency to
sbase. The disadvantage is the fact that we would depend on the
whole posix-locale-thing which seems unnecessarily complicated in
places.

What are your thoughts?


(a happy Christmas eve/Hanukkah/Spaghettimonster day btw!)

[1] http://harmful.cat-v.org/software/
Received on Tue Dec 24 2013 - 17:20:08 CET

This archive was generated by hypermail 2.3.0 : Tue Dec 24 2013 - 17:24:07 CET