On Sat, Apr 12, 2014 at 07:19:45PM +0200, Hiltjo Posthuma wrote:
> On Sat, Apr 12, 2014 at 6:58 PM, Silvan Jegen <s.jegen_AT_gmail.com> wrote:
> >
> >> I'll also probably rewrite the mmap code to use malloc since it causes
> >> issues on some machines.
> >
> > The reason we used mmap was that it allocates memory only on use. So
> > even if we mmap space for 1'114'112 ints (one for each unicode point)
> > to do the mapping, we do not use
> >
> > (1'114'112 * 4) / (1024 * 1024) = 8.5MB
> >
> > of memory but only the few characters we are actually mapping to set2.
> >
> > Using malloc would result in the memory being allocated regardless
> > of whether we actually map all unicode characters or not, leading to
> > 8.5MB of memory being used in any case (additionally we would need to
> > initialise the memory to zero to make the current algorithm work IIRC,
> > since mmap already does that automatically on access).
> >
> > If there are issues that cannot be worked around though we may not have
> > another choice. What kind of issues are you experiencing?
> >
>
> Yeah I'm aware of how mmap works and in that way it's quite clever :).
> In the future we might want to support character classes ([:lower:],
> [:alnum:] etc). I'm not sure how well it would fit in with the current
> parsemapping code. For now I'll leave it as is and focus on more
> important things or please feel free to help and improve :)
I think it is one of those 80/20 (pareto principle) cases. The current
approach would likely have to be reworked in order to implement the
character classes. In my experience the current solution covers more
than 99% of my use cases while staying stupidly simple and thus "good
enough." To cover the remaining 1% of use cases would be probably be
quite a lot of effort but I encourage anyone to do so :-)
> Offtopic but related to tr: mbtowc() return code isn't checked at the
> moment, but can be -1 if an invalid character is given as an argument.
True. I would suggest just adding checks and bailing out when the return
code is <0. Maybe something like the following?
From: Silvan Jegen <s.jegen_AT_gmail.com>
Date: Sat, 12 Apr 2014 20:50:51 +0200
Subject: [PATCH] Wrap mbtowc to check for errors
---
tr.c | 19 +++++++++++++++----
1 file changed, 15 insertions(+), 4 deletions(-)
diff --git a/tr.c b/tr.c
index aad2245..7579b6a 100644
--- a/tr.c
+++ b/tr.c
_AT_@ -46,6 +46,17 @@ handleescapes(char *s)
}
}
+static int
+xmbtowc(wchar_t *unicodep, const char *s)
+{
+ int rv;
+
+ rv = mbtowc(unicodep, s, 4);
+ if (rv < 0)
+ eprintf("mbtowc:");
+ return rv;
+}
+
static void
parsemapping(const char *set1, const char *set2, wchar_t *mappings)
{
_AT_@ -64,12 +75,12 @@ parsemapping(const char *set1, const char *set2, wchar_t *mappings)
while(*s1) {
if(*s1 == '\\')
handleescapes(++s1);
- leftbytes = mbtowc(&runeleft, s1, 4);
+ leftbytes = xmbtowc(&runeleft, s1);
s1 += leftbytes;
if(*s2 == '\\')
handleescapes(++s2);
if(*s2 != '\0') {
- rightbytes = mbtowc(&runeright, s2, 4);
+ rightbytes = xmbtowc(&runeright, s2);
s2 += rightbytes;
}
mappings[runeleft] = runeright;
_AT_@ -85,7 +96,7 @@ maptonull(const wchar_t *mappings, char *in)
s = in;
while(*s) {
- leftbytes = mbtowc(&runeleft, s, 4);
+ leftbytes = xmbtowc(&runeleft, s);
if(!mappings[runeleft])
putwchar(runeleft);
s += leftbytes;
_AT_@ -101,7 +112,7 @@ maptoset(const wchar_t *mappings, char *in)
s = in;
while(*s) {
- leftbytes = mbtowc(&runeleft, s, 4);
+ leftbytes = xmbtowc(&runeleft, s);
if(!mappings[runeleft])
putwchar(runeleft);
else
--
1.9.2
Received on Sat Apr 12 2014 - 21:02:05 CEST