Re: [dev][sbase][RFC] "tr" with -d option or without?

From: Silvan Jegen <s.jegen_AT_gmail.com>
Date: Sat, 12 Apr 2014 21:02:05 +0200

On Sat, Apr 12, 2014 at 07:19:45PM +0200, Hiltjo Posthuma wrote:
> On Sat, Apr 12, 2014 at 6:58 PM, Silvan Jegen <s.jegen_AT_gmail.com> wrote:
> >
> >> I'll also probably rewrite the mmap code to use malloc since it causes
> >> issues on some machines.
> >
> > The reason we used mmap was that it allocates memory only on use. So
> > even if we mmap space for 1'114'112 ints (one for each unicode point)
> > to do the mapping, we do not use
> >
> > (1'114'112 * 4) / (1024 * 1024) = 8.5MB
> >
> > of memory but only the few characters we are actually mapping to set2.
> >
> > Using malloc would result in the memory being allocated regardless
> > of whether we actually map all unicode characters or not, leading to
> > 8.5MB of memory being used in any case (additionally we would need to
> > initialise the memory to zero to make the current algorithm work IIRC,
> > since mmap already does that automatically on access).
> >
> > If there are issues that cannot be worked around though we may not have
> > another choice. What kind of issues are you experiencing?
> >
>
> Yeah I'm aware of how mmap works and in that way it's quite clever :).
> In the future we might want to support character classes ([:lower:],
> [:alnum:] etc). I'm not sure how well it would fit in with the current
> parsemapping code. For now I'll leave it as is and focus on more
> important things or please feel free to help and improve :)

I think it is one of those 80/20 (pareto principle) cases. The current
approach would likely have to be reworked in order to implement the
character classes. In my experience the current solution covers more
than 99% of my use cases while staying stupidly simple and thus "good
enough." To cover the remaining 1% of use cases would be probably be
quite a lot of effort but I encourage anyone to do so :-)


> Offtopic but related to tr: mbtowc() return code isn't checked at the
> moment, but can be -1 if an invalid character is given as an argument.

True. I would suggest just adding checks and bailing out when the return
code is <0. Maybe something like the following?


From: Silvan Jegen <s.jegen_AT_gmail.com>
Date: Sat, 12 Apr 2014 20:50:51 +0200
Subject: [PATCH] Wrap mbtowc to check for errors

---
 tr.c | 19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)
diff --git a/tr.c b/tr.c
index aad2245..7579b6a 100644
--- a/tr.c
+++ b/tr.c
_AT_@ -46,6 +46,17 @@ handleescapes(char *s)
 	}
 }
 
+static int
+xmbtowc(wchar_t *unicodep, const char *s)
+{
+	int rv;
+
+	rv = mbtowc(unicodep, s, 4);
+	if (rv < 0)
+			eprintf("mbtowc:");
+	return rv;
+}
+
 static void
 parsemapping(const char *set1, const char *set2, wchar_t *mappings)
 {
_AT_@ -64,12 +75,12 @@ parsemapping(const char *set1, const char *set2, wchar_t *mappings)
 	while(*s1) {
 		if(*s1 == '\\')
 			handleescapes(++s1);
-		leftbytes = mbtowc(&runeleft, s1, 4);
+		leftbytes = xmbtowc(&runeleft, s1);
 		s1 += leftbytes;
 		if(*s2 == '\\')
 			handleescapes(++s2);
 		if(*s2 != '\0') {
-			rightbytes = mbtowc(&runeright, s2, 4);
+			rightbytes = xmbtowc(&runeright, s2);
 			s2 += rightbytes;
 		}
 		mappings[runeleft] = runeright;
_AT_@ -85,7 +96,7 @@ maptonull(const wchar_t *mappings, char *in)
 
 	s = in;
 	while(*s) {
-		leftbytes = mbtowc(&runeleft, s, 4);
+		leftbytes = xmbtowc(&runeleft, s);
 		if(!mappings[runeleft])
 			putwchar(runeleft);
 		s += leftbytes;
_AT_@ -101,7 +112,7 @@ maptoset(const wchar_t *mappings, char *in)
 
 	s = in;
 	while(*s) {
-		leftbytes = mbtowc(&runeleft, s, 4);
+		leftbytes = xmbtowc(&runeleft, s);
 		if(!mappings[runeleft])
 			putwchar(runeleft);
 		else
-- 
1.9.2
Received on Sat Apr 12 2014 - 21:02:05 CEST

This archive was generated by hypermail 2.3.0 : Sat Apr 12 2014 - 21:12:07 CEST