Re: [dev] [sbase] [PATCH] uuencode base64 encoding/decoding and stdout output for uudecode from Ralph Eastwood on 2015-02-11 (dev mail list archive)

From: Ralph Eastwood <tcmreastwood_AT_gmail.com>
Date: Wed, 11 Feb 2015 13:43:06 +0000

On 10 February 2015 at 19:45, Ralph Eastwood <tcmreastwood_AT_gmail.com> wrote:
> On 10 February 2015 at 19:23, Ralph Eastwood <tcmreastwood_AT_gmail.com> wrote:
>>
>> Hi,
>>
>> Attached patch gives support for uuencode -m, base64 encoding and decoding
>> in uudecode.
>> Flag, -o, added so that uudecode can output to stdout to override the
>> output in encoded files.
>>
>> uudecode -m is accepted but ignored - the patch has an autodetection of
>> the file (begin-base64) in the header.
>>
>> Cheers,
>> Ralph
>
>
> Small fixes to this patch; uudecode didn't flush in a corner case - and
> there's an extra #include <assert.h> lying in uuencode.
>
>

Attached is another patch for uudecode which I think works properly
now *cross fingers*.

***

How base64 uuencode and uudecode work
================================

uuencodeb64
-----------------

The (new) suckless implementation of uuencode makes use of two
buffers; an input and output buffer. The algorithm assumes that the
size of the input buffer, Si, is a multiple of 3 and the output buffer
is dependent on this size. The output buffer size, So, is 4 * (Si /
3) + 1 because the base64 changes a group of 3 bytes = 24 bits and
encodes it into 6-bit values; which means you need 4 characters (i.e.
4 * 6 (= 24 bits)) to encode the same information. Using the fact
that the output is a group of 4, the implementation utilises the fact
that an unsigned int (uint32_t) is 4 bytes, and encodes the base64
characters into a uint32_t array instead. There is one additional
entry to encode the newline character. This implementation assumes
that the output buffer writes a line; and hence to give the same
output as other implementations, these values need to be kept as is.

The workhorse of the algorithm is the loop (which gets hugely
optimised by gcc -O3 it seems so you don't actually have to do any
loop unrolling for a fast version!):

> for (pb = buf, po = out; pb < buf + n; pb += 3)
> *po++ = b64e(pb);

It utilises b64e which changes 3 bytes into 4 base64 encoded
characters using a lookup table.

The other parts of the encoding deal with the last case where the
remaining bytes are less than 3. Firstly, this means that the input
buffer into the workhorse loop may not have '\0' characters in the
last incomplete group of 3 inputs and will give an incorrect output.
This part clears the end part of the buffer to make sure b64e gives
the corrected output. Secondly, although this gives the correct
entries for b64e, it will effectively encode '\0' at the end of the
stream - the specification [0] dictates '=' be used to pad instead.
This is implemented using masks; the masks are dynamically generated
for the 1 and 2 byte left cases and then AND'd with b64e and (the
inverse of the mask) is AND'd the string as an int (0x3d3d3d3d or
"===="), and these two are OR'd together - effectively replacing the
'\0' with the padded (or in base64 form, 'A' with '=').

uudecodeb64
-----------------

This algorithm makes use of a 60 byte input buffer and 45 byte output
buffer; with the same ratios as required as in the encode (without the
newline character this time). However, the implementation doesn't
depend on their actual sizes if the ratios are kept the same.
Unlike the encoding algorithm, this has to ignore whitespace in the
input, and hence will be slower, no matter what (plus the fact that
the input size is larger!).

The algorithm is state machine based and uses a decoding table b64dt
which is generated by the attached program. The decoding table
indicates to the algorithms which are illegal characters, whitespace
characters and the value of the 6-bit base64 characters. By decoding
byte per byte, the state machine encodes what position of the 3-byte
output it is currently in. Once the output buffer is full it flushes.
If a padding character '=' is encountered, then it knows it is the end
of the input stream and calculates based on the
current decoding state how many '=' are expected. Line numbers are
tracked for debugging errors in the stream (though ultimately
unecessary).

[0] http://pubs.opengroup.org/onlinepubs/000095399/utilities/uuencode.html

text/x-csrc attachment: genb64tab.c

text/x-patch attachment: 0004-uudecode-fix-flushing-again-through-rewrite.patch

Received on Wed Feb 11 2015 - 14:43:06 CET

This archive was generated by hypermail 2.3.0 : Wed Feb 11 2015 - 14:48:07 CET