Re: [dev] Best way to serialize data

From: Connor Lane Smith <cls_AT_lubutu.com>
Date: Mon, 6 Jun 2011 19:22:55 +0100

Hey,

On 6 June 2011 18:19, Džen <yvldwt_AT_gmail.com> wrote:
> I was wondering about which way would be the easiest/simplest to
> serialize data, f.e. being read via a file or stdin (data being a
> table of x rows and y columns, each cell a string). I thought of
> using NULL bytes as cell delimiters and newline characters as row
> delimiters. This way it wouldn't be possible to use \0 nor \n
> inside the "cells", but I couldn't think of a simpler solution.

It ultimately depends on the use case. If you don't need \0 or \n in
cells, your format is fine. If not, there are two approaches:

As Dieter suggested, you can use fixed length fields. This is great if
you have a maximum cell width, especially if this length is small or
most fields use most of the space. This approach is used in, for
example, tarballs' filename fields.

However, if the cells dramatically vary in length, and the maximum is
rather large, a better alternative is to use length-prefixing, using a
number of bytes according to how large you expect your rows and cells
to be:

0x000d 0x0006 "hello"\0 0x0007 "world!"\0

That is, 2-byte row length followed by two cells each with a 2-byte
cell length (and I've null-terminated the strings in the example). You
may need 4 or 8 bytes if your data is very long. The benefit of this
is that you can check the row length and jump straight to the next
row, or carry on into the row and iterate its cells. It is also
completely independent of content: you can store anything.

The problem with using ASCII values is you can't store binary data,
and you have to check each cell's content and everything. It's a
hassle; using length-prefixing is way easier.

(This approach is very often used in binary protocols, such as 9P and Sam.)

Thanks,
cls
Received on Mon Jun 06 2011 - 20:22:55 CEST

This archive was generated by hypermail 2.2.0 : Mon Jun 06 2011 - 20:24:04 CEST