Re: [dev] Best way to serialize data

From: Džen <yvldwt_AT_gmail.com>
Date: Mon, 06 Jun 2011 21:07:05 +0200

Pretty much answers my question. In my use case it'd be easier to use
delimiters like \0 or \n, due to the data not being binary. However now
I wonder, which method would need more cpu time? I suppose that when
using delimiters there isn't a easier way than using fgetc(), reading
through the whole data stream. Hard-coded field lengths would be faster
if the fields contain a lot of characters I guess.

On 06/06/2011 20:22, Connor Lane Smith wrote:
> It ultimately depends on the use case. If you don't need \0 or \n in
> cells, your format is fine. If not, there are two approaches:
>
> As Dieter suggested, you can use fixed length fields. This is great if
> you have a maximum cell width, especially if this length is small or
> most fields use most of the space. This approach is used in, for
> example, tarballs' filename fields.
>
> However, if the cells dramatically vary in length, and the maximum is
> rather large, a better alternative is to use length-prefixing, using a
> number of bytes according to how large you expect your rows and cells
> to be:
>
> 0x000d 0x0006 "hello"\0 0x0007 "world!"\0
>
> That is, 2-byte row length followed by two cells each with a 2-byte
> cell length (and I've null-terminated the strings in the example). You
> may need 4 or 8 bytes if your data is very long. The benefit of this
> is that you can check the row length and jump straight to the next
> row, or carry on into the row and iterate its cells. It is also
> completely independent of content: you can store anything.
>
> The problem with using ASCII values is you can't store binary data,
> and you have to check each cell's content and everything. It's a
> hassle; using length-prefixing is way easier.
>
> (This approach is very often used in binary protocols, such as 9P and Sam.)

-- 
Džen
Received on Mon Jun 06 2011 - 21:07:05 CEST

This archive was generated by hypermail 2.2.0 : Mon Jun 06 2011 - 21:12:03 CEST