Re: [dev] [sbase] should tar preserve hard links info

From: Laslo Hunhold <dev_AT_frign.de>
Date: Tue, 31 Jan 2023 14:45:44 +0100

On Tue, 31 Jan 2023 14:05:05 +0100 (CET)
Andrea Calligaris <recluse_AT_email.it> wrote:

Dear Andrea,

> Do you think it should?
>
> I'm not interested in a short-term implementation, I'm more
> interested if you think that it should, or if you have an opinion
> against it. Im my opinion it should, because someone may rely on
> hardlinks to manage its data, and if you want to backup your stuff in
> a tar archive, with the current sbase version you are loosing the
> hardlink connection and it just raw-copies the file.
>
> Or should one not rely on hardlinks at all, because X, etc.
>
> Let me know what you think, thanks.

Just as a side-note: We are talking about support during "compression"
here. sbase-tar already supports "decompression" of symbolic and hard
links.

I think sbase-tar should support hardlinks in general, but it adds
complications that would require an elegant and scalable (!) solution.

First off, detecting if a file is a hard link is POSIX compliant. The
stat structure contains st_dev and st_ino (which gives a file a unique
ID that you could use at compression time to keep tabs) and even
st_nlink (which gives the number of hard links to a file, so you would
only have to take tabs of those files with st_nlink > 1). It is also
POSIX-tar-compliant (UStar, pax) to add hard links to archives.

Still, there is no upper bound to the number of hard links you might
encounter during "compression", and it would be ugly and inconsistent to
impose an arbitrary limit on that (for a static buffer). You could
alternatively just create and fill array of structs of the form

        struct {
                dev_t dev;
                ino_t ino;
                char *fname;
        } possible_hard_link_target;

if you encounter afile with st_nlink > 1 that would then need to be
checked for every subsequently encountered file with st_nlink > 1, but
you would sacrifice the worst case memory consumption, which is
currently O(1), to be O(n) in the case of all files being hard links,
which in turn makes a naïve linear search on the array relatively
inefficient. Instead, you could use a linked list, hash table or some
binary tree over dev and ino, but this leads to cache misses and lots
of pointer madness. In this case it might be much better to keep the
array manually sorted (where you sort by "dev.ino", i.e. dev is the
"major" part, and "ino" is the minor part) and then run a binary
search, giving you logarithmic checking time.

It might be a close call if shuffling your array and saving time in the
search is better than having a stupid O(1)-addition-time at the cost of
linear search time, but it shall not be forgotten how good CPUs have
become at memmove().

No matter how you approach it though, you will have to go into the
water and get your feet wet by sacrificing the constant memory
consumption, but I think the gains are higher than the costs.

With best regards

Laslo

[0]:https://pubs.opengroup.org/onlinepubs/009696799/basedefs/sys/stat.h.html
Received on Tue Jan 31 2023 - 14:45:44 CET

This archive was generated by hypermail 2.3.0 : Tue Jan 31 2023 - 14:48:07 CET