top | item 1665708

Why not tar? Limitations of the tar file format

79 points| gnosis | 15 years ago |duplicity.nongnu.org | reply

31 comments

order
[+] cperciva|15 years ago|reply
The first two issues -- a lack of index and the fact that you can't seek within a deflated tarball -- are true but are easily handled by smarter compression. Tarsnap, for example, splits off archive headers and stores them separately in order to speed up archive scanning.

The third issue -- lack of support for modern filesystem features -- is just plain wrong. Sure, the tar in 7th edition UNIX didn't support these, but modern tars support modern filesystem features.

The fourth issue -- general cruft -- is correct but irrelevant on modern tars since the problems caused by the cruft are eliminated via pax extension headers.

[+] enneff|15 years ago|reply
"Other archive formats like WinZip..."

The guy immediately loses credibility in my eyes for referring to the most popular archive format as 'WinZip'. It's the ZIP file format, designed by Phil Katz of PKWare Inc.

http://en.wikipedia.org/wiki/ZIP_(file_format)

To add injury to insult, the rest of his proposal is pretty similar to ZIP, which also accomplishes the nice-to-have things he mentions at the end.

[+] bl4k|15 years ago|reply
What this article describes has already been solved with zip, gzip, 7z, bzip and forks of tar

The problem is that at the moment there is no open standard (there are IETF proposals) since each of these is either patent, copyright or trademark encumbered.

[+] nailer|15 years ago|reply
It's very difficult to talk about 'tar' per se. Do you mean:

* GNU tar?

* BSD tar?

* Solaris tar?

Or even Schilly's 'star' program?

Each of these has different limits, advantages, and disadvantages.

[+] rarrrrrr|15 years ago|reply
this detail is wrong: the tar that ships on Mac OS X does indeed support resource forks.
[+] anon_d|15 years ago|reply
> Because tar does not support encryption/compression on the inside of archives.

Yes it does? Just encrypt/compress all the files before tarring.

> Not indexed

The reason tar doesn't have an index is so that tarballs can be concatenated. Also IIRC, you only have to jump through the headers for all files. Still O(n) where n is the number of files, but you don't have to scan through all of the data.

[+] gwern|15 years ago|reply
> The reason tar doesn't have an index is so that tarballs can be concatenated.

I'm curious, what's the use-case for this? Offhand, the only use for that ability I can think of is if I forgot a file in a tarball and have already deleted the originals; I can tar the missing file and cat the two tarballs.

[+] cybernytrix|15 years ago|reply
Compress before tarring is a really dumb idea and you will get terrible compression ratios - you cannot exploit data patterns across files. It could work if you ask gzip to write some sort of a global table...
[+] micheljansen|15 years ago|reply
I think raising these concerns is fair in a world where nearly all Unix-related source code and binaries is distributed in (g/bzipped) TAR format. Unfortunately, the author does not really explain why this is and what is wrong with ZIP (e.g. why a new format is needed).

I guess that one of the reasons for TAR's dominance is the lack of a free alternative? Apparently ZIP is not free enough (as I understand from http://en.wikipedia.org/wiki/ZIP_(file_format)#Standardizati...).

TAR is old however, and if ZIP cannot take its place, coming up with something new is not such a bad idea. I think Apple's DMG/UDIF file format deserves to be mentioned as well: it addresses all the concerns mentioned (it is essentially a mountable filesystem). I'm pretty sure there is a lot to be learned from that.

[+] bootload|15 years ago|reply
"... Because tar does not support encryption/compression on the inside of archives ..."

That can be an advantage. Space isn't always what I want for backups - I want the original data back and compression gone wrong (tar -zxvf) is just another way to loose data.

[+] fhars|15 years ago|reply
That is exactly why the lackof in-archive compression is bad, with tar you lose tje whole rest of the archive on a single bit error, with in-archive compression you lose just the file the error is located in.
[+] dagw|15 years ago|reply
The pkzip format allows you to "zip" data uncompressed if you are worried about that. Then you can trivially unpack your files using nothing but seek and read for those cases where you also accidentally misplace your last copy of unzip.
[+] masklinn|15 years ago|reply
> That can be an advantage. Space isn't always what I want for backups

Most if not all "compression" formats (and software) offer a "store" compressor which stores the data as-is, without applying any compression filter.

[+] nanairo|15 years ago|reply
Anyone knows how does duplicity compare to XAR? DAR? Or CFS or 7z?
[+] hernan7|15 years ago|reply
As long as they don't make me use cpio, I'm fine.
[+] joey_bananas|15 years ago|reply
we should go back to that embedded shellscript thingy that was common back in the day. Its name escapes me.