Generally good points. Unfortunately existing file formats are rarely following these rules. In fact these rules should form naturally when you are dealing with many different file formats anyway. Specific points follow:
- Agreed that human-readable formats have to be dead simple, otherwise binary formats should be used. Note that textual numbers are surprisingly complex to handle, so any formats with significant number uses should just use binary.
- Chunking is generally good for structuring and incremental parsing, but do not expect it to provide reorderability or back/forward compatibility somehow. Unless explicitly designed, they do not exist. Consider PNG for example; PNG chunks were designed to be quite robust, but nowadays some exceptions [1] do exist. Versioning is much more crucial for that.
- Making a new file format from scratch is always difficult. Already mentioned, but you should really consider using existing file formats as a container first. Some formats are even explicitly designed for this purpose, like sBOX [2] or RFC 9277 CBOR-labeled data tags [3].
> Note that textual numbers are surprisingly complex to handle, so any formats with significant number uses should just use binary.
Especially true of floats!
With binary formats, it's usually enough to only support machines whose floating point representation conforms to IEEE 754, which means you can just memcpy a float variable to or from the file (maybe with some endianness conversion). But writing a floating point parser and serializer which correctly round-trips all floats and where the parser guarantees that it parses to the nearest possible float... That's incredibly tricky.
What I've sometimes done when I'm writing a parser for textual floats is, I parse the input into separate parts (so the integer part, the floating point part, the exponent part), then serialize those parts into some other format which I already have a parser for. So I may serialize them into a JSON-style number and use a JSON library to parse it if I have that handy, or if I don't, I serialize it into a format that's guaranteed to work with strtod regardless of locale. (The C standard does, surprisingly, quite significantly constrain how locales can affect strtod's number parsing.)
Spent the weekend with an untagged chunked format, and... I rather hate it.
A friend wanted a newer save viewer/editor for Dragonball Xenoverse 2, because there's about a total of two, and they're slow to update.
I thought it'd be fairly easy to spin up something to read it, because I've spun up a bunch of save editors before, and they're usually trivial.
XV2 save files change over versions. They're also just arrays of structs [0], that don't properly identify themselves, so some parts of them you're just guessing. Each chunk can also contain chunks - some of which are actually a network request to get more chunks from elsewhere in the codebase!
[0] Also encrypted before dumping to disk, but the keys have been known since about the second release, and they've never switched them.
>Most extensions have three characters, which means the search space is pretty crowded. You may want to consider using four letters.
Is there a reason not to use a lot more characters? If your application's name is MustacheMingle, call the file foo.mustachemingle instead of foo.mumi?
This will decrease the probability of collision to almost zero. I am unaware of any operating systems that don't allow it, and it will be 100% clear to the user which application the file belongs to.
It will be less aesthetically pleasing than a shorter extension, but that's probably mainly a matter of habit. We're just not used to longer file name extensions.
A 14-character extension might cause UX issues in desktop environments and file managers, where screen real estate per directory entry is usually very limited.
When under pixel pressure, a graphical file manager might choose to prioritize displaying the file extension and truncate only the base filename. This would help the user identify file formats. However, the longer the extension, the less space remains for the base name. So a low-entropy file extension with too many characters can contribute to poor UX.
> it will be 100% clear to the user which application the file belongs to.
The most popular operating system hides it from the user, so clarity would not improve in that case. At leat one other (Linux) doesn't really use "extensions" and instead relies on magic headers inside the files to determine the format.
Otherwise I think the decision is largely aestethic. If you value absolute clarity, then I don't see any reason it won't work, it'll just be a little "ugly"
For archive formats, or anything that has a table of contents or an index, consider putting the index at the end of the file so that you can append to it without moving a lot of data around. This also allows for easy concatenation.
What probably allows for even more easier concatenation would be to store the header of each file immediately preceding the data of that file. You can make a index in memory when reading the file if that is helpful for your use.
Why not put it at the beginning so that it is available at the start of the filestream that way it is easier to get first so you know what other ranges of the file you may need?
>This also allows for easy concatenation.
How would it be easier than putting it at the front?
Using SQLite as a container format is only beneficial when the file format itself is a composite, like word processor files which will include both the textual data and any attachments. SQLite is just a hinderance otherwise, like image file formats or archival/compressed file formats [1].
[1] SQLite's own sqlar format is a bad idea for this reason.
Consider DER format. Partial parsing is possible; you can easily ignore any part of the file that you do not care about, since the framing is consistent. Additionally, it works like the "chunked" formats mentioned in the article, and one of the bits of the header indicates whether it includes other chunks or includes data. (Furthermore, I made up a text-based format called TER which is intended to be converted to DER. TER is not intended to be used directly; it is only intended to be converted to DER for then use in other programs. I had also made up some additional data types, and one of these (called ASN1_IDENTIFIED_DATA) can be used for identifying the format of a file (which might conform to multiple formats, and it allows this too).)
I dislike JSON and some other modern formats (even binary formats); they often are just not as good in my opinion. One problem is they tend to insist on using Unicode, and/or on other things (e.g. 32-bit integers where you might need 64-bits). When using a text-based format where binary would do better, it can also be inefficient especially if binary data is included within the text as well, especially if the format does not indicate that it is meant to represent binary data.
However, even if you use an existing format, you should avoid using the existing format badly; using existing formats badly seems to be common. There is also the issue of if the existing format is actually good or not; many formats are not good, for various reasons (some of which I mentioned above, but there are others, depending on the application).
About target hardware, not all software is intended for a specific target hardware, although some is.
For compression, another consideration is: there are general compression schemes as well as being able to make up a compression scheme that is specific for the kind of data that is being compressed.
They also mention file names. However, this can also depend on the target system; e.g. for DOS files you will need to be limited to three characters after the dot. Also, some programs would not need to care about file names in some or all cases (many programs I write don't care about file names).
Compression: For anything that ends up large it's probably desired. Though consider both algorithm and 'strength' based on the use case carefully. Even a simple algorithm might make things faster when it comes time to transfer or write to permanent storage. A high cost search to squeeze out yet more redundancy is probably worth it if something will be copied and/or decompressed many times, but might not be worth it for that locally compiled kernel you'll boot at most 10 times before replacing it with another.
Agreed on that one. With a nice file format, streamable is hopefully just a matter of ordering things appropriately once you know the sizes of the individual chunks. You want to write the index last, but you want to read it first. Perhaps you want the most influential values first if you're building something progressive (level-of-detail split.)
Similar is the discussion of delimited fields vs. length prefix. Delimited fields are nicer to write, but length prefixed fields are nicer to read. I think most new formats use length prefixes, so I'd start there. I wrote a blog post about combining the value and length into a VLI that also handles floating point and bit/byte strings: https://tommie.github.io/a/2024/06/small-encoding
Designing your file (and data) formats well is important.
“Show me your flowcharts and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won’t usually need your flowcharts; they’ll be obvious.”
If your data format contains multiple streams inside, consider ZIP for the container. Enables standard tools, and libraries available in all languages. The compression support is built-in but optional, can be enabled selectively for different entries.
The approach is widely used in practice. MS office files, Java binaries, iOS app store binaries, Android binaries, epub books, chm documentation are all using ZIP container format.
I've had to do just that to retrofit features I wasn't allowed to think about up front (we must get the product out the door.... we'll cross that bridge when we get to it)
iNES file format is guilty of badly designed bit packing. Four flags were packed into the lower 4 bits, then Mapper Number was assigned to the high 4 bits. But then they needed more than 16 mappers. They used 4 high bits of the next byte to store the remaining 4 bits, and that was enough... until they needed over 256 mappers.
> However, it's cleaner to have a field in your header that states where the first sub-chunk starts; that way you can expand your header as much as you like in future versions, with old code being able to ignore those fields and jump to the good stuff.
That’s assuming that parsers will honor this, and not just use the fixed offset that worked for the past ten hears. This has happened often enough in the past.
Also you should consider the context in which you are developing. Often there are "standard" tools and methods to deal with the kind of data you want to store.
E.g. if you are interested in storing significant amounts of structured floating point data, choosing something like HDF5 will not only make your life easier it will also make it easy to communicate what you have done to others.
If you find yourself building a file format - you should read this page carefully and make sure that you have very good arguments which support why it does not apply to you.
The "Chunk your binaries" point is spot on. Creating a huge binary blob that contains everything makes it hard to work with in constrained environments.
Also, +1 for "Document your format". More like "Document everything". Future you will thank you for it for sure.
Thinking about a file format is a good way to clarify your vision. Even if you don’t want to facilitate interop, you’d get some benefits for free—if you can encapsulate the state of a particular thing that the user is working on, you could, for example, easily restore their work when they return, etc.
Some cop-out (not necessarily in a bad way) file formats:
1. Don’t have a file format, just specify a directory layout instead. Example: CinemaDNG. Throw a bunch of particularly named DNGs (a file for each frame of the footage) in a directory, maybe add some metadata file or a marker, and you’re good. Compared to the likes of CRAW or BRAW, you lose in compression, but gain in interop.
2. Just dump runtime data. Example: Mnemosyne’s old format. Do you use Python? Just dump your state as a Python pickle. (Con: dependency on a particular runtime, good luck rewriting it in Rust.)
3. Almost dump runtime data. Example: Anki, newer Mnemosyne with their SQLite dumps. (Something suggests to me that they might be using SQLite at runtime.) A step up from a pickle in terms of interop, somewhat opens yourself (but also others) to alternative implementations, at least in any runtime that has the means to read SQLite. I hope if you use this you don’t think that the presence of SQL schema makes the format self-documenting.
4. One or more of the above, except also zip or tar it up. Example: VCV, Anki.
About 1, directory of files, many formats these days are just a bunch of files in a ZIP. One thing most applications lack unfortunately is a way to instead just read and write the part files from/to a directory. For one thing it makes it much better for version control, but also just easier to access in general when experimenting. I don't understand why this is not more common, since as a developer it is much more fun to debug things when each thing is its own file rather than an entry in an archive. Most times it is also trivial to support both, since any API for accessing directory entries will be close to 1:1 to an API for accessing ZIP entries anyway.
When editing a file locally I would prefer to just have it split up in a directory 99% of the time, only exporting to a ZIP to publish it.
Of course it is trivial to write wrapper scripts to keep zipping and unzipping files, and I have done that, but it does feel a bit hacky and should be an unnecessary extra step.
> 2. Just dump runtime data. Example: Mnemosyne’s old format. Do you use Python? Just dump your state as a Python pickle. (Con: dependency on a particular runtime, good luck rewriting it in Rust.)
Be particularly careful with this one as it can potentially vastly expand the attack surface of your program. Not that you shouldn't ever do it, just make sure the deserializer doesn't accept objects/values outside of your spec.
[+] [-] lifthrasiir|10 months ago|reply
- Agreed that human-readable formats have to be dead simple, otherwise binary formats should be used. Note that textual numbers are surprisingly complex to handle, so any formats with significant number uses should just use binary.
- Chunking is generally good for structuring and incremental parsing, but do not expect it to provide reorderability or back/forward compatibility somehow. Unless explicitly designed, they do not exist. Consider PNG for example; PNG chunks were designed to be quite robust, but nowadays some exceptions [1] do exist. Versioning is much more crucial for that.
[1] https://www.w3.org/TR/png/#animation-information
- Making a new file format from scratch is always difficult. Already mentioned, but you should really consider using existing file formats as a container first. Some formats are even explicitly designed for this purpose, like sBOX [2] or RFC 9277 CBOR-labeled data tags [3].
[2] https://nothings.org/computer/sbox/sbox.html
[3] https://www.rfc-editor.org/rfc/rfc9277.html
[+] [-] mort96|10 months ago|reply
Especially true of floats!
With binary formats, it's usually enough to only support machines whose floating point representation conforms to IEEE 754, which means you can just memcpy a float variable to or from the file (maybe with some endianness conversion). But writing a floating point parser and serializer which correctly round-trips all floats and where the parser guarantees that it parses to the nearest possible float... That's incredibly tricky.
What I've sometimes done when I'm writing a parser for textual floats is, I parse the input into separate parts (so the integer part, the floating point part, the exponent part), then serialize those parts into some other format which I already have a parser for. So I may serialize them into a JSON-style number and use a JSON library to parse it if I have that handy, or if I don't, I serialize it into a format that's guaranteed to work with strtod regardless of locale. (The C standard does, surprisingly, quite significantly constrain how locales can affect strtod's number parsing.)
[+] [-] shakna|10 months ago|reply
A friend wanted a newer save viewer/editor for Dragonball Xenoverse 2, because there's about a total of two, and they're slow to update.
I thought it'd be fairly easy to spin up something to read it, because I've spun up a bunch of save editors before, and they're usually trivial.
XV2 save files change over versions. They're also just arrays of structs [0], that don't properly identify themselves, so some parts of them you're just guessing. Each chunk can also contain chunks - some of which are actually a network request to get more chunks from elsewhere in the codebase!
[0] Also encrypted before dumping to disk, but the keys have been known since about the second release, and they've never switched them.
[+] [-] InsideOutSanta|10 months ago|reply
Is there a reason not to use a lot more characters? If your application's name is MustacheMingle, call the file foo.mustachemingle instead of foo.mumi?
This will decrease the probability of collision to almost zero. I am unaware of any operating systems that don't allow it, and it will be 100% clear to the user which application the file belongs to.
It will be less aesthetically pleasing than a shorter extension, but that's probably mainly a matter of habit. We're just not used to longer file name extensions.
Any reason why this is a bad idea?
[+] [-] Hackbraten|10 months ago|reply
When under pixel pressure, a graphical file manager might choose to prioritize displaying the file extension and truncate only the base filename. This would help the user identify file formats. However, the longer the extension, the less space remains for the base name. So a low-entropy file extension with too many characters can contribute to poor UX.
[+] [-] delusional|10 months ago|reply
The most popular operating system hides it from the user, so clarity would not improve in that case. At leat one other (Linux) doesn't really use "extensions" and instead relies on magic headers inside the files to determine the format.
Otherwise I think the decision is largely aestethic. If you value absolute clarity, then I don't see any reason it won't work, it'll just be a little "ugly"
[+] [-] layer8|10 months ago|reply
It’s prone to get cut off in UIs with dedicated columns for file extensions.
As you say, it’s unconventional and therefore risks not being immediately recognized as a file extension.
On the other hand, Java uses .properties as a file extension, so there is some precedent.
[+] [-] dist-epoch|10 months ago|reply
You could go the whole java way then foo.com.apache.mustachemingle
> Any reason why this is a bad idea
the focus should be on the name, not on the extension.
[+] [-] Rygian|10 months ago|reply
[+] [-] thasso|10 months ago|reply
[+] [-] zzo38computer|10 months ago|reply
[+] [-] charcircuit|10 months ago|reply
>This also allows for easy concatenation.
How would it be easier than putting it at the front?
[+] [-] leiserfg|10 months ago|reply
[+] [-] paulddraper|10 months ago|reply
That wouldn’t support partial parsing.
[+] [-] lifthrasiir|10 months ago|reply
[1] SQLite's own sqlar format is a bad idea for this reason.
[+] [-] Dwedit|10 months ago|reply
At that point, you're asking for a filesystem inside of a file. And you can literally do exactly that with a filesystem library (FAT32, etc).
[+] [-] zzo38computer|10 months ago|reply
I dislike JSON and some other modern formats (even binary formats); they often are just not as good in my opinion. One problem is they tend to insist on using Unicode, and/or on other things (e.g. 32-bit integers where you might need 64-bits). When using a text-based format where binary would do better, it can also be inefficient especially if binary data is included within the text as well, especially if the format does not indicate that it is meant to represent binary data.
However, even if you use an existing format, you should avoid using the existing format badly; using existing formats badly seems to be common. There is also the issue of if the existing format is actually good or not; many formats are not good, for various reasons (some of which I mentioned above, but there are others, depending on the application).
About target hardware, not all software is intended for a specific target hardware, although some is.
For compression, another consideration is: there are general compression schemes as well as being able to make up a compression scheme that is specific for the kind of data that is being compressed.
They also mention file names. However, this can also depend on the target system; e.g. for DOS files you will need to be limited to three characters after the dot. Also, some programs would not need to care about file names in some or all cases (many programs I write don't care about file names).
[+] [-] aidenn0|10 months ago|reply
[+] [-] mjevans|10 months ago|reply
Compression: For anything that ends up large it's probably desired. Though consider both algorithm and 'strength' based on the use case carefully. Even a simple algorithm might make things faster when it comes time to transfer or write to permanent storage. A high cost search to squeeze out yet more redundancy is probably worth it if something will be copied and/or decompressed many times, but might not be worth it for that locally compiled kernel you'll boot at most 10 times before replacing it with another.
[+] [-] adelpozo|10 months ago|reply
[+] [-] flowerthoughts|10 months ago|reply
Similar is the discussion of delimited fields vs. length prefix. Delimited fields are nicer to write, but length prefixed fields are nicer to read. I think most new formats use length prefixes, so I'd start there. I wrote a blog post about combining the value and length into a VLI that also handles floating point and bit/byte strings: https://tommie.github.io/a/2024/06/small-encoding
[+] [-] teddyh|10 months ago|reply
“Show me your flowcharts and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won’t usually need your flowcharts; they’ll be obvious.”
— Fred Brooks
[+] [-] Const-me|10 months ago|reply
If your data format contains multiple streams inside, consider ZIP for the container. Enables standard tools, and libraries available in all languages. The compression support is built-in but optional, can be enabled selectively for different entries.
The approach is widely used in practice. MS office files, Java binaries, iOS app store binaries, Android binaries, epub books, chm documentation are all using ZIP container format.
[+] [-] nikeee|10 months ago|reply
[+] [-] ahoka|10 months ago|reply
[+] [-] vrighter|10 months ago|reply
[+] [-] Dwedit|10 months ago|reply
[+] [-] layer8|10 months ago|reply
That’s assuming that parsers will honor this, and not just use the fixed offset that worked for the past ten hears. This has happened often enough in the past.
[+] [-] constantcrying|10 months ago|reply
E.g. if you are interested in storing significant amounts of structured floating point data, choosing something like HDF5 will not only make your life easier it will also make it easy to communicate what you have done to others.
[+] [-] mixcocam|10 months ago|reply
https://www.sqlite.org/appfileformat.html
[+] [-] codingjerk|10 months ago|reply
The "Chunk your binaries" point is spot on. Creating a huge binary blob that contains everything makes it hard to work with in constrained environments.
Also, +1 for "Document your format". More like "Document everything". Future you will thank you for it for sure.
[+] [-] strogonoff|10 months ago|reply
Some cop-out (not necessarily in a bad way) file formats:
1. Don’t have a file format, just specify a directory layout instead. Example: CinemaDNG. Throw a bunch of particularly named DNGs (a file for each frame of the footage) in a directory, maybe add some metadata file or a marker, and you’re good. Compared to the likes of CRAW or BRAW, you lose in compression, but gain in interop.
2. Just dump runtime data. Example: Mnemosyne’s old format. Do you use Python? Just dump your state as a Python pickle. (Con: dependency on a particular runtime, good luck rewriting it in Rust.)
3. Almost dump runtime data. Example: Anki, newer Mnemosyne with their SQLite dumps. (Something suggests to me that they might be using SQLite at runtime.) A step up from a pickle in terms of interop, somewhat opens yourself (but also others) to alternative implementations, at least in any runtime that has the means to read SQLite. I hope if you use this you don’t think that the presence of SQL schema makes the format self-documenting.
4. One or more of the above, except also zip or tar it up. Example: VCV, Anki.
[+] [-] 3036e4|10 months ago|reply
When editing a file locally I would prefer to just have it split up in a directory 99% of the time, only exporting to a ZIP to publish it.
Of course it is trivial to write wrapper scripts to keep zipping and unzipping files, and I have done that, but it does feel a bit hacky and should be an unnecessary extra step.
[+] [-] strogonoff|10 months ago|reply
[+] [-] trinix912|10 months ago|reply
Be particularly careful with this one as it can potentially vastly expand the attack surface of your program. Not that you shouldn't ever do it, just make sure the deserializer doesn't accept objects/values outside of your spec.
[+] [-] unknown|10 months ago|reply
[deleted]
[+] [-] afsdef3f3wafew|10 months ago|reply
[deleted]