top | item 36051710

50 years in filesystems: towards 2004 – LFS

148 points| todsacerdoti | 2 years ago |blog.koehntopp.info

42 comments

[+] userbinator|2 years ago|reply

Also, we’re getting things that can seek a lot faster than disks: Flash Storage.

NAND FTLs are by necessity log-structured because of the nature of the medium: pages can only be programmed in sequential order in each block, only entire blocks can be erased at once, and ideally you want to evenly use all blocks even when you're just updating a single (logical) sector repeatedly.

[+] mappu|2 years ago|reply

NILFS2 is upstream in Linux - the log structure means snapshots happen for free on every disk operation, so you can get point-in-time rollback without needing to manually create disk snapshots.

It does require a userspace daemon running to compact the log, though.

[+] drpixie|2 years ago|reply

Current file systems are impressive - flexible, robust, close the hardware performance. But I'm disappointed that we are still using such low level models for our day to day computing. Files = everything is an array of bytes and every program/library has to interpret and manage those bytes, "manually", individually, and slightly differently to other programs!

It's understandable to use "files" when running retro apps, but it's way past time that a high level model rendered the concept of files obsolete.

(I can be hopeful but I hold not outlook for such better models. Too many backwards compatible apps and too much depends on our existing code.)

[+] ok_computer|2 years ago|reply

I think the simplicity and flexibility and lack of overall framework is the benefit. Dead simple bytes that may or may not be arranged in a way that works with the program you’re trying to open them with. Then build the relational model on top of it.

Git’s now out of style and we’re onto ____ but my storage is identical. I used to use flickr but now I dump directly to s3 and my jpgs are indistinguishable.

Especially so some consortium of tech companies don’t come up with the next-gen db/fs with bolt on features that no one’s asking for and telemetry to improve your file recall experience. Or logging into my fs because I need customization. For instance any modern web app is built with overkill tech that adds complexity because in certain scale uses that is necessary.

Give me trees of utf-8 encoded flat files any day. Not nested object relational models of stuff that ages faster than milk.

[+] jcranmer|2 years ago|reply

There are roughly three ways you can look at files.

The first is the traditional way: a file is a bag of bytes. Operating systems could do a better job of handling bags of bytes (really, they should default to making sure that the bags of bytes are updated atomically--you either see only the old bag of bytes or the new bag of bytes, never a weird mixture of both), but this is the fundamental view that most APIs tend to expose.

The second is a file is a collection of fixed-sized blocks, stored at not-necessarily-contiguous offsets. This is where something like mmap comes into play, or sparse storage files. A lot of higher-level formats actually tend to built on this model, and this tends to be how underlying storage thinks of files.

The third is that a file is a collection of data structures. It's tempting to think that the OS should expose this view of files natively in its API, but this turns out to be a really bad idea. If you limit it to well-supportable primitives, it's too simple for many applications, so they need to build their own serialization logic anyways. Cast too wide a net, and now applications have to worry about representing things they can't support. Or you take a third option and have a full serialization/deserialization framework that allows custom pluggable things, which is a ticking time bomb for security.

[+] userbinator|2 years ago|reply

The "stream of bytes" model is what lead to easy data interchange and interoperability. There were plenty of proprietary "structured file" schemes invented in the past, but (fortunately) none of them seem to have become widespread.

[+] rektide|2 years ago|reply

I agree that where are now is bad, but I also think files could be an answer too.

What we saw in 9p was a file orientation as well, but files were much smaller grained structures. We can see various kernel interfaces like /proc and /sys where we have file structures representing bigger objects too.

Rather than use the file system structure, apps have been creating their own structures within files. This obstructs homogenous user access to the data!

If we could start to access finer grained data, start to have objects as file-system-trees, I think a lot of progress could be made in computing, especially vis-a-vie the rifts of human-computer-interaction. It would give us leverage to see & work the data, broadly. Rather than facing endless different opaque streams of bytes.

[+] RcouF1uZ4gsC|2 years ago|reply

I think the closest thing to what you are looking for is SQLite.

It is basically designed to be an fopen replacement. It is designed to be robust. The relational model is very flexible. It provides great interoperability and backwards compatibility.

[+] cmiller1|2 years ago|reply

Are you proposing something similar to the Apple Newton Soup? https://en.wikipedia.org/wiki/Soup_(Apple)

[+] throwawaylinux|2 years ago|reply

Most technology is able to do useful things by building layers of simpler things.

Files are not sequences of bytes in day to day computing. They are videos, or databases, or applications. Actually a lot of the time you'll be doing your day to day computing, thousands of files are being accessed and you wouldn't even know it.

[+] pjc50|2 years ago|reply

This has attracted a lot of flack, but you can see from actual usage that "S3 blob" is a not-quite-filesystem API that people actually use. Given all the latency and mutability tradeoffs, it might be useful to have something that sits on the PCIe bus and speaks Blob.

[+] pjmlp|2 years ago|reply

In some mainframes and micros, the filesystem is based on database model, there are no files.

[+] kragen|2 years ago|reply

this is the same thinking that gave us the 'advanced intelligent network'

current ip networks are impressive - flexible, robust, close to line speed. but i'm disappointed that we are still using such low level models for our day to day computing. tcp/ip = everything is a sequence of packets and every computer has to interpret and manage those packets, 'manually', individually, and slightly differently than other computers do!

it's understandable to use 'packets' when running retro apps, but it's way past time that a high-level model rendered the concept of packets obsolete

that's not a quote from a pre-stupid-network bellhead 25 years ago but it could have been

or the intel iapx432

current cpu architectures are impressive - flexible, robust, with impressive performance. but i'm disappointed that we are still using such low level models for our day to day computing. 8086 = everything is a sequence of computations on 16-bit integers and every program/library has to interpret and manage those 16-bit integers, 'manually', individually, and slightly differently than other programs do!

it's understandable to use '16-bit words' when running retro apps, but it's way past time that a high-level model rendered the concept of untyped words obsolete

in fact file storage forms the same sort of nexus as the rest of the posix system call interface, the 8086 instruction set, ip packets, bytes, and dollars: many things can store files fairly efficiently, and many things can use them for many different purposes, and the nexus permits those things to evolve independently with minimal coupling to one another

(there are many ways the posix concept of files could be improved, which is also true of 8086)

if we want to replace files with a better storage interface, it should probably be something dumber rather than something smarter

'it's done in the os so it's simple' is the same kind of cognitive error as 'it's done in the hardware so it's cheap' https://yosefk.com/blog/its-done-in-hardware-so-its-cheap.ht... (though see https://blog.cr.yp.to/20190430-vectorize.html for some 02019 updates on the relative costs of things like dispatching and floating point)

actual good systems design amounts to more than 'move the problem somewhere where i don't understand what's involved in solving it anymore'

[+] drpixie|2 years ago|reply

added...

> (I can be hopeful but I hold not outlook for such better models. Too many backwards compatible apps and too much depends on our existing code.)

I see it that we (you, me, almost all programmers) are so practiced at the "file" way of thinking, that we genuinely struggle to look far beyond that paradigm. We see the advantages of "files" but have no experience with much else, so we struggle to make comparisons.

[+] Perenti|2 years ago|reply

The article claims garbage collection was invented in the JVM! I wonder what that old DEC-20 was doing when it reported to all terminals that garbage collection was ongoing...

Or what the mark/release garden of eden model of Smalltalk was...

[+] compiler-guy|2 years ago|reply

The article claims JVMs were invented around the same time, and that they happened to include garbage collectors.

[+] usernew|2 years ago|reply

Speaking as a guy who's done enterprise storage for close to 30 years, the main issue here is IO stack integration. There's almost none. There are people like Oracle that try to bypass at least some of these disconnected layers that don't work well together, but why don't the drive vendors do this? Intel makes a compiler for their CPUs. Why isn't there a WDFS that has built-in LVM?

Here's the main issue. You have your application that sits on a filesystem. The filesystem tries to predict what the application is doing. That sits on a volume manager. That's just a dumb table of pointers. That sits on top of a disk drive, talking to the RAM on the drive. Then you have the backend of the disk controller trying to predict what to put in RAM.

Oracle knows best what it's going to need next from the disk, based on some query it's running, if it expects a drop in IO soon where the disk can do background cleanup, if and when it's about to do a lot of reads once it's done with a lot of writes, in 3 minutes. The filesystem has no idea. The disk controller has no idea. Wouldn't it be great, more performant, and less wasteful, if the application could tell the disk drive about its behavior using some sort of standard API, and the disk controller could translate that to what the backend disk should do - whether it's the various types of spinning rust or different flash types?

TRIM is a very basic example of that. What we need is more things like TRIM for the application IO libraries to tell its intent to the backend controller, and that API is appropriate to be put in the filesystem, and just blindly pass it on all the way to the backend.

[+] yjftsjthsd-h|2 years ago|reply

> why don't the drive vendors do this? Intel makes a compiler for their CPUs. Why isn't there a WDFS that has built-in LVM?

Given the quality of firmware in RAID controllers and disk drives and... er, everything, actually, I would really rather that they do as little as possible, unless they're going to make the firmware open source so we can fix the bugs.

[+] fulafel|2 years ago|reply

This is a very general issue in computing. You could make many of the same arguments about a web app running on a computer and all the involved modules (graphics, networking, JS VM, app code itself, etc). We have abstractions and interfaces that enforce separations of concerns, which give us many desireable properties, but at the same time there's an attraction, especially for performance in exchange for modularity, to do some "layering violations" to take advantage of knowledge of unexposed internals of other modules.

I think one way around it and to have the cake but eat it too would be to enable some whole-system program transformations, a bit like what unikernels have started nibbling at the edges of.

[+] rektide|2 years ago|reply

I did some searching & am a bit shocked: I couldn't find any way to adjust io priority other than by altering the entire processes io level. I would have though this would be a semi-commonly used routine to make high/regulae/low priority QoS for io, but indeed, per your claims, I can't seem to find anything.

Hypothetically one could maybe spawn a bunch of child processes and give them each their own io priority? Maybe io priority is sticky, and one can change io priority just before doing io work, and the io priority for that work would stay when one changes the io priority before the next operation?

I feel like we have a bunch of possible things we could to better qos with what controls we have.

Therw are also a variety of madvise hints we can provide, telling the kernel what we will need, what to drop, what we won't need, what will be random access (not benefit from lookahead) Vs sequential. These already are some pretty useful knobs. Which I'd guess are quite broadly underused.

[+] formerly_proven|2 years ago|reply

NVMe devices can support multiple namespaces and each namespace is assigned a specific command set upon creation, normally NVM with LBA. But there's also a key-value command set. I'd expect NVMe KV-enabled devices to directly use their FTL for the mapping.

Zoned namespaces provide a "trimless" future, as zones are allocated explicitly, written sequentially and must be released explicitly by the host.

edit: I've worked on ACID stuff before and another thing that's kinda annoying is how poorly FS APIs line up with both what you want for ACID databases and how the hardware works. FS APIs are "flush/sync" oriented, somewhere between device and byte-range/sector granularity. Log-structured databases, which is most of 'em (page-oriented RDBMS with WAL are effectively log-structured), don't need or care about that, it's just an additional complication. They really only need barriers. Hardware also has barriers, at least on paper. FTLs in SSDs provide barriers for free almost by definition; writes go to fresh NAND, but they're only visible once the log entry in the FTL is persisted. Writes between FTL flushes can be reordered any way, doesn't matter, if power fails all of them are either gone or visible.

[+] Someone|2 years ago|reply

I think that’s for the same reason most OS schedulers don’t have functionality for applications to tell them such things as “this program needs m MB RAM, s seconds of a standard CPU, doesn’t use vector instructions, will do r I/O reads and w writes to disk d and has to finish before 8 PM”: on general-purpose systems, it’s effectively an intractable system.

Also, even if the OS could compute an optimal schedule, that may not be so good that it makes up for time spent computing that schedule.

[+] mjevans|2 years ago|reply

Isn't this also the same reason 'prosumer' storage hardware / use of off the shelf stuff mostly doesn't exist? If the storage manufacturers dared provide a low level interface option to the real hardware without the easy to for Windows traditional abstractions then they'd both get their lunch eaten (by everyone that moves their current excuse for market segmentation into the OS / Database daemons) and take a loss at still providing for the majority market share of dumb as bricks Windows that lacks a mature VFS API other than NTFS (it's defacto VFS API that MS should just declare all new filesystems implement due to the crushing weight of legacy).

[+] jgerrish|2 years ago|reply

Smart idea. There are a bunch of different basic strategies and policies that could be implemented in a series of weekend projects. Ranged / extent reads/writes, upcoming allocations, locality-sensitive data, Short-range vs. Long-range data structures, access frequency estimates, historical file size estimates, etc.

This pairs well with microkernel architectures too. A separate FS policy manager service that is pluggable. You could write a dozen simple policies in a month and also shore up in terms of open-source defensive patents.

Or, if you're a commercial house and not worrying about day-to-day operations you could fill your patent portfolio.

Smart idea.

[+] sacheendra|2 years ago|reply

Storage manufacturers now give the application more control over SSD FTL operations through ZNS. https://zonedstorage.io/docs/introduction/zns. Curious to see how it will be used

[+] znpy|2 years ago|reply

Your writing made me think of the fact that purestorage is designing its own (flash based) drives for its storage appliances… I wonder they’re doing what you’re saying, in their own stack at least.