Data-Oriented Design Principles

[+] jebarker|2 years ago|reply

This might be an unpopular opinion, but I feel like these DOD posts should always mention that the ideas are borne out of experience in game engine development and don't necessarily easily apply in other domains.

In some general sense the principles are applicable everywhere. But trying to actually apply them in some domains is really difficult. For example, I work on developing a large scale ML training framework and I haven't seen an example where avoiding the use of objects and inheritance leads to better code. Having said that, the principles certainly apply lower down the stack when talking about implementing the GPU kernels etc on the hot path.

[+] flohofwoe|2 years ago|reply

> ...where avoiding the use of objects and inheritance leads to better code...

DOD isn't about "better code" but strictly about "faster code" by letting real-world hardware restrictions dictate how code accesses data (the way I wrote that it may sound like it's a bad thing, but it's really not) - writing software that doesn't fight the hardware it runs on can lead to simpler and easier to maintain code, but that's not guaranteed.

[+] vkazanov|2 years ago|reply

Wouldn't it be nice if people would always follow the serious whitepaper structure: abstract, summary, conclusion, measurement...

But this wouldn't work for programming language ideologies. Not for the OOP proponents, anyway.

The DOD crowd can at least demostrate how CPU love blasting through data layed out in arrays.

The FP crowd can least explain how the languages can be formalised.

PS ML is the most natural application of dod principles, no?

[+] xiasongh|2 years ago|reply

I feel that DOD is about performance, not understandability and usability, so it makes sense that the frontend of a framework, where usability is priority, doesn't benefit, and the backend and hot path does.

[+] js6i|2 years ago|reply

That would be fair if OO/FP posts had always mentioned that the ideas were borne out of academia and don't necessarily easily apply in other domains, but that ship has sailed..

[+] nightski|2 years ago|reply

I feel the biggest weakness of PyTorch is that everything is an object with inheritance. It makes no sense that ReLu is a class for example. Seems like ML could take a few hints.

[+] dkarl|2 years ago|reply

> Different problems require different solutions.

> If you have different data, you have a different problem.

What is unstated but perhaps implicit here is that you might have a different problem even if you have the same data, and this is why it is better not to bind functions and data together as in OOP.

Separating functions from data lets you organize them according to the functionality they provide, instead of grouping them all together with the data structure they operate on. When you bundle data and functions together into a class, it often grows into an oversized assortment of all the various operations you might want to do on that particular bundle of data. The OOP solution for splitting up a bloated kitchen sink class is decorators, which results in using multiple object instances, each with a different interface, to manipulate the same data. Instead of simply importing functions from a different module when you want extra functionality, you have to construct a new object instance to wrap your "basic" object and "enhance" it with extra methods.

[+] StackOverlord|2 years ago|reply

As a Clojure programmer, this is not what I have in mind when I think of data-oriented-programming, but this:

Principle #1: Separating code (behavior) from data.

Principle #2: Representing data with generic data structures.

Principle #3: Treating data as immutable.

Principle #4: Separating data schema from data representation.

Source: https://blog.klipse.tech/dop/2022/06/22/principles-of-dop.ht...

I think using C++ gives a different twist to the meaning of data-oriented, mainly because with lisps code is data. As I read this "manifesto", it seems more focused on the data the program handles than handling the program with data: In Clojure I often use data-oriented programming for programs that barely deal with any data at all. I tend to lay what I call a "plan" that describes the computation that needs to be carried out. In some way this is similar to a DSL except that this "plan" won't run without also writing a "compiler" or "interpreter". If suddenly requirements change and you need to run your "plan" in a distributed way (or any other execution flavor you may think of), you just write another compiler.

Code being data, this is an approach you can take on code itself with macros, not just as a way to add behavior but to split different aspects of code: I once wrote a macro specifically for a block of complex code that I wanted to read without the clutter introduced by debug lines, so I moved this code in a macro that would add it back using a highly specific code-walker.

What is gained by introducing interfaces using data rather than an object system, must be repaid when writing and maintaining those compilers.

[+] codethief|2 years ago|reply

> As a Clojure programmer, this is not what I have in mind when I think of data-oriented-programming, but this:

Yeah, that's more or less how I would have defined DOP, too.

Could it be that there is a slight difference in meaning when people refer to "data-oriented design" (OP) vs. "data-oriented programming"? At least that's been my (anecdotal) impression so far.

[+] oersted|2 years ago|reply

This is also consistent with how it is thought of in GameDev. The trendy ECS (Entity Component System) architecture is usually implemented with a data-oriented mindset to maximize cache utilization, make allocations/deallocations of many small entities easy and fast, and facilitate concurrency.

[+] xgkickt|2 years ago|reply

AKA “Cache is King”.

(edit) Terje Mathisen's "almost all programming can be viewed as an exercise in caching" is often quoted amongst game developers, as it rings true with how we use hardware and pre-calculation of data.

[+] dxhdr|2 years ago|reply

Mike Acton presented a great visualization of this in his 2014 DOD talk -- here's a link to just that clip: https://www.youtube.com/watch?v=rX0ItVEVjHc&t=1831s

[+] psfried|2 years ago|reply

In contrast to jebarker's comment, I actually think it's really interesting that a concept coming from game engine development actually seems quite applicable in some very different domains.

We (https://estuary.dev/) ended up arriving at a very similar design for transformations in streaming analytics pipelines: https://docs.estuary.dev/concepts/derivations/

To paraphrase, each derivation produces a collection of data by reading from one or more source collections (DOD calls these "streams"), optionally updating some internal state (sqlite), and emitting 0 or more documents to add to the collection. We've been experimenting with this paradigm for a few years now in various forms, and I've found it surprisingly capable and expressive. One nice property of this system is that every transform becomes testable by just providing an ordered list of inputs and expectations of outputs. Another nice property is that it's relatively easy to apply generic and broadly applicable scale-out strategies. For example, we support horizontal scaling using consistent hashing of some value(s) that's extracted from each input.

Putting it all together, it's not hard to imagine building real-world web applications using this. Our system is more focused on analytics pipelines, so you probably don't want to build a whole application out of Flow derivations. But it would be really interesting to see a more generic DOD-based web application platform, as I'd bet it could be quite a nice way to build web apps.

[+] vjust|2 years ago|reply

While Data is always associated with SQL, there is a world of data separated from SQL. SQL is a standard way in many cases to persist and work on data, but doesn't span the lifetime/journey of data. Software systems are built on classes, functions and structs, caches, interfaces, buffers. Its common to have hierarchical relationships between data objects. SQL doesn't naturally handle hierarchy, despite the fact it has syntax to do so.

As a software engineer, data modeler, and data engineer, DoD is a weird label applied to anything. I've decades of experience with SQL but don't gravitate towards it. Reality is messy.

[+] ticklemyelmo|2 years ago|reply

Even with SQL we recognize that while aggregate roots and ORMS are great for storing and editing single items, they're terrible for use cases where the data is _used_, and you're better off using different query mechanisms to slice it in a better way. Indexing services and caches and transformation pipelines continue that.

I think CQRS is really just a glimmer of DoD in the enterprise world, the recognition that the system of record is generally a terrible resource for actually using the data, and that you need to rethink everything again if you want a performant system.

[+] bob1029|2 years ago|reply

I 2000% agree with everything in here except this line:

> If you don’t understand the hardware, you can’t reason about the cost of solving the problem.

For me, "Data-oriented Design" mostly implies "try to use SQL-shaped things". In this context, understanding of the hardware is like seeing another side of the query planner blackbox. You should have a general sense of how many raw bytes you can move in a serialized manner based upon your machines & networks, but I feel trying to directly understand how exactly every H/W resource will be utilized goes against these principles at a certain point.

Most offerings of SQL provide exceptionally-powerful tools that attempt to answer questions like "how long might this query take to run based upon history and/or projected growth?". Is this not "reasoning about the cost of solving the problem"?

For non-trivial domains, focusing only on getting the data model logically clean might be a 100% full-time job. You should always worry about performance after correctness unless you are developing a piece of software where performance is the headline aspect of correctness (game engines, DAWs, etc).

[+] munificent|2 years ago|reply

The context isn't clear from the page, but data-oriented design mostly comes from the game industry where this level of performance often (but not always) does matter.

And in that context, they don't mean "data-oriented" in the sense of "declarative like SQL". They mean it in the sense of "how the bytes are arranged in memory". One of the primary motivations is being able to use the CPU cache well.

51 comments