Now that people are considering NOSQL will more people consider no-DB

[+] dgreensp|14 years ago|reply

After reading the article and all the comments here, and from my own experience, I just don't think it's possible to not have a DB. At best, you write your own basic DB, because you don't need anything fancy.

For example, you write S-expressions to files like Hacker News does. This is clever, because the file system has some of the features of a database system, and files and S-expressions are abstractions that already exist. You do have to manage what data is in memory and what data is on disk at any given time, but the complexity and amount of code are low.

The idea that "event sourcing" somehow keeps you from needing a DB is ridiculous. By the time you've defined the event format, and written the software to replay the logs, etc., which if you're smart will be fairly general and modular, congrats, you've just written a database. At best, you keep complexity low, and it's another example of a small custom DB for a case where you don't need a fancy off-the-shelf DB. Maybe it's the perfect solution for your app, but it's still a database.

"Memory images," as a completely separate matter, are an abstraction that saves you some of the work of making a DB. Just as S-expressions can save you from defining a data model format, and files can save you from a custom key-value store, memory images as in Smalltalk could save you from having to deal with persistence. And if your language has transactions built in, maybe that saves you from writing your own transaction system. In general, though, it's very hard to get the DB to disappear, as there is a constellation of features important to data integrity that you need one way or another. It's usually pretty clear that you're using a DB, writing a DB, or using a system that already has a DB built in. If you think there's no DB, there's a high chance you're writing one. Again, that could be fine if you don't need all the features and properties of a robust general-purpose DB.

Funnily enough, in EtherPad's case, we had full versioned history of documents, and did pretty much everything in RAM and application logic -- a pretty good example of what the article is talking about -- and yet we used MySQL as a "dumb" back-end datastore for persistence. Believe me, we tried not to; we spent weeks trying alternatives, and trying to write alternatives. Perhaps if every last aspect of the data model had been event-based, we could have just logged the events to a text file and avoided SQL. More likely, I think, we would use something like Redis now.

[+] joshu|14 years ago|reply

Of course, filesystems are a kind of database as well.

[+] unknown|14 years ago|reply

[deleted]

[+] kfool|14 years ago|reply

Was there something specific it the data model that made the versioning hard to write? Or was it that, for this to work across the board, the entire model had to be versioned?

It sounds that SQL itself wasn't the problem. Were you looking for versioning alternatives in SQL that weren't up to par?

[+] unknown|14 years ago|reply

[deleted]

[+] keithnoizu|14 years ago|reply

[deleted]

[+] guygurari|14 years ago|reply

In many applications, data outlives code. This is certainly the case in enterprise applications, where data can sometimes migrate across several generations of an application. Data may also be more valuable to the organization than the code that processes it.

While I'm no fan of databases, one obvious advantage is that they provide direct access to the data in a standard way that is decoupled from the specific application code. This makes it easy to perform migrations, backups etc. It also increases one's confidence in the data integrity. Any solution that aims to replace databases altogether must address these concerns. I think that intimately coupling data with the application state, as suggested in the article, does not achieve this.

[+] wpietri|14 years ago|reply

The goal is not to replace databases altogether. The goal is to solve some particular problems very well. Last time I used this approach, for example, we mirrored a bunch of data in a traditional SQL store for reporting and ad-hoc querying, things that databases are great at.

In my view, direct access to data decoupled from application code is a bug, not a feature. With multiple code bases touching the same data, schema improvements become nearly impossible.

I also think data integrity is easier to maintain with a system like this. SQL constraints don't allow me to express nearly as much about data integrity as I can in code. Sure, I could use stored procedures, but if I'm going to write code somewhere, I'd rather it be in my app.

[+] alexro|14 years ago|reply

And not the last thing is security and permissions to access different parts of data. I see no way to have it easily implemented in the event logging system.

[+] jcromartie|14 years ago|reply

When people set out to design a SQL database, they usually end up updating and deleting records. This is bad because it destroys history, and nothing that you can add to your SQL architecture will fix it at a fundamental level.

By basing your system on a journaled event stream, you start with a foundation of complete history retention, and you can build exactly the sort of reporting views you need at any time (say, by creating a SQL database for other applications to query).

[+] zwischenzug|14 years ago|reply

Perhaps I am an old dinosaur, but this article merely annoyed me.

"The key element to a memory image is using event sourcing, which essentially means that every change to the application's state is captured in an event which is logged into a persistent store."

That is a key element of a database. It's called a logical log.

"Furthermore it means that you can rebuild the full application state by replaying these events."

Yup, logical log.

"Using a memory image allows you to get high performance, since everything is being done in-memory with no IO or remote calls to database systems. "

This is _exactly_ what sophisticated old-school databases do. You can have them require to write to the DB on commit, or just to memory, and have a thread take care of IO in the background.

"Databases also provide transactional concurrency as well as persistence, so you have to figure out what you are going to do about concurrency."

Righty-ho.

"Another, rather obvious, limitation is that you have to have more memory than data you need to keep in it. As memory sizes steadily increase, that's becoming much less of a limitation than it used to be."

So why not store your old-school DB in memory?

I can understand the argument that you don't want to lock into a big DB vendor's license path, but the technical arguments here look distinctly weak to me.

Maybe old-fashioned DBs are hipper than people think?

[+] zzzeek|14 years ago|reply

These are all good points but the core of Fowler's article is that the persistence is against the application's object structures directly, with no translation to relational concepts needed (note I am the author of a very popular object-relational library, so I'm not in any way opposed to object-relational mapping...it's just interesting to see this approach that requires none). That it's stored in memory and is reconstructed against an event log are secondary to this.

[+] ynd|14 years ago|reply

When you say old-school DB do you mean something like mysql?

[+] courtewing|14 years ago|reply

No matter how skilled I become as a developer, there is always something lurking around the corner to make me feel more naive than ever. As I was reading this article, I realized that my whole career and knowledge about the way applications work is based around the one core idea that when non-binary data needs to be persisted, you use a database.

The idea that you can reliably use event sourcing in memory to persist your data is as foreign to me as it is impressive. Is anyone familiar with major applications (web apps, ideally) that use this method for their data persistence?

[+] wpietri|14 years ago|reply

You're already familiar with a couple of things that can be built this way: word processors and multiplayer game servers. In both cases SQL databases are too slow and too awkward.

Financial trading is another area where databases are too slow. I know of one place that uses this approach to keep pricing data hot in RAM for their financial models. And Fowler previously documented using this for a financial exchange:

http://martinfowler.com/articles/lmax.html

[+] jkkramer|14 years ago|reply

I don't know if it uses "event sourcing" per se, but doesn't HN use in-memory & serialized Lisp data structures instead of a DB?

[+] dwolfson20|14 years ago|reply

This is the field we choose.

Expand your thinking in the abstract about what a database is. As 71104 mentions, a file system is also a database. What you are thinking of as "a database" is really a specific type of key-value store that is located on disk. But the fact most DBs are on disk has nothing to do with the concept itself.

[+] dpark|14 years ago|reply

It seems like email maps onto this model fairly well. An event (mail) comes in and gets written to the log (mbox). Starting the server (session) entails reading the events from file. It's not a perfect fit, of course.

The big players are backing their services with custom data stores, though.

[+] kragen|14 years ago|reply

I've written some programs like this, even to the point of replaying the entire input history every time my CGI script got invoked. It's surprising what a large set of apps even that naïve approach is applicable to, and there are some much more exciting possibilities under the surface.

To the extent that you could actually write your program as a pure function of its past input history — ideally, one whose only O(N) part (where N was the length of the history) was a fold, so the system could update it incrementally as new events were added — you could get schema upgrade and decentralization "for free". However, to get schema upgrade and decentralization, your program would need to be able to cope with "impossible" input histories — e.g. the same blog post getting deleted twice, or someone commenting on a post they weren't authorized to read — because of changes in the code over the years and because of distribution.

I called this "rumor-oriented programming", because the propagation of past input events among the nodes resembles the propagation of rumors among people: http://lists.canonical.org/pipermail/kragen-tol/2004-January...

I wrote a bit more on a possible way of structuring web sites as lazily-computed functions of sets of REST resources, which might or might not be past input events: http://lists.canonical.org/pipermail/kragen-tol/2005-Novembe...

John McCarthy's 1998 proposal, "Elephant", takes the idea of writing your program as a pure function of its input history to real-time transaction processing applications: http://www-formal.stanford.edu/jmc/elephant/elephant.html

The most advanced work in writing interactive programs as pure functions of their input history is "functional reactive programming", which unfortunately I don't understand properly. The Fran paper http://conal.net/papers/icfp97/ is a particularly influential, and there's a page on HaskellWiki about FRP: http://www.haskell.org/haskellwiki/Functional_Reactive_Progr...

[+] IgorPartola|14 years ago|reply

Wouldn't this system have a bunch of drawbacks:

- Long startup times as the entire image needs to be loaded and prepared.

- It would be hard to distribute the state across multiple nodes

- What happens in case of a crash? How fault tolerant would this be?

- Does this architecture essentially amount to building in a sort-of-kind-of datastore into your already complex application? Without a well-defined well-tested existing code base, is this just re-inventing the wheel for each new project?

- How do you enforce constraints on the data?

- How do transactions work (debit one account, [crash], credit another account?

- How do you allow different components (say web user interface, admin system, reporting system, external data sources) to share this state?

Just curious.

EDIT:

- Isn't this going to lead to you writing code that almost always has side-effects, causing it to be really hard to test? How would you implement this system in Haskell?

[+] wpietri|14 years ago|reply

- The startup times can be a problem if you have a lot of data. Modern disks are pretty fast for streaming reads, though, and you can split the deserialization load across multiple processors.

- Mirroring state is easy; you just pipe the serialized commands to multiple boxes.

- It's very fault tolerant. Because every change is logged before being applied, you just load the last snapshot and replay the log.

- It didn't seem that way to me.

- In code. In the system I built, each mutation was packaged as a command, and the commands enforced integrity.

- Each command is a transaction. As with DB transactions, you do have to be careful about where you draw your transaction boundaries.

- Via API. Which I like better, as it allows you to enforce more integrity than you can with DB constraints.

[+] m0th87|14 years ago|reply

Mostly agree, but to this point:

> - Isn't this going to lead to you writing code that almost always has side-effects, causing it to be really hard to test? How would you implement this system in Haskell?

You'd have to figure out how to isolate the IO monad as much as possible, but this is no different than interacting with a database in Haskell. And Haskell would give you nice features like STM to address other concerns as well.

[+] lcapaldo|14 years ago|reply

Re: Haskell it would probably look a lot like (exactly like?) Happs-State: http://happs.org/

[+] nickyp|14 years ago|reply

For those who want to take this kind of approach (object prevalence) in Common Lisp see http://common-lisp.net/project/cl-prevalence/

Sven Van Caekenberghe (the author of cl-prevalence) and I used this approach to power the back-end/cms of a concert hall back in 2003. A write-up of our experiences can be found at http://homepage.mac.com/svc/RebelWithACause/index.html

The combination of a long-running Lisp image with a remote REPL and the flexibility of the object prevalence made it a very enjoyable software development cycle. It's possibly even more applicable with the current memory prices.

I especially liked the fact that your mind never needs to step out of your object space. No fancy mapping or relationship tables, just query the objects and their relations directly. I guess that's what SmallTalk developers also like about their programming environment.

[+] larve|14 years ago|reply

we started with cl-prevalence and then of course (NIH-syndrome) implemented our own approach to this back in 2003, which you can find at http://bknr.net/ . We used it back then to run eboy.com, and it still is powering http://quickhoney.com http://www.createrainforest.org/ and http://ruinwesen.com/ amongst others. Those transaction logs + images are for some 6+ years old, and have gone through multiple code rewrites and compiler changes and OS changes and what not. It is good fun, has drawbacks, has advantages, definitely widens your horizon.

[+] jkkramer|14 years ago|reply

Using the no-DB approach is particularly tempting with a language like Clojure. Clojure can slice & dice collections easily and efficiently. It has built-in constructs for managing concurrency safely.

I actually have a couple Clojure apps that rely on a hefty amount of in-memory data to do some computations. Even the cost of pulling the data from Redis would be too expensive. The in-memory data grows very slowly, so it's easy to maintain. Moving faster-growing data in-process would be trickier, but this article makes me want to try.

[+] giardini|14 years ago|reply

This has limited use because of

Maintenance: I can easily give a 10% raise to everyone with a single SQL statement. Fowler's method requires that I first create an entire infrastructure (transaction processing, ACID properties) in code for this particular application. And it had better be as reliable as the transaction processing available in modern relational databases (so says my boss) or I'll be looking for a new job.

Support: you get to teach the new guy how "Event Sourcing" works for this application A and also applications B, C, ....

That said, I _have_ done this with great success. But the work involved a single application (a minicomputer-based engineering layout system). The ease with which versioning could be included was a selling point.

And don't get me started on reporting or statistics.

[+] wpietri|14 years ago|reply

The "give everybody a 10% raise" case can be looked upon either as a bug or a feature. Sometimes it's nice that anybody can do anything; sometimes it isn't.

As to creating the infrastructure and worries about reliability, there are a number of frameworks for this. E.g., Prevayler. It gives you all the ACID guarantees, but has about three orders of magnitude less code than a modern database.

Supporting it could definitely be a problem. That's true for anything novel, so I'd only do this where the (major) performance benefits outweigh the support cost.

Some kinds of statistics are easier with this. For example, if you want to keep a bunch of up-to-date stats on stocks (latest price, highs, lows, and moving averages for last hour, day, and week) it is almost trivially easy in a NoDB system, and much, much faster than with a typical SQL system.

For other stats and reporting, though, dumping to an SQL database is great. For many systems you don't want to use your main database for statistics anyhow, so a NoDB approach mainly means you start using some sort of data warehouse a little earlier.

[+] mmatants|14 years ago|reply

I am 100% agreeing with the article, with one caveat.

Database engines are not just for storing - each is basically a "utility knife" of data retrieval - indexing, sorting and filtering are available via (relatively) simple SQL constructs. If your app uses an index right now, ditching the DB will mean re-implementing it manually. It's not hard, but it's extra code.

So basically, the DB engine might still be a necessary "library", at least for data retrieval. A middle-of-the-road take on this is e.g. using an in-memory Sqlite instance to perform indexing, etc - seeding it at run-time to help with data searches, but then still not using it for storing persistent information and discarding the data at the end.

[+] wpietri|14 years ago|reply

Having built a couple of systems like this, I didn't find this to be a big problem. You have to organize your data in memory somehow, and that tended to be along lines that made for fast access. Occasionally I'd have to index something, which meant adding a hashtable here and there. It also allowed me to index and organize in ways that SQL doesn't make available.

The area where I still most needed SQL was for ad-hoc querying and reporting. I dreamed of building a relational-to-object mapper, but settled for a) XPath queries against our snapshot files, and b) dumping to an SQL database for reports.

[+] Duff|14 years ago|reply

The SQL constructs are great, but the biggest advantage to relational databases is that the engine handles your data consistency issues for you. Consistency isn't just about rolling the datastore back to a specific moment in time -- you have to handle locking, concurrent reads/writes, etc.

If you're building a trading platform that handles 6M transactions/second, you have the money to handle this in the application layer and the load to justify the expense. But for many other tasks, you may be wasting money or putting data at risk.

[+] jpitz|14 years ago|reply

When you're 3-4 orders of magnitude ahead of the game because you've ditched network round trips and disk accesses, indices become less ( not completely un- ) necessary.

The impedance mismatch between database and application is a lot of code too.

[+] alexro|14 years ago|reply

I think that the best thing DB-s provide is the separation of skills. I can fully concentrate on the programming side and just be aware of the db side, and the DBA-s will handle setup, replication, migration, analytics, ad-hoc queries, backup, etc.

If, on the other side, I had to do it all myself, I'd most probably have lost my last hair.

[+] wpietri|14 years ago|reply

I'd encourage everybody to try this out; building an app like this really broadened my way of thinking about system design.

Compared with a database-backed system, many operations are thousands of times faster. Some things that I was used to thinking of as impossible became easy, and vice versa. Coming to grips with why was very helpful.

[+] andrewcooke|14 years ago|reply

what implementation did you use? what others exist? thanks.

[+] yesimahuman|14 years ago|reply

I think it's interesting you can distribute more of your "persistent" state to in-memory storage and then distribute snapshots throughout the day. Online game servers often rely on state being in memory rather than being queried on demand. Achieving high performance otherwise is difficult.

However, I wouldn't call this "no-DB." Rather, it's "less-db." Ultimately, historical and statistical data needs to be stored and databases are great for that (and for a stats team).

[+] cachemoney|14 years ago|reply

I spent about a year as a maintainer of FlockDB, Twitter's social graph store. If you don't know, it's basically a sharded MySQL setup. One of the key pain points was optimizing the row lock over the follower count. Whenever a Charlie Sheen joins, or someone tries to follow spam us, one particular row would get blasted with concurrent updates.

Doing this in-memory in java via someAtomicLong.incrementAndGet() sounds appealing.

[+] jcromartie|14 years ago|reply

> Doing this in-memory in java via someAtomicLong.incrementAndGet() sounds appealing.

Just for fun, in Clojure:

    (def current-id (atom (long 0)))
    (defn get-id [] (swap! current-id inc))

[+] emehrkay|14 years ago|reply

I didn't finish the article because I read the one on Event Sourcing (http://martinfowler.com/eaaDev/EventSourcing.html), pretty good pattern. I like that every time he describes a new one (to me), I feel like I have to use it.

[+] discreteevent|14 years ago|reply

"I feel like I have to use it". Sure, as long as its in a prototype and I assume that that is what you mean. The problem is that its this "see a new thing and feel like I have to use it" that seems to be the single biggest source of accidental complexity in production software. So by all means use something new but not because you feel like it, rather because you have thought long and hard about it, tested it and can really justify why you want to use it in your situation.

[+] lucisferre|14 years ago|reply

Event sourcing has a lot of power, but it also offers up some unique challenges. If you are interested in applying it I'd check out CQRS: http://cqrsinfo.com, I've also got a few blog posts on the subject http://lucisferre.net/tag/cqrs/

[+] nivertech|14 years ago|reply

Nothing new here. I remember working with TED editor on PDP-11. The machine crashed some times. After restart TED would restore the text by replaying all my key presses.

Other example vector graphic editors: it replays vector graphics primitive instead of pixel bitmaps.

[+] jeffdavis|14 years ago|reply

What about ROLLBACK? And no, going back in time by replaying logs is no substitute, because you lose other transactions that you want to keep (and perhaps already reported to the user as completed).

What about transaction isolation? How do you keep one transaction from seeing partial results from a concurrent transaction? Sounds like a recipe for a lot of subtle bugs.

And all of the assumptions you need to make for this no-DB approach to be feasible (e.g. fits easily in memory) might hold at the start of the project, but might not remain valid in a few months or years. Then what? You have the wrong architecture and no path to fix it.

And what's the benefit to all of this? It's not like DBMS do disk accesses just for fun. If your database fits easily in memory, a DBMS won't do I/O, either. They do I/O because either the database doesn't fit in memory or there is some better use for the memory (virtual memory doesn't necessarily solve this for you with the no-DB approach; you need to use structures which avoid unnecessary random access, like a DBMS does).

I think it makes more sense to work with the DBMS rather than constantly against it. Try making simple web apps without an ORM. You might be surprised at how simple things become, particularly changing requirements. Schema changes are easy unless you have a lot of data or a lot of varied applications accessing it (and even then, often not as bad as you might think) -- and if either of those things are true, no-DB doesn't look like a solution, either.

[+] smokinn|14 years ago|reply

In an event-based system, especially a large distributed one, ROLLBACK as a single command to revoke all previous attempts at state mutation becomes impossible to support. Instead of supporting distributed transactions you have to change to a tentative model. The paper Life beyond Distributed Transactions: an Apostate's Opinion (Available here: http://www.ics.uci.edu/~cs223/papers/cidr07p15.pdf ) describes this well.

Basically instead of making a transaction between 2 entities, you send a message to the first reserving some data, a message to the second reserving the data and once you get confirmation from both (or however many entities are involved in the transaction) you send a commit to them.

These reservations can be revoked though. Your rollback has to be managed by an "activity".

Ex: Bank transfers. You have the activity called BankTransfer. It manages the communication between entities and the overall workflow. It starts by sending messages to entities Account#1 with 100$ in it and Account#2 also with 100$. To #1 it says debit 500$. To #2 it says credit 500$. #2 responds first and says Done. #1 responds second and says Insufficient Funds. BankTransfer sends another message to #2 saying Cancel event id 100 (the crediting).

Other activities that want to read the state of number 1 will see 100$ in it but the transfer (as yet unconfirmed) had been of 50 rather than 500$ and another debit of 75$ comes in it would respond insufficient funds. At this point it's the activity's job to decide what to do. Wait and try again? Fail entirely and notify any other entities relevant to the workflow? That's up to the business rules. Also, since the credit has not yet been confirmed, reading the balance on #2 would still say 100, not 600$.

Of course, depending on your use case you may want the read to return the balance with unconfirmed transactions. That's entirely up to the application code and business rules but the example should be explanatory as to how rollback is implemented.

Eventual consistency is the only scalable way to go for very large systems.

[+] darylteo|14 years ago|reply

My understanding of Rollback in Event Sourcing is that there IS no rollback.

If an event causes the model to be in an invalid state, another event must be triggered to rectify the model into a valid state. (Simplistically speaking)

[+] Mordor|14 years ago|reply

Why not go for NoDisk solution too - just RAID your memory and back it up with ultracapacitors?

[+] kragen|14 years ago|reply

You don't even need the capacitors or battery backup if your memory is sufficiently distributed.

[+] superuser2|14 years ago|reply

Running Redis in journaling mode is essentially this, too. Mongo can run like that too.

[+] _urga|14 years ago|reply

Re: Redis, only if you never compact the append-only file.

[+] overshard|14 years ago|reply

Any time you have someone who is not a programmer who wants to maintain code, you will have a DB. And this almost all the time.

140 comments