Unofficial Guide to Datomic Internals (2014)

[+] JimmyRuska|5 years ago|reply

We used datomic in production at Time Inc around 2016. The idea of an immutable database where you can track changes over time, or query the state of the universe at any given point, sounded amazing for marketing or compliance use cases. Unfortunately from a dev standpoint it did not feel like mature system, and the performance was not where we needed it to be.

Probably the most advanced database for triple stores these days is RDFox ( https://www.youtube.com/watch?v=-DnmuHtywFs ). While datomic uses datalog for querying, RDFox uses datalog for database reasoning, and sparql, a w3 standard for querying. As you add data to the database you can infer new facts. If you want immutability, simply add data in append mode only with a timestamp. But this idea you can add the business rules/logic to the database, and have it incrementally apply that logic as you add data is a recent advance by oxford AI research.

[+] kendallgclark|5 years ago|reply

Stardog is full of features, performance, and enterprise hardening Rdfox hasn’t even started thinking about yet.

[+] jmiskovic|5 years ago|reply

For uninitiated, Nikita has created in-memory database with mostly compatible API (datalog) for ClojureScript or Javascript. My impression is that his variant actually has more wide-spread use than original Datomic, based on number of open source projects that use each.

I used DataScript for a while to get familiar with graph database querying. I was fascinated how easy it is to construct queries that mine obscure relations between distantly related entities. I hope I get to use similar tech again.

[+] twic|5 years ago|reply

> This simplicity enables Datomic to do more than any relational DB or KV storage can ever afford

> Datomic does not manage persistence itself, instead, it outsources storage problems to databases implemented by other people. Data can be kept, at your expense, in DynamoDB, Riak, Infinispan, Couchbase or SQL database.

These things can't both be true.

[+] adamkl|5 years ago|reply

Why not?

It really depends on the definition of “do more”.

From what I understand, Datomic’s model is far more flexible than many other databases, and has built in time travelling capability due to its accretion of immutable data.

It’s architecture does, in fact, allow you to choose the storage provider, and it’s considered an external concern.

Those are very compelling reasons to use it, but part of the trade off is write-scalability, and potentially, raw performance.

So maybe it’s “do more” within certain limitations (which is up to you to decide if those limitations are a deal breaker).

Here’s a great talk about Datomic’s architecture for more details: https://youtu.be/9TYfcyvSpEQ

[+] reilly3000|5 years ago|reply

I think in this context "do more" refers to querying capabilities than the costs of running the system.

[+] tnisonoff|5 years ago|reply

If the argument is Datomic can't be simpler than a relational DB because it can utilize it for persistence, then you'd have to argue that a relational DB can't be simpler than directly using a hard drive for your storage solution.

[+] jayd16|5 years ago|reply

Yes, use of the word ever is a bit hyperbolic. Change it to "easily afford" and no one would bat an eye.

[+] paulgb|5 years ago|reply

One thing I've never understood is why all the indexes have transaction last. One of the selling points of Datomic is that it supports as-of queries, but using the EAVT or AEVT indexes requires it to scan all historic values of that attribute, right?

In most situations this is probably fine, but if you have data that changes frequently it seems like this could slow queries down compared to an EATV or AETV index.

It's also likely that the people who made Datomic are both smarter about this stuff than me and put more thought into it than I have, so I'd love to know what the reasoning behind the choice of index is.

(PS @dang it would be nice to have (2014) in the title)

[+] refset|5 years ago|reply

I'm not sure EATV/AETV could be used fully instead of EAVT/AEVT as you would then lose the ability to have efficient range seeks across values. I do agree though that scanning all historical values in EAVT/AEVT is unsatisfactory for many use-cases as it makes the performance of ad-hoc as-of queries unpredictable.

By contrast, Crux [0] uses two dedicated temporal indexes: EVtTtC and EZC (Z-curve index) to make bitemporal as-of queries as fast as possible. These are distinct from the various triple indexes, which don't concern themselves with time at all. (Vt = valid time, Tt= transaction time, and C = the document hash for the version of an entity at a given coordinate)

[0] https://opencrux.com (I work on Crux :)

[+] lukashrb|5 years ago|reply

Not necessarily. Take a look at this implementation https://aosabook.org/en/500L/an-archaeology-inspired-databas...

There you can only retrieve the top layer and don't have to scan all historic data, it's only in-memory though.

[+] jasonwatkinspdx|5 years ago|reply

In the article it mentions that while indexes are conceptually monolithic, in practice they're partitioned into 3 spaces: historical, current, and in memory.

New data gets written to the log for durability and updates the in memory portion for queries. Periodically indexes are rebuilt, creating new segments for current, and shifting historical data out of current. This limits how much of the log must be replayed on recovery, and allows garbage collection of data that falls out of the retention window.

It's not that dissimilar to solutions used by traditional mvcc databases.

[+] unknown|5 years ago|reply

[deleted]

[+] nlitened|5 years ago|reply

The page mentions the Log index which is sorted by transaction id. It should be enough to support as-of if I understand correctly.

[+] unknown|5 years ago|reply

[deleted]

[+] dgb23|5 years ago|reply

Additional details about Datoms:

To determine whether a Datom is being rectracted or added there is a fifth element in the tuple [0].

There are many similarities to modelling temporal data in SQL [1]. But Datoms are simpler and more open as you can freely build relations between them (composable), similar to a graph-db.

[0] https://docs.datomic.com/cloud/whatis/data-model.html

[1] https://en.wikipedia.org/wiki/Temporal_database

[+] vlmutolo|5 years ago|reply

this theme is hilarious and infuriating

EDIT: You can comment out the yellow background image in the style editor and it becomes something reasonable

[+] adamkl|5 years ago|reply

I quite like the black-on-yellow theme. It’s different, and distinctive (I’ve read a few things on tonsky’s site, so I know what to expect).

Even more hilarious is switching to “dark mode”.

[+] kgwxd|5 years ago|reply

Too late. Read the whole thing before coming back here to see your suggestion, now everything here, and everywhere else, is yellowish, no matter what I set in the style editor.

[+] ddlutz|5 years ago|reply

It gave me a migraine reading half of the article, couldn't continue :(

[+] archarios|5 years ago|reply

I went to a Clojure meetup one time and they all went on about how using Datomic in production is a nightmare and it's generally an over-engineered product that isn't worth the trouble in the end. Do most people who have dealt with Datomic in production feel this way?

[+] Scarbutt|5 years ago|reply

Yes, and that's exactly why Nubank acquired Cognitect. They are too deep into the tech to migrate to something else, cheaper to just buy the authors.

So you have deep technical debt with serious scaling issues and bugs everywhere(Datomic/Nubank) and a burnout company(Datomic/Cognitect) get together, makes sense.

Burnout because their "Datomic Cloud" product didn't worked out, it was just a horrible complex AWS cloudformation template that force you to click through tens of aws webpages. It was more complex to manage and to dev for than on-premise but you still had all the same issues and bugs.

Nubank got into Datomic not because of Clojure, but the other way around, they got into Clojure because of Datomic. If you watch their videos, the reason they picked Datomic was because they think it had "time travel", which is quite different from having "history" of transactions, use mostly for auditing and troubleshooting, not for real time travel queries.

In the end, I guess things did work out for Cognitect, and Hickey is now laughing all the way to the bank.

I have being following Datomic for a year because of a system I inherited.

[+] dwohnitmok|5 years ago|reply

Anecdotally I know of one company which is also in the same boat and generally regrets their usage of Datomic and is trying to move away from it last I talked with them. However, there's also people on HN like dustingetz who have had a great time with Datomic and use it as a core component of their product.

I just wish Cognitect would allow people to run public benchmarks of Datomic to make it easier to evaluate its tradeoffs.

[+] auganov|5 years ago|reply

Never had any strict trouble with it. Maybe it's just that I've used it for a long time but I enjoy the simplicity of using it.

My biggest complaint is performance for certain use-cases. Say if you're trying to pull a lot of attributes on hundreds of thousands of datoms it's going to be rather slow (even though it's supposed to be in-memory already). But again for these kinds of use-cases I'd probably go with a completely different kind of a database either way.

The story around deletions/excisions isn't that great either. Honestly the whole log/history aspect of Datomic sounds nice but never really used it other than for reverting stupid mistakes.

The #1 thing I love is the freedom of querying you get with Datomic. You insert your data in a way that makes sense for your data, and querying is pretty much a completely separate concern. For the most part you don't need to structure your schema around the querying capabilities of your database which I love. Say back in the day I liked Mongo because you could just insert whatever you wanted [0] but eventually you'd hit problems where you couldn't easily query your data (maybe it has changed over the years, no idea).

And the syntax is just a pleasure to work with. I'd love a version of Datomic that kept the same interface but dropped some of the more esoteric features in favor of performance.

Also I noticed some of the people reporting issues used the cloud version. Never used that so can't speak to that. On-prem is free and has all the features. As long as you don't redistribute it there's no problem.

[0] Yes in datomic you do have to have a schema. But it's pretty much a simple global list of possible attributes. If you need to add something later or make a change it's pretty straightforward.

[+] juskrey|5 years ago|reply

Datomic learning curve is relatively steep, like many higher level and more abstract things in Clojure ecosystem in general, and you should know how to cook it for sure. After figuring out all why's and how's it works like a charm.

However, I indeed find Datomic Cloud version unnecessarily complex for most applications. Probably it is still a good corporate sales product for Cognitect.. Datomic On-premise version is much more friendly for small-medium-somewhat-larger use cases. Cloud version is also an AWS thing, so locks you in there, which is also not good.

[+] MrBuddyCasino|5 years ago|reply

I have heard multiple times that its rather slow, but haven’t seen any benchmarks. Would make sense, as a dynamically typed, garbage collected language Clojure is not the greatest fit to implement a database in. The question is, are the things you gain worth it?

[+] dragonne|5 years ago|reply

I wouldn't call it over-engineered, but it certainly is an operational disaster. It's slow, memory hungry, and full of catastrophic bugs.

We are currently replacing it with PostgreSQL to improve performance and scalability.

[+] nojito|5 years ago|reply

It's ridiculously expensive and many large scale deployments have consulting arrangements because of that very reason you shared.

51 comments