Change Data Capture: The Magic Wand We Forgot

[+] boothead|10 years ago|reply

I have come to believe that storing your data as the semantic events that happen rather than the state at a given point in time is the way to go. From what I've seen change data capture is the opposite process of trying to extract an event stream from the data changes.

[+] A_Beer_Clinked|10 years ago|reply

Lots of databases are configured to do both. The tables store what we normally think of as "the data" and the log stores the changes. Tables are like the HEAD in git etc and the transaction log is like the chain of commits.

In principle you could just query the transaction log for every change to your data and compute the final state every time. Obviously this would be onerous so in normal operation we just use the latest state.

When things go wrong the transaction log is useful for understanding why and also rewinding/replaying the database to the correct state.

Some databases ship these transaction logs around between replicas to keep them all in sync.

The work presented here is an interesting application of the same basic mechanism to keep different flavours of datastores in sync.

Recently we very briefly explored the idea of using this mechanism to implement partial replication for partitioned reporting data stores. Unfortunately our current platform SQL Azure doesn't grant access to the transaction log directly. (Which on balance this is a good thing because it's handling all the replications etc)

[+] ZenoArrow|10 years ago|reply

Why do you believe capturing semantic events (update statements, delete statements, alter statements, etc...) is superior to capturing a log of the data changes?

Whilst there is an element of compactness when it comes to capturing semantic events, the benefit of using a simpler mechanism like logs means that you don't need to use a full database engine to parse the data, and may end up offering better performance (for example, no need to calculate what a commit rollback entails on every node, just do it on the master node and let the other nodes read the logs to know what to update).

[+] strictfp|10 years ago|reply

Aka "event sourcing". See for instance https://geteventstore.com

[+] adamtj|10 years ago|reply

See also, "The Log: What every software engineer should know about real-time data's unifying abstraction"

http://engineering.linkedin.com/distributed-systems/log-what...

https://news.ycombinator.com/item?id=6916557

[+] baseballmerpeak|10 years ago|reply

Essentially, one database to rule them all?

[+] brianxq3|10 years ago|reply

It is very much the opposite. With this pattern, you're going to have lots of copies of your data in different transformations in potentially many different data stores. The idea is that you take the stream of changes from something like Postgres and use that stream to populate caches, indexes, denormalizations/representations, counts, etc.

[+] yo-code-sucks|10 years ago|reply

[deleted]

13 comments