Ask HN: How do you update multiple disparate databases?

[+] redis_mlc|6 years ago|reply

Banks use an ancient but powerful architecture called Source of Truth (SoT), or system of record. It's one of the first techniques developed to manage heterogeneous distributed databases.

One master is picked as the final authority, then changes flow to other databases, replicas, etc. either using database tools or applications.

Awareness of deadlines or how much latency (ie. when the updates are needed for each downstream pipeline) is helpful.

A practical example that solves a simple case is the recent release of a tool by Netflix.

Pro tip: when you interview people for a project like this, they should already know the above. :)

Source: DBA.

https://en.wikipedia.org/wiki/Single_source_of_truth

https://en.wikipedia.org/wiki/System_of_record

[+] kjaftaedi|6 years ago|reply

I get the feeling OP is the one being interviewed.

Mortgage rates change frequently.

The bank likely deals with this exact scenario on a regular basis and should have an established procedure in place.

[+] rorykoehler|6 years ago|reply

I was in a client meeting a while back where one of their consultants corrected me with 'system of record' when I said source of truth. Your comment made me wonder where this distinction/difference was from? Do you know the history behind this?

[+] chrisjc|6 years ago|reply

> A practical example that solves a simple case is the recent release of a tool by Netflix.

Which tool is this?

[+] maps7|6 years ago|reply

What do non-banks use?

[+] Misdicorl|6 years ago|reply

Step 1) Generate a GUID

Step 2) Send a message to all the `downstream` systems (i.e. not the Source of Truth systems) with the change description and GUID

Step 3a) Update source of truth with the new info, write the GUID to a log table in source of truth in the same transaction

Step 3b,c,d) Downstream systems poll the source of truth for existence of the GUID in the log table indicating the change has been committed to source of truth. Load latest value from source of truth and commit the change to the downstream system.

Step 4) If GUID never shows up, transaction was rolled back in source of truth (or an error occurred trying to send to one of the other downstreams, aborting before transaction)- do something relevant for your use case. Alert somebody, carry on, whatever.

[+] TimSchumann|6 years ago|reply

This seems entirely reasonable, but you forgot the 0'th step.

Create a source of truth. I'd bet a 'source of truth' internal to the bank doesn't exist right now, and they're probably pulling something manually from an external 3rd party.

And on with the regress..

Step -1) What form will this internal source of truth take, how much will it cost, and who builds that system?

Step -2) Who gets to make that decision?

My uncle used to do this kind of work and man, he has some horror stories. The common thread among his clients seemed to be 'We know we need to do something, we know it will cost a lot of time and money, and it absolutely cannot fail.' The hard part always seemed to be finding the person within the org who had both the authority to spend the money, the political clout to make the decision. Nobody wants to own something like that going wrong. The actual technical part of the work, they had that down pretty well as it was all they did.

[+] Misdicorl|6 years ago|reply

Note that downstream systems can skip updates with this system, but will always be eventually consistent. Buyer beware

[+] devrandnull|6 years ago|reply

Is this a known pattern in distributed systems?

[+] gwbas1c|6 years ago|reply

Honestly, the solution depends highly on your circumstances. Microsoft SQL has a concept of a distributed transaction, but I'm going to assume that "4 disparate legacy databases" aren't Microsoft SQL.

Here's the high-level approach I would take:

1: Understand which databases serve what, and what the needs for accuracy are. (For example, a database serving their public-facing website can probably be a few seconds behind compared to a database handling real money.)

2: Figure out which database is the source of truth. Update that database first.

3: Update the remaining three databases in a loop: Update the database, then check to make sure the update happened

You will probably need to add some state management where you put yourself into an "updating" state, so that if the update breaks in the middle, you can go back and recover.

----

Another approach is to refactor the applications that are critical (Those that handle money, not the public facing web site) to always read from the database you decide is the source of truth.

[+] perlgeek|6 years ago|reply

So, the first solution to consider is if this needs a technical solution at all.

If it happens infrequently enough, you can have simply have a proceed "change mortgage rate" that touches the 4 legacy databases. Done.

If it does need a technical solution, you have to establish the leading system / source of truth / whatever you call it. This can be either an existing system or a new one. And then all the others need to follow the leading system.

A more distributed approach (every of the 4 legacy systems can generate events that the other 3 work with) likely won't work with 4 legacy systems. It's often hard enough to get high-fidelity events/deltas out of a single system.

[+] PaulRobinson|6 years ago|reply

There might be an addendum to this good answer: OP seems to be thinking they all need to show the same rates at the same time, so in essence you're looking for a "distributed transaction", which, as we all know... well, exercise for the reader and all that.

However, one technique I've used which works with all your approaches and guarantees consistency if the data model allows it is to have "future value" entries.

If you just have a single value:

    mortgage_rate_percent: 3.0

And this is in 4 databases, there will be a period where they don't match. But if you put a future "valid_from" in there:

    mortgage_rate_percent: 3.0
    valid_from: 2020-04-01 00:00:00-0000

Well, you now have the ability to have the rates all input at different times in different databases, and by querying for only the most recent one (or use from/to dates and find the intersection), they all become active at the exact same moment in time (clock drift aside) as time passes. No magic needed.

This seems really obvious once you've seen it, but it's amazing how many people try and worry about source of truth, synchronisation and distributed transaction problems when in many cases (like this), the change can be planned so you can just say "this value will apply in the future" and you have ample time to make sure it gets synch'ed before that time and everything "just works"

[+] todaysAI|6 years ago|reply

There could be a rate-change once a week on average. Of course if there is a central bank change then probably every mortgage product will need to be changed that day.

Rates are the important data-item but some of the legacy DBs will need to be notified of changes like fees, as well.

[+] segmondy|6 years ago|reply

Change Data Capture. - share the consulting loot when you get yours. https://en.wikipedia.org/wiki/Change_data_capture

[+] henriquez|6 years ago|reply

It would be hard to formulate a one-size-fits-all solution even if you were talking about relatively modern systems.

There’s a trade-off between having a manual vs. automated process in the best of circumstances. You might think intuitively that an automated process would be easier and more reliable, but depending on local factors for each of the databases, there might be issues like availability or locking that could make an automated solution more “fragile” than simply having an employee with domain expertise doing an update on a schedule. Reliably automating a process like this would require a lot of careful QA.

So throwing legacy systems into the mix would just aggravate this further. It may be that the cost/benefit of automating this just isn’t there. If it’s cheaper in terms of salary cost to do it by hand for N years than the one-time expense to automate, it may just not make sense to automate. (Obviously over an indefinite time scale it would almost always make sense to automate)

[+] todaysAI|6 years ago|reply

This is such a good answer because sometimes a system where people are used to solve the issue is the right way.

And I say 'system' because as long as the company knows that Sally does the updating each day, but if Sally is on holidays or sick or quits, then there is a protocol where the work is done and the checks/balances are done to ensure that that mortgage rate change is done.

[+] cryptonector|6 years ago|reply

There are no standard mechanisms for keeping heterogeneous databases in sync, or even DBs of the same types but with different schemas.

The standard thing to do is to have a source of truth, as described in various comments here like https://news.ycombinator.com/reply?id=22690991&goto=item%3Fi...

Sometimes you'll have multiple sources of truth for (hopefully non-overlapping) subsets of the data in question.

These things are messy. You have to do the hard work of writing sync tools. You want sync tools that work off of incremental / delta feeds, but you also need (slower) tools that can make-it-right based on the totality of your data -- at least the first time you sync two systems, and often also for auditing purposes to make sure that the incremental (fast) sync is working correctly.

[+] oweiler|6 years ago|reply

CDC via Kafka Connect is one very powerful option.

[+] tyingq|6 years ago|reply

Goldengate might have a higher billing rate.

[+] unknown|6 years ago|reply

[deleted]

[+] edoceo|6 years ago|reply

Oh, I've done this. The 4DBs are for forward-origination, servicing and reverse origination, servicing.

We had a prime app, a fifth DB driving it. It had scripts to pull rates and a UI to adjust. Logs all changes.

These rates are pulled by the other systems (eg the AS400 pulled over FTP). Additional internal apps could query over SMB and HTTP.

The prime updated slowly and each of the services would poll it very frequently (4m) + on-demand when needed

[+] ransom1538|6 years ago|reply

1) Create an internal endpoint for each of the databases to change the mortgage rate. Each one is different so you will need to do this 4 different times. EG: setMortgageRate() . This is PUSH not Pull. Create this too, while you are at it: getMortgageRate().

2) Then once this is standardized, write a for loop.

[+] ahasani|6 years ago|reply

Need to watch out for update failure, some sort of transaction management/atomic commit[2] such that if one update fails, other updated dbs should rollback.

1. https://en.wikipedia.org/wiki/Atomic_commit

[+] timgebrally|6 years ago|reply

I researched a pretty long article about this and documented my findings here: https://www.rallyhealth.com/devops/database-updates-across-t...

[+] quickthrower2|6 years ago|reply

This simplest way possible. If the rate change can be processed via an endpoint that endpoint then updates the 4 databases. If talking to ACID databases, do a 2 phase commit to reduce the chance of them getting out of sync. Send me 10% of your consulting fee please :-)

[+] mooreds|6 years ago|reply

Surprised no one has mentioned two phase commit:

https://en.m.wikipedia.org/wiki/Two-phase_commit_protocol

[+] pooktrain|6 years ago|reply

Debezium plus Kafka would be good here - see https://www.youtube.com/watch?v=fU9hR3kiOK0

[+] mappu|6 years ago|reply

Exactly how small?

A little planned downtime goes a long way.

[+] perl4ever|6 years ago|reply

Oh...you don't want to know that... (in the tone of J Frank Parnell explaining you don't want to look in his Malibu's trunk)

[+] accountuser|6 years ago|reply

[deleted]

[+] kennyangel|6 years ago|reply

this is just a comment test

51 comments