(no title)
boskone | 15 years ago
Ended up standardizing clusters of only 3 nodes. Initially I wrote a Paxos+Dynamo but it was a dead end for me for our needs. I did not need a massive data store, but a cheap highly available / scalable enterprise service that consistently gave the same answer. e.g. an eCommerce pricing engine. 3 commodity box nodes is all you ever need. I've done over 500,000 prices per second with ~ 9 unique way of pricing ~475,000 skus, based off the equivalent of 100's millions of data rows.
Use IP multicast for cluster/leader election configuration. Much faster. All a server needs configured to boot strap into the cluster is a single IP. Opted to not use Zookeeper though it came out when I was writing my system.
I tried a Dynamo/Cassandra approach on the updates but abandoned it as too slow early on. I ended up using something more like Linked-Ins Kafka for the "log" shuttling. In Kafka parlance, one topic with 3 partitions one for each node. Each node mirrors all three partitions but is owner of 1. One other diff is I do push/push while Kafka a does push/pull. This makes sense as each node is both a broker for and consumer of the other nodes.
Node 0 writes incoming updates to partition 0 and does a 1 out of 2 push to partition 0 on nodes 1 and 2 (third auto drops to async as we have 2 of 3 durable stores)
For update application Paxos is used to then only for consensus of the highwater mark of message Id (the kafka offset). Analogous to Spinnaker's LSN.
Consensus updates to the message high water mark are "batch" applied by a single thread, and only a single thread, to the in-memory store which uses a special array based hash map (Compact Hashmap). The single thread with a ReaderWriter lock give transactions for free with a "database lock" being faster row locking as its all in mem. I have the equivalent of 100s millions of rows in mem all synched between the three nodes.
One final optimization is the observation that if the overlaying application is REST based then updates commute under PUT semantics (version with a vector clock). For this type of application the total ordering of the application of the updates across the nodes is relaxed as it is guaranteed to converge (eventually consistent).
Since all update are batch applied, in mem, on a single threads with a reader/writer lock a client always views only the converged consistent state though each nodes path to convergence may be different.
I had looked at using a Dynamo/Cassandra like Spinnaker but after coding it found it too slow. When I went with a log shuttling push/push similar to LinkedIns Kafka performance of updates jumped quite nicely. I note one of the authors works at LinkedIn.
Wrote the thing in Scala which was also sort of new at the time. But turned out to be a good choice.
No comments yet.