top | item 20430925

Streaming Cassandra at WePay

113 points| matchagreentea | 6 years ago |wecode.wepay.com | reply

43 comments

order
[+] georgewfraser|6 years ago|reply
There's some excellent choices in this design, but using Kafka in the middle makes no sense. The Cassandra change-data segments are written in batches, and BigQuery loads in batches. So the author is splitting up change-data segments into individual rows, feeding those rows into Kafka, and then re-batching them on the BQ side for ingestion.

The use of a streaming system adds a tremendous amount of complexity for no benefit, because the original data source is batch. A simpler solution would be to process the data in 5-minute batches and use a BigQuery MERGE statement to apply that "patch" to the existing table.

You really shouldn't use Kafka unless you have a true streaming data source and a hard latency requirement of < 5 minutes. It's much easier to work with data in small batches, using local disk and blob stores like S3 as your queue.

[+] matchagreentea|6 years ago|reply
OP here.

Thanks for the feedback! I totally agree with you that using a streaming approach seems like an overkill. Actually making the pipeline truly real-time is part of our future work for Cassandra CDC (This is mentioned in Part 2 of this blogpost, which is expected to be published next week). In Cassandra 4.0, the CDC feature is improved to allow real-time parsing of commit logs. We are running 3.X in production, however we plan to add 4.0 support in the future.

[+] nightshadetrie|6 years ago|reply
Alot of engineers at WePay come from LinkedIn, so there's a special love for Kafka.
[+] dillonmckay|6 years ago|reply
I recently had a junior dev propose using Kafka as a business ‘event’ system, to trigger or be triggered by webhooks, in order to integrate our API with a third party CRM.
[+] benjaminwootton|6 years ago|reply
Curious if you also looked at Aurora over Cassandra being as you are in the AWS ecosystem?

We are an AWS aligned consultancy who are seeing lots of take up of Aurora as it’s MySQL API compatible but with better performance and fully managed.

There’s an interesting pattern here using Lambda for CDC between Aurora and Redshift, though not sure how the cost would scale at WePay scale https://aws.amazon.com/blogs/big-data/combine-transactional-...

[+] chucky_z|6 years ago|reply
Aurora and Redshift do not scale well enough for companies at their scale (only 64TB for Aurora last I checked, and Redshift falls over near the 100TB level). They'd be looking at DynamoDB.
[+] gunnarmorling|6 years ago|reply
Out of curiosity, is the binlog accessible on Aurora/MySQL? We've lots of users of the Debezium MySQL connector on AWS, but would be interesting to know whether it'd work with Aurora, too.

(Disclaimer: I'm the lead of Debezium, which is the open-source CDC platform to which WePay are contributing their Cassandra work)

[+] Luker88|6 years ago|reply
I'm not saying nosql dbs don't have their place, but...

Articles like this are exactly why my current company moved to cassandra, and why I regret not having tried harder to stop them (but I was new, didn't knew cassandra and I'm digressing)

Cassandra is (very probably) wrong for your use case.

The only model that makes sense in cassandra is where you insert data, wait for minutes for the propagation, hope it was propagated, then read it, while knowing either the primary/clustering key or the time of the last read data (assuming everything has the timestamp in the key, and has been replicated)

Anything else in cassandra feels like a hack. Deleting stuff is kinda bad, it will remain there (marked as tombstone) until the "grace period" has expired (default 10 days). It is really bad unless you put the timestamp in the keys, which you basically must do in cassandra. default 100k tombstones ad your query will be dropped. You want a big grace period for the data, 'cause if a node goes down that is exactly the maximum time you have to bring it up and hope all data has finished resyncing. Otherwise deleted data will come up again, and your node will have less data. If the grace period has passed or are not confident in the replication speed, absolutely delete that node and add a new one.

Bigger grace period means more tombstones (unless you never, ever delete data). Even if all the data/tombstones have been replicated to all nodes, the tombstones remain. Running manual compaction once prevents the automatic compaction from running ever again. You need to pay attention to never scan tombstones, so no queries like select without where, always use a LIMIT or where that will ensure you will not scan deleted data. (but don't use 'where' on columns that are not keys... Unless X, unless Y on the unless X and so on for everything)

They need kafka because the probably need a queue, and there is no way to implement those in semi-realtime in cassandra without the tombstone problem killing your queries.

Features have been added and then deprecated (see material views), or very limited in their use (secondary indexes, which basically work only per-partition). Queries that are not plain select/insert (or update, which is the same as insert) have loads of special cases and special cases on the special cases. LWT are much slower, and are the only transaction/consensus you will find. for single-row only. COPY TO/FROM works with csv, but will not properly escape all kinds of data, giving you all kinds of fun. Hell, manual copy-pasting a timestamp from a select to an insert does not work, the format a tiny bit different.

Backups/restore procedures seem complex and amateurish, especially when you have to restore in an other cluster.

You can select how many nodes to query for an operation, but things like "Quorum" are not a transaction or consensus. BATCH is atomic, not isolated. read-after-write works only on exactly the same node (unless you write to ALL nodes, and I have not checked what happens if one is down). And a sh*tload of other stuff.

I could go on, but even if this fits your use case, don't use cassandra. Try scylladb first. Same protocol, missing material views and LWT (until next year apparently), but 10x as fast, probably due to being C++ vs java

[+] pixelmonkey|6 years ago|reply
I had a somewhat similar experience.

This, especially, is what my team also found to be true: "The only model that makes sense in cassandra is where you insert data, wait for minutes for the propagation, hope it was propagated, then read it, while knowing either the primary/clustering key or the time of the last read data."

We use Cassandra in production at my company, but only after much scope reduction to the use case.

We needed a distributed data store for real-time writes and delayed reads for a 24x7 high-throughput shared-nothing highly-parallel distributed processing plant (implemented atop Apache Storm). Cassandra fit the bill because writes could come from any of 500+ parallel processes on 30+ physical nodes (and growing), potentially across AZs in AWS. And because the writes were a form of "distributed grouping" where we'd later look up data by row key.

At first, we tried to also use Cassandra for counters, but that had loads of problems. So we cut that. We briefly looked at LWTs and determined they were a non-starter.

Then we had some reasonable cost/storage constraints, and discovered that Cassandra's storage format with its "native" map/list types left a lot to be desired. We had some client performance requirements, and found the same thing that hosed storage also hosed client performance. So then we did a lot of performance hacks and whittled down our use case to just the "grouping" one described above, and even still, hit a thorny production issue around "effective wide row limits".

I wrote up a number of these issues in this detailed blog post.

https://blog.parse.ly/post/1928/cass/

The good news about Cassandra is that it has been a rock-solid production cluster for us, given the whittled-down use case and performance enhancements. That is, operationally, it really is nice to have a system that thoughtfully and correctly handles data sharding, node zombification, and node failure. That's really Cassandra's legacy, IMO: that it is possible to build a distributed, masterless cluster system that lets your engineers and ops team sleep at night. But I think the quirky/limited data model may hold the project back long-term, when it comes to multi-decade popularity.

[+] KaiserPro|6 years ago|reply
Interesting article!

One question, how are your backups?

We moved away from datastax for a number of reasons, one of the biggest was the lifecycle manager. For us, it silently failed to make backups, created backups that were unrecoverable, and other generally nasty stuff.

The worst part was with every update, something broke that should have been caught be QA. It felt like nobody was regularly testing their backups, or they rolled their own(which defeats the point of buying support)

Having said that, if you wanted to stream things like log data, cassandra was brilliant.

However, it collapsed like a sack of poop as soon as you wanted to do any kind of reading. In the end we replaced a 6 node r4.4xlarge 4TB cluster with a single t2.medium Postgres instance.

[+] agacera|6 years ago|reply
Can you please elaborate a little bit more of how you replaced it by postgres? Because it is strange that a single box of postgres in way less powerful instance type would perform the same as your cassandra cluster. This kind of seems that the first solution was way over engineered or was built for different requirements.
[+] marcus_holmes|6 years ago|reply
I'm really curious why MySQL was dropped instead of being optimised?

SQL databases can (and do) scale to high volumes, and would have avoided a lot of re-engineering. What was the final blocker for WePay that meant that optimising SQL could go no further?

[+] marcinzm|6 years ago|reply
SQL on it's own scales to a point, that point being however large you can make a single machine. Which can be quiet small depending on the data sizes in question. Beyond that you need to use something like Vitess (which the article mentions) but that comes with it's own overheads and caveats. You no longer have SQL really, you have Vitess. A couple years ago a company I was at tried to use Vitess but found too many edge cases and too much operational overhead.
[+] Thaxll|6 years ago|reply
They don't really scale since MySQL / PG don't have built in features for horizontal scaling / sharding ect ... If you want to make it scale you have to do a lot of things manually / use third party tools / solutions.
[+] ledneb|6 years ago|reply
Looking at the "double-writing" option, this talk by Udi Dahan ("Reliable Messaging Without Distributed Transactions") could be of interest to you https://vimeo.com/showcase/3784159/video/111998645

It's still a "double write" but you may not need distributed transactions if you're happy with at-least-once writes to Kafka and deduplication on the consuming side.

[+] vijaybritto|6 years ago|reply
Sorry, Off topic here: Why are there three scrollbars for the window??!
[+] twic|6 years ago|reply
Redundancy. Only one of the scrollbars is actually active at the moment, but if it fails, a leadership election will take place, and one of the others will take over as the active.
[+] fasthandle|6 years ago|reply
This was a remarkably detailed and honestly confusing post from Tencent, until I noticed the Chase logo at the top, and it was not by Tencent. (The WeChat team in Tencent there is very small, close, integrated).

Pretty major name clash here, WeChat pay aka WePay, perhaps WeChat Pay, and well, another WePay.

For those not familiar, WeChat, somewhat a WhatsApp+/Facebood Messenger of China, probably earlier mover however riding from QQ, transact billions of yuan in payments every day and together with AliPay is promoting a cashless society (which I think is wrong, I like cash).

[+] haecceity|6 years ago|reply
Why do you think cash is preferable to electronic payment? I personally hate carrying change.