top | item 35956287

Show HN: Capillaries: Distributed data processing with Go and Cassandra

70 points| kleineshertz | 2 years ago |capillaries.io

I started thinking about this approach after working on a large-scale project for a major financial company where our group developed a distributed in-house data processing solution. On a regular basis, it ingested a few gigabytes of financial data and, within a tight SLA time limit, produced a lot of enriched/aggregated/validated data for a number of customers. Sometimes, source data had errors, so operators with domain knowledge had to verify data validity at some checkpoints, immediately make corrections, and re-run parts of the workflow manually. The solution involved complex web service orchestration, custom database and was very demanding on the infrastructure availability.

Capillaries is a built from scratch, open-source Go solution that does just that: ingests data and applies user-defined transforms - Go one-liner expressions, Python formulas, joins, aggregations, denormalization - using Cassandra for intermediate data storage and RabbitMQ for task scheduling. End users just have to provide: - source data in CSV files; - Capillaries script (JSON file) that defines the workflow and the transforms; - Python code that performs complex calculations (only if needed).

The whole data processing pipeline can be split into separate runs that can be started independently and re-run by the user if needed.

The goal is to build a platform that is tolerant to database and processing node failures, and allows users to focus on data transform logic and data quality control.

“Getting started” Docker-based demo calculates ARK funds performance, using EOD holdings and transactions data acquired from public sources. There are also integration tests that use non-financial data. There is a test deploy tool that uses Openstack API for provisioning in the cloud.

25 comments

order

dmingod666|2 years ago

Surprised you didn't pick ScyllaDB over cassandra.. ScyllaDB has excellent out of the box support for Golang Change Data Capture and can handle much more load given the same hardware as Cassandra. It has nice integration with stuff like Confluent as well. Given ScyllaDB is mostly drop-in replacment I guess it would be quite straight-forward to swap it out if someone wanted, I guess

kleineshertz|2 years ago

ScyllaDB is definitely on the radar. The main reason I picked Cassandra on the prototyping stage was because default Cassandra configuration gave me much better performance then ScyllaDB (I know, it is supposed to be vice versa). Another obvious reason was Cassandra's maturity and community support. If gocqlx is indeed a drop-in replacement for gocql, I can't see problems having a separate config/fork using ScyllaDB along with Cassandra.

MrBuddyCasino|2 years ago

At this point, I wonder if they should simply have picked a more memorable name. Yes its clever, but even I regularly forget its name. Never happens with Cassandra.

Xeoncross|2 years ago

https://capillaries.io/ (Cassandra & RabbitMQ) reminds me of https://temporal.io/ (PostgreSQL)

Next up, TemporiarriesLite™

Go + SQLite using https://litestream.io for single-instance, low-power systems or server-less, seldom-used apps that need distributed backups and statefulness.

Jokes aside, these at-least-once operation state managers are really nice and help us avoid adding SQS / NATS / etc.. queues littered all over the place. The focus on data processing by Capillaries is nice. Looking forward to trying it out.

kleineshertz|2 years ago

Temporal is a different ecosystem (and a much more ambitious solution), but one of the principles is the same: users want a platform that solves scalability issues and lets them focus on biz logic and customer value.

lorendsr|2 years ago

Temporal has different database options: Cassandra, Postgres, MySQL, SQLite.

> source data in CSV files; - Capillaries script (JSON file) that defines the workflow and the transforms; - Python code that performs complex calculations (only if needed).

Temporal is more general purpose: source data anywhere, and you write code to define workflows and transforms instead of JSON, and the code can be in Go/Java/Python/JS/TS/.NET

gabereiser|2 years ago

My suggestion… read/write Avro and Parquet files so that big data pipelines could use Capillaries. I was working on something along these lines, not sure if you support it or not. If not, you really should.

kleineshertz|2 years ago

Parquet support is on the radar for sure, and I would like to have it before diving into database connector development.

m00x|2 years ago

Couldn't you just run it on Airflow / Luigi / Keboola / Dagster / Flyte?

kleineshertz|2 years ago

Maybe. The scenarios Capillaries is intended for do not need complex/flexible workflow, we just need some basic dependency rules (easy to implement) and really reliable scheduling (RabbitMQ).

eranation|2 years ago

Maybe I’m asking the wrong question but how does it compare to Apache Spark etc?

kleineshertz|2 years ago

Nothing wrong with this question. I do not have any experience with Spark, but I guess Capillaries belongs to the same or similar ecosystem. My understanding is that Spark is way more generic framework that revolves around DAG-defined workflow and map/reduce-style functionality.

Capillaries is about:

- taking a very structured, stage-by-stage, approach to batch data processing with the possibility to control the results of a specific stage (although some kind of workflow DAG is there as well); - executing a SQL-style aggregation and denormalization on data in Cassandra; - executing workflows without actually writing code (besides one-liner Go expressions and Python math formulas when needed).

Sorry if I am missing the point with Spark, as I said - I never worked with it.

serial_dev|2 years ago

I would have had the same question with Apache Storm, it sounds to me that these tools would solve the described problem relatively well (and now that I think about it, Spark even has Python support).