Show HN: Capillaries: Distributed data processing with Go and Cassandra
70 points| kleineshertz | 2 years ago |capillaries.io
Capillaries is a built from scratch, open-source Go solution that does just that: ingests data and applies user-defined transforms - Go one-liner expressions, Python formulas, joins, aggregations, denormalization - using Cassandra for intermediate data storage and RabbitMQ for task scheduling. End users just have to provide: - source data in CSV files; - Capillaries script (JSON file) that defines the workflow and the transforms; - Python code that performs complex calculations (only if needed).
The whole data processing pipeline can be split into separate runs that can be started independently and re-run by the user if needed.
The goal is to build a platform that is tolerant to database and processing node failures, and allows users to focus on data transform logic and data quality control.
“Getting started” Docker-based demo calculates ARK funds performance, using EOD holdings and transactions data acquired from public sources. There are also integration tests that use non-financial data. There is a test deploy tool that uses Openstack API for provisioning in the cloud.
dmingod666|2 years ago
kleineshertz|2 years ago
MrBuddyCasino|2 years ago
eduramiba|2 years ago
Xeoncross|2 years ago
Next up, TemporiarriesLite™
Go + SQLite using https://litestream.io for single-instance, low-power systems or server-less, seldom-used apps that need distributed backups and statefulness.
Jokes aside, these at-least-once operation state managers are really nice and help us avoid adding SQS / NATS / etc.. queues littered all over the place. The focus on data processing by Capillaries is nice. Looking forward to trying it out.
kleineshertz|2 years ago
lorendsr|2 years ago
> source data in CSV files; - Capillaries script (JSON file) that defines the workflow and the transforms; - Python code that performs complex calculations (only if needed).
Temporal is more general purpose: source data anywhere, and you write code to define workflows and transforms instead of JSON, and the code can be in Go/Java/Python/JS/TS/.NET
gabereiser|2 years ago
kleineshertz|2 years ago
m00x|2 years ago
kleineshertz|2 years ago
eranation|2 years ago
kleineshertz|2 years ago
Capillaries is about:
- taking a very structured, stage-by-stage, approach to batch data processing with the possibility to control the results of a specific stage (although some kind of workflow DAG is there as well); - executing a SQL-style aggregation and denormalization on data in Cassandra; - executing workflows without actually writing code (besides one-liner Go expressions and Python math formulas when needed).
Sorry if I am missing the point with Spark, as I said - I never worked with it.
serial_dev|2 years ago