Show HN: S2-lite, an open source Stream Store
77 points| shikhar | 1 month ago |github.com
The idea of streams as a cloud storage primitive resonated with a lot of folks, but not having an open source option was a sticking point for adoption – especially from projects that were themselves open source! So we decided to build it: https://github.com/s2-streamstore/s2
s2-lite is MIT-licensed, written in Rust, and uses SlateDB (https://slatedb.io) as its storage engine. SlateDB is an embedded LSM-style key-value database on top of object storage, which made it a great match for delivering the same durability guarantees as s2.dev.
You can specify a bucket and path to run against an object store like AWS S3 — or skip to run entirely in-memory. (This also makes it a great emulator for dev/test environments).
Why not just open up the backend of our cloud service? s2.dev has a decoupled architecture with multiple components running in Kubernetes, including our own K8S operator – we made tradeoffs that optimize for operation of a thoroughly multi-tenant cloud infra SaaS. With s2-lite, our goal was to ship something dead simple to operate. There is a lot of shared code between the two that now lives in the OSS repo.
A few features remain (notably deletion of resources and records), but s2-lite is substantially ready. Try the Quickstart in the README to stream Star Wars using the s2 CLI!
The key difference between S2 vs a Kafka or Redis Streams: supporting tons of durable streams. I have blogged about the landscape in the context of agent sessions (https://s2.dev/blog/agent-sessions#landscape). Kafka and NATS Jetstream treat streams as provisioned resources, and the protocols/implementations are oriented around such assumptions. Redis Streams and NATS allow for larger numbers of streams, but without proper durability.
The cloud service is completely elastic, but you can also get pretty far with lite despite it being a single-node binary that needs to be scaled vertically. Streams in lite are "just keys" in SlateDB, and cloud object storage is bottomless – although of course there is metadata overhead.
One thing I am excited to improve in s2-lite is pipelining of writes for performance (already supported behind a knob, but needs upstream interface changes for safety). It's a technique we use extensively in s2.dev. Essentially when you are dealing with high latencies like S3, you want to keep data flowing throughout the pipe between client and storage, rather than go lock-step where you first wait for an acknowledgment and then issue another write. This is why S2 has a session protocol over HTTP/2, in addition to stateless REST.
You can test throughput/latency for lite yourself using the `s2 bench` CLI command. The main factors are: your network quality to the storage bucket region, the latency characteristics of the remote store, SlateDB's flush interval (`SL8_FLUSH_INTERVAL=..ms`), and whether pipelining is enabled (`S2LITE_PIPELINE=true` to taste the future).
I'll be here to get thoughts and feedback, and answer any questions!
csense|1 month ago
Adding a database, multiple components, and Kubernetes to the equation seems like massively overengineering.
What value does S2 provide that simple TCP sockets do not?
Is this for like "making your own Twitch" or something, where streams have to scale to thousands-to-millions of consumers?
shikhar|1 month ago
ED: no k8s required for s2-lite, it is just a singe binary. It was an architectural note about our cloud service.
shikhar|1 month ago
Yes, this can be a good building block for broadcasting data streams.
s2-lite is single node, so to scale to that level, you'd need to add some CDN-ing on top.
s2.dev is the elastic cloud service, and it supports high fanout reads using Cachey (https://www.reddit.com/r/databasedevelopment/comments/1nh1go...)
maxpert|1 month ago
shikhar|1 month ago
shikhar|1 month ago
And it has the durability of object storage rather than just local. SlateDB actually lets you also use local FS, will experiment with plumbing up the full range of options - right now it's just in-memory or S3-compatible bucket.
> So I'd try so share as much of the frontend code (e.g. the GRPC and REST handlers) as possible between these.
Right on, this is indeed the case. The OpenAPI spec is also now generated off the REST handlers from s2-lite. We are getting rid of gRPC, s2-lite only supports the REST API (+ gRPC-like session protocol over HTTP/2: https://s2.dev/docs/api/records/overview#s2s-spec)
michaelmior|1 month ago
I'm curious why and what challenges you had with gRPC. s2-lite looks cool!
vogtb|1 month ago
shikhar|1 month ago
DTE|1 month ago
kwkelly|1 month ago
And am I understanding correctly that if I pointed 2 running instances of s2-lite at the same place in s3 there would be problems since slatedb is single writer?
shikhar|1 month ago
Did not architect explicitly for that, but should be viable. You could use the `Backend` directly, is what the REST handlers call https://docs.rs/s2-lite/latest/s2_lite/backend/struct.Backen...
Happy to accept contributions that make this more ergonomic.
> And am I understanding correctly that if I pointed 2 running instances of s2-lite at the same place in s3 there would be problems since slatedb is single writer?
SL8 will fence the older writer, thanks to S3 conditional writes. I think there would be potential for stale reads until the fencing happens...
ED: Fresh discussion in https://discord.com/channels/1232385660460204122/12323856609...
The stale read potential can be mitigated, https://github.com/s2-streamstore/s2/issues/91
up2isomorphism|1 month ago
Also there seems not much use cases nowadays want this, if there are any, they already use Kafka.
solaris2007|1 month ago
Please elaborate on this.
arpinum|1 month ago
shikhar|1 month ago
Will look into how to enable that option from s2-lite