From reading the docs, this has an IMO surprising design decision: the “journal” is a stream of bytes, where each append (of a byte string) is atomic and occurs in a global order. The bytes are grouped into fragments, and no write spans a fragment boundary.
This seems sort of okay if writes are self-delimiting and never corrupt, and synchronization can always be recovered at a fragment boundary.
I suppose it’s neat that one can write JSONL and get actual JSONL in the blobs. But this seems quite brittle if multiple writers write to one journal and one malfunctions (aside from possibly failing to write a delimiter, there’s no way to tell who wrote a record, and using only a single writer per journal seems to defeat the purpose). And getting, say, Parquet output doesn’t seem like it will happen in any sensible way.
> But this seems quite brittle if multiple writers write to one journal and one malfunctions (aside from possibly failing to write a delimiter, there’s no way to tell who wrote a record, and using only a single writer per journal seems to defeat the purpose).
Yes, writers are responsible for only ever writing complete delimited blocks of messages, in whatever framing the application wants to use.
Gazette promises to provide a consistent total order over a bunch of raced writes, and to roll back broken writes (partial content and then a connection reset, for example), and checksum, and a host of other things. There's also a low-level "registers" concept which can be used to cooperatively fence a capability to write to a journal, off from other writers.
But garbage in => garbage out, and if an application correctly writes bad data, then you'll have bad data in your journal. This is no different from any other file format under the sun.
> there’s no way to tell who wrote a record
To address this comment specifically: while brokers are byte-oriented, applications and consumers are typically message oriented, and the responsibility for carrying metadata like "who wrote this message?" shifts to the application's chosen data representation instead of being a core broker concern.
Gazette has a consumer framework that layers atop the broker, and it uses UUIDs which carry producer and sequencing metadata in order to provide exactly-once message semantics atop an at-least-once byte stream: https://gazette.readthedocs.io/en/latest/architecture-exactl...
I don't think it's correct to say that JSONL is any more vulnerable to invalid data than other message framings. There's literally no system out there that can fully protect you from bugs in your own application. But the client libraries do validate the framing for you automatically, so in practice the risk is low. I've been running decently large Gazette clusters for years now using the JSONL framing, and have never seen a consumer write invalid JSON to a journal.
The choice of message framing is left to the writers/consumers, so there's also nothing preventing you from using a message framing that you like better. Similarly, there's nothing preventing you from adding metadata that identifies the writer. Having this flexibility can be seen as either a benefit or a pain. If you see it as a pain and want something that's more high-level but less flexible, then you can check out Estuary Flow, which builds on Gazette journals to provide higher-level "Collections" that support many more features.
Gazette is at the core of Estuary Flow (https://estuary.dev), a real-time data platform. Unlike Kafka, Gazette’s architecture is simpler to reason about and operate. It plays well with k8s and is backed by S3 (or any object storage).
I feel a bit paralyzed by Fear Of Missing Io_Uring. There's so much awesome streaming stuff about (RisingWave, Materialize, NATS, DataFusion, Velox, neat upstarts like Iggy, many more), but it all feels built on slower legacy system libraries.
io_uring is a low level abstraction and is generally a wash against epoll. Really won't make a difference for these kinds of applications, especially not for client nodes.
I think it's less about guaranteed 1ms real time transactions and more about, like, it's just fast enough that you most likely don't have to worry about it introducing perceptible lag?
I'm working on a streaming audio thing and keeping latency low is a priority. I actually think I'll try Gazette, I just saw it now and it was one of those moments where it's like wait I go to Hacker News to waste time but this is quite exactly what I've been wanting in so many ways.
I'll use it for Ogg/Opus media streams, transcription results, chat events, LLM inferences...
I really like the byte-indexed append-only blob paradigm backed by object storage. It feels kind of like Unix as a distributed streaming system.
Other streaming data gadgets like Kafka always feel a bit uncomfortable and annoying to me with their idiosyncratic record formats and topic hierarchies and whatnot... I always wanted something more low level and obvious...
amluto|1 year ago
This seems sort of okay if writes are self-delimiting and never corrupt, and synchronization can always be recovered at a fragment boundary.
I suppose it’s neat that one can write JSONL and get actual JSONL in the blobs. But this seems quite brittle if multiple writers write to one journal and one malfunctions (aside from possibly failing to write a delimiter, there’s no way to tell who wrote a record, and using only a single writer per journal seems to defeat the purpose). And getting, say, Parquet output doesn’t seem like it will happen in any sensible way.
jgraettinger1|1 year ago
> But this seems quite brittle if multiple writers write to one journal and one malfunctions (aside from possibly failing to write a delimiter, there’s no way to tell who wrote a record, and using only a single writer per journal seems to defeat the purpose).
Yes, writers are responsible for only ever writing complete delimited blocks of messages, in whatever framing the application wants to use.
Gazette promises to provide a consistent total order over a bunch of raced writes, and to roll back broken writes (partial content and then a connection reset, for example), and checksum, and a host of other things. There's also a low-level "registers" concept which can be used to cooperatively fence a capability to write to a journal, off from other writers.
But garbage in => garbage out, and if an application correctly writes bad data, then you'll have bad data in your journal. This is no different from any other file format under the sun.
> there’s no way to tell who wrote a record
To address this comment specifically: while brokers are byte-oriented, applications and consumers are typically message oriented, and the responsibility for carrying metadata like "who wrote this message?" shifts to the application's chosen data representation instead of being a core broker concern.
Gazette has a consumer framework that layers atop the broker, and it uses UUIDs which carry producer and sequencing metadata in order to provide exactly-once message semantics atop an at-least-once byte stream: https://gazette.readthedocs.io/en/latest/architecture-exactl...
psfried|1 year ago
The choice of message framing is left to the writers/consumers, so there's also nothing preventing you from using a message framing that you like better. Similarly, there's nothing preventing you from adding metadata that identifies the writer. Having this flexibility can be seen as either a benefit or a pain. If you see it as a pain and want something that's more high-level but less flexible, then you can check out Estuary Flow, which builds on Gazette journals to provide higher-level "Collections" that support many more features.
danthelion|1 year ago
Onavo|1 year ago
https://www.tinybird.co/
jauntywundrkind|1 year ago
It's not heavily used yet, but Rust has a bunch of fairly high visibility efforts. Situation sort of feels similar with http3, where the problem is figuring out what to pick. https://github.com/tokio-rs/tokio-uring https://github.com/bytedance/monoio https://github.com/DataDog/glommio
Alas libuv (powering Node.js) shipped io_uring but disabled it latter. Seems to have significantly worn out the original author on the topic to boot. https://github.com/libuv/libuv/pull/4421#issuecomment-222586...
immibis|1 year ago
hnav|1 year ago
abrookewood|1 year ago
mrbluecoat|1 year ago
Any plans to support websocket?
https://gazette.readthedocs.io/en/latest/brokers-tutorial-in...
oatmeal_croc|1 year ago
mbrock|1 year ago
I'm working on a streaming audio thing and keeping latency low is a priority. I actually think I'll try Gazette, I just saw it now and it was one of those moments where it's like wait I go to Hacker News to waste time but this is quite exactly what I've been wanting in so many ways.
I'll use it for Ogg/Opus media streams, transcription results, chat events, LLM inferences...
I really like the byte-indexed append-only blob paradigm backed by object storage. It feels kind of like Unix as a distributed streaming system.
Other streaming data gadgets like Kafka always feel a bit uncomfortable and annoying to me with their idiosyncratic record formats and topic hierarchies and whatnot... I always wanted something more low level and obvious...
freeqaz|1 year ago
kcb|1 year ago
xyst|1 year ago
Groxx|1 year ago
immibis|1 year ago