top | item 38187215

(no title)

mavam | 2 years ago

We want to achieve something similar with our pipelines [1] by making the beginning and the end of the pipeline symmetric, giving you this flow:

1. Acquire bytes (void → unstructured)

2. Parse bytes to events (unstructured → structured)

3. Transform events (structured → structured)

4. Print events (structured → unstructured)

5. Send bytes (unstructured → void)

The "Publish" part is a combination of (4) and (5). Sometimes they are fused because not all APIs differentiate those steps. We're currently focusing on building blocks (engine, connectors, formats) as opposed to application-level integrations, so turnkey Reverse ETL is not near. But the main point is that the symmetry reduces cognitive effort for the user, because they worked that muscle on the "E" side already and now just need to find the dual in the docs.

[1] https://docs.tenzir.com/blog/five-design-principles-for-buil...

discuss

order

code_biologist|2 years ago

I don't do security, but I have been a data engineer for the better part of a decade and I don't understand what void and unstructured are. Am I the fool? I don't get it.

The primitives of many of these ETL systems are structured tables (snowflake, parquet, pandas dataframes, whatever) and I don't think I'd ever choose bytes over structured tables. The unstructured parts of data systems I've worked on have always chewed up an outsize portion of labor with difficult to diagnose failure modes. The biggest cognitive effort win of reverse ETL solutions has been to make external systems and applications "speak table".

mavam|2 years ago

The extra data type of unstructured/bytes is optional in that you don’t have to use it if you don’t need it. Just start with a table if that’s your use case.

In security, binary artifacts are common, e.g., to scan YARA rules on malware samples and produce a structured report (“table”). Turning packet traces into structured logs is another example. Typically you have to switch between a lot of tools for that, which makes the process complex.

(The “void” type is only for symmetry in that every operator has an input and output type. The presence of void makes an operator a source or sink. A “closed” pipeline invariant is one with source and sink, and only closed pipelines can execute in our mental model.)