(no title)
sseppola | 5 years ago
What I'm still trying to grasp is first how to assess the big data tools (Spark/Flink/Synapse/Big Query et.al) for my use cases (mostly ETL). It just seems like Spark wins because it's most used, but I have no idea how to differentiate these tools beyond the general streaming/batch/real-time taglines. Secondly, assessing the "pipeline orchestrator" for our use cases, where like Spark, Airflow usually comes out on top because of usage. Would love to read more about this.
Currently I'm reading Designing Data-Intensive Applications by Kleppman, which is great. I hope this will teach me the fundamentals of this space so it becomes easier to reason about different tools.
No comments yet.