top | item 25730154

(no title)

sseppola | 5 years ago

Great resource, thanks for sharing it! I will dig deeper into the resources linked here as there's a lot I have never seen before. The main topics are more or less exactly what I've found to be key in this space in the last 2 months trying to wrap my head around data engineering in my new job.

What I'm still trying to grasp is first how to assess the big data tools (Spark/Flink/Synapse/Big Query et.al) for my use cases (mostly ETL). It just seems like Spark wins because it's most used, but I have no idea how to differentiate these tools beyond the general streaming/batch/real-time taglines. Secondly, assessing the "pipeline orchestrator" for our use cases, where like Spark, Airflow usually comes out on top because of usage. Would love to read more about this.

Currently I'm reading Designing Data-Intensive Applications by Kleppman, which is great. I hope this will teach me the fundamentals of this space so it becomes easier to reason about different tools.

discuss

No comments yet.