top | item 34007420

(no title)

bonecrusher2102 | 3 years ago

This article is about organizations, but I'm curious how this plays out on web properties as well. The larger ones I imagine make use of many different data sources, and that data needs to be federated somehow, or one loses out on the insights from the aggregated data.

Does anyone have any experience for how this is well architected?

discuss

order

closeparen|3 years ago

One Hadoop installation for the company. When you provision a messaging topic or storage instance in production, it's automatically replicated to a corresponding table in the "raw" namespace in the warehouse. Teams can check in Airflow jobs to build modeled/derived tables, downstream of those, as desired. Modeled tables go in team namespaces and teams can set their ACLs. Any tables you have access to, you can select/join in the same Hive or Presto query. It works well - it's kind of mystifying to hear about places with many different data warehouses or federating data between different parts of the company. Big advantages to centralization here.