top | item 29947861

Ask HN: Best Practices for Creating a Golden Record or Single Source of Truth?

11 points| IWantToRelocate | 4 years ago

Hey HN.

I'm facing this challenge at my job where I receive user's data from N sources and need to create a single source of truth (a table with single entities for each user).

I'm looking for best practices, I don't even know how to start it.

When I look for that on the web I find many random "tutorials" where it ends up trying to sell you a tool for that purpose. If anyone could point me in a good direction/book/tutorial on that matter (Golden Record, SSOT, MDM, etc.) I would really appreciate it.

Ps: I use GCP (BigQuery SQL) and Python.

Thanks!

8 comments

order

stadium|4 years ago

I'd start with learning about dimensional modeling, with one of the newer Kimble data warehouse toolkit books. And also document the data flow between all of these systems to make it easier to talk through options with stakeholders.

What type of user data is it? It sounds like you have multiple "input" sources of truth, and are wanting to create a single reporting dataset that has the latest info from each?

Once you can break down the datasets into facts and dimensions, it becomes more clear which update strategies to use for each. Master data like names and email addresses usually goes into a dimension table, and transactional data like purchases, page views into a fact table.

How do you know when each record has changed in each source? Are there edge cases where changes can't be detected? How much latency is acceptable in the reporting system? 1 day? 5 minutes? Those questions are important to explore early on.

You're getting into data engineering. It's a great field to learn and the job market is hot. But if you are wanting to avoid data wrangling completely, check out kloud.io and fivetran. They do a lot of the heavy lifting. If they support most of your systems it could save you and your company a lot of time. (I'm not affiliated but know both products well from firsthand experience)

IWantToRelocate|4 years ago

thank you! We will use the data for customer segmentation purposes. For that, we will need to transform all the user's sources into a single entity for a given user, for example. And link that entity/user to another table, with their purchases so we can do the customer segmentation. I'm not sure about latency but honestly that's not really important at the moment, I just need to have a strategy to make an end-to-end solution for gathering the data, transform it and delivery a concise and coherent "user table" for the machine learning dude.

thatsamonad|4 years ago

Regardless of the tool you pick for actually managing the data, first you’ll need to go through some data classification to determine what your dimensions are and how to reconcile them.

For example, if you have multiple data sets for Contacts, what dimensions should be used to say that a record from system A is the same as a record from system B? Is it email address, full name, some other identifier? From there you have to then decide which source system is more “correct” and how to determine that.

You’ll also need to figure out if reconciliation should be automatic, manual, or both. Does a “master” record get flagged when it’s out of sync for user review, does someone have to intervene to say which values are correct and merge those records, or are you relying on some keys and algorithms to make a best attempt? In some cases it’s helpful to have a set of “staging” tables for dealing with the source data and a “pre-merge” table for resolving conflicts before overwriting the master record.

This is why many MDM platforms can charge as much as they do. They’ve already figured out decent ways to manage this for some known platforms (Salesforce, ERP systems, etc).

IWantToRelocate|4 years ago

yes, each user will have an user id (and also full name, phone and email)

jonahbenton|4 years ago

This is a deep space, lots of challenges and subtleties and lots of ways to make mistakes to cost the business and can cost you your job. Highly recommend getting clearance to engage a consultant to at minimum outline the solution domain for you, which you can either implement or have them implement then you can learn and maintain.

tmaly|4 years ago

Start with 3rd normal form when designing your tables.

stadium|4 years ago

For data warehouse reporting use cases I'd recommend starting with a star schema, not 3rd NF.