top | item 33302139

(no title)

aschwad | 3 years ago

Interesting initiative! Do I understand correctly, that any push mechanism is done via the ODD API and pull mechanisms check on the schema of data sources? Do you already have a standard for providing ETL metadata? On which level of detail are you collecting this metadata?

At BMW, the data catalogue is continuously growing and the amount of datasets is increasing rapidly. Therefore we had a similar problem to find out how datasets relate to each other and how they are transformed --> we needed coarse- and fine-grained data lineage. We found a way by leveraging the Spline Agent (https://github.com/AbsaOSS/spline) to make use of the Execution Plans, transform them into a suiting data model for our set of requirements and developed a UI to explore these relationships. We also open-sourced our approach in a

- paper: https://link.springer.com/article/10.1007/s13222-021-00387-7

- and blog post: https://medium.com/@alex.schoenenwald/fishing-for-data-linea...

discuss

order

ndementev|3 years ago

Thank you!

Actually everything is working on a push basis in ODD now. ODD Platform implements ODD Specification (https://github.com/opendatadiscovery/opendatadiscovery-speci...) and all agents, custom scripts and integrations, Airflow/Spark listeners, etc are pushing metadata to specific ODD Platform's endpoint (https://github.com/opendatadiscovery/opendatadiscovery-speci...). ODD Collectors (agents) are pushing metadata on a configurable schedule.

ODD Specification is a standard for collecting and gathering such metadata, ETL included. We gather metadata for lineage on an entity level now, but we plan to expand this to the column-level lineage at the end 2022 — start 2023. Specification allows us to make the system open and it's really easy to write your own integration by taking a look in what format metadata needs to be injected in the Platform.

ODD Platform has its own OpenAPI specification (https://github.com/opendatadiscovery/odd-platform/tree/main/...) so that the already indexed and layered metadata could be extracted via platform's API.

Also, thank you for sharing links with us! I'm thrilled to take a look how BMW solved a problem of lineage gathering from Spark, that's something we are improving in our product right now.

wanderingmind|3 years ago

I'm sorry if its open source where is the code and if there is no code please stop calling a blogpost as opensource

aschwad|3 years ago

> open-sourced our approach True, the code isn't - yet we figured that sharing the architecture, procedures and data model could be helpful for others too. IMO this is still a way of open-sourcing an architecture.