I can't wait until documentation gets filled out enough to see if I want to spend an afternoon importing all my pipelines to then see if it's useful for me
Interesting initiative! Do I understand correctly, that any push mechanism is done via the ODD API and pull mechanisms check on the schema of data sources? Do you already have a standard for providing ETL metadata? On which level of detail are you collecting this metadata?
At BMW, the data catalogue is continuously growing and the amount of datasets is increasing rapidly. Therefore we had a similar problem to find out how datasets relate to each other and how they are transformed --> we needed coarse- and fine-grained data lineage.
We found a way by leveraging the Spline Agent (https://github.com/AbsaOSS/spline) to make use of the Execution Plans, transform them into a suiting data model for our set of requirements and developed a UI to explore these relationships. We also open-sourced our approach in a
ODD Specification is a standard for collecting and gathering such metadata, ETL included. We gather metadata for lineage on an entity level now, but we plan to expand this to the column-level lineage at the end 2022 — start 2023. Specification allows us to make the system open and it's really easy to write your own integration by taking a look in what format metadata needs to be injected in the Platform.
Also, thank you for sharing links with us! I'm thrilled to take a look how BMW solved a problem of lineage gathering from Spark, that's something we are improving in our product right now.
Why do demos need Google or other login ? This is such a friction . I should be able to get access to demos w/o having to login .
Also the use case for pre sales is interesting. Has anyone really had any success during pre-sales for an enterprise customer agreeing to installing a collector on their applications ?
Hi, we're the team behind this product! We updated our demo to include social logins so that spam doesn't get through. There are no logs being collected, and we're not selling this information either - if you don’t want the online version of it just head over https://github.com/opendatadiscovery/odd-platform/tree/main/... locally using docker-compose
I really like how you describe things from "use cases". What's missing for me is a clear highlight of what is part of ODD.
For example: all the steps under 3. are not part of ODD, or are they?
Only step 1 is performed in ODD, yes?
Personally, I'm mostly interested in lineage and would love a usecase that explains real world lineage. Say we have redshift/postgres and a Tableau with a dataset. How is the lineage generated or manually maintained.
> How is the lineage generated or manually maintained
All lineage in the platform is generated and not manually handled by user in the UI. We are leveraging ODD Specification (https://github.com/opendatadiscovery/opendatadiscovery-speci...) and all ODD Collectors (agents that scrape metadata from your data sources) send payload to the ODD Platform in this specification's format. ODD Specification introduces something called ODDRN — OpenDataDiscovery Resource Names. These are basically strings, identifiers of specific data entities. All ODD Collectors generates same identifiers for same entities, allowing us automatically build a lineage graph in ODD Platform.
Not letting a user to manually change lineage in the UI is kinda our solution to one of the lineage problems. This way users can be sure that the lineage is correct, up to date and no one messed with it at least in the UI.
Of course if there's an described API endpoint, there's a way to change the lineage by sending a request on your own (e.g. via curl or custom script), but I wouldn't call it manual. This approach allows companies and users to write their own integrations, making the system open.
How does this differ from data catalogs like datahub or amundsen? I'm looking to setup a data catalog at my work and currently I'm looking at datahub and amundsen. I'm leaning towards datahub simply because it doesn't require neo4j, which we don't have any other use for.
Well, one of the differences is that we only require PostgreSQL as a external dependency in contrast with DataHub (MySQL, Kafka, ElasticSearch). Please correct me if I wrong about this list of DataHub's external dependencies.
Hi Cilvic,
This is an opensource product. You could use it for free. If you have any questions or need any assistance we would be happy to help you and the same time we hope you'll help our product with your feedback and real-world use cases.
While Pachyderm (a great product by the way) helps teams to automate transformation tasks, ODD is more of a discovery/observability/monitoring solution for your pipelines. Basically if Pachyderm helps you to build a pipeline, ODD helps you to monitor all of your pipelines in a context of your whole data infrastructure
I see the motivation and skill of your group, and I hope you can retain that and build a useful contribution. I also see the extraordinary effort required to get to this point, and many projects fail to get this far, so congratulations. Something is obviously going well.
However, I am unmoved by your list of key wins (details below). If you indeed built something useful, is there a different way to deliver your message about the functionality that you enable?
Here are my reactions:
1) Shorten data discovery phase. In my experience, analysts and data scientists are always very familiar with what relevant data exists, or else they can find the right people to acquire what data they need. Often, kick-off meetings for new projects cover with stakeholders which data is useful.
2) Have transparency on how and by whom the data is used. For publicly available data, this is not something that a company usually cares about. Internal and proprietary data management is already a very mature space, and every company with such data already has processes in place to manage data access. I grant this is often a mess, but I also don't see any global solution on the horizon.
3) Foster data culture by continuous compliance and data quality monitoring. Data quality monitoring is extremely complex. I have seen many claims over many years of tools that solve this problem broadly, but I have yet to see any solution that matches the claims.
4) Accelerate data insights. This is a very bold claim for a new project, especially given the many (5+) decades of work and experience developing tools and techniques for data insights.
5) Know the sources of your dashboards and ad hoc reports. All dashboards I am aware of surface this sort of information.
6) Deprecate outdated objects responsibly by assessing and mitigating the risks. This is a good idea, but it is challenging in practice, as illustrated by several prominent examples [1,2,3].
Finally, unrelated to the above, your project's name (ODD) is very similar to the name used by the Outlier Detection DataSets (ODDS) project [4]
Thank you for your kind words and a constructive feedback! We appreciate it.
Let me cover some of your reactions from my perspective as a Data Engineer. Please feel free to add your opinion on those
> Shorten data discovery phase. In my experience, analysts and data scientists are always very familiar with what relevant data exists, or else they can find the right people to acquire what data they need. Often, kick-off meetings for new projects cover with stakeholders which data is useful.
You're right, but from my experience it's not always the case. Sometimes finding the key person/team responsible for a dataset might be challenging. You mentioned the kick-off meeting, about which I agree, but it's not always the silver bullet. Data goes outdated/deprecated all the time and we are trying to solve a problem of telling about this to all people which may be affected by this as soon an as easy as possible.
> Know the sources of your dashboards and ad hoc reports. All dashboards I am aware of surface this sort of information
Again, you are right. All dashboard services and BI tools can show you from what data source what data are they getting. But from my experience sometimes it's useful to take a look at the origin of data some dashboard uses. This is where end-to-end lineage comes in hand. Also, I consider useful to have metadata of all of my dashboards from all of my company's BI tools in one place.
> Deprecate outdated objects responsibly by assessing and mitigating the risks. This is a good idea, however, it is challenging
Couldn't agree more. We are working not only to improve our way to solve this problem, but the solution itself, if it makes sense. We are basically trying to find a right approach to this and offer it to everyone else. I know it's ambitious and really is a loud statement, but I hope we are getting there.
In overall, thank you for your input!
@germanosin, would you like to add something I may have missed?
O__________O|3 years ago
- https://opendatadiscovery.org/
Demo (requires GitHub/Google login) + Demo Video
- https://demo.oddp.io/login
- https://youtube.com/watch?v=ZSa2FWAyUic
Use cases:
- https://docs.opendatadiscovery.org/use_cases
Presentation on ODD (by HN user germanosin):
- https://youtube.com/watch?v=Y0aFqHd4h3k
dang|3 years ago
AnEro|3 years ago
germanosin|3 years ago
aschwad|3 years ago
At BMW, the data catalogue is continuously growing and the amount of datasets is increasing rapidly. Therefore we had a similar problem to find out how datasets relate to each other and how they are transformed --> we needed coarse- and fine-grained data lineage. We found a way by leveraging the Spline Agent (https://github.com/AbsaOSS/spline) to make use of the Execution Plans, transform them into a suiting data model for our set of requirements and developed a UI to explore these relationships. We also open-sourced our approach in a
- paper: https://link.springer.com/article/10.1007/s13222-021-00387-7
- and blog post: https://medium.com/@alex.schoenenwald/fishing-for-data-linea...
ndementev|3 years ago
Actually everything is working on a push basis in ODD now. ODD Platform implements ODD Specification (https://github.com/opendatadiscovery/opendatadiscovery-speci...) and all agents, custom scripts and integrations, Airflow/Spark listeners, etc are pushing metadata to specific ODD Platform's endpoint (https://github.com/opendatadiscovery/opendatadiscovery-speci...). ODD Collectors (agents) are pushing metadata on a configurable schedule.
ODD Specification is a standard for collecting and gathering such metadata, ETL included. We gather metadata for lineage on an entity level now, but we plan to expand this to the column-level lineage at the end 2022 — start 2023. Specification allows us to make the system open and it's really easy to write your own integration by taking a look in what format metadata needs to be injected in the Platform.
ODD Platform has its own OpenAPI specification (https://github.com/opendatadiscovery/odd-platform/tree/main/...) so that the already indexed and layered metadata could be extracted via platform's API.
Also, thank you for sharing links with us! I'm thrilled to take a look how BMW solved a problem of lineage gathering from Spark, that's something we are improving in our product right now.
wanderingmind|3 years ago
weekay|3 years ago
germanosin|3 years ago
unknown|3 years ago
[deleted]
Cilvic|3 years ago
For example: all the steps under 3. are not part of ODD, or are they?
Only step 1 is performed in ODD, yes?
Personally, I'm mostly interested in lineage and would love a usecase that explains real world lineage. Say we have redshift/postgres and a Tableau with a dataset. How is the lineage generated or manually maintained.
Anyways great effort.
ndementev|3 years ago
May I ask you what do you mean by saying "all steps under 3"? Are you referring to https://docs.opendatadiscovery.org/use_cases/dq_visibility?
As for the
> How is the lineage generated or manually maintained
All lineage in the platform is generated and not manually handled by user in the UI. We are leveraging ODD Specification (https://github.com/opendatadiscovery/opendatadiscovery-speci...) and all ODD Collectors (agents that scrape metadata from your data sources) send payload to the ODD Platform in this specification's format. ODD Specification introduces something called ODDRN — OpenDataDiscovery Resource Names. These are basically strings, identifiers of specific data entities. All ODD Collectors generates same identifiers for same entities, allowing us automatically build a lineage graph in ODD Platform.
Not letting a user to manually change lineage in the UI is kinda our solution to one of the lineage problems. This way users can be sure that the lineage is correct, up to date and no one messed with it at least in the UI.
Of course if there's an described API endpoint, there's a way to change the lineage by sending a request on your own (e.g. via curl or custom script), but I wouldn't call it manual. This approach allows companies and users to write their own integrations, making the system open.
rolls-reus|3 years ago
ndementev|3 years ago
Cilvic|3 years ago
There is:
- the "schedule a call" which sounds like there is some paid version of this
- "Free an open source"
Is this a volunteer project but you still offer to take a call?
germanosin|3 years ago
skrtskrt|3 years ago
ndementev|3 years ago
abrazensunset|3 years ago
jethkl|3 years ago
However, I am unmoved by your list of key wins (details below). If you indeed built something useful, is there a different way to deliver your message about the functionality that you enable?
Here are my reactions:
1) Shorten data discovery phase. In my experience, analysts and data scientists are always very familiar with what relevant data exists, or else they can find the right people to acquire what data they need. Often, kick-off meetings for new projects cover with stakeholders which data is useful.
2) Have transparency on how and by whom the data is used. For publicly available data, this is not something that a company usually cares about. Internal and proprietary data management is already a very mature space, and every company with such data already has processes in place to manage data access. I grant this is often a mess, but I also don't see any global solution on the horizon.
3) Foster data culture by continuous compliance and data quality monitoring. Data quality monitoring is extremely complex. I have seen many claims over many years of tools that solve this problem broadly, but I have yet to see any solution that matches the claims.
4) Accelerate data insights. This is a very bold claim for a new project, especially given the many (5+) decades of work and experience developing tools and techniques for data insights.
5) Know the sources of your dashboards and ad hoc reports. All dashboards I am aware of surface this sort of information.
6) Deprecate outdated objects responsibly by assessing and mitigating the risks. This is a good idea, but it is challenging in practice, as illustrated by several prominent examples [1,2,3].
Finally, unrelated to the above, your project's name (ODD) is very similar to the name used by the Outlier Detection DataSets (ODDS) project [4]
Good luck.
[1] http://www.lenna.org/editor.html
[2] https://scikit-learn.org/stable/modules/generated/sklearn.da...
[3] https://deepai.org/dataset/fb15k and https://paperswithcode.com/dataset/fb15k-237
[4] http://odds.cs.stonybrook.edu
ndementev|3 years ago
Let me cover some of your reactions from my perspective as a Data Engineer. Please feel free to add your opinion on those
> Shorten data discovery phase. In my experience, analysts and data scientists are always very familiar with what relevant data exists, or else they can find the right people to acquire what data they need. Often, kick-off meetings for new projects cover with stakeholders which data is useful.
You're right, but from my experience it's not always the case. Sometimes finding the key person/team responsible for a dataset might be challenging. You mentioned the kick-off meeting, about which I agree, but it's not always the silver bullet. Data goes outdated/deprecated all the time and we are trying to solve a problem of telling about this to all people which may be affected by this as soon an as easy as possible.
> Know the sources of your dashboards and ad hoc reports. All dashboards I am aware of surface this sort of information
Again, you are right. All dashboard services and BI tools can show you from what data source what data are they getting. But from my experience sometimes it's useful to take a look at the origin of data some dashboard uses. This is where end-to-end lineage comes in hand. Also, I consider useful to have metadata of all of my dashboards from all of my company's BI tools in one place.
> Deprecate outdated objects responsibly by assessing and mitigating the risks. This is a good idea, however, it is challenging
Couldn't agree more. We are working not only to improve our way to solve this problem, but the solution itself, if it makes sense. We are basically trying to find a right approach to this and offer it to everyone else. I know it's ambitious and really is a loud statement, but I hope we are getting there.
In overall, thank you for your input!
@germanosin, would you like to add something I may have missed?