Launch HN: Grai (YC S22) – Open-Source Data Observability Platform
101 points| ersatz_username | 2 years ago
Ever experienced a production outage due to changes in upstream data sources? That's a problem we regularly encountered whether deploying machine learning or keeping a datawarehouse operational and it led us to create Grai.
Systematically testing the impact of data changes on the rest of your stack turns out to be quite difficult when the same data is copied and used across many different services and applications. Simple changes like renaming a column in a database can result in broken BI dashboards, incorrect training data for ML models, and data pipeline failure. For example, business users regularly deal with questions like "why does revenue look different in different dashboards".
These sort of problems are commonly dealt with by passively monitoring application execution logs for anomalies that might indicate an outage. Our goal was to move that task out of runtime where an outage has already occurred back into testing.
At its core, Grai is a graph of the relationships between the data in your organization, from columns in a database to JSON fields in an API. This graph allows Grai to analyze the downstream impact of proposed changes during CI and before they go live.
It includes a variety of pre-built integrations with common data tools such as PostgreSQL, Snowflake, dbt, and Fivetran, which automatically extract metadata and synchronize the state of your graph. It's built on a flexible data model backed by REST and GraphQL APIs and a Python client library. This way, users can directly build on top of Grai as they see fit. For example, because every object in Grai serializes to a yaml definition file, sort of like a CRD in Kubernetes, even if a pre-built integration doesn't exist it's fairly easy to manually create or script a custom solution.
We made the decision to build open-source from the beginning in part because we believe lineage is underutilized both organizationally and technologically. We hope to provide a foundation for the community to build cool concepts on top and have already had companies come to us with amazing ideas, like optimizing their real-time query pipelines to take advantage of spot price arbitrage between cloud and on-prem.
We try not to be overly opinionated about how organizations work, so whether you maintain a development database or run service containers in GitHub Actions it doesn't really matter. When your tests are triggered we evaluate the new state of the environment and check for any impacts, before reporting back as a comment in the pull request.
Data observability can have unexpected benefits. One of our customers uses us because we make on-boarding new engineers easier. Because we render an infinitely zoomable Figma-like graph of the entire data stack it's possible for them to visually explore end-to-end data flows and application dependencies.
You can find a quick demo here: https://vimeo.com/824026569, we've also put together an example getting started guide if you want to try things out yourself: https://docs.grai.io/examples/enhanced-dbt. Since everything is open source, you can always explore the code (https://github.com/grai-io/grai-core) and docs (https://docs.grai.io), where we have example deployment configurations for docker-compose and Kubernetes.
We would love to hear your feedback. If there's a feature we're missing, we'll build it. If you have a UX or developer experience suggestion, we'll fix it. If it's something else, we want to hear about it. We can't wait to hear your feedback and thank you in advance!
whytai|2 years ago
Any plans to support airflow in the future? Would love to have something like this for our companies 500k+ airflow jobs.
ersatz_username|2 years ago
More generally we can embed the transformation logic of each stage of your data pipelines into the edge between nodes (like two columns). Like you said, in the case of SQL there are lots of ways to statically analyze that pipeline but it becomes much more complicated with something like pure python.
As an intermediate solution you can manually curate data contracts or assertions about application behavior into Grai but these inevitably fall out of sync with the code.
Airflow has a really great API for exposing task level lineage but we've held off integrating it because we weren't sure how to convert that into robust column or field level lineage as well. How are y'all handling testing / observability at the moment?
pdimitar|2 years ago
If you allow me a remark on the website: it requires JS from 8 separate domains to show content which is fine but I know that more technical readers can be sensitive to these aspects. Secondly, the browser addon DarkReader doesn't work well with the website so I had to turn it off and could only browse it in light mode.
Perhaps these could be actionable points for the future.
Good job and keep going!
ersatz_username|2 years ago
slotrans|2 years ago
I have experienced "devs changed something and it broke reporting" more times than I count. Typically the reason boils down to 1) they don't care, 2) their management doesn't care. That has always felt like an insurmountable cultural problem, but I do wonder, if a bot posted a PR comment on breaking changes before they got deployed, might that move the needle, just a little?
debarshri|2 years ago
Theres datafold [1], databend (acquired by ibm), atlan, greatexpectations to name a few, doing very similar things.
Just looking at the video I could not figure out what's differentiated. I hope you have success in the space.
[1] https://www.datafold.com
ersatz_username|2 years ago
Here are some things we think are really important though
1. Data quality testing ideally happens during CI not after merge.
2. Developers come first. Virtually every aspect of the tool can be customized, modified, and extended down to the basic data model without changing any upstream core code. Want to build your own custom application on top of your data lineage? Great! Have at it!
3. Users should be able to own not just their own data but their own metadata. We go to great lengths to maintain feature parity between the cloud and self-hosted application.
ssddanbrown|2 years ago
[1] https://github.com/grai-io/grai-core/blob/master/LICENSE [2] https://opensource.org/osd/
satvikpendem|2 years ago
ersatz_username|2 years ago
We believe a project like this needs financial backing and a dedicated team driving development along but therein lies the tension. The common monetization paths either feature-lock critical self-hosted capabilities like SSO behind a paywall and/or monetize behind a cloud hosted option.
The Elastic license is an attempt to maintain feature parity between the cloud and self-hosted tool while still being protected from something like the big cloud providers ripping the code off altogether.
In all seriousness though, we would love to hear suggestions if you think there's a better path.
swordsmith8|2 years ago
Unlike a data observability platform like Monte Carlo which proactively monitors data, am I correct in assuming that your solution is less focused on data observability (i.e. monitoring production data and conducting root cause analysis / impact analysis) and more on ensuring reliable CI/CD?
ersatz_username|2 years ago
We actually already do data monitoring as well although we haven't built the specific alerting features of Monte Carlo. There are quite a few tools that do that really well so it's not our focus at the moment.
kevinmershon|2 years ago
ersatz_username|2 years ago
If you have a different toolset onboarding will look exactly the same though, there's nothing truly DBT specific at work here. It's a good idea though! We really should put together a few other combinations so more people can see their own stack represented.
boredemployee|2 years ago
BlackjackCF|2 years ago
Just FYI, I’m getting a “failed to load search index” error in your docs.
Also I saw GitHub Actions called out in the workflow. Do you have GitLab support?
ersatz_username|2 years ago
We haven't had anyone request Gitlab yet but would love to add support! Any chance you'd be willing to beta test for us? If so, shoot me an email at ian@grai.io :).
EDIT: It looks like the index issue is related to our search provider. Were you able to eventually load the page or is it fully blocking you?
pjot|2 years ago
And that it took 2+ weeks to train their models with the table metadata - so time to value for my team was always “in two weeks”.
Glad to see y’all going against that trend!
ersatz_username|2 years ago
MattSWilliamson|2 years ago
NortySpock|2 years ago
As I see it, DataHub already gives you a data lineage. Is that not enough?
James_Bowers|2 years ago
mdaniel|2 years ago
unknown|2 years ago
[deleted]