Launch HN: Mozart Data (YC S20) – One-stop shop for a modern data pipeline
106 points| pfduke02 | 4 years ago
Ten years ago, we started a hot sauce company, Bacon Hot Sauce, together. But more relevantly, we have spent the last two decades building data pipelines at startups like Clover Health, Eaze, Opendoor, Playdom, and Zenefits. For example, at Yammer, we built a tool called “Avocado,” which was our end to end analysis tool chain -- we loaded data from our production database and relevant SaaS tools like Salesforce, we scheduled data transformations (similar to Airflow), and we had a front-end BI tool where we wrote and shared queries and dashboards. Today Avocado is two tools, Mozart Data and Mode Analytics (a collaborative analytics tool). We basically have been building similar data tools for years (though the names and underlying technologies have changed).
Dan & I decided to build a product to bring the same tools and technology to earlier stage companies (so that you don’t need to make an early hire in data engineering). We’ve built a platform where business users can load data and create & schedule transformations with just SQL, wrapped in an interface anyone can use -- no Python, no Jinja, no custom language. We connect to over 150 SaaS tools and databases, most just need credentials to send data to Mozart. There is no need to define DAGs (we parse your SQL transforms to automatically infer the way data flows through the pipeline). Mozart does the rote and cumbersome data engineering that typically takes a while to set up and maintain, so that you can tackle the problems your company is uniquely suited to do.
Most data companies have focused on a single slice of the data pipeline (ETL, warehousing, BI). The maturation of data tools over the last decade has made now the time to combine them into an easy solution accessible to data scientists and business operations alike. We believe that there is immense value in centralizing and cleaning your data, as well as setting up the core tables for downstream analysis in your BI tool. Customers like Rippling, Tempo, & Zeplin use Mozart to automate key metrics dashboards, calculate CAC and LTV, or identify customers at risk of churn. We want to empower the teams -- like revenue and sales ops -- that have a lot of data, know what they want to do with it, but don’t have the engineering bandwidth to execute it.
Try us out and see for yourself - you can sign up (https://app.mozartdata.com/signup) and immediately start loading, querying, cleaning, and analyzing your data in Mozart. We offer a free 14-day trial (no credit card required). After the free trial, we charge metered pricing based on compute time used and data ingested. We’d love to hear about your experiences with data pipelines and any ideas/feedback/questions you might have about what we’re building.
mpeg|4 years ago
You say you charge metered pricing but this information seems to be missing from your site, I understand it's hard to price a new product but I personally need to know pricing before I am able to recommend a product to a client – so the more available this information is the easier I can compare you to others.
I do like the SQL transforms, they don't replace DAG orchestration tools like Airflow but it's a very nice feature that covers a lot of what companies with basic data needs will want.
pfduke02|4 years ago
In terms of pricing we charge by monthly active rows (MAR) and compute time. An introductory package with 500k MAR and 500k seconds costs $1000/month; but we try to tailor to individual company needs.
mattmarcus|4 years ago
dmull|4 years ago
carlineng|4 years ago
[1] https://blog.getdbt.com/future-of-the-modern-data-stack/
pfduke02|4 years ago
Here’s a more thorough writeup from our CTO, Dan… https://www.mozartdata.com/post/mozart-data-cto-and-co-found...
jerrytsai|4 years ago
dsil|4 years ago
There are workarounds, eg for database connectors, and some other connectors, we let you specify which schemas/tables/columns to sync, so you can choose to not sync PII columns (or hash them), and still get a ton of value from the other data and/or aggregates.
And not for PHI, but some of our customers pull all their data into Mozart, write some data transformations within Mozart to redact sensitive data, then use role-based-access-control to give the rest of the company full access to redacted tables, and only certain people have access to the full data.
That said, the security of our customers' data is our top priority regardless of what type of data it is. We're currently in the process of being audited for SOC2 type2.
Eidamj1|4 years ago
- do you have a wrapper around Snowflake? - do you support data streaming? - who are your target customers (size, domain, etc.)? - have customers identified gaps in their own data coverage/needs to use this pipeline (i.e. 1st party data is limited).. and if so, where do you point them to cover any gaps (e.g. external sources or partners)? - have you received any feedback that says whether customers are not able to make progress with their BI, not due to ETL, but as a result of poor/unmaintainable data modeling? - how do you handle scenarios where customers prefer to host their own data? Is that common? - is it possible for customers to run certain components of the ETL process/pipeline on their own systems? Have you found that to be a frequent request so far?
chrisfrantz|4 years ago
Having some experience here, I can say that this is typically not a quick process since it depends so much on third-parties, so it's really cool you have such a large library of connectors.
pfduke02|4 years ago
dvt|4 years ago
I'm not sure this is even doable without a dedicated data scientist, but a potential solution is a two-way marketplace that connects companies with data scientists to help make heads or tails of the data they're storing. Otherwise, it's just sitting in a data lake somewhere. (Not sure if something like this exists already, I'm just thinking out loud.)
pfduke02|4 years ago
I also believe that company context matters a lot. I think so much of getting started with extracting value from data is getting up the learning curve of understanding what it means (which columns have the truth). One of the reasons that we don’t have a lot of canned reports is that understanding these edge cases within a company often matters a lot (and that not accounting for the nuance can often lead to a misinference). With this in mind, the explosion of ETL solutions and products like Mozart Data means that others at the company can specialize in their business context, as opposed to needing someone who can do all aspects of data including engineering, data science, analysis, and communicating/presenting it.
shoo|4 years ago
the consulting "data scientist" is likely able to do better job if they have experience with the idiosyncracies of the individual company's operations. If you get a fresh data scientist every time they need to repeat the ramp-up period before they are in a position to maybe add value.
Suggests a model where company keeps the same consultant on retainer and brings them on board each time a situation pops up where the consultant may be able to assist
(this isn't a particularly novel suggestion, the same suggestion is made in a 60s/70s era thesis investigating how applicable operations research is to small businesses)
pfduke02|4 years ago
shrig94|4 years ago
pfduke02|4 years ago
sbr464|4 years ago
Are you partnered with them or would there be additional Fivetran fees if an integration went through them? I noticed when clicking on the Xero integration.
pfduke02|4 years ago
satyrnein|4 years ago
pfduke02|4 years ago
mrwnmonm|4 years ago
dsil|4 years ago
Components of this certainly already exist, we're trying to put it all together in a single platform and make this functionality easier to use.
zomglings|4 years ago
1. What do multi-source joins look like?
2. How expensive are they as a function of the sizes of the "tables" being joined?
dsil|4 years ago
(Some call this model "ETLT", where the first ETL part is just moving data from APIs or other databases into a shared db, and the extra "T" joining that data across sources or otherwise organizing it in useful ways.)
theboat|4 years ago
pfduke02|4 years ago
virgilp|4 years ago
pfduke02|4 years ago
BugsJustFindMe|4 years ago
dsil|4 years ago