Launch HN: Opstrace (YC S19) – open-source Datadog
Seb here, with my co-founder Mat. We are building an open-source observability platform aimed at the end user. We assemble what we consider the best open source APIs and interfaces such as Prometheus and Grafana, but make them as easy to use and featureful as Datadog, with for example TLS and authentication by default. It's scalable (horizontally and vertically) and upgradable without a team of experts. Check it out here: http://opstrace.com/ & https://github.com/opstrace/opstrace
About us: I co-founded dotCloud which became Docker, and was also an early employee at Cloudflare where I built their monitoring system back when there was no Prometheus (I had to use OpenTSDB :-). I have since been told it's all been replaced with modern stuff—thankfully! Mat and I met at Mesosphere where, after building DC/OS, we led the teams that would eventually transition the company to Kubernetes.
In 2019, I was at RedHat and Mat was still at Mesosphere. A few months after IBM announced purchasing RedHat, Mat and I started brainstorming problems that we could solve in the infrastructure space. We started interviewing a lot of companies, always asking them the same questions: "How do you build and test your code? How do you deploy? What technologies do you use? How do you monitor your system? Logs? Outages?" A clear set of common problems emerged.
Companies that used external vendors—such as CloudWatch, Datadog, SignalFX—grew to a certain size where cost became unpredictable and wildly excessive. As a result (one of many downsides we would come to uncover) they monitored less (i.e. just error logs, no real metrics/logs in staging/dev and turning metrics off in prod to reduce cost).
Companies going the opposite route—choosing to build in-house with open source software—had different problems. Building their stack took time away from their product development, and resulted in poorly maintained, complicated messes. Those companies are usually tempted to go to SaaS but at their scale, the cost is often prohibitive.
It seemed crazy to us that we are still stuck in this world where we have to choose between these two paths. As infrastructure engineers, we take pride in building good software for other engineers. So we started Opstrace to fix it.
Opstrace started with a few core principles: (1) The customer should always own their data; Opstrace runs entirely in your cloud account and your data never leaves your network. (2) We don’t want to be a storage vendor—that is, we won’t bill customers by data volume because this creates the wrong incentives for us. (AWS and GCP are already pretty good at storage.) (3) Transparency and predictability of costs—you pay your cloud provider for the storage/network/compute for running Opstrace and can take advantage of any credits/discounts you negotiate with them. We are incentivized to help you understand exactly where you are spending money because you pay us for the value you get from our product with per-user pricing. (For more about costs, see our recent blog post here: https://opstrace.com/blog/pulling-cost-curtain-back). (4) It should be REAL Open Source with the Apache License, Version 2.0.
To get started, you install Opstrace into your AWS or GCP account with one command: `opstrace create`. This installs Opstrace in your account, creates a domain name and sets up authentication for you for free. Once logged in you can create tenants that each contain APIs for Prometheus, Fluentd/Loki and more. Each tenant has a Grafana instance you can use. A tenant can be used to logically separate domains, for example, things like prod, test, staging or teams. Whatever you prefer.
At the heart of Opstrace runs a Cortex (https://github.com/cortexproject/cortex) cluster to provide the above-mentioned scalable Prometheus API, and a Loki (https://github.com/grafana/loki) cluster for the logs. We front those with authenticated endpoints (all public in our repo). All the data ends up stored only in S3 thanks to the amazing work of the developers on those projects.
An "open source Datadog" requires more than just metrics and logs. We are actively working on a new UI for managing, querying and visualizing your data and many more features, like automatic ingestion of logs/metrics from cloud services (CloudWatch/Stackdriver), Datadog compatible API endpoints to ease migrations and side by side comparisons and synthetics (e.g. Pingdom). You can follow along on our public roadmap: https://opstrace.com/docs/references/roadmap.
We will always be open source, and we make money by charging a per-user subscription for our commercial version which will contain fine-grained authz, bring-your-own OIDC and custom domains.
Check out our repo (https://github.com/opstrace/opstrace) and give it a spin (https://opstrace.com/docs/quickstart).
We’d love to hear what your perspective is. What are your experiences related to the problems discussed here? Are you all happy with the tools you’re using today?
[+] [-] brodouevencode|5 years ago|reply
[+] [-] nrmitchi|5 years ago|reply
If you can disable them at the agent level and avoid the data out that would be even better.
At a previous employer the defaults were quite literally half of our log volume, that we were paying for. I was doing a sanity check before renewing our datadog contract and was very not-pleased to discover that.
[+] [-] cyberpunk|5 years ago|reply
[+] [-] spahl|5 years ago|reply
[+] [-] alexchamberlain|5 years ago|reply
[+] [-] tailspin2019|5 years ago|reply
DataDog has a UI. Does Opstrace? Or is it just a CLI/API based tool?
If you actually have a UI element to your product you’re doing a huge disservice to yourself by not actually showing this anywhere...
EDIT: I don’t mean to sound negative, I’m wondering if positioning this against Datadog is going to create immediate, potentially incorrect, expectations in people’s minds as to what this product might provide.
From first impressions I’d say this is much closer to Prometheus (which does have a UI but it’s so basic it may as well not - but then the UI is not the point of Prometheus).
[+] [-] fat-apple|5 years ago|reply
[+] [-] Denzel|5 years ago|reply
I’m surprised the mods haven’t edited this title.
Source: I’m an engineer that’s used, operated and hacked on a medium-sized prom+grafana; and used Datadog at a large, multi-region, global scale.
[+] [-] sciurus|5 years ago|reply
Why should someone choose Opstrace over purchasing from them directly?
[+] [-] englambert|5 years ago|reply
That being said, in case you couldn’t tell, we love software from Grafana Labs. It’s popular for a reason. However, we want it to be as easy to install and maintain as clicking a button, i.e., as simple as Datadog. So one problem we are trying to solve today is that while, yes, you can stitch together all of their OSS projects yourself (and many, many people do), it’s a non-trivial exercise to set up and then maintain. We’ve done it ourselves, seen friends go through it; we’d like to stop everyone from becoming a subject matter expert and reinventing the wheel. (Especially since when our friends do it themselves they always skimp on important things like, say, security.) Bottom line—we’re inspired by Grafana Labs. We strive to also be good OSS stewards and contribute to the overall ecosystem like they have.
Another way to solve the “stitching-it-together” problem, as you mentioned, is of course pay Grafana Labs for their SaaS (which I’ve done in the past) or one of their on-prem Enterprise versions. However, these are not open source. The former is hosted in their cloud account and single-tenant; the latter have no free versions. We think Opstrace provides a lot of value, but we understand that it’s not for everyone.
[+] [-] dudeinjapan|5 years ago|reply
In a nutshell, running all these various components (Grafana, etc) is a royal pain in the neck. Even if `opstrace create` spawns them easily, the problem is running/maintaining them. We want someone to run these for us as a SaaS/PaaS and we're happy to pay them.
Re: your principles:
(1) The customer should always own their data --> we agree. However, we are happy for you to be a custodian of that data.
(2) We don’t want to be a storage vendor --> neither do we. We want storage to be someone else's problem. We're happy for you to use a cloud platform like AWS/GCP and charge us a 50% markup.
(3/4) Transparency, predictability of costs, open source --> all excellent.
[+] [-] jgehrcke|5 years ago|reply
> is a royal pain in the neck.
It's fun to see how different people put the same unpleasant experience into words in this thread. Thanks for adding your personal touch. Every time we hear something like that, we're re-assured that we're on the right track.
> Even if `opstrace create` spawns them easily, the problem is running/maintaining them
Yes. You're right. While we can be proud of our setup/installation process already, we know that there's so much more to it. We don't underestimate that. Maybe also see https://news.ycombinator.com/item?id=25998587, where I just commented on the robustness topic.
> However, we are happy for you to be a custodian of that data.
Great.
> We want storage to be someone else's problem.
I share that perspective. We, of course, are happy to let S3/GCS do the actual job.
> We're happy for you to use a cloud platform like AWS/GCP and charge us a 50% markup.
That's great to hear, and I hope you can be enthusiastic about the fact that our markup is _not_ going to be relative to storage volume. It's going to be independent of that.
> Transparency, predictability of costs, open source --> all excellent.
Thanks for sharing. That's incredibly motivating.
Keep an eye on us, and we'd love to hear from you!
[+] [-] xyzzy_plugh|5 years ago|reply
[+] [-] BringerOfChaos|5 years ago|reply
It might not fit your use case...but it might.
[+] [-] jarym|5 years ago|reply
[+] [-] fat-apple|5 years ago|reply
[+] [-] hangonhn|5 years ago|reply
I was the engineer who was heavily involved with monitoring at my last job and a lot of what this is doing aligns with what I would have done myself. At my new job, I work on different stuff but I can see we're going to run into monitoring issues soon too. I'm so, so, so glad this is an option because I do not want to rebuild that stuff all over again. Getting monitoring scalable and robust is HARD!
[+] [-] englambert|5 years ago|reply
[+] [-] boundlessdreamz|5 years ago|reply
2. When opstrace is setup in AWS/GCP, what is the typical fixed cost?
[+] [-] fat-apple|5 years ago|reply
(1) As it stands today, you can already use https://vector.dev/docs/reference/sinks/prometheus_remote_wr... to write metrics directly to our Prometheus API. You can also use https://vector.dev/docs/reference/sinks/loki/ to send your logs to our Loki API. Vector is very cool in our opinion and we’d love to see if there is more we can do with it. What are your thoughts?
(2) As for cost, our super early experiments (https://opstrace.com/blog/pulling-cost-curtain-back) indicate that ingesting 1M active series with 18-month retention is less than $30 per day. It is a very important topic and we've already spent quite a bit of time on exploring this. Our goal is to be super transparent (something you don’t get with SaaS vendors like Datadog) by adding a system cost tab in the UI. Clearly, the cost depends on the specific configuration and use case, i.e. on parameters such as load profile, redundancy, and retention. A credible general answer would come in the shape of some kind of formula, involving some of these parameters -- and empirically derived from real-world observations (testing, testing, testing!). For now, it's fair to say that we're in the observation phase -- from here, we'll certainly do many optimizations specifically towards reducing cost, and we'll also focus on providing good recommendations (because as we all know cost is just one dimension in a trade-off space). We're definitely excited about the idea of providing users useful, direct insight into the cost (say, daily cost) of their specific, current Opstrace setup (observation is key!). We've talked a lot about "total cost of ownership" (TCO) in the team.
[+] [-] stevemcghee|5 years ago|reply
[+] [-] tamasnet|5 years ago|reply
Also, please don't forget about people (like me) who don't run on $MAJOR_CLOUD_PROVIDER. I'd be curious to try this e.g. on self-operated Docker w/ Minio.
[+] [-] nickbp|5 years ago|reply
[+] [-] sneak|5 years ago|reply
Seems to me that these are at odds. If you're open source, why does anyone have to pay for these things?
If you're open core, I think it's mighty misleading to say things like "We will always be open source" because then not only is it untrue on its face, but also if someone contributes useful features to the open source project that compete with or supplant your paid proprietary bits, you are incentivized to refuse to merge their work - extremely not in the spirit of open source.
My perspective, which you asked for, is that open core is dishonest, and that you should be honest with yourselves about being a proprietary software vendor if that's indeed your plan, and stop with the open source posturing.
If I've misunderstood you, then I apologize.
[+] [-] davelester|5 years ago|reply
There’s nothing inherently dishonest when a company emphasizes their open source strategy. Open source community building is as much about shipping code as it is leading people, and that requires you to be transparent about your intentions. I’ve interpreted opstrace’s release as just that.
I think the concern about neutral project governance is an important one. It’s early days, but from what I’ve seen it seems clear what is being sold vs what is open today. The fact that the project is released under the Apache v2 license means that folks are able to reuse, distribute, and sell the project as they wish — even fork it if they dislike the direction. That said, if governance is a priority for your use I’d definitely look to project in neutral software foundations like the Apache Software Foundation and CNCF.
[+] [-] fat-apple|5 years ago|reply
Our intention is to be really transparent with how we build and price software, which is why our commercial features will also be public in our repo, but commercially licensed. Transparency is critical in our opinion.
This is the model we’ve seen work for other highly impactful software projects.
We’ve created a ticket to track our addition of commercial code to our repo: https://github.com/opstrace/opstrace/issues/319
[+] [-] kazinator|5 years ago|reply
If they used "free software" language, then we might have a case for posturing.
[+] [-] zaczekadam|5 years ago|reply
My two points - right now docs are clearly targeting users familiar with the competition but for someone like me who does not know similar products, a 'how it works' section with examples would be awesome.
Fingers crossed!
[+] [-] jgehrcke|5 years ago|reply
[+] [-] arianvanp|5 years ago|reply
[+] [-] englambert|5 years ago|reply
[+] [-] mrwnmonm|5 years ago|reply
Wish you all the best. and Congratulations!
[+] [-] fat-apple|5 years ago|reply
[+] [-] snissn|5 years ago|reply
[+] [-] nickbp|5 years ago|reply
Opstrace does ship with a "system" tenant designed for monitoring the Opstrace system itself. This tenant has built-in dashboards that we've designed to show you the health of the Opstrace system.
Incidentally, having sharable "dashboards" across people/teams/organizations is something we are also working on, so people don't have to re-invent dashboards all the time.
We also have some guidelines for you to ingest metrics from Kubernetes clusters (https://opstrace.com/docs/guides/user/instrumenting-a-k8s-cl...) and are building native cloud metrics collection. Feel free to follow along in GitHub: https://github.com/opstrace/opstrace/issues/310.
[+] [-] spahl|5 years ago|reply
Thanks for the feedback, we appreciate it!
[+] [-] tmzt|5 years ago|reply
More broadly, how are you contributing to the upstream projects?
[1] https://github.com/grafana/loki/issues?page=7&q=is%3Aissue+i...
[+] [-] jgehrcke|5 years ago|reply
> Are you doing anything to improve log search versus an Elasticsearch cluster, for instance?
No. You're right, Loki is not designed for building up an index for full-text search. The premise here is that you won't typically need that and that in exchange for not having to build up that index, you get other advantages (such as being able to rely on an object store for both payload and index data!). If, on the other hand, in special situations, you need to "grep-search" your logs, this is absolutely doable with Loki! Loki does not neglect this use case. The opposite is true; everyone is already excited about the performance characteristics that Loki already has today when it comes to ad-hoc processing full text. For example, see https://twitter.com/Kuqd/status/1336722211604996098 and definitely have a look at https://grafana.com/blog/2020/12/08/how-to-create-fast-queri.... I'm sure Cyril is happy to answer your questions, too!
> how are you contributing to the upstream projects?
We're reporting issues and try to contribute as much as we can! Of course, this effort has only started. So far, we've contributed to Loki's Fluentd plugin (https://github.com/grafana/loki/pulls?q=is%3Apr+author%3Ajge...), and our testing efforts have helped reveal edge cases; see for example https://github.com/grafana/loki/issues/2124 and https://github.com/grafana/loki/issues/3085.
We're excited to substantially contribute to both Loki and Cortex in the future!
[+] [-] rubiquity|5 years ago|reply
On top of bad UX, I do think the storage layer is where customers are really getting hit by these companies. The big players are using very unoptimized ingestion and querying layers and pretending like tiered storage never happened. Developers share some of the blame too by not being at all pragmatic about how long and how much to keep. It's a tough nut to crack.
What's the plan for commercial? They run it themselves and pay per user? If so, that's refreshing.
[+] [-] nickbp|5 years ago|reply
We will also have a blog post about bad UX in a couple weeks… stay tuned. What are some of your biggest gripes about UX?
[+] [-] GeneralTspoon|5 years ago|reply
We just moved away from Datadog because their log storage pricing is too high for us. We moved to BigQuery instead. But the interface kind of sucks.
Would love to get this up and running. A couple of questions:
1. Is it possible to setup outside of AWS/GCP? I would like to set this up on a dedicated server.
2. If not - then do you have a pricing comparison page where you give some example figures? e.g. to ingest 1 billion log lines from Apache per month it will cost you roughly $X in AWS hosting fees and $Y per seat to use Opstrace
[+] [-] fat-apple|5 years ago|reply
We've done a deep dive into the cost model for metrics and posted more about it here: https://opstrace.com/blog/pulling-cost-curtain-back. We are still working on a full cost analysis for logs - I'd be happy to send it to you once we have it (feel free to email me [email protected] to chat about your use case). Our goal is to be super transparent (see https://news.ycombinator.com/item?id=25992081) with cost and to have a page on our website that helps someone determine what to expect (probably some sort of calculator with live data). Our UI will also show you exactly what your system is currently costing you with some breakdown for teams or services so you know who/what is driving your monitoring cost. We're doing user testing on our to-be-released UI now and would love to have people like yourself give us early feedback (since you mentioned the BigQuery interface).
[+] [-] unknown|5 years ago|reply
[deleted]
[+] [-] mleonhard|5 years ago|reply
> opstrace destroy PROVIDER CLUSTER_NAME
> opstrace list PROVIDER
I want to keep cluster config in source control, track deployment changes in code reviews, and automate deployments. Do you have any plans to add an 'apply' command to support this?
$ opstrace apply -c CONFIG_FILE_PATH [--dry-run] PROVIDER CLUSTER_NAME
[+] [-] jgehrcke|5 years ago|reply
An `apply` command might look innocent on the surface. But. Upgrades (including config changes) are hard. Super hard. If it's helping a bit: the entire current Opstrace team has dealt with super challenging platform upgrade scenarios in the most demanding customer environments in the past years. We try to not underestimate this challenge :).
We're super fast moving right now and didn't want to bother with in-place config changes (as you can imagine, we wouldn't really be able to provide solid guarantees around that). We'll work on that and make it nice when the time is right, and when we feel like we can can actually provide guarantees.
[+] [-] richardw|5 years ago|reply
[+] [-] fat-apple|5 years ago|reply