top | item 37558357

Show HN: HyperDX – open-source dev-friendly Datadog alternative

722 points| mikeshi42 | 2 years ago |github.com | reply

Hi HN, Mike and Warren here! We've been building HyperDX (hyperdx.io). HyperDX allows you to easily search and correlate logs, traces, metrics (alpha), and session replays all in one place. For example, if a user reports a bug “this button doesn't work," an engineer can play back what the user was doing in their browser and trace API calls back to the backend logs for that specific request, all from a single view.

Github Repo: https://github.com/hyperdxio/hyperdx

Coming from an observability nerd background, with Warren being SRE #1 at his last startup and me previously leading dev experience at LogDNA/Mezmo, we knew there were gaps in the existing tools we were used to using. Our previous stack of tools like Bugsnag, LogRocket, and Cloudwatch required us to switch between different tools, correlate timestamps (UTC? local?), and manually cross-check IDs to piece together what was actually happening. This often made meant small issues required hours of frustration to root cause.

Other tools like Datadog or New Relic come with high price tags - when estimating costs for Datadog in the past, we found that our Datadog bill would exceed our AWS bill! Other teams have had to adjust their infrastructure just to appease the Datadog pricing model.

To build HyperDX, we've centralized all the telemetry in one place by leveraging OpenTelemetry (a CNCF project for standardizing/collecting telemetry) to pull and correlate logs, metrics, traces, and replays. In-app, we can correlate your logs/traces together in one panel by joining everything automatically via trace ids and session ids, so you can go from log <> trace <> replay in the same panel. To keep costs low, we store everything in Clickhouse (w/ S3 backing) to make it extremely affordable to store large amounts of data (compared to Elasticsearch) while still being able to query it efficiently (compared to services like Cloudwatch or Loki), in large part thanks to Clickhouse's bloom filters + columnar layout.

On top of that, we've focused on providing a smooth developer experience (the DX in HyperDX!). This includes features like native parsing of JSON logs, full-text search on any log or trace, 2-click alert creation, and SDKs that help you get started with OpenTelemetry faster than the default OpenTelemetry SDKs.

I'm excited to share what we've been working with you all and would love to hear your feedback and opinions!

Hosted Demo - https://api.hyperdx.io/login/demo

Open Source Repo: https://github.com/hyperdxio/hyperdx

Landing Page: https://hyperdx.io

163 comments

order
[+] addisonj|2 years ago|reply
Wow, there is a lot here and what here is to a pretty impressive level of polish for how far along this is.

The background of someone with a DX background comes through! I will be looking into this a lot more.

Here are a few comments, notes, and questions:

* I like the focus on DX (especially compared to other OSS solutions) in your messaging here, and I think your hero messaging tells that story, but it isn't reinforced as much through the features/benefits section

* It seems like clickhouse is obviously a big piece of the tech here, which is an obvious choice, but from my experience with high data rate ingest, especially logs, you can run into issues at larger scale. Is that something you expect to give options around in open source? Or is the cloud backend a bit different where you can offer that scale without making open source so complex?

* I saw what is in OSS vs cloud and I think it is a reasonable way to segment, especially multi-tenancy, but do you see the split always being more management/security features? Or are you considering functional things? Especially with recent HashiCorp "fun" I think more and more it is useful to be open about what you think the split will be. Obviously that will evolve, but I think that sort of transparency is useful if you really want to grow the OSS side

* on OSS, I was surprised to see MIT license. This is full featured enough and stand alone enough that AGPL (for server components) seems like a good middle ground. This also gives some options for potentially a license for an "enterprise" edition, as I am certain there is a market for a modern APM that can run all in a customer environment

* On that note, I am curious what your target persona and GTM plan is looking like? This space is a a bit tricky IMHO, because small teams have so many options at okay price points, but the enterprise is such a difficult beast in switching costs. This looks pretty PLG focused atm, and I think for a first release it is impressive, but I am curious to know if you have more you are thinking to differentiate yourself in a pretty crowded space.

Once again, really impressive what you have here and I will be checking it out more. If you have any more questions, happy to answer in thread or my email is in profile.

[+] mikeshi42|2 years ago|reply
Thank you, really appreciate the feedback and encouragement!

> It seems like clickhouse is obviously a big piece of the tech here, which is an obvious choice, but from my experience with high data rate ingest, especially logs, you can run into issues at larger scale. Is that something you expect to give options around in open source?

Scaling any system can be challenging - our experience so far is that Clickhouse is a fraction of the overhead of systems like Elasticsearch has previously demanded luckily. That being said, I think there's always going to be a combination of learnings we'd love to open source for operators that are self-hosting/managing Clickhouse, and tooling we use internally that is purpose-built for our specific setup and workloads.

> I saw what is in OSS vs cloud and I think it is a reasonable way to segment, especially multi-tenancy, but do you see the split always being more management/security features?

Our current release - we've open sourced the vast majority of our feature set, including I think some novel features like event patterns that typically are SaaS-only and that'll definitely be the way we want to continue to operate. Given the nature of observability - we feel comfortable continuing to keep pushing a fully-featured OSS version while having a monetizable SaaS that focuses on the fact that it's completely managed, rather than needing to gate heavily based on features.

> on OSS, I was surprised to see MIT license

We want to make observability accessible and we think AGPL will accomplish the opposite of that. While we need to make money at the end of the day - we believe that a well-positioned enterprise + cloud offering is better suited to pull in those that are willing to pay, rather than forcing it via a license. I also love the MIT license and use it whenever I can :)

> On that note, I am curious what your target persona and GTM plan is looking like?

I think for small teams, imo the options available are largely untantilizing, it ranges from narrow tools like Cloudwatch to enterprise-oriented tools like New Relic or Datadog. We're working hard to make it easier for those kinds of teams to adopt good monitoring and observability from day 1, without the traditional requirement of needing an observability expert or dedicated SRE to get it set up. (Admittedly, we still have a ways to improve today!) On the enterprise side, switching costs are definitely high, but most enterprises are highly decentralized in decision making, where I routinely hear F500s having a handful of observability tools in production at a given time! I'll say it's not as locked-in as it seems :)

[+] dangoodmanUT|2 years ago|reply
For clickhouse, just batch insert. They probably have something batching every few s before inserting directly to their hosted version
[+] fnord77|2 years ago|reply
Clickhouse is proprietary, though.

I wonder why not Apache Druid

[+] prabhatsharma|2 years ago|reply
A good one. A lot is being built on top of clickhouse. I can count at least 3 if not more (hyperdx, signoz and highlight) built on top of clickhouse now.

We at OpenObserve are solving the same problem but a bit differently. A much simpler solution that anyone can run using a single binary on their own laptop or in a cluster of hundreds of nodes backed by s3. Covers logs, metrics, traces, Session replay, RUM and error tracking are being released by end of the month) - https://github.com/openobserve/openobserve

[+] t1mmen|2 years ago|reply
This looks really cool, congrats on the launch!

I haven’t had time to dig in proper, but this seems like something that would fit perfectly for “local dev” logging as well. I struggled to find a good solution for this, ending up Winston -> JSON, with a simpler “dump to terminal” script running.

(The app I’m building does a ton of “in the background” work, and I wanted to present both “user interactions” and “background worker” logs in context)

I don’t see Winston being supported as a transport, but presumably easy to add/contribute.

Good luck!

[+] mikeshi42|2 years ago|reply
Thank you! We do support Winston (docs: https://www.hyperdx.io/docs/install/javascript#winston-trans...) and use it a lot internally. Let me know if you run into any issues with it (or have suggestions on how to make it more clear)

In fact this is actually how we develop locally - because even our local stack is comparatively noisy, we enable self-logging in HyperDX so our local logs/traces go to our own dev instance, and we can quickly trace a 500 that way. (Literally was doing this last night for a PR I'm working on).

[+] silentguy|2 years ago|reply
Have you tried lnav? It has somewhat steeper learning curve but it'd fit the bill. One small binary and some log parsing config, and you are good to go.
[+] corytheboyd|2 years ago|reply
Outside of the intended use-case of _replacing_ Datadog, I think this may actually serve as an excellent local development "Datadog Lite", which I have always wanted, and is something embarrassingly, sorely missing from local development environments.

In local development environments, I want to:

- Verify that tracing and metrics (if you use OpenTelemetry) actually work as intended (through an APM-like UI).

- Have some (rudimentary, even) data aggregation and visualization tools to test metrics with. You often discover missing/incorrect metrics by just exploring aggregations, visualizations, filters. Why do we accept that production (or rather, a remote deployment watched by Datadog etc.) is the correct place to do this? It's true that unknowns are... unknown, but what better time to discover them than before shipping anything at all?

- Build tabular views from structured logs (JSON). It is _mind blowing_ to me that most people seem to just not care about this. Good use of structured logging can help you figure out in seconds what would take someone else days.

I mean, that's it, the bar isn't too high lol. It looks like HyperDX may do... all of this... and very well, it seems?!

Before someone says "Grafana"-- no. Grafana is such a horrible, bloated, poorly documented solution for this (for THIS case. NOT IN GENERAL!). It needs to be simple to add to any local development stack. I want to add a service to my docker compose file, point this thing at some log files (bonus points for some docker.sock discoverability features, if possible), expose a port, open a UI in my browser, and immediately know what to do given my Datadog experience. I'm sure Grafana and friends are great when deployed, but they're terrible to throw into a project and have it just work and be intuitive.

[+] mikeshi42|2 years ago|reply
Yes! We definitely do - in fact this is how we develop locally, our local stack is pretty intricate and can fail in different areas, so it's pretty nice for us to be able to debug errors directly in HyperDX when we're developing HyperDX!

Otel tracing works and should be pretty bulletproof - metrics is still early so you might see some weirdness (we'll need to update the remaining work we've identified in GH issues)

You can 100% build tabular views based on JSON logs, we auto-parse JSON logs and you can customize the search table layout to include custom properties in the results table.

Let us know if we fulfill this need - we at least do this ourselves so I feel pretty confident it should work in your use case! If there's anything missing - feel free to ping us on Discord or open an issue, we'd likely benefit from any improvement ideas ourselves while we're building HyperDX :)

Edit: Oh I also talk a bit about this in another comment below https://news.ycombinator.com/item?id=37561358

[+] carlio|2 years ago|reply
I use InfluxDB for this, it comes with a frontend UI and you can configure Telefraf as a statsd listener, so the same metric ingestion as datadog pretty much. There are docker containers for these, which I have added to my docker-compose for local dev.

I think it does log ingestion too, I haven't ever used that, I mostly use it just for the metrics and graphing.

[+] snowstormsun|2 years ago|reply
[+] Kiro|2 years ago|reply
Not applicable when the base offering is free and open source. The SSO is in the base pricing in this case.
[+] jamesmcintyre|2 years ago|reply
This looks really promising, will definitely look into using this for a project i'm working on! Btw I've used both datadog and newrelic in large-scale production apps and for the costs I still am not very impressed by the dx/ux. If hyperdx can undercut price and deliver parity features/dx (or above) i can easily see this doing well in the market. Good luck!
[+] mikeshi42|2 years ago|reply
Thank you! Absolutely agree on Datadog/New Relic DX, I think the funny thing we learned is that most customers of theirs mention how few developers on their team actually comfortably engage with either New Relic or Datadog, and most of the time end up relying on someone to help get the data they need!

Definitely striving to be the opposite of that - and would love to hear how it goes and any place we can improve!

[+] Hamuko|2 years ago|reply
Datadog feels like they've used a shotgun to shoot functionality all over the place. New Relic felt a bit more focused, but even then I had to go attend a New Relic seminar to properly learn how to use the bloody thing.
[+] pighive|2 years ago|reply
What does dx/ux mean in this context? Data Diagnostics?
[+] Dockson|2 years ago|reply
Just want to heap on with the praise here and say that this was definitely the best experience I've had with any tool trying to add monitoring for a Next.js full-stack application. The Client Sessions tab where I, out of the box, can correlate front-end actions and back-end operations for a particular user is especially nice.

Great job!

[+] wrn14897|2 years ago|reply
Thank you. This means a lot to us.
[+] cheema33|2 years ago|reply
I am new to this space and was considering a self hosted install of Sentry software. Sentry is also opensource and appears to be similar to datadog and HyperDX in some ways. Do you know Sentry and can you tell us how your product is different?

Thanks.

[+] mikeshi42|2 years ago|reply
Very familiar with Sentry! I think we have a bit of overlap in that we both do monitoring and help devs debug though here's where I think we differ:

HyperDX:

- Can collect all server logs (to help debug issues even if an exception isn't thrown)

- We can collect server metrics as well (CPU, memory, etc.)

- We accept OpenTelemetry for all your data (logs, metrics, traces) - meaning you only need to instrument once and choose to switch vendors at any time if you'd like without re-instrumenting.

- We can visualize arbitrary data (what's the response time of endpoint X, how many users did action Y, how many times do users hit endpoint X grouped by user id?) - Sentry is a lot more limited in what it can visualize (mainly because it collects more limited amounts of data).

Sentry:

- Great for exception capture, it tries to capture any exception and match them with sourcemap properly so you can get to the right line of code where the issue occurred. We don't have proper sourcemap support yet - so our stack traces point to minified file locations currently.

- Gives you a "inbox" view of all your exceptions so you can see which ones are firing currently, though you can do something similar in HyperDX (error logs, log patterns, etc.) theirs is more opinionated to be email-style inbox, whereas our is more about searching errors.

- Link your exceptions to your project tracker, so you can create Jira, Linear, etc. tickets directly from exceptions in Sentry.

I don't think it's an either/or kind of situation - we have many users that use both because we cover slightly different areas today. In the future we will be working towards accepting exception instrumentation as well, to cover some of our shortfalls when it comes to Sentry v HyperDX (since one common workflow is trying to correlate your Sentry exception to the HyperDX traces and logs).

Hope that gives you an idea! Happy to chat more on our Discord if you'd like as well.

[+] mdaniel|2 years ago|reply
> Sentry is also opensource

Well, pedantically the 5 year old version of Sentry is open source, sure

[+] vadman97|2 years ago|reply
How do you think about the query syntax? Are you defining your own or are you following an existing specification? I particularly love the trace view you have, connecting a frontend HTTP request to server side function-level tracing.
[+] mikeshi42|2 years ago|reply
This one is a fun one that I've spent too many nights on - we're largely similar to Google-style search syntax (bare terms, "OR" "AND" logical operators, and property:value kind of search).

We include a "query explainer" - which translates the parsed query AST into something more human readable under the search bar, hopefully giving good feedback to the user on whether we're understand their query or not. Though there's lots of room to improve here!

[+] nodesocket|2 years ago|reply
Congrats on the launch. Perhaps I missed it, but what are the system requirements to run the self-hosted version? Seems decently heavy (Clickhouse, MongoDB, Redis, HyperDX services)? Is there a Helm chart to install into k8s?

Look forward to the syslog integration which says coming soon. I have a hobby project which uses systemd services for each of my Python apps and the path with least resistance is just ingest syslog (aware that I lose stack traces, session reply, etc).

[+] mikeshi42|2 years ago|reply
The absolute bare minimum I'd say is 2GB RAM, though in the README we do say 4GB and 2 cores for testing, obviously more if you're at scale and need performance.

For Syslog - it's something we're actually pretty close to because we already support Heroku's syslog based messages (though it's over HTTP), but largely need to test the otel Syslog receiver + parsing pipeline will translate as well as it should (PRs always welcome of course but it shouldn't be too far out from now ourselves :)). I'm curious are you using TLS/TCP syslog or plain TCP or UDP?

Here's my docker stats on a x64 linux VM where it's doing some minimal self-logging, I suspect the otel collector memory can be tuned down to bring the memory usage closer to 1GB, but this is the default out-of-the-box stats, and the miner can be turned off if log patterns isn't needed:

CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS

439e3f426ca6 hdx-oss-miner 0.89% 167.2MiB / 7.771GiB 2.10% 3.25MB / 6.06MB 8.85MB / 0B 21

7dae9d72913d hdx-oss-task-check-alerts 0.03% 83.65MiB / 7.771GiB 1.05% 6.79MB / 9.54MB 147kB / 0B 11

5abd59211cd7 hdx-oss-app 0.00% 56.32MiB / 7.771GiB 0.71% 467kB / 551kB 6.23MB / 0B 11

90c0ef1634c7 hdx-oss-api 0.02% 93.71MiB / 7.771GiB 1.18% 13.2MB / 7.87MB 57.3kB / 0B 11

39737209c58f hdx-oss-hostmetrics 0.03% 72.27MiB / 7.771GiB 0.91% 3.83GB / 173MB 3.84MB / 0B 11

e13c9416c06e hdx-oss-ingestor 0.04% 23.11MiB / 7.771GiB 0.29% 73.2MB / 89.4MB 77.8kB / 0B 5

36d57eaac8b2 hdx-oss-otel-collector 0.33% 880MiB / 7.771GiB 11.06% 104MB / 68.9MB 1.24MB / 0B 11

78ac89d8e28d hdx-oss-aggregator 0.07% 88.08MiB / 7.771GiB 1.11% 141MB / 223MB 147kB / 0B 11

8a2de809efed hdx-oss-redis 0.19% 3.738MiB / 7.771GiB 0.05% 4.36MB / 76.5MB 8.19kB / 4.1kB 5

2f2eac07bedf hdx-oss-db 1.34% 75.62MiB / 7.771GiB 0.95% 105MB / 3.79GB 1.32MB / 246MB 56

032ae2b50b2f hdx-oss-ch-server 0.54% 128.7MiB / 7.771GiB 1.62% 194MB / 45MB 88.4MB / 65.5kB 316

[+] Wulfheart|2 years ago|reply
So do I understand the landing page correctly: It is possible to run Clickhouse using an Object Storage like S3? What are the performance implications?
[+] mikeshi42|2 years ago|reply
You can definitely run Clickhouse directly on S3 [1] - though we don't run _just_ on S3 for performance reasons but instead use a layered disk strategy.

A few of the weaknesses of S3 are:

1. API calls are expensive, while storage in S3 is cheap, writing/reading into it is expensive. Using only S3 for storage will incur lots of API calls as Clickhouse will work on merging objects together (which require downloading the files again from S3 and uploading a merged part) continuously in the background. And searching on recent data on S3 can incur high costs as well, if you're constantly needing to do so (ex. alert rules)

2. Latency and bandwidth of S3 are limited, SSDs are an order of magnitude faster to respond to IO requests, and also on-device SSDs typically have higher bandwidth available. This typically is a bottleneck for reads, but typically not a concern for writes. This can be mitigated by scaling out network-optimized instances, but is just another thing to keep in mind.

3. We've seen some weird behavior on skip indices that can negatively impact performance in S3 specifically, but haven't been able to identify exactly why yet. I don't recall if that's the only weirdness we see happen in S3, but it's one that sticks out right now.

Depending on your scale and latency requirements - writing directly to S3 or a simple layered disk + S3 strategy might work well for your case. Though we've found scaling S3 to work at the latencies/scales our customers typically ask for require a bit of work (as with scaling any infra tool for production workloads).

[1] https://clickhouse.com/docs/en/integrations/s3

[+] mnahkies|2 years ago|reply
One thing I appreciate about sentry compared to datadog is the ability to configure hard caps on ingestion to control cost. AFAIK the mechanism is basically that the server starts rate limiting/rejecting requests and the client SDKs are written to handle this and enter a back off state or start sampling events.

I think this could be a nice point of difference to explore that can help people avoid unexpected bills

[+] mikeshi42|2 years ago|reply
Agreed on needing better tooling for surprise bills - definitely no stranger to that problem!

For now we're trying to make the base price cheap enough where those kinds of considerations don't need to be top of mind today and a policy that can be forgiving when it occasionally happens, but certainly as we continue to scale and grow, we'll need to put in proper controls to allow users to define what should happen if events are spiking unexpectedly (how to shed events via sampling, what needs to be explicitly perserved for compliance reasons, when to notify, etc.)

I do like Sentry's auto-sampling algorithm which is a really neat way to solve that issue.

[+] mfkp|2 years ago|reply
Looks very interesting, although a lot of the OpenTelemetry libraries are incomplete: https://opentelemetry.io/docs/instrumentation/

Especially Ruby, which is the one that I would be most interested in using.

[+] mikeshi42|2 years ago|reply
The OpenTelemetry ecosystem is definitely still young depending on the language, but we have Ruby users onboard (typically using OpenTelemetry for the tracing portion, and piping logs via Heroku or something else via the regular Ruby logger).

Feel free to pop in on the Discord if you'd like to chat more/share your thoughts!

[+] jacobbank|2 years ago|reply
Just wanted to say congrats on the launch! We recently adopted hyperdx at Relay.app and it's great.
[+] mikeshi42|2 years ago|reply
Thank you - it's been awesome working with you guys! :)
[+] kcsavvy|2 years ago|reply
The session playback looks useful - I find this is missing from many DD alternatives I have seen.
[+] mikeshi42|2 years ago|reply
Absolutely! It's pretty magical to go from a user report -> session replay -> exact API call being made and the backend error logs.

We dogfood a ton internally and (while obviously biased) we're always surprised how much faster we can pin point issues and connect alarms with bug reports.

Hope you give us a spin and feel free to hop on our discord or open an issue if you run into anything!

[+] solardev|2 years ago|reply
This is awesome! Datadog's one of my favorite providers, and their pricing is great for small businesses, but probably unaffordable for larger businesses (as pointed out in these threads).

This is slick and fast. Will have to check it out. Thanks for making it!

[+] mikeshi42|2 years ago|reply
Thank you - let me know how it goes when you're trying it out, would love to learn how you feel it compares to Datadog :)
[+] robertlagrant|2 years ago|reply
If you want my two Datadog favourite features, they were: 1) clicking on a field and making it a custom search dimension in another click, and 2) flame graphs. Delicious flame graphs.
[+] mikeshi42|2 years ago|reply
We should have both! If you hover over a property value, a magnify/plus icon come up to allow you to search on that property value (no manual facets required) - and our traces all come with delicious flame graphs :) Let me know if you were thinking of something different.

One other thing I think you'd love if you're coming from Datadog is that you're able to full text search on structured logs as well, so even if the value you're looking for lives in a property, it's still full text searchable (this is a huge pain we hear from other Datadog users)

If there's anything you love/hate about Datadog - would love to learn more!

[+] technics256|2 years ago|reply
Is there a guide for integrating this in local dev, either locally or if you want to view it on the hosted?

Ideally hosted, devs can bring up our app locally, and view their logs and traces etc when testing and building

[+] mikeshi42|2 years ago|reply
There shouldn't be any differences with how you want to set things up for local vs production telemetry (in fact all our users test locally typically before pushing it out to staging/prod).

Of course if your local/prod run completely different and require different instrumentation, that might be trickier.

I'm wondering if you had a specific use case in mind? Happy to dive more into how it should be done (feel free to join on Discord too if you'd like to chat there)