top | item 47047753

Show HN: StatusDude – Uptime monitoring internal services with K8s autodiscovery

1 points| canto | 13 days ago |statusdude.com

Hey HN, I'm Oskar. For the past few months I've been building StatusDude - an uptime monitoring tool with private agents that auto-detects your Kubernetes resources. I run a bunch of stuff across multiple orgs, different clusters, internal networks, self-hosted, GKE, EKS, etc. Monitoring all of it without Datadog money was getting tough, and most tools don't even support internal networks. So, here we are. A tiny async agent sits inside your network and phones home over HTTPS. No inbound ports, no VPN, no firewall rules. One container, one helm install, done. A single instance handles 10k+ monitors comfortably. The agent pulls check definitions from the cloud, runs them locally, uploads raw results. All evaluation is server-side - the agent stays dead simple, and the cloud decides what's actually down vs. a blip. For Kubernetes, it auto-discovers Ingresses, Services, and HTTPRoutes. Deploy something new, it just gets picked up. Monitors and status pages spin up automatically. During the development process I found out I don't know how to use Celery properly. Went with ARQ instead - 50k+ jobs/min, no drama. After I modified it a bit, that is ;-) Not a full observability platform - no incident management, no on-call. Just monitoring, status pages, and notifications. If you want straightforward uptime monitoring that works behind firewalls, give it a go and please leave feedback in the comments! New signups currently get the Team plan unlocked for free, I want people to test the full thing. Happy to answer any questions about the architecture.

https://statusdude.com https://artifacthub.io/packages/helm/statusdude-agent/status...

1 comment

order

jamiemallers|12 days ago

[deleted]

canto|12 days ago

"A plugin/annotation system where users can teach the agent about custom resource types would scale better than hard-coding each one." - this is a fantastic observation and feedback! Many thanks!

"requiring N consecutive failures before marking down" - I do have the code for it, it's just hidden currently. StatusDude supports 2 types of worker/agents - cloud agents - that will re-verify from multiregion the service status and private agents - the ones we're talking about here - that I might just bring this option back as it makes more sense.

Correlating failures is a bit tricky as usually it requires some sort of manual dependency creation but, I guess for k8s ingress and similar I should be able to figure this out and at least send alerts with appropriate priorities and order.

As for the status page auto generation - currently it's based on namespace - I didn't wanted to bloat the user dashboard too much. Each monitor is tagged with cluster id, namespace and labels. Status Pages pickup monitors based on labels. Users are free to modify these and show exactly what they want :)