(no title)
z64 | 2 years ago
This has 100% been a learning experience for us, but I can provide some finer context re: observability.
Kagi is a small team. The number of staff we have capable of responding to an event like this is essentially 3 people, seated across 3 timezones. For myself and my right-hand dev, this is actually our very first step in our web careers - this is to say that we are not some SV vets who have seen it all already. To say that we have a lot to learn is a given, from building Kagi from nothing though, I am proud of how far we've come & where we're going.
Observability is something we started taking more seriously in the past 6 months or so. We have tons of dashboards now, and alerts that go right to our company chat channels and ping relevant people. And as the primary owner of our DB, GCP's query insights are a godsend. During the incident both our monitoring went off, as well as query insights showing the "culprit" query - but, we could have monitoring in the world, and still lack the experience to interpret it and understand what the root cause is or most efficient action to mitigate is.
In other words, we don't have the wisdom yet to not be "gaslit" by our own systems if we're not careful. Only in hindsight can I say that GCP's query insights was 100% on the money, and not some bug in application space.
All said, our growth has enabled us to expand our team quite a bit now. We have had SRE consultations before, and intend to bring on more full or part-time support to help keep things moving forward.
kelnos|2 years ago
Don't worry too much about all the people being harsh in the comments here. There's always a tendency for HN users to pile on with criticism whenever anyone has an outage.
I've always found this bizarre, because I've worked at places with worse issues, and more holes in monitoring or whatever than a lot of the companies that get skewered here. Perhaps many of us are just insecure about our own infra and project our feeling onto other companies when they have outages.
Y'all are doing fine, and I think it's to your credit that you're able to run Kagi's users table off a single, fairly cheap primary database instance. I've worked at places that haven't much thought to optimization, and "solve" scaling problems by throwing more and bigger hardware at it, and then wonder later on why they're bleeding cash on infrastructure. Of course, by that point, those inefficiencies are much more difficult to fix.
As for monitoring, unfortunately sometimes you don't know everything you need to monitor until something bad happens because your monitoring was missing something critical that you didn't realize was critical. That's fine; seems like y'all are aware and are plugging those holes. I'm sure there will be more of those holes in the future, that's just life.
At any rate, keep doing what you're doing, and I know the next time you get hit with something bad, things will be a bit better.
z64|2 years ago
pembrook|2 years ago
If everything at Kagi was FAANG-level bulletproof, with extensive processes around outages/redundancy, then the team absolutely would not be making the best use of their time/resources.
If you’re risk averse and aren’t comfortable encountering bugs/issues like this, don’t try any new software product of moderate complexity for about 7-10 years.
Tempest1981|2 years ago
I've read most of the comments here, and don't recall anything negative, just supportive.
tetha|2 years ago
I'll say: Effective observability, monitoring and alerting of complex systems is a really hard problem.
Like, you look at a graph of a metric, and there are spikes. But... are the spikes even abnormal? Are the spikes caused by the layer below, because our storage array is failing? Are the spikes caused by ... well also the storage layer.. because the application is slamming the database with bullshit queries? Or maybe your data is collected incorrectly. Or you select the wrong data, which is then summarized misleadingly.
Been in most of these situations. The monitoring means everything, and nothing, at the same time.
And in the application case, little common industry wisdom will help you. Yes, your in-house code is slamming the database with crap, and thus all the layers in between are saturating and people are angry. I guess you'd add monitoring and instrumentation... while production is down.
At that point, I think we're at a similar point of "Safety rules are written in blood" - "the most effective monitoring boards are found while prod is down".
And that's just the road to find the function in code that's a problem. That's when product tells you how this is critical to a business critical customer.
callalex|2 years ago
primitivesuave|2 years ago
alberth|2 years ago
Don’t host your status page (status.kagi.com), as a subdomain of your main site (DNS issues can cause both your main site and status site to go offline - so use something like kagistatus.com).
And host it with a webhost who doesn’t use any common infra as you.
jjtheblunt|2 years ago
xwolfi|2 years ago
You'll only get better at guessing what the issue could be: an exploit by a user is something you'll remember forever and will overly protect against from now on, until you hit some other completely different problem which your metric will be unprepared for, and you'll fumble around and do another post mortem promising to look at that new class of issues etc.
You'll marvel at the diversity of potential issues, especially in human-facing services like yours. But you'll probably have another long loss of service again one day, and you're right to insist on the transparency / speed of signaling to your users: they can forgive everything as long as you give them an early signal, a discount and an apology, in my experience.
ayberk|2 years ago
- You don't have enough.
- You have too much.
Usually either something will be missing or some red herring will cost you valuable time. You're already doing much better than most people by taking it seriously :)
JohnMakin|2 years ago
I figured this was the case when you said “our devops engineer” singular and not “one of our devops engineers.”
I’m glad you’re willing at this stage to invest in SRE. It’s a decision a lot of companies only make when they absolutely have to or have their backs against a wall.
digitalsin|2 years ago
siquick|2 years ago
Big ups, you’re smashing it
timwis|2 years ago
nanocat|2 years ago
ijhuygft776|2 years ago