top | item 39019936

(no title)

z64 | 2 years ago

Hi there, I'm Zac, Kagi's tech lead / author of the post-mortem etc.

This has 100% been a learning experience for us, but I can provide some finer context re: observability.

Kagi is a small team. The number of staff we have capable of responding to an event like this is essentially 3 people, seated across 3 timezones. For myself and my right-hand dev, this is actually our very first step in our web careers - this is to say that we are not some SV vets who have seen it all already. To say that we have a lot to learn is a given, from building Kagi from nothing though, I am proud of how far we've come & where we're going.

Observability is something we started taking more seriously in the past 6 months or so. We have tons of dashboards now, and alerts that go right to our company chat channels and ping relevant people. And as the primary owner of our DB, GCP's query insights are a godsend. During the incident both our monitoring went off, as well as query insights showing the "culprit" query - but, we could have monitoring in the world, and still lack the experience to interpret it and understand what the root cause is or most efficient action to mitigate is.

In other words, we don't have the wisdom yet to not be "gaslit" by our own systems if we're not careful. Only in hindsight can I say that GCP's query insights was 100% on the money, and not some bug in application space.

All said, our growth has enabled us to expand our team quite a bit now. We have had SRE consultations before, and intend to bring on more full or part-time support to help keep things moving forward.

discuss

kelnos|2 years ago

Hi Zac, thank you for chiming in here. Been using Kagi since the private beta, and have been overwhelmingly impressed by the service since I first used it.

Don't worry too much about all the people being harsh in the comments here. There's always a tendency for HN users to pile on with criticism whenever anyone has an outage.

I've always found this bizarre, because I've worked at places with worse issues, and more holes in monitoring or whatever than a lot of the companies that get skewered here. Perhaps many of us are just insecure about our own infra and project our feeling onto other companies when they have outages.

Y'all are doing fine, and I think it's to your credit that you're able to run Kagi's users table off a single, fairly cheap primary database instance. I've worked at places that haven't much thought to optimization, and "solve" scaling problems by throwing more and bigger hardware at it, and then wonder later on why they're bleeding cash on infrastructure. Of course, by that point, those inefficiencies are much more difficult to fix.

As for monitoring, unfortunately sometimes you don't know everything you need to monitor until something bad happens because your monitoring was missing something critical that you didn't realize was critical. That's fine; seems like y'all are aware and are plugging those holes. I'm sure there will be more of those holes in the future, that's just life.

At any rate, keep doing what you're doing, and I know the next time you get hit with something bad, things will be a bit better.

z64|2 years ago

Very kind, thank you! (and everyone else too, many heartwarming replies)

pembrook|2 years ago

Agreed. Even though this site is on YC’s domain, I think only a few of the folks in the comments are actually early-adopting startup types. Probably just due to power law statistics, I’d guess most commenters are big company worker bees who’ve never worked on/at a seed stage startup.

If everything at Kagi was FAANG-level bulletproof, with extensive processes around outages/redundancy, then the team absolutely would not be making the best use of their time/resources.

If you’re risk averse and aren’t comfortable encountering bugs/issues like this, don’t try any new software product of moderate complexity for about 7-10 years.

Tempest1981|2 years ago

> people being harsh in the comments here

I've read most of the comments here, and don't recall anything negative, just supportive.

tetha|2 years ago

Mh, I work quite a bit in the OPs-side and monitoring and observability are part of my job, for a bit of time now too.

I'll say: Effective observability, monitoring and alerting of complex systems is a really hard problem.

Like, you look at a graph of a metric, and there are spikes. But... are the spikes even abnormal? Are the spikes caused by the layer below, because our storage array is failing? Are the spikes caused by ... well also the storage layer.. because the application is slamming the database with bullshit queries? Or maybe your data is collected incorrectly. Or you select the wrong data, which is then summarized misleadingly.

Been in most of these situations. The monitoring means everything, and nothing, at the same time.

And in the application case, little common industry wisdom will help you. Yes, your in-house code is slamming the database with crap, and thus all the layers in between are saturating and people are angry. I guess you'd add monitoring and instrumentation... while production is down.

At that point, I think we're at a similar point of "Safety rules are written in blood" - "the most effective monitoring boards are found while prod is down".

And that's just the road to find the function in code that's a problem. That's when product tells you how this is critical to a business critical customer.

callalex|2 years ago

Running voodoo analysis on graph spikes is indeed a fool’s errand. What you really need is load testing on every component of your system, and alerts for when you approach known, tested limits. Of course this is easier said than done and things will still be missed, but I’ve done both approaches and only one of them had pagers needlessly waking me in the middle of the night enough to go on sleepless swearing rants to coworkers.

primitivesuave|2 years ago

I really appreciate you sharing these candid insights. Let me tell you (after over a decade of deploying cloud services), some rogue user will always figure out how to throw an unforeseen wrench into your system as the service gets more popular. Even worse than an outage is when someone figures out how to explode your cloud computing costs :)

alberth|2 years ago

Unsolicited suggestion.

Don’t host your status page (status.kagi.com), as a subdomain of your main site (DNS issues can cause both your main site and status site to go offline - so use something like kagistatus.com).

And host it with a webhost who doesn’t use any common infra as you.

jjtheblunt|2 years ago

I bet a silent majority are thinking "well done, Zac, all the same".

xwolfi|2 years ago

I work in a giant investment bank with hundreds of people who can answer across all time zones. We still f up, we still don't always know where problems lie and we still sometimes can spend hours on a simple DoS.

You'll only get better at guessing what the issue could be: an exploit by a user is something you'll remember forever and will overly protect against from now on, until you hit some other completely different problem which your metric will be unprepared for, and you'll fumble around and do another post mortem promising to look at that new class of issues etc.

You'll marvel at the diversity of potential issues, especially in human-facing services like yours. But you'll probably have another long loss of service again one day, and you're right to insist on the transparency / speed of signaling to your users: they can forgive everything as long as you give them an early signal, a discount and an apology, in my experience.

ayberk|2 years ago

Kudos for being so open -- after seeing numerous "incidents" at AWS and GCE I can say that two rules always hold with respect to observability:

- You don't have enough.

- You have too much.

Usually either something will be missing or some red herring will cost you valuable time. You're already doing much better than most people by taking it seriously :)

JohnMakin|2 years ago

> Kagi is a small team.

I figured this was the case when you said “our devops engineer” singular and not “one of our devops engineers.”

I’m glad you’re willing at this stage to invest in SRE. It’s a decision a lot of companies only make when they absolutely have to or have their backs against a wall.

digitalsin|2 years ago

I use Kagi every single day, ever since the beta. I don't remember the last time I used that other search engine, the G one..can't remember the name. Anyway, absolutely love Kagi and the work you guys do. Thank you!

siquick|2 years ago

The best feedback I can give is that my whole family now uses Kagi over Google Search and I’m a regular user of the summariser tool too.

Big ups, you’re smashing it

timwis|2 years ago

Thank you for sharing! I’m surprised to hear that, given how impressive your product is, but I’m an even bigger fan now.

nanocat|2 years ago

Sounds like you’re doing great to me. Thank you for being so open!

ijhuygft776|2 years ago

Kagi, an efficient company. Thanks Zac and the rest of the team