top | item 9496169

(no title)

mtsmith85 | 10 years ago

This line: I would have sent out an email to the mailing lists earlier; but since at each point I thought I was "one change away" from fixing the problems, I kept on delaying said email until it was clear that the problems were finally fixed" is such a common situation for most people, but I tend to see it with engineers especially. I find I struggle with it an incredible amount. In some ways, I guess it seems healthy or reassuring that incredibly smart people like Colin Percival suffer from similar challenges around fully understanding the scope of the problem and the solution.

All that being said, I really respect the detailed response from a technical perspective as well as owning up to (and the decisions that went into) a spell of downgraded performance.

Later edit because I don't want to spam the comments: I'd love some context (maybe from cperciva himself?) around the performance enhancement of integrating new Intel AESNI instructions. This is well beyond my depth and while Colin mentions that it didn't necessarily increase performance, I'm wondering if the hope is it would longterm? Or were there other benefits to such an integration?

discuss

order

cperciva|10 years ago

I'd love some context (maybe from cperciva himself?) around the performance enhancement of integrating new Intel AESNI instructions.

I was using OpenSSL for that (which was using a software implementation). The code (you can see it in spiped) now detects the CPU feature and selects between AESNI or OpenSSL automatically. Given that the tarsnap server code was spending about 40% of its time running AES, it's a nontrivial CPU time saving.

I should probably have been clearer in my writeup though -- using AESNI was never a "once I roll this out everything will be good" fix. Rather, it was a case of "I have this well-tested code available which will help a bit while I finish testing the real fixes".

gonzo|10 years ago

One wonders why you aren't using a version of OpenSSL that has the AESNI bits already in it.

cperciva|10 years ago

I would have sent out an email to the mailing lists earlier; but since at each point I thought I was "one change away" from fixing the problems, I kept on delaying said email until it was clear that the problems were finally fixed

This ties in to the last lesson I mentioned at the bottom:

5. When performance drops, it's not always due to a single problem; sometimes there are multiple interacting bottlenecks.

Every time I identified a problem, I was correct that it was a problem -- my failing was in not realizing that there were several things going on at once.

jcrites|10 years ago

> Every time I identified a problem, I was correct that it was a problem -- my failing was in not realizing that there were several things going on at once.

Very common! One thing that's been helpful for us is establishing predefined system performance thresholds that, if exceeded, initiate the chain of events that will lead to customer communication. "If X% of requests are failing, then we had better advertise that the system is degraded." Discussing and setting these thresholds in advance and the expectation that they'll result in communication helps drive the right outcome. It's not perfect, because one is always tempted to make a judgment call in the circumstance, which is vulnerable to the same effect, but it's a good start.

Thanks for sharing!

spydum|10 years ago

i tend to get to debug problems like this (usually in 3rd party code i dont know the internals of) pretty frequently.. my experience has been it tends to follow a curve..MOST of the time, the problem is simple and you can quickly dispatch it. the scary (or fun, depending on your perspective) part hits when you pass the first level, and there are still problems.. and you dont know if it's two or ten levels deeper. then you get into that crazy test/optimize cycle and crawl out two weeks later wondering when you last ate..

mtsmith85|10 years ago

That totally jibes with what I found "reassuring" in a sense. That even very smart people sometimes get hit with inadvertent "multiple problems looking like a single issue" situations.

mryan|10 years ago

This "it's almost fixed, I'll email the client soon" pattern is something I have personally struggled with a lot, and I agree it appears to be common with engineers.

My workaround has been to make something else responsible for sending the email. In a team, this could be a manager setting a cut-off point after which communication must be made. When working on my own, I set an alarm for X minutes. When that alarm goes off I ignore the internal voice which says "just try one more thing, then send the email", and send an update to let the relevant people know my current progress, ETA to fix, and when they can expect the next update.

I think this is similar to how GTD encourages us to use systems for storing to-do lists instead of trying to remember them - our fragile human brains are not always to be trusted.

Poiesis|10 years ago

I came here to write this comment essentially.

Very much of the time I feel, "If I knew what the problem[s] [was|were] it'd be solved by now!" That's not exactly true of course but of course diagnosis is a large part of the total solution.

This type of an answer that Colin gave above does not exactly win friends and influence people in most situations where you're part of a team or hierarchy. Can anyone share what they've done to give better answers in these cases? I understand why people want the answers, but I don't have them to give right away particularly when it's Someone Else's system.

jballanc|10 years ago

One trick that I've learned (though I still have trouble routinely applying it myself) for these situations is: less is more.

That is, as engineers we tend to want details. All the details. We want to know what happened, why it happened, how it's going to be fixed, and how long that will take. Because we want all that detail for ourselves, we hesitate to contact our customers/boss until we have all the details. Combine that with a desire to fix problems as they come up, and you end up with, "I never told you there was a problem because I was always one fix away from the solution."

But most people are not engineers. They want to be acknowledged. They want to feel informed, even if they have less details than what you would like to provide for them. Sometimes, something as simple as, "We've noticed that there is an issue and are currently working on a fix," goes a long way. Also don't be afraid to pull out, "Users have been reporting issues with backup performance. We do not currently believe this represents a service failure, but we are working to return performance to normal levels."

Your users trust you (otherwise they wouldn't pay you). If you "believe" something, they will too.

cperciva|10 years ago

Just to be clear, when Tarsnap users wrote to me I told them everything I could. The "I think it will be fixed soon" delay in sending out an email to the lists affected only people who didn't notice or noticed but didn't ask about the issue.