Ask PG: Postmortem of the outage?

[+] pg|12 years ago|reply

I don't know the details. Nick Sivo is in charge of this stuff, and he'll post something about it. I know he thinks the root of the problem was a disk failure. The server got wedged, and when we rebooted, the file system was corrupted. I'm not sure exactly why it took so long to restore. I was out of town the whole time this was happening.

The reason we lost so much data was that we only do nightly backups. That seemed enough when we started. Now that HN is a bigger part of more people's lives, we'll make more of an effort to make it proof against this sort of problem.

[+] saurik|12 years ago|reply

Another point to make is that, even with a lot of data loss, it at least would be good in the future to skip a bunch of post identifiers (potentially even just saying "well, we certainly didn't use 100,000 of them, so I skipped up to 7100000") in order to try to not cause data that used to be there, and may have been archived by various sources including Google Cache or in the various reader clients people use on various devices , to suddenly have been swapped out by different posts leading to wide-spread cache corruption (and further confusion or loss). I mean, even just for the sake of people who may have posted links to this content in various places: maybe they wrote an article or posted a tweet/comment somewhere referencing how great/horrible a post was, and now suddenly it is saying something entirely different and potentially quite awkward ;P.

As a quick example, for those still not sure what I mean: what used to be an article about Python 2/3...

http://webcache.googleusercontent.com/search?q=cache:9368SwV...

...becomes a comment with numerous references for how to learn arduino hacking.

https://news.ycombinator.com/item?id=7015438

[+] kogir|12 years ago|reply

I'll post something more detailed tomorrow, but in terms of data loss, we went down at 2014-01-05 16:10:29 PST and restored a backup from 2014-01-05 01:00:00 PST.

[+] swalsh|12 years ago|reply

Its amazing to think that HN is still someone's "side project"

[+] bane|12 years ago|reply

Thanks for the update. I don't think all that much was lost, and a day to restore from a disk failure ain't bad. Please thank Nick and whoever else worked overtime to get things back up and running!

[+] justinzollars|12 years ago|reply

Thanks for the update!

[+] theGimp|12 years ago|reply

It seems all activity from the past two days has disappeared -- backup storage is something you never regret paying for.

You've probably all seen it by now, but from @HNStatus: [1]

  Server back up and seemingly stable. Now restoring our latest backup to recover from limited filesystem corruption.

[1] https://twitter.com/HNStatus/status/420179162138021888

[+] aroman|12 years ago|reply

Yeah, I lost about 200 karma (what was about 15% of my total) in the crash.

Good thing they're just silly internet points :)

[+] sehrope|12 years ago|reply

Yes I noticed this as well from (the lack of) my own comment activity. I don't comment that often but I had written something yesterday that has disappeared.

On a more general note if anybody has backups and they aren't regularly tested restoring them, then you really don't have backups! As an added bonus, regular restoration tests let you practice for the "real deal" and you know how long the entire process will take.

[+] mixmastamyk|12 years ago|reply

That's a shame, I was really looking forward to the comments for the article below. Unfortunately I had it loaded, but hit Ctrl+R (like I sometimes do) and lost it forever. :/

The google cache got a few comments, but very few.

http://lucumr.pocoo.org/2014/1/5/unicode-in-2-and-3/

https://news.ycombinator.com/item?id=7015438

http://webcache.googleusercontent.com/search?q=cache:9368SwV...

[+] hga|12 years ago|reply

"backup storage is something you never regret paying for"

Indeed, but the time and CPU it takes to do backups can be a very nasty trade off.

For that matter, as I write this, due to a mistake I made after a 85 minutes power outage yesterday, I'm just now doing my daily incremental backup of my home machines to an LTO-4 tape drive. Keeping that drive fed fast enough to prevent "shoe-shining" took some effort, Bacula spools up to 100G at a time to a partition on a single 15K disk on a separate controller. But if I had a LTO-5 drive, from what I've heard there's no single disk in existence that can keep up with a drive (not counting SSDs, which are a very poor match for this use case).

[+] t0|12 years ago|reply

Were they not using raid or performing multiple database writes? A mechanical hard drive failure is pretty common and can be mitigated fairly easily.

[+] 8ig8|12 years ago|reply

What's odd is that if you look at your submission history, you can up vote your own submissions.

Maybe that's nothing new, but I just noticed it. Seems like a bug.

[+] yaddayadda|12 years ago|reply

Several comment threads that I was following when it went down are gone ("No such item."), although their original links are still valid.

[+] sigvef|12 years ago|reply

During the outage, https://twitter.com/HNStatus went from somewhere around 300 to 1163 followers.

[+] lukeqsee|12 years ago|reply

Earlier today I saw it had about 45 followers. I think it was a new account. (Please correct me if I'm wrong.)

[+] lucb1e|12 years ago|reply

I actually looked for a twitter account, but didn't know which would be official. Then jgrahamc retweeted hnstatus and I knew which to follow.

[+] morganherlocker|12 years ago|reply

I was bummed that the conversation around openstreetmaps got killed in the middle of it, and now I do not see it on the front page. Does anyone have a link to that thread or did it disappear?

[+] dbaupp|12 years ago|reply

It's been resubmitted: https://news.ycombinator.com/item?id=7015502

(But I think the original thread was totally lost, I submitted it and it's not listed in my submission history.)

[+] unknown|12 years ago|reply

[deleted]

[+] eevilspock|12 years ago|reply

me too. i guess we can start over: https://news.ycombinator.com/item?id=7015502

[+] tambourine_man|12 years ago|reply

[deleted]

[+] stokedmartin|12 years ago|reply

[deleted]

[+] yaddayadda|12 years ago|reply

I find it interesting that this question is fresher (by a minute), has more points (67 v 42 at snapshot), and has more comments (18 v 10 at snapshot) than "HackerNews down, unwisely returning http 200 for outage message" but is ranked lower (2 v 1 at snapshot).

snapshot - http://oi40.tinypic.com/2mmbv5y.jpg

[+] Kronopath|12 years ago|reply

Self posts are penalized so they don't clog the front page for long.

http://jacquesmattheij.com/The+Unofficial+HN+FAQ#selfposts

[+] unknown|12 years ago|reply

[deleted]

[+] rhizome|12 years ago|reply

Postmortem: it went down last night when people should have been going to sleep before their first day back at the job after holidays. It stayed down until the end of that day, with the last couple of days of vacation insanity erased.

Appreciate the gift of perspective that has been given.

[+] carljoseph|12 years ago|reply

Interesting that your perspective is locked into one side of the globe. ;) HN was down during the day my time, when we had already slept before returning to work. :)

Appreciate the gift a new perspective gives you.

[+] geerlingguy|12 years ago|reply

Would like to read it too. And it looks like right now is a good time to get just about anything in the front page. Front pretty much == new.

[+] ewoodrich|12 years ago|reply

Are you actually able to see 'top'? I'm still getting the error.

EDIT: (never mind, it was just cached)

[+] rcfox|12 years ago|reply

pg: I don't know how much you care to get back the data that was lost, but it seems like it's at least partially available in the hnsearch.com API: http://api.thriftdb.com/api.hnsearch.com/items/_search?prett...

[+] joshuaheard|12 years ago|reply

I'm not an expert in internet architecture, but shouldn't a site this important be running on redundant servers? The irony of a tech site going down due to technical issue is making me grin, however. Glad to see it back :)

[+] alan_cx|12 years ago|reply

"Important"?

Really?

Obviously Im a fan of the site, etc, etc, but "important"? On what level?

Im not even sure I'd call Facebook or Twitter important. Banking, yes. Weather warnings, yes. Things like that, sure. But, Im also pretty sure "important" is slightly over egging it for dear HN.

(No offence PG xxxx)

[+] dschiptsov|12 years ago|reply

Is there any plans to release a new version of Arc, if it exists or server side code (without business-critical stuff)? I guess that there are lots of improvements since last Arc release.)

[+] nmc|12 years ago|reply

Despite the website being back online, the root URL still redirects to the error page (at the time of writing this).

So https://news.ycombinator.com/news works, but https://news.ycombinator.com still redirects to "Sorry for the downtime. We hope to be back soon.".

[+] watermel0n|12 years ago|reply

It's your browser cache.

[+] rainmaking|12 years ago|reply

This must have been the most productive time for the tech industry in months.

[+] ithkuil|12 years ago|reply

No, I just kept wasting time reloading HN home page or following notifications on twitter!

[+] cenhyperion|12 years ago|reply

I'm also interested in what the infrastructure of HN looks like. One of the tweets via @HNStatus seemed to imply that the site runs off of one application server.

[+] sigvef|12 years ago|reply

HN is indeed running on a single (10 month old) server, it seems [1].

[1]: https://news.ycombinator.com/item?id=5229364

[+] xmonkee|12 years ago|reply

Social experiment

[+] noblethrasher|12 years ago|reply

You may jest, but I once suggested something like that: https://news.ycombinator.com/item?id=2403880

[+] jader201|12 years ago|reply

I considered this as well (seriously), until I saw the lost data from the past day or so.

[+] ithkuil|12 years ago|reply

I wonder how much effort would be reasonable to improve the resilience of HN to this kind of issues, given that's a relatively rare issue and HN doesn't really have a money loss in case of a downtime such as this.

[+] DonGateley|12 years ago|reply

If the outage was due to something malicious I don't really expect to see a postmortem.

[+] pearjuice|12 years ago|reply

Do we get the karma we lost refunded somehow? I am certain I am missing around 30 points.

132 comments