Clarification: This is not meant with any ill-will towards PG or any of the other individuals who help run HN. It is a simple request for a postmortem eventually. Perhaps it's an unneeded request, but I think a lot of HNers echo the sentiment.
I don't know the details. Nick Sivo is in charge of this stuff, and he'll post something about it. I know he thinks the root of the problem was a disk failure. The server got wedged, and when we rebooted, the file system was corrupted. I'm not sure exactly why it took so long to restore. I was out of town the whole time this was happening.
The reason we lost so much data was that we only do nightly backups. That seemed enough when we started. Now that HN is a bigger part of more people's lives, we'll make more of an effort to make it proof against this sort of problem.
Another point to make is that, even with a lot of data loss, it at least would be good in the future to skip a bunch of post identifiers (potentially even just saying "well, we certainly didn't use 100,000 of them, so I skipped up to 7100000") in order to try to not cause data that used to be there, and may have been archived by various sources including Google Cache or in the various reader clients people use on various devices , to suddenly have been swapped out by different posts leading to wide-spread cache corruption (and further confusion or loss). I mean, even just for the sake of people who may have posted links to this content in various places: maybe they wrote an article or posted a tweet/comment somewhere referencing how great/horrible a post was, and now suddenly it is saying something entirely different and potentially quite awkward ;P.
As a quick example, for those still not sure what I mean: what used to be an article about Python 2/3...
I'll post something more detailed tomorrow, but in terms of data loss, we went down at 2014-01-05 16:10:29 PST and restored a backup from 2014-01-05 01:00:00 PST.
Thanks for the update. I don't think all that much was lost, and a day to restore from a disk failure ain't bad. Please thank Nick and whoever else worked overtime to get things back up and running!
Yes I noticed this as well from (the lack of) my own comment activity. I don't comment that often but I had written something yesterday that has disappeared.
On a more general note if anybody has backups and they aren't regularly tested restoring them, then you really don't have backups! As an added bonus, regular restoration tests let you practice for the "real deal" and you know how long the entire process will take.
That's a shame, I was really looking forward to the comments for the article below. Unfortunately I had it loaded, but hit Ctrl+R (like I sometimes do) and lost it forever. :/
The google cache got a few comments, but very few.
"backup storage is something you never regret paying for"
Indeed, but the time and CPU it takes to do backups can be a very nasty trade off.
For that matter, as I write this, due to a mistake I made after a 85 minutes power outage yesterday, I'm just now doing my daily incremental backup of my home machines to an LTO-4 tape drive. Keeping that drive fed fast enough to prevent "shoe-shining" took some effort, Bacula spools up to 100G at a time to a partition on a single 15K disk on a separate controller. But if I had a LTO-5 drive, from what I've heard there's no single disk in existence that can keep up with a drive (not counting SSDs, which are a very poor match for this use case).
I was bummed that the conversation around openstreetmaps got killed in the middle of it, and now I do not see it on the front page. Does anyone have a link to that thread or did it disappear?
I find it interesting that this question is fresher (by a minute), has more points (67 v 42 at snapshot), and has more comments (18 v 10 at snapshot) than "HackerNews down, unwisely returning http 200 for outage message" but is ranked lower (2 v 1 at snapshot).
Postmortem: it went down last night when people should have been going to sleep before their first day back at the job after holidays. It stayed down until the end of that day, with the last couple of days of vacation insanity erased.
Appreciate the gift of perspective that has been given.
Interesting that your perspective is locked into one side of the globe. ;) HN was down during the day my time, when we had already slept before returning to work. :)
I'm not an expert in internet architecture, but shouldn't a site this important be running on redundant servers? The irony of a tech site going down due to technical issue is making me grin, however. Glad to see it back :)
Obviously Im a fan of the site, etc, etc, but "important"? On what level?
Im not even sure I'd call Facebook or Twitter important. Banking, yes. Weather warnings, yes. Things like that, sure. But, Im also pretty sure "important" is slightly over egging it for dear HN.
Is there any plans to release a new version of Arc, if it exists or server side code (without business-critical stuff)? I guess that there are lots of improvements since last Arc release.)
I'm also interested in what the infrastructure of HN looks like. One of the tweets via @HNStatus seemed to imply that the site runs off of one application server.
I wonder how much effort would be reasonable to improve the resilience of HN to this kind of issues, given that's a relatively rare issue and HN doesn't really have a money loss in case of a downtime such as this.
[+] [-] pg|12 years ago|reply
The reason we lost so much data was that we only do nightly backups. That seemed enough when we started. Now that HN is a bigger part of more people's lives, we'll make more of an effort to make it proof against this sort of problem.
[+] [-] saurik|12 years ago|reply
As a quick example, for those still not sure what I mean: what used to be an article about Python 2/3...
http://webcache.googleusercontent.com/search?q=cache:9368SwV...
...becomes a comment with numerous references for how to learn arduino hacking.
https://news.ycombinator.com/item?id=7015438
[+] [-] kogir|12 years ago|reply
[+] [-] swalsh|12 years ago|reply
[+] [-] bane|12 years ago|reply
[+] [-] justinzollars|12 years ago|reply
[+] [-] theGimp|12 years ago|reply
You've probably all seen it by now, but from @HNStatus: [1]
[1] https://twitter.com/HNStatus/status/420179162138021888[+] [-] aroman|12 years ago|reply
Good thing they're just silly internet points :)
[+] [-] sehrope|12 years ago|reply
On a more general note if anybody has backups and they aren't regularly tested restoring them, then you really don't have backups! As an added bonus, regular restoration tests let you practice for the "real deal" and you know how long the entire process will take.
[+] [-] mixmastamyk|12 years ago|reply
The google cache got a few comments, but very few.
http://lucumr.pocoo.org/2014/1/5/unicode-in-2-and-3/
https://news.ycombinator.com/item?id=7015438
http://webcache.googleusercontent.com/search?q=cache:9368SwV...
[+] [-] hga|12 years ago|reply
Indeed, but the time and CPU it takes to do backups can be a very nasty trade off.
For that matter, as I write this, due to a mistake I made after a 85 minutes power outage yesterday, I'm just now doing my daily incremental backup of my home machines to an LTO-4 tape drive. Keeping that drive fed fast enough to prevent "shoe-shining" took some effort, Bacula spools up to 100G at a time to a partition on a single 15K disk on a separate controller. But if I had a LTO-5 drive, from what I've heard there's no single disk in existence that can keep up with a drive (not counting SSDs, which are a very poor match for this use case).
[+] [-] t0|12 years ago|reply
[+] [-] 8ig8|12 years ago|reply
Maybe that's nothing new, but I just noticed it. Seems like a bug.
[+] [-] yaddayadda|12 years ago|reply
[+] [-] sigvef|12 years ago|reply
[+] [-] lukeqsee|12 years ago|reply
[+] [-] lucb1e|12 years ago|reply
[+] [-] morganherlocker|12 years ago|reply
[+] [-] dbaupp|12 years ago|reply
(But I think the original thread was totally lost, I submitted it and it's not listed in my submission history.)
[+] [-] unknown|12 years ago|reply
[deleted]
[+] [-] eevilspock|12 years ago|reply
[+] [-] tambourine_man|12 years ago|reply
[deleted]
[+] [-] stokedmartin|12 years ago|reply
[deleted]
[+] [-] yaddayadda|12 years ago|reply
snapshot - http://oi40.tinypic.com/2mmbv5y.jpg
[+] [-] Kronopath|12 years ago|reply
http://jacquesmattheij.com/The+Unofficial+HN+FAQ#selfposts
[+] [-] unknown|12 years ago|reply
[deleted]
[+] [-] rhizome|12 years ago|reply
Appreciate the gift of perspective that has been given.
[+] [-] carljoseph|12 years ago|reply
Appreciate the gift a new perspective gives you.
[+] [-] geerlingguy|12 years ago|reply
[+] [-] ewoodrich|12 years ago|reply
EDIT: (never mind, it was just cached)
[+] [-] rcfox|12 years ago|reply
[+] [-] joshuaheard|12 years ago|reply
[+] [-] alan_cx|12 years ago|reply
Really?
Obviously Im a fan of the site, etc, etc, but "important"? On what level?
Im not even sure I'd call Facebook or Twitter important. Banking, yes. Weather warnings, yes. Things like that, sure. But, Im also pretty sure "important" is slightly over egging it for dear HN.
(No offence PG xxxx)
[+] [-] dschiptsov|12 years ago|reply
[+] [-] nmc|12 years ago|reply
So https://news.ycombinator.com/news works, but https://news.ycombinator.com still redirects to "Sorry for the downtime. We hope to be back soon.".
[+] [-] watermel0n|12 years ago|reply
[+] [-] rainmaking|12 years ago|reply
[+] [-] ithkuil|12 years ago|reply
[+] [-] cenhyperion|12 years ago|reply
[+] [-] sigvef|12 years ago|reply
[1]: https://news.ycombinator.com/item?id=5229364
[+] [-] xmonkee|12 years ago|reply
[+] [-] noblethrasher|12 years ago|reply
[+] [-] jader201|12 years ago|reply
[+] [-] ithkuil|12 years ago|reply
[+] [-] DonGateley|12 years ago|reply
[+] [-] pearjuice|12 years ago|reply
[+] [-] pbhjpbhj|12 years ago|reply
[+] [-] royalghost|12 years ago|reply
[+] [-] stickhandle|12 years ago|reply
[+] [-] Nodex|12 years ago|reply
[deleted]