top | item 5709933

Dribbble offline due to Postgres problem

100 points| ericras | 13 years ago |twitter.com | reply

46 comments

order
[+] rosser|13 years ago|reply
This looks like disk page corruption.

In such a case, the very first thing you do — aside from verifying that your backups, replication, and WAL archiving are all working — before you even try diagnosing your problem any further than "something weird is happening" is make a filesystem-level copy of your PostgreSQL cluster. (If you're running LVM, ZFS, a SAN, or any other thing that lets you take an instantaneous filesystem snapshot, so much the better. Do that, and then copy it.) Then, and only then, should you even contemplate trying to un-fuck your database.

I cannot stress this point enough. That means make a tarball, or cp the directory, or rsync it, or whatever disk file level tool you prefer, and specifically and emphatically not pg_dump. (It's probably not going to make things any worse, but if you do have corrupted disk pages, pg_dump isn't likely work, anyway.)

Flailing around trying to fix things can sometimes make them worse. If you're working on your already broken data, and break it further without the safety net of a fs-level backup, you're ... well, you're worse off than you were five minutes prior, aren't you?

[+] astine|13 years ago|reply
Better, if you do have backups, just drop your database, reconstitute a new one (using your schema definition file which you should have saved,) and load your most recent backup. I've dealt with disk file level corruption and the best way to deal with it IMHO is not to deal with it at all.
[+] Zenst|13 years ago|reply
^^This

Still true today and as I say backup don't fuckup and plan for the worst and let Murphies law work in your favour.

[+] PhantomGremlin|13 years ago|reply
Since "dribbble.com" appears to be "taking a timeout for maintenance" I can't at a glance determine what, exactly, they do. So maybe they're a bunch of hobbyists. But if not ...

I don't understand how a "real" company, even in todays overheated environment of soon-to-fail hipster startups, doesn't do (at least) one of these two things: a) have competent employees on staff that are intimately familiar with their critical infrastructure and how to support it or b) pay, yes, gasp, pay some other company for professional support services. A glance at the Wikipedia entry for Postgres shows some possible companies that do that. Or, as already mentioned, there are mailing lists.

But twitter? Really? That's support? For someone other than a hobbyist running a website in their spare time?

Go ahead, flame me. But my first impression is "amateur hour". My apologies if dribbble.com really is an amateur effort.

[+] Xylakant|13 years ago|reply
Even people intimately familiar with a pieces of infrastructure - hardware or software - can't possibly know all failure modes. Asking on twitter about an obscure error that you've never seen doesn't cost anything and may be very well just one way of finding a fix. It's akin to asking on IRC or writing a mail to a mailing list - something which you seem to condone. I fail to see how it's any less professional.
[+] alexcroox|13 years ago|reply
Plenty of big sites use Twitter to inform users of down time, I prefer that, do you expect me to go hunting down their status page or just have it come up conveniently in my newsfeed...

Just google "reddit status" and check the first result.

[+] BoyWizard|13 years ago|reply
Who's to say they aren't? Spending 10 seconds asking on Twitter, and getting an 'oh yeah, I had that and fixed it with x, y and z' in response sure beats staying up until 2am fixing it.
[+] joevandyk|13 years ago|reply
Everyone should be running pg_basebackup + pg_receivexlog on a separate machine, preferably at a different data center.

These let you go back to any point in time. If you ran 'delete from orders where id=id', you can restore to the transaction before you ran that command.

http://www.packtpub.com/how-to-postgresql-backup-and-restore... contains more information.

Also, postgresql 9.3 (out in a few months) supports disk page checksums which can detect filesystem failures immediately.

[+] sickpig|13 years ago|reply
Did they try to post about their problem on postgresql mailing list (both pgsql bugs and general apply here)?

Postgres community is quite responsive and usually give very effective advice, as long as you use the proper channel to communicate.

[+] pilif|13 years ago|reply
Also, #postgres on Freenode. No matter when or what you ask, one of the regulars will usually respond instantly with very high-quality advice. No single product, no matter whether it was free software or not has ever reached this kind of support quality for me.
[+] bifrost|13 years ago|reply
I feel for you guys, but it sounds like a disk error. If you're not experts, you need to do one of the following A) rollback to a snapshot B) restore from backup. If you have neither of those, take a sql backup, and pray you can recreate the broken data.
[+] taitems|13 years ago|reply
Might as well comment here as tweets to the account seem to do nothing. Every 25th click or so has always resulted in a Chrome "no content" error. It's a frequent problem, and it's been happening for years.
[+] samuelfine|13 years ago|reply
You should probably email [email protected]. Twitter isn't an "official" support channel, so your tweets are far less likely to get noticed.

In the interest of saving time: browser / version / OS? Any browser extensions installed?

[+] unknown|13 years ago|reply

[deleted]

[+] samuelfine|13 years ago|reply
When a problem first rears it's head, we usually don't know what the actual cause is. Some problems take 5 minutes to resolve, and others keep me up until 1 in the morning. So, making any kind of specific statement in the middle of a crisis is often impossible, and always imprudent.
[+] 19_ploT|13 years ago|reply
It's back but I think one day of data is lost!