Dribbble offline due to Postgres problem

[+] rosser|13 years ago|reply

This looks like disk page corruption.

In such a case, the very first thing you do — aside from verifying that your backups, replication, and WAL archiving are all working — before you even try diagnosing your problem any further than "something weird is happening" is make a filesystem-level copy of your PostgreSQL cluster. (If you're running LVM, ZFS, a SAN, or any other thing that lets you take an instantaneous filesystem snapshot, so much the better. Do that, and then copy it.) Then, and only then, should you even contemplate trying to un-fuck your database.

I cannot stress this point enough. That means make a tarball, or cp the directory, or rsync it, or whatever disk file level tool you prefer, and specifically and emphatically not pg_dump. (It's probably not going to make things any worse, but if you do have corrupted disk pages, pg_dump isn't likely work, anyway.)

Flailing around trying to fix things can sometimes make them worse. If you're working on your already broken data, and break it further without the safety net of a fs-level backup, you're ... well, you're worse off than you were five minutes prior, aren't you?

[+] astine|13 years ago|reply

Better, if you do have backups, just drop your database, reconstitute a new one (using your schema definition file which you should have saved,) and load your most recent backup. I've dealt with disk file level corruption and the best way to deal with it IMHO is not to deal with it at all.

[+] Zenst|13 years ago|reply

^^This

Still true today and as I say backup don't fuckup and plan for the worst and let Murphies law work in your favour.

[+] PhantomGremlin|13 years ago|reply

Since "dribbble.com" appears to be "taking a timeout for maintenance" I can't at a glance determine what, exactly, they do. So maybe they're a bunch of hobbyists. But if not ...

I don't understand how a "real" company, even in todays overheated environment of soon-to-fail hipster startups, doesn't do (at least) one of these two things: a) have competent employees on staff that are intimately familiar with their critical infrastructure and how to support it or b) pay, yes, gasp, pay some other company for professional support services. A glance at the Wikipedia entry for Postgres shows some possible companies that do that. Or, as already mentioned, there are mailing lists.

But twitter? Really? That's support? For someone other than a hobbyist running a website in their spare time?

Go ahead, flame me. But my first impression is "amateur hour". My apologies if dribbble.com really is an amateur effort.

[+] Xylakant|13 years ago|reply

Even people intimately familiar with a pieces of infrastructure - hardware or software - can't possibly know all failure modes. Asking on twitter about an obscure error that you've never seen doesn't cost anything and may be very well just one way of finding a fix. It's akin to asking on IRC or writing a mail to a mailing list - something which you seem to condone. I fail to see how it's any less professional.

[+] alexcroox|13 years ago|reply

Plenty of big sites use Twitter to inform users of down time, I prefer that, do you expect me to go hunting down their status page or just have it come up conveniently in my newsfeed...

Just google "reddit status" and check the first result.

[+] BoyWizard|13 years ago|reply

Who's to say they aren't? Spending 10 seconds asking on Twitter, and getting an 'oh yeah, I had that and fixed it with x, y and z' in response sure beats staying up until 2am fixing it.

[+] bsg75|13 years ago|reply

Also, posts error without Postgres version, operating system, or any other relevant details which could help the community diagnose the issue quicker.

https://twitter.com/samuelfine/status/334499315949133825

[+] joevandyk|13 years ago|reply

Everyone should be running pg_basebackup + pg_receivexlog on a separate machine, preferably at a different data center.

These let you go back to any point in time. If you ran 'delete from orders where id=id', you can restore to the transaction before you ran that command.

http://www.packtpub.com/how-to-postgresql-backup-and-restore... contains more information.

Also, postgresql 9.3 (out in a few months) supports disk page checksums which can detect filesystem failures immediately.

[+] jordanthoms|13 years ago|reply

I'm using https://github.com/wal-e/wal-e which ships the logs to S3 with good results

[+] parfe|13 years ago|reply

The error in question is raised at http://doxygen.postgresql.org/sequence_8c_source.html#l01083

My guess is hard drive failure. Hope they have good backups.

[+] mattryanharris|13 years ago|reply

And if they don't?

[+] sickpig|13 years ago|reply

Did they try to post about their problem on postgresql mailing list (both pgsql bugs and general apply here)?

Postgres community is quite responsive and usually give very effective advice, as long as you use the proper channel to communicate.

[+] pilif|13 years ago|reply

Also, #postgres on Freenode. No matter when or what you ask, one of the regulars will usually respond instantly with very high-quality advice. No single product, no matter whether it was free software or not has ever reached this kind of support quality for me.

[+] bifrost|13 years ago|reply

I feel for you guys, but it sounds like a disk error. If you're not experts, you need to do one of the following A) rollback to a snapshot B) restore from backup. If you have neither of those, take a sql backup, and pray you can recreate the broken data.

[+] taitems|13 years ago|reply

Might as well comment here as tweets to the account seem to do nothing. Every 25th click or so has always resulted in a Chrome "no content" error. It's a frequent problem, and it's been happening for years.

[+] samuelfine|13 years ago|reply

You should probably email [email protected]. Twitter isn't an "official" support channel, so your tweets are far less likely to get noticed.

In the interest of saving time: browser / version / OS? Any browser extensions installed?

[+] unknown|13 years ago|reply

[deleted]

[+] unknown|13 years ago|reply

[deleted]

[+] samuelfine|13 years ago|reply

When a problem first rears it's head, we usually don't know what the actual cause is. Some problems take 5 minutes to resolve, and others keep me up until 1 in the morning. So, making any kind of specific statement in the middle of a crisis is often impossible, and always imprudent.

46 comments