How to move from Amazon RDS to a dedicated PostgreSQL server

[+] aftabh|9 years ago|reply

>To my surprise, PostgreSQL v9.5 on Ubuntu 14.04 does not default to unicode schema.

PostgreSQL does (if unicode character set was specified when PostgreSQL database cluster is initialized first time); it's Ubuntu OS which creates PostgreSQL database cluster with ASCII character set (encoding) after PostgreSQL's installation.

> Here is the snippet that I had to use: ...

Instead of resorting to those hacks, follow the PostgreSQL documentation[1] to do it the right way. The simplest way is to initialize your PostgreSQL database cluster as:

initdb --encoding=UTF8 --pgdata=<database-cluster-dir>

If you want to use an existing database cluster, PostgreSQL documentation shows how you can do it.

--

[1] - https://www.postgresql.org/docs/9.5/static/multibyte.html

[+] dboreham|9 years ago|reply

Using the distro-supplied packages for PG (or any database server) is typically not the best choice for a production deployment.

[+] mrmondo|9 years ago|reply

Ubuntu has lots of poor defaults and this is a good example of one.

[+] merb|9 years ago|reply

actually that is not true for ubuntu 16.04 there it will use the locale setting:

newly setup:

                                      List of databases
       Name    |  Owner   | Encoding |   Collate   |    Ctype    |   Access privileges   
    -----------+----------+----------+-------------+-------------+-----------------------
     postgres  | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | 
     template0 | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres          +
               |          |          |             |             | postgres=CTc/postgres
     template1 | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres          +
               |          |          |             |             | postgres=CTc/postgres
    (3 rows)

[+] approachingtraj|9 years ago|reply

> Instead of resorting to those hacks, follow the PostgreSQL documentation[1] to do it the right way. The simplest way is to initialize your PostgreSQL database cluster as:

This is actually poor advice. The right way on Ubuntu is to use the packages provided by PostgreSQL Development Group (https://wiki.postgresql.org/wiki/Apt), and in that setup you use pg_createcluster instead of initdb.

The snippet in the original article was so hacky because it fixed an existing cluster instead of destroying it and creating a new one in UTF-8.

[+] jbrantly|9 years ago|reply

Not that it necessarily should, but the article did not really address why they were moving away from RDS. Was it cost, performance, other issues?

I've been trying to compare RDS for Postgres to other offerings like Compose or Heroku but have come up surprisingly dry on comparisons.

[+] eliribble|9 years ago|reply

I'm the decision maker for the company in question - Authentise. It was a combination of cost and the fact that we need to support an in-datacenter install option for our customers. RDS is really great if you need all of the failover and backup options it provides and you don't have the manpower to invest in it. That's how we got our services started. But, as our expertise and automation grew it ended up being a modest cost savings to handle our own backups and redundancies. Primarily, though, we just can't install RDS in someone else's datacenter on premises.

[+] markonen|9 years ago|reply

In my personal comparison, a huge factor for RDS and against Heroku is something that I have never seen mentioned in these comparisons: you can run multiple databases on a single RDS instance and you can't do that with Heroku.

My company has a bunch of microservices that each need a Postgres database, but none of them are particularly high traffic. We do however need high availability guarantees for all of them, and so Heroku's pricing starts at $200/mo per database.

With RDS, we get adequate performance for a similar total cost (around $300/mo), but get to run about a dozen separate high-availability databases on that same instance, each with their own usernames and passwords. That means significant savings.

(of course you could just share a single Heroku Postgres database and user account between these services, perhaps separating them by scheme, but that wasn't really to my taste.)

[+] jdiscar|9 years ago|reply

It's because it's a pretty situational thing to talk about. IMO, the most sensible reasons to move away from RDS would be customer requirements or you're not using AWS for anything else.

What traits are you comparing on? What kind of a thing are you building? What features (cost, performance, ease of use, time) are the most important to you?

[+] partisan|9 years ago|reply

Agreed, I always look for the reasoning behind these types of decisions. It helps others understand the issues that people face with different platforms.

[+] joshmn|9 years ago|reply

OVH is likely to launch a dedicated instance for Postgres in the near future. It's currently in the beta stage of RunAbove.

https://www.runabove.com/PaaSDBPGSQL.xml

No affiliation, just a happy customer of OVH's hardware.

[+] brightball|9 years ago|reply

I wrote one up for Codeship earlier this year:

https://blog.codeship.com/heroku-postgresql-versus-amazon-rd...

[+] qaq|9 years ago|reply

Cost, performance, size limitations

[+] craigkerstiens|9 years ago|reply

Overall a nice post. It doesn't quite get into all the options for pg_dump as there's a number of other flags that can be useful to reduce downtime. It's also probably worth taking a look at Amazon Migration Service as well (though I haven't personally used it or heard from others that have yet) - https://aws.amazon.com/dms/.

One tool they did miss for continuous archiving is WAL-E, which tends to be the one most used including by us at Citus Cloud and Heroku Postgres - https://github.com/wal-e/wal-e

[+] chrisatumd|9 years ago|reply

We've tested out DMS to migrate from a 3 node Oracle RAC to Oracle RDS, and I was incredibly impressed. They list a bunch of limitations in their documentation that you should look at first. It migrated our roughly 1 TB database in two hours and ten minutes and handled replication flawlessly - not a single record discrepancy between the two database systems over a week long test. We were not expecting it to work that well and doubted that it would work with RAC as the source.

[+] willlll|9 years ago|reply

Yeah, I would* do a `pg_dump | pg_restore` with these settings: https://github.com/heroku/heroku/blob/master/lib/heroku/help...

*disclaimer: I wrote that code while at heroku

[+] lexandstuff|9 years ago|reply

Unfortunately, DMS is quite limited for Postgres. We trialled it but found these two limitations deal breakers:

* "AWS DMS doesn't support change processing on Amazon RDS for PostgreSQL. You can only do a full load."

* "AWS DMS doesn't map some PostgreSQL data types, including the JSON data type. The JSON is converted to CLOB."

http://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source....

Hopefully it improves soon, since it's an otherwise awesome tool.

[+] eliribble|9 years ago|reply

Maybe Yani himself can chime in when it's daytime in his time zone, but if memory serves correctly he did look at WAL-E. It had some nice features in its favor. I'm not sure why it didn't make it into the blog post.

[+] seag64|9 years ago|reply

Yeah, came here to say that the migration service was built to solve this exact problem (among others).

[+] room271|9 years ago|reply

It would be interesting to hear the motivation of why they did this. My own experience is moving the opposite direction (from master-slave self-managed to RDS master-replica).

I wrote up my experiences and the reasons for the switch here: https://www.theguardian.com/info/developer-blog/2016/feb/04/... .

[+] samcheng|9 years ago|reply

For those that have done this, I'd be interested in the performance gain (say, IOPS at an equivalent hosting + hardware cost). I'd also be interested in RDS vs. Heroku Postgres.

Do people do this for security / policy purposes, or are they motivated by cost savings? Does removing all that virtualization buy improved performance?

[+] vacri|9 years ago|reply

I'm not sure that the article moved to bare iron. The author could have just set up postgres on a regular ec2 instance.

[+] qaq|9 years ago|reply

30,000 iops is less then single crappy consumer grade SSD 6TB is fairly low limit largest instance has 244Gb RAM So improved performance, running larger instances, lower cost

[+] throwaway2016a|9 years ago|reply

Amazon DMS (database migration service) actually works really well and allows you to do this without writing any code. You may need to change your Postgres config a bit.

DMS isn't just for moving two RDS it can also be used to move off or RDS or even move from one non-RDS server to another non-RDS server that isn't even on AWS.

It can also do Postres to MySQL, MySQL to Postgres, etc.

It also does continuous replication so once the copy is done it will keep up to date (yes, even if it is Postgres -> mySQL).

One gotcha, though, it does not bring over secondary keys so make sure to recreate them before sending traffic over.

But this article does provide some insight on how to configure Postgres to be similar to RDS in terms of functionality.

[+] qaq|9 years ago|reply

OK a bit of topic, but just curious we are considering rolling Postgres hosted service that would support much large instances up 50TB much higher IOPS 1,000,000+ lower latency (infiniband vs ethernet SAN) pricing would be in line with AWS RDS +/- optional ElasticSearch integration optional change notification pub/sub Would you guys be interested/think such service is a good idea? Would appreciate to learn what you guys think?

[+] Rezo|9 years ago|reply

When it comes to data, trust is key.

Think about it this way: OCZ has on paper SSDs that are simultaneously both faster and cheaper than, say, Intel. But I wouldn't put a OCZ drive into the lowest budged ricer gaming PC I could imagine (let's just say that have a bit of a reputation, to put it lightly), and I should be rightfully fired if I suggested putting one in a server.

Yes, your service might be massively faster, but... at 10-15% cheaper? No way! Not until you've been running for years with a great reputation. Now, if you enter the managed database space and pull a Digital Ocean ($5-$40 pricing vs comparable EC2 at $80-$400 AND better raw performance on top), then you can essentially create a new market niche. That's my suggestion to focus on, because businesses who operate 50TB databases aren't looking to save a few dollars. But people who use DO, need a database to go with it that is both faster and cheaper than AWS.

Honestly, performance ranks pretty low on the things I've found people look for when choosing a managed database service. You can serve a million users per day and average a whopping 12 IOPS, which might just about push a Raspberry Pi with sqlite slightly. Got more than a million users? Congrats, you now officially qualify to throw money at problems, the kind of problems RDS and Amazon loves to solve by injecting money.

[+] audleman|9 years ago|reply

My 2 cents: you're promising a lot more for the same price which seems dubious. How are you going to beat the wholesaler?

Even if you did I probably wouldn't use it. I don't want a 3rd party dependency on the most critical part of my infrastructure.

[+] sandGorgon|9 years ago|reply

Dont go for IOPS , but price. Make sure that there is replication and fail over... people at this stage just want to make sure that their data isn't lost. Use cheap hardware like ovh - you can beat AWS on price. Plus, googling for "hosted postgresql" is still something that can be competed upon.

[+] adwf|9 years ago|reply

My priorities are: Reliability > Size ~= Price > Performance

Reliability of a database is always no.1, without it, nothing else matters. So the customer must have enough access to the machine to setup replication and failover themselves (eg. using something like repmgr). Or have damn good guarantees that you're doing it well for them. ie. multiple replicas, verifiable backups, offsite backups, etc.

Secondly is size. This is a major sticking point for me when looking at hosted postgres solutions - they're mostly for relatively small databases <1Tb or so.

Closely in third is price. Even when you do find a service that provides a decent size like when RDS eventually went up to 6Tb (still not quite enough for me), the price they charge for it (and the ram) over buying dedicated iron is exorbitant. Given that 6-8Tb drives are mainstream now, it's a real head scratcher why they'll charge $750-1500/mth for a 6Tb drive that costs $250 outright. And then charge you for the IOPS on top of that... Even factoring in multiple redundant drives/replicas, SSD vs HDD, it's still very unfavourable. I can install 5 machines (at least) in a cluster for the price of 1 on RDS; giving me not just price savings, but my no.1 need - reliability.

Finally performance. The number of high IOPS workloads out there is rather small. You're far more likely to have a high storage need with small IOPS, than high IOPS, smaller storage. Everyone overestimates the amount of traffic they'll get, and generally if you have an IOPS problem, it's better off fixed elsewhere in the app. Usually some bad code or ORM is thrashing the DB.

Take a simple login DB as a thought experiment, just 10,000 IOPS is upto 864 million users logging in per day. Whereas the storage for that number of users, at a very generous 1Kb per user, is 864Gb. Size is way more important than IOPS. And if you've got that number of users, you've also got your own datacentre :)

You might also run into the problem that the kind of organisations that need high storage + high iops are also the kind that wouldn't use the cloud for it in a million years (banks, high freq. traders, etc).

In summary, I think the main thing that would attract me is high capacity storage for a decent price. Reliability is a given, but it must be easy to manage. Ideally I'd like a mixed storage system, with the ability to arrange the tablespaces across the drives as I need. 50Tb is great, but I need the ability to put the high IOPS tables on an SSD, then spread the infrequently accessed tables onto large HDDs.

[+] unknown|9 years ago|reply

[deleted]

[+] koolba|9 years ago|reply

From the article (emphasis mine):

> [pg_dump ...] The main disadvantage of this method is that it will not provide high reliability.

Pretty sure the author means "availability" not "reliabilitiy". pg_dump is completely reliable, arguably more so than ANY other backup mechanism as it creates logical machine independent backups.

71 comments