>To my surprise, PostgreSQL v9.5 on Ubuntu 14.04 does not default to unicode schema.
PostgreSQL does (if unicode character set was specified when PostgreSQL database cluster is initialized first time); it's Ubuntu OS which creates PostgreSQL database cluster with ASCII character set (encoding) after PostgreSQL's installation.
> Here is the snippet that I had to use: ...
Instead of resorting to those hacks, follow the PostgreSQL documentation[1] to do it the right way. The simplest way is to initialize your PostgreSQL database cluster as:
> Instead of resorting to those hacks, follow the PostgreSQL documentation[1] to do it the right way. The simplest way is to initialize your PostgreSQL database cluster as:
This is actually poor advice. The right way on Ubuntu is to use the packages provided by PostgreSQL Development Group (https://wiki.postgresql.org/wiki/Apt), and in that setup you use pg_createcluster instead of initdb.
The snippet in the original article was so hacky because it fixed an existing cluster instead of destroying it and creating a new one in UTF-8.
I'm the decision maker for the company in question - Authentise. It was a combination of cost and the fact that we need to support an in-datacenter install option for our customers. RDS is really great if you need all of the failover and backup options it provides and you don't have the manpower to invest in it. That's how we got our services started. But, as our expertise and automation grew it ended up being a modest cost savings to handle our own backups and redundancies. Primarily, though, we just can't install RDS in someone else's datacenter on premises.
In my personal comparison, a huge factor for RDS and against Heroku is something that I have never seen mentioned in these comparisons: you can run multiple databases on a single RDS instance and you can't do that with Heroku.
My company has a bunch of microservices that each need a Postgres database, but none of them are particularly high traffic. We do however need high availability guarantees for all of them, and so Heroku's pricing starts at $200/mo per database.
With RDS, we get adequate performance for a similar total cost (around $300/mo), but get to run about a dozen separate high-availability databases on that same instance, each with their own usernames and passwords. That means significant savings.
(of course you could just share a single Heroku Postgres database and user account between these services, perhaps separating them by scheme, but that wasn't really to my taste.)
It's because it's a pretty situational thing to talk about. IMO, the most sensible reasons to move away from RDS would be customer requirements or you're not using AWS for anything else.
What traits are you comparing on? What kind of a thing are you building? What features (cost, performance, ease of use, time) are the most important to you?
Agreed, I always look for the reasoning behind these types of decisions. It helps others understand the issues that people face with different platforms.
Overall a nice post. It doesn't quite get into all the options for pg_dump as there's a number of other flags that can be useful to reduce downtime. It's also probably worth taking a look at Amazon Migration Service as well (though I haven't personally used it or heard from others that have yet) - https://aws.amazon.com/dms/.
One tool they did miss for continuous archiving is WAL-E, which tends to be the one most used including by us at Citus Cloud and Heroku Postgres - https://github.com/wal-e/wal-e
We've tested out DMS to migrate from a 3 node Oracle RAC to Oracle RDS, and I was incredibly impressed. They list a bunch of limitations in their documentation that you should look at first. It migrated our roughly 1 TB database in two hours and ten minutes and handled replication flawlessly - not a single record discrepancy between the two database systems over a week long test. We were not expecting it to work that well and doubted that it would work with RAC as the source.
Maybe Yani himself can chime in when it's daytime in his time zone, but if memory serves correctly he did look at WAL-E. It had some nice features in its favor. I'm not sure why it didn't make it into the blog post.
It would be interesting to hear the motivation of why they did this. My own experience is moving the opposite direction (from master-slave self-managed to RDS master-replica).
For those that have done this, I'd be interested in the performance gain (say, IOPS at an equivalent hosting + hardware cost). I'd also be interested in RDS vs. Heroku Postgres.
Do people do this for security / policy purposes, or are they motivated by cost savings? Does removing all that virtualization buy improved performance?
30,000 iops is less then single crappy consumer grade SSD
6TB is fairly low limit
largest instance has 244Gb RAM
So improved performance, running larger instances, lower cost
Amazon DMS (database migration service) actually works really well and allows you to do this without writing any code. You may need to change your Postgres config a bit.
DMS isn't just for moving two RDS it can also be used to move off or RDS or even move from one non-RDS server to another non-RDS server that isn't even on AWS.
It can also do Postres to MySQL, MySQL to Postgres, etc.
It also does continuous replication so once the copy is done it will keep up to date (yes, even if it is Postgres -> mySQL).
One gotcha, though, it does not bring over secondary keys so make sure to recreate them before sending traffic over.
But this article does provide some insight on how to configure Postgres to be similar to RDS in terms of functionality.
OK a bit of topic, but just curious we are considering rolling Postgres hosted service that would support much large instances up 50TB
much higher IOPS 1,000,000+
lower latency (infiniband vs ethernet SAN)
pricing would be in line with AWS RDS +/-
optional ElasticSearch integration
optional change notification pub/sub
Would you guys be interested/think such service is a good idea? Would appreciate to learn what you guys think?
Think about it this way: OCZ has on paper SSDs that are simultaneously both faster and cheaper than, say, Intel. But I wouldn't put a OCZ drive into the lowest budged ricer gaming PC I could imagine (let's just say that have a bit of a reputation, to put it lightly), and I should be rightfully fired if I suggested putting one in a server.
Yes, your service might be massively faster, but... at 10-15% cheaper? No way! Not until you've been running for years with a great reputation. Now, if you enter the managed database space and pull a Digital Ocean ($5-$40 pricing vs comparable EC2 at $80-$400 AND better raw performance on top), then you can essentially create a new market niche. That's my suggestion to focus on, because businesses who operate 50TB databases aren't looking to save a few dollars. But people who use DO, need a database to go with it that is both faster and cheaper than AWS.
Honestly, performance ranks pretty low on the things I've found people look for when choosing a managed database service. You can serve a million users per day and average a whopping 12 IOPS, which might just about push a Raspberry Pi with sqlite slightly. Got more than a million users? Congrats, you now officially qualify to throw money at problems, the kind of problems RDS and Amazon loves to solve by injecting money.
Dont go for IOPS , but price. Make sure that there is replication and fail over... people at this stage just want to make sure that their data isn't lost.
Use cheap hardware like ovh - you can beat AWS on price. Plus, googling for "hosted postgresql" is still something that can be competed upon.
My priorities are: Reliability > Size ~= Price > Performance
Reliability of a database is always no.1, without it, nothing else matters. So the customer must have enough access to the machine to setup replication and failover themselves (eg. using something like repmgr). Or have damn good guarantees that you're doing it well for them. ie. multiple replicas, verifiable backups, offsite backups, etc.
Secondly is size. This is a major sticking point for me when looking at hosted postgres solutions - they're mostly for relatively small databases <1Tb or so.
Closely in third is price. Even when you do find a service that provides a decent size like when RDS eventually went up to 6Tb (still not quite enough for me), the price they charge for it (and the ram) over buying dedicated iron is exorbitant. Given that 6-8Tb drives are mainstream now, it's a real head scratcher why they'll charge $750-1500/mth for a 6Tb drive that costs $250 outright. And then charge you for the IOPS on top of that... Even factoring in multiple redundant drives/replicas, SSD vs HDD, it's still very unfavourable. I can install 5 machines (at least) in a cluster for the price of 1 on RDS; giving me not just price savings, but my no.1 need - reliability.
Finally performance. The number of high IOPS workloads out there is rather small. You're far more likely to have a high storage need with small IOPS, than high IOPS, smaller storage. Everyone overestimates the amount of traffic they'll get, and generally if you have an IOPS problem, it's better off fixed elsewhere in the app. Usually some bad code or ORM is thrashing the DB.
Take a simple login DB as a thought experiment, just 10,000 IOPS is upto 864 million users logging in per day. Whereas the storage for that number of users, at a very generous 1Kb per user, is 864Gb. Size is way more important than IOPS. And if you've got that number of users, you've also got your own datacentre :)
You might also run into the problem that the kind of organisations that need high storage + high iops are also the kind that wouldn't use the cloud for it in a million years (banks, high freq. traders, etc).
In summary, I think the main thing that would attract me is high capacity storage for a decent price. Reliability is a given, but it must be easy to manage. Ideally I'd like a mixed storage system, with the ability to arrange the tablespaces across the drives as I need. 50Tb is great, but I need the ability to put the high IOPS tables on an SSD, then spread the infrequently accessed tables onto large HDDs.
> [pg_dump ...] The main disadvantage of this method is that it will not provide high reliability.
Pretty sure the author means "availability" not "reliabilitiy". pg_dump is completely reliable, arguably more so than ANY other backup mechanism as it creates logical machine independent backups.
Best intro tip regarding pg_dump: use -Fc (custom format)
[+] [-] aftabh|9 years ago|reply
PostgreSQL does (if unicode character set was specified when PostgreSQL database cluster is initialized first time); it's Ubuntu OS which creates PostgreSQL database cluster with ASCII character set (encoding) after PostgreSQL's installation.
> Here is the snippet that I had to use: ...
Instead of resorting to those hacks, follow the PostgreSQL documentation[1] to do it the right way. The simplest way is to initialize your PostgreSQL database cluster as:
initdb --encoding=UTF8 --pgdata=<database-cluster-dir>
If you want to use an existing database cluster, PostgreSQL documentation shows how you can do it.
--
[1] - https://www.postgresql.org/docs/9.5/static/multibyte.html
[+] [-] dboreham|9 years ago|reply
[+] [-] mrmondo|9 years ago|reply
[+] [-] merb|9 years ago|reply
newly setup:
[+] [-] approachingtraj|9 years ago|reply
This is actually poor advice. The right way on Ubuntu is to use the packages provided by PostgreSQL Development Group (https://wiki.postgresql.org/wiki/Apt), and in that setup you use pg_createcluster instead of initdb.
The snippet in the original article was so hacky because it fixed an existing cluster instead of destroying it and creating a new one in UTF-8.
[+] [-] jbrantly|9 years ago|reply
I've been trying to compare RDS for Postgres to other offerings like Compose or Heroku but have come up surprisingly dry on comparisons.
[+] [-] eliribble|9 years ago|reply
[+] [-] markonen|9 years ago|reply
My company has a bunch of microservices that each need a Postgres database, but none of them are particularly high traffic. We do however need high availability guarantees for all of them, and so Heroku's pricing starts at $200/mo per database.
With RDS, we get adequate performance for a similar total cost (around $300/mo), but get to run about a dozen separate high-availability databases on that same instance, each with their own usernames and passwords. That means significant savings.
(of course you could just share a single Heroku Postgres database and user account between these services, perhaps separating them by scheme, but that wasn't really to my taste.)
[+] [-] jdiscar|9 years ago|reply
What traits are you comparing on? What kind of a thing are you building? What features (cost, performance, ease of use, time) are the most important to you?
[+] [-] partisan|9 years ago|reply
[+] [-] joshmn|9 years ago|reply
https://www.runabove.com/PaaSDBPGSQL.xml
No affiliation, just a happy customer of OVH's hardware.
[+] [-] brightball|9 years ago|reply
https://blog.codeship.com/heroku-postgresql-versus-amazon-rd...
[+] [-] qaq|9 years ago|reply
[+] [-] craigkerstiens|9 years ago|reply
One tool they did miss for continuous archiving is WAL-E, which tends to be the one most used including by us at Citus Cloud and Heroku Postgres - https://github.com/wal-e/wal-e
[+] [-] chrisatumd|9 years ago|reply
[+] [-] willlll|9 years ago|reply
*disclaimer: I wrote that code while at heroku
[+] [-] lexandstuff|9 years ago|reply
* "AWS DMS doesn't support change processing on Amazon RDS for PostgreSQL. You can only do a full load."
* "AWS DMS doesn't map some PostgreSQL data types, including the JSON data type. The JSON is converted to CLOB."
http://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source....
Hopefully it improves soon, since it's an otherwise awesome tool.
[+] [-] eliribble|9 years ago|reply
[+] [-] seag64|9 years ago|reply
[+] [-] room271|9 years ago|reply
I wrote up my experiences and the reasons for the switch here: https://www.theguardian.com/info/developer-blog/2016/feb/04/... .
[+] [-] samcheng|9 years ago|reply
Do people do this for security / policy purposes, or are they motivated by cost savings? Does removing all that virtualization buy improved performance?
[+] [-] vacri|9 years ago|reply
[+] [-] qaq|9 years ago|reply
[+] [-] throwaway2016a|9 years ago|reply
DMS isn't just for moving two RDS it can also be used to move off or RDS or even move from one non-RDS server to another non-RDS server that isn't even on AWS.
It can also do Postres to MySQL, MySQL to Postgres, etc.
It also does continuous replication so once the copy is done it will keep up to date (yes, even if it is Postgres -> mySQL).
One gotcha, though, it does not bring over secondary keys so make sure to recreate them before sending traffic over.
But this article does provide some insight on how to configure Postgres to be similar to RDS in terms of functionality.
[+] [-] qaq|9 years ago|reply
[+] [-] Rezo|9 years ago|reply
Think about it this way: OCZ has on paper SSDs that are simultaneously both faster and cheaper than, say, Intel. But I wouldn't put a OCZ drive into the lowest budged ricer gaming PC I could imagine (let's just say that have a bit of a reputation, to put it lightly), and I should be rightfully fired if I suggested putting one in a server.
Yes, your service might be massively faster, but... at 10-15% cheaper? No way! Not until you've been running for years with a great reputation. Now, if you enter the managed database space and pull a Digital Ocean ($5-$40 pricing vs comparable EC2 at $80-$400 AND better raw performance on top), then you can essentially create a new market niche. That's my suggestion to focus on, because businesses who operate 50TB databases aren't looking to save a few dollars. But people who use DO, need a database to go with it that is both faster and cheaper than AWS.
Honestly, performance ranks pretty low on the things I've found people look for when choosing a managed database service. You can serve a million users per day and average a whopping 12 IOPS, which might just about push a Raspberry Pi with sqlite slightly. Got more than a million users? Congrats, you now officially qualify to throw money at problems, the kind of problems RDS and Amazon loves to solve by injecting money.
[+] [-] audleman|9 years ago|reply
Even if you did I probably wouldn't use it. I don't want a 3rd party dependency on the most critical part of my infrastructure.
[+] [-] sandGorgon|9 years ago|reply
[+] [-] adwf|9 years ago|reply
Reliability of a database is always no.1, without it, nothing else matters. So the customer must have enough access to the machine to setup replication and failover themselves (eg. using something like repmgr). Or have damn good guarantees that you're doing it well for them. ie. multiple replicas, verifiable backups, offsite backups, etc.
Secondly is size. This is a major sticking point for me when looking at hosted postgres solutions - they're mostly for relatively small databases <1Tb or so.
Closely in third is price. Even when you do find a service that provides a decent size like when RDS eventually went up to 6Tb (still not quite enough for me), the price they charge for it (and the ram) over buying dedicated iron is exorbitant. Given that 6-8Tb drives are mainstream now, it's a real head scratcher why they'll charge $750-1500/mth for a 6Tb drive that costs $250 outright. And then charge you for the IOPS on top of that... Even factoring in multiple redundant drives/replicas, SSD vs HDD, it's still very unfavourable. I can install 5 machines (at least) in a cluster for the price of 1 on RDS; giving me not just price savings, but my no.1 need - reliability.
Finally performance. The number of high IOPS workloads out there is rather small. You're far more likely to have a high storage need with small IOPS, than high IOPS, smaller storage. Everyone overestimates the amount of traffic they'll get, and generally if you have an IOPS problem, it's better off fixed elsewhere in the app. Usually some bad code or ORM is thrashing the DB.
Take a simple login DB as a thought experiment, just 10,000 IOPS is upto 864 million users logging in per day. Whereas the storage for that number of users, at a very generous 1Kb per user, is 864Gb. Size is way more important than IOPS. And if you've got that number of users, you've also got your own datacentre :)
You might also run into the problem that the kind of organisations that need high storage + high iops are also the kind that wouldn't use the cloud for it in a million years (banks, high freq. traders, etc).
In summary, I think the main thing that would attract me is high capacity storage for a decent price. Reliability is a given, but it must be easy to manage. Ideally I'd like a mixed storage system, with the ability to arrange the tablespaces across the drives as I need. 50Tb is great, but I need the ability to put the high IOPS tables on an SSD, then spread the infrequently accessed tables onto large HDDs.
[+] [-] unknown|9 years ago|reply
[deleted]
[+] [-] koolba|9 years ago|reply
> [pg_dump ...] The main disadvantage of this method is that it will not provide high reliability.
Pretty sure the author means "availability" not "reliabilitiy". pg_dump is completely reliable, arguably more so than ANY other backup mechanism as it creates logical machine independent backups.
Best intro tip regarding pg_dump: use -Fc (custom format)
[+] [-] room271|9 years ago|reply
[+] [-] jayisbatman|9 years ago|reply