DynamoDB to Postgres: Why and How

[+] hesdeadjim|8 years ago|reply

DynamoDB was a constant source of anxiety for me and my team. We fit the usage pattern perfectly, so the author's points didn't apply to us.

Our primary problem was having to constantly manage and change our provisioned capacity so that we could respond to usage spikes and keep our bill from becoming stratospheric. Worse, anytime we'd add new functionality -- such as nighttime analytics jobs that scanned the table or daily backups, it meant revisiting the parameters for our auto-scaling or risk getting throttled and having our application randomly start spewing errors.

DynamoDB's billing and provisioning model is awful to deal with. When your team is having serious conversations about the repercussions of your data values going from 4kb to 8kb, you very quickly begin questioning your decision to use DynamoDB in the first place.

[+] Zaheer|8 years ago|reply

FYI DynamoDB in last few weeks added auto-scaling: https://aws.amazon.com/blogs/aws/new-auto-scaling-for-amazon...

Additionally they somewhat recently added an optional caching layer: https://aws.amazon.com/dynamodb/dax/

[+] eropple|8 years ago|reply

The high score I ever saw at a company I worked for was about $85K in DynamoDB spend. That's an eight, a five, and three zeroes after it. It was a feature so otherwise silly and simple that I don't want to flame the developers thereof by describing it, but it might have been the most gratuitous burning of cash I've ever seen at a startup.

[+] linkmotif|8 years ago|reply

> I’m not a database guy, I’m a node guy.

Is this really a distinction your "scrappy startup type" can afford to make?

> Although since version 3.4.x Mongo passes all Jepsen tests, when our original database was chosen the current version of Mongo at the time did not and was deemed unreliable by the team

Does not pass all Jepsen tests => deeemed unreliable :( Wow. Of all possible reasons, THAT'S the reason MongoDB wasn't chosen? Jepsen, man. Pretty sad so many people just look at Jepsen results like this. Jepsen, really, is about the tests more than the results. It highlights various failure modes that may be difficult or uncommon to replicate in production. These failure modes are important to know about to understand the product you are using, but are not really the end-all when it comes to deciding whether that product is worth using!

Anyway, show me a database that passes Jepsen and I'll show you a database that only accepts writes to a single node.

[+] andreimatei1|8 years ago|reply

> Anyway, show me a database that passes Jepsen and I'll show you a database that only accepts writes to a single node.

I know one - CockroachDB :). Relational, strongly-consistent, SQL, elastic, multi-active replication - the dream! (I work on it)

https://www.cockroachlabs.com/blog/cockroachdb-beta-passes-j... http://jepsen.io/analyses/cockroachdb-beta-20160829

[+] tootie|8 years ago|reply

I think if you're swimming a sea of options you look for any port and Aphyr's blog gives you all sorts of ammo to eliminate choices even if they're only failing on microscopic edge cases that will never affect your small-scale project.

[+] neovintage|8 years ago|reply

The really important part of this post is the section articulating why not Dynamo. It uncovers a big question I ask of all of my customers when they're trying to choose data stores at any point during the life of their application.

Do you know your query patterns for your application?

If you know how your application / product need to pull data from the data store, you way more information that will allow you to pick the right data store to fit the problem at hand.

When you don't have any prior query patterns to go off of, picking a relational database to start your application is a good choice. You need the flexibility to allow your schema to change over time. The schema and query patterns will eventually reach a point where changes don't happen all that often and then you can look to other datastores if you need to scale. Plus, are you really going to have THAT much data when you start?

Disclosure: I work on Heroku Postgres. Could be a little biased....

[+] lilbobbytables|8 years ago|reply

Similar to the article author, many just don't have experience with relational db's and have only heard that they aren't flexible and don't scale.

They don't realize just how easily performant Postgres can be - and often so easily by sticking basic indexes where they are needed. Similarly they may not realize the amount of data and load PG or MySQL can handle just fine.

[+] paladin314159|8 years ago|reply

> Assuming you know how your data needs to be queried when you create your table, then the hash/range key combination could be the answer to your query problems. However we did not.

DynamoDB's purpose is to allow you to easily scale a known query pattern. I definitely don't recommend it as a first solution when you don't know how the data is going to be accessed. Or when everything fits on one PostgreSQL box.

We've had a lot of success starting out with a big PostgreSQL box and migrating use cases to DynamoDB when a table gets too big and has a query pattern that fits (essentially K/V lookup with some simple indexes).

[+] gamesbrainiac|8 years ago|reply

Good news is that postgres is going to get multi-master replication soon in postgres 10, so a lot of the painpoints when it comes to scaling our writes, are hopefully going to be solved.

[+] mark242|8 years ago|reply

Dismissing CouchDB because it only supports pre-sharding is extremely shortsighted. It takes very little time to replicate from one db with 2 shards to another db with 3 shards, even on N nodes. Copying the *.couch files is the absolute wrong way to manage this especially in a production environment when those files will be constantly changing.

It is far, far easier to add nodes to a Cloudant/Couch2 cluster, set up realtime replication from db1 -> db2, then once the replication has caught up, at your leisure you switch your database connection in your app.

[+] idbehold|8 years ago|reply

I'd also like to remind anyone using DynamoDB: it cannot store empty strings. Let that sink in for a second. If you think you're going to be able to store JSON documents you'll be in for a nasty surprise when you try to insert a valid JSON document and either the request fails or it converts any empty strings in your JSON document into nulls.

[+] koolba|8 years ago|reply

An empty string (zero bytes) isn't valid JSON.

'""' vs '' (single quotes to illustrate the wrapper for the total content).

[+] luord|8 years ago|reply

My own experience working with Postgres, discovering nifty features or trying out new ones, and articles like this make it clear for me that I'm unlikely to ever need anything else. By the time any of the applications I work on could grow to the size in which Postgres wouldn't be a good option, I'd hopefully be in the process of selling it anyway.

I'm quite ok with that. It's such an awesome piece of software.

[+] aluskuiuc|8 years ago|reply

I'm surprised the author didn't consider using DynamoDB Streams to replicate to a PGSQL database for the relational queries while keeping the primary key-value lookups on DDB.

[+] manigandham|8 years ago|reply

What does that gain other than more expense and overhead? SQL can do key/value just fine.

[+] dchuk|8 years ago|reply

It doesn't really seem like this guy was qualified to make such a pivotal decision like choosing the database your entire system will run on given his self-submitted breadth of experience in the opening paragraphs...

[+] takeda|8 years ago|reply

You could say I'm a Postgres fanboy, but even to me he did not sound convincing.

The guy did not seem to know what he was talking about. Almost felt like he picked Postgres by throwing darts at possible options. Yes, he picked the right choice :), but I feel like he will misuse it[1] and few months later write article how bad it was.

[1] Heh, I've seen Postgres tables that had 4 columns:

- id

- date added

- date updated

- thrift encoded blob containing the data (that was before JSONB, but even then the data was highly relational and that was silly)

[+] pedalpete|8 years ago|reply

I've just started using DynamoDB on a project. Before I made the choice, I discussed some architecture issues with a friend who works at a big fast growing unicorn start-up. They use MongoDB for the majority of the app the user sees, they use ElasticSearch for search, postgres for reporting. Data gets exported or updated in the data-sources that best suit the need of that data, rather than trying to find a single data solution that fits every use of the data.

[+] jungturk|8 years ago|reply

This (polyglot persistence in Fowler-speak) is typical in situations when you have access patterns that aren't well-solved by a single data store.

It will likely introduce some serious consistency issues that need to be attended to, but that pain might be better than the alternatives.

[+] orf|8 years ago|reply

I submitted this article because of a single paragraph:

> When it came to the different types of NoSQL databases, it was pretty clear that a document store best suited our needs. We needed to be able to query based on the data so key/value was out, there was no need to have the ability to make all of the complex connections between data of a graph db, and nobody actually knows what a column db is (jk but our data is not the right use case for it). So document it was. We would have to change our data layout a bit to really work well with most document databases, our current data structure was similar to what you would expect for a relational db with tables etc since dynamo also uses a table structure. I wanted something that was ACID, none of that eventual consistency nonsense. That took a lot of them out of the race right there.

I don't get it. He needed arbitrary querying, relations, ACID, tables. And yet... his first thought was to go to a document-based no-sql database and is only later hit with the sudden realization that his relational data is bet suited in a relational database.

What has gone so wrong here that an otherwise smart and educated engineer makes these kinds of fundamental mistakes? Is it the success of no-sql marketing departments, and the apparent un-sexiness of traditional relational databases? Is it education? The job market? A fear of SQL (that you never really have to write anyway...)? Is it the quick-fix non-relational databases give you?

As far as I can see there are very very very few nails that a PostgreSQL hammer cannot hit. I guess that's part of the point of the article, but why does it need an article to explain this?

How can we stop these kinds of articles from needing to be written?

> The change drastically increased the speed of our API, some requests are up to 6x faster than they were with Dynamo. The account details request, our previously slowest call, took around 12 seconds with Dynamo and was reduced to 2 seconds with Postgres!

sigh

Edit: Not sure why this has been moved to the bottom of the thread. I wasn't trying to single this particular developer out, I was merely commenting on this troubling trend. Apologies if this came across in the wrong way.

[+] toomuchtodo|8 years ago|reply

"I’m not a database guy, I’m a node guy."

This school of thought that a SWE can know the entire stack, infra to app, is coming home to roost. You are the technical debt.

Learn your fundementals, and if a relational database isn't part of your foundational knowledge, learn it now or get out.

[+] StreamBright|8 years ago|reply

NodeJS and MongoDB is frequently used together and I think some fraction of the NodeJS developers thinks that NoSQL is just the better, newer, faster solution to data storage. I regurarly run into customers who hire us to make their data model work and some non-trivial cases end up being moving them to PostgreSQL. Data engineering actually is a more complex topic than some people think and it really bites you when you are actually using the solution for a while.

[+] justicezyx|8 years ago|reply

He likely has no formal training in database and therefore are not exposed to things alike.

Side topic: AWS did agressive move to throw it's resources to University, and pays off hansomely. Those new blood coming to our profession now has the concept of infrastructure ~= AWS (mileage varies between people, but that's the idea).

[+] venture_lol|8 years ago|reply

It's way too easy for anyone to pick up Node and CRUD a few pieces of data into a NOSQL store and voila, I am an architect :)

Selecting the right framework, platform, model for current needs with reasonable forward looking support takes experience and strong formal fundamentals.

[+] jopsen|8 years ago|reply

Often people don't want to do database admin stuff, dynamodb has none of that.

On the bright side, relational is coming back with RDS, cloud-sql, spanner and cockroachdb. With any luck NoSQL will be forgotten :)

[+] misterbowfinger|8 years ago|reply

[deleted]

[+] virmundi|8 years ago|reply

Keep in mind that the Postgres hammer is a non-standard hammer. Sure it is pretty SQL compliant (except there are no dirty reads thanks to MVCC), but if you use JSON or the arrays or the many other wonderful features of Postgres, you are now tied to Postgres. You are also slowly inching your way into Document Stores.

[+] tdurden|8 years ago|reply

The author will hopefully look back on this post and cringe a little. All the requirements point to a relational database, but that is not realized quickly because "if I need to use a database, I go straight to NoSQL"

[+] williamstein|8 years ago|reply

This closely mirrors my experience a few months ago: https://news.ycombinator.com/item?id=12649712

96 comments