One of my favorite idioms lately for doing bulk updates without incurring lock contention is to chain CTEs, like so:
with candidate_rows as (
select id
from table
where conditions
limit 1000
for update nowait
), update_rows as (
update table
set column = value
from candidate_rows
where candidate_rows.id = table.id
returning table.id
)
select count(1) from update_rows;
...and loop on issuing that query until the "count(1)" returns zero some for number of iterations (three works pretty well).
Want to add a column to your "orders" table and populate it without blocking concurrent writes for as long as it will take to rewrite a multi-million row table? Want to re-hash user passwords using something stronger than MD5, but not prevent users from ... you know, logging in for the duration?
Do you have any resources for learning cool advanced SQL like this? I've used CTEs in the past, just from reading about them in the documentation, but I feel like I'm only scratching the surface of what's possible out there.
"Want to add a column to your "orders" table and populate it without blocking concurrent writes for as long as it will take to rewrite a multi-million row table?"
PostgreSQL ALTER TABLE ... ADD COLUMN is an O(1) operation, and requires no data rewrite (as long as you are OK with NULLs).
You can even add an attribute to a composite type, and existing tables using that type will see the extra attribute. Again, O(1), no data rewrite.
> re-hash user passwords using something stronger than MD5
Isn't this only possible if you do it on user login - unless you are also cracking user passwords... Actually has anybody cracked their own MD5 user passwords to upgrade them?
Is this really different from using a transaction with separate read (using an update lock) and update statements, and looping over that in the same way from the application until the count is 0?
I was interested in NoSQL. Went to a Mongo pres. Both examples would have been easier and faster in SQL. Even the SQL I used 20 years ago. I asked for an example that would show performance advantage. I got a tired vague statement about the vague performance advantage. Seems like snake oil to me.
Unfortunately you were downvoted, but it wasn't so long ago on HN that every second story was NoSQL this, NoSQL that. There were even "SQL is dead"/"relational DBs are dead" posts, just ridiculous. So it's nice to see stories like this.
I've been writing a lot of recursive queries for Postgresql lately using CTEs. Quite cool though a little mindbending at times.
"NoSQL" is generally a misnomer. It's not SQL that is/was the problem, but that there are a lot of cases where specific common properties of RDBMS' are limiting. The NoSQL moniker is a result of the fact that most of these RDBMS's uses SQL as the query language, and most of the "new" database engines does/did not.
Since then, a lot of the RDBMS's have adopted features that have reduced the gap. E.g. Postgres' rapidly improving support for indexed JSON data means that for cases where you have genuine reasons to have data you don't want a schema for, you can just operate on JSON (and best of all, you get to mix and match).
For some of the NoSQL databases that puts them in a pickle because they're not distinguishing themselves enough to have a clear value proposition any more.
But it is not the lack of SQL that has been the real value proposition.
(accidentally posted this to the root of the story)
Postgresql 9.4 with jsonb sends mongo to the dustbin, IMHO. If you have to write it in js close to the data or if plpgsql is too steep of a learning curve, you can play with the experimental plv8.
But you should really pick up plpgsql, it's "python" powerful, with the an awesome db (and has a python 2.x consistent API, sadly, but the doc is very good) There is a great sublime 2.0 package that makes the writing and debugging of functions in one file just awesome. Write an uncalled dumb function that has a lot of the API in it at the top of your file, and you'll get autocomplete on this part of the API.
Specifically no not miss getting acquainted with json and hstore, specifically using json as a variable size argument passing and returning mechanism, it's just hilariously effective.
cheers, and keep making this place(not only HN, our blue dot) better, F
> Both examples would have been easier and faster in SQL.
That's easy, so long as we mean the whole entire project when we say "faster". When I worked at Timeout.com they were importing information about hotels from a large number of sources. For some insane reason, they were storing the data in MySql. Processing was 2 step:
1.) the initial import was done with PHP
2.) a later phase normalized all the data to the schema that we wanted, and this was written in Scala
The crazy thing was that, during the first phase, we simply pulled in the data and stored it in the form that the 3rd party was using. That meant that we had a separate schema for every 3rd party that we imported data from. I think we pulled data from 8 sources, so we had 8 different schemas. When they 3rd party changed their schema, we had to change ours. If we added a 9th source of information, then we would have to create a 9th schema in MySQL. We also checked the 3rd party schema at this phase, which struck me as silly because this did not mean that step 2 could be innocent of the schema, rather, both step 1 and step 2 would have to know the structure of those foreign schemas, but it was necessary because we were writing to a database that had a schema.
The system struck me as verbose and too complicated.
It's important to note that most of the work involved with step 1 could be skipped entirely if we used MongoDB. Simply import documents, and don't care about their schema. Dump all the data we get in MongoDB. Then we can move straight to step 2, which is taking all those foreign schemas and normalizing them to the schema that we wanted to use.
For ETL situations like, NoSQL document stores offer a huge convenience. Just grab data and dump it somewhere. Simplify the process. Your transformation phase is the only phase that should have to know about schemas, the import phase should be allowed to focus on the details of getting data and saving it.
I'm not about to go to bat for Mongo in an SQL thread (there's plenty of problems with that platform that are real), but I rather enjoy their query syntax, it's very AST-like, even through the aggregation pipeline.
I don't believe it's probably much faster even in the best case (and I'm sure an experienced SQL expert wouldn't find it any "easier"), but on a grammatical level I do find it a fresh take on query structure, and writing queries and map-reduce jobs in coffeescript was extremely satisfying because of how terse, elegant, and pseudocode-like it turned out.
It seems to be a philosophical difference. A lot of people seem to think that the database should only support primitive operations, and the rest should be left to application code.
I am not one of those people -- I think a good database system (like postgres) can make many things dramatically simpler.
My experience with PostgreSQL and MySQL is that PostgreSQL supports lots of features, but performance needs a very careful eye - often a neat feature isn't as useful as it seems because optimization isn't good enough. Whereas MySQL tends to be very good at what it can do, and makes it easy to do silly things, like turn off referential integrity in a slow query to boost performance.
It almost feels like a worse-is-better story. As a programmer, PostgreSQL is much better to work with; more tools, better EXPLAIN, more features, more types, more of almost everything. But to use in the heat of battle, it's less clear-cut. PostgreSQL's replication story is complicated. MySQL master-master replication is fairly easy to set up, and if you use a master as a hot failover, it all mostly just works; when the primary site comes back up, it resyncs with the failover. PostgreSQL has a lot of different replication stories - without a strong central narrative, it's hard to gain confidence.
> but do people know why MySQL is lagging behind so much
Because for 99% of the MySQL user base there isn't a need for these features.
Those using MySQL at scale are using them as dumb key-value stores with horizontally sharding. Those who aren't typically are using them with an ORM and so they aren't dealing with the database at the SQL layer.
Everyone else who is manually writing SQL generally was on or moved to Oracle, Teradata, SQL Server, PostgreSQL etc anyway.
One thing that would be nice is if SQL provided first class support for sub records.
So instead of "SELECT name, (SELECT GROUP_CONCAT(CONCAT_WS(',', post_id, post) SEPARATOR ';') FROM posts p WHERE p.user_id = u.user_id) AS 'posts' FROM users u WHERE u.user_id = 1",
you could do "SELECT name, (SELECT post_id, post FROM posts p WHERE p.user_id = u.user_id) AS 'posts' FROM users u WHERE u.user_id = 1".
and the query result would be { name : 'Todd', posts : [ { post_id : 1, post : 'My Comment' } ] }.
Obviously this is a simple example and could have been rewritten as a query on the posts table, inner joined on the user table, and duplicating the user's name in the result. But it becomes much nicer to have as queries get more complex.
A query that supports sub records would gives you flexibility to structure data like a JSON object and simplify the server end of REST apis.
If you're just trying to get JSON out (for a simple query to REST api, or in a node.js environment), have you considered converting the record/recordset to json in the subquery?
The json functions in 9.3+ are pretty handy for that sort of thing. Andrew (core developer who wrote most of that functionality) and I decided to keep the api pretty lightweight, as its easy to also add your own functions to suit your needs.
Postgresql 9.4 with jsonb sends mongo to the dustbin, IMHO. If you have to write it in js close to the data or if plpgsql is too steep of a learning curve, you can play with the experimental plv8.
But you should really pick up plpgsql, it's "python" powerful, with the an awesome db (and has a python 2.x consistent API, sadly, but the doc is very good) There is a great sublime 2.0 package that makes the writing and debugging of functions in one file just awesome. Write an uncalled dumb function that has a lot of the API in it at the top of your file, and you'll get autocomplete on this part of the API.
Specifically no not miss getting acquainted with json and hstore, specifically using json as a variable size argument passing and returning mechanism, it's just hilariously effective.
cheers, and keep making this place(not only HN, our blue dot) better,
F
I've given this talk on Saturday in Moscow again and it was video taped there. Organizers told me they'll publish the English original as well as the Russian voice over version. I'm keen for the second one ;)
I work on several Django apps, and also write advanced SQL in Postgres (CTEs, plv8 functions, materialized views, etc.). One thing the ORM allows though is writing a custom sql query and its pretty easy to 'cast' the sql results into a Django model.
With that said, its still sometimes a struggle to justify if almost all the queries are highly custom, and the app doesn't require Django Admin. Certainly there are advantages, but the minimalism of Flask and focusing on the REST interface for integration and building micro services is more appealing in some cases.
I often find myself wishing that SQL had a more uniform syntax. So much of it resembles COBOL to me, with all the capital letters and "english-like" arbitrary word order. At the same time, relational algebra exists and is very elegant, in an APL-like way. Would be very interested if there was a project to expose such an api to SQL databases.
What's really funny about this comment is that, while all of the features discussed in TFA are fully standards-compliant SQL, MySQL implements none of them, and its "upsert" isn't.
EDIT: And before one slags PostgreSQL too hard for not currently supporting upsert, one might peruse the relevant pg wiki page [1] to better understand where development stands and why we don't yet have it. (Hint: it's actually kinda complicated, assuming you want it done, you know, "right".)
[+] [-] rosser|11 years ago|reply
Want to add a column to your "orders" table and populate it without blocking concurrent writes for as long as it will take to rewrite a multi-million row table? Want to re-hash user passwords using something stronger than MD5, but not prevent users from ... you know, logging in for the duration?
CTEs are all that and the bag of chips.
[+] [-] lastofus|11 years ago|reply
[+] [-] jeffdavis|11 years ago|reply
PostgreSQL ALTER TABLE ... ADD COLUMN is an O(1) operation, and requires no data rewrite (as long as you are OK with NULLs).
You can even add an attribute to a composite type, and existing tables using that type will see the extra attribute. Again, O(1), no data rewrite.
[+] [-] kibibu|11 years ago|reply
Isn't this only possible if you do it on user login - unless you are also cracking user passwords... Actually has anybody cracked their own MD5 user passwords to upgrade them?
[+] [-] platz|11 years ago|reply
[+] [-] nroose|11 years ago|reply
[+] [-] cgh|11 years ago|reply
I've been writing a lot of recursive queries for Postgresql lately using CTEs. Quite cool though a little mindbending at times.
[+] [-] vidarh|11 years ago|reply
Since then, a lot of the RDBMS's have adopted features that have reduced the gap. E.g. Postgres' rapidly improving support for indexed JSON data means that for cases where you have genuine reasons to have data you don't want a schema for, you can just operate on JSON (and best of all, you get to mix and match).
For some of the NoSQL databases that puts them in a pickle because they're not distinguishing themselves enough to have a clear value proposition any more.
But it is not the lack of SQL that has been the real value proposition.
[+] [-] headgasket|11 years ago|reply
Postgresql 9.4 with jsonb sends mongo to the dustbin, IMHO. If you have to write it in js close to the data or if plpgsql is too steep of a learning curve, you can play with the experimental plv8. But you should really pick up plpgsql, it's "python" powerful, with the an awesome db (and has a python 2.x consistent API, sadly, but the doc is very good) There is a great sublime 2.0 package that makes the writing and debugging of functions in one file just awesome. Write an uncalled dumb function that has a lot of the API in it at the top of your file, and you'll get autocomplete on this part of the API. Specifically no not miss getting acquainted with json and hstore, specifically using json as a variable size argument passing and returning mechanism, it's just hilariously effective. cheers, and keep making this place(not only HN, our blue dot) better, F
[+] [-] lkrubner|11 years ago|reply
That's easy, so long as we mean the whole entire project when we say "faster". When I worked at Timeout.com they were importing information about hotels from a large number of sources. For some insane reason, they were storing the data in MySql. Processing was 2 step:
1.) the initial import was done with PHP
2.) a later phase normalized all the data to the schema that we wanted, and this was written in Scala
The crazy thing was that, during the first phase, we simply pulled in the data and stored it in the form that the 3rd party was using. That meant that we had a separate schema for every 3rd party that we imported data from. I think we pulled data from 8 sources, so we had 8 different schemas. When they 3rd party changed their schema, we had to change ours. If we added a 9th source of information, then we would have to create a 9th schema in MySQL. We also checked the 3rd party schema at this phase, which struck me as silly because this did not mean that step 2 could be innocent of the schema, rather, both step 1 and step 2 would have to know the structure of those foreign schemas, but it was necessary because we were writing to a database that had a schema.
The system struck me as verbose and too complicated.
It's important to note that most of the work involved with step 1 could be skipped entirely if we used MongoDB. Simply import documents, and don't care about their schema. Dump all the data we get in MongoDB. Then we can move straight to step 2, which is taking all those foreign schemas and normalizing them to the schema that we wanted to use.
For ETL situations like, NoSQL document stores offer a huge convenience. Just grab data and dump it somewhere. Simplify the process. Your transformation phase is the only phase that should have to know about schemas, the import phase should be allowed to focus on the details of getting data and saving it.
[+] [-] juliangregorian|11 years ago|reply
I don't believe it's probably much faster even in the best case (and I'm sure an experienced SQL expert wouldn't find it any "easier"), but on a grammatical level I do find it a fresh take on query structure, and writing queries and map-reduce jobs in coffeescript was extremely satisfying because of how terse, elegant, and pseudocode-like it turned out.
[+] [-] dtech|11 years ago|reply
SQLite seems logical because it needs to be kept lean for embedding purposes, but do people know why MySQL is lagging behind so much?
[+] [-] jeffdavis|11 years ago|reply
I am not one of those people -- I think a good database system (like postgres) can make many things dramatically simpler.
[+] [-] barrkel|11 years ago|reply
It almost feels like a worse-is-better story. As a programmer, PostgreSQL is much better to work with; more tools, better EXPLAIN, more features, more types, more of almost everything. But to use in the heat of battle, it's less clear-cut. PostgreSQL's replication story is complicated. MySQL master-master replication is fairly easy to set up, and if you use a master as a hot failover, it all mostly just works; when the primary site comes back up, it resyncs with the failover. PostgreSQL has a lot of different replication stories - without a strong central narrative, it's hard to gain confidence.
[+] [-] weaksauce|11 years ago|reply
[+] [-] threeseed|11 years ago|reply
Because for 99% of the MySQL user base there isn't a need for these features.
Those using MySQL at scale are using them as dumb key-value stores with horizontally sharding. Those who aren't typically are using them with an ORM and so they aren't dealing with the database at the SQL layer.
Everyone else who is manually writing SQL generally was on or moved to Oracle, Teradata, SQL Server, PostgreSQL etc anyway.
[+] [-] mjrpes|11 years ago|reply
So instead of "SELECT name, (SELECT GROUP_CONCAT(CONCAT_WS(',', post_id, post) SEPARATOR ';') FROM posts p WHERE p.user_id = u.user_id) AS 'posts' FROM users u WHERE u.user_id = 1",
you could do "SELECT name, (SELECT post_id, post FROM posts p WHERE p.user_id = u.user_id) AS 'posts' FROM users u WHERE u.user_id = 1".
and the query result would be { name : 'Todd', posts : [ { post_id : 1, post : 'My Comment' } ] }.
Obviously this is a simple example and could have been rewritten as a query on the posts table, inner joined on the user table, and duplicating the user's name in the result. But it becomes much nicer to have as queries get more complex.
A query that supports sub records would gives you flexibility to structure data like a JSON object and simplify the server end of REST apis.
[+] [-] exogen|11 years ago|reply
[+] [-] qooleot|11 years ago|reply
The json functions in 9.3+ are pretty handy for that sort of thing. Andrew (core developer who wrote most of that functionality) and I decided to keep the api pretty lightweight, as its easy to also add your own functions to suit your needs.
[+] [-] zapov|11 years ago|reply
And Revenj has been using it for years: https://github.com/ngs-doo/revenj
And yes, Revenj now comes with an offline compiler ;)
[+] [-] headgasket|11 years ago|reply
But you should really pick up plpgsql, it's "python" powerful, with the an awesome db (and has a python 2.x consistent API, sadly, but the doc is very good) There is a great sublime 2.0 package that makes the writing and debugging of functions in one file just awesome. Write an uncalled dumb function that has a lot of the API in it at the top of your file, and you'll get autocomplete on this part of the API.
Specifically no not miss getting acquainted with json and hstore, specifically using json as a variable size argument passing and returning mechanism, it's just hilariously effective.
cheers, and keep making this place(not only HN, our blue dot) better, F
[+] [-] gamesbrainiac|11 years ago|reply
[+] [-] fatalmind|11 years ago|reply
I keep you posted.
[+] [-] jiphex|11 years ago|reply
[+] [-] leoh|11 years ago|reply
[+] [-] qooleot|11 years ago|reply
With that said, its still sometimes a struggle to justify if almost all the queries are highly custom, and the app doesn't require Django Admin. Certainly there are advantages, but the minimalism of Flask and focusing on the REST interface for integration and building micro services is more appealing in some cases.
[+] [-] jfb|11 years ago|reply
[+] [-] hcarvalhoalves|11 years ago|reply
[+] [-] juliangregorian|11 years ago|reply
[+] [-] ris|11 years ago|reply
[+] [-] collyw|11 years ago|reply
[+] [-] xvirk|11 years ago|reply
[+] [-] liprais|11 years ago|reply
[+] [-] ecopoesis|11 years ago|reply
[+] [-] rosser|11 years ago|reply
EDIT: And before one slags PostgreSQL too hard for not currently supporting upsert, one might peruse the relevant pg wiki page [1] to better understand where development stands and why we don't yet have it. (Hint: it's actually kinda complicated, assuming you want it done, you know, "right".)
[1] https://wiki.postgresql.org/wiki/UPSERT
[+] [-] dbenhur|11 years ago|reply
hahahaha.
If you think MySQL gets this right, you haven't thought about it much.
http://www.pgcon.org/2014/schedule/events/661.en.html https://wiki.postgresql.org/wiki/UPSERT