A Critique of SQL, 40 Years Later

[+] exabrial|3 years ago|reply

The only thing that affects my daily life with SQL is that FROM should be before the SELECT keyword. This would _greatly_ improve type-ahead support in SQL IDEs.

Nothing is perfect, but that is really the main beef. Another commentor already nailed having a LIMIT WITH ERROR clause to be specified on UPDATE,DELETE statements and explicitly throw an error otherwise.

SQL is on of my favorite tools to use and I don't see it getting replaced by anything any time soon.

[+] stickfigure|3 years ago|reply

I started my career in the 90s writing ROLAP engines, and even though I've spent most of my time since doing "web" development, I still seem to end up having to build engines that generate SQL queries that are dozens of lines long.

The complaints about SQL composability are real. The grammar is fundamentally pretty irregular. Acceptable for humans to type ad-hoc, crappy for computers to generate.

You can like what SQL does for you (I do!) but it's also easy to imagine something a bit better. I thought the article was spot-on. I hope some future SQL x.0 will fix these issues, but also be similar enough to present-day SQL that I don't have to learn a whole new language from scratch.

[+] toto444|3 years ago|reply

Every time SQL is mentioned on HN someone comes to complain about FROM coming after SELECT. I use SQL every day and not a single time have I found reason to complain about it. Can you give a bit more detail about what's wrong with it being like it is ?

EDIT : thanks all for your reply. I now understand that it is an IDE related thing not something fundamental to the language.

[+] BeefWellington|3 years ago|reply

I think this stems from a misunderstanding of the entire point of SQL. It isn't about looking at data in an individual table, it's about retrieving a Result Set for complex queries.

All the FROM-first examples I've ever seen are almost universally the simplest query in the world where autocomplete is not a large hurdle anyways because you aren't even bothering with table aliasing. As soon as you do anything even moderately complex (multiple joins, subqueries, calculations, etc.) the advantage of putting FROM first vanishes, and if you're adding a table alias or hard reference to the table you can already see what is in the SELECT list AND in many cases get autocomplete.

[+] mmcdermott|3 years ago|reply

That's a fair point. This comment made me realize my own tendency to write "select * from..." so that I could supply the "from" before going back and replacing "*" with specific columns.

[+] tehlike|3 years ago|reply

Check this out https://github.com/prql/prql

[+] vivegi|3 years ago|reply

Several decades ago while using Oracle and SQL*Plus, I used to generate the column list for queries from the data dictionary table (ALL_TAB_COLUMNS). Once I learnt that trick, I never typed the column list ever again. Eventually, I had a library of queries for common tasks. You could use that trick with almost all database engines.

IDEs were never a favorite (they were quite limited then).

Today, the situation is a lot better.

[+] d0mine|3 years ago|reply

Configure your IDE, to help you write SELECT queries. It should be easy to insert a template with jump points in the desired order e.g., a generic YASnippet in Emacs should be able to do it.

Thus by the time you are writing the column expression after the SELECT, the FROM table expression would be filled already (enabling scope-dependent column completion).

[+] Koshkin|3 years ago|reply

FWIW in LINQ, Select comes after From and Where.

[+] crazysim|3 years ago|reply

I don't know why my memory jogged to the old ruby on rails screencasts but I suppose there's nothing stopping IDEs from jumping to the FROM section of a query and then returning you back to the SELECT in some sort of "macro"/snippet. It's a hack, I guess.

[+] v0idzer0|3 years ago|reply

Just a few years ago Javascript was pretty awful, so people started writing transpilers (most famously Babel) to add features like arrow functions, async/await, decorators, constants, etc. Typescript is another transpiler that greatly improves the language but outputs regular old, awful javascript. Many of these features even ended up being built into the language.

I wonder if it’s time for a transpiler renaissance in SQL? It would be trivial to convert “select id from users” into “from users select id”. I’m sure theres a lot of other cool feautures/sugar you could add while still outputting valid SQL.

[+] mordechai9000|3 years ago|reply

Maybe we should treat SQL like JavaScript, and use it as a compiler target instead of coding in it directly. /s

[+] aspyct|3 years ago|reply

Agreed.

I would also love to be able to specify the WHERE just after the FROM but before the JOINs. And also after them: - the first one to filter the rows before joining - the second one to filter after joining

[+] cmrdporcupine|3 years ago|reply

"Despite the improvements listed above, in a 2014 interview, CJ Date said “we didn’t realize how truly awful SQL was or would turn out to be (note that it’s much worse now than it was then, though it was pretty bad right from the outset).” This quote leaves me wondering – if Date himself were to write an updated critique, what would it look like? My best guess is most of his criticism would revolve around further departures of SQL from the relational model, but specific examples escape me."

I am not sure how the author has missed the copious amount of material that Date (& Darwin) wrote on this topic? The obvious one being the Third Manifesto (whole book available here [1] since it's out of print now). But even "Database in Depth" [2] has extensive discussion about this despite being a less "opinionated" and "ranty" book (highly recommended BTW)

Yes, not explicitly updated since 2014, but there's really nothing new to say since then?

Date is definitely concerned about the departures from the relational model in SQL. But it's (EDIT) not just about purism. He's also pretty in general peaved off at SQL for similar concerns that the author presents here about syntax and expressiveness. Third Manifesto's "tutorial D" mandates the addition of a richer type system, including higher level types that match on relation constraints, and "operators" (similar to OO methods) on those types; basically a kind of predicate dispatch.

I imagine he's pretty sick of being ignored.

[1] https://www.dcs.warwick.ac.uk/~hugh/TTM/DTATRM.pdf

[2] https://www.oreilly.com/library/view/database-in-depth/05961...

[+] fipar|3 years ago|reply

I agree with everything you're saying, though I'm assuming here:

> Date is definitely concerned about the departures from the relational model in SQL. But it's just about purism.

Perhaps you meant to say "But it's not just about purism"?

[+] ErwinSmout|3 years ago|reply

In fact, Date has recently revised a whole slew of "old" papers of his, got them up-to-date wrt his current thinking, and (re-) published them. His "A critique of SQL/86" paper(s) are now chapter 8, 9 & 10 of "Database Dreaming, Vol I" (there's a Vol II too).

[+] crazygringo|3 years ago|reply

I've written a bajillion queries and have tons of nitpicks, but it's the twin meanings of NULL that really kills me.

NULL can be the value of a field in a record, but it is also used to indicate the lack of a record in a JOIN.

If I run:

  SELECT x.a, y.b FROM x LEFT JOIN y on x.a = y.a

and I get back

  [5, NULL]

I have no way of knowing if that means there's a record [5, NULL] in table y, or if there's no record in table y that starts with 5.

Obviously there are lots of workarounds, but to me this is a truly fundamental design flaw. SQL should have specified another NULL-like value, call it EMPTY, used only for joins that find no data. And analagous to IS NULL, it would be checked using IS EMPTY.

[+] dspillett|3 years ago|reply

> but it's the twin meanings of NULL that really kills me

NULL only has one meaning: NULL. This is roughly analogous to unknown.

The one that his a lot of people is WHERE <value> NOT IN (<set>) where <set> contains a NULL. Because NOT IN unrolls to “<value> <> <s1> AND <value> <> <s2> AND … AND <value> <> <sx>” any NULL values in the set makes one predicate NULL which makes the whole expression NULL even if one or more of the other values match.

> I have no way of knowing if that means there's a record [5, NULL] in table y, or if there's no record in table y that starts with 5.

Not directly, but you can infer the difference going by the value of (in your example) y.a - if it is NULL then there was no match, otherwise a NULL for y.b is a NULL from the source not an indication of no match.

> SQL should have specified another NULL-like value

This sort of thing causes problems of its own. Are the unknowns equivalent? Where are they relevant? How do they affect each other? Do you need more to cover other edge cases? I have memories of VB6's four constants of the apocalypse (null, empty, missing, nothing).

This is one of the reasons some purists argue against NULL existing in SQL at all, rather than needing a family of NULL-a-likes.

[+] goto11|3 years ago|reply

A nullable column can always be extracted to a separate table with a non-nullable column and a foreign key. If you left-join this back with the base table, you will get the nulls again for the missing values. So to me it seem nicely symmetrical to use the same kind of NULL value.

[+] ianmcgowan|3 years ago|reply

I'd check to see if y.a IS NULL in that situation. I'm sure there are cases where it matters, but most of the time for me the difference between "there's a row in y, but the value is NULL" and "there's no row in y, the value is NULL" is irrelevant. I can't think of a time when that distinction has been important, and I'm working on a project converting thousands of complex SQL queries.

The thing that really bugs me about NULL is the default assumptions - 99.9% of the time I want NULL in a string context to be the empty string and NULL in a numeric context to be 0, but I have to use ISNULL or COALESCE to get that. I wish it were the other way round where NULL is treated as '' or 0 by default, but I can do something special if I really want to test for NULL'ness.

[+] rawgabbit|3 years ago|reply

As another commenter said, if your table have required ID columns, you should specify it in your SELECT.

     SELECT x.Id,x.a,y.ID,y.b 
     FROM x 
     LEFT JOIN y 
       on x.a = y.a

The output [Id12345,5,Id6789,NULL] means the column is null. The output [Id12345,5,NULL,NULL] means the row is empty.

On a side not, I use the SQL below all the time to identify missing rows on the "comparetable".

  SELECT #basetable.columnname
  FROM #basetable
  LEFT OUTER JOIN #comparetable
  ON #basetable.columnname1 = #comparetable.columnname1
    and #basetable.columnname2 = #comparetable.columnname2
    and #basetable.columnname3 = #comparetable.columnname3
  WHERE #comparetable.columnname1 is null

[+] cmrdporcupine|3 years ago|reply

Fabian Pascal made a whole career out of writing about how stupid "NULL" is in SQL. See if you can find his writings some time. He was positively livid about it.

Notably NULL of any kind is completely absent from the relational algebra. There are other ways to express absence. Notably, this is all supposed to be first-order predicate logic <hand wave hand wave> stuff, and... in that world "null" is also not a "thing".

Also SQL insists on allowing duplicate "rows" in its "tables"; whereas the relational model is very clear that the contents of a relation ("table") is a set, not a bag.

These two confusions in SQL actually lead to many many data integrity problems as follow-on effects and also complicate query planning and optimization.

[+] paulclinger|3 years ago|reply

> I have no way of knowing if that means there's a record [5, NULL] in table y, or if there's no record in table y that starts with 5.

You can always add SELECT y.a, which will allow you to disambiguate between the two options (it will be non-NULL in the first case and NULL in the second).

[+] cryptonector|3 years ago|reply

You need one more output column. The specifics will vary by RDBMS because SQL isn't that standard. The most portable thing would be to do this:

  SELECT x.a, y.a IS NOT NULL, y.b
  FROM x
  LEFT JOIN y ON x.a = y.a;

assuming y.a is a NOT NULL column, or a PRIMARY KEY column, since PKs are supposed to be non-nullable.

In PG you could:

  SELECT x.a, y IS DISTINCT FROM NULL, y.b
  FROM x
  LEFT JOIN y ON x.a = y.a;

[+] ErwinSmout|3 years ago|reply

The predicate corresponding to an outer join is fundamentally disjunctive in nature (any outer join is equivalent to a UNION of at least 2 SELECTs) meaning that "after the fact", there is no way to determine which particular one of the disjuncts made the row appear in the result.

[+] SPBS|3 years ago|reply

Why does it kill you? A NULL in a row is semantically identical to the row never existing in the first place, unless there is some other non-null column (the primary key) that lets you disambiguate between the two cases. It's hardly a problem in practice because all good tables should have primary keys.

[+] janci|3 years ago|reply

This is why JS has both null and undefined.

[+] onlyrealcuzzo|3 years ago|reply

How do you expect `y.b` to behave on EMPTY rows? Should it return EMPTY or NULL?

[+] gfody|3 years ago|reply

you should not need to distinguish between empty and null because they mean the same thing. if you want to select records from x that aren't in y you should use an anti-join (where not exists, not in, etc.)

[+] janci|3 years ago|reply

Also you should select y.a to know wether the y-record exists.

[+] Shorel|3 years ago|reply

Good point, it would be nice to have both NULL and EMPTY to split these meanings.

[+] thedataslinger|3 years ago|reply

To piggy-back, it bothers me so much that this is valid syntax in many implementations:

UPDATE x SET a = NULL WHERE b IS NULL;

Like... wat?

[+] rrrrrrrrrrrryan|3 years ago|reply

Why are you joining on a nullable column in the first place?

If your database is well designed, joining on a nullable column should be a relatively exotic use case, and can be handled by writing a tiny bit more code to check for NULLs before performing your join.

[+] bearjaws|3 years ago|reply

The only thing I would really blame solely on SQL is that UPDATE and DELETE statements don't require you to specify a limit.

I have seen many times in my career where a rogue delete just truncates a table, a simple statement of intent (e.g. LIMIT 1) would tell the query planner that if it is about update/delete more than 1 row, it should error. In fact MySQL actually returns a warning if you do this.

TRUNCATE clearly states your intention to delete everything, but DELETE by default also deletes everything.

There were also definitely some bad paradigms invented as a result of SQL (e.g. all business logic lives in the database as stored procedures or stored functions), but that was bound to happen whenever a shiny new tool comes out and engineers want to use it.

[+] Cockbrand|3 years ago|reply

While all the criticism is probably correct, and SQL definitely shows its age, it's still very much good enough for pretty much all its current applications. Also, SQL's basics are very easy to learn. These aspects make it very hard to imagine a language that might gain enough traction to actually replace SQL in a foreseeable future.

[+] tabtab|3 years ago|reply

Many of those complaints seem theoretical. I like to focus on practical concerns. The biggest problem I see is that the SQL language has grown too complex. It's related to the "Lack of Orthogonality" problem mentioned in the article, but I see different solutions. SQL is not based on combinations of simpler concepts, but hard-coded keywords. But how to orthogonize (factor) it gets into philosophical differences. My favorite alternative is an experimental language called SMEQL (Structured Meta-Enabled Query Language):

https://wiki.c2.com/?TqlRoadmap

It's more API-like to reduce the need for specialized keywords. And you can "calculate" column lists via queries instead of have to list columns. For example, if a table has 60 columns and you want to SELECT all 60 minus two columns, you can't without listing all 58. With SMEQL you can use a query to return a list (table) of those 58 and feed it to the equivalent of a SELECT clause.

Things like CREATE TABLE are fed a table if you desire so you can "compute" your schema. You can generate entire schemas from data dictionaries. Anything that can be done with a table is. You can create in-line (virtual) tables if you want it command-driven, but it's still "using tables to do everything". You can use textual code to create such virtual tables or a table editor as it fits the situation. SMEQL is as close to "everything is a table" as I've seen. Even your FROM list can be table-ized. I used to do similar with dBASE/xBASE, and it was really nice, especially for ad-hoc work such as one-off research requests.

And as somebody mentioned here, null handling needs a serious revisit in SQL. I could rant all day about SQL null handling, especially for strings.

[+] adius|3 years ago|reply

PRQL (Pipelined Relational Query Language) to the rescue!

https://prql-lang.org

[+] cbm-vic-20|3 years ago|reply

> "we didn’t realize how truly awful SQL was or would turn out to be (note that it’s much worse now than it was then, though it was pretty bad right from the outset)."

I wish the "truly awful" stuff I come up with was 0.1% as successful as SQL.

[+] thomas536|3 years ago|reply

I once read an article about SQL and how reordering the sections of a query would make it more ergonomic for users (iirc things like specifying what you want first and then how to present it last, more like a pipeline). I searched many times over the years but have not been able to find it again.

[+] devin-petersohn|3 years ago|reply

I always appreciate blog posts like this, there are obviously cases where SQL shines, and in part I think the dataframe abstraction helps with filling a lot of the missing pieces that SQL doesn't handle so well (composability, debuggability, interactivity, etc.)

Even pandas (with all its faults) is more flexible as a language than SQL[1]. I'm of the opinion that there's a better together story in the end, but I guess we will see.

[1] https://ponder.io/pandas-vs-sql-food-court-michelin-style-re...

[+] RyanHamilton|3 years ago|reply

qSQL based on the concepts of ordered lists is more appropriate for many queries, examples available here: https://www.timestored.com/b/kdb-qsql-query-vs-sql/

Kdb the system that qSQL is ran within, allows full use of variables and all builtin functions with tables/functions/variable/columns. It really is a case of less is more. What this allows is functional form queries. Imagine being able to query: ?[tableName;lessThan(age;10)] and have perfect functional form representation for all queries. No ORM, no string concatenation. It seems some other database creators are at least becoming area of these things and integrating parts.

[+] dominotw|3 years ago|reply

SQL is having somewhat of a moment in the bigdata world, thanks in part to 'modern datastack' and new age datawarehouses like snowflake,bigquery.

However there are a lot of pushback from 'traditional' dataengineers who were trained on spark/scala. Its bit of hardsell to go from a highly typed language to a free for all text based logic.

I think the following is needed for sql to be finally accepted as 'serious' contender.

create compiled sql language ( not pandas)

1. that compiles to sql and addresses some of the issues bought up in the post like nested aggregations.

2. make code reusable. Eg: apply year over year growth to a table that has the requisite columns. Compiler should check this in ide.

3. make materializations first class concept in the language. No seperate dbt layer.

4. crate a way to package and distribute libraries that you can import into your project .

5. a unit testing framework that makes it easy to test the logic without having to setup test tables in the database.

[+] bawolff|3 years ago|reply

Its interesting,the author dismisses

>Mismatch with Host Language

But to me, i feel like this is the biggest problem with sql. Yes every language is different, but having different mental models nonetheless causes friction, and leads to things like ORMs which are often even worse.

[+] iddan|3 years ago|reply

I think we can do much better than SQL without losing its inherent power. Projects like Prisma and EdgeDB make me optimistic regarding the future of relational querying languages

[+] warrenm|3 years ago|reply

the title doesn't match the first statement, which states a math unfact:

>A Critique of SQL, 40 Years Later 08.11.2022

>The SQL language made its first appearance in 1974, as part of IBM’s System R database. It is now over 50 years later, and SQL is the de facto language for

1974 isn't 50 years ago yet :)

[+] tibbydudeza|3 years ago|reply

I wish there was some smart machine learning technology that will take my shitty SQL developer queries and make it more performant - sometimes it is like sitting in front of a bubbling cauldron and invoking magic incantations.

[+] snowstormsun|3 years ago|reply

https://youtu.be/lS7zNbB-XfI?list=PLn0nrSd4xjjaSLBSzmno-3Ods...

[+] erichocean|3 years ago|reply

All SQL database implement the actual Relational Calculus internally—it's required to implemented a SQL optimizer.

SQL is just a language for submitting Relational Calculus to the database (+ DDL statements).

If you wanted to, for instance, you could add a language alongside SQL in Postgres, submit the results to the internal RC optimizer, execute the optimized query, and get back the results.

In your new language, you can address all of the issues Carlin Eng/Chris Date identify in the article.

[+] ErwinSmout|3 years ago|reply

In "Stating the obvious", C.J. Date wrote 50 (FIFTY) pages solely on the subject of how SQL screwed up with the most fundamental operator of all : testing for equality. I'll repeat : FIFTY pages about everything that SQL got wrong with just the equality operator.

Go figure about the rest ...

[+] jokoon|3 years ago|reply

I wish there were SQL "primitives" functions instead of the SQL language.

For example if I want to pick a single row by id, with SQL I must send a query string, which results in parsing, which means lower latency.

If I want to randomly select 100k rows among a database of 1 million entries, I need to build 10k query strings (I think?), which won't be fast to parse. I don't think this happens when using C pointers in C, C++ or arrays in other languages.

253 comments