top | item 42828883

Composable SQL

271 points| earnestinger | 1 year ago |borretti.me | reply

143 comments

order
[+] branko_d|1 year ago|reply
From the article:

> What you hope happens is that Postgres will recursively inline every view, merge them all together into a gigaquery, and shuffle the predicates up and down to maximally optimize the query. That was the conceit. And of course that never happens.

The author seems to be describing a Progress-specific problem. Oracle, SQL Server (and probably other "enterprise" databases as well) are pretty good at inlining views and then optimizing the whole resulting gigaquery as if the views were not there in the first place.

> The central problem is that views have to be written for the general case, and then you filter on the view’s output. Views can’t take parameters.

SQL Server supports inline table-valued functions, which are exactly that - parametrized and inlinable views.

Unfortunately, they are not as composable as they could be. For example, a function returning table and a function accepting the same type of a table as input cannot be directly composed - you'd first need to store the first function's result in a temporary table or table variable. But despite the limitations, functions are extremely useful for constructing complex queries which reuse logic.

[+] bob1029|1 year ago|reply
Oracle, MSSQL, DB2, et. al., are essentially like magic when it comes to making hopeless piles of SQL run quickly. This is because these engines have all ostensibly targeted the business people over the decades more than the tech people.

Features like dynamically optimizing queries as they return results [0,1,2] isn't something you chase if your audience is willing to rewrite the universe at the first sign of incongruity.

[0] https://learn.microsoft.com/en-us/sql/relational-databases/p...

[1] https://www.ibm.com/docs/en/i/7.5?topic=overview-adaptive-qu...

[2] https://apex.oracle.com/pls/apex/features/r/dbfeatures/featu...

[+] SigmundA|1 year ago|reply
I mentioned this down thread, MSSQL and Oracle cache query plans allow their optimizers to take their time which I imagine helps with doing more sophisticated optimization.

PG always re optimizes on every call unless you manually prepare the statement and that only last for the session. So PG's optimizer is designed to be very fast not very thorough.

[+] willvarfar|1 year ago|reply
The article talks about how CTEs are efficiency minefields and how Postgres doesn't do a good job of pushing down predicates, for example.

TBH, the solution is to improve the Postgres engine in this regard?

For example, in Bigquery I often use CTEs and see predictable speedups that are explained by predicate pushdown when slapping a WHERE clause on the user of a CTE etc. Bigquery has plenty of performance gotchas but I don't think CTEs are one of them; they seem to be transparent to the optimiser.

Something else that Bigquery supports is user-defined 'table' functions. These are like the 'functors' the article creates.

However, in practice they aren't very widely used, and I think this hints that they aren't as useful as the article is hoping for.

The thing I _really_ want is scoping. If only functions could be described like CTE WITH blocks, e.g.

   WITH visits AS (
      SELECT ...
      FROM web_logs
      WHERE ...
   ),
   foo(bar INT, baz STRING) AS (
      ... function body here
   )
   SELECT ..., foo(a, b), ...
   FROM visits
   WHERE ...
Apart from the tidiness of this, this allows inlining of functions into the SQL statement body rather than as separate statements. Often various tooling e.g. DBT makes it really hard to with separate function definitions etc, and being able to inline logic would make things possible.

Bigquery also supports user-defined aggregate functions. Although there is rarely the need to stray from the standard builtin aggregate functions, when you need them they are brilliant.

(Anyone working _on_ Bigquery reading? If so, please ping me, I have loads of feedback on what works well and where the user pain and toil is ;) )

[+] zetalyrae|1 year ago|reply
This suffers from the "sufficiently smart compiler" problem. The query planner that can do what I mean with maximal efficiency is always just over the horizon. There's always yet another query that can't be optimized automatically.
[+] geysersam|1 year ago|reply
Similar functionality already exists in some SQL implementations, notably DuckDB:

  create table macro author_books() as   
  (
    select
        a.name as author,
        b.title,
    from
        a
    inner join
        b on b.author_id = a.author_id
  );
is the same as the example in the article and it is called like:

  with a as (...),
       b as (...)
       from author_books();

https://duckdb.org/docs/sql/statements/create_macro.html#tab...
[+] yujzgzc|1 year ago|reply
I work on a project that does tons of SQL based data transformations (thousands). A lot of it isn't backed by unit tests. Some of our people with more of a C++ background have got the idea that this is crazy, every bit of business logic should be encapsulated into a function just like in every other civilized language, so that it could be unit tested. They built a framework like this for SQL and implemented quite a few jobs with it.

The problem is most of the SQL-first devs hate it and would rather write untested code, or rely on alternative test automation techniques like bringing up a local DB instance with fake data, than have to refit their business logic into pure functions. The truth is that it's just not a way of working that comes naturally to people who write a lot of SQL for a living.

[+] fifilura|1 year ago|reply
I can definitely sympathise with the SQL developers here. (I have worked extensively both with SQL and C++).

There is a trade off between unit tests and having control and overview of what you are writing.

Also very often it is so much more about verifying that the input data is clean and if this becomes more cumbersome, you may have lost the ability to debug the biggest source of errors.

And of course sometimes it hits a nerve when the "real engineers" come and dictate exactly how you should do things. Reference also to Joel Spolsky's architecture astronauts.

[+] thyristan|1 year ago|reply
Pure functions in the context of databases are a rarity. A database is meant to be persisting data while upholding certain consistency, integrity and isolation guarantees. This basically means that most of the database-using SQL functions can never be pure, because having side effects is their whole purpose. Whats more, integrity and isolation are basically untestable, especially with unit testing.

So imho the DB people are totally right, to test any SQL, you need database tables with data in them.

[+] svilen_dobrev|1 year ago|reply
long time ago i had similar need.. did not want to directly churn SQL, for hundreds of small filters. So i did an expression parser and translator, and wrote all things in simple python functions, then translated those into SQL automatically. Both funcs and the translator were heavily tested. Apparently, that parser also grew few other "renderers" - like evaluator, documenter and what not.

https://github.com/svilendobrev/svd_util/blob/master/expr.py

[+] smartmic|1 year ago|reply
A lot of enterprise software tools that I know of are not tested as your C++ professional developers would expect. Especially when this software is not from the original software vendor, but is assembled/configured/enhanced by third party vendors or consultants. But of course even such software is "tested", albeit by the end user over a long period of use, unstructured, haphazardly, and in most cases unsatisfactorily. The problem is particularly acute if the industry is not deeply rooted in software development.
[+] slt2021|1 year ago|reply
SQL relies on RDBMS's type system, just like C++ devs rely on C++ OOP.

the problem lies with lax/weak table design that allows too lax data (all string nullable fields for example), instead of strict types like int, float, varchar, foreign key constraints etc.

you dont need unit test if you table design is clean, because SQL operators are deterministic and are tested by the RDBMS vendor

just need integration test, dry run test, periodic data integrity checks at the system boundaries when data is ingested/egress

SQL represents operations on the Sets of data and their relations, so you need not unit test on a single row, but a test suite on your entire DataSet (backup copy or synthetic copy)

[+] cbsmith|1 year ago|reply
> The truth is that it's just not a way of working that comes naturally to people who write a lot of SQL for a living.

I don't know... Refactoring Databases is pretty old now. Plenty of DBAs can handle this style and even prefer it. I think this is more of a cultural problem in the small than in the large.

[+] TachyonicBytes|1 year ago|reply
I assume that framework is not open-source or somewhere I can look at it?
[+] marcosdumay|1 year ago|reply
Hum...

You know that SQL is completely pure up to the point where you specify a data alteration command, right?

Your framework is ditching the queries, that are completely pure as enforced by the engine, and insists on using maybe-pure functions that all the engines have problems with.

[+] feoren|1 year ago|reply
SQL cannot be "fixed". It's too broken, too ubiquitous, too customized, too optimized. Will these functors integrate with the thousands of optimizations modern database systems rely on?

This post has the format: "SQL is extremely terribly awful at writing business logic. But if people implemented this one half-baked, unrealistic idea, then it would suck a little bit less!" The first part is spot-on: SQL is a horrible language for writing business logic. So: stop. Stop doing that.

Stop writing business logic in SQL!

Imagine you joined a new team, and they said: our program uses a bunch of XML files as its data format, and that's why most of our business logic is implemented in XPath. Just sit there and ponder that situation for a little bit. This is what you all sound like to me. Stop it!

> Testing the simplest query requires building up a massive object graph, just to test an infinitesimal slice of it.

Yes, correct, because SQL is awful at representing business logic, and foreign key constraints are business logic! The fact that an employee has one supervisor who is also an employee (FK Employee.SupervisorID -> Employee.ID) is a business constraint. (And one that can easily change, by the way.) All these database engineers who can't pull themselves away from their stored procedures will argue with me to their graves about foreign key constraints, but they're generally terrible. They're a business constraint that just happens to be supported by the database engine, but they're still a business constraint. Stop putting your business logic in your persistence layer.

We've been getting better at building programming languages that actually support software engineering every year. Use one of those. SQL is an output of one of your program layers. Just like XPath would be an output of your program. Compose SQL in an actual real programming language that isn't horrible instead of doing any of this.

[+] wvenable|1 year ago|reply
> Yes, correct, because SQL is awful at representing business logic, and foreign key constraints are business logic!

This strikes me as a unnecessary all-or-nothing argument. I would agree that you shouldn't put all your business logic in the database but that doesn't mean you shouldn't put any business logic in your database. If foreign key constraints are business logic then so are table definitions and column types. It's not awful for those things as it's literally built right in.

My advice is to ensure, as best as possible, that invalid states are impossible to represent. If you can't have an address without being attached to a client, make that impossible to represent in the database. It's one less thing you'll ever have to worry about.

[+] liontwist|1 year ago|reply
What is business logic? What specifically is SQL bad at?

I’m not sold by your foreign key example. Constraints are great. I guess you can do that in python? Good like getting ACID right

As soon as your data stops resembling a tree and starts looking like a graph than all your “business logic” looks a lot like operations on relations and you are back to where you started.

[+] dagss|1 year ago|reply
I agree that SQL has problems as a language.

I disagree that we should be happy about moving data out of the storage layer into the backend in cases where it does not make sense performance wise.

The problem is doing some processing in SQL in some cases has a huge performance advantages over moving data to the backend.

If your business logic requires you to take a million rows from table A, apply some rule, and output a million rows into table B -- then piping all that data through a backend is a bad solution performance wise. I reject the idea that all that extra IO (+extra overhead also for the database) is worth it just to use another programming language than SQL.

What we need is better alternatives that execute at the storage layer. Probably starting with proper languages that compile to SQL, like we saw with JavaScript.

[+] magicalhippo|1 year ago|reply
Agree not to putting your business logic in stored procs, but hard neg on getting rid of all constraints.

I consider it defense in depth. Sure your application should absolutely ensure order lines aren't referencing a non-existing order, but bugs happen and I prefer the database returning an error to the database getting full of inconsistent data.

[+] default-kramer|1 year ago|reply
> Compose SQL in an actual real programming language that isn't horrible instead of doing any of this.

Got any recommendations? That's a serious question, I've tried many (and built my own) and some are okay but none feel remotely good enough when you consider how much progress we've made basically everywhere else in the wider world of software engineering. Relational databases are so good; they deserve better than SQL!

[+] branko_d|1 year ago|reply
I don't understand what you are trying to say about foreign keys. Are you advocating moving FKs from database to application?

If yes, you are opening yourself up to race conditions that will only become visible in concurrent environment. I had a distinct misfortune of working with an ERP system based on Oracle which didn't use database-level FKs. Predictably, their customers would occasionally complain about "dangling" data, which, to my knowledge, has never been properly resolved.

[+] tacone|1 year ago|reply
> Imagine you joined a new team, and they said: our program uses a bunch of XML files as its data format, and that's why most of our business logic is implemented in XPath. Just sit there and ponder that situation for a little bit.

Been there 20 years ago! It sucks at unimaginable levels!

[+] dicytea|1 year ago|reply
I agree with most of this, but I don't understand why you'd treat FKs as business logic. IMO, it's clearly a part of the data structure. Like, say if you have a domain that looks like:

    type User = {
      username: string;
      posts: Post[];
    }

    type Post = {
      content: string;
    }
(for the DDD-brained, assume User is an aggregate and Post is a value object here)

The natural way to represent this in the database is:

    CREATE TABLE users (
        username text PRIMARY KEY
    );

    CREATE TABLE posts (
        author text NOT NULL REFERENCES users (username),
        content text NOT NULL
    );
I just don't see why you'd do it in any other way.
[+] geysersam|1 year ago|reply
I think the fact that these functions/functors/macros already exist in a few different varieties in mainstream database systems presents a challenge to your argument.

SQL can be fixed, has been fixed, and is a underutilized tool in my personal opinion.

[+] FridgeSeal|1 year ago|reply
I think you’re throwing the baby out with the bath water here a bit.

I agree that stored-price are rarely, if ever the solution.

I do however think that having something that enforces certain constraints across all your data, and forces you to handle schema evolutions reasonably over time is really powerful. Database implementations are (mostly) full of hard-fought correctness decisions, and I don’t think most devs are in a position to “whip out” an implementation that cares for a number of these on-demand.

We had the whole NoSQL thing, we saw what when we threw most of this stuff away.

Is sql good enough? For many problems yeah. Is it _really_ good enough? Nah probably not, we could do a lot better, but I think a better design looks more like a better sql-like language and less like a full programming language.

[+] hobs|1 year ago|reply
I don't know why you are getting downvoted, and I generally don't like a lot of business logic in the persistence layer, but representing the things you absolutely dont want another part of your app doing via FKs is...pretty normal.

If you are arguing it spreads out your business logic into multiple places, that's fair, but so does... a lot of things. Databases with decent fks and constraints have a tendency to outlive the apps that build atop them, and communicating these constraints in long lived fashions can be ... good actually.

[+] halflife|1 year ago|reply
100% completely agree.

You just can’t scale SQL code. If you have a small app, then sure. But then I don’t know why you’d invest so much in testing. But if you need to scale your code to more than 3 people, SQL should only be your persistent state store. You probably already have a BL layer in another, more manageable programming language, so use that.

In my team I set a rule to remove all SQL functions and views, and move all the logic to our typescript backend. That resulted in code which is much easier to test, and is version controlled.

[+] gregwebs|1 year ago|reply
It seems like you are operating in an environment where a single code library can implement business logic and mediate all DB write access?

Can someone update the db directly with another programming language or do they have to go through an API?

The value of business logic in the DB is that all clients must obey it.

[+] zetalyrae|1 year ago|reply
If I'm at the point of dropping foreign keys, why not just switch to DynamoDB?
[+] throwaway201606|1 year ago|reply
I understand the primary premise about the difficulty with testing SQL and fully agree with it.

I do have a question though - while I understand how functors can help make the problem easier to tackle, I am not sure I fully understand how functors are different from a similar existing tool - stored procedures.

Some DB flavors:

- can take tables as arguments to stored procedures - can return tables - also offer the additional benefit of being able to run almost all flavors of SQL commands ( DDL, DML, DQL, DCL, TCL) in those stored procedures

Netezza stored procedures, for example, can do what you describe here:

https://www.ibm.com/docs/en/netezza?topic=nsg-return-result-...

As can SQL Server & Oracle (which both return cursors, which are just ordered tables):

https://learn.microsoft.com/en-us/sql/relational-databases/s...

https://docs.oracle.com/cd/B28359_01/appdev.111/b28843/tdddg...

Am I missing something fundamental here? How are functors different from stored procedures? To me, they seem to be just a sub-class / sub-type of stored procedures.

[+] sixdimensional|1 year ago|reply
I found the argument that it’s impossible to test SQL (and some other hard statements in this article) a bit too absolute.

Testing SQL is not impossible – SQL unit testing is a thing – identify a unit of SQL, such as a query, function, or stored procedure - mock the necessary inputs and any state that is modified and check the result of the unit just as you would in any unit testing framework (e.g. arrange, act, assert). Perhaps it’s a little difficult, or different to wrap our heads around (and many do not take the time to see how this can work) but it’s not impossible.

Also, as others have already pointed out, depending on the implementation of the SQL engine itself, different engines can have very different behaviors, some of which make composable SQL possible already today (e.g. table valued functions, CTEs that are inlined, stored procedures, etc).

And for what it’s worth, if you look at a lot of the data processing engines we use today (such as Apache Spark), they really solve the problem of being able to compose and modularize code and SQL together, through data frames or things like language integrated query, integrations of languages that compile down to code that can run and be orchestrated on common execution engines, and so forth.

[+] agentultra|1 year ago|reply
Ah but “functor,” is a name from category theory and algebraic topology. I had read along thinking that this would come up in the conclusion! Instead it seems it’s not even mentioned in the sources for the name.

There’s potentially something like this in the “esqueleto” package for Haskell. But it has some limitations. It composes on syntactic fragments.

Although speaking of functors, the category theory ones, maybe categorical databases would fit the bill: https://categoricaldata.net/

[+] RyanHamilton|1 year ago|reply
He got so close to the right answer but went the wrong direction. "Imagine a programming language without functions.". Imagine SQL was instead based on an actual programming language, with variables and functions. that would solve all the problems you mention. Kdb+ and Dialog already knew this 20+ years ago. I wish someone else will recreate this in an open source library. Now with Arrow format in memory the best hope may be that all languages will be able to act on shared memory data frames bringing tables to every language within the database.
[+] r1b|1 year ago|reply
Nodding along furiously. One area where this comes up is when you want to export some piece of logic that effectively “returns a table”.

For example, I work with a source of data that requires a particularly hairy JOIN strategy. The logic is the same for any table with the same “shape”, but no one can reuse it without copy / paste.

[+] virtualwhys|1 year ago|reply
See Esqueleto in Haskell, Slick and Quill in Scala, probably LINQ or some variant on .NET in C#/F#.

All support building up arbitrarily complex queries based on statically typed query fragments.

String fragment "composition" is like banging rocks together.

[+] yujzgzc|1 year ago|reply
There are a lot more composability problems with SQL.

As an example, can you write a function that you can invoke on a column in a select statement and will compute the cumulative distribution function of that column?

The expression to do this is enormous "SUM(column) OVER (ORDER BY column ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) / SUM(column) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)" and there is no choice but to inline it. I think it would be reasonable to expect users to be able to describe this and then write simply "CDF(column)". The only systems that do this I believe are text processing macros on top of SQL...

[+] naiquevin|1 year ago|reply
I've implemented a CLI tool[1] that tries to address similar problems with SQL as mentioned in the article: testing, business logic and re-usability. The approach is different though - you write SQL and pgTAP tests using jinja templates and the tool generates

1. SQL queries that can be used by the application code using libs such as yesql, hugsql etc. and,

2. pgTAP tests, ensuring that they test the exact same SQL queries that are actually run by your application

I haven't used it in a /real/ production setting though, so I'd still call it an experiment. But I have been using it in a personal project with a handful of users (friends and family). Most of the business logic is implemented in SQL so the queries are fairly intricate (CTEs, sub queries, multiple joins). Occasionally, I push some features/improvements/tweaks and I am able to do so with high confidence, which I believe is mainly thanks to pgTAP tests. I am quite happy with it so far although that's just my experience. Haven't tried it on a large project with many contributors.

[1]: https://github.com/naiquevin/tapestry

[+] iforgot22|1 year ago|reply
I prefer to keep the SQL itself as simple as possible. I've gotten by OK by composing SQL in my outer code as needed. Despite relying crazy heavily on SQL, that's rarely needed. And the outer code has tests.

It'd just be nice if Postgres could loosen the syntax in a few annoying places, like allowing a trailing comma in "WITH q1 AS (...), q2 AS (...)" Or if something could be done to avoid needing the "WHERE true AND" trick.

[+] default-kramer|1 year ago|reply
Very nice. Adding this one to my list of "things other people wrote that nicely elucidate a shortcoming of SQL"

> Why would this be useful? Because SQL tables are global variables. By vanishing global variables, we automatically make every query fully testable.

And even if you don't care about testability, certainly you can appreciate that global variables are bad for composability.

[+] noisy_boy|1 year ago|reply
Stop trying to fix SQL. Stop trying to make it do things that belong with a full fledged unit testable language. It is mainly a way to view, filter and aggregate data - nothing more. Having dealt with untestable thousand lines stored procedures (each of which are copy of a similar stored procedure with subtle variations), just don't put yourself in a world of pain by doing overcomplicated things at the database end.

And when you want to query data, use a view so that a) you don't need to deal to idiosyncrasies of your ORM deciding to do weird/suboptimal joins b) if tomorrow you have to make database side changes or optimize your view, all you need to care about is to keep the select clause the same c) you can just query/profile the view and be assured that what you see is what your application will see too (in general). But don't keep nesting them like mad and expect the database to do magic.

[+] asqueella|1 year ago|reply
There's a need to do complex data processing (not sure if it falls under your "view, filter and aggregate data - nothing more"), and it's an open question whether "full fledged" languages (presumably with an API, like pandas, ibis or spark), improved SQL, or another language focused on data transformation will be a better fit for this task.
[+] paulddraper|1 year ago|reply
> It is mainly a way to view, filter and aggregate data

It looks like it’s viewing filtering and aggregating data to me.

[+] froderick|1 year ago|reply
I already have functors, effectively, by writing sql queries in clojure with honeysql. The clojure data structures make behavior composition from reusable sql snippets easy, and the testing of sql snippets is just based on literals as the article describes. Works really well.
[+] maxulite|1 year ago|reply
Imagine a programming language without functions...This would be tiresome. But that’s what SQL is

SQL limitations encourage creative solutions. I once implemented a simple "functional" framework as a virtual module in SQLite.

Some background. SQLite's virtual tables were always "functional". To make a virtual table function you basically require that particular constraint from where clause is present otherwise you will report a failure. Having this, you may create for example a table of all primes but with two columns, the prime value (obviously) and the count, but the column count will be mandatory (so the query for showing first 10 primes will be 'select value from primes where count=10'). Later SQLite added table-valued syntax and hidden column which allows the use of the syntax 'Select value from primes(10)'. But finally what made me implement a functional framework were eponymous virtual tables. Usually any virtual table should be present in schema, but eponymous tables don't require this and exist just by virtue of the corresponding module being registered.

So the framework uses all this machinery and a simple table is used for storing "functions". The table has two columns, query name and query body. The body may contain numbered SQLite parameters (?1, ?2, ?3). So this single row will automatically create a virtual table with the name from the query name column while all numbered parameters in the body will be converted into "input" parameters of this table (No need to manually parse, SQLite will take care of this thanks to its API). Sure, the bodies can use other queries from this library, SQLite handles this nicely.

[+] brikym|1 year ago|reply
This what Kusto Query Language looks like:

  StormEvents
  | summarize 
      TotalStormsInState = count(),
      StormsWithCropDamage = countif(DamageCrops > 0)
      by State
  | extend PercentWithCropDamage = 
      round((todouble(StormsWithCropDamage) / 
  TotalStormsInState \* 100), 2)
  | sort by StormsWithCropDamage

I find it to be much more readable. With SQL CTEs is as close as you can get.