Embracing SQL in Postgres

[+] knodi123|11 years ago|reply

My CTO wrote custom PostgreSQL functions to generate all the JSON that our API returns. First time I saw it, I said "Holy crap, that query is two pages long. You could just tell Rails to serialize the object in 3 lines of code." He replied "I know. That's how we did it at first. But generating the JSON directly in postgres is about 700x faster."

Eventually we even moved a very complex scoring algorithm from an overnight Hadoop Java process, into Postgres as a compiled extension in C. We used to spend all night calculating those scoring results and caching them in an enormous table. But with the Postgres+C solution, we could calculate and store it in realtime by having Postgres do the hard part.

[+] inopinatus|11 years ago|reply

You can skip the web tier entirely and have Pg returning JSON documents via an HTTP endpoint to a rich MVC js browser application that synchs to a local cache.

Thus re-inventing Lotus Notes.

[+] pjungwir|11 years ago|reply

Writing Postgres functions in C is not nearly as scary as it sounds. Here are some stats functions I wrote for a Rails app. The histogram function in particular was the driver, since it was parameterized on bucket size/count/start so uncacheable:

https://github.com/pjungwir/aggs_for_arrays/

[+] ibejoeb|11 years ago|reply

Very happy to see all of this talk over the past few months on the power of modern--or anything beyond basic--SQL.

This article touches on one of my favorite things, too, which is built-in date and time arithmetic. That's something that basically just works and comes for free in the database but is often a hodgepodge of messy third-party libraries elsewhere.

The other enormous one, for me at least, is arbitrary precision math. Proper precision, rounding, division, and exponentiation isn't easy to come by. I'm so tired of dealing with libmpdec, BigDecimal (flimsy as it is), GMP, etc. It all works well in the database and it's pretty fast. Perhaps it's not suitable to 1000-digit scale scientific computing, but it sure does make building financial software easy.

[+] p_l|11 years ago|reply

In Ruby World, Sequel (http://sequel.jeremyevans.net/) probably gives a bit of best of both ORM and direct SQL use, because what it really does is an object-oriented ("native" to language) interface to SQL database. And yes, it can build models like ActiveRecord, too!

In Common Lisp, postmodern (http://marijnhaverbeke.nl/postmodern/) serves a similar role, with both DAO and lower-level access methods that compose pretty well. The 'perec' ORM from DWIM project was based on Postmodern, and among some weirder things I did quite easily was embedding custom SQL into query builder, so that the object-based accesses would use the advanced SQL. Really nifty feature :)

[+] pjc50|11 years ago|reply

One thing that drives me absolutely over the cliff is how ORMs try so hard (and fail) to abstract the power and expressiveness of SQL.

I agree with this. It's nice to deserialise a DB row into objects, but actually querying with an ORM is usually terrible and people should not be afraid to write smart queries. Especially like those in sibling comment https://news.ycombinator.com/item?id=9112036 : date/time/arithmetic.

[+] Terr_|11 years ago|reply

> actually querying with an ORM is usually terrible

What do you think of stuff like Hibernate Query Language? I'm currently working on a project that might use DQL (a PHP variation) to get our snarl of tables under control.

[+] ptype|11 years ago|reply

Even though I think it's important for developers to have some understanding of the underlying SQL, I still believe a good ORM can save time and make code more readable. I think SQLAlchemy does a good job here - and it's close enough to SQL such that you are typically not surprised by the SQL it produces.

For more complicated queries, I often hand write them in SQL first and then translate them into SQLAlchemy's methods. It's still worth it, since it's both easier to build dynamic queries (e.g. dynamic WHERE clauses) and it makes the parameterisation trivial.

My problem with stored procedures is that you can end up with a lot of business code in the db.

[+] marcosdumay|11 years ago|reply

Why is it that the DB is the correct place for business data, but not for business code?

Of course, even for practical reasons, it's important to keep the DB lean. But some code does really belong toghether with the data.

[+] wallyhs|11 years ago|reply

If you are primarily using stored procedures, you are probably viewing the database as a "persistence API" with its own encapsulated logic.

[+] zvikara|11 years ago|reply

For context, the author Rob Conery wrote SubSonic - one on the first opensource ORM for .Net which was very popular at the time. http://en.wikipedia.org/wiki/SubSonic_(software)

[+] pjmlp|11 years ago|reply

I never got that much sold into ORMs when doing anything more complex than plain CRUD, I rather use the expressiveness and performance of plain SQL and its associated procedural extensions.

[+] mercurial|11 years ago|reply

One thing you get with a good ORM is composability, though. Trying to build dynamic SQL queries with raw strings is a nightmare.

Another, less useful property, is a considerable reduction of boilerplate when persisting a complex object graph.

[+] angersock|11 years ago|reply

I prefer ORMs for the CRUD stuff--especially for data validations and things. Filtering and querying, though, tends to benefit from some kind of structure query language.

That said, there is a good argument to be made that, if you've factored everything correctly (ha) and done your design work up front properly (ha ha) you've rendered the database as just another sort of persistence engine, so even an ORM is not always necessary.

[+] humbertomn|11 years ago|reply

Really happy to see some SQL appreciation. Of course ORMs can make code easier to read and allow you to change databases quickly... But I always asked myself if these 2 points are enough for the loss in performance and the use (in some cases) of really dodge background implementations of things that already are beautifully implemented by many databases.

[+] dorfsmay|11 years ago|reply

Another issue I have with ORMs is different projects end up with different views of the same database.

Why don't you use a foreign key to that table? We don't have that table... Yes we do, it's not because you haven't defined it in your project that it doesn't exist!

And... every project ends up with their own version of the "versioned schema", so to rebuild the DB, you need to aggregate all the liquidbase/flyway files from all the projects and run them all.

When I wear my sysadmin/pseudo DBA hat, ORMs are one of my more prominent pain.

[+] habitue|11 years ago|reply

I think this pain comes from using swl databases for integration, which is an antipattern in my book, and prevalent in the enterprise. Databases should have a single application talking to them, and have a service in from of them that makes it possible to get the data needed out. Services are easier to maintain backwards compatibility with, whereas databases are a massive pain to version and migrate and having to synch multiple teams to do such a job is a nightmare.

That being said, I realize its often much more convenient at the beginning to have one database for integration. SQL is much more expressive than your service's interface will be, and in the beginning this seems like a killer.

[+] pjungwir|11 years ago|reply

These are some very nice tips! One of the things I like so much about ActiveRecord is that it plays nice with pretty much every example here, e.g. `Product.where("title ~* 'title'")` or `Product.where("title ~* ?", title)`. A few other remarks:

- I'm so glad to see someone talking about `generate_series` for time-series reports. I often see queries that are `GROUP BY date`, which is usually okay, but will skip an output row if you have no data for a certain day. If you want to force one-row-per-day, use `generate_series`.

- His `GROUP BY` query can be rewritten with less repetition like this:

    SELECT  sku,
            SUM(price),
            date_part('month', created_at) m
    FROM invoice_items
    GROUP BY m
    HAVING m = 9

In other words, `GROUP BY` and `HAVING` can reference expressions that are aliased in your `SELECT`.

- Also on this query: In Postgres if `sku` is a primary key, then you can `SELECT` other columns from `invoice_items` without including them in `GROUP BY`, e.g. `SELECT sku, name, supplier_id, etc`. I think even joined stuff is okay: `SELECT sku, suppliers.name`.

- One more nit on this query: I think this is backwards from what he meant?: "having to be sure that whatever you GROUP BY is in your SELECT clause."

- Usually I don't use window functions where a `GROUP BY` will do, but to me the biggest value in window functions is they don't force any particular structure on the overall query, so they don't constrain you the way `GROUP BY` does. He kind of alludes to this when he shows per-sku monthly sales combined with total monthly sales. Good example!

- I'm not sure his CTE example shows much added value from the CTE, but maybe I'm missing something. The three places I really like CTEs are (1) recursion, (2) with INSERT/UPDATE/DELETE and RETURNING, (3) very big queries. But a lot there depends on taste.

- When you use `string_agg` and friends, you can give an ordering for just that function: `string_agg(downloads.title, ',' ORDER BY products.sku)`. That syntax works for every aggregate function (I believe), although it's especially useful for the *_agg family.

Thanks for sharing these tips! I wish more people knew the cool things you can do beyond SELECT x FROM y.

[+] rtpg|11 years ago|reply

I've spent some time recently figuring out that SQL is nice. But my issue has always been the same : how do I coalesce that with best-practices in other things.

For example, how do I apply DRY and use SQL ? How do I apply type safety principles? How do I avoid syntax errors?

It feels like this ends up with building some sort of ORM.

Anyways, I'd love to hear experiences on interfacing SQL with "good" safety principles in larger-scale projects

[+] dragonwriter|11 years ago|reply

> For example, how do I apply DRY and use SQL ?

Lots of ways. Defining appropriate views is one of the key ones, but not the only one.

> How do I apply type safety principles?

That's too vague of a question to present a concrete answer to.

> It feels like this ends up with building some sort of ORM.

Anytime you are using a database from an external language, you are going to have some code to translate back-and-forth between database objects and operations and application language objects and operations; you might call it an ORM, but if you are using SQL more robustly, it may not look much like most things that go under the ORM label, and some popular ORMs are fairly opinionated in ways which fight effective use of SQL (ActiveRecord in particular at least used to be; I haven't done much with it in quite a while and am not sure if that's still the case.)

[+] spacemanmatt|11 years ago|reply

When I can do my ideal best, I write a remote facade to my database that provides functions and views on the data as an API, which can be adapted to an ORM or mapped nearly 1:1 to a DAO.

This API lends itself to TDD, and yes, you must test your database components the same as any other code. Preferably using database code. For PostgreSQL I really liked PGTap. Include this testing in CI and other unit test executions.

My sniff-test for a well-done schema is that developing application(s) around it should be clean.

[+] SixSigma|11 years ago|reply

Stored procedures, parametrization of queries and views will eliminate many errors.

[+] crdoconnor|11 years ago|reply

DRY is the #1 reason I use an ORM. I also think that it's a factor that is rarely if ever considered by people who choose not to use an ORM.

[+] s_kilk|11 years ago|reply

The 'slick' library for scala ticks a lot of these boxes. It allows you to write type-safe, idiomatic scala code to query your database, transform results into proper objects while avoiding the weight of an ORM.

[+] borplk|11 years ago|reply

In the Java world checkout MyBatis it lets you write "named" SQL queries in a config file and makes it available to you in Java as a type-safe interface. Been using it recently and love it.

[+] bayesianhorse|11 years ago|reply

The point of using an ORM is not about avoiding the difficulty of SQL or the ease of switching databases.

ORMs are about code reusability.

First off, validation of input data is painful without ORMs. Converting data between SQL, your languages' type system and the user interface in a reusable way almost necessarily leads to an abstraction which looks like an ORM. Not doing so quickly leads to bugs and vulnerabilities. I can usually tell at a glance if Django ORM code leads to an injection vulnerability or not (Hint: Unless there is custom/raw SQL there is no vulnerability). Harder to do in pure SQL.

Secondly, queries can be modified and introspected in code. Hard to do with SQL query strings, easy to do with ORMs. With the Django ORM you can progressively filter a query using variables from your context, you can modify a base query to get you a count, an aggregate and a list of model instances, and you can introspect a query for the tables it will touch.

Thirdly, an ORM makes queries reusable by different sub systems of your framework. Again in Django, you can pass a Queryset (without evaluating it before it's necessary) to the template system, to the input validation (django forms), to form widgets, the introspected admin, REST frameworks and so on.

Fourthly, ORMs help manage migrations. Migrating to and from different points of migrations can be hard to do in raw SQL or with your own tools. ORMs aren't perfect at this, but SQLAlchemy and the Django ORM both have very good migration tools which save you a lot of time and a lot of frustration, with minimal headache now and then. Migrations are easily testable and easily repeatable.

I also believe that ORMs are more important to the framework than for the individual application. A framework can only survive and grow a strong community if it can satisfy a large base of users. For Django's success it was necessary to support SQLite, Postgres, Mysql and Oracle, whereas single Django projects might thrive without ever changing to and from Postgres.

[+] buckbova|11 years ago|reply

Switching from mssql to postgres, these are the kinds of things I really loved where I didn't need to roll my own. In mssql I rolled my own regex, series generation, date formatting, and others. Once you've done it, it's there though and you forget about the time that went into implementing.

However there's plenty things I prefer in mssql, but I won't go into them all here.

Edit:

I didn't want to take away from the focus or be a jerk about it.

Things I prefer in mssql environment:

> Triggers work on entire dataset instead of row by row.

> Stored procs can return multiple data sets and output variables all from a single call.

> Opinion: t-sql is friendlier than pgpsql to work in.

> Backups/restores are more straightforward.

[+] takeda|11 years ago|reply

Perhaps I'm missing something, but:

> Triggers work on entire dataset instead of row by row.

in pgsql you can have trigger that works per row or per dataset [1]

> Stored procs can return multiple data sets and output variables all from a single call.

you can provide results through arguments by using "OUT" as argmode [2]

> Opinion: t-sql is friendlier than pgpsql to work in.

while it is hard to argue with an opinion, you can also use Tcl, Perl, Python (these three come standard), Java, PHP, R, Ruby, Scheme, sh[3].

[1] http://www.postgresql.org/docs/current/static/sql-createtrig...

[2] http://www.postgresql.org/docs/current/static/sql-createfunc...

[3] http://www.postgresql.org/docs/current/interactive/external-...

[+] angersock|11 years ago|reply

You haven't gone into any, much less all, here--I for one would be curious to know about what you prefer in mssql.

[+] rondon2|11 years ago|reply

I'm sure anyone can come up with specific situations where writing SQL is much better than using ORM. If I had infinite time and dev resources I would never use an ORM. In reality Entity Framework makes abstracting my data-layer very easy and increases my productivity. It also makes me feel much more comfortable about switching databases if necessary. Even if I'm taking a 5% performance hit it is totally worth it to me.

[+] jon-wood|11 years ago|reply

> It also makes me feel much more comfortable about switching databases if necessary.

This is one of the big features touted by ORMs, but I wonder how often it actually comes into play. I've been using Rails with ActiveRecord for most of the last 10 years now, and I've not once changed databases after starting a project. There's one project where I'd like to, but despite the use of ActiveRecord, there's enough MySQL specific stuff in the code that its not going to be trivial to just copy the data over to a Postgres database and flip the switch.

[+] crdoconnor|11 years ago|reply

>I'm sure anyone can come up with specific situations where writing SQL is much better than using ORM.

You are supposed to bypass every ORM sometimes. One measure of a good ORM is whether it makes that easy for you or difficult.

Regex queries are definitely one example of a niche SQL query that I would bypass an ORM altogether in order to use.

95% of my SQL queries are not that, though, and for the very standard ones I use an ORM to heavily restrict the amount of code I have to write and maintain type safety.

[+] joelthelion|11 years ago|reply

The problem is that this 5% hit very often becomes a 1000% hit when you let the ORM in the hands of people who don't understand it and they do a thousand queries when a single one would be enough.

[+] JustSomeNobody|11 years ago|reply

I always hear (one of) the argument for ORMs being "If I want to switch databases...", but in reality, after having questioned many many devs, nobody has ever _actually_ done this.

[+] jdmichal|11 years ago|reply

Agree that ORMs like Entity Framework are very good at taking basic SELECT and INSERT/UPDATE statements and jamming them into objects without a lot of code. However, consider that perhaps your position is a judgement of how few features of a database system you have needed to use, and not necessarily a judgement on the robustness of ORMs.

[+] matwood|11 years ago|reply

I have found that ORMs get you up and running more quickly, but that time saved is eventually given back working through ORM performance issues. As someone who has written complex SQL for many years, it is simply faster and easier to just write optimized queries up front. Something like jOOq makes this nice with typing and some minimal object mapping.

[+] joesmo|11 years ago|reply

I find the opposite to be true. I have to learn a whole bunch of new methods and concepts to use an ORM like Doctrine 2 and even a whole new query language for when the basics fail which is most of the time. I can't imagine a way to make development slower than adding an ORM.

[+] bifel|11 years ago|reply

Using distinct and windowing functions to avoid writing group-by clauses may work fine with Postgres but is a very ugly hack.

[+] adamtj|11 years ago|reply

This ugly hack is not useful in general and only looks interesting here because the original example query is in bad taste. Simplifying it removes the tedium motivating the use of window functions.

For example, this is equivalent to the original query:

  SELECT sku, sum(price)
  FROM invoice_items
  WHERE date_part('month',created_at) = 9
  GROUP BY sku

See how the tedium disappears? In the original, the HAVING clause doesn't use any aggregate functions, which means it can instead be a WHERE. Removing the HAVING means we can further simplify the GROUP BY.

Of course, that only works because this example needs only a single month group. If you want multiple months and per-month price totals, then my rewritten query is no longer equivalent. It would need the more complicated GROUP BY. However, the window function version would also get more complicated in just the same way. The complication is inherent to the problem, not due to standards conformance.

Actually, what's really tedious is the verbose expression required to get the month and the fact that the author is repeating it. The standard provides "WITH" to avoid that duplication, and Postgres implements it.

Here's a reasonable version for multiple months:

  WITH items AS (
    SELECT *, date_part('month',created_at) AS month
    FROM invoice_items
  )
  SELECT sku, month, sum(price)
  FROM items
  WHERE month in (7,8,9)
  GROUP BY sku, month

But suppose we make the example even more complicated. Suppose you want sums over different time periods, like per-month and per-quarter. Then you can't use a GROUP BY and must use window functions and DISTINCT.

  WITH items AS (
    SELECT
        *
        ,date_part('year',created_at)    AS year
        ,date_part('month',created_at)   AS month
        ,date_part('quarter',created_at) AS quarter
    FROM invoice_items
  )
  SELECT DISTINCT
      sku
      ,year
      ,month
      ,quarter
      ,sum(price) OVER (PARTITION BY year)          AS yearly_total
      ,sum(price) OVER (PARTITION BY year, month)   AS monthly_total
      ,sum(price) OVER (PARTITION BY year, quarter) AS quarterly_total
      ,sum(price) OVER (PARTITION BY month)         AS month_number_total
      ,sum(price) OVER (PARTITION BY quarter)       AS quarter_number_total
  FROM items
  ORDER BY 1,2,3,4

[+] wallyhs|11 years ago|reply

This "hack" is also supported by Oracle and SQL Server.

[+] k_bx|11 years ago|reply

SQLAlchemy can express SQL from author's post, so probably he just didn't try good ORMs.

[+] towelguy|11 years ago|reply

SQL is nice but most of those features work best when the user is a finance person doing reports (which was the original purpose for SQL, I think?). Developers don't want the sql server to format their numbers as money, that's a work for the UI. The main reason for me to use an ORM is to standarize on one way to retrieve data whatever the database engine is. Maybe an intermediate solution would be better, like a query builder or a sql-like syntax that transforms to the correct SQL for the database in use like Doctrine's DQL.

[+] bqe|11 years ago|reply

My favorite query builder for SQL is jOOQ, works on the JVM. It has a great, fluent syntax and lets you get real objects back while writing what is essentially SQL.

http://www.jooq.org/

[+] RangerScience|11 years ago|reply

> whatever the database engine is I know, but... How often does that flexibility come up? How often do you write code first against one DB type, and then switch to a different that would require a re-write?

It's admirable and fun to think about, for sure, but how much use does that feature actually get?

120 comments