Lesser-known Postgres features

[+] gavinray|4 years ago|reply

I want to stress the importance of not using

  id int SERIAL

If you are on a somewhat recent version of postgres, please do yourself a favor and use:

  id int GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY

An "identity column", the part here:

https://hakibenita.com/postgresql-unknown-features#prevent-s...

You might think this is trivial -- but SERIAL creates an "owned" (by a certain user) sequence behind the scenes, and so you run into massive headaches if you try to move things around.

Identity columns don't, and avoid the issue altogether.

https://www.2ndquadrant.com/en/blog/postgresql-10-identity-c...

[+] netcraft|4 years ago|reply

I'll stop short of giving a recommendation or using the word "should", but ill give encouragement to consider using uuid's for keys. I have used them in several systems and have never had any issues with them, and they solve so many issues. The ability to generate a key on the client or on the server or in the database is great for one. And the fact that keys are unique not only in a table but in the system (or many systems) is a huge advantage.

[+] cormacrelf|4 years ago|reply

That syntax seems a bit verbose for a column that will most likely appear on every table you ever create. How about:

    id int SIMILAR TO SERIAL BUT WITHOUT PROBLEMS MOVING THINGS AROUND

And the other variant for when you aren’t sure that worked:

    id int SIMILAR TO SERIAL BUT WITH EVEN FEWER PROBLEMS MOVING THINGS AROUND

[+] uhoh-itsmaciek|4 years ago|reply

Another favor you should do yourself is use bigint keys (unless using uuid as suggested elsewhere, or you're very, very sure you'll never chew through 2 billion ids).

[+] cnity|4 years ago|reply

I often move ID generation into the application layer (this also helps avoid things like enumeration attacks), and actually quite a lot of cool Postgres features blur that line a little bit. It's interesting to think sequences and other computational mechanisms in a DB, and whether they make architecting applications easier or harder. I don't have a strong opinion either way, but I'm interested in HN's opinion.

[+] olddb|4 years ago|reply

>You might think this is trivial -- but SERIAL creates an "owned" (by a certain user) sequence behind the scenes, and so you run into massive headaches if you try to move things around.

Maybe it is because I'm too old, but making id grow by sequence is the way how things 'ought' to be done in the old skool db admin ways. Sequences are great, it allows the db to to maintain two or more sets of incremental ids, comes in very handy when you keeping track of certain invoices that needs to have a certain incremental numbers. By exposing that in the CREATE statement of the table brings transparency, instead of some magical blackbox IDENTITY. However, it is totally understandable from a developer's perspective that getting to know the sequences is just unneeded headache. ;-)

[+] tokumei|4 years ago|reply

Why? Concrete ownership is a good thing. I prefer more rigid structures and I like designing my tables around that. It was the Wild West when I did database design around MySQL 15ish years ago. Of course I didn’t know nearly as much as I do now back then.

[+] mixmastamyk|4 years ago|reply

From TFA:

    Instead of using GENERATED BY DEFAULT, use GENERATED ALWAYS.

[+] brasetvik|4 years ago|reply

Regarding "Match Against Multiple Patterns", the examples are about finding the _suffixes_ of something, email domains in the example.

An attempt to find a suffix like that will not be able to use an index, whereas creating a functional index on the reverse and looking for the reversed suffix as a prefix will be:

  # create table users (id int primary key, email text);
  CREATE TABLE
  # create unique index on users(lower(email));
  CREATE INDEX
  # set enable_seqscan to false;
  SET
  
  # insert into users values (1, '[email protected]'), (2, '[email protected]'), (3, '[email protected]');
  INSERT 0 3
  # explain select * from users where email ~* '@(gmail.com|yahoo.com)$';
                                  QUERY PLAN
  --------------------------------------------------------------------------
   Seq Scan on users  (cost=10000000000.00..10000000025.88 rows=1 width=36)
     Filter: (email ~* '@(gmail.com|yahoo.com)$'::text)

  # create index on users(reverse(lower(email)) collate "C"); -- collate C explicitly to enable prefix lookups
  CREATE INDEX

  # explain select * from users where reverse(lower(email)) ~ '^(moc.liamg|moc.oohay)';
                                               QUERY PLAN
  ----------------------------------------------------------------------------------------------------
   Bitmap Heap Scan on users  (cost=4.21..13.71 rows=1 width=36)
     Filter: (reverse(lower(email)) ~ '^(moc.liamg|moc.oohay)'::text)
     ->  Bitmap Index Scan on users_reverse_idx  (cost=0.00..4.21 rows=6 width=0)
           Index Cond: ((reverse(lower(email)) >= 'm'::text) AND (reverse(lower(email)) < 'n'::text))

(Another approach could of course be to tokenize the email, but since it's about pattern matching in particular)

[+] ako|4 years ago|reply

Alternatively you could use a trigram index which i've used in the past successfully to speed up queries using like '%gmail.com'.

Blogpost discussing this approach: https://about.gitlab.com/blog/2016/03/18/fast-search-using-p...

[+] netcraft|4 years ago|reply

Great article, always learn a lot from this author.

Here is another way I have used to do pivot tables / crosstab in postgres where you have a variable number of columns in the output:

https://gist.github.com/ryanguill/101a19fb6ae6dfb26a01396c53...

You can try it out here: https://dbfiddle.uk/?rdbms=postgres_9.6&fiddle=5dbbf7eadf0ed...

[+] leephillips|4 years ago|reply

Yes, just the per-DB history was worth the price of admission. Too bad there is no RSS feed.

[+] bajsejohannes|4 years ago|reply

Another great hidden feature is the LISTEN/NOTIFY system. You can use it to watch for changes on a table, and then use that to e.g. update a view in the UI.

For example create a table with a trigger on insert "NOTIFY new_data". Then on query do

    LISTEN new_data;
    SELECT ...;

Now you'll get the results and any future updates.

[+] cldellow|4 years ago|reply

Expanding on the regex trick: I often use [.] instead of \. to match a period.

They achieve the same thing, but [.] avoids the problem of your host language "helpfully" interpolating \. into . before sending the query to Postgres.

[+] brasetvik|4 years ago|reply

Two things that aren't exactly lesser-known, but that I wish more used continuously as part of development:

- generate_series(): While not the best to make _realistic_ test data for proper load testing, at least it's easy to make a lot of data. If you don't have a few million rows in your tables when you're developing, you probably don't know how things behave, because a full table/seq scan will be fast anyway - and you'll not spot the missing indexes (on e.g. reverse foreign keys, I see missing often enough)

- `EXPLAIN` and `EXPLAIN ANALYZE`. Don't save minutes of looking at your query plans during development by spending hours fixing performance problems in production. EXPLAIN all the things.

A significant percentage of production issues I've seen (and caused) are easily mitigated by those two.

By learning how to read and understand execution plans and how and why they change over time, you'll learn a lot more about databases too.

(CTEs/WITH-expressions are life changing too)

[+] dylanz|4 years ago|reply

It's worth to note that earlier versions of PostgreSQL didn't include the "AS NOT MATERIALIZED" option when specifying CTE's. In our setup, this had huge hits to performance. If we were on a more recent version of PostgreSQL (I think 11 in this case), or if the query writer just used a sub-query instead of a CTE, we would have been fine.

[+] skunkworker|4 years ago|reply

Also explain(analyze,buffers) is by far my favorite. It shows you the number of pages loaded from disk or cache.

Also to note: EXPLAIN just plans the query, EXPLAIN (ANALYZE) plans and runs the query. Which can take awhile in production.

[+] thearrow|4 years ago|reply

For the last point about "Find Overlapping Ranges", consider PostgreSQL range types [0] and their corresponding range operators [1].

[0]: https://www.postgresql.org/docs/current/rangetypes.html

[1]: https://www.postgresql.org/docs/current/functions-range.html

[+] grangerg|4 years ago|reply

What I do is negate a check that determines if they don't overlap. Done.

[+] gmac|4 years ago|reply

On this first point — 'Get the Number of Updated and Inserted Rows in an Upsert' — unfortunately this `xmax` method is not always reliable and there does not seem to be any reliable alternative.

I posted to pgsql-hackers about this problem here: https://www.postgresql.org/message-id/DE57F14C-DB96-4F17-925...

[+] jrumbut|4 years ago|reply

I've seen similarly titled articles and they are usually full of the features that aren't used for a reason (MONEY column type, etc).

It's nice to see one that's full of good advice.

[+] tbrock|4 years ago|reply

Im surprised there is no mention of foreign data wrappers, easily one of the best but lesser known features.

Would i use them in production? no. Are they fun to play around with? Yes!

[+] laurent92|4 years ago|reply

Atlassian gives us sales figures as a large CSV. Nothing better than loading the CSV as a Postgres view with a foreign data wrapper, and performing pivots/joins/anything on it. Replace the file with a more recent one, and you even have versioning!

[+] iaabtpbtpnn|4 years ago|reply

We use them in production, they work fine. You probably want to enable use_remote_estimate for postgres_fdw though.

[+] micw|4 years ago|reply

I may be wrong, but isn't "overlapping" just "meeting_a.starts_at<meeting_b.ends_at && meeting_a.ends_at>meeting_b.starts_at"?

Or in words: There is an overlap if meeting A starts before meeting B ends and meeting A ends after meeting B starts.

So the scenario described looks way more complex as it actual seems to be.

[+] KarlKemp|4 years ago|reply

I believe it's about the common case of two events sharing a single (infinitely small) point of time: meeting A from 3 to 4, followed by meeting B from 4 to 5.

Those two time periods do overlap mathematically, but not physically: they share the common point of "4", even if that point has duration 0.

I know the iCal standard goes into this for a bit.

[+] rvdginste|4 years ago|reply

If you use left-inclusive, right-exclusive intervals [), then then the overlap condition is:

meeting_a.starts_at <= meeting_b.ends_at && meeting_b.starts_at <= meeting_a.ends_at

So, indeed, the scenario as described in the article is more complex than needed.

[+] micw|4 years ago|reply

I get the exactly same result with that query as the "overlap" query:

SELECT * FROM meetings, new_meetings WHERE new_meetings.starts_at<meetings.ends_at AND new_meetings.ends_at>meetings.starts_at;

[+] jpitz|4 years ago|reply

Well, for this feature, it means if meeting a and meeting b share any point in time ( except their boundaries )

[+] jayd16|4 years ago|reply

You'd fail on the exactly the same time case.

[+] grangerg|4 years ago|reply

Both meeting A and B start at 2pm and end at 3pm.

After you're done with that corner case, you may find it better to determine if they don't overlap and negate that.

[+] boarnoah|4 years ago|reply

The upsert strategy mentioned (using ON CONFLICT to update instead of insert) has a catch, if your PK is a sequence it will increment even though the insert has failed.

It's not a huge deal in most cases, is a bigger issue if you are running any sort of scheduled ETL to update (or insert new) things.

[+] orangepanda|4 years ago|reply

Does it matter? In case of rollbacks, the sequence would increment too. You have to handle gaps anyway.

[+] freeest-will|4 years ago|reply

Careful with DISTINCT ON, it can be extremely slow, like 100x slower than a verbose and less readable alternative. But it might not matter in your use case.

[+] CabSauce|4 years ago|reply

I don't belive this is correct for postgres. It's true for other RDBMS, I believe. <https://medium.com/@smitagudale712/distinct-vs-group-by-in-s...>

[+] 1500100900|4 years ago|reply

A hand-written loose index scan sometimes helps in a case like this. I think it's not built-in yet.

[+] mixmastamyk|4 years ago|reply

This example didn't bring anything new to the table, only adds redundant extra chars:

    SELECT *
    FROM users
    WHERE email ~ ANY('{@gmail\.com$|@yahoo\.com$}')

Perhaps they were intending something similar to the following example instead. This one works but has a several potential lurking issues:

    with connection.cursor() as cursor:
        cursor.execute('''
            SELECT *
            FROM users
            WHERE email ~ ANY(ARRAY%(patterns)s)
        ''' % {
            'patterns': [
                '@gmail\.com$',
                '@yahoo\.com$',
            ],
        })

The dictionary-style interpolation is unnecessary, the pattern strings should be raw strings (the escape is ignored only due to being a period), and this could be a SQL injection site if any of this is ever changed. I don't recommend this form as given, but it could be improved.

[+] jrmiii|4 years ago|reply

> This example didn't bring anything new to the table, only adds redundant extra chars:

OP indicated as much saying:

> This approach is easier to work with from a host language such as Python

I'm with you on the injection - have to be sure your host language driver properly escapes things.

[+] est|4 years ago|reply

God damn I need to unlearn those because it will spoil my capability to develop non-postgres systems.

[+] quickthrower2|4 years ago|reply

I’m crying in sql server land

[+] NightMKoder|4 years ago|reply

One of the hardest types of queries in a lot of DBs is the simple `min-by` or `max-by` queries - e.g. "find the most recent post for each user." Seems like Postgres has a solution - `DISTINCT ON` - though personally I've always been a fan of how BigQuery does it: `ARRAY_AGG`. e.g.

  SELECT user_id, ARRAY_AGG(STRUCT(post_id, text, timestamp) ORDER BY timestamp DESC LIMIT 1)[SAFE_OFFSET(0)].*
  FROM posts
  GROUP BY user_id

`DISTINCT ON` feels like a hack and doesn't cleanly fit into the SQL execution framework (e.g. you can't run a window function on the result without using subqueries). This feels cleaner, but I'm not actually aware of any other DBMS that supports `ARRAY_AGG` with `LIMIT`.

[+] bbirk|4 years ago|reply

I would use lateral join. Alternatively you can use window function instead of lateral join, buy from my experience lateral joins are usually faster for this kind of top-N queries in postgres.

select user_table.user_id, tmp.post_id, tmp.text, tmp.timestamp

from user_table

left outer join lateral (

select post_id, text, timestamp

from post

where post.user_id = user_table.user_id

order by timestamp

limit 1

) tmp on true

order by user_id

limit 30;

[+] zX41ZdbW|4 years ago|reply

In ClickHouse we have argMin/argMax aggregate functions. And also LIMIT BY. It looks like this:

    ... ORDER BY date DESC LIMIT 1 BY user_id

[+] zX41ZdbW|4 years ago|reply

And we have groupArray aggregate function with limit.

> I'm not actually aware of any other DBMS that supports `ARRAY_AGG` with `LIMIT`.

So, put ClickHouse in you list :)

[+] mnahkies|4 years ago|reply

I've generally used `PARTITION BY` to achieve the same effect in big query (similar to what was suggested in the article for postgres).

Does the `ARRAY_AGG` approach offer any advantages?

[+] Aeolun|4 years ago|reply

> ERROR: permission denied for table users

I would hesitate to call this a useful feature. This error message doesn’t tell you about the actual problem at all.

How about something like:

Permission denied, cannot select other than (id, name) from table users

[+] luhn|4 years ago|reply

That COPY trick is really neat. I've always used SELECT INTO.

    SELECT *
    INTO TEMP copy
    FROM foobar;
    \copy "copy" to 'foobar.csv' with csv headers

[+] haki|4 years ago|reply

I'm always looking for new copy tricks. Is "copy" a temporary table? Or a variable...?

[+] madmax108|4 years ago|reply

The upsert strategy mentioned (using ON CONFLICT to update instead of insert) is something we use internally quite a bit, esp in multi-insert transactions which were horrendous to handle without conflict resolution because any conflict would end up cancelling the entire transaction.

Though I wish there was an easy way to figure out how many times "CONFLICT" actually occured (i.e how many times did we insert vs update).

[+] your_challenger|4 years ago|reply

Haki Benita always has the best database blog posts <3

166 comments