You might think this is trivial -- but SERIAL creates an "owned" (by a certain user) sequence behind the scenes, and so you run into massive headaches if you try to move things around.
Identity columns don't, and avoid the issue altogether.
I'll stop short of giving a recommendation or using the word "should", but ill give encouragement to consider using uuid's for keys. I have used them in several systems and have never had any issues with them, and they solve so many issues. The ability to generate a key on the client or on the server or in the database is great for one. And the fact that keys are unique not only in a table but in the system (or many systems) is a huge advantage.
Another favor you should do yourself is use bigint keys (unless using uuid as suggested elsewhere, or you're very, very sure you'll never chew through 2 billion ids).
I often move ID generation into the application layer (this also helps avoid things like enumeration attacks), and actually quite a lot of cool Postgres features blur that line a little bit. It's interesting to think sequences and other computational mechanisms in a DB, and whether they make architecting applications easier or harder. I don't have a strong opinion either way, but I'm interested in HN's opinion.
>You might think this is trivial -- but SERIAL creates an "owned" (by a certain user) sequence behind the scenes, and so you run into massive headaches if you try to move things around.
Maybe it is because I'm too old, but making id grow by sequence is the way how things 'ought' to be done in the old skool db admin ways. Sequences are great, it allows the db to to maintain two or more sets of incremental ids, comes in very handy when you keeping track of certain invoices that needs to have a certain incremental numbers. By exposing that in the CREATE statement of the table brings transparency, instead of some magical blackbox IDENTITY.
However, it is totally understandable from a developer's perspective that getting to know the sequences is just unneeded headache. ;-)
Why? Concrete ownership is a good thing. I prefer more rigid structures and I like designing my tables around that. It was the Wild West when I did database design around MySQL 15ish years ago. Of course I didn’t know nearly as much as I do now back then.
Regarding "Match Against Multiple Patterns", the examples are about finding the _suffixes_ of something, email domains in the example.
An attempt to find a suffix like that will not be able to use an index, whereas creating a functional index on the reverse and looking for the reversed suffix as a prefix will be:
# create table users (id int primary key, email text);
CREATE TABLE
# create unique index on users(lower(email));
CREATE INDEX
# set enable_seqscan to false;
SET
# insert into users values (1, '[email protected]'), (2, '[email protected]'), (3, '[email protected]');
INSERT 0 3
# explain select * from users where email ~* '@(gmail.com|yahoo.com)$';
QUERY PLAN
--------------------------------------------------------------------------
Seq Scan on users (cost=10000000000.00..10000000025.88 rows=1 width=36)
Filter: (email ~* '@(gmail.com|yahoo.com)$'::text)
# create index on users(reverse(lower(email)) collate "C"); -- collate C explicitly to enable prefix lookups
CREATE INDEX
# explain select * from users where reverse(lower(email)) ~ '^(moc.liamg|moc.oohay)';
QUERY PLAN
----------------------------------------------------------------------------------------------------
Bitmap Heap Scan on users (cost=4.21..13.71 rows=1 width=36)
Filter: (reverse(lower(email)) ~ '^(moc.liamg|moc.oohay)'::text)
-> Bitmap Index Scan on users_reverse_idx (cost=0.00..4.21 rows=6 width=0)
Index Cond: ((reverse(lower(email)) >= 'm'::text) AND (reverse(lower(email)) < 'n'::text))
(Another approach could of course be to tokenize the email, but since it's about pattern matching in particular)
Another great hidden feature is the LISTEN/NOTIFY system. You can use it to watch for changes on a table, and then use that to e.g. update a view in the UI.
For example create a table with a trigger on insert "NOTIFY new_data". Then on query do
LISTEN new_data;
SELECT ...;
Now you'll get the results and any future updates.
Expanding on the regex trick: I often use [.] instead of \. to match a period.
They achieve the same thing, but [.] avoids the problem of your host language "helpfully" interpolating \. into . before sending the query to Postgres.
Two things that aren't exactly lesser-known, but that I wish more used continuously as part of development:
- generate_series(): While not the best to make _realistic_ test data for proper load testing, at least it's easy to make a lot of data. If you don't have a few million rows in your tables when you're developing, you probably don't know how things behave, because a full table/seq scan will be fast anyway - and you'll not spot the missing indexes (on e.g. reverse foreign keys, I see missing often enough)
- `EXPLAIN` and `EXPLAIN ANALYZE`. Don't save minutes of looking at your query plans during development by spending hours fixing performance problems in production. EXPLAIN all the things.
A significant percentage of production issues I've seen (and caused) are easily mitigated by those two.
By learning how to read and understand execution plans and how and why they change over time, you'll learn a lot more about databases too.
It's worth to note that earlier versions of PostgreSQL didn't include the "AS NOT MATERIALIZED" option when specifying CTE's. In our setup, this had huge hits to performance. If we were on a more recent version of PostgreSQL (I think 11 in this case), or if the query writer just used a sub-query instead of a CTE, we would have been fine.
On this first point — 'Get the Number of Updated and Inserted Rows in an Upsert' — unfortunately this `xmax` method is not always reliable and there does not seem to be any reliable alternative.
Atlassian gives us sales figures as a large CSV. Nothing better than loading the CSV as a Postgres view with a foreign data wrapper, and performing pivots/joins/anything on it. Replace the file with a more recent one, and you even have versioning!
I believe it's about the common case of two events sharing a single (infinitely small) point of time: meeting A from 3 to 4, followed by meeting B from 4 to 5.
Those two time periods do overlap mathematically, but not physically: they share the common point of "4", even if that point has duration 0.
I know the iCal standard goes into this for a bit.
The upsert strategy mentioned (using ON CONFLICT to update instead of insert) has a catch, if your PK is a sequence it will increment even though the insert has failed.
It's not a huge deal in most cases, is a bigger issue if you are running any sort of scheduled ETL to update (or insert new) things.
Careful with DISTINCT ON, it can be extremely slow, like 100x slower than a verbose and less readable alternative. But it might not matter in your use case.
This example didn't bring anything new to the table, only adds redundant extra chars:
SELECT *
FROM users
WHERE email ~ ANY('{@gmail\.com$|@yahoo\.com$}')
Perhaps they were intending something similar to the following example instead.
This one works but has a several potential lurking issues:
with connection.cursor() as cursor:
cursor.execute('''
SELECT *
FROM users
WHERE email ~ ANY(ARRAY%(patterns)s)
''' % {
'patterns': [
'@gmail\.com$',
'@yahoo\.com$',
],
})
The dictionary-style interpolation is unnecessary, the pattern strings should be raw strings (the escape is ignored only due to being a period), and this could be a SQL injection site if any of this is ever changed. I don't recommend this form as given, but it could be improved.
One of the hardest types of queries in a lot of DBs is the simple `min-by` or `max-by` queries - e.g. "find the most recent post for each user." Seems like Postgres has a solution - `DISTINCT ON` - though personally I've always been a fan of how BigQuery does it: `ARRAY_AGG`. e.g.
SELECT user_id, ARRAY_AGG(STRUCT(post_id, text, timestamp) ORDER BY timestamp DESC LIMIT 1)[SAFE_OFFSET(0)].*
FROM posts
GROUP BY user_id
`DISTINCT ON` feels like a hack and doesn't cleanly fit into the SQL execution framework (e.g. you can't run a window function on the result without using subqueries). This feels cleaner, but I'm not actually aware of any other DBMS that supports `ARRAY_AGG` with `LIMIT`.
I would use lateral join. Alternatively you can use window function instead of lateral join, buy from my experience lateral joins are usually faster for this kind of top-N queries in postgres.
The upsert strategy mentioned (using ON CONFLICT to update instead of insert) is something we use internally quite a bit, esp in multi-insert transactions which were horrendous to handle without conflict resolution because any conflict would end up cancelling the entire transaction.
Though I wish there was an easy way to figure out how many times "CONFLICT" actually occured (i.e how many times did we insert vs update).
[+] [-] gavinray|4 years ago|reply
https://hakibenita.com/postgresql-unknown-features#prevent-s...
You might think this is trivial -- but SERIAL creates an "owned" (by a certain user) sequence behind the scenes, and so you run into massive headaches if you try to move things around.
Identity columns don't, and avoid the issue altogether.
https://www.2ndquadrant.com/en/blog/postgresql-10-identity-c...
[+] [-] netcraft|4 years ago|reply
[+] [-] cormacrelf|4 years ago|reply
[+] [-] uhoh-itsmaciek|4 years ago|reply
[+] [-] cnity|4 years ago|reply
[+] [-] olddb|4 years ago|reply
Maybe it is because I'm too old, but making id grow by sequence is the way how things 'ought' to be done in the old skool db admin ways. Sequences are great, it allows the db to to maintain two or more sets of incremental ids, comes in very handy when you keeping track of certain invoices that needs to have a certain incremental numbers. By exposing that in the CREATE statement of the table brings transparency, instead of some magical blackbox IDENTITY. However, it is totally understandable from a developer's perspective that getting to know the sequences is just unneeded headache. ;-)
[+] [-] tokumei|4 years ago|reply
[+] [-] mixmastamyk|4 years ago|reply
[+] [-] brasetvik|4 years ago|reply
An attempt to find a suffix like that will not be able to use an index, whereas creating a functional index on the reverse and looking for the reversed suffix as a prefix will be:
(Another approach could of course be to tokenize the email, but since it's about pattern matching in particular)[+] [-] ako|4 years ago|reply
Blogpost discussing this approach: https://about.gitlab.com/blog/2016/03/18/fast-search-using-p...
[+] [-] netcraft|4 years ago|reply
Here is another way I have used to do pivot tables / crosstab in postgres where you have a variable number of columns in the output:
https://gist.github.com/ryanguill/101a19fb6ae6dfb26a01396c53...
You can try it out here: https://dbfiddle.uk/?rdbms=postgres_9.6&fiddle=5dbbf7eadf0ed...
[+] [-] leephillips|4 years ago|reply
[+] [-] bajsejohannes|4 years ago|reply
For example create a table with a trigger on insert "NOTIFY new_data". Then on query do
Now you'll get the results and any future updates.[+] [-] cldellow|4 years ago|reply
They achieve the same thing, but [.] avoids the problem of your host language "helpfully" interpolating \. into . before sending the query to Postgres.
[+] [-] brasetvik|4 years ago|reply
- generate_series(): While not the best to make _realistic_ test data for proper load testing, at least it's easy to make a lot of data. If you don't have a few million rows in your tables when you're developing, you probably don't know how things behave, because a full table/seq scan will be fast anyway - and you'll not spot the missing indexes (on e.g. reverse foreign keys, I see missing often enough)
- `EXPLAIN` and `EXPLAIN ANALYZE`. Don't save minutes of looking at your query plans during development by spending hours fixing performance problems in production. EXPLAIN all the things.
A significant percentage of production issues I've seen (and caused) are easily mitigated by those two.
By learning how to read and understand execution plans and how and why they change over time, you'll learn a lot more about databases too.
(CTEs/WITH-expressions are life changing too)
[+] [-] dylanz|4 years ago|reply
[+] [-] skunkworker|4 years ago|reply
Also to note: EXPLAIN just plans the query, EXPLAIN (ANALYZE) plans and runs the query. Which can take awhile in production.
[+] [-] thearrow|4 years ago|reply
[0]: https://www.postgresql.org/docs/current/rangetypes.html
[1]: https://www.postgresql.org/docs/current/functions-range.html
[+] [-] grangerg|4 years ago|reply
[+] [-] gmac|4 years ago|reply
I posted to pgsql-hackers about this problem here: https://www.postgresql.org/message-id/DE57F14C-DB96-4F17-925...
[+] [-] jrumbut|4 years ago|reply
It's nice to see one that's full of good advice.
[+] [-] tbrock|4 years ago|reply
Would i use them in production? no. Are they fun to play around with? Yes!
[+] [-] laurent92|4 years ago|reply
[+] [-] iaabtpbtpnn|4 years ago|reply
[+] [-] micw|4 years ago|reply
Or in words: There is an overlap if meeting A starts before meeting B ends and meeting A ends after meeting B starts.
So the scenario described looks way more complex as it actual seems to be.
[+] [-] KarlKemp|4 years ago|reply
Those two time periods do overlap mathematically, but not physically: they share the common point of "4", even if that point has duration 0.
I know the iCal standard goes into this for a bit.
[+] [-] rvdginste|4 years ago|reply
meeting_a.starts_at <= meeting_b.ends_at && meeting_b.starts_at <= meeting_a.ends_at
So, indeed, the scenario as described in the article is more complex than needed.
[+] [-] micw|4 years ago|reply
SELECT * FROM meetings, new_meetings WHERE new_meetings.starts_at<meetings.ends_at AND new_meetings.ends_at>meetings.starts_at;
[+] [-] jpitz|4 years ago|reply
[+] [-] jayd16|4 years ago|reply
[+] [-] grangerg|4 years ago|reply
After you're done with that corner case, you may find it better to determine if they don't overlap and negate that.
[+] [-] boarnoah|4 years ago|reply
It's not a huge deal in most cases, is a bigger issue if you are running any sort of scheduled ETL to update (or insert new) things.
[+] [-] orangepanda|4 years ago|reply
[+] [-] freeest-will|4 years ago|reply
[+] [-] CabSauce|4 years ago|reply
[+] [-] 1500100900|4 years ago|reply
[+] [-] mixmastamyk|4 years ago|reply
[+] [-] jrmiii|4 years ago|reply
OP indicated as much saying:
> This approach is easier to work with from a host language such as Python
I'm with you on the injection - have to be sure your host language driver properly escapes things.
[+] [-] est|4 years ago|reply
[+] [-] quickthrower2|4 years ago|reply
[+] [-] NightMKoder|4 years ago|reply
[+] [-] bbirk|4 years ago|reply
select user_table.user_id, tmp.post_id, tmp.text, tmp.timestamp
from user_table
left outer join lateral (
select post_id, text, timestamp
from post
where post.user_id = user_table.user_id
order by timestamp
limit 1
) tmp on true
order by user_id
limit 30;
[+] [-] zX41ZdbW|4 years ago|reply
[+] [-] zX41ZdbW|4 years ago|reply
> I'm not actually aware of any other DBMS that supports `ARRAY_AGG` with `LIMIT`.
So, put ClickHouse in you list :)
[+] [-] mnahkies|4 years ago|reply
Does the `ARRAY_AGG` approach offer any advantages?
[+] [-] Aeolun|4 years ago|reply
I would hesitate to call this a useful feature. This error message doesn’t tell you about the actual problem at all.
How about something like:
Permission denied, cannot select other than (id, name) from table users
[+] [-] luhn|4 years ago|reply
[+] [-] haki|4 years ago|reply
[+] [-] madmax108|4 years ago|reply
Though I wish there was an easy way to figure out how many times "CONFLICT" actually occured (i.e how many times did we insert vs update).
[+] [-] your_challenger|4 years ago|reply