top | item 32053293

Postgres 15 improves UNIQUE and NULL

428 points| Twisell | 3 years ago |blog.rustprooflabs.com

257 comments

order
[+] petereisentraut|3 years ago|reply
I am the author of this feature. The background here is that the SQL standard was ambiguous about which of the two ways an implementation should behave. So in the upcoming SQL:202x, this was addressed by making the behavior implementation-defined and adding this NULLS [NOT] DISTINCT option to pick the other behavior.

Personally, I think that the existing PostgreSQL behavior (NULLS DISTINCT) is the "right" one, and the other option was mainly intended for compatibility with other SQL implementations. But I'm glad that people are also finding other uses for it.

[+] ryanbrunner|3 years ago|reply
This is great! I have multiple places in our code where I've had to define multiple unique indexes (with conditionals) to get around this. It's definitely confusing behaviour the first time you encounter it.

Honestly I think SQL already breaks it's own standard of "NULL = unknown". One example: If you do a LEFT OUTER JOIN and the joined data is missing, you get NULL in the resultset. But in that case, it's not "unknown" whether there are records that can be joined to, the answer is that there are definitively no records there.

NULL is univerally considered by most programming languages to equal "not present" vs "unknown", and in many cases by SQL in practice. The ship has probably sailed on actually changing the default behaviour, but optional flags like these are a good step.

[+] knorker|3 years ago|reply
NULL means "I don't have this data". It doesn't guarantee that the data doesn't exist anywhere.

It's semantically ambiguous, yes.

But it's not ambiguous that the database doesn't have the data. It's very clearly NULL.

If "middle name" is NULL, then the database itself can't tell you if a person doesn't have a middle name, or if it's merely not yet been told the middle name. You could differentiate the two by saying empty string for "has no middle name"… except on Oracle VARCHAR2, where NULL is equal to empty string.

[+] SnowHill9902|3 years ago|reply
NULL means that the value is unknown, not that the data is unknown. You indeed have no data but its value may exist. You just don’t know it yet. NULL is very powerful when you understand that.
[+] irrational|3 years ago|reply
This is amazing. We moved from Oracle to Postgres after being on Oracle for 15+ years. That was a huge move that took 2 years. One of the biggest hurdles was rewriting thousands of sql queries. One of the biggest things we had to change was accounting for how oracle handled nulls versus how Postgres handled nulls. However, we missed this nuance about how Postgres treated nulls and unique. It wasn’t until about 2 years into using Postgres, and running into so many issues, that we finally stumbled upon this nuance in the documentation. Crap. That required making a bunch of more changes to account for this. At this point I’m firmly convinced that null handling is the hardest part of database work ;-)
[+] pbz|3 years ago|reply
Since we have some PG devs here: Can we please have a way to reorder columns?

Coming from MySQL, this is one of the first missing features one hits. It's like moving to a new house where you're told you can never clean or move the furniture without moving to a new house.

It leaves the unfair impression that this is a "toy" db. "What other basic features is it missing if it doesn't have this?" Please don't think of it as trivial; first impression are very important.

[+] jl6|3 years ago|reply
Better to channel your displeasure into advocating for a future SQL standard to include this feature. Other DBs that support this do so via proprietary syntax extensions.
[+] urthor|3 years ago|reply
There's like 5 different features wants I could add like this. A columnar database engine comes to mind (slightly more work involved) and poaching a few Sqllite QOL features.

Ultimately Postgres is a community driven product, and people want to work on what they want the work on.

I don't get angry that some kind soul hasn't volunteered their Saturday for free for me.

[+] SnowHill9902|3 years ago|reply
What do you gain from that other than soothing your OCD when executing \d t (?)
[+] Tostino|3 years ago|reply
Ideally, postgres would play column Tetris behind the scenes and store the columns on disk in the most appropriate way, while allowing the representation to be changed at will.
[+] hans_castorp|3 years ago|reply
> It leaves the unfair impression that this is a "toy" db.

So you consider Oracle, SQL Server and DB2 also to be "toy" databases?

[+] dboreham|3 years ago|reply
> Please don't think of it as trivial

I've worked with relational databases for 20+ years. This is the very first time I heard of this.

[+] anarazel|3 years ago|reply
The problem is that we hear a lot of different features touted as "the crucial missing one"...

Anyway, there's been work on this in the past, which recently has been picked up again. It's not all that trivial to do well.

[+] rwmj|3 years ago|reply
I think the main lesson from this is you should use a NOT NULL constraint on just about every column of your database, since the behaviour of NULLs is weird and likely to indicate a problem with your data.
[+] GuB-42|3 years ago|reply
NULL is totally fine, and happen naturally if you make an outer join. But if you are making a UNIQUE constraint one a nullable column, then you may have a problem.

The idea with NULL is that it is not a value, it is the absence of value, like a field you didn't fill in a form. For example, if you ask two people their personal information, and neither specified their email address (email=NULL), you can't assume they have the same email address. And if you put a UNIQUE constraint on that email field, you probably don't mean that only one person is allowed to leave it blank: either you make it mandatory (NOT NULL), or you let everyone leave it blank.

The reason nullable and unique are rarely seen together is that unique is typically for keys, and generally, you don't want rows that have no key value. Also, putting a unique constraint on something that is not a key may not be the best idea. For example, if you don't intend to use email to uniquely identify people, what's the problem with two people having the same email?

[+] fabian2k|3 years ago|reply
NOT NULL really should be the default. There are of course valid reasons for allowing NULLs, but NOT NULL is much easier to handle for most cases. I don't think nullable columns indicate a problem with the data, but nullable columns that don't have a good justifications are certainly trouble.
[+] tgv|3 years ago|reply
That's just plain wrong. You need some value to populate optional fields, and NULL is a pretty decent choice to indicate that a string, date or number has not been filled in. If you try to fix that by having special string or numeric constants, you're just making life harder, especially when you export the data. The poor souls that import it will think "hey, this timestamp says 1-1-1970, so this probably happened at Jan. 1, 1970."
[+] usrbinbash|3 years ago|reply
The problem is: NULL is a valid datatype in many situations. There are non-boolean fields that can just be empty; My goto example:

    -- Table definition for employee
    name, surname, date_entry , date_exit
Everyone has names, and if hes an employee, he probably has an entry date...but until he leaves, what's the exit date?

Other than soothsaying, my choices here are: NULL, some magic value, or an additional field telling me whether the application should evaluate the field.

The latter just makes the code more complicated.

Magic values may not be portable, and lead to a whole range of other problems if applications forget to check for them; I count myself lucky if the result is something as harmless as an auto-email congratulating an employee to his 7000s-something birthday.

That leaves me with NULL. Yes, it causes problems. Many problems. But at least it causes predictable problems like apps crashing when some python code tries to shove a NULL into datetime.datetime.

[+] p49k|3 years ago|reply
Be careful with this in Postgres: using a zero, blank space, empty array or similar as a default value instead of NULL can lead to enormous bloat in terms of disk space. There are distinct advantages of NULL that have to be considered.
[+] evv|3 years ago|reply
I have struggled to implement a tree structure in PG with nullable unique values.

Consider a "docs" table where each doc has a unique name, under a given parent doc. A null parent would be a top-level doc, and top-level docs just have a unique name. This didn't work before, and would hopefully be addressed by PG15.

I'm not sure if null parents really represent a "problem with my data", or if the tree structure was too exotic for PG to support properly.

How I got around it: hardcode a certain root doc ID, then the parent column can be NOT NULL. But this felt janky because the app has to ensure the root node exists before it can interact with top-level docs. Plus there were edge cases introduced with the root doc.

[+] knorker|3 years ago|reply
You can't get away from having to reason about NULL in SQL even if you remove all NULLs from the database.

You can still get NULLs at query time, and they'll have semantically valid meanings in views and subqueries.

[+] RedShift1|3 years ago|reply
What do you use instead of null then? For example let's take a purchase table, that keeps track when the purchase has been submitted to the supplier, let's call it sent_to_supplier. What do you use for sent_to_supplier in case the purchase has not been sent to the supplier yet?
[+] Pxtl|3 years ago|reply
A person with alopecia's hair color isn't "unknown" it's non-existent. They do not have hair.

Would you suggest that a database of appearance information about a person should have a separate subtable for "hair" to properly model this feature?

Either way, in the end, for displaying 99% of the time you're going to be using a VIEW that LEFT JOINs all these cute normalized tables back together again, and then you're going to want to filter those views dynamically on a nice grid that the user is viewing, and the fact that X != X is going to bite you in the ass all over again.

Creating more tables is just moving the problem around, not solving it.

[+] nicoburns|3 years ago|reply
Except that this change means that the behaviour of nulls is not weird anymore!
[+] jeff-davis|3 years ago|reply
In any thread involving SQL NULL, I see a lot of not-quite-right explanations of what SQL NULL is, conceptually. I challenge anyone who feels like they understand NULL conceptually to explain the following query:

    -- find orders with a total less than $10000
    select order_id, sum(price)
      from orders o left join order_lineitems l using (order_id)
      group by order_id having sum(price) < 10000;
This query is actually incorrect. Orders with no line items at all are clearly less than $10000, but they will be excluded because: first, the left outer join produces a NULL for the price; second, the group aggregation with SUM over that NULL will result in NULL; and third, the HAVING clause treats that NULL as false-like and excludes the order from the result.

Of course, we can explain procedurally what's happening here, and each individual step makes some sense. But the end result has no conceptual integrity.

Extra challenge: explain why using COUNT instead of SUM in the query does correctly return orders with fewer than 4 items:

    -- find orders with fewer than 4 line items
    select order_id, count(price)
      from orders o left join order_lineitems l using (order_id)
      group by order_id having count(price) < 4;

PS: thank you to the author for a developer-friendly feature that adds flexibility here!
[+] masklinn|3 years ago|reply
I’m not sure what explanation you want, NULL is in essence “not a value”, which works more or less the same way “not a number” does. So yes if you perform operations between a NULL and a value you get a NULL, therefore when you sum or compare NULLs with or to other things you get NULL, which is then treated as false in a boolean context.

If the price of one item is UNKNOWN (which is what NULL represents in SQL) then it stands to reason that the sum is unknown, and it is unknown whether the sum is or is not smaller than 10000.

> But the end result has no conceptual integrity.

> Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?

> Extra challenge: explain why using COUNT instead of SUM in the query does correctly return orders with fewer than 4 items:

You don’t need to determine the values of the sequence to see that there are 4 of them, therefore count doesn’t care that some of the items are null.

[+] silvestrov|3 years ago|reply
> This fits with my mental model of how I think unique constraints should work with null values.

Agree, this should be the default for unique indexes. Why would you ever want multiple NULLs in an unique index?

[+] masklinn|3 years ago|reply
Because it’s very common for attributes to be optional but unique if specified e.g. your fleet management might have a 0-1:1 relationship between employees and cars, no car can be associated with 2 employees, an employee can’t have two cars, but any number of employee can be car-less, or car employee-less (lending pool, or cars which have not been attributed yet).

This also fits perfectly well with the normal interpretation / treatment of sql null as “not a value”.

[+] larsnystrom|3 years ago|reply
Multiple NULLs in a unique index are useful when an entity has 0 or 1 of a simple scalar value. For example, you might have a column with an external ID, such as a facebook login id in your users table. Not all users have a facebook id, but those who do should have a unique value in that column.
[+] goto11|3 years ago|reply
Lets say you have a Person table with with a SSN column. The column has a unique constraint because no two people have the same SSN. But the column in nullable, since not everyone has a SSN.

It would be weird to expect that only one person does not have SSN.

[+] jhugo|3 years ago|reply
I can't think of a situation where I'd want "only a single NULL"; that seems much weirder. Another comment described a tree data structure where it could make sense, but I can't think of others. Nullable unique indexes usually describe some kind of optional identifier; if it's present, you want it to be unique, but it's perfectly fine for more than one row to omit it.
[+] seanhunter|3 years ago|reply
One way nullable columns are often used is like an option type. It's easy to see that you might want rows to be unique if they have Some(value) but have multiple rows where the column is None
[+] daveoc64|3 years ago|reply
Some kind of secondary identifier, but only for records where it is applicable (e.g. loyalty card number).
[+] mhaberl|3 years ago|reply
> Why would you ever want multiple NULLs in an unique index?

You would maybe want some value to be unique if it exists.

[+] phoe-krk|3 years ago|reply
The rows (1, NULL), (1, NULL), and (1, NULL) - each NULL signifies an absence of a value - these NULLs may actually be later filled into (1, 1), (1, 2), and (1, 3). That's why may not know if any two NULL values are equal to one another.

In some cases the above is actually important, whereas in others you may want to treat NULL as a "value" instead.

[+] knorker|3 years ago|reply
E.g. if you have a "tax ID" column, and two people in your database don't have tax IDs because they're not <your country> tax payers.

Or "payment vendor processing ID". You may have multiple entries not yet sent to the vendor, so they're NULL.

This shows up literally all the time in SQL schemas.

[+] daigoba66|3 years ago|reply
Coming from an MS SQL Server background, this behavior surprised me.

While SQL Server generally does treat null values as not equal in query predicates, for the purpose of unique constraints/indexes it’ll treat them as the same value. I guess this does go against the spec? But it makes sense as a human when looking at a tuple and deciding if it is “unique” or not.

On that note, now I wonder what SQL Server does with nulls in a DISTINCT or GROUP BY. I suspect that it’ll treat all nulls as the same.

[+] hans_castorp|3 years ago|reply
> Coming from an MS SQL Server background, this behavior surprised me.

Oracle behaves the same as Postgres with regards to NULL values in unique indexes.

> I guess this does go against the spec?

The first comment (by the blog author) https://news.ycombinator.com/item?id=32054151 explained this:

> The background here is that the SQL standard was ambiguous about which of the two ways an implementation should behave. So in the upcoming SQL:202x, this was addressed by making the behavior implementation-defined and adding this NULLS [NOT] DISTINCT option to pick the other behavior.

[+] ahmed_ds|3 years ago|reply
When something that fundamental changes what is the cascading effect? Do other functions refer UNIQUE/NULL or do they have their own versions of UNIQUE/NULL like code?
[+] hrdwdmrbl|3 years ago|reply
Cool but this was trivially solved previously by excluding NULL from the index using a partial index.

I'm more interested in problems that have no great solution, like taking the first element of a set of partitioned data. Right now you have to partition & rank, and then select the rank 1 elements in a subsequent query.

[+] _flux|3 years ago|reply
I don't think you understood what this does (or then I misunderstood this) but if you do that, then how are you going to get an index violation from having two NULL values in the index? Because that is what this allows: normally indexing NULL values permitted multiple NULL entries, but with this NULL is comparable to other NULLs, therefore you can have only one NULL value in the index.

Previously I had worked this around with functional index on e.g. COALESCE(value_with_null, -1), but not always you have a good sentinel value easily available. A more complicated index (..CASE WHEN..) would solve that, but it's nice to have this directly expressible, seems efficient as well.

[+] masklinn|3 years ago|reply
> Cool but this was trivially solved previously by excluding NULL from the index using a partial index.

That seems like the opposite solution, treating NULLs as different in database systems which treat them as identical by default (which I think only includes MSSQL).

In postgres, and most DBMS, using a partial index excluding NULL allows any number of rows to have a NULL value… which is the same behaviour as if you’re not excluding NULLs.

The new opt-in behaviour is treating all nulls as the same value, and thus allowing a UNIQUE index to constraint them.

This required two different partial indexes, or for constrained data subset in hacks like COALESCE-ing nulls to a sentinel.

[+] zeroimpl|3 years ago|reply
I think this is a great change. I wish NULLS NOT DISTINCT was the default already - if I wanted the constraint to ignore null values I’d have used a partial index to only index non-null values. Doing the opposite is much more difficult.
[+] layer8|3 years ago|reply
> Because NULL values are of unknown equality to one another, they do not violate UNIQUE constraints.

Well, arguably you don’t know whether they violate uniqueness, so inserting more than one shouldn’t have been allowed.

[+] hagope|3 years ago|reply
This is cool, but wouldn't creating a constraint using a nullable column be considered a poor design decision? In which scenarios would this be a good idea?
[+] cptn_badass|3 years ago|reply
When an entry can belong to 0 or 1 related object only. Not that I'd put a constraint in such scenario, but I imagine a User can optionally have a Subscription, so subscription_id is either nil or present, and said subscription cannot be associated to any other User.