top | item 35566804

(no title)

Sonata | 2 years ago

Here a common example:

Given the following tables:

  CREATE TABLE person (
    id INT PRIMARY KEY,
    name VARCHAR NOT NULL
  );

  CREATE TABLE pet (
    id INT PRIMARY KEY,
    name VARCHAR NOT NULL,
    owner_id INT NOT NULL
  );

It is common to want to join them, like so:

  SELECT owner.id, owner.name, COUNT(pet.id) AS numberOfPets 
  FROM person AS owner
  LEFT OUTER JOIN pet ON owner.id = pet.owner_id
  GROUP BY owner.id

This doesn't work in standard SQL, because all columns in the SELECT list have to either be aggregated or included in the GROUP BY. owner.name is neither. That is a bit silly though because we know each result row will only have one unambiguous value for the owner name, since the GROUP BY is on a unique column from the same table as the owner name.

We can solve this with ANY_VALUE:

  SELECT owner.id, ANY_VALUE(owner.name) AS name, COUNT(pet.id) AS numberOfPets 
  FROM person AS owner
  LEFT OUTER JOIN pet ON owner.id = pet.owner_id
  GROUP BY owner.id

discuss

brabel|2 years ago

I run into the equivalent problem in Kotlin sometimes. I want to group things by a key, but that results in type `Map<K, List<V>>` because the compiler can't be sure that there won't be more than one `V` for the field being used for grouping... which is why there's also `associateBy` which does return `Map<K, V>`, but will throw if there happens to be more than one value for each K.

Not relevant, but I thought it may help other readers understand the problem.

mtts|2 years ago

The common solution in this case is simply to also GROUP BY owner.name.

That said, if owner.name has different values and you didn’t actually care which one you’d get ANY_VALUE could maybe be useful.

Though in that case you could also just use MIN or MAX, which already exist.

orthoxerox|2 years ago

Yes, but all these solutions require the optimizer to be smart enough to be smart enough to optimize them away: "oh, owner.id is the primary key, so any value from owner will appear at most once, I don't have to group by them or track the maximum value, I can pick and store any value".

I think all optimizers are smart enough already, as this is a super common pattern. I honestly would have preferred this logic to be made more explicit: if you group by the primary key (or by the table name, as someone suggested), you can use all columns from the table without aggregate functions.

This kinda breaks down when you introduce subqueries and CTEs, which is where myopic optimizers can screw the pooch as well, so any_value can be of use there.