Sorry, but someone who is averse to joins is not a non-expert in SQL, they are a total novice. The answer is like any other programming language. You simply must learn the language fundamentals in order to use it.
Not sorry, I’ll stick with SQL non-expert as I’ve only worked with databases for a few decades and sometimes run into people who know more.
Working with a database you built or can control is kind of a simplistic example.
In my experience this most often arises when it’s someone else’s database or API you’re interacting with and do not control.
An upstream ERP system doesn’t have unique keys, or it changes the name of a column, or an accounting person adds a custom field, or the accounting system does claim to have unique identifiers but gets upgraded and changes the unique row identifiers that were never supposed to change, or a user deletes a record and recreates the same record with the same name, which now has a different ID so the data in your reporting database no longer has the correct foreign keys, and some of the data has to be cross-referenced from the ERP system with a CRM that only has a flaky API and the only way to get the data is by pulling a CSV report file from an email, which doesn’t have the same field names to reliably correlate the data with the ERP, and worse the CRM makes these user-editable so one of your 200 sales people decides to use their own naming scheme or makes a typo and we have 10 different ways of spelling “New York”, “new york”, “NY”, “newyork2”, “now york”, and yeah…
Turns out you can sometimes end up with extra rows despite your best efforts and that SQL isn’t always the best tool for joining data, and no I’m not interested in helping you troubleshoot your 7-page SQL query that’s stacked on top of multiple layers of nested SQL views that’s giving you too many rows. You might even say I’m averse.
Yes when you have duplicated data and data inconsistencies/integrity issues you might get duplicate data and data inconsistencies/integrity issues in your output.
This is a problem of form, not method. JOINs are a fantastic, well-defined method to aggregate data. If the form of your data is messed up, then naturally the result may be too.
> no I’m not interested in helping you troubleshoot your 7-page SQL query that’s stacked on top of multiple layers of nested SQL views that’s giving you too many rows
People say this type of thing but SQL is an incredibly high-level language.
Yes debugging a big SQL query can suck. Debugging the equivalent of it is almost always much worse. I'd rather debug a 30-line query than a 400 line perl script that does the same thing. Because that's actually the alternative.
I have manually aggregated data, messy data, in awk, perl, python... it is much, much worse.
I have run into this scenario a few times where the multi-hour processes produces an explosion of rows that no one cared to troubleshoot further. They only wanted to de-duplicate the final result.
In practice, I ended up creating a skeleton table that not only had the appropriate constraints for de-duplication. But I would also create mostly empty rows (empty except for the required column key ID fields) with the exact rows they were expecting. And then I would perform an UPDATE <skeleton> ... FROM <complicated join>. It is a hack but if there was no desire to rewrite a process that was written years ago by teams of consultants, I can only do what I can do.
Nobody is entitled to any of this being easy. If you don't like working with badly-designed databases, why not simply work with people who know how databases work? In the meantime, I have bad news: nobody is going to do the hard work for you.
halfcat|1 year ago
Working with a database you built or can control is kind of a simplistic example.
In my experience this most often arises when it’s someone else’s database or API you’re interacting with and do not control.
An upstream ERP system doesn’t have unique keys, or it changes the name of a column, or an accounting person adds a custom field, or the accounting system does claim to have unique identifiers but gets upgraded and changes the unique row identifiers that were never supposed to change, or a user deletes a record and recreates the same record with the same name, which now has a different ID so the data in your reporting database no longer has the correct foreign keys, and some of the data has to be cross-referenced from the ERP system with a CRM that only has a flaky API and the only way to get the data is by pulling a CSV report file from an email, which doesn’t have the same field names to reliably correlate the data with the ERP, and worse the CRM makes these user-editable so one of your 200 sales people decides to use their own naming scheme or makes a typo and we have 10 different ways of spelling “New York”, “new york”, “NY”, “newyork2”, “now york”, and yeah…
Turns out you can sometimes end up with extra rows despite your best efforts and that SQL isn’t always the best tool for joining data, and no I’m not interested in helping you troubleshoot your 7-page SQL query that’s stacked on top of multiple layers of nested SQL views that’s giving you too many rows. You might even say I’m averse.
consteval|1 year ago
This is a problem of form, not method. JOINs are a fantastic, well-defined method to aggregate data. If the form of your data is messed up, then naturally the result may be too.
> no I’m not interested in helping you troubleshoot your 7-page SQL query that’s stacked on top of multiple layers of nested SQL views that’s giving you too many rows
People say this type of thing but SQL is an incredibly high-level language.
Yes debugging a big SQL query can suck. Debugging the equivalent of it is almost always much worse. I'd rather debug a 30-line query than a 400 line perl script that does the same thing. Because that's actually the alternative.
I have manually aggregated data, messy data, in awk, perl, python... it is much, much worse.
rawgabbit|1 year ago
In practice, I ended up creating a skeleton table that not only had the appropriate constraints for de-duplication. But I would also create mostly empty rows (empty except for the required column key ID fields) with the exact rows they were expecting. And then I would perform an UPDATE <skeleton> ... FROM <complicated join>. It is a hack but if there was no desire to rewrite a process that was written years ago by teams of consultants, I can only do what I can do.
yunolearn|1 year ago