top | item 26141003

Why SELECT * is bad for SQL performance (2020)

259 points| wheresvic4 | 5 years ago |tanelpoder.com

164 comments

order
[+] bjarneh|5 years ago|reply
> assuming that your application doesn’t actually need all the columns.

Title is a bit click-bait-ish; I was expecting some deep insight of why this:

     SELECT A,B,C FROM TBL;
was superior to this:

     SELECT * FROM TBL;
when the table had three columns. Instead we get a wildcards are bad argument; especially when that wildcard fetches data that you don't need etc; which I guess everyone agrees with.
[+] whack|5 years ago|reply
#1 comment:

> Instead we get a wildcards are bad argument; especially when that wildcard fetches data that you don't need etc; which I guess everyone agrees with.

#2 comment:

> Nevertheless in reality most of the stated reasons have almost no real practicability... So in most cases fetching to many columns would be no problem

This happens surprisingly frequently in a lot of tech discussions. The most popular comment accuses the author of saying something so obviously correct, that it's a waste of time to even bring it up. And the second most popular comment is someone telling the author that he's wrong.

[+] tw04|5 years ago|reply
In general, in my limited experience, the former is better than the latter because you have NO IDEA what's going to be in that database a decade from now, but your query is probably still getting called by some obscure application that nobody even knows exists.

When someone does something stupid and puts a whole bunch of unexpected data into that database, the latter query will cripple the entire show, whereas in theory the former will still just ask for the original intended data.

[+] Hjfrf|5 years ago|reply
Your first question is fairly common too- select * is to be avoided if your table can change or you care about order of columns.

Select * is better from cte/temp table because you then only need to make changes in one place.

I'm not sure how much deeper it's possible to go.

[+] actuator|5 years ago|reply
I think the title is apt. It is quite common for people to mistakingly do this pattern, I have seen this mostly happen through ORMs.

It might have changed but in ActiveRecord it was fairly common to do.

  models = Model.where(column_name: column_val)
and then further down the fetch, just use two three columns from the whole fetch. This used to issue a `SELECT *` to the table if no select clause went along, and most people don't realise the harm with fetching all columns. Even if it is a single row, storage engines like InnoDB store blobs in an indirected way, which can lead to more read IO ops than required.
[+] ComodoHacker|5 years ago|reply
>which I guess everyone agrees with.

You would be surprised how many don't even think a second about it, not to say realize any of the points made by the author.

[+] dathinab|5 years ago|reply
> [...] > was superior to this: > [...]

For API Stability:

- querying only the columns you really need makes it less likely that a column you don't really need happens to somehow brake your code if changed (e.g. because some ORM mapping not working). (This btw. is one of the major drawbacks of many REST APIs, i.e. they default to the rest equivalent of ).

- deciding the order in which you want columns to be returned explicitly is less bug prone to a variety of bugs, relying on column order of `` is brittle but some ORM do so and some other code might do so by accident in some way or another. Explicitly defining the order makes it easier to avoid such bugs.

- SQL queries doesn't need to be updated if you add columns (to explicitly not query the for the query unnecessary new column)

[+] bcrosby95|5 years ago|reply
Depends upon how you use your database. If tables are a 1-to-1 representation of domain objects then you should use select *. If it isn't then you should prefer to be choosy about your columns.
[+] brundolf|5 years ago|reply
It's obvious once you think about it, but it's not something I'd ever thought about before. I think the article still had value
[+] HenryBemis|5 years ago|reply
Hi, IT auditor here.

Sometimes 'we' ask * instead of "gimme everything that starts with a number" because we may need to check that the column "ID" which is supposed to be a 10-digit number, may have an entry "A1234567890". And this has not raised any alarms. And this means that this human is not paying taxes, because his/her tax-ID is not 'run' when the gov runs the tax calculations, because the script pulls 10-digit numbers, and my ID has a letter. So some DBA got paid $50k, changed my field, run the thing, change back my field. If you think this is not happening, it is.

Consider other scenarios on obligations like that.

I once did a KYC audit for a bank. When I asked * , I laughed and cried with the results.

Soooooooooo.. yes, very click bait-y title. Each SELECT should fit the purpose it serves for the report reader. When doing KYC and you have to review 5mil clients, you want *, otherwise you will miss the: Surname: 123Smith!!!, DoB: 30/2/1980, and other duplicates, nasty surprises, people over 150yo, etc.

[+] tpetry|5 years ago|reply
These arguments are correct and repeated for many many years. Nevertheless in reality most of the stated reasons have almost no real practicability.

They re correct in an "academic view" but most applications are CRUD and based on an ORM and there are not many columns fetched too much which will make any difference. There are some rare cases where these statements are correct, especially the lob fetching but in these cases most often the queries are switched to manually specifying the columns just for these tables.

What is really needed would be something like SELECT * EXCEPT verylargecolumn FROM mytable ... So in most cases fetching to many columns would be no problem, but if there is really a column which is probelamtic you could easily request to ignore it.

[+] branko_d|5 years ago|reply
It's not jut about large columns.

When fetching from views, selecting unneeded columns can trigger unneeded JOINs.

Assuming no JOINed column is mentioned in the select list (either explicitly or through *), the query optimizer can eliminate the OUTER JOIN, or INNER JOIN when it is known that the referenced row exists (because there is a FOREIGN KEY).

This is mentioned in the article as "Oracle’s join elimination transformation", but (most?) other databases can do it too.

[+] EugeneOZ|5 years ago|reply
My experience is quite contrary :) I prefer to not use ORM (even if I wouldn’t mind, clients often ask me to work with the DB directly, without ORM), and every selected column matters - in terms of performance and amount of transferred data. Some fields are TEXT/BLOB and sending thousands of them “just because” is not cool.
[+] pqb|5 years ago|reply
As other commenter mentioned the BigQuery supports the `SELECT * EXCEPT(column names...) FROM mytable` syntax. Also, it is worth noting the `SELECT COUNT(*) FROM mytable` in BigQuery is faster and cheaper than `SELECT COUNT(always_truish_column) FROM mytable`.
[+] teraku|5 years ago|reply
This is supported by some databases and DWHs. I work with BigQuery and they do have that syntax
[+] meepmorp|5 years ago|reply
>most applications are CRUD and based on an ORM

Do you actually believe that most CRUD apps use an ORM?

[+] orangepanda|5 years ago|reply
> I explicitly specify the columns of interest in the select-list (projection), [...] for application reliability reasons. For example, will your application’s data processing code suddenly break when a new column has been added or the column order has changed in a table?

Application reliability for me is the main reason. Better performance is just a plus.

[+] SpicyLemonZest|5 years ago|reply
It's the main reason for me too, but in an increasing number of modern contexts I think making SELECT * work well is part of application reliability. I've encountered quite a few users who add new columns all the time (or consume data from a dozen other groups who add new columns semi-frequently), so if we don't enable them to use SELECT * that just means they have to constantly rebuild and redeploy their data processing code.
[+] Mauricebranagh|5 years ago|reply
Absolutely this was drummed into me 20+ years ago when I worked on my first major sql based project
[+] jerzyt|5 years ago|reply
While SELECT * has no place in a production code, it's the right thing to do during data exploration with ad hoc queries. Only after I get sufficient understanding of the data, I'll restrict the columns. I often see junior developers assume what they need just by the name of the columns, and miss something. Oh wait, I just did that a few days ago. It was embarrassing.
[+] ttz|5 years ago|reply
Using LIMIT N is a great habit to develop to prevent yourself from hogging resources too, during exploration.
[+] InfiniteRand|5 years ago|reply
So I think the case for selecting all of the columns (and for an ORM-like approach of selecting related fields at once) is to reduce the number of queries. Generally I think that selecting two columns at one point in application logic and firing off a separate query for another two columns later in execution of a script is going to be worse than selecting 6 columns. Query performance is not the only consideration here, you also need to consider network performance of transmitting the data from the database server to the client (although that equation might change if you're using an in-memory database or something like that).

Also, selecting all the columns at once has an ergonomics effect of making it simpler to take advantage of caching when writing multiple components which might query the same data within the same execution.

That being said, I do think selecting less columns is better than more, an even when using an ORM I find myself writing projects and helper objects to reduce how many columns I need in my queries, but I view that as more of an optimization, a secondary concern that I can take care of after the main logic is implemented, rather than a primary concern that should shape the initial implementation of the logic.

Of course, the performance concerns I have had might be vastly different than the original poster's, so take my comment with a grain of salt.

[+] dutchmartin|5 years ago|reply
Can’t we just argue that using the * wildcard is just a great feature when you are writing sql queries on test data. I personally use it a lot when writing joins to eventually get the data I want or to execute a aggregating function like count(). But I think it is interesting to know if writing down all your table column names also delivers faster queries in other DBMS systems.
[+] vinger|5 years ago|reply
In theory yes. When you are joining two tables that have the same column names you have to alias them anyways.
[+] mumblemumble|5 years ago|reply
This seems more useful if re-framed as "Why pulling data you won't use is bad for performance."

SELECT * seems like a strawman to me. I don't often see it in the wild anymore. But pulling more columns than you need is extremely common; it happens in every codebase I've ever seen. I habitually do it myself. More-or-less every time I choose to re-use a single function for retrieving data in several places. Because then the function needs to get the union of all the columns that all the callers need.

Which may be a reasonable trade-off. There's almost always a need to strike a balance between maintainability and performance. But it's also nice to have occasional reminders to re-assess what you've been doing. This particular practice tends to have a particularly high cost, and one that, depending on how you configure your environments, may be much larger in production than it is in development or CI.

[+] Jestar342|5 years ago|reply
It can be extended upon.. I see a lot of ORM based solutions that pull a full record just to check a single property sometimes. E.g.:

    var user = repository.GetUserById(userId);
    return user.IsDisabled;
That's not even a facetious example, I have seen it multiple times. In some cases that query is pulling multiple columns, and a few joins.. just to pull a single bit value.
[+] closed|5 years ago|reply
I'm seeing a lot of comments on why SELECT * is fine or not, but it seems like the bigger issue is that (in general) SQL has two very limited ways of letting your select columns: explicitly naming each column, or getting all columns. This means that when you are concise, you sometimes end up getting columns you don't need.

For example, R's dbplyr library lets you write queries that select all columns that start_with("something_"). This is revolutionary, because now a person can use the semantics of column names in their selection!

Granted, dbplyr ultimately generates a query that explicitly names the columns, so it's not a within-SQL solution, but I've been surprised at how useful the behavior is!

[+] ttraub|5 years ago|reply
If network latency is a significant factor, and server storage and processing resources are sufficient, then the author's example of an 800 column table needs redesign.

Break that monster into smaller, more manageable tables and architect your database such that you can obtain exactly the data you need with well indexed joins, cached lookups, etc.

[+] jerzyt|5 years ago|reply
800 columns in an OLTP is insane, that was my reaction when I read the OP. But in analytics it's not so unusual, although at that point a SQL DB is probably not the right solution anyway, and the data scientists probably do need all the columns.
[+] giantDinosaur|5 years ago|reply
It selects all columns. Why/when wouldn't this be bad for performance is a more interesting question, no?

(to be clear, I still enjoyed reading this.)

[+] corty|5 years ago|reply
There are CSV-backed "databases" that you can query with SQL. Selecting a row and returning it is basically copying a string once and replacing all CSV-separators by record separators. Doing a projection requires you to copy the string, remove all fields that are not requested, reorder the fields as requested, resulting in a few more copies, and only then transmitting the answer. So in this case it would be slower to not do 'select *'.

However, those things have always been exotic, and with the advent of SQLite have become even less common.

[+] edroche|5 years ago|reply
I've always wondered why SQL didn't have something like an EXCLUDE clause to make this easier for tables with lots of columns where you wanted a large number of them in the result. Something like (table with columns a through z):

SELECT * EXCLUDE g, m, p, x FROM table

[+] nivenkos|5 years ago|reply
And this is even more true when using columnar databases.
[+] damowangcy|5 years ago|reply
Isn't this something obvious?

There's use for the wild card, for instance if you need ad hoc query of data during development or testing.

If your table is not properly indexed however, will cause unnecessary select all query even if you didn't intend to.

Junior developer is bad for SQL performance but hey, everyone starts there, so there's nothing to be embarrassed about. Just code on!

[+] karmakaze|5 years ago|reply
The title of the post should have been phrased for the audience of ORM users. With ActiveRecord and such `SELECT *` is the norm rather than the exception. With an ORM it's even worse as its even allocating/converting data for the fields. Learn to use `pluck` or you ORM's equivalent.
[+] Semaphor|5 years ago|reply
I feel like this is a weird title. An equivalently bad title would be "SELECTing fewer columns than you need is bad for correctness."

"SELECT *" should be just as performant as "SELECT A, B, C" when your object only has A, B and C as columns.

[+] wyattpeak|5 years ago|reply
It's fine for now, but it makes assumptions about the future state of your database.

The arguments presented in the article may or may not be applicable to your particular application, but I think it's a solid principle that you shouldn't assume your database will retain its schema indefinitely. If you're relying on "SELECT *" to mean "SELECT A, B, C", you're in for a bad time.

[+] jameshart|5 years ago|reply
SELECT * is quicker to transmit as a query, contains less tokens to parse, requires no lookups to column metadata to map from stored data structure to result structure... I’m pretty sure there must be some circumstances where it outperforms an explicit SELECT a, b, c query.
[+] kaushikt|5 years ago|reply
Perhaps to begin with just having A,B,C columns are fine and your comment justifies.

However, it still stands true in a test of time when you will end up with a lot more columns.

[+] coldtea|5 years ago|reply
That's a non-useful pedantic correction, if I ever saw one...
[+] tinus_hn|5 years ago|reply
And will never have more columns
[+] rubyist5eva|5 years ago|reply
Specifying the columns you want also allows you to create indexes that have the data you need as part of the index.

Select “a” from “b” where “c” = 1;

With a composite index on c and a allows you to fetch everything you need without even looking at the table.

[+] jve|5 years ago|reply
And this is SO important. Even if this is the sole reason why * should be considered bad.

In some cases on big databases, adding covering index is the way to make your apication perform well, by eliminating those table lookups.

[+] gandutraveler|5 years ago|reply
Let's use this article so that we all understand that querying more than you need has memory implications and close any future discussions on this as it's already general knowledge.
[+] skeeter2020|5 years ago|reply
There are a lot of reasons NOT to use SELECT * but sending extra data over the wire is not one of them. This is just optimizing for a 2nd (or 3rd or lower) problem when there are much bigger and easier gains to be captured.

If wire-level payload size IS your big concern you should probably be looking at an entire myriad of things beyond your projection, like the actual query, or your protocol or caching or if you should even be using a RDBMS.

Yes, you should be more explicit in your queries, in which case

SELECT A,B,C FROM TBL;

should likely become

SELECT t.[A] ,t.[B] ,t.[C] FROM TBL t;

YMMV

[+] markus_zhang|5 years ago|reply
SELECT * FROM table WHERE dt = CURRENT_DATE() LIMIT 100;

Probably the most common code run in my Datagrip instance. I run it so frequently that I created a parameterized version with a shortcut.