The first version of an app I worked from was very SQL heavy. Almost every calculation was done in a stored proc and the app servers just formatted that.
As the product got popular this became the bottleneck. It's far easier to get more app servers than DB servers.
So we restructured it to do straight index reads and aggregations in the DB, but more complex calculations in the app itself.
It all depends on the circumstances, but I'd still advocate pushing as much in to the DB as you can without making convoluted SQL - your average RDBMS has amazing optimisations for aggregation, sorting and filtering.
But the problem with doing it on the app is that when you are joining huge rows, all of this unfiltered data gets sent over a relatively slow 100-1000 mbit port.
The rule of thumb is that DBs are bound on IO, so any calculation that you can get with the same amount of IO as returning the records is likely to be "free", and any calculation that requires more IO is likely to be "expensive" (unless that allows you to avoid IO by restricting returned records, avoiding future queries, etc. - it gets complicated fast).
So things like column-column calculations, simple aggregates, etc. are likely to be good ideas on the DB; for anything else It Depends.
Another fun complication: database stored procedures implemented with non-SQL code. I worked on a project where the client's dedicated SQL ninjas managed to bring the time to process daily extracts from the data warehouse down to 2min from 45min by switching over to CLR (.NET) implementations of the calculations being performed on SQL Server 2005.
IIRC the issue was simply that MSSQL 2005 is just plain slow at doing calculations compared to C# code (the calculations were complex financial models involving large amounts of data and inter-dependencies between that data in calculations, so shipping the raw data to an application was not a viable option).
Isn't that just because of the way systems that make use of DBs are architected? I would think you could trivially make them CPU bound if you start tacking a bushel of FLOPs on to every request.
The big advantages for doing things "on the DB" are centralizing business logic and eliminating serialization/transport/deserialization. Having a view or stored proc with business logic lets it be shared across different components of the same app or even across separate apps that share a database. An extreme (yet common) example of calls best done on the DB itself is a SUM of price X quantity or any other grouping operation. Rather than moving N rows to the client, the server can aggregate it all and just send back the result.
If CPU usage on your DB really is your bottleneck (and seriously it's probably not) then you should look into federating logic out to your app. Otherwise the centralization of app logic alone is worth it.
I'm not saying this is the wrong approach, but you will be moving aspects of the business logic from the application tier to the data tier. If you have written your application correctly, then you should have reusable modular code that does the same thing.
Like I say though, it's not wrong what you are saying. In fact, in terms of latency it's probably for the best.
I've written some pretty awesome SQL queries that give back exactly what the application wants with the application sometimes not even needing to do anything else afterwards.
I've usually always regretted those "awesome" queries.
I've known some very smart developers who've gotten lost in SQL queries, while show them the equivalent Ruby/Python/Javascript/C code that parses through the results and they can understand it less than a minute.
One approach I've seen that works well for untangling gnarly SQL is to refactor those mega-queries into smaller, component queries that are pieced together by application code. This keeps the processing on the DB server but avoids the pain that comes with programming in SQL. I rather like SQL, but it doesn't handle complexity well at all (e.g. maintaining a "variable" to be used in different parts of the query).
For example, I once worked on a Foursquare clone originally written by a hardcore PostgreSQL nut. This system had a query that, if memory serves, returned a list of places of a certain type within a geographic area along with user activity on those places (votes, comments, etc). This was around a 75 line SQL query that actually wasn't that fast (response times from the DB were roughly 1 second even with every join indexed). We rewrote that query into 4 smaller queries (place ID's within that area, place ID's within that category, hydrating those places from the filtered ID's and then getting the user info), and that cut our DB response by about 70% in addition to making the system easier to work with. This required roughly 10 lines of Java code and a variable - a list of ID's that we got first and passed into each other query. It also freed us up to do other things - for instance, if performance were still a problem, queries 2-4 could have been done asynchronously behind a latch. By lifting the "glue" out of SQL and into a better language, it freed us to do new things, and it freed the database from having to juggle unnecessary complexity while planning and executing its queries.
Odd though it might sound, the opposite is probably true for me (and I assume a bunch of other DB developers) - once you really 'grok' SQL, and as long as it's written in an understandable manner, it can be easier to read than a large block of procedural (and/or functional) code.
One of the big things is that there are no 'for' loops in good SQL code - i.e. it is written using set-based logic and not procedural logic. That can (sometimes) really cut down the amount of reading involved, but takes a while to get used to.
Also depends how well you understand the schema in question (but that compares equally to understanding the code framework/namespace hierarchy).
Premature optimization is a bigger enemy than performance issues.
That being said, the DB server is what's optimized to do calculations.
The tradeoff is you have a second stack to maintain and performance tune now beyond being a datastore. A positive is you can independently write and run tests.
The question is, can you resist building the perfect empire on day 1? Move stored procs and functions into the DB as they are needed. Whatever you're working on (including who is working on it) isn't that important.
Move things to the DB if the calculation aggregates data (the input is much larger than the output), to avoid materializing data and shipping it around. (Map-reduce is a subtype of this.) Otherwise don't.
If you are using numeric data (not floats) in the database, I would say do your calculations in the database. Different languages and tools have different arithmetic handling which can cause some very long and psychotic debugging sessions. A single point of calculation is a good thing.
[edit] I should explain a bit. If you use are in a multi-language environment[1] and are doing financial or weight / volume calculations, be extremely careful if you decide to not do all the calculation on the database. Having results calculate differently in two different places will drive you mad. I have noticed some serious problems with number handling in different languages and some mistakes in calculation will get you sued.
If you find yourself returning very large numbers of rows to the client, you might be doing something wrong - SQL databases are at their most efficient when you select only the columns and rows you need.
If you perform aggregation/calculations in the DB, you can potentially save on-the-wire data transfer time (and potentially CPU time on your clients.. though obviously that is shifting the CPU work to the database).
Similarly if you find yourself making multiple trips to the database, and then using loops to combine different data sets, you're probably too far on the 'client-side' and should look at using some joins and combination logic on the DB side to get what you need in a single (and likely more efficient) round-trip.
The top answer on the page says as much but the answer to pretty much any "Which is a faster way to do this, ABC or XYZ?" is "try it both ways, measure the results (adjust for load if necessary), and see". Doesn't matter if you're talking different pure code algorithms to achieve the same result, stored procs vs app code, network caching or not, etc. If doing it the absolute fastest way matters to you, try multiple approaches and measure the results. Even in situations where there are accepted "best practices" there may be variables specific to your own project/tech stack that cause your results to be surprising!
... annnnd then the database technology changes, and your benchmark results of three years ago are now actively hosing you. Or you have to change databases but you can't because you embedded a lot of logic in your database layer that can't be easily ported (somewhere, a salesman for your current database product is crying tears of joy and shouting his cash-register shout, "Ka-ching! Customer lock-in! Ka-CHING baby!". Avoid these people).
Unless you have really compelling reasons to get snuggly with a particular vendor's technology, be conservative.
[This still applies if you're using a "free" database engine; you're just not paying MS or Oracle or whomever, and it's "just" your own time]
Basically, "it depends". having dealt with extremely DB-intensive applications, I have developed a personal motto of "be nice to the DB".
Let the DB be a secure storage of your data, not a calculating part of your application. But like the top answer says, sometimes it is not practical to do a calculation within the application. In my case, we had a few database servers set aside just for reporting, so we could slam them with difficult queries and not worry about affecting data.
I have seen sql queries that were 30 pages long if printed out. The developer showing me this was looking for some other examples that he said were up to 100 pages long. These queries had multiple levels of correlated subqueries and tons of decode statements ( oracle syntax). There looked to be tons of duplication just in what I saw.
How are you going to write unit tests for that stuff? Refactor? etc. etc.
The examples always start out simple like summing a bunch of rows that match a predicate but once you start doing that it is hard to rewrite that to use application code once it becomes too complex.
Also most databases are 20+ year old technologies and often have weird systems in place for storing the code in the db or something else just as odd. No more grep, no more static code analysis.
As far as I am concerned the db is a pile of facts or observations. I tell the db something and later it tells me what I told it. When I am thinking about what goes in the db I think about using the past perfect verb tense. On this day such and such happened. Thats it. Preferably that never changes, you might get new info in the future so just record that new info along with everything else.
Ideally we should be getting to a point to where resources are so cheap that CRUD can become CR - no more updates or delete just new facts.
The existence of terribly written queries proves nothing. Because there's also plenty of terribly written code that takes 100 pages to do something simple.
In both cases it comes down to people blindly grasping when they don't understand the fundamentals.
Relational logic can be extremely elegant, composable, and testable. Unfortunately SQL is a pretty awful interface to expose those ideas, and most attempts to wrap SQL in a better interface make the mistake of trying to pretend to be object-oriented, when they should really let their true relational nature shine through.
As usual "it depends" but yes, this is imho the the most important aspect (provided reasonable performance and sustainability is carefully taken into consideration).
The older I get, the more tired I get of developers writing in-house apps that are never going to have more than 10 concurrent users, but they architect as if they are going to have 1000+ concurrent users, regardless of the additional cost or complexity....which is how relatively simple projects end up cost $100k+++ and become maintenance nightmares, and why simple change requests are often rejected because they would be "too complex".
I default to relying on SQL. I've found that helps ensure calculations have a single source of truth, which helps maintenance and reliable sharing throughout your app, or across multiple apps.
SQL also offers powerful aggregate functions to assist. Much simpler to use something like AVG() or SUM() in a SQL query than having to worry about deriving the same calcs in application code.
It comes down to performance and sometimes what your hardware constraints are e.g. your DB is a faster machine than your web server and you have no say in the matter.
You may also benefit from precalculating stuff in the DB and storing it. I wrote this 6 years ago which illustrates the point:
Performance wise? Maybe writing hand-tuned assembly code will be faster. Or using CUDA. Don't optimize performance where you don't have to. Are small calculations the long pole in your app that talks to a db? Almost certainly not. Pick a method that maximizes robustness, ease of understanding, servicing, openness to feature changes, and accountability. In my experience that typically means client side code, not queries or sprocs.
I would always lean on the database to perform simple sums and aggregations unless I was really concerned about turning the database into a performance bottleneck.
If you need to add simple logic above and beyond this, stored procedures aren't sexy but they can be a good compromise that avoids shipping data around and re-implementing SQL in your application server.
It is interesting that nobody has mentioned testability, as in unit tests. Some people (such as Thoughtworks) advocate removing all business logic from the db server into app code so that it can be unit tested.
I buy that argument to some degree but in practice I am SQL junkie and always implement calculations in SQL.
Isn't there a size past which it's always good to do it on the DB? Or to use buzzwords, "At some points, shouldn't most Big Data calculations be done on the database?"
[+] [-] sitharus|12 years ago|reply
The first version of an app I worked from was very SQL heavy. Almost every calculation was done in a stored proc and the app servers just formatted that.
As the product got popular this became the bottleneck. It's far easier to get more app servers than DB servers.
So we restructured it to do straight index reads and aggregations in the DB, but more complex calculations in the app itself.
It all depends on the circumstances, but I'd still advocate pushing as much in to the DB as you can without making convoluted SQL - your average RDBMS has amazing optimisations for aggregation, sorting and filtering.
[+] [-] codexon|12 years ago|reply
[+] [-] fiatmoney|12 years ago|reply
The rule of thumb is that DBs are bound on IO, so any calculation that you can get with the same amount of IO as returning the records is likely to be "free", and any calculation that requires more IO is likely to be "expensive" (unless that allows you to avoid IO by restricting returned records, avoiding future queries, etc. - it gets complicated fast).
So things like column-column calculations, simple aggregates, etc. are likely to be good ideas on the DB; for anything else It Depends.
[+] [-] lusr|12 years ago|reply
IIRC the issue was simply that MSSQL 2005 is just plain slow at doing calculations compared to C# code (the calculations were complex financial models involving large amounts of data and inter-dependencies between that data in calculations, so shipping the raw data to an application was not a viable option).
[+] [-] sliverstorm|12 years ago|reply
Isn't that just because of the way systems that make use of DBs are architected? I would think you could trivially make them CPU bound if you start tacking a bushel of FLOPs on to every request.
[+] [-] sehrope|12 years ago|reply
If CPU usage on your DB really is your bottleneck (and seriously it's probably not) then you should look into federating logic out to your app. Otherwise the centralization of app logic alone is worth it.
[+] [-] altcognito|12 years ago|reply
[+] [-] chris_wot|12 years ago|reply
Like I say though, it's not wrong what you are saying. In fact, in terms of latency it's probably for the best.
[+] [-] danielweber|12 years ago|reply
I've usually always regretted those "awesome" queries.
I've known some very smart developers who've gotten lost in SQL queries, while show them the equivalent Ruby/Python/Javascript/C code that parses through the results and they can understand it less than a minute.
[+] [-] bartonfink|12 years ago|reply
For example, I once worked on a Foursquare clone originally written by a hardcore PostgreSQL nut. This system had a query that, if memory serves, returned a list of places of a certain type within a geographic area along with user activity on those places (votes, comments, etc). This was around a 75 line SQL query that actually wasn't that fast (response times from the DB were roughly 1 second even with every join indexed). We rewrote that query into 4 smaller queries (place ID's within that area, place ID's within that category, hydrating those places from the filtered ID's and then getting the user info), and that cut our DB response by about 70% in addition to making the system easier to work with. This required roughly 10 lines of Java code and a variable - a list of ID's that we got first and passed into each other query. It also freed us up to do other things - for instance, if performance were still a problem, queries 2-4 could have been done asynchronously behind a latch. By lifting the "glue" out of SQL and into a better language, it freed us to do new things, and it freed the database from having to juggle unnecessary complexity while planning and executing its queries.
[+] [-] jka|12 years ago|reply
One of the big things is that there are no 'for' loops in good SQL code - i.e. it is written using set-based logic and not procedural logic. That can (sometimes) really cut down the amount of reading involved, but takes a while to get used to.
Also depends how well you understand the schema in question (but that compares equally to understanding the code framework/namespace hierarchy).
[+] [-] joevandyk|12 years ago|reply
[+] [-] walshemj|12 years ago|reply
[+] [-] j45|12 years ago|reply
That being said, the DB server is what's optimized to do calculations.
The tradeoff is you have a second stack to maintain and performance tune now beyond being a datastore. A positive is you can independently write and run tests.
The question is, can you resist building the perfect empire on day 1? Move stored procs and functions into the DB as they are needed. Whatever you're working on (including who is working on it) isn't that important.
[+] [-] gaius|12 years ago|reply
[+] [-] JulianMorrison|12 years ago|reply
[+] [-] protomyth|12 years ago|reply
[edit] I should explain a bit. If you use are in a multi-language environment[1] and are doing financial or weight / volume calculations, be extremely careful if you decide to not do all the calculation on the database. Having results calculate differently in two different places will drive you mad. I have noticed some serious problems with number handling in different languages and some mistakes in calculation will get you sued.
1) SQL counts as one of the languages
[+] [-] chris_wot|12 years ago|reply
[+] [-] jka|12 years ago|reply
If you perform aggregation/calculations in the DB, you can potentially save on-the-wire data transfer time (and potentially CPU time on your clients.. though obviously that is shifting the CPU work to the database).
Similarly if you find yourself making multiple trips to the database, and then using loops to combine different data sets, you're probably too far on the 'client-side' and should look at using some joins and combination logic on the DB side to get what you need in a single (and likely more efficient) round-trip.
[+] [-] georgemcbay|12 years ago|reply
[+] [-] soora|12 years ago|reply
Benchmarking multiple queries / approaches is generally worthwhile if performance is important.
[+] [-] kabdib|12 years ago|reply
Unless you have really compelling reasons to get snuggly with a particular vendor's technology, be conservative.
[This still applies if you're using a "free" database engine; you're just not paying MS or Oracle or whomever, and it's "just" your own time]
[+] [-] borlak|12 years ago|reply
Basically, "it depends". having dealt with extremely DB-intensive applications, I have developed a personal motto of "be nice to the DB".
Let the DB be a secure storage of your data, not a calculating part of your application. But like the top answer says, sometimes it is not practical to do a calculation within the application. In my case, we had a few database servers set aside just for reporting, so we could slam them with difficult queries and not worry about affecting data.
[+] [-] johnwatson11218|12 years ago|reply
How are you going to write unit tests for that stuff? Refactor? etc. etc.
The examples always start out simple like summing a bunch of rows that match a predicate but once you start doing that it is hard to rewrite that to use application code once it becomes too complex.
Also most databases are 20+ year old technologies and often have weird systems in place for storing the code in the db or something else just as odd. No more grep, no more static code analysis.
As far as I am concerned the db is a pile of facts or observations. I tell the db something and later it tells me what I told it. When I am thinking about what goes in the db I think about using the past perfect verb tense. On this day such and such happened. Thats it. Preferably that never changes, you might get new info in the future so just record that new info along with everything else.
Ideally we should be getting to a point to where resources are so cheap that CRUD can become CR - no more updates or delete just new facts.
[+] [-] ef4|12 years ago|reply
In both cases it comes down to people blindly grasping when they don't understand the fundamentals.
Relational logic can be extremely elegant, composable, and testable. Unfortunately SQL is a pretty awful interface to expose those ideas, and most attempts to wrap SQL in a better interface make the mistake of trying to pretend to be object-oriented, when they should really let their true relational nature shine through.
[+] [-] danielharan|12 years ago|reply
The one that matters to me the most is developer time.
Write a damn SQL query. If it's too slow or the DB becomes a bottleneck, then reconsider.
[+] [-] mistermann|12 years ago|reply
The older I get, the more tired I get of developers writing in-house apps that are never going to have more than 10 concurrent users, but they architect as if they are going to have 1000+ concurrent users, regardless of the additional cost or complexity....which is how relatively simple projects end up cost $100k+++ and become maintenance nightmares, and why simple change requests are often rejected because they would be "too complex".
[+] [-] mzarate06|12 years ago|reply
SQL also offers powerful aggregate functions to assist. Much simpler to use something like AVG() or SUM() in a SQL query than having to worry about deriving the same calcs in application code.
[+] [-] mmaunder|12 years ago|reply
You may also benefit from precalculating stuff in the DB and storing it. I wrote this 6 years ago which illustrates the point:
http://markmaunder.com/2007/07/20/how-to-create-a-zip-code-d...
[+] [-] InclinedPlane|12 years ago|reply
[+] [-] benjaminwootton|12 years ago|reply
If you need to add simple logic above and beyond this, stored procedures aren't sexy but they can be a good compromise that avoids shipping data around and re-implementing SQL in your application server.
[+] [-] adamconroy|12 years ago|reply
I buy that argument to some degree but in practice I am SQL junkie and always implement calculations in SQL.
[+] [-] chris_wot|12 years ago|reply
http://download.red-gate.com/HelpPDF/DatabaseUnitTestingWith...
P.S. I'm not affiliated with them in any way, I just love their products.
[+] [-] mathattack|12 years ago|reply
[+] [-] _random_|12 years ago|reply