The example shown has 14GB of data, which is absolutely tiny, yet it can still rack up $40k in costs per day if used incorrectly (i.e. less than optimal, but not egregiously wrong)? Why would you use this?
Scanned 6.67GB and processed for 2.41s, cost $40340.16 / 86400 = $0.4669. More than the hourly cost of AWS on-demand a1.x4large with 16 vCPU and 32GB of RAM.
Can’t think of a worse advertisement for their product.
This headline is egregiously misleading clickbait, since it is based on an absurd and self contradictory hypothetical:
"If my product had 10,000 daily active users generating 100 events per day, I’d easily hit that size in less than a year ... If I built an application to visualize this result and refreshed it every second, I would spend $40,340.16/day on Tinybird’s Pro-plan"
There is no plausible business sense in refreshing the display of a year's worth of data every second, and even scaling back to a likely still-unreasonably-frequent refresh rate of once per hour you're down to just $11/day.
If you want to stop out of control spend, have your analysts learn SQL. Avoid database systems that charge per query (like tinybird) and make damn certain your people know SQL if you do. Ignore and preferably fire people who mention things like “data lakes”.
1000% this
Almost every solution I've seen to tooling that is supposed to "avoid data analysts having to learn SQL" is far more costly, requires engineering staffing, performs much slower and sometimes requires the analyst to instead learn some esoteric less powerful language ...
If your job is to slice&dice data all day, and can't be bothered to learn SQL, I don't know what to say.
I think it’s worth mentioning that tinybird doesn’t necessarily charge per query, the billing model seems more geared towards API usage, though queries are the underpinning.
Also it’s not unreasonable to see people spending $10K+ a day in Snowflake because bad practices just like this.
I heard at a certain Professional Social Network company that some of their analysts would slam their db cluster with inefficient queries until it rebooted, and first one to rerun their query would finally get a result.
Materialized Views are damn near magic for solving issues involving slow queries on for tools that don't need real time results (eg daily reporting). They essentially act as a cache of a query at a given point in time that you're able to refresh whenever you want.
> It’s only significant weakness now is in Materialized views, with their lack of incremental refresh.
> That work towards incrementally updated views is happening and progressing. For now, it's a separate extension, though: https://github.com/sraoss/pg_ivm.
I'm really curious how people use these types of platforms in practice. It seems really easy to screw yourself over when you're "expirementing". The advice seems really obvious, but I'm sure I'd absent mindedly do a super naive query at some point and potentially cost my employer a lot of money.
I worked on a system to capture the production test data in a semiconductor company. We had trillions of rows and terabytes of data. While we were figuring stuff out, I'm sure I ran queries that scanned the entire dataset accidentally. I imagine one of those queries would have cost at least 1k to run. Our entire setup cost less than 10k a month to run on AWS regardless of how many queries we ran. I can't imagine spending 40k on 14gb of data regardless of what you were doing.
Yeah I mean I look at this example and it’s obviously pretty “extreme” - nobody is running that query every second (at least I hope not) - but I think the principle is important. You run that query once a day and it’s still costing you 1Kx what it would if it were optimized.
> If I built an application to visualize this result and refreshed it every second, I would spend $40,340.16/day on Tinybird’s Pro-plan.
What? Even if I used SQLite on my laptop and queried this thing every second, I'd still use <$3 a day. Also, this platform has no concept of caching? Don't understand this post at all, total clickbait based on an inefficiency in your platform you really shouldn't be advertising.
I remember when I studied databases in college, the execution plan took care of what to do first: join, filters, etc (I remember doing excercises with paper and pen where I got a query, a few tables, and I had to build the optimal execution tree).
So this product is not only expensive, but I have to think of the execution plan myself? Or I am wrong and modern dbs don't do that?
So I read the full article and it's interesting. Maybe the title is a little misleading, but when you actually read it I think they make it pretty clear that it's an extreme example designed to prove a simple point: bad data practice has a cost. For somebody who isn't super data savvy but wants to get into the space, I actually found it helpful.
I try to never underestimate the potential for someone to do something really stupid and I'm sure there are some egregious examples out there where a DB was set up and run such that outrageous charges resulted; but has anyone seen a situation in real life anywhere close to this kind of example?
Even if there was a instance where a poorly designed and implemented data set caused a $40K charge for a single day; I wonder how long it would take for the bean counters to notice and take action?
I'm sorry but if your egress costs are 40k/day I think it's time to consider leasing your own pipe. $1.2 million per month will get you one hell of an internet connection.
I just skimmed through the article tbh, but I saw 14 GB and what looked like a predicate pushdown optimization. I think DuckDB could handle that on my 16GB mac.
Lost me at refreshing every second. Probably only needs updating a couple of times per day. This is why you drill into requirements rather than blindly accepting them without question.
Isn't this all extremely elementary and common knowledge for someone designing these workflows? Next blog post about how to not waste money by just not doing something extremely poorly optimized.
royjacobs|3 years ago
oefrha|3 years ago
Can’t think of a worse advertisement for their product.
mcphage|3 years ago
Foobar8568|3 years ago
[deleted]
femto113|3 years ago
"If my product had 10,000 daily active users generating 100 events per day, I’d easily hit that size in less than a year ... If I built an application to visualize this result and refreshed it every second, I would spend $40,340.16/day on Tinybird’s Pro-plan"
There is no plausible business sense in refreshing the display of a year's worth of data every second, and even scaling back to a likely still-unreasonably-frequent refresh rate of once per hour you're down to just $11/day.
datalopers|3 years ago
steveBK123|3 years ago
If your job is to slice&dice data all day, and can't be bothered to learn SQL, I don't know what to say.
cosmolev|3 years ago
h4kor|3 years ago
_peregrine_|3 years ago
Also it’s not unreasonable to see people spending $10K+ a day in Snowflake because bad practices just like this.
rozenmd|3 years ago
Learn SQL ffs.
nodejsthrowaway|3 years ago
oh my god
i'm currently $13k/mo in dynamodb costs because of this whereas the same requirements sql database costs $2k/mo
unknown|3 years ago
[deleted]
kiernanmcgowan|3 years ago
https://www.postgresql.org/docs/current/rules-materializedvi...
leeoniya|3 years ago
> That work towards incrementally updated views is happening and progressing. For now, it's a separate extension, though: https://github.com/sraoss/pg_ivm.
https://news.ycombinator.com/item?id=32098603
johnthescott|3 years ago
in my experience materialized views are critical for most large databases.
doix|3 years ago
I worked on a system to capture the production test data in a semiconductor company. We had trillions of rows and terabytes of data. While we were figuring stuff out, I'm sure I ran queries that scanned the entire dataset accidentally. I imagine one of those queries would have cost at least 1k to run. Our entire setup cost less than 10k a month to run on AWS regardless of how many queries we ran. I can't imagine spending 40k on 14gb of data regardless of what you were doing.
_peregrine_|3 years ago
yellow_lead|3 years ago
What? Even if I used SQLite on my laptop and queried this thing every second, I'd still use <$3 a day. Also, this platform has no concept of caching? Don't understand this post at all, total clickbait based on an inefficiency in your platform you really shouldn't be advertising.
schnebbau|3 years ago
101008|3 years ago
So this product is not only expensive, but I have to think of the execution plan myself? Or I am wrong and modern dbs don't do that?
blandcoffee|3 years ago
Query 1 -> join Table A with Table B, both have 1M records
Query 2 -> Filter Table A to 10k records, then join Table A with Table B (1M records)
I would expect Query 2 to execute faster - I don't think the exec plan would've optimized Query 1 equivalently.
_peregrine_|3 years ago
didgetmaster|3 years ago
I try to never underestimate the potential for someone to do something really stupid and I'm sure there are some egregious examples out there where a DB was set up and run such that outrageous charges resulted; but has anyone seen a situation in real life anywhere close to this kind of example?
Even if there was a instance where a poorly designed and implemented data set caused a $40K charge for a single day; I wonder how long it would take for the bean counters to notice and take action?
krnlpnc|3 years ago
alterneesh|3 years ago
unknown|3 years ago
[deleted]
nickdothutton|3 years ago
a_c|3 years ago
Learn from history, specifically, SQL. Or people who think they are too important to learn SQL aren't that important after all
johnthescott|3 years ago
unknown|3 years ago
[deleted]
johnthescott|3 years ago
bernf|3 years ago
cosmolev|3 years ago