The OP message could have better said that 50TB databases are common these days when single metal or 24xl I3en or I4* instance on AWS can hold 60T raw.
it's more than big enough to cause big problems / risk days of downtime to change, yea. 50GB is not big. 50TB is at least touching big - you can do it on one physical machine if needed, but it's the sort of scale that benefits from bigger-system architecture. 50PB would be world-class big, hitting exciting new problems every time they do something.
Around ~2005 I took a tour of the [a well known government organization] and they were bragging about several-PB-sized databases at the time. Interestingly, there was a TON of server racks there in a bomb-proof building with tons of security, and they were all IBM servers (a supercomputer maybe?), if I remember correctly. Also, there was one small server rack that was painted differently from the rest (it looked like something made in-house), and we asked what it was, and the tour guide (a PhD computer scientist) said that technically it doesn't exist and he couldn't talk about it even though it was super cool. Now that I know what they were doing around that time (and probably still today) I am kinda scared at the implications of that tour guide's statement and what that one tiny rack was for. I'm glad I never went to work in their organization, since that tour was meant to recruit some of us a few years down the road.
How are people dealing with databases this large? At work we have a mysql db with a table that has 130M records in it and a count(*) on the table takes 100 seconds. Anything but a simple look up by id is almost unworkable. I assumed this was normal because its too big. But am I missing something here? Are SQL databases capable of actually working fast at 50TB?
count(*) is always going to be slow. They don't store the number of live tuples, just an estimate so it's a full table scan. The secret is to use indexes to get down to a small bit that you care about. If you're filtering on 3 columns, the goal is to get the index to wipe out at least half the results you don't care about and so on and so forth.
A 130M record table with no indexes is going to be crazy slow. Although if all you need are primary key updates, then that's the way to go.
Even at the 130M rows range, you should still be able to take advantage of indexes for fast queries beyond just the primary key. It's been a while since I used mysql, but around 2010 I was working on mysql 5.something and we had several >100M row tables that could still serve indexed queries very quickly (sub ms, or couple ms, iirc). If you are not able to do this, I suggest looking into mysql config and adding/tuning indexes.
But yes count(*) will be slow, I'm not aware of good workarounds for that other than caching or using table stats with postgres (if you don't need perfect accuracy) - not sure if mysql supports similar.
It depends on the queries you run. In postgres we use stuff like materialized views, partial indexes, hyperloglog and it you are using citusdb (postgres for adults), you can even have columnar tables to accelerate olap stuff
Security and incident response systems ingesting log files from other systems can get big, add in ‘must store for $x years’ compliance fuzz and you might hit some big numbers
GordonS|4 years ago
If someone here commented they had a 2PB database, I guarantee someone else here would be like "pfft, that's not big"...
mritun|4 years ago
Groxx|4 years ago
aantix|4 years ago
ok_dad|4 years ago
barrkel|4 years ago
golergka|4 years ago
Gigachad|4 years ago
Something1234|4 years ago
A 130M record table with no indexes is going to be crazy slow. Although if all you need are primary key updates, then that's the way to go.
lukeasrodgers|4 years ago
itsthecourier|4 years ago
hsbauauvhabzb|4 years ago
ddorian43|4 years ago