top | item 29925820

(no title)

paozac | 4 years ago

50TB is not so big these days. I read that in 2008 (!) Yahoo had a 2+ PB PG database. What is the largest you know of, 14 years later?

discuss

order

GordonS|4 years ago

50TB is big. Bigger is possible I'm sure, but I'd guess 99.something% of all PG databases are less than 50TB.

If someone here commented they had a 2PB database, I guarantee someone else here would be like "pfft, that's not big"...

mritun|4 years ago

The OP message could have better said that 50TB databases are common these days when single metal or 24xl I3en or I4* instance on AWS can hold 60T raw.

Groxx|4 years ago

it's more than big enough to cause big problems / risk days of downtime to change, yea. 50GB is not big. 50TB is at least touching big - you can do it on one physical machine if needed, but it's the sort of scale that benefits from bigger-system architecture. 50PB would be world-class big, hitting exciting new problems every time they do something.

aantix|4 years ago

With 50TB, and if you were doing a full text search, wouldn't the entirety of the index have to be held in memory?

ok_dad|4 years ago

Around ~2005 I took a tour of the [a well known government organization] and they were bragging about several-PB-sized databases at the time. Interestingly, there was a TON of server racks there in a bomb-proof building with tons of security, and they were all IBM servers (a supercomputer maybe?), if I remember correctly. Also, there was one small server rack that was painted differently from the rest (it looked like something made in-house), and we asked what it was, and the tour guide (a PhD computer scientist) said that technically it doesn't exist and he couldn't talk about it even though it was super cool. Now that I know what they were doing around that time (and probably still today) I am kinda scared at the implications of that tour guide's statement and what that one tiny rack was for. I'm glad I never went to work in their organization, since that tour was meant to recruit some of us a few years down the road.

barrkel|4 years ago

This comment contains no information other than an ego boost for yourself, AFAICT.

golergka|4 years ago

Was that a three letter US government agency?

Gigachad|4 years ago

How are people dealing with databases this large? At work we have a mysql db with a table that has 130M records in it and a count(*) on the table takes 100 seconds. Anything but a simple look up by id is almost unworkable. I assumed this was normal because its too big. But am I missing something here? Are SQL databases capable of actually working fast at 50TB?

Something1234|4 years ago

count(*) is always going to be slow. They don't store the number of live tuples, just an estimate so it's a full table scan. The secret is to use indexes to get down to a small bit that you care about. If you're filtering on 3 columns, the goal is to get the index to wipe out at least half the results you don't care about and so on and so forth.

A 130M record table with no indexes is going to be crazy slow. Although if all you need are primary key updates, then that's the way to go.

lukeasrodgers|4 years ago

Even at the 130M rows range, you should still be able to take advantage of indexes for fast queries beyond just the primary key. It's been a while since I used mysql, but around 2010 I was working on mysql 5.something and we had several >100M row tables that could still serve indexed queries very quickly (sub ms, or couple ms, iirc). If you are not able to do this, I suggest looking into mysql config and adding/tuning indexes. But yes count(*) will be slow, I'm not aware of good workarounds for that other than caching or using table stats with postgres (if you don't need perfect accuracy) - not sure if mysql supports similar.

itsthecourier|4 years ago

It depends on the queries you run. In postgres we use stuff like materialized views, partial indexes, hyperloglog and it you are using citusdb (postgres for adults), you can even have columnar tables to accelerate olap stuff

hsbauauvhabzb|4 years ago

Security and incident response systems ingesting log files from other systems can get big, add in ‘must store for $x years’ compliance fuzz and you might hit some big numbers

ddorian43|4 years ago

Was it a single server?