Why we built a 40TB photo server in-house instead of using S3

[+] ivan78|14 years ago|reply

"The one most valuable asset at Oyster.com is our photo collection. ... In strict accordance with KISS methodology, we opted against LTO and S3, and decided to build a big BOX."

I can only imagine how much they will be scared each time they need install updates or reboot THE BOX. They will eventually decide to build identical BOX and mirror their data on daily basis. Then they will notice that mirroring such big volumes of data is wasting tooo much system resources and start evaluating in-house distributed storage solutions, such as OpenStack Swift. Then they will notice it is way too overcomplicated and finally decide migrate their data to Amazon S3.

I'm writing it as a person who walked the same path over the last few years. :-)

[+] mbell|14 years ago|reply

This is honestly pretty scary. There are a lot of single points of failure in this solution.

1) Single Box

2) Single Location

3) Single 40TB RAID 6 Array on single RAID card with 22 Drives (assuming 24 2TB drives, 2 parity, 2 hot spare = 40TB)

4) Single bonded network link means single switch, no redundancy against switch / network device failure

Honestly this may have been cheap but your getting what your paying for: An unreliable backup solution.

Also: Norco? Really? I wouldn't be trusting anything they produce with your company's critical data. Supermicro isn't much more expensive all things considered.

[+] dspillett|14 years ago|reply

Warning, the sound of a broken record coming up...

> An unreliable backup solution.

Nope. RAID is not a backup solution. It provides redundancy so the array can survive an event like a device failure (and so the data survives as a consequence) without significant downtime for repair (with zero downtime if you have hot-swap hardware) but it does not, and is not intended to, protect the data from the huge list of other things that can affect it.

RAID is redundancy for reliability purporses, not backup purposes.

[+] jodrellblank|14 years ago|reply

4) Single bonded network link means single switch, no redundancy against switch / network device failure

Maybe, but not necessarily. Could be a pair of stacked switches (making one logical switch), with the bonded link having one port on each physical switch.

[+] PaulHoule|14 years ago|reply

S3 is more expensive than cheap storage, but S3 provides a level of data protection that you'd have a hard time duplicating at any cost.

[+] moe|14 years ago|reply

"The first challenge in putting together the big box was getting internal SAS connectors properly seated into the backplane adaptor sockets"

Excuse me?

As a customer I'd feel slightly uncomfortable about my data by now. You make it sound like stuffing 24 disks into a box is rocket science to you, all the while SuperMicro and others sell plug&play chassis for up to 45 disks[1].

Also, you didn't mention it in the post, but you do have at least two of these in two physically distant racks, right?

[edit: deleted snarky comment about people running windows on a fileserver]

[1] http://www.supermicro.com/storage/

[+] wx77|14 years ago|reply

At least for this website it isn't your data but it is their data.

They are backing up their curated photos (which seems to pretty much be the whole business as they say in the article).

It appears they are using Akamai services to server their images so hopefully their is some extra redundancy there already.

Other than that it seems like it would be a steal to use s3 as a backup system at this point because from this article it looks like they need to hire another employee to tell them why this backup solution seems a bit silly (I mean having trouble setting up the hardware).

[+] smerritt|14 years ago|reply

We had something similar but smaller (~8 TB) at a place I worked, and it was a nightmare. Migrating from that to S3 was one of the best things to happen to that project.

Being a single big box, it had a bunch of single points of failure, and boy did they fail; we probably had 5-10 hours per month of downtime due to the photo server falling over (flaky RAID controller firmware, mostly).

Also, since the big box was expensive, we only had one in production. There was code for taking a newly-uploaded photo and copying it over to the photo server that only executed in production, which meant the only way to functionally test it was to ship it and hope.

We switched to S3 about a year ago with different buckets for prod, staging, and dev; the production-only code paths went away, and there hasn't been any photo-related downtime since. Definitely worth it.

[+] vilda|14 years ago|reply

This sounds like you had a really bad implementation. Proper file server of this small size would not fail for several hours per month.

[+] rhizome|14 years ago|reply

flaky RAID controller firmware

Fortunately, this can usually be rectified with a simple application of money. Good hardware is its own reward.

However, it sounds like the problem that S3 cured was caused by a bad architecture.

[+] rdl|14 years ago|reply

For a 24 drive server, I'd just get a heavily discounted dell or HP box. A startup should be able to pay half list, buy two, and be ahead vs. s3.

Supermicro chassis re a big improvement overdoing your own wiring. The Areca controllers, especially in raid 6, re great. For raid 5 I'd also look at 3ware.

For single gig e, you can get away with esata expanders, building something like the back blaze pod. I've done that kind of thing for personal use, and to have an onsite mirror of something, but I'd want several, in several different colo facilities, to compare with s3. The exception is if you need some kind of scratch storage, but even refilling a 40TB archive with downloadable content takes a really long time over a 1Gbps link.

I'd build a few boxes like this now, but the Thai floods pushed hard drive prices up to the point I have to wait. Hopefully fast 4-5TB drives will be $200 by summer 2012.

[+] jwatte|14 years ago|reply

S3 gives you multi-host, multi-region redundancy. Putting it all in one box is asking for trouble. What if the raid controller grows a bug and corrupts on write? It's happened. What if there's a fire in the building that has both your server and your back-up? Eggs, meet basket!

[+] throwaway64|14 years ago|reply

you could easily achieve double, triple, or even quadruple co-located redundancy for far less than what s3 would charge for 40TB.

Before anyone replies about needing a "24/7 sysadmin", if you are running stuff at this scale:

A) You already have a sysdmin, or somebody competent enough to do the setup/maintenance work. Running off s3 doesn't magically mean you never need deal with sysadmin issues.

B) Rack providers will swap out hardware if you put in a ticket, so you can provide multi-regional redundancy for much much much cheaper than amazon would charge you ($60000*2 for 2 regions gets expensive fast)

[+] JoeAltmaier|14 years ago|reply

S3 fails too, right? Didn't they have an internal network issue this year, and go down for hours?

Anything you haven't tried, doesn't work. That's a truism in computing. I don't think Amazon tries failing-over entire data centers very often (have they ever?), so when it needed to happen, it didn't work.

Anyway, I'm thinking this guy has only to back up his photo store about once a day to (something big) and put it in his bank box, and he's good to go, at least for a photo site.

[+] vl|14 years ago|reply

I believe S3 is multi-zone, but not multi-region.

[+] jackowayed|14 years ago|reply

> In strict accordance with KISS methodology

Buying a ton of parts, carefully assembling them, and having it be your problem when something breaks is simpler than paying Amazon to solve the problem nearly perfectly?

[+] cbs|14 years ago|reply

If you know what you're doing, yeah. Running a fileserver should be pretty damn simple for anyone who stylized themselves a "hacker", but you're talking like maintaining file servers is some sort of black magic for which the simple solution is to give up and turn to an outsourced solution where you have little control and no visibility.

Its not. The hard part of storage is to understand the capabilities and limitations of your storage and more importantly how those fit in with your computational needs. You have to do that at amazon on in house. A3 just makes it easy to ignore evaluating their service because you're never actually forced to. Maintaining servers is butter.

[+] rorrr|14 years ago|reply

Better than going bankrupt.

[+] blrgeek|14 years ago|reply

After reading that I'm afraid they're going to have downtime because of a 'shoddy backplane connection' sometime soon. Or that it's going to fall out of its 'delicate balance' or have a 'driver conflict' soon :(

I wonder if they subconsciously undervalue their photos, or if this is just a naive 'we can do better' moment.

I'm all for building your own, but there's a good reason enterprise server hardware costs more, and there's a good reason to go for enterprise hardware when you say 'The one most valuable asset at Oyster.com is our photo collection'.

For instance a Dell or HP storage server with 24 disks would be around 18K list - and be really engineered for that as opposed to hacked together.

[+] fbuilesv|14 years ago|reply

For starters, 40TB on S3 costs around $60,000 annually. The components to build the Box — about 1/10th of that

I wonder why no one ever factors the cost of having a knowledgeable person handling the system into their calculations. TBH 40TB doesn't sound as much, but once you start growing you'll want someone familiar enough with the subject to take care of it (especially if it's their most valuable asset).

[+] buff-a|14 years ago|reply

Its just not that hard people! Assuming he built the box for $6000 as claimed (unfortunate timing given hard-drive prices), that's $54,000 of someone's time before it becomes a loss. I've got 7Tb sitting here and in the last two years I've had three failures (all seagate incidentally) for a total consumption of my time of 45 minutes and zero downtime. Maybe 10 hours (erring on the high side) to set the things up.

If I needed this much space in a start-up I'd totally do it myself - it would be a good use of my time. No, my concern with this setup is that it looks (to me) to be a "w00t I got $6000 so lets build a 733t boxen!" It looks cool and fragile, instead of boring and robust.

UPDATE: To be fair, I'm not hammering those drives continuously. How much does that change the equation? Even if it goes from 45 minutes every two years to 45 minutes per day, you'd still be net positive. Basically, you can't prevent drive failures. But if you can keep the failures restricted to the ones that just require popping in a new drive, then your "operator time" is minimal. Its when you lose an array that you're in trouble. Again: Do not use one big box!

[+] donw|14 years ago|reply

If your business depends on handling large amounts of data, you want that person on-staff anyway.

[+] aaronjg|14 years ago|reply

Amazon S3 price also takes into account having redundancy at three data centers. So you should multiply the cost by 3, just for that. Of course if you don't need the redundancy, you can build it for cheaper.

Also for some applications it really does make sense to move away from S3, and have a solution in house. For example The Broad Institute has about 6 petabytes of storage [1]. They in particular benefit from local storage, since all of their data is generated on-site. However, even at this scale, they don't build the boxes themselves [2].

[1] http://www.genome.gov/27538886

[2] http://www.isilon.com/press-release/isilon-iq-powers-data-st...

[+] WettowelReactor|14 years ago|reply

Well first off all S3 is not the only option. Hell even going mid tier with an established SAN installed at a cheap colo is way cheaper than S3 and leaps ahead of their solution.

[+] bretr|14 years ago|reply

the author updated with this comment:

http://tech.oyster.com/how-to-build-a-40tb-file-server/?#com...

"Didn’t mean to give the impression that this is the only backup, it is not. It is the “warmest” one — first line. It does not share the same physical location with the primary storage box either, but is less than 2 miles away. So we can easily have a 2 hour recovery time without throwing away those astronomical monthly service fees. (Although many technologists will always prefer paying big bucks for the comfort of being cushioned from every angle by SLAs and such — nothing wrong with that, just a different approach.)

As far as TCO goes, it cannot get any lower since we already have one or two system guys handling all servers, as well as office workstations, etc.. This backup box takes up such a small fraction of their time that its almost negligible — several thousand annually at most. Same goes for power, etc — it is just one of many servers.

The disks are all Enterprise Class 2TB SATA-II, several different models. We were purchasing them right after the monsoon floods in Thailand constricted supply so our choices were somewhat limited as time was a factor.

Raid6 has come a long way since it’s early inception days, but is still a trade-off between raw storage capacity and processor utilization. HW RAID industry is now old enough to not have to wait for new products to mature as we used to when the technology itself was in its infancy. Old habits certainly die hard, but getting the “latest and greatest” was a conscious choice made for this specific problem, not submission to some immature fascination with “elite” new products, or however that may be.. This card has the best specs for Raid6 currently on the market — bottom line, period.

Big Kudos to all who made suggestions and participate in the discussion, keep it coming!"

[+] wmf|14 years ago|reply

From my experience building storage, you're better off buying an enclosure that has expanders (e.g. Supermicro); it really simplifies cabling.

[+] rhizome|14 years ago|reply

Supermicro is just such a fantastic company. Love their stuff.

[+] nodesocket|14 years ago|reply

Static image storage makes the most sense todo on S3 or similar. Building your own storage, does not provide the redundancy and reliability of S3. Additionally, you have the flexibility to enable CloudFront and distribute the images via CDN if you need.

[+] meroliph|14 years ago|reply

Building your own storage can provide the same redundancy and reliability. You can still use a CDN as well.

[+] 16s|14 years ago|reply

A bit off-topic, but I wonder if anyone can comment on using software RAID rather than hardware RAID? I'd like to try it. I've been bitten by buggy hardware RAID controllers far too often (even high-dollar name brand gear). I know that all of the free *nix systems offer software RAID, I'm just curious how they perform.

[+] mbell|14 years ago|reply

I can't comment for industrial usage but for use at home (work and non-work use) I have the follow setup for storage:

My old desktop hardware (Intel Q6600, 8GB RAM, Asus MB, pair of gigabit links bonded)

Supermicro 4u tower case with 8x hot swap bays + the 5.25 bay filled with a 5x hot swap cage, 13 total hot swap bays.

3-Ware 9550SX RAID Controller, 4 x WB RE 320GB drives, RAID 5.

8 x 1.5TB "Green" Drives, mixture of WD and Samsung drives, RAID 6 using mdadm (linux software raid)

1 x WD Raptor (system drive)

Ubuntu with KVM for virtutalization

Originally i was in the "must have raid controller" camp which is when I bought the 3-ware controller and the RE series drives. When that array filled up I did some research and decided to just go the mdadm route and have minimal complaints so far. I use the 3-ware array still for "critical data" and back it up to S3.

Monitoring: You have to work a bit more to get proper alerting of issues from mdadm but its not hard to setup. Doesn't matter for me much, i sit in the same room as this server most of the day working so if something goes wrong i generally notice before i get the e-mail.

Performance: As mentioned I'm using Green drives, this system wasn't built with speed in mind but rather large amounts of nearline storage. Never-the-less, with some basic tweaking and making sure the array's partition alignment is correct I get around 450MB/sec read speed and ~85MB/sec write long term, I have the system setup to cache writes aggressively however as most of the data on this array isn't critical and its on a UPS, what this means is that writes under a few gigabytes usually complete at wire speed (~200MB/sec) then get flushed to the disk later. Most of the time I'm limited by network bandwidth to this system, unless I'm writing a very large amount of data all at once.

One negative is rebuild speed, here i'm very limited by the 'Green' drives I believe. It runs at about 50MB/sec so rebuilds do take awhile.

As far as CPU usage goes, I've never seen it be the limiter but haven't watched it that closely, it doesn't peg during rebuild. This machine acts as an SMB/NFS file server and a few development VMs 24/7 (database, and a couple other things) and I've never really had an issue with cpu usage.

One really nice bonus is that if something in the system fails, you can just plug the drives of your array into to almost any other linux system:

apt-get install mdadm

mdadm --assemble --scan

Poof, working array.

tl;dr

If your going after 1GB/sec transfer speeds, get a high end raid card.

If you just need some large redundant storage that can saturate a Gbit link or 2, then mdadm software RAID is just fine IMO.

[+] mbell|14 years ago|reply

This may be useful too, BackBlaze uses mdadm RAID for its storage:

http://blog.backblaze.com/2009/09/01/petabytes-on-a-budget-h...

and version 2

http://blog.backblaze.com/2011/07/20/petabytes-on-a-budget-v...

[+] cagenut|14 years ago|reply

Yes S3 can get expensive, but imho this swings the pendulum too far in the other direction. Something like a riak cluster of four 2U/24-drive servers would get you the cost structure of good colo but the features/resilience/operational-flexibility of something more like s3.

[+] lucaspiller|14 years ago|reply

Good suggestion. I posted on the Riak mailing list regarding this:

http://lists.basho.com/pipermail/riak-users_lists.basho.com/...

[+] dfrankow|14 years ago|reply

How much more expensive is a clustered software solution (e.g., Hadoop FS) than this RAID box?

[+] Jugglernaut|14 years ago|reply

That depends on how much hardware you want to spread it out on. Many companies would have to hire a hadoop guy, then again some companies would have to hire a sysadmin to to run the Oyster solution. Build your storage to fit your company.

[+] papercruncher|14 years ago|reply

How are you dealing with bit rot? Are you periodically scrubbing the data to give the controller a chance to repair or are you waiting to get a URE during an array rebuild? Are you running end to end checksums against all your data to protect against bad firmware, bad ram, etc. What is your mean time to repair in case you lose a drive?

One more question: you saturated the network link with a sequential read/write, but is that how you actually store the data? If not, how long would it take you to be up and running on another CDN in case Akamai goes down in flames?

[+] zorg|14 years ago|reply

"Having spent some extra time on research, fine-tuning, and optimizing the new server, we were glad to find that the gigabit network had became the bottleneck"

This matches my experience, RAID performance in linear IO is an order of magnitude below what the disk should allow. This guy is relieved to finally get 1 gig of useful bandwidth out of 24 disks (about 1 gig a piece). So it's no faster than a single disk.

(i know it's linear io in this case because the screenshot shows a large file copy)

[+] tomkarlo|14 years ago|reply

This seems like a half-solution - I can understand building a big box locally for day-to-day access to the images (if they indeed need that) but I didn't see any mention of an off-site backup. Even assuming no problems like file corruption, what happens if there's a fire or flood? At the least, there should be two copies of this big box, in different places.

[+] forensic|14 years ago|reply

Off site backup conveniently left out of the write up?

[+] latchkey|14 years ago|reply

Hey everyone, don't worry, be happy! Their VP of engineering used to work at 'a startup' and prior to that, he was a 'rocket scientist' because he worked at Raytheon on missile guidance systems. Oh, and before that, he worked at the Mothership, I mean Microsoft, in the 'user experience team'. As an added bonus, if you want to work at Oyster, you get your choice of such cutting edge technologies as 'Python, PostgreSQL, Nginx, Windows, CentOS, C++ and more'!

Nothing to worry about here, I think we are in good hands.

74 comments