top | item 35001272

My manager spent $1M on a backup server that I never used

293 points| mooreds | 3 years ago |blog.dijit.sh | reply

234 comments

order
[+] wongarsu|3 years ago|reply
Wait, so Ubisoft spent $1M on a company wide backup system that apparently worked great, it's just that this sysadmin wanted incremental backups and the $1M system wasn't built to support that (blazingly fast ingest, slow online reads). So the sysadmin had to fight a lot to get dedicated hardware to do incremental backups, and eventually got it. I'm not sure Ubisoft is in the wrong at any point here?

Everything else mentioned wants me to not work at Ubisoft, and roughly matches what I would have expected, just worse. The disregard for developers, the absurd NIH syndrome, etc. But I don't get the headline of the article

[+] PragmaticPulp|3 years ago|reply
Agreed. This reads like the $1 million system did what the company needed it to do (safely archive code to prevent more loss of old games) but it didn’t do exactly what this developer wanted.

There are various good points scattered in the article for the author’s specific use case, but it’s written as if the entire company was mistaken to not make this decision revolve around this one developer.

[+] kubb|3 years ago|reply
My manager doesn't do almost anything. I wonder why they keep him around. He has like two reports and we're both fully self-directed. He can't understand technical issues, and whenever he proposes something, it's completely untenable due to his lack of understanding what's useful or possible.
[+] NikolaNovak|3 years ago|reply
It is definitely possible they indeed don't do anything.

Fwiw though - a decade ago when I was a sysadmin I had a manager that I was certain never did anything.

And then he was replaced.

And then... We realized how much politics, uncertainty, churn, screaming, changing requirements, ambiguous priorities and other carp he protected us from :-/

Not saying it's the case with your manager. But managers have duties roles and priorities I for one hadn't always appreciated.

[+] bitlad|3 years ago|reply
Atleast your manager doesn't do anything. I have seen managers who do stuff and ruin the work that people have done.
[+] nicoburns|3 years ago|reply
If you think your manager doesn't do anything he's probably one of the good ones. Possibly not one of the great ones. But still, often one of the best things those who manage can do is get out of the way and allow their reports to do their jobs properly.
[+] MathMonkeyMan|3 years ago|reply
I'm beginning to think that making more money by becoming a better developer is hard, and that managing, while difficult in a totally different way, is less hard. At least for the money.
[+] lordnacho|3 years ago|reply
Chesterton's Manager?
[+] roncesvalles|3 years ago|reply
He's around so that your skip has to manage 1 person instead of 2.
[+] swyx|3 years ago|reply
well look, I already told you, I deal with the customers so the engineers don't have to. I have people skills. I'm good at dealing with people, can't you understand that? What the hell is wrong with you people?
[+] jmull|3 years ago|reply
If your work is getting done, things aren't falling apart, and he leaves you alone, then he's doing a great job, whatever he does. At the very least his presence will keep other managers from meddling.
[+] toast0|3 years ago|reply
As a fully self-directed engineer, what I'm looking for from a manager is benign neglect.
[+] optimalsolver|3 years ago|reply
"When you do things right, people won't be sure you've done anything at all."

- Futurama

[+] Gigachad|3 years ago|reply
My manager isn’t a programmer but uses ChatGPT to educate himself on a topic and attempts to read the code to see what’s happening before asking one of us if anything is still unclear.
[+] anuragvohraec|3 years ago|reply
Generally managers are hired to act us buffer between higher order management and employee. HOM don't want to micro manage employees, but they want responsibility, which lower rank employee (the one who do actual work) tend to carry the least. While most workforce would come defensive, that they work responsibily, but indeed most don't. So they higher this manager who do micro manage employee on HOM behalf. The minute most employee gets a good offers, they will fly like its none of there responsiblity, so HOM higher this buffer manager to manage this minute but important aspect for successful project delivery.
[+] telotortium|3 years ago|reply
Are they planning on hiring more reports?
[+] drums8787|3 years ago|reply
My manager is blissfully absent.
[+] jiggawatts|3 years ago|reply
This rant starts of pretty badly, with "Windows bad, Unix good, Windows especially bad because it doesn't have Unix tools".

Reverse that statement. Is Unix bad because it doesn't Windows tools?

> but simultaneously there was nothing to lean back on: no shell, no Unix tools like sed/awk, no SSH. I

He joined in 2014, when Windows had a superior shell and remoting system that eliminates the requirement for low-level string parsing tools like sed/awk entirely!

PowerShell + WinRM + Desired State Configuration (DSC) was all pretty mature at the time, and I had used these technologies to manage huge fleets of servers solo without issues.

Then, almost none of this heroic database backup efforts would have been required if they had just a commercial database product, e.g.: one designed for Windows such as SQL Server.

In 2014 it supported AlwaysOn Availability groups that allowed multiple synchronous and asynchronous replicas! It also has had true online backups since before 1997.

His second mistake is that backups can be incremental, but a restore in a (true) disaster is a full restore by definition! Differential + Log backups are great to capture data regularly during the day, but the business requirement is usually that a full restore must complete in a reasonable time.

Typical commercial databases plus the product he mentioned could easily backup and restore tens of terabytes in setups I've seen directly to or from any proper database engine without having to through "NFS" on the way. Typically you'd use a native backup agent.

This whole article sounds like a self-important Unix engineer refusing to touch commercial products, lack of understanding of business requirements, and an allergy to Windows.

[+] hnlmorg|3 years ago|reply
I respectfully disagree. Having been an admin for both (started out as a Windows developer and NT4 sysadmin, so you can’t say I wasn’t experienced) I’ve always found UNIX to be vastly easier to manage servers.

Yes, Bash has a lot of warts (so much so that I wrote my own shell) but Powershell creates lots of new warts of its own.

As for automation, Windows doesn’t even come close to UNIX-like systems for east of scripting and automation. There’s just no contest.

Bash might be ugly at times but at least its ugliness is consistent. I’ve lost count of the number of times I’ve found parameters parsed differently between applications (because Windows passes parameters as a single string rather than an array of arguments like POSIX systems), or that so called headless installers still have GUI prompts and/or spin off a non-blocking process so it’s challenging figuring out when it’s complete. Or Powershell routines come back with an unexpected type thus breaking your entire pipeline. Or don’t even support basic help flags so you cannot discover how to use the damn routine to begin with. Or the utterly ridiculous numeric error code that MS returns rather than descriptive error messages. Or its over engineered approach to logging that makes quick inspections of the systems state far more involved than it should be, or even just managing simple things like software updates requires while other expensive 3rd party solutions because the whole application and OS package management situation is fundamentally broken at its foundations …I could go on. But suffice to say there’s so many painful edge cases to consider with Windows.

Windows does get somethings right though: RDP is a great protocol and its backwards compatibility story is second to none. But for server administration, Windows feels like a toy OS compared to most UNIX-like systems. As in it has support for pretty much anything you’d want to do With modern server management processes but yet everything it does support it does so in a really awkward and immature (technologically speaking) way.

I’m sure it’ll get there though. But likely not before I retire

[+] sgt|3 years ago|reply
What are you talking about - Windows was always terrible for an admin. Unix was and is vastly superior when it comes to remote admin. Having used both (Powershell is powerful yes, but it's nowhere as fluid as a Unix shell).
[+] merpkz|3 years ago|reply
I know people will now argue into oblivion about powershell vs bash, but I actually wonder why was this person hired in first place to manage windows based systems what clearly seems not be his domain of expertise?
[+] throw009|3 years ago|reply
>He joined in 2014, when Windows had a superior shell and remoting system that eliminates the requirement for low-level string parsing tools like sed/awk entirely!

One mans feature is another mans bug.

[+] Severian|3 years ago|reply
I can comment directly on this as I work in the backup sector. First hand knowledge, yada yada yada.

Dell EMC DataDomains do have good ingest performance, you can typically throw hundreds of streams at them and they'll greedily gulp it down.

And it is true that they are dog slow at restoring data. The reason? They are deduplicating appliances, you have to rehydrate the data, and this can take a very long time depending on the number of blocks needed that constitute the backup. Some blocks may be shared between hundreds if not thousands of backups. Dell isn't one to talk about their downsides much.

They are best used not as _primary_ storage, but secondary (think 3-2-1 backup rule). You should have some fast nearline storage available for recent backups that require really low RTO (recovery time objective).

Depending on the application used for backup, it may not have had native Postgresql WAL processing. This and may have had to be image based which slows down the process and requires some additional scripting.

[+] dijit|3 years ago|reply
Thats pretty cool, the datadomain appliance is actually really cool and I definitely appreciate its existence.

I definitely feel like we were using it wrong and I’m not entirely convinced it was my fault.

In my ideal scenario I would have had a weeks worth of point in time backups on a machine in the rack, and replicated the contents to the DD after verification.

Sadly I was denied that “in-rack” solution and was sold the datadomain solution (without it being named) as if it was simply a remote disk, not a fancy appliance.

the main point I tried (and failed) to convey in the article is that a solution can be brilliant and expensive but that doesn't mean its what you meed for the job at hand.

[+] rodgerd|3 years ago|reply
> They are best used not as _primary_ storage, but secondary (think 3-2-1 backup rule).

Yeah, these are for archives, not service restoration/disaster recovery scenarios.

[+] JoeAltmaier|3 years ago|reply
Our remote manager bought us an expensive HP floor-mount server like a dishwasher. It was over spec'd and underpowered. And we didn't need a server, not for work.

My colleague loaded up his entire ripped CD collection and we got Bluetooth headphones and had it serve up tunes all day. Back when that was a thing. Only thing it ever did.

[+] whydoineedthis|3 years ago|reply
in a way, he's genius. that probably boosted your overall productivity by 5% for a fraction of the cost.
[+] anyfoo|3 years ago|reply
> and begin real investigation you will quickly find that many databases that are popular are totally fine losing data. MongoDB being the most famous example that I can think off of the top of my head.

Always pisses me off. University teaches the principles of ACID and how hard databases work to adhere to these principles, and then so called "NoSQL" comes along and says "lol we have eventual consistency".

[+] e40|3 years ago|reply
It is very frustrating that many of the NoSQL databases aren't ACID. Some even say they are, but they aren't. Frustrating to me because we have one that is, and it's a lot of work and harder to win benchmarks against the ones that fake it or aren't.
[+] bombcar|3 years ago|reply
The heat death of the universe is eventually consistent.
[+] lmm|3 years ago|reply
> Always pisses me off. University teaches the principles of ACID and how hard databases work to adhere to these principles, and then so called "NoSQL" comes along and says "lol we have eventual consistency".

99% of the organisations that pride themselves on using "real databases" and ACID aren't actually using those guarantees or gaining anything out of them. Transactions are inherently useless in a web service, for example, because you can't open the transaction on the client, so the part of the system where the vast majority of consistency issues happen (the client <-> server communication) will always be outside the transaction boundary. ACID fanboys love to talk about how all the big name internet companies are built on RDBMSes, and fail to mention that most of them were built on MySQL 3 which never actually had working ACID in the first place.

If MongoDB had come first and SQL/ACID RDBMSes had come after, we'd recognise them for what they are: a grossly overengineered non-solution to the wrong problem.

[+] madrox|3 years ago|reply
In my observation, the game industry and tech industry are like humans and apes: they have the same ancestor, but at some point in our ancient past we diverged. You rarely see crossover between the two these days, and I doubt we'll see any in another decade.

What I think that means is there's probably a huge opportunity selling five year old tech ideas to the gaming industry

[+] daneel_w|3 years ago|reply
The implied notion that the manager really had no idea what they actually wanted, or at the very least were unable to sensibly describe the requirements, is entirely believable, but from a systems development perspective nothing in this article makes any sense.
[+] neilv|3 years ago|reply
This could be one of those big-corporate situations:

1. Someone pays a lot of money for something.

2. It turns out to be the wrong thing (either for new needs, or because they didn't consult or listen to the people who could've told them that in time).

3. Someone wants to avoid the political backlash of admitting they bought the wrong thing.

4. The org incurs much more costs than actually fixing the situation would cost, as people have to use the wrong thing and/or can't use the right thing. (Costs from lost productivity, lower quality/uptime, damaged morale, etc.)

(But what I really want to know is... Why did "The Division 2" not build upon "The Division", but instead make you start a new character, and then be more of a selling-brightly-colored hats game? Also, TD1's gritty survival mode was the most compelling gameplay of the franchise, IMHO.)

[+] binarymax|3 years ago|reply
Not sure if it’s just me but I get an invalid SSL cert warning and can’t load the page (DDG mobile browser)
[+] shoo|3 years ago|reply
This kind of thing seems to happen in very large organisations that have accreted many production IT systems, especially once an org realises it has half-a-dozen different services in production that are performing an apparently equivalent job. What an opportunity for architecture simplification! Why run half a dozen different variants of apparently the same thing when we could consolidate on one standard approach, and reap a bunch of cost savings and reductions in operational complexity.

Enterprise architecture can commission a project to review options and select the one true enterprise backup solution. After sufficient peer review / diligent testing / bribery from vendors, an enterprise backup solution is chosen, wrangled into production, and then inflicted upon the org.

Let's be pragmatic. For existing production systems, it may not make sense to migrate them over to the new enterprise backup solution. Much risk and cost, limited payoff. Perhaps it's simplest to just keep them running their existing legacy backup solution until the entire production system that depends on it is finally decommissioned, which is planned for only 3 years away. We'll definitely decommission it after 3 years.

But for any new teams trying to deliver a new service into production. Are you storing data? Well, you must have a backup strategy before you can go to prod, and further, your backup design must align with the enterprise backup strategy -- you must integrate with the enterprise backup solution for your design to be stamped approved by architecture, unless there is an exception granted by appeal and/or ritual sacrifice to your manager's manager's manager's manager's manager.

[+] tagyro|3 years ago|reply
Reading this, one might think that this happens only at X or Y company, but in my experience, this is a lot more common.

One example: at a previous job, we spent $1M and almost 1 year to build a (new version of a) process (as in, a website) that served less than 50 customers. The project was launched to great fanfare internally (talks, videos, the whole package) but created no value, even those 50 customers didn't transition to it, as the old one worked fine.

The same job where my manager wrote in my review that I'm too negativist and only point out problems.

I have so many of these stories, some funny, others more tragic - companies going under and people loosing their job and going into depression because a manager was stubborn and incompetent - that I think I should write a book, but then I realise I'm not that special and my stories are pretty boring.

[+] znpy|3 years ago|reply
The author makes a good progression in their reasoning but then somehow arrives to the wrong conclusion:

> So when someone says that Amazon has invested a lot of money into security I think about the fact that Ubisoft spent $1M on a backup solution that didn’t work for the game that would have had the best use of it.

The missing point here is thst ubisoft is/was spending money for something outside of its domain of expertise, and managed to make a suboptimal decision. Amazon however is spending money on two of its core domains of expertise: building and managing datacenters and building and operating web services.

Not to mention, aws is very clear and explicit about what is their responsibility and what’s your responsibility (security of the cloud vs security in the cloud).

Sad to see such a nice post ending like that.

[+] ilyt|3 years ago|reply
> The games industry is weird: It simultaneously lags behind the rest of the tech industry by half-a-decade in some areas and yet it can be years ahead in others.

I'd love to see examples of "ahead" coz I literally never saw it. Game dev studios act like automated testing is a new thing...

> PostgresSQL performed much better and had the additional benefit of being able to cleanly split write-ahead logs (which are largely sequential) and data to separate RAID devices. Something that MySQL doesn’t really support and would have to be hacked in using Symlinks on every table create.

It's funny that you can guess that it this happened good few years ago purely because the solution to above wasn't just "slap a NVMe on it"

>I tested this and ordered the storage I would need to have a rolling 90 day window of backups (with older backups being taken off-site)..

>The hardware request was rejected.

> When I inquired as to why, I was told that Ubisoft has a standard backup solution which is replicated globally and sent to cold storage in a bank vault somewhere in Paris. I was told this is because we had lost some source code once upon a time and we could no longer build certain games because of that.

My first thought was "why not just ship pg backups + wal logs there" and just a full backup from time to time. No read-back needed.

The whole problem seems to be "the admin overengineered solution without knowing the constraint, then tried to headbutt the constraints instead of changing approach

> Our EMC DataDomain system was optimised primarily for ingesting huge volumes of traffic, but if we want incremental backups then perhaps we needed a something a little more dynamic.

Nope, your incremental approach sucked. Majority of software that I've seen used to do incremental backups could generate incrementals on its own, without any readback from server.

The minority *was software that wholly managed backups*, so it was prepared for that, and often it was just "read metadata from database" instead of actual data.

> I don’t know what else to take away from this.

That people use tapes for backups so ? To get your requirements before implementing solution ? To find better solution when requirements are known ? To not get stuck in initial idea.

Literally the basic, recommended way of doing PostgreSQL backups with no extra software would work.

[+] jtchang|3 years ago|reply
How does one force exponential backoff of clients without taking in any load on themselves?

Is it some sort of TCP window congestion control futzing? But even that wouldn't work since the client can do whatever the heck it wants.

[+] dijit|3 years ago|reply
Hi, I’m the author of the article and I’m horrified by how many spelling mistakes I made. I must have written this in a rage.

To answer your question, since we had control of the clients (since its a game) we used a proof of work challenge on TLS handshake which increased in complexity the more failed attempts you gave us.

Very cheap on the server, very expensive on the client, which effectively rate limited connection attempts.

[+] guidedlight|3 years ago|reply
I worked for a university that spent about a million dollars on data storage for a research project. They estimated that they required about a petabyte of storage, so a big expensive storage array was purchased.

In the end the project used just a few terabytes of storage because they miscalculated their requirements.

It was a bit of a running joke that they spent a million dollars, when a $150 SSD would have been fine. They ended up repurposing the storage array (which is when I got involved).

Luckily everyone has now moved to cloud, and life is better.

[+] walrus01|3 years ago|reply
> Ubisoft had built an organisation optimised for treating developers like fools

I think this whole write up really says more about how video game developers treat entry and mid-level career developers/programmers/graphic artists.

Because there seems to be an inexhaustible supply of young idealistic naive persons who will take any salary offered and work for 80 hours a week, because they've been offered a job at a big name video game company.

Film/VFX industry not much different.

[+] gambiting|3 years ago|reply
>>Because there seems to be an inexhaustible supply of young idealistic naive persons who will take any salary offered and work for 80 hours a week

As someone who also works at ubisoft(and I have actually worked with the author), I want to point out that for all the failings of Ubisoft, our work life culture is top notch and there is an incredibly strong focus on avoiding overtime**. It's drilled into our heads constantly that it's NOT normal to work more than the contracted 7.5h a day, if you as much as send someone an email at 8pm someone will talk to you to make sure you aren't working late(and will tell you to avoid doing that in the future, because it makes it look like reading and sending work emails at 8pm is a normal thing). In my 9 years here I have worked overtime only a very small handful of times, usually around launches of our project - for a week or two. I haven't logged more than my standard 37.5h/week on any project for literally years now. I manage juniors now and I wouldn't let them work more than that even if they wanted to.

** at least in the studio where I work and the studios I have interacted with - Ubisoft has 40(?) studios across the world and I cannot possibly comment on every single one of them. But it does certainly seem to be the company policy worldwide.

[+] TheRealPomax|3 years ago|reply
Good old "we spent $1m on it" fallacies. "Exactly, we spent a million on a backup solution that's objectively bad, and we spent 100 times that on the actual code we're trying to secure. So: do we want to have wasted a trivial $1m by redoing the backup part, or do we want to have wasted $100m because of a terrible pretend-backup solution? Because this should be a business no-brainer".
[+] nitwit005|3 years ago|reply
> Data Consistency as a Requirement

Probably only sort of consistent, I imagine? Game servers tend not to support seamless failover. If the server crashes at the wrong moment, data is going to be lost, regardless of how politely behaved the data backend is.

That is, if you kill a boss and successfully pick up an item, you know it'll be saved to DB. If the game crashes before you can pick it up, it's probably just gone.

[+] ilyt|3 years ago|reply
> Probably only sort of consistent, I imagine? Game servers tend not to support seamless failover. If the server crashes at the wrong moment, data is going to be lost, regardless of how politely behaved the data backend is.

You could just have a let's say "user profile service" where all of the transactions about user profile (items, XP, etc.) go, on top of internal game server data.

That way the important stuff could be send immediately, like "epic or above item drops", and everything else either in batched update (XP, achievements etc.) or periodically (stuff like player rearranging inventory) or at end of the session.

[+] dijit|3 years ago|reply
the way we had it working was local storage (sqlite) and remote storage (postgresql with a middleware).

If you pick something up it flags as an “interesting event” and syncs your profile to local storage- on game server crash, there would be a crash collector which syncs the local database with the remote one and removes the lock - meaning you can join a working server.

The crash collector also did things like… well, collect the crash dump and send it for debugging.

But thats roughly how it worked.