top | item 11620890

Stack Overflow: How We Do Deployment

428 points| Nick-Craver | 10 years ago |nickcraver.com

140 comments

order
[+] taurath|10 years ago|reply
How people manage git and source control tells you a lot of things about a company's culture. They said that most commits go directly onto master and this works for them, which indicates:

- Good rapid communication about who is working where. People are generally not touching the same code or else you'd run into frequent collisions (solvable via rebasing of course but they would be doing more branching if it were a thing to happen very frequently I'd suspect)

- The developers are given autonomy and have assumed some level of mastery over whatever their domain is. Trust in each developer's ability to commit well-formed and considered code.

- They have a comprehensive test stack, which helps verify the above point and keep it sane

[+] adontz|10 years ago|reply
I was always wondering if there is some way of database deployment which does not suck. And all I see - every professional team ends with bunch of 074-add-column-to-table.sql files. I mean, code deployment can be organized much better. You can make graceful restarts, transactional updates, etc. Actually, almost nobody backups old code version because deployment of new one may fail, but database upgrades are so fragile and making backups is a must not only because you afraid of upgrade process may be interrupted and leave database in inconsistent state, but because properly done upgrade may result in corrupted state.
[+] rusanu|10 years ago|reply
There is a fundamental difference between code deployment and DB upgrades: DB upgrades must preserve (and upgrade) the existing state. Code deployment only needs to ensure that the new code is in place. It can do so by deploying to a new location and then redirecting the entry points, or it can do so by laying over the new code. Either way, the ways ways in which it can go wrong are few (at least compared to DB upgrades). DB upgrades, on the other hand, must take existing state (which can measure GBs and TBs) and transform it into a new state. If you're really unlucky, this may involve a size-of-data transformation (eg. update every row in a big table). I've seen 'upgrades' that lasted weeks. Having witnessed DB upgrades at MS Azure scale (see [0]), and having to code myself several SQL DB upgrade steps (internal system upgrades, the kind done when you install new DB engine, not user upgrades) I can say with confidence that DB upgrades are hard.

What we all end up with are either DB migration steps (a-la Rails migrations), of which I approve see [1], or schema comparison tools (a-la vsdbcmd.exe) of which I'm very skeptical after being burned several times on large DBs (they are as good as the tool, and some tools are better, still I find explicit migrations much better).

As a side note, My startup DBHistory.com is recording all DB schema changes, like a DB flight-recorder of sorts, and one of my future goals is to add capability to generate compensating actions for any DB schema change and thus be able to revert the DB schema to any previous state (by simply rolling back every change, in reverse order). But I must reckon I'm quite far from having such a thing working, and I'm not even considering migrations that modify data, not schema.

    [0] https://medium.com/@rusanu/why-i-have-high-hopes-for-the-quality-of-sql-server-2016-release-6173bc1fbc82#.wp1zt5pn0
    [1] http://rusanu.com/2009/05/15/version-control-and-your-database/
[+] Nick-Craver|10 years ago|reply
Perhaps I can add some context on top of the post. All updates are in transactions unless explicitly opted out. Also remember our case is a little special: we have hundreds of the databases with the same schema, what if #328 fails? The failure/rollback scenario is a bit more complicated when you go past a single database involved.

As for backups: absolutely. We handle this independently though. We do full backups every night as well as T-logs every 15 minutes. If I had to restore every database we had to a very specific point in time or just before a migration command was run: we have T-logs to do that going back 4 days at all times.

I'm sure there are good solutions for single database applications way more fully featured than our approach, they just do little to solve any problems we actually run into.

[+] rossng|10 years ago|reply
There definitely is some tooling available to help - Redgate, many ORMs - but I agree that it's somewhat lacking.

I think the bigger problem is cultural - many programmers either don't really understand databases/data modelling or they don't care about it. After all, you don't really have to worry about it when you're just starting out - almost any schema will work. That is, right up until you have to modify it. By the time it becomes an issue, the culture has crystallised and changing the database is too risky.

For some reason, a lot of companies are largely unwilling to spend money on good database management/migration tools - even if they're paying a stack of cash for SQL Server.

[+] harveywi|10 years ago|reply
I wonder about this too. It seems like some ideas from category theory and pure functional programming could really help here to provide an abstraction over the top of the (pseudo)relational model.

While doing some research on an unrelated topic, I stumbled on some potentially related work [1, 2, 3] by some researchers at MIT that could be relevant to database deployment/migration (I haven't checked in depth yet). I have not had a chance to sink into these references to see if there is any relevance or promise there, though it looks like there is some kind of commercialization effort [4].

[1] Patrick Schultz et al. Algebraic Databases. http://arxiv.org/abs/1602.03501

[2] David Spivak. Functorial Data Migration. http://arxiv.org/abs/1009.1166

[3] http://categoricaldata.net/

[4] http://www.catinf.com/

[+] xorcist|10 years ago|reply
I've used Flyway for that. Conceptually similar, but it's a real tool. Very solid.
[+] ericdc1|10 years ago|reply
I put a bunch of effort into database change management over the years and work on a tool called AliaSQL that is a simple command line database migrator for SQL Server.

Using this tool has led to a dramatic increase in productivity on our team since we really don't have to worry about database changes anymore. I won't waste space here with the details but these links will fill you in if you have any interest.

https://github.com/ClearMeasure/AliaSQL http://sharpcoders.org/post/Introducing-AliaSQL

[+] schrodinger|10 years ago|reply
After using SQL for most of my career, I'm now working on a product using MongoDB. Removing the need for schema migrations has been such a boon for productivity. You essentially push the work of migrating to the app code, where it can be done incrementally, and with a far better programming language than SQL. It's been well worth the trade offs, on my opinion.
[+] mwhite|10 years ago|reply
It's called NoSQL, which removes the need for schema migrations for things like adding or deleting columns.

This could be solved for relational databases if you implemented application-level abstractions that allowed you to store all your data using JSON storage, but create non-JSON views in order to query it in your application using traditional ORMs, etc.

So, store all data using these tables, which never have to be changed:

- data_type

- data (int type_id, int id, json data)

- foreign_key_type (...)

- foreign_keys (int type_id, int subject_id, int object_id)

(we'll ignore many-to-many for the moment)

And then at deploy time, gather the list of developer-facing tables and their columns from the developer-defined ORM subclasses, make a request to the application-level schema/view management abstraction to update the views to the latest version of the "schema", along the lines of https://github.com/mwhite/JSONAlchemy.

With the foreign key table, performance would suffer, but probably not enough to matter for most use cases.

For non-trivial migrations where you have to actually move data around, I can't see why these should ever be done at deploy time. You should write your application to be able to work with the both the old and new version of the schema, and have the application do the migration on demand as each piece of data is accessed. If you need to run the migration sooner, then run it all at once using a management application that's not connected to deploy -- with the migration for each row in a single transaction, eliminating downtime for migrating large tables.

I don't have that much experience with serious production database usage, so tell me if this there's something I'm missing, but I honestly think this could be really useful.

[+] clio|10 years ago|reply
Additional anecdata: At my place of employment, after the required code review, we must write an essay about the change and have at least two coworkers approve it. Then we must email the essay to a mailing list of several hundred people. One of those people is the designated external reviewer, who must sign off. However, lately, this final reviewer has been requesting 10+ modifications to the deployment essay due to a managerial decision to "improve deployments". Moreover, deployments are not allowed 1/4th of the year unless a vice president has signed off.

Any code change requires this process.

[+] ohitsdom|10 years ago|reply
This feels really clunky to me, but maybe I'm just not getting it. I'm trying to implement a more automated build/deploy process at my current place of employment and am basically modeling it off of Github's [0], which seems to have a better feel.

Obviously the quality of the process needs to be high, but when it's effortless and "fun" then everybody wins.

[0] http://githubengineering.com/deploying-branches-to-github-co...

[+] richardwhiuk|10 years ago|reply

   Fun fact: since Linux has no built-in DNS caching, most of the DNS queries are looking for…itself. Oh wait, that’s not a fun fact — it’s actually a pain in the ass.
Surely that should just be a very fast lookup in /etc/hosts?
[+] Nick-Craver|10 years ago|reply
The problem here is that these services move - so if it's in /etc/hosts, our failover mechanisms (to a DR data center which has a replica server) are severely hindered. We're adding some local cache, but there are some nasty gotchas with subnet-local ordering on resolution. By this I mean, New York resolves the local /16 first, and Denver resolve's its local /16...instead BIND doesn't care (by default) and likes to auth against let's say: the London office. Good times!
[+] xorcist|10 years ago|reply
Also, Linux has "built in" (whatever that means) DNS caching. It's called nscd. It's just usually not enabled by default (which is sensible, since it's better off shared).
[+] sakopov|10 years ago|reply
I think if you're deploying .net code you're almost certainly going to follow similar build architecture with TeamCity doing most of the grunt work. We have a very similar build structure but a bit more polished I think. Our TeamCity build agents build nuget packages, deploy them to Octopus, run unit and integration tests. Octopus handles promotions of code from dev to qa, to staging and all production environments. We also write migrations for database updates using FluentMigrator which works with just about any database driver. It's a joy deploying code on an environment like this.
[+] rjbwork|10 years ago|reply
Agreed on the Octopus bit. TeamCity + Octopus is practically magical. Until literally yesterday, I'd yet to find something that didn't work with minimal effort between the two.
[+] Griever|10 years ago|reply
Happy to see someone else using FluentMigrator. It's a fantastic library that doesn't get mentioned enough imo.
[+] radicalbyte|10 years ago|reply
It's funny to see that stackoverflow came to exactly the same solution for database migrations on the Microsoft stack as my team did, even down to the test procedure.

Simple, safe and very effective :)

[+] daddykotex|10 years ago|reply
> If Gitlab pricing keeps getting crazier (note: I said “crazier”, not “more expensive”), we’ll certainly re-evaluate GitHub Enterprise again.

Shots fired :P

[+] goldbrick|10 years ago|reply
When did everybody decide that chatbots were the new hotness for deployments?
[+] Nick-Craver|10 years ago|reply
If you mean pinbot - that's literally all it does. It takes a message and pins it, knocking the old one off the pins.

The build messages build...that's also literally all it does. It simply puts handy notices in the chatroom. Why wouldn't you want that integration? Everyone going to look at the build screen and polling it to see what's up is a far less efficient system. A push style notification, no matter the medium, causes far less overhead.

I doubt we'll ever build from chat directly for anything production at least, simply because those are 2 different user and authentication systems in play. It's too risky, IMO.

[+] richardwhiuk|10 years ago|reply

   A developer is new, and early on we want code reviews
   A developer is working on a big (or risky) feature, and wants a one-off code review
This implies you don't normally do code review??
[+] sklivvz1971|10 years ago|reply
Yes, we don't normally code review stuff we don't need to. We trust each other.
[+] nxzero|10 years ago|reply
SE/SO is for it's size an amazingly high proformance team. It's rare to hear them say it, in fact, I don't recall ever hearing them say this.

The importantance of the cohesion and trust amoung their team is critical to their deployments. In fact, I would say it's vital to how they're able to get away with minimal amounts of code reviews for example.

It's dangerous to believe this is easy or reproducible. New teams needs extensive controls in place to make sure the quality of the deployments will not negatively impact the group.

[+] NKCSS|10 years ago|reply
A few things strike me as odd/sub-optimal:

    - migration id's freeform in chat -> why not central db with an auto-increment column?
    - Use chat to prevent 'migration collisions' -> Same central db/msmq/whatever to report start/stop of migrations and lock based on that...
[+] GordonS|10 years ago|reply
I guess this is because using a team chat is 'good enough' for them without adding another layer of tooling
[+] ngrilly|10 years ago|reply
The article says:

> there is a slight chance that someone will get new static content with an old hash (if they hit a CDN miss for a piece of content that actually changed this build)

Anyone has a solution to this problem?

[+] dpark|10 years ago|reply
Without poking at their site to see how they behave, I assume they just append a cache breaking param onto the URL for a static entity. Assuming that assumption is correct, the fix is to make the hash meaningful to the process serving the static entities.

This adds complexity as now your static site needs to serve requests based on the hash. This isn't conceptually complex but it means you must deploy at least two versions of your static resource (one for each hash in the wild). And you still have to do two phase deployment in this model (static resources first, then the "real" site). Or you can build redirect or proxying support from v2 to v1 and vice versa, which is a much uglier problem to solve, but eliminates the need for two phase deployment.

Since they have short deployments, their solution is pretty elegant. If you have long deployments, the hash aware server becomes sensible. If you're a masochist, the two way proxying becomes attractive.

[+] windowsworkstoo|10 years ago|reply
Nope, we solve it the same way, it's a pain. I did a presentation once for devs where I unraveled the up-to-9 different layers of caching between an end user and our website (when you take into account the browser and any tiered CDN caching)

It's a pest of a problem but pre-deploying static assets is the best answer.

[+] mfontani|10 years ago|reply
Push new static content first to all servers/cdn, _then_ bust the CDN cache / push a cache bust
[+] infocollector|10 years ago|reply
Stack Overflow uses windows? Any particular reason to do this?
[+] streeter|10 years ago|reply
I was surprised they weren't using Kiln [https://www.fogcreek.com/kiln/], a Fog Creek product. I know SO is independent from Fog Creek now, but still a bit surprised at it. I wonder if there was a migration off at some point.
[+] Nick-Craver|10 years ago|reply
Yep...for the on-premise reasons listed in the article. Once upon a time a lot of projects were on Mercurial, hosted by Kiln. The Stack Overflow repo specifically has always been on an internal Mercurial and then Git server. Originally this was for speed, now it's for speed and reliability/dependency reduction.
[+] gnahckire|10 years ago|reply
Did you choose polling over webhooks for a reason? Or was webhooks recently added as a feature to Gitlab?
[+] msane|10 years ago|reply
TeamCity is a great CI system.
[+] totally|10 years ago|reply
What's with all the backslashes?