Uber migrates microservices to multi-cloud platform running Kubernetes and Mesos

[+] scarface_74|2 years ago|reply

> The team used existing tooling to move services between zones in order to ensure they were portable. Firstly, they allowed services to be moved back to the original zone to resolve any portability issues, but once resolved, services would be moved periodically to validate portability and prevent regressions.

This is something that most companies don’t do when they say they want to do $x to “prevent lock in”.

Uber actually is testing for portability along the way.

[+] dehrmann|2 years ago|reply

It's probably more cost effective to negotiate a long-term max price with your cloud provider with a force majeure clause.

[+] eatonphil|2 years ago|reply

This article is a recap of the original engineering article by the quoted developer and manager at Uber.

https://www.uber.com/en-GB/blog/up-portable-microservices-re...

[+] voz_|2 years ago|reply

Uber Microservices were such an inefficient PITA. There was buzzword soup of a bunch of half baked infra pieces and they were always migrating. Every part of the stack was rotten. Udeploy, xterra, tchannel, schemaless, etc etc.

My peak “wtf” moment was when we had a SEV because two services that should communicate actually used different versions of thrift, both hard forked by Uber, with different implementations for sets. Passing a set from one service to another caused everything to break.

[+] activescott|2 years ago|reply

> In preparation for the move to the cloud, the company spent two years working towards making all stateless microservices portable so that their placement in zones and regions can be managed centrally without any involvement from the service engineers

I'd like to hear more about how Uber organized the engineering teams over two years to make "stateless microservices portable".

How many teams? What were the requirements to each team? What was the timeline? How did they know it was completed? How was it prioritized along other business priorities of the teams? How long did they think it would take originally? Was it worth it?

[+] s3p|2 years ago|reply

Maybe direct these questions to a C-level employee at Uber who could potentially answer them for you?

[+] jbotdev|2 years ago|reply

It seems like they’ve gotten to the “holy grail” of deployment where developers don’t have to worry about infrastructure at all in theory.

I’ve seen many teams go for simple/leaky abstractions on top Kubernetes to provide a similar solution, which is tempting because it’s easy and flexible. The problem is then all your devs need to be trained in all the complexities of Kubernetes deployments anyway. Hopefully Uber abstracted away Kubernetes and Mesos enough to be worthwhile, and they have a great infra team to support the devs.

[+] x86x87|2 years ago|reply

Does Uber really need 4000 microservices?

[+] danpalmer|2 years ago|reply

A different (better?) question is, does Uber need 4000 API contracts?

The answer to that is probably yes. APIs let us split work across systems/people/teams/regions, and provide a way for both sides of a split to work together. Uber has a lot of teams, a lot of engineers, and so it makes sense that there are a lot of API boundaries to allow them to work together more efficiently. Sometimes those APIs make sense to package as microservices.

[+] jpalomaki|2 years ago|reply

There's an an interesting HN comment[1] from 2020 by former Uber engineer, which discusses the complexity a bit. It's more about UI, but the thread discusses the backend as well. In brief something that may look super simple for the user (like handling payments) is actually quite complicated when you cover all the market, different payment types etc. And all this carries to the backend as well.

[1] https://news.ycombinator.com/item?id=25376346

[+] threeseed|2 years ago|reply

Uber is a global company (70+ countries) operating Uber and Uber Eats.

So almost certainly they are duplicating their entire stack per-country if only to get around the vastly different regulatory environments.

[+] bastawhiz|2 years ago|reply

Uber has a really liberal definition of a micro service. Every web UI or dashboard is a service (of which there are many hundreds). Every application anyone builds across their many thousands of engineers is a service. It's rare, I think, for services to have fewer then a few thousand lines of code. In my experience, most companies would have a monolith that serves multiple UIs from the same service. Uber instead ships that monolith as a library which is a framework for building individual UIs. It has its pros and cons but I quite liked how they did it.

[+] ninja3925|2 years ago|reply

(Worked at Lyft) Our number of active micro services was small in comparison. 4,000 is likely a overblown number to highlight the accomplishment possibly counting inactive ones

[+] justapassenger|2 years ago|reply

From experience working at big tech I’m willing to take a guess.

Maybe a couple of dozens will be actual more complex and meaningful services. Then few dozens more services that are somewhat more unique.

And then majority of the long tail will be mostly cookie cutter services, doing X, but for lots of different use cases, where each of use cases is separate deployment counting as a service (for example - systems to process streams of logs related to business logic).

[+] talent_deprived|2 years ago|reply

I've seen at least one place with many more than that in recent years. If you have one microservice "listener" per queue and another for the database processing and persistence (business logic) and another providing an API for one or more frontend UI's related to it then the microservice tally goes up very fast. It's kind of surprising to read so many comments indicating HN readers weren't aware of this.

[+] whynotmaybe|2 years ago|reply

There's quite a sizing range between monolith and microservice.

If all their It needs are behind micro "micro" services, that figure is understandable.

Outside of the map, taxi, food, payments, onboarding, they also have monitoring, deployment, HR, billing, legal, taxes, internationalized stufd, and the usual "..." for what I'm missing.

If you just take a standard ERP, you could easily split it in dozens even hundreds of microservices.

[+] belter|2 years ago|reply

Apparently they started at 1000 and went from there...

"What I Wish I Had Known Before Scaling Uber to 1000 Services" - https://youtu.be/kb-m2fasdDY

[+] speedgoose|2 years ago|reply

It reminds me this thread about Netflix, with insane amounts of events and logs compared to active users.

https://news.ycombinator.com/item?id=30635369

[+] 0xblinq|2 years ago|reply

What would be the engineers doing otherwise? You get bored if you don’t.

[+] tight-ship|2 years ago|reply

Does it matter how they organize their services? Your experience and environment will be different in so many ways that I doubt it's comparable.

[+] 0xDEF|2 years ago|reply

Yes, there are specific business rules for each nation, region/state, and city.

[+] barbazoo|2 years ago|reply

Maybe they meant instances.

[+] nine_zeros|2 years ago|reply

How else would engineers demonstrate "impact" for promotions?

/s

[+] deathanatos|2 years ago|reply

There's no way that number isn't fiction; Occam's razor say's its out of the range of believable. That's ~2 per eng according to Google. That's absurd. (That eng headcount is also a bit … high.)

This sounds like a figure from someone who sees a signle microservice running across 100 pods/instances, and counted that as 100 "microservices".

[+] locustmostest|2 years ago|reply

I couldn't find any explanation of where the data would be found. Are they splitting data across clouds, and constantly "porting" that data from cloud to cloud as part of their portability?

Orchestrating the application layer across clouds is interesting, but how does their data layer work?

[+] fbnbr|2 years ago|reply

The title is misleading. I don’t see Mesos mentioned ones in the article.

I got so excited about reading for Mesos helping in the multi cloud world, potentially as the hypervisor for running k8s

[+] xyst|2 years ago|reply

I dislike the Uber business itself (horrible treatment of drivers, poor customer service, poor safety controls, bullying of small businesses with Uber Eats, shitty executive level team with questionable ethics).

But the underlying technology which carried them to this point is a fascinating read.

[+] abbadadda|2 years ago|reply

“Microservices” https://m.youtube.com/watch?v=y8OnoxKotPQ

[+] opportune|2 years ago|reply

I believe the dollar amount savings figures, they’re big and worthy of a congratulations to the engineers involved!

IMO, engineering man hour savings are a lot less trustable. This may eliminate or simplify some engineering processes but IME massive migrations like this simply replace them with a different set of processes; because they’re different and theoretically addressable they’re not counted against the hours saved as they can be bucketed into bugs/to be addressed by the roadmap/legacy behavior migrated from the old system (which is now dangerously-fragile-legacy and not ol-reliable-legac). Eventually someone will come along and decide this too is an inherently flawed platform that needs to be entirely replaced at great expense, and the circle of life continues.

This is still a massive undertaking not just from an engineering perspective but from an organizational/process one though. Whoever pulled this off essentially had to coordinate (or figure out how to simplify/explain things well enough to skip coordination) with almost every engineer and likely almost every production service in a company with thousands of engineers. Those in startups may balk about this kind of thing taking two years, but having done my own two year projects (at a smaller but comparable scale) in a big company I can say two years is what I’d consider a highly optimistic and unlikely outcome for a project of this magnitude.

[+] jvans|2 years ago|reply

> This may eliminate or simplify some engineering processes but IME massive migrations like this simply replace them with a different set of processes

Yes

> because they’re different

Now I have to learn an entire new set of tools/processes etc that are more useful to someone else but not helpful for me. The old one had its quirks but I knew it inside out and now the whole org has to re-learn how to do everything we did before.

[+] lowbloodsugar|2 years ago|reply

Look forward to the future write -up of how a Zookeeper issue nuked their entire Mesos stack.

[+] this_user|2 years ago|reply

For a company that is basically a taxi service, they seem to invest an awful lot in constant rebuilds of their extremely complex infrastructure, which raises the question of whether that is even remotely necessary or just an exercise in pretending that they are a tech company.

[+] ttul|2 years ago|reply

“Basically a taxi service,” except that Uber spans hundreds of cities, coordinates millions of drivers - none of whom work on a fixed schedule - and its only interface with customers is an app that has to be fast, accurate, and reliable at all times.

[+] bogota|2 years ago|reply

This is just such a bad take that it makes everything else you say after it null.

And google is just a search engine they only need like 20 engineers……………

[+] mardifoufs|2 years ago|reply

They do food delivery, parcel courriers, regular ubers, plan ahead uber, grocery shopping, and a lot of other stuff. if anything this is simpler than most silo driven architectures you'd usually get with such a massively diversified business.

[+] gedy|2 years ago|reply

> "Basically a taxi service"

Not defending their tech stack, but I mean that is a lot of realtime data that needs to be accurate - this is not your typical SaaS crud app.

[+] zht|2 years ago|reply

I love these r/iamverysmart takes on HN.

Is this generally a sign of youthful wishful thinking or just plain hubris?

[+] intunderflow|2 years ago|reply

Oh hey, this is the thing I work on.

We're giving a talk about this at KCD Denmark on the 14th of November "Keynote: Uber - Migrating 2 million CPU cores to Kubernetes" if anyone is in the area and has any particular interest in this.

[+] kosolam|2 years ago|reply

Congrats to the UP team. The platform sounds good. I especially liked the Balancer component.

[+] jiveturkey|2 years ago|reply

To save you the deep deep dive: on OCI and GCP.

[+] mbrumlow|2 years ago|reply

In 3 years… “Uber saved cost by migrating their micro service to their own colo.” followed by “Uber simplified operations by migrating their micro service platform to a monolith”.

[+] corney91|2 years ago|reply

Might be a good guess, there's precedent of them changing fundamental techology in a similar timeframe...

2013: "Migrating Uber from MySQL to PostgreSQL"[1]

2016: "Why Uber Engineering Switched from Postgres to MySQL"[2]

[1] https://www.yumpu.com/en/document/view/53683323/migrating-ub...

[2] https://www.uber.com/en-GB/blog/postgres-to-mysql-migration/

[+] politelemon|2 years ago|reply

In 5 years... "We've discovered a new paradigm for efficiently carving up and distributing computational units for our application. We call it, nanofunctions."

[+] matwood|2 years ago|reply

I'm not sure why this is an issue with a long running system. Business requirements change, knowledge changes, cost structures change, etc... Unfortunately the world isn't static. I'm not sure about you, but when the facts change I also try to change.

218 comments