top | item 37612483

(no title)

mabbo | 2 years ago

Many years ago when I was a junior dev at Amazon, there was a massive project internally to split up every internal system into regional versions with limited gateways allowing calls between regions. The reason? We had run out of internal IPv4 addresses.

The Principal PM in charge of the "regionalization" effort was asked in a Q&A "why didn't we just switch to IPv6?".

Her answer was something along the lines of "The number of internal networking devices we currently have that cannot support IPv6 is so large that to replace them we would have needed to buy nearly the entire world's yearly output of those devices, and then install them all."[0]

It's easy to presume malicious intent on the IPv4 front from Amazon, but with so many AWS systems being on the scale they are at, I find it easy to believe that replacing all of the old network hardware may just be a project too large to do on a short timescale.

[0] - At least, that's my memory of it. I'm sure that's not an entirely accurate quotation.

discuss

order

aranchelk|2 years ago

Can you remember what year it was?

I’ve got a slight suspicion you were given some bullshit or at least a creative treatment of facts e.g. everything had IPv6 support but FUD-filled network engineers didn’t want to turn it on.

Most network devices I’ve encountered were dual-stack way before anyone I knew seemed to care about actually using IPv6 — I always assumed it was added for US government/military requirements.

discodave|2 years ago

From memory, the regionalization project ran from approx 2014 to 2015 or 2016.

There were also other reasons given, like the amount of internal software that used e.g. IPv4 addresses. Also, AWS likes to have 'lots of small things' instead of one big thing (regions, AZs, cells, two pizza teams, no (official) monorepo) so regionalization was part of that.

Another big reason for regionalization, other than IPv4 exhaustion was that AWS promises customers that AWS regions are completely seperate, but with one big giant network, it turns out there were all sorts of services making calls between regions that nobody had realized. I have a couple of funny examples, but that might make me too identifiable :)

jjoonathan|2 years ago

Sure, everything supports IPv6 -- until you turn it on and rediscover the tickets that have been sitting at the bottom of the JIRA for the last decade.

jvolkman|2 years ago

I believe the issue wasn't of IPv6 support generally, but of issues with TCAM space and the increase in routing table size moving from v4 to v6. Overflowing TCAM would cause routing to hit the CPU which would immediately lead to outages.

Tables were relatively large internally because AWS was all in on clos networks at that point. And the devices used to build those clos networks were running Broadcom ASICs, not Cisco or other likely vendors.

thatsBs369|2 years ago

Right, if you worked at Amazon and didn't have incentive, then, you didn't do it. It was part of your job to not do things which you were not incentivized to do.

paulddraper|2 years ago

Right?? How old of a device you would have to get to NOT have IPv6 support?

EDIT: But maybe bugs, IDK.

ketralnis|2 years ago

> FUD-filled network engineers

FUD sounds like a mean way to say unproven in production

Twirrim|2 years ago

I remember the regionalisation, that was "fun" to be on the sidelines for (I was in a newer service that was regionalised from the get-go). I don't remember who the PM was for that one, but I remember that being when I truly came to respect the value that a TPM can add.

You're right about the cost and need to replace network equipment being one of the strong reasons why they didn't. Amazon used its own in-house designed and built network gear for a variety of reasons (IIRC there's a re:invent talk about it), which I'm sure is probably still the case. Every single one of those machines had fixed memory capacity and would need to be replaced to bump up the memory sufficiently large enough to handle IPv6 routing table needs etc. What they had wouldn't even be enough if they'd have chosen to go IPv6 Only (which you couldn't get through except via dual stack IPv4/IPv6 anyway).

NBJack|2 years ago

Were they also by chance considered accelerators for encrypted traffic?

I'm not privy to details, but I recall once when a mandate was issued to a Java platform to remove an outdated encryption protocol (mandated by Amazon Infosec). The change was made and rolled out with little fanfare.

A few weeks later, a large outage of Amazon Video (which used said platform) occurred on a Friday evening. Root cause? The network hardware accelerators were only setup to use that outdated protocol, which in turn meant that encryption was happening in software instead. Under load, the video hosting eventually caved.

Might be specific to the hardware used for Amazon retail, but it reinforces the point of their home grown (and now aging) stack.

grogenaut|2 years ago

I believe the PM was Laura Grit, who was actually a TPM I believe. Laura is a Distinguished Engineer now. She seems to constantly do massive scale projects. IPv4 being a smaller one now. Sadly I can't share some of the big projects she's doing now. I've gotten some sage advice from her on a few occasions that she had time and appreciate it.

justrealist|2 years ago

> the PM was Laura Grit

Talk about nominative determinism...

virtuallynathan|2 years ago

Yep, she was behind regionalization and IPv6 and such. I recall reading the same the the parent comment talks about.

irrational|2 years ago

> replacing all of the old network hardware may just be a project too large to do on a short timescale.

If that is the case, then Amazon should hold off on charging for IPv4 on a short timescale until they have replaced all the old hardware and can support IPv6 internally everywhere.

JoBrad|2 years ago

True. But if they are having a problem getting that done, adding a surcharge is a good way to get bottom-up pressure on AWS teams to finish the job.

tinix|2 years ago

this doesn't forgo v6 phase-in though, can't kick that can down the road forever.

surely they started the process...

right? i cannot imagine AWS just sticking head in the ground and ignoring this...

Twirrim|2 years ago

No one is ignoring it, and the US Government has done everyone another favour on this score. Years ago in the late Bush / early Obama administration, NIST required that all federal government agencies have IPv6 at the border. Federal government money is not to be sniffed at, and that had the effect of forcing a number of vendors to add IPv6 support. A few years after that, it became that the federal agencies needed to have dual-stack IPv4/IPv6.

About 18 months ago, the requirement came that federal agencies are required to be IPv6 Only, dropping the dual stack. IIRC they have until 2025 to do that. This has the neat effect of forcing all vendors to make IPv6 a first class citizen. The extra little fun from this is that it applies to the military JWCC contract that all the major clouds have been trying to land. The timescales of JWCC meant that initial offerings are pretty bare, but that won't be allowed to last.

mtnGoat|2 years ago

Yes they are working on it. A number of services already support v6, more to come.

KaiserPro|2 years ago

I can believe that, but also, places like google and facebook saw the problem of having >1million devices and the lack of IP addresses and moved to ipv6.

housemusicfan|2 years ago

Hanlon's razor applies here.

There is no reason any company of any size should run out of IPv4 addresses internally, IF they are doing proper IP management. If I were to wager a guess I'd say there was a lot of waste going on, issuing /24s or larger to teams when all they need are /29s etc. It adds up over time. Once they exhaust private IP space they can always buy more at auction. They are Amazon after all, there's no shortage of money. This is just mismanagement of resources.

ben0x539|2 years ago

Can you elaborate on proper IP management? Isn't that sort of what the parent post is talking about with splitting the network into regional chunks?

I'd imagine few service teams at Amazon would get very far with a /29, let alone a /24, if they have to put all their stuff on that.

master_crab|2 years ago

My one issue with this is if it’s such a large lift, why burn the effort to just kick the can down the road? IPv6 has to happen at some point (and for AWS that point is sooner than most).

The better reason is the regionalization was probably a way to decrease blast radius in case of a service failure.

Also, AWS definitely did not regionalize all their services in 2016. IAM and certainly not DNS/Rte53 (part of the reason why they had their massive failure in US East 1 2-3 years ago)

jongjong|2 years ago

I upgraded a P2P networking library recently to add support for IPv6. That was a pure software solution and it required a lot of work. When you have to upgrade hardware as well, I can imagine it would present a massive challenge (especially logistically). You'd have to upgrade ALL the hardware before you even start thinking about the software side of the equation.

pmarreck|2 years ago

So basically, their IPv4 infrastructure investment is so entrenched that they're trapped.

Sounds like a perfect opportunity for a market upstart to start out v6-only...

Dwedit|2 years ago

Out of IP addresses? Just use NAT.

kibwen|2 years ago

32-bit IPv4 addresses are wasteful. By leveraging NAT, we can get away with a 1-bit addressing scheme and save 31 bits per packet!

dharmab|2 years ago

Out of NAT sockets? Just use more IP addresses.

jvolkman|2 years ago

Hah, I worked on the hardware loadbalancer team during that period. Fun times.

devwastaken|2 years ago

Even cheap consumer hardware supports ipv6. There are significant financial incentives to continue the capitalism of ipv4 addresses. Like NFT's - an artificially limited capital. To create more addresses means more competition, loss of capital. Therefore they will spend billions on continually reworking internal IPV4 than going for the proper solution.

mannyv|2 years ago

You obviously have never been on the backend of a big enterprise deployment.

The world is bigger than your apartment.