top | item 34165791

How Southwest Airlines melted down

283 points| wallflower | 3 years ago |wsj.com | reply

317 comments

order
[+] burlesona|3 years ago|reply
It’s fascinating that the same hopscotch travel pattern that allows SWA to offer better service to more places is also what caused the network to suffer cascading failure. Once a critical mass of pieces (planes/crew) were out of position the whole network fell apart, and it’s large enough that it seems like neither the humans nor software can easily reason about how to resume operations. Hence the need for a “full system reboot” over many days.

Anecdotally, I flew Southwest just before Christmas. The network was already buckling and we had major delays, but we were lucky and made it through. Despite the stress, the SWA crews were helpful, empathetic, and polite. They handled it better than I would have if I had been in their shoes.

[+] TheCondor|3 years ago|reply
Is resumption difficult or is it resumption and then make whole the tens of thousands of customers that were supposed to be moved a week prior?

No idea what SkySolver actually does in totality, I'm sure it's complicated but I would think a flight crew could indicate where they are right now and then it could maybe pickup the next possible course they could perform. Not sure why the phone lines "jam up" exactly, don't you have a hierarchical management structure for this sort of thing? Or do 1000 pilots all report to one person?

They've got like 1000 planes and like 100-150 destinations, it's not the traveling sales man problem, an optimal plan isn't needed now so much as a functional one.

Of course, it's easy to bitch about it not being hard when I've never seen the code. Maybe is also tracks hours and does payroll and a dozen other functions.

[+] DrBazza|3 years ago|reply
> Once a critical mass of pieces (planes/crew) were out of position the whole network fell apart, and it’s large enough that it seems like neither the humans nor software can easily reason about how to resume operations. Hence the need for a “full system reboot” over many days.

We had days like that in the UK but with the rail system. And it when it happens it’s also due to snow. We’ve seen it on global scale recently due to covid putting ships in the wrong places so that optimised shipping routes become a mess.

[+] amluto|3 years ago|reply
I don’t see how a full system reboot should take days. If you don’t care about serving customers (which they don’t right now), then the problem could be simplified to getting every plane and every crew home to where they would spend the night under a normal schedule. With few or no paying customers on each plane, there should be plenty of capacity to move misplaced crew members around. None of this needs to approximate normal routing, and fewer segments than normal are needed, since passengers can be ignored.

That being said, the software sucks. Southwest may have lost track of where their employees are. The ground crews are quitting. I wouldn’t be utterly shocked if management doesn’t even have a good overview of their the planes are.

(Obviously anyone halfway competent could hack up a script to find all the planes based on ADS-B data in a few hours. And it wouldn’t be terribly hard to text a link to all crew asking them to fill out a simple form with their location, nearest airport, and when they can get there. But this requires competence and agility.)

[+] MR4D|3 years ago|reply
I wonder if it’s that or simply a lack of slack in their system.

It seems to me that just like pre-staged inventory helps in logistics management, that extra planes and crews in the rotation could improve operations under these circumstances.

[+] cratermoon|3 years ago|reply
> Hence the need for a “full system reboot” over many days.

My understanding is that the full system reboot wouldn't have taken all that long, it's just the the company was trying to do a major fix while keeping whatever was still sort-of-working running. As any sysop will tell you, patching a running system is all kinds of crazy risky.

[+] bogomipz|3 years ago|reply
>"Anecdotally, I flew Southwest just before Christmas. The network was already buckling and we had major delays, but we were lucky and made it through."

Interesting. You don't say how far before Christmas you were traveling. Had this crazy weather system already started moving from West to East at that point? Or was the system buckling just from passenger volume at the point i.e similar to the Summer meltdown that Southwest had?

[+] firstSpeaker|3 years ago|reply
I imagine the system cannot account for where all the planes, staff, passengers are and where they want to go economically.
[+] icambron|3 years ago|reply
I've told this story a few times, but maybe 10 years ago I had a cross-country JetBlue flight that was delayed perhaps 6 hours hours. It was a few days after a major storm. Like Southwest here, JetBlue didn't have much flex capacity and relied on the daisy chain to keep on chaining. Our plane had gotten stuck somewhere, so they had to find a different one at some far-away airport and fly it in, which took hours. But the kicker was that when the plane finally landed, the crew already onboard couldn't man the flight because that would exceed their duty limits. The airline didn't realize this ahead of time, so they had to gather a new crew (like literally call them in), which added a couple of hours to the delay.

Naively, I'd assumed these kinds of things were handled in some sort of mission-control center with warnings from rule engines blinking on some big screen and a team of crack operators mapping out what needed done. But clearly that wasn't so: they were just making things up as they went along. Sounds like Southwest is in a similar spot, but this time on a much bigger scale.

[+] seandoe|3 years ago|reply
> clearly that wasn't so: they were just making things up as they went along.

Where did you get your information? I have experience in the industry and scheduling logistics is clearly not how you describe it. The issue is that to optimize for profit you sacrifice the ability to maintain service through catastrophic events and can end up in a bit of a dominoes situation.

[+] bombcar|3 years ago|reply
I’ve seen it happen and it’s always strange that they seem to not realize the crew will be unusable until they arrive … I assume they had been trying to get another crew at the same time.
[+] walrus01|3 years ago|reply
Once you realize how many people in critical industries (power grid, telecom, global cargo/logistics) are in fact making things up as they go along, and there aren't really any highly organized and responsible people running the show, you start to worry.
[+] nostromo|3 years ago|reply
The actual answer is buried at the end of a long article.

> Unlike many rival airlines, Southwest’s planes generally hop from one city to another, rather than orbiting a major hub. That approach lets Southwest maximize use of its planes and crew, but the daisy chain structure also makes its network more delicate—problems in one corner of the country can be difficult to contain

[+] phpisthebest|3 years ago|reply
That is only part of the issue, not every airline is 100% hub system alot most are a pretty big mix.

Further alot of major Hubs where imacted by the Storm, yet those airlines where able to transition. Why?

Well most airlines have mobile apps, and web portals and other techonology so their crews can be reassigned in almost real time, (just like I would get auto booked on a new flight before I even knew my flight was canceled via the mobile app)

Instead Southwest has systems from the 80's they require crews and customers to call and talk to an actual live human...

[+] masklinn|3 years ago|reply
I mean... it's not entirely untrue but at the same time airlines have been winding down their "hub and spoke" model for a point to point one for a while.

That's in part what doomed the A380, which was popular with airlines still going strong with hub-and-spoke (Emirates being by far the most prominent one) but is worthless in a point-to-point model.

[+] ComputerGuru|3 years ago|reply
Southwest is statistically the worst airline in terms of delays and cancellations but has deluded its customers into thinking its the best (according to surveys asking people to rate airlines on their reliability).

https://www.insidehook.com/daily_brief/travel/airlines-fewes...

[+] skellington|3 years ago|reply
Thanks for not understanding statistics and linking to an article that also doesn't understand basic math.

SW has a high number of delays and cancellations BECAUSE THEY FLY A HIGH VOLUME OF PEOPLE. By percentage, they are in the middle of the pack for both delays and cancellations, which isn't great, but they are not the worst by any means.

How are HN people so consistently bad with basic information?

[+] vl|3 years ago|reply
But also SW attracts very specific kind of customer. If you fly for business, or just can afford other airline, why would you fly Southwest?
[+] deathanatos|3 years ago|reply
The statistics in that article are of the "damned lies" variety: none of the values are normalized (they compared simple number of cancellations and delays without taking that as a per flight value, or perhaps better, per passenger) and they treat all delays (and cancellations) as equal; I'll take an airline often delayed by 5 minutes over an airline sometimes delayed by 3 hours.

Perhaps it's true nonetheless, but the numbers there won't tell you.

(And IME, it's perhaps true that SWA is often delayed … but by tolerable amounts. Compared to delays I've endured with Delta, where, e.g., a flight was delayed longer than the time it would take the plane to drive at highway speeds, from where it was coming from. Or … also Delta … where I was cancelled on twice in the same flight. They wanted to go 0-3 but I gave up and bought a ticket on … SWA.)

[+] tyingq|3 years ago|reply
Perhaps they are thinking frequency for a trip they take often. It matters less that your Dallas->Houston flight is late when there's another one in 30 minutes during peak times.
[+] variant|3 years ago|reply
Deluded? Or could it be that customers value economy over predictability?
[+] kube-system|3 years ago|reply
I believe this has changed in recent years due to similar hiccups, and their reliability in prior years was previously good.
[+] tmpburning|3 years ago|reply
You are lucky if your flights are on time 75% of the time on average, with any airline.
[+] marze|3 years ago|reply
I find it especially ironic that SWA system failed them, and this large failure was preceded by worse and worse "near failures", since SWA is in the aviation business.

In the aviation arena, high reliability is maintained in part by careful analysis of "near failures": lessons are extracted and improvements are made to aircraft designs, procedures, etc.

By contrast, the "near failures" of the SWA system as a whole don't appear to have been utilized to motivate system improvements.

[+] thepasswordis|3 years ago|reply
I'm surprised they haven't tried to blame a cyberattack yet.

That said, I feel like these sorts of catastrophic ultra-fragile McKinsey-consulted-to-death failures we keep seeing in various industries are basically a giant signal to any adversaries that say "Hi! Check out how easy it would be to grind this entire industry to a halt!"

Resiliency is literally the opposite of efficiency. These systems need to have slack, aka inefficiency built into them. Unfortunately the business culture has moved towards ultra fragile, ultra efficient thinking.

[+] ghaff|3 years ago|reply
In part because, in this case, customers will buy the ticket that is $10 cheaper.
[+] factsarelolz|3 years ago|reply
That could potentially increase their cyber security insurance premiums if not cause the insurer to drop them immediately. Not to mention the broader impact on the market and industry as a whole.
[+] twobitshifter|3 years ago|reply
https://blog.geaerospace.com/technology/big-wins-in-flight-e...

Skysolver is a GE Flight Services trademark - there’s a video here showing how it works and SW planes. Contrary to the reddit claim, it does appear to use a predictive algorithm.

Highlight quote from the video:

“It is humanly impossible when there’s a major disruption for somebody to figure out what the optimal approach is to get them back on schedule”

[+] nlstitch|3 years ago|reply
I would be very interested in a post mortem of the software used called SkySolver. Its supposed to be a Java Application which is said to be developed by Accenture? Anyone have actual technical insights into why it failed?
[+] ProAm|3 years ago|reply
Still my go to airline because the rest are so difficult, unfriendly or just greedy Id rather deal with Southwest every time to feel like I am a human being.
[+] crosen99|3 years ago|reply
It's easy to ask, "How could this happen?", but it's also a wonder this sort of thing doesn't happen more often with airlines and other businesses that rely on solutions to complex logistical challenges at their core. Overall, despite the horrors of war, perils of a pandemic, etc., sometime I pause and ponder how remarkably well the world works.
[+] calbear81|3 years ago|reply
I’ve been lucky to have caught a flight back to SF after cancellations and can wait at home while figuring out how to get to my original destination.

What I don’t understand is how come SW couldn’t enlist help to get customers rebooked on other airlines - their phone lines were slammed (I waited 3 hours) just to get a refund since their app wouldn’t allow me to choose to rebook/cancel.

If I was as customer focused as they say they are - I would’ve contacted AMEX global travel and gotten their entire network of booking agents to backfill and rebook customers on other flights.

[+] realityking|3 years ago|reply
Southwest doesn’t have interline agreements with other airlines nor, AFAIK, the integration into reservation systems that allow rebooking onto another airline.
[+] jrochkind1|3 years ago|reply
From what we know, this to me sounds like a story about technical debt.

"Sure, it's held together with rubber bands and is a mess, but it would cost hundreds of millions to fix, and it's working, isn't it? So the programmers complain a bit, that's their job."

Which works until conditions change in some way and it catastrophically does not.

I think a lot of our society is now run on unreliable fragile software. I expect to see a lot more of this. "Automation" is especially cost-savings when you don't min it being a fragile unreliable time-bomb.

[+] w10-1|3 years ago|reply
So much chatter!

I would expect any interview candidate to spot the issue within a minute.

For hub systems, ready crews are either at the hub, or at a spoke, ready to come back. That gives the hub a queue of ready crews, and each spoke can return a crew-plane combination to the hub when available. So with natural queue's, there's no delay cascade: it's all a function of whether and crew/plane readiness.

For point-to-point systems, crew-plane's are scattered, and the next flight opportunity might not be the next flight need. There is no buffer anywhere. Furthermore, any greedy/opportunistic strategy at one point can block a superior global solution.

That's the point-to-point trade-off taken by SWA. In the common case of good weather, you avoid the extra miles from going via hubs. But in the rare case of global weather shutdowns, there is no good recovery.

The only real question is whether SWA had any obligation to communicate this to investors and passengers. So far, Apple stock has gone down more than SouthWest's in this period, and passengers are remaining loyal, so no damage done.

[+] francisofascii|3 years ago|reply
So they are blaming SkySolver software. The article says it is off-the-shelf software? But in other news reports, they make it sound like it was developed in-house.
[+] phpisthebest|3 years ago|reply
Alot of enterprise software is both, it is more akin to a "super framework". Where you start with the base system that provides commons functions but then it extendable.

Most ERP platforms are like this. over the decades it is not uncommon for little of the original platform to be used.

My org.s ERP is like that, we use probably 10% of the commercial code, and 90% of functions are custom in house written code.

[+] dragonwriter|3 years ago|reply
> The article says it is off-the-shelf software? But in other news reports, they make it sound like it was developed in-house.

Working in enterprise, there is a lot of “modifiable off-the-shelf” software; ideally, this provides basic functionality out of the box, but also the alignment to custom business needs of in-house software.

(Often, it seems like it combines the up-front and ongoing external licensing cost of COTS software, with the internal development and maintenance costs of in-house software, and the combined problems of both.)

[+] crisdux|3 years ago|reply
I don't buy the narrative that inadequate technology is the main reason for the Southwest debacle. We must ask, why did this happen now and not before? Southwest has previously been able to better deal with disruptions like this. While the weather event did happen in the middle of their network, it wasn't unprecedented.

I think a more obvious reasons is because of staffing issues brought on by covid, layoffs, and the vaccine mandates. They lost experienced employees who were able to wrangle the bad scheduling software. Throughout 2022, Southwest was having hiring issues because they were still mandating the vaccine through at least the summer for new employees. Their pilots association warned about this causing disruptions after a bunch of summer cancellations. Do people forget how flaky Southwest was during summer 2022? Southwest just recently reached staffing levels that matched their 2019 high. This "inadequate technology" narrative just seems like a convenient scapegoat.

[+] Supermancho|3 years ago|reply
> This "inadequate technology" narrative just seems like a convenient scapegoat.

My brother has worked in the white and blue collar unions (he prefers his ramp job). It's not like there's some impermeable cover of secrecy. These are just regular people who you can talk to. It's a combination of computer problems and regulatory controls (sleep blocks) leaving insufficient staff (and mechanical dangers) due to weather. The ramp teams were sitting at almost quad pay with no planes to service out of Minneapolis for a significant part of the weekend. This same situation has occurred, to some degree, every year.

Due to the inevitable Guld Stream collapse, this will be a routine problem until SWA triages it.

[+] MaxHoppersGhost|3 years ago|reply
This is probably a contributing factor but will get buried by the media. However, I doubt it was the sole or even primary cause. There is definitely a staffing component here in addition to their bad software.
[+] tenebrisalietum|3 years ago|reply
> They lost experienced employees who were able to wrangle the bad scheduling software.

Employees could quit because they don't want to get vaccinated, but they also could have like just died from COVID too, or won the lottery, etc.

So to me this still points to the technology as something of a root cause. Your tech is as brittle as the number of people who know how to use it. Losing the people who make your sunk-cost old tech actually work, and not planning for the "bus factor' still makes it your fault for not addressing.

[+] variant|3 years ago|reply
As with any event, multiple factors were involved. I have no doubt tech and process could be at the center, but our self-imposed response to COVID undoubtedly had major impacts.
[+] kube-system|3 years ago|reply
They also had mass cancellations in Oct 2021
[+] paulpauper|3 years ago|reply
For a meltdown the stock is back to where is was in October, tracking other airlines and the overall market, which keeps falling. I think people have become so accustomed this sort of stuff that it does not affect business long term. After Covid, people are accustomed to major inconvenience when traveling.
[+] mise_en_place|3 years ago|reply
You only get bitten in the ass by tech debt after it’s too late. I’m sure management justified not paying it down because, truthfully, the consequences are never really felt until it’s too late. It’s better to pay down tech debt incrementally, instead of grand projects promising full rewrites.
[+] zx8080|3 years ago|reply
Aren't cases like this is where the automated solvers are expected to shine?

If, on the other hand, it's not at all about software failures as many comments here suggest ("company management lost track of crews" notion), then does it have something to do with software at all?