top | item 33660243

(no title)

hoytech | 3 years ago

In 2015 I was working at a "fintech" company and a leap second was announced. It was scheduled for a Wednesday, unlike all others before which had happened on the weekend, when markets were closed.

When the previous leap second was applied, a bunch of our Linux servers had kernel panics for some reason, so needless to say everyone was really concerned about a leap second happening during trading hours.

So I was assigned to make sure nothing bad would happen. I spent a month in the lab, simulating the leap second by fast forwarding clocks for all our different applications, testing different NTP implementations (I like chrony, for what it's worth). I had heaps of meetings with our partners trying to figure out what their plans were (they had none), and test what would happen if their clocks went backwards. I had to learn about how to install the leap seconds file into a bunch of software I never even knew existed, write various recovery scripts, and at one point was knee-deep in ntpd and Solaris kernel code.

After all that, the day before it was scheduled, the whole trading world agreed to halt the markets for 15 minutes before/after the leap second, so all my work was for nothing. I'm not sure what the moral is here, if there is one.

discuss

order

hcrisp|3 years ago

Reminds me of the story of the computer engineer at Data General in Traccy Kidder's nonficion book, "The Soul of a New Machine" [0], who quit after spending weeks toiling away on sub-second timing concerns:

> He went away from the basement and left this note on his terminal: "I'm going to a commune in Vermont and will deal with no unit of time shorter than a season."

[0] https://en.m.wikipedia.org/wiki/The_Soul_of_a_New_Machine

xwolfi|3 years ago

What should I say, trying to go sub millisecond, one profiling run at a time, sigh...

Konohamaru|3 years ago

Maybe he should get Stephen Colbert's second-by-second day planner.

divbzero|3 years ago

This is a good story regardless, but if you do want to derive some morals from the experience:

– Seemingly simple tasks can be more complex than you expect (“add a leap second on this Wednesday”)

– Real world systems can be more complex than you expect (“bunch of software I never even knew existed”)

– Planning and testing can make a big difference vs. just winging it (“a bunch of our Linux servers had kernel panics for some reason”)

– Success can be a non-event that goes unnoticed (”everything worked and no money went missing”)

– Sometimes the best solution is not a technical solution (“halt the markets for 15 minutes before/after”)

yreg|3 years ago

>Sometimes the best solution is not a technical solution (“halt the markets for 15 minutes before/after”)

We've had an election recently, right on the day when DST changed. On the night of counting of the votes, the clock went 2:59 AM -> 2:00 AM.

To save themselves trouble the Statistics Office instructed all vote counters that under no circumstances are they to enter or update anything in any system during the repeating hour until it's 3:00 AM the second time…

a9h74j|3 years ago

> Sometimes the best solution is not a technical solution

I once came across an early 1950s Scientific American article by Bertrand Russel, IIRC. It included a cartoon.

Frame one: Computer beats man at chess.

Frame two: Man unplugs computer.

zinekeller|3 years ago

– Success can be a non-event that goes unnoticed (”everything worked and no money went missing”)

And yet, there are still Y2K deniers (to be fair some people have exaggerated it to the point that they're promoting it as the end of the world).

the_black_hand|3 years ago

> Sometimes the best solution is not a technical solution (“halt the markets for 15 minutes before/after”)

I'm little confused. How does this solve the problem? If you don't code for the second, you'll still be off if you wait. I'm I missing something?

ycombobreaker|3 years ago

Sometimes it pays to be the most-prepared among your cohort. In this case, it would have paid so well that your cohort decided to work around it.

It always pays to not be the least-prepared among your cohort. You'll get no sympathy if you're at the back of the pack, you'll just die.

xmprt|3 years ago

Another moral of the story could be that sometimes it's best to have a people solution to a technical problem.

bahmboo|3 years ago

You got paid to dig extremely deeply into a very complex and important problem spanning multiple systems and domains. You developed a plan, tested it and were ready to act.

This is a hugely valuable learning experience few people even get a chance at, let alone solve. Too bad it doesn’t show up on your resume is the only downside!

oblio|3 years ago

Résumé, no.

Interview discussion? If you're any good at interviewing, it should.

Shared404|3 years ago

The moral is we get to hear your cool war story. Thanks for sharing!

...okay yeah that's not a moral, but still.

imglorp|3 years ago

Great, here's another.

$work had thousands of full custom, dsp-heavy, location measurement hardware devices widely deployed in the field for UTDOA locating cell phones. It used GPS for time reference -- if you know your location, you can get GPS time accurate around the 10's of nanoseconds. GPS also broadcasts a periodic almanac which includes leap second offsets: if you wanted to apply the offset to GPS you could derive UTC. Anyway there were three models of these units, each with an off-the-shelf GPS chip from one of three location vendors you've probably heard of. The chip firmware was responsible for handling leaps.

One day, a leap second arrived from the heavens. We learned the three vendors all exhibited different behaviors! Some chips handled the leap fine. Some ignored it. Some just crashed, chip offline, no bueno, adios. And some went into a state that gave wildly wrong answers. After a flurry of log pulling, debugging, console cabling, and truck rolls, we had a procedure to identify units in bad states and reset them without too many getting bricked.

It seems the less likely an event is to occur, the less likely your vendor put work into handling it.

svara|3 years ago

The moral of the story is that laziness is a virtue. Think of all the time that could have been saved, had you had no plans like your partners ;)

jimmaswell|3 years ago

Think of all the time that could be better appropriated than on fintech in general. Seems like such a waste of resources siccing a bunch of computers against each other in a zero sum game of stock arbitrage. I will admit some of the stuff tech comes out of it is cool on its own at least.

akira2501|3 years ago

I think it's the other way around. They had a problem which previously only impacted weekends, so it was written off entirely without consideration of whether this was happening by rule or by convention. They knew it would be a concern on any other day and yet did nothing until the day it was announced.

Idle curiosities can lead to their own waste, but the kernel panic was probably worth digging into earlier.

dylan604|3 years ago

>I'm not sure what the moral is here, if there is one.

Apparently, its about as useful as the leap second itself ;)

I feel your pain though, as I've spent weeks on something only for it to be tossed away like it was nothing at the last second. I guess that's how Google devs feel when their projects are deprecated. At least theirs saw the light of day and provided some validation

rkagerer|3 years ago

Here, have a bright shiny imaginary internet point. It doesn't nearly do justice but thanks in any case for sharing your story.

RunSet|3 years ago

A bit of a tangent but I have observed that whenever a networking record for bandwidth is broken it is typically by a nonprofit such as a university, but whenever a networking record for latency is broken it is more often than not by someone in the "fintech" industry developing a faster bag-passing mechanism.

It is clear to me that the disparity of latency creates islands of privilege. I mentioned this to someone in the industry once and they replied that what the layman perceives as parasitic middlemen actually provide valuable liquidity. When I asked whether they considered ticket-scalpers to likewise provide liquidity they claimed that was not at all the same thing.

TOGoS|3 years ago

> I'm not sure what the moral is here

I think the moral is that it'd be a lot easier if we could just stop messing with the clocks, or at least push more technical things towards only caring about a closest-to-a-global-high-precision-monotonic-clock-as-relativity-allows rather than worrying about what the clocks on the walls say, which is more a personal matter of how much you care or don't where the sun is in the sky at 12:00:00.000.

kqr|3 years ago

I was gonna say, why not just close all positions and turn off the computers around the leap second? How much are you realistically gonna lose by missing a few minutes of trading, compared to the alternative risk?

Edit: I guess the other way to look it is I guess now how much you can make on a few minutes of trading, seeing that it was worth putting at least one software engineer on it for a long time despite the risks...

alfalfasprout|3 years ago

It can be extremely costly to close positions (often from a tax perspective this is a big-no no in some cases too).

telotortium|3 years ago

> I'm not sure what the moral is here, if there is one.

As the CIA director in Burn After Reading says, "I guess we learned not to do it again."

gnu8|3 years ago

It is a uniquely crummy feeling to have your work go unused like that, but you shouldn’t let it discourage you. You reached a level of mastery on this particular thing that few people have, which is evidenced by the fact that no one else in the trading community was able to reach your company’s level of confidence and they decided to wait out the leap second instead.

a9h74j|3 years ago

> no one else in the trading community was able to reach your company’s level of confidence

So his work contributed to community wisdom, and that influential community has probably had some say in cancelling leap seconds. I wouldn't call his work wasted. I would call that notably few degrees-of-separation in making an observable difference.

justinpombrio|3 years ago

> I'm not sure what the moral is here, if there is one.

Always procrastinate :-)

programmarchy|3 years ago

Moral of the story is that sometimes social engineering is much cheaper and more effective than software engineering!

magpi3|3 years ago

Yes, I agree, and I do think this is another example of worse is better. The complex but correct solution is to the hard work the OP did. But the simple but better solution is to simply halt the markets.

pmontra|3 years ago

Contingency plans have their own contingency plans. Maybe trading companies started talks to stop the market months before your company assigned that task to you, in case of no agreement or a negative one.

prottog|3 years ago

It's the software engineering equivalent of the crypto nerd with the super-strong encryption, beat by a five-dollar wrench attack[0]. ;-)

Sometimes the best (for some definition of best) solution to a problem is to side-step it entirely.

[0]: https://xkcd.com/538/

emeraldd|3 years ago

The moral here is that you and people in similar positions convinced everyone else that there was too much risk to go forward. Either by direct or, indirect action and implication. Sometimes, just seeing what your own team needs to feel safe and seeing what everyone else is or not doing on the same front is enough to make the call one way or the other.

ilyt|3 years ago

We just enabled leap second smearing on chrony.

mikepurvis|3 years ago

I think that's the only reasonable way to handle this kind of thing, though I bet that accurate time matters enough in fintech that you'd still have some cases where you'd need access to the "true" wall time in order to stamp logs for auditing or whatever.

ComputerGuru|3 years ago

The interesting thing is that for the less careful, the 15-min before/after halt may have been not enough. You knew enough not to use a time smearing NTP server but others that didn’t obsess as you did might have been off by a fraction of a second for the entire 24 hour period leading up to it.

dpkirchner|3 years ago

The moral is that sometimes we humans can choose not to let the perfect be the enemy of the good (enough).

ozim|3 years ago

What if all trading world would not agree? No one could knew that in advance.

quickthrower2|3 years ago

A bit like buying car insurance and not claiming. Still possibly worth it.

msla|3 years ago

Don't conclude it's that hard for everyone until you've spoken to a good subset of different people.

You're conscientious and willing to dig in to the details to fix a problem. Plenty of people aren't, and plenty of those are doing the same job as you. Look up from your own little world and try to figure out what other people are doing, how they're doing it, and why. This applies generally: If you fixate on a specific language or toolkit, you'll miss others which fix or obviate bugs you were resigned to living with. Same with OSes and environments. It even applies to relationships, which is why a big hallmark of abuse is isolating the victim.

dudeinjapan|3 years ago

I worked in algo trading at major bank in Japan. Japan time zone is UTC+9. Markets open at 9am. A leap second brought down our trading right at the open.

dclowd9901|3 years ago

Moral: “if there is a lot of work involved solving a technical problem, there’s probably a lot less work involved solving it non-technically”

tsol|3 years ago

Moral of the story is that insurance is expensive

terminal_d|3 years ago

Wow. You should write a post about it that goes deep into what you had to do to make it work.

modernpink|3 years ago

>I spent a month in the lab

Do you mean at your desk? What is a lab in a fintech context?

martyvis|3 years ago

For most of us that are doing implementation engineering, a lab is simply a collection of the gear that can be put together in a simulation of the production environment without being constrained by formalities. For me it would be a bunch of network and server kit and cables in a rack.

mgsouth|3 years ago

Not OP, but at several jobs our labs were small server rooms stuffed with network gear, servers, and client PCs. They were used for end-to-end simulations and tests. It wasn't uncommon to actually do work in the lab, keeping an eye on the blinky lights or somesuch.

pfarrell|3 years ago

I think the moral is, “those who fail to prepare are preparing to fail”.

qudat|3 years ago

The moral of the story is if everyone is slacking then you can as well

alexfromapex|3 years ago

The moral is it’s a waste of time either way

karmakaze|3 years ago

Couldn't your "fintech" company decide to halt trading 15-min before/after on its own without the agreement of the trading world?