top | item 23041263

The Safety Boat: Kubernetes and Rust

219 points| DeathArrow | 5 years ago |msrc-blog.microsoft.com | reply

96 comments

order
[+] zozbot234|5 years ago|reply
It's a bit weird to see "several weeks" of effort being described as a problematic learning curve. At least the blogpost makes it clear that the effort pays out hugely but still, "several weeks" is not rocket surgery. It's not learning Haskell or category theory! ISTM that they're just running with an assumption that most devs wouldn't be professional and committed to this, which strikes me as an unwitting gatekeeping attitude.
[+] steveklabnik|5 years ago|reply
Go has a shorter time, and that’s the measuring stick in this area.
[+] ashtonkem|5 years ago|reply
Learning a new framework in a familiar language might take a few weeks. A few weeks for a new language is really fast!
[+] gameswithgo|5 years ago|reply
I think learning Rust will be harder than Haskell for a lot of people.
[+] ernado|5 years ago|reply
> we caught a significant race condition

It is a data race, not a race condition.

> and which passed the race checker for Go

No, it is not. https://github.com/helm/helm/pull/7820#issuecomment-60436062...

There is a comment by issue author which is literally a go data race detector warning. Like "WARNING: DATA RACE".

[+] Rusky|5 years ago|reply
Data races are a kind of race condition, no?
[+] gtkspert|5 years ago|reply
Also, to be clear, by “we” they really mean “a contributor”
[+] oconnor663|5 years ago|reply
IIUC, the point is that the code has been in prod for a year, but the race detector only just now found the bug? But I could be wrong.
[+] jrockway|5 years ago|reply
It is a data race. I'm guessing the race detector (go test -race) didn't detect it because they are layering multiple synchronization primitives (mutexes, channel i/o, and a WaitGroup) and their tests hit the "good" code path but production workloads didn't.

Here's what happens. Delete takes a ResourceList. It delegates to "perform" and then "batchPerform". perform calls batchPerform in a separate goroutine, which calls a helper function in another goroutine for every resource in the ResourceList. The helper function is defined in Delete and updates a data structure defined in Delete. This is a classic case where some synchronization is necessary. The function runs multiple times in multiple goroutines, and updates a single shared structure. (Perhaps not obvious because it delegates to two helper functions, and the list that the function is executed on is a "ResourceList" not a []Resource, so it isn't clear that there is a "for { go func() }" loop anywhere; the programmers did their best to make it non-obvious that a loop is occurring.)

The confounding factor here is that batchPerform tries to synchronize with a WaitGroup, but it's faulty and not enough to protect the data integrity. batchPerform creates a WaitGroup, but only calls Wait() on the WaitGroup when the "kind" of an individual resource is not equal to the "kind" passed to batchPerform. I am guessing that it's very natural to craft some test data where this condition is met, and the for loop in batchPerform only runs the function once at a time (perhaps a ResourceList of length 1). In that case, there is no race condition for the race detector to detect.

All in all, if I were reviewing this code, it would not be checked in its current form. Splitting perform and batchPerform doesn't make sense to me, and they both implement faulty synchronization logic in a slightly different way. (batchPerform uses "for { wg.Add(); go f() }; wg.Wait", perform does "for range x { go func() { ch <- f() }() }; for range x { <- ch }". I consider these pretty much exactly equivalent, but neither prevents f() from running concurrently with itself. The only reason this passed the race checker is because batchPerform doesn't actually use the WaitGroup in the normal way, instead degrading to "for range x { wg.Add(); go f(); wg.Wait() }", which DOES prevent f() from running concurrently with itself, with certain inputs.

The root cause is that the caller of Delete isn't really sure about the semantics of "perform". Does it protect the body of the callback function? There is no documentation, and the author thought "yes". But the answer was "no". In general, the convention in go is to consider something thread-unsafe unless it's marked as thread safe. When you see something like "var foo Foo; f(list, func(bar){ foo = bar })" your spidey sense should be concerned about synchronization. But in this case, the code went out of its way to hide the existence of a loop and the existence of parallel processing, and so the programmer made a mistake. A bug or at least VERY confusing use of WaitGroup in batchPerform allowed the tests to pass. Should the compiler detect this? It would be nice. But a code reviewer should have been super concerned about this implementation.

[+] boulos|5 years ago|reply
The bug they caught [1] is one of the reasons some languages require you to explicitly name your captured variables. You still could have typed that code in, especially if you started with a for loop and then made it parallel (fwiw, perform should have been named something clearly suggesting it was parallel), but you'd at least be confronted with "oh, you went from serial, local state to a capture. Still think it's okay to explicitly borrow that state from this scope?". Then again, that's the point of Rust here :).

Fwiw, it's too bad the commit message didn't say something like "Since we're doing delete on many resources in parallel, we need to hold a lock while updating errs/res.Deleted". The reviewer was also obviously confused at first.

[1] https://github.com/helm/helm/pull/7820/commits/edb2b7511bcb9...

[+] melling|5 years ago|reply
“For comparison, last week we caught a significant race condition in another Kubernetes-related project we maintain called Helm (written in Go) that has been there for a year or more, and which passed the race checker for Go. That error would never have escaped the Rust compiler, preventing the bug from ever existing in the first place.”

I’ve heard people brag that Haskell is a great language because it’s supposedly easier to write correct code.

Rust has this same reputation?

[+] tybit|5 years ago|reply
Yes, though I believe Rust has already proven this more in practice than Haskell has.

Rust has many advocates now at places like Mozilla, Amazon and Microsoft that have delivered critical software in Rust that they believe has made it safer.

[+] Falell|5 years ago|reply
Yes. The common blurb is: In safe rust the borrow checker encourages 'fearless concurrency' by statically preventing all data races.
[+] exdsq|5 years ago|reply
It’s harder to formally prove Rust code compared to Haskell, the company I work at prototyped in Rust and then used domain information gained to improve the parallel implementation in Haskell.
[+] lmm|5 years ago|reply
Yes, for much the same reasons. Pretty much any ML-family language has the same effect; just having proper sum types, polymorphism and first class functions (and not having null) goes a long way to preventing huge classes of bugs.
[+] yongjik|5 years ago|reply
I think the common saying is that once your Haskell code compiles, it's usually correct.

The fine print is that nobody claimed it's easy to write Haskell code that compiles.

[+] shock|5 years ago|reply
> One of the biggest ones to point out is that async runtimes are still a bit unclear. There are currently two different options to choose from, each of them with their own tradeoffs and problems. Also, many of the implementation details are tied to specific runtimes, meaning that if you have a dependency that uses one runtime over another, you’ll often be locked into that runtime choice.

My understanding of how async/await works in Rust is that you can have multiple async runtimes in one Rust program. Is that not the case?

[+] roblabla|5 years ago|reply
That is the case, but it's super awkward to use. Basically, you cannot await a tokio future on an async-std runtime, or an async-std future on a tokio runtime. You can, however, have both runtimes running at the same time, and use some form of message-passing to bridge them.

It's definitely easier to only deal with one runtime. Ideally, we should have some kind of abstraction to allow crates to support both runtimes (e.g. a trait that'd allow creating an async TcpSocket of the right "kind" for your runtime), but AFAIK this is not currently done.

[+] Rusky|5 years ago|reply
It's possible, the downsides are a bit silly when you look at how the Future trait was designed to allow tasks to be runtime-agnostic.

There is ongoing work to standardize more runtime interfaces so that more libraries can be runtime-agnostic.

[+] gas9S9zw3P9c|5 years ago|reply
In theory you can, but in practice it would make your code very messy. If your dependency is using runtime A and you are using runtime B - how would they interact? Runtimes like tokio also provide convenience macros for you main mehtod, kind of locking you into them (I think), at least for that codepath.

If your application has two parts or binaries that are completely separate you could potentially use two different runtimes, but otherwise I don't think it would make sense. And even then, it would just be a mess.

Right now, your runtime is essentially picked for you by your dependencies.

[+] mappu|5 years ago|reply
Going off on a tangent, but this exact problem would be a worst-case scenario for Go getting user-defined generic types instead of only the current blessed ones.

t. C++ developer with a mixed std::string/QString/BSTR codebase.

[+] Thaxll|5 years ago|reply
I don't believe for one second that it takes just a couple of weeks to an average SE to be proficient in Rust.
[+] steveklabnik|5 years ago|reply
It really depends on so many factors it’s extremely hard to tell. We’ve brought folks at Cloudflare up to speed roughly that fast.

“average” and “proficient” are both very variable in that statement, imho.

[+] acdha|5 years ago|reply
Depends on your definition of average: I found that to be the case with significant programming experience with traditional languages (notably not something like Haskell) so I think it’s plausible since the compiler, editor, and documentation are rather above average for newcomers. In particular, Cargo providing a lot of easy tools and the compiler providing really helpful error messages seemed to be useful for the time to write a real first program which does something useful.

Edit: one other big factor - presumably in their environment you have coworkers to get advice from. That’s huge when you’re first starting.

[+] pas|5 years ago|reply
We became pretty comfy with it in less than a month in Aug 2017. (Let's say average guy had a few years of Python and this-and-that before that, and a ~5 year CS degree before that.) Sure, there was no async/await anywhere yet, but no crossbeam-channels either. And there were a lot less friendly tutorials and there were a bit more rough edges. (Especially that we did "IoT" so cross-compiling was ... an experience.)
[+] empath75|5 years ago|reply
IME it takes 2-3 months for a talented senior developer to get comfortable with it.
[+] aganame|5 years ago|reply
”Several weeks of hard effort”, they said. I can buy it, if they actually work hard and are basically competent. Rust is a difficult language in total, that’s for sure, but you can get a lot done without knowing it all.
[+] xrd|5 years ago|reply
After reading this article, I'm excited about finding a reason to write a component in Rust and WASM. Can anyone recommended the best getting started guide for dipping your toes in the water? This article didn't have a link to anything that seemed appropriate for that goal.
[+] ronlobo|5 years ago|reply
It is exciting to see Microsoft is pushing so many efforts into Rust and WASM.

The Rust onboarding experience is incredibly explicit and once things start to click and code compiles, you're on the train.

[+] conroy|5 years ago|reply
I looked into WASM / WASI last week but couldn't find an answer to this anywhere: can I write a network service in Rust and compile it to WASM / WASI?

I know that wasmtime can execute a WASM module and give it access to a file system. Can that filesystem contain a socket that the WASM module can interact with?

[+] jononor|5 years ago|reply
Very curious as to why you would want to do that? If you want a network service, WASM does not seem to help with much, only complicate things?
[+] whb07|5 years ago|reply
There’s a version of NGINX that’s compiled to WASM.

Conceivably you could compile all of the CPython runtime into WASM, just that you’d be left with a big binary that gets passed around all the time over the wire.

[+] mappu|5 years ago|reply
You could speak FastCGI (or plain HTTP) over stdin/stdout, although that won't get you accept(2) semantics without some other kind of layering.
[+] Klasiaster|5 years ago|reply
Good article but it somehow suggests that because there is no garbage collection you would need to fight the borrow checker. This is not fully true because you could put your data in a Box (so that it is stored on heap instead of the function's stack) and you can wrap it in a mutex with reference counting (Arc+Mutex or Rc+RefCell), which roughly gives you what garbage collection does. Also cloning can avoid solving the borrow-check puzzle if you don't need a shared state. Of course you would not want to pack your code with Arc+Mutex or data copying if performance matters, but it's fine for a beginner to start with when writing Rust and then learn to do the optimized borrow version a bit later when needed.
[+] pjmlp|5 years ago|reply
It doesn't give the productivity that GC allows for writing GUI code and UI designers.

Imagine having JetPack Composer, SwiftUI, Qt designer, or WPF/UWP Blend in Rust.

[+] klitze|5 years ago|reply
It has a weird taste that Microsoft is preferring Rust over Golang considering that Golang is a Google thing.

Don’t get me wrong, all technical arguments are correct and rust does have advantages for cloud software. But this also comes quite handy for MS. :)

[+] pjmlp|5 years ago|reply
VSCode support for Go, and some Delve improvements, were actually developed by Microsoft.
[+] sittingnut|5 years ago|reply
rust and kubernetes - post with mostly useless hype monsters united.
[+] rvz|5 years ago|reply
> For comparison, last week we caught a significant race condition in another Kubernetes-related project we maintain called Helm (written in Go) that has been there for a year or more, and which passed the race checker for Go. That error would never have escaped the Rust compiler, preventing the bug from ever existing in the first place

While the possible security benefits of Rust is interesting in software like Kubernetes, it seems like this whole blog-post is an implicit RIIR proposal for the Kubernetes ecosystem from a Microsoft software engineer which isn’t going to happen anytime soon.

> Rust has made great progress in the past year with its async story, but there are still some issues that are being worked out.

On top of that, there are still many crates that aren’t using async-await yet and most are not even 1.0, thus are not stable. I would not touch such crates if they are still immature or even unsafe.

Realistically, a Rust Kubernetes is possible but practically the effort of a production ready version is measured in years.

[+] steveklabnik|5 years ago|reply
Kubernetes is an ecosystem. It doesn’t need to be written in Rust for Rust components to play a part. Helm is not Kubernetes, for example, though your comment seems to blur the two. There are folks writing stuff to interact with the broader ecosystem in Rust. That’s one of the interesting bits of networked systems! You can be heterogeneous with languages more easily when the network/api is the boundary.
[+] pjmlp|5 years ago|reply
Plenty of Kubernetes stuff is actually written in Java, .NET and other languages, not necessarily Go. Thankfully.