top | item 38300425

NilAway: Practical nil panic detection for Go

241 points| advanderveer | 2 years ago |uber.com

242 comments

order

anonacct37|2 years ago

I do like the approach of static code analysis.

I found it a little funny that their big "win" for the nilness checker was some code logging nil panics thousands of time a day. Literally an example where their checker wasn't needed because it was being logged at runtime.

It's a good idea but they need some examples where their product beats running "grep panic".

lazaroclapp|2 years ago

Actually, if we were running into cases where we aren't logging a panic which is actually happening in production, then the first thing to note is that we need to improve our observability. The issue might or might not be recoverable, but it should be logged. If nothing else, it should show up as a service crash somewhere within those logs, which is also something that service owners monitor and get alerts on.

The advantage of NilAway is not just detecting nil panic crashes after the fact (as you note, we should always be able to detect those eventually, once they happen!), but detecting them early enough that they don't make it to users. If the tool had been online when that panic was first introduced, it would have been fixed before ever showing up in the logs (Presumably, at least! The tool is not currently blocking, and developers can mistake a real warning for a false positive, which also exist due to a number of reasons both fundamental and just related to features still being added)

But, on the big picture, this is the same general argument as: "Why do you want a statically typed language if a dynamically typed one will also inform you of the mismatch at runtime and crash?" "Well, because you want to know about the issue before it crashes."

Beyond not making it all the way to prod, there is also a big benefit of detecting issues early on the development lifecycle, simply in terms of the effort required to address them: 'while typing the code' beats 'while compiling and testing locally' beats 'at code review time' beats 'during the deployment flow or in staging' beats 'after the fact, from logs/alerts in production', which itself beats 'after the fact, from user complains after a major outage'. NilAway currently works on the code review stage for most internal users, but it is also fast enough to run during local builds (currently that requires all pre-existing warnings in the code to either be resolved or marked for suppression, though, which is why this mode is less common).

carbocation|2 years ago

Just tried this out on some of my own code and it nails the warts that I had flagged as TODOs (and a few more...). The tool gives helpful info about the source of the nil, too. This is great.

Too|2 years ago

Building a type checker on global inference is the kind of thing that sounds romantic in academia - "no type definitions and yet get type checking!" - but ends up being a nightmare to use in practice.

Nilability of return values should be part of functions public interface. It shouldn't come as a surprise under certain circumstances of using the code. The problem of global inference is that it targets both the producer and the consumer of the interface at the same time, without a mediating interface definition deciding who is correct. If a producer starts returning nil and a consumer five levels downstream the call-stack happens to be using it, both the producer and caller is called out, even if that was documented public api before, just never executed. Or vice versa.

For anyone who had the great pleasure of deciphering error messages from C++ templates, you know what I'm talking about.

I understand the compromises they had to take due to language constraints and I'm sure this will be plenty useful anyway. Just sad to see that a language, often called modern and safe, having these idiosyncrasies and need such workarounds.

mrkeen|2 years ago

> Building a type checker on global inference is the kind of thing that sounds romantic in academia - "no type definitions and yet get type checking!" - but ends up being a nightmare to use in practice.

Hi! I use global type inference and I love it.

ludiludi|2 years ago

I got a nil pointer deref panic trying to use this tool:

$ nilaway ./...

panic: runtime error: invalid memory address or nil pointer dereference [recovered]

panic: runtime error: invalid memory address or nil pointer dereference

[signal SIGSEGV: segmentation violation code=0x2 addr=0x0 pc=0x100c16a58]

technics256|2 years ago

Is there any movement in the language spec to address this in the future with Nil types or something?

baby|2 years ago

I really love Golang and how it focused on making the job of the reader easy. But with today’s modern programming language the existence of null pointer dereference bugs doesn’t really make sense anymore. I don’t think I would recommend anyone to start a project in Golang today.

Maybe we’ll get a Golang 3 with sum types…

aatd86|2 years ago

If you squint really hard, the work on generics is a step toward the future.

If you don't squint, then I don't think so.

tw061023|2 years ago

Without intention to offend. It's Golang, the language that famously ignored over 30 years of progress in language development for the sake of simplicity.

What answer do you expect?

iot_devs|2 years ago

Wonderful job.

I am toying around with a similar project, with the same goal, and it is DIFFICULT.

I'll definitely get to learn from their implementation.

aatd86|2 years ago

Very interesting work. I wonder what were the difficulties encountered. Aliasing? Variable reassignment wrt short declaration shadowing?

Hopefully with time, when exploring union types and perhaps a limited form of generalized subtyping (currently it's only interface types) we'll be able to deal with nil for good.

Nil is useful, as long as correctly reined in.

mrkeen|2 years ago

> Nil is useful, as long as correctly reined in.

A good way to rein in behaviour is with types. If you need Nil in your domain, great! Give it type 'Nil'.

candiddevmike|2 years ago

It's really easy to check a field of a pointer struct without first checking the struct is non nil. Would be interesting if go vet or test checked this somehow.

luxurytent|2 years ago

Plenty of Go commentary in this thread but can I just say I'm glad to have learned about nilness? Suffered through a few nil pointer dereferences after deploying and having this analyser enabled in gopls (off by default for me at least) is a nice change.

Tested via vim and looks good!

earthboundkid|2 years ago

I tried it but got too many false positives to be useful.

tptacek|2 years ago

I tried it and got a lot of false positives, but there wasn't so much output that I couldn't quickly pick out the interesting cases. This is very cool.

lazaroclapp|2 years ago

We'd be interested in the general characteristics of the most common ones you are seeing. If you have a chance to file a couple issues (and haven't done so yet): https://github.com/uber-go/nilaway/issues

We definitely have gotten some useful reports there already since the blog post!

We are aware of a number of sources of false positives and actively trying to drive them down (prioritizing the patterns that are common in our codebase, but very much interested in making the tool useful to others too!).

Some sources of false positives are fundamental (any non-trivial type system will forbid some programs which are otherwise safe in ways that can't be proven statically), others need complex in-development features for the tool to understand (e.g. contracts, such as "foo(...) returns nil iff its third argument is nil"), and some are just a matter of adding a library model or similar small change and we just haven't run into it ourselves.

mrkeen|2 years ago

Does a false positive mean:

- You're confident that a flagged value is actually non-Nil?

- A value was Nil but you prefer it that way?

pluto_modadic|2 years ago

cool... what does this mean the best linter / correctness checking is at the moment?

I have some code that eventually core dumps and honestly I don't know what I'm doing wrong, and neither do any golang tools I've tried :(

maaaaaybe there's something that'll check that your code never closes a channel or always blocks after a specific order of events happens...

mseepgood|2 years ago

I don't think a pure Go program can core dump, unless you use Cgo (wrongly) or unsafe. It can only panic.

adtac|2 years ago

I'm not sure if that was the best example to showcase NilAway. I understand there's a lot of context omitted to focus on NilAway's impact, but why is foo returning a channel to bar if bar is just going to block on it anyway? Why not just return a *U? If foo's function signature was func foo() (*U, error) {}, this wouldn't be a problem to begin with.

thiht|2 years ago

Not the point.

Seb-C|2 years ago

I have been thinking about this problem for a long time as well.

But I think that focusing on nils is a wrong analysis. The problem is the default zero-values dogma, and that is not going to change anytime soon.

Sometimes you also need a legitimate empty string or 0 integer, but the language cannot distinguish it from the absence of value.

In my codebase, I was able to improve the readability of those cases a lot by using mo.Option, but that has a readability cost and does not offer the same guarantees than a compiler would. The positive side is that I get a panic and clear stack trace whenever I try to read an absent value, which is better than nothing, but still not as good as having those cases at compile time.

No amount of lint checkers (however smart) will workaround the fact that the language cannot currently express those constraints. And I don't see it evolving past it's current dogmas unfortunately, unless someone forks it or create something like typescript for go.

xyzzy_plugh|2 years ago

It's not a dogma it's a breaking change to the language. Removing default zero-values is effectively a different language. Literally none of my code over the past several years (which otherwise all still works perfectly as-is) would work.

The Go team is very careful to avoid breaking changes (cue all the usual Well Actually comments regarding breaking changes that affected exactly zero code bases) and rightfully so. Their reputation as a stable foundation to build large projects upon has been key to the success and growth of the language and its ecosystem.

I have about a million and one other issues I'd like to see resolved first that don't involve breaking changes. It's a known pain point, the core maintainers acknowledge it, but suggestions to fundamentally derail the entire project are ludicrous.

Focusing on nils is fine. NilAway is fine. It's a perfectly reasonable approach and adds a lot of value. This solves a real problem in real code bases today. There is no universe wherein forking to create a new language creates remotely equivalent value.

insanitybit|2 years ago

> Nil panics are found to be an especially pervasive form of runtime errors in Go programs. Uber’s Go monorepo is no exception to this, and has witnessed several runtime errors in production because of nil panics, with effects ranging from incorrect program behavior to app outages, affecting Uber customers.

Insane that Go had decades of programming mistakes to learn from but it chose this path.

Anyway, at least Uber is out there putting out solid bandaids. Their equivalent for Java is definitely a must-have for any project.

tuetuopay|2 years ago

> Insane that Go had decades of programming mistakes to learn from but it chose this path.

Yup, every time I write some Go I feel like it's been made in a vaccum, ignoring decades of programming language. null/nil is a solved problem by languages with sum types like haskell and rust, or with quasi-sums like zig. It always feels like a regression when switching from rust to go.

Kudos to Uber for the tool, it looks amazing!

marsavar|2 years ago

While this is a great piece of engineering, and will certainly deliver a huge amount of value to any project, the fact that a whole new tool had to be built (and will have to be maintained) to address serious, fundamental shortcomings in the language is really quite sad.

MichaelNolan|2 years ago

The go version NilAway isn’t as good as the java version NullAway yet. But the team working on it is very responsive and eager to improve.

For java projects I think NullAway has gotten so good that it really takes the steam out of the Kotlin proponents. Hopefully NilAway will get there too.

skybrian|2 years ago

I don’t see this as a band-aid. It’s doing proper type checking (static analysis) and that seems quite promising?

Getting good type errors without requiring type annotations seems like a win over languages that are annotation-heavy. Normally I’d be skeptical about relying on type inference too much over explicit type declarations, but maybe it’s okay for this problem?

This is speculative, but I could see this becoming another win for the Go approach of putting off problems that aren’t urgent. Sort of like having third-party module systems for so many years, and then a really good one. Or like generics.

bb88|2 years ago

Agree, this is dumb. This should be a part of the compiler if the language def can't simplify this for the user.

npalli|2 years ago

"The Go monorepo is the largest codebase at Uber, comprising 90 million lines of code (and growing)"

Is this just a symptom of having a lot of engineers and they keep churning code, Golang being verbose or something else. Hard time wrapping my head around Uber needing 90+ million lines of code(!). What would be some large components of this codebase look like?

adtac|2 years ago

Imagine a multidimensional matrix with various payment methods, local regulations, cloud providers, third party dependencies, web/mobile platforms, etc. Then also add more dimensions for internal things like accounting, hiring, payroll, promotions, compliance, security, monitoring, etc. Then double it for Uber Eats or whatever.

There's a lot of overlap and some invalid combinations, but you're still left with a huge number of combinations where Uber must simply work. And every time you add a new thing to this list, the total number of combinations grows polynomially.

(Also, Go is slightly more verbose than most languages. I think that's a feature and not a bug, but it's one more reason.)

vrosas|2 years ago

Uber is famous for NIH syndrome. You can tell by their open source projects they've basically built every part of their infra from scratch. So it's not just the application code but everything else that helps run it.

lopkeny12ko|2 years ago

When you have thousands and thousands of engineers, and they are evaluated by how much code they produce, and they need to justify their job and continued employmwment, you end up with a 90M line codebase.

vitiral|2 years ago

It is the nature of large systems to grow. As software engineers we build libraries to build libraries, we build tools on top of tools to check our tools.

gavinray|2 years ago

From what I've heard from ex-FAANG, I'd wager that a significant portion of the Go is code-generated for things like RPC definitions or service skeletons.

nirga|2 years ago

It amazes me that in 2023 this is not a solved problem by design of the language. Why go doesn’t adapt the “optional” notion of other languages so that if you have a variable you either know it is not null or know that you must check for nullness. The technology exists

andreyvit|2 years ago

I write a lot of Go and used to write a lot of Swift. Swift is what you’ll consider a modern language (optionals, generics, strong focus on value types), while Go is Go.

I appreciate both languages, and of course Swift feels like what you’d pick any day.

But, after using both nearly side by side and comparing the experience directly, I’ve got to say, I’m so much more productive in Go, there’s SO much less mental burden when writing the code, — and it does not result in more bugs or other sorts of problems.

Thing is, I, of course, am always thinking about types, nullability and the like. The mental type model is pretty rich. But the more intricacies of it that I have to explain to the compiler, the more drag I feel on getting things shipped.

And because Go is so simple, idiomatic, and basically things are generally as I expect them to be, maintenance is not an issue either. Yes, occasionally you are left wondering if a particular field can or cannot be nil / invalid-zero-value, but those cases are few enough to not become a problem.

fizx|2 years ago

There's much I don't love about Rust, but I feel golang could steal the ? operator and keep the spirit of go.

Effectively,

instead of

    result, err := doSomething()
    if err != nil {
        return nil, err
    } 
you'd get the same control flow with

    result := doSomething()?

ikari_pl|2 years ago

The same reason you can't get map keys without a library or looping yourself. "Simplicity" (for go maintainers).

monksy|2 years ago

It's because it's golang and that's how they oninonatedly want it. (Seriously, the community is incredibly stubborn and controlling)

pjmlp|2 years ago

There are several language design problems solved in the 20th century that Go designers decided to ignore, because they require PhD level skills to master, apparently.

Hence why the language is full of gotchas like these.

Had it not been for Docker and Kubernetes success, and most likely it wouldn't have gotten thus far.

adtac|2 years ago

That's what the `func foo() (*T, error)` pattern is for. It's actually better than syntactic sugar for optional values because now you also have a descriptive reason for why the value is nil.

But if you really cannot afford to return more than one bit of information, do `func foo() (*T, bool)`.

wg0|2 years ago

90 million lines of code to .. call a cab?

Genuinely curious what's so much of business logic is for.

yankput|2 years ago

I was working on a Grab competitor, you would be surprised about the number of subsystems running there.

There are entire teams that are working on just internal services that connect some internal tools together.

There was also very little effectivity and efficiency in the era of cheap capital so there were tons of talent wasted on nonsense. Uber built their own slack for a while!! (before just going to mattermost)

People always ask who actually makes money on Uber... I think it's not the cab drivers, not the investors, who makes money is the programmers. It's a transfer of money from Saudis to programmers.

Well it was, anyway.

vore|2 years ago

And billing, and reporting, and regulatory compliance, and inventory management, and abuse detection, and routing, and operations, and...

robryan|2 years ago

A lot of it will be location based. It has come up before here in the discussion of why there is so much in the app. They have to cater for all the different rules in every jurisdiction.

geitir|2 years ago

Well it is in Go, so has to be verbose

techn00|2 years ago

Uber does way more than calling a cab, however, I was also surprised by the number of lines of code

jheriko|2 years ago

how in the hell does uber need engineering problems solved?

mad

neonsunset|2 years ago

[deleted]

tptacek|2 years ago

Please don't attempt to start language wars on threads here. They're a curse; they grow like kudzu and take over the whole thread. This is interesting computer science, and in the ecosystem of Hacker News, superficial bickering is its top predator.

__turbobrew__|2 years ago

I don’t really buy the usefulness of trying to statically detect possible nil panics. In their example of a service panicing 3000+ times a day why didn’t they just check the logs to get the stack trace of the panic and fix it there? I don’t see why static analysis was needed to fix that panic in runtime.

What I would really like golang to have is way to send a “last gasp” packet to notify some other system that the runtime is panicing. Ideally at large scales it would be really nice to see what is panicing where and at what time with also stack traces and maybe core dumps. I think that would be much more useful for fixing panics in production.

There was a proposal to add this to the runtime, but it got turned down: https://github.com/golang/go/issues/32333 Most of the arguments against the proposal seem to be that it is hard to determine what is safe to run in a global panic handler. I think the more reasonable option is to tell the go runtime that you want it to send a UDP packet to some address when it panics. That allows the runtime to not support calling arbitrary functions during panicing as it only has to send a UDP packet and then crash.

I could see the static analyzer being useful for helping prevent the introduction of new panics, but I would much rather have better runtime detection.

styluss|2 years ago

Because they want to find the code paths before deploying the code. Surely they have error logging or tracing and can see why it panics.

I tried this with a medium sized project and some unexpected code that could panic 3 functions away from the nil.

jeffrallen|2 years ago

Can we not link to scammy engineering blog articles with ads for scammy restaurant apps on top please?

Link to the source, or better yet, never link at all to anything related to Uber.

demi56|2 years ago

As a language that’s focused on backward compatibility than features oriented this is the best and optimal way to reduce some of Go’s loopholes. The problem of using developer tooling to solve the innate problems is that they lack awareness

I do recommend the Go team to find a way to these tools to run before it complies, just doing go build while going through these tools first goes a long way than just using scripts