top | item 44211779

Discovering a JDK Race Condition, and Debugging It in 30 Minutes with Fray

157 points| aoli-al | 8 months ago |aoli.al

52 comments

order

exabrial|8 months ago

> Fray is a concurrency testing tool for Java that can help you find and debug tricky race conditions that manifest as assertion violations, run-time exceptions, or deadlocks. It performs controlled concurrency testing using state-of-the-art techniques such as probabilistic concurrency testing or partial order sampling.

> Fray also provides deterministic replay capabilities for debugging specific thread interleavings. Fray is designed to be easy to use and can be integrated into existing testing frameworks.

I wish I had this 20 years ago.

MaxBarraclough|8 months ago

Neat to see sleep calls artificially introduced to reliably recreate the deadlock. [0]

Looks like fixing the underlying bug is still in-progress, [1] I wonder how many lines of code it will take.

[0] https://github.com/aoli-al/jdk/commit/625420ba82d2b0ebac24d9...

[1] https://bugs.openjdk.org/browse/JDK-8358601

trhway|8 months ago

without reworking of the code all these checks of the executor and queue state and queue manipulations have to be under a mutex, and that is just a few lines.

brabel|8 months ago

Bugs like these are pervasive in languages like Java that give no protection against even the most basic race condition causes. It’s nearly impossible to write reliable concurrent code. Freya only helps if you actually use it to test everything which is not realistic. I am convinced, after my last year long struggle to get a highly concurrent Java (actually Kotlin but Kotlin does not add much to help) module at work, that we should only use languages that provide safe concurrency models, like Erlang/Elixir and Rust, or actor-like like Dart and JavaScript, where concurrency is required.

anorwell|8 months ago

This actually intersects with two of my current interests. We have, in production, rarely been seeing ThreadPoolExecutor hangs (JDK17) during shutdown. After a lot of debugging, I've been suspecting more and more that it may be an actual JDK issue. But, this type of issue is extremely hard to reason about in production, and I've never successfully reproduced it locally. (It's not clear to me that it's the same issue as in the post, since it's not a scheduled executor.)

Separately, we're looking at using fray for concurrency property testing, as a way to reliably catch concurrency issues in a distributed system by simulating it within a single JVM.

latchkey|8 months ago

Maybe it is just me, but I can't read the text in the code because the font is nearly white on white.

masklinn|8 months ago

The light mode is fine, but you're right the dark mode is truly awful, the code blocks are unreadable.

edit: for some reason the author overrode the background color on code blocks via an inline style of

    background-color:#f0f0f0
from

    var(--code-background-color) = #f2f2f2
to make the background nigh imperceptibly darker, but then while the stylesheet properly switches the to #01242e in dark mode the inline override stays and blows it to bit.

Not that it's amazing if you remove the inline stle, on account of operators and method names being styled pretty dark (#666 and #4070a0).

9d|8 months ago

[deleted]

AugustoCAS|8 months ago

[posted this in another thread, but maybe the author can clarify this]

I wonder how this works when one runs test in parallel (something I always enable in any project). By this I mean configuring JUnit to run as many tests as cores are available to speed up the run of the whole test suite.

I took a peek at the code and I have the impression it doesn't work that well as it hooks into when a thread is started. Also, I'm not sure if this works with fibers.

delusional|8 months ago

You appear to be one of the authors, so forgive me asking a technical question.

In the technical paper, Section 5.4 you mention that kotlin has non-determinism in the scheduler. Where does this non-determinism come from?

It seems unclear to me why Kotlin would inject randomness here, and I suspect that you may actually have identified a false positive in the Lincheck DSL.

aoli-al|8 months ago

The "randomness" comes from Kotlin coroutines and user-space scheduling. For example, Kotlin runs multiple user-space threads on the same physical thread. Fray only reschedules physical threads. So when testing applications use coroutine/virtual threads, Fray cannot generate certain thread interleavings. Also, It cannot deterministically replay because the thread execution is no longer controlled by Fray.

In our paper, we found that Fray suffers from false negatives because of this missing feature. Lincheck supports Kotlin coroutines so it finds one more bug than Fray in LC-Bench.

We didn't make any claims about false positives in Lincheck.

herrDerb|8 months ago

In the bug report you state that this bug is starting from jdk 23? Could it be that it also affects earlier versions? I am asking as we do have a similar behavior with 17 & 21 which we can't really explain.

TYMorningCoffee|8 months ago

Impressive! Can't wait to try Fray out at work.