top | item 36007110

(no title)

tych0 | 2 years ago

Author here (hi Sargun), it's not really about rediscovering killable vs. unkillable waits, and any confusion is probably a result of my poor writing.

The crux of it is that once you've called exit_signals() from do_exit(), signals will not get delivered. So if you subsequently use the kernel's completions or other wait code, you will not get the signal from zap_pid_ns_processes(), so you don't know to wake up and exit.

There's a test case here if people want to play around: https://github.com/tych0/kernel-utils/tree/master/fuse2

discuss

sargun|2 years ago

Hi Tycho!

I'm glad you inherited this :).

Oh, I wasn't suggesting that it was about killable vs. unkillable.

Couple of things: 1. Should prepare_to_wait_event check if the task is in PF_EXITING, and if so, refuse to wait unless a specific flag is provided? I'd be curious if you just add a kprobe to prepare_to_wait_event that checks for PF_EXITING, how many cases are valid?

2. Following this:

  zap_pid_ns_processes ->
     __fatal_signal_pending(task)
     group_send_sig_info
       do_send_sig_info
         send_signal_locked
           __send_signal_locked -> (jump to out_set)
             sigaddset // It has the pending signal here
             ....
             complete_signal

Shouldn't it wake up, even if in its in PF_EXITING, that would trigger as reassessment of the condition, and then the `__fatal_signal_pending` check would make it return -ERESTARTSYS.

One note, in the post:

  # grep Pnd /proc/1544574/status
  SigPnd: 0000000000000000
  ShdPnd: 0000000000000100

> Viewing process status this way, you can see 0x100 (i.e. the 9th bit is set) under SigPnd, which is the signal number corresponding to SIGKILL.

Shouldn't it be "ShdPnd"?

tych0|2 years ago

> Couple of things: 1. Should prepare_to_wait_event check if the task is in PF_EXITING, and if so, refuse to wait unless a specific flag is provided? I'd be curious if you just add a kprobe to prepare_to_wait_event that checks for PF_EXITING, how many cases are valid?

I would argue they're all invalid if PF_EXITING is present. Maybe I should send a patch to WARN() and see how much I get yelled at.

> Shouldn't it wake up, even if in its in PF_EXITING, that would trigger as reassessment of the condition, and then the `__fatal_signal_pending` check would make it return -ERESTARTSYS.

No, because the signal doesn't get delivered by complete_signal(). wants_signal() returns false if PF_EXITING is set. (Another maybe-interesting thing would be to just delete that check.) Or am I misunderstanding you?

> Shouldn't it be "ShdPnd"

derp, fixed, thanks.

steelframe|2 years ago

Hi Tycho. I was The Guy at LSS who tested positive for COVID about 12 hours after we sat next to each other at that Japanese restaurant in Vancouver the week before last. I really hope you didn't catch it. So far, to my knowledge, my "blast radius" is just me.

As somebody who has written a non-trivial amount of upstream Linux filesystem code and who is leading the containers team at my current employer, I've found your writing more interesting than perhaps most people on the face the planet might. I'm also a bit surprised at how often companies write their own custom FUSE filesystems. A lot of them I only hear about as former employees from those companies join mine and then clue me in about their existence. It seems like every large-ish company these days has at least one now.

It looks like you were able to figure things out through some combination of /proc poking, code inspection, and LKML querying. Out of curiosity, would it be feasible for you to have tried enabling some of the kernel hacking options such as WQ_WATCHDOG or DETECT_HUNG_TASK? Do you think that would have sped up your investigation?

Also, my whole career I've been doing ps aux, but TIL about ps awwfux. Which I guess goes to show there's always some gap in one's basic knowledge of Linux foo!

tych0|2 years ago

> Hi Tycho. I was The Guy at LSS who tested positive for COVID about 12 hours after we sat next to each other at that Japanese restaurant in Vancouver the week before last. I really hope you didn't catch it. So far, to my knowledge, my "blast radius" is just me.

Hi Mike. So far so good for me.

> It looks like you were able to figure things out through some combination of /proc poking, code inspection, and LKML querying. Out of curiosity, would it be feasible for you to have tried enabling some of the kernel hacking options such as WQ_WATCHDOG or DETECT_HUNG_TASK? Do you think that would have sped up your investigation?

We do have these both enabled, and have alerts to log them in the fleet. I have found it very useful for saying "there's a bug", but not generally applicable in debugging it. However, we wouldn't catch these things without user reports if we didn't have those tools.

Something that might (?) be useful is something like lockdep when there's hung tasks. It wouldn't have helped in this case, since it was a bug in signals wakeup, but I e.g. in the xfs case I cited at the bottom maybe it would.

loeg|2 years ago

Should processes not be able to wait after exit_signals? That seems like a plausible invariant.

tych0|2 years ago

I think they definitely should not. I've considered sending a patch that adds a WARN() or some syzkaller test for it or something, especially now that I've seen it in other filesystems.

avianlyric|2 years ago

I think that’s the point. Currently doing that will potentially result in a deadlock.

mjevans|2 years ago

It sounds like exit_signals() is being called too early, and based on the test case linked this might be a library issue rather than a code or kernel issue?

Edit: Reading the article it's more clear this happens in kernel's:

  do_exit() {
    ...
    exit_signals(tsk); /* sets PF_EXITING */
    ...
    exit_files(tsk);

Would a better solution not be to exit_signals(tsk); later in do_exit() after all possible signal sources are exhausted?

cryptonector|2 years ago

It doesn't matter. Filesystem waits are historically non-interruptible. The correct fix is indeed to allow the flushes to fail fast rather than wait forever.

loeg|2 years ago

> It sounds like exit_signals() is being called too early

Or zap_pid_ns too late, yeah.