What happened:
If the command run by "utils/exec".New().Command()
ignores SIGTERM and does not exit, or takes longer than 10 seconds to exit, using cmd.Stop()
will result in a panic after 10 seconds.
What you expected to happen:
After SIGTERM fails to stop the process, Stop should send SIGKILL.
How to reproduce it:
I wrote a test that demonstrates the issue. It can be added to exec_test.go
func TestStopWhenSigtermIgnored(t *testing.T) {
// Create a process that will ignore SIGTERM
pr, pw, err := os.Pipe()
if err != nil {
t.Errorf("pipe: %v", err)
}
cmd := New().Command("/bin/sh", "-c", `trap "" TERM && echo "trap" && exec sleep 15`)
cmd.SetStdin(nil)
cmd.SetStdout(pw)
cmd.SetStderr(pw)
if err := cmd.Start(); err != nil {
t.Errorf("expected Start() not to error, got: %v", err)
}
// Wait for SIGTERM's disposition to be set
r := bufio.NewReader(pr)
var line string
if line, err = r.ReadString('\n'); err != nil {
t.Errorf("ReadString: %v", err)
}
if line != "trap\n" {
t.Errorf("unexpected write from shell script: %v", line)
}
// no race on process state after waiting 10 seconds to send SIGKILL
cmd.Stop()
if err := cmd.Wait(); err == nil || err.Error() != "signal: killed" {
t.Errorf("expected Wait() to receive an error about a signal, got: %v", err)
}
}
→ go test -v ./exec/ -test.run ^TestStopWhenSigtermIgnored$
=== RUN TestStopWhenSigtermIgnored
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x1109a40]
goroutine 19 [running]:
os.(*ProcessState).exited(...)
/usr/local/Cellar/go/1.20.6/libexec/src/os/exec_posix.go:83
os.(*ProcessState).Exited(...)
/usr/local/Cellar/go/1.20.6/libexec/src/os/exec.go:155
k8s.io/utils/exec.(*cmdWrapper).Stop.func1()
/Users/dsharp/workspace/kubernetes-utils/exec/exec.go:189 +0x20
created by time.goFunc
/usr/local/Cellar/go/1.20.6/libexec/src/time/sleep.go:176 +0x32
FAIL k8s.io/utils/exec 10.188s
FAIL
Anything else we need to know?:
The bug seems to come from a misconception about "os/exec".Cmd.ProcessState.Exited()
// Exited reports whether the program has exited.
// On Unix systems this reports true if the program exited due to calling exit,
// but false if the program terminated due to a signal.
However:
// ProcessState contains information about an exited process.
// If the process was started successfully, Wait or Run will
// populate its ProcessState when the command completes.
That is, Cmd.ProcessState
is nil until Cmd.Wait()
successfully returns. Exited()
really only is useful to distinguish if the process called exit()
/ returned normally, or was killed by a signal.
I also have a patch to address the panic:
→ git diff
diff --git a/exec/exec.go b/exec/exec.go
index d9c91e3..fd9ead2 100644
--- a/exec/exec.go
+++ b/exec/exec.go
@@ -186,7 +186,7 @@ func (cmd *cmdWrapper) Stop() {
c.Process.Signal(syscall.SIGTERM)
time.AfterFunc(10*time.Second, func() {
- if !c.ProcessState.Exited() {
+ if c.ProcessState == nil {
c.Process.Signal(syscall.SIGKILL)
}
})
HOWEVER I'm concerned about a possible data race on the osexec.Cmd object between the time.AfterFunc goroutine and whichever routine calls cmd.Wait(). Running under the go race detector after the patch above has been applied does not show a race. But, if patched with the wrong condition (if c.ProcessState != nil {
), then the race detector is triggered (It's not clear to me why the detection is sensitive to inverting the condition):
→ go test -race -count=1 -v ./exec/ -test.run ^TestStopWhenSigtermIgnored$
=== RUN TestStopWhenSigtermIgnored
==================
WARNING: DATA RACE
Write at 0x00c0001b20a8 by goroutine 6:
os/exec.(*Cmd).Wait()
/usr/local/Cellar/go/1.20.6/libexec/src/os/exec/exec.go:894 +0x1b7
k8s.io/utils/exec.(*cmdWrapper).Wait()
/Users/dsharp/workspace/kubernetes-utils/exec/exec.go:157 +0x30
k8s.io/utils/exec.TestStopWhenSigtermIgnored()
/Users/dsharp/workspace/kubernetes-utils/exec/exec_test.go:160 +0x4ee
testing.tRunner()
/usr/local/Cellar/go/1.20.6/libexec/src/testing/testing.go:1576 +0x216
testing.(*T).Run.func1()
/usr/local/Cellar/go/1.20.6/libexec/src/testing/testing.go:1629 +0x47
Previous read at 0x00c0001b20a8 by goroutine 8:
k8s.io/utils/exec.(*cmdWrapper).Stop.func1()
/Users/dsharp/workspace/kubernetes-utils/exec/exec.go:189 +0x44
Goroutine 6 (running) created at:
testing.(*T).Run()
/usr/local/Cellar/go/1.20.6/libexec/src/testing/testing.go:1629 +0x805
testing.runTests.func1()
/usr/local/Cellar/go/1.20.6/libexec/src/testing/testing.go:2036 +0x8d
testing.tRunner()
/usr/local/Cellar/go/1.20.6/libexec/src/testing/testing.go:1576 +0x216
testing.runTests()
/usr/local/Cellar/go/1.20.6/libexec/src/testing/testing.go:2034 +0x87c
testing.(*M).Run()
/usr/local/Cellar/go/1.20.6/libexec/src/testing/testing.go:1906 +0xb44
main.main()
_testmain.go:71 +0x2e9
Goroutine 8 (finished) created at:
time.goFunc()
/usr/local/Cellar/go/1.20.6/libexec/src/time/sleep.go:176 +0x47
==================
exec_test.go:161: expected Wait() to receive an error about a signal, got: <nil>
testing.go:1446: race detected during execution of test
--- FAIL: TestStopWhenSigtermIgnored (15.02s)
=== NAME
testing.go:1446: race detected during execution of test
FAIL
FAIL k8s.io/utils/exec 15.218s
FAIL
If sending SIGKILL is to depend on the termination state of cmd (and it seems like it should), then there needs to be some kind of synchronization point between the timer.AfterFunc goroutine and the goroutine calling Wait().
It's possible that Stop() should be removed. I could only find one user of "utils/exec".Cmd.Stop()
in public repositories by searching with sourcegraph, and that is in kubernetes/kubernetes/pkg/volume/flexvolume/driver-call.go
. That code itself is lacking sufficient synchronization, and is in a deprecated component.
https://github.com/kubernetes/kubernetes/blob/99190634ab252604a4496882912ac328542d649d/pkg/volume/flexvolume/driver-call.go#L131
Environment:
- Kubernetes version (use
kubectl version
): n/a
- OS (e.g. from /etc/os-release): MacOS (dev environment)
- Kernel (e.g.
uname -a
): Darwin
- Install tools:
- Others: