Describe the bug We seem to have a flake in the perf reader wakeup

it only works for raw tracepoint programs <p dir="auto"

Test TestPerfReaderWakeupEvents gets stuck on some runs about ebpf HOT 9 CLOSED

dylandreimerink commented on August 24, 2024

Test TestPerfReaderWakeupEvents gets stuck on some runs

from ebpf.

Comments (9)

dylandreimerink commented on August 24, 2024

I have been playing around with this a bit. The flaky behavior seems to originate in the kernels WakeupEvents logic. I have not looked into the kernel code yet, but the current test fails from time to time until I always add the WakeupEvents + 1 amount of events, then it consistently passes.

	// send followup events
	for i := 1; i < numEvents+1; i++ {
		_, _, err = prog.Test(internal.EmptyBPFContext)
		if err != nil {
			t.Fatal(err)
		}
	}

So perhaps this has to do with memory alignment of the map or something like that. I have tried varying the numEvents and sampleSize but changes there don't seem to change anything.

from ebpf.

dylandreimerink commented on August 24, 2024

I think I found the cause. The WakeupEvents limit is per ring, one per CPU. And when we execute BPF_PROG_RUN multiple times, we sometimes write 2 messages to different rings. If I log the CPU ID of the first and the followup events I see:

=== RUN   TestPerfReaderWakeupEvents
ret 7
ret 7
--- PASS: TestPerfReaderWakeupEvents (0.01s)
=== RUN   TestPerfReaderWakeupEvents
ret 7
ret 7
--- PASS: TestPerfReaderWakeupEvents (0.01s)
=== RUN   TestPerfReaderWakeupEvents
ret 7
ret 7
--- PASS: TestPerfReaderWakeupEvents (0.01s)
=== RUN   TestPerfReaderWakeupEvents
ret 7
ret 7
--- PASS: TestPerfReaderWakeupEvents (0.01s)
=== RUN   TestPerfReaderWakeupEvents
ret 7
ret 0
panic: test timed out after 1s

The numbers changes from run to run, and its seems pure luck that the +1 I mentioned earlier happens to land on the same CPU as one of the once before.

A potential fix would be to add the following to the start of the test:

import extUnix "golang.org/x/sys/unix"

...

func TestPerfReaderWakeupEvents(t *testing.T) {
	// Lock goroutine to thread
	runtime.LockOSThread()
	defer runtime.UnlockOSThread()

	// Save CPU affinity
	var set extUnix.CPUSet
	err := extUnix.SchedGetaffinity(0, &set)
	qt.Assert(t, qt.IsNil(err))
	// Schedule test to run on only CPU 0
	err = extUnix.SchedSetaffinity(0, &extUnix.CPUSet{1})
	qt.Assert(t, qt.IsNil(err))
	// Restore CPU affinity
	defer extUnix.SchedSetaffinity(0, &set)

Perhaps there are other alternatives (this doesn't win any beauty awards)

from ebpf.

brycekahle commented on August 24, 2024

Could we send numCPUs * WakeupEvents events to ensure that at least one CPU gets woken up?

from ebpf.

dylandreimerink commented on August 24, 2024

Yea, that should also work, but I don't know if that defeats the purpose of the test, in my case you would be enqueue'ing 16 events to test a 2 event limit.

from ebpf.

brycekahle commented on August 24, 2024

The test was more for making sure it didn't wakeup after 1 event.

from ebpf.

brycekahle commented on August 24, 2024

I'm not sure we can control the CPU the eBPF program actually runs on by controlling the affinity of the userspace program.

from ebpf.

dylandreimerink commented on August 24, 2024

I'm not sure we can control the CPU the eBPF program actually runs on by controlling the affinity of the userspace program.

I tested the code I showed seems to work, at least locally. By default the BPF program executes on the CPU making the syscall. Although that isn't official so not guaranteed.

The Program.Run also has a parameter to pick a CPU to run on, but looking at the kernel, it only works for raw tracepoint programs, so if we can change the program type for our sample prog, then that might be an option. (torvalds/linux@1b4d60e)

from ebpf.

brycekahle commented on August 24, 2024

it only works for raw tracepoint programs

That would constrain what kernel versions we can test on though.

from ebpf.

lmb commented on August 24, 2024

I'd be fine with both solutions. I remember that we have the same problem (samples submitted on the "wrong" CPU) in other places as well. Maybe we could reuse the user space code.

I think it's also fine to constrain this to a smaller number of kernel versions: we're testing that the plumbing we have ~ works. We don't need to / want to assert that the kernel isn't doing dodgy things (as we'd never see the end of it 😆 ).

from ebpf.

Test TestPerfReaderWakeupEvents gets stuck on some runs about ebpf HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs