GithubHelp home page GithubHelp logo

Comments (9)

roberthbailey avatar roberthbailey commented on June 5, 2024

How have you configured healthchecking? I've tried to reproduce this scenario, but if my game server container exists after being allocated (or ready), it moves to the Unhealthy state as per the third item in https://agones.dev/site/docs/guides/health-checking/#health-failure-strategy. Even when I disable healthchecking, I still see my gameserver becomes Unhealthy after exiting.

from agones.

roberthbailey avatar roberthbailey commented on June 5, 2024

One other thing I found while looking into this is that #2781 implies that to change this requires a code change in Agones - so I'm not sure how I can even reproduce what you are seeing.

Do you have a minimal game server example that you could share?

from agones.

markmandel avatar markmandel commented on June 5, 2024

So I was looking into the panic on the stacktrace path, and branch, and I'm not sure its related to the reconnect, although maybe?:

agones.dev/agones/pkg/sdkserver.(*SDKServer).sendGameServerUpdate(0xc0003641e0, 0xc00017ec00)
	/go/src/agones.dev/agones/pkg/sdkserver/sdkserver.go:1248 +0x7da

The specific piece of code that panics is:

s.connectedStreams = append(s.connectedStreams[:i], s.connectedStreams[i+1:]...)

As part of the function sendGameServerUpdate(...).

This was done to drop disconnected streams from the SDK.WatchGameServer(...) operations -- but it really shouldn't panic like this when and if a Watch disconnection happens. At first glace, I'm not sure if this is a race condition, or a bug in the code.

@qhyun2 a couple of quick questions to help us narrow things down:

  1. Does this happens all the time with your code above, or only sometimes?
  2. My node is super super bad -- it looks like you are disconnecting from the Watch operation once the promise is over? Is that correct? (not sure I understand the code 100%).

If I'm correct on No. 2 and you are disconnecting, a workaround while we investigate is to keep the Watch operation open throughout the lifetime of your GameServer, rather than do it as a "one and done" operation (although we should also fix the bug).

from agones.

markmandel avatar markmandel commented on June 5, 2024

Chatting offline with another user - they ran into the same issue. For them, it only happened sporadically. Makes me wonder if there is a race condition where access/manipulation of the watch channels isn't locked appropriately.

from agones.

qhyun2 avatar qhyun2 commented on June 5, 2024

I can give quite a bit more context now.

How have you configured healthchecking? I've tried to reproduce this scenario, but if my game server container exists after being allocated (or ready), it moves to the Unhealthy state as per the third item in https://agones.dev/site/docs/guides/health-checking/#health-failure-strategy. Even when I disable healthchecking, I still see my gameserver becomes Unhealthy after exiting.

I do not have health checking on. My game server is wrapped in a script so the agones server never sees it exit, so it never becomes unhealthy.

#!/bin/bash

until ! node ./server.js; do
    echo -e "\\n\\nServer exited with code 0, will be restarted...\\n\\n"
    sleep 1
done

One other thing I found while looking into this is that #2781 implies that to change this requires a code change in Agones - so I'm not sure how I can even reproduce what you are seeing.

Do you have a minimal game server example that you could share?

I seem to be doing option 1 of what #2781 is describing. The code snippet I originally posted basically ensures that old state doesn't get reused.

Here is a reproduction of the bug:
qhyun2@0bf7d22

  1. Does this happens all the time with your code above, or only sometimes?

In the reproduction above it happens every time. Outside of the reproduction, wasn't happening every time, possibly because process.exit wasn't always being called. It seems like a race condition. My guess is the side car tries to send and update but the watch grpc stream is closed.

  1. My node is super super bad -- it looks like you are disconnecting from the Watch operation once the promise is over? Is that correct? (not sure I understand the code 100%).

The watch operation is being left open. New events are seen and ignored (additional calls to resolve have no effect). There doesn't seem to be a way to cleanup with the provided API.

This issue is happening in production for me.
image

{"error":"context canceled","gsKey":"servers/medium-hthkb-rx6qd","message":"stream closed with error","severity":"error","source":"*sdkserver.SDKServer","time":"2024-02-13T23:35:31.434247456Z"}
{"error":"context canceled","gsKey":"servers/medium-hthkb-rx6qd","message":"stream closed with error","severity":"error","source":"*sdkserver.SDKServer","time":"2024-02-13T23:56:18.462592944Z"}
E0213 23:56:18.462700       1 runtime.go:79] Observed a panic: runtime.boundsError{x:2, y:1, signed:true, code:0x3} (runtime error: slice bounds out of range [2:1])
goroutine 69 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x1d56400?, 0xc00097e120})
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x99
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc0006662c0?})
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x75
panic({0x1d56400, 0xc00097e120})
        /usr/local/go/src/runtime/panic.go:884 +0x213
agones.dev/agones/pkg/sdkserver.(*SDKServer).sendGameServerUpdate(0xc0006661e0, 0xc000b8eb00)
        /go/src/agones.dev/agones/pkg/sdkserver/sdkserver.go:1248 +0x7da
agones.dev/agones/pkg/sdkserver.NewSDKServer.func2({0x406ad8?, 0xc0000ad980?}, {0x1e75f80?, 0xc000b8eb00?})
        /go/src/agones.dev/agones/pkg/sdkserver/sdkserver.go:184 +0x36
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnUpdate(...)
        /go/src/agones.dev/agones/vendor/k8s.io/client-go/tools/cache/controller.go:250
k8s.io/client-go/tools/cache.(*processorListener).run.func1()
        /go/src/agones.dev/agones/vendor/k8s.io/client-go/tools/cache/shared_informer.go:971 +0xfc
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x3e
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc00094df38?, {0x22df9c0, 0xc0008cd3b0}, 0x1, 0xc0004faea0)
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0?, 0x3b9aca00, 0x0, 0x0?, 0x0?)
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(...)
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161
k8s.io/client-go/tools/cache.(*processorListener).run(0xc00068abd0)
        /go/src/agones.dev/agones/vendor/k8s.io/client-go/tools/cache/shared_informer.go:967 +0x6b
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:72 +0x5a
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:70 +0x85
panic: runtime error: slice bounds out of range [2:1] [recovered]
        panic: runtime error: slice bounds out of range [2:1]

goroutine 69 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc0006662c0?})
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:56 +0xd7
panic({0x1d56400, 0xc00097e120})
        /usr/local/go/src/runtime/panic.go:884 +0x213
agones.dev/agones/pkg/sdkserver.(*SDKServer).sendGameServerUpdate(0xc0006661e0, 0xc000b8eb00)
        /go/src/agones.dev/agones/pkg/sdkserver/sdkserver.go:1248 +0x7da
agones.dev/agones/pkg/sdkserver.NewSDKServer.func2({0x406ad8?, 0xc0000ad980?}, {0x1e75f80?, 0xc000b8eb00?})
        /go/src/agones.dev/agones/pkg/sdkserver/sdkserver.go:184 +0x36
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnUpdate(...)
        /go/src/agones.dev/agones/vendor/k8s.io/client-go/tools/cache/controller.go:250
k8s.io/client-go/tools/cache.(*processorListener).run.func1()
        /go/src/agones.dev/agones/vendor/k8s.io/client-go/tools/cache/shared_informer.go:971 +0xfc
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x3e
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc00094df38?, {0x22df9c0, 0xc0008cd3b0}, 0x1, 0xc0004faea0)
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0?, 0x3b9aca00, 0x0, 0x0?, 0x0?)
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(...)
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161
k8s.io/client-go/tools/cache.(*processorListener).run(0xc00068abd0)
        /go/src/agones.dev/agones/vendor/k8s.io/client-go/tools/cache/shared_informer.go:967 +0x6b
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:72 +0x5a
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:70 +0x85```

from agones.

qhyun2 avatar qhyun2 commented on June 5, 2024

If I'm correct on No. 2 and you are disconnecting, a workaround while we investigate is to keep the Watch operation open throughout the lifetime of your GameServer, rather than do it as a "one and done" operation (although we should also fix the bug).

WORKAROUND

for those coming here from googling the issue

My current work around is to avoid using multiple watch calls. Everything is hooked into a single watch and the crash no longer happens.

from agones.

markmandel avatar markmandel commented on June 5, 2024

My current work around is to avoid using multiple watch calls. Everything is hooked into a single watch and the crash no longer happens.

Thanks for continuing to dig in - all the information helps us try and work out what's going on.

from agones.

roberthbailey avatar roberthbailey commented on June 5, 2024

Thank you for the reproduction. I was unable to reproduce using a go game server (based on the simple game server), even after wrapping the crashing game server in a bash script to restart it without triggering Agones' healthchecking and also creating multiple watches. But using your node code I can now see a panic in the sdkserver.

from agones.

roberthbailey avatar roberthbailey commented on June 5, 2024

I've found the bug. In the code Mark linked to above, the function sendGameServerUpdate modifies the slice s.connectedStreams while ranging over it by removing elements. When the slice has multiple elements this causes a panic.

Here is a simple example from the go playground that illustrates this: https://go.dev/play/p/Qb8JCvgnhWc

The first loop uses the strategy from https://go.dev/wiki/SliceTricks#filtering-without-allocating to filter the slice (efficiently without allocating a new slice). The second loop modifies the slice similarly to sendGameServerUpdate which results in some elements being skipped and eventually causes a panic.

from agones.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.