GithubHelp home page GithubHelp logo

Comments (11)

andrewsykim avatar andrewsykim commented on July 27, 2024 1

Related: kubernetes/kubernetes#107631 (review)

from cloud-provider.

mdbooth avatar mdbooth commented on July 27, 2024

Related (but separate), I think we should remove the serviceCache. The cached was added in kubernetes/kubernetes@fc08a0a in 2015, and I suspect that the reasons it was originally added no longer apply. It is used to update Service object in response to changes to Nodes.

I suspect the reason the cache was added was to avoid fetching Service objects again from the API service. However, we already have these objects cached in the serviceInformer. I believe we can entirely replace this cache with calls to the serviceInformer. We can track failed reconciles by key and fetch them from the informer.

It adds unnecessary complexity to the code and could easily become another source of subtle bugs.

from cloud-provider.

andrewsykim avatar andrewsykim commented on July 27, 2024

It's possible that bug fix would address your problem though? But there's a follow-up to be had where we remove the cache as you suggested

from cloud-provider.

mdbooth avatar mdbooth commented on July 27, 2024

Related: kubernetes/kubernetes#107631 (review)

That's the incidental problem, not the main one, though!

from cloud-provider.

mdbooth avatar mdbooth commented on July 27, 2024

It's possible that bug fix would address your problem though? But there's a follow-up to be had where we remove the cache as you suggested

No, I've confused this issue by discussing 2 related but different things. The serviceCache is simply confusing now. It's not the source of the bug.

from cloud-provider.

mdbooth avatar mdbooth commented on July 27, 2024

You can see it here: https://github.com/kubernetes/kubernetes/blob/349900472a38a29fd6d85f7e4880d4f3d72ad6ee/staging/src/k8s.io/cloud-provider/controllers/service/controller.go#L346-L347

Note that we updated the cache with the service object we were passed from the informer, before passing the object from the informer to syncLoadBalancerIfNeeded(). The dirty object came from the informer, not the serviceCache. We never actually read from the serviceCache in this code path, only write to it.

from cloud-provider.

mdbooth avatar mdbooth commented on July 27, 2024

I've thrown up a placeholder PR which hopefully demonstrates what I think the problem is. It's completely untested! I'll try to work on it properly tomorrow.

from cloud-provider.

mdbooth avatar mdbooth commented on July 27, 2024

I have manually verified that kubernetes/kubernetes#109601 fixes the issue: I'm now more confident that the issue is that we're modifying the object returned by the informer when we should not. I'll explain how I tested it in the PR, and look for a practical way to add an automated test.

from cloud-provider.

mdbooth avatar mdbooth commented on July 27, 2024

I decided to quickly audit the other cloud-provider controllers looking for similar problematic uses of objects returned from an informer:


Node controller looks ok. Looks like it always copies or re-fetches before modifying.

Node lifecycle controller passes shallow-copied Node from informer to:

Route controller looks ok. Looks like it always copies or re-fetches before modifying.


This is quite hard to audit manually. It feels ripe for a 'safe' wrapper used consistently across the codebase. Without anything like rust's ability to separate our mutable from our immutable references automatically, it might be safer to simply always copy these objects before use.

from cloud-provider.

andrewsykim avatar andrewsykim commented on July 27, 2024

I think up to this point it was implied that the Service object passed from the controller should be read-only but agreed that passing a deep copy is probably the safer thing to do since it's not always obvious

from cloud-provider.

k8s-triage-robot avatar k8s-triage-robot commented on July 27, 2024

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

from cloud-provider.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.