GithubHelp home page GithubHelp logo

Comments (23)

sanposhiho avatar sanposhiho commented on August 17, 2024 1

OK, so from my eyes, it looks like

  • The standardization of API and making the vanilla scheduler compatible with various subprojects is the first, core motivation.
    • Without this, requiring a custom scheduler with coscheduling plugin would be a requirement for various subprojects, which would be a troublesome hurdle.
  • Other technical reasons, such as a better gang plugin implementation by introducing some changes in the scheduler core, come as a second reason.
    • It couldn't be the first motivation since, as I said, technically we can implement sophisticated gang scheduling as out-of-tree plugins (with some tricks, though ;)).

from enhancements.

kerthcet avatar kerthcet commented on August 17, 2024

/sig scheduling
/assign

from enhancements.

kerthcet avatar kerthcet commented on August 17, 2024

Some designs are based on #3370.

from enhancements.

alculquicondor avatar alculquicondor commented on August 17, 2024

cc @cs20

from enhancements.

utam0k avatar utam0k commented on August 17, 2024

Recently, @sanposhiho and I have opened our gang scheduler plugin, which we actually use in our cluster. Since I believe our approach is different from co-scheduling, it is valuable to take it into consideration. This plugin doesn't require custom resources such as PodGroups. Perhaps it has tips to improve this proposal.
https://github.com/pfnet/scheduler-plugins/tree/master/plugins/gang

from enhancements.

kerthcet avatar kerthcet commented on August 17, 2024

Thanks @utam0k I'll take a look but the link is invalid.

from enhancements.

utam0k avatar utam0k commented on August 17, 2024

Thanks @utam0k I'll take a look but the link is invalid.

Sorry, I have fixed it

from enhancements.

sanposhiho avatar sanposhiho commented on August 17, 2024

@kerthcet
Our company used an custom plugin similar to co-scheduling (design-wise), and hit challenges.
(I actually haven't followed this topic though,) if you're planning to introduce the current co-scheduling plugin implementation almost as is, we'd definitely hit the same too.

These are major challenges from our experience -

  • Inefficient scheduling: waiting Pods reserve too large space in a cluster.
    • Let's say a group has 5 Pods; 4 Pods go through the scheduling cycle and get on waitOnPermit, while 1 Pod is rejected. 4 Pods reserves the places until the last Pod goes thru the scheduling cycle too (or reach timeout).
    • As the number of groups in the cluster grows, it likely gets worse situation like a dead lock; many Pods would be unschedulable while the cluster has plenty space because a few Pods in each group reserve space and wait for the rest of Pods, while the rest of Pods cannot go through the scheduling cycle as many space is reserved.
  • Difficulty in requeueing: For efficient requeueing, we should requeue a group of Pod when all Pods are ready to get schedule.
    • Let's say a group has 5 Pods; one Pod is rejected by resource fit, another Pod is rejected by NodeAffinity, three Pods are schedulable (they have to wait for the other 2 Pods). In this case, we have to requeue all 5 Pods when both resource fit plugin's failure for Pod-1 and NodeAffinity plugin failure for Pod-2 are solved.

Our gang plugin overcomes those challenges, which is worth taking a look for you, hopefully :)
But, on the other hand, I'm not saying, then we should follow our plugin's design. At its design phase, I had to solve them only by what is allowed within plugins; we didn't want to fork the scheduler.
So, it should be much easier/simpler for sure if we could introduce changes in the scheduling framework itself to wisely support scheduling a group of Pods.

from enhancements.

kerthcet avatar kerthcet commented on August 17, 2024

Thanks for the feedbacks @sanposhiho

Took a briefly look at the https://github.com/pfnet/scheduler-plugins/tree/master/plugins/gang, as the README.md highlights, the gang scheduler enhanced at:

  • Simple configuration - Simplify is what we chased for but annotation is arbitrary, not good for validation and it's not that official, I may not follow this, but I got your points.
  • Enhanced requeueing - This is something we should consider when we're dedicated to solve the performance problem.

Let's say a group has 5 Pods; 4 Pods go through the scheduling cycle and get on waitOnPermit, while 1 Pod is rejected. 4 Pods reserves the places until the last Pod goes thru the scheduling cycle too (or reach timeout).

If 4 Pods passed the scheduling cycle, shouldn't we wait for the last Pod's scheduling result? Sorry, I didn't get your point here, but regarding to co-scheduling, it has similar logic as https://github.com/kubernetes-sigs/scheduler-plugins/blob/531a1831bdda0bdad6057f7d00f3cafacdb93d86/pkg/coscheduling/coscheduling.go#L152.

After all, I do agree we should reject the podGroup scheduling ASAP if we're 100% sure that the podGroup will not succeed at last.

As the number of groups in the cluster grows, it likely gets worse situation like a dead lock;

Yes, we should make sure the group Pods are queued up together. I may take a deep look of your approach about your implementations.

Difficulty in requeueing: For efficient requeueing, we should requeue a group of Pod when all Pods are ready to get schedule.

This is a more fine-gained approach focused on performance.

Our gang plugin overcomes those challenges, which is worth taking a look for you, hopefully :)

Definitely I will, thanks for sharing.

from enhancements.

sftim avatar sftim commented on August 17, 2024

Kubernetes already supports pluggable schedulers. Should this feature be delivered in-tree?

from enhancements.

kerthcet avatar kerthcet commented on August 17, 2024

If you mean moving co-scheduling in tree, yes, that's something similar but more than that. Co-scheduling now has several problems like:

  • Stateful, however, plugin should better be stateless(each scheduling cycle) and lightweight
  • Queueing problems for PodGroups
  • Provide additional maintenance for some components like another backoffQ
  • ... maybe some other problems like performance mentioned above

This is not the design defect for co-scheduling, but because kube-scheduler is unaware of a group of Pods as a unit. So what we hope to do is make scheduler aware of PodGroup, and users can build more plugins on this concept.

from enhancements.

sanposhiho avatar sanposhiho commented on August 17, 2024

Simple configuration - Simplify is what we chased for but annotation is arbitrary, not good for validation and it's not that official, I may not follow this, but I got your points.

Yeah, regarding configuration via annotations, IMO we should avoid annotations if we intend to support it upstream. The native API would offer a much more robust solution.

If 4 Pods passed the scheduling cycle, shouldn't we wait for the last Pod's scheduling result?

We do need to wait, but the gang should give up being scheduled once one Pod is rejected + preemption doesn't help a rejected Pod's failure.
Basically, allowing Pods to wait in waitOnPermit should be very minimal; as I mentioned earlier, pods in waitOnPermit could result in them reserving nodes, which could easily lead to a situation where many nodes appear to have space (from outside scheduler pov), yet are essentially reserved, making other pending Pods unschedulable.

For our gang scheduling plugin, consider a scenario with a gang size of five. If four Pods are waiting for the fifth Pod in waitOnPermit, but the fifth Pod becomes unschedulable, we should immediately move all Pods back to the unschedulable queue. Up to this point, coscheduling does the same as you pointed out.

But, coscheduling is too simple; leaves many scenarios out of consideration; let's think about it further.
If we're using coscheduling, Pod1 - Pod4 have FailedPlugin: coscheduling and Pod5 have FailedPlugin: hoge-plugin. Then, Pod1-Pod4 would be requeued based on these events, Pod5 would be requeued based on hoge-plugin's registered events.

So, each Pod would be requeued individually.`, which is problematic.
Let's say only Pod1-Pod4 are requeued,

  • they again wouldn't go beyond waitOnPermit until Pod5 comes
  • they wouldn't get back to unschedQ until Pod5 is requeued (or timeout-ed) because no one from the same gang goes thru PostFilter.

Let's say only Pod5 is requeued,

  • when Pod5 is requeued and reaches waitOnPermit, Pod1-Pod4 could be schedulable.
  • but actually Pod1-Pod4 won't be requeued until coscheduling's events happen.

So -

Yes, we should make sure the group Pods are queued up together.

we concluded the same ;)
We should not regard Pod1-Pod4 as rejected by coscheduling, but regard them as rejected because of "Pod-5's failure". Our plugin requeues all of pods at once when some cluster event happens and hoge-plugin might change the result for Pod5. (= at the latest scheduler, it's when QHint of hoge-plugin returns Queue for Pod5)

from enhancements.

sanposhiho avatar sanposhiho commented on August 17, 2024

Kubernetes already supports pluggable schedulers. Should this feature be delivered in-tree?

I've got the same feeling actually.
We've put coscheduling in sigs/scheduler-plugins until now, what's a motivation we're moving it to in-tree plugins now?

(As I said, of course, a plugin implementation would be simpler if it's supported as in-tree plugin and we change the scheduler implementation for it though) Technically, our plugin shows that we can implement all the sophisticated tricks described in my comments (plus more) as a custom plugin, without requiring any implementation change in the scheduler side.

from enhancements.

kerthcet avatar kerthcet commented on August 17, 2024

Mentioned several points above, regarding to the fact of another gang scheduling implementation, we should support this in the upstream to avoid the reimplementations again and again and again.

from enhancements.

sanposhiho avatar sanposhiho commented on August 17, 2024

I just commented for that point in your doc, what's the actual pain point for us from the situation? Why do we want everyone to use the same gang scheduling solution?
Can't we just say that we officially maintain coscheduling, and we don't care about the others maintained by other communities?

from enhancements.

kerthcet avatar kerthcet commented on August 17, 2024

Why do we want everyone to use the same gang scheduling solution?

That's a good question, I think what we want is not force everyone to use the same gang scheduling solution, but make it extendable since people may want different queueing or preemption logic, but at the same time, we should provide a standard gang scheduling primitive for users, that doesn't mean it's the best one.

OTOH, I think all solutions have the same goal as making podGroup scheduling efficient, that's what we can plumb into the schedulingQueue as we do with activeQ/backoffQ/unschedulablePods.

I'll make co-scheduling with native scheduler as an Alternative, will append to the proposal later.

Hope to hear other advices about how to make podGroup extendable, do you have any advices with your gang scheduler plugin @sanposhiho

from enhancements.

alculquicondor avatar alculquicondor commented on August 17, 2024

Why do we want everyone to use the same gang scheduling solution?

We need to standardize the API, at the very least.

Furthermore, we should have an in-tree plugin that is compatible with other sig-scheduling projects, primarily Kueue.

from enhancements.

tenzen-y avatar tenzen-y commented on August 17, 2024

Why do we want everyone to use the same gang scheduling solution?

We need to standardize the API, at the very least.

Furthermore, we should have an in-tree plugin that is compatible with other sig-scheduling projects, primarily Kueue.

Also, the in-tree plugin would be worth it when we support the gnang-scheduling in the JobSet and LeaderWorkerSet (sig-apps sub-projects).

from enhancements.

kerthcet avatar kerthcet commented on August 17, 2024

Compatible with other subprojects should be another goal. Will add to the proposal as well.

from enhancements.

tenzen-y avatar tenzen-y commented on August 17, 2024

Compatible with other subprojects should be another goal. Will add to the proposal as well.

I didn't mean that we should work on integrations with subprojects. I just raised use cases.

from enhancements.

kerthcet avatar kerthcet commented on August 17, 2024

Integration is out-of-goal, what I refer to is compatibility.

from enhancements.

tenzen-y avatar tenzen-y commented on August 17, 2024

Integration is out-of-goal, what I refer to is compatibility.

Yes, I wanted to mean what you say.

I DIDN'T mean that we should work on integrations with subprojects. I just raised use cases.

from enhancements.

alculquicondor avatar alculquicondor commented on August 17, 2024

To me, those 2 motivations are equally important.

Hopefully, in the future, people don't need to implement custom schedulers to get all-or-nothing scheduling.

from enhancements.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.