Enhancement Deion One-line enhancement

OK, so from my eyes, it looks like The standardization of API

Some designs are based on <a class="issue-link js-issue-link" data-error-text="Failed

cc <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Recently, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercar

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

Thanks <a class="user-mention notranslate" data-hovercard-type="user" dat

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Thanks for the feedbacks <a class="user-mention notranslate" data-hovercard-type="user

Gang Scheduling Support in Kubernetes about enhancements HOT 23 OPEN

kerthcet commented on August 17, 2024 3

Gang Scheduling Support in Kubernetes

from enhancements.

Comments (23)

sanposhiho commented on August 17, 2024 1

OK, so from my eyes, it looks like

The standardization of API and making the vanilla scheduler compatible with various subprojects is the first, core motivation.
- Without this, requiring a custom scheduler with coscheduling plugin would be a requirement for various subprojects, which would be a troublesome hurdle.
Other technical reasons, such as a better gang plugin implementation by introducing some changes in the scheduler core, come as a second reason.
- It couldn't be the first motivation since, as I said, technically we can implement sophisticated gang scheduling as out-of-tree plugins (with some tricks, though ;)).

from enhancements.

kerthcet commented on August 17, 2024

/sig scheduling
/assign

from enhancements.

kerthcet commented on August 17, 2024

Some designs are based on #3370.

from enhancements.

alculquicondor commented on August 17, 2024

cc @cs20

from enhancements.

utam0k commented on August 17, 2024

Recently, @sanposhiho and I have opened our gang scheduler plugin, which we actually use in our cluster. Since I believe our approach is different from co-scheduling, it is valuable to take it into consideration. This plugin doesn't require custom resources such as PodGroups. Perhaps it has tips to improve this proposal.
https://github.com/pfnet/scheduler-plugins/tree/master/plugins/gang

from enhancements.

kerthcet commented on August 17, 2024

Thanks @utam0k I'll take a look but the link is invalid.

from enhancements.

utam0k commented on August 17, 2024

Thanks @utam0k I'll take a look but the link is invalid.

Sorry, I have fixed it

from enhancements.

sanposhiho commented on August 17, 2024

@kerthcet
Our company used an custom plugin similar to co-scheduling (design-wise), and hit challenges.
(I actually haven't followed this topic though,) if you're planning to introduce the current co-scheduling plugin implementation almost as is, we'd definitely hit the same too.

These are major challenges from our experience -

Inefficient scheduling: waiting Pods reserve too large space in a cluster.
- Let's say a group has 5 Pods; 4 Pods go through the scheduling cycle and get on waitOnPermit, while 1 Pod is rejected. 4 Pods reserves the places until the last Pod goes thru the scheduling cycle too (or reach timeout).
- As the number of groups in the cluster grows, it likely gets worse situation like a dead lock; many Pods would be unschedulable while the cluster has plenty space because a few Pods in each group reserve space and wait for the rest of Pods, while the rest of Pods cannot go through the scheduling cycle as many space is reserved.
Difficulty in requeueing: For efficient requeueing, we should requeue a group of Pod when all Pods are ready to get schedule.
- Let's say a group has 5 Pods; one Pod is rejected by resource fit, another Pod is rejected by NodeAffinity, three Pods are schedulable (they have to wait for the other 2 Pods). In this case, we have to requeue all 5 Pods when both resource fit plugin's failure for Pod-1 and NodeAffinity plugin failure for Pod-2 are solved.

Our gang plugin overcomes those challenges, which is worth taking a look for you, hopefully :)
But, on the other hand, I'm not saying, then we should follow our plugin's design. At its design phase, I had to solve them only by what is allowed within plugins; we didn't want to fork the scheduler.
So, it should be much easier/simpler for sure if we could introduce changes in the scheduling framework itself to wisely support scheduling a group of Pods.

from enhancements.

kerthcet commented on August 17, 2024

Thanks for the feedbacks @sanposhiho

Took a briefly look at the https://github.com/pfnet/scheduler-plugins/tree/master/plugins/gang, as the README.md highlights, the gang scheduler enhanced at:

Simple configuration - Simplify is what we chased for but annotation is arbitrary, not good for validation and it's not that official, I may not follow this, but I got your points.
Enhanced requeueing - This is something we should consider when we're dedicated to solve the performance problem.

Let's say a group has 5 Pods; 4 Pods go through the scheduling cycle and get on waitOnPermit, while 1 Pod is rejected. 4 Pods reserves the places until the last Pod goes thru the scheduling cycle too (or reach timeout).

If 4 Pods passed the scheduling cycle, shouldn't we wait for the last Pod's scheduling result? Sorry, I didn't get your point here, but regarding to co-scheduling, it has similar logic as https://github.com/kubernetes-sigs/scheduler-plugins/blob/531a1831bdda0bdad6057f7d00f3cafacdb93d86/pkg/coscheduling/coscheduling.go#L152.

After all, I do agree we should reject the podGroup scheduling ASAP if we're 100% sure that the podGroup will not succeed at last.

As the number of groups in the cluster grows, it likely gets worse situation like a dead lock;

Yes, we should make sure the group Pods are queued up together. I may take a deep look of your approach about your implementations.

Difficulty in requeueing: For efficient requeueing, we should requeue a group of Pod when all Pods are ready to get schedule.

This is a more fine-gained approach focused on performance.

Our gang plugin overcomes those challenges, which is worth taking a look for you, hopefully :)

Definitely I will, thanks for sharing.

from enhancements.

sftim commented on August 17, 2024

Kubernetes already supports pluggable schedulers. Should this feature be delivered in-tree?

from enhancements.

kerthcet commented on August 17, 2024

If you mean moving co-scheduling in tree, yes, that's something similar but more than that. Co-scheduling now has several problems like:

Stateful, however, plugin should better be stateless(each scheduling cycle) and lightweight
Queueing problems for PodGroups
Provide additional maintenance for some components like another backoffQ
... maybe some other problems like performance mentioned above

This is not the design defect for co-scheduling, but because kube-scheduler is unaware of a group of Pods as a unit. So what we hope to do is make scheduler aware of PodGroup, and users can build more plugins on this concept.

from enhancements.

sanposhiho commented on August 17, 2024

Simple configuration - Simplify is what we chased for but annotation is arbitrary, not good for validation and it's not that official, I may not follow this, but I got your points.

Yeah, regarding configuration via annotations, IMO we should avoid annotations if we intend to support it upstream. The native API would offer a much more robust solution.

If 4 Pods passed the scheduling cycle, shouldn't we wait for the last Pod's scheduling result?

We do need to wait, but the gang should give up being scheduled once one Pod is rejected + preemption doesn't help a rejected Pod's failure.
Basically, allowing Pods to wait in waitOnPermit should be very minimal; as I mentioned earlier, pods in waitOnPermit could result in them reserving nodes, which could easily lead to a situation where many nodes appear to have space (from outside scheduler pov), yet are essentially reserved, making other pending Pods unschedulable.

For our gang scheduling plugin, consider a scenario with a gang size of five. If four Pods are waiting for the fifth Pod in waitOnPermit, but the fifth Pod becomes unschedulable, we should immediately move all Pods back to the unschedulable queue. Up to this point, coscheduling does the same as you pointed out.

But, coscheduling is too simple; leaves many scenarios out of consideration; let's think about it further.
If we're using coscheduling, Pod1 - Pod4 have FailedPlugin: coscheduling and Pod5 have FailedPlugin: hoge-plugin. Then, Pod1-Pod4 would be requeued based on these events, Pod5 would be requeued based on hoge-plugin's registered events.

So, each Pod would be requeued individually.`, which is problematic.
Let's say only Pod1-Pod4 are requeued,

they again wouldn't go beyond waitOnPermit until Pod5 comes
they wouldn't get back to unschedQ until Pod5 is requeued (or timeout-ed) because no one from the same gang goes thru PostFilter.

Let's say only Pod5 is requeued,

when Pod5 is requeued and reaches waitOnPermit, Pod1-Pod4 could be schedulable.
but actually Pod1-Pod4 won't be requeued until coscheduling's events happen.

So -

Yes, we should make sure the group Pods are queued up together.

we concluded the same ;)
We should not regard Pod1-Pod4 as rejected by coscheduling, but regard them as rejected because of "Pod-5's failure". Our plugin requeues all of pods at once when some cluster event happens and hoge-plugin might change the result for Pod5. (= at the latest scheduler, it's when QHint of hoge-plugin returns Queue for Pod5)

from enhancements.

sanposhiho commented on August 17, 2024

Kubernetes already supports pluggable schedulers. Should this feature be delivered in-tree?

I've got the same feeling actually.
We've put coscheduling in sigs/scheduler-plugins until now, what's a motivation we're moving it to in-tree plugins now?

(As I said, of course, a plugin implementation would be simpler if it's supported as in-tree plugin and we change the scheduler implementation for it though) Technically, our plugin shows that we can implement all the sophisticated tricks described in my comments (plus more) as a custom plugin, without requiring any implementation change in the scheduler side.

from enhancements.

kerthcet commented on August 17, 2024

Mentioned several points above, regarding to the fact of another gang scheduling implementation, we should support this in the upstream to avoid the reimplementations again and again and again.

from enhancements.

sanposhiho commented on August 17, 2024

I just commented for that point in your doc, what's the actual pain point for us from the situation? Why do we want everyone to use the same gang scheduling solution?
Can't we just say that we officially maintain coscheduling, and we don't care about the others maintained by other communities?

from enhancements.

kerthcet commented on August 17, 2024

Why do we want everyone to use the same gang scheduling solution?

That's a good question, I think what we want is not force everyone to use the same gang scheduling solution, but make it extendable since people may want different queueing or preemption logic, but at the same time, we should provide a standard gang scheduling primitive for users, that doesn't mean it's the best one.

OTOH, I think all solutions have the same goal as making podGroup scheduling efficient, that's what we can plumb into the schedulingQueue as we do with activeQ/backoffQ/unschedulablePods.

I'll make co-scheduling with native scheduler as an Alternative, will append to the proposal later.

Hope to hear other advices about how to make podGroup extendable, do you have any advices with your gang scheduler plugin @sanposhiho

from enhancements.

alculquicondor commented on August 17, 2024

Why do we want everyone to use the same gang scheduling solution?

We need to standardize the API, at the very least.

Furthermore, we should have an in-tree plugin that is compatible with other sig-scheduling projects, primarily Kueue.

from enhancements.

tenzen-y commented on August 17, 2024

Why do we want everyone to use the same gang scheduling solution?

We need to standardize the API, at the very least.

Furthermore, we should have an in-tree plugin that is compatible with other sig-scheduling projects, primarily Kueue.

Also, the in-tree plugin would be worth it when we support the gnang-scheduling in the JobSet and LeaderWorkerSet (sig-apps sub-projects).

from enhancements.

kerthcet commented on August 17, 2024

Compatible with other subprojects should be another goal. Will add to the proposal as well.

from enhancements.

tenzen-y commented on August 17, 2024

Compatible with other subprojects should be another goal. Will add to the proposal as well.

I didn't mean that we should work on integrations with subprojects. I just raised use cases.

from enhancements.

kerthcet commented on August 17, 2024

Integration is out-of-goal, what I refer to is compatibility.

from enhancements.

tenzen-y commented on August 17, 2024

Integration is out-of-goal, what I refer to is compatibility.

Yes, I wanted to mean what you say.

I DIDN'T mean that we should work on integrations with subprojects. I just raised use cases.

from enhancements.

alculquicondor commented on August 17, 2024

To me, those 2 motivations are equally important.

Hopefully, in the future, people don't need to implement custom schedulers to get all-or-nothing scheduling.

from enhancements.

Gang Scheduling Support in Kubernetes about enhancements HOT 23 OPEN

Comments (23)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs