GithubHelp home page GithubHelp logo

volcano-sh / volcano Goto Github PK

View Code? Open in Web Editor NEW
4.0K 90.0 912.0 75.25 MB

A Cloud Native Batch System (Project under CNCF)

Home Page: https://volcano.sh

License: Apache License 2.0

Makefile 0.30% Shell 3.49% Go 95.96% Dockerfile 0.20% Python 0.02% Smarty 0.02%
batch-systems kubernetes golang hpc bigdata machine-learning gene

volcano's Introduction


Build Status Go Report Card RepoSize Release LICENSE CII Best Practices

Volcano is a batch system built on Kubernetes. It provides a suite of mechanisms that are commonly required by many classes of batch & elastic workload including: machine learning/deep learning, bioinformatics/genomics and other "big data" applications. These types of applications typically run on generalized domain frameworks like TensorFlow, Spark, Ray, PyTorch, MPI, etc, which Volcano integrates with.

Volcano builds upon a decade and a half of experience running a wide variety of high performance workloads at scale using several systems and platforms, combined with best-of-breed ideas and practices from the open source community.

Until June 2021, Volcano has been widely used around the world at a variety of industries such as Internet/Cloud/Finance/ Manufacturing/Medical. More than 20 companies or institutions are not only end users but also active contributors. Hundreds of contributors are taking active part in the code commit/PR review/issue discussion/docs update and design provision. We are looking forward to your participation.

NOTE: the scheduler is built based on kube-batch; refer to #241 and #288 for more detail.

cncf_logo

Volcano is an incubating project of the Cloud Native Computing Foundation (CNCF). Please consider joining the CNCF if you are an organization that wants to take an active role in supporting the growth and evolution of the cloud native ecosystem.

Overall Architecture

volcano

Talks

Ecosystem

Quick Start Guide

Prerequisites

  • Kubernetes 1.12+ with CRD support

You can try Volcano by one of the following two ways.

Note:

  • For Kubernetes v1.17+ use CRDs under config/crd/bases (recommended)
  • For Kubernetes versions < v1.16 use CRDs under config/crd/v1beta1 (deprecated)

Install with YAML files

Install Volcano on an existing Kubernetes cluster. This way is both available for x86_64 and arm64 architecture.

kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/master/installer/volcano-development.yaml

Enjoy! Volcano will create the following resources in volcano-system namespace.

NAME                                       READY   STATUS      RESTARTS   AGE
pod/volcano-admission-5bd5756f79-dnr4l     1/1     Running     0          96s
pod/volcano-admission-init-4hjpx           0/1     Completed   0          96s
pod/volcano-controllers-687948d9c8-nw4b4   1/1     Running     0          96s
pod/volcano-scheduler-94998fc64-4z8kh      1/1     Running     0          96s

NAME                                TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE
service/volcano-admission-service   ClusterIP   10.98.152.108   <none>        443/TCP   96s

NAME                                  READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/volcano-admission     1/1     1            1           96s
deployment.apps/volcano-controllers   1/1     1            1           96s
deployment.apps/volcano-scheduler     1/1     1            1           96s

NAME                                             DESIRED   CURRENT   READY   AGE
replicaset.apps/volcano-admission-5bd5756f79     1         1         1       96s
replicaset.apps/volcano-controllers-687948d9c8   1         1         1       96s
replicaset.apps/volcano-scheduler-94998fc64      1         1         1       96s

NAME                               COMPLETIONS   DURATION   AGE
job.batch/volcano-admission-init   1/1           48s        96s

Install via helm

To install official release, please visit to helm-charts for details.

helm repo add volcano-sh https://volcano-sh.github.io/helm-charts
helm install volcano volcano-sh/volcano -n volcano-system --create-namespace

Install from source code for developers:

helm install volcano installer/helm/chart/volcano --namespace volcano-system --create-namespace

# list helm release
helm list -n volcano-system

Install from code

If you don't have a kubernetes cluster, try one-click install from code base:

./hack/local-up-volcano.sh

This way is only available for x86_64 temporarily.

Install monitoring system

If you want to get prometheus and grafana volcano dashboard after volcano installed, try following commands:

make TAG=latest generate-yaml
kubectl create -f _output/release/volcano-monitoring-latest.yaml

Kubernetes compatibility

Kubernetes 1.17 Kubernetes 1.18 Kubernetes 1.19 Kubernetes 1.20 Kubernetes 1.21 Kubernetes 1.22 Kubernetes 1.23 Kubernetes 1.24 Kubernetes 1.25 Kubernetes 1.26 Kubernetes 1.27 Kubernetes 1.28 Kubernetes 1.29 Kubernetes 1.30
Volcano v1.6 - - - - - - -
Volcano v1.7 - - - _
Volcano v1.8 - - - -
Volcano v1.9 - - - - -
Volcano HEAD (master) - - - -

Key:

  • Volcano and the Kubernetes version are exactly compatible.
  • + Volcano has features or API objects that may not be present in the Kubernetes version.
  • - The Kubernetes version has features or API objects that Volcano can't use.

Meeting

Community weekly meeting for Asia: 15:00 - 16:00 (UTC+8) Friday. (Convert to your timezone.)

Community biweekly meeting for America: 08:30 - 09:30 (UTC-8) Thursday. (Convert to your timezone.)

Community meeting for Europe is ongoing on demand now. If you have some ideas or topics to discuss, please leave message in the slack. Maintainers will contact with you and book an open meeting for that.

Resources:

Contact

If you have any question, feel free to reach out to us in the following ways:

Volcano Slack Channel | Join

Mailing List

Wechat: Add WeChat account k8s2222 (华为云小助手2号) to let her pull you into the group.

volcano's People

Contributors

alcorj-mizar avatar asifdxtreme avatar daixiang0 avatar dmatch01 avatar haozi23 avatar hex108 avatar huone1 avatar hwdef avatar hzxuzhonghu avatar k82cn avatar k8s-ci-robot avatar kerthcet avatar lminzhw avatar lowang-bh avatar mikechengwei avatar monokaix avatar qiankunli avatar shinytang6 avatar sivanzcw avatar thandayuthapani avatar thor-wl avatar tommylike avatar volcano-sh-bot avatar wackxu avatar waiterq avatar wangyang0616 avatar wangyuqing4 avatar william-wang avatar wpeng102 avatar zen-xu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

volcano's Issues

Fix state machine issue

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

New state of enqueue has been introduced, but it's unfinished, need keep working on this and fix related testcase issues.

NOTES: There are some testcases are expected to have job status: pending->running/xxxxx, which are incorrect within new status of enqueue, please update them all asl well.

Support ScheduledJob/CronJob

Is this a BUG REPORT or FEATURE REQUEST?:

/kind feature

Description:

Scheduled job is an common requirement for high performance workload.

make mutating and validating admission controllers consistent

func mutateSpec(tasks []v1alpha1.TaskSpec, basePath string) (patch []patchOperation) {
	for index := range tasks {
		// add default task name
		taskName := tasks[index].Name
		if len(taskName) == 0 {
			tasks[index].Name = v1alpha1.DefaultTaskSpec
		}
	}
	patch = append(patch, patchOperation{
		Op:    "replace",
		Path:  basePath,
		Value: tasks,
	})

	return patch
}

If user not specify the task names of a job, default will be used in mutating stage, but the validating admission controller will reject the Job creation because of duplicate task names.

Speed up E2E tests

/kind bug

Currently, Travis would spend almost 26 minutes to finish e2e tests, need to figure it out how to speed up these tests.

Ran for 26 min 14 sec
Ran 33 of 33 Specs in 773.302 seconds
SUCCESS! -- 33 Passed | 0 Failed | 0 Pending | 0 Skipped
--- PASS: TestE2E (773.30s)
PASS
ok  	volcano.sh/volcano/test/e2e	773.323s
release "integration" deleted
Running kind: [kind delete cluster --name integration]
Deleting cluster "integration" ...
$KUBECONFIG is still set to use /home/travis/.kube/kind-config-integration even though that file has been deleted, remember to unset it
Volcano logs are currently not supported.
The command "make e2e-test-kind" exited with 0.

Add PodGroupController to creat shadow PodGroup

Is this a BUG REPORT or FEATURE REQUEST?:

/kind feature

Description:

Currently, kube-batch create shadow PodGroup by pod's OwnerReference for upstream objects, e.g. Deployment. It make Queue related feature harder, e.g. Queue's status, it's better to have such a controller to create PodGroup for upstream objects.

Makefile cleanup

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

Description:

It's better to support following targets in Makefile:

  1. make: only make related binaries, e.g. controller, scheduler
  2. make images: build related docker images
  3. make e2e-test-kind: run e2e test with kind
  4. make unit-test: run unit test
  5. make integration-test: run integration test

Reclaim CI failed

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

Description:

Queue E2E teat failed as follow, it seems there're not enough resource for recliam e2e test.

• Failure [19.615 seconds]
Queue E2E Test
/home/travis/gopath/src/volcano.sh/volcano/test/e2e/queue.go:26
  Reclaim [It]
  /home/travis/gopath/src/volcano.sh/volcano/test/e2e/queue.go:27
  Expected error:
      <*errors.errorString | 0xc00028f4f0>: {
          s: "expected replica <1> is too small",
      }
      expected replica <1> is too small
  not to have occurred
  /home/travis/gopath/src/volcano.sh/volcano/test/e2e/queue.go:57

refer to https://travis-ci.com/volcano-sh/volcano/jobs/188302297 for more detail :)

Allow multi sync job works run in parallel

Is this a BUG REPORT or FEATURE REQUEST?:

/kind feature

What happened:

Currently, there is only one goroutine worker syncing jobs. For large scale jobs, this will be a bottle neck.

Set default value of PodGroup in admission controller

Is this a BUG REPORT or FEATURE REQUEST?:

/kind feature

Description:

Currently, the default value of PodGroup is set by operator/customized-controller which is inconvenience for developer. It's better to set those default value to PodGroup for all users/developers.

Add e2e test for admission service

Is this a BUG REPORT or FEATURE REQUEST?:

/kind test

What happened:

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • master

Resource Reservation to avoid starvation

Description:

When batch jobs have to compete with each others or elastic jobs for resources, the resources that become available are likely to be taken immediately by elastic job. Batch jobs need multiple resources to be available before they can be dispatched. If the cluster is always busy, a large batch job could be pending indefinitely. The more processors a parallel job requires, the worse the problem is. Resource reservation solves this problem by reserving resources as they become available, until there are enough reserved resources to run the batch job.

Pass conformance test

Is this a BUG REPORT or FEATURE REQUEST?:

/kind feature

Description:

Cherry pick related PR in kube-batch to volcano-sh/kube-batch for conformance test.

/cc @asifdxtreme

Deleting helm chart exits with error

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened:

While deleting helm chart we get an error and all crds still exist.

# helm delete sid
Error: deletion completed with 1 error(s): mutatingwebhookconfigurations.admissionregistration.k8s.io "sid-mutate-job" already exists

because of which for deploying it next time we need to delete all crd's and then deploy again

What you expected to happen:
Delete helm chart should exit properly

Setup travis as CI env

For now, found the follow two issues here:

  • no hack/verify-gofmt.sh for make verify
  • both e2e-test and e2e-test-kind miss scripts

Add event on actions

Is this a BUG REPORT or FEATURE REQUEST?:

/kind feature

Description:

Currently, we only record event for Commands; it's better to also record an event for each actions of jobs.

How do we support Job.Spec update

I noticed we want to support Job.Spec update in Controller.updateJob

But the generated request is

	req := apis.Request{
		Namespace: newJob.Namespace,
		JobName:   newJob.Name,

		Event: vkbatchv1.OutOfSyncEvent,
	}

But in syncJob
if no pods provided in request, it will create new pods for the Job, and so it will fail, and the following status is unknown.

btw, I am not very familiar with the entire state machine , and maybe i miss something.

Job GC

Is this a BUG REPORT or FEATURE REQUEST?:

/kind feature

Description:

Make sure "completed" & "terminated" jobs will be removed later.

Support Task/Job retry

Is this a BUG REPORT or FEATURE REQUEST?:

/kind feature

Description:

Support task/job retry; if it's still failed after try count, mark as Failed.

Unable get csr when building test cluster

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind bug

/kind feature

What happened:

Error log:

certificatesigningrequest.certificates.k8s.io/integration-admission-service.kube-system created
NAME                                        AGE   REQUESTOR          CONDITION
integration-admission-service.kube-system   0s    kubernetes-admin   Pending
certificatesigningrequest.certificates.k8s.io/integration-admission-service.kube-system approved
ERROR: After approving csr integration-admission-service.kube-system, the signed certificate did not appear on the resource. Giving up after 10 attempts.
Error: plugin "gen-admission-secret" exited with error
Install volcano chart
NAME:   integration
LAST DEPLOYED: Mon Apr  1 03:20:04 2019
NAMESPACE: kube-system
STATUS: DEPLOYED

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • Volcano Version:
  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

Update Imports

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened:
I see imports used in files with format "volcano.sh/volcano/XXX/XXX/XXX"

What you expected to happen:
It should of the format "github.com/volcano-sh/volcano/XXX/XXX/XXX"

Screenshot from 2019-03-20 14-40-08

Enable robot for Volcano

Currently, we still merge code manually; it's better to have robot for it. We can leverage robot from other community, e.g. Kubernetes.

Queue controller and related cli

Currently, user can only create a Queue for scheduling; but it's hard to know more info about it, e.g. how many job in the queue, which plugins is used by this queue; and if the Queue is deleted, the job is still there :( It's better to have QueueController to mamange Queue's lifecycle and update its status; and have related command line for uset to get its info.

The docker image name should align with binaries'

Is this a BUG REPORT or FEATURE REQUEST?:

/kind cleanup

Description:

In Makefile, our binaries are vk-controller, vk-scheduler and so on; but the docker image is volcanosh/volcano-scheduler. It's better to make them align with each other to avoid confusion.

/cc @asifdxtreme

11 tests are failed in CI

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

Description:
There're 11 tests in CI were failed, we need to get it fixed ASAP before release.



[Fail] Job Error Handling [It] job level LifecyclePolicy, Event: PodFailed; Action: TerminateJob 

/home/travis/gopath/src/volcano.sh/volcano/test/e2e/job_error_handling.go:102


[Fail] Job Error Handling [It] job level LifecyclePolicy, Event: PodFailed; Action: AbortJob 

/home/travis/gopath/src/volcano.sh/volcano/test/e2e/job_error_handling.go:139


[Fail] Job Error Handling [It] job level LifecyclePolicy, Event: PodEvicted; Action: RestartJob 

/home/travis/gopath/src/volcano.sh/volcano/test/e2e/job_error_handling.go:174


[Fail] Job Error Handling [It] job level LifecyclePolicy, Event: PodEvicted; Action: TerminateJob 

/home/travis/gopath/src/volcano.sh/volcano/test/e2e/job_error_handling.go:218


[Fail] Job Error Handling [It] job level LifecyclePolicy, Event: PodEvicted; Action: AbortJob 

/home/travis/gopath/src/volcano.sh/volcano/test/e2e/job_error_handling.go:262


[Fail] Job Error Handling [It] job level LifecyclePolicy, Event: Any; Action: RestartJob 

/home/travis/gopath/src/volcano.sh/volcano/test/e2e/job_error_handling.go:306


[Fail] Job Error Handling [It] job level LifecyclePolicy, Event: TaskCompleted; Action: CompletedJob 

/home/travis/gopath/src/volcano.sh/volcano/test/e2e/job_error_handling.go:468


[Fail] Job Error Handling [It] job level LifecyclePolicy, error code: 3; Action: RestartJob 

/home/travis/gopath/src/volcano.sh/volcano/test/e2e/job_error_handling.go:507


[Fail] Job E2E Test [It] Gang scheduling 

/home/travis/gopath/src/volcano.sh/volcano/test/e2e/job_scheduling.go:109


[Fail] MPI E2E Test [It] will run and complete finally 

/home/travis/gopath/src/volcano.sh/volcano/test/e2e/mpi.go:74


[Fail] Job E2E Test: Test Job Command [It] Suspend pending job 

/home/travis/gopath/src/volcano.sh/volcano/test/e2e/command.go:142

xref https://travis-ci.com/volcano-sh/volcano/jobs/197649052

Support TaskSpec level error handling

Is this a BUG REPORT or FEATURE REQUEST?:

/kind feature

Description:

Currently, we only support Job level and Task instance level error handling; TaskSpec level error handling is also necessary, e.g. the MPI job should be completed when mpirun Pod completed successfully.

Makefile cleanup

Is this a BUG REPORT or FEATURE REQUEST?:

/kind cleanup

Description:

  • release is almost equal to all
  • docker target should be images
  • build info is necessary
  • can not build release from MacOS or other platform

Refactor Delay Pod Creation by admission controller

Is this a BUG REPORT or FEATURE REQUEST?:

/kind feature

Description:

Currently, we delay pod creattion in job controller which make it hard for two scenarios:

  1. vk.job can not work with other scheduler
  2. enqueue can not support other operators

To resolve the above issues, perfer to add an admission controller to check PodGroup's status for them. If they did not use PodGroup, PodGroupController will help them to create a shadow one.

Add error handling for exit code

Is this a BUG REPORT or FEATURE REQUEST?:

/kind feature

Description:

Got this requirements from a user, it's better to support error handling for exit code.

Support Job plugins

Both MPI and Tensorflow need hostfile for its workers; and MPI job need more, e.g. ssh authentication. It's better to provide related plugins for different works.

The yaml file maybe similar as follow:

spec:
  - plugins
      ssh: ["seed"]
      env: [""]

For example, if ssh is enabled, job controller should create related rsa public/private keys and mount them for ssh.

Add example on MPI Job

Is this a BUG REPORT or FEATURE REQUEST?:

/kind feature

Description:

Add an example on how to run MPI job :)

Resolve the golint issues

Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug

What happened:
Resolve all the golint issues ignored in the file of .golint_failures

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • Volcano Version:
  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.