azure / cyclecloud-pbspro Goto Github PK

View Code? Open in Web Editor NEW

11.0 18.0 18.0 323 KB

Example Azure CycleCloud PBSpro cluster type

License: MIT License

Python 86.29% Ruby 7.43% Shell 6.17% HTML 0.12%

cyclecloud-pbspro's People

Stargazers

Watchers

Forkers

jonshelley hmeiland leanderc2018 kjnam fhusseini05 marabgol isabella232 keith-thai mandargujrathi themorey jsaelices cadwrdeltamodeling wolfgang-desalvador equinor sourcecodecheck mikesecurity

cyclecloud-pbspro's Issues

Add HBv3 support

HBv3 resources are not recognized by scalelib.
Workaround is to add them at the beginning of the default_resources in the autoscale.json

"default_resources": [
{
"name": "ncpus",
"select": {
"node.vm_size": "Standard_HB120rs_v3"
},
"value": 120
},
{
"name": "mem",
"select": {
"node.vm_size": "Standard_HB120rs_v3"
},
"value": "memory::448g"
},
{
"name": "ncpus",
"select": {
"node.vm_size": "Standard_HB120-96rs_v3"
},
"value": 96
},
{
"name": "mem",
"select": {
"node.vm_size": "Standard_HB120-96rs_v3"
},
"value": "memory::448g"
},
{
"name": "ncpus",
"select": {
"node.vm_size": "Standard_HB120-64rs_v3"
},
"value": 64
},
{
"name": "mem",
"select": {
"node.vm_size": "Standard_HB120-64rs_v3"
},
"value": "memory::448g"
},
{
"name": "ncpus",
"select": {
"node.vm_size": "Standard_HB120-32rs_v3"
},
"value": 32
},
{
"name": "mem",
"select": {
"node.vm_size": "Standard_HB120-32rs_v3"
},
"value": "memory::448g"
}

Parameterize Azure.MaxScalesetSize

Other schedulers allow users to edit the max scaleset size via a parameter. Currently users must add Azure.MaxScalesetSize=X manually if they want something besides the default, 100.

Add Ubuntu 18 support with OpenPBS

OpenPBS 20 releases install packages for Ubuntu 18.04. It would be good if we support this as well.

Modifing stack softlimit

Hi,
I've noticed cyclecloud recently changed the behavior for limits of stack sizes.
Now it add this:

$ cat /etc/security/limits.conf |grep stack
#        - stack - max stack size (KB)
*               hard    stack           unlimited
*               soft    stack           unlimited

However I am not sure where this comes from, I can't find it in this repo and it is not from the CentOS HPC Image as far as I could tell (https://github.com/openlogic/AzureBuildCentOS)

In any case if someone else is falling over this, Abaqus at least does not accept unlimited as a soft limit.

Greetings
Klaas

Error in Documentation

There's an error in the documentation for the PBS Pro configuration reference:

pbspro.version is currently set to 18.1.3-0 by default, not 14.2.1-0.

/etc/profile.d/azpbs_autocomplete.sh is breaking PBS Dataservice restart

Context

Standalone Scheduler VM
OpenPBS 22.05.11
Ubuntu 20.04
cyclecloud-pbspro 2.0.21

after executing this

    ./initialize_pbs.sh 
    ./initialize_default_queues.sh
    ./install.sh  --install-python3 --venv /opt/cycle/pbspro/venv --cron-method pbs_hook

PBS is failing to restart :

root@scheduler:/etc/profile.d# /opt/pbs/libexec/pbs_init.d stop
Stopping PBS
Shutting server down with qterm.
PBS server - was pid: 103121
PBS sched - was pid: 103025
PBS comm - was pid: 103010
Waiting for shutdown to complete
root@scheduler:/etc/profile.d# /opt/pbs/libexec/pbs_init.d start
Starting PBS
/opt/pbs/sbin/pbs_comm ready (pid=104252), Proxy Name:scheduler.internal.cloudapp.net:17001, Threads:4
PBS comm
PBS sched
-sh: 32: eval: Syntax error: "(" unexpected (expecting "}")
-sh: 32: eval: Syntax error: "(" unexpected (expecting "}")
Failed to start PBS dataservice. 
-sh: 32: eval: Syntax error: "(" unexpected (expecting "}")
Failed to start PBS dataservice. 
-sh: 32: eval: Syntax error: "(" unexpected (expecting "}")
-sh: 32: eval: Syntax error: "(" unexpected (expecting "}")
Failed to start PBS dataservice. 
-sh: 32: eval: Syntax error: "(" unexpected (expecting "}")
Failed to start PBS dataservice. 
-sh: 32: eval: Syntax error: "(" unexpected (expecting "}")
-sh: 32: eval: Syntax error: "(" unexpected (expecting "}")
Failed to start PBS dataservice. 
^C
root@scheduler:/etc/profile.d#

Workaround

after removing /etc/profile.d/azpbs_autocomplete.sh it works again

root@scheduler:/etc/profile.d# rm azpbs_autocomplete.sh 
root@scheduler:/etc/profile.d# /opt/pbs/libexec/pbs_init.d start
Starting PBS
PBS comm already running.
PBS scheduler already running.
/opt/pbs/sbin/pbs_ds_systemd: 43: [: xdegraded: unexpected operator
Connecting to PBS dataservice....connected to PBS [email protected]
PBS server

hwloc-libs RPM is no longer provided in the epel repo CentOS 8

openpbs installers failing with missing package hwloc-libs. This is impacting CentOS 8.

work around is do download from alma linux:
- wget -O /tmp/hwloc-libs-1.11.9-3.el8.x86_64.rpm wget https://repo.almalinux.org/almalinux/8.3-beta/BaseOS/x86_64/os/Packages/hwloc-libs-1.11.9-3.el8.x86_64.rpm
- yum install -y /tmp/hwloc-libs-1.11.9-3.el8.x86_64.rpm
- rm -f /tmp/hwloc-libs-1.11.9-3.el8.x86_64.rpm

Can we redistribute in release artifacts?

cyclecloud-pbspro-pkg-2.0.9.tar.gz is not found.

There is not pkg file for 2.0.9. Please fix it.

https://github.com/Azure/cyclecloud-pbspro/releases/download/2.0.9/cyclecloud-pbspro-pkg-2.0.9.tar.gz

Jetpack error while deploying pbspro cluster

Hello,
I'm trying to deploy a pbspro cluster with customized images. However, I keep running into the following error

Check /opt/cycle/jetpack/logs/installation.log for more information
Get more help on this issue
Detail:

Traceback (most recent call last):
  File "/opt/cycle/jetpack/system/embedded/lib/python2.7/site-packages/jetpack/admin/validate.py", line 27, in execute
    jetpack.util.test_cyclecloud_connection(connection)
  File "/opt/cycle/jetpack/system/embedded/lib/python2.7/site-packages/jetpack/util.py", line 439, in test_cyclecloud_connection
    r = _connect_to_cyclecloud(connection, path)
  File "/opt/cycle/jetpack/system/embedded/lib/python2.7/site-packages/jetpack/util.py", line 482, in _connect_to_cyclecloud
    conn, headers = jetpack.util.get_cyclecloud_connection(config)
  File "/opt/cycle/jetpack/system/embedded/lib/python2.7/site-packages/jetpack/util.py", line 380, in get_cyclecloud_connection
    if jetpack.config.get('cyclecloud.skip_ssl_validation', default_skip_ssl_validation):
  File "/opt/cycle/jetpack/system/embedded/lib/python2.7/site-packages/jetpack/config.py", line 29, in get
    raise ConfigError(UNKNOWN_ERROR)
ConfigError: An unknown error occurred while processing the configuration data file.

I've boiled down the customized image to an updated CentOS configuration where 'cmake' is installed. I'm unfamiliar with Jetpack so I cannot figure out which is the origin of the problem or how to fix it. Thanks.

autoscaler not handling well bad formatted JSON qstat output

cyclecloud-pbspro version 2.0.10

With OpenPBS 19.1.1 output in JSON for qstat can be bad formatted in case of complex environment variables.
For example : qstat -f <job_id> -F json | jq '.' will return an error meaning the JSON is bad formatted.
As a result this make the autoscaler to stop working, so a single bad job can hang all the whole system and no new nodes can be added.

Here the output of azpbs autoscale

File "/opt/cycle/pbspro/venv/lib/python3.6/site-packages/pbspro/environment.py", line 58, in from_driver
--- Logging error ---
Traceback (most recent call last):
File "/usr/lib64/python3.6/logging/init.py", line 996, in emit
stream.write(msg)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 47623-47628: ordinal not in range(128)
Call stack:
File "/root/bin/azpbs", line 4, in
main()
File "/opt/cycle/pbspro/venv/lib/python3.6/site-packages/pbspro/cli.py", line 284, in main
clilib.main(argv or sys.argv[1:], "pbspro", PBSCLI())
File "/opt/cycle/pbspro/venv/lib/python3.6/site-packages/hpc/autoscale/clilib.py", line 1739, in main
args.func(**kwargs)
File "/opt/cycle/pbspro/venv/lib/python3.6/site-packages/hpc/autoscale/clilib.py", line 1315, in analyze
dcalc = self._demand(config, ctx_handler=ctx_handler)
File "/opt/cycle/pbspro/venv/lib/python3.6/site-packages/hpc/autoscale/clilib.py", line 360, in _demand
dcalc, jobs = self._demand_calc(config, driver)
File "/opt/cycle/pbspro/venv/lib/python3.6/site-packages/pbspro/cli.py", line 113, in _demand_calc
pbs_env = self._pbs_env(pbs_driver)
File "/opt/cycle/pbspro/venv/lib/python3.6/site-packages/pbspro/cli.py", line 106, in _pbs_env
self.__pbs_env = environment.from_driver(pbs_driver.config, pbs_driver)
File "/opt/cycle/pbspro/venv/lib/python3.6/site-packages/pbspro/environment.py", line 58, in from_driver
jobs = pbs_driver.parse_jobs(queues, default_scheduler.resources_for_scheduling)
File "/opt/cycle/pbspro/venv/lib/python3.6/site-packages/pbspro/driver.py", line 414, in parse_jobs
self.pbscmd, self.resource_definitions, queues, resources_for_scheduling
File "/opt/cycle/pbspro/venv/lib/python3.6/site-packages/pbspro/driver.py", line 530, in parse_jobs
response: Dict = pbscmd.qstat_json("-f", "-t")
File "/opt/cycle/pbspro/venv/lib/python3.6/site-packages/pbspro/pbscmd.py", line 31, in qstat_json
response = self.qstat(*args)
File "/opt/cycle/pbspro/venv/lib/python3.6/site-packages/pbspro/pbscmd.py", line 25, in qstat
return self._check_output(cmd)
File "/opt/cycle/pbspro/venv/lib/python3.6/site-packages/pbspro/pbscmd.py", line 76, in _check_output
logger.info("Response: %s", ret)
File "/usr/lib64/python3.6/logging/init.py", line 1308, in info
self._log(INFO, msg, args, **kwargs)
File "/opt/cycle/pbspro/venv/lib/python3.6/site-packages/hpc/autoscale/hpclogging.py", line 45, in _log
**stacklevelkw
Message: 'Response: %s'

This repo is missing important files

There are important files that Microsoft projects should all have that are not present in this repository. A pull request has been opened to add the missing file(s). When the pr is merged this issue will be closed automatically.

Microsoft teams can learn more about this effort and share feedback within the open source guidance available internally.

Merge this pull request

autoscaler is not adding nodes

running a non mpi job using “-l select=1:slot_type=execute:ungrouped=true” as a select statement.
The execute node array is not spot
But the autoscaler is not adding a new node

[xpillons@ondemand ~]$ qstat -fx 1651
Job Id: 1651.scheduler
Job_Name = sys-dashboard-sys-codeserver
Job_Owner = [email protected]
job_state = Q
queue = workq
server = scheduler
Checkpoint = u
ctime = Fri Nov 19 09:56:25 2021
Error_Path = ondemand.internal.cloudapp.net:/anfhome/xpillons/ondemand/data
/sys/dashboard/batch_connect/sys/codeserver/output/c1144623-b9b5-44a2-8
5b1-93fa66a0dc14/sys-dashboard-sys-codeserver.e1651
Hold_Types = n
Join_Path = oe
Keep_Files = n
Mail_Points = a
mtime = Fri Nov 19 09:56:25 2021
Output_Path = ondemand.internal.cloudapp.net:/anfhome/xpillons/ondemand/dat
a/sys/dashboard/batch_connect/sys/codeserver/output/c1144623-b9b5-44a2-
85b1-93fa66a0dc14/output.log
Priority = 0
qtime = Fri Nov 19 09:56:25 2021
Rerunable = True
Resource_List.ncpus = 1
Resource_List.nodect = 1
Resource_List.place = scatter:excl
Resource_List.select = 1:slot_type=execute:ungrouped=true
Resource_List.slot_type = execute
Resource_List.ungrouped = false
Resource_List.walltime = 03:00:00
Shell_Path_List = /bin/bash
substate = 10
Variable_List = PBS_O_HOME=/anfhome/xpillons,PBS_O_LANG=C,
PBS_O_LOGNAME=xpillons,
PBS_O_PATH=/var/www/ood/apps/sys/dashboard/tmp/node_modules/yarn/bin:/
opt/ood/ondemand/root/usr/share/gems/2.7/bin:/opt/rh/rh-nodejs12/root/u
sr/bin:/opt/rh/rh-ruby27/root/usr/local/bin:/opt/rh/rh-ruby27/root/usr/
bin:/opt/rh/httpd24/root/usr/bin:/opt/rh/httpd24/root/usr/sbin:/opt/ood
/ondemand/root/usr/bin:/opt/ood/ondemand/root/usr/sbin:/sbin:/bin:/usr/
sbin:/usr/bin,PBS_O_MAIL=/var/mail/root,PBS_O_SHELL=/bin/bash,
PBS_O_WORKDIR=/anfhome/xpillons/ondemand/data/sys/dashboard/batch_conn
ect/sys/codeserver/output/c1144623-b9b5-44a2-85b1-93fa66a0dc14,
PBS_O_SYSTEM=Linux,PBS_O_QUEUE=workq,
PBS_O_HOST=ondemand.internal.cloudapp.net
etime = Fri Nov 19 09:56:25 2021
Submit_arguments = -N sys-dashboard-sys-codeserver -S /bin/bash -o /anfhome
/xpillons/ondemand/data/sys/dashboard/batch_connect/sys/codeserver/outp
ut/c1144623-b9b5-44a2-85b1-93fa66a0dc14/output.log -j oe -l select=1:sl
ot_type=execute:ungrouped=true -l walltime=03:00:00
project = _pbs_project_default

[root@scheduler ~]# azpbs analyze --job-id 1651
NotInAPlacementGroup : Bucket[array=execute vm_size=Standard_F2s_v2 id=ac4fc82f-3d82-4f20-bd6e-67299dcdd388] is not in a placement group
NotInAPlacementGroup : Bucket[array=hb120rs_v2 vm_size=Standard_HB120rs_v2 id=4167f5f8-3a30-46df-92ba-8f77e0b636cd] is not in a placement group
NotInAPlacementGroup : Bucket[array=hb120rs_v3 vm_size=Standard_HB120rs_v3 id=aed72457-3b3c-44a0-ab8f-9ebb1a954fee] is not in a placement group
NotInAPlacementGroup : Bucket[array=hb60rs vm_size=Standard_HB60rs id=53bb8fd0-e62e-468f-b278-93da9e6fef3e] is not in a placement group
NotInAPlacementGroup : Bucket[array=hc44rs vm_size=Standard_HC44rs id=be3124ac-3894-496a-8580-ef06aba68273] is not in a placement group
NotInAPlacementGroup : Bucket[array=viz vm_size=Standard_D8s_v3 id=03e27828-b4bd-4678-a8e5-5886599bc6e5] is not in a placement group
NotInAPlacementGroup : Bucket[array=viz3d vm_size=Standard_NV6 id=ab76e8b1-0632-49a9-be16-93d880341855] is not in a placement group
InvalidOption : Resource[name=ungrouped value='false'] != 'true' for Bucket[array=execute vm_size=Standard_F2s_v2 attr=ungrouped]
InvalidOption : Resource[name=ungrouped value='false'] != 'true' for Bucket[array=execute vm_size=Standard_F2s_v2 attr=ungrouped]
InvalidOption : Resource[name=slot_type value='hb120rs_v2'] != 'execute' for Bucket[array=hb120rs_v2 vm_size=Standard_HB120rs_v2 attr=slot_type]
InvalidOption : Resource[name=slot_type value='hb120rs_v2'] != 'execute' for Bucket[array=hb120rs_v2 vm_size=Standard_HB120rs_v2 attr=slot_type]
InvalidOption : Resource[name=slot_type value='hb120rs_v3'] != 'execute' for Bucket[array=hb120rs_v3 vm_size=Standard_HB120rs_v3 attr=slot_type]
InvalidOption : Resource[name=slot_type value='hb120rs_v3'] != 'execute' for Bucket[array=hb120rs_v3 vm_size=Standard_HB120rs_v3 attr=slot_type]
InvalidOption : Resource[name=slot_type value='hb60rs'] != 'execute' for Bucket[array=hb60rs vm_size=Standard_HB60rs attr=slot_type]
InvalidOption : Resource[name=slot_type value='hb60rs'] != 'execute' for Bucket[array=hb60rs vm_size=Standard_HB60rs attr=slot_type]
InvalidOption : Resource[name=slot_type value='hc44rs'] != 'execute' for Bucket[array=hc44rs vm_size=Standard_HC44rs attr=slot_type]
InvalidOption : Resource[name=slot_type value='hc44rs'] != 'execute' for Bucket[array=hc44rs vm_size=Standard_HC44rs attr=slot_type]
InvalidOption : Resource[name=slot_type value='viz'] != 'execute' for Bucket[array=viz vm_size=Standard_D8s_v3 attr=slot_type]
InvalidOption : Resource[name=slot_type value='viz'] != 'execute' for Bucket[array=viz vm_size=Standard_D8s_v3 attr=slot_type]
InvalidOption : Resource[name=slot_type value='viz3d'] != 'execute' for Bucket[array=viz3d vm_size=Standard_NV6 attr=slot_type]
InvalidOption : Resource[name=slot_type value='viz3d'] != 'execute' for Bucket[array=viz3d vm_size=Standard_NV6 attr=slot_type]

NoCandidatesFound : SatisfiedResult(status=NotInAPlacementGroup, node=NodeBucket(nodearray=execute, vm_size=Standard_F2s_v2, pg=None),reason=Bucket[array=execute vm_size=Standard_F2s_v2 id=ac4fc82f-3d82-4f20-bd6e-67299dcdd388] is not
SatisfiedResult(status=NotInAPlacementGroup, node=NodeBucket(nodearray=hb120rs_v2, vm_size=Standard_HB120rs_v2, pg=None),reason=Bucket[array=hb120rs_v2 vm_size=Standard_HB120rs_v2 id=4167f5f8-3a30-46df-92ba-8f77e0
SatisfiedResult(status=NotInAPlacementGroup, node=NodeBucket(nodearray=hb120rs_v3, vm_size=Standard_HB120rs_v3, pg=None),reason=Bucket[array=hb120rs_v3 vm_size=Standard_HB120rs_v3 id=aed72457-3b3c-44a0-ab8f-9ebb1a
SatisfiedResult(status=NotInAPlacementGroup, node=NodeBucket(nodearray=hb60rs, vm_size=Standard_HB60rs, pg=None),reason=Bucket[array=hb60rs vm_size=Standard_HB60rs id=53bb8fd0-e62e-468f-b278-93da9e6fef3e] is not i
SatisfiedResult(status=NotInAPlacementGroup, node=NodeBucket(nodearray=hc44rs, vm_size=Standard_HC44rs, pg=None),reason=Bucket[array=hc44rs vm_size=Standard_HC44rs id=be3124ac-3894-496a-8580-ef06aba68273] is not i
SatisfiedResult(status=NotInAPlacementGroup, node=NodeBucket(nodearray=viz, vm_size=Standard_D8s_v3, pg=None),reason=Bucket[array=viz vm_size=Standard_D8s_v3 id=03e27828-b4bd-4678-a8e5-5886599bc6e5] is not in a pl
SatisfiedResult(status=NotInAPlacementGroup, node=NodeBucket(nodearray=viz3d, vm_size=Standard_NV6, pg=None),reason=Bucket[array=viz3d vm_size=Standard_NV6 id=ab76e8b1-0632-49a9-be16-93d880341855] is not in a plac
SatisfiedResult(status=InvalidOption, node=NodeBucket(nodearray=execute, vm_size=Standard_F2s_v2, pg=Standard_F2s_v2_pg0),reason=Resource[name=ungrouped value='false'] != 'true' for Bucket[array=execute vm_size=St
SatisfiedResult(status=InvalidOption, node=NodeBucket(nodearray=execute, vm_size=Standard_F2s_v2, pg=Standard_F2s_v2_pg1),reason=Resource[name=ungrouped value='false'] != 'true' for Bucket[array=execute vm_size=St
SatisfiedResult(status=InvalidOption, node=NodeBucket(nodearray=hb120rs_v2, vm_size=Standard_HB120rs_v2, pg=Standard_HB120rs_v2_pg0),reason=Resource[name=slot_type value='hb120rs_v2'] != 'execute' for Bucket[array
SatisfiedResult(status=InvalidOption, node=NodeBucket(nodearray=hb120rs_v2, vm_size=Standard_HB120rs_v2, pg=Standard_HB120rs_v2_pg1),reason=Resource[name=slot_type value='hb120rs_v2'] != 'execute' for Bucket[array
SatisfiedResult(status=InvalidOption, node=NodeBucket(nodearray=hb120rs_v3, vm_size=Standard_HB120rs_v3, pg=Standard_HB120rs_v3_pg0),reason=Resource[name=slot_type value='hb120rs_v3'] != 'execute' for Bucket[array
SatisfiedResult(status=InvalidOption, node=NodeBucket(nodearray=hb120rs_v3, vm_size=Standard_HB120rs_v3, pg=Standard_HB120rs_v3_pg1),reason=Resource[name=slot_type value='hb120rs_v3'] != 'execute' for Bucket[array
SatisfiedResult(status=InvalidOption, node=NodeBucket(nodearray=hb60rs, vm_size=Standard_HB60rs, pg=Standard_HB60rs_pg0),reason=Resource[name=slot_type value='hb60rs'] != 'execute' for Bucket[array=hb60rs vm_size=
SatisfiedResult(status=InvalidOption, node=NodeBucket(nodearray=hb60rs, vm_size=Standard_HB60rs, pg=Standard_HB60rs_pg1),reason=Resource[name=slot_type value='hb60rs'] != 'execute' for Bucket[array=hb60rs vm_size=
SatisfiedResult(status=InvalidOption, node=NodeBucket(nodearray=hc44rs, vm_size=Standard_HC44rs, pg=Standard_HC44rs_pg0),reason=Resource[name=slot_type value='hc44rs'] != 'execute' for Bucket[array=hc44rs vm_size=
SatisfiedResult(status=InvalidOption, node=NodeBucket(nodearray=hc44rs, vm_size=Standard_HC44rs, pg=Standard_HC44rs_pg1),reason=Resource[name=slot_type value='hc44rs'] != 'execute' for Bucket[array=hc44rs vm_size=
SatisfiedResult(status=InvalidOption, node=NodeBucket(nodearray=viz, vm_size=Standard_D8s_v3, pg=Standard_D8s_v3_pg0),reason=Resource[name=slot_type value='viz'] != 'execute' for Bucket[array=viz vm_size=Standard_
SatisfiedResult(status=InvalidOption, node=NodeBucket(nodearray=viz, vm_size=Standard_D8s_v3, pg=Standard_D8s_v3_pg1),reason=Resource[name=slot_type value='viz'] != 'execute' for Bucket[array=viz vm_size=Standard_
SatisfiedResult(status=InvalidOption, node=NodeBucket(nodearray=viz3d, vm_size=Standard_NV6, pg=Standard_NV6_pg0),reason=Resource[name=slot_type value='viz3d'] != 'execute' for Bucket[array=viz3d vm_size=Standard_
SatisfiedResult(status=InvalidOption, node=NodeBucket(nodearray=viz3d, vm_size=Standard_NV6, pg=Standard_NV6_pg1),reason=Resource[name=slot_type value='viz3d'] != 'execute' for Bucket[array=viz3d vm_size=Standard_

[root@scheduler ~]# azpbs buckets
NODEARRAY PLACEMENT_GROUP VM_SIZE VCPU_COUNT PCPU_COUNT MEMORY AVAILABLE_COUNT NCPUS NGPUS DISK HOST SLOT_TYPE GROUP_ID MEM CCNODEID UNGROUPED
execute Standard_F2s_v2 2 1 4.00g 512 1 0 20.00g execute none 4.00g true
execute Standard_F2s_v2_pg0 Standard_F2s_v2 2 1 4.00g 100 1 0 20.00g execute Standard_F2s_v2_pg0 4.00g false
execute Standard_F2s_v2_pg1 Standard_F2s_v2 2 1 4.00g 100 1 0 20.00g execute Standard_F2s_v2_pg1 4.00g false
hb120rs_v2 Standard_HB120rs_v2 120 120 456.00g 72 120 0 20.00g hb120rs_v2 none 456.00g true
hb120rs_v2 Standard_HB120rs_v2_pg0 Standard_HB120rs_v2 120 120 456.00g 72 120 0 20.00g hb120rs_v2 Standard_HB120rs_v2_pg0 456.00g false
hb120rs_v2 Standard_HB120rs_v2_pg1 Standard_HB120rs_v2 120 120 456.00g 72 120 0 20.00g hb120rs_v2 Standard_HB120rs_v2_pg1 456.00g false
hb120rs_v3 Standard_HB120rs_v3 120 120 448.00g 10 120 0 20.00g hb120rs_v3 none 448.00g true
hb120rs_v3 Standard_HB120rs_v3_pg0 Standard_HB120rs_v3 120 120 448.00g 10 120 0 20.00g hb120rs_v3 Standard_HB120rs_v3_pg0 448.00g false
hb120rs_v3 Standard_HB120rs_v3_pg1 Standard_HB120rs_v3 120 120 448.00g 10 120 0 20.00g hb120rs_v3 Standard_HB120rs_v3_pg1 448.00g false
hb60rs Standard_HB60rs 60 60 228.00g 40 60 0 20.00g hb60rs none 228.00g true
hb60rs Standard_HB60rs_pg0 Standard_HB60rs 60 60 228.00g 40 60 0 20.00g hb60rs Standard_HB60rs_pg0 228.00g false
hb60rs Standard_HB60rs_pg1 Standard_HB60rs 60 60 228.00g 40 60 0 20.00g hb60rs Standard_HB60rs_pg1 228.00g false
hc44rs Standard_HC44rs 44 44 352.00g 40 44 0 20.00g hc44rs none 352.00g true
hc44rs Standard_HC44rs_pg0 Standard_HC44rs 44 44 352.00g 40 44 0 20.00g hc44rs Standard_HC44rs_pg0 352.00g false
hc44rs Standard_HC44rs_pg1 Standard_HC44rs 44 44 352.00g 40 44 0 20.00g hc44rs Standard_HC44rs_pg1 352.00g false
viz Standard_D8s_v3 8 4 32.00g 50 4 0 20.00g viz none 32.00g true
viz Standard_D8s_v3_pg0 Standard_D8s_v3 8 4 32.00g 50 4 0 20.00g viz Standard_D8s_v3_pg0 32.00g false
viz Standard_D8s_v3_pg1 Standard_D8s_v3 8 4 32.00g 50 4 0 20.00g viz Standard_D8s_v3_pg1 32.00g false
viz3d Standard_NV6 6 6 56.00g 10 6 1 20.00g viz3d none 56.00g true
viz3d Standard_NV6_pg0 Standard_NV6 6 6 56.00g 10 6 1 20.00g viz3d Standard_NV6_pg0 56.00g false
viz3d Standard_NV6_pg1 Standard_NV6 6 6 56.00g 10 6 1 20.00g viz3d Standard_NV6_pg1 56.00g false

cyclecloud-pbspro scalelib module links to /Users/ryhamel/code/cyclecloud-scalelib/

Pretty sure this was not intended

Slot_type seems to be ignored when provisioning nodes

I have a cyclecloud cluster generated from a modified version of the PBSpro template. In it I have added different types of nodes for execution, such as memory optimized nodes and HPC nodes for heavy duty numerical simulations.

When scheduling a job making use of the #PBS -l slot_type=name_slot I have found that sometimes resources are allocated that do not match the configuration of the slot. The number of nodes and processes per node is respected but the actual type of node is not. Which can result in reduced performance for certain applications.

Job History is not in release 2.0.2

it seems that that version doesn't contains this fix despite what is claimed in the release page
repro :
wget https://github.com/Azure/cyclecloud-pbspro/releases/download/2.0.2/cyclecloud-pbspro-pkg-2.0.2.tar.gz
tar xvf cyclecloud-pbspro-pkg-2.0.2.tar.gz
initialize_pbs.sh doesn't contains the job history parameter.

slot_type is case sensitive

If a slot type is defined as hbv3 then the following job will never start

qsub -l select=1:slot_type=HBv3 -I

Add ND96asr support

Please add the Standard_ND96asr_v4 in the supported resources

Request for PBS Pro 19.1.2

I would like to use PBS Pro 19.1.2 to match the version with other production environments.
The detailed version of CentOS 7 is fine as long as it is the stable version.

Issues with output files and working directory

Hello,
I'm having some trouble with the working directory and job outputs. The jobs run fine but any output file is nowhere to be found. The details of a job follow:

Job Id: 0.ip-0A000204
    Job_Name = HPCG31_4
    Job_Owner = afernandez@ip-0a000204.iy4jdcoj0c5exd4bbj432msnqd.xx.internal.cloud
        app.net
    resources_used.cpupercent = 326
    resources_used.cput = 00:03:19
    resources_used.mem = 1889692kb
    resources_used.ncpus = 4
    resources_used.vmem = 3353476kb
    resources_used.walltime = 00:00:54
    job_state = E
    queue = workq
    server = ip-0A000204
    Checkpoint = u
    ctime = Thu Dec 19 19:56:48 2019
    Error_Path = ip-0a000204.iy4jdcoj0c5exd4bbj432msnqd.xx.internal.cloudapp.ne
        t:/home/afernandez/HPCG31_4.e0
    exec_host = ip-0A000205/0*2+ip-0A000206/0*2
    exec_vnode = (ip-0A000205:ncpus=2)+(ip-0A000206:ncpus=2)
    Hold_Types = n
    Join_Path = oe
    Keep_Files = n
    Mail_Points = a
    mtime = Thu Dec 19 20:02:58 2019
    Output_Path = ip-0a000204.iy4jdcoj0c5exd4bbj432msnqd.xx.internal.cloudapp.n
        et:/home/afernandez/hpcg31_4.out
    Priority = 0
    qtime = Thu Dec 19 19:56:48 2019
    Rerunable = True
    Resource_List.mpiprocs = 4
    Resource_List.ncpus = 4
    Resource_List.nodect = 2
    Resource_List.nodes = 2:ppn=2
    Resource_List.place = scatter:group=group_id
    Resource_List.select = 2:ncpus=2:mpiprocs=2
    Resource_List.slot_type = execute
    Resource_List.ungrouped = false
    Resource_List.walltime = 100:30:00
    stime = Thu Dec 19 20:02:04 2019
    session_id = 6412
    jobdir = /home/afernandez
    substate = 51
    Variable_List = PBS_O_SYSTEM=Linux,PBS_O_SHELL=/bin/bash,
        PBS_O_HOME=/home/afernandez,PBS_O_LOGNAME=afernandez,
        PBS_O_WORKDIR=/home/afernandez,PBS_O_LANG=en_US.UTF-8,
        PBS_O_PATH=/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/cycl
        e/jetpack/bin:/opt/pbs/bin:/opt/openmpi/bin:/home/afernandez/.local/bin:/ho
        me/afernandez/bin,PBS_O_MAIL=/var/spool/mail/afernandez,PBS_O_QUEUE=workq,
        PBS_O_HOST=ip-0a000204.iy4jdcoj0c5exd4bbj432msnqd.xx.internal.cloudapp
        .net
    comment = Job run at Thu Dec 19 at 20:02 on (ip-0A000205:ncpus=2)+(ip-0A000
        206:ncpus=2)
    etime = Thu Dec 19 19:56:49 2019
    run_count = 1
    Exit_status = 0
    Submit_arguments = HPCG31_4.sh
    pset = group_id=single
    project = _pbs_project_default

I don't understand why the output, error and working paths are showing as the IP plus iy4jdcoj0c5exd4bbj432msnqd.xx.internal.cloudapp.net or if this the reason preventing the files from showing up in the home directory.
Thanks.

Use queue's available_resources when autoscaling

The integration currently only considers the MaxCoreCount set in the nodearray to limit autoscaling. PBS has limits at a queue level that we are not considering. We are already parsing and storing this, in PBSQueue.resource_state.available_resources, we just need to propagate that as a limit to the jobs.

Excellent internal write up - see Ava ticket 2406060060001613

Nodes are most of the time not unregistered in PBS

If I run azpbs autoscale here is the output, these nodes have been deprovisioned by Cycle and no longer exists.
[root@scheduler hpcadmin]# azpbs autoscale
2021-04-01 09:11:18,334 ERROR: Could not convert private_ip(None) to hostname using gethostbyaddr() for SchedulerNode(254rq000000, 254rq000000, unknown, None): gethostbyaddr() argument 1 must be str, bytes or bytearray, not None
NAME HOSTNAME PBS_STATE JOB_IDS STATE VM_SIZE DISK NGPUS GROUP_ID MACHINETYPE MEM NCPUS NODEARRAY SLOT_TYPE UNGROUPED INSTANCE_ID CTR ITR
254rq000000 254rq000000 down running unknown 20.00gb/20.00gb 0/0 s_v2_pg0 456.00gb 120 unknown hb120rs_v2 false 0.0 0.0
[root@scheduler hpcadmin]#

Add option to disable reverse DNS validation

azpbs remove_nodes doesn't remove them from pbsnodes

To repro, manually add nodes to pbs cluster with add nodes in the UI. They get joined to the cluster. Then remove them with:

azpbs remove_nodes -H ip-0A010907 -H ip-0A010908 --force

Note that --force is required. They seem to temporarily go through the "down state" but recover to free.

Does this command work?

[root@ip-0A010906 ~]# pbsnodes -aS
vnode           state           OS       hardware host            queue        mem     ncpus   nmics   ngpus  comment
--------------- --------------- -------- -------- --------------- ---------- -------- ------- ------- ------- ---------
ip-0A010907     down            --       --       ip-0a010907     --              4gb       1       0       0 --
ip-0A010908     state-unknown   --       --       ip-0a010908     --              4gb       1       0       0 --
[root@ip-0A010906 ~]# pbsnodes -aS
vnode           state           OS       hardware host            queue        mem     ncpus   nmics   ngpus  comment
--------------- --------------- -------- -------- --------------- ---------- -------- ------- ------- ------- ---------
ip-0A010907     free            --       --       ip-0a010907     --              4gb       1       0       0 --
ip-0A010908     initializing    --       --       ip-0a010908     --              4gb       1       0       0 --

Ignore 'm' resource flag, so that PBS version < 19 scale correctly

The autoscaler incorrectly sees resources with flag=hnq as static, rather than consumable. This is only an issue with version of PBS before 19.

job history is disabled by default

in the feature/2.0.0 branch, please enable job history by default as it will help with accounting and troubleshooting

azure / cyclecloud-pbspro Goto Github PK

cyclecloud-pbspro's People

Stargazers

Watchers

Forkers

cyclecloud-pbspro's Issues

Context

Workaround

Recommend Projects

Recommend Topics

Recommend Org

Jobs