azure / cyclecloud-pbspro Goto Github PK
View Code? Open in Web Editor NEWExample Azure CycleCloud PBSpro cluster type
License: MIT License
Example Azure CycleCloud PBSpro cluster type
License: MIT License
HBv3 resources are not recognized by scalelib.
Workaround is to add them at the beginning of the default_resources in the autoscale.json
"default_resources": [
{
"name": "ncpus",
"select": {
"node.vm_size": "Standard_HB120rs_v3"
},
"value": 120
},
{
"name": "mem",
"select": {
"node.vm_size": "Standard_HB120rs_v3"
},
"value": "memory::448g"
},
{
"name": "ncpus",
"select": {
"node.vm_size": "Standard_HB120-96rs_v3"
},
"value": 96
},
{
"name": "mem",
"select": {
"node.vm_size": "Standard_HB120-96rs_v3"
},
"value": "memory::448g"
},
{
"name": "ncpus",
"select": {
"node.vm_size": "Standard_HB120-64rs_v3"
},
"value": 64
},
{
"name": "mem",
"select": {
"node.vm_size": "Standard_HB120-64rs_v3"
},
"value": "memory::448g"
},
{
"name": "ncpus",
"select": {
"node.vm_size": "Standard_HB120-32rs_v3"
},
"value": 32
},
{
"name": "mem",
"select": {
"node.vm_size": "Standard_HB120-32rs_v3"
},
"value": "memory::448g"
}
Other schedulers allow users to edit the max scaleset size via a parameter. Currently users must add Azure.MaxScalesetSize=X manually if they want something besides the default, 100.
OpenPBS 20 releases install packages for Ubuntu 18.04. It would be good if we support this as well.
Hi,
I've noticed cyclecloud recently changed the behavior for limits of stack sizes.
Now it add this:
$ cat /etc/security/limits.conf |grep stack
# - stack - max stack size (KB)
* hard stack unlimited
* soft stack unlimited
However I am not sure where this comes from, I can't find it in this repo and it is not from the CentOS HPC Image as far as I could tell (https://github.com/openlogic/AzureBuildCentOS)
In any case if someone else is falling over this, Abaqus at least does not accept unlimited as a soft limit.
Greetings
Klaas
There's an error in the documentation for the PBS Pro configuration reference:
pbspro.version
is currently set to 18.1.3-0
by default, not 14.2.1-0
.
after executing this
./initialize_pbs.sh
./initialize_default_queues.sh
./install.sh --install-python3 --venv /opt/cycle/pbspro/venv --cron-method pbs_hook
PBS is failing to restart :
root@scheduler:/etc/profile.d# /opt/pbs/libexec/pbs_init.d stop
Stopping PBS
Shutting server down with qterm.
PBS server - was pid: 103121
PBS sched - was pid: 103025
PBS comm - was pid: 103010
Waiting for shutdown to complete
root@scheduler:/etc/profile.d# /opt/pbs/libexec/pbs_init.d start
Starting PBS
/opt/pbs/sbin/pbs_comm ready (pid=104252), Proxy Name:scheduler.internal.cloudapp.net:17001, Threads:4
PBS comm
PBS sched
-sh: 32: eval: Syntax error: "(" unexpected (expecting "}")
-sh: 32: eval: Syntax error: "(" unexpected (expecting "}")
Failed to start PBS dataservice.
-sh: 32: eval: Syntax error: "(" unexpected (expecting "}")
Failed to start PBS dataservice.
-sh: 32: eval: Syntax error: "(" unexpected (expecting "}")
-sh: 32: eval: Syntax error: "(" unexpected (expecting "}")
Failed to start PBS dataservice.
-sh: 32: eval: Syntax error: "(" unexpected (expecting "}")
Failed to start PBS dataservice.
-sh: 32: eval: Syntax error: "(" unexpected (expecting "}")
-sh: 32: eval: Syntax error: "(" unexpected (expecting "}")
Failed to start PBS dataservice.
^C
root@scheduler:/etc/profile.d#
after removing /etc/profile.d/azpbs_autocomplete.sh
it works again
root@scheduler:/etc/profile.d# rm azpbs_autocomplete.sh
root@scheduler:/etc/profile.d# /opt/pbs/libexec/pbs_init.d start
Starting PBS
PBS comm already running.
PBS scheduler already running.
/opt/pbs/sbin/pbs_ds_systemd: 43: [: xdegraded: unexpected operator
Connecting to PBS dataservice....connected to PBS [email protected]
PBS server
openpbs installers failing with missing package hwloc-libs. This is impacting CentOS 8.
work around is do download from alma linux:
- wget -O /tmp/hwloc-libs-1.11.9-3.el8.x86_64.rpm wget https://repo.almalinux.org/almalinux/8.3-beta/BaseOS/x86_64/os/Packages/hwloc-libs-1.11.9-3.el8.x86_64.rpm
- yum install -y /tmp/hwloc-libs-1.11.9-3.el8.x86_64.rpm
- rm -f /tmp/hwloc-libs-1.11.9-3.el8.x86_64.rpm
Can we redistribute in release artifacts?
There is not pkg file for 2.0.9. Please fix it.
Hello,
I'm trying to deploy a pbspro cluster with customized images. However, I keep running into the following error
Check /opt/cycle/jetpack/logs/installation.log for more information
Get more help on this issue
Detail:
Traceback (most recent call last):
File "/opt/cycle/jetpack/system/embedded/lib/python2.7/site-packages/jetpack/admin/validate.py", line 27, in execute
jetpack.util.test_cyclecloud_connection(connection)
File "/opt/cycle/jetpack/system/embedded/lib/python2.7/site-packages/jetpack/util.py", line 439, in test_cyclecloud_connection
r = _connect_to_cyclecloud(connection, path)
File "/opt/cycle/jetpack/system/embedded/lib/python2.7/site-packages/jetpack/util.py", line 482, in _connect_to_cyclecloud
conn, headers = jetpack.util.get_cyclecloud_connection(config)
File "/opt/cycle/jetpack/system/embedded/lib/python2.7/site-packages/jetpack/util.py", line 380, in get_cyclecloud_connection
if jetpack.config.get('cyclecloud.skip_ssl_validation', default_skip_ssl_validation):
File "/opt/cycle/jetpack/system/embedded/lib/python2.7/site-packages/jetpack/config.py", line 29, in get
raise ConfigError(UNKNOWN_ERROR)
ConfigError: An unknown error occurred while processing the configuration data file.
I've boiled down the customized image to an updated CentOS configuration where 'cmake' is installed. I'm unfamiliar with Jetpack so I cannot figure out which is the origin of the problem or how to fix it. Thanks.
cyclecloud-pbspro version 2.0.10
With OpenPBS 19.1.1 output in JSON for qstat can be bad formatted in case of complex environment variables.
For example : qstat -f <job_id> -F json | jq '.'
will return an error meaning the JSON is bad formatted.
As a result this make the autoscaler to stop working, so a single bad job can hang all the whole system and no new nodes can be added.
Here the output of azpbs autoscale
File "/opt/cycle/pbspro/venv/lib/python3.6/site-packages/pbspro/environment.py", line 58, in from_driver
--- Logging error ---
Traceback (most recent call last):
File "/usr/lib64/python3.6/logging/init.py", line 996, in emit
stream.write(msg)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 47623-47628: ordinal not in range(128)
Call stack:
File "/root/bin/azpbs", line 4, in
main()
File "/opt/cycle/pbspro/venv/lib/python3.6/site-packages/pbspro/cli.py", line 284, in main
clilib.main(argv or sys.argv[1:], "pbspro", PBSCLI())
File "/opt/cycle/pbspro/venv/lib/python3.6/site-packages/hpc/autoscale/clilib.py", line 1739, in main
args.func(**kwargs)
File "/opt/cycle/pbspro/venv/lib/python3.6/site-packages/hpc/autoscale/clilib.py", line 1315, in analyze
dcalc = self._demand(config, ctx_handler=ctx_handler)
File "/opt/cycle/pbspro/venv/lib/python3.6/site-packages/hpc/autoscale/clilib.py", line 360, in _demand
dcalc, jobs = self._demand_calc(config, driver)
File "/opt/cycle/pbspro/venv/lib/python3.6/site-packages/pbspro/cli.py", line 113, in _demand_calc
pbs_env = self._pbs_env(pbs_driver)
File "/opt/cycle/pbspro/venv/lib/python3.6/site-packages/pbspro/cli.py", line 106, in _pbs_env
self.__pbs_env = environment.from_driver(pbs_driver.config, pbs_driver)
File "/opt/cycle/pbspro/venv/lib/python3.6/site-packages/pbspro/environment.py", line 58, in from_driver
jobs = pbs_driver.parse_jobs(queues, default_scheduler.resources_for_scheduling)
File "/opt/cycle/pbspro/venv/lib/python3.6/site-packages/pbspro/driver.py", line 414, in parse_jobs
self.pbscmd, self.resource_definitions, queues, resources_for_scheduling
File "/opt/cycle/pbspro/venv/lib/python3.6/site-packages/pbspro/driver.py", line 530, in parse_jobs
response: Dict = pbscmd.qstat_json("-f", "-t")
File "/opt/cycle/pbspro/venv/lib/python3.6/site-packages/pbspro/pbscmd.py", line 31, in qstat_json
response = self.qstat(*args)
File "/opt/cycle/pbspro/venv/lib/python3.6/site-packages/pbspro/pbscmd.py", line 25, in qstat
return self._check_output(cmd)
File "/opt/cycle/pbspro/venv/lib/python3.6/site-packages/pbspro/pbscmd.py", line 76, in _check_output
logger.info("Response: %s", ret)
File "/usr/lib64/python3.6/logging/init.py", line 1308, in info
self._log(INFO, msg, args, **kwargs)
File "/opt/cycle/pbspro/venv/lib/python3.6/site-packages/hpc/autoscale/hpclogging.py", line 45, in _log
**stacklevelkw
Message: 'Response: %s'
There are important files that Microsoft projects should all have that are not present in this repository. A pull request has been opened to add the missing file(s). When the pr is merged this issue will be closed automatically.
Microsoft teams can learn more about this effort and share feedback within the open source guidance available internally.
running a non mpi job using “-l select=1:slot_type=execute:ungrouped=true” as a select statement.
The execute node array is not spot
But the autoscaler is not adding a new node
[xpillons@ondemand ~]$ qstat -fx 1651
Job Id: 1651.scheduler
Job_Name = sys-dashboard-sys-codeserver
Job_Owner = [email protected]
job_state = Q
queue = workq
server = scheduler
Checkpoint = u
ctime = Fri Nov 19 09:56:25 2021
Error_Path = ondemand.internal.cloudapp.net:/anfhome/xpillons/ondemand/data
/sys/dashboard/batch_connect/sys/codeserver/output/c1144623-b9b5-44a2-8
5b1-93fa66a0dc14/sys-dashboard-sys-codeserver.e1651
Hold_Types = n
Join_Path = oe
Keep_Files = n
Mail_Points = a
mtime = Fri Nov 19 09:56:25 2021
Output_Path = ondemand.internal.cloudapp.net:/anfhome/xpillons/ondemand/dat
a/sys/dashboard/batch_connect/sys/codeserver/output/c1144623-b9b5-44a2-
85b1-93fa66a0dc14/output.log
Priority = 0
qtime = Fri Nov 19 09:56:25 2021
Rerunable = True
Resource_List.ncpus = 1
Resource_List.nodect = 1
Resource_List.place = scatter:excl
Resource_List.select = 1:slot_type=execute:ungrouped=true
Resource_List.slot_type = execute
Resource_List.ungrouped = false
Resource_List.walltime = 03:00:00
Shell_Path_List = /bin/bash
substate = 10
Variable_List = PBS_O_HOME=/anfhome/xpillons,PBS_O_LANG=C,
PBS_O_LOGNAME=xpillons,
PBS_O_PATH=/var/www/ood/apps/sys/dashboard/tmp/node_modules/yarn/bin:/
opt/ood/ondemand/root/usr/share/gems/2.7/bin:/opt/rh/rh-nodejs12/root/u
sr/bin:/opt/rh/rh-ruby27/root/usr/local/bin:/opt/rh/rh-ruby27/root/usr/
bin:/opt/rh/httpd24/root/usr/bin:/opt/rh/httpd24/root/usr/sbin:/opt/ood
/ondemand/root/usr/bin:/opt/ood/ondemand/root/usr/sbin:/sbin:/bin:/usr/
sbin:/usr/bin,PBS_O_MAIL=/var/mail/root,PBS_O_SHELL=/bin/bash,
PBS_O_WORKDIR=/anfhome/xpillons/ondemand/data/sys/dashboard/batch_conn
ect/sys/codeserver/output/c1144623-b9b5-44a2-85b1-93fa66a0dc14,
PBS_O_SYSTEM=Linux,PBS_O_QUEUE=workq,
PBS_O_HOST=ondemand.internal.cloudapp.net
etime = Fri Nov 19 09:56:25 2021
Submit_arguments = -N sys-dashboard-sys-codeserver -S /bin/bash -o /anfhome
/xpillons/ondemand/data/sys/dashboard/batch_connect/sys/codeserver/outp
ut/c1144623-b9b5-44a2-85b1-93fa66a0dc14/output.log -j oe -l select=1:sl
ot_type=execute:ungrouped=true -l walltime=03:00:00
project = _pbs_project_default
[root@scheduler ~]# azpbs analyze --job-id 1651
NotInAPlacementGroup : Bucket[array=execute vm_size=Standard_F2s_v2 id=ac4fc82f-3d82-4f20-bd6e-67299dcdd388] is not in a placement group
NotInAPlacementGroup : Bucket[array=hb120rs_v2 vm_size=Standard_HB120rs_v2 id=4167f5f8-3a30-46df-92ba-8f77e0b636cd] is not in a placement group
NotInAPlacementGroup : Bucket[array=hb120rs_v3 vm_size=Standard_HB120rs_v3 id=aed72457-3b3c-44a0-ab8f-9ebb1a954fee] is not in a placement group
NotInAPlacementGroup : Bucket[array=hb60rs vm_size=Standard_HB60rs id=53bb8fd0-e62e-468f-b278-93da9e6fef3e] is not in a placement group
NotInAPlacementGroup : Bucket[array=hc44rs vm_size=Standard_HC44rs id=be3124ac-3894-496a-8580-ef06aba68273] is not in a placement group
NotInAPlacementGroup : Bucket[array=viz vm_size=Standard_D8s_v3 id=03e27828-b4bd-4678-a8e5-5886599bc6e5] is not in a placement group
NotInAPlacementGroup : Bucket[array=viz3d vm_size=Standard_NV6 id=ab76e8b1-0632-49a9-be16-93d880341855] is not in a placement group
InvalidOption : Resource[name=ungrouped value='false'] != 'true' for Bucket[array=execute vm_size=Standard_F2s_v2 attr=ungrouped]
InvalidOption : Resource[name=ungrouped value='false'] != 'true' for Bucket[array=execute vm_size=Standard_F2s_v2 attr=ungrouped]
InvalidOption : Resource[name=slot_type value='hb120rs_v2'] != 'execute' for Bucket[array=hb120rs_v2 vm_size=Standard_HB120rs_v2 attr=slot_type]
InvalidOption : Resource[name=slot_type value='hb120rs_v2'] != 'execute' for Bucket[array=hb120rs_v2 vm_size=Standard_HB120rs_v2 attr=slot_type]
InvalidOption : Resource[name=slot_type value='hb120rs_v3'] != 'execute' for Bucket[array=hb120rs_v3 vm_size=Standard_HB120rs_v3 attr=slot_type]
InvalidOption : Resource[name=slot_type value='hb120rs_v3'] != 'execute' for Bucket[array=hb120rs_v3 vm_size=Standard_HB120rs_v3 attr=slot_type]
InvalidOption : Resource[name=slot_type value='hb60rs'] != 'execute' for Bucket[array=hb60rs vm_size=Standard_HB60rs attr=slot_type]
InvalidOption : Resource[name=slot_type value='hb60rs'] != 'execute' for Bucket[array=hb60rs vm_size=Standard_HB60rs attr=slot_type]
InvalidOption : Resource[name=slot_type value='hc44rs'] != 'execute' for Bucket[array=hc44rs vm_size=Standard_HC44rs attr=slot_type]
InvalidOption : Resource[name=slot_type value='hc44rs'] != 'execute' for Bucket[array=hc44rs vm_size=Standard_HC44rs attr=slot_type]
InvalidOption : Resource[name=slot_type value='viz'] != 'execute' for Bucket[array=viz vm_size=Standard_D8s_v3 attr=slot_type]
InvalidOption : Resource[name=slot_type value='viz'] != 'execute' for Bucket[array=viz vm_size=Standard_D8s_v3 attr=slot_type]
InvalidOption : Resource[name=slot_type value='viz3d'] != 'execute' for Bucket[array=viz3d vm_size=Standard_NV6 attr=slot_type]
InvalidOption : Resource[name=slot_type value='viz3d'] != 'execute' for Bucket[array=viz3d vm_size=Standard_NV6 attr=slot_type]
NoCandidatesFound : SatisfiedResult(status=NotInAPlacementGroup, node=NodeBucket(nodearray=execute, vm_size=Standard_F2s_v2, pg=None),reason=Bucket[array=execute vm_size=Standard_F2s_v2 id=ac4fc82f-3d82-4f20-bd6e-67299dcdd388] is not
SatisfiedResult(status=NotInAPlacementGroup, node=NodeBucket(nodearray=hb120rs_v2, vm_size=Standard_HB120rs_v2, pg=None),reason=Bucket[array=hb120rs_v2 vm_size=Standard_HB120rs_v2 id=4167f5f8-3a30-46df-92ba-8f77e0
SatisfiedResult(status=NotInAPlacementGroup, node=NodeBucket(nodearray=hb120rs_v3, vm_size=Standard_HB120rs_v3, pg=None),reason=Bucket[array=hb120rs_v3 vm_size=Standard_HB120rs_v3 id=aed72457-3b3c-44a0-ab8f-9ebb1a
SatisfiedResult(status=NotInAPlacementGroup, node=NodeBucket(nodearray=hb60rs, vm_size=Standard_HB60rs, pg=None),reason=Bucket[array=hb60rs vm_size=Standard_HB60rs id=53bb8fd0-e62e-468f-b278-93da9e6fef3e] is not i
SatisfiedResult(status=NotInAPlacementGroup, node=NodeBucket(nodearray=hc44rs, vm_size=Standard_HC44rs, pg=None),reason=Bucket[array=hc44rs vm_size=Standard_HC44rs id=be3124ac-3894-496a-8580-ef06aba68273] is not i
SatisfiedResult(status=NotInAPlacementGroup, node=NodeBucket(nodearray=viz, vm_size=Standard_D8s_v3, pg=None),reason=Bucket[array=viz vm_size=Standard_D8s_v3 id=03e27828-b4bd-4678-a8e5-5886599bc6e5] is not in a pl
SatisfiedResult(status=NotInAPlacementGroup, node=NodeBucket(nodearray=viz3d, vm_size=Standard_NV6, pg=None),reason=Bucket[array=viz3d vm_size=Standard_NV6 id=ab76e8b1-0632-49a9-be16-93d880341855] is not in a plac
SatisfiedResult(status=InvalidOption, node=NodeBucket(nodearray=execute, vm_size=Standard_F2s_v2, pg=Standard_F2s_v2_pg0),reason=Resource[name=ungrouped value='false'] != 'true' for Bucket[array=execute vm_size=St
SatisfiedResult(status=InvalidOption, node=NodeBucket(nodearray=execute, vm_size=Standard_F2s_v2, pg=Standard_F2s_v2_pg1),reason=Resource[name=ungrouped value='false'] != 'true' for Bucket[array=execute vm_size=St
SatisfiedResult(status=InvalidOption, node=NodeBucket(nodearray=hb120rs_v2, vm_size=Standard_HB120rs_v2, pg=Standard_HB120rs_v2_pg0),reason=Resource[name=slot_type value='hb120rs_v2'] != 'execute' for Bucket[array
SatisfiedResult(status=InvalidOption, node=NodeBucket(nodearray=hb120rs_v2, vm_size=Standard_HB120rs_v2, pg=Standard_HB120rs_v2_pg1),reason=Resource[name=slot_type value='hb120rs_v2'] != 'execute' for Bucket[array
SatisfiedResult(status=InvalidOption, node=NodeBucket(nodearray=hb120rs_v3, vm_size=Standard_HB120rs_v3, pg=Standard_HB120rs_v3_pg0),reason=Resource[name=slot_type value='hb120rs_v3'] != 'execute' for Bucket[array
SatisfiedResult(status=InvalidOption, node=NodeBucket(nodearray=hb120rs_v3, vm_size=Standard_HB120rs_v3, pg=Standard_HB120rs_v3_pg1),reason=Resource[name=slot_type value='hb120rs_v3'] != 'execute' for Bucket[array
SatisfiedResult(status=InvalidOption, node=NodeBucket(nodearray=hb60rs, vm_size=Standard_HB60rs, pg=Standard_HB60rs_pg0),reason=Resource[name=slot_type value='hb60rs'] != 'execute' for Bucket[array=hb60rs vm_size=
SatisfiedResult(status=InvalidOption, node=NodeBucket(nodearray=hb60rs, vm_size=Standard_HB60rs, pg=Standard_HB60rs_pg1),reason=Resource[name=slot_type value='hb60rs'] != 'execute' for Bucket[array=hb60rs vm_size=
SatisfiedResult(status=InvalidOption, node=NodeBucket(nodearray=hc44rs, vm_size=Standard_HC44rs, pg=Standard_HC44rs_pg0),reason=Resource[name=slot_type value='hc44rs'] != 'execute' for Bucket[array=hc44rs vm_size=
SatisfiedResult(status=InvalidOption, node=NodeBucket(nodearray=hc44rs, vm_size=Standard_HC44rs, pg=Standard_HC44rs_pg1),reason=Resource[name=slot_type value='hc44rs'] != 'execute' for Bucket[array=hc44rs vm_size=
SatisfiedResult(status=InvalidOption, node=NodeBucket(nodearray=viz, vm_size=Standard_D8s_v3, pg=Standard_D8s_v3_pg0),reason=Resource[name=slot_type value='viz'] != 'execute' for Bucket[array=viz vm_size=Standard_
SatisfiedResult(status=InvalidOption, node=NodeBucket(nodearray=viz, vm_size=Standard_D8s_v3, pg=Standard_D8s_v3_pg1),reason=Resource[name=slot_type value='viz'] != 'execute' for Bucket[array=viz vm_size=Standard_
SatisfiedResult(status=InvalidOption, node=NodeBucket(nodearray=viz3d, vm_size=Standard_NV6, pg=Standard_NV6_pg0),reason=Resource[name=slot_type value='viz3d'] != 'execute' for Bucket[array=viz3d vm_size=Standard_
SatisfiedResult(status=InvalidOption, node=NodeBucket(nodearray=viz3d, vm_size=Standard_NV6, pg=Standard_NV6_pg1),reason=Resource[name=slot_type value='viz3d'] != 'execute' for Bucket[array=viz3d vm_size=Standard_
[root@scheduler ~]# azpbs buckets
NODEARRAY PLACEMENT_GROUP VM_SIZE VCPU_COUNT PCPU_COUNT MEMORY AVAILABLE_COUNT NCPUS NGPUS DISK HOST SLOT_TYPE GROUP_ID MEM CCNODEID UNGROUPED
execute Standard_F2s_v2 2 1 4.00g 512 1 0 20.00g execute none 4.00g true
execute Standard_F2s_v2_pg0 Standard_F2s_v2 2 1 4.00g 100 1 0 20.00g execute Standard_F2s_v2_pg0 4.00g false
execute Standard_F2s_v2_pg1 Standard_F2s_v2 2 1 4.00g 100 1 0 20.00g execute Standard_F2s_v2_pg1 4.00g false
hb120rs_v2 Standard_HB120rs_v2 120 120 456.00g 72 120 0 20.00g hb120rs_v2 none 456.00g true
hb120rs_v2 Standard_HB120rs_v2_pg0 Standard_HB120rs_v2 120 120 456.00g 72 120 0 20.00g hb120rs_v2 Standard_HB120rs_v2_pg0 456.00g false
hb120rs_v2 Standard_HB120rs_v2_pg1 Standard_HB120rs_v2 120 120 456.00g 72 120 0 20.00g hb120rs_v2 Standard_HB120rs_v2_pg1 456.00g false
hb120rs_v3 Standard_HB120rs_v3 120 120 448.00g 10 120 0 20.00g hb120rs_v3 none 448.00g true
hb120rs_v3 Standard_HB120rs_v3_pg0 Standard_HB120rs_v3 120 120 448.00g 10 120 0 20.00g hb120rs_v3 Standard_HB120rs_v3_pg0 448.00g false
hb120rs_v3 Standard_HB120rs_v3_pg1 Standard_HB120rs_v3 120 120 448.00g 10 120 0 20.00g hb120rs_v3 Standard_HB120rs_v3_pg1 448.00g false
hb60rs Standard_HB60rs 60 60 228.00g 40 60 0 20.00g hb60rs none 228.00g true
hb60rs Standard_HB60rs_pg0 Standard_HB60rs 60 60 228.00g 40 60 0 20.00g hb60rs Standard_HB60rs_pg0 228.00g false
hb60rs Standard_HB60rs_pg1 Standard_HB60rs 60 60 228.00g 40 60 0 20.00g hb60rs Standard_HB60rs_pg1 228.00g false
hc44rs Standard_HC44rs 44 44 352.00g 40 44 0 20.00g hc44rs none 352.00g true
hc44rs Standard_HC44rs_pg0 Standard_HC44rs 44 44 352.00g 40 44 0 20.00g hc44rs Standard_HC44rs_pg0 352.00g false
hc44rs Standard_HC44rs_pg1 Standard_HC44rs 44 44 352.00g 40 44 0 20.00g hc44rs Standard_HC44rs_pg1 352.00g false
viz Standard_D8s_v3 8 4 32.00g 50 4 0 20.00g viz none 32.00g true
viz Standard_D8s_v3_pg0 Standard_D8s_v3 8 4 32.00g 50 4 0 20.00g viz Standard_D8s_v3_pg0 32.00g false
viz Standard_D8s_v3_pg1 Standard_D8s_v3 8 4 32.00g 50 4 0 20.00g viz Standard_D8s_v3_pg1 32.00g false
viz3d Standard_NV6 6 6 56.00g 10 6 1 20.00g viz3d none 56.00g true
viz3d Standard_NV6_pg0 Standard_NV6 6 6 56.00g 10 6 1 20.00g viz3d Standard_NV6_pg0 56.00g false
viz3d Standard_NV6_pg1 Standard_NV6 6 6 56.00g 10 6 1 20.00g viz3d Standard_NV6_pg1 56.00g false
Pretty sure this was not intended
I have a cyclecloud cluster generated from a modified version of the PBSpro template. In it I have added different types of nodes for execution, such as memory optimized nodes and HPC nodes for heavy duty numerical simulations.
When scheduling a job making use of the #PBS -l slot_type=name_slot
I have found that sometimes resources are allocated that do not match the configuration of the slot. The number of nodes and processes per node is respected but the actual type of node is not. Which can result in reduced performance for certain applications.
it seems that that version doesn't contains this fix despite what is claimed in the release page
repro :
wget https://github.com/Azure/cyclecloud-pbspro/releases/download/2.0.2/cyclecloud-pbspro-pkg-2.0.2.tar.gz
tar xvf cyclecloud-pbspro-pkg-2.0.2.tar.gz
initialize_pbs.sh doesn't contains the job history parameter.
If a slot type is defined as hbv3 then the following job will never start
qsub -l select=1:slot_type=HBv3 -I
Please add the Standard_ND96asr_v4 in the supported resources
I would like to use PBS Pro 19.1.2 to match the version with other production environments.
The detailed version of CentOS 7 is fine as long as it is the stable version.
Hello,
I'm having some trouble with the working directory and job outputs. The jobs run fine but any output file is nowhere to be found. The details of a job follow:
Job Id: 0.ip-0A000204
Job_Name = HPCG31_4
Job_Owner = afernandez@ip-0a000204.iy4jdcoj0c5exd4bbj432msnqd.xx.internal.cloud
app.net
resources_used.cpupercent = 326
resources_used.cput = 00:03:19
resources_used.mem = 1889692kb
resources_used.ncpus = 4
resources_used.vmem = 3353476kb
resources_used.walltime = 00:00:54
job_state = E
queue = workq
server = ip-0A000204
Checkpoint = u
ctime = Thu Dec 19 19:56:48 2019
Error_Path = ip-0a000204.iy4jdcoj0c5exd4bbj432msnqd.xx.internal.cloudapp.ne
t:/home/afernandez/HPCG31_4.e0
exec_host = ip-0A000205/0*2+ip-0A000206/0*2
exec_vnode = (ip-0A000205:ncpus=2)+(ip-0A000206:ncpus=2)
Hold_Types = n
Join_Path = oe
Keep_Files = n
Mail_Points = a
mtime = Thu Dec 19 20:02:58 2019
Output_Path = ip-0a000204.iy4jdcoj0c5exd4bbj432msnqd.xx.internal.cloudapp.n
et:/home/afernandez/hpcg31_4.out
Priority = 0
qtime = Thu Dec 19 19:56:48 2019
Rerunable = True
Resource_List.mpiprocs = 4
Resource_List.ncpus = 4
Resource_List.nodect = 2
Resource_List.nodes = 2:ppn=2
Resource_List.place = scatter:group=group_id
Resource_List.select = 2:ncpus=2:mpiprocs=2
Resource_List.slot_type = execute
Resource_List.ungrouped = false
Resource_List.walltime = 100:30:00
stime = Thu Dec 19 20:02:04 2019
session_id = 6412
jobdir = /home/afernandez
substate = 51
Variable_List = PBS_O_SYSTEM=Linux,PBS_O_SHELL=/bin/bash,
PBS_O_HOME=/home/afernandez,PBS_O_LOGNAME=afernandez,
PBS_O_WORKDIR=/home/afernandez,PBS_O_LANG=en_US.UTF-8,
PBS_O_PATH=/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/cycl
e/jetpack/bin:/opt/pbs/bin:/opt/openmpi/bin:/home/afernandez/.local/bin:/ho
me/afernandez/bin,PBS_O_MAIL=/var/spool/mail/afernandez,PBS_O_QUEUE=workq,
PBS_O_HOST=ip-0a000204.iy4jdcoj0c5exd4bbj432msnqd.xx.internal.cloudapp
.net
comment = Job run at Thu Dec 19 at 20:02 on (ip-0A000205:ncpus=2)+(ip-0A000
206:ncpus=2)
etime = Thu Dec 19 19:56:49 2019
run_count = 1
Exit_status = 0
Submit_arguments = HPCG31_4.sh
pset = group_id=single
project = _pbs_project_default
I don't understand why the output, error and working paths are showing as the IP plus iy4jdcoj0c5exd4bbj432msnqd.xx.internal.cloudapp.net or if this the reason preventing the files from showing up in the home directory.
Thanks.
The integration currently only considers the MaxCoreCount set in the nodearray to limit autoscaling. PBS has limits at a queue level that we are not considering. We are already parsing and storing this, in PBSQueue.resource_state.available_resources, we just need to propagate that as a limit to the jobs.
Excellent internal write up - see Ava ticket 2406060060001613
If I run azpbs autoscale here is the output, these nodes have been deprovisioned by Cycle and no longer exists.
[root@scheduler hpcadmin]# azpbs autoscale
2021-04-01 09:11:18,334 ERROR: Could not convert private_ip(None) to hostname using gethostbyaddr() for SchedulerNode(254rq000000, 254rq000000, unknown, None): gethostbyaddr() argument 1 must be str, bytes or bytearray, not None
NAME HOSTNAME PBS_STATE JOB_IDS STATE VM_SIZE DISK NGPUS GROUP_ID MACHINETYPE MEM NCPUS NODEARRAY SLOT_TYPE UNGROUPED INSTANCE_ID CTR ITR
254rq000000 254rq000000 down running unknown 20.00gb/20.00gb 0/0 s_v2_pg0 456.00gb 120 unknown hb120rs_v2 false 0.0 0.0
[root@scheduler hpcadmin]#
To repro, manually add nodes to pbs cluster with add nodes in the UI. They get joined to the cluster. Then remove them with:
azpbs remove_nodes -H ip-0A010907 -H ip-0A010908 --force
Note that --force
is required. They seem to temporarily go through the "down state" but recover to free.
Does this command work?
[root@ip-0A010906 ~]# pbsnodes -aS
vnode state OS hardware host queue mem ncpus nmics ngpus comment
--------------- --------------- -------- -------- --------------- ---------- -------- ------- ------- ------- ---------
ip-0A010907 down -- -- ip-0a010907 -- 4gb 1 0 0 --
ip-0A010908 state-unknown -- -- ip-0a010908 -- 4gb 1 0 0 --
[root@ip-0A010906 ~]# pbsnodes -aS
vnode state OS hardware host queue mem ncpus nmics ngpus comment
--------------- --------------- -------- -------- --------------- ---------- -------- ------- ------- ------- ---------
ip-0A010907 free -- -- ip-0a010907 -- 4gb 1 0 0 --
ip-0A010908 initializing -- -- ip-0a010908 -- 4gb 1 0 0 --
The autoscaler incorrectly sees resources with flag=hnq as static, rather than consumable. This is only an issue with version of PBS before 19.
in the feature/2.0.0 branch, please enable job history by default as it will help with accounting and troubleshooting
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.