GithubHelp home page GithubHelp logo

clusterinthecloud / terraform Goto Github PK

View Code? Open in Web Editor NEW
20.0 5.0 23.0 224 KB

Terraform config for Cluster in the Cloud

Home Page: https://cluster-in-the-cloud.readthedocs.io

License: MIT License

HCL 64.08% Shell 10.72% Makefile 3.23% Smarty 1.90% Python 20.07%
cluster-in-the-cloud

terraform's Introduction

terraform's People

Contributors

ab-philburke avatar alisonrclarke avatar christopheredsall avatar chryswoods avatar david-young-appsbroker avatar geoffnewell avatar gmw99 avatar mikeoconnor0308 avatar milliams avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

terraform's Issues

Jobs infinitely pending (Resources)

Hi,

I've been happily running a cluster for a couple of months now. In the last few days, however, jobs are stuck pending, apparently waiting for resources:

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               233   compute test.slm     mike PD       0:00      1 (Resources)

I've confirmed there are no compute nodes up:

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up   infinite      0    n/a

Is there any way to debug what's going on here? sosreport attached:

sosreport-mgmt-mikeoconnor0308-2019-11-20-wsfwtat.tar.zip

I appreciate the version of citc I'm running is quite old (commit id: 91f5b55), but I'd rather not spin up the cluster again right now!

Problems creating new clusters after the first one

I've been iterating through various clusters and I see a problem after the first deployment.

If I try to re-use the existing service account, that fails, so I try deleting the service account and starting from scratch. However, at the apply stage there are permission errors:

google_compute_network.vpc_network: Creating...
google_service_account.mgmt-sa: Creating...
Error: Error creating Network: googleapi: Error 403: Required 'compute.networks.create' permission for 'projects/<project ID>/global/networks/citc-net', forbidden
  on google-cloud-platform/networking.tf line 2, in resource "google_compute_network" "vpc_network":
   2: resource "google_compute_network" "vpc_network" {
Error: Error creating service account: googleapi: Error 403: Permission iam.serviceAccounts.create is required to perform this operation on project projects/<project ID>., forbidden
  on google-cloud-platform/service-account.tf line 2, in resource "google_service_account" "mgmt-sa":
   2: resource "google_service_account" "mgmt-sa" {

Update to Terraform 12

Terraform 12 gives us some nice new features but also simplifies a lot of the syntax as it now uses HCL2 as its file syntax.

Only create one mount target

I believe that instances in all ADs can communicate to a mount target in any AD. Therefore we don't need to create an array of mount targets.

We should edit ClusterFSMountTarget in mount_target.tf to create a single entry, not a list.

SLURM not automatically spawning instances

I have set up a management node using the newest version. Everything is set up nicely and I have updated my limits.yaml file. However, when I then start submitting jobs, it seems that compute instances are not initialised.

My limits.yaml:

VM.Standard2.1:
1: 1
2: 1
3: 1
VM.Standard2.2:
1: 2
2: 2
3: 2

My slurm script:

#! /bin/bash
#SBATCH --job-name=test
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=1
#SBATCH --time=10:00
srun -l hostname

sinfo gives this (after a few attempts):

PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
compute* up infinite 1 alloc# vm-standard2-1-ad2-0001
compute* up infinite 1 idle# vm-standard2-2-ad3-0002
compute* up infinite 1 down# vm-standard2-2-ad1-0001

And nothing shows up in the OCI dashboard under instances - only the management node.

OpenStack: Create application credential in Terraform and inject `clouds.yaml` via `user_data` script

Proposal for avoiding the hack for injecting clouds.yaml via local-exec provisioner

command = "for i in {1..60}; do echo Attempt $i; scp -A -oStrictHostKeyChecking=no -oUserKnownHostsFile=/dev/null clouds.yaml cloud-user@${openstack_compute_floatingip_v2.mgmt.address}:. && break || sleep 1; done"

  • Generate an application credential in Terraform using openstack_identity_application_credential_v3
  • Use the secret and id attributes of the application credential resource to create a string representing a clouds.yaml file for authentication to API by CitC
  • Interpolate the clouds.yaml string into a heredoc in bootstrap_custom.sh.tpl and have this written to the right location on the deployed mgmt instance on execution of the user_data script

The clouds.yaml might look something like this:

clouds:
  openstack:
    auth:
      auth_url: <URL for identity service API>
      application_credential_id: <ID of application credential>
      application_credential_secret: <application credential secret>
    auth_type: v3applicationcredential
    region_name: "RegionOne"
    interface: "public"
    identity_api_version: 3

application_credential_id and application_credential_secret should come from the created openstack_identity_application_credential_v3 resource. The user will need to provide auth_url, also possibly region_name and interface (though I suspect that these will not change in most cases).

This avoids the need for the Terraform user to pre-generate an application credential. It also means they do not need to manage the credential separate to the CitC instance. It should be destroyed at the same time as the cluster.

Creating the application credential in Terraform gives greater control over the amount of access granted to CitC, which in the longer term could be used to improve security, e.g. by applying access rules that restrict access via the application credential to only the API endpoints needed by CitC.

PR #79 lays some of the groundwork for this, by separating the application credential/clouds.yaml used by the CitC instance to communicate with the OpenStack API from the OpenStack API authentication details used by Terraform.

Dictionary in network.tf

Hi,
I have terraform v0.12.5 and terraform init threw an error. The fix (suggested by terraform validate) is super simple, add an equals in line 53 of network.tf (to give tcp_options = {). However, I haven't tested if that's compatible with previous versions of terraform.
thanks very much for your work,
Ben

Validate VCN Security List Rules for File Storage

Oracle notifies us:

The Oracle Cloud Infrastructure team has identified an issue where customers without stateful VCN security list egress rules may experience a temporary disruption during Mount Target failover.
To function correctly, File Storage requires stateful ingress to TCP ports 111, 2048, 2049, and 2050 and stateful ingress to UDP ports 111 and 2048. File storage also requires stateful egress from TCP ports 111, 2048, 2049, and 2050 and stateful egress from UDP port 111.
If you do not have the correctly configured stateful egress rules your Mount Target(s) can become temporarily unavailable during a planned or unplanned Mount Target failover. Please validate your security list rules as soon as possible. Instructions can be found in the additional information section of this message.
Review the Configuring VCN Security List Rules for File Storage documentation found here: https://docs.cloud.oracle.com/iaas/Content/File/Tasks/securitylistsfilestorage.htm

Slurm not working after stop and start of mgmt node

This is probably my own fault, but it used to work with the old CitC setup I had. After finishing running some scaling studies yesterday, I stopped (not terminated) the management node in the OCI dashboard to save credits. Today I started it again, it boots fine and I logged in. However, when I try to submit jobs, it tells me SLURM is not running. So then I started the SLURM daemon:

[opc@mgmt ~]$ sudo slurmctld

Then when I submitted my job, it is listed as configuring, but in the OCI dashboard there are no compute instances being provisioned.

Is there a way to restart things now, so I don't have to destroy and rebuild the cluster setup? And should I preferably not do this in the future? Like I wrote to start, this used to work fine using the old setup (where compute nodes where just stopped rather than terminated).

Add a cleanup script

There are bound to be some situations where Terraform is not able to orchestrate the shutting down of all our dangling compute nodes. For this we should add a bash script to this repo which terminates all objects with the appropriate tag.

Available modules differ between management and compute nodes

I am trying to run some MPI jobs on my CitC. However, the available MPI modules are different on the management and compute nodes. I have compiled my code on the shared filesystem logged in on the management node, using the module mpi/openmpi-x86_64. However, when I then tried to load it on the compute nodes (as part of my job script), it told me that it did not exist.

Below are the listed available modules on the management node:

[joealex@mgmt run_test]$ module avail

--------------------------------------------------------- /usr/share/Modules/modulefiles ---------------------------------------------------------
dot module-git module-info modules null use.own

---------------------------------------------------------------- /etc/modulefiles ----------------------------------------------------------------
mpi/mpich-3.2-x86_64 mpi/openmpi3-x86_64 mpi/openmpi-x86_64
And the compute node:
[opc@vm-standard2-2-ad1-0001 ~]$ module avail

--------------------------------------------------------- /usr/share/Modules/modulefiles ---------------------------------------------------------
dot module-git module-info modules null use.own

---------------------------------------------------------------- /etc/modulefiles ----------------------------------------------------------------
mpi/mpich-3.0-x86_64 mpi/mpich-x86_64 mpi/openmpi3-x86_64

There is only one that overlaps...

Out of Host Capacity error on node activation

When submitting jobs through slurm, the AMD nodes we've specified in limits.yml are not automatically activating. We then follow the instructions on the elastic scaling page to manually call up a node and receive the error:

2019-06-10 11:29:37,108 startnode  ERROR    bm-standard-e2-64-ad1-0003:  problem launching instance: {'opc-request-id': 'E3D3A2D1DEB14B9C84CBB7FD6F2CA7B3/90862EA821FF46290B355B89CAE3A926/B4D05FEBD75749F988B1C201434A2A1C', 'code': 'InternalError', 'message': 'Out of host capacity.', 'status': 500}

After trying to launch three node instances, we also get this error:

2019-06-10 11:32:07,976 startnode  ERROR    bm-standard-e2-64-ad1-0001:  problem launching instance: {'opc-request-id': '07BF5FF7021E4BD5B7580DF99C44D23F/F122163E1641F3594B94D09F1EB83A9E/5077118583A2A8872A52AF2492160373', 'code': 'TooManyRequests', 'message': 'Too many requests for the user', 'status': 429}

We've actually had one success in activating a node following this approach, but can't figure out why it worked in that particular case but not in others. Otherwise we are well below the node limit on our given AD. Any ideas?

AWS c7g instances unknown to CitC

Error: Could not find shape information for 'c7g.2xlarge'.

It seems like my existing CitC cluster doesn't know about the c7g (Graviton3) instances yet, see also https://aws.amazon.com/ec2/instance-types/c7g/ .

@milliams Any idea how to fix this (for an existing cluster)?
I'm happy to open a PR to add the necessary info so c7g instances are known for new CitC clusters, if you can give me some pointers...

`terraform init` step fails on macOS with arm64 binary

I am using the openstack development branch, but I suspect that this issue is general across other cloud providers.

% terraform -chdir=openstack init                                                                                                                                                                                
                                                                                                                                                                                                                                             
Initializing the backend...                                                                                           
                                                                                                                                                                                                                                             
Initializing provider plugins...                                                                                                                                                                                                             
- Finding terraform-provider-openstack/openstack versions matching "~> 1.48"...                                                                                                                                                              
- Finding latest version of hashicorp/random...                                                                                                                                                                                              
- Finding latest version of hashicorp/template...                                                                                                                                                                                            
- Installing terraform-provider-openstack/openstack v1.54.1...                                                        
- Installed terraform-provider-openstack/openstack v1.54.1 (self-signed, key ID 4F80527A391BEFD2)                     
- Installing hashicorp/random v3.6.0...                    
- Installed hashicorp/random v3.6.0 (signed by HashiCorp)  

Partner and community providers are signed by their developers.                                                       
If you'd like to know more about provider signing, you can read about it here:                                        
https://www.terraform.io/docs/cli/plugins/signing.html     

│ Error: Incompatible provider version                     

│ Provider registry.terraform.io/hashicorp/template v2.2.0 does not have a package available for your current         
│ platform, darwin_arm64.                                  

│ Provider releases are separate from Terraform CLI releases, so not all providers are available for all platforms.   
│ Other versions of this provider may have different platforms supported.                                             

A quick web search suggests that this might be due to the use of the template_file data source, which is part of the deprecated template provider.

It seems the template provider was archived prior to arm64 macOS releases of terraform and has not been updated to add support. See for example: https://discuss.hashicorp.com/t/template-v2-2-0-does-not-have-a-package-available-mac-m1/35099

I will try the workaround of using the darwin_amd64 binary, but thought it worth flagging since it seems that this provider should be replaced with alternative non-deprecated functionality, as described here: https://registry.terraform.io/providers/hashicorp/template/latest/docs#deprecation

Terraform version information:

% terraform version
Terraform v1.7.5
on darwin_arm64

race condition when running terraform destroy if filesystem destroyed before compute nodes

Occasionally, when running terraform destroy on a cluster it will take a long time and eventually fail. This seems to be because the filesystem which contains /mnt/shared/etc/slurm/slurm.conf is destroyed before it is needed to be read to get the compute node info.

oci_file_storage_export.ClusterFSExport: Destroying... (ID: ocid1.export.oc1.iad.aaaaaa4np2soonqpnfqwillqojxwiotjmfsc2ylefuzaaaaa)
oci_core_instance.ClusterManagement: Destroying... (ID: ocid1.instance.oc1.iad.abuwcljtvbv6codx...barl7h54uuypnd7e6golvv7gqy3xgx6vsd2ntq)
oci_core_instance.ClusterManagement: Provisioning with 'remote-exec'...
oci_core_instance.ClusterManagement (remote-exec): Connecting to remote host via SSH...
oci_core_instance.ClusterManagement (remote-exec):   Host: 132.145.208.237
oci_core_instance.ClusterManagement (remote-exec):   User: opc
oci_core_instance.ClusterManagement (remote-exec):   Password: false
oci_core_instance.ClusterManagement (remote-exec):   Private key: true
oci_core_instance.ClusterManagement (remote-exec):   SSH Agent: false
oci_core_instance.ClusterManagement (remote-exec):   Checking Host Key: false
oci_core_instance.ClusterManagement (remote-exec): Connected!
oci_file_storage_export.ClusterFSExport: Destruction complete after 0s
oci_file_storage_file_system.ClusterFS: Destroying... (ID: ocid1.filesystem.oc1.iad.aaaaaaaaaaaalsqhnfqwillqojxwiotjmfsc2ylefuzaaaaa)
oci_file_storage_mount_target.ClusterFSMountTarget: Destroying... (ID: ocid1.mounttarget.oc1.iad.aaaaaby27ve23nwonfqwillqojxwiotjmfsc2ylefuzaaaaa)
oci_core_instance.ClusterManagement (remote-exec): Terminating any remaining compute nodes
oci_core_instance.ClusterManagement (remote-exec): sinfo: error: s_p_parse_file: unable to read "/mnt/shared/etc/slurm/slurm.conf": Unknown error 521
oci_core_instance.ClusterManagement (remote-exec): sinfo: error: "Include" failed in file /etc/slurm/slurm.conf line 34
oci_core_instance.ClusterManagement (remote-exec): sinfo: fatal: Unable to process configuration file
oci_core_instance.ClusterManagement (remote-exec): scontrol: error: s_p_parse_file: unable to read "/mnt/shared/etc/slurm/slurm.conf": Unknown error 521
oci_core_instance.ClusterManagement (remote-exec): scontrol: error: "Include" failed in file /etc/slurm/slurm.conf line 34
oci_core_instance.ClusterManagement (remote-exec): scontrol: fatal: Unable to process configuration file
oci_file_storage_file_system.ClusterFS: Destruction complete after 2s
oci_file_storage_mount_target.ClusterFSMountTarget: Destruction complete after 3s
oci_core_instance.ClusterManagement (remote-exec): Node termination request completed
oci_core_instance.ClusterManagement: Still destroying... (ID: ocid1.instance.oc1.iad.abuwcljtvbv6codx...barl7h54uuypnd7e6golvv7gqy3xgx6vsd2ntq, 10s elapsed)
oci_core_instance.ClusterManagement: Still destroying... (ID: ocid1.instance.oc1.iad.abuwcljtvbv6codx...barl7h54uuypnd7e6golvv7gqy3xgx6vsd2ntq, 20s elapsed)
[ ... ]
Error: Error applying plan:

2 error(s) occurred:

* oci_core_subnet.ClusterSubnet[0] (destroy): 1 error(s) occurred:

* oci_core_subnet.ClusterSubnet.0: Service error:Conflict. The Subnet ocid1.subnet.oc1.iad.aaaaaaaa3ow3n2tn3gx34nvlmrbnwgprq5fznksy5dsjhaquyxkxts3e5ala references the VNIC ocid1.vnic.oc1.iad.abuwcljtei434zygcrm6fa7pcwq44fgxnjq2wufkyt56noickqx6cs4obora. You must remove the reference to proceed with this operation.. http status code: 409. Opc request id: d5f0804c3e00a886e9d89398c48b3825/64ED088EEA7C23F64C7F35855CACDB98/12AB2FC3E72046E7B981E999DB8AF089
* oci_core_subnet.ClusterSubnet[1] (destroy): 1 error(s) occurred:

* oci_core_subnet.ClusterSubnet.1: Service error:Conflict. The Subnet ocid1.subnet.oc1.iad.aaaaaaaad3264gcho6pjcjqb4p32b5nac4i7kq7qo2vtjdynehgxbquavmeq references the VNIC ocid1.vnic.oc1.iad.abuwcljs2a7e2rn5jioe2adaq3eflwzq6t32bvw77lnp5gvxweyoivfzsdvq. You must remove the reference to proceed with this operation.. http status code: 409. Opc request id: 316eb18e0b83fd440c6d5e23c3255e22/F22821F37412A0643FD9589ACF03DD17/2E8E0525A37E40FB97B885FA417B016C

Deprecation warning for Terraform 0.12 "External references from destroy provisioners are deprecated"

When validating a new cluster I get the output

[ce16990@ce16990 citc-terraform]$ terraform validate oracle

Warning: External references from destroy provisioners are deprecated

  on oracle/compute.tf line 142, in resource "oci_core_instance" "ClusterManagement":
 142:     host        = oci_core_instance.ClusterManagement.public_ip

Destroy-time provisioners and their connection configurations may only
reference attributes of the related resource, via 'self', 'count.index', or
'each.key'.

References to other resources during the destroy phase can cause dependency
cycles and interact poorly with create_before_destroy.

(and one more similar warning elsewhere)

Success! The configuration is valid, but there were some validation warnings as shown above.

srun: error: Unable to resolve "mgmt": Unknown host

srun: error: Unable to resolve "mgmt": Unknown host
srun: error: Unable to establish control machine address
srun: error: Unable to confirm allocation for job 90: No error
srun: Check SLURM_JOB_ID environment variable. Expired or invalid job 90
slurmstepd: error: Unable to resolve "mgmt": Unknown host

When I run my job, it shows this. How to fix this issue?

GCP installation issues

I am following this manual: https://cluster-in-the-cloud.readthedocs.io/en/latest/google-infrastructure.html
Running command in GCP cloud shell.

docker run -it -e CLOUDSDK_CONFIG=/config/gcloud \
                 -v $CLOUDSDK_CONFIG:/config/gcloud \
                 clusterinthecloud/google-install
...
[EXECUTE] ssh-keygen -t rsa -f /root/.ssh/citc-google -C provisioner -N ""
Generating public/private rsa key pair.
Your identification has been saved in /root/.ssh/citc-google.
Your public key has been saved in /root/.ssh/citc-google.pub.
The key fingerprint is:
SHA256:MRZr/antwrE7d470C0GfAXIrkVqbxJe5GFgAkdnt3Oo provisioner
The key's randomart image is:
+---[RSA 3072]----+
|      o*o*+.oo   |
|      o o+B++o   |
|        **oBo..  |
|       o.o*+oo o |
|        S  .+ o  |
|          oo .   |
|         o.o+    |
|          Eo.+.  |
|          .=oo+. |
+----[SHA256]-----+
[EXECUTE] terraform init google
Initializing modules...
- budget_filer_shared_storage in google/storage/nfs-storage-budget

Error: Unsupported Terraform Core version

This configuration does not support Terraform version 0.12.21. To proceed,
either choose another supported Terraform version or update the root module's
version constraint. Version constraints are normally set for good reason, so
updating the constraint may lead to other errors or unexpected behavior.

[ERROR] Command '['terraform', 'init', 'google']' returned non-zero exit status 1.

Need to add shape info for BM Oracle GPU shapes?

Could not find shape information for 'BM.GPU3.8'
Could not find shape information for 'BM.GPU4.8'

BM.GPU3.8 = 52 cores, 768GB RAM, 8 GPUs (V100)
BM.GPU4.8 = 64 cores, 2048GB RAM, 8 GPUs (A100)

terraform error - unable to validate

I am getting this error when running terraform validate...
Error: oci_core_instance.ClusterManagement: expected source_details.0.source_type to be one of [bootVolume image], got map

What are the possible causes of this error?

Thanks for your help!

Rename repo to vendor neutral

With work being done to port CitC to GCP, this repo should be renamed to reflect the cross-vendor nature of the project. Perhaps citc-terraform?

Make a common admin account on all clouds

Currently, OCI has the opc user and we make a provisioner user on Google. This makes the documentation harder to follow. We should create a citc user on all nodes which the human admin uses to log in and manage the system and keep opc/provisioner as the account that only Terraform uses.

my slurm job is always in the "PD" state for many hours and most of the compute nodes are idle#

[root@mgmt FederatedLearning]# sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
compute* up infinite 17 idle# vm-standard2-24-ad1-[0001-0010],vm-standard2-24-ad2-[0001-0004,0007,0009-0010]
compute* up infinite 6 idle vm-standard2-24-ad2-[0005-0006,0008],vm-standard2-24-ad3-[0001-0003]
[root@mgmt FederatedLearning]# squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
50 compute fl root PD 0:00 9 (Resources)

[opc@mgmt FederatedLearning]$ ping 10.1.0.5

PING 10.1.0.5 (10.1.0.5) 56(84) bytes of data.

^C

--- 10.1.0.5 ping statistics ---

4 packets transmitted, 0 received, 100% packet loss, time 3068ms

on-destroy remote-exec provisioner can not be executed within an aws_cloudformation_stack resource

I tried to upgrade the terraform from 0.12.12 to a newer one, currently it ls 0.13.1.
And previously, it worked in a non_resource to clean up some addition infras when destroy, for which were not created by terraform.
After upgrade, it told me that it was not allowed to refer some other resource when destroy, it should be within the particular resource. so I move it to the resource which provisioning a bastion that I need it to run the clean up script.
but it failed with below errors:

I don't know whether an on-destroy provisioner cloud be used in a cloudformation stack resource or not?

Error: Invalid reference from destroy provisioner
in resource "aws_cloudformation_stack" "bastion_stack":
24: private_key = file(var.ssh_private_key_file)
Destroy-time provisioners and their connection configurations may only
reference attributes of the related resource, via 'self', 'count.index', or
'each.key'.
References to other resources during the destroy phase can cause dependency
cycles and interact poorly with create_before_destroy.

resource "aws_cloudformation_stack" "bastion_stack" {
...
  provisioner "remote-exec" {
    when = destroy
    connection {
      type        = "ssh"
      user        = "centos"
      private_key = file(var.ssh_private_key_file)
      host        = self.outputs["BastionHostIP"]
      agent       = false
    }
    inline = [
      "./cleanup.sh",
    ]
}

Running "finish" fails

I'm using an older version of CitC (initially forked ~6 months ago, but keep up to date until 15th March 2019) and a forked ansible script (last updated 15th March 2019).

Using this configuration, I get the following error when I run 'finish'

[opc@mgmt ~]$ ./finish  
ERROR! the playbook: finalise.yml could not be found 

############################################# 
Error: Ansible run did not complete correctly 
############################################# 

I tried running a fresh pull (so using the default playbook too) and now get the error

[opc@mgmt ~]$ ./finish  
Error: Could not find limits.yaml 
Please create the file and rerun this script. 

Specify custom images for compute nodes?

Looking at the latest version on master, it's no longer clear how to specify images for the compute nodes?

I'm looking to use GPU nodes. Since the management node and GPU nodes will need to use different images, I need to be able to specify a custom image for the GPU nodes.

Cluster is not deploying correctly with ssh error.

I've run through the instructions on how to deploy to GCP, but I'm hitting an error.

At the end of the terraform run, I get the following error:

Error: timeout - last error: SSH authentication failed (provisioner@ip-redacted22): ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain

The ssh private and public key have been created and paths are correct. When I check the mgmt node in GCP, it has a public key, but no provisioner user has been created.

Issues with configuration step

Hi Matt,
Thank you for developing and posting this project. I was able to create a very small cluster of 1 mgt + 1 compute node (chapter 1) but couldn't complete the configuration (chapter 2). I figured out that the issue was trying to use shapes for AMD arch (e.g. VM.Standard.E2.2).
Thank you.
AF

Incomplete setup

Hi, I hope I'm not missing anything, but I can't seem to complete setup for new clusters. To make a reproducible example I've used Terraform v0.11.15-oci to initiate a cluster on the Oracle cloud. Unfortunately the finish script reports that mgmt has not completed setup (even after ~24 hours). Reading ansible-pull.log from my latest iteration it reports an error as follows:

Starting Ansible Pull at 2019-08-27 20:44:24
/usr/bin/ansible-pull --url=https://github.com/ACRC/slurm-ansible-playbook.git --checkout=3 --invent
ory=/root/hosts management.yml
 [WARNING]: Could not match supplied host pattern, ignoring: mgmt
 [WARNING]: Your git version is too old to fully support the depth argument.
Falling back to full checkouts.
mgmt.subnet.clustervcn.oraclevcn.com | CHANGED => {
    "after": "4aa832749d3e79961587713b4a1950baead67127",
    "before": null,
    "changed": true
}
 [WARNING]: Could not match supplied host pattern, ignoring: mgmt

I can't quite relate this to other issues and I cannot quite see the error in my terraform.tfvars file which ends:

ManagementShape = "VM.Standard2.1"
ManagementAD = "2"
FilesystemAD = "2"

Any advice would be very much appreciated and I'll be happy to supply any more information. many thanks

job code and data storage location

The file system on /mnt/shared is quite slow. Is it possible to place the code and data on each node's local storage, to improve data / file retrieval by each node? If yes, what are the steps to do so?

Nodes in drain state

Upon creation of a cluster, the nodes have gone into the drain state, and jobs are forever pending:

[opc@mgmt ~]$ sacct
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
3                  test    compute                     2    PENDING      0:0 
[opc@mgmt ~]$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up   infinite      3  drain compute[001-003]

What causes the nodes to enter this state, and how can it be remedied?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.