hashicorp / terraform-aws-nomad Goto Github PK

A Terraform Module for how to run Nomad on AWS using Terraform and Packer

License: Apache License 2.0

Shell 25.86% HCL 50.14% Go 24.00%

terraform-aws-nomad's Introduction

DISCLAIMER: This is no longer supported.

Moving forward in the future this repository will be no longer supported and eventually lead to deprecation. Please use our latest versions of our products moving forward or alternatively you may fork the repository to continue use and development for your personal/business use.

Nomad AWS Module

This repo contains a set of modules for deploying a Nomad cluster on AWS using Terraform. Nomad is a distributed, highly-available data-center aware scheduler. A Nomad cluster typically includes a small number of server nodes, which are responsible for being part of the consensus protocol, and a larger number of client nodes, which are used for running jobs.

Features

Deploy server nodes for managing jobs and client nodes running jobs
Supports colocated clusters and separate clusters
Least privilege security group rules for servers
Auto scaling and Auto healing

Learn

This repo was created by Gruntwork, and follows the same patterns as the Gruntwork Infrastructure as Code Library, a collection of reusable, battle-tested, production ready infrastructure code. You can read How to use the Gruntwork Infrastructure as Code Library for an overview of how to use modules maintained by Gruntwork!

Core concepts

Nomad Use Cases: overview of various use cases that Nomad is optimized for.
Nomad Guides: official guide on how to configure and setup Nomad clusters as well as how to use Nomad to schedule services on to the workers.
Nomad Security: overview of how to secure your Nomad clusters.

Repo organization

modules: the main implementation code for this repo, broken down into multiple standalone, orthogonal submodules.
examples: This folder contains working examples of how to use the submodules.
test: Automated tests for the modules and examples.
root: The root folder is an example of how to use the nomad-cluster module module to deploy a Nomad cluster in AWS. The Terraform Registry requires the root of every repo to contain Terraform code, so we've put one of the examples there. This example is great for learning and experimenting, but for production use, please use the underlying modules in the modules folder directly.

Deploy

Non-production deployment (quick start for learning)

If you just want to try this repo out for experimenting and learning, check out the following resources:

examples folder: The examples folder contains sample code optimized for learning, experimenting, and testing (but not production usage).

Production deployment

If you want to deploy this repo in production, check out the following resources:

Nomad Production Setup Guide: detailed guide covering how to setup a production deployment of Nomad.

Manage

Day-to-day operations

Major changes

How to upgrade a Nomad cluster

Who created this Module?

These modules were created by Gruntwork, in partnership with HashiCorp, in 2017 and maintained through 2021. They were deprecated in 2022, see the top of the README for details.

License

Please see LICENSE for details on how the code in this repo is licensed.

terraform-aws-nomad's People

Contributors

Stargazers

Watchers

Forkers

josh-padnick zambien vasgab dgrstl yamaszone apachipa flapp sunileman eikeisermann surabhiborgikar lawliet89 waccapital rxacevedo optionalg marcosnils clovis818 karma0 tethik blankenshipz peakyblinder rocasolida nubbthedestroyer ozbillwang zoarder-muttakin pcanham crazycoop74 sksegha bdclark guangie88 tamentis ibuystuff vikramjadon921995 nadzir juan-moreno jarrettj aqitio etiene oldskool73 iac-infrastructureascode sentella trvl3r pcittadini ncabatoff thomasobenaus mahsoud pidelport hartzell manthang parkside-securities realorko craigday parmarsumit ekomi-ltd ynux kingsoftgames tlinkin theinpu kernt adrianoxfx samsungnext gp-technical ticketguardian heroku jelinn saurabhrpatil anarsen cab ianlevesque mainframe2 hngerebara filatushkin mundey wealthfit iammj-as andersontechnetverrum jsuar durango bwalendz malyna23 sachnk jfiandaca angrycub pp23 draoncc kshahar stvnjacobs danielchudc is0merlntalt ipenas lexio-hk chrisarcand ryancraig robmadole mawa-jnd ksaito1125 nicgrayson yardbirdsax stoffee so-development dsaidgovsg

terraform-aws-nomad's Issues

Multiple conflict errors received when deploying using version > 3 of AWS provider

When deploying the example using version 3.3.0 of the AWS provider, multiple conflict errors are received similar to the below.

Error: ConflictsWith

  on modules/nomad-cluster/main.tf line 18, in resource "aws_autoscaling_group" "autoscaling_group":
  18:   vpc_zone_identifier = var.subnet_ids

"vpc_zone_identifier": conflicts with availability_zones

This looks to be the same issue reported on this issue in the terraform-aws-consul repository.

Making CIDR Block for SSH Configuration Optional and Outbound Traffic Configurable

Currently the CIDR blocks for incoming SSH access is mandatory.

Especially with the introduction of the AWS System Manager Session Manager an alternative administrative access is available. This would allow to reduce the access for SSH access and increase the security.

Furthermore module users might have the use case to limit the outbound traffic traffic as well.

Source Code References:

Convert aws_launch_configuration to aws_launch_template

Hello!

Quick question:

Launch Templates are preferred by AWS over Launch Configurations.

I was wondering if converting the aws_launch_configuration to aws_launch_template would be a welcome PR or if there was a reason to continue with aws_launch_configuration that I'm not aware of?

Some benefits:

Template support additional options ("latest features") such as metadata_options
We may not require the create_before_destroy = true calls to prevent cyclical dependency errors

Switches for Install & Run matches Consul & Vault

Could we please make the install switches the same for Nomad as they are in Consul & Vault i.e.:

terraform-aws-nomad/.../install-nomad

Nomad install does not have the same install switches (download & update option) as Consul & Vault do here:

terraform-aws-consul/.../install-consul
terraform-aws-vault/.../install-vault

All 3 allow for username, path & version however only Consul & Vault in terraform-aws-etc allow for URL & update skip i.e.:

  echo -e "  --download-url\t\tUrl to exact Vault package to be installed. Optional if version is provided."
  echo -e "  --skip-package-update\t\tSkip yum/apt updates. Optional. Only recommended if you already ran yum update or apt-get update yourself. Default: $DEFAULT_SKIP_PACKAGE_UPDATE."

Having the install experience be the same for each product improves the user experience. Although the current version does download from releases, it also forces package update. After all, is this not a great opportunity to create a Universal Install script which other teams might be able to leverage for emerging products?

Missing ports in nomad security group

I have an issue when using the "security group" module, when the incoming_cidr is adpated to a custom IP addr (st. else then 0.0.0.0/0).

My ASG is created with help of the terraform-aws-modules/terraform-aws-autoscaling module using custom userdata and ubuntu 20.04. The userdata incorporates the hashicorp repos and performs a default installation of nomad and consul:

userdata script:

#!/bin/sh

apt update
apt install -y \
software-properties-common \
curl \
vim-tiny \
netcat \
file \
bash-completion

curl -fsSL https://apt.releases.hashicorp.com/gpg | sudo apt-key add -
sudo apt-add-repository "deb [arch=amd64] https://apt.releases.hashicorp.com $(lsb_release -cs) main"
apt update

apt install -y consul
apt install -y nomad

/etc/nomad.d/nomad.hcl:

datacenter = "us-east-1"
data_dir = "/opt/nomad/data"
bind_addr = "0.0.0.0"

# Enable the server
server {
  enabled = true
  bootstrap_expect = 3
}

consul {
  address = "127.0.0.1:8500"
  token   = "***************************"
}

/etc/consul.d/consul.hcl:

cat /etc/consul.d/consul.hcl 
datacenter = "us-east-1"
server              = true
bootstrap_expect    = 3
data_dir            = "/opt/consul/data"
client_addr         = "0.0.0.0"
log_level           = "INFO"
ui                  = true

# AWS cloud join
retry_join          = ["provider=aws tag_key=Nomad-Cluster tag_value=dev-nomad"]

# Max connections for the HTTP API
limits {
  http_max_conns_per_client = 128
}
performance {
    raft_multiplier = 1
}

acl {
  enabled        = true
  default_policy = "allow"
  enable_token_persistence = true
  tokens {
    master = "***************************************"
  }
}

encrypt = "************************"

When opening the browser I see the following message:

No Cluster Leader

The cluster has no leader. Read about Outage Recovery.

In the nomad logs it shows:

sudo journalctl -t nomad:

Oct 02 11:43:51 ip-10-10-10-93 nomad[3802]:     2021-10-02T11:43:51.616Z [ERROR] worker: failed to dequeue evaluation: error="No cluster leader"
Oct 02 11:43:57 ip-10-10-10-93 nomad[3802]:     2021-10-02T11:43:57.320Z [ERROR] http: request failed: method=GET path=/v1/agent/health?type=server error="{"server":{"ok":false,"message":"No cluster lead
er"}}" code=500

It seems that the communication for port 4647 is currently not allowed within the security group.

Trying to access the port of a server node from another server node times out:

nc -zv -w 5 10.10.10.48 4647
nc: connect to 10.10.10.48 port 4647 (tcp) timed out: Operation now in progress

After allowing port 4647 communication within the security group the cluster server nodes starts replicate with each other:

Oct 02 11:44:06 ip-10-10-10-93 nomad[3802]:     2021-10-02T11:44:06.257Z [INFO]  nomad: serf: EventMemberJoin: ip-10-10-10-48.global 10.10.10.48
Oct 02 11:44:06 ip-10-10-10-93 nomad[3802]:     2021-10-02T11:44:06.257Z [INFO]  nomad: serf: EventMemberJoin: ip-10-10-10-12.global 10.10.10.12
Oct 02 11:44:06 ip-10-10-10-93 nomad[3802]:     2021-10-02T11:44:06.257Z [INFO]  nomad: adding server: server="ip-10-10-10-48.global (Addr: 10.10.10.48:4647) (DC: us-east-1)"
Oct 02 11:44:06 ip-10-10-10-93 nomad[3802]:     2021-10-02T11:44:06.265Z [INFO]  nomad: found expected number of peers, attempting to bootstrap cluster...: peers=10.10.10.93:4647,10.10.10.48:4647,10.10.1
0.12:4647
Oct 02 11:44:06 ip-10-10-10-93 nomad[3802]:     2021-10-02T11:44:06.270Z [INFO]  nomad: adding server: server="ip-10-10-10-12.global (Addr: 10.10.10.12:4647) (DC: us-east-1)"
Oct 02 11:44:06 ip-10-10-10-93 nomad[3802]:     2021-10-02T11:44:06.725Z [ERROR] worker: failed to dequeue evaluation: error="No cluster leader"
Oct 02 11:44:07 ip-10-10-10-93 nomad[3802]:     2021-10-02T11:44:07.151Z [WARN]  nomad.raft: heartbeat timeout reached, starting election: last-leader=
Oct 02 11:44:07 ip-10-10-10-93 nomad[3802]:     2021-10-02T11:44:07.151Z [INFO]  nomad.raft: entering candidate state: node="Node at 10.10.10.93:4647 [Candidate]" term=2
Oct 02 11:44:07 ip-10-10-10-93 nomad[3802]:     2021-10-02T11:44:07.162Z [INFO]  nomad.raft: election won: tally=2
Oct 02 11:44:07 ip-10-10-10-93 nomad[3802]:     2021-10-02T11:44:07.162Z [INFO]  nomad.raft: entering leader state: leader="Node at 10.10.10.93:4647 [Leader]"
Oct 02 11:44:07 ip-10-10-10-93 nomad[3802]:     2021-10-02T11:44:07.163Z [INFO]  nomad.raft: added peer, starting replication: peer=10.10.10.48:4647
Oct 02 11:44:07 ip-10-10-10-93 nomad[3802]:     2021-10-02T11:44:07.163Z [INFO]  nomad.raft: added peer, starting replication: peer=10.10.10.12:4647
Oct 02 11:44:07 ip-10-10-10-93 nomad[3802]:     2021-10-02T11:44:07.163Z [INFO]  nomad: cluster leadership acquired
Oct 02 11:44:07 ip-10-10-10-93 nomad[3802]:     2021-10-02T11:44:07.165Z [INFO]  nomad.raft: pipelining replication: peer="{Voter 10.10.10.12:4647 10.10.10.12:4647}"
Oct 02 11:44:07 ip-10-10-10-93 nomad[3802]:     2021-10-02T11:44:07.166Z [WARN]  nomad.raft: appendEntries rejected, sending older logs: peer="{Voter 10.10.10.48:4647 10.10.10.48:4647}" next=1
Oct 02 11:44:07 ip-10-10-10-93 nomad[3802]:     2021-10-02T11:44:07.168Z [INFO]  nomad.raft: pipelining replication: peer="{Voter 10.10.10.48:4647 10.10.10.48:4647}"
Oct 02 11:44:07 ip-10-10-10-93 nomad[3802]:     2021-10-02T11:44:07.186Z [INFO]  nomad.core: established cluster id: cluster_id=c40704d5-7b77-0ea7-9da2-eef39a58b4bb create_time=1633175047177920656

Question for me is if port 4647 is new or only missing in the security groups module?

The config from a installation using the root module differs slightly but I can't see any pinning to another port:

/opt/nomad/config/default.hcl:

datacenter = "us-east-1c"
name       = "i-06382f65cc9495792"
region     = "us-east-1"
bind_addr  = "0.0.0.0"

advertise {
  http = "172.31.84.5"
  rpc  = "172.31.84.5"
  serf = "172.31.84.5"
}


server {
  enabled = true
  bootstrap_expect = 3
}

consul {
  address = "127.0.0.1:8500"
}

Include volume_type for launch configuration ebs_block_device

Describe the solution you'd like
Currently the launch configuration does not include the "volume_type" variable for "ebs_block_device" . In this case, any ebs_block_devices passed are created as the default "gp2". Adding this variable will allow users to pass in their own desired volume type, like the more cost-efficient "gp3" type.

Additional code should just be the extra line:
volume_type = lookup(ebs_block_device.value, "volume_type", null)

Describe alternatives you've considered
We thought of simply downloading your module and making the desired changes but we do not wish to run the module locally and rather use yours as it gets updated. Not sure if there are any others. Would like to hear your opinions.

Additional context
Open to other suggestions.

snapshot_id is missing in ebs-configuration in nomad-cluster

To create an ebs_block_device AWS requires either a valid volume size or a snapshot id. So if no volume size or a volume size of 0 was provided the creation of ebs devices will result in an AWS error which that states that neither a volume size nor a snapshot id has been specified.

https://github.com/hashicorp/terraform-aws-nomad/blob/master/modules/nomad-cluster/main.tf#L83

Allow additional security groups for ASG Instances

I am referring to the line in the nomad-cluster module.

It would be desirable to allow users to define any additional security groups for instances launched by the ASG which will be concatenated with aws_security_group.lc_security_group.id. For example, if I have a ELB launched, I would like to be able to add the ELB security group as ingress.

I have a patch for this, although I am not sure how this can be tested.

Bad keyword "ConditionalFileNotEmpty" in generated systemd config

Describe the bug

The script modules/run-nomad/run-nomad generates a systemd config with a typo.
The generated keyword "ConditionalFileNotEmpty" in the [Unit], should be "ConditionFileNotEmpty".

To Reproduce

Create a Nomad cluster using Terraform that uses this script to run Nomad on a server node.
Log in to a server node, run journalctl -u nomad.service
The output shows /etc/systemd/system/nomad.service:6: Unknown lvalue 'ConditionalFileNotEmpty' in section 'Unit'

Expected behavior

With "ConditionFileNotEmpty", I expect this error gone.

Inconsistent with consul module

terraform-aws-nomad/modules/nomad-cluster/main.tf

Line 30 in 223be6d

tag {

Not a big deal but it would be nice if all the modules worked in more or less the way way. See https://github.com/hashicorp/terraform-aws-consul/blob/master/modules/consul-cluster/main.tf#L31-L35

Cgroups not mounting

Using your example, I was able to launch a Consul cluster (working fine) and a Nomad cluster which successfully connects to Consul.

However, two of the drivers, java and exec, are failing to load due to error "Cgroup mount point unavailable."

Nomad client log file:

==> Loaded configuration from /opt/nomad/config/default.hcl
==> Starting Nomad agent...
==> Nomad agent configuration:

       Advertise Addrs: HTTP: 172.31.21.117:4646
            Bind Addrs: HTTP: 0.0.0.0:4646
                Client: true
             Log Level: DEBUG
                Region: us-west-2 (DC: us-west-2b)
                Server: false
               Version: 0.9.4

==> Nomad agent started! Log data will stream in below:

    2019-08-07T17:51:10.239Z [WARN ] agent.plugin_loader: skipping external plugins since plugin_dir doesn't exist: plugin_dir=/opt/nomad/data/plugins
    2019-08-07T17:51:10.305Z [DEBUG] agent.plugin_loader.docker: using client connection initialized from environment: plugin_dir=/opt/nomad/data/plugins
    2019-08-07T17:51:10.305Z [DEBUG] agent.plugin_loader.docker: using client connection initialized from environment: plugin_dir=/opt/nomad/data/plugins
    2019-08-07T17:51:10.305Z [INFO ] agent: detected plugin: name=java type=driver plugin_version=0.1.0
    2019-08-07T17:51:10.305Z [INFO ] agent: detected plugin: name=docker type=driver plugin_version=0.1.0
    2019-08-07T17:51:10.305Z [INFO ] agent: detected plugin: name=rkt type=driver plugin_version=0.1.0
    2019-08-07T17:51:10.305Z [INFO ] agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
    2019-08-07T17:51:10.305Z [INFO ] agent: detected plugin: name=exec type=driver plugin_version=0.1.0
    2019-08-07T17:51:10.305Z [INFO ] agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
    2019-08-07T17:51:10.305Z [INFO ] agent: detected plugin: name=nvidia-gpu type=device plugin_version=0.1.0
    2019-08-07T17:51:10.307Z [INFO ] client: using state directory: state_dir=/opt/nomad/data/client
    2019-08-07T17:51:10.327Z [INFO ] client: using alloc directory: alloc_dir=/opt/nomad/data/alloc
    2019-08-07T17:51:10.331Z [DEBUG] client.fingerprint_mgr: built-in fingerprints: fingerprinters="[arch cgroup consul cpu host memory network nomad signal storage vault env_gce env_aws]"
    2019-08-07T17:51:10.333Z [DEBUG] client.fingerprint_mgr: fingerprinting periodically: fingerprinter=cgroup period=15s
    2019-08-07T17:51:10.335Z [DEBUG] client.fingerprint_mgr.cpu: detected cpu frequency: MHz=2400
    2019-08-07T17:51:10.335Z [DEBUG] client.fingerprint_mgr.cpu: detected core count: cores=1
    2019-08-07T17:51:10.337Z [DEBUG] client.fingerprint_mgr: fingerprinting periodically: fingerprinter=consul period=15s
    2019-08-07T17:51:10.348Z [WARN ] client.fingerprint_mgr.network: unable to parse speed: path=/sbin/ethtool device=eth0
    2019-08-07T17:51:10.348Z [DEBUG] client.fingerprint_mgr.network: unable to read link speed: path=/sys/class/net/eth0/speed
    2019-08-07T17:51:10.348Z [DEBUG] client.fingerprint_mgr.network: link speed could not be detected and no speed specified by user, falling back to default speed: mbits=1000
    2019-08-07T17:51:10.348Z [DEBUG] client.fingerprint_mgr.network: detected interface IP: interface=eth0 IP=172.31.21.117
    2019-08-07T17:51:10.355Z [DEBUG] client.fingerprint_mgr: fingerprinting periodically: fingerprinter=vault period=15s
    2019-08-07T17:51:10.373Z [DEBUG] client.fingerprint_mgr.env_gce: could not read value for attribute: attribute=machine-type resp_code=404
    2019-08-07T17:51:10.373Z [DEBUG] client.fingerprint_mgr: detected fingerprints: node_attrs="[arch cpu host network nomad signal storage env_aws]"
    2019-08-07T17:51:10.373Z [INFO ] client.plugin: starting plugin manager: plugin-type=driver
    2019-08-07T17:51:10.373Z [INFO ] client.plugin: starting plugin manager: plugin-type=device
    2019-08-07T17:51:10.400Z [ERROR] client: error discovering nomad servers: error="client.consul: unable to query Consul datacenters: Get http://127.0.0.1:8500/v1/catalog/datacenters: dial tcp 127.0.0.1:8500: connect: connection refused"
    2019-08-07T17:51:10.400Z [DEBUG] client.plugin: waiting on plugin manager initial fingerprint: plugin-type=driver
    2019-08-07T17:51:10.400Z [DEBUG] client.plugin: waiting on plugin manager initial fingerprint: plugin-type=device
    2019-08-07T17:51:10.400Z [DEBUG] client.plugin: finished plugin manager initial fingerprint: plugin-type=device
    2019-08-07T17:51:10.400Z [DEBUG] client.driver_mgr: initial driver fingerprint: driver=raw_exec health=healthy description=Healthy
    2019-08-07T17:51:10.407Z [DEBUG] client.driver_mgr: initial driver fingerprint: driver=exec health=unhealthy description="Cgroup mount point unavailable"
    2019-08-07T17:51:10.407Z [DEBUG] client.driver_mgr: initial driver fingerprint: driver=qemu health=undetected description=
    2019-08-07T17:51:10.407Z [DEBUG] client.driver_mgr: initial driver fingerprint: driver=java health=unhealthy description="Cgroup mount point unavailable"
    2019-08-07T17:51:10.411Z [DEBUG] client.driver_mgr.docker: could not connect to docker daemon: driver=docker endpoint=unix:///var/run/docker.sock error="Get http://unix.sock/version: dial unix /var/run/docker.sock: connect: no such file or directory"
    2019-08-07T17:51:10.411Z [DEBUG] client.driver_mgr: initial driver fingerprint: driver=docker health=undetected description="Failed to connect to docker daemon"
    2019-08-07T17:51:10.411Z [DEBUG] client.driver_mgr: initial driver fingerprint: driver=rkt health=undetected description="Failed to execute rkt version: exec: "rkt": executable file not found in $PATH"
    2019-08-07T17:51:10.411Z [DEBUG] client.driver_mgr: detected drivers: drivers="map[undetected:[qemu docker rkt] healthy:[raw_exec] unhealthy:[exec java]]"
    2019-08-07T17:51:10.411Z [DEBUG] client.plugin: finished plugin manager initial fingerprint: plugin-type=driver
    2019-08-07T17:51:10.411Z [INFO ] client: started client: node_id=7b3d2591-71fa-9d92-d949-2a748099420b
    2019-08-07T17:51:10.414Z [WARN ] client.server_mgr: no servers available
    2019-08-07T17:51:10.414Z [DEBUG] client: registration waiting on servers
    2019-08-07T17:51:10.414Z [WARN ] client.server_mgr: no servers available
    2019-08-07T17:51:10.415Z [ERROR] client: error discovering nomad servers: error="client.consul: unable to query Consul datacenters: Get http://127.0.0.1:8500/v1/catalog/datacenters: dial tcp 127.0.0.1:8500: connect: connection refused"
    2019-08-07T17:51:13.468Z [ERROR] http: request failed: method=GET path=/v1/agent/health?type=client error="{"client":{"ok":false,"message":"no known servers"}}" code=500
    2019-08-07T17:51:13.468Z [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration=568.093µs
    2019-08-07T17:51:23.755Z [ERROR] http: request failed: method=GET path=/v1/agent/health?type=client error="{"client":{"ok":false,"message":"no known servers"}}" code=500
    2019-08-07T17:51:23.755Z [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration=140.767µs
    2019-08-07T17:51:25.625Z [INFO ] client.fingerprint_mgr.consul: consul agent is available
    2019-08-07T17:51:30.745Z [WARN ] client.server_mgr: no servers available
    2019-08-07T17:51:30.745Z [DEBUG] client: registration waiting on servers
    2019-08-07T17:51:30.747Z [DEBUG] client.consul: bootstrap contacting Consul DCs: consul_dcs=[us-west-2]
    2019-08-07T17:51:30.765Z [INFO ] client.consul: discovered following servers: servers=172.31.13.97:4647
    2019-08-07T17:51:30.765Z [DEBUG] client.server_mgr: new server list: new_servers=172.31.13.97:4647 old_servers=
    2019-08-07T17:51:30.777Z [DEBUG] client: updated allocations: index=1 total=0 pulled=0 filtered=0
    2019-08-07T17:51:30.778Z [DEBUG] client: allocation updates: added=0 removed=0 updated=0 ignored=0
    2019-08-07T17:51:30.778Z [DEBUG] client: allocation updates applied: added=0 removed=0 updated=0 ignored=0 errors=0
    2019-08-07T17:51:30.781Z [INFO ] client: node registration complete
    2019-08-07T17:51:33.756Z [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration=270.264µs
    2019-08-07T17:51:35.753Z [DEBUG] client: state updated: node_status=ready
    2019-08-07T17:51:38.116Z [DEBUG] client: state changed, updating node and re-registering
    2019-08-07T17:51:38.121Z [INFO ] client: node registration complete
    2019-08-07T17:51:43.757Z [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration=159.506µs

The configuration is directly from your example code except I set the number of servers and clients to one.

Please advise.

Limiting value for EC2 instance tag

terraform-aws-nomad/modules/nomad-cluster/main.tf

Line 32 in 223be6d

value = "${var.cluster_name}"

${var.cluster_tag_value} makes more sense and would make it consistent with the Consul module (https://github.com/hashicorp/terraform-aws-consul/blob/master/modules/consul-cluster/main.tf#L39). In any case, feels like each module should work in roughly the same way.

HCL2'ify the examples?

Hi there!
using packer hcl2_upgrade results in a few oddities, such as the following for nomad-consul[.pkr].json:

# 1 error occurred upgrading the following block:
# unhandled "clean_resource_name" call:
# there is no way to automatically upgrade the "clean_resource_name" call.
# Please manually upgrade to use custom validation rules, `replace(string, substring, replacement)` or `regex_replace(string, substring, replacement)`
# Visit https://packer.io/docs/templates/hcl_templates/variables#custom-validation-rules , https://www.packer.io/docs/templates/hcl_templates/functions/string/replace or https://www.packer.io/docs/templates/hcl_templates/functions/string/regex_replace for more infos.

This is easily solved, but - given that hcl2 is now recommended and default - why not upgrade the examples?

Feature Request: Support for ARM installations

Nomad and Consul both have ARM64 distributions, which seem a fit for AWS' Graviton-based t/m/c/r6g instances.

Unfortunately, it looks like the install-nomad module in this repo (at the least) is only designed for x86-64.

I could certainly take a stab, but would it be possible to consider and add support for ARM64 to this repo, so that the modules herein can be used for deploys on those newer nodes?

Could't apply a plan output

Hi, If I terraform apply an output from a previous plan, then I get an error message as below:

The argument "region" is required, but was not set.

It seems that the value of region parameter received from interactive terminal is not saved into plan output.

Here is the detail.

First of all, I called terraform plan, input the region value and saved the output into a local file called reduce_client_number.plan.

ubuntu@ip-172-31-44-132:~/nomad/terraform-aws-nomad$ terraform plan -out=reduce_client_number.plan
provider.aws.region
  The region where AWS operations will take place. Examples
  are us-east-1, us-west-2, etc.

  Enter a value: us-east-1

Refreshing Terraform state in-memory prior to plan...
...
...
Plan: 0 to add, 1 to change, 0 to destroy.
------------------------------------------------------------------------
This plan was saved to: reduce_client_number.plan

To perform exactly these actions, run the following command to apply:
    terraform apply "reduce_client_number.plan"

Then I tried to apply the plan but encountered an error saying that the region argument was missing.

ubuntu@ip-172-31-44-132:~/nomad/terraform-aws-nomad$ terraform apply "reduce_client_number.plan"

Error: Missing required argument

The argument "region" is required, but was not set.

If I applied without using the previous-generated plan, and input the region name interactively, then it would go without issues.

ubuntu@ip-172-31-44-132:~/nomad/terraform-aws-nomad$ terraform apply
provider.aws.region
  The region where AWS operations will take place. Examples
  are us-east-1, us-west-2, etc.

  Enter a value: us-east-1

data.template_file.user_data_client: Refreshing state...
...
module.servers.module.security_group_rules.module.client_security_group_rules.aws_security_group_rule.allow_serf_lan_tcp_inbound_from_self: Refreshing state... [id=sgrule-910686762]

An execution plan has been generated and is shown below.
Resource actions are indicated with the following symbols:
  ~ update in-place
...
...

Apply complete! Resources: 0 added, 1 changed, 0 destroyed.

Outputs:

asg_name_clients = tf-asg-20190716110806454000000009
asg_name_servers = nomad-example-server2019071611080760310000000a
aws_region = us-east-1
iam_role_arn_clients = arn:aws:iam::094240567632:role/nomad-example-client20190716110746333400000004
iam_role_arn_servers = arn:aws:iam::094240567632:role/nomad-example-server20190716110746332700000002
iam_role_id_clients = nomad-example-client20190716110746333400000004
iam_role_id_servers = nomad-example-server20190716110746332700000002
launch_config_name_clients = nomad-example-client-20190716110749662000000007
launch_config_name_servers = nomad-example-server-20190716110749870600000008
nomad_servers_cluster_tag_key = nomad-servers
nomad_servers_cluster_tag_value = auto-join
num_clients = 2
num_nomad_servers = 3
security_group_id_clients = sg-0018d90e77a2f8baf
security_group_id_servers = sg-0c5c2726c63054b3a
ubuntu@ip-172-31-44-132:~/nomad/terraform-aws-nomad$

Roll out updates

My team recently automated the roll out updates process. Essentially it was accomplished with two pieces:

A lambda function which is triggered by a Lifecycle Hook when an auto-scaling group wants to terminate a node. This lambda function gets the instance which is being terminated and uses SSM to drain the nomad client of jobs. When it's completely drained, it completes the lifecycle hook to fully terminate the instance.
A script which orchestrates the deployment. It takes advantage of the autoscaling API (instead of EC2) so that it triggers the lifecycle hook to safely drain the nomad client before termination. In our case we chose to scale-out first so that we always have N clients available, rather than N-1 if you scale-in.

I just wanted to open this issue to see if anyone else has feedback on this approach and whether or not you'd like this to be added to that module!

[IMPROVEMENT] Upgrade Amazon Linux Support to Version 2

AWS update there Amazon Linux to a new version 2: https://aws.amazon.com/de/amazon-linux-2/

Interesting Features:

systemd
on-premise support ( Docker, ... )

Outlook:
The support of systemd would facilitate the setup and configuration of Nomad and Consul, as well.
Supervisor would not be needed anymore to streamline setup, since Debian and Ubuntu are using systemd as well.

Question:
Should the Amazon Linux 1 support be dropped?

Pro: Legacy support
Contra: Duplication of many configuration files, especially for the AMI creation.
Contra: Increased complexity in the install-nomad and run-nomad script.

Suggestion:
Drop support since the provided setup can only be seen as an example.

Note:
Once this is decided, I can provide a PR.

Misleading Cluster Join Information

terraform-aws-nomad/modules/nomad-cluster/variables.tf

Line 93 in bf3ee5f

 description = "Add a tag with key var.cluster_tag_key and this value to each Instance in the ASG. This can be used to automatically find other Consul nodes and form a cluster." 

The examples and variable description for the ASG tag leads one to believe that the nomad-cluster module uses the cloud auto-join strategy to form a Nomad cluster, but really it uses the Consul strategy to form clusters.

I think the misleading information was copied from the consul-cluster module.

The cluster_tag_key and cluster_tag_value variables are not special - they're just a tag as referenced in #8

Add Windows nodes to the cluster

I have extended the Terraform script to create an ASG for Windows nodes and attach it to the client LB.
I'm trying to install correctly Nomad and Consul with auto-join but the Powershell scripts are very hard to write (trying to mimic what the Ubuntu shell scripts do, but some things are above my head).
Is there any chance you can provide those scripts please? Also something in the HCL files must be different, I guess.
Documentation on internet for running Nomad/Consul on Windows is very poor. I've tried also to install them using Chocolatey but the config is wrong and not trivial to change.

Thanks

Issue while nomad startup in the EC2 instance

When this script is running: https://github.com/hashicorp/terraform-aws-nomad/blob/master/examples/root-example/user-data-client.sh in the EC2 instance. The logs in /var/log/user_data.log show this:

2018-09-13 11:43:29 [INFO] [run-consul] Creating default Consul configuration
2018-09-13 11:43:29 [INFO] [run-consul] Installing Consul config file in /opt/consul/config/default.json
2018-09-13 11:43:29 [INFO] [run-consul] Creating Supervisor config file to run Consul in /etc/supervisor/conf.d/run-consul.conf
2018-09-13 11:43:29 [INFO] [run-consul] Reloading Supervisor config and starting Consul
consul: available
consul: added process group
2018-09-13 11:43:30 [INFO] [run-nomad] Looking up current Compute Instance name
2018-09-13 11:43:30 [INFO] [run-nomad] Looking up Metadata value at http://metadata.google.internal/computeMetadata/v1/instance/name
curl: (6) Could not resolve host: metadata.google.internal

This means, consul started successfully but nomad startup failed.

When I do nomad status on the instance.
This is the output Error querying jobs: Get http://127.0.0.1:4646/v1/jobs: dial tcp 127.0.0.1:4646: connect: connection refused.

Feature Request: Instance scale-in protection for nomad-cluster

It is not possible to set the Instance scale-in protection for nomad-cluster instances.
The used "aws_autoscaling_group" allows to set "protect_from_scale_in" but it is not provided by the module:
https://github.com/hashicorp/terraform-aws-nomad/blob/master/modules/nomad-cluster/main.tf#L13

This issue is more a kind of feature request though.

Make IAM role set-up optional

Describe the solution you'd like
The solution can follow what is done in https://github.com/hashicorp/terraform-aws-consul.
An input variable enable_iam_setup makes the IAM setup optional.

Describe alternatives you've considered

Additional context

Toggle detailed monitoring in launch configuration

I have been using the Nomad Cluster module and I noticed a larger increase in $cost to the additional Nomad Clusters we have recently added and was curious what caused this. CloudWatch was an outlier of the increase and noticed by default, detailed monitoring is enabled through Terraforms aws_launch_configuration resource.

There currently doesn't exist a variable to toggle this off, which would be of benefit to my non production clusters because I'm not really concerned about scaling abilities and moreso about the running cost :)

Happy to PR.

Module nomad-cluster : aws_autoscaling_attachment broken

Hi,

I think the same issue that was reported for terraform-aws-consul (hashicorp/terraform-aws-consul#183) also exists for nomad.

When I try to create an "aws_autoscaling_attachment" by using the autoscaling group "autoscaling_group" of the module "nomad-cluster" I experience the following behavior:

First apply: Target groups are added
Second apply: Target groups are removed
Third apply: Target groups are added
...

Regarding to https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/autoscaling_attachment I guess the following snippet should also be added for the autoscaling group here:

  lifecycle {
    ignore_changes = [load_balancers, target_group_arns]
  }

The same change was made for terraform-aws consul here: hashicorp/terraform-aws-consul#188

Incorrect Error Message Reference to Consul

terraform-aws-nomad/modules/install-nomad/install-nomad

Line 147 in 287997f

log_error "CPU architecture $cpu_arch is not a supported by Consul."

This line references Consul supported architectures rather than Nomad.

Terraform init fails on this module

Module version

terraform-aws-nomad 0.3.1

Config file

module "nomad" {
  source  = "hashicorp/nomad/aws"
  version = "0.3.1"
}

Steps to reproduce

configure above .tf file
terraform init

error log

terraform init
Initializing modules...
- module.nomad
  Found version 0.3.1 of hashicorp/nomad/aws on registry.terraform.io
  Getting source "hashicorp/nomad/aws"
- module.nomad.servers
  Getting source "git::[email protected]:hashicorp/terraform-aws-consul.git//modules/consul-cluster?ref=v0.1.0"
- module.nomad.nomad_security_group_rules
  Getting source "./modules/nomad-security-group-rules"
- module.nomad.clients
  Getting source "./modules/nomad-cluster"
- module.nomad.consul_iam_policies
  Getting source "git::[email protected]:hashicorp/terraform-aws-consul.git//modules/consul-iam-policies?ref=v0.1.0"
- module.nomad.servers.security_group_rules
  Getting source "../consul-security-group-rules"
- module.nomad.servers.iam_policies
  Getting source "../consul-iam-policies"
- module.nomad.clients.security_group_rules
  Getting source "../nomad-security-group-rules"

Error: module "servers": "tags" is not a valid argument

[nomad-security-group-rules] doesn't allow empty inbound CIDR

I noticed that the Nomad SG rules module was lacking the ability to have an empty inbound CIDR. This is useful for when you don't want to explicitly assign CIDRs and have an existing SG you will use and set via var.security_groups

The Vault and Consul equivalents have this ability and it seems rationale be added to the Nomad version as well.

Open files limit is restricted by supervisord

Nomad version: Nomad v0.9.6 (1f8eddf2211d064b150f141c86e30d9fceabec89)
AWS nomad module version: v0.5.0
OS: 16.04.6 LTS (Xenial Xerus)

We started running into the "Too many files open" issue on our client nodes due to some increased IO usage by some of our services. The erroring process (JVM) has been holding a little over 4000 file descriptors. Given the default soft limit of 1024 we've tried increasing it to 10000 by updating /etc/security/limits.conf. Even though the new config seemed to be picked up by the system we continued to experience the issue.

Then we saw that cat /proc/<pid_id>/limits was showing

Limit                     Soft Limit           Hard Limit           Units     
Max open files            4096                 4096                 files

which made us learn about the whole chain of starting an allocation by nomad, realizing that the process is inheriting the limits from the parent process. And given that the allocation is started by nomad and nomad is started by supervisord we found that the actual limit is set by supervisord.

The solution for us was to add another config file in /etc/supervisor/conf.d that included

[supervisord]
minfds=8192

See the docs for minfds here http://supervisord.org/configuration.html#supervisord-section-settings

To prevent surprises for other users, I'd suggest that either the default supervisord config should be updated in this module (perhaps with an optional argument passed to the install script) or at least this fact should be mentioned in documentation so the users know how to act.

Tests are unstable

Tests are failing intermittently on master (example). Need to dig in to understand what's going on.

enable encryption for run_consul

It would be nice to include the enable encryption and enable gossip encryption in run_nomad script.

This way this module would be on par with the rest of the published terraform_aws_modules and can be used in production environment

Fix examples/nomad-consul-ami/nomad-consul-docker.json

Hi! I have no idea what I'm doing, but I had issues using examples/nomad-consul-ami/nomad-consul-docker.json,
and the following made my problems go away:

Find and replace all occurrences of clean_ami_name with clean_resource_name in the file.

(I had a PR ready for this, but I don't feel like reading your CLA at the moment.)

Support vault integration

run-nomad does not currently support the vault stanza for vault integration with server or client

This would add

vault {
enabled
address
create_from_role
token
}

at a minimum.

Using `aws_autoscaling_attachment`

I would like to be able to attach an ELB to the auto scaling group after the initial creation of the ASG using the aws_autoscaling_attachment resource.

However, the current implementation sets target_group_arns and load_balancers in the aws_autoscaling_group resource. These two usage would overwrite each other.

I would like to propose that we remove setting these two arguments in the aws_autoscaling_group and let users use the aws_autoscaling_attachment resource instead. (Sadly there is no "unset" functionality in Terraform AFAIK) I would argue that using aws_autoscaling_attachment is more flexible.

This would, unfortunately, be a breaking change.

is there a recommended AMI to use with this?

Nomad and Consul on the same vs separate clusters - pros and cons

Hi!

What are the main functional differences between those tho concepts: Nomad and Consul on the same cluster, and on separate clusters?
What are pros and cons of each of them?

Kind regards,
ThickDrinkLots

ECR login

I am a nomad, consul newbie and used these modules to create a new cluster. Can you please help me understand how to specify/create client config so that I can use ECR repository to move ECS tasks to nomad jobs.
I tried following nomad documentation but found it lacking.

Why pin version?

In file:

https://github.com/hashicorp/terraform-aws-nomad/blob/master/examples/nomad-consul-ami/nomad-consul-docker.json

{
  "min_packer_version": "0.12.0",
  "variables": {
    "aws_region": "us-east-1",
    "nomad_version": "0.7.1",
    "consul_module_version": "v0.3.1", <-------
    "consul_version": "1.0.3"
},

The version of "consul_module_version" is being pinned to a branch named "v0.3.1."

Shouldn't this be the latest version of the module? Why pin to a very old version?

If I remove the pin and use master instead, I get this error (Amazon Linux 2):

Starting cloud-init: Cloud-init v. 0.7.6 running 'modules:final' at Tue, 20 Aug 2019 20:00:05 +0000. Up 33.87 seconds.

user-data: 2019-08-20 20:00:05 [ERROR] [run-consul] The binary 'systemctl' is required by this script but is not installed or in the system's PATH.

What is the correct version to use and what should I use for consul_version?

Intermittent test failure

We sometimes get a test failure like this:

TestNomadConsulClusterColocatedAmazonLinux2Ami 2021-01-26T12:17:10Z retry.go:72: Check Nomad cluster has expected number of servers and clients
TestNomadConsulClusterColocatedAmazonLinux2Ami 2021-01-26T12:17:10Z nomad_helpers.go:214: Making an HTTP GET to URL http://54.200.89.51:4646/v1/nodes
TestNomadConsulClusterColocatedAmazonLinux2Ami 2021-01-26T12:17:15Z nomad_helpers.go:227: Response from Nomad for URL http://54.200.89.51:4646/v1/nodes: No cluster leader
TestNomadConsulClusterColocatedAmazonLinux2Ami 2021-01-26T12:17:15Z retry.go:84: Check Nomad cluster has expected number of servers and clients returned an error: invalid character 'N' looking for beginning of value. Sleeping for 10s and will try again.
TestNomadConsulClusterColocatedAmazonLinux2Ami 2021-01-26T12:17:25Z retry.go:72: Check Nomad cluster has expected number of servers and clients
TestNomadConsulClusterColocatedAmazonLinux2Ami 2021-01-26T12:17:25Z nomad_helpers.go:214: Making an HTTP GET to URL http://54.200.89.51:4646/v1/nodes
TestNomadConsulClusterColocatedAmazonLinux2Ami 2021-01-26T12:17:30Z nomad_helpers.go:227: Response from Nomad for URL http://54.200.89.51:4646/v1/nodes: No cluster leader
TestNomadConsulClusterColocatedAmazonLinux2Ami 2021-01-26T12:17:30Z retry.go:84: Check Nomad cluster has expected number of servers and clients returned an error: invalid character 'N' looking for beginning of value. Sleeping for 10s and will try again.
TestNomadConsulClusterColocatedAmazonLinux2Ami 2021-01-26T12:17:40Z retry.go:72: Check Nomad cluster has expected number of servers and clients
TestNomadConsulClusterColocatedAmazonLinux2Ami 2021-01-26T12:17:40Z nomad_helpers.go:214: Making an HTTP GET to URL http://54.200.89.51:4646/v1/nodes
TestNomadConsulClusterColocatedAmazonLinux2Ami 2021-01-26T12:17:46Z nomad_helpers.go:227: Response from Nomad for URL http://54.200.89.51:4646/v1/nodes: No cluster leader
TestNomadConsulClusterColocatedAmazonLinux2Ami 2021-01-26T12:17:46Z retry.go:84: Check Nomad cluster has expected number of servers and clients returned an error: invalid character 'N' looking for beginning of value. Sleeping for 10s and will try again.
TestNomadConsulClusterColocatedAmazonLinux2Ami 2021-01-26T12:17:56Z retry.go:72: Check Nomad cluster has expected number of servers and clients
TestNomadConsulClusterColocatedAmazonLinux2Ami 2021-01-26T12:17:56Z nomad_helpers.go:214: Making an HTTP GET to URL http://54.200.89.51:4646/v1/nodes
TestNomadConsulClusterColocatedAmazonLinux2Ami 2021-01-26T12:18:01Z nomad_helpers.go:227: Response from Nomad for URL http://54.200.89.51:4646/v1/nodes: No cluster leader
TestNomadConsulClusterColocatedAmazonLinux2Ami 2021-01-26T12:18:01Z retry.go:84: Check Nomad cluster has expected number of servers and clients returned an error: invalid character 'N' looking for beginning of value. Sleeping for 10s and will try again.
TestNomadConsulClusterColocatedAmazonLinux2Ami 2021-01-26T12:18:11Z retry.go:72: Check Nomad cluster has expected number of servers and clients
TestNomadConsulClusterColocatedAmazonLinux2Ami 2021-01-26T12:18:11Z nomad_helpers.go:214: Making an HTTP GET to URL http://54.200.89.51:4646/v1/nodes
TestNomadConsulClusterColocatedAmazonLinux2Ami 2021-01-26T12:18:16Z nomad_helpers.go:227: Response from Nomad for URL http://54.200.89.51:4646/v1/nodes: No cluster leader
TestNomadConsulClusterColocatedAmazonLinux2Ami 2021-01-26T12:18:16Z retry.go:84: Check Nomad cluster has expected number of servers and clients returned an error: invalid character 'N' looking for beginning of value. Sleeping for 10s and will try again.

After a bunch more retries, it times out and fails the test. I wonder what response we're getting there?

Reference to tag 0.0.5 does not exist

terraform-aws-nomad/examples/nomad-consul-separate-cluster/main.tf

Line 99 in 3ddcbc6

source = "git::[email protected]:hashicorp/terraform-aws-consul.git//modules/consul-iam-policies?ref=v0.0.5"