elastic / ansible-elastic-cloud-enterprise Goto Github PK

View Code? Open in Web Editor NEW

60.0 60.0 61.0 305 KB

Ansible playbooks for Elastic Cloud Enterprise (ECE)

Home Page: https://www.elastic.co/products/ece

License: Other

Shell 97.61% Jinja 2.39%

ansible-elastic-cloud-enterprise's People

Contributors

Stargazers

Watchers

ansible-elastic-cloud-enterprise's Issues

Provide sensible defaults for memory settings

The ECE default memory settings are pretty much useless.
We set up a test installation and used the settings as documented for a 'Small baseline installation' for the primary node, but - accidentally - the secondary and tertiary node did not.
End result was that secondary and tertiary started, but could not be administered via Cloud UI - worse, they could not be /removed/ and we had to ditch the whole installation and start all over again.
On the plus side: installation is really fast!

We're currently using a dirty hack to solve this problem.
In 'bootstrap/main.yml' the task
-name: Memory settings for ECE set_fact: ece_memory_settings: ' {"runner:{"xms":"1G","xmx":"1G"}[...]}'
defines a variable and add
--memory-settings '{{ ece_memory_settings }}'
in the respective 'install_stack.yml' to the install script call (the whitespace before '{' is necessary, otherwise the line is not treated simply as text.

This is definitely the wrong way to do this - but it works for us so far!
A better approach might be to define variables for each (ECE) role and read the values from the 'inventory.yml' file?

ECE Upgrade command from Ansible is not working

I'm getting an error while executing below play.

{{ ece_version }}: 2.6.0

Play:

name: Execute upgrade
shell: /mnt/data/elastic-cloud-enterprise.sh upgrade --cloud-enterprise-version {{ ece_version }}
become: yes
become_method: sudo
become_user: elastic

Error:
fatal: [ece-host]: FAILED! => {
"ansible_job_id": "75300605991.103745",
"changed": true,
"cmd": "/mnt/data/elastic-cloud-enterprise.sh upgrade --cloud-enterprise-version 2.6.0",
"delta": "0:00:00.665977",
"end": "2020-09-04 13:30:33.803465",
"finished": 1,
"invocation": {
"module_args": {
"_raw_params": "/mnt/data/elastic-cloud-enterprise.sh upgrade --cloud-enterprise-version 2.6.0",
"_uses_shell": true,
"argv": null,
"chdir": null,
"creates": null,
"executable": null,
"removes": null,
"stdin": null,
"stdin_add_newline": true,
"strip_empty_ends": true,
"warn": false
}
},
"msg": "non-zero return code",
"rc": 1,
"start": "2020-09-04 13:30:33.137488",
"stderr": "",
"stderr_lines": [],
"stdout": "\u001b[0;31mContainer frc-runners-runner was not found -- is the environment running?\u001b[0m",
"stdout_lines": [
"\u001b[0;31mContainer frc-runners-runner was not found -- is the environment running?\u001b[0m"
]
}

Default to Docker 1.13 when installing on RHEL 7.x

ECE is supported on RHEL 7 when using Docker 1.13 [1], however the playbook uses a default of docker_version: "18.09" [2], which means users can end up with an unsupported installation if they don't adjust the docker_version variable. It'd be great if we could default to 1.13 (maybe even prohibit 18.09 [3]?) if installing on RHEL 7.

[1] https://www.elastic.co/guide/en/cloud-enterprise/current/ece-prereqs-software-linux.html
[2] https://github.com/elastic/ansible-elastic-cloud-enterprise/blob/master/defaults/main.yml#L11
[3] https://github.com/elastic/ansible-elastic-cloud-enterprise/blob/master/tasks/base/RedHat-7/install_docker.yml#L24-L36

Ability to set swap size per host as a variable

We'd like to allow customers to configure the amount of swap defined on each ECE host, instead of always using a predetermined formula as seen here

The formula should act as a default, but there may be some cases where 7% of disk space is too little or too much from a customers' perspective.

Option to remove existing swap volumes

Hi,

this issue is regarding the process on how the required swap partition is created by the playbook.

The size of the swap patition is defined as 7% of the whole disk size, which is in my opinion way to much (and 7% should be the maximum).

ansible-elastic-cloud-enterprise/tasks/direct-install/setup_xfs.yml

Lines 13 to 15 in 479b8e6

 - name: Define size of swap 

 shell: SWAP_MAX_SIZE=$(sudo vgdisplay --units M lxc | grep "VG Size" | awk '{ print mem=int(0.07*$3) }'); grep MemTotal /proc/meminfo | awk -v MAXMEM=${SWAP_MAX_SIZE} '{ mem=int($2/(2*1024)); if(mem>MAXMEM) mem=MAXMEM; print mem; }' 

 register: swap_size

Using the calculated size the logical volume for swap is created and initialized

ansible-elastic-cloud-enterprise/tasks/direct-install/setup_xfs.yml

Lines 31 to 32 in 479b8e6

 - name: Setup swap 

 command: mkswap /dev/lxc/swap

Then the last step is to enable swapping to the newly created partition

ansible-elastic-cloud-enterprise/tasks/direct-install/setup_xfs.yml

Lines 50 to 51 in 479b8e6

 - name: Enable all swap devices 

 command: swapon -a

I see the following issues:

The swap partition seems to be to big, since the swap size is more closely related to the actuall RAM size and the amount of CPUs (See Red Hat Storage Administration Guide for example). Maybe we should find a way to make the swap size more dependent on RAM and CPU? Or is is a recommendation from Elastic for ECE/ELK?
The playbook does not check if there are existing swap partitions. Either we remove the existing swap partition or allow to choose to not create a swap partition during the installation and let the administrator self-manage the swap partition

What's your opinion on that?

Regards,
Alex

Allow customizing data path

This is now hardcoded to /mnt/data and we would like to take this role into use but our convention is to use /data. Now I have to modify the role which introduces merge conflicts when we update it form upstream. Can you make it a variable in official repo?

UPD: Made PR for this since I already performed the necessary changes

Ansible installer fails on GCP centos7 image

Issue Description

The ansible-playbook fails with the following error when run on centos7 using this image (standard centos in gcp)
image - projects/centos-cloud/global/images/centos-7-v20191210

Error message -

Ece version: 2.4.3

TASK [ansible-elastic-cloud-enterprise : include_tasks] ************************
included: /Users/omerkushmaro/.ansible/roles/ansible-elastic-cloud-enterprise/tasks/bootstrap/primary/install_stack.yml for 35.184.83.18
TASK [ansible-elastic-cloud-enterprise : Execute the primary installation] *****
fatal: [35.184.83.18]: FAILED! => {"changed": true, "cmd": "/home/elastic/elastic-cloud-enterprise.sh install --availability-zone us-central1-a --cloud-enterprise-version 2.4.3 --docker-registry docker.elastic.co --ece-docker-repository cloud-enterprise --memory-settings ' {\"runner\":{\"xms\":\"1G\",\"xmx\":\"1G\"},\"proxy\":{\"xms\":\"8G\",\"xmx\":\"8G\"},\"zookeeper\":{\"xms\":\"4G\",\"xmx\":\"4G\"},\"director\":{\"xms\":\"1G\",\"xmx\":\"1G\"},\"constructor\":{\"xms\":\"4G\",\"xmx\":\"4G\"},\"admin-console\":{\"xms\":\"4G\",\"xmx\":\"4G\"}}'", "delta": "0:00:00.520743", "end": "2020-01-30 17:16:30.355823", "msg": "non-zero return code", "rc": 1, "start": "2020-01-30 17:16:29.835080", "stderr": "", "stderr_lines": [], "stdout": "\u001b[0;31mCan't determine a default HOST_IP ('ip' tool can't be found). Please supply '--host-ip' with the appropriate ip address.\u001b[0m", "stdout_lines": ["\u001b[0;31mCan't determine a default HOST_IP ('ip' tool can't be found). Please supply '--host-ip' with the appropriate ip address.\u001b[0m"]}

To Recreate

use this terraform example for installing 3 node ece instance in gcp

needs terraform 0.12 installed
need to change the image id in servers.tf from ubuntu, to the image mentioned in the description above, and run the thing :)

On multiple runs "Create swap volume" failures

When running the ansible job multiple times, there is a failure on the "create swap volume" task. The following output is shown

TASK [elastic-cloud-enterprise : Create swap volume] 
******************************************************************************************************
fatal: [192.168.1.10]: FAILED! => {"changed": false, "msg": "Sorry, no shrinking of swap without force=yes."}

The server hardware has not been changed during runs.

Also if the proposed swap size is smaller than the current size, should this be run ?

Thanks

Centos fails to install because of "Failed to download metadata for repo 'appstream'"

[2022-04-21T13:23:56.118Z] TASK [. : Install common base dependencies] ************************************
[2022-04-21T13:24:05.248Z] failed: [ec2-3-82-203-154.compute-1.amazonaws.com] (item=cloud-init) => {"ansible_loop_var": "item", "changed": false, "item": "cloud-init", "msg": "Failed to download metadata for repo 'appstream': Cannot prepare internal mirrorlist: No URLs in mirrorlist", "rc": 1, "results": []}

Installing with ansible results in an invalid certificate

Problem

When using the ansible-playbook to install ece, results in an invalid (revoked) certificate, that modern browsers block, preventing you from using the UI

Error is -
NET::ERR_CERT_REVOKED
(thus, the browser doesn't allow you to 'skip' and temporarily trust the certificate)

Tested on

2.4.0 / 2.4.3

To recreate

Install ece on a remote host, by running the ansible-playbook from your local machine (might be a timezone issue? just a thought)

Skip Tags setup_filesystem no longer works

We have been using --skip-tags setup_filesystem when running this playbook, but after pulling in a more recent version, it seems the play is trying to create a volume for xvdb. Before, it would skip this section due to the tag being set to skip, but now it seems to try and create the volume anyway.

It appears that in #55 the included tasks general/setup_xfs.yml was moved to a new place and the tag setup_filesystem was not moved with it.

Ansible doesn't verify docker as running

The ansible-playbook should make sure docker server is running on the host before attempting to continue configuration

Update README regarding upgrade of ECE

Hi,
i tried an upgrade of a newly installed test environment and noticed that only providing '--skip-tags base' has a severe drawback:
It still triggers the tasks in 'direct-install' which result in a reboot of the node. In my case the docker containers weren't up and running fast enough afterwards and the upgrade terminated with an error along the lines

Container frc-directors-director was not found -- does the current host have a role 'director'?

I haven't yet tried but i think using '--tags bootstrap' instead should be the correct way to do this.
Regards,

Steffen Elste

p.s.: Using '--tags bootstrap' did indeed work without any problems.

Playbook fails on Ubuntu when docker not installed

In my efforts to reproduce a different issue, I have identified a problem with the playbook installing docker. It appears we assert docker to be installed before it is able to attempt to install the desired version.

I believe this the issue is we attempt to validate the docker version BEFORE we attempt to install docker.
https://github.com/elastic/ansible-elastic-cloud-enterprise/blob/master/tasks/base/main.yml#L8-L11

However, ansible throws an error before we can do anything about it.

Ansible install when a proxy is being used

When installing ECE inside a more sandboxed environment we needed to add in some proxy settings.

I took a copy of the settings that were added, but I did not test this to know exactly which settings were definitely needed and which are duplicates etc.

---
- hosts: all
  vars:
    http_proxy: http://<proxy_ip>:8080
    https_proxy: https://<proxy_ip>:8080
    proxy_host: <proxy_ip>
    proxy_port: 8080
    proxy_env:
      http_proxy: "{{ http_proxy }}"
      https_proxy: "{{ https_proxy }}"
    no_proxy: <primary_hostname>

When I get a chance I will test them, but putting this here as a placeholder and in case someone already knows the answer.

Add support for Ubuntu 18.04 LTS

We are looking to deploy this, but we are currently standardising on Ubuntu 18.04, any chance of getting this supported in the ansible scripts ?

Thanks

Add support for ece-support-diagnostics

Description

It would be pretty useful to have ece-support-diagnostics (https://github.com/elastic/ece-support-diagnostics) available in the default installation as well.

Solution

Include the support diagnostics script at preparation time in the same way as the ECE installation/management script.

- name: Download ece support diagnostics
  get_url:
    url: "{{ ece_supportdiagnostics_url }}"
    dest: /home/elastic/ece-support-diagnostics.sh
    mode: 0755

Add sysstat as Dependency

SYSSTAT is not listed a dependency anywhere in the Playbooks

https://github.com/elastic/ansible-elastic-cloud-enterprise/blob/master/tasks/diagnostics/main.yml

This should be automatically deployed as part of the deployment of ECE otherwise when you run the ECE Support Tool it needs installing

https://github.com/elastic/ece-support-diagnostics

Update README to recommend using --tags instead of --skip-tags

Using --tags alone now weorks properly, so we can recommend using that instead for e.g. upgrades (--tags "bootstrap")

Always assumes that network interface is called eth0

The playbook always assumes that the network interface is eth0.

in main.yml

- name: ensure dhcp dns is set
  lineinfile:
    path: /etc/sysconfig/network-scripts/ifcfg-eth0
    line: "{{ item }}"
  with_items:
    - 'PeerDNS=yes'
    - 'NM_CONTROLLED=yes'

My suggestion is to allow this to be passed via variable.

Persist ip_conntrack after reboot

ip_conntrack is currently loaded via ansible modprobe which does not persist it anywhere.

We need to create a /etc/modules-load.d/ file in a seperate task (e.g. ip_conntrack.conf)

Issue when "default user" has no authorized_keys set

This an issue in the "Copy keys from default user to elastic user" task from the "tasks/system/general/make_user.yml" file.

This issue occurs when the "standard" user (user running ansible) does not have any authorized_keys.

Scenario is

We clone the machines from a standardised template, which has no ssh keys defined
We run the Elastic ansible role
We then run an ansible job to provision all the keys etc and delete the "standard" account, and a number of other internal base "build" configs

Would it not be better to allow the "authorized_keys" content to be defined in the ansible role ? Or at least not fail if no keys are defined ?

At the moment I am working around this with a separate task, that just essentially populates the "standard" users authorized_keys file prior to the make_user.yml running

Thanks

ansible playbook fails to run base tasks for ubuntu 18.04 VMs

This is the error that I am getting from running ansible-playbook -i inventory.yml site.yml. Ubuntu image - ubuntu-1804-bionic-v20210119a on GCP

TASK [ansible-elastic-cloud-enterprise : sysctl_scripts.yml || load ip_conntrack if needed] ******************************************
fatal: [35.209.97.34]: FAILED! => {"changed": false, "msg": "modprobe: FATAL: Module ip_conntrack not found in directory /lib/modules/5.4.0-1034-gcp\n", "name": "ip_conntrack", "params": "", "rc": 1, "state": "present", "stderr": "modprobe: FATAL: Module ip_conntrack not found in directory /lib/modules/5.4.0-1034-gcp\n", "stderr_lines": ["modprobe: FATAL: Module ip_conntrack not found in directory /lib/modules/5.4.0-1034-gcp"], "stdout": "", "stdout_lines": []}

gap between software requirements and documentation for ECE

Let me describe with a table because that should illustrate better what I mean.
Basically in the documentation we sometimes mention another concrete software than in the actual documentation

OS	Documentation	Ansible file	Docker/Cli Ansible vs Docker/Cli Doc	Containerd Ansible vs Containerd Doc	Storagedriver Ansible vs Storagedrive Doc
CentOS 7	https://www.elastic.co/guide/en/cloud-enterprise/current/ece-configure-hosts-rhel-centos-cloud.html	https://github.com/elastic/ansible-elastic-cloud-enterprise/blob/master/vars/os_CentOS_7.yml	20.10.8 vs 20.10.* but later than 20.10.7	1.4.3 vs 1.4.*	overlay2 vs overlay2
CentOS 8	https://www.elastic.co/guide/en/cloud-enterprise/current/ece-configure-hosts-rhel-centos-cloud.html	https://github.com/elastic/ansible-elastic-cloud-enterprise/blob/master/vars/os_CentOS_8.yml	20.10.8 vs 20.10.* but later than 20.10.7	1.4.3 vs 1.4.*	overlay2 vs overlay2
RHEL 7	https://www.elastic.co/guide/en/cloud-enterprise/current/ece-configure-hosts-rhel-centos-cloud.html	https://github.com/elastic/ansible-elastic-cloud-enterprise/blob/master/vars/os_RedHat_7.yml	20.10.8 vs 20.10.* but later than 20.10.7	1.4.3 vs 1.4.*	overlay2 vs overlay2
RHEL 8	https://www.elastic.co/guide/en/cloud-enterprise/current/ece-configure-hosts-rhel-centos-cloud.html	https://github.com/elastic/ansible-elastic-cloud-enterprise/blob/master/vars/os_RedHat_8.yml	20.10.8 vs 20.10.* but later than 20.10.7	1.4.3 vs 1.4.*	overlay2 vs overlay2
Ubuntu 16	https://www.elastic.co/guide/en/cloud-enterprise/current/ece-configure-hosts-ubuntu-cloud.html	https://github.com/elastic/ansible-elastic-cloud-enterprise/blob/master/vars/os_Ubuntu_16.yml	19.03.15* vs 19.03.*	1.4.3-1* vs 1.2.*	overlay2 vs overlay2
Ubuntu 18	https://www.elastic.co/guide/en/cloud-enterprise/current/ece-configure-hosts-ubuntu-cloud.html	https://github.com/elastic/ansible-elastic-cloud-enterprise/blob/master/vars/os_Ubuntu_18.yml	19.03.15* vs 19.03.*	1.4.3-1* vs 1.2.*	overlay2 vs overlay2
SLES 12	https://www.elastic.co/guide/en/cloud-enterprise/current/ece-configure-hosts-sles12-cloud.html	https://github.com/elastic/ansible-elastic-cloud-enterprise/blob/master/vars/os_SLES_12.yml	docker-19.03.14_ce vs docker-19.03.14_ce	-	overlay vs overlay2

So to summarize is seems that there is no playbook completely matching the requirements in documentation at the moment.

Update CentoOS/RHEL containerd version to 1.5 to avoid docker "hang" issues

TL;DR: We should update the containerd version in RHEL/CentOS (i.e. here) to 1.5.* to fix runc the issue described bwelow and also match our own install docs, that use sudo yum install -y docker-ce-20.10* docker-ce-cli-20.10* containerd.io-1.5.* today (ECE 3.4).

This is a follow up to #125 and #121.

We worked with a user yesterday that had issues with ECE (or better docker), where newly created containers would be stuck in "Created" state, similar to the issue above.

In #125 we pinned the containerd version to 1.4.3 for RHEL/CentOS and back then this was containerd.io-1.4.3-3.1.el8.x86_64 and using runc 1.0.0-rc92.
Today if one installs 1.4.3 it translates to containerd.io-1.4.3-3.2.el8.x86_64 and that ships with runc 1.0.0-rc93 and is affected by opencontainers/runc#2871 .

Various Issues With Current Ansible Playbook for ECE

I am pasting in a series of comments provided by a user which they encountered when "testing the latest version of the ansible playbook/install script". User requested I post these issues because they do not have access to github:

Install process:

Ubuntu 18.04/docker 19.03: docker_version 18.09 => assertion failure when running the playbook on Ubuntu 18.04 that requires docker 19.03.
Ubuntu 18.04/docker 19.03: docker19.03.conf is present in the "template" folder but there is no task to copy it on the remote system. I only see tasks related to docker 18.09 or docker 1.13:

~/ansible-elastic-cloud-enterprise/templates$ ls
docker1.13.conf  docker18.09.conf  docker19.03.conf  elastic.cfg.j2  format-drives.j2

~/ansible-elastic-cloud-enterprise$ vi tasks/base/general/configure_docker.yml

- name: Ensures /etc/systemd/system/docker.service.d dir exists
  file:
    path: /etc/systemd/system/docker.service.d
    state: directory
  when: docker_version == '18.09'

- name: Create service.d docker.conf
  template:
    src: docker{{ docker_version }}.conf
    dest: /etc/systemd/system/docker.service.d/docker.conf
  when: docker_version == '18.09'

- name: set docker storage options
  lineinfile:
    path: /etc/sysconfig/docker
    regexp: "^OPTIONS='(.*)'"
    line: "OPTIONS='-g {{ data_dir }}/docker \\1'"
    backrefs: yes
    create: yes
  when: docker_version == '1.13'

- name: set docker network options
  lineinfile:
    path: /etc/sysconfig/docker-network
    regexp: '^DOCKER_NETWORK_OPTIONS='
    line: 'DOCKER_NETWORK_OPTIONS="--bip={{ docker_bridge_ip }}"'
    create: yes
  when: docker_version == '1.13'

- name: set docker storage driver
  lineinfile:
    path: /etc/sysconfig/docker-storage-setup
    regexp: '^DOCKER_NETWORK_OPTIONS='
    line: 'STORAGE_DRIVER={{ docker_storage_driver }}'
    create: yes
  when: docker_version == '1.13'

Upgrade process:

The install script elastic-cloud-enterprise.sh fails when it tries to retrieve the HOST_STORAGE_PATH with ansible.
By Removing the -it parameters from the docker exec command, this seems to fix the issue.
This also works properly without -it when running the docker exec command in a ssh session.

Problematic lines of code (removed the /dev/null to be able to see the root cause of this error).

  SOURCE_CONTAINER_NAME="frc-runners-runner"
  HOST_STORAGE_PATH=$(docker -H "unix://${HOST_DOCKER_HOST}" exec -it $SOURCE_CONTAINER_NAME bash -c 'echo -n $HOST_STORAGE_PATH' | cut -d: -f 2)
  if [[ -z "${HOST_STORAGE_PATH}" ]]; then
      echo -e "${RED}Container $SOURCE_CONTAINER_NAME was not found -- is the environment running?${NC}"
      exit $GENERAL_ERROR_EXIT_CODE
  fi

  SOURCE_CONTAINER_NAME="frc-directors-director"
  ZK_ROOT_PASSWORD=$(docker -H "unix://${HOST_DOCKER_HOST}" exec -it $SOURCE_CONTAINER_NAME bash -c 'echo -n $FOUND_ZK_READWRITE' | cut -d: -f 2)
  if [[ -z "${ZK_ROOT_PASSWORD}" ]]; then
      echo -e "${RED}Container $SOURCE_CONTAINER_NAME was not found -- does the current host have a role 'director'?${NC}"
      exit $GENERAL_ERROR_EXIT_CODE
  fi

Error

"the input device is not a TTY"
 
TASK [elastic-cloud-enterprise : include_tasks] 

included: /home/user/ansible/roles/elastic-cloud-enterprise/tasks/ece-bootstrap/upgrade.yml for <REDACTED>

TASK [elastic-cloud-enterprise : Execute upgrade] 

fatal: [<REDACTED>]: FAILED! => {"changed": true, "cmd": "/home/elastic/elastic-cloud-enterprise.sh upgrade --cloud-enterprise-version 2.6.2 --docker-registry docker.elastic.co --ece-docker-repository cloud-enterprise", "delta": "0:00:00.434120", "end": "2020-10-12 10:29:11.156153", "msg": "non-zero return code", "rc": 1, "start": "2020-10-12 10:29:10.722033", "stderr": "+ SOURCE_CONTAINER_NAME=frc-runners-runner\n++ docker -H unix:///var/run/docker.sock exec -it frc-runners-runner bash -c 'echo -n $HOST_STORAGE_PATH'\n++ cut -d: -f 2\nthe input device is not a TTY\n+ HOST_STORAGE_PATH=\n+ [[ -z '' ]]\n+ echo -e '\\033[0;31mContainer frc-runners-runner was not found -- is the environment running?\\033[0m'\n+ exit 1", "stderr_lines": ["+ SOURCE_CONTAINER_NAME=frc-runners-runner", "++ docker -H unix:///var/run/docker.sock exec -it frc-runners-runner bash -c 'echo -n $HOST_STORAGE_PATH'", "++ cut -d: -f 2", "the input device is not a TTY", "+ HOST_STORAGE_PATH=", "+ [[ -z '' ]]", "+ echo -e '\\033[0;31mContainer frc-runners-runner was not found -- is the environment running?\\033[0m'", "+ exit 1"], "stdout": "\u001b[0;31mContainer frc-runners-runner was not found -- is the environment running?\u001b[0m", "stdout_lines": ["\u001b[0;31mContainer frc-runners-runner was not found -- is the environment running?\u001b[0m"]}

The task below in ~/elastic-cloud-enterprise/tasks/ece-bootstrap/main.yml fails (permission issue) when I run the playbook with my own admin user. This command has to run as root or elastic user. So a sudo instruction has to be added to that task.

- name: Check if an installation or upgrade should be performed
  shell: docker ps -a -f name=frc-runners-runner --format {%raw%}"{{.Image}}"{%endraw%}
  register: existing_runner
  tags: [dbg]

  become: yes
  become_method: sudo
  become_user: elastic

Error:

TASK [elastic-cloud-enterprise : Check if an installation or upgrade should be performed] 

fatal: [<REDACTED>]: FAILED! => {"changed": true, "cmd": "docker ps -a -f name=frc-runners-runner --format \"{{.Image}}\"", "delta": "0:00:00.419693", "end": "2020-10-12 10:52:42.118514", "msg": "non-zero return code", "rc": 1, "start": "2020-10-12 10:52:41.698821", "stderr": "Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Get http://%2Fvar%2Frun%2Fdocker.sock/v1.40/containers/json?all=1&filters=%7B%22name%22%3A%7B%22frc-runners-runner%22%3Atrue%7D%7D: dial unix /var/run/docker.sock: connect: permission denied", "stderr_lines": ["Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Get http://%2Fvar%2Frun%2Fdocker.sock/v1.40/containers/json?all=1&filters=%7B%22name%22%3A%7B%22frc-runners-runner%22%3Atrue%7D%7D: dial unix /var/run/docker.sock: connect: permission denied"], "stdout": "", "stdout_lines": []}

Documentation update (README.md)

https://github.com/elastic/ansible-elastic-cloud-enterprise#performing-an-upgrade

The upgrade section of the documentation indicates to use the following command:

ansible-playbook -i inventory.yml site.yml --skip-tags base

By just skipping the base tag, the playbook still performs some destructive(volume-fs creation) or unwanted(system reboot) task from direct install.

To perform an upgrade of ece only, I had to use the following command:

ansible-playbook -i inventory.yml site.yml --tags bootstrap

Add method to separate tasks in system to allow image builds

I would like to use an image building tool like Packer with your Ansible role. However, there are a few issues with the role today that make it difficult to do so. Specifically the reboot in system/main.yml causes timeout issues and certain tasks in this file cannot be done at image build time. A possible solution would be to add two more tags for system (i.e reboot and setup) so that these tasks can be included during image builds but excluded during deployments.

---
- name: Include OS specific vars
  include_vars: "{{ item }}"
  with_first_found:
  - os_{{ ansible_distribution }}_{{ ansible_distribution_major_version }}.yml
  - unsupported.yml

- name: Check that OS is supported
  fail:
    msg: "ERROR: OS {{ ansible_distribution }} {{ ansible_distribution_major_version}} is not supported!"
  when: unsupported_version is defined and unsupported_version

- name: Assert docker version is supported
  assert:
    that: "docker_version in docker_version_map.keys()"
    msg: "Docker version must be one of {{ docker_version_map.keys() }}"

- name: execute os specific tasks
  include_tasks: "{{ ansible_distribution }}-{{ ansible_distribution_major_version}}/main.yml"
  tags: [setup]

- include_tasks: general/make_user.yml
  tags: [setup]
- include_tasks: general/set_limits.yml
  tags: [setup]
- include_tasks: general/setup_xfs.yml
  tags: [setup_filesystem, destructive]
  when: ansible_lvm['vgs']['lxc'] is not defined or force_xfc == true
- include_tasks: general/update_grub_docker.yml
  tags: [setup_filesystem, destructive]
- include_tasks: general/configure_docker.yml
  tags: [install_docker, destructive]
- include_tasks: general/sysctl_scripts.yml
  tags: [setup]
- include_tasks: general/kernel_modules.yml
  tags: [setup]

- name: Reboot the machine with all defaults
  shell: sleep 2 && shutdown -r now "Reboot for changes to take effect"
  async: 1
  poll: 0
  ignore_errors: true
  tags: [reboot]

- name: Wait for the reboot to complete
  wait_for_connection:
    connect_timeout: 20
    sleep: 5
    delay: 5
    timeout: 600
  tags: [reboot]

- include_tasks: general/setup_mount_permissions.yml
  tags: [setup_filesystem]

ECE 2.13+ and 3.x does not bootstrap on SLES

Starting 2.13 and above (including 3.0 and above), ECE does not bootstrap on SLES 12 and 15, with docker 19 or 20:

Details

bootstrap logs:

- Starting local runner {}
- Started local runner {}
- Waiting for runner container node {}
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  Errors have caused Elastic Cloud Enterprise installation to fail - Please check logs 
  Node type - initial
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

in docker logs of runner:

ok: run: docker-socket-proxy: (pid 30) 2s
Traceback (most recent call last):
  File "/elastic_cloud_apps/runner/write_config.py", line 10, in <module>
    with open('runner.conf', 'w') as dest:
PermissionError: [Errno 13] Permission denied: 'runner.conf'

What I noticed is ece user is well in passwd and group, and elastic well belongs to ece group! So this failure should not happen.

elastic:x:1000:1000::/home/elastic:/bin/false
ece:x:199:199::/home/ece:/bin/bash

ece:x:199:elastic
elastic:x:1000:

Indeed, path to runner.conf :

$ ls -lah /elastic_cloud_apps/runner
total 16K
drwxrwxr-x 1 199     199       65 Apr 28 14:36 .

On ubuntu, user ece is well set as owner of /elastic_cloud_apps/runner, but on SLES it shows its uid 199
For bootstrapper docker container, it's well displayed ece and not its uid

Also, the following command does not work:

$ setuser ece whoami
setuser: user ece not found

This does not make sense as ece user is well defined in /etc/passwd
Again, it's all good on ubuntu and on SLES from inside boostrapper container

My guess is that docker have issues with mapping uid/gid between the host and the container. Indeed, the user/group ece does not exists on the host. And so, elastic does not belong to group ece on the host.

Workaround

On the host, create a user and group named ece with uid/gid both 199, and add user elastic to ece group.
Then run ECE installer, and that should work!

Allocator capacity conditional breaks on Ansible 2.9

Related to #147

This syntax breaks on ansible 2.9. It works on ansible 2.10 (confirmed) and it seems it passed your tests on 2.8.

Minimal example to reproduce behavior :

--
- name: "test"
  hosts: localhost
  connection: local
  tasks:
    - name: Set empty capacity if not defined
      set_fact:
        capacity: "{{ capacity | default('') }}"
    - name: "debug"
      debug:
        msg: /home/elastic/elastic-cloud-enterprise.sh {{ '--capacity ' + capacity if capacity }}

Results on Fedora 35 with Ansible 2.10.15 (python3)

ansible-playbook test.yml 
[WARNING]: No inventory was parsed, only implicit localhost is available
[WARNING]: provided hosts list is empty, only localhost is available. Note that
the implicit localhost does not match 'all'

PLAY [add admin] ***************************************************************

TASK [Gathering Facts] *********************************************************
ok: [localhost]

TASK [Set empty capacity if not defined] ***************************************
ok: [localhost]

TASK [debug] *****************************************************************
ok: [localhost] => 
  msg: '/home/elastic/elastic-cloud-enterprise.sh '

CentOS 7.9 with ansible 2.9.27 (python 2.7.5)

ansible-playbook  test.yml
[WARNING]: provided hosts list is empty, only localhost is available. Note that the implicit localhost does not match 'all'

PLAY [add admin] *************************************************************************************************************

TASK [Gathering Facts] *******************************************************************************************************
ok: [localhost]

TASK [Set empty capacity if not defined] *************************************************************************************
ok: [localhost]

TASK [debug] ***************************************************************************************************************
fatal: [localhost]: FAILED! =>
  msg: |-
    The task includes an option with an undefined variable. The error was: the inline if-expression on line 1 evaluated to false and no else section was defined.

    The error appears to be in '<my_secret_filepath>/test.yml': line 9, column 7, but may
    be elsewhere in the file depending on the exact syntax problem.

    The offending line appears to be:

            capacity: "{{ capacity | default('') }}"
        - name: "debug"
          ^ here

I would suggest using jinja2 syntax to avoid any problems with Ansible 2.9 which I believe is the most widely used version at the moment.
I will create a PR once I check the correct jinja2 syntax.

Add task(s) for checking prerequisites before installing

With larger users of Elastic Cloud Enterprise (ECE) it is not uncommon that the set-up of a deployment is split up by a number of different teams who have different responsibilities. There are also a number of configuration items that are easy to miss with the branching nature of the documentation between the different Linux distribution requirements as well.

It would be helpful to have a "script" that can check that all of the pre-steps are complete before continuing. This will improve user experience as there are a number of items that require a re-install of the runners if they are missed.

My proposal would be to add two "checking" tasks. It would be helpful if we could run the checking tasks separately in additional to the larger tasks/main.yml file. Some users prefer to manually install the software, so it would be helpful if we had something to run to check that items are correct before continuing.

Task to check any set-up steps that need to be done before the tasks/system/main.yml is executed have been completed
Within the tasks/bootstrap/main.yml, add another sub-task that would run checks to verify all the prerequisites are complete before doing the install. There is an option to skip the system set-up task but we don't verify that the system configuration was done successfully.

The checks could output with info / warn / error messages depending on if it is a hard requirement or a performance issue (e.g. memory / disk ratio).

Add support for RHEL 7.x

This project is so great I'm already submitting a request to add support for RHEL :)
As ECE also support RHEL 7.x with RedHat Docker 1.13 it will be great if we can add that as another OS to the vars folder.

Keep ece_version current with ECE releases?

This role defaults to ece_version 2.2, but the current version is 2.4.3. Is there a reason to keep this at 2.2 vs keeping it in sync with ECE releases?

ECE Ansible does not use consistent groups across tasks

ECE 2.7

We've identified an issue where the default users and groups that Ansible provisions do not align with the user & group then used in other tasks.

The task in tasks/base/general/make_user.yml defines the group "elastic" and "docker" - Elastic then ends up being the primary group from what we can see.

But in the download step under tasks/ece-bootstrap/main.yml it tries to then set the group ownership of the file to elastic_user_group - which does not exist and was never created.

Update Role Variables ECE Support Diagnostics URL

ece_supportdiagnostics_url: The location of the diagnostics tool. Can be a local file for offline installation.

Default: https://github.com/elastic/ece-support-diagnostics/archive/v1.1.tar.gz

Should be kept in line with
https://github.com/elastic/ece-support-diagnostics

Current is v2.0.2, v1.1 is 3 years old

https://github.com/elastic/ece-support-diagnostics/releases/tag/v2.0.2

RHEL 7 ('ip' tool not found) error

Installed fresh 7.9 RHEL DVD ISO using VBox locally. Script throws the error seen in #51 resolved by (a8df0d5).

IP tool is installed. I discovered that the main.yml within tasks/base/RedHat-7 do not include the same configurations that allow the server to discover the IP address:

- name: ensure dhcp dns is set
  lineinfile:
    path: /etc/sysconfig/network-scripts/ifcfg-eth0
    line: "{{ item }}"
  with_items:
    - 'PeerDNS=yes'
    - 'NM_CONTROLLED=yes'

- name: set locale
  lineinfile:
    path: /etc/environment
    line: "{{ item }}"
  with_items:
    - 'LANG=en_US.utf8'
    - 'LC_CTYPE=en_US.utf8'

- name: set path
  lineinfile:
    path: /etc/profile.d/path.sh
    line: "export PATH=$PATH:/usr/sbin"
    create: yes

Also, by default, VirtualBox renames eth0 to enp0s3, and it does not appear that the ansible-elastic role checks for the name of the interface. I followed this guide to rename enp0s3 to eth0, and after updating the main.yml above, the installation completes successfully:

https://www.linuxtopic.com/2017/02/how-to-change-default-interface-name.html

Config Issue with diagnostics/main.yml

HardCoded ece-support-diagnostics-1.1 in script
Will never run with later versions
Latest version is 2.02 unpacked tar file is ece-support-diagnostics-v2.0.2\diagnostics.sh not ece-support-diagnostics-1.1/diagnostics.sh

name: Run ece support diagnostics
script: /tmp/elastic/ece-support-diagnostics-1.1/diagnostics.sh -s -d

defaults/main.yml - Docker_Version "18.09"

General Elastic Cloud Enterprise relevant settings

ece_version: 2.4.3
ece_docker_registry: docker.elastic.co
ece_docker_repository: cloud-enterprise
docker_config: ""
ece_installer_url: "https://download.elastic.co/cloud/elastic-cloud-enterprise.sh"
ece_runner_id: "{{ ansible_default_ipv4.address }}"

Overall setup variables (like package versions)

docker_version: "18.09"

Should docker_version now be
docker_version: "19.03"

Enhancement: reduce permissions on install script elastic-cloud-enterprise.sh

At the moment the installation script is owned by root in elastic home directory with execution permissions on any user

$ ls -l /home/elastic/
-rwxr-xr-x 1 root root 54962 Feb  7 22:55 elastic-cloud-enterprise.sh

This is not causing any issue especially as the home folder of user elastic is not accessible to users without sudo, but for the sake of it, it might be good to add owner and group with value elastic and possibly use something more restrictive than mode: 0755 in ece-bootstrap/main.yml

RHEL 8 fs.may_detach_mounts error

RHEL 8.3 kernel 4.18.0-240.22.1.el8_3.x86_64

Ansible role adds fs.may_detach_mounts=1 to /etc/sysctl.conf

The following error is seen when executing sysctl -p:

sysctl: cannot stat /proc/sys/fs/may_detach_mounts: No such file or directory

Various Google searches indicate the version of runc included with the version of containerd may be related.

Provide an option to skip the authorized_keys

There are legit use-cases where a user might not want to rely on the authorized_keys files.
During #15 a change was implemented, which allows to look at two different locations for a file as well as to provide a custom path. However, the Ansible task is failing if neither of those values are available.

Would it be possible to make those parameters optional and to not fail if they are unavailable, as requested in the above issue Or at least not fail if no keys are defined ??
The current workaround is to provide mock key files which is not ideal.
Thank you.

Not setting allocator memory

This is more of a question.. Why are you not setting allocator memory here?

ansible-elastic-cloud-enterprise/tasks/ece-bootstrap/main.yml

Line 27 in d021fd0

 memory_settings: ' {"runner":{"xms":"{{memory.runner}}","xmx":"{{memory.runner}}"},"proxy":{"xms":"{{memory.proxy}}","xmx":"{{memory.proxy}}"},"zookeeper":{"xms":"{{memory.zookeeper}}","xmx":"{{memory.zookeeper}}"},"director":{"xms":"{{memory.director}}","xmx":"{{memory.director}}"},"constructor":{"xms":"{{memory.constructor}}","xmx":"{{memory.constructor}}"},"admin-console":{"xms":"{{memory.adminconsole}}","xmx":"{{memory.adminconsole}}"}}' 

You set a variable here:

ansible-elastic-cloud-enterprise/defaults/main.yml

Line 22 in d021fd0

allocator: 4G

Error Message at task "Creating physical volume '/dev/sda2' failed"

Hi team,

The ansible script is failing at the following task with the message

fatal: [jon-ece-test1-1]: FAILED! => {
    "changed": false,
    "err": "  Can't open /dev/sda2 exclusively.  Mounted filesystem?\n  Can't open /dev/sda2 exclusively.  Mounted filesystem?\n",
    "invocation": {
        "module_args": {
            "force": true,
            "pesize": "4",
            "pv_options": "",
            "pvs": [
                "/dev/sda2"
            ],
            "state": "present",
            "vg": "lxc",
            "vg_options": ""
        }
    },
    "msg": "Creating physical volume '/dev/sda2' failed",
    "rc": 5
}

[root@jon-ece-test1-1 dev]# df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs         15G     0   15G   0% /dev
tmpfs            15G     0   15G   0% /dev/shm
tmpfs            15G  8.5M   15G   1% /run
tmpfs            15G     0   15G   0% /sys/fs/cgroup
/dev/sda2       100G  2.4G   98G   3% /
/dev/sda1       200M   12M  189M   6% /boot/efi
tmpfs           3.0G     0  3.0G   0% /run/user/1019
tmpfs           3.0G     0  3.0G   0% /run/user/0

#Ansible Config file 
- hosts: primary
  gather_facts: true
  roles:
    - ansible-elastic-cloud-enterprise
  vars:
    ece_primary: true
    device_name: sda2
    availability_zone: asia-southeast1-c

As spoken with @vaubarth, this is an issue due to the the Ansible script that requires a separate mount path besides the root partition for the installation. I think the error message isn't clear on the steps to take next.

Best Regards,
Jonathan Lim

update.yml doesn't offer to add additional variables for the upgrade script

As of today the upgrade script has a limited hard coded amount of parameters:
https://github.com/elastic/ansible-elastic-cloud-enterprise/blob/master/tasks/ece-bootstrap/upgrade.yml#L3
It would be nice if the parameters could be customized in order to fulfill additional parameters that are documented here:
https://www.elastic.co/guide/en/cloud-enterprise/current/ece-installation-script-upgrade.html

Possibly a parameter can be introduce which later on can be used for customization.

Prepare Diagnostics Bundle not working on ECE 2.9.0

ECE Version Info
2.9.0
Bug description:
Prepare Diagnostics Bundle action returns immediately an Internal Server Error

Steps to reproduce:
Using an ECE=2.9.0, go to admin-console-elasticsearch deployment and try Prepare Diagnostics Bundle button from Operations

Related to elastic/sdh-cloud#17745

Add possibility to set `--runner-id` to a custom value in the installation script

The runner ID, if not set during the installation process using --runner-id defaults to host-ip. Currently this setting can't be used in the ECE playbook.

It would be very good if this setting could be specified by the user when installing ECE.

Support allocator tags

We need to use allocator tags in our setup.

I have created a PR that adds that: #140

It's a very simplistic solution (adding a variable for allocator tags and using the existence of the variable to control the addition of a parameter for the install script) but it works for us.

Pin containerd version to <=1.4.3-1 to avoid containers getting stuck in "Created" state

We've recently seen several occurrences of ECE users that had issue with containers getting stuck in "Created" state and traced it down to this upstream issue opencontainers/runc#2871.

For now we should pin the containerd version to <=1.4.3-1 to avoid that the affected runc gets used.
As far as I can see, this would affect at least:

ansible-elastic-cloud-enterprise/vars/os_CentOS_7.yml

Line 13 in c9f63f9

- containerd.io

ansible-elastic-cloud-enterprise/vars/os_Ubuntu_18.yml

Line 12 in c9f63f9

- containerd.io

ansible-elastic-cloud-enterprise/vars/os_CentOS_8.yml

Line 13 in c9f63f9

- containerd.io

ansible-elastic-cloud-enterprise/vars/os_RedHat_8.yml

Line 13 in c9f63f9

- containerd.io

Not sure about the others.

uid 1001 for the elastic user might already be in use

- name: Add user elastic
  user:
    name: elastic
    uid: 1001
    group: elastic
    groups: docker
    append: yes
    state: present
    generate_ssh_key: true
  when: getent_passwd["elastic"] == none

Should uid be auto-generated if the number 1001 is already in use?

"ece_version" Incorrect for Reference URI

ansible-elastic-cloud-enterprise/defaults/main.yml

Line 3 in 19436e3

ece_version: 2.4.3

I believe this should be "2.5.1". Please advise.

Error /etc/cloud/cloud.cfg.d/ no file or directory

Error /etc/cloud/cloud.cfg.d/ no file or directory

Files affected:

tasks/base/general/make_user.yml
tasks/base/main.yml

Potential solution, add task to ensure /etc/cloud/cloud.cfg.d/ directory is present, not certain on which file to add it too though, most like guess tasks/base/main.yml.

- name: ansible create directory
  file:
    path: /etc/cloud/cloud.cfg.d/
    state: directory

My shorter temporary workaround change /etc/cloud/cloud.cfg.d/ to /tmp/ is both the files mentioned above.

	- name: Define size of swap
	shell: SWAP_MAX_SIZE=$(sudo vgdisplay --units M lxc \| grep "VG Size" \| awk '{ print mem=int(0.07$3) }'); grep MemTotal /proc/meminfo \| awk -v MAXMEM=${SWAP_MAX_SIZE} '{ mem=int($2/(21024)); if(mem>MAXMEM) mem=MAXMEM; print mem; }'
	register: swap_size

elastic / ansible-elastic-cloud-enterprise Goto Github PK

ansible-elastic-cloud-enterprise's People

Contributors

Stargazers

Watchers

Forkers

ansible-elastic-cloud-enterprise's Issues

Play:

Issue Description

Error message -

To Recreate

Problem

Tested on

To recreate

Description

Solution

Details

Workaround

General Elastic Cloud Enterprise relevant settings

Overall setup variables (like package versions)

Recommend Projects

Recommend Topics

Recommend Org

Jobs