GithubHelp home page GithubHelp logo

scylladb / scylla-ansible-roles Goto Github PK

View Code? Open in Web Editor NEW
42.0 12.0 35.0 684 KB

Ansible roles for deploying and managing Scylla, Scylla-Manager and Scylla-Monitoring

Python 86.48% Shell 4.76% Jinja 8.75%
scylla-ansible-roles scylla-monitoring scylla-cluster scylla playbooks ansible ansible-roles scylladb automation hacktoberfest

scylla-ansible-roles's Introduction

Scylla Ansible Roles

This repo contains the Ansible roles and example playbooks used for deploiying and maintaining Scylla clusters. The roles produce some outputs that can be used with the other roles, running all 3 in tandem is recommended, but not required.

For detailed documentation of each role and some of the example playbooks, please see the Wiki: https://github.com/scylladb/scylla-ansible-roles/wiki

Discussion on Slack: https://scylladb-users.slack.com/archives/C01KV03RTEV

Roles:

ansible-scylla-node role

This role will deploy Scylla on the provided set of roles. Please see the role's README and defaults/main.yml for variable settings. The inventory also has to be configured properly for this role, specifically the [scylla] group members, if using the GPFS snitch much have dc and rack properties, and if using one of the public-cloud snitches they need to have dc_suffix set the same way.

Manual

ansible-scylla-manager role

This role will deploy Scylla Manager on the given host(s). If ansible-scylla-node was run before with the scylla_manager_enabled var set to true, there will be a pre-generated auth token already prepared and applied to the nodes to use. Manager will be installed and connected to the Scylla cluster.

Manual

ansible-scylla-monitoring role

This role will install Scylla Monitoring (a prometheus/grafana based, containerized monitoring stack). If the ansible-scylla-node role was run previously with generate_monitoring_config set to true, there is already a scylla-servers.yaml file prepared for the stack to use, in order to connect to the existing cluster.

Manual

ansible-scylla-loader role

This role will prepare a host to run a stress load against a Scylla cluster. The following components get installed:

  • Scylla Java driver
  • Scylla Python driver
  • cassandra-stress (in $PATH)
  • tlp-stress (in $PATH)
  • YCSB (in /home/ANSIBLE_USER/ycsb/VERSION)

example-playbooks

Some basic playbooks showing off how the roles can be utilized, as well as some playbooks used for standard day-2 ops with Scylla

Rolling restart automation Major upgrade automation

scylla-ansible-roles's People

Contributors

aurelilys avatar dyasny avatar ebenzecri avatar elcomtik avatar fee-mendes avatar igorribeiroduarte avatar mmatczuk avatar r4fek avatar rumbles avatar sharonovd avatar simonfrey avatar sitano avatar stevedrip avatar tarzanek avatar v0112358 avatar vladzcloudius avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

scylla-ansible-roles's Issues

Can't set rules for Scylla Monitoring 3.8.0 (ansible-scylla-monitoring role)

16:56:42 TASK [ansible-scylla-monitoring : set prometheus rules file from preset file] ***
16:56:42 task path: /home/[..HIDDEN...]/roles/ansible-scylla-monitoring/tasks/common.yml:62
16:56:42 skipping: [..HIDDEN...] => {"changed": false, "skip_reason": "Conditional result was False"}
16:56:42 
16:56:42 TASK [ansible-scylla-monitoring : set prometheus rules file from the default file] ***
16:56:42 task path: /home/[..HIDDEN...]/roles/ansible-scylla-monitoring/tasks/common.yml:69
16:56:42 fatal: [..HIDDEN...]: FAILED! => {"changed": false, "msg": "Source /opt/scylla-monitoring/scylla-monitoring-scylla-monitoring-3.8.0//prometheus/prometheus.rules.yml not found"}
16:56:44 
16:56:44 PLAY RECAP *********************************************************************
16:56:44 escylla-1pmonitor-prod-us  : ok=21   changed=15   unreachable=0    failed=1    skipped=1    rescued=0    ignored=0   
16:56:44 
16:56:44 Build step 'Execute shell' marked build as failure
16:56:45 Finished: FAILURE

It' seems like :

  • path changed
  • file name is different.
  • file doesn't exist.

Artifacts generated by roles should be stored outside of playbook hierarchy

I used the role ansible-scylla-node for the setup of my cluster and find it quite annoying that it produced artifacts, which were placed into the playbook directory.

The artifacts in concern:

        plays/scylla/scylla_deploy/cqlshrc
        plays/scylla/scylla_deploy/monitoring.yml
        plays/scylla/scylla_deploy/scylla_servers.yml
        plays/scylla/scylla_deploy/scyllamgr_auth_token.txt
        plays/scylla/scylla_deploy/ssl/

I suppose users place their playbook into the git repository usually, so these files need to be ignored at least. Another problem arises when the user wants to deploy multiple clusters by the same playbook. The artifacts will be replaced, which may not be intended.

I think it would be much better if it would be placed into the user's home dir in a folder e.g. ~/.scylla or ~/.ansible_scylla.

The last observation related to these artifacts. It would be great if they can be loaded from vars. This would allow storing them encrypted by ansible-vault or Hashicorp vault and sharing them between team members collaborating on the same project. If the user would like to use the same files as currently generated, it would be possible by loading them with an appropriate ansible lookup plugin to the corresponding variable. Default values for these variables may reference unencrypted artifacts.

aws i3 support

Since the aws i3 instances have the optimized ami already, do these ansible roles support the deployment and rolling restart on i3 instances?

With the aws i3 ami, we are seeing issues when we stop and start the node due to "Failed to mount RAID volume". Does this ansible script support a fix for this?

more scylla.yaml vars

the whole scylla.yaml can be configured? would it make sense to add other options such as r/w timeouts or warning/fail for batch sizes
I remember cloud folks also add custom fields, so a variable which would just insert what you give it as "key: value" to scylla.yaml would be great
(e.g. enable mc option in 2019.1)
Show less

  • I know cloud folks have such custom settings per some customer clusters and they change them manually, so here we should at least have a notion and know about them

rolling_restart: restart seeders first

HEAD: 6cbd4ce

Description
We need to restart seeders first in order to make sure they are healthy before non-seeders restart.
Otherwise, e.g. if seeder is not visible for a non-seeder due to some issue that was a root cause of that restart, the non-seeder will refuse to boot.

perftune setup in scylla-node role

Need to run perftune.py

Up to 4 cores: mode=mq
5-8 cores: mode=sq
9+ cores: mode=sq_split

For reloc:
sudo PATH=$PATH:/opt/scylladb/bin /opt/scylladb/scripts/perftune.py --tune net --get-cpu-mask --mode sq_split | /opt/scylladb/scripts/hex2list.py
For non-reloc: ...

cleanup should skip non existent tables

00:01:41.070         "cleanup of stressexample rawtest",
00:01:41.070         "Using /etc/scylla/scylla.yaml as the config file",
00:01:41.070         "nodetool: Keyspace [stressexample] does not exist.",
00:01:41.070         "See 'nodetool help' or 'nodetool help <command>'."
00:01:41.070     ],
00:01:41.070     "warning": "The job has failed and was not cleaned up properly. Please, delete job alias manually to restart."

the cleanup.sh script can find empty directories of removed keyspaces/tables and it should skip them and not fail completely

Yum commands often timeout if you're using a smaller server

During my PoC I was running the Scylla playbook on smaller nodes as I wanted to test the deployment process more than testing Scylla itself. I often hit issues when playbook steps failed because of yum lock being held and ansible timing out. This timeout can be increased fairly easily. PR to follow

If user provides their own CA key as decribed in docs, key is ignored and self signed keypair generated

Further to #11 the wiki describes being able to provide your own CA key for use in generating SSL certificates. Currently, even if you provide a ca.pem in the folder described in the wiki, the key would not be used and self signed TLS keypair would be generated regardless.

This playbook should look for the existence of a ca.pem as well as a ca.crt and use those to generate the CSR and host keypairs. Or the metnion of this process should be removed from the wiki as it is misleading.

Ansible nodes role fails to generate SSL certificates

Hi as requested Guy Carmin, raising this as an issue. I'm doing a PoC of Scylla enterprise and I'm hitting an issue where I cannot deploy from the node playbook on to instances running Amazon Linux 2.

In my setup I use terraform to pull in dependencies, for example, I check out this repo from github as well as the one recommended for setting up swap, and I put them in the working directory. I also grab a CA Certificate from a local Vault and put that in an ssl/ca folder in working directory.

I run the following command (via terraform):

ansible-playbook -i inventory.ini nodes.yaml -e '@node-variables.yaml'

In my working directory I have nodes.yaml:

---
- name: Scylla node
  hosts: scylla
  become: true
  vars:

    #variables for the swap role
    swap_file_size_mb: '1024'

    scylla_dependencies:
      - curl
      - wget
    scylla_io_probe: True
    enable_mc_format: true
    scylla_api_address: '127.0.0.1'
    scylla_api_port: '10000'
    generate_monitoring_config: True

  roles:
    - ansible-role-swap
    - ansible-scylla-node

Then my node-variables.yaml (redacted my repo UUID):

# URL of an RPM .repo file or a DEB .list file
scylla_repos:
  - http://repositories.scylladb.com/scylla/repo/UUID/centos/scylladb-2020.1.repo

# Options are oss|enterprise
scylla_edition: enterprise
scylla_version: latest

scylla_cluster_name: "devel-scylla-ee"

scylla_snitch: Ec2Snitch

# Manager Agent Backup configuration - this is a free text which will be inserted into the configuration file
# Please ensure it is a valid and working configuration. Please refer to the example configuration file and the
# documentation for detailed instructions
scylla_manager_agent_config: |
  s3:
    access_key_id: 12345678
    secret_access_key: QWerty123456789
    provider: AWS
    region: us-east-1c
    endpoint: https://foo.bar.baz
    server_side_encryption:
    sse_kms_key_id:
    upload_concurrency: 2
    chunk_size: 50M
    use_accelerate_endpoint: false

# Configure raid disks via scylla_setup. This requires a list of disks to add to the raid
scylla_raid_setup:
  - /dev/nvme1n1

scylla_seeds:
  - "10.72.7.6"

scylla_manager_enabled: true
scylla_manager_repo_url: "http://downloads.scylladb.com/rpm/centos/scylladb-manager-2.2.repo"

I appreciate that I haven't set my scylla_manager_agent_config variables yet, but I don't think that is causing my issue...

My inventory.ini:

In my working directory I have the following tree:

.                                                                                                                                                                                                                                                  [41/135936]
├── ansible-role-swap
│   ├── LICENSE
│   ├── README.md
│   ├── defaults
│   │   └── main.yml
│   ├── meta
│   │   └── main.yml
│   ├── molecule
│   │   └── default
│   │       ├── converge.yml
│   │       └── molecule.yml
│   └── tasks
│       ├── disable.yml
│       ├── enable.yml
│       └── main.yml
├── ansible-scylla-loader
│   ├── README.md
│   ├── defaults
│   │   └── main.yml
│   └── tasks
│       ├── Debian.yml
│       ├── RedHat.yml
│       └── main.yml
├── ansible-scylla-manager
│   ├── README.md
│   ├── defaults
│   │   └── main.yml
│   ├── files
│   │   └── perftune.py
│   ├── handlers
│   │   └── main.yml
│   ├── meta
│   │   └── main.yml
│   ├── tasks
│   │   ├── Debian.yml
│   │   ├── RedHat.yml
│   │   ├── add-clusters.yml
│   │   └── main.yml
│   ├── tests
│   │   ├── inventory
│   │   └── test.yml
│   └── vars
│       └── main.yml
├── ansible-scylla-monitoring
│   ├── README.md
│   ├── defaults
│   │   └── main.yml
│   ├── handlers
│   │   └── main.yml
│   ├── meta
│   │   └── main.yml
│   ├── tasks
│   │   ├── Debian.yml
│   │   ├── RedHat.yml
│   │   ├── common.yml
│   │   ├── docker.yml
│   │   ├── main.yml
│   │   └── non-docker.yml
│   ├── tests
│   │   ├── inventory
│   │   └── test.yml
│   └── vars
│       └── main.yml
├── ansible-scylla-node
│   ├── README.md
│   ├── defaults
│   │   └── main.yml
│   ├── files
│   │   └── minigenconfig.py
│   ├── handlers
│   │   └── main.yml
│   ├── meta
│   │   └── main.yml
│   ├── tasks
│   │   ├── Debian.yml
│   │   ├── RedHat.yml
│   │   ├── common.yml
│   │   ├── main.yml
│   │   ├── manager_agents.yml
│   │   ├── monitoring_config.yml
│   │   └── ssl.yml
│   ├── templates
│   │   ├── cassandra-rackdc.properties.j2
│   │   ├── cqlshrc.j2
│   │   ├── io_properties.yaml.j2
│   │   ├── scylla-manager-agent.yaml.j2
│   │   └── scylla.yaml.j2
│   ├── tests
│   │   ├── inventory
│   │   └── test.yml
│   └── vars
│       └── main.yml
├── inventory.ini
├── manager-variables.yaml
├── manager.yaml
├── monitoring.yaml
├── node-variables.yaml
├── nodes.yaml
└── ssl
    └── ca
        └── devel-scylla-ee-ca.pem

35 directories, 66 files

I'm in your Slack if you need to get in touch (James Stocker)

Thanks!

Most of the ansible play seems to work fine, but when I get to the SSL steps toward the end I get this error:

null_resource.execute_ansible_playbook_nodes (local-exec): TASK [ansible-scylla-node : enable ssl options] ********************************
null_resource.execute_ansible_playbook_nodes (local-exec): included: /Users/stocker/git/infrastructure/resources/scylla/build/assets/ansible/ansible-scylla-node/tasks/ssl.yml for [email protected], [email protected], [email protected]

null_resource.execute_ansible_playbook_nodes (local-exec): TASK [ansible-scylla-node : Create dir for the CA] *****************************
null_resource.execute_ansible_playbook_nodes (local-exec): fatal: [[email protected]]: FAILED! => {"changed": false, "module_stderr": "sudo: a password is required\n", "module_stdout": "", "msg": "MODULE FAILURE\nSee stdout/stderr for the exact error",
"rc": 1}

I haven't set up passwords on the target host (It has just been deployed and then this playbook ran on it - I am able to ssh on to the machine and sudo as ec2-user without it asking me for a password at the same time this error occurs)

When you add your own SSL certificates, nodes playbook ignores them and generates self signed ones

The documentation states that:

To use your own CA, place it in ./ssl/ca/SCYLLA_CLUSTER_NAME-ca.pem and derived per-node certificates will be generated from it, if they are not present.
To use your own node certificates, create a directory per hostname (same as in the inventory) in ./ssl/HOSTNAME and place the .pem file as HOSTNAME.pem and the .crt file as HOSTNAME.crt

When I tried this, my own certs were ignored and new ones generated. There currently isn't a check to see if certs were provided as the documentation describes (as far as I could see in ssl.yaml)

I have a fix I am currently testing, I will update if I think it is suitable

scylla_io_setup fails when scylla 4.4 used on Ubuntu 20.4

Following task fails on a new system when creating a cluster of Scylla 4.4 used on Ubuntu 20.4

- name: Measure IO settings on one node
shell: |
scylla_io_setup
when: io_prop_stat.stat.exists|bool == False

Error it produces

TASK [scylla-node : Measure IO settings on one node] ************************************************************************************************************************************************************                                                                                                          
fatal: [redacted_ip_of_host]: FAILED! => {"changed": true, "cmd": "scylla_io_setup\n", "delta": "0:00:00.899964", "end": "2021-05-27 13:23:54.277646", "msg": "non-zero return code", "rc": 1, "start": "2021-05-27 13:23:53.377682", "stderr": "/usr/sbin/scylla_io_setup: line 3: warning: setlocale: LC_ALL: cannot change loca
le (en_US.UTF-8): No such file or directory\n/bin/bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)\nERROR:root:This is not a recommended Google Cloud instance setup for auto tuning, running manual iotune.\n/bin/bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)\nERROR 2021-05-
27 13:23:54,176 [shard 5] seastar - Could not setup Async I/O: Resource temporarily unavailable. The most common cause is not enough request capacity in /proc/sys/fs/aio-max-nr. Try increasing that number or reducing the amount of logical CPUs available for your application\nERROR:root:Command '['/usr/bin/iotune',
 '--format', 'envfile', '--options-file', '/etc/scylla.d/io.conf', '--properties-file', '/etc/scylla.d/io_properties.yaml', '--evaluation-directory', '/var/lib/scylla/data', '--evaluation-directory', '/var/lib/scylla/commitlog', '--evaluation-directory', '/var/lib/scylla/hints', '--evaluation-directory', '/var/lib
/scylla/view_hints']' returned non-zero exit status 1.\nERROR:root:['/var/lib/scylla/data', '/var/lib/scylla/commitlog', '/var/lib/scylla/hints', '/var/lib/scylla/view_hints'] did not pass validation tests, it may not be on XFS and/or has limited disk space.\nThis is a non-supported setup, and performance is expec
ted to be very bad.\nFor better performance, placing your data on XFS-formatted directories is required.\nTo override this error, enable developer mode as follow:\nsudo /opt/scylladb/scripts/scylla_dev_mode_setup --developer-mode 1", "stderr_lines": ["/usr/sbin/scylla_io_setup: line 3: warning: setlocale: LC_ALL: 
cannot change locale (en_US.UTF-8): No such file or directory", "/bin/bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)", "ERROR:root:This is not a recommended Google Cloud instance setup for auto tuning, running manual iotune.", "/bin/bash: warning: setlocale: LC_ALL: cannot change locale (en_U
S.UTF-8)", "ERROR 2021-05-27 13:23:54,176 [shard 5] seastar - Could not setup Async I/O: Resource temporarily unavailable. The most common cause is not enough request capacity in /proc/sys/fs/aio-max-nr. Try increasing that number or reducing the amount of logical CPUs available for your application", "ERROR:root:
Command '['/usr/bin/iotune', '--format', 'envfile', '--options-file', '/etc/scylla.d/io.conf', '--properties-file', '/etc/scylla.d/io_properties.yaml', '--evaluation-directory', '/var/lib/scylla/data', '--evaluation-directory', '/var/lib/scylla/commitlog', '--evaluation-directory', '/var/lib/scylla/hints', '--eval
uation-directory', '/var/lib/scylla/view_hints']' returned non-zero exit status 1.", "ERROR:root:['/var/lib/scylla/data', '/var/lib/scylla/commitlog', '/var/lib/scylla/hints', '/var/lib/scylla/view_hints'] did not pass validation tests, it may not be on XFS and/or has limited disk space.", "This is a non-supported
 setup, and performance is expected to be very bad.", "For better performance, placing your data on XFS-formatted directories is required.", "To override this error, enable developer mode as follow:", "sudo /opt/scylladb/scripts/scylla_dev_mode_setup --developer-mode 1"], "stdout": "tuning /sys/devices/virtual/blo
ck/dm-0\ntuning: /sys/devices/virtual/block/dm-0/queue/nomerges 2\ntuning /sys/devices/pci0000:00/0000:00:03.0/virtio0/host0/target0:0:2/0:0:2:0/block/sdb\ntuning: /sys/devices/pci0000:00/0000:00:03.0/virtio0/host0/target0:0:2/0:0:2:0/block/sdb/queue/nomerges 2\ntuning /sys/devices/virtual/block/dm-0\ntuning /sys/
devices/virtual/block/dm-0\ntuning /sys/devices/virtual/block/dm-0", "stdout_lines": ["tuning /sys/devices/virtual/block/dm-0", "tuning: /sys/devices/virtual/block/dm-0/queue/nomerges 2", "tuning /sys/devices/pci0000:00/0000:00:03.0/virtio0/host0/target0:0:2/0:0:2:0/block/sdb", "tuning: /sys/devices/pci0000:00/000
0:00:03.0/virtio0/host0/target0:0:2/0:0:2:0/block/sdb/queue/nomerges 2", "tuning /sys/devices/virtual/block/dm-0", "tuning /sys/devices/virtual/block/dm-0", "tuning /sys/devices/virtual/block/dm-0"]}  

With help of @tarzanek, I manually fixed it by executing the following command on the first node in the cluster.

echo 1048576 > /proc/sys/fs/aio-max-nr

Versioning

We have a working set of files, I suggest we start considering branching off into v0.1 and keeping the rolling changes in master

gnugp2 on Ubuntu 20 is missing

I have created a few ec2 machines using Terraform with Ubuntu 20.04. Ansible can't provision the machines because because gnugp2 is missing. Once I do a manual installation of gnugp2, and then give Ansible another go, it installs fine.

support encryption at rest explicitely (not only using scylla_yaml_params)

so https://docs.scylladb.com/operating-scylla/security/encryption-at-rest/ is not part of the role as per se, but you can dynamically add a directory with keys for system_key_directory or for system_info_encryption from https://docs.scylladb.com/operating-scylla/security/encryption-at-rest/#encrypt-system-resources by using extra scylla_yaml_params
https://github.com/scylladb/scylla-ansible-roles/blob/master/ansible-scylla-node/defaults/main.yml#L187
of course generating the key and setting it up for user tables is not part of role
so you need to manually create them as per
https://docs.scylladb.com/operating-scylla/security/encryption-at-rest/#create-encryption-keys

we should add support for system tables, since they change scylla.yaml, user tables can still be outside of the role (cql access, but expects keys on all nodes)

When installing nodes with epel repo, system should reboot before io checks

When installing epel repo, currently the grub doesn't get updated and the machine isn't rebooted before the IO checks, so the kernel is installed, but not used. I feel this could potentially reduce performance test results, so I have a fix which will update grub, reboot the machine then wait until the server becomes available again.

Missing restart of the agent after upgrade

There is a missing restart of the agent in ansible role is missing notify of agent restart, the agent will get restarted only when the config is changed.

- name: install the latest manager agent
apt:
name: scylla-manager-agent
state: present
when: scylla_manager_agent_upgrade is defined and scylla_manager_agent_upgrade|bool

- name: start and enable the Manager agent service
service:
name: scylla-manager-agent
state: restarted
enabled: yes
become: true
when: manager_agent_config_change.changed
ignore_errors: true
.

This will not happen when upgrading.

This causes some serious issues when there are incompatible changes between agent versions. Eg. backups will break if the agent is 2.3 and server 2.4.

Fix is pretty simple, there is already a handler for agent restart, it just needs to be notified in the upgrade task.

Ansible-scylla-manager default vars inconsistency

Tasks and docs use variable scylla_manager_db_vars, however, defaults for role define another one. This may need refactoring.

---
- name: deploy local Scylla on the Manager node
import_role:
name: "{{ role_path }}/../ansible-scylla-node"
vars:
install_only: True
scylla_manager_enabled: false
scylla_version: 'latest'
scylla_edition: "{{ scylla_manager_db_vars.scylla_edition|default('oss') }}"
scylla_repos: "{{ scylla_manager_db_vars.scylla_repos }}"
elrepo_kernel: false
scylla_repo_keyserver: "{{ scylla_manager_db_vars.scylla_repo_keyserver|default('') }}"
scylla_repo_keys: "{{ scylla_manager_db_vars.scylla_repo_keys|default([]) }}"
scylla_dependencies: "{{ scylla_manager_db_vars.scylla_dependencies|default([]) }}"
scylla_ssl:
internode:
enabled: false
client:
enabled: false

scylla_db_vars:
# Repo URLs for the ScyllaDB datastore installation
# More information on the variables can be found in the documentation for the scylla-node role and it's heavily
# commented `defaults/main.yml` file
# scylla_repos:
# - 'http://repositories.scylladb.com/scylla/repo/../scylladb-4.1.repo'
# # Set when relevant (Debian/Ubuntu)
# scylla_repo_keyserver: 'hkp://keyserver.ubuntu.com:80'
# scylla_repo_keys:
# - 5e08fbd8b5d6ec9c
# # Configure when additional dependency packages are required (only for some distributions)
# scylla_dependencies:
# - curl
# - wget
# - software-properties-common
# - apt-transport-https
# - gnupg2
# - dirmngr

Setting cache_valid_time prevents task execution in an unexpected way.

When installing scylla-manager, the first task is to include of role ansible-scylla-node. This role is using apt to install its packages and executes apt-cache update also. When the ansible-scylla-manager role adds a new repository it needs to execute the following task before executing install of scylla-manager packages. Hovewer because of the setting of cache validity to 600s it skips execution of apt-cache update. This causes the install of scylla-manager packages to fail.

- name: refresh apt cache
apt:
update_cache: yes
cache_valid_time: 600
name: "*"
state: latest
force_apt_get: yes

Package list not updated

I have created 2 Ubuntu 20.04 instances on EC2 and when I install Scylla using the Ansible roles, it runs into various problems because no 'sudo apt-get update' was executed. And some packages like Scylla and gnugp2 are not found. So I need to manually log into the nodes, execute the update, then then the installation will continue.

So before any software is installed, the package list should be updated.

Use ansible.builtin.password plugin instead of shell

You should use the password plugin to generate a password instead of executing the shell command and copying it to localhost. It can be handled in one task instead of the current two.

- name: generate a new token file
block:
- name: generate the agent key
shell: |
LC_CTYPE=C tr -dc 'a-zA-Z0-9' < /dev/urandom | fold -w 128 | head -n 1
register: scyllamgr_auth_token
delegate_to: localhost
run_once: true
- name: store the auth token in a local file for later use
copy:
content: |
{{ scyllamgr_auth_token.stdout }}
dest: scyllamgr_auth_token.txt
delegate_to: localhost
run_once: true
when: token_file_stat.stat.islnk is not defined

Unsatisfiable conditional

This condition cannot be satisfied because of contradiction.

- name: install the manager agent
apt:
name:
- scylla-manager-server
- scylla-manager-client
state: present
when:
- enable_upgrade is not defined
- enable_upgrade is defined and enable_upgrade|bool == False

By the way, the second part (line) in the condition is not reachable because items in the list are combined by AND logical operator https://docs.ansible.com/ansible/latest/user_guide/playbooks_conditionals.html. So it may be omitted entirelly

Role re-design

@vladzcloudius @tarzanek I've checked the whole ansible-scylla-node and maybe it's a good time to re-design it entirely, offering a "modular" approach (customizable and well-prepared for future changes, lowering the maintenance effort later). I'll post some observations regarding to the current version which are, actually, limitations:

  • There are no validations about if version [version here] can run on [operating system and version here].
  • All supported Ubuntu versions are being treated in the same way like Debian. Same thing for CentOS/RHEL.
  • If prerequisites are different between distribution versions (for example, Ubuntu 20.04 vs. Ubuntu 16.04), there's no way to handle that easily. Same approach for potential incompatible packages.
  • Firewall configuration is not provided, which is actually important.

New version role should provide:

  • A better user experience.
  • Better task splitting.
  • Compatibility checks between versions and OSs.
  • Standard tasks for all supported distros (like Scylla configuration, etc).
  • Per distro family standard tasks (preinstallation, installation, postinstallation, etc).
  • Per distro profile to have full control for each distro and version (install a specific package, remove incompatible those incompatible, set an specific configuration only applicable to that distro version, etc).

I hope you like the idea. I've started a "role draft" ("coding", notes, etc) to put some ideas in action.

Cheers!

Vars should use unique prefix for each role

I encountered some issues with overlapping variable names between roles ansible-scylla-node and ansible-scylla-monitoring.

I installed scylla-monitoring on the same node as scylla-manager, I used the same group_name for them also. I think that roles should not limit users to use whatever inventory they want to use.

There may be more interfering variables and the best solution to limit such issues beforehand is to use some prefix for variables in each role, eg. scylla_node_, scylla_manager_, scylla_monitoring_*.

on slow machines ansible systemd check for service state errors out

n1-highmem-1:

TASK [ansible-scylla-node : start scylla seeds] ******************************************************************************************************************************************************
task path: /var/lib/jenkins/workspace/Ops/Create_cluster_Impact/impact-scylla-data-us/roles/ansible-scylla-node/tasks/common.yml:214
fatal: [scylla-test1]: FAILED! => {"changed": false, "msg": "Service is in unknown state", "status": {}}
...ignoring

TASK [ansible-scylla-node : start scylla non-seeds] **************************************************************************************************************************************************
task path: /var/lib/jenkins/workspace/Ops/Create_cluster_Impact/impact-scylla-data-us/roles/ansible-scylla-node/tasks/common.yml:223
skipping: [scylla-test1] => {"changed": false, "skip_reason": "Conditional result was False"}

TASK [ansible-scylla-node : wait for the API port to come up on all nodes] ***************************************************************************************************************************
task path: /var/lib/jenkins/workspace/Ops/Create_cluster_Impact/impact-scylla-data-us/roles/ansible-scylla-node/tasks/common.yml:234
fatal: [scylla-test1]: FAILED! => {"changed": false, "elapsed": 300, "msg": "Timeout when waiting for 127.0.0.1:10000"}

after manual start and rerun, again same cause:

RUNNING HANDLER [ansible-scylla-node : node_exporter start] ******************************************************************************************************************************************
task path: /var/lib/jenkins/workspace/Ops/Create_cluster_Impact/impact-scylla-data-us/roles/ansible-scylla-node/handlers/main.yml:20
fatal: [scylla-test1]: FAILED! => {"changed": false, "msg": "Service is in unknown state", "status": {}}
META: ran handlers

same happened for scylla, so installing and then quickly checking can lead to above

Manager PW gen fails on MacOS

This command:

tr -dc 'a-zA-Z0-9' < /dev/urandom | fold -w 128 | head -n 1

Is not safe on MacOS:

tr: Illegal byte sequence

[ansible-scylla-node] Retry for downloading external files

Use these lines as a reference:

10:00:50 TASK [ansible-scylla-node : Install ELRepo repository] *************************
10:00:50 task path: [...]/ansible-scylla-node/tasks/RedHat.yml:24
10:00:50 changed: [...] => {"changed": true, "msg": "", "rc": 0, "results": ["Installed [...]/elrepo-release-8.el8.elrepo.noarch5r06ki42.rpm", "Installed: elrepo-release-8.2-1.el8.elrepo.noarch"]}
10:01:02 changed: [....] => {"changed": true, "msg": "", "rc": 0, "results": ["Installed [...]/elrepo-release-8.el8.elrepo.noarch9_a1icyb.rpm", "Installed: elrepo-release-8.2-1.el8.elrepo.noarch"]}
10:01:03 fatal: [....]: FAILED! => {"changed": false, "msg": "Failure downloading https://www.elrepo.org/elrepo-release-8.el8.elrepo.noarch.rpm, Request failed: <urlopen error timed out>"}

The entire playbook failed because of not having a retry mechanism to handle this kind of scenario.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.