scylladb / scylla-ansible-roles Goto Github PK

Ansible roles for deploying and managing Scylla, Scylla-Manager and Scylla-Monitoring

Python 86.48% Shell 4.76% Jinja 8.75%

scylla-ansible-roles scylla-monitoring scylla-cluster scylla playbooks ansible ansible-roles scylladb automation hacktoberfest

scylla-ansible-roles's Introduction

Scylla Ansible Roles

This repo contains the Ansible roles and example playbooks used for deploiying and maintaining Scylla clusters. The roles produce some outputs that can be used with the other roles, running all 3 in tandem is recommended, but not required.

For detailed documentation of each role and some of the example playbooks, please see the Wiki: https://github.com/scylladb/scylla-ansible-roles/wiki

Discussion on Slack: https://scylladb-users.slack.com/archives/C01KV03RTEV

Roles:

ansible-scylla-node role

This role will deploy Scylla on the provided set of roles. Please see the role's README and defaults/main.yml for variable settings. The inventory also has to be configured properly for this role, specifically the [scylla] group members, if using the GPFS snitch much have dc and rack properties, and if using one of the public-cloud snitches they need to have dc_suffix set the same way.

Manual

ansible-scylla-manager role

This role will deploy Scylla Manager on the given host(s). If ansible-scylla-node was run before with the scylla_manager_enabled var set to true, there will be a pre-generated auth token already prepared and applied to the nodes to use. Manager will be installed and connected to the Scylla cluster.

Manual

ansible-scylla-monitoring role

This role will install Scylla Monitoring (a prometheus/grafana based, containerized monitoring stack). If the ansible-scylla-node role was run previously with generate_monitoring_config set to true, there is already a scylla-servers.yaml file prepared for the stack to use, in order to connect to the existing cluster.

Manual

ansible-scylla-loader role

This role will prepare a host to run a stress load against a Scylla cluster. The following components get installed:

Scylla Java driver
Scylla Python driver
cassandra-stress (in $PATH)
tlp-stress (in $PATH)
YCSB (in /home/ANSIBLE_USER/ycsb/VERSION)

example-playbooks

Some basic playbooks showing off how the roles can be utilized, as well as some playbooks used for standard day-2 ops with Scylla

Rolling restart automation Major upgrade automation

scylla-ansible-roles's People

Contributors

Stargazers

Watchers

scylla-ansible-roles's Issues

Can't set rules for Scylla Monitoring 3.8.0 (ansible-scylla-monitoring role)

16:56:42 TASK [ansible-scylla-monitoring : set prometheus rules file from preset file] ***
16:56:42 task path: /home/[..HIDDEN...]/roles/ansible-scylla-monitoring/tasks/common.yml:62
16:56:42 skipping: [..HIDDEN...] => {"changed": false, "skip_reason": "Conditional result was False"}
16:56:42 
16:56:42 TASK [ansible-scylla-monitoring : set prometheus rules file from the default file] ***
16:56:42 task path: /home/[..HIDDEN...]/roles/ansible-scylla-monitoring/tasks/common.yml:69
16:56:42 fatal: [..HIDDEN...]: FAILED! => {"changed": false, "msg": "Source /opt/scylla-monitoring/scylla-monitoring-scylla-monitoring-3.8.0//prometheus/prometheus.rules.yml not found"}
16:56:44 
16:56:44 PLAY RECAP *********************************************************************
16:56:44 escylla-1pmonitor-prod-us  : ok=21   changed=15   unreachable=0    failed=1    skipped=1    rescued=0    ignored=0   
16:56:44 
16:56:44 Build step 'Execute shell' marked build as failure
16:56:45 Finished: FAILURE

It' seems like :

path changed
file name is different.
file doesn't exist.

Artifacts generated by roles should be stored outside of playbook hierarchy

I used the role ansible-scylla-node for the setup of my cluster and find it quite annoying that it produced artifacts, which were placed into the playbook directory.

The artifacts in concern:

        plays/scylla/scylla_deploy/cqlshrc
        plays/scylla/scylla_deploy/monitoring.yml
        plays/scylla/scylla_deploy/scylla_servers.yml
        plays/scylla/scylla_deploy/scyllamgr_auth_token.txt
        plays/scylla/scylla_deploy/ssl/

I suppose users place their playbook into the git repository usually, so these files need to be ignored at least. Another problem arises when the user wants to deploy multiple clusters by the same playbook. The artifacts will be replaced, which may not be intended.

I think it would be much better if it would be placed into the user's home dir in a folder e.g. ~/.scylla or ~/.ansible_scylla.

The last observation related to these artifacts. It would be great if they can be loaded from vars. This would allow storing them encrypted by ansible-vault or Hashicorp vault and sharing them between team members collaborating on the same project. If the user would like to use the same files as currently generated, it would be possible by loading them with an appropriate ansible lookup plugin to the corresponding variable. Default values for these variables may reference unencrypted artifacts.

aws i3 support

Since the aws i3 instances have the optimized ami already, do these ansible roles support the deployment and rolling restart on i3 instances?

With the aws i3 ami, we are seeing issues when we stop and start the node due to "Failed to mount RAID volume". Does this ansible script support a fix for this?

remove "latest" from monitoring

as per scylladb/scylla-monitoring#1547
we should drop using latest in versions of monitoring:

https://github.com/scylladb/scylla-ansible-roles/blob/master/ansible-scylla-monitoring/defaults/main.yml#L27

elrepo kernel should be set as default (if elrepo + centos 7 used)

if centos 7 is used, elrepo kernel needs to be explicitely enabled by:

sudo grub2-set-default 0

add support for disabling authenthication

more scylla.yaml vars

the whole scylla.yaml can be configured? would it make sense to add other options such as r/w timeouts or warning/fail for batch sizes
I remember cloud folks also add custom fields, so a variable which would just insert what you give it as "key: value" to scylla.yaml would be great
(e.g. enable mc option in 2019.1)
Show less

I know cloud folks have such custom settings per some customer clusters and they change them manually, so here we should at least have a notion and know about them

rolling_restart: restart seeders first

HEAD: 6cbd4ce

Description
We need to restart seeders first in order to make sure they are healthy before non-seeders restart.
Otherwise, e.g. if seeder is not visible for a non-seeder due to some issue that was a root cause of that restart, the non-seeder will refuse to boot.

perftune setup in scylla-node role

Need to run perftune.py

Up to 4 cores: mode=mq
5-8 cores: mode=sq
9+ cores: mode=sq_split

For reloc:
sudo PATH=$PATH:/opt/scylladb/bin /opt/scylladb/scripts/perftune.py --tune net --get-cpu-mask --mode sq_split | /opt/scylladb/scripts/hex2list.py
For non-reloc: ...

rolling_restart: wait for DRAIN to complete, and stop the service anyway

If DRAIN doesn't complete fast enough, we should be able to stop the service and continue rolling restart

cleanup should skip non existent tables

00:01:41.070         "cleanup of stressexample rawtest",
00:01:41.070         "Using /etc/scylla/scylla.yaml as the config file",
00:01:41.070         "nodetool: Keyspace [stressexample] does not exist.",
00:01:41.070         "See 'nodetool help' or 'nodetool help <command>'."
00:01:41.070     ],
00:01:41.070     "warning": "The job has failed and was not cleaned up properly. Please, delete job alias manually to restart."

the cleanup.sh script can find empty directories of removed keyspaces/tables and it should skip them and not fail completely

cpuset is not run

https://github.com/scylladb/scylla-ansible-roles/blob/master/ansible-scylla-node/tasks/common.yml#L166

just figures out the cpuset mask
but never sets it anywhere

so obviously for smaller machines sq_split won't be applied

Yum commands often timeout if you're using a smaller server

During my PoC I was running the Scylla playbook on smaller nodes as I wanted to test the deployment process more than testing Scylla itself. I often hit issues when playbook steps failed because of yum lock being held and ansible timing out. This timeout can be increased fairly easily. PR to follow

Set NIC name when running perftune

scylla-ansible-roles/ansible-scylla-node/tasks/common.yml

Line 170 in cce3a03

 perftune: "PATH=$PATH:/opt/scylladb/bin /opt/scylladb/scripts/perftune.py --tune net --get-cpu-mask --mode {{ perf_mode }} | /opt/scylladb/scripts/hex2list.py" 

Here I silently assume that there is (always) an eth0 iface.
It would be safer if I used a real NIC name and give it to the perftune.py using --nic .

If user provides their own CA key as decribed in docs, key is ignored and self signed keypair generated

Further to #11 the wiki describes being able to provide your own CA key for use in generating SSL certificates. Currently, even if you provide a ca.pem in the folder described in the wiki, the key would not be used and self signed TLS keypair would be generated regardless.

This playbook should look for the existence of a ca.pem as well as a ca.crt and use those to generate the CSR and host keypairs. Or the metnion of this process should be removed from the wiki as it is misleading.

Ansible nodes role fails to generate SSL certificates

Hi as requested Guy Carmin, raising this as an issue. I'm doing a PoC of Scylla enterprise and I'm hitting an issue where I cannot deploy from the node playbook on to instances running Amazon Linux 2.

In my setup I use terraform to pull in dependencies, for example, I check out this repo from github as well as the one recommended for setting up swap, and I put them in the working directory. I also grab a CA Certificate from a local Vault and put that in an ssl/ca folder in working directory.

I run the following command (via terraform):

ansible-playbook -i inventory.ini nodes.yaml -e '@node-variables.yaml'

In my working directory I have nodes.yaml:

---
- name: Scylla node
  hosts: scylla
  become: true
  vars:

    #variables for the swap role
    swap_file_size_mb: '1024'

    scylla_dependencies:
      - curl
      - wget
    scylla_io_probe: True
    enable_mc_format: true
    scylla_api_address: '127.0.0.1'
    scylla_api_port: '10000'
    generate_monitoring_config: True

  roles:
    - ansible-role-swap
    - ansible-scylla-node

Then my node-variables.yaml (redacted my repo UUID):

# URL of an RPM .repo file or a DEB .list file
scylla_repos:
  - http://repositories.scylladb.com/scylla/repo/UUID/centos/scylladb-2020.1.repo

# Options are oss|enterprise
scylla_edition: enterprise
scylla_version: latest

scylla_cluster_name: "devel-scylla-ee"

scylla_snitch: Ec2Snitch

# Manager Agent Backup configuration - this is a free text which will be inserted into the configuration file
# Please ensure it is a valid and working configuration. Please refer to the example configuration file and the
# documentation for detailed instructions
scylla_manager_agent_config: |
  s3:
    access_key_id: 12345678
    secret_access_key: QWerty123456789
    provider: AWS
    region: us-east-1c
    endpoint: https://foo.bar.baz
    server_side_encryption:
    sse_kms_key_id:
    upload_concurrency: 2
    chunk_size: 50M
    use_accelerate_endpoint: false

# Configure raid disks via scylla_setup. This requires a list of disks to add to the raid
scylla_raid_setup:
  - /dev/nvme1n1

scylla_seeds:
  - "10.72.7.6"

scylla_manager_enabled: true
scylla_manager_repo_url: "http://downloads.scylladb.com/rpm/centos/scylladb-manager-2.2.repo"

I appreciate that I haven't set my scylla_manager_agent_config variables yet, but I don't think that is causing my issue...

My inventory.ini:

[all]
[email protected]
[email protected]
[email protected]
[email protected]

[scylla]
[email protected]
[email protected]
[email protected]

[scylla-manager]
[email protected]

[scylla-monitoring]
[email protected]

In my working directory I have the following tree:

.                                                                                                                                                                                                                                                  [41/135936]
├── ansible-role-swap
│   ├── LICENSE
│   ├── README.md
│   ├── defaults
│   │   └── main.yml
│   ├── meta
│   │   └── main.yml
│   ├── molecule
│   │   └── default
│   │       ├── converge.yml
│   │       └── molecule.yml
│   └── tasks
│       ├── disable.yml
│       ├── enable.yml
│       └── main.yml
├── ansible-scylla-loader
│   ├── README.md
│   ├── defaults
│   │   └── main.yml
│   └── tasks
│       ├── Debian.yml
│       ├── RedHat.yml
│       └── main.yml
├── ansible-scylla-manager
│   ├── README.md
│   ├── defaults
│   │   └── main.yml
│   ├── files
│   │   └── perftune.py
│   ├── handlers
│   │   └── main.yml
│   ├── meta
│   │   └── main.yml
│   ├── tasks
│   │   ├── Debian.yml
│   │   ├── RedHat.yml
│   │   ├── add-clusters.yml
│   │   └── main.yml
│   ├── tests
│   │   ├── inventory
│   │   └── test.yml
│   └── vars
│       └── main.yml
├── ansible-scylla-monitoring
│   ├── README.md
│   ├── defaults
│   │   └── main.yml
│   ├── handlers
│   │   └── main.yml
│   ├── meta
│   │   └── main.yml
│   ├── tasks
│   │   ├── Debian.yml
│   │   ├── RedHat.yml
│   │   ├── common.yml
│   │   ├── docker.yml
│   │   ├── main.yml
│   │   └── non-docker.yml
│   ├── tests
│   │   ├── inventory
│   │   └── test.yml
│   └── vars
│       └── main.yml
├── ansible-scylla-node
│   ├── README.md
│   ├── defaults
│   │   └── main.yml
│   ├── files
│   │   └── minigenconfig.py
│   ├── handlers
│   │   └── main.yml
│   ├── meta
│   │   └── main.yml
│   ├── tasks
│   │   ├── Debian.yml
│   │   ├── RedHat.yml
│   │   ├── common.yml
│   │   ├── main.yml
│   │   ├── manager_agents.yml
│   │   ├── monitoring_config.yml
│   │   └── ssl.yml
│   ├── templates
│   │   ├── cassandra-rackdc.properties.j2
│   │   ├── cqlshrc.j2
│   │   ├── io_properties.yaml.j2
│   │   ├── scylla-manager-agent.yaml.j2
│   │   └── scylla.yaml.j2
│   ├── tests
│   │   ├── inventory
│   │   └── test.yml
│   └── vars
│       └── main.yml
├── inventory.ini
├── manager-variables.yaml
├── manager.yaml
├── monitoring.yaml
├── node-variables.yaml
├── nodes.yaml
└── ssl
    └── ca
        └── devel-scylla-ee-ca.pem

35 directories, 66 files

I'm in your Slack if you need to get in touch (James Stocker)

Thanks!

Most of the ansible play seems to work fine, but when I get to the SSL steps toward the end I get this error:

null_resource.execute_ansible_playbook_nodes (local-exec): TASK [ansible-scylla-node : enable ssl options] ********************************
null_resource.execute_ansible_playbook_nodes (local-exec): included: /Users/stocker/git/infrastructure/resources/scylla/build/assets/ansible/ansible-scylla-node/tasks/ssl.yml for [email protected], [email protected], [email protected]

null_resource.execute_ansible_playbook_nodes (local-exec): TASK [ansible-scylla-node : Create dir for the CA] *****************************
null_resource.execute_ansible_playbook_nodes (local-exec): fatal: [[email protected]]: FAILED! => {"changed": false, "module_stderr": "sudo: a password is required\n", "module_stdout": "", "msg": "MODULE FAILURE\nSee stdout/stderr for the exact error",
"rc": 1}

I haven't set up passwords on the target host (It has just been deployed and then this playbook ran on it - I am able to ssh on to the machine and sudo as ec2-user without it asking me for a password at the same time this error occurs)

settings for perftune for sq_split don't get applied

https://github.com/scylladb/scylla-ansible-roles/blob/master/ansible-scylla-node/tasks/common.yml#L176
only prints the config, it won't set it

what should be done is (for 2019):

scylla_cpuset_setup --cpuset `/usr/lib/scylla/perftune.py --tune net --get-cpu-mask --nic eth0 --mode sq_split | /usr/lib/scylla/hex2list.py`

When you add your own SSL certificates, nodes playbook ignores them and generates self signed ones

The documentation states that:

To use your own CA, place it in ./ssl/ca/SCYLLA_CLUSTER_NAME-ca.pem and derived per-node certificates will be generated from it, if they are not present.
To use your own node certificates, create a directory per hostname (same as in the inventory) in ./ssl/HOSTNAME and place the .pem file as HOSTNAME.pem and the .crt file as HOSTNAME.crt

When I tried this, my own certs were ignored and new ones generated. There currently isn't a check to see if certs were provided as the documentation describes (as far as I could see in ssl.yaml)

I have a fix I am currently testing, I will update if I think it is suitable

Do not use hardcoded group name

I found that using my own group name for cluster nodes is impossible due to its hardcoded values in the role.
https://github.com/scylladb/scylla-ansible-roles/blob/master/ansible-scylla-node/defaults/main.yml#L93
https://github.com/scylladb/scylla-ansible-roles/blob/master/ansible-scylla-node/templates/cqlshrc.j2#L6-L14

I would solve it by introducing a variable containing the name of group that is used.

Mutable architecture changes

Removed

wait for cluster to become healthy needs to be aware of multi-dc

https://github.com/scylladb/scylla-ansible-roles/blob/master/ansible-scylla-node/tasks/common.yml#L225

assumes the size of UN will be equal to number of scylla nodes

which if you add a DC to existing cluster isn't true
the check needs to support multi-DC

scylla_io_setup fails when scylla 4.4 used on Ubuntu 20.4

Following task fails on a new system when creating a cluster of Scylla 4.4 used on Ubuntu 20.4

scylla-ansible-roles/ansible-scylla-node/tasks/common.yml

Lines 24 to 27 in 52228ec

 - name: Measure IO settings on one node 

 shell: | 

  scylla_io_setup 

  when: io_prop_stat.stat.exists|bool == False

Error it produces

TASK [scylla-node : Measure IO settings on one node] ************************************************************************************************************************************************************                                                                                                          
fatal: [redacted_ip_of_host]: FAILED! => {"changed": true, "cmd": "scylla_io_setup\n", "delta": "0:00:00.899964", "end": "2021-05-27 13:23:54.277646", "msg": "non-zero return code", "rc": 1, "start": "2021-05-27 13:23:53.377682", "stderr": "/usr/sbin/scylla_io_setup: line 3: warning: setlocale: LC_ALL: cannot change loca
le (en_US.UTF-8): No such file or directory\n/bin/bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)\nERROR:root:This is not a recommended Google Cloud instance setup for auto tuning, running manual iotune.\n/bin/bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)\nERROR 2021-05-
27 13:23:54,176 [shard 5] seastar - Could not setup Async I/O: Resource temporarily unavailable. The most common cause is not enough request capacity in /proc/sys/fs/aio-max-nr. Try increasing that number or reducing the amount of logical CPUs available for your application\nERROR:root:Command '['/usr/bin/iotune',
 '--format', 'envfile', '--options-file', '/etc/scylla.d/io.conf', '--properties-file', '/etc/scylla.d/io_properties.yaml', '--evaluation-directory', '/var/lib/scylla/data', '--evaluation-directory', '/var/lib/scylla/commitlog', '--evaluation-directory', '/var/lib/scylla/hints', '--evaluation-directory', '/var/lib
/scylla/view_hints']' returned non-zero exit status 1.\nERROR:root:['/var/lib/scylla/data', '/var/lib/scylla/commitlog', '/var/lib/scylla/hints', '/var/lib/scylla/view_hints'] did not pass validation tests, it may not be on XFS and/or has limited disk space.\nThis is a non-supported setup, and performance is expec
ted to be very bad.\nFor better performance, placing your data on XFS-formatted directories is required.\nTo override this error, enable developer mode as follow:\nsudo /opt/scylladb/scripts/scylla_dev_mode_setup --developer-mode 1", "stderr_lines": ["/usr/sbin/scylla_io_setup: line 3: warning: setlocale: LC_ALL: 
cannot change locale (en_US.UTF-8): No such file or directory", "/bin/bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)", "ERROR:root:This is not a recommended Google Cloud instance setup for auto tuning, running manual iotune.", "/bin/bash: warning: setlocale: LC_ALL: cannot change locale (en_U
S.UTF-8)", "ERROR 2021-05-27 13:23:54,176 [shard 5] seastar - Could not setup Async I/O: Resource temporarily unavailable. The most common cause is not enough request capacity in /proc/sys/fs/aio-max-nr. Try increasing that number or reducing the amount of logical CPUs available for your application", "ERROR:root:
Command '['/usr/bin/iotune', '--format', 'envfile', '--options-file', '/etc/scylla.d/io.conf', '--properties-file', '/etc/scylla.d/io_properties.yaml', '--evaluation-directory', '/var/lib/scylla/data', '--evaluation-directory', '/var/lib/scylla/commitlog', '--evaluation-directory', '/var/lib/scylla/hints', '--eval
uation-directory', '/var/lib/scylla/view_hints']' returned non-zero exit status 1.", "ERROR:root:['/var/lib/scylla/data', '/var/lib/scylla/commitlog', '/var/lib/scylla/hints', '/var/lib/scylla/view_hints'] did not pass validation tests, it may not be on XFS and/or has limited disk space.", "This is a non-supported
 setup, and performance is expected to be very bad.", "For better performance, placing your data on XFS-formatted directories is required.", "To override this error, enable developer mode as follow:", "sudo /opt/scylladb/scripts/scylla_dev_mode_setup --developer-mode 1"], "stdout": "tuning /sys/devices/virtual/blo
ck/dm-0\ntuning: /sys/devices/virtual/block/dm-0/queue/nomerges 2\ntuning /sys/devices/pci0000:00/0000:00:03.0/virtio0/host0/target0:0:2/0:0:2:0/block/sdb\ntuning: /sys/devices/pci0000:00/0000:00:03.0/virtio0/host0/target0:0:2/0:0:2:0/block/sdb/queue/nomerges 2\ntuning /sys/devices/virtual/block/dm-0\ntuning /sys/
devices/virtual/block/dm-0\ntuning /sys/devices/virtual/block/dm-0", "stdout_lines": ["tuning /sys/devices/virtual/block/dm-0", "tuning: /sys/devices/virtual/block/dm-0/queue/nomerges 2", "tuning /sys/devices/pci0000:00/0000:00:03.0/virtio0/host0/target0:0:2/0:0:2:0/block/sdb", "tuning: /sys/devices/pci0000:00/000
0:00:03.0/virtio0/host0/target0:0:2/0:0:2:0/block/sdb/queue/nomerges 2", "tuning /sys/devices/virtual/block/dm-0", "tuning /sys/devices/virtual/block/dm-0", "tuning /sys/devices/virtual/block/dm-0"]}

With help of @tarzanek, I manually fixed it by executing the following command on the first node in the cluster.

echo 1048576 > /proc/sys/fs/aio-max-nr

cassandra-stress role

When demo Scylla, its useful to have a "loader" role running cassandra-stress.
For such a role, one need to set:

number and type of nodes
parameters for c-s
List of Scylla node IPs to send traffic to

Versioning

We have a working set of files, I suggest we start considering branching off into v0.1 and keeping the rolling changes in master

gnugp2 on Ubuntu 20 is missing

I have created a few ec2 machines using Terraform with Ubuntu 20.04. Ansible can't provision the machines because because gnugp2 is missing. Once I do a manual installation of gnugp2, and then give Ansible another go, it installs fine.

support encryption at rest explicitely (not only using scylla_yaml_params)

so https://docs.scylladb.com/operating-scylla/security/encryption-at-rest/ is not part of the role as per se, but you can dynamically add a directory with keys for system_key_directory or for system_info_encryption from https://docs.scylladb.com/operating-scylla/security/encryption-at-rest/#encrypt-system-resources by using extra scylla_yaml_params
https://github.com/scylladb/scylla-ansible-roles/blob/master/ansible-scylla-node/defaults/main.yml#L187
of course generating the key and setting it up for user tables is not part of role
so you need to manually create them as per
https://docs.scylladb.com/operating-scylla/security/encryption-at-rest/#create-encryption-keys

we should add support for system tables, since they change scylla.yaml, user tables can still be outside of the role (cql access, but expects keys on all nodes)

When installing nodes with epel repo, system should reboot before io checks

When installing epel repo, currently the grub doesn't get updated and the machine isn't rebooted before the IO checks, so the kernel is installed, but not used. I feel this could potentially reduce performance test results, so I have a fix which will update grub, reboot the machine then wait until the server becomes available again.

Missing restart of the agent after upgrade

There is a missing restart of the agent in ansible role is missing notify of agent restart, the agent will get restarted only when the config is changed.

scylla-ansible-roles/ansible-scylla-node/tasks/Debian.yml

Lines 119 to 123 in da4a7ae

 - name: install the latest manager agent 

 apt: 

 name: scylla-manager-agent 

 state: present 

 when: scylla_manager_agent_upgrade is defined and scylla_manager_agent_upgrade|bool

scylla-ansible-roles/ansible-scylla-node/tasks/manager_agents.yml

Lines 57 to 64 in da4a7ae

 - name: start and enable the Manager agent service 

 service: 

 name: scylla-manager-agent 

 state: restarted 

 enabled: yes 

 become: true 

 when: manager_agent_config_change.changed 

 ignore_errors: true

This will not happen when upgrading.

This causes some serious issues when there are incompatible changes between agent versions. Eg. backups will break if the agent is 2.3 and server 2.4.

Fix is pretty simple, there is already a handler for agent restart, it just needs to be notified in the upgrade task.

Ansible-scylla-manager default vars inconsistency

Tasks and docs use variable scylla_manager_db_vars, however, defaults for role define another one. This may need refactoring.

scylla-ansible-roles/ansible-scylla-manager/tasks/main.yml

Lines 1 to 19 in da4a7ae

 --- 

 - name: deploy local Scylla on the Manager node 

 import_role: 

 name: "{{ role_path }}/../ansible-scylla-node" 

 vars: 

 install_only: True 

 scylla_manager_enabled: false 

 scylla_version: 'latest' 

 scylla_edition: "{{ scylla_manager_db_vars.scylla_edition|default('oss') }}" 

 scylla_repos: "{{ scylla_manager_db_vars.scylla_repos }}" 

 elrepo_kernel: false 

 scylla_repo_keyserver: "{{ scylla_manager_db_vars.scylla_repo_keyserver|default('') }}" 

 scylla_repo_keys: "{{ scylla_manager_db_vars.scylla_repo_keys|default([]) }}" 

 scylla_dependencies: "{{ scylla_manager_db_vars.scylla_dependencies|default([]) }}" 

 scylla_ssl: 

 internode: 

 enabled: false 

 client: 

 enabled: false

scylla-ansible-roles/ansible-scylla-manager/defaults/main.yml

Lines 7 to 26 in da4a7ae

 scylla_db_vars: 

 # Repo URLs for the ScyllaDB datastore installation 

 # More information on the variables can be found in the documentation for the scylla-node role and it's heavily 

 # commented `defaults/main.yml` file 

 # scylla_repos: 

 # - 'http://repositories.scylladb.com/scylla/repo/../scylladb-4.1.repo' 

 # # Set when relevant (Debian/Ubuntu) 

 # scylla_repo_keyserver: 'hkp://keyserver.ubuntu.com:80' 

 # scylla_repo_keys: 

 # - 5e08fbd8b5d6ec9c 

 # # Configure when additional dependency packages are required (only for some distributions) 

 # scylla_dependencies: 

 # - curl 

 # - wget 

 # - software-properties-common 

 # - apt-transport-https 

 # - gnupg2 

 # - dirmngr

add support for cql monitoring when there is auth

to support showing large*
you need to do:

https://github.com/scylladb/scylla-monitoring/tree/branch-3.7/grafana/plugins/scylla-plugin

so if we know a password we can prepare above session and add it when deploying monitoring

perftune.yaml should contain -disks too

https://github.com/scylladb/scylla-ansible-roles/blob/master/ansible-scylla-node/tasks/common.yml#L187
should tune disks too

Setting cache_valid_time prevents task execution in an unexpected way.

When installing scylla-manager, the first task is to include of role ansible-scylla-node. This role is using apt to install its packages and executes apt-cache update also. When the ansible-scylla-manager role adds a new repository it needs to execute the following task before executing install of scylla-manager packages. Hovewer because of the setting of cache validity to 600s it skips execution of apt-cache update. This causes the install of scylla-manager packages to fail.

scylla-ansible-roles/ansible-scylla-manager/tasks/Debian.yml

Lines 19 to 25 in a39419c

 - name: refresh apt cache 

 apt: 

 update_cache: yes 

 cache_valid_time: 600 

 name: "*" 

 state: latest 

 force_apt_get: yes

Package list not updated

I have created 2 Ubuntu 20.04 instances on EC2 and when I install Scylla using the Ansible roles, it runs into various problems because no 'sudo apt-get update' was executed. And some packages like Scylla and gnugp2 are not found. So I need to manually log into the nodes, execute the update, then then the installation will continue.

So before any software is installed, the package list should be updated.

Authenthication enabling should adjust replication of system_auth keyspace

default system_auth is set to RF=2 with simple strategy
for a cluster this should be set to NetworkTopology + RF = number of nodes and trigger a repair
otherwise every auth connect triggers connections around the cluster and not local ones

rolling restart example doesn't wait for "serving" status

nor does it wait for cqlsh port
so this rolling restart isn't that friendly to cql clients

above is for seed hosts - resp when switching from seeds to non-seeds

Use ansible.builtin.password plugin instead of shell

You should use the password plugin to generate a password instead of executing the shell command and copying it to localhost. It can be handled in one task instead of the current two.

scylla-ansible-roles/ansible-scylla-node/tasks/manager_agents.yml

Lines 10 to 26 in da4a7ae

 - name: generate a new token file 

 block: 

 - name: generate the agent key 

 shell: | 

  LC_CTYPE=C tr -dc 'a-zA-Z0-9' < /dev/urandom | fold -w 128 | head -n 1 

  register: scyllamgr_auth_token 

 delegate_to: localhost 

 run_once: true 

 - name: store the auth token in a local file for later use 

 copy: 

 content: | 

  {{ scyllamgr_auth_token.stdout }} 

  dest: scyllamgr_auth_token.txt 

 delegate_to: localhost 

 run_once: true 

 when: token_file_stat.stat.islnk is not defined

Unsatisfiable conditional

This condition cannot be satisfied because of contradiction.

scylla-ansible-roles/ansible-scylla-manager/tasks/Debian.yml

Lines 27 to 35 in 087ccb6

 - name: install the manager agent 

 apt: 

 name: 

 - scylla-manager-server 

 - scylla-manager-client 

 state: present 

 when: 

 - enable_upgrade is not defined 

 - enable_upgrade is defined and enable_upgrade|bool == False

By the way, the second part (line) in the condition is not reachable because items in the list are combined by AND logical operator https://docs.ansible.com/ansible/latest/user_guide/playbooks_conditionals.html. So it may be omitted entirelly

Role re-design

@vladzcloudius @tarzanek I've checked the whole ansible-scylla-node and maybe it's a good time to re-design it entirely, offering a "modular" approach (customizable and well-prepared for future changes, lowering the maintenance effort later). I'll post some observations regarding to the current version which are, actually, limitations:

There are no validations about if version [version here] can run on [operating system and version here].
All supported Ubuntu versions are being treated in the same way like Debian. Same thing for CentOS/RHEL.
If prerequisites are different between distribution versions (for example, Ubuntu 20.04 vs. Ubuntu 16.04), there's no way to handle that easily. Same approach for potential incompatible packages.
Firewall configuration is not provided, which is actually important.

New version role should provide:

A better user experience.
Better task splitting.
Compatibility checks between versions and OSs.
Standard tasks for all supported distros (like Scylla configuration, etc).
Per distro family standard tasks (preinstallation, installation, postinstallation, etc).
Per distro profile to have full control for each distro and version (install a specific package, remove incompatible those incompatible, set an specific configuration only applicable to that distro version, etc).

I hope you like the idea. I've started a "role draft" ("coding", notes, etc) to put some ideas in action.

Cheers!

Vars should use unique prefix for each role

I encountered some issues with overlapping variable names between roles ansible-scylla-node and ansible-scylla-monitoring.

I installed scylla-monitoring on the same node as scylla-manager, I used the same group_name for them also. I think that roles should not limit users to use whatever inventory they want to use.

ansible-scylla-monitoring defines and use variable install_type

scylla-ansible-roles/ansible-scylla-monitoring/defaults/main.yml

Line 5 in da4a7ae

install_type: 'docker'
ansible-scylla-manager includes role ansible-scylla-node, which defines and use variable install_type also.

scylla-ansible-roles/ansible-scylla-node/defaults/main.yml

Line 12 in da4a7ae

install_type: online

There may be more interfering variables and the best solution to limit such issues beforehand is to use some prefix for variables in each role, eg. scylla_node_, scylla_manager_, scylla_monitoring_*.

add explicit support for encryption at rest

Add explicit support as per https://docs.scylladb.com/operating-scylla/security/encryption-at-rest/#set-the-kmip-host
now it's only available using https://github.com/scylladb/scylla-ansible-roles/wiki/ansible-scylla-node:-Deploying-a-Scylla-cluster#all-other-scyllayaml-parameters

on slow machines ansible systemd check for service state errors out

n1-highmem-1:

TASK [ansible-scylla-node : start scylla seeds] ******************************************************************************************************************************************************
task path: /var/lib/jenkins/workspace/Ops/Create_cluster_Impact/impact-scylla-data-us/roles/ansible-scylla-node/tasks/common.yml:214
fatal: [scylla-test1]: FAILED! => {"changed": false, "msg": "Service is in unknown state", "status": {}}
...ignoring

TASK [ansible-scylla-node : start scylla non-seeds] **************************************************************************************************************************************************
task path: /var/lib/jenkins/workspace/Ops/Create_cluster_Impact/impact-scylla-data-us/roles/ansible-scylla-node/tasks/common.yml:223
skipping: [scylla-test1] => {"changed": false, "skip_reason": "Conditional result was False"}

TASK [ansible-scylla-node : wait for the API port to come up on all nodes] ***************************************************************************************************************************
task path: /var/lib/jenkins/workspace/Ops/Create_cluster_Impact/impact-scylla-data-us/roles/ansible-scylla-node/tasks/common.yml:234
fatal: [scylla-test1]: FAILED! => {"changed": false, "elapsed": 300, "msg": "Timeout when waiting for 127.0.0.1:10000"}

after manual start and rerun, again same cause:

RUNNING HANDLER [ansible-scylla-node : node_exporter start] ******************************************************************************************************************************************
task path: /var/lib/jenkins/workspace/Ops/Create_cluster_Impact/impact-scylla-data-us/roles/ansible-scylla-node/handlers/main.yml:20
fatal: [scylla-test1]: FAILED! => {"changed": false, "msg": "Service is in unknown state", "status": {}}
META: ran handlers

same happened for scylla, so installing and then quickly checking can lead to above

Make monitoring upgradable

Separate config files from the actual exec directory
Create an upgrade playbook

support matrix for scylla versions

We should make public which versions are supported by this role
and which rolling versions will be ...

Proper way of updating APT cache

I just discovered your roles and found an issue at https://github.com/scylladb/scylla-ansible-roles/blob/master/ansible-scylla-node/tasks/Debian.yml#L105-L109

This doesn't work, it will not update the cache and the next task will fail.

The update_cache parameter should be used instead https://docs.ansible.com/ansible/latest/collections/ansible/builtin/apt_module.html

It may be placed in this task or directly in a task which installs packages.

provide jks for java clients besides cqlshrc

https://github.com/scylladb/scylla-ansible-roles/blob/master/ansible-scylla-node/tasks/ssl.yml generates
cqlshrc

it could also generate java truststores and keystores to simplify java client connection with self signed certs

https://docs.scylladb.com/operating-scylla/security/client_node_encryption/#validate-the-clients

remove unneeded enable_sstables_mc_format: true

remove https://github.com/scylladb/scylla-ansible-roles/blob/master/ansible-scylla-node/defaults/main.yml#L190

Manager PW gen fails on MacOS

This command:

tr -dc 'a-zA-Z0-9' < /dev/urandom | fold -w 128 | head -n 1

Is not safe on MacOS:

tr: Illegal byte sequence

Monitoring should verify firewalld and ev. open needed ports

monitoring setup should also make sure firewalld is set up properly so you can access ports 3000 and 9090 from outside

perhaps same thing is needed for scylla roles (to open up scylla ports when firewalld is in place)

[ansible-scylla-node] Retry for downloading external files

Use these lines as a reference:

10:00:50 TASK [ansible-scylla-node : Install ELRepo repository] *************************
10:00:50 task path: [...]/ansible-scylla-node/tasks/RedHat.yml:24
10:00:50 changed: [...] => {"changed": true, "msg": "", "rc": 0, "results": ["Installed [...]/elrepo-release-8.el8.elrepo.noarch5r06ki42.rpm", "Installed: elrepo-release-8.2-1.el8.elrepo.noarch"]}
10:01:02 changed: [....] => {"changed": true, "msg": "", "rc": 0, "results": ["Installed [...]/elrepo-release-8.el8.elrepo.noarch9_a1icyb.rpm", "Installed: elrepo-release-8.2-1.el8.elrepo.noarch"]}
10:01:03 fatal: [....]: FAILED! => {"changed": false, "msg": "Failure downloading https://www.elrepo.org/elrepo-release-8.el8.elrepo.noarch.rpm, Request failed: <urlopen error timed out>"}

The entire playbook failed because of not having a retry mechanism to handle this kind of scenario.

	- name: Measure IO settings on one node
	shell: \|
	scylla_io_setup
	when: io_prop_stat.stat.exists\|bool == False

	- name: install the latest manager agent
	apt:
	name: scylla-manager-agent
	state: present
	when: scylla_manager_agent_upgrade is defined and scylla_manager_agent_upgrade\|bool

	- name: start and enable the Manager agent service
	service:
	name: scylla-manager-agent
	state: restarted
	enabled: yes
	become: true
	when: manager_agent_config_change.changed
	ignore_errors: true

	---
	- name: deploy local Scylla on the Manager node
	import_role:
	name: "{{ role_path }}/../ansible-scylla-node"
	vars:
	install_only: True
	scylla_manager_enabled: false
	scylla_version: 'latest'
	scylla_edition: "{{ scylla_manager_db_vars.scylla_edition\|default('oss') }}"
	scylla_repos: "{{ scylla_manager_db_vars.scylla_repos }}"
	elrepo_kernel: false
	scylla_repo_keyserver: "{{ scylla_manager_db_vars.scylla_repo_keyserver\|default('') }}"
	scylla_repo_keys: "{{ scylla_manager_db_vars.scylla_repo_keys\|default([]) }}"
	scylla_dependencies: "{{ scylla_manager_db_vars.scylla_dependencies\|default([]) }}"
	scylla_ssl:
	internode:
	enabled: false
	client:
	enabled: false

	scylla_db_vars:
	# Repo URLs for the ScyllaDB datastore installation
	# More information on the variables can be found in the documentation for the scylla-node role and it's heavily
	# commented `defaults/main.yml` file
	# scylla_repos:
	# - 'http://repositories.scylladb.com/scylla/repo/../scylladb-4.1.repo'

	# # Set when relevant (Debian/Ubuntu)
	# scylla_repo_keyserver: 'hkp://keyserver.ubuntu.com:80'
	# scylla_repo_keys:
	# - 5e08fbd8b5d6ec9c

	# # Configure when additional dependency packages are required (only for some distributions)
	# scylla_dependencies:
	# - curl
	# - wget
	# - software-properties-common
	# - apt-transport-https
	# - gnupg2
	# - dirmngr

	- name: refresh apt cache
	apt:
	update_cache: yes
	cache_valid_time: 600
	name: "*"
	state: latest
	force_apt_get: yes

	- name: generate a new token file
	block:
	- name: generate the agent key
	shell: \|
	LC_CTYPE=C tr -dc 'a-zA-Z0-9' < /dev/urandom \| fold -w 128 \| head -n 1
	register: scyllamgr_auth_token
	delegate_to: localhost
	run_once: true

	- name: store the auth token in a local file for later use
	copy:
	content: \|
	{{ scyllamgr_auth_token.stdout }}
	dest: scyllamgr_auth_token.txt
	delegate_to: localhost
	run_once: true
	when: token_file_stat.stat.islnk is not defined

	- name: install the manager agent
	apt:
	name:
	- scylla-manager-server
	- scylla-manager-client
	state: present
	when:
	- enable_upgrade is not defined
	- enable_upgrade is defined and enable_upgrade\|bool == False

scylladb / scylla-ansible-roles Goto Github PK

scylla-ansible-roles's Introduction

Scylla Ansible Roles

Roles:

ansible-scylla-node role

ansible-scylla-manager role

ansible-scylla-monitoring role

ansible-scylla-loader role

example-playbooks

scylla-ansible-roles's People

Contributors

Stargazers

Watchers

Forkers

scylla-ansible-roles's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs