scylladb / scylla-cluster-tests Goto Github PK

Tests for Scylla Clusters

License: GNU Affero General Public License v3.0

Python 86.65% Shell 0.53% CMake 0.03% C++ 0.05% HTML 2.05% Dockerfile 0.11% JavaScript 0.07% Groovy 10.26% Mustache 0.05% Jinja 0.03% Jupyter Notebook 0.16%

cql scylladb test-automation testing-framework cassandra

scylla-cluster-tests's Introduction

SCT - Scylla Cluster Tests

SCT tests are designed to test Scylla database on physical/virtual servers under high read/write load. Currently, the tests are run using built in unittest These tests automatically create:

Scylla clusters - Run Scylla database
Loader machines - used to run load generators like cassandra-stress
Monitoring server - uses official Scylla Monitoring repo to monitor Scylla clusters and Loaders

Quickstart

# install aws cli
sudo apt install awscli # Debian/Ubuntu
sudo dnf install awscli # Redhat/Fedora
# or follow amazon instructions to get it: https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html

# Ask your AWS account admin to create a user and access key for AWS) and then configure AWS

> aws configure
AWS Access Key ID [****************7S5A]:
AWS Secret Access Key [****************5NcH]:
Default region name [us-east-1]:
Default output format [None]:

# if using OKTA, use any of the tools to create the AWS profile, and export it as such,
# anywhere you are gonna use hydra command (replace DeveloperAccessRole with the name of your profile):
export AWS_PROFILE=DeveloperAccessRole

# Install hydra (docker holding all requirements for running SCT)
sudo ./install-hydra.sh

# if using podman, we need to disable enforcing of short name usage, without it monitoring stack won't run from withing hydra
echo 'unqualified-search-registries = ["registry.fedoraproject.org", "registry.access.redhat.com", "docker.io", "quay.io"]
short-name-mode="permissive"
' > ~/.config/containers/registries.conf

Run a test

Example running test using Hydra using test-cases/PR-provision-test.yaml configuration file

Run test locally with AWS backend:

export SCT_SCYLLA_VERSION=5.2.1
# Test fails to report to Argus. So we need to disable it
export SCT_ENABLE_ARGUS=false
# configuration is needed for running from a local development machine (default communication is via private addresses)
hydra run-test longevity_test.LongevityTest.test_custom_time --backend aws --config test-cases/PR-provision-test.yaml --config configurations/network_config/test_communication_public.yaml

# Run with IPv6 configuration
hydra run-test longevity_test.LongevityTest.test_custom_time --backend aws --config test-cases/PR-provision-test.yaml --config configurations/network_config/all_addresses_ipv6_public.yaml

Run test using SCT Runner with AWS backend:

hydra create-runner-instance --cloud-provider <cloud_name> -r <region_name> -z <az> -t <test-id> -d <run_duration>

export SCT_SCYLLA_VERSION=5.2.1
# For choose correct network configuration, check test jenkins pipeline.
# All predefined configurations are located under `configurations/network_config`
hydra --execute-on-runner <runner-ip|`cat sct_runner_ip> "run-test longevity_test.LongevityTest.test_custom_time --backend aws --config test-cases/PR-provision-test.yaml"

Run test locally with GCE backend:

export SCT_SCYLLA_VERSION=5.2.1
hydra run-test longevity_test.LongevityTest.test_custom_time --backend gce --config test-cases/PR-provision-test.yaml

Run test locally with Azure backend:

export SCT_SCYLLA_VERSION=5.2.1
hydra run-test longevity_test.LongevityTest.test_custom_time --backend azure --config test-cases/PR-provision-test.yaml

Run test locally with docker backend:

# **NOTE:** user should be part of sudo group, and setup with passwordless access,
# see https://unix.stackexchange.com/a/468417 for example on how to setup

# example of running specific docker version
export SCT_SCYLLA_VERSION=5.2.1
hydra run-test longevity_test.LongevityTest.test_custom_time --backend docker --config test-cases/PR-provision-test-docker.yaml

You can also enter the containerized SCT environment using:

hydra bash

List resources being used by user:

hydra list-resources --user `whoami`

Clear resources being used by the last test run:

SCT_CLUSTER_BACKEND= hydra list-resources --test-id `cat ~/sct-results/latest/test_id`

Install local development environment

Frequently Asked Questions (FAQ)

Contribution instructions

Supported backends

aws - the mostly used backed, most longevity run on top of this backend
gce - most of the artifacts and rolling upgrades run on top of this backend
azure -
docker - should be used for local development
baremetal - can be used to run with already setup cluster
k8s-eks -
k8s-gke -
k8s-local-kind - used for run k8s functional test locally
k8s-local-kind-gce - used for run k8s functional test locally on GCE
k8s-local-kind-aws - used for run k8s functional test locally on AWS

Configuring test run configuration YAML

Take a look at the test-cases/PR-provision-test.yaml file. It contains a number of configurable test parameters, such as DB cluster instance types and AMI IDs. In this example, we're assuming that you have copied test-cases/PR-provision-test.yaml to test-cases/your_config.yaml.

All the test run configurations are stored in test-cases directory.

Important: Some tests use custom hardcoded operations due to their nature, so those tests won't honor what is set in test-cases/your_config.yaml.

the available configuration options are listed in configuration_options

Types of Tests

Artifact tests

Longevity Tests (TODO: write explanation for them)

Upgrade Tests (TODO: write explanation for them)

Performance Tests (TODO: write explanation for them)

Features Tests (TODO: write explanation for them)

Manager Tests (TODO: write explanation for them)

K8S Functional Tests

Microbenchmarking Tests

scylla-cluster-tests's People

Contributors

Stargazers

Watchers

scylla-cluster-tests's Issues

grafana-monitor: no full available metrics json for scylla-1.7

There is some display doesn't work when I try to use 1.6 or master metrics json to monitor scylla-1.7-rc0.

@amnonh is working to update the metrics json of 1.7, when scylla-grafana-monitor project is updated, we can add the 1.7 files to SCT.

High avocado cpu consumption

 2527 colord    20   0 2858252 108236   9868 S 173.4  0.3 169:17.94 0 avocado

Connect the cluster into a prometheus backend that will collect the metrics and download the data at the end

We are collecting the c-s
We are not collecting the metrics of scylla
It will be helpfull if we can download the metrics from scylla so that it will help initial analysis

The direction is to setup https://github.com/scylladb/scylla-grafana-monitoring and have it point to the scylla nodes

We will work toward having promethus on the AMI / scylla_setup so this part should be done in the near future.

The real work is to start the prmoethues/graphana up
Download the db
Check that we can later reuse the download db in another instance run locally

How large are the artifacts

We may want later to save everything in git or some other db.

can not add node with cassandra ami

Lucas, we have issue adding the 4th node with cas AMI, it hangs there
when I login into it. I saw this twice.

Cluster started with these options:
--clustername asias-test-long-cassandra-db-cluster-a455508d
--bootstrap true --totalnodes 1 --seeds 172.30.0.19 --version
community --release 2.1.8

Waiting for nodetool...
The cluster is now in it's finalization phase. This should only take a moment...

Note: You can also use CTRL+C to view the logs if desired:
AMI log: ~/datastax_ami/ami.log
Cassandra log: /var/log/cassandra/system.log

The logs says:


INFO  [SharedPool-Worker-3] 2016-06-03 06:14:51,705 Gossiper.java:954
- InetAddress /172.30.0.233 is now UP
INFO  [SharedPool-Worker-14] 2016-06-03 06:14:51,707 Gossiper.java:954
- InetAddress /172.30.0.18 is now UP
ERROR [main] 2016-06-03 06:14:52,487 CassandraDaemon.java:541 -
Exception encountered during startup
java.lang.UnsupportedOperationException: Other
bootstrapping/leaving/moving nodes detected, cannot bootstrap while
cassandra.consistent.rangemovement is true
        at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:552)
~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:777)
~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.service.StorageService.initServer(StorageService.java:714)
~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.service.StorageService.initServer(StorageService.java:605)
~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:378)
[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:524)
[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:613)
[apache-cassandra-2.1.8.jar:2.1.8]
WARN  [StorageServiceShutdownHook] 2016-06-03 06:14:52,553
Gossiper.java:1418 - No local state or state is in silent shutdown,
not announcing shutdown
INFO  [StorageServiceShutdownHook] 2016-06-03 06:14:52,554
MessagingService.java:708 - Waiting for messaging service to quiesce
INFO  [ACCEPT-/172.30.0.233] 2016-06-03 06:14:52,558
MessagingService.java:958 - MessagingService has terminated the
accept() thread

If I restart cassandra

sudo /etc/init.d/cassandra restart

it can join the cluster with no problem.

scylla-server should not be started in loader instance

On a db loader instance, I saw scylla is running.

$ ps aux
...
/usr/bin/scylla --log-to-syslog 1 --log-to-stdout 0
/bin/sh /usr/bin/cassandra-stress write cl=QUORUM dura
...

grafana serving on ipv6 on the monitor

Upgraded fedora23
latest pip
SCT git commit: 92ff3ad
avocado lts.

command: avocado run grow_cluster_test.py:GrowClusterTest.test_grow_3_to_5 --multiplex data_dir/scylla.yaml --filter-only /run/backends/aws/us_east_1 /run/databases/scylla --filter-out /run/backends/libvirt

scylla.yaml.txt
job.log.txt
rpm.list.mine.txt
pip.list.mine.txt
netstat.txt

Port 3000, 9000 and 9090 are server on tcpv6 on the resulting monitor hence preventing curl
to access them since ipv6 does not exist on ec2.

force GCE/AWS not to reuse IP

In issue scylladb/scylladb#2267, one node was decommissioned, then destroyed, but the new added gce node got same ip (both private and public ip) as destroyed node.

Then the first yum command after new node start failed, let's avoid the 'bad luck' situation by forcing GCE/AWS not to reuse IP (at least not to reuse IP just released)

create a report compatible with jenkins that include cassandra-stress results

code snippet we have previously used

https://github.com/cloudius-systems/tests/blob/master/scripts/statjenkins.py

#!/usr/bin/python3
import xml.etree.ElementTree as ET
import datetime
import json
import argparse
import sys
import collections
import traceback

def add_time(parent, name):
    e = ET.SubElement(parent, name)
    t = datetime.datetime.utcnow()
    ET.SubElement(e, 'date', val=t.date().isoformat(), format='ISO8601')
    ET.SubElement(e, 'time', val=t.time().isoformat(), format='ISO8601')

def jenkinsreport(val, units, category, test, description):
    report = ET.Element('report', categ=category)
    add_time(report, 'start')
    test = ET.SubElement(report, 'test', name=test, executed='yes')
    ET.SubElement(test, 'description').text = description
    res = ET.SubElement(test, 'result')
    ET.SubElement(res, 'success', passed='yes', state='1', hasTimeOut='no')
    ET.SubElement(res, 'performance', unit=units, mesure=val, isRelevant='true')
    add_time(report, 'end')
    sys.stdout.write(str(ET.tostring(report), 'UTF8'))

def json_from_file(name):
    try:
        return json.load(open(name))
    except:
        t, value, tb = sys.exc_info()
        traceback.print_tb(tb)
        print("Bad formatted JSON file '" + name + "' error ", value.message)
        sys.exit(-1)

def run(file, path, units, category, test, description):
    val = json_from_file(file)[path]
    jenkinsreport (str(val), units, category, test, description)

if __name__ == "__main__":
    parser = argparse.ArgumentParser('statjenkins')
    parser.add_argument('file', help='file holding test json stat results')
    parser.add_argument('path', help='use xpath to search in json object')
    parser.add_argument('units', help='units of object')
    parser.add_argument('category', help='category to be used in report file')
    parser.add_argument('test', help='test name to be used in report')
    parser.add_argument('description', help='test description to be used in report')
    args = parser.parse_args()

    run(args.file, args.path, args.units, args.category, args.test, args.description)

code to create json file

  echo '{ "op_rate" : ' `cat cassandra.perf | grep "op rate" | cut -f2 -d':'` ' }' > cassandra.perf.json

code to create xml file

  tests/scripts/statjenkins.py cassandra.perf.json  op_rate op_rate cassandra-stress-write op_rate op_rate > jenkins_perf_cassandra_stress_write.xml

GCE: network isn't prepared in setting collectd

2017-02-16 18:10:00,012 remote           L0194 DEBUG| [35.185.60.228] [stdout] Feb 16 18:10:00 info   |  [shard 2] compaction - Compacted 4 sstables to [/var/lib/scylla/data/keyspace1/standard1-698f2eb0f46e11e68fcf000000000001/keyspace1-standard1-ka-58-Data.db:level=0, ]. 596289868 bytes to 178508341 (~29% of original) in 27279ms = 6.24065MB/s. ~2122624 total partitions merged to 635261.
2017-02-16 18:10:00,856 cluster          L2370 INFO | GCE Cluster longevity-1-6-gce-scylla-db-cluster-f9485b00 | Image: centos-7 | Root Disk: pd-ssd 50 GB | Local SSD: 1 | Type: n1-highmem-8: added nodes: [<sdcm.cluster.GCENode object at 0x4ef9110>]
2017-02-16 18:10:00,857 remote           L0862 DEBUG| Remote [[email protected]]: Running 'sudo yum install -y epel-release'
2017-02-16 18:10:00,858 remote           L0094 INFO | [35.185.37.0] Running '/bin/ssh -tt -a -x  -o ControlPath=/var/tmp/ssh-master9aSJLc/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/var/tmp/tmppt5hya -o BatchMode=yes -o ConnectTimeout=300 -o ServerAliveInterval=300 -l scylla-test -p 22 -i ~/.ssh/scylla-test 35.185.37.0 "sudo yum install -y epel-release"'
2017-02-16 18:10:07,896 remote           L0194 DEBUG| [35.185.37.0] [stderr] ssh: connect to host 35.185.37.0 port 22: Connection refused
2017-02-16 18:10:07,897 nemesis          L0266 ERROR| sdcm.nemesis.ChaosMonkey: Unhandled exception in method <function disrupt at 0x4bde668>
Traceback (most recent call last):
  File "/jenkins/slave/workspace/scylla-1.6-longevity-gce/label/gce/sdcm/nemesis.py", line 263, in wrapper
    result = method(*args, **kwargs)
  File "/jenkins/slave/workspace/scylla-1.6-longevity-gce/label/gce/sdcm/nemesis.py", line 332, in disrupt
    self.call_random_disrupt_method()
  File "/jenkins/slave/workspace/scylla-1.6-longevity-gce/label/gce/sdcm/nemesis.py", line 240, in call_random_disrupt_method
    disrupt_method()
  File "/jenkins/slave/workspace/scylla-1.6-longevity-gce/label/gce/sdcm/nemesis.py", line 226, in disrupt_nodetool_decommission
    self.cluster.wait_for_init(node_list=new_nodes)
  File "/jenkins/slave/workspace/scylla-1.6-longevity-gce/label/gce/sdcm/cluster.py", line 2804, in wait_for_init
    self.collectd_setup.install(node)
  File "/jenkins/slave/workspace/scylla-1.6-longevity-gce/label/gce/sdcm/collectd.py", line 429, in install
    self.node.remoter.run('sudo yum install -y epel-release')
  File "/jenkins/slave/workspace/scylla-1.6-longevity-gce/label/gce/sdcm/remote.py", line 873, in run
    watch_stdout_pattern=watch_stdout_pattern)
  File "/jenkins/slave/workspace/scylla-1.6-longevity-gce/label/gce/sdcm/remote.py", line 817, in _run
    watch_stdout_pattern=watch_stdout_pattern)
  File "/jenkins/slave/workspace/scylla-1.6-longevity-gce/label/gce/sdcm/remote.py", line 215, in ssh_run
    raise process.CmdError(cmd, sp.result)
CmdError: Command '/bin/ssh -tt -a -x  -o ControlPath=/var/tmp/ssh-master9aSJLc/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/var/tmp/tmppt5hya -o BatchMode=yes -o ConnectTimeout=300 -o ServerAliveInterval=300 -l scylla-test -p 22 -i ~/.ssh/scylla-test 35.185.37.0 "sudo yum install -y epel-release"' failed (rc=255)

We should make sure node ssh is up before collectd_setup

2659 class ScyllaGCECluster(GCECluster, BaseScyllaCluster):
.....
2775     def wait_for_init(self, node_list=None, verbose=False):
......
2802         # avoid using node.remoter in thread
2803         for node in node_list:
2804             self.collectd_setup.install(node)
2805 
2806         seed = node_list[0].private_ip_address
2807         for node in node_list:
2808             setup_thread = threading.Thread(target=node_setup,
2809                                             args=(node, seed))
2810             setup_thread.daemon = True
2811             setup_thread.start()

sdcm/cluster.py: Fail to parse downloaded scylla.yaml

scylla.yaml was downloaded to localhost, but avocado failed to parse it to get seeds ip address.
The output content is blank.

Avocado Code: sdcm/cluster.py

class BaseScyllaCluster(object):

    def get_seed_nodes_private_ips(self):
        if self.seed_nodes_private_ips is None:
            node = self.nodes[0]
            yaml_dst_path = os.path.join(tempfile.mkdtemp(
                prefix='scylla-longevity'), 'scylla.yaml')
            node.remoter.receive_files(src='/etc/scylla/scylla.yaml',
                                       dst=yaml_dst_path)
            with open(yaml_dst_path, 'r') as yaml_stream:
                conf_dict = yaml.load(yaml_stream)
                try:
                    self.seed_nodes_private_ips = conf_dict['seed_provider'][
                        0]['parameters'][0]['seeds'].split(',')
                except:
                    raise ValueError('Unexpected scylla.yaml '
                                     'contents:\n%s' % yaml_stream.read())
        return self.seed_nodes_private_ips

The downloaded scylla.yaml is fine:

$ grep seed  /var/tmp/scylla-longevityLPTTxz/scylla.yaml
# seed_provider class_name is saved for future use.
# seeds address are mandatory!
seed_provider:
          # seeds is actually a comma-delimited list of addresses.
          - seeds: "172.31.31.137"
#    connectivity.  (Thus, you should set seed addresses to the public

Avocado logs:

Cluster amostation-jks-scylla-loader-set-2c40a9c2 (AMI: ami-e2246382 Type: c3.large): Setup duration -> 115 s
Cluster amostation-jks-scylla-db-cluster-5e5389cb (AMI: ami-e2246382 Type: c3.large): (1/3) DB nodes ready. Time elapsed: 95 s
Cluster amostation-jks-scylla-db-cluster-5e5389cb (AMI: ami-e2246382 Type: c3.large): (2/3) DB nodes ready. Time elapsed: 95 s
Cluster amostation-jks-scylla-db-cluster-5e5389cb (AMI: ami-e2246382 Type: c3.large): (3/3) DB nodes ready. Time elapsed: 96 s
PARAMS (key=update_db_binary, path=*, default=None) => ''
Remote [[email protected]]: Receive files (src) /etc/scylla/scylla.yaml -> (dst) /var/tmp/scylla-longevityLPTTxz/scylla.yaml
Cleaning up resources used in the test
Cluster amostation-jks-scylla-db-cluster-5e5389cb (AMI: ami-e2246382 Type: c3.large): Stop nemesis begin
Cluster amostation-jks-scylla-db-cluster-5e5389cb (AMI: ami-e2246382 Type: c3.large): Stop nemesis end
Cluster amostation-jks-scylla-db-cluster-5e5389cb (AMI: ami-e2246382 Type: c3.large): Destroy nodes
Node amostation-jks-scylla-db-node-5e5389cb-1 [54.183.174.176 | 172.31.31.138] (seed: None): Destroyed
Node amostation-jks-scylla-db-node-5e5389cb-2 [52.53.173.157 | 172.31.31.139] (seed: None): Destroyed
[54.183.179.46] [stdout] Jul 20 00:42:10 ip-172-31-31-137 scylla[2158]:  [shard 0] gossip - InetAddress 172.31.31.138 is now DOWN
[52.53.173.157] [stderr] Connection to 52.53.173.157 closed by remote host.
Node amostation-jks-scylla-db-node-5e5389cb-3 [54.183.179.46 | 172.31.31.137] (seed: None): Destroyed
Cluster amostation-jks-scylla-loader-set-2c40a9c2 (AMI: ami-e2246382 Type: c3.large): Destroy nodes
Node amostation-jks-scylla-loader-node-2c40a9c2-1 [54.67.57.26 | 172.31.30.7] (seed: None): Destroyed
[54.67.57.26] [stderr] Connection to 54.67.57.26 closed by remote host.
Key Pair amostation-jks-longevity-test-3b494cb6 -> /var/tmp/amostation-jks-longevity-test-3b494cb6.pem: Destroyed

Reproduced traceback from: /usr/lib/python2.7/site-packages/avocado-37.0-py2.7.egg/avocado/core/test.py:435
Traceback (most recent call last):
  File "/home/amos/scylladb.com/scylla-cluster-tests/sdcm/tester.py", line 110, in wrapper
    return method(*args, **kwargs)
  File "/home/amos/scylladb.com/scylla-cluster-tests/sdcm/tester.py", line 141, in setUp
    self.db_cluster.wait_for_init()
  File "/home/amos/scylladb.com/scylla-cluster-tests/sdcm/cluster.py", line 1344, in wait_for_init
    self.get_seed_nodes()
  File "/home/amos/scylladb.com/scylla-cluster-tests/sdcm/cluster.py", line 673, in get_seed_nodes
    seed_nodes_private_ips = self.get_seed_nodes_private_ips()
  File "/home/amos/scylladb.com/scylla-cluster-tests/sdcm/cluster.py", line 669, in get_seed_nodes_private_ips
    'contents:\n%s' % yaml_stream.read())
ValueError: Unexpected scylla.yaml contents:


ERROR 1-simple_regression_test.py:SimpleRegressionTest.test_simple_regression -> TestSetupFail: Unexpected scylla.yaml contents:


Not logging /proc/slabinfo (lack of permissions)

libcloud failed: 'NoneType' object has no attribute 'makefile'

Fail to call list_nodes() before cleanup resource. If it's a python or libcloud issue we should workaround it insider SCT.

Node longevity-1-6-scylla-db-node-04f7ec6a-2 [104.196.185.46 | 10.240.0.27] (seed: None):
Call to method <bound method GCENodeDriver.list_nodes of
<libcloud.compute.drivers.gce.GCENodeDriver object at 0x7f3b8bc91dd0>>
failed: 'NoneType' object has no attribute 'makefile'
Cleaning up resources used in the test

(private) job/scylla-1.6-longevity-gce/label=master/22/console
Python Issue (no update): http://bugs.python.org/issue8728

make it configurable to use public or private ip in ssh connections

@abvgedeika just found a problem, currently we use private ip for ssh connections, which was changed by 1ac5d0b

But if we execute SCT in localhost, then it failed to connect remote instances. Let's add a parameter in configure file to control it, and use private ip by default.

Longevity GCE hang at the end of DecommissionMonkey

We have 6 nodes in Longevity test. At the end of DecommissionMonkey, 5 nodes are ready, but last one are never ready, the job hangs there.

Problem 1: node didn't added correctly for namespace conflict:
Created instances: [<GCEFailedNode name="longevity-1-6-gce-scylla-db-node-ad585c86-000" error_code="alreadyExists">]

Problem 2: added nodes aren't returned correctly by self.cluster.add_nodes(count=1)

I will send a PR to fix those two problem.

Repeated delete key pair (pem file) in cleanup stage

Priority: low

Description: In the end of job.log, there is an error "[Errno 2] No such file or directory *.pem"

Event needs-retry.ec2.DeleteKeyPair: calling handler <botocore.retryhandler.RetryHandler object at 0x7f80d6fdc710>
No retry needed.
Response: {'ResponseMetadata': {'HTTPStatusCode': 200, 'RequestId': '291fb7e5-7a0c-4e9d-bce8-d33a203d73fe', 'HTTPHeaders': {'transfer-encoding': 'chunked', 'vary': 'Accept-Encoding', 'server': 'AmazonEC2', 'content-type': 'text/xml;charset=UTF-8', 'date': 'Thu, 22 Dec 2016 15:19:01 GMT'}}}
[Errno 2] No such file or directory: '/var/tmp/amostation-jks-sct-d360f1a8.pem'
Test results available in /home/amos/scylladb.com/sct-jenkins-perf/job-2016-12-22T23.06-0ae4d55

AttributeError: 'GCENode' object has no attribute 'prometheus_data_dir'

Try to download prometheus data before prometheus setup, then it's failed.

2017-01-05 10:12:17,343 stacktrace       L0039 ERROR| Reproduced traceback from: /usr/lib/python2.7/site-packages/avocado/core/test.py:422
2017-01-05 10:12:17,343 stacktrace       L0042 ERROR| Traceback (most recent call last):
2017-01-05 10:12:17,343 stacktrace       L0042 ERROR|   File "sdcm/tester.py", line 131, in wrapper
2017-01-05 10:12:17,343 stacktrace       L0042 ERROR|     args[0].clean_resources()
2017-01-05 10:12:17,343 stacktrace       L0042 ERROR|   File "sdcm/tester.py", line 760, in clean_resources
2017-01-05 10:12:17,343 stacktrace       L0042 ERROR|     self.monitors.destroy()
2017-01-05 10:12:17,343 stacktrace       L0042 ERROR|   File "sdcm/cluster.py", line 3138, in destroy
2017-01-05 10:12:17,343 stacktrace       L0042 ERROR|     self.download_monitor_data()
2017-01-05 10:12:17,343 stacktrace       L0042 ERROR|   File "sdcm/cluster.py", line 1778, in download_monitor_data
2017-01-05 10:12:17,343 stacktrace       L0042 ERROR|     node.download_prometheus_data_dir()
2017-01-05 10:12:17,343 stacktrace       L0042 ERROR|   File "sdcm/cluster.py", line 534, in download_prometheus_data_dir
2017-01-05 10:12:17,343 stacktrace       L0042 ERROR|     (self.remoter.user, self.remoter.user, self.prometheus_data_dir))
2017-01-05 10:12:17,343 stacktrace       L0042 ERROR| AttributeError: 'GCENode' object has no attribute 'prometheus_data_dir'

longevity on 1.6 is not using duration limit in SCT

commit 838cff4 (head i s broken #219)

http://jenkins.cloudius-systems.com:8080/job/scylla-1.6-longevity/4/label=master/console

the test was killed by jenkins timeout - it should have ended after 3 hours (I think) and provided a feedback if it was successful or not.

support aggregating the results of multiple c-s (also in the plot)

we support running multiple loaders

we need to support aggregation of the results at the end and also aggregation durring execution to plot the aggregated load in the system.

how should we plot the latency (we should not aggregate them) - maybe we should have a bar at each point showing the difference between the loaders - if the bar is small the difference is small if the bar is large the difference is large.

sct debug: how can we run it standalone without avocado?

Larisa wants to debug her new code in sct, how can we run it standalone without avocado?

enhancement: setup db_nodes , loaders, monitor in parallel

longevity job blocked at sending prometheus config file from jenkins slave to monitor

One longevity job blocked at sending prometheus config file from
jenkins slave to monitor. It's caused by the huge timeout of rsync.

The default timeout of rsync is 0s, current code uses 100000s
(about 28 hours). The timeout isn't the timeout to execute rsync
command for copying files. If no data is transferred for the
specified time then rsync will exit.

SSH connect timeout is 300, let's also set the IO timeout of rsync to 300s.

rsync cmdline to copy file:

jenkins  13553  0.0  0.0 114644   988 ?        S    Mar08   0:00 rsync -L --timeout=100000 --rsh=/bin/ssh -tt -a -x -o ControlPath=/var/tmp/ssh-masterizcx19/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/var/tmp/tmpBz8w6J -o BatchMode=yes -o ConnectTimeout=300 -o ServerAliveInterval=300 -l scylla-test -p 22 -i ~/.ssh/scylla-test -az /var/tmp/scm-prometheusJrGLwZ/prometheus-scylla.yml [email protected]:/var/tmp/prometheus-1.0.2.linux-amd64/prometheus-scylla.yml

jenkins  13556  0.0  0.0  73808  3148 ?        S    Mar08   0:00 /bin/ssh -tt -a -x -o ControlPath=/var/tmp/ssh-masterizcx19/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/var/tmp/tmpBz8w6J -o BatchMode=yes -o ConnectTimeout=300 -o ServerAliveInterval=300 -l scylla-test -p 22 -i ~/.ssh/scylla-test -l scylla-test 10.240.0.18 rsync --server -logDtprze.iLsf --timeout=100000 . /var/tmp/prometheus-1.0.2.linux-amd64/prometheus-scylla.yml

rsync error: protocol incompatibility (code 2)

Description

rsync raised error of protocol incompatibility in send file from jenkins slave to monitor, ssh/scp is fine.

$ rsync -L  --timeout=300 --rsh='/bin/ssh -tt -a -x  -o StrictHostKeyChecking=no -o UserKnownHostsFile=/var/tmp/tmpSijbQU -o BatchMode=yes -o ConnectTimeout=300 -o ServerAliveInterval=300 -l scylla-test -p 22 -i /jenkins/.ssh/scylla-test' -az /tmp/a.out [email protected]:"/tmp/b.out"
protocol version mismatch -- is your shell clean?
(see the rsync man page for an explanation)
rsync error: protocol incompatibility (code 2) at compat.c(174) [sender=3.0.9]

rsync version in monitor machine

[scylla-test@longevity-1-7-gce-amosbranch-scylla-monitor-node-b57152ef-000 ~]$ rsync --version
rsync  version 3.0.9  protocol version 30

it's fine to login monitor by ssh key:

[jenkins@public-jenkins-builder1 scylla-test]$ ssh 10.240.0.14 -lscylla-test -i ~/.ssh/scylla-test
The authenticity of host '10.240.0.14 (10.240.0.14)' can't be established.
ECDSA key fingerprint is 65:7b:f1:d0:82:2d:67:51:f8:c7:78:ab:e7:50:99:31.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '10.240.0.14' (ECDSA) to the list of known hosts.
Last login: Sat Mar 11 13:54:36 2017 from public-jenkins-builder1.c.skilled-adapter-452.internal
[scylla-test@longevity-1-7-gce-amosbranch-scylla-monitor-node-b57152ef-000 ~]$

disrupt:decommission: some threads aren't cleaned in destroying node

Error description:

I saw some error "Error 24: Too many open files" in longevity-gce-1.7 job, they occurred after job run 3 days.

I just workarounded this issue by adjust the 'open files limit' for avocado process.

# original limit is 4096, I just increased it to 10000, the job will finish after 3 days, it should be enough.
prlimit -n10000 -p pid_of_process

Problem in nemesis:

There are many opened database.log, I found the journal_thread isn't stop when we destroy target node in disrupt decommission test.

sdcm/remoter.py: send_files/receive_files: scp cmdline didn't use right SSHKey

Description:

If it failed to copy files by rsync, scp will be tried. But currently the scp cmdline doesn't use right SSHKey, it will fail for permission issue.

rsync cmdline to copy file:

jenkins  13553  0.0  0.0 114644   988 ?        S    Mar08   0:00 rsync -L --timeout=100000 --rsh=/bin/ssh -tt -a -x -o ControlPath=/var/tmp/ssh-masterizcx19/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/var/tmp/tmpBz8w6J -o BatchMode=yes -o ConnectTimeout=300 -o ServerAliveInterval=300 -l scylla-test -p 22 -i ~/.ssh/scylla-test -az /var/tmp/scm-prometheusJrGLwZ/prometheus-scylla.yml [email protected]:/var/tmp/prometheus-1.0.2.linux-amd64/prometheus-scylla.yml
jenkins  13556  0.0  0.0  73808  3148 ?        S    Mar08   0:00 /bin/ssh -tt -a -x -o ControlPath=/var/tmp/ssh-masterizcx19/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/var/tmp/tmpBz8w6J -o BatchMode=yes -o ConnectTimeout=300 -o ServerAliveInterval=300 -l scylla-test -p 22 -i ~/.ssh/scylla-test -l scylla-test 10.240.0.18 rsync --server -logDtprze.iLsf --timeout=100000 . /var/tmp/prometheus-1.0.2.linux-amd64/prometheus-scylla.yml

scp cmdline to copy file:

jenkins  32626  0.0  0.0  52692  1936 ?        S    09:42   0:00 scp -rq -o ControlPath=/var/tmp/ssh-masterizcx19/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/var/tmp/tmpBz8w6J -P 22 /var/tmp/scm-prometheusJrGLwZ/prometheus-scylla.yml [email protected]:"/var/tmp/prometheus-1.0.2.linux-amd64/prometheus-scylla.yml"

need to split the latency generated graphs to be different ones for each percentile

the max latency is causing all the rest latencies not to be present - need to split the display to 5 different files

the svg version it is more visible yet - its not easy as well.

Output format in aws configure

The aws configure command prompts for

Default output format [None]:

along with the other items covered.

take snapshot of Grafana metrics at the end of perf tests

I tried to take snapshot for Grafana page at the end of perf test by QtWebKit, it works. But there is a way to load old prometheus data to grafana: scylladb/scylla-monitoring@f91bf79

But it still worth to take snapshot, because the prometheus log might lost, and it requests some setup. The snapshot is a quick review.

Priority: Low

AWS use spot instances

CassandraAWSCluster inherits ScyllaAWSCluster, not AWSCluster + BaseScyllaCluster

@lmr any hint?

diff --git a/sdcm/cluster.py b/sdcm/cluster.py
index 70bbec3..96a29fb 100644
--- a/sdcm/cluster.py
+++ b/sdcm/cluster.py
@@ -2859,7 +2859,7 @@ class ScyllaAWSCluster(AWSCluster, BaseScyllaCluster):
super(ScyllaAWSCluster, self).destroy()

-class CassandraAWSCluster(ScyllaAWSCluster):
+class CassandraAWSCluster(AWSCluster, BaseScyllaCluster):

 def __init__(self, ec2_ami_id, ec2_subnet_id, ec2_security_group_ids,
              service, credentials, ec2_instance_type='c4.xlarge',

Grafana not installed correctly

Hi,

2 out of 4 tests I ran were not able to start Grafana correctly.
Prometheus was installed correctly and worked.

Here is the relevant log file:

2016-08-08 10:38:21,450 remote           L0832 DEBUG| Remote [[email protected]]: Running 'sudo grafana-cli plugins install grafana-piechart-panel'
2016-08-08 10:38:21,450 remote           L0094 INFO | [54.175.131.251] Running '/bin/ssh -a -x  -o ControlPath=/var/tmp/ssh-masterfWVukH/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/var/tmp/tmp_3gDSy -o BatchMode=yes -o ConnectTimeout=300 -o ServerAliveInterval=300 -l centos -p 22 -i /var/tmp/vagrant-longevity-test-8db5c1d9.pem 54.175.131.251 "sudo grafana-cli plugins install grafana-piechart-panel"'
2016-08-08 10:38:25,400 remote           L0194 DEBUG| [54.175.131.251] [stdout] installing grafana-piechart-panel @ 1.1.1
2016-08-08 10:38:25,401 remote           L0194 DEBUG| [54.175.131.251] [stdout] from url: https://grafana.net/api/plugins/grafana-piechart-panel/versions/1.1.1/download
2016-08-08 10:38:25,401 remote           L0194 DEBUG| [54.175.131.251] [stdout] into: /var/lib/grafana/plugins
2016-08-08 10:38:25,401 remote           L0194 DEBUG| [54.175.131.251] [stdout]
2016-08-08 10:38:28,968 remote           L0194 DEBUG| [54.175.131.251] [stdout]
2016-08-08 10:38:28,968 remote           L0194 DEBUG| [54.175.131.251] [stdout] Error: zip: not a valid zip file
2016-08-08 10:38:28,969 remote           L0194 DEBUG| [54.175.131.251] [stdout]
2016-08-08 10:38:28,969 remote           L0194 DEBUG| [54.175.131.251] [stdout] NAME:
2016-08-08 10:38:28,969 remote           L0194 DEBUG| [54.175.131.251] [stdout]    Grafana cli plugins install - install <plugin id>
2016-08-08 10:38:28,969 remote           L0194 DEBUG| [54.175.131.251] [stdout]
2016-08-08 10:38:28,969 remote           L0194 DEBUG| [54.175.131.251] [stdout] USAGE:
2016-08-08 10:38:28,970 remote           L0194 DEBUG| [54.175.131.251] [stdout]    Grafana cli plugins install [arguments...]
2016-08-08 10:38:29,205 output           L0632 DEBUG| Exception in thread Thread-27:
2016-08-08 10:38:29,205 output           L0632 DEBUG| Traceback (most recent call last):
2016-08-08 10:38:29,205 output           L0632 DEBUG|   File "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner
2016-08-08 10:38:29,206 output           L0632 DEBUG|     self.run()
2016-08-08 10:38:29,206 output           L0632 DEBUG|   File "/usr/lib64/python2.7/threading.py", line 764, in run
2016-08-08 10:38:29,206 output           L0632 DEBUG|     self.__target(*self.__args, **self.__kwargs)
2016-08-08 10:38:29,206 output           L0632 DEBUG|   File "/home/vagrant/scylla-cluster-tests/sdcm/cluster.py", line 1303, in node_setup
2016-08-08 10:38:29,206 output           L0632 DEBUG|     node.install_grafana()
2016-08-08 10:38:29,207 output           L0632 DEBUG|   File "/home/vagrant/scylla-cluster-tests/sdcm/cluster.py", line 277, in install_grafana
2016-08-08 10:38:29,208 output           L0632 DEBUG|     self.remoter.run('sudo grafana-cli plugins install grafana-piechart-panel')
2016-08-08 10:38:29,209 output           L0632 DEBUG|   File "/home/vagrant/scylla-cluster-tests/sdcm/remote.py", line 843, in run
2016-08-08 10:38:29,209 output           L0632 DEBUG|     watch_stdout_pattern=watch_stdout_pattern)
2016-08-08 10:38:29,210 output           L0632 DEBUG|   File "/home/vagrant/scylla-cluster-tests/sdcm/remote.py", line 808, in _run
2016-08-08 10:38:29,210 output           L0632 DEBUG|     watch_stdout_pattern=watch_stdout_pattern)
2016-08-08 10:38:29,211 output           L0632 DEBUG|   File "/home/vagrant/scylla-cluster-tests/sdcm/remote.py", line 215, in ssh_run
2016-08-08 10:38:29,211 output           L0632 DEBUG|     raise process.CmdError(cmd, sp.result)
2016-08-08 10:38:29,212 output           L0632 DEBUG| CmdError: Command '/bin/ssh -a -x  -o ControlPath=/var/tmp/ssh-masterfWVukH/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/var/tmp/tmp_3gDSy -o BatchMode=yes -o ConnectTimeout=300 -o ServerAliveInterval=300 -l centos -p 22 -i /var/tmp/vagrant-longevity-test-8db5c1d9.pem 54.175.131.251 "sudo grafana-cli plugins install grafana-piechart-panel"' failed (rc=1)
2016-08-08 10:38:29,245 output           L0632 DEBUG|
2016-08-08 10:42:49,957 remote           L0194 DEBUG| [54.175.131.251] [stderr] time="2016-08-08T10:42:48Z" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:539"
2016-08-08 10:42:51,900 remote           L0194 DEBUG| [54.175.131.251] [stderr] time="2016-08-08T10:42:50Z" level=info msg="Done checkpointing in-memory metrics and chunks in 1.924280976s." source="persistence.go:563"
2016-08-08 10:47:51,857 remote           L0194 DEBUG| [54.175.131.251] [stderr] time="2016-08-08T10:47:50Z" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:539"
2016-08-08 10:47:53,285 remote           L0194 DEBUG| [54.175.131.251] [stderr] time="2016-08-08T10:47:51Z" level=info msg="Done checkpointing in-memory metrics and chunks in 1.221837148s." source="persistence.go:563"

I used the following commt:

commit ceb34c99a0550a9854fdf6005db680a94f6f4e1a
Merge: 0afe46e eafb16f
Author: Lucas Meneghel Rodrigues <[email protected]>
Date:   Thu Aug 4 06:34:25 2016 -0300

    Merge pull request #163 from kongove/libvirt_uri

    sdcm: replace fixed Libvirt uri

RFC: Additional new nemesis

Hi,

We need additional nemesis for testing:

snapshot
flush
remove node
add node
cross DC repair
major compaction
cleanup
refresh

Thanks,
Noam.

c-s-exporter: data of multiple c-s clients mixed into same metric

Prometheus page only displayed 8 metrics (same loader IP), but we have multiple c-s clients in each loader. So prometheus feels data from multiple c-s-exporters (one c-s client has one cs-exporter) is one client.

We should find a way to sum the values from different c-s clients and exported to prometheus.

@benoit-canet

Longevity GCE: Job hung at disrupt_stop_wait_start_scylla_server()->wait_db_up(), c-s client exit after 30mins

Jenkins job 68

Description:

Longevity test called disrupt_stop_wait_start_scylla_server() nemesis, and hung at waiting db up.
detail: hangs between line 884 and line 885.

Actually DB is already UP.

 875     def _report_housekeeping_uuid(self, verbose=True):
 876         """
 877         report uuid of test db nodes to ScyllaDB
 878         """
 879         uuid_path = '/var/lib/scylla-housekeeping/housekeeping.uuid'
 880         mark_path = '/var/lib/scylla-housekeeping/housekeeping.uuid.marked'
 881         cmd = 'curl "https://i6a5h9l1kl.execute-api.us-east-1.amazonaws.com/prod/check_version?uu=%s&mark=scylla"'
 882 
 883         uuid_result = self.remoter.run('test -e %s' % uuid_path,
 884                                        ignore_status=True, verbose=verbose)
 885         mark_result = self.remoter.run('test -e %s' % mark_path,
 886                                        ignore_status=True, verbose=verbose)
 887         if uuid_result.exit_status == 0 and mark_result.exit_status != 0:
 888             result = self.remoter.run('cat %s' % uuid_path, verbose=verbose)
 889             self.remoter.run(cmd % result.stdout.strip())
 890             self.remoter.run('sudo -u scylla touch %s' % mark_path,
 891                              verbose=verbose)
 892 
 893     def wait_db_up(self, verbose=True):
 894         text = None
 895         if verbose:
 896             text = '%s: Waiting for DB services to be up' % self
 897         wait.wait_for(func=self.db_up, step=60,
 898                       text=text)
 899         self._report_housekeeping_uuid()

SSH process that lunched c-s client in jenkins slave didn't exit, but in 'S' status.
c-s client exited unexpectedly, I didn't find it in loader, but there is no error in job log.

 1406  8050  1240  1240 ?           -1 S     1010   0:00  |                           \_ /bin/ssh -tt -a -x -o ControlPath=
/var/tmp/ssh-mastertpljsc/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/var/tmp/tmpG59aNn -o BatchMode=yes -o C
onnectTimeout=300 -o ServerAliveInterval=300 -l scylla-test -p 22 -i /jenkins/.ssh/scylla-test 35.185.60.228 echo TAG: load
er_idx:0-cpu_idx:0-keyspace_idx:1; /home/scylla-test/run_cassandra_stress.sh 0 1

checking in the grafana, there is no data wrote to cluster. I guess the c-s exited at that point.
I manually started a c-s client in loader, then you can find some writing in the end.
I manually stop/start db-node-004 twice, so you can see some RPC error at the end of job
I aborted the job, and some error outputted (no detail):

2017-02-18 00:12:56,434 stacktrace       L0036 ERROR| 
2017-02-18 00:12:56,434 stacktrace       L0039 ERROR| Reproduced traceback from: /usr/lib/python2.7/site-packages/avocado/core/test.py:435
NO detail

Time:(reference the grafana snapshot, Loader per server, Served Requests)

start: around 22:00
hang at wait_db_up(): around 2:30
c-s exited: around 3:00
start a c-s around 7:30

longevity is not able to progress

commit 979a8df

extract from http://jenkins.cloudius-systems.com:8080/job/scylla-1.6-longevity/label=master/3/consoleFull

Cluster longevity-1-6-scylla-loader-set-d376668f (AMI: ami-63504874 Type: c3.large): Stress script content:
mkfifo /tmp/cs_pipe_$1_$2; cat /tmp/cs_pipe_$1_$2|python /usr/bin/cassandra_stress_exporter & cassandra-stress write cl=QUORUM duration=1800m -schema keyspace=keyspace$1 'replication(factor=3)' -port jmx=6868 -mode cql3 native -rate threads=1000 -pop seq=1..10000000 -node 172.30.0.124|tee /tmp/cs_pipe_$1_$2; pkill -P $$ -f cassandra_stress_exporter; rm -f /tmp/cs_pipe_$1_$2
.
.
.
.
0
Remote [[email protected]]: Running 'nodetool cfstats keyspace1'
[54.166.115.61] Running '/bin/ssh  -a -x  -o ControlPath=/var/tmp/ssh-master0WNg0b/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/var/tmp/tmphsiQsR -o BatchMode=yes -o ConnectTimeout=300 -o ServerAliveInterval=300 -l centos -p 22 -i /var/tmp/longevity-1-6-sct-e0ed9db3.pem 54.166.115.61 "nodetool cfstats keyspace1"'
[54.144.210.214] [stdout] total,        145854,   23953,   23953,   23953,    41.8,    40.3,    73.8,    97.3,   122.7,   134.4,    7.2,  0.07374,      0,      0,       0,       0,       0,       0
[54.144.210.214] [stdout] total,        176286,   25032,   25032,   25032,    39.9,    36.3,    77.2,    92.0,   108.5,   121.7,    8.4,  0.06864,      0,      0,       0,       0,       0,       0
[54.166.115.61] [stdout] nodetool: Unknown keyspace: keyspace1
[54.166.115.61] [stdout] See 'nodetool help' or 'nodetool help <command>'.

it seems that the test to check the size of the keyspace is not aligning with the cassandra stress keyspace used.

How about always compress coredump before uploading?

Right now if the coredump is less than 5G, we will upload it directly without compressing.

How about always compress the coredump before uploading for saving the space? It only effect the core file that is less than 5G.

In one longevity job 45, it generated a 92G coredump, it was compressed to 2.6G and uploaded without splitting.

diff --git a/sdcm/cluster.py b/sdcm/cluster.py
index 536c6d5..a2377d2 100644
--- a/sdcm/cluster.py
+++ b/sdcm/cluster.py
@@ -689,7 +689,7 @@ WantedBy=multi-user.target
             self.log.error('Failed getting coredump file size: %s', ex)
         return None
 
-    def _split_coredump(self, coredump):
+    def _try_split_coredump(self, coredump):
         core_files = []
         try:
             self.remoter.run('sudo yum install -y pigz')
@@ -723,10 +723,7 @@ WantedBy=multi-user.target
                 coredump = line.split()[-1]
                 self.log.debug('Found coredump file: {}'.format(coredump))
                 file_size = self._get_coredump_size(coredump)
-                if file_size and file_size > COREDUMP_MAX_SIZE:
-                    coredump_files = self._split_coredump(coredump)
-                else:
-                    coredump_files = [coredump]
+                coredump_files = self._try_split_coredump(coredump)
                 for f in coredump_files:
                     self._upload_coredump(f)
                 if len(coredump_files) > 1 or coredump_files[0].endswith('.gz'):

Please add a way to disable starting the monitor instance

other tests that have to implement their own setup procedures might need similar fixes

@@ -163,6 +163,7 @@ def setUp(self):
         db_node_address = self.db_cluster.nodes[0].private_ip_address
         self.loaders.wait_for_init(db_node_address=db_node_address)
         nodes_monitored = [node.private_ip_address for node in self.db_cluster.nodes]
+        nodes_monitored += [node.private_ip_address for node in self.loaders.nodes]

nemesis repair_nodetool_rebuild should not run on all nodes

Hi,

Currently, the nemesis repair_nodetool_rebuild runs on all nodes.

It should run only on a single node or a couple at most.

    def repair_nodetool_rebuild(self):
        rebuild_cmd = 'nodetool -h localhost rebuild'
        queue = Queue.Queue()

        def run_nodetool(local_node):
            self._run_nodetool(rebuild_cmd, local_node)
            queue.put(local_node)
            queue.task_done()

        for node in self.cluster.nodes:
            setup_thread = threading.Thread(target=run_nodetool,
                                            args=(node,))
            setup_thread.daemon = True
            setup_thread.start()

        results = []
        while len(results) != len(self.cluster.nodes):
            try:
                results.append(queue.get(block=True, timeout=5))
            except Queue.Empty:
                pass

Thanks,
Noam.

RFC: Find ways to explicitly test slow disks and limited bandwidth

Some bugs might show up on slow disks or limited bandwidth. We need to think about a consistent way to reproduce such conditions in our testing setup.

new regression: cs_pipe caused SCT stuck in reading data from Queue

sdcm/cluster.py:
    def run_stress_thread(self, stress_cmd, timeout, output_dir, stress_num=1):
        stress_cmd = "mkfifo /tmp/cs_pipe; cat /tmp/cs_pipe|python /usr/bin/cassandra_stress_exporter & " + stress_cmd + "|tee /tmp/cs_pipe"

The bug is introduced by commit: @benoit-canet

Author: Benoît Canet <[email protected]>
Date:   Tue Nov 29 13:18:39 2016 +0000

    loader: Add cassandra-stress exporter
    
    Also wire up collectd and collectd_exporter
    on the loader node.
    
    Finally take care of making prometheus polling
    the loaders.
    
    Signed-off-by: Benoît Canet <[email protected]>

I'm trying to kill cassandra_stress_exporter when c-s finishes and remove the cs_pipe.

move to use the predefined dashboards of scylla-graphite-monitoring and not use the hardcoded ones

SCT is using a copy of the dashboards.

We are working to change the metrics in scylla and as such the dashboards will not be aligned with scylla metrics.

Need to move and use the dashboards (and probably the docker container) from scylla-graphite-monitoring to make sure we have the correct dashboards used for each version

Need a way to associate each test with a scylla version to know which branch/tage should be pulled out and used.

debug: support to use exists (already setup) nodes (only GCE)

Low priority

node_setup: checking setup thread status

nodes' setup is done by threads, the main process keeps checking the status of dp status, it will stuck there if db isn't up. The node setup in log raise exception, then the rest part of setup won't be done, db will never be up.

we should check thread status (finish, exception, running) when we check if db is up. It will save time or get out from the stuck.

solution:

retry node setup in thread exception
or ignore this error (in longevity test)

Longevity test of GCE isn't stable

The job always hangs.

Fail to execute remote command into Instances:

"sudo sync"' failed (rc=-15) #define ENOTBLK 15 /* Block device required */
other setup commands

systemd-tmpfiles: Failed to open directory /var/lib/systemd/coredump: Too many levels of symbolic links

[scylla-test@longevity-1-6-gce-scylla-db-node-cf272ad9-001 ~]$ ls -l /var/lib/systemd/coredump
lrwxrwxrwx. 1 root root 24 Feb 15 07:01 /var/lib/systemd/coredump -> /var/lib/scylla/coredump

Feb 15 07:03:53 longevity-1-6-gce-scylla-db-node-cf272ad9-001 sudo[12978]: scylla-test : TTY=pts/1 ; PWD=/home/scylla-test ; USER=root ; COMMAND=/bin/sync
Feb 15 07:03:54 longevity-1-6-gce-scylla-db-node-cf272ad9-001 sudo[12995]: scylla-test : TTY=pts/1 ; PWD=/home/scylla-test ; USER=root ; COMMAND=/bin/cat /etc/scylla.d/io.co
Feb 15 07:03:54 longevity-1-6-gce-scylla-db-node-cf272ad9-001 sudo[13012]: scylla-test : TTY=pts/1 ; PWD=/home/scylla-test ; USER=root ; COMMAND=/bin/systemctl enable scylla
Feb 15 07:03:54 longevity-1-6-gce-scylla-db-node-cf272ad9-001 polkitd[374]: Registered Authentication Agent for unix-process:13028:45299 (system bus name :1.41 [/usr/bin/pkt
Feb 15 07:03:54 longevity-1-6-gce-scylla-db-node-cf272ad9-001 systemd[1]: Reloading.
Feb 15 07:03:54 longevity-1-6-gce-scylla-db-node-cf272ad9-001 polkitd[374]: Unregistered Authentication Agent for unix-process:13028:45299 (system bus name :1.41, object pat
Feb 15 07:07:18 longevity-1-6-gce-scylla-db-node-cf272ad9-001 run-parts(/etc/cron.hourly)[13181]: finished 0yum-hourly.cron
Feb 15 07:07:34 longevity-1-6-gce-scylla-db-node-cf272ad9-001 sshd[13182]: Did not receive identification string from 113.108.21.16
Feb 15 07:12:12 longevity-1-6-gce-scylla-db-node-cf272ad9-001 systemd[1]: Starting Cleanup of Temporary Directories...
Feb 15 07:12:12 longevity-1-6-gce-scylla-db-node-cf272ad9-001 systemd-tmpfiles[13289]: Failed to open directory /var/lib/systemd/coredump: Too many levels of symbolic links
Feb 15 07:12:12 longevity-1-6-gce-scylla-db-node-cf272ad9-001 systemd[1]: Started Cleanup of Temporary Directories.
Feb 15 07:18:07 longevity-1-6-gce-scylla-db-node-cf272ad9-001 ntpd[11061]: 0.0.0.0 0612 02 freq_set kernel -0.232 PPM
Feb 15 07:18:07 longevity-1-6-gce-scylla-db-node-cf272ad9-001 ntpd[11061]: 0.0.0.0 0615 05 clock_sync

Fail to download repo file:

[scylla-test@longevity-1-6-gce-scylla-db-node-cf272ad9-000 yum.repos.d]$ cat scylla.repo 
<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>InternalError</Code><Message>We encountered an internal error. Please try again.</Message><RequestId>AD6BBA758C638EDF</RequestId><HostId>5tOi453otMHdUgN7yLnyYUk0pznuwj4nPnHTZJY4vy3MOcOXyc8PXD0xNQB0vq+mDGWNKqckJBM=</HostId></Error>[scylla-test@longevity-1-6-gce-scylla-db-node-cf272ad9-000 yum.repos.d]$ top

Longevity GCE: seed node is removed in DecommissionMonkey

Problem:
Seed node is removed in DecommissionMonkey, because we set the 2nd node as seed (it's the fastest one), but some code thought the db nodes (except first one) are all not seed, so it's removed unexpectedly.

Result: there is no Served Request after the seed is removed.

Solution:
Before we support creating gce instances in parallel, there are some performance issue to setup db nodes.
So we added code to choose the fastest db node as seed, it works. But there is some SCT code assume that the first node is seed, so let's remove the code to adjust the seed.

SSH cmd error: Pseudo-terminal will not be allocated because stdin is not a terminal.

SCT version:

commit 838cff4805df0eee47268b74780d1918c7277c63
Merge: b19d48b 6137254
Author: Lucas Meneghel Rodrigues <[email protected]>
Date:   Fri Dec 23 04:14:48 2016 -0200

    Merge pull request #214 from scylladb/random_fixes
    
    Random fixes

I login a remote GCE instance, execute screen and run SCT insider screen. Then touched a problem : [stderr] sudo: sorry, you must have a tty to run sudo

2016-12-23 14:41:23,613 remote           L0094 INFO | [104.196.129.135] Running '/bin/ssh -t -a -x  -o ControlPath=/var/tmp/ssh-master7XVbeA/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/var/tmp/tmpwAN6sx -o BatchMode=yes -o ConnectTimeout=300 -o ServerAliveInterval=300 -l amos -p 22 -i /home/amos/.ssh/amos-openstack 104.196.129.135 "sudo yum install -y epel-release"'
2016-12-23 14:41:23,672 remote           L0194 DEBUG| [104.196.129.135] [stderr] Pseudo-terminal will not be allocated because stdin is not a terminal.
2016-12-23 14:41:24,055 remote           L0194 DEBUG| [104.196.129.135] [stderr] sudo: sorry, you must have a tty to run sudo

2016-12-23 14:20:10,967 remote           L0094 INFO | [104.196.154.12] Running '/bin/ssh -t -a -x  -o ControlPath=/var/tmp/ssh-masterZyN0WH/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/var/tmp/tmpAzNFZs -o BatchMode=yes -o ConnectTimeout=300 -o ServerAliveInterval=300 -l amos -p 22 -i /home/amos/.ssh/amos-openstack 104.196.154.12 "journalctl -f --no-tail --no-pager -u scylla-ami-setup.service -u scylla-io-setup.service -u scylla-server.service -u scylla-jmx.service -o json | /var/tmp/sct_log_formatter"'
2016-12-23 14:20:10,983 remote           L0194 DEBUG| [104.196.154.12] [stderr] Pseudo-terminal will not be allocated because stdin is not a terminal.
2016-12-23 14:20:11,378 remote           L0194 DEBUG| [104.196.154.12] [stderr] Failed to get realtime timestamp: Cannot assign requested address

Cannot retrieve metalink for repository: epel/x86_64

In this job, instance installed epel successfully, but failed to install collected from epel for yum issue.
It's fine in other jobs. Let's record this issue here, we can close it if it's just a network issue.

private job/scylla-ami-perf-regression-dev/75/slave=muninn,sub_test=test_write/artifact/latest/job.log

2017-01-05 10:08:36,174 remote           L0862 DEBUG| Remote [[email protected]]: Running 'sudo yum install -y epel-release'
2017-01-05 10:08:36,174 remote           L0094 INFO | [104.196.30.192] Running '/bin/ssh -t -a -x  -o ControlPath=/var/tmp/ssh-mastera5xFnU/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/var/tmp/tmpkCL1kv -o BatchMode=yes -o ConnectTimeout=300 -o ServerAliveInterval=300 -l scylla-test -p 22 -i ~/.ssh/scylla-test 104.196.30.192 "sudo yum install -y epel-release"'
2017-01-05 10:08:36,182 remote           L0194 DEBUG| [104.196.30.192] [stderr] Pseudo-terminal will not be allocated because stdin is not a terminal.
2017-01-05 10:08:36,853 remote           L0194 DEBUG| [104.196.30.192] [stdout] Loaded plugins: fastestmirror
2017-01-05 10:08:41,750 remote           L0194 DEBUG| [104.196.30.192] [stdout] Determining fastest mirrors
2017-01-05 10:08:41,988 remote           L0194 DEBUG| [104.196.30.192] [stdout]  * base: reflector.westga.edu
2017-01-05 10:08:41,990 remote           L0194 DEBUG| [104.196.30.192] [stdout]  * extras: centos.mirror.nac.net
2017-01-05 10:08:41,990 remote           L0194 DEBUG| [104.196.30.192] [stdout]  * updates: mirror.umd.edu
2017-01-05 10:08:43,420 remote           L0194 DEBUG| [104.196.30.192] [stdout] Resolving Dependencies
2017-01-05 10:08:43,420 remote           L0194 DEBUG| [104.196.30.192] [stdout] --> Running transaction check
2017-01-05 10:08:43,420 remote           L0194 DEBUG| [104.196.30.192] [stdout] ---> Package epel-release.noarch 0:7-6 will be installed
2017-01-05 10:08:44,566 remote           L0194 DEBUG| [104.196.30.192] [stdout] --> Finished Dependency Resolution
2017-01-05 10:08:44,605 remote           L0194 DEBUG| [104.196.30.192] [stdout] 
2017-01-05 10:08:44,605 remote           L0194 DEBUG| [104.196.30.192] [stdout] Dependencies Resolved
2017-01-05 10:08:44,606 remote           L0194 DEBUG| [104.196.30.192] [stdout] 
2017-01-05 10:08:44,606 remote           L0194 DEBUG| [104.196.30.192] [stdout] ================================================================================
2017-01-05 10:08:44,606 remote           L0194 DEBUG| [104.196.30.192] [stdout]  Package                Arch             Version         Repository        Size
2017-01-05 10:08:44,606 remote           L0194 DEBUG| [104.196.30.192] [stdout] ================================================================================
2017-01-05 10:08:44,606 remote           L0194 DEBUG| [104.196.30.192] [stdout] Installing:
2017-01-05 10:08:44,606 remote           L0194 DEBUG| [104.196.30.192] [stdout]  epel-release           noarch           7-6             extras            14 k
2017-01-05 10:08:44,606 remote           L0194 DEBUG| [104.196.30.192] [stdout] 
2017-01-05 10:08:44,606 remote           L0194 DEBUG| [104.196.30.192] [stdout] Transaction Summary
2017-01-05 10:08:44,607 remote           L0194 DEBUG| [104.196.30.192] [stdout] ================================================================================
2017-01-05 10:08:44,607 remote           L0194 DEBUG| [104.196.30.192] [stdout] Install  1 Package
2017-01-05 10:08:44,607 remote           L0194 DEBUG| [104.196.30.192] [stdout] 
2017-01-05 10:08:44,607 remote           L0194 DEBUG| [104.196.30.192] [stdout] Total download size: 14 k
2017-01-05 10:08:44,607 remote           L0194 DEBUG| [104.196.30.192] [stdout] Installed size: 24 k
2017-01-05 10:08:44,607 remote           L0194 DEBUG| [104.196.30.192] [stdout] Downloading packages:
2017-01-05 10:08:44,749 remote           L0194 DEBUG| [104.196.30.192] [stdout] Running transaction check
2017-01-05 10:08:44,750 remote           L0194 DEBUG| [104.196.30.192] [stdout] Running transaction test
2017-01-05 10:08:44,761 remote           L0194 DEBUG| [104.196.30.192] [stdout] Transaction test succeeded
2017-01-05 10:08:44,761 remote           L0194 DEBUG| [104.196.30.192] [stdout] Running transaction
2017-01-05 10:08:44,899 remote           L0194 DEBUG| [104.196.30.192] [stdout]   Installing : epel-release-7-6.noarch                                      1/1 
2017-01-05 10:08:44,899 remote           L0194 DEBUG| [104.196.30.192] [stdout]   Verifying  : epel-release-7-6.noarch                                      1/1 
2017-01-05 10:08:44,899 remote           L0194 DEBUG| [104.196.30.192] [stdout] 
2017-01-05 10:08:44,899 remote           L0194 DEBUG| [104.196.30.192] [stdout] Installed:
2017-01-05 10:08:44,899 remote           L0194 DEBUG| [104.196.30.192] [stdout]   epel-release.noarch 0:7-6                                                     
2017-01-05 10:08:44,899 remote           L0194 DEBUG| [104.196.30.192] [stdout] 
2017-01-05 10:08:44,899 remote           L0194 DEBUG| [104.196.30.192] [stdout] Complete!

2017-01-05 10:08:44,919 remote           L0862 DEBUG| Remote [[email protected]]: Running 'sudo yum install -y collectd'
2017-01-05 10:08:44,919 remote           L0094 INFO | [104.196.30.192] Running '/bin/ssh -t -a -x  -o ControlPath=/var/tmp/ssh-mastera5xFnU/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/var/tmp/tmpkCL1kv -o BatchMode=yes -o ConnectTimeout=300 -o ServerAliveInterval=300 -l scylla-test -p 22 -i ~/.ssh/scylla-test 104.196.30.192 "sudo yum install -y collectd"'
2017-01-05 10:08:44,927 remote           L0194 DEBUG| [104.196.30.192] [stderr] Pseudo-terminal will not be allocated because stdin is not a terminal.
2017-01-05 10:08:45,540 remote           L0194 DEBUG| [104.196.30.192] [stdout] Loaded plugins: fastestmirror
2017-01-05 10:09:15,586 remote           L0194 DEBUG| [104.196.30.192] [stderr] 
2017-01-05 10:09:15,586 remote           L0194 DEBUG| [104.196.30.192] [stderr] 
2017-01-05 10:09:15,586 remote           L0194 DEBUG| [104.196.30.192] [stderr]  One of the configured repositories failed (Unknown),
2017-01-05 10:09:15,586 remote           L0194 DEBUG| [104.196.30.192] [stderr]  and yum doesn't have enough cached data to continue. At this point the only
2017-01-05 10:09:15,587 remote           L0194 DEBUG| [104.196.30.192] [stderr]  safe thing yum can do is fail. There are a few ways to work "fix" this:
2017-01-05 10:09:15,587 remote           L0194 DEBUG| [104.196.30.192] [stderr] 
2017-01-05 10:09:15,587 remote           L0194 DEBUG| [104.196.30.192] [stderr]      1. Contact the upstream for the repository and get them to fix the problem.
2017-01-05 10:09:15,587 remote           L0194 DEBUG| [104.196.30.192] [stderr] 
2017-01-05 10:09:15,587 remote           L0194 DEBUG| [104.196.30.192] [stderr]      2. Reconfigure the baseurl/etc. for the repository, to point to a working
2017-01-05 10:09:15,587 remote           L0194 DEBUG| [104.196.30.192] [stderr]         upstream. This is most often useful if you are using a newer
2017-01-05 10:09:15,587 remote           L0194 DEBUG| [104.196.30.192] [stderr]         distribution release than is supported by the repository (and the
2017-01-05 10:09:15,587 remote           L0194 DEBUG| [104.196.30.192] [stderr]         packages for the previous distribution release still work).
2017-01-05 10:09:15,587 remote           L0194 DEBUG| [104.196.30.192] [stderr] 
2017-01-05 10:09:15,587 remote           L0194 DEBUG| [104.196.30.192] [stderr]      3. Run the command with the repository temporarily disabled
2017-01-05 10:09:15,587 remote           L0194 DEBUG| [104.196.30.192] [stderr]             yum --disablerepo=<repoid> ...
2017-01-05 10:09:15,588 remote           L0194 DEBUG| [104.196.30.192] [stderr] 
2017-01-05 10:09:15,588 remote           L0194 DEBUG| [104.196.30.192] [stderr]      4. Disable the repository permanently, so yum won't use it by default. Yum
2017-01-05 10:09:15,588 remote           L0194 DEBUG| [104.196.30.192] [stderr]         will then just ignore the repository until you permanently enable it
2017-01-05 10:09:15,588 remote           L0194 DEBUG| [104.196.30.192] [stderr]         again or use --enablerepo for temporary usage:
2017-01-05 10:09:15,588 remote           L0194 DEBUG| [104.196.30.192] [stderr] 
2017-01-05 10:09:15,588 remote           L0194 DEBUG| [104.196.30.192] [stderr]             yum-config-manager --disable <repoid>
2017-01-05 10:09:15,588 remote           L0194 DEBUG| [104.196.30.192] [stderr]         or
2017-01-05 10:09:15,588 remote           L0194 DEBUG| [104.196.30.192] [stderr]             subscription-manager repos --disable=<repoid>
2017-01-05 10:09:15,588 remote           L0194 DEBUG| [104.196.30.192] [stderr] 
2017-01-05 10:09:15,588 remote           L0194 DEBUG| [104.196.30.192] [stderr]      5. Configure the failing repository to be skipped, if it is unavailable.
2017-01-05 10:09:15,588 remote           L0194 DEBUG| [104.196.30.192] [stderr]         Note that yum will try to contact the repo. when it runs most commands,
2017-01-05 10:09:15,589 remote           L0194 DEBUG| [104.196.30.192] [stderr]         so will have to try and fail each time (and thus. yum will be be much
2017-01-05 10:09:15,589 remote           L0194 DEBUG| [104.196.30.192] [stderr]         slower). If it is a very temporary problem though, this is often a nice
2017-01-05 10:09:15,589 remote           L0194 DEBUG| [104.196.30.192] [stderr]         compromise:
2017-01-05 10:09:15,589 remote           L0194 DEBUG| [104.196.30.192] [stderr] 
2017-01-05 10:09:15,589 remote           L0194 DEBUG| [104.196.30.192] [stderr]             yum-config-manager --save --setopt=<repoid>.skip_if_unavailable=true
2017-01-05 10:09:15,589 remote           L0194 DEBUG| [104.196.30.192] [stderr] 
2017-01-05 10:09:15,589 remote           L0194 DEBUG| [104.196.30.192] [stderr] Cannot retrieve metalink for repository: epel/x86_64. Please verify its path and try again
2017-01-05 10:09:15,601 tester           L0732 DEBUG| Cleaning up resources used in the test

statistics monitor from the loaders(c-s clients) isn't enabled for perf_regression_test

In PR [1], The loaders are added to monitor targets only in part of tests, the general Loader init isn't changed. So in perf_regression_test, prometheus won't get data from Loaders.

[1] #192

Updated dashboard files don't work with old Scylla

There is no data displayed in Grafana dashboard When I run SCT with 1.4.0 AMI.

Last SCT commit:

commit fae132aff643c9f598769aa05d107bb07dfadf04
Merge: 3a195fc a7db4e6
Author: Lucas Meneghel Rodrigues <[email protected]>
Date:   Mon Dec 19 14:45:17 2016 -0200

    Merge pull request #198 from scylladb/openstack-support
    
    [WiP] OpenStack support

We need to add a parameter to assign the dashboard files version.

Master ssh broken after longevity test run 6+ hours

We use master ssh to share connection.

When longevity-gce run more than about 6 hours, nemesis fails to sending prometheus-scylla.yaml to monitor.

It seems the master ssh isn't available.

improve: re-start master ssh (if old master ssh isn't available, it will be reset) before retrying to send files by scp
problem: reseting master ssh will cause exist connection broken, at least the reset can't rescue them.
suggestion: maybe we should disable master ssh for longevity test which test time is better than 3 hours.
improve: set process timeout for calling ssh_cmd() in send/receive_files(). 300s is good for most of case, we can set big ssh_timeout for special big file.

I'm trying above improve and suggestion, will make a decision later.

[SSL: ENCRYPTED_LENGTH_TOO_LONG] encrypted length too long (_ssl.c:1750)

I didn't change sct code and environment, this ssl error just occurred. Maybe it's a problem of GCE service.

2017-03-10 02:55:20,231 stacktrace       L0039 ERROR| Reproduced traceback from: /usr/lib/python2.7/site-packages/avocado-36.3-py2.7.egg/avocado/core/test.py:436
2017-03-10 02:55:20,234 stacktrace       L0042 ERROR| Traceback (most recent call last):
2017-03-10 02:55:20,234 stacktrace       L0042 ERROR|   File "/home/amos/scylla-cluster-tests/sdcm/tester.py", line 129, in wrapper
2017-03-10 02:55:20,234 stacktrace       L0042 ERROR|     return method(*args, **kwargs)
2017-03-10 02:55:20,234 stacktrace       L0042 ERROR|   File "/home/amos/scylla-cluster-tests/sdcm/tester.py", line 161, in setUp
2017-03-10 02:55:20,234 stacktrace       L0042 ERROR|     self.init_resources()
2017-03-10 02:55:20,234 stacktrace       L0042 ERROR|   File "/home/amos/scylla-cluster-tests/sdcm/tester.py", line 129, in wrapper
2017-03-10 02:55:20,234 stacktrace       L0042 ERROR|     return method(*args, **kwargs)
2017-03-10 02:55:20,235 stacktrace       L0042 ERROR|   File "/home/amos/scylla-cluster-tests/sdcm/tester.py", line 524, in init_resources
2017-03-10 02:55:20,235 stacktrace       L0042 ERROR|     monitor_info=monitor_info)
2017-03-10 02:55:20,235 stacktrace       L0042 ERROR|   File "/home/amos/scylla-cluster-tests/sdcm/tester.py", line 305, in get_cluster_gce
2017-03-10 02:55:20,235 stacktrace       L0042 ERROR|     params=self.params)
2017-03-10 02:55:20,235 stacktrace       L0042 ERROR|   File "/home/amos/scylla-cluster-tests/sdcm/cluster.py", line 2744, in __init__
2017-03-10 02:55:20,235 stacktrace       L0042 ERROR|     params=params)
2017-03-10 02:55:20,235 stacktrace       L0042 ERROR|   File "/home/amos/scylla-cluster-tests/sdcm/cluster.py", line 2357, in __init__
2017-03-10 02:55:20,235 stacktrace       L0042 ERROR|     params=params)
2017-03-10 02:55:20,235 stacktrace       L0042 ERROR|   File "/home/amos/scylla-cluster-tests/sdcm/cluster.py", line 1322, in __init__
2017-03-10 02:55:20,235 stacktrace       L0042 ERROR|     self.add_nodes(n_nodes)
2017-03-10 02:55:20,235 stacktrace       L0042 ERROR|   File "/home/amos/scylla-cluster-tests/sdcm/cluster.py", line 2755, in add_nodes
2017-03-10 02:55:20,235 stacktrace       L0042 ERROR|     ec2_user_data=ec2_user_data)
2017-03-10 02:55:20,236 stacktrace       L0042 ERROR|   File "/home/amos/scylla-cluster-tests/sdcm/cluster.py", line 2429, in add_nodes
2017-03-10 02:55:20,236 stacktrace       L0042 ERROR|     base_logdir=self.logdir))
2017-03-10 02:55:20,236 stacktrace       L0042 ERROR|   File "/home/amos/scylla-cluster-tests/sdcm/cluster.py", line 1047, in __init__
2017-03-10 02:55:20,236 stacktrace       L0042 ERROR|     self._gce_service.ex_set_node_tags(self._instance, ['keep-alive'])
2017-03-10 02:55:20,236 stacktrace       L0042 ERROR|   File "/usr/lib/python2.7/site-packages/libcloud/compute/drivers/gce.py", line 5679, in ex_set_node_tags
2017-03-10 02:55:20,236 stacktrace       L0042 ERROR|     self.connection.async_request(request, method='POST', data=tags_data)
2017-03-10 02:55:20,236 stacktrace       L0042 ERROR|   File "/usr/lib/python2.7/site-packages/libcloud/common/base.py", line 1003, in async_request
2017-03-10 02:55:20,236 stacktrace       L0042 ERROR|     response = request(**kwargs)
2017-03-10 02:55:20,236 stacktrace       L0042 ERROR|   File "/usr/lib/python2.7/site-packages/libcloud/compute/drivers/gce.py", line 120, in request
2017-03-10 02:55:20,236 stacktrace       L0042 ERROR|     response = super(GCEConnection, self).request(*args, **kwargs)
2017-03-10 02:55:20,236 stacktrace       L0042 ERROR|   File "/usr/lib/python2.7/site-packages/libcloud/common/google.py", line 812, in request
2017-03-10 02:55:20,236 stacktrace       L0042 ERROR|     raise e
2017-03-10 02:55:20,237 stacktrace       L0042 ERROR| SSLError: [SSL: ENCRYPTED_LENGTH_TOO_LONG] encrypted length too long (_ssl.c:1750)
2017-03-10 02:55:20,237 stacktrace       L0043 ERROR| 
2017-03-10 02:55:20,237 test             L0581 ERROR| ERROR 1-longevity_test.py:LongevityTest.test_custom_time -> TestSetupFail: [SSL: ENCRYPTED_LENGTH_TOO_LONG] encrypted length too long (_ssl.c:1750)