GithubHelp home page GithubHelp logo

Comments (10)

mykaul avatar mykaul commented on September 22, 2024

@vincentwenatsa - can you clarify this a bit? Did you run the setup and this is the result you've got?
@roydahan - do we have this instance type in our matrix?

from scylladb.

vincentwenatsa avatar vincentwenatsa commented on September 22, 2024

No I leave to the AMI to do all the setups, and the cpuset config is not correct. Based on this it seems is4gen is on the support list?
https://github.com/scylladb/scylla-machine-image/blob/next-5.4/lib/scylla_cloud.py#L740

from scylladb.

roydahan avatar roydahan commented on September 22, 2024

We have such a test and it passed and cpuset seems correct:

  t:2024-04-19 13:28:04,838 f:remote_base.py  l:521  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "cat /etc/scylla.d/cpuset.conf"...
< t:2024-04-19 13:28:04,883 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > # DO NO EDIT
< t:2024-04-19 13:28:04,883 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > # This file should be automatically configure by scylla_cpuset_setup
< t:2024-04-19 13:28:04,883 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > #
< t:2024-04-19 13:28:04,883 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > # CPUSET="--cpuset 0 --smp 1"
< t:2024-04-19 13:28:04,883 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > CPUSET="--cpuset 0-3 "
< t:2024-04-19 13:28:04,884 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "cat /etc/scylla.d/cpuset.conf" finished with status 0
< t:2024-04-19 13:28:04,884 f:cluster.py      l:596  c:sdcm.cluster_aws     p:DEBUG > Node artifacts-ami-jenkins-db-node-f5fb855e-1 [50.19.11.172 | 10.12.9.72] (seed: True): CPUSET on node artifacts-ami-jenkins-db-node-f5fb855e-1: CPUSET="--cpuset 0-3 "
< t:2024-04-19 13:28:05,251 f:cluster.py      l:3713 c:sdcm.cluster         p:INFO  > Cluster artifacts-ami-jenkins-db-cluster-f5fb855e (AMI: ['ami-0de7aecd4094a3969'] Type: is4gen.xlarge): (1/1) nodes ready, node Node artifacts-ami-jenkins-db-node-f5fb855e-1 [50.19.11.172 | 10.12.9.72] (seed: True). Time elapsed: 213 s
< t:2024-04-19 13:28:05,251 f:remote_base.py  l:521  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "/usr/bin/nodetool  status "...
< t:2024-04-19 13:28:06,569 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Datacenter: us-east
< t:2024-04-19 13:28:06,569 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > ===================
< t:2024-04-19 13:28:06,570 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Status=Up/Down
< t:2024-04-19 13:28:06,570 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > |/ State=Normal/Leaving/Joining/Moving
< t:2024-04-19 13:28:06,573 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > --  Address     Load       Tokens       Owns    Host ID                               Rack
< t:2024-04-19 13:28:06,576 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > UN  10.12.9.72  181.34 KB  256          ?       b3f6d2b9-6124-4db8-affa-263706df007e  1c
< t:2024-04-19 13:28:06,576 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG >
< t:2024-04-19 13:28:06,576 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
< t:2024-04-19 13:28:07,077 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "/usr/bin/nodetool  status " finished with status 0

What do you see in the log that tells you that it failed to bootstrap?

from scylladb.

vincentwenatsa avatar vincentwenatsa commented on September 22, 2024

We have such a test and it passed and cpuset seems correct:

  t:2024-04-19 13:28:04,838 f:remote_base.py  l:521  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "cat /etc/scylla.d/cpuset.conf"...
< t:2024-04-19 13:28:04,883 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > # DO NO EDIT
< t:2024-04-19 13:28:04,883 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > # This file should be automatically configure by scylla_cpuset_setup
< t:2024-04-19 13:28:04,883 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > #
< t:2024-04-19 13:28:04,883 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > # CPUSET="--cpuset 0 --smp 1"
< t:2024-04-19 13:28:04,883 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > CPUSET="--cpuset 0-3 "
< t:2024-04-19 13:28:04,884 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "cat /etc/scylla.d/cpuset.conf" finished with status 0
< t:2024-04-19 13:28:04,884 f:cluster.py      l:596  c:sdcm.cluster_aws     p:DEBUG > Node artifacts-ami-jenkins-db-node-f5fb855e-1 [50.19.11.172 | 10.12.9.72] (seed: True): CPUSET on node artifacts-ami-jenkins-db-node-f5fb855e-1: CPUSET="--cpuset 0-3 "
< t:2024-04-19 13:28:05,251 f:cluster.py      l:3713 c:sdcm.cluster         p:INFO  > Cluster artifacts-ami-jenkins-db-cluster-f5fb855e (AMI: ['ami-0de7aecd4094a3969'] Type: is4gen.xlarge): (1/1) nodes ready, node Node artifacts-ami-jenkins-db-node-f5fb855e-1 [50.19.11.172 | 10.12.9.72] (seed: True). Time elapsed: 213 s
< t:2024-04-19 13:28:05,251 f:remote_base.py  l:521  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "/usr/bin/nodetool  status "...
< t:2024-04-19 13:28:06,569 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Datacenter: us-east
< t:2024-04-19 13:28:06,569 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > ===================
< t:2024-04-19 13:28:06,570 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Status=Up/Down
< t:2024-04-19 13:28:06,570 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > |/ State=Normal/Leaving/Joining/Moving
< t:2024-04-19 13:28:06,573 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > --  Address     Load       Tokens       Owns    Host ID                               Rack
< t:2024-04-19 13:28:06,576 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > UN  10.12.9.72  181.34 KB  256          ?       b3f6d2b9-6124-4db8-affa-263706df007e  1c
< t:2024-04-19 13:28:06,576 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG >
< t:2024-04-19 13:28:06,576 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
< t:2024-04-19 13:28:07,077 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "/usr/bin/nodetool  status " finished with status 0```


What do you see in the log that tells you that it failed to bootstrap?

-- Boot 5c1ca9a11cce4189bdc5a5364ebe7f61 --
May 22 14:05:24 ip-172-19-38-128 systemd[1]: Starting Scylla Server...
May 22 14:05:26 ip-172-19-38-128 scylla_prepare[647]: /var/lib/scylla/schema_commitlog doesn't exist - skipping
May 22 14:05:26 ip-172-19-38-128 scylla_prepare[647]: irqbalance is not running
May 22 14:05:26 ip-172-19-38-128 scylla_prepare[647]: No non-NVMe disks to tune
May 22 14:05:26 ip-172-19-38-128 scylla_prepare[647]: Setting NVMe disks: nvme0n1p1...
May 22 14:05:26 ip-172-19-38-128 scylla_prepare[647]: Setting mask 00000001 in /proc/irq/52/smp_affinity
May 22 14:05:26 ip-172-19-38-128 scylla_prepare[647]: Writing 'none' to /sys/devices/pci0000:00/0000:00:04.0/nvme/nvme0/nvme0n1/queue/scheduler
May 22 14:05:26 ip-172-19-38-128 scylla_prepare[647]: Writing '2' to /sys/devices/pci0000:00/0000:00:04.0/nvme/nvme0/nvme0n1/queue/nomerges
May 22 14:05:26 ip-172-19-38-128 scylla_prepare[647]: Setting a physical interface eth0...
May 22 14:05:26 ip-172-19-38-128 scylla_prepare[647]: Executing: ethtool -L eth0 rx 1
May 22 14:05:26 ip-172-19-38-128 scylla_prepare[647]: Executing: ethtool -L eth0 combined 1
May 22 14:05:26 ip-172-19-38-128 scylla_prepare[647]: Distributing IRQs handling Rx and Tx for first 1 channels:
May 22 14:05:26 ip-172-19-38-128 scylla_prepare[647]: Setting mask 00000001 in /proc/irq/61/smp_affinity
May 22 14:05:26 ip-172-19-38-128 scylla_prepare[647]: Distributing the rest of IRQs
May 22 14:05:26 ip-172-19-38-128 scylla_prepare[647]: Setting mask 0000000f in /sys/class/net/eth0/queues/rx-0/rps_cpus
May 22 14:05:26 ip-172-19-38-128 scylla_prepare[647]: Setting net.core.rps_sock_flow_entries to 32768
May 22 14:05:26 ip-172-19-38-128 scylla_prepare[647]: Setting limit 32768 in /sys/class/net/eth0/queues/rx-0/rps_flow_cnt
May 22 14:05:26 ip-172-19-38-128 scylla_prepare[647]: Trying to enable ntuple filtering HW offload for eth0...not supported
May 22 14:05:26 ip-172-19-38-128 scylla_prepare[647]: Setting mask 0000000f in /sys/class/net/eth0/queues/tx-0/xps_cpus
May 22 14:05:26 ip-172-19-38-128 scylla_prepare[647]: Writing '4096' to /proc/sys/net/core/somaxconn
May 22 14:05:26 ip-172-19-38-128 scylla_prepare[647]: Writing '4096' to /proc/sys/net/ipv4/tcp_max_syn_backlog
May 22 14:05:26 ip-172-19-38-128 scylla_prepare[647]: Clocksource setting not available or not needed for this architecture. Not tuning
May 22 14:05:30 ip-172-19-38-128 scylla[694]: Scylla version 5.4.6-0.20240418.10f137e367e3 with build-id 7331638fd804e4efb02c4ae7a375591e1be374a1 starting ...
May 22 14:05:30 ip-172-19-38-128 scylla[694]: command used: "/usr/bin/scylla --log-to-syslog 1 --log-to-stdout 0 --default-log-level info --network-stack posix --cpuset 1-15 --lock-memory=1"
May 22 14:05:30 ip-172-19-38-128 scylla[694]: parsed command line options: [log-to-syslog, (positional) 1, log-to-stdout, (positional) 0, default-log-level, (positional) info, network-stack, (positional) posix,
cpuset, (positional) 1-15, lock-memory: 1]
May 22 14:05:30 ip-172-19-38-128 scylla[694]: seastar - Bad value for --cpuset: 4 5 6 7 8 9 10 11 12 13 14 15 not allowed. Shutting down.
May 22 14:05:30 ip-172-19-38-128 systemd[1]: scylla-server.service: Main process exited, code=exited, status=1/FAILURE
May 22 14:05:30 ip-172-19-38-128 systemd[1]: scylla-server.service: Failed with result 'exit-code'.
May 22 14:05:30 ip-172-19-38-128 systemd[1]: Failed to start Scylla Server.

from scylladb.

vincentwenatsa avatar vincentwenatsa commented on September 22, 2024

I am gonna try from a fresh is4gen.xlarge node later

from scylladb.

vincentwenatsa avatar vincentwenatsa commented on September 22, 2024

Failed again on an is4gen.xlarge node, in us-east-1a AWS AZ, same symptom
image

It seems only failed at this type, I tried an i4i.xlarge node and it worked there

from scylladb.

mykaul avatar mykaul commented on September 22, 2024

@vincentwenatsa - can you share how you set up Scylla and the logs?

from scylladb.

vincentwenatsa avatar vincentwenatsa commented on September 22, 2024

@vincentwenatsa - can you share how you set up Scylla and the logs?

tried on a couple of x86 nodes, and a few arm nodes. Problem always happen on ARM nodes. We use packer to do a thin repack of your AMI (added/revise our own SSH setups and scylla.yaml config). Because we need to change scylla.yaml file, we run a sudo rm -f /etc/scylla/machine_image_configured. On the ARM nodes those file seems not got deleted. I am guessing that is the problem?

from scylladb.

mykaul avatar mykaul commented on September 22, 2024

I'd try manually first, to narrow down the issue.

from scylladb.

vincentwenatsa avatar vincentwenatsa commented on September 22, 2024

/etc/scylla/machine_image_configured indeed was the problem

from scylladb.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.