Comments (10)
@vincentwenatsa - can you clarify this a bit? Did you run the setup and this is the result you've got?
@roydahan - do we have this instance type in our matrix?
from scylladb.
No I leave to the AMI to do all the setups, and the cpuset config is not correct. Based on this it seems is4gen is on the support list?
https://github.com/scylladb/scylla-machine-image/blob/next-5.4/lib/scylla_cloud.py#L740
from scylladb.
We have such a test and it passed and cpuset seems correct:
t:2024-04-19 13:28:04,838 f:remote_base.py l:521 c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "cat /etc/scylla.d/cpuset.conf"...
< t:2024-04-19 13:28:04,883 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > # DO NO EDIT
< t:2024-04-19 13:28:04,883 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > # This file should be automatically configure by scylla_cpuset_setup
< t:2024-04-19 13:28:04,883 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > #
< t:2024-04-19 13:28:04,883 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > # CPUSET="--cpuset 0 --smp 1"
< t:2024-04-19 13:28:04,883 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > CPUSET="--cpuset 0-3 "
< t:2024-04-19 13:28:04,884 f:base.py l:142 c:RemoteLibSSH2CmdRunner p:DEBUG > Command "cat /etc/scylla.d/cpuset.conf" finished with status 0
< t:2024-04-19 13:28:04,884 f:cluster.py l:596 c:sdcm.cluster_aws p:DEBUG > Node artifacts-ami-jenkins-db-node-f5fb855e-1 [50.19.11.172 | 10.12.9.72] (seed: True): CPUSET on node artifacts-ami-jenkins-db-node-f5fb855e-1: CPUSET="--cpuset 0-3 "
< t:2024-04-19 13:28:05,251 f:cluster.py l:3713 c:sdcm.cluster p:INFO > Cluster artifacts-ami-jenkins-db-cluster-f5fb855e (AMI: ['ami-0de7aecd4094a3969'] Type: is4gen.xlarge): (1/1) nodes ready, node Node artifacts-ami-jenkins-db-node-f5fb855e-1 [50.19.11.172 | 10.12.9.72] (seed: True). Time elapsed: 213 s
< t:2024-04-19 13:28:05,251 f:remote_base.py l:521 c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "/usr/bin/nodetool status "...
< t:2024-04-19 13:28:06,569 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > Datacenter: us-east
< t:2024-04-19 13:28:06,569 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > ===================
< t:2024-04-19 13:28:06,570 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > Status=Up/Down
< t:2024-04-19 13:28:06,570 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > |/ State=Normal/Leaving/Joining/Moving
< t:2024-04-19 13:28:06,573 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > -- Address Load Tokens Owns Host ID Rack
< t:2024-04-19 13:28:06,576 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > UN 10.12.9.72 181.34 KB 256 ? b3f6d2b9-6124-4db8-affa-263706df007e 1c
< t:2024-04-19 13:28:06,576 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG >
< t:2024-04-19 13:28:06,576 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
< t:2024-04-19 13:28:07,077 f:base.py l:142 c:RemoteLibSSH2CmdRunner p:DEBUG > Command "/usr/bin/nodetool status " finished with status 0
What do you see in the log that tells you that it failed to bootstrap?
from scylladb.
We have such a test and it passed and cpuset seems correct:
t:2024-04-19 13:28:04,838 f:remote_base.py l:521 c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "cat /etc/scylla.d/cpuset.conf"... < t:2024-04-19 13:28:04,883 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > # DO NO EDIT < t:2024-04-19 13:28:04,883 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > # This file should be automatically configure by scylla_cpuset_setup < t:2024-04-19 13:28:04,883 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > # < t:2024-04-19 13:28:04,883 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > # CPUSET="--cpuset 0 --smp 1" < t:2024-04-19 13:28:04,883 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > CPUSET="--cpuset 0-3 " < t:2024-04-19 13:28:04,884 f:base.py l:142 c:RemoteLibSSH2CmdRunner p:DEBUG > Command "cat /etc/scylla.d/cpuset.conf" finished with status 0 < t:2024-04-19 13:28:04,884 f:cluster.py l:596 c:sdcm.cluster_aws p:DEBUG > Node artifacts-ami-jenkins-db-node-f5fb855e-1 [50.19.11.172 | 10.12.9.72] (seed: True): CPUSET on node artifacts-ami-jenkins-db-node-f5fb855e-1: CPUSET="--cpuset 0-3 " < t:2024-04-19 13:28:05,251 f:cluster.py l:3713 c:sdcm.cluster p:INFO > Cluster artifacts-ami-jenkins-db-cluster-f5fb855e (AMI: ['ami-0de7aecd4094a3969'] Type: is4gen.xlarge): (1/1) nodes ready, node Node artifacts-ami-jenkins-db-node-f5fb855e-1 [50.19.11.172 | 10.12.9.72] (seed: True). Time elapsed: 213 s < t:2024-04-19 13:28:05,251 f:remote_base.py l:521 c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "/usr/bin/nodetool status "... < t:2024-04-19 13:28:06,569 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > Datacenter: us-east < t:2024-04-19 13:28:06,569 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > =================== < t:2024-04-19 13:28:06,570 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > Status=Up/Down < t:2024-04-19 13:28:06,570 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > |/ State=Normal/Leaving/Joining/Moving < t:2024-04-19 13:28:06,573 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > -- Address Load Tokens Owns Host ID Rack < t:2024-04-19 13:28:06,576 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > UN 10.12.9.72 181.34 KB 256 ? b3f6d2b9-6124-4db8-affa-263706df007e 1c < t:2024-04-19 13:28:06,576 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > < t:2024-04-19 13:28:06,576 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless < t:2024-04-19 13:28:07,077 f:base.py l:142 c:RemoteLibSSH2CmdRunner p:DEBUG > Command "/usr/bin/nodetool status " finished with status 0``` What do you see in the log that tells you that it failed to bootstrap?
-- Boot 5c1ca9a11cce4189bdc5a5364ebe7f61 --
May 22 14:05:24 ip-172-19-38-128 systemd[1]: Starting Scylla Server...
May 22 14:05:26 ip-172-19-38-128 scylla_prepare[647]: /var/lib/scylla/schema_commitlog doesn't exist - skipping
May 22 14:05:26 ip-172-19-38-128 scylla_prepare[647]: irqbalance is not running
May 22 14:05:26 ip-172-19-38-128 scylla_prepare[647]: No non-NVMe disks to tune
May 22 14:05:26 ip-172-19-38-128 scylla_prepare[647]: Setting NVMe disks: nvme0n1p1...
May 22 14:05:26 ip-172-19-38-128 scylla_prepare[647]: Setting mask 00000001 in /proc/irq/52/smp_affinity
May 22 14:05:26 ip-172-19-38-128 scylla_prepare[647]: Writing 'none' to /sys/devices/pci0000:00/0000:00:04.0/nvme/nvme0/nvme0n1/queue/scheduler
May 22 14:05:26 ip-172-19-38-128 scylla_prepare[647]: Writing '2' to /sys/devices/pci0000:00/0000:00:04.0/nvme/nvme0/nvme0n1/queue/nomerges
May 22 14:05:26 ip-172-19-38-128 scylla_prepare[647]: Setting a physical interface eth0...
May 22 14:05:26 ip-172-19-38-128 scylla_prepare[647]: Executing: ethtool -L eth0 rx 1
May 22 14:05:26 ip-172-19-38-128 scylla_prepare[647]: Executing: ethtool -L eth0 combined 1
May 22 14:05:26 ip-172-19-38-128 scylla_prepare[647]: Distributing IRQs handling Rx and Tx for first 1 channels:
May 22 14:05:26 ip-172-19-38-128 scylla_prepare[647]: Setting mask 00000001 in /proc/irq/61/smp_affinity
May 22 14:05:26 ip-172-19-38-128 scylla_prepare[647]: Distributing the rest of IRQs
May 22 14:05:26 ip-172-19-38-128 scylla_prepare[647]: Setting mask 0000000f in /sys/class/net/eth0/queues/rx-0/rps_cpus
May 22 14:05:26 ip-172-19-38-128 scylla_prepare[647]: Setting net.core.rps_sock_flow_entries to 32768
May 22 14:05:26 ip-172-19-38-128 scylla_prepare[647]: Setting limit 32768 in /sys/class/net/eth0/queues/rx-0/rps_flow_cnt
May 22 14:05:26 ip-172-19-38-128 scylla_prepare[647]: Trying to enable ntuple filtering HW offload for eth0...not supported
May 22 14:05:26 ip-172-19-38-128 scylla_prepare[647]: Setting mask 0000000f in /sys/class/net/eth0/queues/tx-0/xps_cpus
May 22 14:05:26 ip-172-19-38-128 scylla_prepare[647]: Writing '4096' to /proc/sys/net/core/somaxconn
May 22 14:05:26 ip-172-19-38-128 scylla_prepare[647]: Writing '4096' to /proc/sys/net/ipv4/tcp_max_syn_backlog
May 22 14:05:26 ip-172-19-38-128 scylla_prepare[647]: Clocksource setting not available or not needed for this architecture. Not tuning
May 22 14:05:30 ip-172-19-38-128 scylla[694]: Scylla version 5.4.6-0.20240418.10f137e367e3 with build-id 7331638fd804e4efb02c4ae7a375591e1be374a1 starting ...
May 22 14:05:30 ip-172-19-38-128 scylla[694]: command used: "/usr/bin/scylla --log-to-syslog 1 --log-to-stdout 0 --default-log-level info --network-stack posix --cpuset 1-15 --lock-memory=1"
May 22 14:05:30 ip-172-19-38-128 scylla[694]: parsed command line options: [log-to-syslog, (positional) 1, log-to-stdout, (positional) 0, default-log-level, (positional) info, network-stack, (positional) posix,
cpuset, (positional) 1-15, lock-memory: 1]
May 22 14:05:30 ip-172-19-38-128 scylla[694]: seastar - Bad value for --cpuset: 4 5 6 7 8 9 10 11 12 13 14 15 not allowed. Shutting down.
May 22 14:05:30 ip-172-19-38-128 systemd[1]: scylla-server.service: Main process exited, code=exited, status=1/FAILURE
May 22 14:05:30 ip-172-19-38-128 systemd[1]: scylla-server.service: Failed with result 'exit-code'.
May 22 14:05:30 ip-172-19-38-128 systemd[1]: Failed to start Scylla Server.
from scylladb.
I am gonna try from a fresh is4gen.xlarge node later
from scylladb.
Failed again on an is4gen.xlarge node, in us-east-1a AWS AZ, same symptom
It seems only failed at this type, I tried an i4i.xlarge node and it worked there
from scylladb.
@vincentwenatsa - can you share how you set up Scylla and the logs?
from scylladb.
@vincentwenatsa - can you share how you set up Scylla and the logs?
tried on a couple of x86 nodes, and a few arm nodes. Problem always happen on ARM nodes. We use packer to do a thin repack of your AMI (added/revise our own SSH setups and scylla.yaml config). Because we need to change scylla.yaml file, we run a sudo rm -f /etc/scylla/machine_image_configured
. On the ARM nodes those file seems not got deleted. I am guessing that is the problem?
from scylladb.
I'd try manually first, to narrow down the issue.
from scylladb.
/etc/scylla/machine_image_configured indeed was the problem
from scylladb.
Related Issues (20)
- Hints and batchlog flush takes long time to finish in repair
- test/pylib: keyspace_compaction incorrectly encodes column family parameter
- test_parallel_operations_for_10ks_and_10tables_and_clearallsnapshots fails due to exceptions::mutation_write_failure_exception HOT 2
- replica: implement `memtable_flush_period_in_ms` schema option HOT 1
- raft topology: concurrent removenode requests for the same node can hang
- topology_coordinator: make exec_global_command generic
- When SSTable compression metadata is corrupted, no hints of the underlying SSTable is shown in the logs HOT 2
- All DC nodes got Coredump and error of: "failed to log message: fmt='Requested location for node {} not in topology" on altering keyspace replication-factor of one data-center (out of 2) HOT 25
- building the documentation fails on Fedora 40 HOT 5
- [x86_64, dev] topology_experimental_raft/test_topology_recovery_basic failed with "Operation timed out for test_1724625930946_tbmul.tbl_scylla_cdc_log - received only 1 responses from 2 CL=LOCAL_QUORUM."
- docs: Issue on page ScyllaDB Repair
- Docs: Arbiter (tie-breaker) DC deployment
- raft snapshot transferred between different node versions fails with "regular column id 5 >= 5" HOT 13
- major compaction flushes all tables in successive runs despite compaction_flush_all_tables_before_major_seconds having a non zero value HOT 2
- [x86_64, dev] topology_experimental_raft/test_tablets failed with ConfigurationException HOT 1
- db/hints: Possible use-after-free in a coroutine lambda
- Make tracing user-friendly
- generic_server::server::shutdown() isn't idempotent wrt. reentrancy
- [x86_64, dev] topology_custom/test_topology_failure_recovery failed with ConfigurationException
- [aarch64, debug] topology_custom/test_mv_tablets_empty_ip failed with <Task
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from scylladb.