flashcatcloud / n9e-helm Goto Github PK
View Code? Open in Web Editor NEWHelm chart of Nightingale
License: Apache License 2.0
Helm chart of Nightingale
License: Apache License 2.0
别人用不用不管,加入进去可以方便要用的人。默认可以把它设置为关闭,也就是不部署。
Error: template: nightingale/templates/redis/statefulset.yaml:22:20: executing "nightingale/templates/redis/statefulset.yaml" at <{{template "nightingale.redis" .}}>: template "nightingale.redis" not defined
夜莺版本:
5.6.3
问题和复现方法:
helm install nightingale ./n9e-helm -n monitoring --create-namespace
# nightingale-nserver 提供服务端口是80
kubectl get service -n monitoring nightingale-nserver
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
nightingale-nserver ClusterIP 10.43.137.45 <none> 80/TCP 17d
# telegraf 配置中使用的是91000端口,导致telegraf由于无法连上nightingale-nserver,一直无法启动.
kubectl get configmap -n monitoring telegraf-config -o yaml
apiVersion: v1
data:
telegraf.conf: |-
...
[[outputs.opentsdb]]
host = "http://nightingale-nserver"
port = 91000
...
port = 91000 改为 port = 80, 重启telegraf daemonset即可.
{{- if eq .Values.nserver.type "internal" -}}
apiVersion: v1
kind: ConfigMap
metadata:
name: nserver-script
data:
{{ (.Files.Glob "scripts/*.py").AsConfig | indent 2 }}
{{- end -}}
Kubernetes 1.24 之后的版本中已经完全移除了对 docker 的支持,考虑到很多用户在此之前都已将 kubernetes 默认的 runtime 切换到 containerd。当前 helm 的 categraf 配置中将 docker 为默认的采集项,造成我在部署时报错如下:
Warning FailedMount 8m40s (x14 over 20m) kubelet MountVolume.SetUp failed for volume "docker-socket" : hostPath type check failed: /var/run/docker.sock is not a socket file
Warning FailedMount 46s (x7 over 9m50s) kubelet (combined from similar events): Unable to attach or mount volumes: unmounted volumes=[docker-socket], unattached volumes=[docker-socket input-kubernetes input-mem input-kubelet-metrics input-processes input-cpu categraf-config hostrofs kube-api-access-h2nmj input-sysctl-fs input-diskio input-kernel-vmstat input-docker input-disk input-netstat input-kernel input-net input-system hostroutmp]: timed out waiting for the condition
类似的 issues flashcatcloud/categraf#46
考虑需要到兼容docker的场景,建议对 charts做下调整
categraf:
type: internal
internal:
## Parm: categraf.internal.docker_socket Desc: the path of docker socket on kubelet node.
## "unix:///var/run/docker.sock" is default, if your kubernetes runtime is container or others, empty this variable.
## docker_socket: ""
docker_socket: unix:///var/run/docker.sock
我已提交了一个 PR来协助解决此问题,希望得到您的回复。
helm配置:
persistence:
enabled: true
resourcePolicy: "keep"
persistentVolumeClaim:
database:
existingClaim: ""
storageClass: "longhorn"
subPath: ""
accessMode: ReadWriteOnce
size: 4Gi
redis:
existingClaim: ""
storageClass: "longhorn"
subPath: ""
accessMode: ReadWriteOnce
size: 1Gi
prometheus:
existingClaim: ""
storageClass: "longhorn"
subPath: ""
accessMode: ReadWriteOnce
size: 4Gi
报错信息:
ts=2023-04-10T04:05:48.273Z caller=main.go:169 level=warn msg="Remote write receiver enabled via feature flag remote-write-receiver. This is DEPRECATED. Use --web.enable-remote-write-receiver."
ts=2023-04-10T04:05:48.274Z caller=main.go:520 level=info msg="No time or size retention was set so using the default time retention" duration=15d
ts=2023-04-10T04:05:48.274Z caller=main.go:564 level=info msg="Starting Prometheus Server" mode=server version="(version=2.43.0, branch=HEAD, revision=edfc3bcd025dd6fe296c167a14a216cab1e552ee)"
ts=2023-04-10T04:05:48.274Z caller=main.go:569 level=info build_context="(go=go1.19.7, platform=linux/amd64, user=root@8a0ee342e522, date=20230321-12:56:07, tags=netgo,builtinassets)"
ts=2023-04-10T04:05:48.275Z caller=main.go:570 level=info host_details="(Linux 5.15.0-60-generic #66-Ubuntu SMP Fri Jan 20 14:29:49 UTC 2023 x86_64 nightingale-prometheus-0 (none))"
ts=2023-04-10T04:05:48.275Z caller=main.go:571 level=info fd_limits="(soft=1048576, hard=1048576)"
ts=2023-04-10T04:05:48.275Z caller=main.go:572 level=info vm_limits="(soft=unlimited, hard=unlimited)"
ts=2023-04-10T04:05:48.275Z caller=query_logger.go:91 level=error component=activeQueryTracker msg="Error opening query log file" file=/prometheus/queries.active err="open /prometheus/queries.active: permission denied"
panic: Unable to create mmap-ed active query log
goroutine 1 [running]:
github.com/prometheus/prometheus/promql.NewActiveQueryTracker({0x7ffd0172b7cd, 0xb}, 0x14, {0x3dec280, 0xc0007f79f0})
/app/promql/query_logger.go:121 +0x3cd
main.main()
/app/cmd/prometheus/main.go:626 +0x6cf3
想要对夜莺进行二开,请问nightingale-center-xxxx pod里面的工作目录的二进制文件是怎么打包的呢?
当前 chart 里默认支持 Prometheus 作为后端存储,如果选择外部的 vm,需要手动修改对应的 server 配置文件,不便于直接通过 helm 进行部署和维护,建议增加对外部 vm 的支持。
nserver config-cm.yaml
[Reader]
Url = "http://{{ template "nightingale.prometheus.host" . }}:{{ template "nightingale.prometheus.servicePort" . }}"
[[Writers]]
Url = "http://{{ template "nightingale.prometheus.host" . }}:{{ template "nightingale.prometheus.servicePort" . }}/api/v1/write"
请问 value.yml 里面的内网地址 192.168.0.1-5 需要更改吗?
每个k8s集群工作节点的内网IP都不一样
2023-03-02 10:09:01+08:00 [Note] [Entrypoint]: Entrypoint script for MySQL Server 5.7.41-1.el7 started.
2023-03-02 10:09:01+08:00 [Note] [Entrypoint]: Initializing database files
2023-03-02T02:09:01.763856Z 0 [Warning] TIMESTAMP with implicit DEFAULT value is deprecated. Please use --explicit_defaults_for_timestamp server option (see documentation for more details).
2023-03-02T02:09:01.765550Z 0 [ERROR] --initialize specified but the data directory has files in it. Aborting.
2023-03-02T02:09:01.765580Z 0 [ERROR] Aborting
夜莺版本:
5.6.3
问题和复现方法:
helm install nightingale ./n9e-helm -n monitoring --create-namespace
dial unix /var/run/docker.sock: connect: permission denied
按官方issues讨论,获取k8s主机的docker group id,并通过k8s设置为telegraf
的运行用户组即可.
本集群docker group id 为 990
stat -c '%g' /var/run/docker.sock
990
官方镜像的dockerfile中未设置USER,默认是root
运用启动容器,再以telegraf
用户运行程序,故通过设置securityContext.runAsGroup
无效.
bitinami镜像的dockerfile中有设置USER,以1001
用户启动容器,并以1001
用户运行程序,故可通过设置securityContext.runAsGroup
一种参考处理方式:
替换镜像为: bitnami/telegraf:1.22.4
设置: spec.template.spec.containers[0].command: ["telegraf"]
.spec.template.spec.containers[0].securityContext.runAsGroup: 990
我使用 Helm 部署服务后发现, 监控目标 n9e 准备失败报错如下
修改文件:templates/prometheus/configmap.yaml
categraf日志报错:
2022/09/16 08:02:13 main.go:121: I! runner.vm_limits: (soft=unlimited, hard=unlimited)
2022/09/16 08:02:13 main.go:120: I! runner.fd_limits: (soft=1048576, hard=1048576)
2022/09/16 08:02:13 main.go:119: I! runner.hostname: cce-0002
2022/09/16 08:02:13 main.go:118: I! runner.binarydir: /usr/bin
Failed to set additional capabilities on /usr/bin/categraf
/entrypoint.sh: 5: setcap: not found
2022/09/16 08:02:09 main.go:65: F! failed to init config: failed to load configs of dir: /etc/categraf/conf err:toml: line 45 (last key "writers"): expected a top-level item to end with a newline, comment, or EOF, but got '[' instead
用tomlv检查了文件,格式是正确的。不知道什么原因。。。
helm install nightingale ./n9e-helm -n n9e --create-namespace
Error: INSTALLATION FAILED: unable to build kubernetes objects from release manifest: unknown
这个要怎么弄
[root@m1 helm]# helm install nightingale ./n9e-helm -n n9e --create-namespace
Error: INSTALLATION FAILED: template: nightingale/templates/_helpers.tpl:244:9: executing "nightingale.redis.mode" at <eq .Values.redis.type "internal">: error calling eq: incompatible types for comparison
请问这个是什么异常?目前只修改了values.yaml 中expose方式和存储,其他未改动
categraf挂载了docker.sock,如果运行时为containerd程序无法正常运行
我们可以考虑把 nightingale 的image 仓库,统一放到 flashcatcloud/下面。
repository: docker.io/ulric2019/nightingale
tag: 5.9.4
Error: failed to start container "mysql": Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: rootfs_linux.go:76: mounting "/var/lib/kubelet/pods/c5fcbc47-810e-4e1b-8423-35c41840151c/volumes/kubernetes.io~configmap/database-config" to rootfs at "/etc/my.cnf" caused: mount through procfd: not a directory: unknown: Are you trying to mount a directory onto a file (or vice-versa)? Check if the specified host path exists and is the expected type
database:
type: internal
internal:
serviceAccountName: ""
automountServiceAccountToken: false
image:
repository: docker.io/library/mysql
tag: 5.7
username: "root"
password: "1234"
shmSizeLimit: 512Mi
nodeSelector: {}
resources: {}
tolerations: []
affinity: {}
priorityClassName:
initContainer:
migrator: {}
permissions: {}
maxIdleConns: 100
maxOpenConns: 900
podAnnotations: {}
nserver服务启动失败,找不到recording_rule这张表
2022/07/11 10:53:55 /home/runner/work/nightingale/nightingale/src/models/recording_rule.go:197 Error 1146: Table 'n9e_v5.recording_rule' doesn't exist
[0.253ms] [rows:0] SELECT count(*) as total,max(update_at) as last_updated FROM recording_rule
WHERE cluster = 'Default'
failed to sync recording rules: failed to exec RecordingRuleStatistics: Error 1146: Table 'n9e_v5.recording_rule' doesn't exist
除了夜莺,我还通过ingress代理了其他服务。所以,希望访问夜莺的uri形如:http://[domain]:[port]/n9e
安装的n9e版本为:6.0.0.ga.6
我当前ingress的配置:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
annotations:
meta.helm.sh/release-name: nightingale
meta.helm.sh/release-namespace: n9e
creationTimestamp: "2023-05-29T06:36:30Z"
generation: 1
labels:
app: n9e
app.kubernetes.io/managed-by: Helm
chart: nightingale
heritage: Helm
release: nightingale
name: nightingale-ingress
namespace: n9e
resourceVersion: "3553356"
uid: 597eed71-34a5-4519-a528-e6b21b25951d
spec:
ingressClassName: nginx
rules:
- host: www.xxx.pub
http:
paths:
- backend:
service:
name: nightingale-center
port:
number: 80
path: /n9e
pathType: Prefix
除了修改ingress,是否还需要修改n9e的其他配置文件才能生效?
在ingress.yaml中,我看到path指定是:{{ .root_path }},但在values.yaml中缺没找到对.root_path的配置
helm安装,MySQL使用外部数据库,database下没有新建表指令,导致n9e主实例崩溃。
`runner.cwd: /app
runner.hostname: n9e-nightingale-center-6bb5866f77-pxsh5
runner.fd_limits: (soft=1048576, hard=1048576)
runner.vm_limits: (soft=unlimited, hard=unlimited)
2023/05/25 11:13:34 /home/runner/work/nightingale/nightingale/models/user.go:179 Error 1146: Table 'n9e_v6.users' doesn't exist
[0.845ms] [rows:0] SELECT * FROM users
WHERE username='root'
`
ts=2023-12-22T03:51:58.343Z caller=main.go:172 level=warn msg="Remote write receiver enabled via feature flag remote-write-receiver. This is DEPRECATED. Use --web.enable-remote-write-receiver."
ts=2023-12-22T03:51:58.344Z caller=main.go:539 level=info msg="No time or size retention was set so using the default time retention" duration=15d
ts=2023-12-22T03:51:58.344Z caller=main.go:583 level=info msg="Starting Prometheus Server" mode=server version="(version=2.48.1, branch=HEAD, revision=63894216648f0d6be310c9d16fb48293c45c9310)"
ts=2023-12-22T03:51:58.344Z caller=main.go:588 level=info build_context="(go=go1.21.5, platform=linux/amd64, user=root@71f108ff5632, date=20231208-23:33:22, tags=netgo,builtinassets,stringlabels)"
ts=2023-12-22T03:51:58.344Z caller=main.go:589 level=info host_details="(Linux 3.10.0-1160.99.1.el7.x86_64 #1 SMP Wed Sep 13 14:19:20 UTC 2023 x86_64 nightingale-prometheus-0 (none))"
ts=2023-12-22T03:51:58.344Z caller=main.go:590 level=info fd_limits="(soft=1048576, hard=1048576)"
ts=2023-12-22T03:51:58.344Z caller=main.go:591 level=info vm_limits="(soft=unlimited, hard=unlimited)"
ts=2023-12-22T03:51:58.345Z caller=query_logger.go:93 level=error component=activeQueryTracker msg="Error opening query log file" file=/prometheus/queries.active err="open /prometheus/queries.active: permission denied"
panic: Unable to create mmap-ed active query log
goroutine 1 [running]:
github.com/prometheus/prometheus/promql.NewActiveQueryTracker({0x7ffd220916a5, 0xb}, 0x14, {0x3e94bc0, 0xc000a400f0})
/app/promql/query_logger.go:123 +0x411
main.main()
/app/cmd/prometheus/main.go:645 +0x7812
deployment 下的 resources 存在缩进问题,helm install 失败,看报错是缩进问题,解决后能正常安装
启动状态如下
kubectl get pod -n n9e
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nightingale-categraf-v6-6t2rh 1/1 Running 0 9m3s 10.30.60.105 kylin105
nightingale-categraf-v6-n7hqg 1/1 Running 0 9m3s 10.30.60.103 kylin103
nightingale-categraf-v6-xwczl 1/1 Running 0 9m3s 10.30.60.109 kylin109
nightingale-center-6f4f6ccb66-s26j4 0/1 CrashLoopBackOff 6 (2m40s ago) 9m3s 10.0.1.240 kylin105
nightingale-database-0 1/1 Running 1 (7m4s ago) 9m3s 10.0.0.106 kylin103
nightingale-nginx-858cc757d7-9bsdz 0/1 Running 1 (3m41s ago) 9m3s 10.0.2.59 kylin109
nightingale-prometheus-0 1/1 Running 0 9m3s 10.0.2.164 kylin109
nightingale-redis-0 1/1 Running 0 9m3s 10.0.0.107 kylin103
查询nightingale-center-6f4f6ccb66-s26j4报错信息如下
kubectl logs nightingale-center-6f4f6ccb66-s26j4 -n n9e
runner.cwd: /app
runner.hostname: nightingale-center-6f4f6ccb66-s26j4
runner.fd_limits: (soft=1073741816, hard=1073741816)
runner.vm_limits: (soft=unlimited, hard=unlimited)
2024/03/07 18:34:23 main.go:39: failed to initialize: Error 1130: Host '10.0.1.240' is not allowed to connect to this MySQL server
2024/03/07 18:34:23 /home/runner/work/nightingale/nightingale/pkg/ormx/ormx.go:45
[error] failed to initialize database, got error Error 1130: Host '10.0.1.240' is not allowed to connect to this MySQL server
查询数据库状态是启动成功,是否需要配置权限?需要修改哪个文件
nightingale-database-0
kubectl logs nightingale-database-0 -n n9e
2024-03-07 18:32:46+08:00 [Note] [Entrypoint]: Entrypoint script for MySQL Server 5.7.44-1.el7 started.
'/var/lib/mysql/mysql.sock' -> '/var/run/mysqld/mysqld.sock'
2024-03-07T10:33:08.928594Z 0 [Warning] TIMESTAMP with implicit DEFAULT value is deprecated. Please use --explicit_defaults_for_timestamp server option (see documentation for more details).
2024-03-07T10:33:08.929675Z 0 [Note] mysqld (mysqld 5.7.44) starting as process 1 ...
2024-03-07T10:33:08.932188Z 0 [Note] InnoDB: PUNCH HOLE support available
2024-03-07T10:33:08.932210Z 0 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2024-03-07T10:33:08.932232Z 0 [Note] InnoDB: Uses event mutexes
2024-03-07T10:33:08.932235Z 0 [Note] InnoDB: GCC builtin __atomic_thread_fence() is used for memory barrier
2024-03-07T10:33:08.932237Z 0 [Note] InnoDB: Compressed tables use zlib 1.2.13
2024-03-07T10:33:08.932239Z 0 [Note] InnoDB: Using Linux native AIO
2024-03-07T10:33:08.932416Z 0 [Note] InnoDB: Number of pools: 1
2024-03-07T10:33:08.932531Z 0 [Note] InnoDB: Using CPU crc32 instructions
2024-03-07T10:33:08.933583Z 0 [Note] InnoDB: Initializing buffer pool, total size = 128M, instances = 1, chunk size = 128M
2024-03-07T10:33:08.939303Z 0 [Note] InnoDB: Completed initialization of buffer pool
2024-03-07T10:33:08.940746Z 0 [Note] InnoDB: If the mysqld execution user is authorized, page cleaner thread priority can be changed. See the man page of setpriority().
2024-03-07T10:33:08.963996Z 0 [Note] InnoDB: Highest supported file format is Barracuda.
2024-03-07T10:33:08.966787Z 0 [Note] InnoDB: Log scan progressed past the checkpoint lsn 2768310
2024-03-07T10:33:08.966808Z 0 [Note] InnoDB: Doing recovery: scanned up to log sequence number 2768319
2024-03-07T10:33:08.966814Z 0 [Note] InnoDB: Database was not shutdown normally!
2024-03-07T10:33:08.966817Z 0 [Note] InnoDB: Starting crash recovery.
2024-03-07T10:33:09.117149Z 0 [Note] InnoDB: Removed temporary tablespace data file: "ibtmp1"
2024-03-07T10:33:09.117180Z 0 [Note] InnoDB: Creating shared tablespace for temporary tables
2024-03-07T10:33:09.117211Z 0 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ...
2024-03-07T10:33:09.190116Z 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB.
2024-03-07T10:33:09.190711Z 0 [Note] InnoDB: 96 redo rollback segment(s) found. 96 redo rollback segment(s) are active.
2024-03-07T10:33:09.190725Z 0 [Note] InnoDB: 32 non-redo rollback segment(s) are active.
2024-03-07T10:33:09.191356Z 0 [Note] InnoDB: 5.7.44 started; log sequence number 2768319
2024-03-07T10:33:09.191490Z 0 [Note] InnoDB: Loading buffer pool(s) from /var/lib/mysql/ib_buffer_pool
2024-03-07T10:33:09.191768Z 0 [Note] Plugin 'FEDERATED' is disabled.
2024-03-07T10:33:09.199948Z 0 [Note] Found ca.pem, server-cert.pem and server-key.pem in data directory. Trying to enable SSL support using them.
2024-03-07T10:33:09.199964Z 0 [Note] Skipping generation of SSL certificates as certificate files are present in data directory.
2024-03-07T10:33:09.199968Z 0 [Warning] A deprecated TLS version TLSv1 is enabled. Please use TLSv1.2 or higher.
2024-03-07T10:33:09.199970Z 0 [Warning] A deprecated TLS version TLSv1.1 is enabled. Please use TLSv1.2 or higher.
2024-03-07T10:33:09.206507Z 0 [Note] InnoDB: Buffer pool(s) load completed at 240307 18:33:09
2024-03-07T10:33:09.207632Z 0 [Warning] CA certificate ca.pem is self signed.
2024-03-07T10:33:09.207670Z 0 [Note] Skipping generation of RSA key pair as key files are present in data directory.
2024-03-07T10:33:09.209148Z 0 [Note] Server hostname (bind-address): '0.0.0.0'; port: 3306
2024-03-07T10:33:09.209179Z 0 [Note] - '0.0.0.0' resolves to '0.0.0.0';
2024-03-07T10:33:09.209198Z 0 [Note] Server socket created on IP: '0.0.0.0'.
2024-03-07T10:33:09.210601Z 0 [Warning] Insecure configuration for --pid-file: Location '/var/run/mysqld' in the path is accessible to all OS users. Consider choosing a different directory.
2024-03-07T10:33:09.294142Z 0 [Note] Event Scheduler: Loaded 0 events
2024-03-07T10:33:09.294366Z 0 [Note] mysqld: ready for connections.
Version: '5.7.44' socket: '/var/run/mysqld/mysqld.sock' port: 3306 MySQL Community Server (GPL)
2024-03-07T10:34:23.388913Z 2 [Warning] IP address '10.0.1.240' could not be resolved: Name or service not known
能否使用 github action 自动构建基于 github page 的 helm chart 仓库
模板如下
name: Release Charts
on:
push:
branches:
- main
jobs:
release:
runs-on: ubuntu-latest
permissions:
contents: write
steps:
- name: Checkout
uses: actions/checkout@v3
with:
fetch-depth: 0
- name: Configure Git
run: |
git config user.name "$GITHUB_ACTOR"
git config user.email "[email protected]"
- name: Install Helm
uses: azure/setup-helm@v3
- name: Add Helm repo
run: |
helm repo add bitnami https://charts.bitnami.com/bitnami
- name: Run chart-releaser
uses: helm/[email protected]
env:
CR_TOKEN: ${{ secrets.GITHUB_TOKEN }}
CR_SKIP_EXISTING: true
2023/03/20 16:52:06 �[31;1m/home/runner/work/nightingale/nightingale/src/models/target.go:82 �[35;1mError 1064: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'cluster = ?' at line 1
�[0m�[33m[0.998ms] �[34;1m[rows:0]�[0m SELECT count(*) as total,max(update_at) as last_updated FROM target
WHERE cluster = 'Default'
failed to sync targets: failed to exec TargetStatistics: Error 1064: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'cluster = ?' at line 1
Mysql version: 5.7
需要在config.toml里增加以下代码:
[ibex]
enable = false
interval = "1000ms"
servers = ["127.0.0.1:20090"]
meta_dir = "./meta"
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.