openshift / cluster-logging-operator Goto Github PK

Operator to support logging subsystem of OpenShift

License: Apache License 2.0

Go 84.88% Shell 13.41% Dockerfile 0.22% Makefile 1.13% Awk 0.05% Python 0.32%

cluster-logging-operator's Introduction

Cluster Logging Operator

An operator to support OKD aggregated cluster logging. Cluster logging configuration information is found in the configuration documentation.

Overview

The CLO (Cluster Logging Operator) provides a set of APIs to control collection and forwarding of logs from all pods and nodes in a cluster. This includes application logs (from regular pods), infrastructure logs (from system pods and node logs), and audit logs (special node logs with legal/security implications)

The CLO does not collect or forward logs itself: it starts, configures, monitors and manages the components that do the work.

CLO currently uses:

Vector as collector/forwarder
Loki as store
Openshift console for visualization.

(Still supports fluentd, elasticsearch and kibana for compatibility)

The goal is to encapsulate those technologies behind APIs so that:

The user has less to learn, and has a simpler experience to control logging.
These technologies can be replaced in the future without affecting the user experience.

The CLO can also forward logs over multiple protocols, to multiple types of log stores, on- or off-cluster

The CLO owns the following APIs:

ClusterLogging: Top level control of cluster-wide logging resources
ClusterLogForwarder: Configure forwarding of logs to external sources

To install a released version of cluster logging see the Openshift Documentation, (e.g., OCP v4.5)

To experiment or contribute to the development of cluster logging, see the hacking and review documentation

To debug the cluster logging stack, see README.md

To find currently known Cluster Logging Operator issues with work-arounds, see the Troubleshooting guide.

cluster-logging-operator's People

Contributors

Stargazers

Watchers

Forkers

ewolinetz jcantrill josefkarasek richm gettyio smarterclayton jpkrohling nhosoi jatanmalde mffiedler enj harpreetkaur11139 jamezp jaormx openshift-cherrypick-robot qiaolingtang lukas-vlcek bparees shawn-hurley freddy-montero vladmasarik pavolloffay rphillips ravisantoshgudimetla brancz igor-karpukhin marcelomata syedriko vimalk78 alanconway jparrill ahadas periklis jhadvig bmillemathias-1a vfreex danielerez cblecker anshulvermapatel laashub-soa aneeshkp sosiouxme dinhxuanvu andymcc openshift-bot raelga vparfonov huikang gkarager jupierce dhellmann eranra sabdulrahuman mrobson k-keiichi-rh onemainf hayarobi sasagarw global-localhost global19 global19-atlassian-net ravitri keremavci darecoder rolfedh ajaygupta978 red-gv lack pmoogi-redhat isabella232 mansikulkarni96 ingvagabund vitus133 alibahramian purab-git yashwanth18 puraut btaani m-yosefpor devenkulkarni rhn-support-achakrat sradco 2qov3b csheremeta obrienrobert yashoza19 cahartma oarribas gufranmirza qhua948 chiragkyal vdrvergara pranjal-gupta2 saurabhsadhale jnordell elbehery nladha09 lm0943111262 bysnupy xperimental

cluster-logging-operator's Issues

Default the kibana index mode to 'shared_ops'

The default (if it isnt) should be for operations users to share dashboards.

Schema change introduced parsing error

Doesn't tell where the error is in the code, unfortunately.

ERRO[0000] error syncing key (openshift-logging/example): failed to decode json data with gvk(logging.openshift.io/v1alpha1, Kind=ClusterLogging): v1alpha1.ClusterLogging.Spec: v1alpha1.ClusterLoggingSpec.Collection: v1alpha1.CollectionSpec.LogCollection: v1alpha1.LogCollectionSpec.FluentdSpec: v1alpha1.FluentdSpec.NodeSelector: ReadMapCB: expect { or n, but found ", error found in #10 byte of ...|elector":"logging-in|..., bigger context ...|ion":{"logCollection":{"fluentd":{"nodeSelector":"logging-infra-fluentd=true"},"type":"fluentd"}},"c|...

Reconsider how collector nodeSelector is defined

The recent submission of a PR to support rsyslog has made me question why we have not taken the opportunity in 4.0 to possibly change our node selector for the collector. Consider the following;
Ref: https://github.com/openshift/cluster-logging-operator/blob/master/hack/cr.yaml

...
  collection:
    logCollection:
      type: "fluentd"
      fluentd:
        nodeSelector:
          logging-infra-fluentd: "true"
      rsyslog:
        nodeSelector:
          logging-infra-rsyslog: "true"

Wouldn't it be more reasonable to have a single label which takes a collector:

logging-infra-collector: "fluentd | rsyslog"

Additionally, moving to a 'well-known-label'='known value' pattern would allow us to remove the nodeselector block all together. I do not see a need for a customer to have to modify the selector

Either way allows us to split collectors across nodes, but the later would allow us to add additional collectors without requiring an additional label.

Why did we decide to continue to use a boolean to identify which nodes will receive the collector?
Additionally, node labeling is now outside the responsibility of this operator correct? What is the workflow to get alternate collectors landed on nodes?
One could additionally argue that eventrouter is a specialized collector. Is there a need to support a variant of landing multiple collectors on the same node?

Set max_map_count when using minishift

Follow-up from #41: when deploying the example and the OpenShift cluster is set via minishift, the script should run the following:

minishift ssh -- sudo sysctl -w vm.max_map_count=262144

Add PrometheusRule object for Elasticsearch

As per discussion in openshift/elasticsearch-operator/issues/20 we need to manage Prometheus rules for Elasticsearch in cluster logging operator. Follow the mentioned link for further technical details.

Add `system:serviceaccount:openshift-logging:fluentd` to privileged scc

Fluentd needs to run as privileged container

https://github.com/openshift/openshift-ansible/blob/master/roles/openshift_logging_fluentd/tasks/main.yaml#L72-L79

500 Internal Error Additional Trusted CA Bundle missing

Hello,

Internally signed TLS CAtrusted bundled are not being copied to the kibana and kibana-proxy pod.

I had to set the Operator to unmanaged and created the configmap with the additional trusted CA bundle - named: "trusted-ca-bundle"

Then modified the deployment.

$ diff -U5 deployment_kibana.yaml.old deployment_kibana.yaml
--- deployment_kibana.yaml.old  2019-10-25 09:01:02.446738600 -0400
+++ deployment_kibana.yaml      2019-10-25 09:00:10.815299500 -0400
@@ -87,10 +87,13 @@
         terminationMessagePolicy: File
         volumeMounts:
         - mountPath: /etc/kibana/keys
           name: kibana
           readOnly: true
+        - mountPath: /etc/pki/ca-trust/extracted/pem
+          name: trusted-ca-bundle
+          readOnly: true
       - args:
         - --upstream-ca=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt
         - --https-address=:3000
         - -provider=openshift
         - -client-id=system:serviceaccount:openshift-logging:kibana
@@ -128,10 +131,13 @@
         terminationMessagePolicy: File
         volumeMounts:
         - mountPath: /secret
           name: kibana-proxy
           readOnly: true
+        - mountPath: /etc/pki/ca-trust/extracted/pem
+          name: trusted-ca-bundle
+          readOnly: true
       dnsPolicy: ClusterFirst
       nodeSelector:
         kubernetes.io/os: linux
         node-role.kubernetes.io/infra: ""
       restartPolicy: Always
@@ -147,10 +153,17 @@
           secretName: kibana
       - name: kibana-proxy
         secret:
           defaultMode: 420
           secretName: kibana-proxy
+      - configMap:
+          defaultMode: 420
+          items:
+          - key: ca-bundle.crt
+            path: tls-ca-bundle.pem
+          name: trusted-ca-bundle
+        name: trusted-ca-bundle
 status:
   availableReplicas: 1
   conditions:
   - lastTransitionTime: "2019-10-24T18:21:02Z"
     lastUpdateTime: "2019-10-24T18:21:02Z"

Error deploying an instance of Cluster-Logging regarding ES SearchGuard

The error that I've faced is regarding an ElasticSearch one with SearchGuard, cannot be initialized and the cluster stays on RED and does not self recover:

Context:

OCP 4.4
OCS 4.3
ELO 4.4
CLO 4.4

Fluentd

2020-01-28 09:12:02 +0000 [warn]: [retry_clo_default_output_es] Could not communicate to Elasticsearch, resetting connection and trying again. Connection refused - connect(2) for 172.30.166.250:9200 (Errno::ECONNREFUSED)
2020-01-28 09:12:02 +0000 [warn]: [retry_clo_default_output_es] Remaining retry: 14. Retry to communicate after 2 second(s).
2020-01-28 09:12:07 +0000 [warn]: [retry_clo_default_output_es] Could not communicate to Elasticsearch, resetting connection and trying again. Connection refused - connect(2) for 172.30.166.250:9200 (Errno::ECONNREFUSED)
2020-01-28 09:12:07 +0000 [warn]: [retry_clo_default_output_es] Remaining retry: 13. Retry to communicate after 4 second(s).
2020-01-28 09:12:16 +0000 [warn]: [retry_clo_default_output_es] Could not communicate to Elasticsearch, resetting connection and trying again. Connection refused - connect(2) for 172.30.166.250:9200 (Errno::ECONNREFUSED)
2020-01-28 09:12:16 +0000 [warn]: [retry_clo_default_output_es] Remaining retry: 12. Retry to communicate after 8 second(s).
2020-01-28 09:12:33 +0000 [warn]: [retry_clo_default_output_es] Could not communicate to Elasticsearch, resetting connection and trying again. Connection refused - connect(2) for 172.30.166.250:9200 (Errno::ECONNREFUSED)
2020-01-28 09:12:33 +0000 [warn]: [retry_clo_default_output_es] Remaining retry: 11. Retry to communicate after 16 second(s).
2020-01-28 09:13:05 +0000 [warn]: [retry_clo_default_output_es] Could not communicate to Elasticsearch, resetting connection and trying again. Connection refused - connect(2) for 172.30.166.250:9200 (Errno::ECONNREFUSED)
2020-01-28 09:13:05 +0000 [warn]: [retry_clo_default_output_es] Remaining retry: 10. Retry to communicate after 32 second(s).
2020-01-28 09:14:10 +0000 [warn]: [retry_clo_default_output_es] Could not communicate to Elasticsearch, resetting connection and trying again. Connection refused - connect(2) for 172.30.166.250:9200 (Errno::ECONNREFUSED)
2020-01-28 09:14:10 +0000 [warn]: [retry_clo_default_output_es] Remaining retry: 9. Retry to communicate after 64 second(s).
2020-01-28 09:14:15 +0000 [error]: unexpected error error_class=Elasticsearch::Transport::Transport::Errors::ServiceUnavailable error="[503] Search Guard not initialized (SG11). See https://github.com/floragunncom/search-guard-docs/blob/master/sgadmin.md"
  2020-01-28 09:14:15 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/elasticsearch-transport-7.4.0/lib/elasticsearch/transport/transport/base.rb:205:in `__raise_transport_error'
  2020-01-28 09:14:15 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/elasticsearch-transport-7.4.0/lib/elasticsearch/transport/transport/base.rb:333:in `perform_request'
  2020-01-28 09:14:15 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/elasticsearch-transport-7.4.0/lib/elasticsearch/transport/transport/http/faraday.rb:24:in `perform_request'
  2020-01-28 09:14:15 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/elasticsearch-transport-7.4.0/lib/elasticsearch/transport/client.rb:152:in `perform_request'
  2020-01-28 09:14:15 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/elasticsearch-api-7.4.0/lib/elasticsearch/api/actions/info.rb:19:in `info'
  2020-01-28 09:14:15 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluent-plugin-elasticsearch-3.7.1/lib/fluent/plugin/out_elasticsearch.rb:394:in `detect_es_major_version'
  2020-01-28 09:14:15 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluent-plugin-elasticsearch-3.7.1/lib/fluent/plugin/out_elasticsearch.rb:264:in `block in configure'
  2020-01-28 09:14:15 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluent-plugin-elasticsearch-3.7.1/lib/fluent/plugin/elasticsearch_index_template.rb:35:in `retry_operate'
  2020-01-28 09:14:15 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluent-plugin-elasticsearch-3.7.1/lib/fluent/plugin/out_elasticsearch.rb:263:in `configure'
  2020-01-28 09:14:15 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/plugin.rb:164:in `configure'
  2020-01-28 09:14:15 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/plugin/multi_output.rb:74:in `block in configure'
  2020-01-28 09:14:15 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/plugin/multi_output.rb:63:in `each'
  2020-01-28 09:14:15 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/plugin/multi_output.rb:63:in `configure'
  2020-01-28 09:14:15 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/plugin/out_copy.rb:36:in `configure'
  2020-01-28 09:14:15 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/plugin.rb:164:in `configure'
  2020-01-28 09:14:15 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/agent.rb:130:in `add_match'
  2020-01-28 09:14:15 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/agent.rb:72:in `block in configure'
  2020-01-28 09:14:15 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/agent.rb:64:in `each'
  2020-01-28 09:14:15 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/agent.rb:64:in `configure'
  2020-01-28 09:14:15 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/label.rb:31:in `configure'
  2020-01-28 09:14:15 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/root_agent.rb:147:in `block in configure'
  2020-01-28 09:14:15 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/root_agent.rb:147:in `each'
  2020-01-28 09:14:15 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/root_agent.rb:147:in `configure'
  2020-01-28 09:14:15 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/engine.rb:131:in `configure'
  2020-01-28 09:14:15 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/engine.rb:96:in `run_configure'
  2020-01-28 09:14:15 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/supervisor.rb:812:in `run_configure'
  2020-01-28 09:14:15 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/supervisor.rb:558:in `block in run_worker'
  2020-01-28 09:14:15 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/supervisor.rb:741:in `main_process'
  2020-01-28 09:14:15 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/supervisor.rb:554:in `run_worker'
  2020-01-28 09:14:15 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/command/fluentd.rb:330:in `<top (required)>'
  2020-01-28 09:14:15 +0000 [error]: /opt/rh/rh-ruby25/root/usr/share/rubygems/rubygems/core_ext/kernel_require.rb:59:in `require'
  2020-01-28 09:14:15 +0000 [error]: /opt/rh/rh-ruby25/root/usr/share/rubygems/rubygems/core_ext/kernel_require.rb:59:in `require'
  2020-01-28 09:14:15 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.7.4/bin/fluentd:8:in `<top (required)>'
  2020-01-28 09:14:15 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/bin/fluentd:23:in `load'
  2020-01-28 09:14:15 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/bin/fluentd:23:in `<main>'

On ElasticSearch I see that the cluster is in RED state but appears as Running and Ready on Openshift:

NAME                                                                READY      STATUS             RESTARTS   AGE
cluster-logging-operator-667799d786-z4nkh	          1/1     Running            0          40m
elasticsearch-cdm-nwjeo1ix-1-68855bd9b-qr6d4     2/2     Running            0          36m
elasticsearch-cdm-nwjeo1ix-2-5f5866dd47-hjwfg     2/2     Running            0          36m
elasticsearch-cdm-nwjeo1ix-3-7c75876bd4-7zpc6   2/2     Running            0          36m
fluentd-58g69                                                            0/1     CrashLoopBackOff   7          13m
fluentd-c77vw                                                            0/1     CrashLoopBackOff   7          13m
fluentd-ljwxz                                                               0/1     CrashLoopBackOff   7          13m
fluentd-v7tgt                                                               0/1     CrashLoopBackOff   7          13m
fluentd-v7tgt-debug                                                    1/1     Running            0          10m
kibana-69c75d9cd9-lcljp                                             2/2     Running            0          36m

This is the first time that you deploys an instance, when you kills Fluentd pods, it will be in the same status, the real error is in ElasticSearch, for any reason cannot initialize SearchGuard.

To workaround the error, you just need to delete the ELS pods and with time the cluster will reach the Green state:

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
[2020-01-28 10:33:50,205][INFO ][container.run            ] Elasticsearch is ready and listening
/usr/share/elasticsearch/init ~
[2020-01-28 10:33:50,211][INFO ][container.run            ] Starting init script: 0001-jaeger
[2020-01-28 10:33:50,213][INFO ][container.run            ] Completed init script: 0001-jaeger
[2020-01-28 10:33:50,251][INFO ][container.run            ] Forcing the seeding of ACL documents
[2020-01-28 10:33:50,254][INFO ][container.run            ] Seeding the searchguard ACL index.  Will wait up to 604800 seconds.
[2020-01-28 10:33:50,286][INFO ][container.run            ] Seeding the searchguard ACL index.  Will wait up to 604800 seconds.
/etc/elasticsearch /usr/share/elasticsearch/init
Search Guard Admin v5
Will connect to localhost:9300 ... done
ERROR StatusLogger No Log4j 2 configuration file found. Using default configuration (logging only errors to the console), or user programmatically provided configurations. Set system property 'log4j2.debug' to show Log4j 2 internal initialization logging. See https://logging.apache.org/log4j/2.x/manual/configuration.html for instructions on how to configure Log4j 2
Elasticsearch Version: 5.6.16
Search Guard Version: <unknown>
Contacting elasticsearch cluster 'elasticsearch' ...
Clustername: elasticsearch
Clusterstate: RED
Number of nodes: 3
Number of data nodes: 3
.searchguard index already exists, so we do not need to create one.
ERR: .searchguard index state is RED.
Populate config from /opt/app-root/src/sgconfig/
Will update 'config' with /opt/app-root/src/sgconfig/sg_config.yml
   SUCC: Configuration for 'config' created or updated
Will update 'roles' with /opt/app-root/src/sgconfig/sg_roles.yml
   SUCC: Configuration for 'roles' created or updated
Will update 'rolesmapping' with /opt/app-root/src/sgconfig/sg_roles_mapping.yml
   SUCC: Configuration for 'rolesmapping' created or updated
Will update 'internalusers' with /opt/app-root/src/sgconfig/sg_internal_users.yml
   SUCC: Configuration for 'internalusers' created or updated
Will update 'actiongroups' with /opt/app-root/src/sgconfig/sg_action_groups.yml
   SUCC: Configuration for 'actiongroups' created or updated
Done with success
/usr/share/elasticsearch/init
[2020-01-28 10:34:25,709][INFO ][container.run            ] Seeded the searchguard ACL index
[2020-01-28 10:34:25,710][INFO ][container.run            ] Disabling auto replication
/etc/elasticsearch /usr/share/elasticsearch/init
Search Guard Admin v5
Will connect to localhost:9300 ... done
ERROR StatusLogger No Log4j 2 configuration file found. Using default configuration (logging only errors to the console), or user programmatically provided configurations. Set system property 'log4j2.debug' to show Log4j 2 internal initialization logging. See https://logging.apache.org/log4j/2.x/manual/configuration.html for instructions on how to configure Log4j 2
Elasticsearch Version: 5.6.16
Search Guard Version: <unknown>
Reload config on all nodes
Auto-expand replicas disabled
/usr/share/elasticsearch/init
[2020-01-28 10:34:39,568][INFO ][container.run            ] Updating replica count to 1
/etc/elasticsearch /usr/share/elasticsearch/init
Search Guard Admin v5
Will connect to localhost:9300 ... done
ERROR StatusLogger No Log4j 2 configuration file found. Using default configuration (logging only errors to the console), or user programmatically provided configurations. Set system property 'log4j2.debug' to show Log4j 2 internal initialization logging. See https://logging.apache.org/log4j/2.x/manual/configuration.html for instructions on how to configure Log4j 2
...
...

After kill ELS pods and waiting a bit you could see in the ELS logs this entry which seems that the recovery happens:

Elasticsearch Version: 5.6.16
Search Guard Version: <unknown>
Reload config on all nodes
Update number of replicas to 1 with result: true
/usr/share/elasticsearch/init
[2020-01-28 10:34:45,420][INFO ][container.run            ] Adding index templates
[2020-01-28 10:34:45,608][INFO ][container.run            ] Index template 'com.redhat.viaq-openshift-operations.template.json' found in the cluster, overriding it
{"acknowledged":true}[2020-01-28 10:35:15,288][INFO ][container.run            ] Index template 'com.redhat.viaq-openshift-orphaned.template.json' found in the cluster, overriding it
{"acknowledged":true}[2020-01-28 10:35:45,330][INFO ][container.run            ] Index template 'com.redhat.viaq-openshift-project.template.json' found in the cluster, overriding it
{"acknowledged":true}[2020-01-28 10:35:50,283][INFO ][container.run            ] Index template 'common.settings.kibana.template.json' found in the cluster, overriding it
{"acknowledged":true}[2020-01-28 10:36:00,286][INFO ][container.run            ] Index template 'common.settings.operations.orphaned.json' found in the cluster, overriding it
{"acknowledged":true}[2020-01-28 10:36:00,585][INFO ][container.run            ] Index template 'common.settings.operations.template.json' found in the cluster, overriding it
{"acknowledged":true}[2020-01-28 10:36:00,868][INFO ][container.run            ] Index template 'common.settings.project.template.json' found in the cluster, overriding it
{"acknowledged":true}[2020-01-28 10:36:01,151][INFO ][container.run            ] Index template 'jaeger-service.json' found in the cluster, overriding it
{"acknowledged":true}[2020-01-28 10:36:01,435][INFO ][container.run            ] Index template 'jaeger-span.json' found in the cluster, overriding it
{"acknowledged":true}[2020-01-28 10:36:01,729][INFO ][container.run            ] Index template 'org.ovirt.viaq-collectd.template.json' found in the cluster, overriding it
{"acknowledged":true}[2020-01-28 10:36:01,935][INFO ][container.run            ] Finished adding index templates
[2020-01-28 10:36:01,940][INFO ][container.run            ] Starting init script: 0500-remove-index-patterns-without-uid
[2020-01-28 10:36:02,090][INFO ][container.run            ] Found 0 index-patterns to evaluate for removal
[2020-01-28 10:36:02,090][INFO ][container.run            ] Completed init script: 0500-remove-index-patterns-without-uid with 0 successful and 0 failed bulk requests
[2020-01-28 10:36:02,094][INFO ][container.run            ] Starting init script: 0510-bz1656086-remove-index-patterns-with-bad-title
[2020-01-28 10:36:02,255][INFO ][container.run            ] Found 0 index-patterns to remove
[2020-01-28 10:36:02,441][INFO ][container.run            ] Completed init script: 0510-bz1656086-remove-index-patterns-with-bad-title
[2020-01-28 10:36:02,446][INFO ][container.run            ] Starting init script: 0520-bz1658632-remove-old-sg-indices
[2020-01-28 10:36:02,740][WARN ][container.run            ] Found .searchguard setting 'index.routing.allocation.include._name' to be null
[2020-01-28 10:36:02,741][INFO ][container.run            ] Updating .searchguard setting 'index.routing.allocation.include._name' to be null
[2020-01-28 10:36:02,899][INFO ][container.run            ] Completed init script: 0520-bz1658632-remove-old-sg-indices
[2020-01-28 10:36:02,903][INFO ][container.run            ] Starting init script: 0530-bz1667801-fix-kibana-replica-shards
[2020-01-28 10:36:03,042][INFO ][container.run            ] Found 0 Kibana indices with replica count not equal to 1
[2020-01-28 10:36:03,042][INFO ][container.run            ] Completed init script: 0530-bz1667801-fix-kibana-replica-shards

Future Release Branches Frozen For Merging | branch:release-4.7 branch:release-4.8

The following branches are being fast-forwarded from the current development branch (master) as placeholders for future releases. No merging is allowed into these release branches until they are unfrozen for production release.

release-4.8
release-4.7

Contact the Test Platform or Automated Release teams for more information.

ImageStreams are not propagated to the deployment

There looks to be an issue with whats in the manifest:
https://github.com/openshift/cluster-logging-operator/blob/master/manifests/05-deployment.yaml#L29-L42

and what gets rolled out.

    spec:
      containers:
      - command:
        - cluster-logging-operator
        env:
        - name: WATCH_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: OPERATOR_NAME
          value: cluster-logging-operator
        - name: ELASTICSEARCH_IMAGE
          value: quay.io/openshift-release-dev/ocp-v4.0-art-dev:v4.0.0-0.93.0.0-ose-elasticsearch-operator
        image: quay.io/openshift/cluster-logging-operator:latest
        imagePullPolicy: IfNotPresent

Note ELASTICSEARCH_IMAGE I manually added to try and update the ES image. It didnt exist before that.
This will inhibit us from deploying the correct images during release

Also, it doesn't appear the value was added to the ES operator deployment:

  - command:
    - elasticsearch-operator
    env:
    - name: WATCH_NAMESPACE
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.namespace
    - name: OPERATOR_NAME
      value: elasticsearch-operator
    image: quay.io/openshift/elasticsearch-operator:latest

Why we use minReadySeconds 10 Minutes

@ewolinetz Why the next fluent pod wait for 10 Min? The only reason I can imagine is connection issue to Elastic search?

cluster-logging-operator/pkg/utils/utils.go

Line 282 in d9644cc

MinReadySeconds: 600,

Cluster logging install status is `Unknown` and no resource created

refer to https://docs.openshift.com/container-platform/4.1/logging/efk-logging-deploying.html

I try to install cluster logging on my OCP 4.1.9 cluster on AWS

but after folllowing the steps, I encounter a issue that the Operator status is unkonw, and no operator is created

and also the elasticsearch operator

Local disk volumes as in 3.X versions

Hi,

I can't find any documentation about local disk volumes as in 3.X versions. (https://access.redhat.com/articles/3932191).

Is it supported?

If not, is there any alternative to it besides forwarding to different cluster or using dynamic volumes?

Use existing ES cluster

Hi,

I'm using an Elasticsearch cluster for a search project using the Elastic operator from elastic.co. Now I want to turn on cluster logging but it seems it will try and create another cluster altogether.

Given the heavy resource requirements of ES clusters, how can I leverage my existing one instead of letting this operator to create a new one? I can't seem to find anything in the ClusterLogging CR to do that, at least in the 4.2 version.

Thanks

ElasticSearch or Cluster-Logging Operator should extend the vm.max_map_count

The kernel parameter regarding vm.max_map_count should be modified by Cluster-Logging-Operator or ElasticSearch Operator in order to allow deploy ElasticSearch instance correctly into the concrete project.

# Should be vm.max_map_count=262144
sysctl -n vm.max_map_count                          
65530

MC Patch

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: master
  name: 99-sysctl-elastic
spec:
  config:
    ignition:
      version: 2.2.0
    storage:
      files:
      - contents:
          # vm.max_map_count=262144
          source: data:text/plain;charset=utf-8;base64,dm0ubWF4X21hcF9jb3VudD0yNjIxNDQ=
        filesystem: root
        mode: 0644
        path: /etc/sysctl.d/99-elasticsearch.conf

cluster logging operator csv stuck on pending

Hi! I am trying to deploy the cluster logging operator with the manifests/4.2/cluster-logging.v4.2.0.clusterserviceversion.yaml file. First i creates the operatorGroups that it needs, rbac for the elasticsearch operator and deploy the elasticsearch operator also with yaml file also and in both cases the csv is stuck on pending and it writes "Service account does not exist".
When I am trying to deploy it with operatorHub its deploying successfully, but i want to deploy it on disconnected environment eventually.

Thanks

Proxyconfig-controller fails to delete logcollector service account

This may not be the core issue, however I see the following error in the cluster-logging-operator logs repeat several times during start up. It does eventually stop appearing, however maybe this is something we can handle more gracefully with a get check first?

{"level":"error","ts":1582132106.5039325,"logger":"kubebuilder.controller","msg":"Reconciler error","controller":"proxyconfig-controller","request":"/cluster","error":"Unable to create or update collection for \"\": Failure deleting logcollector service account: an empty namespace may not be set when a resource name is provided","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/openshift/cluster-logging-operator/_output/src/github.com/go-logr/zapr/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/openshift/cluster-logging-operator/_output/src/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:217\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1\n\t/go/src/github.com/openshift/cluster-logging-operator/_output/src/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/openshift/cluster-logging-operator/_output/src/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/openshift/cluster-logging-operator/_output/src/k8s.io/apimachinery/pkg/util/wait/wait.go:134\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/openshift/cluster-logging-operator/_output/src/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}

how to set MERGE_JSON_LOG for fluentd

Hi,
I'm trying to set the MERGE_JSON_LOG=true while preserving the ManagementState==Manage (unlike the solution from the docs).

Target (DaemonSet autogenerated by the cluster-logging-operator)

kind: DaemonSet
apiVersion: apps/v1
metadata:
  name: fluentd
spec:
  template:
    spec:
      containers:
        - resources:
          name: fluentd
          env:
            - name: MERGE_JSON_LOG
              value: 'true'

Current State
I've got a clusterlogging.4.3.1 operator up and running and every (manual) manipulation in the DaemonSet is instantly reverted.
generators/forwarding/fluentd/templates.go#L167 looks like what I wanted to do, but I cant figure out how to pass the json_fields/ENV correctly.

My CL-instance.yaml looks like this:

apiVersion: logging.openshift.io/v1
kind: ClusterLogging
metadata:
  name: instance
  namespace: openshift-logging
spec:
  managementState: Managed
  collection:
    logs:
      type: fluentd
      fluentd:
        merge_json_log: true          # <----  that's what I'd like to do

Any hint appreciated!
Best, Nick

CLO should not allow specification of the number of ES deployment replicas

We should not allow user's to specify the number of ES pod replicas since we know our model is one deployment per ES node. If this value represents 'ES nodes' then we should change the name accordingly. Furthermore if this represents 'nodes', we should consider removing this field (or defaulting if it doesnt exist) since we can go into unmanaged state, and default the number of ES nodes to '3'. Recall our target is 99% of installations on an AWS Openshift installation which should have enough infra and workers to support logging

cluster logging operator pod cant use clusterlogging

hey, to make things clear this time -
I have an issue deploying the operator in my environment. first I'm testing it in a normal cluster but at the end ill need to deploy it on a disconnected one.

I follow the instructions from those walk-throughes:
https://github.com/operator-framework/getting-started
https://docs.openshift.com/container-platform/4.2/logging/cluster-logging-deploying.html

and everything goes well until the point were i deploy the cluster logging operator csv:
https://github.com/openshift/cluster-logging-operator/blob/release-4.2/manifests/4.2/cluster-logging.v4.2.0.clusterserviceversion.yaml

where i receive the following message when i try it:
Failed to list *v1.ClusterLogging: clusterloggings.logging.openshift.io is forbidden: User "system:serviceaccount:openshift-logging:cluster-logging-operator" cannot list resource "clusterloggings" in API group "logging.openshift.io" at the cluster scope

Delete LOG-411 branch

I created branch LOG-411 by an accident. I do not think I have privs to delete it. Please remove this branch.

Consider changing the default kibana route name to be 'logs'

This is more in line with the choice by ops and removes the technology specific binding to route visualization.

Switching the logCollection type from fluentd to rsyslog does not delete fluentd pods.

How to reproduce the issue.

Original pods:

NAME                                                  READY     STATUS    RESTARTS   AGE
cluster-logging-operator-5b8f47b598-lh7zp             1/1       Running   0          45m
elasticsearch-clientdatamaster-0-1-84d764899d-qqkqv   1/1       Running   0          44m
elasticsearch-operator-649f9b69b5-6wkkj               1/1       Running   0          45m
fluentd-82pp7                                         1/1       Running   0          44m
fluentd-sdlth                                         1/1       Running   0          44m
kibana-675b587dfd-l5s5j                               2/2       Running   0          44m

oc edit clusterlogging example - change the spec.collection.logCollection.type to "rsyslog"

Rsyslog pods are created, but still the fluentd pods are running.

NAME                                                  READY     STATUS    RESTARTS   AGE
cluster-logging-operator-5b8f47b598-lh7zp             1/1       Running   0          46m
elasticsearch-clientdatamaster-0-1-84d764899d-qqkqv   1/1       Running   0          45m
elasticsearch-operator-649f9b69b5-6wkkj               1/1       Running   0          45m
fluentd-82pp7                                         1/1       Running   0          45m
fluentd-sdlth                                         1/1       Running   0          45m
kibana-675b587dfd-l5s5j                               2/2       Running   0          45m
rsyslog-92hz4                                         1/1       Running   0          42s
rsyslog-hcpcc                                         1/1       Running   0          42s

Future Release Branches Frozen For Merging | branch:release-4.6

release-4.6

Contact the Test Platform or Automated Release teams for more information.

Duplicate "capabilities" annotation in 4.1 CSV

https://github.com/openshift/cluster-logging-operator/blob/release-4.1/manifests/4.1/cluster-logging.v4.1.0.clusterserviceversion.yaml#L9,L11 has capabilities twice, and brew is complaining.
https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=28403199

Errors using the deploy-example goal

I'm trying to get clustered logging work on a local instance of minishift (or cluster up) and am facing some issues.

First issue

The first issue is it seems the ELASTICSEARCH_OP_REPO needs to be explicitly set otherwise I see this error:

+ popd
~/tmp/wildfly-efk/cluster-logging-operator
+ CREATE_ES_SECRET=false
+ NAMESPACE=openshift-logging
+ make -C hack/../../elasticsearch-operator deploy-setup
make[1]: *** hack/../../elasticsearch-operator: No such file or directory.  Stop.
make: *** [Makefile:91: deploy-setup] Error 2

Second Issue

The second issue is the vendor/github.com/openshift/elasticsearch-operator/hack/deploy-setup.sh script has line where it point's to an invalid directory.

pushd vendor/github.com/coreos/prometheus-operator/example/prometheus-operator-crd
  for file in prometheusrule.crd.yaml servicemonitor.crd.yaml; do 
    oc create -n ${NAMESPACE} -f ${file} ||:
  done
popd

The directory vendor/github.com/coreos/prometheus-operator/example/prometheus-operator-crd does not exist.

Third Issue

This one could actually just be the environment I'm attempting to use. It could be that I'm using OpenShift 3 and it looks like this targets OpenShift 4. Anyway the error is:

--> FROM registry.svc.ci.openshift.org/openshift/origin-v4.0:base as 1
--> RUN INSTALL_PKGS="       openssl       " &&     yum install -y $INSTALL_PKGS &&     rpm -V $INSTALL_PKGS &&     yum clean all &&     mkdir /tmp/_working_dir &&     chmod og+w /tmp/_working_dir
Loaded plugins: ovl, product-id, search-disabled-repos, subscription-manager
This system is not receiving updates. You can use subscription-manager on the host to register and assign subscriptions.
http://base-4-0.ocp.svc/rhel-fast-datapath/repodata/repomd.xml: [Errno 14] curl#6 - "Could not resolve host: base-4-0.ocp.svc; Unknown error"
Trying other mirror.


 One of the configured repositories failed (rhel-fast-datapath),
 and yum doesn't have enough cached data to continue. At this point the only
 safe thing yum can do is fail. There are a few ways to work "fix" this:

     1. Contact the upstream for the repository and get them to fix the problem.

     2. Reconfigure the baseurl/etc. for the repository, to point to a working
        upstream. This is most often useful if you are using a newer
        distribution release than is supported by the repository (and the
        packages for the previous distribution release still work).

     3. Run the command with the repository temporarily disabled
            yum --disablerepo=rhel-fast-datapath ...

     4. Disable the repository permanently, so yum won't use it by default. Yum
        will then just ignore the repository until you permanently enable it
        again or use --enablerepo for temporary usage:

            yum-config-manager --disable rhel-fast-datapath
        or
            subscription-manager repos --disable=rhel-fast-datapath

     5. Configure the failing repository to be skipped, if it is unavailable.
        Note that yum will try to contact the repo. when it runs most commands,
        so will have to try and fail each time (and thus. yum will be be much
        slower). If it is a very temporary problem though, this is often a nice
        compromise:

            yum-config-manager --save --setopt=rhel-fast-datapath.skip_if_unavailable=true

failure: repodata/repomd.xml from rhel-fast-datapath: [Errno 256] No more mirrors to try.
http://base-4-0.ocp.svc/rhel-fast-datapath/repodata/repomd.xml: [Errno 14] curl#6 - "Could not resolve host: base-4-0.ocp.svc; Unknown error"
running 'INSTALL_PKGS="       openssl       " &&     yum install -y $INSTALL_PKGS &&     rpm -V $INSTALL_PKGS &&     yum clean all &&     mkdir /tmp/_working_dir &&     chmod og+w /tmp/_working_dir' failed with exit code 1
make: *** [Makefile:75: image] Error 1

olm data needs to be update for 4.3

https://github.com/openshift/cluster-logging-operator/tree/release-4.3/manifests
file not found: /manifests/4.3/image-references

Use RetryOnConflict for updates to existing objects #28

First take a look at https://github.com/kubernetes/client-go/blob/master/examples/create-update-delete-deployment/main.go#L102 which describes why RetryOnConflict is needed.

There are several patterns in our code like this:

  client.Get(object)
  object.somefield = "new value"
  client.Update(object)

The problem is that the object can be updated by another client between the Get and the Update and the Update will return a Conflict error. Instead, we need to wrap all such places in our code with RetryOnConflict

I've already seen cases running e2e tests where we get errors from conflicts.

rsyslog: add ability to disable MERGE_JSON_LOG

It is currently hardcoded to merge here https://github.com/openshift/cluster-logging-operator/blob/master/files/rsyslog/65-viaq-formatting.conf#L12 and here https://github.com/openshift/cluster-logging-operator/blob/master/files/rsyslog/65-viaq-formatting.conf#L31
We need to put these inside an if ``echo $MERGE_JSON_LOG`` == "true" then { block
see #74

Does CLO managed state control EO managed state?

Working through config options and the spec:

What happens when I set the CLO to Unmanged?
What happens when I set the EO to Unmanaged but CLO is managed?

We need to sort this out, and describe it in our documention

Missing PriorityClass

I'm trying to get a simple CR deployed, but I'm facing the following error:

ERRO[0031] error syncing key (openshift-logging/example): Unable to create or update collection: Failure creating Collection priority class: failed to get resource client: failed to get resource type: failed to get the resource REST mapping for GroupVersionKind(scheduling.k8s.io/v1beta1, Kind=PriorityClass): no matches for kind "PriorityClass" in version "scheduling.k8s.io/v1beta1"

Here are the steps I'm following, based on the readme and trial and error.

$ minishift version
v1.25.0+90fb23e
$ minishift start --cpus 2 --memory 8192
$ oc login -u system:admin
$ oc create -f manifests/01-namespace.yaml
$ oc project openshift-logging
$ make deploy-setup
$ REPO_PREFIX=openshift/ \
    IMAGE_PREFIX=origin- \
    OPERATOR_NAME=cluster-logging-operator \
    WATCH_NAMESPACE=openshift-logging \
    KUBERNETES_CONFIG=~/.kube/config \
    ELASTICSEARCH_IMAGE=docker.io/openshift/origin-logging-elasticsearch5:latest \
    OAUTH_PROXY_IMAGE=docker.io/openshift/oauth-proxy:latest \
    KIBANA_IMAGE=docker.io/openshift/origin-logging-kibana5:latest \
    CURATOR_IMAGE=docker.io/openshift/origin-logging-curator5:latest \
    FLUENTD_IMAGE=docker.io/openshift/origin-logging-fluentd:latest \
    go run cmd/cluster-logging-operator/main.go
$ oc apply -f hack/cr.yaml

What am I doing wrong?

Failure creating Elasticsearch CR

Issue

Cluster logging can't be deployed on ocp4. The Cluster Logging Operator reports this error

time="2019-04-09T16:14:18Z" level=error msg="error syncing key (openshift-logging/instance): Unable to create or update logstore for \"instance\": Failure creating Elasticsearch CR: failed to get resource client: failed to get resource type: failed to get the resource REST mapping for GroupVersionKind(logging.openshift.io/v1, Kind=Elasticsearch): no matches for kind \"Elasticsearch\" in version \"logging.openshift.io/v1\""
time="2019-04-09T16:14:23Z" level=error msg="error syncing key (openshift-logging/instance): Unable to create or update logstore for \"instance\": Failure creating Elasticsearch CR: failed to get resource client: failed to get resource type: failed to get the resource REST mapping for GroupVersionKind(logging.openshift.io/v1, Kind=Elasticsearch): no matches for kind \"Elasticsearch\" in version \"logging.openshift.io/v1\""

after we have applied the following CR on the namespace openshift-logging

oc create -n openshift-logging -f hack/cr.yaml
where cr.yaml

apiVersion: "logging.openshift.io/v1"
kind: "ClusterLogging"
metadata:
  name: "instance"
spec:
  managementState: "Managed"
  logStore:
    type: "elasticsearch"
    elasticsearch:
      nodeCount: 1
      storage: {}
      redundancyPolicy: "ZeroRedundancy"
  visualization:
    type: "kibana"
    kibana:
      replicas: 1
  curation:
    type: "curator"
    curator:
      schedule: "30 3,9,15,21 * * *"
  collection:
    logs:
      type: "fluentd"
      fluentd: {}
-->
clusterlogging.logging.openshift.io/instance created

Info

Red Hat OpenShift Container Platform
OpenShift is Red Hat's container application platform that allows developers to quickly develop, host, and scale applications in a cloud environment.

Cluster ID
a3acddcb-6eff-41d5-b10c-eafb2b905d11
Kubernetes Master Version
v1.12.4+0ba401e

Cluster Logging Operator deployed : 4.1.0 (preview)

PersistentVolumeClaim does not seem like a viable storage choice

Based on the description in [1] repeated here, this does not seem like a viable choice to represent the ES node storage. The description would indicate you can only specify a single, existing PVC which is not usable for anything but a single node ES cluster.

 78     // PersistentVolumeClaim will NOT try to regenerate PVC, it will be used
 79     // as-is. You may want to use it instead of VolumeClaimTemplate in case
 80     // you already have bounded PersistentVolumeClaims you want to use, and the names
 81     // of these PersistentVolumeClaims doesn't follow the naming convention.
 82     PersistentVolumeClaim *v1.PersistentVolumeClaimVolumeSource `json:"persistentVolumeClaim,omitempty"`

Additionally, we provide no mechanism to specify or default a storage class. This seems like an issue given I thought storageClass was the convenient way to define storage in a single representation of kind, size, etc.

[1] https://github.com/openshift/elasticsearch-operator/blob/master/pkg/apis/elasticsearch/v1alpha1/types.go#L78-L82

Invalid fields in Logging Cluster Components status

When the Logging Cluster is installed properly and I am able to see the logs. But when I try to see the statuses of the different components, I see that the messages seen in the ScreenCapture below. The Screen Capture is from OpenShift 4.3 but I saw the same in 4.5 also.

After digging into it, I saw that the path being accessed in the CSV to check the status is incorrect in this file.
The below path are arrays -

- visualization.kibanaStatus
- logStore.elasticsearchStatus
- logStore.elasticsearchStatus
- logStore.elasticsearchStatus

But they are being referred at they are single objects.

- visualization.kibanaStatus.pods
- logStore.elasticsearchStatus.pods.client
- logStore.elasticsearchStatus.pods.data
- logStore.elasticsearchStatus.pods.master

Rather they should have been as below -

- visualization.kibanaStatus[0].pods
- logStore.elasticsearchStatus[0].pods.client
- logStore.elasticsearchStatus[0].pods.data
- logStore.elasticsearchStatus[0].pods.master

I saw this issue in 4.2, 4.3, 4.4 and 4.5.
It's just a small change is there which needs to be done.
I plan to create a PR for 4.5,

The Screen Capture mentioned above -

Feature Request: Allow list of ElasticSearch endpoints in Log Forwarding API

Hello,

in Openshift 4.5 I would like to send the logs collected by fluentd to multiple elasticsearch endpoints. My ELK administrator sent me a list of three ES nodes, to which fluentd is supposed to send the logs. There is no loadbalancer in front of ES.

In the LogForwarding CR you can only specify one FQDN for each endpoint:

oc describe crd logforwardings.logging.openshift.io
[..]
Schema:
openAPIV3Schema:
Properties:
Spec:
Description: Specification for logforwarding of messages
Properties:
Outputs:
Description: Destinations for log messages
Items:
Description: An individual entry for a specific output
Properties:
Endpoint:
Description: the url to the the service defined by this output
=> Type: string
Name:
Description: The name of the output
Type: string
[..]

I would like to specify a list of hosts, and fluentd chooses one of them.
Fluentd accepts a list of hosts (https://docs.fluentd.org/output/elasticsearch#hosts-optional)

Thanks, Thomas

Restructure 'all-in-one' as its currently defined before 4.0 release

Overview

While working through:

standing up cluster-logging
configuration options
documentation
reviewing code

I fundamental believe the approach we are taking to configure split clusters is repeating the same problem we had with the deployer, ansible, and now the operator. Prior to feature freeze for 4.0, we must re-evaluate the current CR as it will become an API we will need to maintain for a while going forward

Issue

We currently treat the split scenario (apps to one cluster, infra to another) as a special case. The implementation depends on an annotation for which we introduce 'if' checks (i.e. elasticsearch case) in multiple places. This is contrary to the advisement we received several releases ago to consider how we might treat these cases as the same but different instance (e.g. class and object metaphor). With regards to applications and operations Elasticsearch stacks (ie. ES, Kibana, curator), there is no difference between the two besides the name. By subtly altering how we represent these use cases in the CR, we can remove the specialty nature of the current design. This should simply the code.

Proposal

This proposal is a variant of one of the alternates listed below. It would introduce an additional hierarchy to group stacks accordingly (allowing additional ones in future if that makes sense), and configure message routing in the collector. This change also would allow us to treat clusters uniformally:

Clusters

apiVersion: "logging.openshift.io/v1alpha1"
kind: "ClusterLogging"
metadata:
  name: "cluster-logging"
spec:
  managementState: "Managed"
  stacks:
      - name: app
      type: elastic
      elastic:
         logStore:
            type: "elasticsearch"
            elasticsearch:
              dataReplication: "NoReplication"
         visualization:
           type: "kibana"
             kibana:
           replicas: 1
         curation:
           type: "curator"
           curator:
             schedule: "30 3 * * *"
     -  name: infra
       type: elastic
       elastic:
          logStore:
            type: "elasticsearch"
            elasticsearch:
              dataReplication: "NoReplication"
         visualization:
           type: "kibana"
             kibana:
           replicas: 1
         curation:
           type: "curator"
           curator:
             schedule: "30 3 * * *"  
...

One could further suggest an additional optimization where since we know the stacks[].type we no longer need component types; we will ALWAYS have the same components in a given cluster type (e.g. Elasticsearch, Kibana, Curator)

apiVersion: "logging.openshift.io/v1alpha1"
kind: "ClusterLogging"
metadata:
  name: "cluster-logging"
spec:
  managementState: "Managed"
  stacks:
    - name: app
      type: elastic
      elastic:
         logStore:
            resources:
              request:
              limits:
            dataReplication: "NoReplication"
         visualization:
            resources:
              request:
              limits:
           replicas: 1
         curation:
            resources:
              request:
              limits:
            schedule: "30 3 * * *"
    -  name: infra
       type: elastic
       elastic:
         logStore:
         visualization:
         curation:
           type: "curator"
           curator:
             schedule: "30 3 * * *"

What's in a name

Ideally, we would use the name as either the name for all dependent resources or as a suffix to the resources the operator creates (e.g. elasticsearch-infra). Alternatively, we might consider only applying the suffix (as we do now) when there are multiple cluster definitions. Additionally we should consider only supporting the names: apps, infra, since they have special meaning.

Collectors

Initially, message routing would require us make some opinionated assumptions based on the deployed clusters:

Single cluster: all messages route here
Multiple clusters: app logs -> app, infra -> infra
In future we could introduce a way to define where messages are routed but intentionally absent here.

apiVersion: "logging.openshift.io/v1alpha1"
kind: "ClusterLogging"
metadata:
  name: "cluster-logging"
spec:
collection:
    logCollection:
      type: "fluentd"
      fluentd:
        nodeSelector:
          logging-infra-fluentd: "true"

Alternates

Multiple CRs, one for each cluster

Ref: https://gist.github.com/jcantrill/4a9365170f32f72ed57c83f6bb566b4f#file-gistfile1-txt-L27

Cons

Requires cluster admin to 'wire' logs from collector various destinations.
No inherent relations between multiple CRs/clusters on a single 'cluster logging' setup

Single CR with names sources

https://gist.github.com/chancez/6f326e68412dbe760aeffd2be7ea5adf

Cons

Introduces named clusters in away that is limiting (e.g infraLog, appLog)

Make /usr/share/logging/ location customizable

As of now the location of the folder /usr/share/logging/ seems to be hardcoded in Dockerfile, see:

cluster-logging-operator/Dockerfile

Line 18 in 7532020

RUN mkdir -p /usr/share/logging/

When running CLO locally (like during development) this location is expected to contain couple of files, some examples are:

ERRO[0030] Unable to read file to get contents: open /usr/share/logging/curator/curator-actions.yaml: no such file or directory 
ERRO[0030] Unable to read file to get contents: open /usr/share/logging/curator/curator5-config.yaml: no such file or directory 
ERRO[0030] Unable to read file to get contents: open /usr/share/logging/curator/curator-config.yaml: no such file or directory

This can be challenge to Apple users as this location can not be modified on MacOS. See here or here.

Would it make sense to make this location customizable? Can we think of any downsides?

Error deploying ES: Regarding SearchGuard

The error that I've faced is regarding an ElasticSearch , cannot be initialized and the cluster stays on RED and does not self recover. OCP version is 4.4.

oc get pods

NAME                                            READY   STATUS             RESTARTS   AGE
cluster-logging-operator-598b875dfc-mmtp4       1/1     Running            2          3d12h
elasticsearch-cdm-85u334ts-1-5dd99bb9-p6lz6     2/2     Running            0          13m
elasticsearch-cdm-85u334ts-2-dbdc7d9d5-z7chm    2/2     Running            0          12m
elasticsearch-cdm-85u334ts-3-5744fbfd4b-4zxw6   2/2     Running            0          12m
fluentd-825zp                                   0/1     CrashLoopBackOff   7          14m
fluentd-8djwz                                   0/1     CrashLoopBackOff   7          14m
fluentd-crrqz                                   0/1     CrashLoopBackOff   7          14m
fluentd-dzqm6                                   0/1     CrashLoopBackOff   7          14m
fluentd-kmwn7                                   0/1     CrashLoopBackOff   7          14m
fluentd-ph2rh                                   0/1     CrashLoopBackOff   7          14m
fluentd-px7kz                                   0/1     CrashLoopBackOff   7          14m
kibana-6c4b5d7c8d-nqqzc                         2/2     Running            0          45m

fluentd

2020-06-15 08:46:27 +0000 [error]: unexpected error error_class=Elasticsearch::Transport::Transport::Errors::ServiceUnavailable error="[503] Search Guard not initialized (SG11). See https://github.com/floragunncom/search-guard-docs/blob/master/sgadmin.md"
  2020-06-15 08:46:27 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/elasticsearch-transport-7.4.0/lib/elasticsearch/transport/transport/base.rb:205:in `__raise_transport_error'
  2020-06-15 08:46:27 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/elasticsearch-transport-7.4.0/lib/elasticsearch/transport/transport/base.rb:333:in `perform_request'
  2020-06-15 08:46:27 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/elasticsearch-transport-7.4.0/lib/elasticsearch/transport/transport/http/faraday.rb:24:in `perform_request'
  2020-06-15 08:46:27 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/elasticsearch-transport-7.4.0/lib/elasticsearch/transport/client.rb:152:in `perform_request'
  2020-06-15 08:46:27 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/elasticsearch-api-7.4.0/lib/elasticsearch/api/actions/info.rb:19:in `info'
  2020-06-15 08:46:27 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluent-plugin-elasticsearch-3.7.1/lib/fluent/plugin/out_elasticsearch.rb:394:in `detect_es_major_version'
  2020-06-15 08:46:27 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluent-plugin-elasticsearch-3.7.1/lib/fluent/plugin/out_elasticsearch.rb:264:in `block in configure'
  2020-06-15 08:46:27 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluent-plugin-elasticsearch-3.7.1/lib/fluent/plugin/elasticsearch_index_template.rb:35:in `retry_operate'
  2020-06-15 08:46:27 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluent-plugin-elasticsearch-3.7.1/lib/fluent/plugin/out_elasticsearch.rb:263:in `configure'
  2020-06-15 08:46:27 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/plugin.rb:164:in `configure'
  2020-06-15 08:46:27 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/plugin/multi_output.rb:74:in `block in configure'
  2020-06-15 08:46:27 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/plugin/multi_output.rb:63:in `each'
  2020-06-15 08:46:27 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/plugin/multi_output.rb:63:in `configure'
  2020-06-15 08:46:27 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/plugin/out_copy.rb:36:in `configure'
  2020-06-15 08:46:27 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/plugin.rb:164:in `configure'
  2020-06-15 08:46:27 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/agent.rb:130:in `add_match'
  2020-06-15 08:46:27 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/agent.rb:72:in `block in configure'
  2020-06-15 08:46:27 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/agent.rb:64:in `each'
  2020-06-15 08:46:27 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/agent.rb:64:in `configure'
  2020-06-15 08:46:27 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/label.rb:31:in `configure'
  2020-06-15 08:46:27 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/root_agent.rb:147:in `block in configure'
  2020-06-15 08:46:27 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/root_agent.rb:147:in `each'
  2020-06-15 08:46:27 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/root_agent.rb:147:in `configure'
  2020-06-15 08:46:27 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/engine.rb:131:in `configure'
  2020-06-15 08:46:27 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/engine.rb:96:in `run_configure'
  2020-06-15 08:46:27 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/supervisor.rb:812:in `run_configure'
  2020-06-15 08:46:27 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/supervisor.rb:558:in `block in run_worker'
  2020-06-15 08:46:27 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/supervisor.rb:741:in `main_process'
  2020-06-15 08:46:27 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/supervisor.rb:554:in `run_worker'
  2020-06-15 08:46:27 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/command/fluentd.rb:330:in `<top (required)>'
  2020-06-15 08:46:27 +0000 [error]: /opt/rh/rh-ruby25/root/usr/share/rubygems/rubygems/core_ext/kernel_require.rb:59:in `require'
  2020-06-15 08:46:27 +0000 [error]: /opt/rh/rh-ruby25/root/usr/share/rubygems/rubygems/core_ext/kernel_require.rb:59:in `require'
  2020-06-15 08:46:27 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.7.4/bin/fluentd:8:in `<top (required)>'
  2020-06-15 08:46:27 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/bin/fluentd:23:in `load'
  2020-06-15 08:46:27 +0000 [error]: /opt/rh/rh-ruby25/root/usr/local/bin/fluentd:23:in `<main>'

oc logs -f $ESPod -c elasticsearch:

[2020-06-15 08:31:52,097][INFO ][container.run            ] Elasticsearch is ready and listening
/usr/share/elasticsearch/init ~
[2020-06-15 08:31:52,114][INFO ][container.run            ] Starting init script: 0001-jaeger
[2020-06-15 08:31:52,116][INFO ][container.run            ] Completed init script: 0001-jaeger
[2020-06-15 08:31:52,160][INFO ][container.run            ] Forcing the seeding of ACL documents
[2020-06-15 08:31:52,164][INFO ][container.run            ] Seeding the searchguard ACL index.  Will wait up to 604800 seconds.
[2020-06-15 08:31:52,204][INFO ][container.run            ] Seeding the searchguard ACL index.  Will wait up to 604800 seconds.
/etc/elasticsearch /usr/share/elasticsearch/init
Search Guard Admin v5
Will connect to localhost:9300 ... done
ERROR StatusLogger No Log4j 2 configuration file found. Using default configuration (logging only errors to the console), or user programmatically provided configurations. Set system property 'log4j2.debug' to show Log4j 2 internal initialization logging. See https://logging.apache.org/log4j/2.x/manual/configuration.html for instructions on how to configure Log4j 2
Elasticsearch Version: 5.6.16
Search Guard Version: <unknown>
Contacting elasticsearch cluster 'elasticsearch' ...
Clustername: elasticsearch
Clusterstate: RED
Number of nodes: 3
Number of data nodes: 3
.searchguard index already exists, so we do not need to create one.
ERR: .searchguard index state is RED.
Populate config from /opt/app-root/src/sgconfig/
Will update 'config' with /opt/app-root/src/sgconfig/sg_config.yml
   FAIL: Configuration for 'config' failed because of UnavailableShardsException[[.searchguard][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[.searchguard][0]] containing [index {[.searchguard][config][0], source[{"config":"....................eXBlIjoibm9vcCJ9fX19fX0="}]}] and a refresh]]
Will update 'roles' with /opt/app-root/src/sgconfig/sg_roles.yml
   FAIL: Configuration for 'roles' failed because of UnavailableShardsException[[.searchguard][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[.searchguard][0]] containing [index {[.searchguard][roles][0], source[{"roles":"..........kaWNlczphZG1pbi9nZXQqIl19fX19"}]}] and a refresh]]
Will update 'rolesmapping' with /opt/app-root/src/sgconfig/sg_roles_mapping.yml
   FAIL: Configuration for 'rolesmapping' failed because of **UnavailableShardsException[[.searchguard][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[.searchguard][0]] containing** [index {[.searchguard][rolesmapping][0], source[{"rolesmapping":"..........sImJhY2tlbmRyb2xlcyI6WyJqYWVnZXIiXX19"}]}] and a refresh]]
Will update 'internalusers' with /opt/app-root/src/sgconfig/sg_internal_users.yml
   FAIL: Configuration for 'internalusers' failed because of UnavailableShardsException[[.searchguard][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[.searchguard][0]] containing [index {[.searchguard][internalusers][0], source[{"internalusers":"eyJETFdaUmhRTSI6eyJoYXNoIjoiT2tEcnBIdnVwS0x0d1Q3aDAwdWsifX0="}]}] and a refresh]]
Will update 'actiongroups' with /opt/app-root/src/sgconfig/sg_action_groups.yml
   FAIL: Configuration for 'actiongroups' failed because of UnavailableShardsException[[.searchguard][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[.searchguard][0]] containing [index {[.searchguard][actiongroups][0], source[n/a, actual length: [2.8kb], max length: 2kb]}] and a refresh]]
null
null
null
Done with failures
/usr/share/elasticsearch/init
[2020-06-15 08:37:55,055][INFO ][container.run            ] Seeded the searchguard ACL index
[2020-06-15 08:37:55,055][INFO ][container.run            ] Disabling auto replication
/etc/elasticsearch /usr/share/elasticsearch/init
Search Guard Admin v5
Will connect to localhost:9300 ... done
ERROR StatusLogger No Log4j 2 configuration file found. Using default configuration (logging only errors to the console), or user programmatically provided configurations. Set system property 'log4j2.debug' to show Log4j 2 internal initialization logging. See https://logging.apache.org/log4j/2.x/manual/configuration.html for instructions on how to configure Log4j 2
Elasticsearch Version: 5.6.16
Search Guard Version: <unknown>
Reload config on all nodes
Auto-expand replicas disabled
/usr/share/elasticsearch/init
[2020-06-15 08:38:57,990][INFO ][container.run            ] Updating replica count to 0
/etc/elasticsearch /usr/share/elasticsearch/init
Search Guard Admin v5
Will connect to localhost:9300 ... done
ERROR StatusLogger No Log4j 2 configuration file found. Using default configuration (logging only errors to the console), or user programmatically provided configurations. Set system property 'log4j2.debug' to show Log4j 2 internal initialization logging. See https://logging.apache.org/log4j/2.x/manual/configuration.html for instructions on how to configure Log4j 2
Elasticsearch Version: 5.6.16
Search Guard Version: <unknown>
Reload config on all nodes
Update number of replicas to 0 with result: true
/usr/share/elasticsearch/init
[2020-06-15 08:40:00,688][INFO ][container.run            ] Adding index templates
[2020-06-15 08:40:00,769][INFO ][container.run            ] Index template 'com.redhat.viaq-openshift-operations.template.json' found in the cluster, overriding it
{"acknowledged":true}[2020-06-15 08:40:01,195][INFO ][container.run            ] Index template 'com.redhat.viaq-openshift-orphaned.template.json' found in the cluster, overriding it
{"acknowledged":true}[2020-06-15 08:40:01,424][INFO ][container.run            ] Index template 'com.redhat.viaq-openshift-project.template.json' found in the cluster, overriding it
{"acknowledged":true}[2020-06-15 08:40:01,665][INFO ][container.run            ] Index template 'common.settings.kibana.template.json' found in the cluster, overriding it
{"acknowledged":true}[2020-06-15 08:40:01,837][INFO ][container.run            ] Index template 'common.settings.operations.orphaned.json' found in the cluster, overriding it
{"acknowledged":true}[2020-06-15 08:40:02,018][INFO ][container.run            ] Index template 'common.settings.operations.template.json' found in the cluster, overriding it
{"acknowledged":true}[2020-06-15 08:40:02,187][INFO ][container.run            ] Index template 'common.settings.project.template.json' found in the cluster, overriding it
{"acknowledged":true}[2020-06-15 08:40:02,351][INFO ][container.run            ] Index template 'jaeger-service.json' found in the cluster, overriding it
{"acknowledged":true}[2020-06-15 08:40:02,520][INFO ][container.run            ] Index template 'jaeger-span.json' found in the cluster, overriding it
{"acknowledged":true}[2020-06-15 08:40:02,693][INFO ][container.run            ] Index template 'org.ovirt.viaq-collectd.template.json' found in the cluster, overriding it
{"acknowledged":true}[2020-06-15 08:40:02,841][INFO ][container.run            ] Finished adding index templates
[2020-06-15 08:40:02,846][INFO ][container.run            ] Starting init script: 0500-remove-index-patterns-without-uid
[2020-06-15 08:40:02,940][INFO ][container.run            ] Found 0 index-patterns to evaluate for removal
[2020-06-15 08:40:02,941][INFO ][container.run            ] Completed init script: 0500-remove-index-patterns-without-uid with 0 successful and 0 failed bulk requests
[2020-06-15 08:40:02,945][INFO ][container.run            ] Starting init script: 0510-bz1656086-remove-index-patterns-with-bad-title
[2020-06-15 08:40:03,025][INFO ][container.run            ] Found 0 index-patterns to remove
[2020-06-15 08:40:03,126][INFO ][container.run            ] Completed init script: 0510-bz1656086-remove-index-patterns-with-bad-title
[2020-06-15 08:40:03,131][INFO ][container.run            ] Starting init script: 0520-bz1658632-remove-old-sg-indices
[2020-06-15 08:40:03,303][WARN ][container.run            ] Found .searchguard setting 'index.routing.allocation.include._name' to be null
[2020-06-15 08:40:03,305][INFO ][container.run            ] Updating .searchguard setting 'index.routing.allocation.include._name' to be null
[2020-06-15 08:40:03,419][INFO ][container.run            ] Completed init script: 0520-bz1658632-remove-old-sg-indices
[2020-06-15 08:40:03,423][INFO ][container.run            ] Starting init script: 0530-bz1667801-fix-kibana-replica-shards
[2020-06-15 08:40:03,493][INFO ][container.run            ] Found 0 Kibana indices with replica count not equal to 0
[2020-06-15 08:40:03,494][INFO ][container.run            ] Completed init script: 0530-bz1667801-fix-kibana-replica-shards

CLO instance yaml

apiVersion: "logging.openshift.io/v1"
kind: "ClusterLogging"
metadata:
  name: "instance" 
  namespace: "openshift-logging"
spec:
  managementState: "Managed"  
  logStore:
    type: "elasticsearch"  
    elasticsearch:
      nodeCount: 3
      resources:
        limits:
          memory: "4Gi"
        requests:
          cpu: "1"
          memory: "4Gi"
      storage:
        storageClassName: nfs-storage-provisioner
        size: 40Gi      
  visualization:
    type: "kibana"
    kibana:
      replicas: 1
  curation:
    type: "curator"  
    curator:
      schedule: "30 3 * * *"
  collection:
    logs:
      type: "fluentd"  
      fluentd: {}

fluentd pods scheduled only on worker nodes

I deployed the logging-stack refering the README.md file and the operators and pods were up and running.

I could not see nodeSelector anymore on the ds configuration, but the fluentd pods are only up and running on worker nodes and not master nodes.

[root@localhost ocp1]# oc get pods -owide -l component=fluentd
NAME            READY     STATUS    RESTARTS   AGE       IP            NODE
fluentd-4xxgl   1/1       Running   0          36m       10.131.x.xx   ip-10-0-xxx-xx.us-east-2.compute.internal
fluentd-r56fp   1/1       Running   0          36m       10.128.x.xx   ip-10-0-143-93.us-east-2.compute.internal
fluentd-swrqv   1/1       Running   0          36m       10.129.x.xx   ip-10-0-175-200.us-east-2.compute.internal
[root@localhost ocp1]#

The cluster comprises of 6 nodes which includes 3 masters and 3 worker nodes on it.

Could some check if this is expected behavior or am I missing something on it?

Please prepare bundle for OCP 4.4

Now that we are fully branched for 4.4, please prepare your operator to supply a 4.4 bundle, so that 4.4 operator publishing works and doesn't overwrite 4.3 bundles. This means at least updating the package.yaml under https://github.com/openshift/cluster-logging-operator/tree/master/manifests

Permission problem when allow to run as anyuid

Quick install of Openshift 4.5 on AWS
Use gp2 as default storage class
Install cluster-logging and elasticsearch operators
Add scc to group:

 oc adm policy add-scc-to-group anyuid system:authenticated

Deploy a default instance as described on Openshift 4.5 documentation.

Then the elasticsearch pods will throw a exception copying files:

[2020-07-23 06:59:28,823][INFO ][container.run            ] Begin Elasticsearch startup script
[2020-07-23 06:59:28,826][INFO ][container.run            ] Comparing the specified RAM to the maximum recommended for Elasticsearch...
[2020-07-23 06:59:28,827][INFO ][container.run            ] Inspecting the maximum RAM available...
[2020-07-23 06:59:28,828][INFO ][container.run            ] ES_JAVA_OPTS: ' -Xms8192m -Xmx8192m'
[2020-07-23 06:59:28,829][INFO ][container.run            ] Copying certs from /etc/openshift/elasticsearch/secret to /etc/elasticsearch//secret
[2020-07-23 06:59:28,834][INFO ][container.run            ] Building required jks files and truststore
Importing keystore /etc/elasticsearch//secret/admin.p12 to /etc/elasticsearch//secret/admin.jks...
Entry for alias 1 successfully imported.
Import command completed:  1 entries successfully imported, 0 entries failed or cancelled

Warning:
The JKS keystore uses a proprietary format. It is recommended to migrate to PKCS12 which is an industry standard format using "keytool -importkeystore -srckeystore /etc/elasticsearch//secret/admin.jks -destkeystore /etc/elasticsearch//secret/admin.jks -deststoretype pkcs12".

Warning:
The JKS keystore uses a proprietary format. It is recommended to migrate to PKCS12 which is an industry standard format using "keytool -importkeystore -srckeystore /etc/elasticsearch//secret/admin.jks -destkeystore /etc/elasticsearch//secret/admin.jks -deststoretype pkcs12".
Certificate was added to keystore

Warning:
The JKS keystore uses a proprietary format. It is recommended to migrate to PKCS12 which is an industry standard format using "keytool -importkeystore -srckeystore /etc/elasticsearch//secret/admin.jks -destkeystore /etc/elasticsearch//secret/admin.jks -deststoretype pkcs12".
Importing keystore /etc/elasticsearch//secret/elasticsearch.p12 to /etc/elasticsearch//secret/elasticsearch.jks...
Entry for alias 1 successfully imported.
Import command completed:  1 entries successfully imported, 0 entries failed or cancelled

Warning:
The JKS keystore uses a proprietary format. It is recommended to migrate to PKCS12 which is an industry standard format using "keytool -importkeystore -srckeystore /etc/elasticsearch//secret/elasticsearch.jks -destkeystore /etc/elasticsearch//secret/elasticsearch.jks -deststoretype pkcs12".

Warning:
The JKS keystore uses a proprietary format. It is recommended to migrate to PKCS12 which is an industry standard format using "keytool -importkeystore -srckeystore /etc/elasticsearch//secret/elasticsearch.jks -destkeystore /etc/elasticsearch//secret/elasticsearch.jks -deststoretype pkcs12".
Certificate was added to keystore

Warning:
The JKS keystore uses a proprietary format. It is recommended to migrate to PKCS12 which is an industry standard format using "keytool -importkeystore -srckeystore /etc/elasticsearch//secret/elasticsearch.jks -destkeystore /etc/elasticsearch//secret/elasticsearch.jks -deststoretype pkcs12".
Importing keystore /etc/elasticsearch//secret/logging-es.p12 to /etc/elasticsearch//secret/logging-es.jks...
Entry for alias 1 successfully imported.
Import command completed:  1 entries successfully imported, 0 entries failed or cancelled

Warning:
The JKS keystore uses a proprietary format. It is recommended to migrate to PKCS12 which is an industry standard format using "keytool -importkeystore -srckeystore /etc/elasticsearch//secret/logging-es.jks -destkeystore /etc/elasticsearch//secret/logging-es.jks -deststoretype pkcs12".

Warning:
The JKS keystore uses a proprietary format. It is recommended to migrate to PKCS12 which is an industry standard format using "keytool -importkeystore -srckeystore /etc/elasticsearch//secret/logging-es.jks -destkeystore /etc/elasticsearch//secret/logging-es.jks -deststoretype pkcs12".
Certificate was added to keystore

Warning:
The JKS keystore uses a proprietary format. It is recommended to migrate to PKCS12 which is an industry standard format using "keytool -importkeystore -srckeystore /etc/elasticsearch//secret/logging-es.jks -destkeystore /etc/elasticsearch//secret/logging-es.jks -deststoretype pkcs12".
Certificate was added to keystore
Certificate was added to keystore
cp: cannot create regular file '/etc/elasticsearch/elasticsearch.yml': Permission denied
cp: cannot create regular file '/etc/elasticsearch/log4j2.properties': Permission denied

Assign a priority class to pods

Priority classes docs:
https://docs.openshift.com/container-platform/3.11/admin_guide/scheduling/priority_preemption.html#admin-guide-priority-preemption-priority-class

Example: https://github.com/openshift/cluster-monitoring-operator/search?q=priority&unscoped_q=priority

Notes: The pre-configured system priority classes (system-node-critical and system-cluster-critical) can only be assigned to pods in kube-system or openshift-* namespaces. Most likely, core operators and their pods should be assigned system-cluster-critical. Please do not assign system-node-critical (the highest priority) unless you are really sure about it.

Apply license to source code properly

"Nit picking" but the license should be applied properly in terms of:
https://github.com/openshift/cluster-logging-operator/blob/master/LICENSE#L178

Preferably use license checker.

Update skip range to be 4.2 to 4.3

Upgrade test doesnt pass until 4.2 is released to marketplace. Update it to be 4.2 in lieu of 4.1 when 4.2 is released ref: #245

reconciliation resulting in certificate issue and causing elasticsearch unstable.

While using Openshift 4.3.5, Cluster Logging Operator : Cluster logging Operator: 4.3.9-202003230345 , we starts getting below error after around 24 hours related to certificate and elastic search pods become down. It seems to be some bug in operator which is not able to reconcile or rotate the certificate related to elastic search and kibana ?

{"level":"error","ts":1586146160.3927138,"logger":"kubebuilder.controller","msg":"Reconciler error","controller":"logforwarding-controller","request":"openshift-logging/instance","error":"Unable to create or update certificates for "instance": Error running script: exit status 1","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/openshift/cluster-logging-operator/_output/src/github.com/go-logr/zapr/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/openshift/cluster-logging-operator/_output/src/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:217\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1\n\t/go/src/github.com/openshift/cluster-logging-operator/_output/src/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/openshift/cluster-logging-operator/_output/src/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/openshift/cluster-logging-operator/_output/src/k8s.io/apimachinery/pkg/util/wait/wait.go:134\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/openshift/cluster-logging-operator/_output/src/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}

make deploy-example fails on MACOS

Issue

The command REMOTE_CLUSTER=true make deploy-example fails on MACOS as he mktemp option passed as parameter is illegal

++ mktemp --tmpdir -d cluster-logging-operator-build-XXXXXXXXXX
mktemp: illegal option -- -
usage: mktemp [-d] [-q] [-t prefix] [-u] template ...
       mktemp [-d] [-q] [-u] -t prefix

Please prepare bundle for OCP 4.5

Now that we are fully branched for 4.5, please prepare your operator to supply a 4.5 bundle, so that 4.5 operator publishing works and doesn't overwrite 4.4 bundles. This means at least updating the package.yaml under https://github.com/openshift/cluster-logging-operator/blob/master/manifests/cluster-logging.package.yaml

Reference: Get OLM operator owners to update their CSV channels

Create custom elasticsearch index in fluentd configuration

Can you details out steps to create custom elasticsearch index in fluentd configuration.

Tried like below, but it says cannot create any new index other than project_full, operations_full..

<elasticsearch_index_name>
     enabled "true"
     tag "myapp*"
     name_type custom_index
</elasticsearch_index_name>

Thanks for any help in advance

Wrong indentation for the field - metadata within hack/cr-aws.yaml

Issue

The METADATA field is not well positioned within the file hack/cr-aws.yaml

apiVersion: "logging.openshift.io/v1"
kind: "ClusterLogging"
  metadata:
  name: "instance"

Should be

apiVersion: "logging.openshift.io/v1"
kind: "ClusterLogging"
metadata:
  name: "instance"

Change repository permissions

I had a badly set up .git/config and by mistake pushed into upstream / master.

Can we change repo settings so that direct pushes are always declined?

openshift / cluster-logging-operator Goto Github PK

cluster-logging-operator's Introduction

Cluster Logging Operator

Overview

cluster-logging-operator's People

Contributors

Stargazers

Watchers

Forkers

cluster-logging-operator's Issues

First issue

Second Issue

Third Issue

Issue

Info

Overview

Issue

Proposal

Clusters

What's in a name

Collectors

Alternates

Multiple CRs, one for each cluster

Cons

Single CR with names sources

Cons

Issue

Issue

Recommend Projects

Recommend Topics

Recommend Org

Jobs