Problem: Monstache doesn't back off writing data to Elasticsearch eve

Monstache did not back off writing data when ElasticSearch disk was full (http code 429), causing log spam about monstache HOT 3 OPEN

ManuelSchmitzberger commented on August 25, 2024

Monstache did not back off writing data when ElasticSearch disk was full (http code 429), causing log spam

from monstache.

Comments (3)

rwynn commented on August 25, 2024 1

Hi, pushed a new release to back off when indexing errors happen to mitigate the log flooding.

from monstache.

mologie commented on August 25, 2024

Hi, colleague of Manuel here. The specific error message we got was

ERROR 2023/11/24 15:43:43 Bulk response item: {"_index":"main.<col>","_id":"<id>","status":429,"error":{"type":"cluster_block_exception","reason":"index [main.<col>] blocked by: [TOO_MANY_REQUESTS/12/disk usage exceeded flood-stage watermark, index has read-only-allow-delete block];"}}

It was repeated 24 500 000 times in a duration of 10 minutes, totaling roughly 4 GiB of logs.

The steps to reproduce are (though we did not investigate yet whether these can be minimized):

Deny access to the monstache user, so that some data is queued up
Let Elasticsearch run almost full
Stop monstache
Restore access for monstache
Restart monstache
Let Elasticsearch run completely full (up to the flood-stage watermark)
Observe that monstache begins to rapidly generate log events (2+ million log entries per minute)

from monstache.

mologie commented on August 25, 2024

Additionally here is a redacted copy of the config file with which we observed the issue:

mongo-url = "mongodb://monstache:<snip:url>"
elasticsearch-urls = ["http://<snip>:9200"]
direct-read-namespaces = ["main.<snip:col>"]
change-stream-namespaces = ["main.<snip:col>"]
workers = ["worker-0", "worker-1"]
gzip = false
stats = true
index-stats = true
elasticsearch-user = "monstache"
elasticsearch-password = "<snip>"
elasticsearch-max-conns = 4
elasticsearch-validate-pem-file = false
elasticsearch-healthcheck-timeout-startup = 200
elasticsearch-healthcheck-timeout = 200
dropped-collections = true
dropped-databases = true
replay = true
resume = true
resume-write-unsafe = false
resume-name = "default"
resume-strategy = 1
index-files = true
file-highlighting = true
file-namespaces = ["users.fs.files"]
verbose = false
cluster-name = 'elasticsearch'
exit-after-direct-reads = false

I'm curious and investigating possible causes in the source code right now. A brief look tells me that the ElasticSearch library just indiscriminately calls the error handler for everything thrown at it via Add(), so if the ingress side works / provides data we'll end up with one error per ingested item. It's unclear to me however at which point throttling should best take place.

from monstache.

Monstache did not back off writing data when ElasticSearch disk was full (http code 429), causing log spam about monstache HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs