Hi, We've had several severe outages over the last 6 months due to L

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Logstash gets stuck in infinite retry loop writing to Time Series Data Streams about logstash HOT 5 OPEN

mbudge commented on June 13, 2024

Logstash gets stuck in infinite retry loop writing to Time Series Data Streams

from logstash.

Comments (5)

mbudge commented on June 13, 2024

I've added this to the elasticagent pipeline so metrics data only than 1 day gets dropped.

if [data_stream][type] == "metrics" {
        ruby {
            init => "require 'time'"
            code => 'if LogStash::Timestamp.new(event.get("@timestamp")+86400) < ( LogStash::Timestamp.now)
              event.cancel
            end'
        }
    }

This is fine until Logstash handles TSDB 403 Forbidden responses properly i.e. stops retrying them until it topples over.

from logstash.

mbudge commented on June 13, 2024

We want to avoid using dead letter queues as we still want to see other indexing errors in the logatsh logs.

It's just the metrics data/TSDB indexes which are a problem when the index is in read-only mode, resulting in elasticsearch returning a 409 Forbidden error.

from logstash.

mbudge commented on June 13, 2024

Another option would be to add some code to the metrics ingest pipelines which drops events older than X days.

from logstash.

jsvd commented on June 13, 2024

@mbudge thanks for the report.
The challenge of deciding when to retry or not has been a back and forth for ages in Logstash, with arguments on both sides.
The struggle with this particular case is that the error is about authorization, and it may happen because:

a user or api key was given insufficient roles to be able to write to the desired index
the dynamic index feature allows for data to land in places the user doesn't want to

In case 1, we're talking about up to 100% of all incoming data being blocked from ingestion, and in this case it's better for the pipelines to stop ingesting altogether, otherwise it will drop 100% of data received, causing major data loss.
In case 2 we're talking about a small portion of data that is likely unwanted anyway, so it could be discarded.

We don't have a good way of knowing which scenario we're facing by looking at the error from ES, especially without further domain knowledge being injected by the user, like your solution here where you tell Logstash "I want anything older than 24h to be dropped, since I know otherwise it may land in indices that aren't writable anymore".

We need max retries so Logstash doesn't get stuck and loose all our log data. We work in a regulated environment where we are audited on data loss.

I don't see how retries here would help since nothing would change between the time the first error happens until the last retry, which would likely be a few seconds (e.g. 3 retries: 0.1 seconds for first retry, 0.5s for the second, 1 second for the third). Typically 403 action needs to be taken either immediately (e.g. drop/dlq) or retry forever so an operator/administrator can take action and fix permissions.

This is fine until Logstash handles TSDB 403 Forbidden responses properly i.e. stops retrying them until it topples over.

Given that in your domain you expect data to land in places where writing is forbidden, maybe just setting dlq_custom_codes for 403 would be enough, thus informing that a 403 at the document level is fine to either drop or send to DLQ:

output {
  elasticsearch {
    dlq_custom_codes => [ 403 ]
  }
}

We want to avoid using dead letter queues as we still want to see other indexing errors in the logatsh logs.

With the setting above only 403 would be DLQ'ed, everything else would still behave as before.

I don't see a way to make this be the default as it would easily make other users upset that a permission error led to all their data being happily dropped on the floor until it was noticed by someone.

Maybe a path forward is to have TSDS raise a specific kind of error of data not being allowed in downsampled indices, so clients like Logstash can differentiate between this scenario and any other kind of forbidden/403 error.

from logstash.

mbudge commented on June 13, 2024

dlq_custom_codes are is addition to 400 and 404, which is why I want to stay away from DLQ so we can monitor logs for 400 and 404 errors in Logstash logs on the host and in elastic.

"List single-action error codes from Elasticsearch’s Bulk API that are considered valid to move the events into the dead letter queue. This list is an addition to the ordinary error codes considered for this feature, 400 and 404." https://www.elastic.co/guide/en/logstash/current/plugins-outputs-elasticsearch.html#plugins-outputs-elasticsearch-dlq_custom_codes

We can have 2 elasticsearch outputs configured in Logstash. One for logs and one for metrics data. On the elasticsearch output for metrics data we would use max retries to prevent Logstash retrying late metrics data indefinitely, preventing an outage as eventually all the late metrics events will be dropped (and customers decide when to do this). Adding a drop processor to the metrics ingest pipelines does sound like a good way to protect against this too.

I think the Fleet team are adding a global component template to all the Fleet managed logs data streams. If you get them to add the global@custom ingest pipeline to the metrics data streams, it becomes easier for customers to add logic to drop late metrics data to all the metrics ingest pipelines.

elastic/kibana#149484

from logstash.

Logstash gets stuck in infinite retry loop writing to Time Series Data Streams about logstash HOT 5 OPEN

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs