GithubHelp home page GithubHelp logo

Comments (5)

mbudge avatar mbudge commented on June 13, 2024

I've added this to the elasticagent pipeline so metrics data only than 1 day gets dropped.

if [data_stream][type] == "metrics" {
        ruby {
            init => "require 'time'"
            code => 'if LogStash::Timestamp.new(event.get("@timestamp")+86400) < ( LogStash::Timestamp.now)
              event.cancel
            end'
        }
    }

This is fine until Logstash handles TSDB 403 Forbidden responses properly i.e. stops retrying them until it topples over.

from logstash.

mbudge avatar mbudge commented on June 13, 2024

We want to avoid using dead letter queues as we still want to see other indexing errors in the logatsh logs.

It's just the metrics data/TSDB indexes which are a problem when the index is in read-only mode, resulting in elasticsearch returning a 409 Forbidden error.

from logstash.

mbudge avatar mbudge commented on June 13, 2024

Another option would be to add some code to the metrics ingest pipelines which drops events older than X days.

from logstash.

jsvd avatar jsvd commented on June 13, 2024

@mbudge thanks for the report.
The challenge of deciding when to retry or not has been a back and forth for ages in Logstash, with arguments on both sides.
The struggle with this particular case is that the error is about authorization, and it may happen because:

  1. a user or api key was given insufficient roles to be able to write to the desired index
  2. the dynamic index feature allows for data to land in places the user doesn't want to

In case 1, we're talking about up to 100% of all incoming data being blocked from ingestion, and in this case it's better for the pipelines to stop ingesting altogether, otherwise it will drop 100% of data received, causing major data loss.
In case 2 we're talking about a small portion of data that is likely unwanted anyway, so it could be discarded.

We don't have a good way of knowing which scenario we're facing by looking at the error from ES, especially without further domain knowledge being injected by the user, like your solution here where you tell Logstash "I want anything older than 24h to be dropped, since I know otherwise it may land in indices that aren't writable anymore".

We need max retries so Logstash doesn't get stuck and loose all our log data. We work in a regulated environment where we are audited on data loss.

I don't see how retries here would help since nothing would change between the time the first error happens until the last retry, which would likely be a few seconds (e.g. 3 retries: 0.1 seconds for first retry, 0.5s for the second, 1 second for the third). Typically 403 action needs to be taken either immediately (e.g. drop/dlq) or retry forever so an operator/administrator can take action and fix permissions.

This is fine until Logstash handles TSDB 403 Forbidden responses properly i.e. stops retrying them until it topples over.

Given that in your domain you expect data to land in places where writing is forbidden, maybe just setting dlq_custom_codes for 403 would be enough, thus informing that a 403 at the document level is fine to either drop or send to DLQ:

output {
  elasticsearch {
    dlq_custom_codes => [ 403 ]
  }
}

We want to avoid using dead letter queues as we still want to see other indexing errors in the logatsh logs.

With the setting above only 403 would be DLQ'ed, everything else would still behave as before.

I don't see a way to make this be the default as it would easily make other users upset that a permission error led to all their data being happily dropped on the floor until it was noticed by someone.

Maybe a path forward is to have TSDS raise a specific kind of error of data not being allowed in downsampled indices, so clients like Logstash can differentiate between this scenario and any other kind of forbidden/403 error.

from logstash.

mbudge avatar mbudge commented on June 13, 2024

dlq_custom_codes are is addition to 400 and 404, which is why I want to stay away from DLQ so we can monitor logs for 400 and 404 errors in Logstash logs on the host and in elastic.

"List single-action error codes from Elasticsearch’s Bulk API that are considered valid to move the events into the dead letter queue. This list is an addition to the ordinary error codes considered for this feature, 400 and 404." https://www.elastic.co/guide/en/logstash/current/plugins-outputs-elasticsearch.html#plugins-outputs-elasticsearch-dlq_custom_codes

We can have 2 elasticsearch outputs configured in Logstash. One for logs and one for metrics data. On the elasticsearch output for metrics data we would use max retries to prevent Logstash retrying late metrics data indefinitely, preventing an outage as eventually all the late metrics events will be dropped (and customers decide when to do this). Adding a drop processor to the metrics ingest pipelines does sound like a good way to protect against this too.

I think the Fleet team are adding a global component template to all the Fleet managed logs data streams. If you get them to add the global@custom ingest pipeline to the metrics data streams, it becomes easier for customers to add logic to drop late metrics data to all the metrics ingest pipelines.

elastic/kibana#149484

from logstash.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.