GithubHelp home page GithubHelp logo

opendistro-for-elasticsearch / data-prepper Goto Github PK

View Code? Open in Web Editor NEW
37.0 7.0 24.0 4.61 MB

This repository is archived. Please migrate to the active project: https://github.com/opensearch-project/data-prepper

License: Apache License 2.0

Java 99.06% Dockerfile 0.05% Shell 0.88% Scilab 0.01%

data-prepper's Introduction

Data Prepper

This project has moved to OpenSearch Data Prepper.

You are currently viewing the inactive and archived OpenDistro Data Prepper. All work is now happening in the OpenSearch Data Prepper project. The OpenSearch Data Prepper can send events to OpenSearch, OpenDistro, and ElasticSearch 7.x. The OpenSearch Data Prepper already has new features and improvements, with many more planned.

The last version of OpenDistro Data Prepper was version 1.0.3 which was released December 2021 with log4j security patches.

To help you migrate to OpenSearch Data Prepper, we have a short migration guide below.

Migrating to OpenSearch Data Prepper

This section provides instructions for migrating from the OpenDistro Data Prepper to OpenSearch Data Prepper.

Change your Pipeline Configuration

The elasticsearch sink has changed to opensearch. You will need to change your existing pipeline to use the opensearch plugin instead of elasticsearch.

Please note that while the plugin is titled opensearch it remains compatible with OpenDistro and ElasticSearch 7.x.

Update Docker Image

The OpenDistro Data Prepper Docker image was located at amazon/opendistro-for-elasticsearch-data-prepper. You will need to change this value to opensearchproject/opensearch-data-prepper.

Old README

The remainder of this document is the old README file.

Table of Contents

Contribute

We invite developers from the larger Open Distro community to contribute and help improve test coverage and give us feedback on where improvements can be made in design, code and documentation. You can look at contribution guide for more information on how to contribute.

Code of Conduct

This project has adopted an Open Source Code of Conduct.

Security Issue Notifications

If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our vulnerability reporting page. Please do not create a public GitHub issue.

License

This library is licensed under the Apache 2.0 License. Refer

data-prepper's People

Contributors

amazon-auto avatar austintag avatar chenqi0805 avatar dependabot[bot] avatar dinujoh avatar dlvenable avatar erosas avatar kowshikn avatar nclaveeoh-amzn avatar sshivanii avatar wrijeff avatar yadavcbala avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

data-prepper's Issues

Errors when running multiple instances of data-prepper in the sample app

Encountering ES client errors:

Example:

Caused by: java.lang.RuntimeException: method [PUT], host [https://node-0.example.com:9200], URI [/_opendistro/_ism/policies/raw-span-policy], status line [HTTP/1.1 409 Conflict]
{"error":{"root_cause":[{"type":"version_conflict_engine_exception","reason":"[raw-span-policy]: version conflict, document already exists (current version [1])","index_uuid":"IdnxdC94Qaa7CYJwSPESxg","shard":"0","index":".opendistro-ism-config"}],"type":"version_conflict_engine_exception","reason":"[raw-span-policy]: version conflict, document already exists (current version [1])","index_uuid":"IdnxdC94Qaa7CYJwSPESxg","shard":"0","index":".opendistro-ism-config"},"status":409}
        at com.amazon.dataprepper.plugins.sink.elasticsearch.ElasticsearchSink.<init>(ElasticsearchSink.java:62)
        ... 19 more
Caused by: org.elasticsearch.client.ResponseException: method [PUT], host [https://node-0.example.com:9200], URI [/_opendistro/_ism/policies/raw-span-policy], status line [HTTP/1.1 409 Conflict]
{"error":{"root_cause":[{"type":"version_conflict_engine_exception","reason":"[raw-span-policy]: version conflict, document already exists (current version [1])","index_uuid":"IdnxdC94Qaa7CYJwSPESxg","shard":"0","index":".opendistro-ism-config"}],"type":"version_conflict_engine_exception","reason":"[raw-span-policy]: version conflict, document already exists (current version [1])","index_uuid":"IdnxdC94Qaa7CYJwSPESxg","shard":"0","index":".opendistro-ism-config"},"status":409}
....
2523 [main] ERROR com.amazon.dataprepper.parser.PipelineParser  โ€“ Construction of pipeline components failed, skipping building of pipeline [raw-pipeline]

So far I've hit issues in ElasticsearchSink.java and IndexStateManagement.java, both around the lowLevelClient performing requests. Need to handle these errors gracefully instead of failing the pipeline initialization completely.

Peer Forwarder Plugin

The trace analytics feature is the first pipeline that Data Prepper will offer. This feature is requires spans belonging to the same trace workflow to be routed to the same host so we can process and make decision on the entire trace workflow.

The end state is we have a processor plugin,

processor:
    - peer_forwarder:
       discovery.mode: [] 
       peer.host: []
       time_out: 300
       span_agg_count: 48 

The plugin should have an option to auto discovery/list of peers and support the routing.

Items:

  • consistent hashing algorithm
  • Discovery approach
  • Integrate with metrics(Austin already has the abstract classes)
    • Inherit common metrics abstract classes
    • Add custom metrics by introducing timer, counter, etc in constructor
  • Integration tests that ensure we are routing them correctly and detecting service map relationship.
    • Update e2e test to include at least 2 data prepper instances to receive traces.
    • Integration test for discovery mode #224
  • SSL/TLS support
  • Documentation on how to configure various options

Update the raw index template

The below are fields we should manage in the index template,

Properties

  • traceId - keyword - 256
  • spandId - keyword - 256
  • parentSpanId -keyword - 256
  • name - keyword - 1024
  • kind - keyword - 128
  • startTime - date_nanos
  • endTime - date_nanos
  • status.code - int
  • status.message - keyword
  • serviceName - keyword
  • durationInNanos - long

Dynaminc Template (path_match)

  • resource.attributes.* - keywords
  • attributres.* -keywords

Other Fields (if possible)

  • events - should be an array of nested objects
  • links - should be an array of nested objects.

Note:

  • With the current template, status.message is not friendly for kibana. (dicussed this on call)

[Processor] Service-map processor

One of the key features of the Trace Analytics is to offer service-map view from the trace data from customers. To support this view we will be building stateful processor which will derive relationship between services from traces.

Service-Map

To achieve this we will building a service-map processor which will detect relationship using the traces. The service-map processor will be caching all traces for a WINDOW_TIME (default window time is 3 minutes) and then detect the below the relationship

{
 _id: //Hash of the service-group object, this is done for uniqueness and reduce duplicates.   
 serviceName:
 kind:
 destination: 
      {
       domain:
       resource:
      }
 target: 
      {
       domain:
       resource:
      }
 traceGroupName:
}

Optimize release builds - multiple tracking issues

  1. Currently, we have a similar looking distributions block in all subprojects of archives, we can move this to subprojects closure in archives build.gradle. It is little tricky as the custom distribution name depends on subproject name [and gradle does not let using string as custom distribution name].
  2. Upload to S3 is uploading tars individually which lets us use individual task to upload tars but we could also consider adding a batch upload task which will generate the fileTree of all subproject tars and zips
  3. Bucket policy and IAM roles - We can automate adding roles and updating bucket policies which will completely remove manual work of adding IAM policies for odfe infra s3 access.
  4. Move common tasks to parent build - currently it is not immediately inherent as the task names depend on subproject name. Check if there is a way to override the name

Share Pipeline Configuration

There are two configurations that are at pipeline level that needs to be shared to Plugins,

  1. Number of Processors - This is required for Stateful processors. (Already a TODO)
  2. Name of the pipeline - This is required for logging purpose, given that logs are the only way to track things in situp having the logs with Pipeline name will help us debug things. (check on this with @yadavcbala )

[ES sink] Add retry utilities in ES sink plugin

  • Ensure ES High-level client does exponential retries.
  • Handle 4XX errors (incl partial bulk failures)
  • Retry 5xx forever with exponential backoff.
  • Expose a file config, to which 4XX errors will be dumped.

[Situp Core]Parallelize Sinks

As of this writing, Pipeline outputs records synchronously to sinks. Having multiple sinks is a requirement and as we know different sinks have different latencies (and behavior) - We wouldn't want one slow performing sink impacting other sinks. This issue is to track parallelizing writing to sinks.

Gotchas

  1. The decision of marking records as processed will not be addressed in this issue as it entails a bigger discussion involving pipeline to pipeline communication pattern e.g halting the processing of records if one of the sink (which is a pipeline of its own) is down
  2. As per the discussion, we will use the processor thread-pool for writing to sinks i.e. no dedicated thread-pool for sinks. This was purely decided based on the idle time processor threads will have if they are waiting for sink threads to finish writing

Upgrade to latest OpenTelemetry Span Status spec.

  1. Upgrade SITUP to use the latest version of OpenTelemetry Proto which has the new Span Spec.

  2. Upgrade the Sample apps to OpenTelemetryv0.14.0 which support the new Span Spec.

  3. Upgrade the sample also to latest OpenTelemetry python versions.

  4. After all this is done, perform an integration sanity check and ensure all sample apps provides only status code 0, 1 and 2.

Note: We will do this in a branch because this will break kibana dashboard as our dashboard uses ">0" as error code.

[Situp-core] Exception Control Flow

Below is the outlined exception control flow for situp:

  • Exception during Pipeline Construction

    • Single Pipeline - Exits using system.exit(1) with non-zero code, logs appropriate message
    • Multiple Pipelines (not chained) - Failing pipeline is ignored, appropriate message is logged. Other pipelines will continue normally
    • Chained Pipelines - Failing pipeline is ignored but rest of the pipeline is built but nothing gets executed and eventually everything will shutdown
  • Exception while starting the source

    • Single Pipeline - Exits with appropriate log exception message in the log
    • Multiple Pipelines (not chained) - Failing pipeline is ignored, appropriate message is logged. Other pipelines will continue normally
    • Chained Pipelines
      • Root Source - root source pipeline never gets executed (i.e. ignored), child pipelines will start but since root source pipeline is never started, the child pipelines never receive any data - effectively nothing happens
      • Child Source - this case is not possible as child source is nothing but a PipelineConnector which is solely assigning buffer reference.
  • Exception in process worker

    • Single Pipeline - Pipeline gets shutdown with appropriate log message
    • Multiple Pipelines (not chained) - Failing pipeline gets shutdown with appropriate log message, other pipelines will continue normally
    • Chained Pipelines
      • Root Pipeline - Root pipeline gets shutdown and no data flows; child pipelines may still exist and user is expected to halt them. [any changes will be taken up in followup PRs]
      • Child Source - Failing pipeline gets shutdown, this will eventually block/halt every other chained pipeline

Add monitoring support for Data Prepper

Data Prepper should provide a way for users to monitor the various components that are running. We will expose a GET API endpoint which allows the user to get the current state of metrics for the running Data Prepper.

While certain components have specific metrics that apply to them, there are also certain metrics which apply more generically to processors, sinks, and buffers. The following metrics will be provided by default to all components of that type:

  • Prepper: Records in, records out, time elapsed
  • Sink: Records out, time elapsed
  • Buffer: Records written, records read, write time elapsed, read time elapsed, and timeouts

For implementation, we will use Micrometer.io for metrics instrumentation, and Prometheus for the metrics backend. Micrometer allows us to decouple metrics instrumentation from backend implementation, and swap metrics backends if needed. Prometheus is widely used and supported, and provides customers with many options for consumption.

[Pipeline Core] Support for Pipeline Connectors

Background

As part of the definition of Sink - defines one or more destinations to which a SITUP pipeline will publish the records. A sink destination could be services like elasticsearch, s3 or another SITUP pipeline. By using another SITUP pipeline as sink, we enable customers to chain multiple TI pipelines.

To allow defining pipeline as a sink, the source for the connecting pipeline also becomes a pipeline which demands a special type of plugin and is limited to this special use-case i.e. pipeline chaining. This lead to the above issue or requirement for having PipelineConnector.

Current Pipeline Configuration

pipeline:
  name: otel-span
  source:
    apm_trace_source:
      server: "localhost"
      port: 9400
  buffer:
    bounded_blocking:
      buffer_size: 512
  processor:
    geoip:
      database: "/geo-ip.db"
    filter:
      fieldKey: "span_kind"
      fieldValue: "Client"
  sink:
    elasticsearch:
      hosts: ["https://search-sample-app-test.us-west-2.es.amazonaws.com"]
    s3:
      bucket: "global-bucket"

Multiple/Updated Pipelines Configuration

pipeline-1:
  source:
    apm_trace_source:
      server: "localhost"
      port: 9400
  buffer:
    bounded_blocking:
      buffer_size: 512
  processor:
    transform:
      to: elasticsearch
  sink:
    pipeline:
      name: "pipeline-2"
  workers: 4
  delay: 500 #ms
      
pipeline-2:
  source:
    pipeline:
      name: "pipeline-1"
  buffer:
    bounded-blocking:
      buffer_size: 1024
  processor:
    geoip:
      database: "/geo-ip.db"
  sink:
    elasticsearch:
      hosts: ["https://search-sample-app-test.us-west-2.es.amazonaws.com"]

The above sample is solely as an example and it does not cover all the nuances multiple pipelines brings to the table, Feel free to raise a question or comment

Pipeline Connector

PipelineConnector will be a special type of plugin which will be both Source and Sink and we will limit the possibilities of extending this connector i.e. we would not need more custom plugins for this special connector at the time of this writing. This will introduce a good number of ambiguous edge cases and below is my best attempt to cover/answer them, Please feel free to comment if I missed any

  1. Orphan Pipelines

What happens if there exists a pipeline which is not connected to any other pipeline, will it be marked as invalid

pipeline-1:
  source:
    file:
  sink:
    pipeline:
      name: "pipeline-2"
      
pipeline-2:
  source:
    pipeline:
      name: "pipeline-1"
  buffer:
     bounded-blocking:
  sink:
    elasticsearch:
      hosts: ["localhost:9200"]

pipeline-3:
  buffer:
     sqs:
  sink:
    elasticsearch:
      hosts: ["localhost:9200"]

At the time of this writing, we will allow having such configuration as pipelines will be executed independently.

  1. Other Validations
    below are some invalid configuration examples
pipeline-1:
  source:
    file:
  sink:
    pipeline:
      name: "pipeline-2"
      
pipeline-2:
  source:
    pipeline:
      name: "pipeline-3" # invalid - expected pipeline-1
  buffer:
     bounded-blocking:
  sink:
    elasticsearch:
      hosts: ["localhost:9200"]
pipeline-1:
  source:
    file:
  sink:
    pipeline:
      name: "pipeline-2"
      
pipeline-2:
  source:
    apm-otel-trace-source: # invalid - expected pipeline type
      name: "pipeline-3" 
  buffer:
     bounded-blocking:
  sink:
    elasticsearch:
      hosts: ["localhost:9200"]
pipeline-1:
  source:
    file:
  sink:
    pipeline:
      name: "pipeline-3" #invalid because there is no pipeline-3
     

Trace Analytics

This is issue will be used to share details and get feedback about the upcoming Trace Analytics feature. Trace Analytics is the first step towards bringing in Application Performance Management capabilities to ODFE customers.

trace-analytics-rfc

Automate tests for release artifacts

Currently we have no test framwork to test artifacts and docker images. For every release we have the engineers run the sample and test in the local. This lowers the test quality and is slow. We should be automating this with the following goals,

  1. Build the release artifacts/docker image in the local, as apart of the release step tests should run.
  2. Upload the release artifacts to our bucket for odfe-release team to consume it. (This is done)
  3. Once the odfe team lets us know that the artifacts are in staging and production, we should run the tests in step 1 against them.

[ES Sink] ODFE ISM

For the trace analytics, we need to create a new ISM policy for the trace-analytics raw index. We will create a new policy called trace-analytics-raw this with 50 GB or 24 hours or 50k docs.

Note:

  • Regarding generic ISM, we need to discuss later about should generic ISM be fixed or user driven. This can be done after trace_analytics.

[Processor] OtelTraceRawProcessor Plugin

Create a new Processor plugin that takes ResourceSpan as input and converts it to ES friendly docs.

  • Convert ResourceSpan to JSON. Reuse existing ApmSpanProcessor for JSON conversion.
  • Use ApmSpanProcessor on the JSON.

[SITUP Plugin] Add blocking queue

The current default buffer (UnboundedInMemoryBuffer) was added as a time being option; we need to add BlockingBuffer and make it as default option.

Clean up plugins

  • Remove not supported plugins
  • stateless-trace-processor
  • apm_trace_source
  • Split apmtracesource into two separate plugins sub project
  • otel_trace_raw_processor
  • otel_trace_source

Add readme for both of them.

  • Rename upper-case plugin to string_converter with a boolean flag to denote if its toUpperCase or toLowerCase.

Handle Zipkin B3 Propagation

Zipkin users use B3 propagation. When using B3, it uses the same span ID for the client and server side of an RPC.
Check this FAQ.

OpenTelemetry Spec doesn't officially support this behavior as it uses w3c but there is an ongoing discussion about support for this behavior. Regardless of OpenTelemetry decision, we should consider supporting this behavior in our trace analytics feature to test backwards compatible.

In order to support this we need to make the below changes,

  1. Raw trace processor should create _id using unique span identifiers which should contain spanId, serviceName.
  2. Modify the service-map-processor to support List of spans for a spanId lookup.

[Situp-Core] Pipeline to have start/stop functionality

This issue is to track the progress, discussion and decisions of adding start/stop functionality to Pipeline. Currently the start and stop does not completely perform those actions. This issue will attempt to fix the gap.

Pipeline is integral to the working of SITUP, it has four key components source, buffer, processor, and sink. A pipeline definition contains required components source and sink; optional buffer and processors. A default buffer will be used if no buffer is specified in the definition.

execute()
On initiating the execution of pipeline, the control will trigger the start() operation on the defined source with either defined or default buffer. The control will also initiate the processing which includes execution of processors (if there are any) on records from buffer and publishing the resulting records (records) to all the configured sinks.

start()
TODO

stop()
Currently, we notify defined source to stop publishing new records to the buffer. Pipeline will exhaust the existing records from buffer before stopping the processing. TODO

Rename Processor Plugins to Prepper

A product decision is Data Prepper contains Pipelines and each pipeline will a Source, one or more Sinks and zero or more Preppers.

So we wiil rename Processor to Prepper.

Create a simple UI for Trace App

Create a UI for the odfe-pipes-trace-app comprising of components such as load_main_screen, client_create_order, client_cancel_order, client_checkout, client_pay_order and client_delivery_status.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.