opendistro-for-elasticsearch / data-prepper Goto Github PK

This repository is archived. Please migrate to the active project: https://github.com/opensearch-project/data-prepper

License: Apache License 2.0

Java 99.06% Dockerfile 0.05% Shell 0.88% Scilab 0.01%

data-prepper's Introduction

Data Prepper

This project has moved to OpenSearch Data Prepper.

You are currently viewing the inactive and archived OpenDistro Data Prepper. All work is now happening in the OpenSearch Data Prepper project. The OpenSearch Data Prepper can send events to OpenSearch, OpenDistro, and ElasticSearch 7.x. The OpenSearch Data Prepper already has new features and improvements, with many more planned.

The last version of OpenDistro Data Prepper was version 1.0.3 which was released December 2021 with log4j security patches.

To help you migrate to OpenSearch Data Prepper, we have a short migration guide below.

Migrating to OpenSearch Data Prepper

This section provides instructions for migrating from the OpenDistro Data Prepper to OpenSearch Data Prepper.

Change your Pipeline Configuration

The elasticsearch sink has changed to opensearch. You will need to change your existing pipeline to use the opensearch plugin instead of elasticsearch.

Please note that while the plugin is titled opensearch it remains compatible with OpenDistro and ElasticSearch 7.x.

Update Docker Image

The OpenDistro Data Prepper Docker image was located at amazon/opendistro-for-elasticsearch-data-prepper. You will need to change this value to opensearchproject/opensearch-data-prepper.

Old README

The remainder of this document is the old README file.

Contribute

We invite developers from the larger Open Distro community to contribute and help improve test coverage and give us feedback on where improvements can be made in design, code and documentation. You can look at contribution guide for more information on how to contribute.

Code of Conduct

This project has adopted an Open Source Code of Conduct.

Security Issue Notifications

If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our vulnerability reporting page. Please do not create a public GitHub issue.

License

This library is licensed under the Apache 2.0 License. Refer

data-prepper's People

Contributors

Stargazers

Watchers

data-prepper's Issues

Migrate the Trace Analytics Processors to ExportTraceServiceRequest

The current OTLP Trace Service is all or nothing, so the source writes the whole request to the Buffer. To make it easy and transparent we will migrate our processor to ExportTraceServiceRequest.

Migrate the Service-map processor to Record
Migrate the Raw trace processor to Record

Errors when running multiple instances of data-prepper in the sample app

Encountering ES client errors:

Example:

Caused by: java.lang.RuntimeException: method [PUT], host [https://node-0.example.com:9200], URI [/_opendistro/_ism/policies/raw-span-policy], status line [HTTP/1.1 409 Conflict]
{"error":{"root_cause":[{"type":"version_conflict_engine_exception","reason":"[raw-span-policy]: version conflict, document already exists (current version [1])","index_uuid":"IdnxdC94Qaa7CYJwSPESxg","shard":"0","index":".opendistro-ism-config"}],"type":"version_conflict_engine_exception","reason":"[raw-span-policy]: version conflict, document already exists (current version [1])","index_uuid":"IdnxdC94Qaa7CYJwSPESxg","shard":"0","index":".opendistro-ism-config"},"status":409}
        at com.amazon.dataprepper.plugins.sink.elasticsearch.ElasticsearchSink.<init>(ElasticsearchSink.java:62)
        ... 19 more
Caused by: org.elasticsearch.client.ResponseException: method [PUT], host [https://node-0.example.com:9200], URI [/_opendistro/_ism/policies/raw-span-policy], status line [HTTP/1.1 409 Conflict]
{"error":{"root_cause":[{"type":"version_conflict_engine_exception","reason":"[raw-span-policy]: version conflict, document already exists (current version [1])","index_uuid":"IdnxdC94Qaa7CYJwSPESxg","shard":"0","index":".opendistro-ism-config"}],"type":"version_conflict_engine_exception","reason":"[raw-span-policy]: version conflict, document already exists (current version [1])","index_uuid":"IdnxdC94Qaa7CYJwSPESxg","shard":"0","index":".opendistro-ism-config"},"status":409}
....
2523 [main] ERROR com.amazon.dataprepper.parser.PipelineParser  – Construction of pipeline components failed, skipping building of pipeline [raw-pipeline]

So far I've hit issues in ElasticsearchSink.java and IndexStateManagement.java, both around the lowLevelClient performing requests. Need to handle these errors gracefully instead of failing the pipeline initialization completely.

Add integration test to CI

Peer Forwarder Plugin

The trace analytics feature is the first pipeline that Data Prepper will offer. This feature is requires spans belonging to the same trace workflow to be routed to the same host so we can process and make decision on the entire trace workflow.

The end state is we have a processor plugin,

processor:
    - peer_forwarder:
       discovery.mode: [] 
       peer.host: []
       time_out: 300
       span_agg_count: 48

The plugin should have an option to auto discovery/list of peers and support the routing.

Items:

[Documentation] Trace Analytics RFC

Add the Trace Analytics RFC for the ODFE community.

Update the raw index template

The below are fields we should manage in the index template,

Properties

traceId - keyword - 256
spandId - keyword - 256
parentSpanId -keyword - 256
name - keyword - 1024
kind - keyword - 128
startTime - date_nanos
endTime - date_nanos
status.code - int
status.message - keyword
serviceName - keyword
durationInNanos - long

Dynaminc Template (path_match)

resource.attributes.* - keywords
attributres.* -keywords

Other Fields (if possible)

events - should be an array of nested objects
links - should be an array of nested objects.

Note:

With the current template, status.message is not friendly for kibana. (dicussed this on call)

[Situp] Update Name and Version of situp project

Update name of the situp project
Update version and drop beta

[ES sink] Expose batching in bulk request based on data size

[Processor] Service-map processor

One of the key features of the Trace Analytics is to offer service-map view from the trace data from customers. To support this view we will be building stateful processor which will derive relationship between services from traces.

To achieve this we will building a service-map processor which will detect relationship using the traces. The service-map processor will be caching all traces for a WINDOW_TIME (default window time is 3 minutes) and then detect the below the relationship

{
 _id: //Hash of the service-group object, this is done for uniqueness and reduce duplicates.   
 serviceName:
 kind:
 destination: 
      {
       domain:
       resource:
      }
 target: 
      {
       domain:
       resource:
      }
 traceGroupName:
}

Add config flag to enable gRPC ProtoReflectionService

To allow for debugging/testing with gRPC tools such as gprcurl and evans.

Optimize release builds - multiple tracking issues

Currently, we have a similar looking distributions block in all subprojects of archives, we can move this to subprojects closure in archives build.gradle. It is little tricky as the custom distribution name depends on subproject name [and gradle does not let using string as custom distribution name].
Upload to S3 is uploading tars individually which lets us use individual task to upload tars but we could also consider adding a batch upload task which will generate the fileTree of all subproject tars and zips
Bucket policy and IAM roles - We can automate adding roles and updating bucket policies which will completely remove manual work of adding IAM policies for odfe infra s3 access.
Move common tasks to parent build - currently it is not immediately inherent as the task names depend on subproject name. Check if there is a way to override the name

[Source] OpenTelemetry Trace Source

Implement the new Opentelemetry source which support the the OTLP Protocol.

Share Pipeline Configuration

There are two configurations that are at pipeline level that needs to be shared to Plugins,

Number of Processors - This is required for Stateful processors. (Already a TODO)
Name of the pipeline - This is required for logging purpose, given that logs are the only way to track things in situp having the logs with Pipeline name will help us debug things. (check on this with @yadavcbala )

[ES sink] Add retry utilities in ES sink plugin

Ensure ES High-level client does exponential retries.
Handle 4XX errors (incl partial bulk failures)
Retry 5xx forever with exponential backoff.
Expose a file config, to which 4XX errors will be dumped.

[SITUP Core] Plugin Repository

The highlevel goal is to make plugins independent of the core project.

@yadavcbala will update more details from the discussion we had long long ago.

[Situp Core]Parallelize Sinks

As of this writing, Pipeline outputs records synchronously to sinks. Having multiple sinks is a requirement and as we know different sinks have different latencies (and behavior) - We wouldn't want one slow performing sink impacting other sinks. This issue is to track parallelizing writing to sinks.

Gotchas

The decision of marking records as processed will not be addressed in this issue as it entails a bigger discussion involving pipeline to pipeline communication pattern e.g halting the processing of records if one of the sink (which is a pipeline of its own) is down
As per the discussion, we will use the processor thread-pool for writing to sinks i.e. no dedicated thread-pool for sinks. This was purely decided based on the idle time processor threads will have if they are waiting for sink threads to finish writing

Upgrade to latest OpenTelemetry Span Status spec.

Upgrade SITUP to use the latest version of OpenTelemetry Proto which has the new Span Spec.
Upgrade the Sample apps to OpenTelemetryv0.14.0 which support the new Span Spec.
Upgrade the sample also to latest OpenTelemetry python versions.
After all this is done, perform an integration sanity check and ensure all sample apps provides only status code 0, 1 and 2.

Note: We will do this in a branch because this will break kibana dashboard as our dashboard uses ">0" as error code.

[Situp-core] Exception Control Flow

Below is the outlined exception control flow for situp:

Exception during Pipeline Construction
- Single Pipeline - Exits using system.exit(1) with non-zero code, logs appropriate message
- Multiple Pipelines (not chained) - Failing pipeline is ignored, appropriate message is logged. Other pipelines will continue normally
- Chained Pipelines - Failing pipeline is ignored but rest of the pipeline is built but nothing gets executed and eventually everything will shutdown
Exception while starting the source
- Single Pipeline - Exits with appropriate log exception message in the log
- Multiple Pipelines (not chained) - Failing pipeline is ignored, appropriate message is logged. Other pipelines will continue normally
- Chained Pipelines
  - Root Source - root source pipeline never gets executed (i.e. ignored), child pipelines will start but since root source pipeline is never started, the child pipelines never receive any data - effectively nothing happens
  - Child Source - this case is not possible as child source is nothing but a PipelineConnector which is solely assigning buffer reference.
Exception in process worker
- Single Pipeline - Pipeline gets shutdown with appropriate log message
- Multiple Pipelines (not chained) - Failing pipeline gets shutdown with appropriate log message, other pipelines will continue normally
- Chained Pipelines
  - Root Pipeline - Root pipeline gets shutdown and no data flows; child pipelines may still exist and user is expected to halt them. [any changes will be taken up in followup PRs]
  - Child Source - Failing pipeline gets shutdown, this will eventually block/halt every other chained pipeline

[ES Sink] Add README for elasticsearch sink

Add monitoring support for Data Prepper

Data Prepper should provide a way for users to monitor the various components that are running. We will expose a GET API endpoint which allows the user to get the current state of metrics for the running Data Prepper.

While certain components have specific metrics that apply to them, there are also certain metrics which apply more generically to processors, sinks, and buffers. The following metrics will be provided by default to all components of that type:

Prepper: Records in, records out, time elapsed
Sink: Records out, time elapsed
Buffer: Records written, records read, write time elapsed, read time elapsed, and timeouts

For implementation, we will use Micrometer.io for metrics instrumentation, and Prometheus for the metrics backend. Micrometer allows us to decouple metrics instrumentation from backend implementation, and swap metrics backends if needed. Prometheus is widely used and supported, and provides customers with many options for consumption.

[Pipeline Core] Support for Pipeline Connectors

Background

As part of the definition of Sink - defines one or more destinations to which a SITUP pipeline will publish the records. A sink destination could be services like elasticsearch, s3 or another SITUP pipeline. By using another SITUP pipeline as sink, we enable customers to chain multiple TI pipelines.

To allow defining pipeline as a sink, the source for the connecting pipeline also becomes a pipeline which demands a special type of plugin and is limited to this special use-case i.e. pipeline chaining. This lead to the above issue or requirement for having PipelineConnector.

Current Pipeline Configuration

pipeline:
  name: otel-span
  source:
    apm_trace_source:
      server: "localhost"
      port: 9400
  buffer:
    bounded_blocking:
      buffer_size: 512
  processor:
    geoip:
      database: "/geo-ip.db"
    filter:
      fieldKey: "span_kind"
      fieldValue: "Client"
  sink:
    elasticsearch:
      hosts: ["https://search-sample-app-test.us-west-2.es.amazonaws.com"]
    s3:
      bucket: "global-bucket"

Multiple/Updated Pipelines Configuration

pipeline-1:
  source:
    apm_trace_source:
      server: "localhost"
      port: 9400
  buffer:
    bounded_blocking:
      buffer_size: 512
  processor:
    transform:
      to: elasticsearch
  sink:
    pipeline:
      name: "pipeline-2"
  workers: 4
  delay: 500 #ms
      
pipeline-2:
  source:
    pipeline:
      name: "pipeline-1"
  buffer:
    bounded-blocking:
      buffer_size: 1024
  processor:
    geoip:
      database: "/geo-ip.db"
  sink:
    elasticsearch:
      hosts: ["https://search-sample-app-test.us-west-2.es.amazonaws.com"]

The above sample is solely as an example and it does not cover all the nuances multiple pipelines brings to the table, Feel free to raise a question or comment

Pipeline Connector

PipelineConnector will be a special type of plugin which will be both Source and Sink and we will limit the possibilities of extending this connector i.e. we would not need more custom plugins for this special connector at the time of this writing. This will introduce a good number of ambiguous edge cases and below is my best attempt to cover/answer them, Please feel free to comment if I missed any

Orphan Pipelines

What happens if there exists a pipeline which is not connected to any other pipeline, will it be marked as invalid

pipeline-1:
  source:
    file:
  sink:
    pipeline:
      name: "pipeline-2"
      
pipeline-2:
  source:
    pipeline:
      name: "pipeline-1"
  buffer:
     bounded-blocking:
  sink:
    elasticsearch:
      hosts: ["localhost:9200"]

pipeline-3:
  buffer:
     sqs:
  sink:
    elasticsearch:
      hosts: ["localhost:9200"]

At the time of this writing, we will allow having such configuration as pipelines will be executed independently.

Other Validations
below are some invalid configuration examples

pipeline-1:
  source:
    file:
  sink:
    pipeline:
      name: "pipeline-2"
      
pipeline-2:
  source:
    pipeline:
      name: "pipeline-3" # invalid - expected pipeline-1
  buffer:
     bounded-blocking:
  sink:
    elasticsearch:
      hosts: ["localhost:9200"]

pipeline-1:
  source:
    file:
  sink:
    pipeline:
      name: "pipeline-2"
      
pipeline-2:
  source:
    apm-otel-trace-source: # invalid - expected pipeline type
      name: "pipeline-3" 
  buffer:
     bounded-blocking:
  sink:
    elasticsearch:
      hosts: ["localhost:9200"]

pipeline-1:
  source:
    file:
  sink:
    pipeline:
      name: "pipeline-3" #invalid because there is no pipeline-3

Upgrade to latest Opentelemetry Status spec

Upgrade SITUP to use the latest version of OpenTelemetry Proto which has the new Span Spec.
Upgrade the Sample apps to OpenTelemetry v0.14.0 which support the new Span Spec.

[ES sink]Test logger in ES sink

This issue is to track the test coverage on logger in ES sink:

https://github.com/opendistro-for-elasticsearch/simple-ingest-transformation-utility-pipeline/blob/30f35e4abb8f4354ec6b93691e4c6f2f84e42637/situp-plugins/elasticsearch/src/main/java/com/amazon/situp/plugins/sink/elasticsearch/ElasticsearchSink.java#L51

Ref: https://www.baeldung.com/junit-asserting-logs

Release Trace-Analytics feature to Open-Distro

Integrate Enrich nodes as well as enrich processor

Hi,

We have a use case in which we want to use an enrich processor to add fields from an existing index to new documents being indexed, like described here: https://www.elastic.co/guide/en/elasticsearch/reference/7.8/enrich-setup.html

Would it be possible to integrate this to ODFE?

Thank you for the amazing product.

[SITUP] Update README

Trace Analytics

This is issue will be used to share details and get feedback about the upcoming Trace Analytics feature. Trace Analytics is the first step towards bringing in Application Performance Management capabilities to ODFE customers.

trace-analytics-rfc

Automate tests for release artifacts

Currently we have no test framwork to test artifacts and docker images. For every release we have the engineers run the sample and test in the local. This lowers the test quality and is slow. We should be automating this with the following goals,

Build the release artifacts/docker image in the local, as apart of the release step tests should run.
Upload the release artifacts to our bucket for odfe-release team to consume it. (This is done)
Once the odfe team lets us know that the artifacts are in staging and production, we should run the tests in step 1 against them.

[ES Sink] Support SSL/TLS and basic congnito

[ES Sink] ODFE ISM

For the trace analytics, we need to create a new ISM policy for the trace-analytics raw index. We will create a new policy called trace-analytics-raw this with 50 GB or 24 hours or 50k docs.

Note:

Regarding generic ISM, we need to discuss later about should generic ISM be fixed or user driven. This can be done after trace_analytics.

[ES Sink] Decide number of shards from cluster config

Find ways to tailor number_of_shards in the index config to the cluster.

[Processor] OtelTraceRawProcessor Plugin

Create a new Processor plugin that takes ResourceSpan as input and converts it to ES friendly docs.

Convert ResourceSpan to JSON. Reuse existing ApmSpanProcessor for JSON conversion.
Use ApmSpanProcessor on the JSON.

Add Log management for Data Prepper

Add configurations for the users to set the log levels in the Data Prepper Configuration pipeline.

[ES Sink] Delegate read configuration API to **Configuration

For keeping ElasticsearchSink light weighted and testability.

[SITUP Plugin] Add blocking queue

The current default buffer (UnboundedInMemoryBuffer) was added as a time being option; we need to add BlockingBuffer and make it as default option.

Reduce data prepper docker image/artifacts size

Investigate what is taking more space in the image and review if we can reduce the size

FIX: Replace dot with underscore for attributes

The Transformation instance source has JSON processor that uses dot to resolve attributes map, we will use underscore as a temporary fix.

[Bug] Integration Test

https://github.com/opendistro-for-elasticsearch/simple-ingest-transformation-utility-pipeline/blob/c9d2ff870de50db94d0c77f7e4f6d809cef29868/situp-plugins/elasticsearch/src/main/java/com/amazon/situp/plugins/sink/elasticsearch/ElasticsearchSink.java#L170

This method throws error when index already exists.

Investigate usability of Retry4j in Elasticsearch plugin.

https://github.com/elennick/retry4j

Implement standard grpc health check

Our current health check is based of HTTP1.1 as the ALB preview was did not support gRPC. ALB now supports gRPC health check. Implement HealthGrpc compliant with https://github.com/grpc/grpc/blob/master/doc/health-checking.md and add it to the server (if health checking is enabled).

[ES Sink] Support cluster health check

Use gRPC client instead of webclient

In the situp-core e2e test used gRPC to export traces

Clean up plugins

Remove not supported plugins

stateless-trace-processor
apm_trace_source

Split apmtracesource into two separate plugins sub project

otel_trace_raw_processor
otel_trace_source

Add readme for both of them.

Rename upper-case plugin to string_converter with a boolean flag to denote if its toUpperCase or toLowerCase.

Use the latest image in our sample to reduce startup time.

In our examples, we will use the docker image instead of building the projects.

Handle Zipkin B3 Propagation

Zipkin users use B3 propagation. When using B3, it uses the same span ID for the client and server side of an RPC.
Check this FAQ.

OpenTelemetry Spec doesn't officially support this behavior as it uses w3c but there is an ongoing discussion about support for this behavior. Regardless of OpenTelemetry decision, we should consider supporting this behavior in our trace analytics feature to test backwards compatible.

In order to support this we need to make the below changes,

Raw trace processor should create _id using unique span identifiers which should contain spanId, serviceName.
Modify the service-map-processor to support List of spans for a spanId lookup.

[RFC] Add details about SITUP to the RFC

The feedback for the initial draft was to add more details about the SITUP.

SSL integration with OtlpExporter in OpenTelemetry Collector.

Trace Analytics Pipeline offers OTLP source with SSL/TLS. Users should use this with secure option in the OtlpExporter in the Otel collector. For unknown reason the OtlpExporter in Otel Collector doesn't work with secure option starting v0.12.0. Have already created an issue in the upstream project.

This is issue is to track the above issue in our end.

[Situp-Core] Pipeline to have start/stop functionality

This issue is to track the progress, discussion and decisions of adding start/stop functionality to Pipeline. Currently the start and stop does not completely perform those actions. This issue will attempt to fix the gap.

Pipeline is integral to the working of SITUP, it has four key components source, buffer, processor, and sink. A pipeline definition contains required components source and sink; optional buffer and processors. A default buffer will be used if no buffer is specified in the definition.

execute()
On initiating the execution of pipeline, the control will trigger the start() operation on the defined source with either defined or default buffer. The control will also initiate the processing which includes execution of processors (if there are any) on records from buffer and publishing the resulting records (records) to all the configured sinks.

start()
TODO

stop()
Currently, we notify defined source to stop publishing new records to the buffer. Pipeline will exhaust the existing records from buffer before stopping the processing. TODO

Rename Processor Plugins to Prepper

A product decision is Data Prepper contains Pipelines and each pipeline will a Source, one or more Sinks and zero or more Preppers.

So we wiil rename Processor to Prepper.

Create a simple UI for Trace App

Create a UI for the odfe-pipes-trace-app comprising of components such as load_main_screen, client_create_order, client_cancel_order, client_checkout, client_pay_order and client_delivery_status.