GithubHelp home page GithubHelp logo

elastic / apm-data Goto Github PK

View Code? Open in Web Editor NEW
10.0 149.0 23.0 1.35 MB

apm-data holds definitions and code for manipulating Elastic APM data

License: Apache License 2.0

Makefile 0.09% Go 99.73% Shell 0.17%

apm-data's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

apm-data's Issues

Remove model-modelpb compatibility layers

during the migration to protobuf we added some compatibility layers.

Once we are done, we should remove them:

model.ProtoBatchProcessor
modelprocessor.Chained
modelprocessor.PbChained->modelprocessor.Chained`

invalid input for HTTPHeader: nil, numbers and maps

Part of the ops KPI review, from the ecs logs:

decode error: data read error: v2.transactionRoot.Transaction: v2.transaction.Context: v2.context.Response: v2.contextResponse.Headers: invalid input for HTTPHeader: [<nil>]
decode error: data read error: v2.transactionRoot.Transaction: v2.transaction.Context: v2.context.Tags: Response: v2.contextResponse.Headers: invalid input for HTTPHeader: 301
decode error: data read error: v2.transactionRoot.Transaction: v2.transaction.Context: v2.context.Response: v2.contextResponse.Headers: invalid input for HTTPHeader: map[httponly:true path:/ samesite:Lax secure:true]

Add `span.id` field to transactions

We want to converge span and transaction data models. For that, all transaction documents should also have a span.id field (where the field value is a copy of transaction.id).
This also will allow better correlation between logs and arbitrary spans (that are either spans or transactions) for OTel use cases.

OTel log records know about the span.id they belong to. However, if they are not tied to spans as SpanEvents, there's no way to tell whether the span.id in a log record belongs to a span document or a transaction. So, in case of transaction the correlation breaks, because we query for span.id = XYZ while transactions do not have a span.id field at all.

Introduce fuzz testing

Followup from #63

We should introduce fuzz testing to make sure we are not missing anything

As this is a library we might also evaluate adding fuzz testing into APM Server and fuzz the intake endpoint directly.

OTel logs correlation breaks for transactions-mapped spans

When does the problem occur?

When receiving an OTel span S that is being mapped to a transaction (e.g. root span, or SpanKind = SERVER) and in addition an OTel log event L that is correlated to that span S (i.e. the log event has the OTLP field SpanID pointing to that span).

Problem

Correlation on the span / transaction breaks.

Reason

In the above situation we map the OTel span S to a transaction document. Thus, the OTLP field SpanID is being mapped to the transaction.id field in the internal model.

When receiving the corresponding log event L, the log event points to S through an OTLP SpanID field. However, since the log event L does not carry the characteristics of the span S (but only the SpanID) we cannot decide whether the OTLP field SpanID on the log event needs to be mapped to a span.id or a transaction.id field. As a result the SpanID OTLP field is always being mapped to the span.id field (even for associated transaction documents).

Remove processor fields

          The "processor" fields are a bit of a relic, and we should aim to remove them in the long term. With that in mind, I wonder if we should change the model a little bit so we can remove them from the apm-data codebase, and set the `processor.*` fields in our ingest pipelines, or by setting a value on `constant_keyword` fields where it makes sense.

e.g. for metrics, processor.name and processor.event are both always "metric", so we can update their field definitions to set the value in the mapping: https://github.com/elastic/apm-server/blob/23fb1577909836ebf45e65705df3fd560de5adb1/apmpackage/apm/data_stream/app_metrics/fields/fields.yml#L30-L35

IIANM the only exception to this is for the apm.traces and apm.rum data streams, where spans and transactions end up. These both have have processor.name: transaction, but they differ in processor.event (one is "span", one is "transaction"). Eventually spans and transactions should converge, but for now I think we could set the value in an ingest pipeline.

Maybe we could:

  • set event.kind to either "span" or "transaction" in the apm-data code
  • update the traces ingest pipeline to use this to populate processor.event, and then remove event.kind since those values are not valid for event.kind

WDYT?

Originally posted by @axw in #58 (comment)

Related: #47.

OTel instrumentation of OTLP consumer rejected metrics

For OTLP input, there is currently "monitoring" for UnsupportedMetricsDropped in otlp consumer. #156 added partial success support to it, but it returns rejected data points instead of dropped metrics.

We would like to move away from monitoring and have OTel instrumentation of rejected metrics so that all apm-data library users will have access to it.

automatically update license when 'model_generated.go' is generated

This is a follow-up of #17 , in particular this comment.

When generating the model_generated.go through the make generate command, the generated file does not contain the required license headers.

As a consequence, we have to also run make update-licenses in order to fix that.

Making the make generate also update the license in model_generated.go would remove the need to have execute a separate command.

input/otlp: record map-type attributes

We currently do not record map-type attributes when translating OTLP events to Elastic APM events. For now we may want to flatten the map, adding dots as needed. Hopefully in the future we will be using the Elasticsearch flattened field type, and this will be unnecessary.

Investigate proto.Clone performance impact

Local benchmarks show proto.Clone taking a small amount of CPU time (~8% total time). It's probably not enough to cause a regression but it's not great that something we introduced is taking a noticeable CPU time as it would decrease the impact of other performance improvements.

Clone is using reflection under the hood, we should try to minize its usage.

Avoid map allocations when mapping modelpb to modeljson

The new protobuf logic is allocating maps and copying from structpb.Struct to map[string]Any.

We don't really need to do this and could investigate passing the structpb.Struct type directly which is then encoded to json with a custom marshaling method or something similar. This would improve performance and decrease memory allocations

Improve validation strategy on empty elements

Something along the lines of []foo{validFoo, null, validFoo1} shouldn't be parsed correctly.

IMO we have two options here:

  • ensure that each element of a slice is set (not empty/null) as part of validation process
  • add required fields to each slice element type so that the error is caught when validating the slice elements

OpenTelemetry JVM metrics are not properly mapped

The JVM metrics reported by OpenTelemetry Java agents are not properly mapped. In elastic/apm-server#8777 we changed the mapping to comply with the change in the metrics semantic convention, however, this mapping logic seems to be ignored.

The metric documents do appear under discover, however with wrong field names.

Example

This is how the metric document looks right now:
image

And this is how a valid JVM metrics document would look like:
image

So, seems that this mapping logic is not being applied.

Example data:

Here is some OTLP example data for the JVM metrics: https://gist.github.com/AlexanderWert/bf3b8a6cbbd02a345038bd8e8cac520f

Hypothesis:

Let's take a concrete metric: process.runtime.jvm.memory.usage

In this mapping logic it is assumed that this metric is reported as a Gauge metric type, however, in fact this metric (process.runtime.jvm.memory.usage) has the type sum (as we can see in the example data).
So very likely the root cause is the wrong metric type in the mapping logic.

Here is the OTel spec for the metrics: https://opentelemetry.io/docs/reference/specification/metrics/semantic_conventions/runtime-environment-metrics/#jvm-metrics

All types of counters (Counter, UpDownCounter) are mapped to the sum metric type in the OTLP protocol!
So we need to have the mapping in the MetricTypeSumif-branch.

Setup semver

We should start versioning this library with semver, with a changelog.

Provide enablement material

We want to onboard apm-agent developers to this repository, to be able to work on open-telemetry mappings and to add new fields to the Intake API and processing.

For enablement we need to

  • add a high level overview over data flow and processing
  • provide a recorded video with a code-walkthrough
  • document which code changes, make commands, approval test adoptions, etc. are required for adding a new field; link to a reference PR
  • document which code changes are required for adding or changing open telemetry mappings
  • show which steps are required in the APM Server code base to update

Protobuf benchmarks

We do have some Go benchmarks, but as we're trying to optimize our protobuf setup, it would be nice to have an automated/reproducible way of benchmarking our protobuf setup as well.

So the idea here is to setup a suite of benchmarks which would setup structs from the generated protobuf definitions, and analyze the generated size of the objects, and time to encode/decode.

Integrate protobuf definitions for model types in apm-server

Follow up on #36

Phase 2

Use object notation for data_stream fields

Currently, we're setting the data_stream.* fields in dotted notation:

DataStreamType string `json:"data_stream.type,omitempty"`
DataStreamDataset string `json:"data_stream.dataset,omitempty"`
DataStreamNamespace string `json:"data_stream.namespace,omitempty"`

This causes issues when using the reroute processor:

While I think that the reroute processors, and all processor for that matter, should support both dotted and nested field notations, we should use nested fields to work around that issue for now.

It seems unlikely that users have relied on the dotted field notation in their ingest pipeline as the set processor doesn't even work with dotted field names. The only processor for which it's possible to access dotted field names is the script processor.

The primary way to set the data_stream.* fields in an ingest pipeline is the reroute processor, but it can't be use for APM due to the dotted field notation.

[docs] create data mapping dictionary for otel -> ecs mappings

Make it easier for anyone to understand which otel semantic conventions are mapped to ECS fields when processing with apm-data logic. The challenge will be to keep this up to date when done manually.
The main audience for this documentation are UI devs, users, PMs & support engineers.

docs: document release and tag process

This repo is generally supposed to be stack version independend, but some changes need to be pulled into minor or patch fixes of the stack. New features and bug fixes need to be released in minor and patch versions that can be matched with stack versions. We need to document this.

Introduce protobuf definitions for model types

For several reasons we would like to define an efficient, stable, binary encoding for model types. e.g. we would use this for storing events in Badger for tail-based sampling. These would be much faster to encode/decode, and more importantly will have strong stability guarantees

To achieve the above, we will define our intermediate, in-memory/on-disk, model types in protobuf -- this will be the source of truth. We'll take a phased approach to this, given that the existing types are used all across apm-server, and making a big-bang change would carry a significant amount of risk of introducing bugs.

With #35 merged, we have created a cleaner separation between the model types and the way they are encoded to JSON. The model types no longer have to directly map to the final document structure, though for our sanity we should probably keep them close. The model types do not need to be ECS-compliant, and we can instead evolve the JSON encoding over time without changing the model types.

Phase 1 (iteration-05)

  • introduce protobuf definitions (probably worth basing off @marclop's work in https://github.com/elastic/apm-ingest-queueless)
  • generate Go types from protobuf (into model/modelpb or something like that)
  • introduce code for JSON encoding protobuf types by transforming to internal/modeljson types, like we're doing with model.APMEvent now
  • introduce code for mapping model events to the protobuf-generated types; remove the code for translating from model types to modeljson, and instead translate model types to protoc-generated types, then protoc-generated types to modeljson

Phase 2 (iteration-06)
#52

Review protobuf int size and signed/unsigned usage

I am going to use this issue to verify the size and signed/unsigned for integers in protobuf.
Below is the list of all ints in the proto definitions. For each of them, I will validate the definition with the ingest pipeline and the json decoder and leave comments on the issue.

Note: this only looks at the integers. The floats/double aren't in here.

Once each int is validated and possibly fixed, this issue will be closed.
See #47

client.proto

  • port

uint32 port = 3;

The maximum port number is 65 535, which is way lower than what an uint32 can carry.
Protobuf doesn't allow setting int16, so this type and every other port using uint32 is valid.

destination.proto

  • port

uint32 port = 2;

See port comment in client.proto above.

event.proto

  • severity

int64 severity = 9;

Defined as int64 in modeljson, same as we have here. So there doesn't seem to be any reason to downgrade this to a lower size.

Severity int64 `json:"severity,omitempty"`

See #123

experience.proto

  • count

int64 count = 1;

This field is defined as int in the JSON decoder.

Count nullable.Int `json:"count" validate:"required,min=0"`

See #122

http.proto

  • transfer_size

optional int64 transfer_size = 4;

Defined as int64 in modejson

TransferSize *int64 `json:"transfer_size,omitempty"` // Non-ECS field.

  • encoded_body_size

optional int64 encoded_body_size = 5;

Defined as int64 in modeljson.

EncodedBodySize *int64 `json:"encoded_body_size,omitempty"` // Non-ECS field.

  • decoded_body_size

optional int64 decoded_body_size = 6;

Defined as int64 in modeljson.

DecodedBodySize *int64 `json:"decoded_body_size,omitempty"` // Non-ECS field.

  • status_code

int32 status_code = 7;

Defined as int in modejosn.

StatusCode int `json:"status_code,omitempty"`

See #123

log.proto

  • line

int32 line = 2;

Defined as int in modejson.

Line int `json:"line,omitempty"`

See #123

message.proto

  • age_millis

optional int64 age_millis = 3;

Defined as int64 in modejson.

Millis *int64 `json:"ms,omitempty"`

metricset.proto

  • Metricset/doc_count

int64 doc_count = 4;

Defined as an int64 in modeljson.
https://github.com/elastic/apm-data/blob/main/model/internal/modeljson/document.go#L76

  • Histogram/counts

repeated int64 counts = 2;

Defined as int64 in modeljson.

Counts []int64 `json:"counts"`

  • SummaryMetric/count

int64 count = 1;

Defined as int42 in modejosn.
https://github.com/elastic/apm-data/blob/e5765b8f8d8992d4360231ce86d5f57a8d637366/model/internal/modeljson/metricset.go#L51C1-L51C1

See #123

  • AggregationDuration/count

int64 count = 1;

Defined as int in modejson.

See #122
See #123

process.proto

  • Process/ppid

uint32 ppid = 1;

This is defined as an int32 in modeljson, which seems valid.
https://github.com/elastic/apm-data/blob/main/input/elasticapm/internal/modeldecoder/v2/model.go#L509

Note: Hasn't this been deprecated in ECS?
https://github.com/elastic/ecs/blob/2fb814f063746a1fac3ff1390d2e9387bdd47a2f/docs/release-notes/8.0.asciidoc?plain=1#L16

  • Process/pid

uint32 pid = 7;

Defined as an int in modejson, which seems valid.

Pid int `json:"pid,omitempty"`

  • ProcessThread/id

int32 id = 2;

Defined as an id in modejson, which seems valid.

ID int `json:"id,omitempty"`

session.proto

  • sequence

int64 sequence = 2;

Defined as an int in modeljson.

Sequence int `json:"sequence,omitempty"`

See #122

source.proto

  • port

uint32 port = 4;

See port comment in client.proto above.

span.proto

  • DB/rows_affected

optional uint32 rows_affected = 1;

This is a uint32 in modeljson.

RowsAffected *uint32 `json:"rows_affected,omitempty"`

  • Composite/count

uint32 count = 2;

This is an int32 in modeljson.

Count int `json:"count"`

This can't be negative, so switching to uint seems valid.

stacktrace.proto

  • StacktraceFrame/lineno

optional uint32 lineno = 2;

This is an uint32 in modeljson, which seems valid.

Number *uint32 `json:"number,omitempty"`

  • StacktraceFrame/colno

optional uint32 colno = 3;

This is an uint32 in modeljson, which seems valid.

Column *uint32 `json:"column,omitempty"`

  • Original/lineno

optional uint32 colno = 3;

This is an uint32 in modeljson, which seems valid.

Lineno *uint32 `json:"lineno,omitempty"`

  • Original/colno

optional uint32 colno = 5;

This is an uint32 in modeljson, which seems valid.

Colno *uint32 `json:"colno,omitempty"`

transaction.proto

  • SpanCount/dropped

optional uint32 dropped = 1;

This is an uint in modeljson, which seems valid.

Dropped *uint32 `json:"dropped,omitempty"`

  • SpanCount/started

optional uint32 started = 2;

This is an uint in modeljsob, which seems valid.

Started *uint32 `json:"started,omitempty"`

url.proto

  • port

uint32 port = 8;

See port comment in client.proto above.

ci: Run micro-bechmarks on every commit and PR

Description

APM Data is a crucial module, rather than just rely on the APM Server benchmarks to detect potential performance regressions, run the Go micro benchmarks for every commit and in every PR, so any regressions (and performance improvements) can be caught early

Use `modelpb.<Type>FromVTPool` wherever possible

Description

Since we added back pooling in #128, we should start using the pooled modelpb.<Type>FromVTPool wherever possible and indicate that clients should use ReturnToVTPool after they're done processing an event.

Fuzz testing

We should fuzz test the inputs, and ensure for example that decoding cannot cause panics during translation to model types.

We should also provide a test package for producing randomised/fuzzed model.Batches, to feed into a BatchProcessor. This could then be used to ensure processors do not panic or otherwise behave badly when encountering arbitrary data that passes decoding.

Define a process for dealing with OTel SemConv changes

The mapping for OTel data currently supports a certain (old) version of the semantic conventions.

With the OTel Semantic Conventions being merged with ECS and being stabilized we have to expect versions of SemConv to come soon that will introduce many breaking changes in the field names.

We need to define a process for dealing with different versions of Semantic Conventions so that we can support newer versions of SemConv while keeping backwards compatibility with older versions. The implication is that, with a given version of apm-data we should support a range of SemConv versions.

Related Info

  • Semantic conventions define schema files that enumerate all the changes between versions
  • the semantic conventions versions theoretically may even vary per signal (within a single connection / agent / SDK)

Derivation of the `span.type` from OTel data is indeterministic

in OTel a single span can have a mix of attributes from different namespaces. For example, a span could have db.* attributes and at the same time http.* attributes.

We use the logic of this switch statement to determine the foundSpanType variable. This is done while iterating over all the attributes on a span. So basically the last attribute defines the actual foundSpanType and the corresponding mapping-logic below. This is indeterministic because we don't know the order of the span attributes.

We need a more explicit logic to derive the span.type.

Map child-ids in OTel attributes

To support inferred spans in OTel-based agents, we need to map child-ids field from OTel attributes to the child.id field.

TBD: the OTel attribute name to map from.

review truncating otel strings that are indexed as keywords

We currently truncate otel attributes that are indexed as keywords to 1024 chars

stringval := truncate(v.Str())

The mappings are generally created with ignore_above: 1024, which would lead to not indexing this field if the value exceeds 1024 chars.

We should review if truncating of the values for otel strings is the best choice, where the field will always be indexed, but anything above 1024 chars will be completely lost vs. not truncating the values, leading to certain fields not being indexed and searchable, but only available in _source, if exceeding the limit. When moving to synthetic source, the time to retrieve the non-indexed values might be increased.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.