elastic / apm-data Goto Github PK
View Code? Open in Web Editor NEWapm-data holds definitions and code for manipulating Elastic APM data
License: Apache License 2.0
apm-data holds definitions and code for manipulating Elastic APM data
License: Apache License 2.0
during the migration to protobuf we added some compatibility layers.
Once we are done, we should remove them:
model.ProtoBatchProcessor
modelprocessor.Chained
modelprocessor.
PbChained->
modelprocessor.Chained`
Part of the ops KPI review, from the ecs logs:
decode error: data read error: v2.transactionRoot.Transaction: v2.transaction.Context: v2.context.Response: v2.contextResponse.Headers: invalid input for HTTPHeader: [<nil>]
decode error: data read error: v2.transactionRoot.Transaction: v2.transaction.Context: v2.context.Tags: Response: v2.contextResponse.Headers: invalid input for HTTPHeader: 301
decode error: data read error: v2.transactionRoot.Transaction: v2.transaction.Context: v2.context.Response: v2.contextResponse.Headers: invalid input for HTTPHeader: map[httponly:true path:/ samesite:Lax secure:true]
The code is checking for both attributes, even though the comment (and spec) says that only one is required:
Lines 183 to 187 in 39e2a3c
We want to converge span and transaction data models. For that, all transaction documents should also have a span.id
field (where the field value is a copy of transaction.id
).
This also will allow better correlation between logs and arbitrary spans (that are either spans or transactions) for OTel use cases.
OTel log records know about the span.id
they belong to. However, if they are not tied to spans as SpanEvents, there's no way to tell whether the span.id
in a log record belongs to a span document or a transaction. So, in case of transaction the correlation breaks, because we query for span.id = XYZ
while transactions do not have a span.id
field at all.
Investigate adding -race
flag to go test
Followup from #63
We should introduce fuzz testing to make sure we are not missing anything
As this is a library we might also evaluate adding fuzz testing into APM Server and fuzz the intake endpoint directly.
When receiving an OTel span S
that is being mapped to a transaction (e.g. root span, or SpanKind = SERVER
) and in addition an OTel log event L
that is correlated to that span S
(i.e. the log event has the OTLP field SpanID
pointing to that span).
Correlation on the span / transaction breaks.
In the above situation we map the OTel span S
to a transaction document. Thus, the OTLP field SpanID
is being mapped to the transaction.id
field in the internal model.
When receiving the corresponding log event L
, the log event points to S
through an OTLP SpanID
field. However, since the log event L
does not carry the characteristics of the span S
(but only the SpanID
) we cannot decide whether the OTLP field SpanID
on the log event needs to be mapped to a span.id
or a transaction.id
field. As a result the SpanID
OTLP field is always being mapped to the span.id
field (even for associated transaction documents).
The "processor" fields are a bit of a relic, and we should aim to remove them in the long term. With that in mind, I wonder if we should change the model a little bit so we can remove them from the apm-data codebase, and set the `processor.*` fields in our ingest pipelines, or by setting a value on `constant_keyword` fields where it makes sense.
e.g. for metrics, processor.name
and processor.event
are both always "metric", so we can update their field definitions to set the value in the mapping: https://github.com/elastic/apm-server/blob/23fb1577909836ebf45e65705df3fd560de5adb1/apmpackage/apm/data_stream/app_metrics/fields/fields.yml#L30-L35
IIANM the only exception to this is for the apm.traces
and apm.rum
data streams, where spans and transactions end up. These both have have processor.name: transaction
, but they differ in processor.event
(one is "span", one is "transaction"). Eventually spans and transactions should converge, but for now I think we could set the value in an ingest pipeline.
Maybe we could:
event.kind
to either "span" or "transaction" in the apm-data codetraces
ingest pipeline to use this to populate processor.event
, and then remove event.kind
since those values are not valid for event.kind
WDYT?
Originally posted by @axw in #58 (comment)
Related: #47.
With elastic/kibana#151826 OTel system metrics and JVM metrics are being displayed by Kibana in their raw format.
We don't need the following mapping logic in the APM intake anymore:
apm-data/input/otlp/metrics.go
Line 116 in 3ad1a5c
This issue is about cleaning up and removing the logic in the OTel mapping.
For OTLP input, there is currently "monitoring" for UnsupportedMetricsDropped
in otlp consumer. #156 added partial success support to it, but it returns rejected data points instead of dropped metrics.
We would like to move away from monitoring and have OTel instrumentation of rejected metrics so that all apm-data library users will have access to it.
This is a follow-up of #17 , in particular this comment.
When generating the model_generated.go
through the make generate
command, the generated file does not contain the required license headers.
As a consequence, we have to also run make update-licenses
in order to fix that.
Making the make generate
also update the license in model_generated.go
would remove the need to have execute a separate command.
We currently do not record map-type attributes when translating OTLP events to Elastic APM events. For now we may want to flatten the map, adding dots as needed. Hopefully in the future we will be using the Elasticsearch flattened
field type, and this will be unnecessary.
Local benchmarks show proto.Clone
taking a small amount of CPU time (~8% total time). It's probably not enough to cause a regression but it's not great that something we introduced is taking a noticeable CPU time as it would decrease the impact of other performance improvements.
Clone is using reflection under the hood, we should try to minize its usage.
The new protobuf logic is allocating maps and copying from structpb.Struct
to map[string]Any
.
We don't really need to do this and could investigate passing the structpb.Struct
type directly which is then encoded to json with a custom marshaling method or something similar. This would improve performance and decrease memory allocations
Something along the lines of []foo{validFoo, null, validFoo1}
shouldn't be parsed correctly.
IMO we have two options here:
From @axw
with the protobuf enums, is it possible to use options to control the string representation? https://protobuf.dev/programming-guides/proto3/#enum-value-options
Then maybe we can avoid the manually maintained maps from names to enum values
Dev docs should be updated to account for protobuf definitions and new modelpb
package
The old model
package will be removed
See https://github.com/elastic/apm-data/blob/main/dev_docs/HOW_TO.md
The JVM metrics reported by OpenTelemetry Java agents are not properly mapped. In elastic/apm-server#8777 we changed the mapping to comply with the change in the metrics semantic convention, however, this mapping logic seems to be ignored.
The metric documents do appear under discover, however with wrong field names.
This is how the metric document looks right now:
And this is how a valid JVM metrics document would look like:
So, seems that this mapping logic is not being applied.
Here is some OTLP example data for the JVM metrics: https://gist.github.com/AlexanderWert/bf3b8a6cbbd02a345038bd8e8cac520f
Let's take a concrete metric: process.runtime.jvm.memory.usage
In this mapping logic it is assumed that this metric is reported as a Gauge
metric type, however, in fact this metric (process.runtime.jvm.memory.usage
) has the type sum
(as we can see in the example data).
So very likely the root cause is the wrong metric type in the mapping logic.
Here is the OTel spec for the metrics: https://opentelemetry.io/docs/reference/specification/metrics/semantic_conventions/runtime-environment-metrics/#jvm-metrics
All types of counters (Counter
, UpDownCounter
) are mapped to the sum
metric type in the OTLP protocol!
So we need to have the mapping in the MetricTypeSum
if-branch.
Now that we removed model
usage completely from APM Server we should remove the unused code from apm-data.
We should start versioning this library with semver, with a changelog.
The OTLP input uses collector structs, as that's what we get from gRPC (example, metrics).
With elastic/apm-server#11470, we duplicate the same logic with the otel SDK structs.
Investigate refactoring the structs so we can remove that logic duplication (with no performance loss), maybe by converting the SDK structs into collector ones.
We want to onboard apm-agent developers to this repository, to be able to work on open-telemetry mappings and to add new fields to the Intake API and processing.
For enablement we need to
We should investigate whether we could use the Elasticsearch uri_parts
ingest processor to parse URLs, and include only the full URL in the model types.
Originally posted by @axw in #47 (comment)
If APMEvents
is backed by vtproto's pool then timestamppb
comes out as one of the most allocation heavy object. This was fixed in apm-aggregation
by using uint64
to encode timestamps. If uint64
suits all our needs, we should consider promoting that package to apm-data
: https://github.com/elastic/apm-aggregation/tree/main/aggregators/internal/timestamppb
Our code in traces.go doesn’t seem to map db attributes onto span.db, which should be reviewed and fixed.
We do have some Go benchmarks, but as we're trying to optimize our protobuf setup, it would be nice to have an automated/reproducible way of benchmarking our protobuf setup as well.
So the idea here is to setup a suite of benchmarks which would setup structs from the generated protobuf definitions, and analyze the generated size of the objects, and time to encode/decode.
Currently, all logs derived from span events strip all span.*
and transaction.*
fields (including span.id
and transaction.id
):
Lines 880 to 881 in 2adc910
Follow up on #36
We should review protobuf fields to ensure we are using proper types.
Currently, we're setting the data_stream.*
fields in dotted notation:
apm-data/model/internal/modeljson/document.go
Lines 73 to 75 in 6ef8c81
This causes issues when using the reroute
processor:
While I think that the reroute
processors, and all processor for that matter, should support both dotted and nested field notations, we should use nested fields to work around that issue for now.
It seems unlikely that users have relied on the dotted field notation in their ingest pipeline as the set processor doesn't even work with dotted field names. The only processor for which it's possible to access dotted field names is the script processor.
The primary way to set the data_stream.*
fields in an ingest pipeline is the reroute
processor, but it can't be use for APM due to the dotted field notation.
Update the modeljson generator to remove the redundant minValue
check from the generated Go code for unsigned integers, while keeping the validation in the generated JSON schema.
So we can remove the nolint in https://github.com/elastic/apm-data/blob/main/input/elasticapm/internal/modeldecoder/generator/slice.go
See #123 (comment)
Make it easier for anyone to understand which otel semantic conventions are mapped to ECS fields when processing with apm-data logic. The challenge will be to keep this up to date when done manually.
The main audience for this documentation are UI devs, users, PMs & support engineers.
This repo is generally supposed to be stack version independend, but some changes need to be pulled into minor or patch fixes of the stack. New features and bug fixes need to be released in minor and patch versions that can be matched with stack versions. We need to document this.
For several reasons we would like to define an efficient, stable, binary encoding for model types. e.g. we would use this for storing events in Badger for tail-based sampling. These would be much faster to encode/decode, and more importantly will have strong stability guarantees
To achieve the above, we will define our intermediate, in-memory/on-disk, model types in protobuf -- this will be the source of truth. We'll take a phased approach to this, given that the existing types are used all across apm-server, and making a big-bang change would carry a significant amount of risk of introducing bugs.
With #35 merged, we have created a cleaner separation between the model types and the way they are encoded to JSON. The model types no longer have to directly map to the final document structure, though for our sanity we should probably keep them close. The model types do not need to be ECS-compliant, and we can instead evolve the JSON encoding over time without changing the model types.
Phase 1 (iteration-05)
model/modelpb
or something like that)internal/modeljson
types, like we're doing with model.APMEvent
nowPhase 2 (iteration-06)
#52
I am going to use this issue to verify the size and signed/unsigned for integers in protobuf.
Below is the list of all ints in the proto definitions. For each of them, I will validate the definition with the ingest pipeline and the json decoder and leave comments on the issue.
Note: this only looks at the integers. The floats/double aren't in here.
Once each int is validated and possibly fixed, this issue will be closed.
See #47
apm-data/model/proto/client.proto
Line 29 in e5765b8
The maximum port number is 65 535, which is way lower than what an uint32 can carry.
Protobuf doesn't allow setting int16, so this type and every other port using uint32 is valid.
apm-data/model/proto/destination.proto
Line 26 in e5765b8
See port comment in client.proto
above.
apm-data/model/proto/event.proto
Line 39 in e5765b8
Defined as int64 in modeljson, same as we have here. So there doesn't seem to be any reason to downgrade this to a lower size.
See #123
apm-data/model/proto/experience.proto
Line 32 in e5765b8
This field is defined as int
in the JSON decoder.
See #122
apm-data/model/proto/http.proto
Line 47 in e5765b8
Defined as int64 in modejson
apm-data/model/internal/modeljson/http.go
Line 45 in e5765b8
apm-data/model/proto/http.proto
Line 48 in e5765b8
Defined as int64 in modeljson.
apm-data/model/internal/modeljson/http.go
Line 46 in e5765b8
apm-data/model/proto/http.proto
Line 49 in e5765b8
Defined as int64 in modeljson.
apm-data/model/internal/modeljson/http.go
Line 47 in e5765b8
apm-data/model/proto/http.proto
Line 50 in e5765b8
Defined as int in modejosn.
apm-data/model/internal/modeljson/http.go
Line 49 in e5765b8
See #123
apm-data/model/proto/log.proto
Line 37 in e5765b8
Defined as int in modejson.
apm-data/model/internal/modeljson/log.go
Line 37 in e5765b8
See #123
apm-data/model/proto/message.proto
Line 29 in e5765b8
Defined as int64 in modejson.
apm-data/model/proto/metricset.proto
Line 30 in e5765b8
Defined as an int64 in modeljson.
https://github.com/elastic/apm-data/blob/main/model/internal/modeljson/document.go#L76
apm-data/model/proto/metricset.proto
Line 52 in e5765b8
Defined as int64 in modeljson.
apm-data/model/proto/metricset.proto
Line 56 in e5765b8
Defined as int42 in modejosn.
https://github.com/elastic/apm-data/blob/e5765b8f8d8992d4360231ce86d5f57a8d637366/model/internal/modeljson/metricset.go#L51C1-L51C1
See #123
apm-data/model/proto/metricset.proto
Line 61 in e5765b8
Defined as int in modejson.
apm-data/model/proto/process.proto
Line 25 in e5765b8
This is defined as an int32 in modeljson, which seems valid.
https://github.com/elastic/apm-data/blob/main/input/elasticapm/internal/modeldecoder/v2/model.go#L509
Note: Hasn't this been deprecated in ECS?
https://github.com/elastic/ecs/blob/2fb814f063746a1fac3ff1390d2e9387bdd47a2f/docs/release-notes/8.0.asciidoc?plain=1#L16
apm-data/model/proto/process.proto
Line 31 in e5765b8
Defined as an int in modejson, which seems valid.
apm-data/model/proto/process.proto
Line 36 in e5765b8
Defined as an id in modejson, which seems valid.
apm-data/model/proto/session.proto
Line 26 in e5765b8
Defined as an int in modeljson.
See #122
apm-data/model/proto/source.proto
Line 30 in e5765b8
See port comment in client.proto
above.
apm-data/model/proto/span.proto
Line 47 in e5765b8
This is a uint32 in modeljson.
apm-data/model/internal/modeljson/span.go
Line 60 in e5765b8
apm-data/model/proto/span.proto
Line 70 in e5765b8
This is an int32 in modeljson.
apm-data/model/internal/modeljson/span.go
Line 51 in e5765b8
This can't be negative, so switching to uint seems valid.
apm-data/model/proto/stacktrace.proto
Line 28 in e5765b8
This is an uint32 in modeljson, which seems valid.
apm-data/model/proto/stacktrace.proto
Line 29 in e5765b8
This is an uint32 in modeljson, which seems valid.
apm-data/model/proto/stacktrace.proto
Line 29 in e5765b8
This is an uint32 in modeljson, which seems valid.
apm-data/model/proto/stacktrace.proto
Line 50 in e5765b8
This is an uint32 in modeljson, which seems valid.
apm-data/model/proto/transaction.proto
Line 48 in e5765b8
This is an uint in modeljson, which seems valid.
apm-data/model/proto/transaction.proto
Line 49 in e5765b8
This is an uint in modeljsob, which seems valid.
apm-data/model/proto/url.proto
Line 32 in e5765b8
See port comment in client.proto
above.
APM Data is a crucial module, rather than just rely on the APM Server benchmarks to detect potential performance regressions, run the Go micro benchmarks for every commit and in every PR, so any regressions (and performance improvements) can be caught early
Since we added back pooling in #128, we should start using the pooled modelpb.<Type>FromVTPool
wherever possible and indicate that clients should use ReturnToVTPool
after they're done processing an event.
We currently generate multiple JSON Schema documents. These are synchronised to APM Agent repos for testing, which requires having to list out each file. We should look at generating a single compound (bundled) JSON Schema document to simplify this: elastic/apm-agent-python#1745 (comment)
We should fuzz test the inputs, and ensure for example that decoding cannot cause panics during translation to model types.
We should also provide a test package for producing randomised/fuzzed model.Batches, to feed into a BatchProcessor. This could then be used to ensure processors do not panic or otherwise behave badly when encountering arbitrary data that passes decoding.
The mapping for OTel data currently supports a certain (old) version of the semantic conventions.
With the OTel Semantic Conventions being merged with ECS and being stabilized we have to expect versions of SemConv to come soon that will introduce many breaking changes in the field names.
We need to define a process for dealing with different versions of Semantic Conventions so that we can support newer versions of SemConv while keeping backwards compatibility with older versions. The implication is that, with a given version of apm-data we should support a range of SemConv versions.
in OTel a single span can have a mix of attributes from different namespaces. For example, a span could have db.*
attributes and at the same time http.*
attributes.
We use the logic of this switch statement to determine the foundSpanType
variable. This is done while iterating over all the attributes on a span. So basically the last
attribute defines the actual foundSpanType
and the corresponding mapping-logic below. This is indeterministic because we don't know the order of the span attributes.
We need a more explicit logic to derive the span.type
.
To support inferred spans in OTel-based agents, we need to map child-ids field from OTel attributes to the child.id
field.
TBD: the OTel attribute name to map from.
Once open-telemetry/semantic-conventions#435 lands, we need to map the OTel code.stacktrace
attribute top-level (instead of into labels) so we can display stacktraces on spans for OTel-based agents.
We currently truncate otel attributes that are indexed as keywords to 1024 chars
Line 279 in 88a3977
ignore_above: 1024
, which would lead to not indexing this field if the value exceeds 1024 chars.
We should review if truncating of the values for otel strings is the best choice, where the field will always be indexed, but anything above 1024 chars will be completely lost vs. not truncating the values, leading to certain fields not being indexed and searchable, but only available in _source
, if exceeding the limit. When moving to synthetic source, the time to retrieve the non-indexed values might be increased.
We should also look at the encoding/decoding performance of maps, and compare against the array-of-structs approach: https://protobuf.dev/programming-guides/proto3/#backwards. We may want to do that after integrating into apm-server, in case it makes operating on the model types too painful.
Originally posted by @axw in #47 (comment)
Similar to #129
Followup from #42 (comment)
We should expect most keys to not require sanitisation, and optimise for that.
We should investigate switching from MD5 to something faster, like xxhash, for calculating error grouping keys. MD5 is cryptographic, which is not necessary for our purposes. xxHash is non-cryptographic, and considerably faster while maintaining high quality hashes.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.