census-instrumentation / opencensus-specs Goto Github PK

License: Apache License 2.0

opencensus-specs's Introduction

Warning

OpenCensus and OpenTracing have merged to form OpenTelemetry, which serves as the next major version of OpenCensus and OpenTracing.

OpenTelemetry has now reached feature parity with OpenCensus, with tracing and metrics SDKs available in .NET, Golang, Java, NodeJS, and Python. All OpenCensus Github repositories, except census-instrumentation/opencensus-python, will be archived on July 31st, 2023. We encourage users to migrate to OpenTelemetry by this date.

To help you gradually migrate your instrumentation to OpenTelemetry, bridges are available in Java, Go, Python, and JS. Read the full blog post to learn more.

OpenCensus Specs

This is a high level design of the OpenCensus library, some of the API examples may be written (linked) in C++, Java or Go but every language should translate/modify them based on language specific patterns/idioms. Our goal is that all libraries have a consistent "look and feel".

This repository uses terminology (MUST, SHOULD, etc) from RFC 2119.

Overview

Today, distributed tracing systems and stats collection tend to use unique protocols and specifications for propagating context and sending diagnostic data to backend processing systems. This is true amongst all the large vendors, and we aim to provide a reliable implementation in service of frameworks and agents. We do that by standardizing APIs and data models.

OpenCensus provides a tested set of application performance management (APM) libraries, such as Metrics and Tracing, under a friendly OSS license. We acknowledge the polyglot nature of modern applications, and provide implementations in all main programming languages including C/C++, Java, Go, Ruby, PHP, Python, C#, Node.js, Objective-C and Erlang.

Ecosystem Design

Layers

Service exporters

Each backend service SHOULD implement this API to export data to their services.

OpenCensus library

This is what the following section of this document is defining and explaining.

Manually instrumented frameworks

We are going to instrument some of the most popular frameworks for each language using the OpenCensus library to allow users to get traces/stats when they use these frameworks.

Tools for automatic instrumentation

Some of the languages may support libraries for automatic instrumentation. For example, Java applications can use byte-code manipulation (monkey patching) to provide an agent that automatically instruments an application. Note: not all the languages support this.

Application

This is customer's application/binary.

Library Design

Namespace and Package

For details about the library package names structure see Namespace and Package.

Components

This section focuses on the important components that each OpenCensus library must have to support all required functionalities.

Here is a layering structure of the proposed OpenCensus library:

Context

Some of the features for distributed tracing and tagging need a way to propagate a specific context (trace, tags) in-process (possibly between threads) and between function calls.

The key elements of the context support are:

Every implementation MUST offer an explicit or implicit generic Context propagation mechanism that allows different sub-contexts to be propagated.
Languages that already have this support, like Go (context.Context) or C# (Activity), MUST use the language supported generic context instead of building their own.
For an explicit generic context implementation you can look at the Java io.grpc.Context.

Trace

Trace component is designed to support distributed tracing (see the Dapper paper). OpenCensus allows functionality beyond data collection and export. For example, it allows tracking of the active spans and keeping local samples for interesting requests.

The key elements of the API can be broken down as:

A Span represents a single operation within a trace. Spans can be nested to form a trace tree.
Libraries must allow users to record tracing events for a span (attributes, annotations, links, etc.).
Spans are carried in the Context. Libraries MUST provide a way of getting, manipulating, and replacing the Span in the current context.
Libraries SHOULD provide a means of dynamically controlling the trace global configuration at runtime (e.g. trace sampling rate/probability).
Libraries SHOULD keep track of active spans and in memory samples based on latency/errors and offer ways to access the data.
Because context must also be propagated across processes, library MUST offer the functionality that allows any transport (e.g RPC, HTTP, etc.) systems to encode/decode the “trace context” for placement on the wire.

Stats

The Stats component is designed to record measurements, dynamically break them down by application-defined tags, and aggregate those measurements in user-defined ways. It is designed to offer multiple types of aggregation (e.g. distributions) and be efficient (all measurement processing is done as a background activity); aggregating data enables reducing the overhead of uploading data, while also allowing applications direct access to stats.

The key elements the API MUST provide are:

Defining what is to be measured (the types of data collected, and their meaning), and how data will be aggregated (e.g. into a distribution, cumulative aggregation vs. deltas, etc.). Libraries must offer ways for customer to define Metrics that make sense for their application, and support a canonical set for RPC/HTTP systems.
Recording data - APIs for recording measured values. The recorded data is then broken down by tags carried in the context (e.g. a tag can have a value that describes the current RPC service/method name; when RPC latency is recorded, this can be made in a generic call, without having to specify the exact method), and aggregated as needed (e.g. a histogram of all latency values).
Accessing the aggregated data. This can be filtered by data type, resource name, etc. This allows applications to easily get access to their own data in-process.

Supported propagation formats

Library MUST support the W3C Distributed Tracing format for Trace and Tags components.
Binary encoding is defined at the BinaryEncoding document.

opencensus-specs's People

Contributors

Stargazers

Watchers

opencensus-specs's Issues

gRPC specs: Specify the the difference between the client and server status/method

See the discussion on https://github.com/census-instrumentation/opencensus-java/pull/1115/files#r179838994:

From @sebright :

It would help to explain the difference between the client and server status tags. When should each of them be set? The same applies to the client and server method tags.

(re-)make span.kind a thing

As far as I know, one of the goals of census is to be used in place of instrumentation like zipkin's. A couple years back, we learned from stackdriver that span.kind is extremely helpful to know. It allows you to lightly map communication semantics, the intent of the library. For example, we know if the caller intended to be a client or a message producer. Or a receiver intended to be a server or a message consumer. Many libraries are built on this slightly higher information than just the direction of traffic.

Though I can't find anything on github, I think span.kind was intentionally taken off the table. I'd like to have that discussion here as it impacts the viability of 3rd party instrumentation. Ideally folks are on-board with the following stable span kinds, representing remote communication intent. Ex if CLIENT I am acting as a client and sending a remote message.

CLIENT
SERVER
PRODUCER
CONSUMER

If curious, here are the definitions from zipkin
https://github.com/openzipkin/zipkin-api/blob/master/zipkin2-api.yaml#L320
Here are the ones from opentracing
https://github.com/opentracing/specification/blob/master/semantic_conventions.md#modelling-special-circumstances

Even if Census clarifies things differently, it would be helpful. I'd highly recommend not relegating this to an attribute (tag in other tracing systems), because it is a very important part of how you instrument code.

The impact is at least clarifying what it currently murky, and also making this a viable library for what I believe its goal is (routine third party instrumentation)

See https://github.com/census-instrumentation/opencensus-erlang/pull/53/files#r171096697

Decide the scope of data collection

Currently, we don't recommend any policy about the scope of data collection. For example, the Go stats package only provides data collection via the global mechanism. Once turned on, it will collect data from all OC-instrumented code in the process. Users cannot specially target and collect data from a component or library.

The spec should address this problem by either suggesting OpenCensus implementations to provide non-global mechanism or document that the global collection is an expected behavior.

/cc @dinooliva @sebright @bogdandrutu

Specify how to handle invalid serialized Tag and Trace Contexts.

The binary encoding specification specifies a general format for serializing fields, but some of the fields may need additional validation. For example, opencensus-java requires tag keys to contain only printable ASCII characters and have length less than 256. opencensus-specs should specify how to apply these restrictions when deserializing contexts, since they can affect the interactions between processes using different implementations or versions of OpenCensus.

/cc @dinooliva @bogdandrutu

Selective Tag propagation

When instrumenting a library, you want to be as generous as possible with tags that you provide so that the Operator (whoever ends up using your library in an application) has maximum flexibility in how they construct views to break down stats.

However, currently all tags are propagated (for example in gRPC) by default. This means that there's a cost added for every new tag, a cost that is only visible to the Instrumentor (person instrumenting the library) but really paid by the Operator (application developer).

To solve this, we need a way for the Operator to limit tag propagation.

Sanitization rules on view names/tag keys for exporters

In OpenCensus we have restrictions on view names and tag keys, that they should be ASCII strings with < 256 characters. However, when we export views and stats to other monitoring backends, those backends may have different restrictions, and we may need to do sanitization on views and/or tag keys.

For example, Prometheus only accepts metrics and label names with alphanumeric characters and underscores, and in Go and Java exporters, we need to sanitize views and tag keys by replacing all other characters with underscores. However, as @jcd2 described in census-instrumentation/opencensus-go#173, the sanitization may lead to some unexpected cases. For example, suppose we have "grpc/io/some/view" and "grpc/io/some-view", they will be both sanitized to "grpc_io_some_view", and Prometheus will reject the second one.

We need some cross-language specs on how to deal with such cases. Currently Go and Java exporters just do the santization and have the other monitoring client libraries deal with the corner cases.

/cc @bogdandrutu @dinooliva @jcd2 @odeke-em @rakyll @Ramonza

document aggregation functions that should be provided

Sum
Distribution
Latest
Mean
...

Stats: Consider adding support for relative measures.

-For the purposes of the following examples, I will be referring to the the Go implementation as that is the one I am most familiar with-

A use case can be found for a type of Measure that receives a value (whether it's an "Int64RelativeMeasure" or float version) and instead of making a measurement with the given value, it updates the previously known value relatively.

An example use case would be counting how many entries of a particular object are in a DB. You can measure relatively whenever something is inserted to the DB:

stats.Record(ctx, myRelativeInt64.M(1))

This would add 1 to whatever the previous value was and would remove the need for the storage layer to store explicit state.

Whether or not this is currently supported is not specified in the Opencensus spec and I feel like it could expand the covered, common cases.

As an extra example, looking to statsd, and specifically the gauge metrics, it shows that it uses a specific set of symbols to check whether it's an absolute metric or relative. With Opencesus, it will only require making new Measures that support retaining a previous value.

Questions about the binary encoding

I would like to make a pull request to clarify some parts of the binary encoding specification.

Do fields always need to be serialized in order? I.e., in version 1, do all fields with id 0 need to precede all fields with id 1, even if both fields were known when version 1 was first released?
Can a field disallow multiple values? For example, when a serialized trace context contains multiple trace ids, could that be considered a parse error?
Is the version mentioned in Serialization Rules a different version from the version described at the start of General Format? If I understand correctly, the version field describes something like a major version while the version mentioned in Serialization Rules is like a minor version.

/cc @bogdandrutu

Determine what is the signing procedure that we should follow.

@SergeyKanzhelev asked about having an official mechanism to sign the binaries/jars etc.

What are the protobuf models for? Where is the aggregation daemon?

It's confusing to me that OpenCensus integrates so much logic into the client libraries. I'd rather expected a well tuned, performant transit of the app observability data to a OpenCensus aggregation daemon. That would greatly reduce the implementation burden on each platform, and allow polyglot nodes to still effectively aggregate data.

We have a robust model of ProtoBufs, which really out of the gate made me think all this was coming- I thought that was going to be the basis for having client instrumentations talk to the local OpenCensus node-aggregator daemon. I thought the ProtoBufs would be part of how the client instrumentation spoke to the local aggregator-daemon, and how the local-aggregator daemon talked to other higher up daemons. But I haven't seen anything like that, and I'm really wondering why these protobuf definitions were created in the first place. If each platform is going to be completely standalone, it seems like serialization to and from protobuf would be a major hit vs directly targeting the language, & there's no use that I can see for protobuf if the client instrumentation only talks via an "exporter" that speaks some other already defined spec.

I also was expecting the aggregation daemon was going to evolve a runtime remote API (http or grpc) to control the aggregation system. Use cases like #46 - telling OpenCensus to make sure it exports all the samples it has on a specific request, retrospectively - seem like the kind of place where I would expect to have an HTTP or grpc service that I can use to talk to and control my local-aggregations.

A lot of these ideas about capabilities stem from the post "Google's Approach to Observability," especially at the end, talking about "Aggregation of data".

There's a huge disconnect in my mind. Why is there no aggregation daemon, and why is each client starting from scratch? What purpose do these protocol buffers serve, if the client instrumentation only has a native API and exports via some other remote API?

Define how to support percentiles aggregation (a.k.a. Summary).

There are systems in the world which do not support calculating percentiles from Histograms or do not support Histograms at all.

Action items:

Define an aggregation that produces percentiles.
Define a way to implement this.

gRPC integration guidance

Language-independent mapping of gRPC concepts onto the stats and trace APIs. This should specify at least:

Metadata names for propagation (trace, baggage?)
Metrics
Default views

Decouple producers and consumers

Apologies if filed in the wrong repo; not clear where cross-cutting concerns should be represented.

I believe an implicit commitment from OpenCensus ought to be that OpenCensus-annotated code should be loosely-bound to consuming services (Prometheus, Stackdriver). Rather, once code is annotated with OpenCensus libraries, users of such code should be permitted to arbitrarily choose these Services at runtime (through config).

This commitment appears to not be inherent|true today.

In my limited experience using OpenCensus (Golang, Java), code is required to explicitly reference the intended|desired consuming service:

exporter, err := stackdriver.NewExporter(stackdriver.Options{})

or:

StackdriverTraceExporter.createAndRegister(StackdriverTraceConfiguration.builder()

along with associated imports.

I anticipate OpenCensus users will be willing to revise code to annotate it to reference OpenCensus but that they may balk at the prospect of then also specifying the consuming service.

I'd expect|prefer:

exporter, err := opencensus.NewExporter()

and then runtime config specifying a specific implementation of an OpenCensus Exporter with relevant configuration.

/cc: @mtwo @shahprit

Define behavior to trace chunked and multipart HTTP requests

Discuss and define behavior how to handle chunked and multipart HTTP requests. For example, we might create message events for each section.

See #60 (comment) for context.

Move http/stats under stats/HTTP.md

Guidance for how to measure time/latency

We need guidance for how to measure latencies and time things, since this will be very common. In particular:

Guidance on what time unit to use
What measure type (float64 or int64 to use)
How to handle overflows

Specify sanitization rules for exporters

OpenCensus tag and stats names are not as restrict as some of the backends we are uploading the data to. In order to comply with the restriction the backends enforce, we need to sanitize the names. Sanitization rules need to be consistent across different language optimizations.

We need to write a spec that contains the sanitization rules.

Sanitize the unaccepted characters with the character of choice for the backend, e.g. with _.
If starts with an invalid character, add a prefix (e.g. key_ for keys).
Return error if the total length exceeds the allowed name limit.
Return error if the sanitized name is identical to the sanitized version of a different name.

/cc @bogdandrutu @acetechnologist @dinooliva @g-easy @adriancole @odeke-em

consider renaming "stats" packages to "metrics"

"stats" seems much too general, "metrics" seems to me to be a more accurate term of art.

I get the feeling I might be wading into a controversial topic here but in all of the ad hoc I've had, people seem to unanimously prefer "metrics". Putting this here to allow those people to air their grievances and hopefully also to get the other side of the argument (if there is one).

Remove Mean aggregation.

@jkschneider commented:

I worry about this one. Mean isn't aggregable across dimensions, so doesn't seem particularly useful in general. For distribution summary type metrics, shipping a total amount and count is generally sufficient to be able to derive a truly aggregable mean. Some systems are so aware of this problem that they provide a standard query function OOTB (see dist-avg). Some monitoring systems expect mean to be pre-computed as a matter of convention, but could shipping mean be an exporter concern for those systems?

Specify how to handle invalid serialized Trace Contexts.

This issue is a continuation of #12, for tracing.

Design proposal to support gauges (dynamic measures)

Explain how can open-census support recording measurements that are not event based such as task_memory_usage, task_cpu_usage.

Provide implementation details/design proposal for Stats library

Add HTTP tracing section about how to export span kinds

Disable merge commits

Can we disable merge commits on this repo? We prefer squash or rebase for a cleaner history on other repos.

Should tag values be less restrictive?

Tag values as specified in tags/TagContext.md are as restrictive as keys:

TagKey
A string or string wrapper, with some restrictions:

Must contain only printable ASCII (codes between 32 and 126, inclusive).
Must have length greater than zero and less than 256.
TagValue
A string or string wrapper with the same restrictions as TagKey, except that it is allowed to be empty.

This seems too restrictive. Common metrics backends seem to support arbitrary values:

https://prometheus.io/docs/concepts/data_model/

Also I would guess that it would be very inconvenient for non-latin languages where you could imagine things like city names being in tag values (for example). I would be very frustrated if the backend I'm using supports my native language characters but I'm forced to encode everything just because OpenCensus creates an artificial restriction.

Add specification on how to add tags to current TagContext

We have user questions about how to get the current TagContext and add additional tags to it. For example https://github.com/census-instrumentation/opencensus-java/blob/master/examples/src/main/java/io/opencensus/examples/stats/StatsRunner.java#L60.

It's useful to have a specification in https://github.com/census-instrumentation/opencensus-specs/blob/master/tags/TagContext.md.

Moreover, we should also specify that the withTagContext() method replaces the current TagContext rather than adding more tags to current TagContext. The method name has caused confusion to users.

/cc @bogdandrutu @sebright

Java and Go are incompatible for SD monitoring

If a user uses Java and Go and add a IntMeasure, then define a view with a Sum aggregation the data sent to SD monitoring will be incompatible because Java defines the MetricDescriptor with a value type INT64 but Go defines the MetricDescriptor with a value type DOUBLE.

This will cause errors. we need to make a decision if we support or not int64 measure if we do so we need to be compatible.

Provide a document describing performance requirements for Stats.

Allow spans to be annotated by a service name

Moving census-instrumentation/opencensus-go#604 to opencensus-specs.

Issue originally suggests:

Using the jaeger exporter it seems like the service name is fixed on creation of the Exporter struct:

exporter, err := jaeger.NewExporter(jaeger.Options{
	Endpoint:    "http://localhost:14268",
	ServiceName: "trace-demo",
})

In case we have a separate component constructing the spans on behalf of different services, it would be nice to change the service name for each span individually.

An example use case is in the context of istio. The mixer could generate spans on behalf of various services.

An obvious problem is the bundling that's happening in the exporter. A list of spans will be uploaded using whatever service name was set at that moment.

I am available to work on this.

/cc @nov1n

Document the default sampling rate for tracing

Default tracing rate (1e-4) is not documented.

Consider changing randomness of traceID to 96bits + 32bit timestamp

Copied from: census-instrumentation/opencensus-go#489

This would support all the major cloud provider's trace systems while still preserving a high degree of randomness, 96 bits. As it stands, compatibility with AWS X-Ray is awkward when using traceIDs generated by open census.

Document go package and download link

We need to document or explain how to download/link/use the go code. Probably using the opencensus.io domain (not sure for the moment).

Document go package and download link

We need to document or explain how to download/link/use the go code. Probably using the opencensus.io domain (not sure for the moment).

Add support for https://github.com/w3c/distributed-tracing

Once https://github.com/w3c/distributed-tracing is marked as alpha the OpenCensus library should start supporting this generic format.

This is for the moment a tracking issue for this support until we have the confirmation that the standard is at least in alpha.

Suggestion to add fixed rate sampling / rate-limiting

The proposal describing sampling suggests adding support for random percentage of traffic. While simple to implement and predictable for purposes of extrapolation, this has the potential to make it more difficult for tuning to the load of the tracing system. Random sampling also has problems when dealing with significant swings in traffic.

I'd like to suggest we include a description for a fixed rate sampler, where instead of sampling a percentage of the traffic, the sampler captures at most X traces per some time period. One potential java implementation could utilize something like com.google.common.util.concurrent.RateLimiter. This has the benefit of capturing a higher percentage of traces for a low throughput system, while maintaining a consistent flow of traces to the tracing system regardless of the traffic changes.

Design for retroactive tracing & exemplars

Through trace sampling, we might miss important traces that don't occur very frequently for example traces leading to error conditions or high latency.

We should provide a facility for starting tracing later during request processing when we detect an error or other interesting condition. We should rate limit this at the source to avoid cascading failure.

How should we support exporting to local aggregation agents?

Some metrics backends mandate their own local agent/sidecar for aggregation. Examples of this are DataDog (dogstatsd) and CloudWatch (others?)

In these cases, we still want to allow users to take advantage of OpenCensus instrumentation in libraries (e.g. gRPC, HTTP), but we don't want OpenCensus to do in-process aggregation. Instead ideally, OpenCensus should directly send raw metrics events to the vendor agent.

Tags: We probably still want to configure which tags get sent to the vendor sidecar and how

We also might want to configure in-process downsampling to avoid overwhelming the agent (this is supported by the statsd protocol).

We could provide some sort of hook that stats.Record would call but just a Measurement is probably not enough to send to the vendor agent. For example, statsd needs to know what metric type you want to record and Measure does not really specify this.

Specify an exporter configuration format

Most non-library OpenCensus user swill have to load some configuration to initialize the exporters. Specify a common configuration format for the exporters so that OpenCensus community can reuse the same configuration file with different tools.

An initial proposal for the configuration is below. We can also provide parsers in each language to setup the known exporters from the configuration file.

exporters:
  prometheus: 
    addr: "localhost:9999"
  stackdriver: 
    project-id: bamboo-cloud-100
  openzipkin: 
    endpoint: "http://localhost:9411/api/v2/spans"
    hostport: "server:5454"
    name: server

/cc @bogdandrutu @acetechnologist @songy23 @adriancole @dinooliva @odeke-em @g-easy

tags: implement spec for HTTP propagation

While working on the interop tests across Go and Java, one of the requirements was that tag propagation be tested for both gRPC and HTTP. In Go, we only have tag propagation for gRPC which
sends tags as an encoded blob in the metadata headers(I don't see a spec here though, so perhaps also file an issue for it?).

This issue is a request for us to work the spec for it and I left reminders for that such as
https://github.com/census-instrumentation/opencensus-experiments/blob/c6a2b2d00eb184da894de2ae14ef5dce4ab45951/integration/go-http-client/client_test.go#L122
and I also filed an issue on the Go package census-instrumentation/opencensus-go#537

Sampling expectations and propagation when not sampled

There is need for defined points a library or instrumentation should invoke the configured sampler logic and how/what should be propagated to any children of a disabled span.

When the sampler should be called:

entering a service (the initial creation of a span or context from a propagated trace context)
root span creation
user set a specific sampler for that Span (see SpanBuilder in Java)

In the event the sampler returns false and the trace context is propagated to another service, the trace id and parent span id of the span would be propagated. For example if we have services A, B and C with calls resulting in the propagation A->B->C and B is disabled by its sampler logic then the trace context received by C will have the trace id and span id that was originally sent from A to B.

Note that the fact a parent service who propagated context to a service was or wasn't sampled can be used in the sampling decision of the service, but it having been sampled or not does not create a strict requirement one way or the other (unless of course the configured sampler logic on the service is simply to return the decoded boolean sent with the trace context).

Additionally this defines that once a trace is enabled within a service through its sampler logic, the children within the service of that span do not check with the sampler logic. Meaning within A, B, or C there can not be gaps in the trace.

MonitoredResource: consider adding gcp_ and aws_ prefix to property names

For example, we have instance_id in GCE, and instance_id in AWS EC2. Those two should be different properties, and I think it's more clear if we renamed them to gcp_instance_id and aws_instance_id. Same for other properties.

Remove Window from View

Objective

Simplify view definitions by removing the Window, effectively making everything Cumulative. In particular, remove Interval windows.

Background

No backends we are aware of support doing anything meaningful with Interval counters. Instead, generally, you submit cumulative metrics from each task and the backend derives values for time intervals by taking the difference between the values at the start and end of the interval of interest.
Some backends support delta submissions, but these are not the same as Interval. This is really an implementation detail of the Exporter but the logical view is the same.

The only use of Interval views is in-process for z-Pages. Internally at Google, they are also used for load balancing.

The proposal is therefore to:

Eliminate Window as a concept from View definitions
Move the windowing logic into the z-Pages

This achieves a simplification of the view definition and avoids confusing users by presenting them with the Window option that almost never makes sense for them to configure themselves.

Requirements and Scale

This removes API surface area from all languages and from protobuf view definitions. We still need this functionality in z-Pages. But similar to backends, z-Pages can compute the values over intervals from the cumulative values. This is possible for sum, count and distribution aggregations.

Design Ideas

Remove Window from view definition proto and data model docs. This entails removing both Cumulative and Interval. The semantics are that Cumulative is assumed.

In stats.proto, remove Interval*. Propagate this change to code wherever intervals are used.
From all view names, remove the word "cumulative", "window", and "interval". Delete all "hour" and "minute" views (gRPC). z-Pages still need to display metrics over the last hour/minute (should we add something in-between, like 15min?). This can be implemented by z-Pages using an Exporter that keeps an hour worth of samples and computes the change/rate by subtracting starting values from ending values. For example, consider the gRPC view of client request latency distribution (currently called "grpc.io/client/roundtrip_latency/cumulative", renamed to grpc.io/client/roundtrip_latency" if this design is accepted).
This is a Distribution. A distribution is just a fixed set of Counts, one for each bucket. To determine the correct bucket count for the last minute, take the value recorded a minute ago and subtract from the current value. You might not have a recording from exactly one minute ago or exactly the current value but you can use linear interpolation to get an approximation. A similar approach works for any count or sum (anything where a difference operation is defined). Mean can be derived from count and sum.

So in this simple scheme, we record a new sample every time our Exporter is called and remove samples older than an hour.

Windows on aggregate functions

Some aggregate functions require windows. For example, something like Prometheus' Summary metric, or sketch-based approximate data structures. These cannot be subtracted so it's not possible to derive values for time intervals from cumulative values. This doesn't apply to any of the aggregation functions we currently support (all count based).

In future, if we want to add aggregate functions that require a time window, we should add the time window on the aggregation function itself. This is not the same thing as having a time window on the view since in these cases, Cumulative is usually not a valid time window (number of distinct clients seen since application start is not useful, but number of distinct clients in the last hour is). So even if we add such functions, the current Cumulative/Interval representation would probably not make sense. This design does not preclude such future aggregate functions.

Alternatives Considered

Do nothing. Leave intervals and windows in place. This keeps the API and implementation complexity (interval is more complicated than cumulative) and provides very little benefit since it's only used in z-Pages.
Remove interval but keep cumulative interval. Since cumulative is the only window type supported, it should just be assumed. If we need to add a window type in future, we can always restore the window type (defaulting to cumulative).

Add a spec for gRPC tracing

Add a spec explaining the tracing behavior for gRPC clients and servers.

tags: implement spec for gRPC propagation

We lack a spec for tags propagation using gRPC although a couple of the packages already have it, but we've found some issues like in the Go implementation census-instrumentation/opencensus-go#541 where "method" was inserted during propagation.

Also this came up in the equivalent HTTP propagation specs request issue #65

Use same metric type and units in stats exporters

Here is the specs for converting OC aggregations to metrics:
https://github.com/census-instrumentation/opencensus-specs/blob/master/stats/Export.md#aggregation-to-metric

All languages must be consistent.

We will close this issue when all the languages that currently support stats implemented this:

C++
Erlang
Go
Java

Span names of HTTP server and client should contain the HTTP method

Span names of HTTP server and client should contain the HTTP method. This has nice properties - fixed cardinality, defined meaning. It is at least as useful as the path.

Add document about all HTTP supported propagation formats

We need to document what HTTP formats we support.

Add support for automatically detect the monitored "resource"

Inspired by the Stackdriver monitoring measured resources, OpenCensus should provide a way to automatically detect the resource and record these informations as metrics labels (or monitored resources in case of SD) and as tracing attributes (or monitored resource if the backend supports that).

Initially we can start with few supported resource and add more later:

GCP_GCE_INSTANCE
GCP_GKE_CONTAINER
AWS_EC2_INSTANCE

Action items:

Write specs for all supported resources and how to detect them (environment variables, config services, etc).

Implemented as a util package/artifact the support to automatically determine these resources and use them in the exporters:

Establish pattern for before-the-fact, trace-scoped sampling

In B3 (usually zipkin) sample-once, before the fact tracing is status quo. It includes a few things

a yes decision: ensures you get the full trace always
a no decision: often used for capacity, but sometimes for policy like "don't trace /health"
a deferred (null) decision: often used when IDs are pre-provisioned, implies the caller didn't export data yet

There are cases where trace-tier decisions aren't great and are being explored:

a proxy by accident or no other option traces everything and you want to re-evaluate
a component like SQL has a bug and creates 1000 spans of which you'd like to drop 900 of
a service just doesn't want to be traced (never asked this one personally)

Span-scoped decisions can cause problems as they can create possibly unresolvable gaps in a trace, if a yes-no later turns back to a yes

IOTW, I can see cases for both trace tier and span tier decisions. However, brown field really relies on trace-tier (trust a decision downstream), so will be nice to figure a way libraries can facilitate this generically, and safely.

census-instrumentation / opencensus-specs Goto Github PK

opencensus-specs's Introduction

OpenCensus Specs

Overview

Ecosystem Design

Layers

Service exporters

OpenCensus library

Manually instrumented frameworks

Tools for automatic instrumentation

Application

Library Design

Namespace and Package

Components

Context

Trace

Links

Tags

Links

Stats

Links

Supported propagation formats

opencensus-specs's People

Contributors

Stargazers

Watchers

Forkers

opencensus-specs's Issues

Objective

Background

Requirements and Scale

Design Ideas

Windows on aggregate functions

Alternatives Considered

Recommend Projects

Recommend Topics

Recommend Org

Jobs