As per the current <a href="https://docs.google.com/document/d/1y56a9fjkLi6Zen5qGC_JKY

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Two more options: dataset_is_index_p

In the above, we have *service.profiles and just <cod

[Feature Request] Dynamic variables in data stream names,about elastic/package-spec

Comments (42)

axw commented on August 16, 2024 1

Trying to understand logs-<service.name>-*. How will filebeat know which dataset to send it to?

If I understand this correctly, lets assume Kibana contains the APM Agent which writes the service logs. These are picked up and then shipped to logs-kibana-default?

That's the idea -- this would depend on something like https://github.com/elastic/integrations-dev/issues/368, with APM Server informing Filebeat (by way of Elastic Agent) of the dataset.

However, this part is not strictly necessary. As long as we have a keyword-type service.name field in the log docs, all will be well. That should work with (for example) a generic log data stream, where an ECS logger was used to inject the service.name field.

from package-spec.

ruflin commented on August 16, 2024 1

Today on the data streams page in Fleet we show an icon to which integration a data stream belongs. If I remember correctly, this is based on the knowledge about data streams. The above causes an other issue around Kibana index pattern. With this it would be possible that a the type defines is metrics but the index pattern contains logs-*-* or literally anything so the generation would be wrong.

I think we agree, we need to change the pattern that matches. There are at least 2 ways of doing it: Let the user use any pattern or tell Kibana through a config to build a pattern that matches. If we hand over full control to the creator of the package to define it, at one stage a user will break it. If we do it in Kibana, we can enforce that it is done correctly.

My more general thinking around the packages is:

a: Everything under data_stream/* MUST follow the indexing strategy. This means, it must always contain in some way the three properties type, dataset, namespace and support it.
b: Everything that is free form is supported by a package, but should land on the top level.

If we define a flag in the manifest that it should define a pattern for the data_stream instead of a fixed one, I would argue it still can fall under "a". The reason is as following:

Fleet still enforces and knows the type
Fleet creates the index pattern for the template and knows how to deal with it for Kibana index patterns
Fleet makes sure the namespace part exists

Having a config option in the manifest to tell Fleet to create a pattern is a bit more complex at first instead of using your proposal around defining the index pattern directly but I think long term it makes sure everything that is put into data_streams/* follows our indexing strategy.

from package-spec.

axw commented on August 16, 2024 1

@jalvz boolean is fine, as described by @ruflin. I was just trying to clear up what I meant in my previous suggestion.

from package-spec.

axw commented on August 16, 2024 1

Two more options:

dataset_is_index_prefix: true.
dataset_prefix: true

i.e. highlighting not just that it's dynamic, but the specific way (prefix) in which it's dynamic.

from package-spec.

simitt commented on August 16, 2024 1

additionally throwing in dataset_is_prefix - it is shorter than dataset_is_index_prefix and the is indicates it is a boolean and not actually a prefix.

from package-spec.

ruflin commented on August 16, 2024 1

I like all the names that start with dataset_* as you are correct, it only applies to the dataset part. I would leave out the index part as it is about datastreams, so we end up with dataset_is_prefix and dataset_prefix. Last point from @simitt around is making it a boolean would make us land on dataset_is_prefix. SGTM.

from package-spec.

jalvz commented on August 16, 2024

cc @elastic/apm-server

from package-spec.

axw commented on August 16, 2024

As above, the current proposal is to have service.name as a prefix in the data set, under the assumption that this would be ideal for RBAC or other places where we might want to identify data streams for a given APM service; we can alternatively make it a suffix if it that is more conventional.

CC @ruflin

from package-spec.

ruflin commented on August 16, 2024

In the above, we have *service.profiles and just *service. These are 2 different datasets? But metrics-backend_service-production and metrics-frontend_service-staging are the same?

Taking this, when you ingest data, the data stream names are created out of metrics-{service.name}_service-{data_stream.namespace}? And the dataset template would apply to metrics-*_service-*?

What are the limitations on the service name? It should not contain - for sure. It would be nice if we could potentially restrict it further to not have accidental matches? We should test on how we can make the patterns match best, be it with a prefix or postfix. The reason I prefer postfix is because it goes from generic to specific. So metrics is the most generic part, service in the middle and the name of the service is the most specific one concerning the mapping. But I think it will work both ways around.

from package-spec.

jalvz commented on August 16, 2024

Sorry, I didn't explain myself well.

staging and production in the example above are meant to be namespaces; frontend_service and backend_service are meant to be service names. There is no such a think as a hardcoded service part in the middle (in my example that is, not like we can't introduce it if we want).

from package-spec.

ruflin commented on August 16, 2024

In case we don't have a common pre or postfix, how will we make the index template match only these indices and not all indices?

from package-spec.

axw commented on August 16, 2024

@ruflin I'm not sure I follow your last question above. I'll try to restate what we intend to do.

APM Server will produce multiple data streams, with multiple data types: logs, metrics, and traces.
Each data stream will be service-specific, and this will be guaranteed be including the instrumented service name in the dataset.
We will put the service name in the same position (pre or post fix) in dataset for every data stream.

The last point means that we will have a "common pre or postfix", enabling index security etc.

from package-spec.

ruflin commented on August 16, 2024

@axw What you describe is my expectation. But @jalvz has the following example: metrics-backend_service-production and later on states "frontend_service and backend_service are meant to be service names". Where is the pre or postfix?

from package-spec.

jalvz commented on August 16, 2024

Sorry, this felt trough the cracks. @ruflin that sounds good to me, I don't have an strong opinion on whether we should use a pre or a post suffix, and its value.

We can go maybe with _service postfix for the sake of a maybe-slightly-better readability?

from package-spec.

ruflin commented on August 16, 2024

service works for me. I personally prefer prefix, see arguments above ^

The reason I prefer postfix is because it goes from generic to specific. So metrics is the most generic part, service in the middle and the name of the service is the most specific one concerning the mapping. But I think it will work both ways around.

What this will mean, that metrics-service_*-* basically becomes reserved by APM anyone shipping data to these indices, will have to follow you structure. This is a feature not a bug!

@axw @simitt @graphaelli Service prefix it is?

from package-spec.

axw commented on August 16, 2024

@ruflin I'm rather confused. Why do we need a service_ prefix? (or _service postfix)? I don't think we do, but perhaps I'm missing something.

In #64 (comment) where I said The last point means that we will have a "common pre or postfix", I was referring to the set of data streams referring to a single service; all logs, metrics, and traces data streams for a given service will have a common prefix or postfix in the dataset name.

Going back to @jalvz's example in the description (modified slightly to remove the "_service" from service names), let's say we have two services: frontend and backend.

For backend, there might be data streams called:

metrics-backend-production
metrics-backend.profiles-production
metrics-backend.apm.internal-production
logs-backend-production
traces-backend-production

Then for frontend:

metrics-frontend-staging
metrics-frontend.apm.internal-staging
logs-frontend-staging
traces-frontend-staging

from package-spec.

ruflin commented on August 16, 2024

I had a quick meeting with @axw and @jalvz yesterday to clarify on why we need a pre/postfix or alternative paths. The team will follow up.

from package-spec.

axw commented on August 16, 2024

Expanding slightly: @ruflin clarified that a pre/postfix is required in order to have the integration package install an index template for the documents APM produces. The alternative would be to rely on the minimal templates installed by Elasticsearch. We can't at the moment as we have both keyword and text fields, and the template will index strings as keywords.

My current thinking is that we should install our own templates anyway, so that we're not tying APM Server too tightly to Elasticsearch versions and they can be upgraded independently. I'll update the proposal doc with something concrete.

from package-spec.

jalvz commented on August 16, 2024

My current thinking is that we should install our own templates anyway

You mean minimal templates for apm metrics and logs? If so, there isn't really much difference between a dataset prefix and a whole set of different types for apm data altogether?

A dataset prefix is a harmless safeguard any case.

from package-spec.

jalvz commented on August 16, 2024

ACTUALLY: Maybe we do need our own minimal templates for all the types?
Thinking of fields like client.ip, exception.log.message, transaction.duration.histogram... will all have wrong mappings and things will break badly in the case of install-after-ingest?

from package-spec.

axw commented on August 16, 2024

My current thinking is that we should install our own templates anyway

You mean minimal templates for apm metrics and logs? If so, there isn't really much difference between a dataset prefix and a whole set of different types for apm data altogether?

What I meant was that we should install our own dataset templates, and not rely on base templates in Elasticsearch.

... will all have wrong mappings and things will break badly in the case of install-after-ingest

I think we should just require ingest-after-install for APM. If we require running in Fleet mode, then this is implied.

from package-spec.

axw commented on August 16, 2024

I'll update the proposal doc with something concrete.

I've updated the proposal. In summary, we'll have the following data streams:

logs-<service.name>-* (application logs)
logs-apm.error.<service.name>-* (APM error/exception logging)
metrics-apm.<service.name>-* (application-defined metrics)
metrics-apm.internal.<service.name>-* (APM-internal metrics)
metrics-apm.profiling.<service.name>-* (APM profiling metrics)
traces-apm.<service.name>-* (APM transactions and spans)

You'll notice that the first data stream lacks a prefix. That's because the logs streams will be produced by Filebeat rather than APM, and won't rely on any APM-specific mappings.

For everything else there will be a common apm. prefix on the data set which we can use for matching templates and in search queries.

from package-spec.

ruflin commented on August 16, 2024

Looks like we definitively need to add support for the prefix template feature.

Trying to understand logs-<service.name>-*. How will filebeat know which dataset to send it to?

If I understand this correctly, lets assume Kibana contains the APM Agent which writes the service logs. These are picked up and then shipped to logs-kibana-default?

from package-spec.

ruflin commented on August 16, 2024

Sounds like we are on the same page. We should now figure out how the config in the package spec for this looks like and how it is installed in Kibana correctly.

@jalvz Could you make a proposal for the package spec?
@skh @neptunian This will effect our Elasticsearch index template installation

from package-spec.

jalvz commented on August 16, 2024

I took a stab at this, and unless I am too dense, I don't think we need dynamic dataset names.
What we need is the integration to install the templates with the right index patterns, so data streams are created when data is ingested.
Right now this doesn't work because the index pattern will be like type-dataset-*, and we have the service name sticked to the dataset, so eg apm-traces-* won't match with apm-traces.frontend.
But apm-traces* would match and then the right data streams would be created.

To this effect, I opened #102 to allow us to customize the index pattern for each datasets; then we would need Kibana to pick it up for template creation.

Let me know how that looks or if I am missing something.

from package-spec.

ruflin commented on August 16, 2024

Interesting idea. So far the names of indices, pipelines and everything related to data_streams was predictable based on the convention that all the names are always the same. With this, we would allow certain packages to diverge from this. I'm wondering now if this might have unexpected side effects by us allowing additional flexibility. For example at the moment we can easily link integrations and data streams together as it is not a pattern but a fixed name. Not saying we should not do it but we should think through if this causes some other issues.

For the pattern, I think it is important that we keep naming scheme with the 3 parts, so it should be apm-traces.*-*

from package-spec.

jalvz commented on August 16, 2024

names of indices, pipelines and everything related to data_streams was predictable based on the convention that all the names are always the same

As a matter of fact, like this pipeline names stay the same. Kibana automatically renames pipelines to type-dataset-version-name, so with dynamic dataset names, pipeline names wouldn't be known in advance and we couldn't compose them, etc. By changing just the index pattern, they are unaffected.

Regarding index names, the point of this issue is that they won't have a predictable name (one way or another), so not sure I understand the concern. Can you elaborate on:

we can easily link integrations and data streams together

?
Where does that happen?

from package-spec.

jalvz commented on August 16, 2024

Thanks @ruflin .

I am having a hard time understanding your argument, lets see:

If we hand over full control to the creator of the package to define it, at one stage a user will break it

An APM user, you mean? how does that happen if they can't edit the integration manifest files?

Everything under data_stream/* MUST follow the indexing strategy. This means, it must always contain in some way the three properties type, dataset, namespace and support it

Isn't that the case already? What is in my proposal that makes this not be the case?

Having a config option in the manifest to tell Fleet to create a pattern is a bit more complex at first instead of using your proposal around defining the index pattern

My proposal is a field in the manifest to tell Fleet to create a template with a given pattern. I guess you can call that a "config option", is that what you mean by it? What exactly are you proposing and how is different from this?

Sorry, I'm really confused

from package-spec.

ruflin commented on August 16, 2024

I think we are talking about 2 different users. The user I'm referring to is the one creating the package (in the apm case, @jalvz ). There is a future, where not only Elastic but anyone can build packages.

Lets take the apm/data_stream/traces/manifest.yml as an example. What you propose would look similar to:

type: apm
elasticsearch.index_pattern: apm-metrics-*

What I propose would be:

type: apm
match_all: true // much better name needed

In your case, Kibana would only now their is a custom index_pattern that it should add, but how will this affect the Kibana index pattern. With your option, it is also possible to set this to elasticsearch.index_pattern: foo-*. Now the template that Kibana thinks applies to apm-metrics-* does actually apply to foo-* which makes the indexing strategy fall apart.

With my example, Kibana will build the index pattern based on the above flag which is probably something like apm-metrics.*-*. If multiple package developer use this feature, Kibana enforces that it is always implemented the same way. If there are other cases where the index pattern is important, it can use this knowledge.

Does this help?

from package-spec.

jalvz commented on August 16, 2024

There is a future, where not only Elastic but anyone can build packages

Ah, didn't know that. That clarifies things. In that case:

If you put foo-* in your apm metrics data stream you are shooting yourself on the foot. There are really a lot of more ways to shoot yourself in the foot besides this, so I'm having a hard time to see the value in hardcoding the index pattern.

What does match_all do then? Change index pattern from type-dataset-* to type-dataset.*-*? In that case, users can still shoot themselves in the foot, by setting match_all: false nothing would work.

That also seems a very specific APM thing that would be in Kibana Fleet code. Why match_all is needed at all? Can't Kibana just change the pattern for the APM integration templates then?

from package-spec.

axw commented on August 16, 2024

If you put foo-* in your apm metrics data stream you are shooting yourself on the foot. There are really a lot of more ways to shoot yourself in the foot besides this, so I'm having a hard time to see the value in hardcoding the index pattern.

IIUC, @ruflin is not suggesting that we hard-code the index pattern but that we allow package developers to define a pattern for the data set only. This would guarantee the index pattern always adheres to the indexing strategy. (Correct me if I misinterpreted you, Nicolas.)

You may still be able to shoot yourself in the foot in some other way, but I think it would be more ergonomic to define things in terms of the indexing strategy rather than a complete index pattern that has to match the indexing strategy anyway.

Is there a reason why we can't allow the top-level "dataset" property in the data stream manifest to be a pattern? e.g. in the apm package's "traces" data stream, change the dataset from apm to apm.*?

from package-spec.

jalvz commented on August 16, 2024

#102 allows to set the index pattern for the data set only, isn't it?

from package-spec.

ruflin commented on August 16, 2024

@axw Correct.

The dataset value is used for quite a few other things like pipeline names and template names so I rather not allow a pattern here.

@jalvz There is "shooting yourself in the foot" and "breaking the system". If you build the APM package and set match_all to false, APM will not work. But if you set a random index pattern for example metrics-mysql-* in a package, you break other packages and that is the part I do not want to allow.

What match_all would do is exactly what you described above. I agree that for now this is apm specific but I'm pretty sure others will use this too. I rather not have special code for some packages in Fleet.

Is there an issue with match_all (besides that it is a really bad name)?

from package-spec.

axw commented on August 16, 2024

elastic/obs-dc-team#102 allows to set the index pattern for the data set only, isn't it?

No, it matches the entire index name. What I meant is to define a pattern which matches only the part after the type, and before the namespace. Fleet would take care of joining them all together to form a complete index pattern.

The dataset value is used for quite a few other things like pipeline names and template names so I rather not allow a pattern here.

@ruflin fair enough. I'd be fine with match_all (sic) if we come up with a more informative name. I haven't had any good ideas yet.

from package-spec.

jalvz commented on August 16, 2024

I rather not have special code for some packages in Fleet.

That is what I meant, it is very apm specific. But I think I got it this time: the index pattern will be hardcoded to one or other thing depending on that boolean value. I'll update the related pr and file a Kibana ticket.
I'll go with match_all and wait for suggestions, I don't have better ideas myself (it is a concept hard to describe anyways)

from package-spec.

jalvz commented on August 16, 2024

Oh, just read this after sending my reply:

What I meant is to define a pattern which matches only the part after the type

Maybe we need a Zoom after all? 😅 What I get from there is that you suggest a string that Kibana will insert in the right place, while @ruflin suggested a boolean. So string or boolean?

from package-spec.

ruflin commented on August 16, 2024

Two additional suggestion for names taken out of descriptions for it used above:

dynamic: true: Because it dynamically matches
pattern: true: We add an additional "pattern" to the index pattern 🤔

There must be a better name out there.

from package-spec.

jalvz commented on August 16, 2024

The default pattern also matches "dynamically", isn't it? 😅
tbh, for me this is more an alternative_pattern, or partial_dataset_matching_pattern, or something like that. I don't think something as specific and nuanced as this can be summarized as a single word.
But i'm fine with any name because I intend to document what it actually means anyways :)

from package-spec.

ruflin commented on August 16, 2024

@jalvz Could you write the proposed doc for it here? The reason I'm asking for this is because often the "correct" name can be deducted from how we document it / describe it. BTW I don't see an issue with a pretty long key as long as it is descriptive on what it does.

from package-spec.

jalvz commented on August 16, 2024

Sure, it is in the description of the field: https://github.com/elastic/package-spec/pull/102/files#diff-8096817edd596c8787f8d5efb39c5c1a1555c016b5fccc30f30eced78864d494R115

I intend to add the same as comment when I add it to the apm package

from package-spec.

simitt commented on August 16, 2024

dynamic: true: Because it dynamically matches

The default pattern also matches "dynamically", isn't it?

The default pattern matches a dynamic index but not a dynamic dataset. I suggested something similar in the open PR, calling it dataset_dynamic (which I believe would be aligned with the comments from both of you @ruflin and @jalvz).

from package-spec.

jalvz commented on August 16, 2024

Done, closing.

from package-spec.

[Feature Request] Dynamic variables in data stream names about package-spec HOT 42 CLOSED

Comments (42)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs