Comments (42)
Trying to understand logs-<service.name>-*. How will filebeat know which dataset to send it to?
If I understand this correctly, lets assume Kibana contains the APM Agent which writes the service logs. These are picked up and then shipped to logs-kibana-default?
That's the idea -- this would depend on something like https://github.com/elastic/integrations-dev/issues/368, with APM Server informing Filebeat (by way of Elastic Agent) of the dataset.
However, this part is not strictly necessary. As long as we have a keyword-type service.name
field in the log docs, all will be well. That should work with (for example) a generic log data stream, where an ECS logger was used to inject the service.name
field.
from package-spec.
Today on the data streams page in Fleet we show an icon to which integration a data stream belongs. If I remember correctly, this is based on the knowledge about data streams. The above causes an other issue around Kibana index pattern. With this it would be possible that a the type defines is metrics
but the index pattern contains logs-*-*
or literally anything so the generation would be wrong.
I think we agree, we need to change the pattern that matches. There are at least 2 ways of doing it: Let the user use any pattern or tell Kibana through a config to build a pattern that matches. If we hand over full control to the creator of the package to define it, at one stage a user will break it. If we do it in Kibana, we can enforce that it is done correctly.
My more general thinking around the packages is:
- a: Everything under
data_stream/*
MUST follow the indexing strategy. This means, it must always contain in some way the three properties type, dataset, namespace and support it. - b: Everything that is free form is supported by a package, but should land on the top level.
If we define a flag in the manifest that it should define a pattern for the data_stream instead of a fixed one, I would argue it still can fall under "a". The reason is as following:
- Fleet still enforces and knows the type
- Fleet creates the index pattern for the template and knows how to deal with it for Kibana index patterns
- Fleet makes sure the namespace part exists
Having a config option in the manifest to tell Fleet to create a pattern is a bit more complex at first instead of using your proposal around defining the index pattern directly but I think long term it makes sure everything that is put into data_streams/*
follows our indexing strategy.
from package-spec.
@jalvz boolean is fine, as described by @ruflin. I was just trying to clear up what I meant in my previous suggestion.
from package-spec.
Two more options:
dataset_is_index_prefix: true
.dataset_prefix: true
i.e. highlighting not just that it's dynamic, but the specific way (prefix) in which it's dynamic.
from package-spec.
additionally throwing in dataset_is_prefix
- it is shorter than dataset_is_index_prefix
and the is
indicates it is a boolean and not actually a prefix.
from package-spec.
I like all the names that start with dataset_*
as you are correct, it only applies to the dataset part. I would leave out the index
part as it is about datastreams, so we end up with dataset_is_prefix
and dataset_prefix
. Last point from @simitt around is
making it a boolean would make us land on dataset_is_prefix
. SGTM.
from package-spec.
cc @elastic/apm-server
from package-spec.
As above, the current proposal is to have service.name
as a prefix in the data set, under the assumption that this would be ideal for RBAC or other places where we might want to identify data streams for a given APM service; we can alternatively make it a suffix if it that is more conventional.
CC @ruflin
from package-spec.
In the above, we have *service.profiles
and just *service
. These are 2 different datasets? But metrics-backend_service-production
and metrics-frontend_service-staging
are the same?
Taking this, when you ingest data, the data stream names are created out of metrics-{service.name}_service-{data_stream.namespace}
? And the dataset template would apply to metrics-*_service-*
?
What are the limitations on the service name? It should not contain -
for sure. It would be nice if we could potentially restrict it further to not have accidental matches? We should test on how we can make the patterns match best, be it with a prefix or postfix. The reason I prefer postfix is because it goes from generic to specific. So metrics is the most generic part, service in the middle and the name of the service is the most specific one concerning the mapping. But I think it will work both ways around.
from package-spec.
Sorry, I didn't explain myself well.
staging
and production
in the example above are meant to be namespaces; frontend_service
and backend_service
are meant to be service names. There is no such a think as a hardcoded service
part in the middle (in my example that is, not like we can't introduce it if we want).
from package-spec.
In case we don't have a common pre or postfix, how will we make the index template match only these indices and not all indices?
from package-spec.
@ruflin I'm not sure I follow your last question above. I'll try to restate what we intend to do.
- APM Server will produce multiple data streams, with multiple data types: logs, metrics, and traces.
- Each data stream will be service-specific, and this will be guaranteed be including the instrumented service name in the dataset.
- We will put the service name in the same position (pre or post fix) in dataset for every data stream.
The last point means that we will have a "common pre or postfix", enabling index security etc.
from package-spec.
@axw What you describe is my expectation. But @jalvz has the following example: metrics-backend_service-production
and later on states "frontend_service and backend_service are meant to be service names". Where is the pre or postfix?
from package-spec.
Sorry, this felt trough the cracks. @ruflin that sounds good to me, I don't have an strong opinion on whether we should use a pre or a post suffix, and its value.
We can go maybe with _service
postfix for the sake of a maybe-slightly-better readability?
from package-spec.
service
works for me. I personally prefer prefix, see arguments above ^
The reason I prefer postfix is because it goes from generic to specific. So metrics is the most generic part, service in the middle and the name of the service is the most specific one concerning the mapping. But I think it will work both ways around.
What this will mean, that metrics-service_*-*
basically becomes reserved by APM anyone shipping data to these indices, will have to follow you structure. This is a feature not a bug!
@axw @simitt @graphaelli Service prefix it is?
from package-spec.
@ruflin I'm rather confused. Why do we need a service_
prefix? (or _service
postfix)? I don't think we do, but perhaps I'm missing something.
In #64 (comment) where I said The last point means that we will have a "common pre or postfix"
, I was referring to the set of data streams referring to a single service; all logs, metrics, and traces data streams for a given service will have a common prefix or postfix in the dataset name.
Going back to @jalvz's example in the description (modified slightly to remove the "_service" from service names), let's say we have two services: frontend
and backend
.
For backend
, there might be data streams called:
- metrics-backend-production
- metrics-backend.profiles-production
- metrics-backend.apm.internal-production
- logs-backend-production
- traces-backend-production
Then for frontend
:
- metrics-frontend-staging
- metrics-frontend.apm.internal-staging
- logs-frontend-staging
- traces-frontend-staging
from package-spec.
I had a quick meeting with @axw and @jalvz yesterday to clarify on why we need a pre/postfix or alternative paths. The team will follow up.
from package-spec.
Expanding slightly: @ruflin clarified that a pre/postfix is required in order to have the integration package install an index template for the documents APM produces. The alternative would be to rely on the minimal templates installed by Elasticsearch. We can't at the moment as we have both keyword and text fields, and the template will index strings as keywords.
My current thinking is that we should install our own templates anyway, so that we're not tying APM Server too tightly to Elasticsearch versions and they can be upgraded independently. I'll update the proposal doc with something concrete.
from package-spec.
My current thinking is that we should install our own templates anyway
You mean minimal templates for apm metrics and logs? If so, there isn't really much difference between a dataset prefix and a whole set of different types for apm data altogether?
A dataset prefix is a harmless safeguard any case.
from package-spec.
ACTUALLY: Maybe we do need our own minimal templates for all the types?
Thinking of fields like client.ip
, exception.log.message
, transaction.duration.histogram
... will all have wrong mappings and things will break badly in the case of install-after-ingest?
from package-spec.
My current thinking is that we should install our own templates anyway
You mean minimal templates for apm metrics and logs? If so, there isn't really much difference between a dataset prefix and a whole set of different types for apm data altogether?
What I meant was that we should install our own dataset templates, and not rely on base templates in Elasticsearch.
... will all have wrong mappings and things will break badly in the case of install-after-ingest
I think we should just require ingest-after-install for APM. If we require running in Fleet mode, then this is implied.
from package-spec.
I'll update the proposal doc with something concrete.
I've updated the proposal. In summary, we'll have the following data streams:
logs-<service.name>-*
(application logs)logs-apm.error.<service.name>-*
(APM error/exception logging)metrics-apm.<service.name>-*
(application-defined metrics)metrics-apm.internal.<service.name>-*
(APM-internal metrics)metrics-apm.profiling.<service.name>-*
(APM profiling metrics)traces-apm.<service.name>-*
(APM transactions and spans)
You'll notice that the first data stream lacks a prefix. That's because the logs streams will be produced by Filebeat rather than APM, and won't rely on any APM-specific mappings.
For everything else there will be a common apm.
prefix on the data set which we can use for matching templates and in search queries.
from package-spec.
Looks like we definitively need to add support for the prefix template feature.
Trying to understand logs-<service.name>-*
. How will filebeat know which dataset to send it to?
If I understand this correctly, lets assume Kibana contains the APM Agent which writes the service logs. These are picked up and then shipped to logs-kibana-default
?
from package-spec.
Sounds like we are on the same page. We should now figure out how the config in the package spec for this looks like and how it is installed in Kibana correctly.
@jalvz Could you make a proposal for the package spec?
@skh @neptunian This will effect our Elasticsearch index template installation
from package-spec.
I took a stab at this, and unless I am too dense, I don't think we need dynamic dataset names.
What we need is the integration to install the templates with the right index patterns, so data streams are created when data is ingested.
Right now this doesn't work because the index pattern will be like type-dataset-*
, and we have the service name sticked to the dataset, so eg apm-traces-*
won't match with apm-traces.frontend
.
But apm-traces*
would match and then the right data streams would be created.
To this effect, I opened #102 to allow us to customize the index pattern for each datasets; then we would need Kibana to pick it up for template creation.
Let me know how that looks or if I am missing something.
from package-spec.
Interesting idea. So far the names of indices, pipelines and everything related to data_streams was predictable based on the convention that all the names are always the same. With this, we would allow certain packages to diverge from this. I'm wondering now if this might have unexpected side effects by us allowing additional flexibility. For example at the moment we can easily link integrations and data streams together as it is not a pattern but a fixed name. Not saying we should not do it but we should think through if this causes some other issues.
For the pattern, I think it is important that we keep naming scheme with the 3 parts, so it should be apm-traces.*-*
from package-spec.
names of indices, pipelines and everything related to data_streams was predictable based on the convention that all the names are always the same
As a matter of fact, like this pipeline names stay the same. Kibana automatically renames pipelines to type-dataset-version-name
, so with dynamic dataset names, pipeline names wouldn't be known in advance and we couldn't compose them, etc. By changing just the index pattern, they are unaffected.
Regarding index names, the point of this issue is that they won't have a predictable name (one way or another), so not sure I understand the concern. Can you elaborate on:
we can easily link integrations and data streams together
?
Where does that happen?
from package-spec.
Thanks @ruflin .
I am having a hard time understanding your argument, lets see:
If we hand over full control to the creator of the package to define it, at one stage a user will break it
An APM user, you mean? how does that happen if they can't edit the integration manifest files?
Everything under data_stream/* MUST follow the indexing strategy. This means, it must always contain in some way the three properties type, dataset, namespace and support it
Isn't that the case already? What is in my proposal that makes this not be the case?
Having a config option in the manifest to tell Fleet to create a pattern is a bit more complex at first instead of using your proposal around defining the index pattern
My proposal is a field in the manifest to tell Fleet to create a template with a given pattern. I guess you can call that a "config option", is that what you mean by it? What exactly are you proposing and how is different from this?
Sorry, I'm really confused
from package-spec.
I think we are talking about 2 different users. The user I'm referring to is the one creating the package (in the apm case, @jalvz ). There is a future, where not only Elastic but anyone can build packages.
Lets take the apm/data_stream/traces/manifest.yml
as an example. What you propose would look similar to:
type: apm
elasticsearch.index_pattern: apm-metrics-*
What I propose would be:
type: apm
match_all: true // much better name needed
In your case, Kibana would only now their is a custom index_pattern that it should add, but how will this affect the Kibana index pattern. With your option, it is also possible to set this to elasticsearch.index_pattern: foo-*
. Now the template that Kibana thinks applies to apm-metrics-*
does actually apply to foo-*
which makes the indexing strategy fall apart.
With my example, Kibana will build the index pattern based on the above flag which is probably something like apm-metrics.*-*
. If multiple package developer use this feature, Kibana enforces that it is always implemented the same way. If there are other cases where the index pattern is important, it can use this knowledge.
Does this help?
from package-spec.
There is a future, where not only Elastic but anyone can build packages
Ah, didn't know that. That clarifies things. In that case:
If you put foo-*
in your apm metrics data stream you are shooting yourself on the foot. There are really a lot of more ways to shoot yourself in the foot besides this, so I'm having a hard time to see the value in hardcoding the index pattern.
What does match_all
do then? Change index pattern from type-dataset-*
to type-dataset.*-*
? In that case, users can still shoot themselves in the foot, by setting match_all: false
nothing would work.
That also seems a very specific APM thing that would be in Kibana Fleet code. Why match_all
is needed at all? Can't Kibana just change the pattern for the APM integration templates then?
from package-spec.
If you put foo-* in your apm metrics data stream you are shooting yourself on the foot. There are really a lot of more ways to shoot yourself in the foot besides this, so I'm having a hard time to see the value in hardcoding the index pattern.
IIUC, @ruflin is not suggesting that we hard-code the index pattern but that we allow package developers to define a pattern for the data set only. This would guarantee the index pattern always adheres to the indexing strategy. (Correct me if I misinterpreted you, Nicolas.)
You may still be able to shoot yourself in the foot in some other way, but I think it would be more ergonomic to define things in terms of the indexing strategy rather than a complete index pattern that has to match the indexing strategy anyway.
Is there a reason why we can't allow the top-level "dataset" property in the data stream manifest to be a pattern? e.g. in the apm package's "traces" data stream, change the dataset from apm
to apm.*
?
from package-spec.
#102 allows to set the index pattern for the data set only, isn't it?
from package-spec.
@axw Correct.
The dataset
value is used for quite a few other things like pipeline names and template names so I rather not allow a pattern here.
@jalvz There is "shooting yourself in the foot" and "breaking the system". If you build the APM package and set match_all
to false
, APM will not work. But if you set a random index pattern for example metrics-mysql-*
in a package, you break other packages and that is the part I do not want to allow.
What match_all
would do is exactly what you described above. I agree that for now this is apm specific but I'm pretty sure others will use this too. I rather not have special code for some packages in Fleet.
Is there an issue with match_all
(besides that it is a really bad name)?
from package-spec.
elastic/obs-dc-team#102 allows to set the index pattern for the data set only, isn't it?
No, it matches the entire index name. What I meant is to define a pattern which matches only the part after the type, and before the namespace. Fleet would take care of joining them all together to form a complete index pattern.
The dataset value is used for quite a few other things like pipeline names and template names so I rather not allow a pattern here.
@ruflin fair enough. I'd be fine with match_all
(sic) if we come up with a more informative name. I haven't had any good ideas yet.
from package-spec.
I rather not have special code for some packages in Fleet.
That is what I meant, it is very apm specific. But I think I got it this time: the index pattern will be hardcoded to one or other thing depending on that boolean value. I'll update the related pr and file a Kibana ticket.
I'll go with match_all
and wait for suggestions, I don't have better ideas myself (it is a concept hard to describe anyways)
from package-spec.
Oh, just read this after sending my reply:
What I meant is to define a pattern which matches only the part after the type
Maybe we need a Zoom after all? 😅 What I get from there is that you suggest a string that Kibana will insert in the right place, while @ruflin suggested a boolean. So string or boolean?
from package-spec.
Two additional suggestion for names taken out of descriptions for it used above:
dynamic: true
: Because it dynamically matchespattern: true
: We add an additional "pattern" to the index pattern 🤔
There must be a better name out there.
from package-spec.
The default pattern also matches "dynamically", isn't it? 😅
tbh, for me this is more an alternative_pattern
, or partial_dataset_matching_pattern
, or something like that. I don't think something as specific and nuanced as this can be summarized as a single word.
But i'm fine with any name because I intend to document what it actually means anyways :)
from package-spec.
@jalvz Could you write the proposed doc for it here? The reason I'm asking for this is because often the "correct" name can be deducted from how we document it / describe it. BTW I don't see an issue with a pretty long key as long as it is descriptive on what it does.
from package-spec.
Sure, it is in the description of the field: https://github.com/elastic/package-spec/pull/102/files#diff-8096817edd596c8787f8d5efb39c5c1a1555c016b5fccc30f30eced78864d494R115
I intend to add the same as comment when I add it to the apm package
from package-spec.
dynamic: true: Because it dynamically matches
The default pattern also matches "dynamically", isn't it?
The default pattern matches a dynamic index but not a dynamic dataset. I suggested something similar in the open PR, calling it dataset_dynamic
(which I believe would be aligned with the comments from both of you @ruflin and @jalvz).
from package-spec.
Done, closing.
from package-spec.
Related Issues (20)
- [Change Proposal] Take into account visualization filters for filter requirements on dashboards
- [Change Proposal] Support Knowledge Base Packages HOT 12
- [Change Proposal] Add support for counted_keyword field type HOT 6
- [Change Proposal] Reserve `package` keyword for data stream/dataset values HOT 4
- Decide on nested wrapping of structured errors
- [Change Proposal] Make version optional in package manifests HOT 2
- Allow `index.mapping.ignore_malformed` to be set inside index templates HOT 4
- [Proposal] Consolidate use of ecs.version field HOT 11
- [Change Proposal] Add support for required conditional groups of variables HOT 1
- [Change Proposal] Define reusable fields to use in the build dependency system HOT 1
- [DX] Add reference documentation for each validation code
- [Change Proposal] Support packages with many fields
- [Change Proposal] Add special `tls` variable types HOT 5
- [Change Proposal] Allow to define expected_event_types
- [Change Proposal] Skip validation of required fields on packages importing full ECS mappings
- Missing a behavioral definition of routing_rules.yml
- [Change Proposal] Streamline installation for ML integration packages
- Support for Links Type HOT 6
- Entity definition use case HOT 2
- Misleading implementation of mappings of nested object subfields
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from package-spec.