GithubHelp home page GithubHelp logo

Comments (31)

pandaadb avatar pandaadb commented on June 2, 2024 1

Cool, I created #37 :)

Thanks!

Artur

from logstash-filter-aggregate.

fbaligand avatar fbaligand commented on June 2, 2024

Hi @pandaadb

I have just released version 2.2.0, with a new feature : push_previous_map_as_event
I know this is not exactly what you describe in this issue.
But you might find it interesting.

You can find a good example here :
https://github.com/logstash-plugins/logstash-filter-aggregate#example-3

from logstash-filter-aggregate.

fbaligand avatar fbaligand commented on June 2, 2024

Hi @pandaadb,

Several people have the same need than you.
So I think it is time to make your issue as the next aggregate plugin enhancement !
I'm ready to help you actively in this enhancement (design orientation, code review, ...).

Firstly, are you still interested by implementing this enhancement ?

from logstash-filter-aggregate.

pandaadb avatar pandaadb commented on June 2, 2024

Hi @fbaligand

Sorry - i missed the update the last time. I am definitely still interested in that. I have since continued working on the plugin and am using it in production (so it does seem to work).

My branch/fork is here: https://github.com/pandaadb/logstash-filter-aggregate

I have added a few extra options (which I am not sure are all needed) including:

  • Track timeout time based on a defined timstamp field rather than the time of the platform (this is important when re-parsing old data since otherwise all data will be read within timeout and aggregated wrongly)
  • Track times based on an external key. This is important when e.g. using file input. Since the input can deliver different files at different times, the timestamp needs to be key specific (so that a later file does not timeout an earlier file)
  • flush_on_all_events - important because reparsing old data is too fast to time out events (e.g. if it happens every flush call, then you have a 5 second gap where events might advance the timeout even though it shouldn't)

I'm happy to discuss what could be useful/merged if not all. I learned a bit more about ruby since starting on that so I am hoping I didn't produce too much of a mess :)

from logstash-filter-aggregate.

fbaligand avatar fbaligand commented on June 2, 2024

Wow, that's a lot of options :)
I will study all these options. I'm not sure all these new options are useful.
But I like all initial options you suggest.

Concerning timeout_id, if I understand well, it is the field name that is added to "timeout" generated event, that's it ? Given your explanation, I'm not sure what you put inside ? Is it task_id ? Is it task creation time ?

Anyway, could you rebase your branch, so that all your commits are "after" last master commit ?

from logstash-filter-aggregate.

pandaadb avatar pandaadb commented on June 2, 2024

Hi,

yeah, timeout_id is a bit useless I think and also confusing. It forces the task_id property to be in the time'ed out event, so that the timeout event can be matched back to the id that created it.

For example:

timeout_id => "hello"
task_id => "x"

if now an event comes in, the filter looks at field "x" which for this example has the value "World".
When the timeout occurs, the filter will create an event:

event[timeout_id] = map[task_id]

which will end up looking like:

event["Hello"] = "World"

So now the timeout event can be matched back to the startevent, since we know that the field "hello" in the timeout event represents the task_id used for the start event.

So in short: timeout_id is the field name used for the task_id that is used for aggregation.

I am not sure what rebasing means? Do you mean create a new fork after your master and merge my branch in?

from logstash-filter-aggregate.

fbaligand avatar fbaligand commented on June 2, 2024

OK. I think that an optional option called "timeout_task_id_field" is relevant. It would be set on timeout event only if set in configuration.

Then I wonder : do you have an option to say that task timeout involves aggregation map is pushed as a new event ?

rebase is a git command. It does closely what you say.
To say it short, it rebases your local master on remote master and then applies all your commits since your fork.
Later, it will allow to merge your branch into "logstash-plugins" master straight-forward.

First of all, I invite you to tag your present branch.

Then, you can do these commands to add upstream remote branch :

git remote add upstream https://github.com/logstash-plugins/logstash-filter-aggregate.git
git remote -v
git fetch --all

Then, either you rebase your code on your master, or if you prefer, on a special branch that you create from your master.
To rebase, do that :
git rebase remotes/upstream/master

I prevent you, you will certainly have conflicts to resolve :)
Each time you resolve a conflict, you can do : git rebase --continue

That said, if you're not comfortable with rebase, you could do what you say : clone "logstash-plugins" master, and then apply your commits onto it.

Anyway, when you will merge/rebase, I think you will have some conflicts to resolve about timeout event creation. Because, there's some code done for that, associated with option "push_previous_map_as_event".
Don't hesitate to ask questions if you have.

from logstash-filter-aggregate.

pandaadb avatar pandaadb commented on June 2, 2024

Hi,

I will attempt the rebase later today and get back to you. About your question:

I don't have an option that says the aggregation map is pushed. Instead I have an option timeout_code which does the same as the regular code, with the exception that the code is only executed on timeout.

The timeout_code gets the aggregation map and a new event so the user has full control on what he wants to do with the aggregation map in a timeout situation (e.g. in my case I add a few fields that indicate the timeout, and aggregate some other values from within the map, so just pushing the map as an event wasn't enough for me). I could imagine a situation where the timeout_code would default into simply mirroring the aggregation map, so in case people have no desire to do their own operations, it would simply push the map as a new event.

from logstash-filter-aggregate.

fbaligand avatar fbaligand commented on June 2, 2024

Hi,

I like your 'timeout_code' idea. But I see it a little bit differently :

  • I would add a new explicit option called "push_map_as_event_on_timeout". It is a boolean option that is false by default. The name is willingly close to the brand new option "push_previous_map_as_event".
  • if this option is enabled, when timeout occurs, aggregate plugin creates a new event, with full aggregation map inside.
  • and if timeout_code is defined, aggregate plugin calls the code, passing only event as parameter (to make more custom stuff).
  • this would allow an interesting thing : make this 'timeout_code' option also available for users which enables 'push_previous_map_as_event' option.

from logstash-filter-aggregate.

fbaligand avatar fbaligand commented on June 2, 2024

Hi @pandaadb,

What do you think about my previous comment ?

Concerning the last 3 options you speak about, I think it could be the object of another issue/pull request.
Are you ok with that ?

from logstash-filter-aggregate.

pandaadb avatar pandaadb commented on June 2, 2024

@fbaligand hi

If I understand you correctly we'd want to split it in 2 phases then:

  1. Add option to push the timeout event with 2 extra configurations. Firstly enable it, and second provide (optional) code that can modify the previously created event.
  2. Timeout tracking as I described above (with timestamps being grouped by a key etc).

I agree with you. I think the first option is most useful. The second is quite close to my usecase and maybe not as useful to others.

One other thing: What do you think about having an option that checks timeout on each event? By default events can only time out after 5 seconds. My usecase (though triggered by reparsing old data) needed a much more agressive timeout behaviour.

Lastly, sorry I haven't gotten around to rebasing or coding anything yet :/ I hope I will get to it this week.

Thanks!

from logstash-filter-aggregate.

pandaadb avatar pandaadb commented on June 2, 2024

Hi,

so I have started work on this.

I did the following:

I tagged my branch as it was. Then I created an actual branch and committed it (so i don't lose my changes)

Then I rolled back to the point for event generation (without all the extra stuff of the time tracking based on file keys etc). So now I am doing the rebase and committing. Once I have merged succesfully I will make the changes so they match the properties that we discussed.

Once I rebase I assume I get your changes as well? (Push map as event)

Thanks,

Artur

from logstash-filter-aggregate.

fbaligand avatar fbaligand commented on June 2, 2024
  • So yes, you understand correctly :)
  • Don't forget to add new option "timeout_task_id_field" and "last_modified" property
  • OK to add option "flush_on_all_events" (with false as a default value)
  • Yes, rebase will get the last changes in "logstash-plugins/master" branch. It is precisely the goal : get the last changes since your fork, and then apply your commits.
  • After reading one more time, I still don't understand your new option "Track times based on an external key". Can you give some concrete example to explain ?

from logstash-filter-aggregate.

pandaadb avatar pandaadb commented on June 2, 2024

Hi,

One question from my side, I don't know what this code of yours does:

if (@push_previous_map_as_event and !@@aggregate_maps.empty?)
          previous_map = @@aggregate_maps.shift[1].map
          event_to_yield = LogStash::Event.new(previous_map)

If you could elaborate on that (it almost looks like the flush-on-all-events).

The track times based on external key works like this:

Scenario: You are reparsing 1 months worth of data. The input plugin guarantees that (if single threaded) 1 file comes in in the read order, so:

T1 XYZ
T2 XYZ
...

Where T1 < T2 and so on. So this will work ok since T1 always < T2 and the timeout does not occur.

However this has 2 problems:

  1. Reparsing a lot of events means that the events come in really fast (1 second is 24 hours). Say you want to expire after 15 minutes. Reparsing 1 file will never expire the events because they come in so fast (and the timer you currently use is simply Time.now). For this usecase we need to track the time in the event, rather than the time in real-time. Accordingly, the element.creation_timestamp must be the timestamp in the event, not the time the event was seen.

So When E1 comes in, the event time is T1 (which might be 1 month in the past).

so the variable: @@last_eviction_timestamp = Time.now becomes:

@@last_eviction_timestamp = event['my-timestamp-field']

  1. Input can come in out of order. The file input plugin guarantees order for one file, but it can pick files at random (e.g. parse file A from today, then file B from 1 month ago, then file C from 1 week ago, ...).

So with this in mind, if the element.creation_timestamp is the timestamp of the event, this logic will no longer work:

(element.creation_timestamp < min_timestamp)

Because the creation_timestamp could be 1 month ago, while min_timestamp is calculated based on Time.now.

Also, it can't work to just track 1 timestamp, since the nature of files coming in (as well as the nature of multithreaded, say file A and C are read at the same time) will override their timestamp.
So instead, i tracked the times in a map:

@@timestamp['my-unique-file-path'] = event['timestamp']

That way, the times are tracked on:

  • Based on the file path, which is unique and guaranteed in order
  • Based on the event's timestamp rather than when the event was seen

With that approach (in addition of expire on every event), i can reparse all the logs and expire the events.

Imagine Event E1 - EX come in within 1 second, but they are all events of 1 day (24 hours). I want to expire events that are 15 minutes apart. With the above approach the events can come in as fast as they want, because I am tracking the event's timestamp which will jump 24 hours in 1 second and correctly detect expiry based on the event's timestamp.

Ok - I hope this makes sense :) If not, let me know and I will write up a more concrete example

from logstash-filter-aggregate.

fbaligand avatar fbaligand commented on June 2, 2024

The code with "push_previous_map_as_event" is particularly done for jdbc input use case.
You can see Example # 3 for example use case : https://github.com/logstash-plugins/logstash-filter-aggregate/#example-3

It is a very specific use case, where tasks come one after the other. Firstly all task1 events, then all task2 events, etc...
So there is no "entrelaced" task events.
In this very specific use case, as soon as we detect a new task id, we know that previous aggregate map could be pushed as a new event.
So in this case : no timeout, no flush, and maximum only one map in @@aggregate_maps.

Is this more clear for you ?

from logstash-filter-aggregate.

pandaadb avatar pandaadb commented on June 2, 2024

Hi!

Oh that makes sense, thanks. Now I understand what the code does as well.

My implementation will then:

  1. leave filter alone for push_previous_map_as_event. This preserves your usecase
  2. Add "flush_on_all_events" default false to enable expiry for every event
  3. change removeExpiredMaps to react on push_previous_map_as_event as well as push_map_as_event_on_timeout (so that'll be OR) to enable that.
  4. Add timeout_code for both cases (in 1 and 3)

So my config would set
push_previous_map_as_event => false
push_map_as_event_on_timeout => true

Because i would have multiple task_ids (so my map is always of size > 1)

Sounds like a Plan :) Thanks for the clarifications!

from logstash-filter-aggregate.

pandaadb avatar pandaadb commented on June 2, 2024

Hi,

I am having git troubles I think.

I am now at a point were I merged the changes I wanted to merge, so the version I have locally has:

  • timeout_task_id_field option to map the task_id into the event
  • timeout_code for both push_previous_map_as_event as well as push_map_as_event_on_timeout
  • tests for push_map_as_event_on_timeout

I think this is all we wanted to merge for the first version.

However, rebase wants to continue merging the other changes I did as well (tracking timestamp on event). I don't want to merge those in, since I will have to remove them afterwards. I believe this is because my commit tree on master kind of looks like:

commit V1
commit V2
rollback to V1
commit V1 again

So obviously I can skip step 2, 3 and 4 because they cancel each other out. But I don't know how to stop rebasing after V1?

from logstash-filter-aggregate.

fbaligand avatar fbaligand commented on June 2, 2024

About your 4 implementation points, I fully agree with what you say.

from logstash-filter-aggregate.

fbaligand avatar fbaligand commented on June 2, 2024

Concerning your git problem :

  • finally, rebase is probably not the good solution. Sorry about that.
  • to make it simple, I suggest :
    -- create a new branch that is a clone from upstream/master
    -- add on this branch, each commit you want using cherry pick : "git cherry-pick commit1 commit2 ..."
    -- commit1 and others are the commit hashes.

If it's not clear or need more help, tell me.

from logstash-filter-aggregate.

pandaadb avatar pandaadb commented on June 2, 2024

Hi Fabian!

thanks for your help :) I struggled a bit because I tried to commit a branch to the logstash-plugin (which obviously I have no access to). I managed to fix it now, and I think the first version stands:

https://github.com/pandaadb/logstash-filter-aggregate/

What I did:

  • I copied my rebase (to the point where I wanted it) and then aborted it
  • I reverted all changes for my master and committed the rebase (which automatically merged it with your master branch)
  • Then I did a "git merge upstream/master" and it told me all is up to date.

Would you like a pull request or would you rather look over it first?

(Edit: the docs don't match anymore, i will update them once the code is ok)

Thanks!

Artur

from logstash-filter-aggregate.

fbaligand avatar fbaligand commented on June 2, 2024

You can create a PR !
Doing that, I can review your changes, make comments, and if after, you push some changes, it will be automatically part of the PR.

Fabien

Le 21 juil. 2016 à 11:24, pandaadb [email protected] a écrit :

Hi Fabian!

thanks for your help :) I struggled a bit because I tried to commit a branch to the logstash-plugin (which obviously I have no access to). I managed to fix it now, and I think the first version stands:

https://github.com/pandaadb/logstash-filter-aggregate/

What I did:

I copied my rebase (to the point where I wanted it) and then aborted it
I reverted all changes for my master and committed the rebase (which automatically merged it with your master branch)
Then I did a "git merge upstream/master" and it told me all is up to date.
Would you like a pull request or would you rather look over it first?

Thanks!

Artur


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

from logstash-filter-aggregate.

fbaligand avatar fbaligand commented on June 2, 2024

@pandaadb release 2.3.0 is done with timeout event generation !

from logstash-filter-aggregate.

pandaadb avatar pandaadb commented on June 2, 2024

Yey :) I hope all works as expected and people enjoy the new feature! Pleasure working with you!

from logstash-filter-aggregate.

fbaligand avatar fbaligand commented on June 2, 2024

It was a pleasure for me too !
I had some feedback by mail saying it works fine !

Nice news :)

from logstash-filter-aggregate.

fbaligand avatar fbaligand commented on June 2, 2024

Breaking news : official logstash documentation has just been updated with your new options and your sample !
It concerns "master" and "5.0 beta" branches :
https://www.elastic.co/guide/en/logstash/master/index.html

from logstash-filter-aggregate.

pandaadb avatar pandaadb commented on June 2, 2024

That's great :)

I wonder if the documentation needs to be updated? I believe that there is a breaking change in Logstash 5+ where you can not handle the event as an array but you need to use set or put?

e.g.

So instead of saying code needs to be:

event[bla] = 'xyz'

it should be

event.set("bla", "xyz")

For example in Example 2. you have:

code => "event['sql_duration'] = map['sql_duration']"

from logstash-filter-aggregate.

fbaligand avatar fbaligand commented on June 2, 2024

Yes, there is a Breaking change in 5.0 where logstash event becomes a Java bean, and not anymore a ruby object.
Moreover, presently, aggregate plugin itself is not compatible with logstash 5.0 because of that.

Anyway, this is important that logstash official documentation has been updated because lot of people only look at this documentation and don't look at github site or plugin code.

Fabien

Le 12 août 2016 à 14:37, pandaadb [email protected] a écrit :

That's great :)

I wonder if the documentation needs to be updated? I believe that there is a breaking change in Logstash 5+ where you can not handle the event as an array but you need to use set or put?

e.g.

So instead of saying code needs to be:

event[bla] = 'xyz'

it should be

event.set("bla", "xyz")

For example in Example 2. you have:

code => "event['sql_duration'] = map['sql_duration']"


You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub, or mute the thread.

from logstash-filter-aggregate.

fbaligand avatar fbaligand commented on June 2, 2024

For info, I just released version 2.3.1, with a new option :
timeout_tags, which let you define tags to add to generated event when a timeout occurs.

from logstash-filter-aggregate.

SolomonShorser-OICR avatar SolomonShorser-OICR commented on June 2, 2024

Track timeout time based on a defined timstamp field rather than the time of the platform (this is important when re-parsing old data since otherwise all data will be read within timeout and aggregated wrongly)

Was this ever implemented, or is there a way to do this? I am processing some historical logs and this feature would be extremely useful

from logstash-filter-aggregate.

fbaligand avatar fbaligand commented on June 2, 2024

No. There is no such a feature.
Currently, timeout and inactivity_timeout options are based on platform time.

If you need such a feature, I invite you to open a specific issue.

from logstash-filter-aggregate.

SolomonShorser-OICR avatar SolomonShorser-OICR commented on June 2, 2024

Ok, I'll open an issue.

from logstash-filter-aggregate.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.