GithubHelp home page GithubHelp logo

[QUERY]Azure Event Hub Trigger reading duplicate records after increase the partition count on Event Hub about azure-sdk-for-net HOT 17 CLOSED

ch798543 avatar ch798543 commented on July 22, 2024
[QUERY]Azure Event Hub Trigger reading duplicate records after increase the partition count on Event Hub

from azure-sdk-for-net.

Comments (17)

github-actions avatar github-actions commented on July 22, 2024

Thank you for your feedback. Tagging and routing to the team member best able to assist.

from azure-sdk-for-net.

jsquire avatar jsquire commented on July 22, 2024

Hi @ch798543. Thanks for reaching out and we regret that you're experiencing difficulties. There's not enough information available to comment on why you're seeing the behavior. A few questions:

  • When you say that you "changed the partition count", can you clarify what you mean? Did you delete and recreate the hub or dynamically increase the partitions against the existing hub?
  • Did you restart your Function after changing partitions or just leave it running?
  • Did you make any changes to your Function configuration?

To answer your questions:

  1. Unless you deleted/recreated the Event Hub, physically removed data from the Azure Blob storage container, changed configuration to use a new Blob container, or changed configuration to use a new consumer group, then your existing checkpoints would continue to exist and to be valid. Any one of those conditions, however, would invalidate your checkpoints.

  2. No. You will see 1-2 batches of events potentially duplicated between instances each time your Function scales up/down as partition ownership changes. You will also see duplication due to rewinds during scaling, as there is no coordinated hand-off when ownership changes and you cannot assume the old owner wrote a checkpoint that the new owner sees. The new owner will rewind to the last checkpoint written. (more information)

  3. By default, Azure SDK logs are emitted by your Function into your AppInsights instance. (see: docs) Sharing a 5-minute slice of logs from the "Azure-Messaging-EventHubs" source at the time you observed the behavior would help to understand what the client saw and was reacting to. Logs should be filtered to these events.

from azure-sdk-for-net.

github-actions avatar github-actions commented on July 22, 2024

Hi @ch798543. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.

from azure-sdk-for-net.

ch798543 avatar ch798543 commented on July 22, 2024

Thanks @jsquire for the response.

Find below my response inline
When you say that you "changed the partition count", can you clarify what you mean? Did you delete and recreate the hub or dynamically increase the partitions against the existing hub?
-- We updated the partition count from the Azure Portal by going to the Configuration settings. We did not delete or recreate the hub.

Did you restart your Function after changing partitions or just leave it running?
-- We restarted the Function app after the partition count changes

Did you make any changes to your Function configuration?
-- Yes, we updated the below settings in host.json
maxEventBatchSize : 2048 (existing value - 60)
batchCheckpointFrequency : 1
prefetchCount : 4096 (existing value - 60)

from azure-sdk-for-net.

ch798543 avatar ch798543 commented on July 22, 2024

@jsquire I came across another setting in my function host.json file i.e initialOffsetOptions
We did not have initialOffsetOptions property added and the default value is fromStart.
Does that means when I increased the partition count from 16 to 64 then all new partitions started to read the messages from start of the stream?

from azure-sdk-for-net.

jsquire avatar jsquire commented on July 22, 2024

-- We restarted the Function app after the partition count changes

That would have caused listeners to stop and each partition would restart from the last checkpoint written.

batchCheckpointFrequency : 1

What was this previously? This value would indicate that a checkpoint was written after each Function invocation. That doesn't sound like the behavior that you're seeing.

Does that means when I increased the partition count from 16 to 64 then all new partitions started to read the messages from start of the stream?

That means that if there was no checkpoint found, processing for a partition would start at the beginning. The question what we need to answer is "why was there no checkpoint found?"

from azure-sdk-for-net.

github-actions avatar github-actions commented on July 22, 2024

Hi @ch798543. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.

from azure-sdk-for-net.

ch798543 avatar ch798543 commented on July 22, 2024

@jsquire Thanks for your reply

Please find my response inline

batchCheckpointFrequency : 1
What was this previously? This value would indicate that a checkpoint was written after each Function invocation. That doesn't sound like the behavior that you're seeing.
--- Earlier it was set 4, we changed it to 1 when we started seeing the duplicate event processing issue.
What is the recommended value to be used for batchCheckpointFrequency?

from azure-sdk-for-net.

jsquire avatar jsquire commented on July 22, 2024

Earlier it was set 4, we changed it to 1 when we started seeing the duplicate event processing issue.

Given your batch size setting (60), this would mean that you were checkpointing at most every 240 events. Depending on how many events reads were able to slurp up during reads, this may have been less. Under normal circumstances, this would mean that I'd expect to see a rewind between 0 and 300 events every time there was a scaling operation or a host migration in Functions. (non-deterministic, based on state at the time of the change)

What is the recommended value to be used for batchCheckpointFrequency?

There is none. It's a question that each application needs to determine. If you checkpoint more frequently, processing will be slower, but you'll see smaller rewinds and fewer duplicates. If you checkpoint less frequently, you'll see higher throughput, but bigger rewinds and more duplicates when scaling/migrations happen. That's your trade-off.

In either case, it's important to keep the Event Hubs at-least-once guarantee in mind. There will be some number of duplicate events possible, no matter what you do. Your application must be tolerant of duplicates and should be idempotent when processing. The question that I'd ask myself is how expensive it is for your application to process events and whether you want to be able to process more events quickly or guard against having to deal with duplicates.

from azure-sdk-for-net.

github-actions avatar github-actions commented on July 22, 2024

Hi @ch798543. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.

from azure-sdk-for-net.

ch798543 avatar ch798543 commented on July 22, 2024

@jsquire Thanks for your reply

I've updated my host.json settings as follows, resulting in fewer duplicates:

maxEventBatchSize: 2048
batchCheckpointFrequency: 1
prefetchCount: 4096

incoming events - approx 2000 per second

Previously, the following settings were causing numerous duplicate failures:

maxEventBatchSize: 2048
batchCheckpointFrequency: 4
prefetchCount: 4096
incoming events - approx 2000 per second
Regarding the batchCheckpointFrequency, does a higher count mean we'll encounter more duplicates even without any scaling operations? Additionally, you mentioned host migration in Functions. Does this occur automatically?

from azure-sdk-for-net.

jsquire avatar jsquire commented on July 22, 2024

We generally advise at least a 3:1 ratio between prefetch count and batch size, though that will vary by application. If you're seeing your Function invoked consistently with your requested batch size, all is well. If you're consistently seeing fewer than that, you'll want to bump up prefetch.

Regarding the batchCheckpointFrequency, does a higher count mean we'll encounter more duplicates even without any scaling operations?

I can't answer this with any accuracy, as it depends on knowledge of how the Functions infrastructure manages where a Function lives and how often it moves around. This is outside of the insight and influence of the Event Hubs extension package.

The best that I can do is "maybe". The setting translates to "after how many batches are sent to your function should we write a checkpoint?" Your previous value, 4, meant "write a checkpoint after you call my Function 4 times." Your current value will checkpoint after each invocation of the Function.

Additionally, you mentioned host migration in Functions. Does this occur automatically?

Same answer - it depends and relies on platform knowledge that is outside the scope of the Azure SDK package.

That said, I would expect so. As with any orchestrator, Functions may rebalance work and move apps/instances around to accommodate for load, outages, rolling updates/patches, recover from crashes/errors, or just because it feels like it. I would expect that they try to limit this, as it would cause slow down for most trigger types, but I don't have that insight.

from azure-sdk-for-net.

github-actions avatar github-actions commented on July 22, 2024

Hi @ch798543. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.

from azure-sdk-for-net.

ch798543 avatar ch798543 commented on July 22, 2024

@jsquire What is the typical percentage of duplicates that are considered acceptable?

from azure-sdk-for-net.

jsquire avatar jsquire commented on July 22, 2024

@jsquire What is the typical percentage of duplicates that are considered acceptable?

  • The Event Hubs service should only return a tiny set of duplicates in rare circumstances, like crash recovery for a partition node. It's non-zero but enough of an edge case that it's not worth accounting for.

  • Host orchestration ignored, duplication in a healthy application is generally related to partition ownership changes. These are generally related to scaling - the more you scale up/down, the more frequent you'll see rollbacks of 1-2 batch sizes for a partition.

  • Host orchestration moves in a healthy system are entirely dependent on the host platform. No way that I can accurately comment on this case for "normal".

  • Duplication in a non-healthy system is unpredictable and tied to crashes, error recovery, network failures, and so on.

from azure-sdk-for-net.

github-actions avatar github-actions commented on July 22, 2024

Hi @ch798543. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.

from azure-sdk-for-net.

github-actions avatar github-actions commented on July 22, 2024

Hi @ch798543, we're sending this friendly reminder because we haven't heard back from you in 7 days. We need more information about this issue to help address it. Please be sure to give us your input. If we don't hear back from you within 14 days of this comment the issue will be automatically closed. Thank you!

from azure-sdk-for-net.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.