Comments (8)
Updated my github repo https://github.com/ericleigh007/DurableFunctionBenchmark-Ne.git in case anything I have there helps testing. I've implemented and tested the compressed object to optimize the amount of data the backend has to handle.
from durabletask-netherite.
Ping!
Any thoughts on this?
from durabletask-netherite.
The expected behavior when using large messages is that the part where Netherite calls the EH client to send packets to event hubs becomes the throughput bottleneck of the application. It looks to me like that is the case, based on what I can see in the logs, they show that the event hubs sender is taking a long time to send the packets. Specifically, it measures how long it takes one worker to send a batch of packets to one partition and prints something like EventHubsSender partitions/13 sent batch of 1 packets (102796 bytes) in 684.28ms, throughput=0.14MB/s
Because sending of large packets is slow, this also impacts the latency of all other operations. The messages go through the same event hub partitions, so a small message (like a query for the orchestration status) can get stuck behind a large message.
There can be several parts to contribute to the "message sending bottleneck", and can be somewhat addressed:
- The EH namespace limits total ingress (this includes everything sent to any partition in the namespace, from any worker) to 1 MB/s per throughput unit. This can be changed by purchasing additional throughput units.
- On each worker, each partition sender has some throughput limit based on how the EH client is implemented. This sending throughput of an individual partition sender (one worker sending packets to one partition) is what is printed in the log, e.g. in
EventHubsSender partitions/13 sent batch of 1 packets (102796 bytes) in 684.28ms, throughput=0.14MB/s
. Scaling out the number of workers and the number of partitions can help with this, because (a) the sending work is spread across more workers, and (b) the sending work is spread across more partitions.
I don't at the moment have useful estimates on what throughput to expect from a partition sender. Also, it is possible that there are other factors that lead to low throughput on those senders. I have some suspicions but no concrete leads yet.
from durabletask-netherite.
Thanks for the analysis. Is this:
- a limitation of the design we'd like to remove by doing a,b,c in some future version
- a limitation that just means that netherite is going to be totally slow in handling medium/large message traffic
- a limitation that could be documented how it can be mitigated by purchasing more throughput units (3TU, 4TU), and that will definitely help
- a limitation that suggests netherite should be written off for these larger-messages case?
As a somewhat related question -- is there some sort of sweet spot where messages give great throughput, but when that is reached, throughput falls off quickly?
Our real system processes sets of banking data with debits and credits grouped into a balancing unit. Contention of operating on one of these at a time currently makes handling them one per orchestrator problematic for our main durable function system and for downstream systems using more dated technology like SQL server.
My hope was to bundle more messages into a single SQL update, but in doing so, with this behavior, we might totally kill our excellent netherite throughput.
On another peripheral note, my data is pretty "compressable" so I have built an object that can be used to send to orchestrators that cuts the size down from 50% to even 15% of what the original data was. I see in some tests that there is some sort of Compression thing in the tests of the durable framework, but I don't see that is implemented in the actual runtime. My main question is whether netherite and/or the durable function backend itself undertake to compress the data. If so, of course, my compression scheme won't help, and could even hurt.
If there is no built-in compression functionality, then is there a plan to add any? Interestingly, GPT-4 thinks there is a IDurableSerialization interface that let's the user override the default serializer, but alas that's another invention of the AI, but it could be cool.
Happy to continue the conversation, and realize this was a little scatter-brained. We can split these questions into separate places, but would also still like to understand how netherite is "intended to" (now) and "hoped to" (in the future) react to large messages.
Thanks.
from durabletask-netherite.
Thanks, @ericleigh007 for pointing this out and giving us a repro.
The handling of large messages is definitely something we can optimize, and should, based on your experience. There is a relatively simple mechanism that I think should solve this problem (store large messages in blobs and then just send the blob address through EH). I will give this a try soon.
from durabletask-netherite.
The latest release 1.4.0 now addresses this issue. It contains a blob-batching optimization #275 that improves the performance for cases where partitions or clients transmit medium to large amounts of data (in terms of either total size, or number of messages).
from durabletask-netherite.
Apologies that I have not been able to test this one. As usual, life and other development blocked any progress.
I still plan to do a side-by-side and report back, but it could be a week or more.
from durabletask-netherite.
Hi @ericleigh007, I've seen several issues you created regarding Netherite - thanks for your interest in this new feature! We'd love to improve it further and were wondering if it's possible to schedule a quick chat with you to learn about your experience, usage scenario, and feedback. If yes, please share your email via a LinkedIn message. (If you don't have LinkedIn, we can figure out another way.) Thanks again!
from durabletask-netherite.
Related Issues (20)
- Performance drop when using external events HOT 5
- Question regarding "Cannot access a disposed object. Object name: 'FaultTolerantAmqpObject`1'." and "Client object already closed." HOT 3
- ScaleMonitor is constantly logging "An item with the same key has already been added." errors HOT 4
- `BlobNotFound` error when running the DTFx sample
- Question: Event Hub traffic when no orchestrations are executing HOT 5
- Orchestrator start timed out when `PartitionManagement` is set to `ClientOnly` HOT 2
- Part11 discarded ActivityWorkItem 11A19122 because partition was terminated HOT 2
- Startnew Async behaves strange HOT 1
- .Net function app error once deployed HOT 2
- Task Hub Client Timeout Issue when directed to Specific Partition via Orchestration Id HOT 3
- Include child instance IDs in execution history HOT 7
- Non-Deterministic workflow detected ... current replay execution hasn't (yet?) scheduled this task HOT 1
- State Corruption in FASTER storage HOT 2
- Activity is executed twice under heavy load HOT 7
- Sudden Netherite "Client request timed out" exception HOT 3
- Release 1.4.0 breaks storage format
- Replace Microsoft.Azure.EventHubs.Processor with Azure.Messaging.EventHubs.Processor HOT 5
- Repeating Timers on no longer existing durable function and instance HOT 18
- `TimeoutException` at the start of the orchestrator HOT 12
- [Question] How to override storage provide in app settings/environment variables? HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from durabletask-netherite.