GithubHelp home page GithubHelp logo

microsoft / durabletask-netherite Goto Github PK

View Code? Open in Web Editor NEW
203.0 23.0 20.0 3.02 MB

A new engine for Durable Functions. https://microsoft.github.io/durabletask-netherite

License: Other

C# 96.17% PowerShell 3.82% Dockerfile 0.01%

durabletask-netherite's Introduction

Netherite

Netherite is a distributed workflow execution engine for Durable Functions (DF) and the Durable Task Framework (DTFx).

It is of potential interest to anyone developing applications on those platforms who has an appetite for performance, scalability, and reliability.

As Netherite is intended to be a drop-in backend replacement, it does not modify the application API. Existing DF and DTFx applications can switch to this backend with little effort. However, we do not support migrating existing task hub contents between different backends.

Getting Started

To get started, you can either try out the sample, or take an existing DF app and switch it to the Netherite backend. You can also read our documentation.

The hello sample.

For a comprehensive quick start on using Netherite with Durable Functions, take a look at hello sample walkthrough, and the associated video content. We included several scripts that make it easy to build, run, and deploy this application, both locally and in the cloud. Also, this sample is a great starting point for creating your own projects.

Configure an existing Durable Functions app for Netherite.

If you have a .NET Durable Functions application already, and want to configure it to use Netherite as the backend, do the following:

  • Add the NuGet package Microsoft.Azure.DurableTask.Netherite.AzureFunctions to your functions project (if using .NET) or your extensions project (if using TypeScript or Python).
  • Add "type" : "Netherite" to the storageProvider section of your host.json. See recommended host.json settings.
  • Configure your function app to run on 64 bit, if not already the case. You can do this in the Azure portal, or using the Azure CLI. Netherite does not run on 32 bit.
  • Create an EventHubs namespace. You can do this in the Azure portal, or using the Azure CLI.
  • Configure EventHubsConnection with the connection string for the Event Hubs namespace. You can do this using an environment variable, or with a function app configuration settings.

For more information, see the .NET sample, the Python sample, or the TypeScript sample.

Configure an existing Durable Task Application for Netherite.

If you have an application that uses the Durable Task Framework already, and want to configure it to use Netherite as the backend, do the following:

  • Create an EventHubs namespace. You can do this in the Azure portal, or using the Azure CLI.
  • Add the NuGet package Microsoft.Azure.DurableTask.Netherite to your project.
  • Update the server startup code to construct a NetheriteOrchestrationService object with the required settings, and then pass it as an argument to the constructors of TaskHubClient and TaskHubWorker.

For more information, see the DTFx sample.

Why a new engine?

The default Azure Storage engine stores messages in Azure Storage queues and instance states in Azure Storage tables. It executes large numbers of small storage accesses. For example, executing a single orchestration with three activities may require a total of 4 dequeue operations, 3 enqueue operations, 4 table reads, and 4 table writes. Thus, the overall throughput quickly becomes limited by how many I/O operations Azure Storage allows per second.

To achieve better performance, Netherite represents queues and partition states differently, to improve batching:

  • Partitions communicate via ordered streams, using EventHubs.
  • The state of a partition is stored using a combination of an immutable log and checkpoints, in Azure PageBlobs.

To learn more about the Netherite architecture, our VLDB 2022 paper is the best reference. There is also an earlier preprint paper on arXiv.

For some other considerations about how to choose the engine, see the documentation.

Status

The current version of Netherite is 1.5.0. Netherite supports almost all of the DT and DF APIs.

Some notable differences to the default Azure Table storage provider include:

  • Instance queries and purge requests are not issued directly against Azure Storage, but are processed by the function app. Thus, the performance (latency and throughput) of queries heavily depends on the current scale status and load of the function app. In particular, queries do not work if the function app is stopped, and may experience cold start symptoms on consumption plans.
  • Scale out of activities (not just orchestrations) is limited by the partition count configuration setting, which defaults to 12. If you need to scale out beyond 12 workers, you should increase it prior to starting you application (it cannot be changed after the task hub has been created).
  • The rewind feature is not available on Netherite.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repositories using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Security

Microsoft takes the security of our software products and services seriously, which includes Microsoft, Azure, DotNet, AspNet, Xamarin, and our GitHub organizations.

If you believe you have found a security vulnerability in any Microsoft-owned repository that meets Microsoft's Microsoft's definition of a security vulnerability, please report it to us at the Microsoft Security Response Center (MSRC) at https://msrc.microsoft.com/create-report. Do not report security vulnerabilities through GitHub issues.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

durabletask-netherite's People

Contributors

bachuv avatar davidmrdavid avatar dependabot[bot] avatar jviau avatar lechnerc77 avatar microsoft-github-policy-service[bot] avatar romero027 avatar sebastianburckhardt avatar surgupta-msft avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

durabletask-netherite's Issues

Custom orchestration status support

It doesn't seem like context.df.setCustomStatus works - at least the HTTP orchestration API doesn't return anything after attempting to set the status.

This is quite an important feature for the interactive scenarios - is it planned to be supported?

Could be a platform support issue also, I tried using it with Node.js, maybe it actually works with C#?

Handling Partial TaskHub Deletion

A TaskHub can go into a "partially deleted" state if its EventHubs are deleted, but its TaskHub data is left in storage.

This is never supposed to happen; it is a "resource management error" to delete the EventHubs without also deleting the TaskHub. That said, it is a not an unlikely error thus we need to handle it in some graceful way.

In principle, I see several options:

  1. Create an error message saying that the taskhub is broken and needs to be deleted.
  2. Create a warning, delete the taskhub, and recreate it.
  3. Try to recover and recreate the EventHubs.

While 3. sounds the nicest it is not possible to recover in all cases since the whole architecture fundamentally relies on messages not being lost. So only the first two are real options. I suppose I prefer 1. because it is the most likely to make the user understand what went wrong and actually fix the root cause.

@davidmrdavid, do you have any suggestions on this?

Netherite with TypeScript Function fails after first yield (Emulating + Deployed)

Issue

The execution of a Durable Function in emulation mode seems to fail after the first step (yield) is executed. The state remains in the Running status

Description

The sample project is the same setup as the one in https://github.com/microsoft/durabletask-netherite/tree/main/samples/Hello but using TypeScript.

The intent is to run the Durable Function locally, emulating the EventHub and the Storage. The corresponding settings are stored in the local.settings.json.

 "AzureWebJobsStorage": "UseDevelopmentStorage=true",
 "EventHubsConnection": "MemoryF",

The emulation of the storage is done via Azurite.

The extension bundle mechanism is removed and the settings for the extensions are placed in the host.json

"extensions": {
    "durableTask": {
      "hubName": "testnetherite",
      "UseGracefulShutdown": true,
      "storageProvider": {
        "type": "Netherite",
        "StorageConnectionName": "AzureWebJobsStorage",
        "EventHubsConnectionName": "EventHubsConnection"
      }
    }
  } 

The startup of the function is executed without any errors. When the Durable orchestrator is triggered the first call of the activity gets executed successfully, however the second call is never executed. It seems that the Orchestrator function is not called again.

The exactly same setup works for the .NET sample.

The difference that occurred was that no task hubs get created in the emulated storage blob when in the TypeScript case. But even when starting the function with --verbose no error occurs during runtime.

Expected behavior

The local emulation of Netherite should work for TypeScript as for .NET Core

Versions

The following version are used:

  • Azure Functions Core Tools: 3.0.3734 (Runtime: 3.1.4.0)
  • node: 14.16.1
  • TypeScript: 4.4.3
  • Azurite: 3.14.2

Sample code

You find the project to reproduce the error here: https://github.com/lechnerc77/durablefunctionsnetherite

Netherite does not have setting to disable extended sessions

Netherite doesn't explicitly use extended sessions API, but it does provide similar behavior via cached cursors.

Durable Functions customers have a setting to disable extended sessions for performance reasons, and it would be nice if we could enforce this (i.e. in OrchestrationMessageBatch.OnReadComplete() check for extended sessions being disabled before using cached value.

This is likely a blocker for out-of-process Function apps, as they need to explicitly disable cursor based approaches like extended sessions to accommodate the out-of-proc algorithm.

ACTION REQUIRED: Microsoft needs this private repository to complete compliance info

There are open compliance tasks that need to be reviewed for your durabletask-netherite repo.

Action required: 3 compliance tasks

To bring this repository to the standard required for 2021, we require administrators of this and all Microsoft GitHub repositories to complete a small set of tasks within the next 60 days. This is critical work to ensure the compliance and security of your microsoft GitHub organization.

Please take a few minutes to complete the tasks at: https://repos.opensource.microsoft.com/orgs/microsoft/repos/durabletask-netherite/compliance

  • No Service Tree mapping has been set for this repo. If this team does not use Service Tree, they can also opt-out of providing Service Tree data in the Compliance tab.
  • No repository maintainers are set. The Open Source Maintainers are the decision-makers and actionable owners of the repository, irrespective of administrator permission grants on GitHub.
  • Classification of the repository as production/non-production is missing in the Compliance tab.

You can close this work item once you have completed the compliance tasks, or it will automatically close within a day of taking action.

If you no longer need this repository, it might be quickest to delete the repo, too.

GitHub inside Microsoft program information

More information about GitHub inside Microsoft and the new GitHub AE product can be found at https://aka.ms/gim or by contacting [email protected]

FYI: current admins at Microsoft include @TedHartMS, @sebastianburckhardt, @badrishc

Dropping of corrupted message leads to stuck orchestrations

Observed issues with corrupted message today:

System.Runtime.Serialization.SerializationException: There was an error deserializing the object of type DurableTask.Netherite.Event. Start element 'c:Name' does not match end element 'c:N'. Line 1, position 488.
 ---> System.Xml.XmlException: Start element 'c:Name' does not match end element 'c:N'. Line 1, position 488.
   at System.Xml.XmlExceptionHelper.ThrowXmlException(XmlDictionaryReader reader, String res, String arg1, String arg2, String arg3)
   at System.Xml.XmlUTF8TextReader.ReadEndElement()
   at System.Xml.XmlUTF8TextReader.Read()
   at System.Xml.XmlBaseReader.ReadContentAsString()
   at System.Xml.XmlBaseReader.ReadElementContentAsString()
   at ReadTaskScheduledEventFromXml(XmlReaderDelegator , XmlObjectSerializerReadContext , XmlDictionaryString[] , XmlDictionaryString[] )
   at System.Runtime.Serialization.ClassDataContract.ReadXmlValue(XmlReaderDelegator xmlReader, XmlObjectSerializerReadContext context)
   at System.Runtime.Serialization.XmlObjectSerializerReadContext.ReadDataContractValue(DataContract dataContract, XmlReaderDelegator reader)
   at System.Runtime.Serialization.XmlObjectSerializerReadContext.InternalDeserialize(XmlReaderDelegator reader, String name, String ns, Type declaredType, DataContract& dataContract)
   at System.Runtime.Serialization.XmlObjectSerializerReadContext.InternalDeserialize(XmlReaderDelegator xmlReader, Int32 id, RuntimeTypeHandle declaredTypeHandle, String name, String ns)
   at ReadTaskMessageFromXml(XmlReaderDelegator , XmlObjectSerializerReadContext , XmlDictionaryString[] , XmlDictionaryString[] )
   at System.Runtime.Serialization.ClassDataContract.ReadXmlValue(XmlReaderDelegator xmlReader, XmlObjectSerializerReadContext context)
   at System.Runtime.Serialization.XmlObjectSerializerReadContext.ReadDataContractValue(DataContract dataContract, XmlReaderDelegator reader)
   at System.Runtime.Serialization.XmlObjectSerializerReadContext.InternalDeserialize(XmlReaderDelegator reader, String name, String ns, Type declaredType, DataContract& dataContract)
   at System.Runtime.Serialization.XmlObjectSerializerReadContext.InternalDeserialize(XmlReaderDelegator xmlReader, Int32 id, RuntimeTypeHandle declaredTypeHandle, String name, String ns)
   at ReadValueTupleOfTaskMessagestringKPngWuvwFromXml(XmlReaderDelegator , XmlObjectSerializerReadContext , XmlDictionaryString[] , XmlDictionaryString[] )
   at System.Runtime.Serialization.ClassDataContract.ReadXmlValue(XmlReaderDelegator xmlReader, XmlObjectSerializerReadContext context)
   at System.Runtime.Serialization.XmlObjectSerializerReadContext.ReadDataContractValue(DataContract dataContract, XmlReaderDelegator reader)
   at System.Runtime.Serialization.XmlObjectSerializerReadContext.InternalDeserialize(XmlReaderDelegator reader, String name, String ns, Type declaredType, DataContract& dataContract)
   at System.Runtime.Serialization.XmlObjectSerializerReadContext.InternalDeserialize(XmlReaderDelegator xmlReader, Int32 id, RuntimeTypeHandle declaredTypeHandle, String name, String ns)
   at ReadArrayOfValueTupleOfTaskMessagestringKPngWuvwFromXml(XmlReaderDelegator , XmlObjectSerializerReadContext , XmlDictionaryString , XmlDictionaryString , CollectionDataContract )
   at System.Runtime.Serialization.CollectionDataContract.ReadXmlValue(XmlReaderDelegator xmlReader, XmlObjectSerializerReadContext context)
   at System.Runtime.Serialization.XmlObjectSerializerReadContext.ReadDataContractValue(DataContract dataContract, XmlReaderDelegator reader)
   at System.Runtime.Serialization.XmlObjectSerializerReadContext.InternalDeserialize(XmlReaderDelegator reader, String name, String ns, Type declaredType, DataContract& dataContract)
   at System.Runtime.Serialization.XmlObjectSerializerReadContext.InternalDeserialize(XmlReaderDelegator xmlReader, Int32 id, RuntimeTypeHandle declaredTypeHandle, String name, String ns)
   at ReadActivityTransferReceivedFromXml(XmlReaderDelegator , XmlObjectSerializerReadContext , XmlDictionaryString[] , XmlDictionaryString[] )
   at System.Runtime.Serialization.ClassDataContract.ReadXmlValue(XmlReaderDelegator xmlReader, XmlObjectSerializerReadContext context)
   at System.Runtime.Serialization.XmlObjectSerializerReadContext.ReadDataContractValue(DataContract dataContract, XmlReaderDelegator reader)
   at System.Runtime.Serialization.XmlObjectSerializerReadContext.InternalDeserialize(XmlReaderDelegator reader, String name, String ns, Type declaredType, DataContract& dataContract)
   at System.Runtime.Serialization.XmlObjectSerializerReadContext.InternalDeserialize(XmlReaderDelegator xmlReader, Type declaredType, DataContract dataContract, String name, String ns)
   at System.Runtime.Serialization.DataContractSerializer.InternalReadObject(XmlReaderDelegator xmlReader, Boolean verifyObjectName, DataContractResolver dataContractResolver)
   at System.Runtime.Serialization.XmlObjectSerializer.ReadObjectHandleExceptions(XmlReaderDelegator reader, Boolean verifyObjectName, DataContractResolver dataContractResolver)
   --- End of inner exception stack trace ---
   at System.Runtime.Serialization.XmlObjectSerializer.ReadObjectHandleExceptions(XmlReaderDelegator reader, Boolean verifyObjectName, DataContractResolver dataContractResolver)
   at System.Runtime.Serialization.XmlObjectSerializer.ReadObject(XmlDictionaryReader reader)
   at System.Runtime.Serialization.XmlObjectSerializer.ReadObject(Stream stream)
   at DurableTask.Netherite.Serializer.DeserializeEvent(Stream stream) in C:\home\git\alt\src\DurableTask.Netherite\Util\Serializer.cs:line 58
   at DurableTask.Netherite.Packet.Deserialize[TEvent](Stream stream, TEvent& evt, Byte[] taskHubGuid) in C:\home\git\alt\src\DurableTask.Netherite\Events\Packet.cs:line 50
   at DurableTask.Netherite.FragmentationAndReassembly.Reassemble[TEvent](IEnumerable`1 earlierFragments, IEventFragment lastFragment) in C:\home\git\alt\src\DurableTask.Netherite\Util\FragmentationAndReassembly.cs:line 86
   at DurableTask.Netherite.ReassemblyState.Process(PartitionEventFragment evt, EffectTracker effects) in C:\home\git\alt\src\DurableTask.Netherite\PartitionState\ReassemblyState.cs:line 27
   at DurableTask.Netherite.EffectTracker.ProcessEffectOn(TrackedObject trackedObject) in C:\home\git\alt\src\DurableTask.Netherite\Abstractions\PartitionState\EffectTracker.cs:line 80

Implement smarter partition balancer

Currently we use EventHubsProcessor (which is part of the EH client library) to balance partitions across nodes. However, this mechanism is unaware of load inside the partitions, and simply balances the number of partitions. For example, on three nodes and 12 partitions, each node will have four random partitions. This is an issue; for example, if only three partitions are busy, we can be unlucky and all three busy partitions are placed on the same node, wasting the other two nodes.

At some point we should perhaps implement a better partition balancer. This is a non-issue for planned K8s implementations since it does not use EventHubs at all.

Clarity around event hub partition config

The docs state you need

  • An event hub called partitions with 1-32 partitions. We recommend 12 as a default.
  • Four event hubs called clients0, clients1, clients2 and clients3 with 32 partitions each.

Should this be

  • Four event hubs called clients0, clients1, clients2 and clients3 with always 32 partitions each.

or

  • Four event hubs called clients0, clients1, clients2 and clients3 with 32 partitions each, or whatever you set for the partitions event hub i.e. 12 as per the recommendation

Also, should these partition counts be linked in anyway to the partitionCount configuration in host.json, or is that completely independent of this event hub configuration?

Publish Performance Info

Implement and run performance tests.
Include results and repro-instructions in the documentation.

Query fails under concurrent compaction

FASTER throws this exception when the begin address moves while an iteration is in progress:

FASTER.core.FasterException: Iterator address is less than log BeginAddress 81091280

Rather than failing, the iteration should tolerate the moving begin address in some way.

Support stand-alone client

Currently, clients can only be constructed as part of a full-featured OrchestrationService. However, there are situations in which it is desirable to run a client-only version, without executing orchestration steps or activities on that node, and without taking a dependency on the functions runtime.

For example, this can allow external clients to create orchestrations and retrieve their results without having to create an HttpTrigger just as a relay.

integration unit tests broken

For some reason the AF unit tests for Netherite no longer work on the latest package versions, due to mysterioues TypeLoadExceptions.

Hanging on EP1

Sometimes the EventHubs partition management gets into bad states where a partition remains active on multiple nodes. This can lead to hanging, if one node holds on to the lease, while EH is trying to deliver events to another node.

So far I have only seen this on EP1, it may have something to do with TPL induced deadlocks that are hit only when running on single core.

Workarounds

Run on EP2 or EP3, or restart the function app.

Investigate Azure Storage performance issues on EP1 plans

It appears that when running an EP1 premium plan under moderate to high load, the accesses to Azure Storage (including page and block blob reads and writes) exhibit severe performance anomalies ranging from slowness to timeouts, ultimately bringing progress to an apparent halt. The reasons are unclear, this should be investigated. Interestingly, the same problems have so far not surfaced on EP2 and EP3 plan.

Optimize latency tails caused by partition movements

It appears that rebalancing partitions (such as in response to autoscaling) can be very slow, e.g. take more than 10 seconds.

This means that clients can experience bad latency tails in such situations.

It should be possible to significantly improve this by optimizing the partition handoff process.

Taskhub versioning

Add a versioning mechanism to catch breaking changes in the taskhub data formats, so users who upgrade code and no longer read the taskhub state correctly can be warned (instead of creating havoc on the state).

Automated long haul test

Need to design, implement, and run a test that can run Netherite for a long time (> 24h) with error injection enabled.
This will demonstrate that task hubs are resilient enough for a real-world environment.

Include EventHubs tests in CI

The continuous integration Azure DevOps Pipeline currently tests only a single, limited transport scenario (MemoryF:4). We need automated tests that cover the EventHubs transport provider.

tweak for the docs

I can't find the GH Pages or else I would have put this there (or made a PR), sorry :)

In the Configuration section (Minimal), I think it would be worth adding the PremiumStorageConnectionName setting (commented out and described?) for discoverability.

Cheers!

Support query pagination

IOrchestrationService doesn't provide a paginated query condition on the interface, so DurabilityProvider provides this additional method to allow backends that do support pagination to provide that paremeter.

Right now, Netherite is silently ignoring pagination.

Revelaed by runnign emulator Netherite provider on test DurableEntity_ListEntitiesAsync_Paging.

Use cache-aware work item scheduling

Currently, orchestration work items are scheduled without taking into account what is currently in the cache.

To reduce thrashing when memory is an issue, we could implement a better scheduling mechanism that handles orchestrations that are in the cache first. We would still use some fairness mechanism to prevent starvation.

Remove purged keys from database

At the moment, purged instances are reset but not actually removed from the Faster KV.

We need to implement proper removal so that the database does not get bloated over time.

Implement infinite scaling for activities

Currently, the number of nodes cannot exceed the number of partitions.

For activities, we should support scaling beyond that limit.

This could be done by adding an extra EventHubs queue, or by using Http Triggers.

Update to latest FASTER

C:\home\git\durabletask-netherite\src\DurableTask.Netherite\StorageProviders\Faster\FasterKV.cs(545,35): warning CS0618: 'FasterKV<FasterKV.Key, FasterKV.Value>.Iterate(long)' is obsolete: 'Invoke Iterate() on a client session (ClientSession), or use store.Iterate overload with Functions provided as parameter' [C:\home\git\durabletask-netherite\src\DurableTask.Netherite\DurableTask.Netherite.csproj]

JSON Deserialization Issue

The netherite backend is consistently throwing an InvalidOperationException during startup. Tracing this, I've found that in the method CheckStorageFormat in the BlobManager expects to receive JSON in capitalized camel case (e.g. UseAlternateObjectStore). However, the taskhubparameters.json it is reading from is returning as lower case camel case. This is causing Newtonsoft to throw an invalid exception.

I'm not currently changing any of the default Newtonsoft.Json serialization settings. Are there Newtonsoft.Json defaults that are expected? Should Netherite prepare for this by providing it's expected serialization settings when performing serialization / deserialization?

In order to get around this issue, I've manually updated the JSON in taskhubparameters.json to be in the format expected in code. But, I am unsure if this will have any other downstream effects.

Test Consumption Plan Support

The current implementation already does automatic load balancing over the connected nodes, using the the EventProcessorHost mechanism provided by EventHubs. Also, there is some preliminary logic drafted in ScalingMonitor.cs for making scale recommendations. However, this is not currently hooked up to any infrastructure that performs the scaling.

Eventually I think we would like to be able to support all of the following:

  1. Provide auto-scaling for premium plans
  2. Provide auto-scaling for consumption plans
  3. Provide auto-scaling for KEDA deployments

Proposal for Enhancement of Sample Documentation - recorded walk through

Proposal: The documentation of the sample can be enhanced by a walk-through recording how to move from Azure Storage to Netherite following the steps of the "Hello" sample description.

I have recorded one that you find here: Migrate your Durable Function to Netherite.

In case you think that this adds value when referenced in the sample documentation https://github.com/microsoft/durabletask-netherite/blob/gh-pages/docs/hello-sample.md, I can prepare a pull request for this proposed enhancement.

ScaleMonitor.GetMetricsAsync() failed: Microsoft.Azure.Cosmos.Table.StorageException: Not Found

We're trying to change an existing Durable Functions App to use Netherite but seeing a lot of errors in the logs when this is deployed to Azure. None of the functions appear to be running due to the errors.

Traces & Exceptions

After the Function App starts the NetheriteOrchestrationService is created succesfully:

NetheriteOrchestrationService created, workerId=pd1sdwk00021W, processorCount=1, transport=EventHubs, storage=Faster

Then we see these for every Orchestator, Activity and Entity function in the app:

The listener for function '<FunctionName>' was unable to start. (Severity 3)

Then we see dozens of these:

Netherite autoscaler recommends: None from: 2 because: missing metrics (Severity 1)

Then retries of starting the listeners (Attempt 1 and 2):

Retrying to start listener for function '<FunctionName> (Attempt 1) (Severity 1)

Then

IScaleMonitor.GetMetricsAsync() failed: Microsoft.Azure.Cosmos.Table.StorageException: Not Found
   at Microsoft.Azure.Cosmos.Table.RestExecutor.TableCommand.Executor.ExecuteAsync[T](RESTCommand`1 cmd, IRetryPolicy policy, OperationContext operationContext, CancellationToken callerCancellationToken)
   at DurableTask.Netherite.Scaling.AzureTableLoadMonitor.QueryAsync(CancellationToken cancellationToken) in C:\source\durabletask-netherite\src\DurableTask.Netherite\Scaling\AzureTableLoadMonitor.cs:line 84
   at DurableTask.Netherite.Scaling.ScalingMonitor.CollectMetrics() in C:\source\durabletask-netherite\src\DurableTask.Netherite\Scaling\ScalingMonitor.cs:line 110
   at DurableTask.Netherite.AzureFunctions.NetheriteProvider.ScaleMonitor.GetMetricsAsync() in C:\source\durabletask-netherite\src\DurableTask.Netherite.AzureFunctions\NetheriteProvider.cs:line 170
Request Information
RequestID:0471c8ac-e002-0043-5d38-b58660000000
RequestDate:Wed, 29 Sep 2021 13:47:25 GMT
StatusMessage:Not Found
ErrorCode:
ErrorMessage:The table specified does not exist.
RequestId:0471c8ac-e002-0043-5d38-b58660000000
Time:2021-09-29T13:47:25.4358594Z

Followed by hundreds of these:

IScaleMonitor.GetScaleStatus() failed: System.ArgumentNullException: Buffer cannot be null. (Parameter 'buffer')
   at System.IO.MemoryStream..ctor(Byte[] buffer, Boolean writable)
   at DurableTask.Netherite.AzureFunctions.NetheriteProvider.ScaleMonitor.GetScaleStatusCore(Int32 workerCount, NetheriteScaleMetrics[] metrics) in C:\source\durabletask-netherite\src\DurableTask.Netherite.AzureFunctions\NetheriteProvider.cs:line 211 

Infra & configuration

We're running a .NET Core 3.1 Function App on an EP1 SKU with Runtime Scaling Monitoring enabled. We're also using VNETs and private endpoints.

This is the hosts.json:

{
  "version": "2.0",
  "extensions": {
    "durableTask": {
        "storageProvider": {
          "type" : "Netherite",
          "partitionCount": 12,
          "StorageConnectionName": "AzureWebJobsStorage",
          "EventHubsConnectionName": "EventHubsConnection"
        },
        "hubName": "%TaskHubName%",
        "useGracefulShutdown": true,
        "LogLevelLimit": "Debug",
        "StorageLogLevelLimit": "Debug",
        "TransportLogLevelLimit": "Debug",
        "EventLogLevelLimit": "Debug",
        "WorkItemLogLevelLimit": "Debug",
        "TraceToConsole": false,
        "TraceToBlob": false
    }
  },
  "functionTimeout": "00:30:00",
  "logging": {
    "logLevel": {
      "Host.Results": "Warning",
      "Host.Executor": "Warning",
      "default": "Information",
      "Metrics": "Information",
      "Host.Triggers.DurableTask": "Warning",
      "Host.Triggers.DurableTask.AzureStorage": "Warning",
      "Function": "Warning",
      "Microsoft": "Warning",
      "Worker": "Warning",
      "DurableTask": "Warning",
      "DurableTask.AzureStorage": "Warning",
      "DurableTask.Core": "Warning",
      "System.Net.Http.HttpClient": "Information",
      "DurableTask.Netherite": "Information",
      "DurableTask.Netherite.FasterStorage": "Warning",
      "DurableTask.Netherite.EventHubsTransport": "Warning",
      "DurableTask.Netherite.Events": "Warning",
      "DurableTask.Netherite.WorkItems": "Warning"
    },
    "applicationInsights": {
      "samplingSettings": {
        "isEnabled": true,
        "excludedTypes" : "Event"
      }
    }
  }
}

This is the csproj:

<Project Sdk="Microsoft.NET.Sdk">
  <PropertyGroup>
    <TargetFramework>netcoreapp3.1</TargetFramework>
    <AzureFunctionsVersion>v3</AzureFunctionsVersion>
    <StartDevelopmentStorage>false</StartDevelopmentStorage>
  </PropertyGroup>
  <ItemGroup>
    <PackageReference Include="Azure.Security.KeyVault.Certificates" Version="4.2.0"/>
    <PackageReference Include="Azure.Security.KeyVault.Secrets" Version="4.2.0"/>
    <PackageReference Include="Azure.Storage.Queues" Version="12.8.0"/>
    <PackageReference Include="Bogus" Version="33.1.1"/>
    <PackageReference Include="IdentityModel.AspNetCore" Version="3.0.0"/>
    <PackageReference Include="Microsoft.Azure.Functions.Extensions" Version="1.1.0"/>
    <PackageReference Include="Microsoft.Azure.WebJobs.Extensions.Storage" Version="4.0.5"/>
    <PackageReference Include="Microsoft.Extensions.Diagnostics.HealthChecks" Version="3.1.19"/>
    <PackageReference Include="Microsoft.Extensions.Http.Polly" Version="3.1.19"/>
    <PackageReference Include="Microsoft.NET.Sdk.Functions" Version="3.0.13"/>
    <PackageReference Include="Microsoft.Azure.WebJobs.Extensions.DurableTask" Version="2.5.1"/>
    <PackageReference Include="Refit" Version="5.2.4"/>
    <PackageReference Include="Refit.HttpClientFactory" Version="5.2.4"/>
    <PackageReference Include="FluentValidation" Version="10.3.3"/>
    <PackageReference Include="Microsoft.ApplicationInsights.AspNetCore" Version="2.18.0"/>
    <PackageReference Include="Dapper" Version="2.0.90"/>
    <PackageReference Include="System.Data.SqlClient" Version="4.8.2"/>
    <PackageReference Include="Dapper.SimpleCRUD" Version="2.3.0"/>
    <PackageReference Include="Azure.Identity" Version="1.4.1"/>
    <PackageReference Include="Microsoft.Azure.WebJobs.Logging.ApplicationInsights" Version="3.0.30"/>
    <PackageReference Include="Microsoft.Azure.DurableTask.Netherite.AzureFunctions" Version="0.5.0-alpha"/>
  </ItemGroup>
  <ItemGroup>
    <None Update="host.json">
      <CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
    </None>
    <None Update="local.settings.json">
      <CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
      <CopyToPublishDirectory>Never</CopyToPublishDirectory>
    </None>
  </ItemGroup>
</Project>

I don't see a table named DurableTaskPartitions which should be there according to the docs. Also no EventHubs are created in the namespace. Any idea about the root cause and how to solve this?

starting a new orchestration via IDurableClient times out

I have a very simple HttpTrigger that is attempting to start an orchestration. However, the orchestration never starts and the http request eventually times out. This works as expected in the default durable backend.

public class HttpTrigger
{
    private readonly ILogger<HttpTrigger> _logger;

    public HttpTrigger(
        ILogger<HttpTrigger> logger)
    {
        _logger = logger;
    }

    [FunctionName("HttpTrigger")]
    public async Task<IActionResult> RunAsync(
        [HttpTrigger(AuthorizationLevel.Anonymous, "post", Route = "test")] HttpRequest req,
        [DurableClient] IDurableClient durableClient)
    {
        await durableClient.StartNewAsync(
            nameof(ExecutionTestOrchestration));

        return new OkObjectResult("Enqueued");
    }
}

Investigate slow EventProcessor shutdown

I have observed that EventHubs can take up to one minute, after the event processor shutdown is initiated, before the first partition is shut down. Not clear what the cause is. I think we should investigate it and find a way to speed it up. Perhaps it is related to the checkpointing within EventHubs.

Netherite breaks due to dll load error

Observed Exception

When recovering a partition after an ungraceful shutdown Netherite can hit this exception:

System.IO.FileNotFoundException: Could not load file or assembly 'System.Threading.Channels, Version=5.0.0.0, Culture=neutral, PublicKeyToken=cc7b13ffcd2ddd51'. The system cannot find the file specified.
File name: 'System.Threading.Channels, Version=5.0.0.0, Culture=neutral, PublicKeyToken=cc7b13ffcd2ddd51'
at DurableTask.Netherite.Faster.LogWorker.ReplayCommitLog(Int64 from, StoreWorker worker)
at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
at DurableTask.Netherite.Faster.LogWorker.ReplayCommitLog(Int64 from, StoreWorker worker)
at DurableTask.Netherite.Faster.StoreWorker.ReplayCommitLog(LogWorker logWorker) in C:\source\durabletask-netherite\src\DurableTask.Netherite\StorageProviders\Faster\StoreWorker.cs:line 418
at DurableTask.Netherite.Faster.FasterStorage.CreateOrRestoreAsync(Partition partition, IPartitionErrorHandler errorHandler, Int64 firstInputQueuePosition) in C:\source\durabletask-netherite\src\DurableTask.Netherite\StorageProviders\Faster\FasterStorage.cs:line 159

This will effectively prevent Netherite from making any progress and it goes into an infinite recycle.

Repro

.Net Windows EP3. The DF Template from VS w/ zero changes other than setting extension settings to enable netherite/sql and then the app settings to wire them up.

Proposed solution

Two things to consider:

  1. We nee to solve the dll problem. There is a chance that moving to the latest dependencies could help here.
  2. We need to make the problem more observable. I am thinking of (a) forcing early load of this dll, not just at recovery time, and (b) making the load error more observable in the portal, for example as a failure to start function app that creates a visible message.

Support recovery after events are lost

Currently, taskhubs are irrecoverably compromised if EH throws away the events or the EH is deleted. This is problematic because it can happen quite easily (e.g. pause a busy EH for more than 24h).

The proposed fix is to support recovering all the events in case the EH is missing.

Support composite connection string

Currently, the configuration requires two separate connection strings for storage and eventhubs. This creates issues when trying to handle all DT/DF backends in a uniform way, because the other backends only have a single connection string.

To fix this, as discussed with @ConnorMcMahon, the idea is to use a composite connection string which has key-value pairs (which is a standard thing, see https://docs.microsoft.com/en-us/dotnet/framework/data/adonet/connection-strings).

Specifically, we introduce a new configuration setting called "ConnectionString" which can be

  • storage=AzureWebJobsStorage;transport=EventHubsConnection
    use connection names AzureWebJobsStorage and EventHubsConnection for connecting to the respective services
    this is also the default setting
  • storage=AzureWebJobsStorage;transport=emulated
    uses the supplied storage, but emulates the event hubs
  • storage=emulated;transport=emulated
    emulates both the storage and eventhubs

NetheriteProviderFactory populates ConnectionName with resolved connection string

The ConnectionName property of DurabilityProvider is used to generate HTTP Management APIs.

By populating with the resolved connection string instead of the connection string name, we are leaking the connection string to clients when they should only need the connection string name.

Bug identified by running Activity_Gets_HttpManagementPayload with Netherite in emulator mode.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.