GithubHelp home page GithubHelp logo

Unexplainable "Non-Deterministic workflow detected" errors after migrating in-process functions to isolated-mode about durabletask-dotnet HOT 30 CLOSED

Schmaga avatar Schmaga commented on June 4, 2024 1
Unexplainable "Non-Deterministic workflow detected" errors after migrating in-process functions to isolated-mode

from durabletask-dotnet.

Comments (30)

Schmaga avatar Schmaga commented on June 4, 2024 2

I will observe it a little more, but I think we had a breakthrough here! I deployed the workaround to our testing system, and after some promising initial tests, I felt brave and lucky and deployed that fix also to production, because it was low-risk. I can confirm, that since then there has not been a single more occurrence of the error. I guess that was it.

Here is the workaround code (taken and modified from your link, thanks for that):

public class AppConfigurationRefreshMiddleware : IFunctionsWorkerMiddleware
{
    private readonly IConfigurationRefresherProvider _configurationRefresherProvider;

    public AppConfigurationRefreshMiddleware(IConfigurationRefresherProvider configurationRefresherProvider)
    {
        _configurationRefresherProvider = configurationRefresherProvider;
    }

    public async Task Invoke(FunctionContext context, FunctionExecutionDelegate next)
    {
        if (IsOrchestrationTrigger(context))
        {
            await next(context);

            return;
        }

        var refresher = _configurationRefresherProvider.Refreshers.FirstOrDefault();

        if (refresher != null)
        {
            await refresher.RefreshAsync(context.CancellationToken);
        }

        await next(context);
    }

    private static bool IsOrchestrationTrigger(FunctionContext context)
    {
        return context.FunctionDefinition.InputBindings.Values.Any(
            binding => string.Equals(binding.Type, "orchestrationTrigger"));
    }
}

I guess this workaround works fine for us, until you can find a way to build a proper fix. Out of curiosity, do you have tests that contain middleware as part of the pipeline? If yes, what makes our little middleware different than that?

I would also suggest that you update the docs with an info that the current DF version might break with custom middleware in the pipeline. Having a useable middleware pipeline was one of the major advantages of the isolated-mode functions imho, and this will probably break a lot of people's integrations in the future, if they don't know about it.

Again, thanks for being so responsive and supportive on this issue. This was a tough one.

from durabletask-dotnet.

Schmaga avatar Schmaga commented on June 4, 2024 2

Update: It has been a few days without any incident. I can only conclude that the workaround in the middleware seems to help, the error is gone. I can work with that. Thanks for your support, and hopefully fixing this in the long run on your side won't be too much of a hassle.

from durabletask-dotnet.

Schmaga avatar Schmaga commented on June 4, 2024 1

@davidmrdavid

from durabletask-dotnet.

jviau avatar jviau commented on June 4, 2024 1

Yeah, nothing concerning with that versioning strategy.

Looking at the history and code, we are failing on the very first action taken. There is no parallelism, no branching, nothing tricky about that part of the code, it is a plain ol' activity call. Why are we failing there though? To me, this sounds like history is becoming corrupted somehow. Where or how I do not know. Is it happening in the backend persistence layer? Is it happening in some layer in memory? How exactly is it becoming corrupted?

@davidmrdavid if your logging PR doesn't already include it, it would be helpful to have a mode to essentially print out a history diff at time of failure. What history do we have loaded, and what action(s) has the orchestrator taken so far (including the one that caused the non-determinism failure)?

from durabletask-dotnet.

davidmrdavid avatar davidmrdavid commented on June 4, 2024 1

@nilsmehlhorn, @Schmaga:

I'm pretty close to having a package to share. If all goes well, I might have something to share tomorrow (Friday) or Monday at the latest.

from durabletask-dotnet.

davidmrdavid avatar davidmrdavid commented on June 4, 2024 1

Hi @Schmaga:

I've created a general tracking issue for this bug here: #158
Since we investigated and found a workaround for your case, I'll be closing this thread to centralize communications on that new tracking issue. Please subscribe to it for updates, and thanks for your help again :)

from durabletask-dotnet.

jviau avatar jviau commented on June 4, 2024

@Schmaga what is the cancellation token here? Is it ever non-default? Also, what else is your orchestration doing if anything? You shared a single function. Is the orchestration performing other work? In parallel at all?

On a different point, how long does this orchestration live for? The above pattern looks unbounded to me, I suggest employing ContinueAsNew pattern here.

  1. Call your activity. Return if non-null
  2. If null, wait for 10 minutes
  3. Continue-as-new

from durabletask-dotnet.

davidmrdavid avatar davidmrdavid commented on June 4, 2024

Another question:
do you have any a monitoring extension called Dynatrace enabled for you app in Azure? In the past, I've seen users of Dynatrace experiencing non-determinism errors at random intervals (probably due to thread instrumentation?)

from durabletask-dotnet.

Schmaga avatar Schmaga commented on June 4, 2024

@Schmaga what is the cancellation token here? Is it ever non-default? Also, what else is your orchestration doing if anything? You shared a single function. Is the orchestration performing other work? In parallel at all?

On a different point, how long does this orchestration live for? The above pattern looks unbounded to me, I suggest employing ContinueAsNew pattern here.

  1. Call your activity. Return if non-null
  2. If null, wait for 10 minutes
  3. Continue-as-new

The cancellation token is default in one case, but an actual token in another case. The whole Orchestrator is concerned with handling the lifetime of a shipment, in our case. At first, it creates a shipment, then waits for pickup using the code I already posted. In this case, it does get a cancellation token, because with an outside trigger, the shipment can still be canceled at this point. After the pickup event has happened, it updates some state in our database and raises an event to some other orchestrator. It then waits using the posted code until the shipment has finally arrived at its destination. In this case without a cancellation token, because the process now does not support any more "regular" cancellation.

As to the question how long it may live: Depending on shipment destination, delays, etc. it can live between a day or even weeks in a worst case, but the average is 1-3 days.

I did not really know about the continue-as-new pattern, I only saw that once in your docs about "eternal Orchestrators", but because shipments are a part of a business process that has a defined beginning and end, I did not really consider that pattern for this workflow. Was that assessment wrong?

from durabletask-dotnet.

Schmaga avatar Schmaga commented on June 4, 2024

Another question: do you have any a monitoring extension called Dynatrace enabled for you app in Azure? In the past, I've seen users of Dynatrace experiencing non-determinism errors at random intervals (probably due to thread instrumentation?)

We do not have any external monitoring extension except the application insights extension, which we register like this:

builder.AddApplicationInsights().AddApplicationInsightsLogger();

and we also added some of the workaround for the "below warning" level problems with the appinsights integration, from this issue: Azure/azure-functions-dotnet-worker#1182

I noticed that most logs are logged duplicate, maybe due to a misconfiguration in appinsights, but I doubt that has anything to do with the non-deterministic error.

from durabletask-dotnet.

Schmaga avatar Schmaga commented on June 4, 2024

what is the cancellation token here?

One small addition. The cancellation token in the case where it is non-default is based on the pattern we built for the workaround for #147 (comment)

from durabletask-dotnet.

jviau avatar jviau commented on June 4, 2024

I did not really know about the continue-as-new pattern, I only saw that once in your docs about "eternal Orchestrators"

Eternal orchestrations is one case for continue as new. But really continue-as-new can be generalized as a way to cull no longer needed message history and optimize replays. They are very useful for loops, as you can consolidate your state into the next iteration. The primary benefit is performance here - you are letting durable drop message history which no longer has any impact (you don't care about all those previous timers and not-found results from previous loops right? You only care about the next loop). For your case this may have a considerable positive impact. Even living for 1-3 days, that loop is going to cause message history to grow considerably. Continue-as-new will avoid that growth. It is easiest to apply a continue-as-new pattern by compartmentalizing your orchestration logic into multiple sub-orchestrations. For instance, you can have a sub orchestration dedicated to just this check-delay loop.

it does get a cancellation token, because with an outside trigger, the shipment can still be canceled at this point.

Where does this cancellation token come from exactly? It is critical that cancellation tokens come from only within the orchestration itself. You must not access any external cancellation tokens. Really - I don't think it is a good idea to use cancellation tokens in orchestrations, at least not right now. We don't have a very good story for them yet. I would suggest the pattern I supplied in #147 (which you just linked). Using external events and a Task representing the cancellation, and a Task.WhenAny to see which comes first - the timer or cancellation event.

from durabletask-dotnet.

Schmaga avatar Schmaga commented on June 4, 2024

Where does this cancellation token come from exactly?

The cancellation token does come from within the orchestration itself. It is created using a CancellationTokenSource that the orchestration creates itself, and the cancellation is triggered by WaitForExternalEvent (in this case a "Cancel" event sent to the Orchestrator from the outside). In other words, exactly the pattern from #147

from durabletask-dotnet.

davidmrdavid avatar davidmrdavid commented on June 4, 2024

@Schmaga: to get us all in the same page, I think it would help us if you could explicitly post the code for the cancellation pattern you're using. Can you please post as much code about it as possible? After reading through #147, I'm starting to suspect that, if we're not very careful, one could implement this cancellation pattern in a non-deterministic way.

Also, an experiment on your end - I understand you're able to repro this on your staging/testing environment. Any chance you could modify your orchestrator in your testing environment not to use a cancellationTokenSource at all and instead work directly with the external event API (and send repeated events as needed, as explained in first post of #147)? If you're unable to reproduce this non-determinism error when you use the external event API directly, then that would be a strong signal for us to narrow our focus into debugging cancellation mechanism, or to rule that out.

Thanks!

from durabletask-dotnet.

nilsmehlhorn avatar nilsmehlhorn commented on June 4, 2024

@davidmrdavid @jviau here's the code for the cancellation pattern we're using. Note that it inadvertently depends on a custom idempotency layer describe here. Eventually, the cancellation token source is coming from within the orchestrator.

using System.Reactive;

using Microsoft.DurableTask;

internal sealed class OrchestratorCancellation : IDisposable
{
    private readonly CancellationTokenSource _cancelEventCancellation = new();
    private readonly TaskOrchestrationContext _context;
    private Task<object>? _cancelEvent;

    public OrchestratorCancellation(TaskOrchestrationContext context)
    {
        _context = context;
    }

    public void Dispose()
    {
        _cancelEventCancellation.Cancel();
        _cancelEventCancellation.Dispose();
    }

    /// <summary>
    ///     Races the result of a task factory function against an externally sent
    ///     <see cref="OrchestratorCancellationEvents.Cancel" /> event.
    ///     When the orchestrator cancellation occurs before the task from the factory function completes,
    ///     this method throws a <see cref="OrchestratorCanceledException" /> allowing an orchestrator to break
    ///     out of it's regular control flow and optionally enter a cancellation routine.
    /// </summary>
    /// <param name="checkpointStatus">
    ///     Identifier for remembering point of cancellation idempotent across orchestration
    ///     restarts.
    /// </param>
    /// <param name="taskFactory">
    ///     Factory function for creating a task to be raced against orchestrator cancellation.
    ///     May accept a cancellation token for cancelling underlying task upon orchestrator cancellation.
    /// </param>
    /// <returns>Result of <paramref name="taskFactory" /></returns>
    public async Task<T> Wrap<T>(string checkpointStatus, Func<CancellationToken, Task<T>> taskFactory)
    {
        using var taskCancellation = new CancellationTokenSource();
        var task = taskFactory(taskCancellation.Token);

        _cancelEvent ??= _context.WaitForExternalEvent<object>(
            OrchestratorCancellationEvents.Cancel,
            _cancelEventCancellation.Token);

        var cancelAtEventName = OrchestratorCancellationEvents.CancelAt(checkpointStatus);

        var cancelAtEvent = _context.WaitForExternalEventIdempotent(
            cancelAtEventName,
            cancellationToken: _cancelEventCancellation.Token);

        _context.SetCustomStatus(checkpointStatus);

        var completedTask = await Task.WhenAny(_cancelEvent, cancelAtEvent, task);

        if (completedTask == task)
        {
            return await task;
        }

        if (completedTask == _cancelEvent)
        {
            _context.SendEvent(_context.InstanceId, cancelAtEventName, new object());
            await cancelAtEvent;
        }

        taskCancellation.Cancel();

        throw new OrchestratorCanceledException();
    }

    public async Task Wrap(string checkpointStatus, Func<CancellationToken, Task> taskFactory)
    {
        await Wrap(
            checkpointStatus,
            async cancellationToken =>
            {
                await taskFactory(cancellationToken);

                return Unit.Default;
            });
    }
}

Here are accompanying unit tests which may help understanding what the code is supposed to do. Note that the DurableOrchestrationContextFixture is a helper for setting up a mock orchestration context.

using System.Reactive;

using FluentAssertions;

using NSubstitute;

using Xunit;

public sealed class OrchestratorCancellationTests
{
    [Fact]
    public async Task It_resolves_task_from_factory_without_cancellation()
    {
        var fixture = DurableOrchestrationContextFixture.WithDefaultConfiguration();
        var context = fixture.ContextSubstitute;
        var cancellation = new OrchestratorCancellation(context);
        var result = await cancellation.Wrap("checkpoint", _ => Task.FromResult("abc"));

        result.Should().Be("abc");
    }

    [Fact]
    public async Task It_cancels_based_on_external_event()
    {
        var fixture = DurableOrchestrationContextFixture.WithDefaultConfiguration();
        var context = fixture.ContextSubstitute;
        var cancellation = new OrchestratorCancellation(context);

        fixture.GetCancellationSource().SetResult(new object());

        CancellationToken cancellationToken = default;

        var resultAction = () => cancellation.Wrap(
            "checkpoint",
            token =>
            {
                cancellationToken = token;

                return TaskNever<Unit>();
            });

        await resultAction.Should().ThrowAsync<OrchestratorCanceledException>();
        cancellationToken.IsCancellationRequested.Should().BeTrue();
    }

    [Fact]
    public async Task It_cancels_based_on_idempotency_checkpoint_event()
    {
        var fixture = DurableOrchestrationContextFixture.WithDefaultConfiguration();
        var context = fixture.ContextSubstitute;
        var cancellation = new OrchestratorCancellation(context);

        context.WaitForExternalEvent<object>(
                OrchestratorCancellationEvents.CancelAt("checkpoint"),
                Arg.Any<CancellationToken>())
            .Returns(Task.FromResult(new object()));

        CancellationToken cancellationToken = default;

        var resultAction = () => cancellation.Wrap(
            "checkpoint",
            token =>
            {
                cancellationToken = token;

                return TaskNever<Unit>();
            });

        await resultAction.Should().ThrowAsync<OrchestratorCanceledException>();
        cancellationToken.IsCancellationRequested.Should().BeTrue();
    }
}

The thing is, we're also seeing the error occur in orchestrators which aren't using this cancellation pattern/code thus no cancellation token source. So, this probably shouldn't be the root cause here.

Regarding loop vs. continue-as-new: while we may get performance improvements from moving to continue-as-new, the current solution should still not result in those errors, right? We're suspecting that this orchestrator specifically produces the error most often because of the looping as that would technically increase the chance of it occurring.

If we're not mistaken, both aspects can't be the final culprit - especially since everything worked fine before we migrated to the isolated mode. Therefore we're hesitant to adapt both the cancellation and polling mechanism but we could still try since we're kind of desperate here.

from durabletask-dotnet.

davidmrdavid avatar davidmrdavid commented on June 4, 2024

@nilsmehlhorn, @Schmaga:

I believe @jviau's continue-as-new recommendation was mostly a performance suggestion. In that sense, I'm personally almost certain that it would have no impact on improving your orchestrator's determinism. Still, in the long run, I think it's worth incorporating.

@jviau and I discussed your cancellation wrapper: it seems valid at this time, though I personally have not fully ruled out that it's fully safe (I'm just keeping it as suspect in my mind since it's a non-standard wrapper).

On our end, I'm looking to prepare a private, instrumented, release of the isolated SDK to log more contextual information when the non-determinism error is encountered. Hopefully that can give us a few more clues. I'll need a few days to prepare that.

In the meantime - it would help us if you could provide us with the following.
For the orchestrator you shared in your first post (WaitForEvents) can you please share a successful and failed-with-non-determinism-error history for it? Obviously, we'll need you to censor any data from it as you see fit. It would also help us if you can provide us with the exception that the failed history produced, as well as the instanceID + execution time window in UTC for it.

Therefore we're hesitant to adapt both the cancellation and polling mechanism but we could still try since we're kind of desperate here.

If you're willing, and for me to fully rule out the cancellation wrapper as suspect: are we able to try this removing the cancellation logic in a staging / testing environment? I understand that this error can be repro'd on your testing deployment, and if so I would be curious to see if the non-determinism error still triggers without it.

from durabletask-dotnet.

Schmaga avatar Schmaga commented on June 4, 2024

In the meantime - it would help us if you could provide us with the following.
For the orchestrator you shared in your first post (WaitForEvents) can you please share a successful and failed-with-non-determinism-error history for it? Obviously, we'll need you to censor any data from it as you see fit.

At the moment, I do not have a failed history for the orchestrator that I shared part of the code of. Because I purged and restarted most of the failed orchestrators to see whether the process continues without errors. I can -however- give you the history of a different Orchestrator (that has no loop), that suffers from the same problem. I attached the CSV file here:
18UFKD-001_payment.csv

It would also help us if you can provide us with the exception that the failed history produced, as well as the instanceID + execution time window in UTC for it.

The instanceId is the filename, and the Timestamps are also contained in the CSV. You also asked for an exception, but that brings me to an interesting observation: Whenever an orchestrator fails with this strange "non-deterministic" error, there is no exception propagated or logged. Neither to AppInsights, nor to any other logs, and also the parent orchestrator does not notice. It just fails silently. The only place where the error message appears, is in the functionInstanceHistory table. So the only way to notice if an Orchestrator fails this way, is to constantly monitor the History. For me, that kind of leads to the assumption, that the error happens at a very low level, even before any Orchestrator code is called.

If you're willing, and for me to fully rule out the cancellation wrapper as suspect: are we able to try this removing the cancellation logic in a staging / testing environment? I understand that this error can be repro'd on your testing deployment, and if so I would be curious to see if the non-determinism error still triggers without it.

Because other orchestrators without any cancellation or loop code fail with the non-determinstic error as well, I am still convinced that the cancellation is not the problem. I would prefer waiting for and testing with the instrumented lib before going down this road.

from durabletask-dotnet.

jviau avatar jviau commented on June 4, 2024

@Schmaga do you have the source code for the failed orchestration you just shared? Censor what you need from it.

I also noticed that you appear to be including some versioning in the name of the activities: GetPaymentDataActivity_4_0_0 . Are you happening to change those values while an orchestration is running?

from durabletask-dotnet.

Schmaga avatar Schmaga commented on June 4, 2024

@Schmaga do you have the source code for the failed orchestration you just shared? Censor what you need from it.

I will try to put together a version that is censored. Although there's a lot of our business-related code that I would need to remove. Not sure how much value there is to such a cutdown version.

I also noticed that you appear to be including some versioning in the name of the activities: GetPaymentDataActivity_4_0_0 . Are you happening to change those values while an orchestration is running?

We never change those values while an orchestration is running. Whenever we have to introduce a breaking change into our orchestrators, we create a complete copy of all the relevant namespaces, leading to classes with the new number. But these new classes are then only exclusively used for new instances, and existing orchestrations run with the older code.

from durabletask-dotnet.

davidmrdavid avatar davidmrdavid commented on June 4, 2024

Btw @jviau I'm familiar with their versioning strategy and it seemed safe replay-wise: it's as @Schmaga said - essentially a complete copy of the files, old versions of activities and orchestrators are preserved in their deployment payload.

from durabletask-dotnet.

davidmrdavid avatar davidmrdavid commented on June 4, 2024

I'll be developing the new logs here: https://github.com/Azure/durabletask/pull/912/files
At the time of writing this comment, I have not added all the logs I want to see. Hopefully I can wrap that up soon, do a quick sanity test that it works, and then share the package.

from durabletask-dotnet.

davidmrdavid avatar davidmrdavid commented on June 4, 2024

@nilsmehlhorn, @Schmaga:

I have a first draft of this private package already released.
To access it, please add the following source to your nuget.config:

    <add key="durabletask" value="https://durabletaskframework.pkgs.visualstudio.com/734e7913-2fab-4624-a174-bc57fe96f95d/_packaging/durabletask/nuget/v3/index.json" />

For example, my nuget.config looks like this (I placed this at the root of my Azure Functions app):

<?xml version="1.0" encoding="utf-8"?>
<configuration>
  <packageSources>
    <add key="durabletask" value="https://durabletaskframework.pkgs.visualstudio.com/734e7913-2fab-4624-a174-bc57fe96f95d/_packaging/durabletask/nuget/v3/index.json" />
  </packageSources>
</configuration>

Then. please update your .csproj file to reference the following two dependencies:

    <PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.DurableTask" Version="1.0.3-non-determinism-instrumentation" />
    <PackageReference Include="Microsoft.Azure.WebJobs.Extensions.DurableTask" Version="2.9.6-non-determinism-instrumentation" />

Finally, run dotnet restore followed by dotnet build. If that succeeds, then you got the private package. You can further validate this by running func host start --verbose to start the Azure Functions instance with verbose logs and search for a log like this one:

"""
[2023-06-09T23:17:57.780Z] Task hub worker started. Latency: 00:00:01.2814796. Extension GUID f975a985-f70e-4b1f-9c22-99eca04d6aa0. InstanceId: . Function: . HubName: TestHubName. AppName: . SlotName: . ExtensionVersion: 2.9.6. SequenceNumber: 2.
"""

If you see "ExtensionVersion: 2.9.6" then the right bits were picked up. This is because we have not yet released a 2.9.6 on Nuget, so this implies you're correctly downloading our private release.

From there, a reproduction of the error should yield an expanded error message with information that will be helpful in debugging this further. Please share that error message with us when you get a new non-determinism error for the orchestrator implementation you've shared with us.

Finally, there's a good chance we'll need to iterate on this package a few times to be able to root cause this, we just don't have enough information yet to confidently know what runtime data we need to debug this. I appreciate your patience with us, we'll be working quickly on this.

from durabletask-dotnet.

Schmaga avatar Schmaga commented on June 4, 2024

I published a version of our app on our testing system with your instrumented Nuget Package. It took us some time, but after some testing, the error re-appeared. This time, in a different orchestrator (one with a loop again, because they seem to have a higher probability of failing). I attached the log here
FAPV3F-001_shipment.csv

Hope you can make some sense of the error message, as I cannot :) I also looked through the app service and appinsights logs, and again this error can only be found in the history table, nowhere else.

from durabletask-dotnet.

davidmrdavid avatar davidmrdavid commented on June 4, 2024

@Schmaga: thank you for the quick update.

The new error log essentially asserts that, when your orchestrator code experienced this error during replay, it had not yet called any Durable Functions APIs from the context object.

As you know, our execution model is one where we interleave between executing your code, and our framework's state management logic, with await statements to DF APIs transferring control from your code back to the DF framework. For this to work deterministically, all awaited tasks ultimately need to be derived from the DF context object.

The instrumented error suggests to me that control is being transferred back the framework very early in your code's execution, before it even gets a chance to await a DF API from the context object. This rules out that this orchestrator's cancellation helper introduced the non-determinism, as it has not even had to chance to start executing and none of the DF APIs within it have been called.

This also narrows down the literal lines of code where control could be yielded back to the DF framework. I think our next step is probably to add logs on your code, before the first callActivity, to use them to try and see how far into your orchestrator code we're getting before this error occurs. But let's hold off on that, I first want to determine if there's more on the package side we can do. I'll get back to you on this asap.

As an aside, I noticed the way I formatted the error does not play well with csv exporting. I'll look to fix that as well.

from durabletask-dotnet.

Schmaga avatar Schmaga commented on June 4, 2024

The plot thickens... Will wait for your reply then.

In the meantime: One other thing that might be worth talking about and that you reminded me of when you said it could be important to see what runs "before" the Orchestrator code. I wrote a little custom middleware that we register in ConfigureFunctionsWorkerDefaults that is based on the ASP.NET Core middleware for Azure AppConfiguration:

public class AppConfigurationRefreshMiddleware : IFunctionsWorkerMiddleware
{
    private readonly IConfigurationRefresherProvider _configurationRefresherProvider;

    public AppConfigurationRefreshMiddleware(IConfigurationRefresherProvider configurationRefresherProvider)
    {
        _configurationRefresherProvider = configurationRefresherProvider;
    }

    public async Task Invoke(FunctionContext context, FunctionExecutionDelegate next)
    {
        var refresher = _configurationRefresherProvider.Refreshers.FirstOrDefault();

        if (refresher != null)
        {
            await refresher.RefreshAsync(context.CancellationToken);
        }

        await next(context);
    }
}

I reckon it's a long shot, but could this refresh during every function call somehow interfere with your DF framework flow? I don't know what RefreshAsync actually does under the hood, but it is AppConfiguration library code, so I guessed it would be fine.

from durabletask-dotnet.

davidmrdavid avatar davidmrdavid commented on June 4, 2024

Interesting. This is a bit outside my expertise, so I'll defer to @jviau to comment on this middleware

from durabletask-dotnet.

davidmrdavid avatar davidmrdavid commented on June 4, 2024

@Schmaga:

@jviau and I discussed this offline and we do think this middleware is a source of risk and potential root cause. In particular, we think the issue could be in the line await refresher.RefreshAsync , which isn't guaranteed to trigger every time (refresher might null) and even when it triggers, it is not guaranteed to be asynchronous (as with any .NET Task). This could explain why the error only occurs some of the time.

However, when that line does trigger and the task does execute asynchronously, the await may be interpreted as yielding control back to the DF framework. In that case, we'll effectively interpret the middleware as being "new code" in your orchestrator, which creates non-determinism. Note that we're not able to confirm in our telemetry that this is exactly what is happening, but this is consistent with the behavior we're seeing for your app as well as with the instrumented error message you obtained for us.

Assuming this theory is correct, then this problem is entirely on us and it would be something we're looking to fix. We're starting a few threads internally to try and remove this risk, but I'll spare the details in this conversation.

I think we can validate this theory, and work around this suspected bug, in a 2 main ways:

  1. Ideally, if you can skip this middleware when you're calling an orchestrator function, that would be ideal to allow you to continue using the AppConfiguration service. After all, determinism is only checked for orchestrators. Not sure if you have enough info on your context object to make that determination.
  2. Not to have this middleware at all, which I realize defeats the purpose of the AppConfiguration service.

Would you be able to work with these workarounds? I realize they're not ideal, but it's our best bet given the circumstances. It would help us to know if you see the error going away after implementing them as well. Please let us know, thank you.

from durabletask-dotnet.

Schmaga avatar Schmaga commented on June 4, 2024

That is indeed interesting. And strange that no one else has yet reported similar errors, or we are the only people that currently use the new DF framework with middleware.

I guess both ways you propose would be fine for me to implement until you have found a way to update your framework to work seamlessly with middleware.

Ideally, if you can skip this middleware when you're calling an orchestrator function, that would be ideal to allow you to continue using the AppConfiguration service. After all, determinism is only checked for orchestrators. Not sure if you have enough info on your context object to make that determination.

I would like to try number 1) first, because then we would at least not completely lose the refresh function of AppConfiguration. Is there a reliable way to check inside the middleware if the current execution is calling into an orchestration? I could probably debug my way around and check if there is anything in the FunctionContext I can use or if I need to work with the FunctionExecutionDelegate Target/Method Properties, but would appreciate some pointers. Afterwards I will try releasing such a version and test it ASAP.

from durabletask-dotnet.

jviau avatar jviau commented on June 4, 2024

@Schemga here is how we detect it is an orchestration trigger in the middleware: https://github.com/Azure/azure-functions-durable-extension/blob/2960744a186b768c23ddb487674bcdde2958b0b2/src/Worker.Extensions.DurableTask/DurableTaskFunctionsMiddleware.cs#LL40C15-L40C15

Unfortunately, the proper fix for this requires new APIs from the functions dotnet worker. Instead of wrapping the middleware as our orchestration invocation, we need to wrap only the orchestration functions implementation.

from durabletask-dotnet.

davidmrdavid avatar davidmrdavid commented on June 4, 2024

Very encouraging to hear that it's looking like we found the root cause, let's observe it for a little longer though, as you mentioned. This is definitely one of those cases where the contextual information you provided was critical, so I'm thankful for your engagement as well.

As for tests: I'm not aware of any custom middleware tests today, but @jviau would know for certain. Irrespective of the answer, as with any large enough organization like Azure Functions, there's sometimes too many interactions with features from other teams for us to test them all (or even know about them all). It appears to me that this is one of those situations.

Agreed on the docs update and so on. I'm looking to create a ticket on this repo reporting this risk as well, I just haven't gotten around it yet. It's on the pipeline :)

Please keep us posted on whether the incident reoccurs, or if it does not.

from durabletask-dotnet.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.