GithubHelp home page GithubHelp logo

microsoft / service-fabric-healer Goto Github PK

View Code? Open in Web Editor NEW
18.0 5.0 6.0 40.45 MB

Service Fabric Auto-Repair Service with Declarative Logic for Repair Policy Specification. Targets both Windows and Linux SF clusters.

License: MIT License

PowerShell 2.18% C# 97.82%
service-fabric logic-programming auto-mitigation windows linux guan csharp net6

service-fabric-healer's Introduction

FabricHealer 1.2.13

Service Fabric Auto-Repair Service with Declarative Logic for Repair Policy Specification.

Deploy to Azure

FabricHealer (FH) is a .NET 6 Service Fabric application that attempts to automatically fix a set of reliably solvable problems that can take place in Service Fabric applications (including containers), host virtual machines, and logical disks (scoped to space usage problems only). These repairs mostly employ a set of Service Fabric API calls, but can also be fully customizable (like Disk repair). All repairs are safely orchestrated through the Service Fabric RepairManager system service. Repair workflow configuration is written as Prolog-like logic with supporting external predicates written in C#.

FabricHealer's Configuration-as-Logic feature requires Guan, a Prolog-like logic programming library for .NET. Repair workflow starts when FabricHealer detects supported error or warning health events reported by FabricObserver or FabricHealerProxy, for example.

Note that you can use FabricHealer if you don't also employ FabricObserver or FabricHealerProxy. For machine-level repairs you do not need either of these if you want to automatically schedule machine repair jobs based on node health states alone (like, Error state, specifically). For all other repairs, you must install FabricHealerProxy into a .NET Service Fabric project to leverage the power of FabricHealer if you do not deploy FabricObserver.

FabricObserver and FabricHealer work great together. *Note: This version supports FabricObserver 3.2.3 and higher.*

FabricHealer is implemented as a stateless singleton service that runs on one or all nodes in a Linux or Windows Service Fabric cluster. For Disk and Fabric system service repairs, you must run FabricHealer on all nodes. FabricHealer is built as a .NET 6.0 application and has been tested on multiple versions of Windows Server and Ubuntu.

To learn more about FabricHealer's configuration-as-logic model, click here.

FabricHealer requires SF Runtime versions 9 and higher.
FabricHealer requires the Service Fabric RepairManager (RM) service. 
For machine repairs, Service Fabric InfrastructureService (IS) must be deployed for each node type.

Build and run

  1. Clone the repo.
  2. Install .NET 6
  3. Build.

Deploy FabricHealer

You can deploy FabricHealer using Visual Studio (if you build the sources yourself), PowerShell or ARM. Please note that this version of FabricHealer no longer supports the DefaultServices node in ApplicationManifest.xml. This means that should you deploy using PowerShell, you must create an instance of the service as the last command in your script. This was done to support ARM deployment, specifically. The StartupServices.xml file you see in the FabricHealerApp project now contains the service information once held in ApplicationManifest's DefaultServices node. Note that this information is primarily useful for deploying from Visual Studio. Your ARM template or PowerShell script will contain all the information necessary for deploying FabricHealer.

ARM Deployment

For ARM deployment, please see the ARM documentation.

PowerShell Deployment

#cd to the top level repo directory where you cloned FO sources.

cd C:\Users\me\source\repos\service-fabric-healer

#Build FH (Release)

./Build-FabricHealer

#create a $path variable that points to the build output:
#E.g., for Windows deployments:

$path = "C:\Users\me\source\repos\service-fabric-healer\bin\release\FabricHealer\win-x64\self-contained\FabricHealerType"

#For Linux deployments:

#$path = "C:\Users\me\source\repos\service-fabric-healer\bin\release\FabricHealer\linux-x64\self-contained\FabricHealerType"

#Connect to target cluster, for example:

Connect-ServiceFabricCluster -ConnectionEndpoint @('sf-win-cluster.westus2.cloudapp.azure.com:19000') -X509Credential -FindType FindByThumbprint -FindValue '[thumbprint]' -StoreLocation LocalMachine -StoreName 'My'

#Copy $path contents (FO app package) to server:

Copy-ServiceFabricApplicationPackage -ApplicationPackagePath $path -CompressPackage -ApplicationPackagePathInImageStore FH127 -TimeoutSec 1800

#Register FO ApplicationType:

Register-ServiceFabricApplicationType -ApplicationPathInImageStore FH127

#Create FO application (if not already deployed at lesser version):

New-ServiceFabricApplication -ApplicationName fabric:/FabricHealer -ApplicationTypeName FabricHealerType -ApplicationTypeVersion 1.2.13   

#Create the Service instance:  

# FH can be deployed with a single instance or run on all nodes. Note that for certain repairs, it must be deployed to all nodes (InstanceCount = -1). If you employ Disk repair and/or System service process restarts, deploy with InstanceCount set to -1.
New-ServiceFabricService -Stateless -PartitionSchemeSingleton -ApplicationName fabric:/FabricHealer -ServiceName fabric:/FabricHealer/FabricHealerService -ServiceTypeName FabricHealerType -InstanceCount -1

#OR if updating existing version:  

Start-ServiceFabricApplicationUpgrade -ApplicationName fabric:/FabricHealer -ApplicationTypeVersion 1.2.13 -Monitored -FailureAction rollback

Using FabricHealer

Let's say you have a service that is using too much memory or too many ephemeral ports, as defined in both FabricObserver (which generates the Warning(s)) and in your related logic rule (this is optional since you can decide that if FabricObserver warns, then FabricHealer should mitigate without testing the related metric value that led to the Warning by FabricObserver, which, of course, you configured. It's up to you.). You would use FabricHealer to keep the problem in check while your developers figure out the root cause and fix the bug(s) that lead to resource usage over-consumption. FabricHealer is really just a temporary solution to problems, not a fix. This is how you should think about auto-mitigation, generally. FabricHealer aims to keep your cluster green while you fix your bugs. With it's configuration-as-logic support, you can easily specify that some repair for some service should only be attempted for n weeks or months, while your dev team fixes the underlying issues with the problematic service. FabricHealer should be thought of as a "disappearing task force" in that it can provide stability during times of instability, then "go away" when bugs are fixed.

FabricHealer comes with a number of already-implemented/tested target-specific logic rules. You will only need to modify existing rules to get going quickly. FabricHealer is a rule-based repair service and the rules are defined in logic. These rules also form FabricHealer's repair workflow configuration. This is what is meant by Configuration-as-Logic. The only use of XML-based configuration with respect to repair workflow is enabling automitigation (big on/off switch), enabling repair policies, and specifying rule file names. The rest is just the typical Service Fabric application configuration that you know and love. Most of the settings in Settings.xml are overridable parameters and you set the values in ApplicationManifest.xml. This enables versionless parameter-only application upgrades, which means you can change Settings.xml-based settings without redeploying FabricHealer.

Repair ephemeral port usage issue for application service process

## Ephemeral Ports - Number of ports in use for any SF service process belonging to the specified SF Application. 
## Attempt the restart code package mitigation for the offending service if the number of ephemeral ports it has opened is greater than 5000.
## Maximum of 5 repairs within a 5 hour window.
Mitigate(AppName="fabric:/IlikePorts", MetricName="EphemeralPorts", MetricValue=?MetricValue) :- ?MetricValue > 5000, 
    TimeScopedRestartCodePackage(5, 05:00:00).

Repair memory usage issue for application service process

## Memory - Percent In Use for any SF service process belonging to the specified SF Application. 
## Attempt the restart code package mitigation for the offending service if the percentage (of total) physical memory it is consuming is at or exceeding 70.
## Maximum of 3 repairs within a 30 minute window.
Mitigate(AppName="fabric:/ILikeMemory", MetricName="MemoryPercent", MetricValue=?MetricValue) :- ?MetricValue >= 70, 
    TimeScopedRestartCodePackage(3, 00:30:00).

Quickstart

To quickly learn how to use FabricHealer, please see the simple scenario-based examples.

Operational Telemetry

Please see FabricHealer Operational Telemetry for detailed information on the user agnostic (Non-PII) data FabricHealer sends to Microsoft (opt out with a simple configuration parameter change). Please consider leaving this enabled so your friendly neighborhood Service Fabric devs can understand how FabricHealer is doing in the real world. We would really appreciate it!

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

service-fabric-healer's People

Contributors

dependabot[bot] avatar gittorre avatar kumarnareshh74 avatar markwragg avatar sidhant012 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

service-fabric-healer's Issues

BUG: Repair Targets with multiple FO Warnings Fail in Execution

If a repair target, say application fabric:/Foo, has more than one FO Warning, then FH will assume the repairs are different since it looks at repair id only instead of target entity (in this case the app name). The result is expected: code package restart will fail since the replica that used to exist on the related partition is gone (the successful repair for one of the warnings is restarting the offending code package, so the ID of the related replica is no longer valid - it doesn't exist anymore...). This is easy to fix and will be quickly addressed. In the meantime, ignore the Executor Failed Informational events in SFX if you are experimenting with this type of scenario.

[BUG] Machine repair rules appear to be active even when machine repair is not enabled

Describe the bug

I am seeing log entries from Fabric Healer as follows:

Detected Fabric node _node5_2 is in Warning. Machine repair target specified. 11 logic rules found for Machine repair.

Executing repair FH/6354b42d-7235-4c0e-a61e-fee6f8287f13/RestartProcess/_node5_2.

However EnableMachineRepair has the default value of false and is not overridden.

It seems like the Machine Repair rules are executing even when the repair is not enabled.

Steps To Reproduce

Deploy Fabric Healer with its default setup / config.

Expected behavior

The Machine Repair rules should not be executed when EnableMachineRepair is false. I would expect it to not even load the rules.

OS:

  • Name: Windows
  • Version: 2019 Datacenter

Additional information:

Fabric Healer version 1.2.1

[BUG] Write more (and better) unit tests

Bugs - for basic features/capabilities - are being found after release. This means the unit tests are insufficient and end-to-end testing is not picking up that slack.

This is a bug, not a feature request.

[BUG] FH stops processing health events due to benign exceptions

Describe the bug

StartAsync catches handleable exceptions in outer try-catch only. This means StartAsync while loop will break in this scenario and no more health events will be processed until the FH service process restarts.

Expected behavior

StartAsync's while loop continues when any handleable exception is caught. This is the main processing loop and should only break when critical exceptions are encountered or the SF RunAsync CancellationToken is cancelled.

OS:

  • Name: Windows and Linux

Additional information:

This will be fixed in the next release, 1.2.10.

Error deploying 1.2.0

Trying to deploy version 1.2.0 to our cluster. Both the self-contained and framework dependent throw the below error.

Happens during upgrade as well as clean install with previous version removed from the cluster.

Cluster is at version 9.1.1436.9590

{"code":"DeploymentFailed","message":"At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-deployment-operations for usage details.","details":[{"code":"ClusterChildResourceOperationFailed","message":"Resource operation failed. Operation: CreateOrUpdate. Error details: {\r\n "Details": {\r\n "ClassName": "System.Fabric.FabricException",\r\n "Message": "Parameter 'MonitorLoopSleepSeconds' is not defined in the ApplicationManifest file.\r\nFileName: applicationParameters",\r\n "Data": null,\r\n "InnerException": {\r\n "ClassName": "System.Runtime.InteropServices.COMException",\r\n "Message": "Exception from HRESULT: 0x80071BE6",\r\n "Data": null,\r\n "InnerException": null,\r\n "HelpURL": null,\r\n "StackTraceString": " at System.Fabric.Interop.NativeClient.IFabricApplicationManagementClient10.EndCreateApplication(IFabricAsyncOperationContext context)\r\n at System.Fabric.Interop.Utility.<>c__DisplayClass22_0.b__0(IFabricAsyncOperationContext context)\r\n at System.Fabric.Interop.AsyncCallOutAdapter2`1.Finish(IFabricAsyncOperationContext context, Boolean expectedCompletedSynchronously)",\r\n "RemoteStackTraceString": null,\r\n "RemoteStackIndex": 0,\r\n "ExceptionMethod": "8\nEndCreateApplication\nSystem.Fabric, Version=9.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35\nSystem.Fabric.Interop.NativeClient+IFabricApplicationManagementClient10\nVoid EndCreateApplication(IFabricAsyncOperationContext)",\r\n "HResult": -2147017754,\r\n "Source": "System.Fabric",\r\n "WatsonBuckets": null\r\n },\r\n "HelpURL": null,\r\n "StackTraceString": " at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()\r\n at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)\r\n at System.Fabric.UpgradeService.ApplicationClient.d__8.MoveNext()\r\n--- End of stack trace from previous location where exception was thrown ---\r\n at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()\r\n at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)\r\n at System.Fabric.UpgradeService.ApplicationCommandProcessor.d__9.MoveNext()",\r\n "RemoteStackTraceString": null,\r\n "RemoteStackIndex": 0,\r\n "ExceptionMethod": "8\nThrow\nmscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089\nSystem.Runtime.ExceptionServices.ExceptionDispatchInfo\nVoid Throw()",\r\n "HResult": -2147017754,\r\n "Source": "mscorlib",\r\n "WatsonBuckets": null\r\n }\r\n}"}]}

[FEATURE REQUEST] Add support for specifying max execution duration.

Folks should be able to specify, as part of a logic rule, the maximum amount of time a repair can execute. A default maximum execution time will also prevent orphaned/stuck repairs.

This will be added as an optional argument for use in multiple Guan repair predicates. This is only supported for repairs that FabricHealer executes (so, not machine repairs where FH schedules repairs, but they are executed by SF's InfrastructureService). FabricHealer can't reason about how long an external executor should take to complete a repair, so it doesn't try.

[BUG] Guan files are processed even when the specified repair type is not enabled

Describe the bug

I can see the following log entry from FabricHealer:

TraceCurrentlyExecutingRule failure => Unable to read SystemServiceRules.guan: Index was outside the bounds of the array.

I have the default version of the SystemServiceRules.guan file in FabricHealer, which seems to suggest that in its default form it has a bug (which is perhaps because one of the examples it contains is not intended for actual use).

However, I currently have EnableSystemServiceRepair set to the default value of false, so I would have expected the SystemServiceRules.guan file to not even need to be loaded by Fabric Healer.

Steps To Reproduce

Deploy FabricHealer with its default setup. The error can be observed in its logs, or in Log Analytics if telemetry is enabled.

Expected behavior

The tool should not attempt to load guan files for repair type that it does not have enabled.

OS:

  • Name: Windows
  • Version: 2019 Datacenter

Additional information:

Fabric Healer version 1.2.1

Add ARM template for FH deployment (using Releases SFPKG)

Add an ARM template to the repo that can be used to deploy FH using the appropriate SFPKG located in Releases. Given that most of FO's setting are overridable Application Parameters, this would be helpful for quick and easy deployment. More work to do to make this possible, but we're close.

BUG: Disk Mitigation does not work.

Hi,

I'm sure I'm probably doing something wrong here but I can't seem to get a Disk mitigation to fire. I've set EnableDiskRepair to true in my ApplicationManifest.xml and my DiskRules.guan contains the following:

Mitigate :- CheckInsideRunInterval(02:00:00), !.

## Added 2023-04-05 | Mark Wragg: Remove files from Service Fabric Observer and Fabric Healer log directories if directories > 1GB

Mitigate(MetricName=?MetricName) :- LogRule(52), match(?MetricName, "DiskSpace"), GetRepairHistory(?repairCount, 08:00:00), 
	?repairCount < 4,
	member(config(?X,?Y), [config("C:\cluster_observer_logs", 1), config("C:\fabric_healer_logs", 1), config("C:\fabric_observer_logs", 1)]), 
	CheckFolderSize(?X, MaxFolderSizeGB=?Y),
	DeleteFiles(?X, SortOrder=Ascending, MaxFilesToDelete=10, RecurseSubdirectories=true).

## Added 2023-04-05 | Mark Wragg: Remove ETL files from Service Fabric Log Traces directory if directory > 20GB

Mitigate(MetricName=?MetricName) :- LogRule(60), match(?MetricName, "DiskSpace"), GetRepairHistory(?repairCount, 08:00:00),
	?repairCount < 4,
	CheckFolderSize("D:\SvcFab\Log\Traces", MaxFolderSizeGB=20),
	DeleteFiles("D:\SvcFab\Log\Traces", SortOrder=Ascending, MaxFilesToDelete=10, RecurseSubdirectories=true, SearchPattern="*.etl").

I've then created 20 x 1GB test files in the C:\fabric_observer_logs directory (named for example 1gb.1.test) these were created with fsutil and are 1GB in size each.

I was expecting (perhaps after 2 hours) for Fabric Healer to delete the oldest 10 of my files in this directory, and then 2 hours after this delete another 10 of them, but it's been about 6 hours and I'm not seeing anything occuring.

fh_operations_telemetry.log:

{"EventName":"OperationalEvent","TaskName":"FabricHealer","EventRunInterval":"1.00:00:00","SFRuntimeVersion":"9.1.1583.9590","ClusterId":"9bb1353a-355d-41e3-98c9-e990adf9c018","ClusterType":"SFRP","NodeNameHash":"e0f9d96061c0cdd36a0659711ed0768625f395a60a295c42242bbeb694f415eb","FHVersion":"1.2.0","UpTime":"00:00:00.1368888","Timestamp":"2023-04-06T15:27:02.0266516Z","OS":"Windows","EnabledRepairCount":2,"TotalRepairAttempts":0,"SuccessfulRepairs":0,"FailedRepairs":0}

RepairData.log

2023-04-06 09:55:44.9428--INFO--Detected Fabric node _SFnode0_0 is in Warning.
Machine repair target specified. 11 logic rules found for Machine repair.
2023-04-06 14:10:37.6579--INFO--Detected Fabric node _node5_1 is in Warning.
Machine repair target specified. 11 logic rules found for Machine repair.

I am seeing Forbidden errors in TelemetryLogger.log, which I assume is it not being able to write to Log Analytics despite me checking the workspace Id and Shared Key values are correct. I assume though that this is likely unrelated to why my Disk mitigation seems to be not firing.

Grateful if you can advise of anything I am doing wrong.

Thanks,
Mark

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.