microsoft / service-fabric-observer Goto Github PK

Highly configurable, extensible and performant Service Fabric watchdog service that, out of the box, monitors a broad range of physical machine resources that tend to be very important to Service Fabric services and maps these metrics to SF entities. It targets both Windows and Linux SF clusters.

License: MIT License

C# 96.16% PowerShell 3.09% Shell 0.20% C 0.49% Batchfile 0.05%

service-fabric windows csharp linux watchdog-service cluster fabric-observer net6

service-fabric-observer's Introduction

FabricObserver 3.2.15

FabricObserver (FO) is a production-ready watchdog service with an easy-to-use extensibility model, written as a stateless, singleton Service Fabric .NET 6 application that by default

Monitors a broad range of physical machine resources that tend to be very important to all Service Fabric services and maps these metrics to the related Service Fabric entities.
Runs on multiple versions of Windows Server and Ubuntu.
Provides an easy-to-use extensibility model for creating custom Observers out of band (so, you don't need to clone the repo to build an Observer). In this way, FabricObserver is also an "Observer" platform.
Supports Configuration Setting Application Updates for any observer for any supported setting.
Is actively developed in the open.

FabricObserver targets SF runtime versions 9 and higher.

FO is a Stateless Service Fabric Application composed of a single service that runs on every node in your cluster, so it can be deployed and run alongside your applications without any changes to them. Each FO service instance knows nothing about other FO instances in the cluster, by design.

Running side-by-side with existing monitoring services, FabricObserver provides useful and timely health information for the nodes (VMs), apps, and services that make up your Service Fabric deployment.

FabricObserver is one member of a growing family of open source Service Fabric observability services. The latest member of the family is FabricHealer, which works in conjunction with FabricObserver to auto-mitigate service, node and VM level issues reported by FO.

If you run your apps on Service Fabric, then you should definitely consider deploying FabricObserver to all of your clusters (Test, Staging, Production).

Using FabricObserver

To quickly learn how to use FO, please see the simple scenario-based examples.
You can clone the repo, build, and deploy or simply grab latest tested SFPKG with Microsoft signed binaries from Releases section, modify configs, and deploy.

How it works

Application and Service Level Warnings:

Node Level Warnings:

Node Level Machine Info:

When FabricObserver gracefully exits or updates, it will clear all of the health events it created.

FabricObserver comes with a number of Observers that run out-of-the-box. Observers are specialized objects that monitor, point in time, specific resources in use by user service processes, SF system service processes, containers, virtual/physical machines. They emit Service Fabric health reports, diagnostic telemetry and ETW events, then go away until the next round of monitoring. The resource metric thresholds supplied in the configurations of the built-in observers must be set to match your specific monitoring and alerting needs. These settings are housed in Settings.xml and ApplicationManifest.xml. The default settings are useful without any modifications, but you should design your resource usage thresholds according to your specific needs.

When a Warning threshold is reached or exceeded, an observer will send a Health Report to Service Fabric's Health management system (either as a Node or App Health Report, depending on the observer). This Warning state and related reports are viewable in SFX, the Service Fabric EventStore, and Azure's Application Insights/LogAnalytics/ETW, if enabled.

Most observers will remove the Warning state in cases where the issue is transient, but others will maintain a long-running Warning for applications/services/nodes/security problems observed in the cluster. For example, high CPU usage above the user-assigned threshold for a VM or App/Service will put a Node into Warning State (NodeObserver) or Application Warning state (AppObserver), for example, but will soon go back to Healthy if it is a transient spike or after you mitigate the specific problem :-). An expiring certificate Warning from CertificateObsever, however, will remain until you update your application's certificates (Cluster certificates are already monitored by the SF runtime. This is not the case for Application certificates, so use CertificateObserver for this, if necessary).

FO ships with both an Azure ApplicationInsights and Azure LogAnalytics telemetry implementation. Other providers can be used by implementing the ITelemetryProvider interface.

For more information about the design of FabricObserver, please see the Design readme.

Build and run

It is highly recommended that you only deploy code built from the main branch into your production clusters.

Clone the repo.
Install .NET 6
Build.

Note: By default, FO runs as NetworkUser on Windows and sfappsuser on Linux. If you want to monitor SF service processes that run as elevated (System) on Windows, then you must also run FO as System on Windows. There is no reason to run as root on Linux under any circumstances (see the Capabilities binaries implementations, which allow for FO to run as sfappsuser and successfully execute specific commands that require elevated privilege).

For Linux deployments, we have ensured that FO will work as expected as normal user (non-root user). In order for us to do this, we had to implement a setup script that sets Capabilities on three proxy binaries which can only run specific commands as root. If you deploy from VS, then you will need to use FabricObserver/PackageRoot/ServiceManifest.linux.xml (just copy its contents into ServiceManifest.xml or add the new piece which is simply a SetupEntryPoint section).

If you use the FO build script, then it will take care of any configuration modifications automatically for linux build output.

The build scripts include code build, sfpkg generation, and nupkg generation. They are all located in the top level directory of this repo.

FabricObserver can be run and deployed through Visual Studio or Powershell, like any SF app. If you want to add this to your Azure Pipelines CI, see FOAzurePipeline.yaml for msazure devops build tasks. Please keep in mind that if your target servers do not already have .net6 installed (if you deploy VM images from Azure gallery, then they will not have .net6 installed), then you must deploy the SelfContained package.

Deploy FabricObserver

Note: You must deploy this version (3.2.15) to clusters that are running SF 9.0 and above. This version also requires .NET 6. You can deploy FabricObserver (and ClusterObserver) using Visual Studio (if you build the sources yourself), PowerShell or ARM. Please note that this version of FabricObserver no longer supports the DefaultServices node in ApplicationManifest.xml. This means that should you deploy using PowerShell, you must create an instance of the service as the last command in your script. This was done to support ARM deployment, specifically. The StartupServices.xml file you see in the FabricHealerApp project now contains the service information once held in ApplicationManifest's DefaultServices node. Note that this information is primarily useful for deploying from Visual Studio. Your ARM template or PowerShell script will contain all the information necessary for deploying FabricObserver.

Deploy FabricObserver using ARM

Learn how to deploy FabricObserver using ARM

Deploy FabricObserver using Client (PowerShell)

After you adjust configuration settings to meet to your needs (this means changing settings in Settings.xml for ObserverManager (ObserverManagerConfiguration section) and in ApplicationManifest.xml for observers).

NOTE: In version 3.2.0 and higher and you must create a service instance after you create the application.

#cd to the top level repo directory where you cloned FO sources.

cd C:\Users\me\source\repos\service-fabric-observer

#Build FO (Release)

./Build-FabricObserver

#create a $path variable that points to the build output:
#E.g., for Windows deployments:

$path = "C:\Users\me\source\repos\service-fabric-observer\bin\release\FabricObserver\win-x64\self-contained\FabricObserverType"

#For Linux deployments:

#$path = "C:\Users\me\source\repos\service-fabric-observer\bin\release\FabricObserver\linux-x64\self-contained\FabricObserverType"

#Connect to target cluster, for example:

Connect-ServiceFabricCluster -ConnectionEndpoint @('sf-win-cluster.westus2.cloudapp.azure.com:19000') -X509Credential -FindType FindByThumbprint -FindValue '[thumbprint]' -StoreLocation LocalMachine -StoreName 'My'

#Copy $path contents (FO app package) to server:

Copy-ServiceFabricApplicationPackage -ApplicationPackagePath $path -CompressPackage -ApplicationPackagePathInImageStore FO3215 -TimeoutSec 1800

#Register FO ApplicationType:

Register-ServiceFabricApplicationType -ApplicationPathInImageStore FO3215

#Create FO application (if not already deployed at lesser version):

New-ServiceFabricApplication -ApplicationName fabric:/FabricObserver -ApplicationTypeName FabricObserverType -ApplicationTypeVersion 3.2.15   

#Create the Service instances (-1 means all nodes, which is what is required for FO):  

New-ServiceFabricService -Stateless -PartitionSchemeSingleton -ApplicationName fabric:/FabricObserver -ServiceName fabric:/FabricObserver/FabricObserverService -ServiceTypeName FabricObserverType -InstanceCount -1

#OR if updating existing version:  

Start-ServiceFabricApplicationUpgrade -ApplicationName fabric:/FabricObserver -ApplicationTypeVersion 3.2.15 -Monitored -FailureAction rollback

Observer Model

FO is composed of Observer objects (instance types) that are designed to observe, record, and report on several machine-level environmental conditions inside a Windows or Linux (Ubuntu) VM hosting a Service Fabric node.

Here are the current observers and what they monitor:

Resource	Observer
Application (services) resource usage health monitoring across CPU, File Handles, Memory, Ports (TCP), Threads	AppObserver
Looks for dmp and zip files in AppObserver's MemoryDumps folder, compresses (if necessary) and uploads them to your specified Azure storage account (blob only, AppObserver only, and still Windows only in this version of FO)	AzureStorageUploadObserver
Application (user) and cluster certificate health monitoring	CertificateObserver
Container resource usage health monitoring across CPU and Memory	ContainerObserver
Disk (local storage disk health/availability, space usage, IO, Folder size monitoring)	DiskObserver
SF System Services resource usage health monitoring across CPU, File Handles, Memory, Ports (TCP), Threads	FabricSystemObserver
Networking - general health and monitoring of availability of user-specified, per-app endpoints	NetworkObserver
CPU/Memory/File Handles(Linux)/Firewalls(Windows)/TCP Ports usage at machine level	NodeObserver
OS/Hardware - OS install date, OS health status, list of hot fixes, hardware configuration, AutoUpdate configuration, Ephemeral TCP port range, TCP ports in use, memory and disk space usage	OSObserver
Service Fabric Configuration information	SFConfigurationObserver
Another resource you find important	Observer that you implement

To learn more about the current Observers and their configuration, please see the Observers readme.

Just observe it.

Operational Telemetry

Please see FabricObserver Operational Telemetry for detailed information on the user agnostic (Non-PII) data FabricObserver sends to Microsoft (opt out with a simple configuration parameter change). Please consider leaving this enabled so your friendly neighborhood Service Fabric devs can understand how FabricObserver is doing in the real world. We would really appreciate it!

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

Please see CONTRIBUTING.md for development process information.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

service-fabric-observer's People

Contributors

Stargazers

Watchers

service-fabric-observer's Issues

[FEATURE REQUEST] Add Folder Size Monitoring to DiskObserver

Enable specifying folder paths to monitor for size in DiskObserver (with support for specifying folder/file search patterns). Configure FO to warn (in SFX and via telemetry) when some specified folder exceeds a given max size. This will put the related node into Warning (which is what DiskObserver does for total disk space usage today) with all related details in the health event/ and etw/telemetry event. This will require adding new configuration for DiskObserver to be supplied as overridable application parameters.

This feature can be enabled or disabled via setting.

Add ARM template for FO deployment (using Releases SFPKG)

Add an ARM template to the repo that can be used to deploy FO using the appropriate SFPKG located in Releases. Given that most of FO's setting are overridable Application Parameters, this would be helpful for quick and easy deployment. More work to do to make this possible, but we're close.

DiskObserver Events do not indicate a drive letter or ID

Hi,

I'm looking at the DiskObserver events in AppInsights and it seems like it is giving me events for the individual disks, but doesn't include any way to determine which drive ID or Letter they are for. Could the output of DiskObserver be amended to include that?

I can see individual drive letters under the DriveInfo property of the OSObserver events but obviously this isn't particularly easy to graph from.

I am quite new to AppInsights so if i'm just missing something in the existing output of DiskObserver please let me know.

Thanks,
Mark

Windows: Replace Memory Perf Counters (process, VM) with Win32 API calls.

Windows: Replace Memory perf counters with a different approach (Win32 APIs) to enable sane detection of same-named processes for use by AppObserver, FabricSystemObserver. Also, replace VM perf counter for Memory use with Win32 Api call.

FabricObserverWebApi: Check if deployed, else don't run code related to it in FO...

Enable user setting (bool) ObserverWebApiEnabled and also have ObserverManager check if ObsWeb is deployed in cluster (can only check for the default app name, thus provide a user setting in Settings.xml as well...). For the code that runs to support ObsWeb from FO, don't run it if ObsWeb is not deployed...

[BUG] AppObserver: missing Concurrency setting in CPU FRUD.

There is a bug in 3.1.25 where AppObserver is missing the (optional, defaults to false) concurrency setting in the CPU FRUD ctor. This means that random concurrency issues can take place while processing CPU metric data since the underlying data type is IList and not IProducerConsumerCollection. Most folks will not encounter any issues, but there is a possibility for random IOEs due to concurrent write attempts on a List.

This is already fixed in 3.1.26 develop.

If you are hitting this, then disable concurrent monitoring for AppObserver in ApplicationManifest.xml :

<Parameter Name="AppObserverEnableConcurrentMonitoring" DefaultValue="false" />

You can do this via versionless, parameter-only application upgrade:

Connect-ServiceFabricCluster ...
$appParams = @{ "AppObserverEnableConcurrentMonitoring" = "false";  }
Start-ServiceFabricApplicationUpgrade -ApplicationName fabric:/FabricObserver -ApplicationParameter $appParams -ApplicationTypeVersion 3.1.25 -UnMonitoredAuto

[BUG] ContainerObserver: Incorrect values for MonitoredAppCount.

Describe the bug
ContainerObserver is reporting incorrect value for MonitoredAppCount. The value being reported is the number of deployed applications on the node.

To Reproduce
Enable ContainerObserver and ObserverManagerEnableOperationalFOTelemetry. Run FO.

Expected behavior
The value for MonitoredAppCount for ContainerObserver should only include containerized (docker) applications.

FabricObserver.dll not signed in the released sfpkg

Please close this if I've made a mistake but it appears the FabricObserver.dll is not signed.

Cluster observer seems OK.

Thanks

ObserverHealthReporter does not support all available health report types needed for fabric observer plugins

I'd like to have some more flexibility with health reporting options whilst utilising the existing health reporter class.

PR will add the types from here except cluster health reporting as there is a separate Cluster Observer app that handles that level of information: https://docs.microsoft.com/en-us/dotnet/api/system.fabric.health.healthreport?view=azure-dotnet

Thanks, Adam

App insights cluster observer event name not set as expected

Hi Charles

For cluster observer telemetry to app insights I am not seeing the event name set as expected. I think this is because in the App insights ReportHealthAsync function it does the following:

$"{telemetryData.ObserverName ?? "ClusterObserver"}DataEvent" but the calling code will do a null check before this and send in an empty string.

Looks like the following:

Thanks!

sfpkg version not matching release version

Hi Charles

First, just want to say thank you for answering me so quickly on the issues this week. Please let me know if you aren't able to get to all of these and I'll try and hold off. Just trying to get this up and running in production!

I'm automating downloading the sfpkg files, then applying our plugins + deploying. It seems something odd is happening for this though. The released source zip looks correct, but the sfpkg has 3.0.10 as the application version, but 3.0.9 as the service manifest version.

I also see that when using this sfpkg, I get runtime errors when trying to use the partition Id on the health report so I think something funky has gone on with the build/release there.

Thank you

OSObserver reporting hot fixes unformatted to app insights

Description
App insights is now reporting hotfixes in a hyper link format, rather than a simple list as in the previous release.

To Reproduce
Report OSObserver to app insights and observe in the log search, shown in attached screenshot.

Expected behavior
Hot fixes to report in a simple list as before, KX111, KX222 etc

Screenshots
Attached.

Desktop (please complete the following information):

OS: Windows
Version: Windows 10

Additional context
Happening since upgrading to the most recent release, 3.1.11

Move some ObserverManager parameters into AppManifest (Overrides)

There are setting in Settings.xml in the ObserverManager section that would be useful to pull into ApplicationManifest.xml as overridable application parameters that can be set via versionless application upgrades, versus having to redeploy FO in order to change them.

[BUG] CpuUsage - First result is always 0

There is a minor bug in the CpuUsage utility where the first result stored in the list of readings is 0. This can have understandably adverse effects on the averaging of values, which results in the double used to determine if a supplied threshold has been breached. This is fixed in 3.1.19 (see develop branch). In the interim, just increase (reasonably) the monitor duration for AppObserver.

Note: If you are running on capable hardware (logical processors >= 4) and have enabled concurrent monitoring in AppObserver, then you should set MonitorDuration to be at least the number of cores on the target (virtual)machines. You can be more liberal with this setting when running on capable CPU configurations.

As always, measure and make sure the duration supplied is necessary and results in a useful number of readings. Be careful here when running sequentially as the math is pretty simple: number of app service processes * monitor duration will account for most of the time AppObserver runs.

Nuget packages in custom observers

Is it possible to support the use of nuget packages in observer plugins?

Add ContainerObserver to FO project

ContainerObserver was initially released as an FO Plugin project to demonstrate how to write a plugin that does something useful (unlike SampleNewObserver project). It is now time to add this observer to FO.

NetworkObserver: Filter duplicate endpoints before testing

As discussed in #61, sometimes multiple applications connect to the same endpoint. As such in the NetworkObserver.config.json file the same endpoint may be declared for multiple applications.

Its desirable for all of those applications to show a warning state when that endpoint is unavailable, but wasteful for NetworkObserver to check those endpoints more than once on each invocation, so it would be a good enhancement to have NetworkObserver filter out those duplicates before it performs testing.

[FEATURE REQUEST] Monitor and alert on KVS database metrics

Is your feature request related to a problem? Please describe.
Provide an observer to monitor KVS database for various metrics and alert on them

Describe the solution you'd like
Monitor at least the following metrics

LVID usage
Database size

Describe alternatives you've considered
These can be done today by manually checking perf counters or examining size of database on nodes

Automated deployment approach

Hi,

I'm setting up an Azure DevOps pipeline to deploy Service Fabric Observer in our environment. I wanted to validate (at a high level) the approach i'm considering in case you can see any flaws in it. I am doing the following:

Placing the SignedCO.sfpkg and SignedFO.sfpkg files in my repo (which is then built as an artifact) under different subdirectories. I have then renamed these to .zip and copied from them the /config directories (and the .config files from the code directory of FabricObserver). I'm storing these files in a directory alongside the zip.

I then rename the sfpkg.zip files back to sfpkg. I now plan to customise my copies of the configuration files to my requirements.

In my pipeline I do the following:

Rename the .sfpkg files to sfkpkg.zip
Extract the sfpkg.zip files
Copy my versions of the config files over the default equivalents in the extracted packages
Deploy the applications

Does this seem like a reasonable approach? Am I overcomplicating things? I'm trying to make it so that when new versions are released I'll just have to drop the new sfpkg files into my repo, but I appreciate i'd also need to look for breaking changes to the structure/content of the default config files.

Thanks in advance.

Updating the configuration in runtime

It would be nice to have a logic to update configuration parameter at runtime without a re-deploying application.

[BUG] Internal Telemetry: ConcurrencyEnabled measurement not working.

FO Diagnostic Telemetry: Missing data

EnableConcurrentMonitoring bool is not being set for internal diagnostic telemetry. There is no code that sets the field... Add the code.

Move LocalLogPath to AppManifest (make it Overridable)

Enable versionless application parameter upgrades of the local log directory.

Make LocalLogPath setting MustOveride, add parameter to ApplicationManifest.xml

[BUG] AppObserverEnableVerboseLogging is set to true in 3.1.21 Release builds.

In 3.1.21 release, verbose logging is enabled for AppObserver (this is not correct for Release builds, sorry about that). Please disable this in ApplicationManifest.xml before deploying to production by setting the AppObserverEnableVerboseLogging parameter to false:

<!-- Verbose Logging -->
    <Parameter Name="AppObserverEnableVerboseLogging" DefaultValue="false" />

Or, you can run a versionless, parameter-only application upgrade and turn it off if you have already deployed to production:

E.g.,

Connect-ServiceFabricCluster -ConnectionEndpoint @('foo-bar-42.westus.cloudapp.azure.com:19000') -X509Credential -FindType FindByThumbprint -FindValue '[thumbprint]' -StoreLocation LocalMachine -StoreName 'My' -ServerCommonName @('[serverCommonName]')
$appParams = @{ "AppObserverEnableVerboseLogging" = "false";  }
Start-ServiceFabricApplicationUpgrade -ApplicationName fabric:/FabricObserver -ApplicationParameter $appParams -ApplicationTypeVersion 3.1.21 -UnMonitoredAuto

This will be corrected in 3.1.22.

[WORK ITEM] Docs

Update docs (and images) to reflect reality presented in 3.2+.

Troubleshooting NetworkObserver

Hi,

I've just configured NetworkObserver to monitor an Azure SQL connection endpoint (port 1433). Immediately after adding the config it went into warning state as follows:

'FO022' reported Warning for property 'NetworkHealth'.
NetworkObserver detected Warning threshold breach. Outbound Internet connection failure detected for endpoint <redacted>.database.windows.net

I've logged on to the nodes where the app I configured this check for is running and can do a Test-NetConnection on each of them verifying the endpoint/port can be reached. I've turned on verbose logging on one node and it just shows the connection test occurring but not any errors.

What can I do to further troubleshoot why its alerting? Does the test run from all the nodes where the app runs? Does it go into warning if just 1 of them doesn't connect?

The documentation makes reference that it can do ICMP or port checks. ICMP does fail to this address but I can't see any example of how you select whether the test is ICMP or port based.

Thanks,
Mark

Fabric Observer crashes 0xc0000005 Access violation while searching for Windows Fabric Database counter category

Describe the bug
FabricObserver crashes repeatedly on one of five nodes with the below pair of errors in Windows event viewer. Reproduced in versions 3.1.23 and 3.1.25 on a standalone Windows Service Fabric cluster at version 7.2.457.9590.

Application: FabricObserver.exe
CoreCLR Version: 4.700.22.11601
.NET Core Version: 3.1.23
Description: The process was terminated due to an internal error in the .NET Runtime at IP 00007FFB96651EE7 (00007FFB964C0000) with exit code c0000005.

and

Faulting application name: FabricObserver.exe, version: 3.1.25.0, time stamp: 0x619ae151
Faulting module name: coreclr.dll, version: 4.700.22.11601, time stamp: 0x620d7666
Exception code: 0xc0000005
Fault offset: 0x0000000000191ee7
Faulting process id: 0xa78
Faulting application start time: 0x01d840697976f54c
Faulting application path: C:\ProgramData\SF\aws-bastian05\Fabric\work\Applications\FabricObserverType_App61\FabricObserverPkg.Code.3.1.25\FabricObserver.exe
Faulting module path: C:\Program Files\dotnet\shared\Microsoft.NETCore.App\3.1.23\coreclr.dll
Report Id: 83a711cb-8864-44f6-bd1d-bec902d28ffc
Faulting package full name: 
Faulting package-relative application ID:

To Reproduce
Deploy FabricObserver 3.1.23 or 3.1.25 to a Service Fabric cluster at version 7.2.457.9590

Expected behavior
Fabric Observer runs without crashing.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: Windows
Version: Windows Server 2016 Datacenter

Additional context
I captured a crash dump which led me to this call stack:

 	System.Diagnostics.PerformanceCounter.dll!System.Diagnostics.PerformanceDataRegistryKey.GetValue(string name, bool usePool) Line 65	C#
 	System.Diagnostics.PerformanceCounter.dll!System.Diagnostics.PerformanceMonitor.GetData(string item, bool usePool) Line 1333	C#
 	System.Diagnostics.PerformanceCounter.dll!System.Diagnostics.PerformanceCounterLib.GetPerformanceData(string item, bool usePool) Line 1027	C#
 	System.Diagnostics.PerformanceCounter.dll!System.Diagnostics.PerformanceCounterLib.CategoryTable.get() Line 128	C#
 	System.Diagnostics.PerformanceCounter.dll!System.Diagnostics.PerformanceCounterLib.CategoryExists(string machine, string category) Line 283	C#
 	System.Diagnostics.PerformanceCounter.dll!System.Diagnostics.PerformanceCounterCategory.Exists(string categoryName, string machineName) Line 436	C#
 	System.Diagnostics.PerformanceCounter.dll!System.Diagnostics.PerformanceCounterCategory.Exists(string categoryName) Line 416	C#
>	FabricObserver.dll!FabricObserver.Observers.ObserverManager.IsLVIDPerfCounterEnabled() Line 1247	C#
 	FabricObserver.dll!FabricObserver.Observers.ObserverManager.StartObserversAsync() Line 213	C#
 	FabricObserver.dll!FabricObserver.FabricObserver.RunAsync(System.Threading.CancellationToken cancellationToken) Line 49	C#
 	Microsoft.ServiceFabric.Services.dll!Microsoft.ServiceFabric.Services.Runtime.StatelessService.Microsoft.ServiceFabric.Services.Runtime.IStatelessUserServiceInstance.RunAsync(System.Threading.CancellationToken cancellationToken)	Unknown
 	Microsoft.ServiceFabric.Services.dll!Microsoft.ServiceFabric.Services.Runtime.StatelessServiceInstanceAdapter.ExecuteRunAsync(System.Threading.CancellationToken runAsyncCancellationToken)	Unknown
 	Microsoft.ServiceFabric.Services.dll!Microsoft.ServiceFabric.Services.Runtime.StatelessServiceInstanceAdapter.ScheduleRunAsync.AnonymousMethod__0()	Unknown
 	System.Private.CoreLib.dll!System.Threading.Tasks.Task<System.Threading.Tasks.Task>.InnerInvoke() Line 518	C#
 	System.Private.CoreLib.dll!System.Threading.Tasks.Task..cctor.AnonymousMethod__274_0(object obj) Line 2428	C#
 	System.Private.CoreLib.dll!System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(System.Threading.Thread threadPoolThread, System.Threading.ExecutionContext executionContext, System.Threading.ContextCallback callback, object state) Line 289	C#
 	System.Private.CoreLib.dll!System.Threading.Tasks.Task.ExecuteWithThreadLocal(ref System.Threading.Tasks.Task currentTaskSlot, System.Threading.Thread threadPoolThread) Line 2389	C#
 	System.Private.CoreLib.dll!System.Threading.Tasks.Task.ExecuteEntryUnsafe(System.Threading.Thread threadPoolThread) Line 2327	C#
 	System.Private.CoreLib.dll!System.Threading.Tasks.Task.ExecuteFromThreadPool(System.Threading.Thread threadPoolThread) Line 2312	C#
 	System.Private.CoreLib.dll!System.Threading.ThreadPoolWorkQueue.Dispatch() Line 663	C#
 	System.Private.CoreLib.dll!System.Threading._ThreadPoolWaitCallback.PerformWaitCallback() Line 29	C#

The locals show me we are trying to see if the Windows Fabric Database counter category exists. I can confirm it does exist on the affected node, and so does the Long-Value Maximum LVID counter. I am not sure what the environmental difference is that causes observer to fail one just this one node.

I can provide the crash dump in full on request, just need somewhere to upload it.

With support for .Net 3.1 ending tin 6 months, can it be migrated to .Net 6 (LTS)

.Net Core 3.1 (LTS) goes end of life on the 13th of December 2022.

To maintain support, can this project please be upgraded to .NET 6 (LTS)

Telemerty Filtering

If both AppObserver and OSObserver are reporting telemetry to Application Insights, There does not appear to be a property to be able to determine which observer generated the telemetry. Also AppObserver Telemetry provides no mechanism to discern which application the telemetry is for. The only thing that can be done is narrow down the results for a particular node.

Check for Windows Updates being disabled should pass if Automatic Updates are set to “never check for updates”

Per comments I added to #59: Currently the code does the following check for if Windows Updates are enabled (which is if the service is enabled then the setting must be "notify before download"):

   this.isWindowsUpdateAutoDownloadEnabled =
                    wuLibAutoUpdates.ServiceEnabled &&
                    wuLibAutoUpdates.Settings.NotificationLevel != AutomaticUpdatesNotificationLevel.aunlNotifyBeforeDownload;

In my environment, the service is enabled but Windows Update is set to "never check for updates". I think the above check should be amended to include that as an additional acceptable sign that Windows updates are disabled (the enum to check I believe would be AutomaticUpdatesNotificationLevel.aunlDisabled).

Note also that I set virtualMachineProfile.osProfile.windowsConfiguration.enableAutomaticUpdates to false in the ARM template as per the guide on VMSS automatic OS image upgrades and this still didn't end up putting Windows Update in a state where it was disabled for my environment (even for new nodes that were created after the change). Although its worth noting that we have a boot script that sets auto updates to "never check for updates" in the registry so maybe this has overridden the default setting that this ARM property sets (and that it would have been the one you currently check for). I was surprised the ARM setting didn't just disable the Windows Update service, but it seems it does not.

[BUG] Remove unnecessary docker calls (process creates) in ContainerObserver

Describe the bug
There is no need to call docker stats for each replica or instance. Call it once, then loop over results for each replica or instance. This is fixed in task_parallel branch. Will merge into next release. For now, there is no behavioral bug in the current release, but creating a process for each replica or instance is a waste of resources, particularly for deployments with large numbers of containerized apps.

Fix spelling typo errors in ErrorCodes.md

Fix spelling and typos in ErrorCodes.md

Increase fractional digits for resource usage values.

The current fractional digits setting (hard-coded in FabricResourceUsageData.cs) for data values is too small (1). Increase size to 2.

[BUG] Regression - AppObserver's targetAppType config setting is ignored.

A regression was discovered in AppObserver configuration processing that impacts the targetAppType setting (only). This is a relatively advanced feature and most users will not be impacted. For the small number of users who do supply targetAppType settings in AppObserver.config.json, you will need to modify your settings by breaking out the apps that share a single Application Type into individual targetApp setting objects. This is due to a code bug in a relatively new feature where AppObserver automatically fixes malformed targetApp values. The bug is non-crashing, but the unfortunate side effect is settings related to targetAppType will effectively be ignored (so, the related services will not be monitored). This will be fixed in 3.1.26.

Config fix in the interim:

Change

{
   "targetAppType": "SomeAppType",
   "memoryWarningLimitPercent": 40,
   "networkWarningEphemeralPorts": 7000
}

{
   "targetApp": "SomeApp1",
   "memoryWarningLimitPercent": 40,
   "networkWarningEphemeralPorts": 7000
},
{
   "targetApp": "SomeApp2",
   "memoryWarningLimitPercent": 40,
   "networkWarningEphemeralPorts": 7000
}

etc.

This repo is missing important files

There are important files that Microsoft projects should all have that are not present in this repository. A pull request has been opened to add the missing file(s). When the pr is merged this issue will be closed automatically.

Microsoft teams can learn more about this effort and share feedback within the open source guidance available internally.

Merge this pull request

MachineTelemetryData inconsistent naming with TelemetryData

I am putting together some generic Kusto queries for observer data and have one property name tripping me up slightly.

It seems the general convention is a 'Node' property on ETW data but 'NodeName' on Telemetry Data (which ends up in App Insights).

Is it expected, or is it possible for the 'Node' property on the MachineTelemetryData class to match the TelemetryData class
'NodeName'?

Add Start - End DateTime range Observer configuration setting

Add the ability to specify Start - End DateTime range configuration setting to enable only running an observer when the current time falls within the specified range...

Add support for Thread monitoring (process, AppObserver)

Add Threads metric to AppObserver.

[BUG] AppObserver Concurrency bug: Collection was modified; enumeration operation may not execute

Describe the bug
FO is throwing exceptions when reporting health.
The below is a stack trace found in the log files at: C:\observer_logs\ObserverManager

2021-11-08 00:24:40.3433--WARN--Handled AggregateException from AppObserver:
System.AggregateException: One or more errors occurred. (One or more errors occurred. (Collection was modified; enumeration operation may not execute.))
 ---> System.AggregateException: One or more errors occurred. (Collection was modified; enumeration operation may not execute.)
 ---> System.InvalidOperationException: Collection was modified; enumeration operation may not execute.
   at System.Collections.Generic.List`1.Enumerator.MoveNextRare()
   at System.Linq.Enumerable.Average(IEnumerable`1 source)
   at FabricObserver.Observers.Utilities.FabricResourceUsageData`1.get_AverageDataValue() in D:\a\1\s\FabricObserver.Extensibility\Utilities\FabricResourceUsageData.cs:line 138
   at FabricObserver.Observers.ObserverBase.ProcessResourceDataReportHealth[T](FabricResourceUsageData`1 data, T thresholdError, T thresholdWarning, TimeSpan healthReportTtl, HealthReportType healthReportType, ReplicaOrInstanceMonitoringInfo replicaOrInstance, Boolean dumpOnError) in D:\a\1\s\FabricObserver.Extensibility\ObserverBase.cs:line 1180
   at FabricObserver.Observers.AppObserver.<>c__DisplayClass43_0.<ReportAsync>b__0(ReplicaOrInstanceMonitoringInfo repOrInst, ParallelLoopState state) in D:\a\1\s\FabricObserver\Observers\AppObserver.cs:line 369
   at System.Threading.Tasks.Parallel.<>c__DisplayClass19_0`1.<ForWorker>b__1(RangeWorker& currentWorker, Int32 timeout, Boolean& replicationDelegateYieldedBeforeCompletion)
--- End of stack trace from previous location where exception was thrown ---
   at System.Threading.Tasks.Parallel.<>c__DisplayClass19_0`1.<ForWorker>b__1(RangeWorker& currentWorker, Int32 timeout, Boolean& replicationDelegateYieldedBeforeCompletion)
   at System.Threading.Tasks.TaskReplicator.Replica.Execute()
   --- End of inner exception stack trace ---
   at System.Threading.Tasks.TaskReplicator.Run[TState](ReplicatableUserAction`1 action, ParallelOptions options, Boolean stopOnFirstFailure)
   at System.Threading.Tasks.Parallel.ForWorker[TLocal](Int32 fromInclusive, Int32 toExclusive, ParallelOptions parallelOptions, Action`1 body, Action`2 bodyWithState, Func`4 bodyWithLocal, Func`1 localInit, Action`1 localFinally)
--- End of stack trace from previous location where exception was thrown ---
   at System.Threading.Tasks.Parallel.ThrowSingleCancellationExceptionOrOtherException(ICollection exceptions, CancellationToken cancelToken, Exception otherException)
   at System.Threading.Tasks.Parallel.ForWorker[TLocal](Int32 fromInclusive, Int32 toExclusive, ParallelOptions parallelOptions, Action`1 body, Action`2 bodyWithState, Func`4 bodyWithLocal, Func`1 localInit, Action`1 localFinally)
   at System.Threading.Tasks.Parallel.ForEachWorker[TSource,TLocal](IEnumerable`1 source, ParallelOptions parallelOptions, Action`1 body, Action`2 bodyWithState, Action`3 bodyWithStateAndIndex, Func`4 bodyWithStateAndLocal, Func`5 bodyWithEverything, Func`1 localInit, Action`1 localFinally)
   at System.Threading.Tasks.Parallel.ForEach[TSource](IEnumerable`1 source, ParallelOptions parallelOptions, Action`2 body)
   at FabricObserver.Observers.AppObserver.ReportAsync(CancellationToken token) in D:\a\1\s\FabricObserver\Observers\AppObserver.cs:line 142
   at FabricObserver.Observers.AppObserver.ObserveAsync(CancellationToken token) in D:\a\1\s\FabricObserver\Observers\AppObserver.cs:line 133
   --- End of inner exception stack trace ---
   at System.Threading.Tasks.Task.Wait(Int32 millisecondsTimeout, CancellationToken cancellationToken)
   at System.Threading.Tasks.Task.Wait(TimeSpan timeout)
   at FabricObserver.Observers.ObserverManager.RunObserversAsync()

To Reproduce
Deploy the latest Service Fabric Observer using default config to a fresh 1-node 'OneBox' Development Cluster using the latest version of Service Fabric v8.2.1235.9590)

Expected behavior
Exceptions to not be thrown.

Desktop (please complete the following information):

OS: Windows 10 PRO
Version: 21H1 (OS Build 19043.1288)

Additional context
Add any other context about the problem here.

Increase Code Coverage in unit tests.

Increase code coverage in unit tests to be greater than 80%.
AB#13891277

What is the best way to debug plugins?

What is the best process is to debug plugins that have been dropped in the plugin folder and loaded dynamically?

Thanks, Adam

[FEATURE REQUEST] Cluster upgrade monitor notifications support

Is your feature request related to a problem? Please describe.
As an operator I would like to know when cluster upgrades are occurring and their status. Today SFRP only supports notification if there is a failure. See https://docs.microsoft.com/en-gb/azure/service-fabric/service-fabric-cluster-upgrade-version-azure#register-for-notifications

Describe the solution you'd like
A clear and concise description of what you want to happen.

Able to easily configure an email notification to stay informed
https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-diagnostics-event-generation-operational#cluster-events

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Alternatives are described in https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-diagnostics-event-generation-operational#cluster-events , but not prescriptive. It's up to the customer to figure out and may incur more charges.

Additional context
Add any other context or screenshots about the feature request here.

Rewrite Internal Diagnostic Telemetry Impl

Replace current FO operational telemetry impl with one that includes more useful (no PII, as always) information for use in understanding how FO is working and how it is helping in the real world. This is an opt-out feature with a simple on-off switch as a setting in Settings.xml. This must also include local log entry (single file with overwrites) that contains the exact data sent to Microsoft each time the telemetry is sent. This is so customers can inspect what data was sent and use that to inform the on-off setting. Our hope is that customers will leave this enabled so we can gain useful insights that we do not have today. This will also fire off a FabricObserverEtwProvider ETW event (FabricObserverOperationalEvent) that can be sent to your ETW upstream store.

This is the current shape of the data to send (open to feedback and suggestions, of course):

{
"EventName": "OperationalEvent",
"TaskName": "FabricObserver",
"EventRunInterval": "04:00:00",
"ClusterId": "",
"ClusterType": "",
"TenantId": "",
"NodeNameHash": "4182308305",
"FOVersion": "3.1.17",
"UpTime": "00:00:27.6376359",
"Timestamp": "2021-08-11T21:28:19.1448948Z",
"OS": "Windows",
"EnabledObserversCount": 9,
"AppObserverTotalMonitoredApps": 2,
"AppObserverTotalMonitoredServiceProcesses": 1,
"AppObserverErrorDetections": 0,
"AppObserverWarningDetections": 0,
"AzureStorageUploadObserverEnabled": 1,
"CertificateObserverErrorDetections": 0,
"CertificateObserverWarningDetections": 0,
"ContainerObserverTotalMonitoredApps": 2,
"ContainerObserverTotalMonitoredContainers": 2,
"ContainerObserverErrorDetections": 0,
"ContainerObserverWarningDetections": 2,
"DiskObserverErrorDetections": 0,
"DiskObserverWarningDetections": 0,
"FabricSystemObserverTotalMonitoredApps": 1,
"FabricSystemObserverTotalMonitoredServiceProcesses": 9,
"FabricSystemObserverErrorDetections": 0,
"FabricSystemObserverWarningDetections": 0,
"NetworkObserverTotalMonitoredApps": 0,
"NetworkObserverTotalMonitoredServiceProcesses": 0,
"NetworkObserverErrorDetections": 0,
"NetworkObserverWarningDetections": 0,
"NodeObserverErrorDetections": 0,
"NodeObserverWarningDetections": 0,
"OSObserverErrorDetections": 0,
"OSObserverWarningDetections": 0
}

If FO crashes with an uhandled exception that can be caught, it will send telemetry containing the exception details (FO stack only, no PII data, as always).

[BUG] CPU Usage increased 30% coming from build 3.0.8 to 3.1.15

Overview
Hi Team, we recently updated our FabricObserver version from 3.0.8 to 3.1.15 and this correlates with a large 25-30% increase in CPU usage across all our clusters.

We tried increase the ObserverLoopSleepTimeSeconds from 30 seconds to 120 seconds this decreased usage by 10%.

Looking for advice on how we can reduce this further and get performance similar to 3.0.8. Presumably, more components have been added since then but it's hard as a user to understand where the perf hits may be coming from.

Thanks for your help.

To Reproduce
Upgrade from 3.0.8 to 3.1.15.

Expected behavior
A significantly smaller increase in CPU usage.

Screenshots
CPU Charts showing the increase after update and subsequent decrease after increasing ObserverLoopSleepTimeSeconds to 120 seconds

Desktop (please complete the following information):

OS: Windows
Version Windows Server 2019 (Azure)

Additional context

Settings we have changed compared to the default Settings.xml

<Parameter Name="ObserverLoopSleepTimeSeconds" Value="120" />
<Parameter Name="ObserverExecutionTimeout" Value="1800" />
<Parameter Name="EnableVerboseLogging" Value="true" />

Settings we have changed compared to the default ApplicationManifest.xml

<Parameter Name="AppObserverEnableEtw" DefaultValue="true" />
<Parameter Name="CertificateObserverEnableEtw" DefaultValue="true" />
<Parameter Name="DiskObserverEnableEtw" DefaultValue="true" />
<Parameter Name="FabricSystemObserverEnableEtw" DefaultValue="true" />
<Parameter Name="NetworkObserverEnableEtw" DefaultValue="true" />
<Parameter Name="NodeObserverEnableEtw" DefaultValue="true" />
<Parameter Name="OSObserverEnableEtw" DefaultValue="true" />
<Parameter Name="SFConfigurationObserverEnableEtw" DefaultValue="true" />

Request: Create Release of the Fabric Observer Web App

Digging through the readme/documentation there are some mentions of the Fabric Observer Web App.
I can download it and compile it, which is fine, but the two other Apps the Fabric and Cluster observer are both packaged as release here on github, it would be nice if the same was true for the web app.

Add support for concurrent process monitoring

For nodes with large numbers of monitored processes, AppObserver should be able to parallelize work to decrease the amount of time it would take to run through, say, 100s of processes sequentially.

So, let's say we have 150 apps that have in total 200 service processes plus their descendant processes, say 100 children. So, now we have 300 processes that AppObserver will monitor and report on, sequentially. If you have MonitorDuration set to 1s, then it will take AppObserver at least 5 minutes to monitor all of these processes. Now, imagine 1000 processes, 1000 seconds = ~16 minutes, etc..

For capable hardware (logical processors >= 4) AppObserver, ContainerObserver, and FabricSystemObserver will attempt to run process monitoring code in parallel, depending upon the state of the CPU and thread availability at the time the monitoring code runs. If the logical processor count is below 4, then the monitoring behavior will remain sequential (for loops), as before. This will ship in 3.1.18 release and is available in the develop branch today. Note, this may churn, so do not take a dependency on develop, as always.

Info on "Windows Update" setting

Would be great if we can add Banner/Warning/Link to clusters where Windows Update is still enabled and potentially causing VM’s to reboot unexpectedly. Link in banner pointing to recommended SF configuration for VMSS OS Updates or POA

Update Documentation in various places.

There are places where the documentation is no longer accurate or reflective of reality (minor changes, but needs to be addressed regardless).

issue NeworkUsage.cs TupleGetDynamicPortRange Unhandled exception in GetSystemCpuMemoryValuesAsync with 4 digit 'start port'

issue NeworkUsage.cs TupleGetDynamicPortRange Unhandled exception in GetSystemCpuMemoryValuesAsync with 4 digit start port

C:\r>netsh int ipv4 show dynamicportrange tcp

Protocol tcp Dynamic Port Range

Start Port : 1024 <---- 4 digit number causing substring offset to be wrong
Number of Ports : 64511

2020-01-16 12:16:52.8537--WARN--FabricObserver service health warning: fabric:/FabricObserver/FabricObserver | NodeObserver | Unhandled exception in GetSystemCpuMemoryValuesAsync: Input string was not in a correct format.:
    at System.Number.StringToNumber(String str, NumberStyles options, NumberBuffer& number, NumberFormatInfo info, Boolean parseDecimal)
   at System.Number.ParseInt32(String s, NumberStyles style, NumberFormatInfo info)
   at FabricObserver.Utilities.NetworkUsage.TupleGetDynamicPortRange(Protocol protocol) in C:\github\Microsoft\service-fabric-observer\FabricObserver\Observers\Utilities\NetworkUsage.cs:line 242
   at FabricObserver.Utilities.NetworkUsage.GetActiveEphemeralPortCount(Int32 procId, Protocol protocol) in C:\github\Microsoft\service-fabric-observer\FabricObserver\Observers\Utilities\NetworkUsage.cs:line 318
   at FabricObserver.NodeObserver.<>c__DisplayClass61_0.b__0() in C:\github\Microsoft\service-fabric-observer\FabricObserver\Observers\NodeObserver.cs:line 268

proposed fix:
Match match = Regex.Match(output,
@"Start Port\s+:\s+(?\d+).+?Number of Ports\s+:\s+(?\d+)",
RegexOptions.Singleline | RegexOptions.IgnoreCase);

                string startPort = match.Groups["startPort"].Value;
                string portCount = match.Groups["numberOfPorts"].Value;

verification:
OSObserver MachineInformation Thu, 16 Jan 2020 14:11:48 GMT 0.00:01:27.0 132236575080657285 true false
OS Information:

Name: Microsoft Windows 10 Enterprise
Version: 10.0.18363
InstallDate: 2019-07-16T13:51:22.0000000Z
LastBootUpTime*: 2020-01-14T23:06:34.5000000Z
OSLanguage: 1033
OSHealthStatus*: OK
NumberOfProcesses*: 461
WindowsEphemeralTCPPortRange: 1024 - 65535 (Active*: 148)
FabricApplicationTCPPortRange: 30001 - 31000
ActiveFirewallRules*: 501
TotalActiveTCPPorts*: 155

Hardware Information:

LogicalProcessorCount: 8
TotalVirtualMemorySize: 37 GB
TotalVisibleMemorySize: 31 GB
FreePhysicalMemory*: 15.47 GB
FreeVirtualMemory*: 10.4 GB
LogicalDriveCount: 1
Drive C (System) Size: 952.77 GB
Drive C (System) Consumed*: 70%

Windows Patches/Hot Fixes*:
...

creating pr if you want it but np if you dont

Docs: Update graphics to match reality of text output FO produces today.

The images were taken from older versions of FO and are not accurate for the 3.x version (or, really, any time after TelemetryData type was introduced...).

Upload current images that depict what the customer will actually see in SFX.

This is unrelated to FO functionality. This is documentation-only updating.

Cluster observer releases

Hi Charles

Just wondering if you'll be adding the cluster observer as a github release (and maybe the FO web api?)

Thanks again!

microsoft / service-fabric-observer Goto Github PK

service-fabric-observer's Introduction

FabricObserver 3.2.15

Using FabricObserver

How it works

Build and run

Deploy FabricObserver

Deploy FabricObserver using ARM

Deploy FabricObserver using Client (PowerShell)

Observer Model

Operational Telemetry

Contributing

service-fabric-observer's People

Contributors

Stargazers

Watchers

Forkers

service-fabric-observer's Issues

FO Diagnostic Telemetry: Missing data

Protocol tcp Dynamic Port Range

Recommend Projects

Recommend Topics

Recommend Org

Jobs