Azure Resource Health & Issue Diagnosis

This workflow analyzes a specific Azure resource to assess its health status, diagnose potential issues using logs and telemetry data, and develop a comprehensive remediation plan for any problems discovered.

Prerequisites

Azure MCP server configured and authenticated

Target Azure resource identified (name and optionally resource group/subscription)

Resource must be deployed and running to generate logs/telemetry

Prefer Azure MCP tools (

azmcp-*

) over direct Azure CLI when available

Workflow Steps

Step 1: Get Azure Best Practices

Action

Retrieve diagnostic and troubleshooting best practices

Tools

Azure MCP best practices tool

Process

:

Load Best Practices

:

Execute Azure best practices tool to get diagnostic guidelines

Focus on health monitoring, log analysis, and issue resolution patterns

Use these practices to inform diagnostic approach and remediation recommendations

Step 2: Resource Discovery & Identification

Action

Locate and identify the target Azure resource

Tools

Azure MCP tools + Azure CLI fallback

Process

:

Resource Lookup

:

If only resource name provided: Search across subscriptions using

azmcp-subscription-list

Use

az resource list --name

to find matching resources

If multiple matches found, prompt user to specify subscription/resource group

Gather detailed resource information:

Resource type and current status

Location, tags, and configuration

Associated services and dependencies

Resource Type Detection

:

Identify resource type to determine appropriate diagnostic approach:

Web Apps/Function Apps

Application logs, performance metrics, dependency tracking

Virtual Machines

System logs, performance counters, boot diagnostics

Cosmos DB

Request metrics, throttling, partition statistics

Storage Accounts

Access logs, performance metrics, availability

SQL Database

Query performance, connection logs, resource utilization

Application Insights

Application telemetry, exceptions, dependencies

Key Vault

Access logs, certificate status, secret usage

Service Bus

Message metrics, dead letter queues, throughput

Step 3: Health Status Assessment

Action

Evaluate current resource health and availability

Tools

Azure MCP monitoring tools + Azure CLI

Process

:

Basic Health Check

:

Check resource provisioning state and operational status

Verify service availability and responsiveness

Review recent deployment or configuration changes

Assess current resource utilization (CPU, memory, storage, etc.)

Service-Specific Health Indicators

:

Web Apps

HTTP response codes, response times, uptime

Databases

Connection success rate, query performance, deadlocks

Storage

Availability percentage, request success rate, latency

VMs

Boot diagnostics, guest OS metrics, network connectivity

Functions

Execution success rate, duration, error frequency

Step 4: Log & Telemetry Analysis

Action

Analyze logs and telemetry to identify issues and patterns

Tools

Azure MCP monitoring tools for Log Analytics queries

Process

:

Find Monitoring Sources

:

Use

azmcp-monitor-workspace-list

to identify Log Analytics workspaces

Locate Application Insights instances associated with the resource

Identify relevant log tables using

azmcp-monitor-table-list

Execute Diagnostic Queries

:

Use

azmcp-monitor-log-query

with targeted KQL queries based on resource type:

General Error Analysis

:

// Recent errors and exceptions

union isfuzzy=true

AzureDiagnostics,

AppServiceHTTPLogs,

AppServiceAppLogs,

AzureActivity

| where TimeGenerated > ago(24h)

| where Level == "Error" or ResultType != "Success"

| summarize ErrorCount=count() by Resource, ResultType, bin(TimeGenerated, 1h)

| order by TimeGenerated desc

Performance Analysis

:

// Performance degradation patterns

Perf

| where TimeGenerated > ago(7d)

| where ObjectName == "Processor" and CounterName == "% Processor Time"

| summarize avg(CounterValue) by Computer, bin(TimeGenerated, 1h)

| where avg_CounterValue > 80

Application-Specific Queries

:

// Application Insights - Failed requests

requests

| where timestamp > ago(24h)

| where success == false

| summarize FailureCount=count() by resultCode, bin(timestamp, 1h)

| order by timestamp desc

// Database - Connection failures

AzureDiagnostics

| where ResourceProvider == "MICROSOFT.SQL"

| where Category == "SQLSecurityAuditEvents"

| where action_name_s == "CONNECTION_FAILED"

| summarize ConnectionFailures=count() by bin(TimeGenerated, 1h)

Pattern Recognition

:

Identify recurring error patterns or anomalies

Correlate errors with deployment times or configuration changes

Analyze performance trends and degradation patterns

Look for dependency failures or external service issues

Step 5: Issue Classification & Root Cause Analysis

Action

Categorize identified issues and determine root causes

Process

:

Issue Classification

:

Critical

Service unavailable, data loss, security breaches

High

Performance degradation, intermittent failures, high error rates

Medium

Warnings, suboptimal configuration, minor performance issues

Low

Informational alerts, optimization opportunities

Root Cause Analysis

:

Configuration Issues

Incorrect settings, missing dependencies

Resource Constraints

CPU/memory/disk limitations, throttling

Network Issues

Connectivity problems, DNS resolution, firewall rules

Application Issues

Code bugs, memory leaks, inefficient queries

External Dependencies

Third-party service failures, API limits

Security Issues

Authentication failures, certificate expiration

Impact Assessment

:

Determine business impact and affected users/systems

Evaluate data integrity and security implications

Assess recovery time objectives and priorities

Step 6: Generate Remediation Plan

Action

Create a comprehensive plan to address identified issues
Process
:
Immediate Actions
(Critical issues):
Emergency fixes to restore service availability
Temporary workarounds to mitigate impact
Escalation procedures for complex issues
Short-term Fixes
(High/Medium issues):
Configuration adjustments and resource scaling
Application updates and patches
Monitoring and alerting improvements
Long-term Improvements
(All issues):
Architectural changes for better resilience
Preventive measures and monitoring enhancements
Documentation and process improvements
Implementation Steps
:
Prioritized action items with specific Azure CLI commands
Testing and validation procedures
Rollback plans for each change
Monitoring to verify issue resolution
Step 7: User Confirmation & Report Generation
Action: Present findings and get approval for remediation actions Process : Display Health Assessment Summary : 🏥 Azure Resource Health Assessment 📊 Resource Overview: • Resource: [Name] ([Type]) • Status: [Healthy/Warning/Critical] • Location: [Region] • Last Analyzed: [Timestamp] 🚨 Issues Identified: • Critical: X issues requiring immediate attention • High: Y issues affecting performance/reliability • Medium: Z issues for optimization • Low: N informational items 🔍 Top Issues: 1. [Issue Type]: [Description] - Impact: [High/Medium/Low] 2. [Issue Type]: [Description] - Impact: [High/Medium/Low] 3. [Issue Type]: [Description] - Impact: [High/Medium/Low] 🛠️ Remediation Plan: • Immediate Actions: X items • Short-term Fixes: Y items • Long-term Improvements: Z items • Estimated Resolution Time: [Timeline] ❓ Proceed with detailed remediation plan? (y/n) Generate Detailed Report :

Azure Resource Health Report: [Resource Name]

Generated

[Timestamp]

Resource

[Full Resource ID]
**
Overall Health
**: [Status with color indicator]

🔍 Executive Summary [Brief overview of health status and key findings]

📊 Health Metrics

Availability

X% over last 24h

Performance

[Average response time/throughput]

Error Rate

X% over last 24h

**
Resource Utilization
**: [CPU/Memory/Storage percentages]

🚨 Issues Identified

Critical Issues

[Issue 1]

[Description]

Root Cause

[Analysis]

Impact

[Business impact]

**
Immediate Action
**: [Required steps]

High Priority Issues

[Issue 2]

[Description]

Root Cause

[Analysis]

Impact

[Performance/reliability impact]

**
Recommended Fix
**: [Solution steps]

🛠️ Remediation Plan

Phase 1: Immediate Actions (0-2 hours) ```bash

Critical fixes to restore service [Azure CLI commands with explanations] Phase 2: Short-term Fixes (2-24 hours)

Performance and reliability improvements

[ Azure CLI commands with explanations ] Phase 3: Long-term Improvements (1-4 weeks)

Architectural and preventive measures

[

Azure CLI commands and configuration changes

]

📈 Monitoring Recommendations

Alerts to Configure

[List of recommended alerts]

Dashboards to Create

[Monitoring dashboard suggestions]

Regular Health Checks

[Recommended frequency and scope]

✅ Validation Steps

Verify issue resolution through logs

Confirm performance improvements

Test application functionality

Update monitoring and alerting

Document lessons learned

📝 Prevention Measures

[Recommendations to prevent similar issues]

[Process improvements]

[Monitoring enhancements]

Error Handling

Resource Not Found

Provide guidance on resource name/location specification

Authentication Issues

Guide user through Azure authentication setup

Insufficient Permissions

List required RBAC roles for resource access

No Logs Available

Suggest enabling diagnostic settings and waiting for data

Query Timeouts

Break down analysis into smaller time windows
Service-Specific Issues: Provide generic health assessment with limitations noted Success Criteria ✅ Resource health status accurately assessed ✅ All significant issues identified and categorized ✅ Root cause analysis completed for major problems ✅ Actionable remediation plan with specific steps provided ✅ Monitoring and prevention recommendations included ✅ Clear prioritization of issues by business impact ✅ Implementation steps include validation and rollback procedures

azure-resource-health-diagnose

安装

📊 Health Metrics

X% over last 24h

[Average response time/throughput]

X% over last 24h

Critical Issues

[Description]

[Analysis]

[Business impact]

High Priority Issues

[Description]

[Analysis]

[Performance/reliability impact]

Performance and reliability improvements

Architectural and preventive measures