azure-resource-health-diagnose

安装量: 7.1K
排名: #507

安装

npx skills add https://github.com/github/awesome-copilot --skill azure-resource-health-diagnose
Azure Resource Health & Issue Diagnosis
This workflow analyzes a specific Azure resource to assess its health status, diagnose potential issues using logs and telemetry data, and develop a comprehensive remediation plan for any problems discovered.
Prerequisites
Azure MCP server configured and authenticated
Target Azure resource identified (name and optionally resource group/subscription)
Resource must be deployed and running to generate logs/telemetry
Prefer Azure MCP tools (
azmcp-*
) over direct Azure CLI when available
Workflow Steps
Step 1: Get Azure Best Practices
Action
Retrieve diagnostic and troubleshooting best practices
Tools
Azure MCP best practices tool
Process
:
Load Best Practices
:
Execute Azure best practices tool to get diagnostic guidelines
Focus on health monitoring, log analysis, and issue resolution patterns
Use these practices to inform diagnostic approach and remediation recommendations
Step 2: Resource Discovery & Identification
Action
Locate and identify the target Azure resource
Tools
Azure MCP tools + Azure CLI fallback
Process
:
Resource Lookup
:
If only resource name provided: Search across subscriptions using
azmcp-subscription-list
Use
az resource list --name
to find matching resources
If multiple matches found, prompt user to specify subscription/resource group
Gather detailed resource information:
Resource type and current status
Location, tags, and configuration
Associated services and dependencies
Resource Type Detection
:
Identify resource type to determine appropriate diagnostic approach:
Web Apps/Function Apps
Application logs, performance metrics, dependency tracking
Virtual Machines
System logs, performance counters, boot diagnostics
Cosmos DB
Request metrics, throttling, partition statistics
Storage Accounts
Access logs, performance metrics, availability
SQL Database
Query performance, connection logs, resource utilization
Application Insights
Application telemetry, exceptions, dependencies
Key Vault
Access logs, certificate status, secret usage
Service Bus
Message metrics, dead letter queues, throughput
Step 3: Health Status Assessment
Action
Evaluate current resource health and availability
Tools
Azure MCP monitoring tools + Azure CLI
Process
:
Basic Health Check
:
Check resource provisioning state and operational status
Verify service availability and responsiveness
Review recent deployment or configuration changes
Assess current resource utilization (CPU, memory, storage, etc.)
Service-Specific Health Indicators
:
Web Apps
HTTP response codes, response times, uptime
Databases
Connection success rate, query performance, deadlocks
Storage
Availability percentage, request success rate, latency
VMs
Boot diagnostics, guest OS metrics, network connectivity
Functions
Execution success rate, duration, error frequency
Step 4: Log & Telemetry Analysis
Action
Analyze logs and telemetry to identify issues and patterns
Tools
Azure MCP monitoring tools for Log Analytics queries
Process
:
Find Monitoring Sources
:
Use
azmcp-monitor-workspace-list
to identify Log Analytics workspaces
Locate Application Insights instances associated with the resource
Identify relevant log tables using
azmcp-monitor-table-list
Execute Diagnostic Queries
:
Use
azmcp-monitor-log-query
with targeted KQL queries based on resource type:
General Error Analysis
:
// Recent errors and exceptions
union isfuzzy=true
AzureDiagnostics,
AppServiceHTTPLogs,
AppServiceAppLogs,
AzureActivity
| where TimeGenerated > ago(24h)
| where Level == "Error" or ResultType != "Success"
| summarize ErrorCount=count() by Resource, ResultType, bin(TimeGenerated, 1h)
| order by TimeGenerated desc
Performance Analysis
:
// Performance degradation patterns
Perf
| where TimeGenerated > ago(7d)
| where ObjectName == "Processor" and CounterName == "% Processor Time"
| summarize avg(CounterValue) by Computer, bin(TimeGenerated, 1h)
| where avg_CounterValue > 80
Application-Specific Queries
:
// Application Insights - Failed requests
requests
| where timestamp > ago(24h)
| where success == false
| summarize FailureCount=count() by resultCode, bin(timestamp, 1h)
| order by timestamp desc
// Database - Connection failures
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.SQL"
| where Category == "SQLSecurityAuditEvents"
| where action_name_s == "CONNECTION_FAILED"
| summarize ConnectionFailures=count() by bin(TimeGenerated, 1h)
Pattern Recognition
:
Identify recurring error patterns or anomalies
Correlate errors with deployment times or configuration changes
Analyze performance trends and degradation patterns
Look for dependency failures or external service issues
Step 5: Issue Classification & Root Cause Analysis
Action
Categorize identified issues and determine root causes
Process
:
Issue Classification
:
Critical
Service unavailable, data loss, security breaches
High
Performance degradation, intermittent failures, high error rates
Medium
Warnings, suboptimal configuration, minor performance issues
Low
Informational alerts, optimization opportunities
Root Cause Analysis
:
Configuration Issues
Incorrect settings, missing dependencies
Resource Constraints
CPU/memory/disk limitations, throttling
Network Issues
Connectivity problems, DNS resolution, firewall rules
Application Issues
Code bugs, memory leaks, inefficient queries
External Dependencies
Third-party service failures, API limits
Security Issues
Authentication failures, certificate expiration
Impact Assessment
:
Determine business impact and affected users/systems
Evaluate data integrity and security implications
Assess recovery time objectives and priorities
Step 6: Generate Remediation Plan
Action
Create a comprehensive plan to address identified issues
Process
:
Immediate Actions
(Critical issues):
Emergency fixes to restore service availability
Temporary workarounds to mitigate impact
Escalation procedures for complex issues
Short-term Fixes
(High/Medium issues):
Configuration adjustments and resource scaling
Application updates and patches
Monitoring and alerting improvements
Long-term Improvements
(All issues):
Architectural changes for better resilience
Preventive measures and monitoring enhancements
Documentation and process improvements
Implementation Steps
:
Prioritized action items with specific Azure CLI commands
Testing and validation procedures
Rollback plans for each change
Monitoring to verify issue resolution
Step 7: User Confirmation & Report Generation
Action
Present findings and get approval for remediation actions Process : Display Health Assessment Summary : 🏥 Azure Resource Health Assessment 📊 Resource Overview: • Resource: [Name] ([Type]) • Status: [Healthy/Warning/Critical] • Location: [Region] • Last Analyzed: [Timestamp] 🚨 Issues Identified: • Critical: X issues requiring immediate attention • High: Y issues affecting performance/reliability • Medium: Z issues for optimization • Low: N informational items 🔍 Top Issues: 1. [Issue Type]: [Description] - Impact: [High/Medium/Low] 2. [Issue Type]: [Description] - Impact: [High/Medium/Low] 3. [Issue Type]: [Description] - Impact: [High/Medium/Low] 🛠️ Remediation Plan: • Immediate Actions: X items • Short-term Fixes: Y items • Long-term Improvements: Z items • Estimated Resolution Time: [Timeline] ❓ Proceed with detailed remediation plan? (y/n) Generate Detailed Report :

Azure Resource Health Report: [Resource Name]
**
Generated
**
[Timestamp]
**
Resource
**
[Full Resource ID]
**
Overall Health
**
[Status with color indicator]

🔍 Executive Summary [Brief overview of health status and key findings]

📊 Health Metrics

**
Availability
**

X% over last 24h

**
Performance
**

[Average response time/throughput]

**
Error Rate
**

X% over last 24h

**
Resource Utilization
**
[CPU/Memory/Storage percentages]

🚨 Issues Identified

Critical Issues

**
[Issue 1]
**

[Description]

**
Root Cause
**

[Analysis]

**
Impact
**

[Business impact]

**
Immediate Action
**
[Required steps]

High Priority Issues

**
[Issue 2]
**

[Description]

**
Root Cause
**

[Analysis]

**
Impact
**

[Performance/reliability impact]

**
Recommended Fix
**
[Solution steps]

🛠️ Remediation Plan

Phase 1: Immediate Actions (0-2 hours) ```bash

Critical fixes to restore service [Azure CLI commands with explanations] Phase 2: Short-term Fixes (2-24 hours)

Performance and reliability improvements

[ Azure CLI commands with explanations ] Phase 3: Long-term Improvements (1-4 weeks)

Architectural and preventive measures

[
Azure CLI commands and configuration changes
]
📈 Monitoring Recommendations
Alerts to Configure
[List of recommended alerts]
Dashboards to Create
[Monitoring dashboard suggestions]
Regular Health Checks
[Recommended frequency and scope]
✅ Validation Steps
Verify issue resolution through logs
Confirm performance improvements
Test application functionality
Update monitoring and alerting
Document lessons learned
📝 Prevention Measures
[Recommendations to prevent similar issues]
[Process improvements]
[Monitoring enhancements]
Error Handling
Resource Not Found
Provide guidance on resource name/location specification
Authentication Issues
Guide user through Azure authentication setup
Insufficient Permissions
List required RBAC roles for resource access
No Logs Available
Suggest enabling diagnostic settings and waiting for data
Query Timeouts
Break down analysis into smaller time windows
Service-Specific Issues
Provide generic health assessment with limitations noted Success Criteria ✅ Resource health status accurately assessed ✅ All significant issues identified and categorized ✅ Root cause analysis completed for major problems ✅ Actionable remediation plan with specific steps provided ✅ Monitoring and prevention recommendations included ✅ Clear prioritization of issues by business impact ✅ Implementation steps include validation and rollback procedures
返回排行榜