System Health and Routing Overview

Quick health check and routing entry point for ClickHouse diagnostics. Run this first, then chain to other skills based on findings.

Timeframe Handling

Use the timeframe specified by the user. If none is provided, default to the last 24 hours. Use it consistently for system.errors and for scanning all system.*_log tables.

Routing Rules (Chain to Other Skills) High memory usage or OOMs → altinity-expert-clickhouse-memory Disk usage > 80% or poor compression → altinity-expert-clickhouse-storage Many parts, merge backlog, or TOO_MANY_PARTS → altinity-expert-clickhouse-merges Replication lag/readonly replicas/Keeper issues → altinity-expert-clickhouse-replication Slow SELECTs / heavy reads in query_log → altinity-expert-clickhouse-reporting Slow INSERTs / high part creation rate → altinity-expert-clickhouse-ingestion Low cache hit ratios / cache pressure → altinity-expert-clickhouse-caches Dictionary load failures or high dictionary memory → altinity-expert-clickhouse-dictionaries Frequent exceptions or error spikes → include system.errors and system.*_log summaries below System log TTL issues or log growth → altinity-expert-clickhouse-logs Schema anti‑patterns (partitioning/ORDER BY/MV issues) → altinity-expert-clickhouse-schema High load/connection saturation/queue buildup → altinity-expert-clickhouse-metrics Suspicious server log entries → altinity-expert-clickhouse-logs System Identification

If a connection recommendation is not provided explicitly, ask the user to choose the MCP server from the list or ask for connection details for clickhouse-client .

select hostName() as hostname, version() as version, formatReadableTimeDelta(uptime()) as uptime_human, getSetting('max_memory_usage') as max_memory_usage, (select value from system.asynchronous_metrics where metric = 'OSMemoryTotal') as os_memory_total

Quick Health Score

Run all checks, aggregate by severity:

Object Counts Audit select 'Replicated Tables' as check_name, (select count() from system.tables where engine like 'Replicated%') as value, multiIf(value > 2000, 'Critical', value > 900, 'Major', value > 200, 'Moderate', 'OK') as severity, 'Recommend: <200, tune background_schedule_pool_size if higher' as note

union all

select 'MergeTree Tables' as check_name, (select count() from system.tables where engine like '%MergeTree%') as value, multiIf(value > 10000, 'Critical', value > 3000, 'Major', value > 1000, 'Moderate', 'OK') as severity, 'High count increases metadata overhead' as note

union all

select 'Databases' as check_name, (select count() from system.databases) as value, multiIf(value > 1000, 'Critical', value > 300, 'Major', value > 100, 'Moderate', 'OK') as severity, 'Consider consolidating if >100' as note

union all

select 'Active Parts' as check_name, (select count() from system.parts where active) as value, multiIf(value > 120000, 'Critical', value > 90000, 'Major', value > 60000, 'Moderate', 'OK') as severity, 'High count slows restarts and metadata ops' as note

union all

select 'Current Queries' as check_name, (select count() from system.processes where is_cancelled = 0) as value, multiIf(value > 100, 'Major', value > 50, 'Moderate', 'OK') as severity, 'Check max_concurrent_queries setting' as note

order by multiIf(severity = 'Critical', 1, severity = 'Major', 2, severity = 'Moderate', 3, 4), check_name

Resource Utilization with (select value from system.asynchronous_metrics where metric = 'OSMemoryTotal') as total_ram, (select value from system.asynchronous_metrics where metric = 'MemoryResident') as used_ram, (select sum(primary_key_bytes_in_memory) from system.parts) as pk_memory, (select sum(bytes_allocated) from system.dictionaries) as dict_memory, (select assumeNotNull(sum(total_bytes)) from system.tables where engine in ('Memory','Set','Join')) as mem_tables select 'Memory Usage' as resource, formatReadableSize(used_ram) as used, formatReadableSize(total_ram) as total, round(100.0 * used_ram / total_ram, 1) as pct, multiIf(pct > 90, 'Critical', pct > 80, 'Major', pct > 70, 'Moderate', 'OK') as severity

union all

select 'Primary Keys in RAM' as resource, formatReadableSize(pk_memory) as used, formatReadableSize(total_ram) as total, round(100.0 * pk_memory / total_ram, 1) as pct, multiIf(pct > 30, 'Critical', pct > 25, 'Major', pct > 20, 'Moderate', 'OK') as severity

union all

select 'Dictionaries + MemTables' as resource, formatReadableSize(dict_memory + mem_tables) as used, formatReadableSize(total_ram) as total, round(100.0 * (dict_memory + mem_tables) / total_ram, 1) as pct, multiIf(pct > 30, 'Critical', pct > 25, 'Major', pct > 20, 'Moderate', 'OK') as severity

order by pct desc

Disk Health select name as disk, path, formatReadableSize(total_space) as total, formatReadableSize(free_space) as free, round(100.0 * (total_space - free_space) / total_space, 1) as used_pct, multiIf(used_pct > 90, 'Critical', used_pct > 85, 'Major', used_pct > 80, 'Moderate', 'OK') as severity from system.disks where type = 'Local' order by used_pct desc

Replication Health select 'Readonly Replicas' as check_name, toFloat64((select value from system.metrics where metric = 'ReadonlyReplica')) as value, if(value > 0, 'Critical', 'OK') as severity

union all

select 'Max Replica Delay' as check_name, toFloat64((select max(value) from system.asynchronous_metrics where metric in ('ReplicasMaxAbsoluteDelay', 'ReplicasMaxRelativeDelay'))) as value, multiIf(value > 86400, 'Critical', value > 10800, 'Major', value > 1800, 'Moderate', 'OK') as severity

union all

select 'Replication Queue Size' as check_name, toFloat64((select value from system.asynchronous_metrics where metric = 'ReplicasSumQueueSize')) as value, multiIf(value > 500, 'Major', value > 200, 'Moderate', 'OK') as severity

Background Pool Status with transform(extract(metric, '^Background(.)PoolTask'), ['MergesAndMutations', 'Fetches', 'Move', 'Common', 'Schedule', 'BufferFlushSchedule', 'MessageBrokerSchedule', 'DistributedSchedule'], ['pool', 'fetches_pool', 'move_pool', 'common_pool', 'schedule_pool', 'buffer_flush_schedule_pool', 'message_broker_schedule_pool', 'distributed_schedule_pool'], '' ) as pool_key, concat('background_', lower(pool_key), '_size') as setting_name select extract(m.metric, '^Background(.)Task') as pool_name, m.value as active_tasks, toFloat64OrZero(s.value) as pool_size, round(100.0 * m.value / pool_size, 1) as utilization_pct, multiIf(utilization_pct > 99, 'Major', utilization_pct > 90, 'Moderate', 'OK') as severity from system.metrics m left join system.settings s on s.name = setting_name where m.metric like 'Background%PoolTask' and pool_size > 0 order by utilization_pct desc

Version Check with (select value from system.build_options where name = 'VERSION_DESCRIBE') as current_version, nullIf((select value from system.build_options where name = 'BUILD_DATE'), '') as build_date_str, parseDateTimeBestEffortOrNull(build_date_str) as build_dt, if(build_dt is null, NULL, dateDiff('day', toDate(build_dt), today())) as age_days select current_version as version, build_date_str as build_date, age_days, multiIf(age_days is null, 'Moderate', age_days > 365, 'Major', age_days > 180, 'Moderate', 'OK') as severity, multiIf( age_days is null, 'Build date not available; check packaging / release notes', age_days > 180, 'Consider upgrading - security and performance fixes available', 'Version is reasonably current' ) as recommendation

System Log Health select format('system.{}', name) as log_table, engine_full like '% TTL %' as has_ttl, if(not has_ttl, 'Major', 'OK') as severity, if(not has_ttl, 'System log should have TTL to prevent disk fill', 'TTL configured') as note from system.tables where database = 'system' and name like '%_log' and engine like '%MergeTree%' order by has_ttl, name

-- Log disk usage select table, formatReadableSize(sum(bytes_on_disk)) as size, count() as parts from system.parts where database = 'system' and table like '%_log' and active group by table order by sum(bytes_on_disk) desc

Recent Errors Summary (Timeframe-Based) select toStartOfHour(event_time) as hour, countIf(type like 'Exception%') as failed_queries, count() as total_queries, round(100.0 * countIf(type like 'Exception%') / count(), 2) as error_rate_pct from system.query_log where event_time >= now() - interval 24 hour group by hour order by hour desc limit 12

system.errors Summary (Timeframe-Based) select code, name, value as count, last_error_time, substring(last_error_message, 1, 160) as last_error_message from system.errors where last_error_time >= now() - interval 24 hour order by last_error_time desc limit 20

system.*_log Activity Summary (Timeframe-Based) Identify which log tables have timestamp columns: select table, groupArray(name) as time_cols from system.columns where database = 'system' and table like '%_log' and name in ('event_time', 'event_date') group by table order by table

For each log table, run a short activity summary using the appropriate column: -- Example: event_time-based tables select count() as rows_24h, min(event_time) as min_time, max(event_time) as max_time from system.query_log where event_time >= now() - interval 24 hour

-- Example: event_date-based tables select count() as rows_24h, min(event_date) as min_date, max(event_date) as max_date from system.part_log where event_date >= today() - 1

Use the user-specified timeframe if provided; otherwise use the last 24 hours.

Warnings from ClickHouse select message as warning from system.warnings

Module Routing

Based on findings, load specific modules:

Finding Load Module High memory usage altinity-expert-clickhouse-memory Disk > 80% altinity-expert-clickhouse-storage Many parts altinity-expert-clickhouse-merges Replica delay altinity-expert-clickhouse-replication High error rate Include system.errors + log table summaries (see below) Pool saturation altinity-expert-clickhouse-ingestion or altinity-expert-clickhouse-merges Old version Check ClickHouse release notes Log issues altinity-expert-clickhouse-logs Schema concerns altinity-expert-clickhouse-schema Full Audit Script

For comprehensive audit, run modules in order:

altinity-expert-clickhouse-overview (this module) - system identification altinity-expert-clickhouse-schema - table design issues altinity-expert-clickhouse-merges - part management altinity-expert-clickhouse-memory - RAM analysis altinity-expert-clickhouse-storage - disk analysis altinity-expert-clickhouse-replication - if replicated tables exist altinity-expert-clickhouse-reporting - if query performance issues

安装