Data Profile

Generate a comprehensive profile of a table that a new team member could use to understand the data.

Step 1: Basic Metadata

Query column metadata:

SELECT COLUMN_NAME, DATA_TYPE, COMMENT FROM .INFORMATION_SCHEMA.COLUMNS WHERE TABLE_SCHEMA = '' AND TABLE_NAME = '

' ORDER BY ORDINAL_POSITION

If the table name isn't fully qualified, search INFORMATION_SCHEMA.TABLES to locate it first.

Step 2: Size and Shape

Run via run_sql:

SELECT COUNT() as total_rows, COUNT() / 1000000.0 as millions_of_rows FROM

Step 3: Column-Level Statistics

For each column, gather appropriate statistics based on data type:

Numeric Columns SELECT MIN(column_name) as min_val, MAX(column_name) as max_val, AVG(column_name) as avg_val, STDDEV(column_name) as std_dev, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY column_name) as median, SUM(CASE WHEN column_name IS NULL THEN 1 ELSE 0 END) as null_count, COUNT(DISTINCT column_name) as distinct_count FROM

String Columns SELECT MIN(LEN(column_name)) as min_length, MAX(LEN(column_name)) as max_length, AVG(LEN(column_name)) as avg_length, SUM(CASE WHEN column_name IS NULL OR column_name = '' THEN 1 ELSE 0 END) as empty_count, COUNT(DISTINCT column_name) as distinct_count FROM

Date/Timestamp Columns SELECT MIN(column_name) as earliest, MAX(column_name) as latest, DATEDIFF('day', MIN(column_name), MAX(column_name)) as date_range_days, SUM(CASE WHEN column_name IS NULL THEN 1 ELSE 0 END) as null_count FROM

Step 4: Cardinality Analysis

For columns that look like categorical/dimension keys:

SELECT column_name, COUNT() as frequency, ROUND(COUNT() * 100.0 / SUM(COUNT(*)) OVER(), 2) as percentage FROM

GROUP BY column_name ORDER BY frequency DESC LIMIT 20

This reveals:

High-cardinality columns (likely IDs or unique values) Low-cardinality columns (likely categories or status fields) Skewed distributions (one value dominates) Step 5: Sample Data

Get representative rows:

SELECT * FROM

LIMIT 10

If the table is large and you want variety, sample from different time periods or categories.

Step 6: Data Quality Assessment

Summarize quality across dimensions:

Completeness Which columns have NULLs? What percentage? Are NULLs expected or problematic? Uniqueness Does the apparent primary key have duplicates? Are there unexpected duplicate rows? Freshness When was data last updated? (MAX of timestamp columns) Is the update frequency as expected? Validity Are there values outside expected ranges? Are there invalid formats (dates, emails, etc.)? Are there orphaned foreign keys? Consistency Do related columns make sense together? Are there logical contradictions? Step 7: Output Summary

Provide a structured profile:

Overview

2-3 sentences describing what this table contains, who uses it, and how fresh it is.

Schema Column Type Nulls% Distinct Description ... ... ... ... ... Key Statistics Row count: X Date range: Y to Z Last updated: timestamp Data Quality Score Completeness: X/10 Uniqueness: X/10 Freshness: X/10 Overall: X/10 Potential Issues

List any data quality concerns discovered.

Recommended Queries

3-5 useful queries for common questions about this data.

← 返回排行榜

profiling-tables

安装