Query Data Lake
Execute SQL queries on Amazon Athena across default and federated catalogs (Glue, S3 Tables, Redshift) with workgroup selection, statement classification, and error recovery.
Overview
Executes and manages Athena SQL queries across default and federated catalogs. Selects a workgroup, resolves target assets (delegating fuzzy references to
finding-data-lake-assets
), classifies statements for safety, and reports cost and data scanned. Use the AWS MCP server for sandboxed execution and audit logging; the same AWS CLI commands work directly when the MCP server is not available.
Constraints for parameter acquisition:
You MUST accept a single optional argument: SQL text, a named-query name, a workgroup name, a catalog name, or
profile TABLE_NAME
You MUST accept the argument as direct text or a pointer to a file containing SQL
You MUST ask the user for the target AWS region if not already set
You MUST confirm the output S3 location before executing any non-trivial query
You MUST respect the user's decision to abort at any step
Common Tasks
1. Verify Dependencies
Check for required tools and AWS access before running queries.
Constraints:
You MUST verify AWS MCP server tools are available (
awscallaws
) and run queries through them when present; fall back to AWS CLI only if the MCP server is unavailable
You MUST NOT fall back to shell or Bash for query execution — results must be captured via the MCP tool or
aws athena
CLI so output location and cost are tracked
You MUST confirm credentials with
aws sts get-caller-identity
and inform the user about any missing tools
2. Resolve Workgroup
Check caller identity, list workgroups, auto-select the best one (see
workgroup-selection.md
).
Constraints:
You MUST select a workgroup before submitting any query (prevents output-location errors)
You MUST present the selected workgroup and its output location to the user
You MUST NOT auto-escalate to a different workgroup on failure without user confirmation
3. Resolve the Target Asset
If the user refers to a table by name, by business concept ("our quarterly report", "the sales data"), by S3 path, or by catalog without specifying the table, delegate to
finding-data-lake-assets
to return the concrete
database.table
(and catalog if non-default).
Constraints:
You MUST NOT attempt to resolve fuzzy asset references with
athena list-data-catalogs
or by iterating
get-tables
— those miss federated catalogs and waste tokens
You SHOULD skip this step only when the user provides a fully-qualified reference (exact
database.table
) or raw SQL they want executed as-is
You MUST state the resolved asset explicitly before building the query: "Found [table] in [catalog]. Using this for the query."
You SHOULD default to the default Glue catalog unless the user mentions "federated", "Redshift", "S3 Tables", or
finding-data-lake-assets
returns a different catalog
4. Discover Schema
For analytical queries, You SHOULD profile the target table before building the final query. You MUST show sample rows (
SELECT ... LIMIT 5
) as part of profiling.
5. Build Query
Table addressing depends on catalog type:
Default Glue catalog:
database.table
(omit the catalog prefix for single-catalog queries). In cross-catalog queries, qualify default-catalog tables with
"awsdatacatalog".database.table
.
Registered data source:
datasource.database.table
Unregistered Glue catalog:
"catalog/subcatalog".database.table
6. Classify and Execute
Classify the SQL statement before executing:
Statement
Behavior
SELECT
,
SHOW
,
DESCRIBE
,
EXPLAIN
Safe — execute
INSERT
,
UPDATE
,
DELETE
,
DROP
,
ALTER
,
CREATE
,
TRUNCATE
,
MERGE
Destructive — warn the user and require explicit confirmation
Unsure
Treat as destructive; confirm
Example tool call (via AWS MCP server):
aws_call_aws(command="aws athena start-query-execution --work-group
querying-data-lake
安装
npx skills add https://github.com/aws/agent-toolkit-for-aws --skill querying-data-lake