Move data from a source into a queryable table in the data lake. This skill assumes the source connection (if one is needed) already exists. For Glue connection setup or troubleshooting, delegate to
connecting-to-data-source
.
Philosophy
Default to S3 Tables unless the environment says otherwise.
S3 Tables is the recommended target for new data lake work. If the user's catalog inventory shows they haven't adopted S3 Tables, recommend standard Iceberg on their existing general-purpose bucket instead of forcing them to change posture.
Common Tasks
You MUST execute commands using AWS MCP server tools when connected -- they provide validation, sandboxed execution, and audit logging. Fall back to AWS CLI only if MCP is unavailable. You MUST explain each step before executing.
Workflow
1. Verify Dependencies and Context
You MUST check whether AWS MCP tools or AWS CLI are available and inform the user if missing
You MUST confirm target AWS region and verify credentials with
aws sts get-caller-identity
For SageMaker Unified Studio project roles, note that target tables and connections may be scoped to the project. See the caller ARN detection pattern in
querying-data-lake
.
2. Classify the Source
User says...
Source type
Reference
"upload my file", "local CSV", "move to S3"
Local file
local-upload.md
"load from S3", "import CSV/JSON/Parquet from s3://"
S3 files
s3-files.md
"import from Oracle/Postgres/MySQL/SQL Server/Redshift/RDS/Aurora"
JDBC
jdbc-ingest.md
"pull from Snowflake", "Snowflake table to S3"
Snowflake
snowflake-ingest.md
"import from BigQuery", "GCP analytics to S3"
BigQuery
bigquery-ingest.md
"export DynamoDB", "DynamoDB to data lake"
DynamoDB
dynamodb-ingest.md
"migrate Glue table", "convert Hive to Iceberg"
Catalog migration
catalog-migration.md
If the user names Salesforce, ServiceNow, SAP, MongoDB, Kafka, or another SaaS/streaming source, decline -- these are not supported in this release.
If the source table is referenced by a fuzzy or business name ("migrate our orders table", "pull from the sales warehouse"), delegate to
finding-data-lake-assets
to resolve before proceeding.
3. Confirm Connection Exists (if applicable)
For JDBC, Snowflake, and BigQuery sources, a Glue connection is required. Check:
aws glue get-connection
--name
<
CONNECTION_NAME
>
--region
<
REGION
>
If the connection does not exist, stop and delegate to
connecting-to-data-source
to create and test it. Do not proceed with ingest until the connection is verified.
Local files, S3 files, DynamoDB, and catalog migration do not need a Glue connection.
4. Clarify the Target
You MUST ask the user (or suggest based on catalog inventory) before creating or writing to any table:
Database/namespace
Does a specific target database exist? Or should one be created?
Table
Existing table (append/merge) or new table (delegate to
creating-data-lake-table
)?
Format
S3 Tables (default), standard Iceberg, or raw Parquet?
Inventory-aware defaults:
If you have already run
exploring-data-catalog
or can quickly check, use what exists:
Account has an
s3tablescatalog
federated catalog and active table buckets: recommend S3 Tables
Account has general-purpose buckets with Iceberg tables and no S3 Tables usage: recommend standard Iceberg on their existing bucket
Account uses Parquet/ORC on S3 without Iceberg metadata: ask whether to adopt Iceberg now (recommend yes) or continue with raw files
Do not force S3 Tables on customers who haven't adopted it. See
iceberg-catalog-config-and-usage.md
.
Delegations from this step:
Target table doesn't exist ->
creating-data-lake-table
Target database named by fuzzy term ->
finding-data-lake-assets
User doesn't know what exists ->
exploring-data-catalog
5. Execute Source Workflow
Read the source-specific reference and follow its phases. Each is self-contained with job templates, gotchas, and troubleshooting:
Local / S3 / JDBC / Snowflake / BigQuery / DynamoDB / catalog migration -- one reference per source
Common Glue 5.1 or higher job configuration and PySpark templates are shared in
glue-job-config.md
and
glue-job-scripts.md
.
6. Validate
Run all three, do not skip:
Row count matches expected (source vs target)
Null check on critical columns
Spot-check 3-5 sample rows
See
data-quality-validation.md
.
7. Schedule (if recurring)
For recurring pipelines, create a Glue Trigger with a cron schedule. See
testing-and-scheduling.md
. Simple single-step pipelines use Glue Triggers; multi-step with branching uses MWAA.
Argument Routing
S3 path only: Infer one-time load, start Step 2 with S3 files
Connection name: Start Step 3 with the named connection
Table name: Start Step 4, ask whether this is source or target
--target
flag: Pre-fill the target format in Step 4
No args: Walk through interactively
Gotchas
S3 Tables requires Glue 5.1 or higher and
--datalake-formats iceberg
job argument
All
spark.sql.catalog.
config MUST go in
--conf
job arguments, never in
spark.conf.set()
. Glue 5.x throws
AnalysisException: Cannot modify the value of a static config
otherwise. See
iceberg-catalog-config-and-usage.md
for correct catalog configs.
The
warehouse
parameter is required in S3 Tables catalog config. Without it Spark fails with "Cannot derive default warehouse location".
Table and column names in S3 Tables MUST be all lowercase
overwritePartitions()
only replaces partitions present in the DataFrame -- for full refresh with deletes, use
createOrReplace()
Standard Iceberg targets MUST include a LOCATION clause; S3 Tables MUST NOT
DynamoDB does not need a Glue connection -- do not attempt to create one
Connection failures during ingest delegate back to
connecting-to-data-source
; do not debug network/credentials in this skill
For target tables in SageMaker Unified Studio projects, ensure the project role has write access to the target namespace before the Glue job runs
Troubleshooting
Error
Likely cause
Action
Access Denied on S3
Missing IAM permissions
Check Glue role has s3:GetObject, s3:PutObject
Access Denied on S3 Tables
Missing s3tables: permissions
Add S3 Tables inline policy to Glue role
CTAS timeout
Dataset too large for Athena
Switch to Glue ETL or batch with WHERE filters
JDBC connection timeout/auth failure
Connection-level issue
Delegate to
connecting-to-data-source
Throughput exceeded (DynamoDB)
Read percent too high
Lower
read.percent
or use native export
See
error-handling.md
for the full catalog.
References
Source-specific
local-upload.md
-- Local files
s3-files.md
-- S3 files (CSV, JSON, Parquet, Avro, ORC)
jdbc-ingest.md
-- Oracle, SQL Server, PostgreSQL, MySQL, RDS, Aurora, Redshift
snowflake-ingest.md
-- Snowflake
bigquery-ingest.md
-- BigQuery
dynamodb-ingest.md
-- DynamoDB (export and Glue direct read)
catalog-migration.md
-- Existing Glue catalog tables (Hive, self-managed Iceberg)
Cross-cutting
iceberg-catalog-config-and-usage.md
-- S3 Tables, standard Iceberg, raw files: catalog config, engine access patterns
glue-job-config.md
-- Job sizing, monitoring, retry
glue-job-scripts.md
-- PySpark templates (append, upsert, custom SQL, full refresh)
incremental-loading.md
-- Watermark strategies
testing-and-scheduling.md
-- Glue Triggers, MWAA
data-quality-validation.md
-- Row counts, null checks, Glue Data Quality
schema-evolution.md
-- ALTER TABLE ADD COLUMNS, nested JSON
type-transformations.md
-- Type conflict resolution
format-specific-loading.md
-- CSV/JSON/Parquet/Avro/ORC specifics
athena-loading.md
-- Athena INSERT INTO as simple-load fallback
error-handling.md
-- Ingest errors (connection errors delegate to connecting-to-data-source)
upload-options.md
-- aws s3 cp vs sync, multipart
Migration-specific
ctas-patterns.md
-- Athena CTAS syntax and partition transforms
glue-etl-migration.md
-- Large-table migration via Glue 5.1 or higher PySpark
migration-validation.md
-- Full validation checklist
migration-troubleshooting.md
-- CTAS failures, visibility, partitions
JDBC-specific
jdbc-schema-discovery.md
-- Crawler, direct inspection, custom SQL
jdbc-performance.md
-- Parallel reads, partitioning