AWS Glue ETL & Data Catalog Serverless ETL service and central metadata repository
Glue Components
| Component | Purpose | Key Details |
|---|
| Crawlers | Discover and catalog data | Auto-detects schema, writes to Data Catalog |
| Data Catalog | Central metadata store | Compatible with Hive metastore, used by Athena/EMR/Redshift |
| ETL Jobs | Transform data | Spark-based, PySpark, Scala, or Ray |
| Triggers | Schedule or event-based | Schedule, job event, on-demand |
| Workflows | Orchestrate ETL | Visual editor for multi-step pipelines |
| Glue Studio | Visual ETL design | Drag-and-drop job creation |
Glue ETL — Key Concepts
- DynamicFrame: Glue alternative to Spark DataFrame, handles inconsistent schemas
- Bookmark: Tracks previously processed data to avoid reprocessing
- FindMatches: ML-based deduplication transform
- Connections: JDBC, S3, Kafka, Kinesis, MongoDB connections
- Workers: Standard (DPU), G.1X (memory-heavy), G.2X (ML), G.025X (streaming)
- Glue Streaming: Continuous ETL from Kinesis or Kafka
Glue Data Quality (DQDL)
- DQDL: Data Quality Definition Language for defining rules
- Rules: Completeness, Uniqueness, Accuracy, Referential Integrity
- Rulesets: Collection of rules applied to a dataset
- Actions: Halt job, quarantine records, or log warnings on failure
- DQ Scorecards: Track data quality metrics over time
Exam Focus Areas
- Glue vs EMR: Glue for serverless ETL, EMR for custom Spark/Hadoop clusters
- Glue vs Lambda: Glue for long-running ETL, Lambda for lightweight transforms (<15 min)
- Partition projection: Speed up Athena queries by projecting partition values in the catalog
- Glue DataBrew: Visual data preparation tool, no-code PII detection and masking