Building scalable data pipelines for ML workloads
Exam Domain
Data Engineering represents 20% of the exam. Focus on choosing the right service for ingestion, transformation, and storage.
Key AWS Data Services
| Service | Purpose | Use Case | Key Features |
|---|
| Kinesis Data Streams | Real-time streaming ingestion | IoT sensor data, clickstream | Provisioned/on-demand capacity, 1-365 day retention |
| Kinesis Data Firehose | Near real-time delivery to targets | S3, Redshift, OpenSearch loading | Auto-scaling, built-in transformations, no management |
| Kinesis Data Analytics | Real-time stream processing | Anomaly detection, aggregation | SQL or Apache Flink, sliding windows |
| AWS Glue | ETL and data cataloging | Schema discovery, data prep | Serverless Spark, crawlers, Data Catalog |
| AWS Data Pipeline | Orchestrate data workflows | Scheduled batch processing | EC2/EMR-based, retry logic |
S3 Data Lake Patterns
- Raw Zone: Original data in native format (JSON, CSV, Parquet)
- Processed Zone: Cleaned and transformed data, typically in columnar format
- Curated Zone: Business-ready datasets optimized for analytics and ML
- S3 Storage Classes: Use S3 Intelligent-Tiering for ML datasets with varying access patterns
Exam Focus Areas
- Kinesis Data Streams requires capacity management; Firehose is fully managed
- AWS Glue ETL jobs can convert data to Parquet/ORC for optimal SageMaker performance
- Use VPC endpoints for S3 to keep data transfer within AWS network
- Glue Data Catalog integrates with Athena, Redshift Spectrum, and EMR