AWS Data Engineer Associate (DEA-C01) Study Notes

Amazon Kinesis Services

Real-time data streaming and delivery on AWS

Exam Domain

Data Ingestion and Transformation is the largest domain at 34%. Kinesis services are the most heavily tested topic.

Kinesis Data Streams vs Firehose

Feature	Kinesis Data Streams	Kinesis Firehose
Latency	~70ms (real-time)	60s–900s (near real-time)
Retention	24h–365 days	No retention
Consumers	Custom (KCL, Lambda, SDK)	Only S3, Redshift, OpenSearch, Splunk
Scaling	Shard-based (manual/auto)	Fully managed, auto-scales
Data Transform	Custom via KCL/Lambda	Built-in Lambda transform
Use Case	Real-time analytics, custom apps	Simple delivery pipeline

Kinesis Data Streams — Key Concepts

Shard: 1 MB/s write, 2 MB/s read. Partition key determines shard.
Sequence number: Unique per shard, assigned on write
Enhanced fan-out: 2 MB/s per consumer per shard (push model)
Standard vs Enhanced: Standard polls (shared 2 MB/s), Enhanced pushes (dedicated 2 MB/s)
Iterator types: TRIM_HORIZON, LATEST, AT_SEQUENCE_NUMBER, AFTER_SEQUENCE_NUMBER, AT_TIMESTAMP
On-demand capacity: Auto-scales based on throughput

Kinesis Firehose — Key Concepts

Destinations: S3, Redshift (via S3), OpenSearch, Splunk, HTTP endpoint
Lambda transformation: Apply custom transformation before delivery
Buffer size (1–128 MB) and buffer interval (60–900s) control when data is delivered
Automatic format conversion: JSON to Parquet/ORC for S3
Dynamic partitioning: Route data to S3 prefixes based on record fields
Error bucket: Failed records written to separate S3 prefix

Exam Focus Areas

Hot shard (ProvisionedThroughputExceededException): Increase shards or use random partition key
Firehose can read from Kinesis Data Streams directly (source = KDS)
Kinesis Analytics (now Managed Flink): SQL or Flink for real-time stream processing
MSK Connect: Managed Kafka connectors for source/sink integrations

AWS Glue ETL & Data Catalog

Serverless ETL service and central metadata repository

Glue Components

Component	Purpose	Key Details
Crawlers	Discover and catalog data	Auto-detects schema, writes to Data Catalog
Data Catalog	Central metadata store	Compatible with Hive metastore, used by Athena/EMR/Redshift
ETL Jobs	Transform data	Spark-based, PySpark, Scala, or Ray
Triggers	Schedule or event-based	Schedule, job event, on-demand
Workflows	Orchestrate ETL	Visual editor for multi-step pipelines
Glue Studio	Visual ETL design	Drag-and-drop job creation

Glue ETL — Key Concepts

DynamicFrame: Glue alternative to Spark DataFrame, handles inconsistent schemas
Bookmark: Tracks previously processed data to avoid reprocessing
FindMatches: ML-based deduplication transform
Connections: JDBC, S3, Kafka, Kinesis, MongoDB connections
Workers: Standard (DPU), G.1X (memory-heavy), G.2X (ML), G.025X (streaming)
Glue Streaming: Continuous ETL from Kinesis or Kafka

Glue Data Quality (DQDL)

DQDL: Data Quality Definition Language for defining rules
Rules: Completeness, Uniqueness, Accuracy, Referential Integrity
Rulesets: Collection of rules applied to a dataset
Actions: Halt job, quarantine records, or log warnings on failure
DQ Scorecards: Track data quality metrics over time

Exam Focus Areas

Glue vs EMR: Glue for serverless ETL, EMR for custom Spark/Hadoop clusters
Glue vs Lambda: Glue for long-running ETL, Lambda for lightweight transforms (<15 min)
Partition projection: Speed up Athena queries by projecting partition values in the catalog
Glue DataBrew: Visual data preparation tool, no-code PII detection and masking

Read aws-data-engineer-associate notes without distractions

Open Foci to run a focused study block while you review domains, tables, and exam tips.

Focus while reading

AWS Data Engineer Associate (DEA-C01) Study Notes

Amazon Kinesis Services

Real-time data streaming and delivery on AWS

Exam Domain

Data Ingestion and Transformation is the largest domain at 34%. Kinesis services are the most heavily tested topic.

Kinesis Data Streams vs Firehose

Feature	Kinesis Data Streams	Kinesis Firehose
Latency	~70ms (real-time)	60s–900s (near real-time)
Retention	24h–365 days	No retention
Consumers	Custom (KCL, Lambda, SDK)	Only S3, Redshift, OpenSearch, Splunk
Scaling	Shard-based (manual/auto)	Fully managed, auto-scales
Data Transform	Custom via KCL/Lambda	Built-in Lambda transform
Use Case	Real-time analytics, custom apps	Simple delivery pipeline

Kinesis Data Streams — Key Concepts

Shard: 1 MB/s write, 2 MB/s read. Partition key determines shard.
Sequence number: Unique per shard, assigned on write
Enhanced fan-out: 2 MB/s per consumer per shard (push model)
Standard vs Enhanced: Standard polls (shared 2 MB/s), Enhanced pushes (dedicated 2 MB/s)
Iterator types: TRIM_HORIZON, LATEST, AT_SEQUENCE_NUMBER, AFTER_SEQUENCE_NUMBER, AT_TIMESTAMP
On-demand capacity: Auto-scales based on throughput

Kinesis Firehose — Key Concepts

Destinations: S3, Redshift (via S3), OpenSearch, Splunk, HTTP endpoint
Lambda transformation: Apply custom transformation before delivery
Buffer size (1–128 MB) and buffer interval (60–900s) control when data is delivered
Automatic format conversion: JSON to Parquet/ORC for S3
Dynamic partitioning: Route data to S3 prefixes based on record fields
Error bucket: Failed records written to separate S3 prefix

Exam Focus Areas

Hot shard (ProvisionedThroughputExceededException): Increase shards or use random partition key
Firehose can read from Kinesis Data Streams directly (source = KDS)
Kinesis Analytics (now Managed Flink): SQL or Flink for real-time stream processing
MSK Connect: Managed Kafka connectors for source/sink integrations

AWS Glue ETL & Data Catalog

Serverless ETL service and central metadata repository

Glue Components

Component	Purpose	Key Details
Crawlers	Discover and catalog data	Auto-detects schema, writes to Data Catalog
Data Catalog	Central metadata store	Compatible with Hive metastore, used by Athena/EMR/Redshift
ETL Jobs	Transform data	Spark-based, PySpark, Scala, or Ray
Triggers	Schedule or event-based	Schedule, job event, on-demand
Workflows	Orchestrate ETL	Visual editor for multi-step pipelines
Glue Studio	Visual ETL design	Drag-and-drop job creation

Glue ETL — Key Concepts

DynamicFrame: Glue alternative to Spark DataFrame, handles inconsistent schemas
Bookmark: Tracks previously processed data to avoid reprocessing
FindMatches: ML-based deduplication transform
Connections: JDBC, S3, Kafka, Kinesis, MongoDB connections
Workers: Standard (DPU), G.1X (memory-heavy), G.2X (ML), G.025X (streaming)
Glue Streaming: Continuous ETL from Kinesis or Kafka

Glue Data Quality (DQDL)

DQDL: Data Quality Definition Language for defining rules
Rules: Completeness, Uniqueness, Accuracy, Referential Integrity
Rulesets: Collection of rules applied to a dataset
Actions: Halt job, quarantine records, or log warnings on failure
DQ Scorecards: Track data quality metrics over time

Exam Focus Areas

Glue vs EMR: Glue for serverless ETL, EMR for custom Spark/Hadoop clusters
Glue vs Lambda: Glue for long-running ETL, Lambda for lightweight transforms (<15 min)
Partition projection: Speed up Athena queries by projecting partition values in the catalog
Glue DataBrew: Visual data preparation tool, no-code PII detection and masking

Read aws-data-engineer-associate notes without distractions

Open Foci to run a focused study block while you review domains, tables, and exam tips.

Focus while reading