AWS Machine Learning Specialty (MLS-C01) Study Notes

Data Ingestion Pipelines

Building scalable data pipelines for ML workloads

Exam Domain

Data Engineering represents 20% of the exam. Focus on choosing the right service for ingestion, transformation, and storage.

Key AWS Data Services

Service	Purpose	Use Case	Key Features
Kinesis Data Streams	Real-time streaming ingestion	IoT sensor data, clickstream	Provisioned/on-demand capacity, 1-365 day retention
Kinesis Data Firehose	Near real-time delivery to targets	S3, Redshift, OpenSearch loading	Auto-scaling, built-in transformations, no management
Kinesis Data Analytics	Real-time stream processing	Anomaly detection, aggregation	SQL or Apache Flink, sliding windows
AWS Glue	ETL and data cataloging	Schema discovery, data prep	Serverless Spark, crawlers, Data Catalog
AWS Data Pipeline	Orchestrate data workflows	Scheduled batch processing	EC2/EMR-based, retry logic

S3 Data Lake Patterns

Raw Zone: Original data in native format (JSON, CSV, Parquet)
Processed Zone: Cleaned and transformed data, typically in columnar format
Curated Zone: Business-ready datasets optimized for analytics and ML
S3 Storage Classes: Use S3 Intelligent-Tiering for ML datasets with varying access patterns

Exam Focus Areas

Kinesis Data Streams requires capacity management; Firehose is fully managed
AWS Glue ETL jobs can convert data to Parquet/ORC for optimal SageMaker performance
Use VPC endpoints for S3 to keep data transfer within AWS network
Glue Data Catalog integrates with Athena, Redshift Spectrum, and EMR

Data Transformation

Preparing data at scale for machine learning

File Formats for ML

Format	Type	Best For	SageMaker Support
CSV	Row-based	Small datasets, tabular data	Most algorithms, slow for large data
RecordIO-Protobuf	Binary	SageMaker built-in algorithms	Fastest training, pipe mode compatible
Parquet	Columnar	Large analytical datasets	Spark-based processing
JSON/JSONLines	Semi-structured	NLP, nested data structures	BlazingText, comprehension tasks
LibSVM	Sparse	Sparse datasets, classification	XGBoost, linear learner

Data Processing Services Comparison

AWS Glue

Serverless Spark ETL. Best for batch transformations, schema discovery, and Data Catalog management.

Amazon EMR

Managed Hadoop/Spark clusters. Best for complex big data processing, custom frameworks, and large-scale ML.

SageMaker Processing

Run preprocessing scripts on managed infrastructure. Best for feature engineering tightly coupled with training.

AWS Lambda

Serverless compute for lightweight transforms. Best for real-time, small-payload data transformations.

Exam Focus Areas

RecordIO-Protobuf is the fastest format for SageMaker built-in algorithms
Use Pipe mode (streaming from S3) instead of File mode for large datasets
EMR is preferred when you need custom Spark/Hadoop processing at scale
SageMaker Processing jobs are ideal for scikit-learn preprocessing

AWS Machine Learning Specialty (MLS-C01) Study Notes

Data Ingestion Pipelines

Building scalable data pipelines for ML workloads

Exam Domain

Data Engineering represents 20% of the exam. Focus on choosing the right service for ingestion, transformation, and storage.

Key AWS Data Services

Service	Purpose	Use Case	Key Features
Kinesis Data Streams	Real-time streaming ingestion	IoT sensor data, clickstream	Provisioned/on-demand capacity, 1-365 day retention
Kinesis Data Firehose	Near real-time delivery to targets	S3, Redshift, OpenSearch loading	Auto-scaling, built-in transformations, no management
Kinesis Data Analytics	Real-time stream processing	Anomaly detection, aggregation	SQL or Apache Flink, sliding windows
AWS Glue	ETL and data cataloging	Schema discovery, data prep	Serverless Spark, crawlers, Data Catalog
AWS Data Pipeline	Orchestrate data workflows	Scheduled batch processing	EC2/EMR-based, retry logic

S3 Data Lake Patterns

Raw Zone: Original data in native format (JSON, CSV, Parquet)
Processed Zone: Cleaned and transformed data, typically in columnar format
Curated Zone: Business-ready datasets optimized for analytics and ML
S3 Storage Classes: Use S3 Intelligent-Tiering for ML datasets with varying access patterns

Exam Focus Areas

Kinesis Data Streams requires capacity management; Firehose is fully managed
AWS Glue ETL jobs can convert data to Parquet/ORC for optimal SageMaker performance
Use VPC endpoints for S3 to keep data transfer within AWS network
Glue Data Catalog integrates with Athena, Redshift Spectrum, and EMR

Data Transformation

Preparing data at scale for machine learning

File Formats for ML

Format	Type	Best For	SageMaker Support
CSV	Row-based	Small datasets, tabular data	Most algorithms, slow for large data
RecordIO-Protobuf	Binary	SageMaker built-in algorithms	Fastest training, pipe mode compatible
Parquet	Columnar	Large analytical datasets	Spark-based processing
JSON/JSONLines	Semi-structured	NLP, nested data structures	BlazingText, comprehension tasks
LibSVM	Sparse	Sparse datasets, classification	XGBoost, linear learner

Data Processing Services Comparison

AWS Glue

Serverless Spark ETL. Best for batch transformations, schema discovery, and Data Catalog management.

Amazon EMR

Managed Hadoop/Spark clusters. Best for complex big data processing, custom frameworks, and large-scale ML.

SageMaker Processing

Run preprocessing scripts on managed infrastructure. Best for feature engineering tightly coupled with training.

AWS Lambda

Serverless compute for lightweight transforms. Best for real-time, small-payload data transformations.

Exam Focus Areas

RecordIO-Protobuf is the fastest format for SageMaker built-in algorithms
Use Pipe mode (streaming from S3) instead of File mode for large datasets
EMR is preferred when you need custom Spark/Hadoop processing at scale
SageMaker Processing jobs are ideal for scikit-learn preprocessing