AWS CertifiedMLA-C01 Exam

Mock Exam ML Engineer Associate

Prepare for the AWS ML Engineer Associate (MLA-C01) exam with practical questions and detailed explanations.

Free

Detailed explanations

Updated 2026

This mock exam is an independent educational resource. It is not affiliated with, endorsed by, or sponsored by Amazon Web Services (AWS).

Practice by domain

How do you want to train?

Choose the mode that fits your current goal.

Study Mode

Feedback after every answer, free navigation, no timer.

✓Full explanation after each answer
✓Go back and review at any time
✓No time limit

Exam Mode

Realistic simulation of the actual exam with a countdown.

✓90 minutes to answer everything
✓No intermediate feedback
✓Score and answer key only at the end

Questions and commented answer key

See all questions with the correct answers and detailed explanations to reinforce your learning.

Questions

Topics

SageMaker Data Wrangler

1 question

Question 1
Which Amazon SageMaker feature provides a visual interface for ML data preparation and transformation, letting you import from S3, Redshift, Athena, and Snowflake without writing code?
Correct answer: A. SageMaker Data Wrangler
SageMaker Data Wrangler provides a low-code UI to connect to multiple sources (S3, Redshift, Athena, Snowflake), explore data, apply 300+ built-in transformations, and export as a pipeline. Pipelines orchestrates workflows, Clarify measures bias, and Model Monitor detects production model drift.

AWS Glue

1 question

Question 2
Which AWS service is a serverless ETL (Extract, Transform, Load) platform that discovers, catalogs, and prepares data at scale for analytics and ML?
Correct answer: A. AWS Glue
AWS Glue is the AWS serverless ETL platform, with Glue Data Catalog (metadata), Glue Crawlers (automatic discovery), and Glue Jobs (Spark/Python for transformations). EMR is managed Hadoop/Spark clusters (more flexible but requires operations), Kinesis is real-time data streaming, and Lambda is general-purpose serverless compute.

Feature Engineering

1 question

Question 3
For a numeric feature with a strongly skewed (long-tail) distribution, which transformation helps normalize the distribution before training an ML model?
Correct answer: A. Apply a logarithmic (log) transformation
The logarithmic transformation is a classic technique to reduce skewness in long-tail distributions (price, salary, counts), bringing them closer to a normal distribution. This improves training stability for scale-sensitive models (linear regression, neural networks). Multiplying or adding constants does not change the distribution, and removing features loses information.

Missing Data

1 question

Question 4
A numeric column in the dataset has 5% missing values. Which common strategy preserves the dataset size and reduces bias introduced by missingness?
Correct answer: B. Impute (fill in) missing values using mean, median, or a predictive model
Imputation preserves rows and uses statistics (mean/median) or models to estimate missing values — standard practice in data preparation. Dropping rows shrinks the dataset and may introduce bias if missingness is not random. Replacing with zero distorts the distribution, and dropping a predictive column discards useful signal.

Feature Store

1 question

Question 5
Which AWS feature lets you store, share, and reuse ML features across multiple models and teams, maintaining consistency between training and inference?
Correct answer: A. Amazon SageMaker Feature Store
SageMaker Feature Store is the AWS central feature repository with an online store (DynamoDB, low-latency for inference) and an offline store (S3, for training), ensuring consistency between the two worlds. Plain S3 and DynamoDB lack versioning, lineage, and online/offline sync. Glue Data Catalog is table metadata (schemas) — it does not store features.

Algorithm Selection

1 question

Question 6
A team needs to predict house prices (a continuous numeric variable) from features such as area, location, and age. Which type of algorithm is most appropriate?
Correct answer: A. A regression algorithm (e.g., XGBoost regressor, Linear Regression)
When the target variable is continuous numeric (price, temperature, age), the problem is regression. Classification predicts discrete categories, clustering groups without labels, and anomaly detection identifies outliers — none of these directly fit predicting a continuous numeric value.

SageMaker Training

1 question

Question 7
Which SageMaker feature lets you run ML training jobs on managed instances with support for built-in algorithms, frameworks (TensorFlow, PyTorch), and BYOC (Bring Your Own Container)?
Correct answer: A. SageMaker Training Jobs
SageMaker Training Jobs run training on managed instances (auto-provisioned and torn down) with support for built-in algorithms, major frameworks, and custom containers. Endpoints are for inference, Ground Truth is for labeling, and Studio Lab is a free educational environment.

Hyperparameter Tuning

1 question

Question 8
Which SageMaker feature automates the search for the best hyperparameters (learning rate, batch size, etc.) by running multiple training jobs in parallel?
Correct answer: A. SageMaker Automatic Model Tuning (Hyperparameter Tuning)
SageMaker Automatic Model Tuning (also called Hyperparameter Tuning) runs multiple training jobs with different hyperparameter combinations, using strategies such as Bayesian, Random, Grid Search, or Hyperband, to find the best configuration. Pipelines orchestrates workflows, Inference Recommender tests deploy instances, and Model Cards document models.

SageMaker Autopilot

1 question

Question 9
Which SageMaker feature provides AutoML — automatically generating multiple candidate models (with feature engineering and hyperparameter tuning) from a tabular dataset?
Correct answer: A. SageMaker Autopilot
SageMaker Autopilot is the AWS AutoML service — it accepts a CSV/Parquet, identifies the problem type (regression/classification), does feature engineering, and trains dozens of candidate models with automatic hyperparameter tuning. Edge Manager manages models on edge devices, Neo optimizes models for specific hardware, and Debugger monitors training jobs in real time.

Endpoint Types

1 question

Question 10
An application needs to run batch predictions on 10 million records once a day, without low-latency requirements. Which SageMaker endpoint type is most cost-effective?
Correct answer: B. Batch Transform
SageMaker Batch Transform is ideal for batch predictions on large datasets — it provisions temporary instances, processes the entire dataset, and shuts down, billing only for usage time. A real-time endpoint keeps instances on 24/7 (expensive for sporadic use), Async Inference handles requests with large payloads but still serves individual responses, and Multi-Model is for hosting multiple models on a single endpoint.

Serverless Inference

1 question

Question 11
A startup has a model with intermittent traffic (sporadic spikes followed by long idle periods). Which SageMaker deployment option automatically scales to zero when idle, charging only for usage?
Correct answer: B. SageMaker Serverless Inference
SageMaker Serverless Inference auto-scales (including to zero) and bills by inference duration + memory — ideal for intermittent traffic. Real-time keeps dedicated instances 24/7, edge targets non-AWS devices, and persistent Multi-Model also keeps resources active.

SageMaker Pipelines

1 question

Question 12
Which AWS service is the native CI/CD tool for ML, letting you orchestrate steps such as data preparation, training, evaluation, model registration, and deployment?
Correct answer: A. SageMaker Pipelines
SageMaker Pipelines is the native CI/CD tool specifically for ML flows, with direct integration to Training Jobs, Processing, Model Registry, and Endpoints. CodePipeline is general-purpose CI/CD (not ML-specialized), Step Functions orchestrates generic workflows (usable but requires more glue code), and Glue Workflows is only for Glue ETL.

Model Registry

1 question

Question 13
Which SageMaker feature centralizes the model lifecycle — versioning, manual approval before deployment, and lineage between training and production?
Correct answer: A. SageMaker Model Registry
SageMaker Model Registry lets you version models in "Model Groups", manually approve versions before deployment (governance), and track lineage. Feature Store stores features (not models), CodeArtifact manages code packages, and ECR stores Docker images (model containers, but without ML-specific governance).

Model Monitor

1 question

Question 14
A production model starts producing worse predictions after a few weeks, even without code changes. Which SageMaker feature automatically detects data drift (deviation in input data distribution)?
Correct answer: A. SageMaker Model Monitor
SageMaker Model Monitor automatically detects data drift, model quality drift, bias drift, and feature attribution drift by comparing production distributions with a baseline recorded during training. Pipelines orchestrates flows, CloudTrail audits API calls (not statistical distribution), and Trusted Advisor performs account checks (not ML-specific).

IAM for SageMaker

1 question

Question 15
Which AWS practice should be used to grant a SageMaker Training Job access to a specific S3 bucket without exposing permanent credentials?
Correct answer: B. Attach an IAM execution Role to the job with a policy specific to the bucket
SageMaker assumes an IAM execution Role (passed via the RoleArn parameter) that receives rotating temporary credentials — the principle of least privilege. Sharing root credentials, hardcoding passwords, or making a bucket public directly violate AWS security best practices.

VPC Endpoints

1 question

Question 16
How can a company ensure that traffic between its EC2 instances and SageMaker stays within AWS, without going through the public internet?
Correct answer: B. Configure VPC Endpoints (PrivateLink) for SageMaker inside the VPC
VPC Endpoints (powered by AWS PrivateLink) allow access to SageMaker APIs (training, inference, runtime) via a private interface inside the VPC, keeping traffic within the AWS network. IPv6 does not isolate traffic, SSH tunneling does not apply to AWS HTTP APIs, and TLS only encrypts but traffic still goes through the public internet.

Cost Optimization

1 question

Question 17
A team is repeatedly training models on SageMaker on-demand instances, with long-running jobs (4-6 hours). Which option can significantly reduce training costs while tolerating interruptions?
Correct answer: A. Use SageMaker Managed Spot Training
SageMaker Managed Spot Training uses spare AWS capacity (Spot) for training, saving up to 90% vs on-demand. It supports automatic checkpoints to resume interrupted training. Larger instances may reduce time but increase total cost; serverless inference is for prediction (not training); and Multi-Model is also for inference.

S3 Storage Classes

1 question

Question 18
A team stores training datasets accessed frequently in the first 30 days and rarely afterwards. Which strategy is most cost-effective?
Correct answer: B. Configure an S3 Lifecycle policy: Standard → Standard-IA after 30 days
S3 Lifecycle policies automatically move objects between classes — Standard (high frequency) → Standard-IA (infrequent access, ~40% cheaper in storage) → Glacier (archival). Keeping everything in Standard is expensive, immediate Glacier Deep Archive prevents fast access during the first 30 days, and EBS is block storage for EC2 (not for shared ML datasets).

Athena

1 question

Question 19
Which AWS service lets you query data in S3 using standard SQL without provisioning infrastructure?
Correct answer: A. Amazon Athena
Amazon Athena is serverless and bills only for data scanned, ideal for ad-hoc queries on S3 data lakes (CSV, JSON, Parquet, ORC formats). Redshift is a provisioned data warehouse, Glue is ETL, and RDS is managed relational database.

EMR (Spark)

1 question

Question 20
A company needs to process 100 TB of raw data with Apache Spark, running custom jobs in Python and Scala. Which AWS service is most suitable?
Correct answer: B. Amazon EMR (managed Hadoop/Spark/Hive)
Amazon EMR (Elastic MapReduce) is the AWS managed platform for big data frameworks such as Spark, Hadoop, Hive, and Presto, ideal for large-scale processing with code flexibility. Lambda has time (15 min) and memory limits, Athena is SQL-only, and Glue is ETL with less granular control than EMR.

Imbalanced Data

1 question

Question 21
In a fraud detection dataset, only 1% of cases are fraud (positive class). Which technique is appropriate to mitigate imbalance during training?
Correct answer: A. Apply SMOTE (Synthetic Minority Over-sampling) or adjust class_weight in the algorithm
SMOTE generates synthetic samples of the minority class to balance, and adjusting class_weight (or pos_weight in XGBoost) penalizes errors on the rare class during training. Training only on negative cases eliminates the fraud signal, dropping features discards predictive information, and simple accuracy is misleading on imbalanced data (precision/recall/F1/AUC are preferable).

Built-in Algorithms

1 question

Question 22
A team wants to train a tabular classification model without writing algorithm code from scratch. Which SageMaker built-in algorithm is widely used for tabular tasks with strong performance?
Correct answer: A. XGBoost
XGBoost is the SageMaker built-in algorithm for gradient boosting on tabular data, dominant in Kaggle competitions and production. BlazingText is for text classification and Word2Vec, Object Detection and Semantic Segmentation are for computer vision — not tabular.

Distributed Training

1 question

Question 23
To train a large deep learning model (e.g., BERT) on a 500 GB dataset, how can training be accelerated using multiple GPUs or multiple nodes?
Correct answer: A. Enable distributed training with SageMaker (Data Parallelism, Model Parallelism, or both)
SageMaker supports data parallelism (each GPU processes different batches of the dataset, syncing gradients) and model parallelism (the model is split across GPUs/nodes, useful when the model does not fit on one GPU). Reducing batch to 1 hurts throughput, CPU is too slow for large deep learning, and local training is infeasible for datasets of hundreds of GB.

Evaluation Metrics

1 question

Question 24
In a binary classification problem for medical diagnosis, which metric is most critical to maximize to avoid false negatives (missing a disease)?
Correct answer: A. Recall (sensitivity / true positive rate)
Recall measures the fraction of actual positives the model detected — TP / (TP + FN). High recall is essential in medical diagnosis to minimize false negatives (missing a serious disease). Precision focuses on avoiding false positives; latency and size are operational metrics (not directly about clinical quality).

Cross-Validation

1 question

Question 25
To evaluate an ML model robustly on a small dataset, which technique is recommended to reduce the variance of the performance estimate?
Correct answer: A. K-fold cross-validation (e.g., 5 folds)
K-fold cross-validation splits the dataset into K parts and trains K times (each time using a different part as validation), reducing the variance of the performance estimate. Training and testing on the same set causes evaluation overfitting, batch size is a training parameter, and dropout is a regularization technique (during training, never in inference).

SageMaker Experiments

1 question

Question 26
Which SageMaker feature lets you organize, compare, and reproduce ML experiments with different hyperparameters, datasets, and algorithms?
Correct answer: A. SageMaker Experiments
SageMaker Experiments automatically tracks each training job as a "trial", grouped in "experiments", letting you compare metrics and hyperparameters across runs. Model Cards documents models for governance, Studio Lab is a free educational environment, and Edge Manager manages models on edge devices.

Auto Scaling

1 question

Question 27
How can a SageMaker endpoint be configured to automatically scale the number of instances based on traffic volume?
Correct answer: A. Enable Application Auto Scaling on the endpoint with a policy based on metrics such as SageMakerVariantInvocationsPerInstance
SageMaker endpoints integrate with Application Auto Scaling — you define a policy based on metrics (invocations per instance, latency, CPU) and the instance count adjusts automatically between min/max. Manual restart does not scale, an oversized instance creates idle cost, and Lambda has limits incompatible with large models.

A/B Testing (Production Variants)

1 question

Question 28
How can you run an A/B test with 2 different model versions on the same SageMaker endpoint, splitting traffic between them?
Correct answer: A. Configure Production Variants on the endpoint with traffic weights (e.g., 50/50, 80/20)
SageMaker supports multiple Production Variants on the same endpoint, each with a traffic distribution weight. Useful for A/B testing, canary deployments, and blue-green. Separate endpoints with DNS works but requires extra infrastructure, training together does not isolate models for comparison, and SageMaker supports this scenario natively.

Inference Recommender

1 question

Question 29
Which SageMaker feature helps you choose the most suitable instance for deploying a model by running latency and cost benchmarks across different instance types?
Correct answer: A. SageMaker Inference Recommender
SageMaker Inference Recommender runs the model on different instance types (CPU, GPU, Inferentia) and provides latency, throughput, and cost comparisons, recommending the best option. Neo optimizes the model for specific hardware, Edge Manager manages edge devices, and Pipelines orchestrates ML workflows.

SageMaker Neo

1 question

Question 30
Which SageMaker feature compiles and optimizes ML models to run faster and with less memory on specific hardware (CPU, GPU, ARM, Inferentia)?
Correct answer: A. SageMaker Neo
SageMaker Neo compiles trained models (TensorFlow, PyTorch, MXNet, etc.) into code optimized for specific hardware, reducing footprint and latency — particularly useful on edge devices or Inferentia. Studio is the ML IDE, Pipelines orchestrates workflows, and Feature Store stores features.

CloudWatch

1 question

Question 31
How can you monitor the latency, throughput, and error rate of a SageMaker endpoint in real time?
Correct answer: A. Use Amazon CloudWatch — metrics such as Invocations, ModelLatency, and Invocation4XXErrors are published automatically
SageMaker endpoints automatically publish metrics to CloudWatch (Invocations, ModelLatency, OverheadLatency, Invocation4XXErrors, Invocation5XXErrors, etc.), enabling real-time dashboards and alarms. Manual polling and local logs are primitive approaches that do not scale — and SageMaker has always supported monitoring.

SageMaker Clarify

1 question

Question 32
Which SageMaker feature is used to detect bias in datasets and models and generate explainability reports (feature importance)?
Correct answer: A. SageMaker Clarify
SageMaker Clarify is the dedicated AWS service for bias analysis (during and after training) and explainability (SHAP values, feature importance). Pipelines orchestrates workflows, Endpoints host models for inference, and Studio is the IDE.

Encryption at Rest

1 question

Question 33
How can you encrypt sensitive data stored in S3 buckets used for SageMaker training jobs?
Correct answer: A. Enable Server-Side Encryption on S3 (SSE-S3 or SSE-KMS) — SageMaker decrypts automatically when accessing
S3 provides Server-Side Encryption (SSE-S3 with AWS keys, SSE-KMS with KMS-managed keys, or SSE-C with customer keys). SageMaker decrypts transparently as long as it has IAM permissions for the bucket and KMS key. Not encrypting is a security flaw, manual encryption is unnecessary work, and renaming the bucket does not encrypt anything.

Model Cards

1 question

Question 34
Which SageMaker feature lets you document models with metadata such as purpose, training data, performance metrics, and ethical considerations, for governance and compliance purposes?
Correct answer: A. SageMaker Model Cards
SageMaker Model Cards is the model documentation feature for governance/compliance, recording intended use, training data, evaluation metrics, ethical considerations, and known biases. Pipelines orchestrates workflows, Inference Recommender benchmarks instances, and Studio Lab is a free educational environment.

Kinesis Data Streams

1 question

Question 35
To ingest real-time event streams (clicks, IoT sensors) and make them available to multiple ML consumers, which AWS service is most suitable?
Correct answer: A. Amazon Kinesis Data Streams
Kinesis Data Streams is the AWS real-time streaming service, ideal for multiple consumers (Lambda, Kinesis Data Analytics, Firehose, custom apps) with configurable retention up to 365 days. Glue is batch ETL, S3 has no native streaming (only events), and Lambda processes events but does not store/replay them.

Processing Jobs

1 question

Question 36
Which SageMaker feature lets you run custom data preprocessing, validation, and postprocessing scripts in managed containers, separate from the training job?
Correct answer: A. SageMaker Processing Jobs
SageMaker Processing Jobs run scripts (Python, Spark) in managed containers for preprocessing, validation, and postprocessing — separate from training, with native S3 integration and built-in containers (sklearn, PySpark). Endpoints serve inference, Studio Lab is a free educational environment, and Edge Manager handles edge devices.

Glue DataBrew

1 question

Question 37
Which AWS service provides a visual no-code interface for data cleaning and normalization with 250+ built-in transformations?
Correct answer: A. AWS Glue DataBrew
AWS Glue DataBrew is a visual no-code tool for data discovery, cleaning, normalization, and validation, with 250+ built-in transformations (formatting, aggregations, joins). Different from Glue Studio (visual ETL with more code). Lambda is compute, EMR is Spark/Hadoop, and Step Functions is orchestration.

Categorical Encoding

1 question

Question 38
To prepare a nominal categorical feature (e.g., "country") for an XGBoost algorithm, which encoding technique is appropriate?
Correct answer: A. One-Hot Encoding (create binary columns for each category)
One-Hot Encoding creates a binary column for each category, avoiding spurious ordering among nominal categories (no hierarchy). It is standard for trees and linear models. MD5 hash breaks interpretability, concatenation is not numeric, and dropping discards predictive signal. For high cardinality there are alternatives (target encoding) — but one-hot is the standard answer.

Feature Scaling

1 question

Question 39
Before training a neural network or linear regression with features of very different scales (e.g., age 18-90 and income 1000-1000000), which transformation is recommended?
Correct answer: A. Apply standardization (StandardScaler: mean 0, std 1) or normalization (MinMaxScaler: 0-1)
Algorithms based on distance (KNN, K-Means) and gradient (linear regression, neural networks) are scale-sensitive — larger-scale features dominate the gradient/distance and hurt learning. StandardScaler or MinMaxScaler equalize the scales. Tree-based models (XGBoost, Random Forest) do NOT need scaling. Keeping originals distorts learning, multiplying does not change the ratio between features, and categorizing loses numeric information.

Random Forest

1 question

Question 40
Which technique combines predictions from multiple trees trained on random subsets of data and features, generally reducing overfitting compared to a single tree?
Correct answer: A. Random Forest (tree bagging)
Random Forest is an ensemble of decision trees trained with bootstrap (random samples) + feature randomness, with a final vote/average — it reduces variance vs a single tree and improves generalization. Linear Regression is a simple linear model (no ensemble), K-Means is unsupervised, and Naive Bayes is a simple probabilistic model.

Regularization (L1/L2)

1 question

Question 41
To reduce overfitting in a linear regression model with many correlated features, which technique adds a penalty to the model weights during training?
Correct answer: A. L1 (Lasso) or L2 (Ridge) regularization
L1 (Lasso) adds an |w| penalty (it zeroes out weights of irrelevant features — automatic feature selection). L2 (Ridge) adds a w² penalty (it shrinks weights without zeroing — useful for multicollinearity). Adding more features worsens overfitting, training longer without early stopping does the same, and eliminating the test set breaks evaluation.

Early Stopping

1 question

Question 42
To prevent overfitting during iterative training of a model (e.g., gradient boosting, neural networks), which technique monitors a metric on a validation set and stops training when it stops improving?
Correct answer: A. Early Stopping
Early Stopping interrupts training when the validation metric does not improve for N consecutive iterations (patience), preventing the model from continuing to memorize the training set. Increasing learning rate causes instability, more features may increase overfitting, and eliminating the validation set makes it impossible to detect the optimal stopping point.

SageMaker Debugger

1 question

Question 43
Which SageMaker feature monitors training jobs in real time, detecting issues such as vanishing gradients, dead ReLUs, overfitting, or class imbalance?
Correct answer: A. SageMaker Debugger
SageMaker Debugger captures tensors during training and applies built-in rules (vanishing gradient, exploding tensor, overfit, class imbalance, etc.), generating real-time alerts. Pipelines orchestrates workflows, Inference Recommender benchmarks deploy instances, and Edge Manager manages edge devices.

Multi-Model Endpoints

1 question

Question 44
A company needs to host 100 different models (one per client) with low, sporadic traffic. Which SageMaker approach minimizes costs by sharing resources?
Correct answer: A. Multi-Model Endpoints (MME) — multiple models on the same endpoint, loaded on demand
Multi-Model Endpoints load models on demand on the same endpoint, sharing resources (CPU/GPU/memory) — ideal for many models with sporadic traffic. 100 separate endpoints generates huge cost (idle instances), Lambda has size/cold-start limits incompatible with large models, and multi-region increases complexity without real benefit.

EventBridge

1 question

Question 45
How can you automatically trigger a SageMaker retraining pipeline when new data arrives in an S3 bucket?
Correct answer: A. Configure Amazon EventBridge (or S3 event notification) to invoke a Lambda or pipeline when new objects are detected
Amazon EventBridge (and S3 event notifications) lets you react to events such as S3 object creation, triggering Lambda, Step Functions, or SageMaker Pipelines automatically. Polling is inefficient (consumes API quota and adds delay), manual does not scale, and the automation is fully supported on AWS.

Step Functions

1 question

Question 46
To orchestrate an ML workflow that involves services beyond SageMaker (e.g., Glue → Lambda → SageMaker → SNS), which AWS service is most flexible?
Correct answer: A. AWS Step Functions
AWS Step Functions is a general-purpose orchestrator that integrates with 200+ AWS services (Glue, Lambda, SageMaker, SNS, EventBridge, etc.) via declarative tasks — ideal for multi-service ML workflows with retry/error handling. SageMaker Pipelines focuses on SageMaker-only flows. CloudFormation provisions infrastructure (does not orchestrate workflows). EMR Workflows is specific to Hadoop/Spark.

Network Isolation

1 question

Question 47
To ensure that a SageMaker training job has no internet access (only VPC resources), which configuration should be applied?
Correct answer: A. Enable Network Isolation on the training job + configure VPC endpoints for required services (S3, ECR, etc.)
Network Isolation prevents the training job container from accessing the internet or other networks. Combined with VPC endpoints (S3, ECR, CloudWatch, etc.), the job only accesses private resources — useful for compliance (sensitive data). "Deny all" IAM rules do not block network, GPU/CPU does not change network config, and SSE-KMS is encryption at rest (not network isolation).

CloudTrail

1 question

Question 48
For auditing purposes, how can you track all API calls made to ML services (CreateTrainingJob, InvokeEndpoint, etc.) in the AWS account?
Correct answer: A. Enable AWS CloudTrail (automatically records API calls in auditable logs)
AWS CloudTrail automatically records AWS API calls (including SageMaker, Bedrock, Comprehend, etc.) in auditable logs on S3 or CloudWatch Logs, with caller identity and timestamp. Custom logging in every call is unnecessary work, local logs do not capture all APIs, and auditing is fully supported natively.

KMS

1 question

Question 49
How can you ensure that EBS volumes attached to SageMaker training job instances are encrypted with a customer-managed key (CMK)?
Correct answer: A. Specify the KMS key in the VolumeKmsKeyId parameter of the training job
SageMaker lets you specify a customer-managed KMS key (CMK) to encrypt EBS volumes for training jobs and endpoints, via parameters like VolumeKmsKeyId/KmsKeyId — meeting compliance requirements that demand control over encryption keys. SSE-S3 only protects S3 objects (not EBS), and renaming a volume does not encrypt anything.

Role Manager

1 question

Question 50
Which SageMaker feature simplifies creating IAM roles for ML personas (data scientist, ML engineer, MLOps), applying pre-built least-privilege policies?
Correct answer: A. SageMaker Role Manager
SageMaker Role Manager provides a wizard to create IAM roles for common ML personas (data scientist, ML engineer, etc.) with pre-approved policies following the principle of least privilege. Config monitors resource configurations, Trusted Advisor performs general best-practice checks, and WAF protects web apps — none manage IAM roles for ML.

Model Dashboard

1 question

Question 51
Which SageMaker feature provides a centralized view of all production models in the account, with their health indicators (Model Monitor alarms, drift, endpoint status)?
Correct answer: A. SageMaker Model Dashboard
SageMaker Model Dashboard centralizes monitoring of all production models, showing Model Monitor alarms, drift detection, endpoint status, and lineage. Studio Lab is a free educational environment, JumpStart is a catalog of pre-trained/foundation models, and Feature Store stores features.

Parquet (Columnar Format)

1 question

Question 52
To reduce Athena query cost and time on large datasets, which file format is recommended for storing data in S3?
Correct answer: A. Apache Parquet (columnar compressed format)
Parquet is a columnar format with efficient compression — Athena/Spark/Redshift Spectrum reads only the columns needed, reducing data scanned (and Athena cost, which bills per TB scanned) by up to 90%. Uncompressed CSV forces scanning all bytes, TXT/XML are row-based and verbose.

Lake Formation

1 question

Question 53
Which AWS service centralizes data lake governance on S3, providing fine-grained access control (row/column) for multiple engines (Athena, EMR, SageMaker)?
Correct answer: A. AWS Lake Formation
AWS Lake Formation centralizes data lake governance, providing fine-grained permissions (table, column, row, cell) shared across Athena, EMR, SageMaker, and Redshift Spectrum. Plain S3 has basic ACLs/policies (not fine-grained). Glue is ETL and Macie classifies sensitive data — none offer full governance.

Data Versioning

1 question

Question 54
To ensure reproducibility of ML experiments, which approach is recommended for versioning training datasets?
Correct answer: A. Enable S3 Versioning + use SageMaker Lineage Tracking to track dataset → training job → model
S3 Versioning automatically preserves previous object versions, and SageMaker Lineage Tracking connects dataset → training job → model automatically for reproducibility. Overwriting wipes history, laptop datasets do not scale, and versioning is a prerequisite for ML audit/compliance.

Data Quality

1 question

Question 55
Which combination of AWS features analyzes data quality in ML pipelines, identifying anomalies, missing values, and bias before training?
Correct answer: A. SageMaker Data Wrangler (data quality reports) + SageMaker Clarify (pre-training bias)
SageMaker Data Wrangler automatically generates statistical profiles and data quality reports (missing values, distributions, anomalies), and SageMaker Clarify analyzes pre-training bias across demographic groups. QuickSight is BI, Trusted Advisor is general checks, and CloudFront is a CDN — none ML-data-quality specific.

AUC-ROC

1 question

Question 56
Which metric evaluates a binary classification model’s ability to discriminate between classes across all possible thresholds, being robust on imbalanced datasets?
Correct answer: A. AUC-ROC (Area Under the ROC Curve)
AUC-ROC measures the area under the ROC curve (TPR vs FPR), ranging from 0 (worst) to 1 (perfect), with 0.5 = random. It is threshold-independent and robust on imbalanced datasets, ideal for comparing models. Simple accuracy is misleading on imbalanced data, and time/size are operational metrics (not predictive quality).

Confusion Matrix

1 question

Question 57
In a multi-class classification problem, which tool visualizes where the model gets right and wrong, showing predictions vs actuals for each class pair?
Correct answer: A. Confusion Matrix
A Confusion Matrix is a square N×N table (N classes) with predictions (rows) × actuals (columns) — the diagonal shows correct predictions, off-diagonal shows specific errors per class pair. It identifies which classes the model confuses most. Learning curve shows training evolution, token histogram is text EDA, and t-SNE is dimensional reduction for visualization (not per-class evaluation).

Transfer Learning

1 question

Question 58
To train an image classifier with a small dataset (5,000 images), which technique allows reusing weights from a model already trained on ImageNet (millions of images) and fine-tuning only the last layers?
Correct answer: A. Transfer Learning (with fine-tuning of the last layers)
Transfer Learning leverages knowledge from pre-trained models (ResNet, VGG, BERT, etc.) — it freezes early layers and retrains only the last layers for the specific task. Efficient with small datasets and low computational cost. SageMaker JumpStart facilitates this pattern. Training from scratch requires lots of data, K-Means is unsupervised, and inverting the dataset is nonsensical.

Reinforcement Learning

1 question

Question 59
Which ML paradigm is appropriate for training an agent that makes sequential decisions, learning by trial and error with rewards (e.g., game playing, robotics, recommendation systems with continuous feedback)?
Correct answer: A. Reinforcement Learning (RL)
RL learns a policy mapping states → actions through rewards/penalties during interactions with an environment. AWS DeepRacer and SageMaker RL Containers facilitate RL. Supervised Learning requires static labeled data (not sequential decisions), unsupervised clustering only groups without feedback, and linear regression is static (not interactive).

Async Inference

1 question

Question 60
An application processes large images (up to 1 GB each) with an ML model and tolerates minute-level latency. Which SageMaker inference mode is most suitable?
Correct answer: A. SageMaker Asynchronous Inference (supports payloads up to 1 GB and long processing, with queues and SNS callbacks)
SageMaker Async Inference accepts payloads up to 1 GB and long processing (up to 1 hour), with internal queues and SNS callbacks — ideal for large images or slow models with tolerable latency. Real-time has a ~6 MB limit and sub-second focus, Lambda also limits requests to 6 MB, and Transfer Acceleration is only for S3 upload.

Blue/Green Deployment

1 question

Question 61
How can you update a SageMaker endpoint to a new model version while minimizing downtime risk and allowing fast rollback?
Correct answer: A. Use SageMaker Deployment Guardrails with Blue/Green or Canary deployment (with auto-rollback)
SageMaker Deployment Guardrails support Blue/Green (provisions a new parallel environment and switches traffic if validations pass), All-At-Once, Canary (small initial percentage), and Linear (incremental). They enable auto-rollback based on metrics. Delete+recreate causes downtime, in-place is risky, and pausing traffic degrades UX.

Lambda Container Image

1 question

Question 62
To serve a small ML model (up to 10 GB) at very low cost with sporadic traffic and a containerized model, which AWS option is viable?
Correct answer: A. AWS Lambda with container image (supports up to 10 GB, billed per ms of execution)
AWS Lambda supports container images up to 10 GB and bills only for milliseconds of execution — ideal for small models with sporadic traffic (zero idle cost). SageMaker Serverless Inference is the native AWS ML alternative with the same advantages, but Lambda is also valid. EC2 24/7 generates idle cost, endpoint without scaling is not serverless, and Batch is for long batch jobs (not on-demand inference).

Secrets Manager

1 question

Question 63
To store and automatically rotate credentials (database passwords, API keys) used by production ML applications, which service is recommended?
Correct answer: A. AWS Secrets Manager (automatic rotation + KMS encryption)
AWS Secrets Manager stores credentials encrypted with KMS and rotates them automatically (custom Lambda or native integrations with RDS, Redshift, etc.). Hardcoding and plain-text env vars violate security. Config monitors resource configurations (not secrets).

AWS Config

1 question

Question 64
How can you automatically detect when a SageMaker endpoint is created without EBS volume encryption, against the company’s compliance policy?
Correct answer: A. Configure AWS Config rules (or Conformance Packs) to continuously audit resource configurations
AWS Config continuously monitors AWS resource configurations and detects compliance deviations via custom or managed rules (e.g., SageMaker endpoint without encryption-at-rest). Conformance Packs group rules for frameworks (HIPAA, PCI, etc.). Manual inspection does not scale, and the automation is fully supported.

Model Quality Monitor

1 question

Question 65
Which SageMaker Model Monitor feature compares production model predictions with ground truth labels (real labels collected after inference) to detect model quality degradation over time?
Correct answer: A. Model Quality Monitor (requires ground truth labels)
Model Quality Monitor computes metrics (accuracy, F1, MSE, etc.) comparing predictions vs ground truth labels collected in production, detecting real model degradation over time. Data Quality monitors input distribution (no ground truth), Bias Drift monitors bias across groups, and Feature Attribution monitors feature-importance drift — none of them compare against real labels.

Other AWS mock exams

Practice for other AWS certifications with the same study and exam modes.

CLF-C02

AWS Cloud Practitioner

Foundational · 65 questions

AIF-C01

AWS AI Practitioner

Foundational · 50 questions

See all AWS mock exams →

Frequently Asked Questions

0/5

✓How many questions are in the AWS ML Engineer Associate mock exam?

The mock exam includes 65 practical MLA-C01-style questions covering the 4 official domains, each with a detailed explanation to support your learning.

✓Are the questions based on the MLA-C01 exam?

Yes. The content is organized to reflect the 4 official domains of the AWS Certified Machine Learning Engineer - Associate (MLA-C01): data preparation, model development, deployment & orchestration, and monitoring & security.

✓Do I get an explanation after answering?

Yes. After selecting an option, the mock exam shows the explanation behind the correct answer and why the other options are wrong.

✓What is the passing score on the official exam?

MLA-C01 uses a compensatory scoring model from 100 to 1000, with a minimum passing score of 720 (~72%). The mock exam adopts a 70% threshold.

✓Is this mock exam free?

Yes. The full mock exam is free and available online without sign-up.

AWS CertifiedMLA-C01 Exam

Mock Exam ML Engineer Associate

Prepare for the AWS ML Engineer Associate (MLA-C01) exam with practical questions and detailed explanations.

Free

Detailed explanations

Updated 2026

This mock exam is an independent educational resource. It is not affiliated with, endorsed by, or sponsored by Amazon Web Services (AWS).

Practice by domain

How do you want to train?

Choose the mode that fits your current goal.

Study Mode

Feedback after every answer, free navigation, no timer.

✓Full explanation after each answer
✓Go back and review at any time
✓No time limit

Exam Mode

Realistic simulation of the actual exam with a countdown.

✓90 minutes to answer everything
✓No intermediate feedback
✓Score and answer key only at the end

Questions and commented answer key

See all questions with the correct answers and detailed explanations to reinforce your learning.

Questions

Topics

SageMaker Data Wrangler

1 question

Question 1
Which Amazon SageMaker feature provides a visual interface for ML data preparation and transformation, letting you import from S3, Redshift, Athena, and Snowflake without writing code?
Correct answer: A. SageMaker Data Wrangler
SageMaker Data Wrangler provides a low-code UI to connect to multiple sources (S3, Redshift, Athena, Snowflake), explore data, apply 300+ built-in transformations, and export as a pipeline. Pipelines orchestrates workflows, Clarify measures bias, and Model Monitor detects production model drift.

AWS Glue

1 question

Question 2
Which AWS service is a serverless ETL (Extract, Transform, Load) platform that discovers, catalogs, and prepares data at scale for analytics and ML?
Correct answer: A. AWS Glue
AWS Glue is the AWS serverless ETL platform, with Glue Data Catalog (metadata), Glue Crawlers (automatic discovery), and Glue Jobs (Spark/Python for transformations). EMR is managed Hadoop/Spark clusters (more flexible but requires operations), Kinesis is real-time data streaming, and Lambda is general-purpose serverless compute.

Feature Engineering

1 question

Question 3
For a numeric feature with a strongly skewed (long-tail) distribution, which transformation helps normalize the distribution before training an ML model?
Correct answer: A. Apply a logarithmic (log) transformation
The logarithmic transformation is a classic technique to reduce skewness in long-tail distributions (price, salary, counts), bringing them closer to a normal distribution. This improves training stability for scale-sensitive models (linear regression, neural networks). Multiplying or adding constants does not change the distribution, and removing features loses information.

Missing Data

1 question

Question 4
A numeric column in the dataset has 5% missing values. Which common strategy preserves the dataset size and reduces bias introduced by missingness?
Correct answer: B. Impute (fill in) missing values using mean, median, or a predictive model
Imputation preserves rows and uses statistics (mean/median) or models to estimate missing values — standard practice in data preparation. Dropping rows shrinks the dataset and may introduce bias if missingness is not random. Replacing with zero distorts the distribution, and dropping a predictive column discards useful signal.

Feature Store

1 question

Question 5
Which AWS feature lets you store, share, and reuse ML features across multiple models and teams, maintaining consistency between training and inference?
Correct answer: A. Amazon SageMaker Feature Store
SageMaker Feature Store is the AWS central feature repository with an online store (DynamoDB, low-latency for inference) and an offline store (S3, for training), ensuring consistency between the two worlds. Plain S3 and DynamoDB lack versioning, lineage, and online/offline sync. Glue Data Catalog is table metadata (schemas) — it does not store features.

Algorithm Selection

1 question

Question 6
A team needs to predict house prices (a continuous numeric variable) from features such as area, location, and age. Which type of algorithm is most appropriate?
Correct answer: A. A regression algorithm (e.g., XGBoost regressor, Linear Regression)
When the target variable is continuous numeric (price, temperature, age), the problem is regression. Classification predicts discrete categories, clustering groups without labels, and anomaly detection identifies outliers — none of these directly fit predicting a continuous numeric value.

SageMaker Training

1 question

Question 7
Which SageMaker feature lets you run ML training jobs on managed instances with support for built-in algorithms, frameworks (TensorFlow, PyTorch), and BYOC (Bring Your Own Container)?
Correct answer: A. SageMaker Training Jobs
SageMaker Training Jobs run training on managed instances (auto-provisioned and torn down) with support for built-in algorithms, major frameworks, and custom containers. Endpoints are for inference, Ground Truth is for labeling, and Studio Lab is a free educational environment.

Hyperparameter Tuning

1 question

Question 8
Which SageMaker feature automates the search for the best hyperparameters (learning rate, batch size, etc.) by running multiple training jobs in parallel?
Correct answer: A. SageMaker Automatic Model Tuning (Hyperparameter Tuning)
SageMaker Automatic Model Tuning (also called Hyperparameter Tuning) runs multiple training jobs with different hyperparameter combinations, using strategies such as Bayesian, Random, Grid Search, or Hyperband, to find the best configuration. Pipelines orchestrates workflows, Inference Recommender tests deploy instances, and Model Cards document models.

SageMaker Autopilot

1 question

Question 9
Which SageMaker feature provides AutoML — automatically generating multiple candidate models (with feature engineering and hyperparameter tuning) from a tabular dataset?
Correct answer: A. SageMaker Autopilot
SageMaker Autopilot is the AWS AutoML service — it accepts a CSV/Parquet, identifies the problem type (regression/classification), does feature engineering, and trains dozens of candidate models with automatic hyperparameter tuning. Edge Manager manages models on edge devices, Neo optimizes models for specific hardware, and Debugger monitors training jobs in real time.

Endpoint Types

1 question

Question 10
An application needs to run batch predictions on 10 million records once a day, without low-latency requirements. Which SageMaker endpoint type is most cost-effective?
Correct answer: B. Batch Transform
SageMaker Batch Transform is ideal for batch predictions on large datasets — it provisions temporary instances, processes the entire dataset, and shuts down, billing only for usage time. A real-time endpoint keeps instances on 24/7 (expensive for sporadic use), Async Inference handles requests with large payloads but still serves individual responses, and Multi-Model is for hosting multiple models on a single endpoint.

Serverless Inference

1 question

Question 11
A startup has a model with intermittent traffic (sporadic spikes followed by long idle periods). Which SageMaker deployment option automatically scales to zero when idle, charging only for usage?
Correct answer: B. SageMaker Serverless Inference
SageMaker Serverless Inference auto-scales (including to zero) and bills by inference duration + memory — ideal for intermittent traffic. Real-time keeps dedicated instances 24/7, edge targets non-AWS devices, and persistent Multi-Model also keeps resources active.

SageMaker Pipelines

1 question

Question 12
Which AWS service is the native CI/CD tool for ML, letting you orchestrate steps such as data preparation, training, evaluation, model registration, and deployment?
Correct answer: A. SageMaker Pipelines
SageMaker Pipelines is the native CI/CD tool specifically for ML flows, with direct integration to Training Jobs, Processing, Model Registry, and Endpoints. CodePipeline is general-purpose CI/CD (not ML-specialized), Step Functions orchestrates generic workflows (usable but requires more glue code), and Glue Workflows is only for Glue ETL.

Model Registry

1 question

Question 13
Which SageMaker feature centralizes the model lifecycle — versioning, manual approval before deployment, and lineage between training and production?
Correct answer: A. SageMaker Model Registry
SageMaker Model Registry lets you version models in "Model Groups", manually approve versions before deployment (governance), and track lineage. Feature Store stores features (not models), CodeArtifact manages code packages, and ECR stores Docker images (model containers, but without ML-specific governance).

Model Monitor

1 question

Question 14
A production model starts producing worse predictions after a few weeks, even without code changes. Which SageMaker feature automatically detects data drift (deviation in input data distribution)?
Correct answer: A. SageMaker Model Monitor
SageMaker Model Monitor automatically detects data drift, model quality drift, bias drift, and feature attribution drift by comparing production distributions with a baseline recorded during training. Pipelines orchestrates flows, CloudTrail audits API calls (not statistical distribution), and Trusted Advisor performs account checks (not ML-specific).

IAM for SageMaker

1 question

Question 15
Which AWS practice should be used to grant a SageMaker Training Job access to a specific S3 bucket without exposing permanent credentials?
Correct answer: B. Attach an IAM execution Role to the job with a policy specific to the bucket
SageMaker assumes an IAM execution Role (passed via the RoleArn parameter) that receives rotating temporary credentials — the principle of least privilege. Sharing root credentials, hardcoding passwords, or making a bucket public directly violate AWS security best practices.

VPC Endpoints

1 question

Question 16
How can a company ensure that traffic between its EC2 instances and SageMaker stays within AWS, without going through the public internet?
Correct answer: B. Configure VPC Endpoints (PrivateLink) for SageMaker inside the VPC
VPC Endpoints (powered by AWS PrivateLink) allow access to SageMaker APIs (training, inference, runtime) via a private interface inside the VPC, keeping traffic within the AWS network. IPv6 does not isolate traffic, SSH tunneling does not apply to AWS HTTP APIs, and TLS only encrypts but traffic still goes through the public internet.

Cost Optimization

1 question

Question 17
A team is repeatedly training models on SageMaker on-demand instances, with long-running jobs (4-6 hours). Which option can significantly reduce training costs while tolerating interruptions?
Correct answer: A. Use SageMaker Managed Spot Training
SageMaker Managed Spot Training uses spare AWS capacity (Spot) for training, saving up to 90% vs on-demand. It supports automatic checkpoints to resume interrupted training. Larger instances may reduce time but increase total cost; serverless inference is for prediction (not training); and Multi-Model is also for inference.

S3 Storage Classes

1 question

Question 18
A team stores training datasets accessed frequently in the first 30 days and rarely afterwards. Which strategy is most cost-effective?
Correct answer: B. Configure an S3 Lifecycle policy: Standard → Standard-IA after 30 days
S3 Lifecycle policies automatically move objects between classes — Standard (high frequency) → Standard-IA (infrequent access, ~40% cheaper in storage) → Glacier (archival). Keeping everything in Standard is expensive, immediate Glacier Deep Archive prevents fast access during the first 30 days, and EBS is block storage for EC2 (not for shared ML datasets).

Athena

1 question

Question 19
Which AWS service lets you query data in S3 using standard SQL without provisioning infrastructure?
Correct answer: A. Amazon Athena
Amazon Athena is serverless and bills only for data scanned, ideal for ad-hoc queries on S3 data lakes (CSV, JSON, Parquet, ORC formats). Redshift is a provisioned data warehouse, Glue is ETL, and RDS is managed relational database.

EMR (Spark)

1 question

Question 20
A company needs to process 100 TB of raw data with Apache Spark, running custom jobs in Python and Scala. Which AWS service is most suitable?
Correct answer: B. Amazon EMR (managed Hadoop/Spark/Hive)
Amazon EMR (Elastic MapReduce) is the AWS managed platform for big data frameworks such as Spark, Hadoop, Hive, and Presto, ideal for large-scale processing with code flexibility. Lambda has time (15 min) and memory limits, Athena is SQL-only, and Glue is ETL with less granular control than EMR.

Imbalanced Data

1 question

Question 21
In a fraud detection dataset, only 1% of cases are fraud (positive class). Which technique is appropriate to mitigate imbalance during training?
Correct answer: A. Apply SMOTE (Synthetic Minority Over-sampling) or adjust class_weight in the algorithm
SMOTE generates synthetic samples of the minority class to balance, and adjusting class_weight (or pos_weight in XGBoost) penalizes errors on the rare class during training. Training only on negative cases eliminates the fraud signal, dropping features discards predictive information, and simple accuracy is misleading on imbalanced data (precision/recall/F1/AUC are preferable).

Built-in Algorithms

1 question

Question 22
A team wants to train a tabular classification model without writing algorithm code from scratch. Which SageMaker built-in algorithm is widely used for tabular tasks with strong performance?
Correct answer: A. XGBoost
XGBoost is the SageMaker built-in algorithm for gradient boosting on tabular data, dominant in Kaggle competitions and production. BlazingText is for text classification and Word2Vec, Object Detection and Semantic Segmentation are for computer vision — not tabular.

Distributed Training

1 question

Question 23
To train a large deep learning model (e.g., BERT) on a 500 GB dataset, how can training be accelerated using multiple GPUs or multiple nodes?
Correct answer: A. Enable distributed training with SageMaker (Data Parallelism, Model Parallelism, or both)
SageMaker supports data parallelism (each GPU processes different batches of the dataset, syncing gradients) and model parallelism (the model is split across GPUs/nodes, useful when the model does not fit on one GPU). Reducing batch to 1 hurts throughput, CPU is too slow for large deep learning, and local training is infeasible for datasets of hundreds of GB.

Evaluation Metrics

1 question

Question 24
In a binary classification problem for medical diagnosis, which metric is most critical to maximize to avoid false negatives (missing a disease)?
Correct answer: A. Recall (sensitivity / true positive rate)
Recall measures the fraction of actual positives the model detected — TP / (TP + FN). High recall is essential in medical diagnosis to minimize false negatives (missing a serious disease). Precision focuses on avoiding false positives; latency and size are operational metrics (not directly about clinical quality).

Cross-Validation

1 question

Question 25
To evaluate an ML model robustly on a small dataset, which technique is recommended to reduce the variance of the performance estimate?
Correct answer: A. K-fold cross-validation (e.g., 5 folds)
K-fold cross-validation splits the dataset into K parts and trains K times (each time using a different part as validation), reducing the variance of the performance estimate. Training and testing on the same set causes evaluation overfitting, batch size is a training parameter, and dropout is a regularization technique (during training, never in inference).

SageMaker Experiments

1 question

Question 26
Which SageMaker feature lets you organize, compare, and reproduce ML experiments with different hyperparameters, datasets, and algorithms?
Correct answer: A. SageMaker Experiments
SageMaker Experiments automatically tracks each training job as a "trial", grouped in "experiments", letting you compare metrics and hyperparameters across runs. Model Cards documents models for governance, Studio Lab is a free educational environment, and Edge Manager manages models on edge devices.

Auto Scaling

1 question

Question 27
How can a SageMaker endpoint be configured to automatically scale the number of instances based on traffic volume?
Correct answer: A. Enable Application Auto Scaling on the endpoint with a policy based on metrics such as SageMakerVariantInvocationsPerInstance
SageMaker endpoints integrate with Application Auto Scaling — you define a policy based on metrics (invocations per instance, latency, CPU) and the instance count adjusts automatically between min/max. Manual restart does not scale, an oversized instance creates idle cost, and Lambda has limits incompatible with large models.

A/B Testing (Production Variants)

1 question

Question 28
How can you run an A/B test with 2 different model versions on the same SageMaker endpoint, splitting traffic between them?
Correct answer: A. Configure Production Variants on the endpoint with traffic weights (e.g., 50/50, 80/20)
SageMaker supports multiple Production Variants on the same endpoint, each with a traffic distribution weight. Useful for A/B testing, canary deployments, and blue-green. Separate endpoints with DNS works but requires extra infrastructure, training together does not isolate models for comparison, and SageMaker supports this scenario natively.

Inference Recommender

1 question

Question 29
Which SageMaker feature helps you choose the most suitable instance for deploying a model by running latency and cost benchmarks across different instance types?
Correct answer: A. SageMaker Inference Recommender
SageMaker Inference Recommender runs the model on different instance types (CPU, GPU, Inferentia) and provides latency, throughput, and cost comparisons, recommending the best option. Neo optimizes the model for specific hardware, Edge Manager manages edge devices, and Pipelines orchestrates ML workflows.

SageMaker Neo

1 question

Question 30
Which SageMaker feature compiles and optimizes ML models to run faster and with less memory on specific hardware (CPU, GPU, ARM, Inferentia)?
Correct answer: A. SageMaker Neo
SageMaker Neo compiles trained models (TensorFlow, PyTorch, MXNet, etc.) into code optimized for specific hardware, reducing footprint and latency — particularly useful on edge devices or Inferentia. Studio is the ML IDE, Pipelines orchestrates workflows, and Feature Store stores features.

CloudWatch

1 question

Question 31
How can you monitor the latency, throughput, and error rate of a SageMaker endpoint in real time?
Correct answer: A. Use Amazon CloudWatch — metrics such as Invocations, ModelLatency, and Invocation4XXErrors are published automatically
SageMaker endpoints automatically publish metrics to CloudWatch (Invocations, ModelLatency, OverheadLatency, Invocation4XXErrors, Invocation5XXErrors, etc.), enabling real-time dashboards and alarms. Manual polling and local logs are primitive approaches that do not scale — and SageMaker has always supported monitoring.

SageMaker Clarify

1 question

Question 32
Which SageMaker feature is used to detect bias in datasets and models and generate explainability reports (feature importance)?
Correct answer: A. SageMaker Clarify
SageMaker Clarify is the dedicated AWS service for bias analysis (during and after training) and explainability (SHAP values, feature importance). Pipelines orchestrates workflows, Endpoints host models for inference, and Studio is the IDE.

Encryption at Rest

1 question

Question 33
How can you encrypt sensitive data stored in S3 buckets used for SageMaker training jobs?
Correct answer: A. Enable Server-Side Encryption on S3 (SSE-S3 or SSE-KMS) — SageMaker decrypts automatically when accessing
S3 provides Server-Side Encryption (SSE-S3 with AWS keys, SSE-KMS with KMS-managed keys, or SSE-C with customer keys). SageMaker decrypts transparently as long as it has IAM permissions for the bucket and KMS key. Not encrypting is a security flaw, manual encryption is unnecessary work, and renaming the bucket does not encrypt anything.

Model Cards

1 question

Question 34
Which SageMaker feature lets you document models with metadata such as purpose, training data, performance metrics, and ethical considerations, for governance and compliance purposes?
Correct answer: A. SageMaker Model Cards
SageMaker Model Cards is the model documentation feature for governance/compliance, recording intended use, training data, evaluation metrics, ethical considerations, and known biases. Pipelines orchestrates workflows, Inference Recommender benchmarks instances, and Studio Lab is a free educational environment.

Kinesis Data Streams

1 question

Question 35
To ingest real-time event streams (clicks, IoT sensors) and make them available to multiple ML consumers, which AWS service is most suitable?
Correct answer: A. Amazon Kinesis Data Streams
Kinesis Data Streams is the AWS real-time streaming service, ideal for multiple consumers (Lambda, Kinesis Data Analytics, Firehose, custom apps) with configurable retention up to 365 days. Glue is batch ETL, S3 has no native streaming (only events), and Lambda processes events but does not store/replay them.

Processing Jobs

1 question

Question 36
Which SageMaker feature lets you run custom data preprocessing, validation, and postprocessing scripts in managed containers, separate from the training job?
Correct answer: A. SageMaker Processing Jobs
SageMaker Processing Jobs run scripts (Python, Spark) in managed containers for preprocessing, validation, and postprocessing — separate from training, with native S3 integration and built-in containers (sklearn, PySpark). Endpoints serve inference, Studio Lab is a free educational environment, and Edge Manager handles edge devices.

Glue DataBrew

1 question

Question 37
Which AWS service provides a visual no-code interface for data cleaning and normalization with 250+ built-in transformations?
Correct answer: A. AWS Glue DataBrew
AWS Glue DataBrew is a visual no-code tool for data discovery, cleaning, normalization, and validation, with 250+ built-in transformations (formatting, aggregations, joins). Different from Glue Studio (visual ETL with more code). Lambda is compute, EMR is Spark/Hadoop, and Step Functions is orchestration.

Categorical Encoding

1 question

Question 38
To prepare a nominal categorical feature (e.g., "country") for an XGBoost algorithm, which encoding technique is appropriate?
Correct answer: A. One-Hot Encoding (create binary columns for each category)
One-Hot Encoding creates a binary column for each category, avoiding spurious ordering among nominal categories (no hierarchy). It is standard for trees and linear models. MD5 hash breaks interpretability, concatenation is not numeric, and dropping discards predictive signal. For high cardinality there are alternatives (target encoding) — but one-hot is the standard answer.

Feature Scaling

1 question

Question 39
Before training a neural network or linear regression with features of very different scales (e.g., age 18-90 and income 1000-1000000), which transformation is recommended?
Correct answer: A. Apply standardization (StandardScaler: mean 0, std 1) or normalization (MinMaxScaler: 0-1)
Algorithms based on distance (KNN, K-Means) and gradient (linear regression, neural networks) are scale-sensitive — larger-scale features dominate the gradient/distance and hurt learning. StandardScaler or MinMaxScaler equalize the scales. Tree-based models (XGBoost, Random Forest) do NOT need scaling. Keeping originals distorts learning, multiplying does not change the ratio between features, and categorizing loses numeric information.

Random Forest

1 question

Question 40
Which technique combines predictions from multiple trees trained on random subsets of data and features, generally reducing overfitting compared to a single tree?
Correct answer: A. Random Forest (tree bagging)
Random Forest is an ensemble of decision trees trained with bootstrap (random samples) + feature randomness, with a final vote/average — it reduces variance vs a single tree and improves generalization. Linear Regression is a simple linear model (no ensemble), K-Means is unsupervised, and Naive Bayes is a simple probabilistic model.

Regularization (L1/L2)

1 question

Question 41
To reduce overfitting in a linear regression model with many correlated features, which technique adds a penalty to the model weights during training?
Correct answer: A. L1 (Lasso) or L2 (Ridge) regularization
L1 (Lasso) adds an |w| penalty (it zeroes out weights of irrelevant features — automatic feature selection). L2 (Ridge) adds a w² penalty (it shrinks weights without zeroing — useful for multicollinearity). Adding more features worsens overfitting, training longer without early stopping does the same, and eliminating the test set breaks evaluation.

Early Stopping

1 question

Question 42
To prevent overfitting during iterative training of a model (e.g., gradient boosting, neural networks), which technique monitors a metric on a validation set and stops training when it stops improving?
Correct answer: A. Early Stopping
Early Stopping interrupts training when the validation metric does not improve for N consecutive iterations (patience), preventing the model from continuing to memorize the training set. Increasing learning rate causes instability, more features may increase overfitting, and eliminating the validation set makes it impossible to detect the optimal stopping point.

SageMaker Debugger

1 question

Question 43
Which SageMaker feature monitors training jobs in real time, detecting issues such as vanishing gradients, dead ReLUs, overfitting, or class imbalance?
Correct answer: A. SageMaker Debugger
SageMaker Debugger captures tensors during training and applies built-in rules (vanishing gradient, exploding tensor, overfit, class imbalance, etc.), generating real-time alerts. Pipelines orchestrates workflows, Inference Recommender benchmarks deploy instances, and Edge Manager manages edge devices.

Multi-Model Endpoints

1 question

Question 44
A company needs to host 100 different models (one per client) with low, sporadic traffic. Which SageMaker approach minimizes costs by sharing resources?
Correct answer: A. Multi-Model Endpoints (MME) — multiple models on the same endpoint, loaded on demand
Multi-Model Endpoints load models on demand on the same endpoint, sharing resources (CPU/GPU/memory) — ideal for many models with sporadic traffic. 100 separate endpoints generates huge cost (idle instances), Lambda has size/cold-start limits incompatible with large models, and multi-region increases complexity without real benefit.

EventBridge

1 question

Question 45
How can you automatically trigger a SageMaker retraining pipeline when new data arrives in an S3 bucket?
Correct answer: A. Configure Amazon EventBridge (or S3 event notification) to invoke a Lambda or pipeline when new objects are detected
Amazon EventBridge (and S3 event notifications) lets you react to events such as S3 object creation, triggering Lambda, Step Functions, or SageMaker Pipelines automatically. Polling is inefficient (consumes API quota and adds delay), manual does not scale, and the automation is fully supported on AWS.

Step Functions

1 question

Question 46
To orchestrate an ML workflow that involves services beyond SageMaker (e.g., Glue → Lambda → SageMaker → SNS), which AWS service is most flexible?
Correct answer: A. AWS Step Functions
AWS Step Functions is a general-purpose orchestrator that integrates with 200+ AWS services (Glue, Lambda, SageMaker, SNS, EventBridge, etc.) via declarative tasks — ideal for multi-service ML workflows with retry/error handling. SageMaker Pipelines focuses on SageMaker-only flows. CloudFormation provisions infrastructure (does not orchestrate workflows). EMR Workflows is specific to Hadoop/Spark.

Network Isolation

1 question

Question 47
To ensure that a SageMaker training job has no internet access (only VPC resources), which configuration should be applied?
Correct answer: A. Enable Network Isolation on the training job + configure VPC endpoints for required services (S3, ECR, etc.)
Network Isolation prevents the training job container from accessing the internet or other networks. Combined with VPC endpoints (S3, ECR, CloudWatch, etc.), the job only accesses private resources — useful for compliance (sensitive data). "Deny all" IAM rules do not block network, GPU/CPU does not change network config, and SSE-KMS is encryption at rest (not network isolation).

CloudTrail

1 question

Question 48
For auditing purposes, how can you track all API calls made to ML services (CreateTrainingJob, InvokeEndpoint, etc.) in the AWS account?
Correct answer: A. Enable AWS CloudTrail (automatically records API calls in auditable logs)
AWS CloudTrail automatically records AWS API calls (including SageMaker, Bedrock, Comprehend, etc.) in auditable logs on S3 or CloudWatch Logs, with caller identity and timestamp. Custom logging in every call is unnecessary work, local logs do not capture all APIs, and auditing is fully supported natively.

KMS

1 question

Question 49
How can you ensure that EBS volumes attached to SageMaker training job instances are encrypted with a customer-managed key (CMK)?
Correct answer: A. Specify the KMS key in the VolumeKmsKeyId parameter of the training job
SageMaker lets you specify a customer-managed KMS key (CMK) to encrypt EBS volumes for training jobs and endpoints, via parameters like VolumeKmsKeyId/KmsKeyId — meeting compliance requirements that demand control over encryption keys. SSE-S3 only protects S3 objects (not EBS), and renaming a volume does not encrypt anything.

Role Manager

1 question

Question 50
Which SageMaker feature simplifies creating IAM roles for ML personas (data scientist, ML engineer, MLOps), applying pre-built least-privilege policies?
Correct answer: A. SageMaker Role Manager
SageMaker Role Manager provides a wizard to create IAM roles for common ML personas (data scientist, ML engineer, etc.) with pre-approved policies following the principle of least privilege. Config monitors resource configurations, Trusted Advisor performs general best-practice checks, and WAF protects web apps — none manage IAM roles for ML.

Model Dashboard

1 question

Question 51
Which SageMaker feature provides a centralized view of all production models in the account, with their health indicators (Model Monitor alarms, drift, endpoint status)?
Correct answer: A. SageMaker Model Dashboard
SageMaker Model Dashboard centralizes monitoring of all production models, showing Model Monitor alarms, drift detection, endpoint status, and lineage. Studio Lab is a free educational environment, JumpStart is a catalog of pre-trained/foundation models, and Feature Store stores features.

Parquet (Columnar Format)

1 question

Question 52
To reduce Athena query cost and time on large datasets, which file format is recommended for storing data in S3?
Correct answer: A. Apache Parquet (columnar compressed format)
Parquet is a columnar format with efficient compression — Athena/Spark/Redshift Spectrum reads only the columns needed, reducing data scanned (and Athena cost, which bills per TB scanned) by up to 90%. Uncompressed CSV forces scanning all bytes, TXT/XML are row-based and verbose.

Lake Formation

1 question

Question 53
Which AWS service centralizes data lake governance on S3, providing fine-grained access control (row/column) for multiple engines (Athena, EMR, SageMaker)?
Correct answer: A. AWS Lake Formation
AWS Lake Formation centralizes data lake governance, providing fine-grained permissions (table, column, row, cell) shared across Athena, EMR, SageMaker, and Redshift Spectrum. Plain S3 has basic ACLs/policies (not fine-grained). Glue is ETL and Macie classifies sensitive data — none offer full governance.

Data Versioning

1 question

Question 54
To ensure reproducibility of ML experiments, which approach is recommended for versioning training datasets?
Correct answer: A. Enable S3 Versioning + use SageMaker Lineage Tracking to track dataset → training job → model
S3 Versioning automatically preserves previous object versions, and SageMaker Lineage Tracking connects dataset → training job → model automatically for reproducibility. Overwriting wipes history, laptop datasets do not scale, and versioning is a prerequisite for ML audit/compliance.

Data Quality

1 question

Question 55
Which combination of AWS features analyzes data quality in ML pipelines, identifying anomalies, missing values, and bias before training?
Correct answer: A. SageMaker Data Wrangler (data quality reports) + SageMaker Clarify (pre-training bias)
SageMaker Data Wrangler automatically generates statistical profiles and data quality reports (missing values, distributions, anomalies), and SageMaker Clarify analyzes pre-training bias across demographic groups. QuickSight is BI, Trusted Advisor is general checks, and CloudFront is a CDN — none ML-data-quality specific.

AUC-ROC

1 question

Question 56
Which metric evaluates a binary classification model’s ability to discriminate between classes across all possible thresholds, being robust on imbalanced datasets?
Correct answer: A. AUC-ROC (Area Under the ROC Curve)
AUC-ROC measures the area under the ROC curve (TPR vs FPR), ranging from 0 (worst) to 1 (perfect), with 0.5 = random. It is threshold-independent and robust on imbalanced datasets, ideal for comparing models. Simple accuracy is misleading on imbalanced data, and time/size are operational metrics (not predictive quality).

Confusion Matrix

1 question

Question 57
In a multi-class classification problem, which tool visualizes where the model gets right and wrong, showing predictions vs actuals for each class pair?
Correct answer: A. Confusion Matrix
A Confusion Matrix is a square N×N table (N classes) with predictions (rows) × actuals (columns) — the diagonal shows correct predictions, off-diagonal shows specific errors per class pair. It identifies which classes the model confuses most. Learning curve shows training evolution, token histogram is text EDA, and t-SNE is dimensional reduction for visualization (not per-class evaluation).

Transfer Learning

1 question

Question 58
To train an image classifier with a small dataset (5,000 images), which technique allows reusing weights from a model already trained on ImageNet (millions of images) and fine-tuning only the last layers?
Correct answer: A. Transfer Learning (with fine-tuning of the last layers)
Transfer Learning leverages knowledge from pre-trained models (ResNet, VGG, BERT, etc.) — it freezes early layers and retrains only the last layers for the specific task. Efficient with small datasets and low computational cost. SageMaker JumpStart facilitates this pattern. Training from scratch requires lots of data, K-Means is unsupervised, and inverting the dataset is nonsensical.

Reinforcement Learning

1 question

Question 59
Which ML paradigm is appropriate for training an agent that makes sequential decisions, learning by trial and error with rewards (e.g., game playing, robotics, recommendation systems with continuous feedback)?
Correct answer: A. Reinforcement Learning (RL)
RL learns a policy mapping states → actions through rewards/penalties during interactions with an environment. AWS DeepRacer and SageMaker RL Containers facilitate RL. Supervised Learning requires static labeled data (not sequential decisions), unsupervised clustering only groups without feedback, and linear regression is static (not interactive).

Async Inference

1 question

Question 60
An application processes large images (up to 1 GB each) with an ML model and tolerates minute-level latency. Which SageMaker inference mode is most suitable?
Correct answer: A. SageMaker Asynchronous Inference (supports payloads up to 1 GB and long processing, with queues and SNS callbacks)
SageMaker Async Inference accepts payloads up to 1 GB and long processing (up to 1 hour), with internal queues and SNS callbacks — ideal for large images or slow models with tolerable latency. Real-time has a ~6 MB limit and sub-second focus, Lambda also limits requests to 6 MB, and Transfer Acceleration is only for S3 upload.

Blue/Green Deployment

1 question

Question 61
How can you update a SageMaker endpoint to a new model version while minimizing downtime risk and allowing fast rollback?
Correct answer: A. Use SageMaker Deployment Guardrails with Blue/Green or Canary deployment (with auto-rollback)
SageMaker Deployment Guardrails support Blue/Green (provisions a new parallel environment and switches traffic if validations pass), All-At-Once, Canary (small initial percentage), and Linear (incremental). They enable auto-rollback based on metrics. Delete+recreate causes downtime, in-place is risky, and pausing traffic degrades UX.

Lambda Container Image

1 question

Question 62
To serve a small ML model (up to 10 GB) at very low cost with sporadic traffic and a containerized model, which AWS option is viable?
Correct answer: A. AWS Lambda with container image (supports up to 10 GB, billed per ms of execution)
AWS Lambda supports container images up to 10 GB and bills only for milliseconds of execution — ideal for small models with sporadic traffic (zero idle cost). SageMaker Serverless Inference is the native AWS ML alternative with the same advantages, but Lambda is also valid. EC2 24/7 generates idle cost, endpoint without scaling is not serverless, and Batch is for long batch jobs (not on-demand inference).

Secrets Manager

1 question

Question 63
To store and automatically rotate credentials (database passwords, API keys) used by production ML applications, which service is recommended?
Correct answer: A. AWS Secrets Manager (automatic rotation + KMS encryption)
AWS Secrets Manager stores credentials encrypted with KMS and rotates them automatically (custom Lambda or native integrations with RDS, Redshift, etc.). Hardcoding and plain-text env vars violate security. Config monitors resource configurations (not secrets).

AWS Config

1 question

Question 64
How can you automatically detect when a SageMaker endpoint is created without EBS volume encryption, against the company’s compliance policy?
Correct answer: A. Configure AWS Config rules (or Conformance Packs) to continuously audit resource configurations
AWS Config continuously monitors AWS resource configurations and detects compliance deviations via custom or managed rules (e.g., SageMaker endpoint without encryption-at-rest). Conformance Packs group rules for frameworks (HIPAA, PCI, etc.). Manual inspection does not scale, and the automation is fully supported.

Model Quality Monitor

1 question

Question 65
Which SageMaker Model Monitor feature compares production model predictions with ground truth labels (real labels collected after inference) to detect model quality degradation over time?
Correct answer: A. Model Quality Monitor (requires ground truth labels)
Model Quality Monitor computes metrics (accuracy, F1, MSE, etc.) comparing predictions vs ground truth labels collected in production, detecting real model degradation over time. Data Quality monitors input distribution (no ground truth), Bias Drift monitors bias across groups, and Feature Attribution monitors feature-importance drift — none of them compare against real labels.

Other AWS mock exams

Practice for other AWS certifications with the same study and exam modes.

CLF-C02

AWS Cloud Practitioner

Foundational · 65 questions

AIF-C01

AWS AI Practitioner

Foundational · 50 questions

See all AWS mock exams →

Frequently Asked Questions

0/5

✓How many questions are in the AWS ML Engineer Associate mock exam?

The mock exam includes 65 practical MLA-C01-style questions covering the 4 official domains, each with a detailed explanation to support your learning.

✓Are the questions based on the MLA-C01 exam?

✓Do I get an explanation after answering?

Yes. After selecting an option, the mock exam shows the explanation behind the correct answer and why the other options are wrong.

✓What is the passing score on the official exam?

MLA-C01 uses a compensatory scoring model from 100 to 1000, with a minimum passing score of 720 (~72%). The mock exam adopts a 70% threshold.

✓Is this mock exam free?

Yes. The full mock exam is free and available online without sign-up.

Practice by domain

How do you want to train?

Study Mode

Exam Mode

Questions and commented answer key

SageMaker Data Wrangler

Question 1

AWS Glue

Question 2

Feature Engineering

Question 3

Missing Data

Question 4

Feature Store

Question 5

Algorithm Selection

Question 6

SageMaker Training

Question 7

Hyperparameter Tuning

Question 8

SageMaker Autopilot

Question 9

Endpoint Types

Question 10

Serverless Inference

Question 11

SageMaker Pipelines

Question 12

Model Registry

Question 13

Model Monitor

Question 14

IAM for SageMaker

Question 15

VPC Endpoints

Question 16

Cost Optimization

Question 17

S3 Storage Classes

Question 18

Athena

Question 19

EMR (Spark)

Question 20

Imbalanced Data

Question 21

Built-in Algorithms

Question 22

Distributed Training

Question 23

Evaluation Metrics

Question 24

Cross-Validation

Question 25

SageMaker Experiments

Question 26

Auto Scaling

Question 27

A/B Testing (Production Variants)

Question 28

Inference Recommender

Question 29

SageMaker Neo

Question 30

CloudWatch

Question 31

SageMaker Clarify

Question 32

Encryption at Rest

Question 33

Model Cards

Question 34

Kinesis Data Streams

Question 35

Processing Jobs

Question 36

Glue DataBrew

Question 37

Categorical Encoding