
Prepare for the AWS ML Engineer Associate (MLA-C01) exam with practical questions and detailed explanations.
This mock exam is an independent educational resource. It is not affiliated with, endorsed by, or sponsored by Amazon Web Services (AWS).
Choose the mode that fits your current goal.
Feedback after every answer, free navigation, no timer.
Realistic simulation of the actual exam with a countdown.
See all questions with the correct answers and detailed explanations to reinforce your learning.
Questions
65
Topics
65
Which Amazon SageMaker feature provides a visual interface for ML data preparation and transformation, letting you import from S3, Redshift, Athena, and Snowflake without writing code?
Correct answer: A. SageMaker Data Wrangler
SageMaker Data Wrangler provides a low-code UI to connect to multiple sources (S3, Redshift, Athena, Snowflake), explore data, apply 300+ built-in transformations, and export as a pipeline. Pipelines orchestrates workflows, Clarify measures bias, and Model Monitor detects production model drift.
Which AWS service is a serverless ETL (Extract, Transform, Load) platform that discovers, catalogs, and prepares data at scale for analytics and ML?
Correct answer: A. AWS Glue
AWS Glue is the AWS serverless ETL platform, with Glue Data Catalog (metadata), Glue Crawlers (automatic discovery), and Glue Jobs (Spark/Python for transformations). EMR is managed Hadoop/Spark clusters (more flexible but requires operations), Kinesis is real-time data streaming, and Lambda is general-purpose serverless compute.
For a numeric feature with a strongly skewed (long-tail) distribution, which transformation helps normalize the distribution before training an ML model?
Correct answer: A. Apply a logarithmic (log) transformation
The logarithmic transformation is a classic technique to reduce skewness in long-tail distributions (price, salary, counts), bringing them closer to a normal distribution. This improves training stability for scale-sensitive models (linear regression, neural networks). Multiplying or adding constants does not change the distribution, and removing features loses information.
A numeric column in the dataset has 5% missing values. Which common strategy preserves the dataset size and reduces bias introduced by missingness?
Correct answer: B. Impute (fill in) missing values using mean, median, or a predictive model
Imputation preserves rows and uses statistics (mean/median) or models to estimate missing values — standard practice in data preparation. Dropping rows shrinks the dataset and may introduce bias if missingness is not random. Replacing with zero distorts the distribution, and dropping a predictive column discards useful signal.
Which AWS feature lets you store, share, and reuse ML features across multiple models and teams, maintaining consistency between training and inference?
Correct answer: A. Amazon SageMaker Feature Store
SageMaker Feature Store is the AWS central feature repository with an online store (DynamoDB, low-latency for inference) and an offline store (S3, for training), ensuring consistency between the two worlds. Plain S3 and DynamoDB lack versioning, lineage, and online/offline sync. Glue Data Catalog is table metadata (schemas) — it does not store features.
A team needs to predict house prices (a continuous numeric variable) from features such as area, location, and age. Which type of algorithm is most appropriate?
Correct answer: A. A regression algorithm (e.g., XGBoost regressor, Linear Regression)
When the target variable is continuous numeric (price, temperature, age), the problem is regression. Classification predicts discrete categories, clustering groups without labels, and anomaly detection identifies outliers — none of these directly fit predicting a continuous numeric value.
Which SageMaker feature lets you run ML training jobs on managed instances with support for built-in algorithms, frameworks (TensorFlow, PyTorch), and BYOC (Bring Your Own Container)?
Correct answer: A. SageMaker Training Jobs
SageMaker Training Jobs run training on managed instances (auto-provisioned and torn down) with support for built-in algorithms, major frameworks, and custom containers. Endpoints are for inference, Ground Truth is for labeling, and Studio Lab is a free educational environment.
Which SageMaker feature automates the search for the best hyperparameters (learning rate, batch size, etc.) by running multiple training jobs in parallel?
Correct answer: A. SageMaker Automatic Model Tuning (Hyperparameter Tuning)
SageMaker Automatic Model Tuning (also called Hyperparameter Tuning) runs multiple training jobs with different hyperparameter combinations, using strategies such as Bayesian, Random, Grid Search, or Hyperband, to find the best configuration. Pipelines orchestrates workflows, Inference Recommender tests deploy instances, and Model Cards document models.
Which SageMaker feature provides AutoML — automatically generating multiple candidate models (with feature engineering and hyperparameter tuning) from a tabular dataset?
Correct answer: A. SageMaker Autopilot
SageMaker Autopilot is the AWS AutoML service — it accepts a CSV/Parquet, identifies the problem type (regression/classification), does feature engineering, and trains dozens of candidate models with automatic hyperparameter tuning. Edge Manager manages models on edge devices, Neo optimizes models for specific hardware, and Debugger monitors training jobs in real time.
An application needs to run batch predictions on 10 million records once a day, without low-latency requirements. Which SageMaker endpoint type is most cost-effective?
Correct answer: B. Batch Transform
SageMaker Batch Transform is ideal for batch predictions on large datasets — it provisions temporary instances, processes the entire dataset, and shuts down, billing only for usage time. A real-time endpoint keeps instances on 24/7 (expensive for sporadic use), Async Inference handles requests with large payloads but still serves individual responses, and Multi-Model is for hosting multiple models on a single endpoint.
A startup has a model with intermittent traffic (sporadic spikes followed by long idle periods). Which SageMaker deployment option automatically scales to zero when idle, charging only for usage?
Correct answer: B. SageMaker Serverless Inference
SageMaker Serverless Inference auto-scales (including to zero) and bills by inference duration + memory — ideal for intermittent traffic. Real-time keeps dedicated instances 24/7, edge targets non-AWS devices, and persistent Multi-Model also keeps resources active.
Which AWS service is the native CI/CD tool for ML, letting you orchestrate steps such as data preparation, training, evaluation, model registration, and deployment?
Correct answer: A. SageMaker Pipelines
SageMaker Pipelines is the native CI/CD tool specifically for ML flows, with direct integration to Training Jobs, Processing, Model Registry, and Endpoints. CodePipeline is general-purpose CI/CD (not ML-specialized), Step Functions orchestrates generic workflows (usable but requires more glue code), and Glue Workflows is only for Glue ETL.
Which SageMaker feature centralizes the model lifecycle — versioning, manual approval before deployment, and lineage between training and production?
Correct answer: A. SageMaker Model Registry
SageMaker Model Registry lets you version models in "Model Groups", manually approve versions before deployment (governance), and track lineage. Feature Store stores features (not models), CodeArtifact manages code packages, and ECR stores Docker images (model containers, but without ML-specific governance).
A production model starts producing worse predictions after a few weeks, even without code changes. Which SageMaker feature automatically detects data drift (deviation in input data distribution)?
Correct answer: A. SageMaker Model Monitor
SageMaker Model Monitor automatically detects data drift, model quality drift, bias drift, and feature attribution drift by comparing production distributions with a baseline recorded during training. Pipelines orchestrates flows, CloudTrail audits API calls (not statistical distribution), and Trusted Advisor performs account checks (not ML-specific).
Which AWS practice should be used to grant a SageMaker Training Job access to a specific S3 bucket without exposing permanent credentials?
Correct answer: B. Attach an IAM execution Role to the job with a policy specific to the bucket
SageMaker assumes an IAM execution Role (passed via the RoleArn parameter) that receives rotating temporary credentials — the principle of least privilege. Sharing root credentials, hardcoding passwords, or making a bucket public directly violate AWS security best practices.
How can a company ensure that traffic between its EC2 instances and SageMaker stays within AWS, without going through the public internet?
Correct answer: B. Configure VPC Endpoints (PrivateLink) for SageMaker inside the VPC
VPC Endpoints (powered by AWS PrivateLink) allow access to SageMaker APIs (training, inference, runtime) via a private interface inside the VPC, keeping traffic within the AWS network. IPv6 does not isolate traffic, SSH tunneling does not apply to AWS HTTP APIs, and TLS only encrypts but traffic still goes through the public internet.
A team is repeatedly training models on SageMaker on-demand instances, with long-running jobs (4-6 hours). Which option can significantly reduce training costs while tolerating interruptions?
Correct answer: A. Use SageMaker Managed Spot Training
SageMaker Managed Spot Training uses spare AWS capacity (Spot) for training, saving up to 90% vs on-demand. It supports automatic checkpoints to resume interrupted training. Larger instances may reduce time but increase total cost; serverless inference is for prediction (not training); and Multi-Model is also for inference.
A team stores training datasets accessed frequently in the first 30 days and rarely afterwards. Which strategy is most cost-effective?
Correct answer: B. Configure an S3 Lifecycle policy: Standard → Standard-IA after 30 days
S3 Lifecycle policies automatically move objects between classes — Standard (high frequency) → Standard-IA (infrequent access, ~40% cheaper in storage) → Glacier (archival). Keeping everything in Standard is expensive, immediate Glacier Deep Archive prevents fast access during the first 30 days, and EBS is block storage for EC2 (not for shared ML datasets).
Which AWS service lets you query data in S3 using standard SQL without provisioning infrastructure?
Correct answer: A. Amazon Athena
Amazon Athena is serverless and bills only for data scanned, ideal for ad-hoc queries on S3 data lakes (CSV, JSON, Parquet, ORC formats). Redshift is a provisioned data warehouse, Glue is ETL, and RDS is managed relational database.
A company needs to process 100 TB of raw data with Apache Spark, running custom jobs in Python and Scala. Which AWS service is most suitable?
Correct answer: B. Amazon EMR (managed Hadoop/Spark/Hive)
Amazon EMR (Elastic MapReduce) is the AWS managed platform for big data frameworks such as Spark, Hadoop, Hive, and Presto, ideal for large-scale processing with code flexibility. Lambda has time (15 min) and memory limits, Athena is SQL-only, and Glue is ETL with less granular control than EMR.
In a fraud detection dataset, only 1% of cases are fraud (positive class). Which technique is appropriate to mitigate imbalance during training?
Correct answer: A. Apply SMOTE (Synthetic Minority Over-sampling) or adjust class_weight in the algorithm
SMOTE generates synthetic samples of the minority class to balance, and adjusting class_weight (or pos_weight in XGBoost) penalizes errors on the rare class during training. Training only on negative cases eliminates the fraud signal, dropping features discards predictive information, and simple accuracy is misleading on imbalanced data (precision/recall/F1/AUC are preferable).
A team wants to train a tabular classification model without writing algorithm code from scratch. Which SageMaker built-in algorithm is widely used for tabular tasks with strong performance?
Correct answer: A. XGBoost
XGBoost is the SageMaker built-in algorithm for gradient boosting on tabular data, dominant in Kaggle competitions and production. BlazingText is for text classification and Word2Vec, Object Detection and Semantic Segmentation are for computer vision — not tabular.
To train a large deep learning model (e.g., BERT) on a 500 GB dataset, how can training be accelerated using multiple GPUs or multiple nodes?
Correct answer: A. Enable distributed training with SageMaker (Data Parallelism, Model Parallelism, or both)
SageMaker supports data parallelism (each GPU processes different batches of the dataset, syncing gradients) and model parallelism (the model is split across GPUs/nodes, useful when the model does not fit on one GPU). Reducing batch to 1 hurts throughput, CPU is too slow for large deep learning, and local training is infeasible for datasets of hundreds of GB.
In a binary classification problem for medical diagnosis, which metric is most critical to maximize to avoid false negatives (missing a disease)?
Correct answer: A. Recall (sensitivity / true positive rate)
Recall measures the fraction of actual positives the model detected — TP / (TP + FN). High recall is essential in medical diagnosis to minimize false negatives (missing a serious disease). Precision focuses on avoiding false positives; latency and size are operational metrics (not directly about clinical quality).
To evaluate an ML model robustly on a small dataset, which technique is recommended to reduce the variance of the performance estimate?
Correct answer: A. K-fold cross-validation (e.g., 5 folds)
K-fold cross-validation splits the dataset into K parts and trains K times (each time using a different part as validation), reducing the variance of the performance estimate. Training and testing on the same set causes evaluation overfitting, batch size is a training parameter, and dropout is a regularization technique (during training, never in inference).
Which SageMaker feature lets you organize, compare, and reproduce ML experiments with different hyperparameters, datasets, and algorithms?
Correct answer: A. SageMaker Experiments
SageMaker Experiments automatically tracks each training job as a "trial", grouped in "experiments", letting you compare metrics and hyperparameters across runs. Model Cards documents models for governance, Studio Lab is a free educational environment, and Edge Manager manages models on edge devices.
How can a SageMaker endpoint be configured to automatically scale the number of instances based on traffic volume?
Correct answer: A. Enable Application Auto Scaling on the endpoint with a policy based on metrics such as SageMakerVariantInvocationsPerInstance
SageMaker endpoints integrate with Application Auto Scaling — you define a policy based on metrics (invocations per instance, latency, CPU) and the instance count adjusts automatically between min/max. Manual restart does not scale, an oversized instance creates idle cost, and Lambda has limits incompatible with large models.
How can you run an A/B test with 2 different model versions on the same SageMaker endpoint, splitting traffic between them?
Correct answer: A. Configure Production Variants on the endpoint with traffic weights (e.g., 50/50, 80/20)
SageMaker supports multiple Production Variants on the same endpoint, each with a traffic distribution weight. Useful for A/B testing, canary deployments, and blue-green. Separate endpoints with DNS works but requires extra infrastructure, training together does not isolate models for comparison, and SageMaker supports this scenario natively.
Which SageMaker feature helps you choose the most suitable instance for deploying a model by running latency and cost benchmarks across different instance types?
Correct answer: A. SageMaker Inference Recommender
SageMaker Inference Recommender runs the model on different instance types (CPU, GPU, Inferentia) and provides latency, throughput, and cost comparisons, recommending the best option. Neo optimizes the model for specific hardware, Edge Manager manages edge devices, and Pipelines orchestrates ML workflows.
Which SageMaker feature compiles and optimizes ML models to run faster and with less memory on specific hardware (CPU, GPU, ARM, Inferentia)?
Correct answer: A. SageMaker Neo
SageMaker Neo compiles trained models (TensorFlow, PyTorch, MXNet, etc.) into code optimized for specific hardware, reducing footprint and latency — particularly useful on edge devices or Inferentia. Studio is the ML IDE, Pipelines orchestrates workflows, and Feature Store stores features.
How can you monitor the latency, throughput, and error rate of a SageMaker endpoint in real time?
Correct answer: A. Use Amazon CloudWatch — metrics such as Invocations, ModelLatency, and Invocation4XXErrors are published automatically
SageMaker endpoints automatically publish metrics to CloudWatch (Invocations, ModelLatency, OverheadLatency, Invocation4XXErrors, Invocation5XXErrors, etc.), enabling real-time dashboards and alarms. Manual polling and local logs are primitive approaches that do not scale — and SageMaker has always supported monitoring.
Which SageMaker feature is used to detect bias in datasets and models and generate explainability reports (feature importance)?
Correct answer: A. SageMaker Clarify
SageMaker Clarify is the dedicated AWS service for bias analysis (during and after training) and explainability (SHAP values, feature importance). Pipelines orchestrates workflows, Endpoints host models for inference, and Studio is the IDE.
How can you encrypt sensitive data stored in S3 buckets used for SageMaker training jobs?
Correct answer: A. Enable Server-Side Encryption on S3 (SSE-S3 or SSE-KMS) — SageMaker decrypts automatically when accessing
S3 provides Server-Side Encryption (SSE-S3 with AWS keys, SSE-KMS with KMS-managed keys, or SSE-C with customer keys). SageMaker decrypts transparently as long as it has IAM permissions for the bucket and KMS key. Not encrypting is a security flaw, manual encryption is unnecessary work, and renaming the bucket does not encrypt anything.
Which SageMaker feature lets you document models with metadata such as purpose, training data, performance metrics, and ethical considerations, for governance and compliance purposes?
Correct answer: A. SageMaker Model Cards
SageMaker Model Cards is the model documentation feature for governance/compliance, recording intended use, training data, evaluation metrics, ethical considerations, and known biases. Pipelines orchestrates workflows, Inference Recommender benchmarks instances, and Studio Lab is a free educational environment.
To ingest real-time event streams (clicks, IoT sensors) and make them available to multiple ML consumers, which AWS service is most suitable?
Correct answer: A. Amazon Kinesis Data Streams
Kinesis Data Streams is the AWS real-time streaming service, ideal for multiple consumers (Lambda, Kinesis Data Analytics, Firehose, custom apps) with configurable retention up to 365 days. Glue is batch ETL, S3 has no native streaming (only events), and Lambda processes events but does not store/replay them.
Which SageMaker feature lets you run custom data preprocessing, validation, and postprocessing scripts in managed containers, separate from the training job?
Correct answer: A. SageMaker Processing Jobs
SageMaker Processing Jobs run scripts (Python, Spark) in managed containers for preprocessing, validation, and postprocessing — separate from training, with native S3 integration and built-in containers (sklearn, PySpark). Endpoints serve inference, Studio Lab is a free educational environment, and Edge Manager handles edge devices.
Which AWS service provides a visual no-code interface for data cleaning and normalization with 250+ built-in transformations?
Correct answer: A. AWS Glue DataBrew
AWS Glue DataBrew is a visual no-code tool for data discovery, cleaning, normalization, and validation, with 250+ built-in transformations (formatting, aggregations, joins). Different from Glue Studio (visual ETL with more code). Lambda is compute, EMR is Spark/Hadoop, and Step Functions is orchestration.
To prepare a nominal categorical feature (e.g., "country") for an XGBoost algorithm, which encoding technique is appropriate?
Correct answer: A. One-Hot Encoding (create binary columns for each category)
One-Hot Encoding creates a binary column for each category, avoiding spurious ordering among nominal categories (no hierarchy). It is standard for trees and linear models. MD5 hash breaks interpretability, concatenation is not numeric, and dropping discards predictive signal. For high cardinality there are alternatives (target encoding) — but one-hot is the standard answer.
Before training a neural network or linear regression with features of very different scales (e.g., age 18-90 and income 1000-1000000), which transformation is recommended?
Correct answer: A. Apply standardization (StandardScaler: mean 0, std 1) or normalization (MinMaxScaler: 0-1)
Algorithms based on distance (KNN, K-Means) and gradient (linear regression, neural networks) are scale-sensitive — larger-scale features dominate the gradient/distance and hurt learning. StandardScaler or MinMaxScaler equalize the scales. Tree-based models (XGBoost, Random Forest) do NOT need scaling. Keeping originals distorts learning, multiplying does not change the ratio between features, and categorizing loses numeric information.
Which technique combines predictions from multiple trees trained on random subsets of data and features, generally reducing overfitting compared to a single tree?
Correct answer: A. Random Forest (tree bagging)
Random Forest is an ensemble of decision trees trained with bootstrap (random samples) + feature randomness, with a final vote/average — it reduces variance vs a single tree and improves generalization. Linear Regression is a simple linear model (no ensemble), K-Means is unsupervised, and Naive Bayes is a simple probabilistic model.
To reduce overfitting in a linear regression model with many correlated features, which technique adds a penalty to the model weights during training?
Correct answer: A. L1 (Lasso) or L2 (Ridge) regularization
L1 (Lasso) adds an |w| penalty (it zeroes out weights of irrelevant features — automatic feature selection). L2 (Ridge) adds a w² penalty (it shrinks weights without zeroing — useful for multicollinearity). Adding more features worsens overfitting, training longer without early stopping does the same, and eliminating the test set breaks evaluation.
To prevent overfitting during iterative training of a model (e.g., gradient boosting, neural networks), which technique monitors a metric on a validation set and stops training when it stops improving?
Correct answer: A. Early Stopping
Early Stopping interrupts training when the validation metric does not improve for N consecutive iterations (patience), preventing the model from continuing to memorize the training set. Increasing learning rate causes instability, more features may increase overfitting, and eliminating the validation set makes it impossible to detect the optimal stopping point.
Which SageMaker feature monitors training jobs in real time, detecting issues such as vanishing gradients, dead ReLUs, overfitting, or class imbalance?
Correct answer: A. SageMaker Debugger
SageMaker Debugger captures tensors during training and applies built-in rules (vanishing gradient, exploding tensor, overfit, class imbalance, etc.), generating real-time alerts. Pipelines orchestrates workflows, Inference Recommender benchmarks deploy instances, and Edge Manager manages edge devices.
A company needs to host 100 different models (one per client) with low, sporadic traffic. Which SageMaker approach minimizes costs by sharing resources?
Correct answer: A. Multi-Model Endpoints (MME) — multiple models on the same endpoint, loaded on demand
Multi-Model Endpoints load models on demand on the same endpoint, sharing resources (CPU/GPU/memory) — ideal for many models with sporadic traffic. 100 separate endpoints generates huge cost (idle instances), Lambda has size/cold-start limits incompatible with large models, and multi-region increases complexity without real benefit.
How can you automatically trigger a SageMaker retraining pipeline when new data arrives in an S3 bucket?
Correct answer: A. Configure Amazon EventBridge (or S3 event notification) to invoke a Lambda or pipeline when new objects are detected
Amazon EventBridge (and S3 event notifications) lets you react to events such as S3 object creation, triggering Lambda, Step Functions, or SageMaker Pipelines automatically. Polling is inefficient (consumes API quota and adds delay), manual does not scale, and the automation is fully supported on AWS.
To orchestrate an ML workflow that involves services beyond SageMaker (e.g., Glue → Lambda → SageMaker → SNS), which AWS service is most flexible?
Correct answer: A. AWS Step Functions
AWS Step Functions is a general-purpose orchestrator that integrates with 200+ AWS services (Glue, Lambda, SageMaker, SNS, EventBridge, etc.) via declarative tasks — ideal for multi-service ML workflows with retry/error handling. SageMaker Pipelines focuses on SageMaker-only flows. CloudFormation provisions infrastructure (does not orchestrate workflows). EMR Workflows is specific to Hadoop/Spark.
To ensure that a SageMaker training job has no internet access (only VPC resources), which configuration should be applied?
Correct answer: A. Enable Network Isolation on the training job + configure VPC endpoints for required services (S3, ECR, etc.)
Network Isolation prevents the training job container from accessing the internet or other networks. Combined with VPC endpoints (S3, ECR, CloudWatch, etc.), the job only accesses private resources — useful for compliance (sensitive data). "Deny all" IAM rules do not block network, GPU/CPU does not change network config, and SSE-KMS is encryption at rest (not network isolation).
For auditing purposes, how can you track all API calls made to ML services (CreateTrainingJob, InvokeEndpoint, etc.) in the AWS account?
Correct answer: A. Enable AWS CloudTrail (automatically records API calls in auditable logs)
AWS CloudTrail automatically records AWS API calls (including SageMaker, Bedrock, Comprehend, etc.) in auditable logs on S3 or CloudWatch Logs, with caller identity and timestamp. Custom logging in every call is unnecessary work, local logs do not capture all APIs, and auditing is fully supported natively.
How can you ensure that EBS volumes attached to SageMaker training job instances are encrypted with a customer-managed key (CMK)?
Correct answer: A. Specify the KMS key in the VolumeKmsKeyId parameter of the training job
SageMaker lets you specify a customer-managed KMS key (CMK) to encrypt EBS volumes for training jobs and endpoints, via parameters like VolumeKmsKeyId/KmsKeyId — meeting compliance requirements that demand control over encryption keys. SSE-S3 only protects S3 objects (not EBS), and renaming a volume does not encrypt anything.
Which SageMaker feature simplifies creating IAM roles for ML personas (data scientist, ML engineer, MLOps), applying pre-built least-privilege policies?
Correct answer: A. SageMaker Role Manager
SageMaker Role Manager provides a wizard to create IAM roles for common ML personas (data scientist, ML engineer, etc.) with pre-approved policies following the principle of least privilege. Config monitors resource configurations, Trusted Advisor performs general best-practice checks, and WAF protects web apps — none manage IAM roles for ML.
Which SageMaker feature provides a centralized view of all production models in the account, with their health indicators (Model Monitor alarms, drift, endpoint status)?
Correct answer: A. SageMaker Model Dashboard
SageMaker Model Dashboard centralizes monitoring of all production models, showing Model Monitor alarms, drift detection, endpoint status, and lineage. Studio Lab is a free educational environment, JumpStart is a catalog of pre-trained/foundation models, and Feature Store stores features.
To reduce Athena query cost and time on large datasets, which file format is recommended for storing data in S3?
Correct answer: A. Apache Parquet (columnar compressed format)
Parquet is a columnar format with efficient compression — Athena/Spark/Redshift Spectrum reads only the columns needed, reducing data scanned (and Athena cost, which bills per TB scanned) by up to 90%. Uncompressed CSV forces scanning all bytes, TXT/XML are row-based and verbose.
Which AWS service centralizes data lake governance on S3, providing fine-grained access control (row/column) for multiple engines (Athena, EMR, SageMaker)?
Correct answer: A. AWS Lake Formation
AWS Lake Formation centralizes data lake governance, providing fine-grained permissions (table, column, row, cell) shared across Athena, EMR, SageMaker, and Redshift Spectrum. Plain S3 has basic ACLs/policies (not fine-grained). Glue is ETL and Macie classifies sensitive data — none offer full governance.
To ensure reproducibility of ML experiments, which approach is recommended for versioning training datasets?
Correct answer: A. Enable S3 Versioning + use SageMaker Lineage Tracking to track dataset → training job → model
S3 Versioning automatically preserves previous object versions, and SageMaker Lineage Tracking connects dataset → training job → model automatically for reproducibility. Overwriting wipes history, laptop datasets do not scale, and versioning is a prerequisite for ML audit/compliance.
Which combination of AWS features analyzes data quality in ML pipelines, identifying anomalies, missing values, and bias before training?
Correct answer: A. SageMaker Data Wrangler (data quality reports) + SageMaker Clarify (pre-training bias)
SageMaker Data Wrangler automatically generates statistical profiles and data quality reports (missing values, distributions, anomalies), and SageMaker Clarify analyzes pre-training bias across demographic groups. QuickSight is BI, Trusted Advisor is general checks, and CloudFront is a CDN — none ML-data-quality specific.
Which metric evaluates a binary classification model’s ability to discriminate between classes across all possible thresholds, being robust on imbalanced datasets?
Correct answer: A. AUC-ROC (Area Under the ROC Curve)
AUC-ROC measures the area under the ROC curve (TPR vs FPR), ranging from 0 (worst) to 1 (perfect), with 0.5 = random. It is threshold-independent and robust on imbalanced datasets, ideal for comparing models. Simple accuracy is misleading on imbalanced data, and time/size are operational metrics (not predictive quality).
In a multi-class classification problem, which tool visualizes where the model gets right and wrong, showing predictions vs actuals for each class pair?
Correct answer: A. Confusion Matrix
A Confusion Matrix is a square N×N table (N classes) with predictions (rows) × actuals (columns) — the diagonal shows correct predictions, off-diagonal shows specific errors per class pair. It identifies which classes the model confuses most. Learning curve shows training evolution, token histogram is text EDA, and t-SNE is dimensional reduction for visualization (not per-class evaluation).
To train an image classifier with a small dataset (5,000 images), which technique allows reusing weights from a model already trained on ImageNet (millions of images) and fine-tuning only the last layers?
Correct answer: A. Transfer Learning (with fine-tuning of the last layers)
Transfer Learning leverages knowledge from pre-trained models (ResNet, VGG, BERT, etc.) — it freezes early layers and retrains only the last layers for the specific task. Efficient with small datasets and low computational cost. SageMaker JumpStart facilitates this pattern. Training from scratch requires lots of data, K-Means is unsupervised, and inverting the dataset is nonsensical.
Which ML paradigm is appropriate for training an agent that makes sequential decisions, learning by trial and error with rewards (e.g., game playing, robotics, recommendation systems with continuous feedback)?
Correct answer: A. Reinforcement Learning (RL)
RL learns a policy mapping states → actions through rewards/penalties during interactions with an environment. AWS DeepRacer and SageMaker RL Containers facilitate RL. Supervised Learning requires static labeled data (not sequential decisions), unsupervised clustering only groups without feedback, and linear regression is static (not interactive).
An application processes large images (up to 1 GB each) with an ML model and tolerates minute-level latency. Which SageMaker inference mode is most suitable?
Correct answer: A. SageMaker Asynchronous Inference (supports payloads up to 1 GB and long processing, with queues and SNS callbacks)
SageMaker Async Inference accepts payloads up to 1 GB and long processing (up to 1 hour), with internal queues and SNS callbacks — ideal for large images or slow models with tolerable latency. Real-time has a ~6 MB limit and sub-second focus, Lambda also limits requests to 6 MB, and Transfer Acceleration is only for S3 upload.
How can you update a SageMaker endpoint to a new model version while minimizing downtime risk and allowing fast rollback?
Correct answer: A. Use SageMaker Deployment Guardrails with Blue/Green or Canary deployment (with auto-rollback)
SageMaker Deployment Guardrails support Blue/Green (provisions a new parallel environment and switches traffic if validations pass), All-At-Once, Canary (small initial percentage), and Linear (incremental). They enable auto-rollback based on metrics. Delete+recreate causes downtime, in-place is risky, and pausing traffic degrades UX.
To serve a small ML model (up to 10 GB) at very low cost with sporadic traffic and a containerized model, which AWS option is viable?
Correct answer: A. AWS Lambda with container image (supports up to 10 GB, billed per ms of execution)
AWS Lambda supports container images up to 10 GB and bills only for milliseconds of execution — ideal for small models with sporadic traffic (zero idle cost). SageMaker Serverless Inference is the native AWS ML alternative with the same advantages, but Lambda is also valid. EC2 24/7 generates idle cost, endpoint without scaling is not serverless, and Batch is for long batch jobs (not on-demand inference).
To store and automatically rotate credentials (database passwords, API keys) used by production ML applications, which service is recommended?
Correct answer: A. AWS Secrets Manager (automatic rotation + KMS encryption)
AWS Secrets Manager stores credentials encrypted with KMS and rotates them automatically (custom Lambda or native integrations with RDS, Redshift, etc.). Hardcoding and plain-text env vars violate security. Config monitors resource configurations (not secrets).
How can you automatically detect when a SageMaker endpoint is created without EBS volume encryption, against the company’s compliance policy?
Correct answer: A. Configure AWS Config rules (or Conformance Packs) to continuously audit resource configurations
AWS Config continuously monitors AWS resource configurations and detects compliance deviations via custom or managed rules (e.g., SageMaker endpoint without encryption-at-rest). Conformance Packs group rules for frameworks (HIPAA, PCI, etc.). Manual inspection does not scale, and the automation is fully supported.
Which SageMaker Model Monitor feature compares production model predictions with ground truth labels (real labels collected after inference) to detect model quality degradation over time?
Correct answer: A. Model Quality Monitor (requires ground truth labels)
Model Quality Monitor computes metrics (accuracy, F1, MSE, etc.) comparing predictions vs ground truth labels collected in production, detecting real model degradation over time. Data Quality monitors input distribution (no ground truth), Bias Drift monitors bias across groups, and Feature Attribution monitors feature-importance drift — none of them compare against real labels.
Practice for other AWS certifications with the same study and exam modes.
The mock exam includes 65 practical MLA-C01-style questions covering the 4 official domains, each with a detailed explanation to support your learning.
Yes. The content is organized to reflect the 4 official domains of the AWS Certified Machine Learning Engineer - Associate (MLA-C01): data preparation, model development, deployment & orchestration, and monitoring & security.
Yes. After selecting an option, the mock exam shows the explanation behind the correct answer and why the other options are wrong.
MLA-C01 uses a compensatory scoring model from 100 to 1000, with a minimum passing score of 720 (~72%). The mock exam adopts a 70% threshold.
Yes. The full mock exam is free and available online without sign-up.