Data Engineering Questions

Practice questions for Data Engineering topic in AWS Certified Machine Learning - Specialty. 40 questions covering this domain.

40 questions12 easy20 medium8 hard

medium

A company needs to create a large labeled image dataset for training an object detection model. They want to use a combination of human labelers and a...

hard

A machine learning team is training a large image classification model on Amazon SageMaker using Pipe input mode. Which data format is most efficient ...

medium

A team has raw data files in various formats landing in an Amazon S3 bucket daily. They want an automated way to populate the AWS Glue Data Catalog wi...

medium

A financial services company needs to ingest real-time transaction events, calculate fraud features within milliseconds, and make those features immed...

medium

A team needs to run distributed Apache Spark jobs to preprocess petabyte-scale datasets stored in Amazon S3 before ML training. They need the ability ...

hard

An ML team stores historical training data in Amazon S3 in Parquet format. Analysts also need to run complex SQL joins between this S3 data and operat...

medium

A data engineering team needs to transform raw CSV files stored in Amazon S3 into Parquet format and apply schema mapping before loading into a data w...

easy

A data engineering team needs to ingest real-time clickstream data and deliver it directly to Amazon S3 for batch ML training without writing custom c...

easy

A data science team needs a centralized, searchable metadata repository that stores table definitions, schemas, and data source locations so that Amaz...

Q10

easy

A machine learning team stores training datasets in Amazon S3. They want to reduce storage costs and improve query performance when accessing the data...

Q11

medium

A data engineering team runs nightly AWS Glue ETL jobs that transform S3 data and load it into Amazon Redshift for ML training. They want to ensure th...

Q12

easy

A data engineering team wants to profile raw data to find missing values, duplicate rows, and statistical distributions using a visual interface with ...

Q13

medium

A company needs to enforce column-level and row-level access controls on their ML data lake stored in Amazon S3, ensuring that different data science ...

Q14

hard

A financial institution stores transaction records in Amazon S3 and needs to run SQL analytics with column-level access control, join data across mult...

Q15

easy

A data science team needs to stream sensor readings to AWS in real time and retain each record for custom consumer applications to process at differen...

Q16

medium

A SageMaker training job needs to read a 2 TB training dataset from Amazon S3. The team wants to avoid downloading the full dataset to the training in...

Q17

hard

An ML team processes daily S3 data with an AWS Glue job. Sometimes upstream data arrives late, causing the Glue job to fail with missing source data. ...

Q18

medium

A team is building an ML data pipeline where raw sensor data from Amazon Kinesis Data Streams needs to be joined with reference data from Amazon RDS a...

Q19

medium

A data engineering team runs AWS Glue ETL jobs that take 4 hours to complete daily. The jobs are not time-sensitive, and the team wants to minimize co...

Q20

easy

Which Amazon S3 storage class is most cost-effective for storing ML training datasets that are accessed infrequently but must be retrievable within mi...

Sign in to see all 40 questions

Create a free account to browse all questions — completely free during our launch phase.