Skip to content

Developing Code for Data Processing using Python and SQL Questions

Practice questions for Developing Code for Data Processing using Python and SQL topic in Databricks Certified Data Engineer Professional. 44 questions covering this domain.

44 questions11 easy21 medium12 hard
Q1
hard

When executing a `MERGE INTO` statement in Delta Lake, what happens if the source dataset contains multiple rows that match the same target row based ...

Q2
medium

What is the primary performance advantage of using a pandas UDF (vectorized UDF) over a standard Python UDF in PySpark?

Q3
medium

What is the primary purpose of the `MERGE INTO` statement in Delta Lake?

Q4
easy

Which SQL syntax correctly queries a Delta table named `sales` at version 5?

Q5
hard

A data engineer writes the following PySpark code to find the top 3 products by total revenue per category. Which statement best describes the result?...

Q6
medium

A data engineer runs the following SQL query on an orders table. What does it return? ```sql SELECT customer_id, order_date, amount, ROW_NUMBER...

Q7
hard

A data engineer writes the following Structured Streaming query. What does the 10-minute watermark guarantee? ```python events \ .withWaterma...

Q8
medium

A data engineer runs the following SQL query. What does the `FILTER` function return? ```sql SELECT FILTER(tags, t -> t LIKE 'priority_%') AS prio...

Q9
medium

Which Structured Streaming trigger type processes all available data in multiple micro-batches and then stops, and is the recommended replacement for ...

Q10
easy

What does the `filter()` transformation in PySpark return?

Q11
easy

Which Structured Streaming output mode emits only new rows that were appended to the result table since the last trigger?

Q12
easy

Which PySpark action triggers the execution of a lazy transformation chain and returns all rows of a DataFrame to the driver as a Python list?

Q13
hard

A data engineer is debugging a failing Structured Streaming query. The stream processes events with `event_time`, uses a 10-minute tumbling window and...

Q14
medium

A data engineer wants to convert a structured DataFrame column `address` of type `STRUCT<street STRING, city STRING, zip STRING>` into a JSON string c...

Q15
easy

What does the `TRANSFORM` higher-order function do in Databricks SQL when applied to an array column?

Q16
medium

A data engineer writes the following PySpark code. What does the `AGGREGATE` higher-order function compute?\n\n```python\nspark.sql("""\n SELECT AGGR...

Q17
hard

A data engineer notices that a PySpark job performing a `groupBy` on a low-cardinality column `status` (only 3 distinct values) followed by a `sum()` ...

Q18
medium

A Structured Streaming pipeline reads from a Delta table source and must process late-arriving events. The engineer adds a 15-minute watermark on `eve...

Q19
easy

A data engineer creates a Structured Streaming query to write aggregated results to a Delta table. Which output mode must be used when the query conta...

Q20
hard

A data engineer writes the following Structured Streaming query using `foreachBatch`:\n\n```python\ndef process_batch(batch_df, batch_id):\n batch_...

Sign in to see all 44 questions

Create a free account to browse all questions — completely free during our launch phase.