Developing Code for Data Processing using Python and SQL Questions
Practice questions for Developing Code for Data Processing using Python and SQL topic in Databricks Certified Data Engineer Professional. 44 questions covering this domain.
When executing a `MERGE INTO` statement in Delta Lake, what happens if the source dataset contains multiple rows that match the same target row based ...
What is the primary performance advantage of using a pandas UDF (vectorized UDF) over a standard Python UDF in PySpark?
What is the primary purpose of the `MERGE INTO` statement in Delta Lake?
Which SQL syntax correctly queries a Delta table named `sales` at version 5?
A data engineer writes the following PySpark code to find the top 3 products by total revenue per category. Which statement best describes the result?...
A data engineer runs the following SQL query on an orders table. What does it return? ```sql SELECT customer_id, order_date, amount, ROW_NUMBER...
A data engineer writes the following Structured Streaming query. What does the 10-minute watermark guarantee? ```python events \ .withWaterma...
A data engineer runs the following SQL query. What does the `FILTER` function return? ```sql SELECT FILTER(tags, t -> t LIKE 'priority_%') AS prio...
Which Structured Streaming trigger type processes all available data in multiple micro-batches and then stops, and is the recommended replacement for ...
What does the `filter()` transformation in PySpark return?
Which Structured Streaming output mode emits only new rows that were appended to the result table since the last trigger?
Which PySpark action triggers the execution of a lazy transformation chain and returns all rows of a DataFrame to the driver as a Python list?
A data engineer is debugging a failing Structured Streaming query. The stream processes events with `event_time`, uses a 10-minute tumbling window and...
A data engineer wants to convert a structured DataFrame column `address` of type `STRUCT<street STRING, city STRING, zip STRING>` into a JSON string c...
What does the `TRANSFORM` higher-order function do in Databricks SQL when applied to an array column?
A data engineer writes the following PySpark code. What does the `AGGREGATE` higher-order function compute?\n\n```python\nspark.sql("""\n SELECT AGGR...
A data engineer notices that a PySpark job performing a `groupBy` on a low-cardinality column `status` (only 3 distinct values) followed by a `sum()` ...
A Structured Streaming pipeline reads from a Delta table source and must process late-arriving events. The engineer adds a 15-minute watermark on `eve...
A data engineer creates a Structured Streaming query to write aggregated results to a Delta table. Which output mode must be used when the query conta...
A data engineer writes the following Structured Streaming query using `foreachBatch`:\n\n```python\ndef process_batch(batch_df, batch_id):\n batch_...
Sign in to see all 44 questions
Create a free account to browse all questions — completely free during our launch phase.