Data Processing & Transformations Questions
Practice questions for Data Processing & Transformations topic in Databricks Certified Data Engineer Associate. 62 questions covering this domain.
A data engineer is building a transformation that processes a `products` table containing a `tags` column of type `ARRAY<STRING>`. They need to create...
What does the `MERGE INTO` SQL statement do in Delta Lake?
A data engineer writes the following Spark SQL to query a Delta table:\n\n```sql\nSELECT customer_id, SUM(amount) AS total\nFROM orders\nWHERE status ...
A data engineer wants to revert a Delta table to the state it was in at version 5 after a bad MERGE operation introduced incorrect data. Which SQL com...
A data engineer runs the following PySpark code but observes that the resulting Delta table has only 1 file:\n\n```python\ndf.coalesce(1).write.format...
In Apache Spark, what is the difference between a transformation and an action?
A data engineer creates a Delta table and wants to enforce that the `order_status` column never contains a value outside the set (`'pending'`, `'compl...
A data engineer is implementing a CDC pipeline that receives a source `changes` DataFrame with columns `id`, `operation` (values: `INSERT`, `UPDATE`, ...
A data engineer joins a large `orders` DataFrame (50 million rows) with a small `stores` DataFrame (200 rows) on `store_id`. The query plan shows Spar...
A data engineer wants to add a new column `discounted_price` to a PySpark DataFrame `df` that equals `price * 0.9`. Which PySpark method should they u...
A data engineer writes the following PySpark code to drop all rows where any column is null, but the resulting DataFrame unexpectedly still contains r...
A data engineer needs to create a reusable function that calculates a custom business metric in PySpark that is not available in the built-in `pyspark...
A data engineer writes the following PySpark code:\n\n```python\nresult = (df\n .filter(col(\status\) == \active\)\n .groupBy(\region\)\n .agg(coun...
A data engineer has a PySpark DataFrame `orders` and wants to calculate the total order amount grouped by `customer_id`. Which PySpark code correctly ...
A data engineer needs to query a Delta table''s full modification history to understand which operations were performed and when. Which SQL command re...
A data engineer needs to join a large `transactions` DataFrame with a small `lookup` DataFrame that contains 500 rows. The join is causing a shuffle t...
A data engineer needs to rename the column `cust_id` to `customer_id` in a PySpark DataFrame `df` without changing any other columns. Which method sho...
A data engineer writes the following PySpark job to process 500 million rows, but it is running very slowly. After profiling, the Spark UI shows a sin...
A data engineer needs to compute a 7-day rolling sum of `daily_sales` per `store_id` ordered by `sale_date`. Which SQL construct achieves this?
A data engineer writes a PySpark streaming transformation and wants to ensure no duplicate events are written to the Delta table if the stream restart...
Sign in to see all 62 questions
Create a free account to browse all questions — completely free during our launch phase.