A data engineer needs to extract data from a source table only where the `updated_at` timestamp is after the last pipeline run timestamp. Which approach best implements this incremental extraction pattern in PySpark?

Question

Accepted Answer

B. Filter the source DataFrame using a watermark value stored in a Delta table or job parameter representing the last run timestamp. Incremental extraction is a standard ETL pattern where only new or updated records since the last run are processed. Storing the last processed watermark (e.g., max `updated_at`) in a configuration table or passing it as a job parameter, then filtering the source with `df.filter(col("updated_at") > last_run_timestamp)`, implements this correctly. Loading all data and de-duplicating is inefficient. Auto Loader ingests files, not SQL table rows. VACUUM deletes old file versions, unrelated to incremental extraction.

A data engineer needs to extract data from a source table only where the `updated_at` timestamp is after the last pipeline run timestamp. Which approach best implements this incremental extraction pattern in PySpark?

Related Questions

Discussion

A data engineer needs to extract data from a source table only where the updated_at timestamp is after the last pipeline run timestamp. Which approach best implements this incremental extraction pattern in PySpark?

Related Questions

Discussion

A data engineer needs to extract data from a source table only where the `updated_at` timestamp is after the last pipeline run timestamp. Which approach best implements this incremental extraction pattern in PySpark?