Skip to content
DEA
Development and Ingestion
medium
Question 2 of 60

A data engineer needs to extract data from a source table only where the updated_at timestamp is after the last pipeline run timestamp. Which approach best implements this incremental extraction pattern in PySpark?

ALoad the entire source table into a DataFrame each run and use .distinct() to remove duplicates
BFilter the source DataFrame using a watermark value stored in a Delta table or job parameter representing the last run timestamp
CUse Auto Loader''s cloudFiles format with trigger(once=True)
DRun VACUUM on the source table before each extraction

Educational Content — CertQnA practice questions are written against official exam objectives, covering the same domains tested on the real exam. All content is original and independent — not actual exam questions, not affiliated with any certification vendor. Learn more about our content policy

Discussion

Be the first to share your understanding of this concept

⚠️ Discussion is for concept clarification only. Do not share or request actual exam questions or answers.

Sign in to join the discussion