A data engineer runs the following PySpark code but observes that the resulting Delta table has only 1 file:\n\n\ndf.coalesce(1).write.format(\delta\).mode(\overwrite\).saveAsTable(\catalog.schema.output\)\n
\n\nAfter several weeks, queries against this table become slower than expected as data volume grows. What is the most likely cause and the recommended fix?
\ndf.coalesce(1).write.format(\delta\).mode(\overwrite\).saveAsTable(\catalog.schema.output\)\n\n\nAfter several weeks, queries against this table become slower than expected as data volume grows. What is the most likely cause and the recommended fix?ANALYZE TABLE to refresh themcoalesce(1) causes all data to be written to a single file, creating a large file that is slow to read; remove coalesce(1) and let Delta Lake manage the file layout, or run OPTIMIZE afterwardspark.sql.shuffle.partitions to improve parallelismMore Data Processing & Transformations Questions
62 questions
Full Databricks Certified Data Engineer Associate Practice Test
All topics covered
All Databricks Certified Data Engineer Associate Questions
Browse by topic
Related Questions
In Apache Spark, what is the difference between a transformation and an action?...
A data engineer wants to add a new column `discounted_price` to a PySpark DataFrame `df` that equals...
What does the `MERGE INTO` SQL statement do in Delta Lake?...
A data engineer has a PySpark DataFrame `orders` and wants to calculate the total order amount group...
A data engineer writes the following Spark SQL to query a Delta table:\n\n```sql\nSELECT customer_id...
Educational Content — CertQnA practice questions are written against official exam objectives, covering the same domains tested on the real exam. All content is original and independent — not actual exam questions, not affiliated with any certification vendor. Learn more about our content policy
Discussion
Be the first to share your understanding of this concept
Sign in to join the discussion