A data engineer runs the following PySpark code but observes that the resulting Delta table has only 1 file:

```python
df.coalesce(1).write.format(\delta\).mode(\overwrite\).saveAsTable(\catalog.schema.output\)
```

After several weeks, queries against this table become slower than expected as data volume grows. What is the most likely cause and the recommended fix?

Question

A data engineer runs the following PySpark code but observes that the resulting Delta table has only 1 file:

```python
df.coalesce(1).write.format(\delta\).mode(\overwrite\).saveAsTable(\catalog.schema.output\)
```

After several weeks, queries against this table become slower than expected as data volume grows. What is the most likely cause and the recommended fix?

Accepted Answer

B. `coalesce(1)` causes all data to be written to a single file, creating a large file that is slow to read; remove `coalesce(1)` and let Delta Lake manage the file layout, or run OPTIMIZE afterward. Using `coalesce(1)` forces all data to be written to a single partition and thus a single file. As data grows, reading a single large file eliminates parallelism: Spark cannot split the read across executors. Databricks Delta Lake best practices recommend letting the engine manage file sizes (via OPTIMIZE and liquid clustering) rather than forcing a single file, especially for tables that grow over time. The other options address unrelated issues.

Related Questions

Discussion