Data Preparation Questions

Practice questions for Data Preparation topic in Databricks Certified Generative AI Engineer Associate. 28 questions covering this domain.

28 questions8 easy12 medium8 hard

medium

An engineering team has already pre-computed text embeddings using a proprietary internal model and stored the embedding vectors in a Delta table. The...

easy

In a RAG pipeline, what is the role of an embedding model?

hard

A team indexing long legal contracts finds that the RAG system retrieves chunks containing the right section topic but misses the specific clause rele...

medium

A data engineer wants to create a Mosaic AI Vector Search Delta Sync Index so the index automatically stays in sync as the source Delta table is updat...

medium

A team builds a real-time recommendation system that computes product embeddings on-the-fly and must upsert individual embeddings into a vector index ...

hard

A team is building a RAG system for retrieving code snippets and their natural language documentation from a multilingual codebase. The system must ma...

easy

In the context of building a RAG data pipeline, what is the primary purpose of chunking source documents?

easy

In Mosaic AI Vector Search, what algorithm does the platform use for approximate nearest neighbor (ANN) similarity searches?

hard

A team is ingesting a multilingual corpus of 800 million product descriptions in English, Spanish, and French into a Mosaic AI Vector Search index. Th...

Q10

hard

A data engineer building a RAG pipeline for a medical research system needs to split documents that have a clear hierarchical structure: document → se...

Q11

easy

What is the purpose of a text splitter strategy in a RAG document pipeline?

Q12

medium

A team notices that when they use cosine similarity to compare query embeddings against their indexed document embeddings, the results are suboptimal ...

Q13

medium

A team builds a Delta Sync Index on a standard Mosaic AI Vector Search endpoint. They want the embedding vectors computed by Databricks and need the i...

Q14

medium

A team creates a Mosaic AI Vector Search index to support a legal document search system. After initially creating the index on a small dataset of 10,...

Q15

medium

A document ingestion pipeline includes many source-table columns that are not useful at retrieval time. The team wants to reduce index size while pres...

Q16

easy

A source Delta table for a Vector Search index already contains a column named `_id`. What should the engineer do before creating the index?

Q17

easy

A team needs exact keyword search over identifiers and does not want to generate embeddings at all. Which Vector Search option best matches that requi...

Q18

hard

A team wants a dedicated full-text search index on a storage-optimized endpoint and assumes it will continuously sync from the source table. What chan...

Q19

medium

A team created a Delta Sync index with self-managed embeddings. Later they decide they want Databricks to compute embeddings instead. What can they do...

Q20

medium

Users search for product issues using both natural language and exact incident codes such as `INC-84721`. Which retrieval approach best handles both n...

Sign in to see all 28 questions

Create a free account to browse all questions — completely free during our launch phase.