Data Preparation Questions
Practice questions for Data Preparation topic in Databricks Certified Generative AI Engineer Associate. 28 questions covering this domain.
An engineering team has already pre-computed text embeddings using a proprietary internal model and stored the embedding vectors in a Delta table. The...
In a RAG pipeline, what is the role of an embedding model?
A team indexing long legal contracts finds that the RAG system retrieves chunks containing the right section topic but misses the specific clause rele...
A data engineer wants to create a Mosaic AI Vector Search Delta Sync Index so the index automatically stays in sync as the source Delta table is updat...
A team builds a real-time recommendation system that computes product embeddings on-the-fly and must upsert individual embeddings into a vector index ...
A team is building a RAG system for retrieving code snippets and their natural language documentation from a multilingual codebase. The system must ma...
In the context of building a RAG data pipeline, what is the primary purpose of chunking source documents?
In Mosaic AI Vector Search, what algorithm does the platform use for approximate nearest neighbor (ANN) similarity searches?
A team is ingesting a multilingual corpus of 800 million product descriptions in English, Spanish, and French into a Mosaic AI Vector Search index. Th...
A data engineer building a RAG pipeline for a medical research system needs to split documents that have a clear hierarchical structure: document → se...
What is the purpose of a text splitter strategy in a RAG document pipeline?
A team notices that when they use cosine similarity to compare query embeddings against their indexed document embeddings, the results are suboptimal ...
A team builds a Delta Sync Index on a standard Mosaic AI Vector Search endpoint. They want the embedding vectors computed by Databricks and need the i...
A team creates a Mosaic AI Vector Search index to support a legal document search system. After initially creating the index on a small dataset of 10,...
A document ingestion pipeline includes many source-table columns that are not useful at retrieval time. The team wants to reduce index size while pres...
A source Delta table for a Vector Search index already contains a column named `_id`. What should the engineer do before creating the index?
A team needs exact keyword search over identifiers and does not want to generate embeddings at all. Which Vector Search option best matches that requi...
A team wants a dedicated full-text search index on a storage-optimized endpoint and assumes it will continuously sync from the source table. What chan...
A team created a Delta Sync index with self-managed embeddings. Later they decide they want Databricks to compute embeddings instead. What can they do...
Users search for product issues using both natural language and exact incident codes such as `INC-84721`. Which retrieval approach best handles both n...
Sign in to see all 28 questions
Create a free account to browse all questions — completely free during our launch phase.