AI-Ready Data — Overview

This page provides a minimal runnable RAG example to help you complete the full "ingest documents → vectorize → retrieve" pipeline in 5 minutes. For a more complete approach selection guide, see AI Data Preparation.

Prerequisites

  • You have a Singdata Lakehouse workspace.
  • You have configured AI Gateway (for calling the embedding model).

5-Minute Minimal RAG Example

Step 1: Create the table

CREATE TABLE knowledge_base ( id BIGINT, title STRING, content STRING, embedding VECTOR(1536) -- dimension must match the embedding model you use );

Step 2: Insert data and vectorize

INSERT INTO knowledge_base (id, title, content, embedding) VALUES (1, 'Introduction to Vector Index', 'A vector index stores and retrieves high-dimensional vectors, supporting approximate nearest neighbor search.', AI_EMBEDDING('endpoint:my_embedding', 'A vector index stores and retrieves high-dimensional vectors, supporting approximate nearest neighbor search.')), (2, 'Introduction to Full-Text Search', 'Full-text search is based on an inverted index, supporting Chinese and English tokenization and BM25 relevance ranking.', AI_EMBEDDING('endpoint:my_embedding', 'Full-text search is based on an inverted index, supporting Chinese and English tokenization and BM25 relevance ranking.')), (3, 'RAG Architecture Overview', 'RAG combines retrieval and generation: it first retrieves relevant documents, then uses an LLM to generate an answer.', AI_EMBEDDING('endpoint:my_embedding', 'RAG combines retrieval and generation: it first retrieves relevant documents, then uses an LLM to generate an answer.'));

Step 3: Create the vector index

CREATE VECTOR INDEX idx_embedding ON TABLE knowledge_base (embedding) PROPERTIES ("scalar.type" = "f32", "distance.function" = "cosine_distance");

Step 4: Semantic retrieval

-- Find the 3 documents most relevant to the user's question SELECT id, title, content, cosine_distance(embedding, AI_EMBEDDING('endpoint:my_embedding', 'What is vector search?')) AS distance FROM knowledge_base ORDER BY distance ASC LIMIT 3;

Step 5: Generate an answer (RAG)

-- Concatenate the retrieved documents as context and call the LLM to generate an answer -- Replace endpoint:my_embedding / endpoint:my_llm with the endpoint names you configured in AI Gateway WITH context AS ( SELECT CONCAT_WS('\n', COLLECT_LIST(content)) AS ctx FROM ( SELECT content FROM knowledge_base ORDER BY cosine_distance(embedding, AI_EMBEDDING('endpoint:my_embedding', 'What is vector search?')) ASC LIMIT 3 ) ) SELECT AI_COMPLETE( 'endpoint:my_llm', CONCAT('Answer the question based on the following material:\n', ctx, '\n\nQuestion: What is vector search?') ) AS answer FROM context;

Next Steps

GoalDocumentation
Learn the full configuration options for vector indexesVector Search
Add full-text search to improve recall qualityHybrid Search Best Practices
Combine vector search and structured filtering on the same tableMulti-modal Data Retrieval
Learn the full parameters for AI_EMBEDDING and AI_COMPLETEAI Functions Overview