Singdata Lakehouse Image Analysis Best Practices

Introduction

In the data-driven era, image data has become an important component of enterprise data assets. Singdata Lakehouse, by integrating advanced image recognition and vector retrieval technologies, provides enterprises with a complete solution for image data management and analysis. This article will use a food image recognition practical case to introduce in detail how to build an end-to-end image analysis system in Singdata Lakehouse.

1. Key Product Features

Multi-Modal Storage: Supports unified management of image files, vector data, and structured metadata
External Functions (EXTERNAL FUNCTION): Seamlessly invoke AI models for image recognition and vectorization
Vector Retrieval: Natively supports 1024-dimensional vector storage and similarity computation

2. Practical Case: Food Image Recognition System

2.1 Creating Data Table Structure

First, we need to create a table containing a vector field to store image information:

-- Create food image recognition data table CREATE TABLE IF NOT EXISTS dish_images ( id BIGINT NOT NULL PRIMARY KEY IDENTITY (1), url STRING NOT NULL COMMENT 'Original URL address of the image', file_name STRING NOT NULL COMMENT 'Image file name', image_content STRING COMMENT 'Dish information extracted using fc_image_to_text', image_vector VECTOR (FLOAT, 1024) COMMENT 'Image vector generated using fc_gen_emmbeding', created_at TIMESTAMP DEFAULT current_timestamp() COMMENT 'Creation time' ) COMMENT 'Food image recognition data table';

2.2 Batch Upload Images to VOLUME

Singdata Lakehouse's VOLUME feature provides file management capabilities. The following example shows how to batch download and upload images from URLs. You can download images from the URLs below to your local machine, and then use the Lakehouse JDBC Client to PUT files into the USER VOLUME:

Image URL (You can download more images by changing the number in the URL)

-- Note: PUT command needs to be executed via the JDBC client, cannot be run directly in the SQL editor -- Batch download and upload images to USER VOLUME from URLs PUT '/User/Downloads/RecognizeFood1.jpg', '/User/Downloads/RecognizeFood2.jpg', '/User/Downloads/RecognizeFood3.jpg' TO USER VOLUME SUBDIRECTORY 'dish_images'; -- View uploaded files LIST USER VOLUME SUBDIRECTORY 'dish_images';

2.3 Using Functions for Image Recognition

Use cloud functions to achieve automatic image recognition and vectorization:

-- Insert a single record, calling cloud functions for image recognition and vector generation INSERT INTO dish_images (url, file_name, image_content, image_vector) VALUES ( 'http://viapi-test.oss-cn-shanghai.aliyuncs.com/viapi-3.0domepic/imagerecog/RecognizeFood/RecognizeFood1.jpg', 'RecognizeFood1.jpg', -- Call image recognition function public.fc_image_to_text('dish_recognition', 'http://viapi-test.oss-cn-shanghai.aliyuncs.com/viapi-3.0domepic/imagerecog/RecognizeFood/RecognizeFood1.jpg'), -- Generate image vector (note: vector dimensions must match) CAST(public.fc_gen_emmbeding('multimodal', '', 'http://viapi-test.oss-cn-shanghai.aliyuncs.com/viapi-3.0domepic/imagerecog/RecognizeFood/RecognizeFood1.jpg') AS VECTOR(FLOAT, 1024)) ); -- Batch insert multiple records INSERT INTO dish_images (url, file_name, image_content, image_vector) VALUES ('http://viapi-test.oss-cn-shanghai.aliyuncs.com/viapi-3.0domepic/imagerecog/RecognizeFood/RecognizeFood2.jpg', 'RecognizeFood2.jpg', public.fc_image_to_text('dish_recognition', 'http://viapi-test.oss-cn-shanghai.aliyuncs.com/viapi-3.0domepic/imagerecog/RecognizeFood/RecognizeFood2.jpg'), CAST(public.fc_gen_emmbeding('multimodal', '', 'http://viapi-test.oss-cn-shanghai.aliyuncs.com/viapi-3.0domepic/imagerecog/RecognizeFood/RecognizeFood2.jpg') AS VECTOR(FLOAT, 1024)) ), ('http://viapi-test.oss-cn-shanghai.aliyuncs.com/viapi-3.0domepic/imagerecog/RecognizeFood/RecognizeFood3.jpg', 'RecognizeFood3.jpg', public.fc_image_to_text('dish_recognition', 'http://viapi-test.oss-cn-shanghai.aliyuncs.com/viapi-3.0domepic/imagerecog/RecognizeFood/RecognizeFood3.jpg'), CAST(public.fc_gen_emmbeding('multimodal', '', 'http://viapi-test.oss-cn-shanghai.aliyuncs.com/viapi-3.0domepic/imagerecog/RecognizeFood/RecognizeFood3.jpg') AS VECTOR(FLOAT, 1024)) );

2.4 Vector Similarity Search

Implement image content-based similarity search:

-- Find the dishes most similar to the target image WITH target_vector AS ( SELECT CAST(public.fc_gen_emmbeding('multimodal', '', 'http://viapi-test.oss-cn-shanghai.aliyuncs.com/viapi-3.0domepic/imagerecog/RecognizeFood/RecognizeFood5.jpg') AS VECTOR(FLOAT, 1024)) as vec ) SELECT d.id, d.file_name, d.url, d.image_content, cosine_distance(d.image_vector, t.vec) as similarity_score FROM dish_images d, target_vector t ORDER BY similarity_score ASC LIMIT 5; -- Find similar images based on existing images SELECT d1.file_name as source_image, d2.file_name as similar_image, d2.image_content, cosine_distance(d1.image_vector, d2.image_vector) as similarity_score FROM dish_images d1, dish_images d2 WHERE d1.file_name = 'RecognizeFood1.jpg' AND d1.id != d2.id ORDER BY similarity_score ASC LIMIT 5;

3. Advanced Application Scenarios

3.1 Multi-Dimensional Image Analysis

Combine structured queries and vector search to achieve complex analysis requirements:

-- Find images of specific categories SELECT id, file_name, image_content FROM dish_images WHERE image_content LIKE '%seafood%' OR image_content LIKE '%fish%' OR image_content LIKE '%shrimp%' OR image_content LIKE '%crab%' ORDER BY id; -- Analyze the confidence distribution of dish recognition results SELECT file_name, image_content, -- Extract confidence (adjust based on actual JSON format) CAST(SUBSTRING(image_content, POSITION('probability' IN image_content) + 15, 8) AS DOUBLE) as confidence FROM dish_images ORDER BY confidence DESC; -- Count dishes by category SELECT CASE WHEN image_content LIKE '%beef%' THEN 'Beef' WHEN image_content LIKE '%fish%' THEN 'Fish' WHEN image_content LIKE '%shrimp%' OR image_content LIKE '%crab%' THEN 'Seafood' WHEN image_content LIKE '%vegetable%' AND image_content NOT LIKE '%non-vegetable%' THEN 'Vegetables' ELSE 'Other' END as category, COUNT(*) as count FROM dish_images WHERE image_content NOT LIKE '%non-vegetable%' GROUP BY 1 ORDER BY count DESC;

3.2 Creating Image Processing Pipeline

Use dynamic tables to implement automated data processing:

-- Create a dynamic table to automatically extract and aggregate dish information CREATE OR REPLACE DYNAMIC TABLE dish_summary REFRESH_INTERVAL = '1 HOUR' AS SELECT file_name, url, -- Extract dish name (simplified string processing) SUBSTRING(image_content, POSITION('name' IN image_content) + 9, POSITION('}' IN SUBSTRING(image_content, POSITION('name' IN image_content))) - 10 ) as dish_name, -- Extract calorie information CASE WHEN image_content LIKE '%calorie%' THEN SUBSTRING(image_content, POSITION('calorie' IN image_content) + 12, 3) ELSE NULL END as calorie, created_at FROM dish_images WHERE image_content IS NOT NULL; -- View dynamic table data SELECT * FROM dish_summary;

Conclusion

Through the practical cases and executable code in this article, we have demonstrated the powerful capabilities of Singdata Lakehouse in image data management and analysis. From simple image recognition to complex vector search, from batch processing to real-time analysis, Singdata Lakehouse provides enterprises with a one-stop solution.

Just as embodied by the technical pursuit of MCP Server, the Singdata team's refinement of every functional detail makes this platform not just a data warehouse, but a bridge connecting AI capabilities with business needs. As technology continues to advance, we believe that Singdata Lakehouse will play an increasingly important role in the field of multi-modal data analysis.

This article is written based on the latest version of Singdata Lakehouse, and all SQL code has been tested in actual environments. Specific features may change with version updates. For more technical details, please refer to the official documentation: https://singdata.com/documents