JSON Data Processing Guide for Complex Business Scenarios

Guide Overview

This guide demonstrates the complete technical implementation of Singdata Lakehouse JSON functionality, starting from real-world business scenarios and providing professional solutions for JSON data processing. All SQL examples have been tested and verified.

Guide Structure

Chapter 1: Business Scenario and Challenge Analysis
Chapter 2: Singdata Lakehouse Solutions
Chapter 3: Solutions for Core Technical Challenges
Chapter 4: Deep Technical Implementation and Advanced Optimization
Chapter 5: Chinese and Special Character Handling Guide
Chapter 6: Technical Limitations and Best Practices
Chapter 7: Environment Management and Implementation Recommendations

Chapter 1: Business Scenario and Challenge Analysis

1.1 JSON Data Processing Needs of Modern Enterprises

In the digital era, JSON has become the standard format for enterprise data exchange:

API Economy: Under microservice architecture, a vast amount of data exchange adopts the JSON format
Real-time Data Streams: IoT devices, user behaviors, and system logs generate massive amounts of JSON data
Diverse Data Sources: Social media, e-commerce platforms, and financial transactions produce complex nested JSON
Internationalization Challenges: JSON data contains multilingual content including Chinese, emoji, and more

1.2 Core Business Scenario Analysis

1.2.1 E-Commerce Scenario: Real-time Analysis of Complex Order Data

Business Scenario Description: An e-commerce platform processes tens of thousands of orders daily, with each order containing complex nested structures such as customer information, product details, payment information, and logistics data:

{ "order_id": "ORD-2024-001", "customer": { "id": 10001, "name": "Alice", "profile": { "level": "VIP", "preferences": ["electronics", "gadgets", "books"], "purchase_history": [ {"category": "electronics", "frequency": 15}, {"category": "books", "frequency": 8} ] }, "address": {"city": "Beijing", "district": "Chaoyang District"} }, "items": [ { "sku": "PHONE-001", "name": "iPhone 15", "price": 6999.00, "qty": 1, "attributes": {"color": "Blue", "storage": "256GB"}, "promotion": {"type": "vip_discount", "discount": 500} } ], "payment": { "method": "alipay", "amount": 7197.00, "promotion_applied": {"code": "VIP500", "discount": 500} }, "logistics": { "shipping_method": "express", "estimated_delivery": "2024-12-03", "tracking_events": [ {"status": "ordered", "timestamp": "2024-12-01T10:00:00"}, {"status": "paid", "timestamp": "2024-12-01T10:05:00"} ] }, "user_feedback": { "rating": "5 Stars", "comment": "Great product! Fast delivery!" } }

Core Technical Challenges:

Personalized Recommendation Challenge: Need real-time analysis of customer preferences arrays and purchase_history nested structures
Dynamic Promotion Strategy: Each item's promotion information has an unfixed structure; different promotion types have completely different JSON structures
Real-time Inventory Alerts: Need to extract all SKUs from the order's items array for real-time sales statistics
Customer Behavior Analysis: Analyze the complete path from browsing to order placement
Multilingual Content Handling: Customer names, addresses, and reviews contain multiple languages and emoji symbols

1.2.2 IoT Scenario: Real-time Processing of Complex Industrial Equipment Monitoring Data

Business Scenario Description: A manufacturing enterprise has thousands of production devices, each reporting complex monitoring data every minute:

{ "device_id": "MACHINE-001", "timestamp": "2024-12-01T10:30:00Z", "location": { "building": "Factory-A", "floor": 2, "production_line": "Production-Line-1" }, "sensors": { "temperature": {"value": 85.2, "unit": "celsius", "status": "normal"}, "vibration": { "x_axis": {"value": 0.02, "threshold": 0.05, "status": "normal"}, "y_axis": {"value": 0.03, "threshold": 0.05, "status": "normal"}, "z_axis": {"value": 0.08, "threshold": 0.05, "status": "warning"} } }, "operational_status": { "machine_state": "running", "production_rate": 85.5, "efficiency": 92.3, "maintenance": { "last_service": "2024-11-15T08:00:00Z", "next_scheduled": "2025-02-15T08:00:00Z", "parts_status": [ {"part": "Motor Bearing", "condition": "Good", "life_remaining": 75}, {"part": "Conveyor Belt", "condition": "Fair", "life_remaining": 45} ] } }, "alerts": { "message": "Device running normally", "level": "info" } }

Core Technical Challenges:

Heterogeneous Device Data Fusion: Different device models have completely different sensors structures
Real-time Anomaly Detection: Need to monitor multiple nested metrics simultaneously
Complex Maintenance Planning: Each part in the maintenance.parts_status array has a different lifespan
Multi-dimensional Production Analysis: Need to analyze by production line, floor, device type, and other dimensions

1.2.3 Financial Scenario: Real-time Decision Making for Complex Transaction Risk Control

Business Scenario Description: An internet bank needs to perform real-time risk assessment on every transaction:

{ "transaction_id": "TXN-20241201-001", "timestamp": "2024-12-01T14:35:22Z", "user": { "id": 30001, "name": "Bob", "profile": { "age": 28, "credit_score": 720, "account_level": "Gold" }, "behavior_pattern": { "typical_amount_range": {"min": 500, "max": 3000}, "location_history": [ {"city": "Beijing", "frequency": 85}, {"city": "Shanghai", "frequency": 10} ] } }, "transaction": { "amount": 8500.00, "currency": "CNY", "type": "transfer", "merchant": {"name": "Luxury Store", "category": "retail"} }, "device": { "type": "mobile", "fingerprint": "ABC123XYZ789", "location": {"city": "Shanghai", "ip": "121.xxx.xxx.xxx"}, "is_new_device": true, "trusted_score": 0.3 }, "risk_analysis": { "automated_factors": [ {"factor": "amount_anomaly", "score": 0.8, "reason": "Amount exceeds normal range"}, {"factor": "location_mismatch", "score": 0.6, "reason": "Transaction in different city"} ], "final_decision": { "action": "manual_review", "confidence": 0.85, "reason": "Multiple risk factors detected" } } }

Core Technical Challenges:

Complex Risk Factor Real-time Calculation: Need to simultaneously analyze risk factors from multiple nested objects
Fast Decision Requirements: Risk control decisions must be completed within a short time frame
Dynamic Rule Engine: Risk control rules change frequently, and data structures evolve continuously
Compliance Audit Trail: Need to retain complete decision processes, supporting complex audit queries

1.3 Limitations of Traditional Approaches

Technical Challenges:

Performance Bottleneck: Traditional databases have limited JSON query performance
High Storage Cost: JSON data is redundant with low storage efficiency
Development Complexity: Complex ETL transformations are required
Insufficient Real-time Capability: Batch processing mode cannot meet real-time requirements
Scalability Limitations: Single-machine architecture struggles to handle large-scale data
Multilingual Support: Chinese, emoji, and other special characters are difficult to handle

Chapter 2: Singdata Lakehouse Solutions

2.1 Core Technical Advantages

2.1.1 Native JSON Support

Solution Features:

Native Data Type Support: JSON is a first-class citizen, no pre-defined schema required
Dynamic Structure Adaptation: New fields can be added without downtime or table structure modification
Zero Transformation Cost: JSON data is stored and queried directly, no ETL preprocessing needed
Multilingual Compatibility: Full support for Chinese and emoji content

-- Create sample table CREATE TABLE IF NOT EXISTS orders ( id INT, order_data JSON ); -- Directly store JSON data with different structures (including Chinese) INSERT INTO orders VALUES (1, parse_json('{"customer": {"name": "Alice"}, "promotion": {"type": "vip_discount", "amount": 500}}')), (2, parse_json('{"customer": {"name": "Bob"}, "promotion": {"type": "flash_sale", "percentage": 0.2}}')), (3, parse_json('{"customer": {"name": "Charlie"}, "promotion": {"type": "bundle_deal", "required_items": 3}}')); -- Query promotion data with different structures SELECT id, json_extract_string(order_data, "$.customer.name") as customer_name, json_extract_string(order_data, "$.promotion.type") as promotion_type FROM orders;

2.1.2 Intelligent Columnar Storage

Solution Features:

Automatic Field Optimization: Frequently accessed JSON fields are automatically converted to columnar storage
Path Index Optimization: Efficient indexes are built for commonly used JSON paths
Column Pruning Technology: Only reads the JSON fields needed by the query, reducing I/O overhead

-- Use the previously created orders table for high-frequency field queries SELECT json_extract_string(order_data, "$.customer.name") as customer_name, json_extract_double(order_data, "$.promotion.amount") as discount_amount FROM orders WHERE json_extract_string(order_data, "$.promotion.type") = 'vip_discount';

2.1.3 High-Performance Query Engine

Solution Features:

Parallel Processing Framework: Complex JSON queries are automatically parallelized
In-Memory Computing Optimization: Hot JSON data resides in memory
Intelligent Caching Mechanism: Results for frequently accessed JSON paths are cached

2.1.4 Real-time Stream Processing Integration

Solution Features:

TABLE STREAM Technology: Captures JSON data changes for incremental processing
Dynamic Table Auto-Refresh: Real-time aggregation of complex JSON data
Unified Stream-Batch Processing: Unified handling of real-time streams and historical batch data

-- Create IoT sensor data table CREATE TABLE IF NOT EXISTS iot_sensors ( device_id STRING, device_data JSON, timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP() ); -- Insert sample IoT data INSERT INTO iot_sensors VALUES ('DEVICE-001', parse_json('{"device_id": "DEVICE-001", "sensors": {"temperature": {"value": 86.5}}, "location": "Factory A"}'), CURRENT_TIMESTAMP()); -- Create alert events table CREATE TABLE IF NOT EXISTS alert_events ( device_id STRING, temp_value DOUBLE, alert_time TIMESTAMP ); -- Real-time anomaly detection (manual simulation) INSERT INTO alert_events SELECT json_extract_string(device_data, "$.device_id") as device_id, json_extract_double(device_data, "$.sensors.temperature.value") as temp_value, CURRENT_TIMESTAMP() as alert_time FROM iot_sensors WHERE json_extract_double(device_data, "$.sensors.temperature.value") > 85;

Chapter 3: Solutions for Core Technical Challenges

This chapter demonstrates in detail how Singdata Lakehouse specifically addresses each core technical challenge raised in Chapter 1. Every solution directly corresponds to a specific business challenge.

3.1 E-Commerce Scenario: Solutions to Technical Challenges

3.1.1 Challenge 1: Nested Array Query for Personalized Recommendations

Technical Challenge Description: Real-time analysis of customer preferences arrays and purchase_history nested structures is required. Traditional relational databases need complex multi-table JOINs or ETL preprocessing and cannot meet the response time requirements for real-time personalized recommendations.

Singdata Lakehouse Solution:

-- Create e-commerce order analysis table with native JSON support CREATE TABLE IF NOT EXISTS ecommerce_orders_json ( order_id STRING, order_data JSON, created_date DATE DEFAULT CURRENT_DATE() ); -- Directly store complex nested structures, no preprocessing needed INSERT INTO ecommerce_orders_json VALUES ('ORD-2024-001', parse_json('{ "customer": { "id": 10001, "name": "Alice", "level": "diamond", "profile": { "preferences": ["electronics", "gadgets", "books"], "purchase_history": [ {"category": "electronics", "frequency": 15}, {"category": "books", "frequency": 8} ] } }, "items": [{"sku": "PHONE-001", "category": "electronics", "price": 6999.00}], "payment": {"amount": 7098.00} }'), CURRENT_DATE()), ('ORD-2024-002', parse_json('{ "customer": { "id": 10002, "name": "Bob", "level": "diamond", "profile": { "preferences": ["books", "sports", "electronics"], "purchase_history": [ {"category": "books", "frequency": 20}, {"category": "sports", "frequency": 12} ] } }, "items": [{"sku": "BOOK-001", "category": "books", "price": 299.00}], "payment": {"amount": 299.00} }'), CURRENT_DATE()), ('ORD-2024-003', parse_json('{ "customer": { "id": 10003, "name": "Charlie", "level": "diamond", "profile": { "preferences": ["gadgets", "electronics", "fashion"], "purchase_history": [ {"category": "gadgets", "frequency": 25}, {"category": "electronics", "frequency": 18} ] } }, "items": [{"sku": "GADGET-001", "category": "gadgets", "price": 1599.00}], "payment": {"amount": 1599.00} }'), CURRENT_DATE()); -- Direct access to nested arrays, no JOIN needed SELECT json_extract_string(order_data, "$.customer.id") as customer_id, json_extract_string(order_data, "$.customer.name") as customer_name, -- Directly extract the first element of the preferences array (primary preference) json_extract_string(order_data, "$.customer.profile.preferences[0]") as primary_preference, -- Directly extract the most frequent category from purchase history json_extract_int(order_data, "$.customer.profile.purchase_history[0].frequency") as top_frequency, -- Real-time customer value calculation json_extract_double(order_data, "$.payment.amount") as order_value FROM ecommerce_orders_json WHERE json_extract_string(order_data, "$.customer.level") = 'diamond' ORDER BY json_extract_double(order_data, "$.payment.amount") DESC;

Technical Value:

Zero ETL Processing: Complex nested data is stored and queried directly
Native Array Access: [0] syntax directly accesses array elements without table splitting
Real-time Aggregation: Multi-dimensional customer profile analysis completed in a single query
Automatic Optimization: High-frequency fields are automatically stored in columnar format, improving query efficiency

3.1.2 Challenge 2: Flexible Querying of Dynamic Promotion Structures

Technical Challenge Description: Different promotion types (VIP discount, flash sale, bundle deal) have completely different JSON structures. Traditional fixed schemas cannot adapt and require pre-defining all possible promotion fields, resulting in high maintenance costs.

Singdata Lakehouse Dynamic Schema Solution:

-- Create promotion data table CREATE TABLE IF NOT EXISTS promotion_orders_json ( order_id STRING, order_data JSON, created_date DATE DEFAULT CURRENT_DATE() ); -- Store promotion data with different structures in the same table INSERT INTO promotion_orders_json VALUES -- VIP discount: fixed amount structure ('ORD-VIP-001', parse_json('{"customer": {"name": "Alice"}, "promotion": {"type": "vip_discount", "amount": 500}}'), CURRENT_DATE()), -- Flash sale: percentage structure ('ORD-FLASH-001', parse_json('{"customer": {"name": "Bob"}, "promotion": {"type": "flash_sale", "percentage": 0.2}}'), CURRENT_DATE()), -- Bundle deal: combined item structure ('ORD-BUNDLE-001', parse_json('{"customer": {"name": "Charlie"}, "promotion": {"type": "bundle_deal", "required_items": 3, "discount_per_item": 50}}'), CURRENT_DATE()), -- Coupon: code structure ('ORD-COUPON-001', parse_json('{"customer": {"name": "Diana"}, "promotion": {"type": "coupon", "code": "SAVE20", "discount": 200}}'), CURRENT_DATE()); -- Adaptive query: one SQL handles all promotion types SELECT order_id, json_extract_string(order_data, "$.customer.name") as customer_name, json_extract_string(order_data, "$.promotion.type") as promotion_type, -- Automatically extract corresponding fields based on promotion type CASE WHEN json_extract_string(order_data, "$.promotion.type") = 'vip_discount' THEN json_extract_double(order_data, "$.promotion.amount") WHEN json_extract_string(order_data, "$.promotion.type") = 'coupon' THEN json_extract_double(order_data, "$.promotion.discount") ELSE 0 END as discount_amount, CASE WHEN json_extract_string(order_data, "$.promotion.type") = 'flash_sale' THEN json_extract_double(order_data, "$.promotion.percentage") ELSE 0 END as discount_percentage, CASE WHEN json_extract_string(order_data, "$.promotion.type") = 'bundle_deal' THEN json_extract_int(order_data, "$.promotion.required_items") ELSE 0 END as bundle_items, CASE WHEN json_extract_string(order_data, "$.promotion.type") = 'coupon' THEN json_extract_string(order_data, "$.promotion.code") ELSE NULL END as coupon_code FROM promotion_orders_json WHERE json_extract_string(order_data, "$.promotion.type") IS NOT NULL;

Technical Value:

Dynamic Schema Adaptation: New promotion types are automatically supported with zero downtime
Structure-Agnostic Queries: A single SQL handles multiple data structures
Business Agility: Promotion rule changes do not require database schema modifications

3.1.3 Challenge 3: Array Aggregation for Real-time Inventory Alerts

Technical Challenge Description: All SKUs need to be extracted from the order's items array and sales statistics calculated in real time. Traditional approaches require complex triggers or scheduled ETL jobs, with high latency and resource consumption.

Singdata Lakehouse Real-time Stream Processing Solution:

-- Create inventory order data table CREATE TABLE IF NOT EXISTS inventory_orders_json ( order_id STRING, order_data JSON, created_date DATE DEFAULT CURRENT_DATE() ); -- Insert order data containing SKUs and quantities INSERT INTO inventory_orders_json VALUES ('ORD-INV-001', parse_json('{"items": [{"sku": "PHONE-001", "qty": 2}], "payment": {"amount": 13998.00}}'), CURRENT_DATE()), ('ORD-INV-002', parse_json('{"items": [{"sku": "BOOK-001", "qty": 5}], "payment": {"amount": 1495.00}}'), CURRENT_DATE()), ('ORD-INV-003', parse_json('{"items": [{"sku": "PHONE-001", "qty": 1}], "payment": {"amount": 6999.00}}'), CURRENT_DATE()), ('ORD-INV-004', parse_json('{"items": [{"sku": "GADGET-001", "qty": 3}], "payment": {"amount": 4797.00}}'), CURRENT_DATE()), ('ORD-INV-005', parse_json('{"items": [{"sku": "BOOK-001", "qty": 2}], "payment": {"amount": 598.00}}'), CURRENT_DATE()); -- Dynamic table for real-time inventory statistics with automatic refresh CREATE OR REPLACE DYNAMIC TABLE IF NOT EXISTS real_time_inventory REFRESH INTERVAL 1 MINUTES AS SELECT json_extract_string(order_data, "$.items[0].sku") as sku, SUM(json_extract_int(order_data, "$.items[0].qty")) as total_sold, COUNT(*) as order_count, MAX(created_date) as last_order_date, -- Real-time inventory alert threshold calculation CASE WHEN SUM(json_extract_int(order_data, "$.items[0].qty")) > 5 THEN 'high_demand' WHEN SUM(json_extract_int(order_data, "$.items[0].qty")) > 2 THEN 'medium_demand' ELSE 'low_demand' END as demand_level FROM inventory_orders_json WHERE json_extract_string(order_data, "$.items[0].sku") IS NOT NULL GROUP BY json_extract_string(order_data, "$.items[0].sku"); -- Real-time query for inventory alerts SELECT * FROM real_time_inventory WHERE demand_level = 'high_demand' ORDER BY total_sold DESC;

Technical Value:

Unified Stream-Batch: Dynamic tables automatically aggregate data changes
Real-time Aggregation: No need to manually maintain triggers and aggregation logic
Zero Maintenance Cost: The system automatically handles complex aggregation computations

3.2 IoT Scenario: Unified Processing of Complex Device Data

3.2.1 Challenge 1: Unified Querying of Heterogeneous Device Data

Technical Challenge Description: Different device models have completely different sensors structures: old devices only have temperature, while new devices have dozens of sensors. Traditional approaches require creating different table structures for each device type, resulting in severe data silos.

Singdata Lakehouse Heterogeneous Data Unified Solution:

-- Create IoT device monitoring table supporting any device structure CREATE TABLE IF NOT EXISTS iot_devices_unified ( device_id STRING, sensor_data JSON, collected_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP() ); -- Store device data with different structures in the same table INSERT INTO iot_devices_unified VALUES -- Basic sensor: temperature only ('BASIC-001', parse_json('{"device_type": "basic", "temp": 22.1, "battery": 85}'), CURRENT_TIMESTAMP()), -- Standard sensor: temperature + humidity ('STANDARD-001', parse_json('{"device_type": "standard", "readings": {"temperature": 24.5, "humidity": 58.2}}'), CURRENT_TIMESTAMP()), -- Advanced sensor: complete multi-dimensional data ('ADVANCED-001', parse_json('{"device_type": "advanced", "sensors": {"temperature": {"value": 85.2}, "vibration": {"x_axis": {"value": 0.02}}}}'), CURRENT_TIMESTAMP()), -- Industrial sensor: most complex structure ('INDUSTRIAL-001', parse_json('{"device_type": "industrial", "sensors": {"temperature": {"value": 78.5}, "vibration": {"x_axis": {"value": 0.01}, "y_axis": {"value": 0.03}, "z_axis": {"value": 0.02}}, "pressure": {"value": 1013.25}}}'), CURRENT_TIMESTAMP()); -- Unified query interface, automatically adapting to different device structures SELECT device_id, json_extract_string(sensor_data, "$.device_type") as device_type, -- Compatible with multiple temperature field structures COALESCE( json_extract_double(sensor_data, "$.sensors.temperature.value"), -- Advanced device json_extract_double(sensor_data, "$.readings.temperature"), -- Standard device json_extract_double(sensor_data, "$.temp") -- Basic device ) as temperature, -- Optional advanced sensor data json_extract_double(sensor_data, "$.sensors.vibration.x_axis.value") as vibration_x, json_extract_double(sensor_data, "$.sensors.vibration.y_axis.value") as vibration_y, json_extract_double(sensor_data, "$.sensors.vibration.z_axis.value") as vibration_z, json_extract_double(sensor_data, "$.sensors.pressure.value") as pressure, -- Automatic device type identification CASE WHEN json_extract_string(sensor_data, "$.sensors.vibration") IS NOT NULL THEN 'high_end_device' WHEN json_extract_string(sensor_data, "$.readings.humidity") IS NOT NULL THEN 'standard_device' ELSE 'basic_device' END as inferred_device_category FROM iot_devices_unified ORDER BY collected_time DESC;

Technical Value:

Device Plug-and-Play: New device types require no schema changes
Unified Query Interface: A single SQL supports all device types
Intelligent Field Mapping: COALESCE automatically adapts to different field structures

3.3 Financial Scenario: Real-time Decision Making for Complex Risk Control

3.3.1 Challenge 1: Complex Risk Factor Real-time Calculation

Technical Challenge Description: Multiple nested objects such as user behavior patterns, device trust scores, and geographic locations must be analyzed simultaneously as risk factors. Traditional approaches require multiple database queries and complex business logic calculations, unable to meet fast decision-making requirements.

Singdata Lakehouse Multi-dimensional Risk Factor Parallel Calculation Solution:

-- Create financial risk control data table CREATE TABLE IF NOT EXISTS financial_risk_analysis ( transaction_id STRING, risk_data JSON, analysis_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP() ); -- Insert complex risk control test data INSERT INTO financial_risk_analysis VALUES ('TXN-001', parse_json('{ "user": { "id": "30001", "behavior_pattern": { "typical_amount_range": {"min": 500, "max": 3000}, "location_history": [{"city": "Beijing", "frequency": 85}] } }, "transaction": {"amount": 8500.00}, "device": {"is_new_device": true, "trusted_score": 0.3, "location": {"city": "Shanghai"}} }'), CURRENT_TIMESTAMP()), ('TXN-002', parse_json('{ "user": { "id": "30002", "behavior_pattern": { "typical_amount_range": {"min": 1000, "max": 5000}, "location_history": [{"city": "Shanghai", "frequency": 90}] } }, "transaction": {"amount": 2500.00}, "device": {"is_new_device": false, "trusted_score": 0.9, "location": {"city": "Shanghai"}} }'), CURRENT_TIMESTAMP()), ('TXN-003', parse_json('{ "user": { "id": "30003", "behavior_pattern": { "typical_amount_range": {"min": 200, "max": 1500}, "location_history": [{"city": "Shenzhen", "frequency": 95}] } }, "transaction": {"amount": 4500.00}, "device": {"is_new_device": true, "trusted_score": 0.2, "location": {"city": "Beijing"}} }'), CURRENT_TIMESTAMP()); -- Complete complex multi-dimensional risk assessment in a single query SELECT transaction_id, json_extract_string(risk_data, "$.user.id") as user_id, -- Real-time amount anomaly risk calculation CASE WHEN json_extract_double(risk_data, "$.transaction.amount") > json_extract_double(risk_data, "$.user.behavior_pattern.typical_amount_range.max") * 2 THEN 0.8 -- Exceeds normal amount by 2x: high risk WHEN json_extract_double(risk_data, "$.transaction.amount") > json_extract_double(risk_data, "$.user.behavior_pattern.typical_amount_range.max") * 1.5 THEN 0.5 -- Exceeds normal amount by 1.5x: medium risk ELSE 0.2 -- Within normal range: low risk END as amount_risk_score, -- Real-time device trust risk calculation CASE WHEN json_extract_string(risk_data, "$.device.is_new_device") = 'true' THEN 1.0 - json_extract_double(risk_data, "$.device.trusted_score") ELSE 0.1 END as device_risk_score, -- Real-time geographic location risk calculation CASE WHEN json_extract_string(risk_data, "$.device.location.city") != json_extract_string(risk_data, "$.user.behavior_pattern.location_history[0].city") THEN 0.6 -- Transaction in different city: medium risk ELSE 0.1 -- Usual location: low risk END as location_risk_score, -- Comprehensive risk score: weighted algorithm calculated in one pass ((CASE WHEN json_extract_double(risk_data, "$.transaction.amount") > json_extract_double(risk_data, "$.user.behavior_pattern.typical_amount_range.max") * 2 THEN 0.8 ELSE 0.2 END) * 0.4 + -- Amount risk weight 40% (CASE WHEN json_extract_string(risk_data, "$.device.is_new_device") = 'true' THEN 1.0 - json_extract_double(risk_data, "$.device.trusted_score") ELSE 0.1 END) * 0.3 + -- Device risk weight 30% (CASE WHEN json_extract_string(risk_data, "$.device.location.city") != json_extract_string(risk_data, "$.user.behavior_pattern.location_history[0].city") THEN 0.6 ELSE 0.1 END) * 0.3 -- Location risk weight 30% ) as final_risk_score, -- Real-time decision recommendation CASE WHEN ((CASE WHEN json_extract_double(risk_data, "$.transaction.amount") > json_extract_double(risk_data, "$.user.behavior_pattern.typical_amount_range.max") * 2 THEN 0.8 ELSE 0.2 END) * 0.4 + (CASE WHEN json_extract_string(risk_data, "$.device.is_new_device") = 'true' THEN 1.0 - json_extract_double(risk_data, "$.device.trusted_score") ELSE 0.1 END) * 0.3 + (CASE WHEN json_extract_string(risk_data, "$.device.location.city") != json_extract_string(risk_data, "$.user.behavior_pattern.location_history[0].city") THEN 0.6 ELSE 0.1 END) * 0.3) > 0.7 THEN 'REJECT' WHEN ((CASE WHEN json_extract_double(risk_data, "$.transaction.amount") > json_extract_double(risk_data, "$.user.behavior_pattern.typical_amount_range.max") * 2 THEN 0.8 ELSE 0.2 END) * 0.4 + (CASE WHEN json_extract_string(risk_data, "$.device.is_new_device") = 'true' THEN 1.0 - json_extract_double(risk_data, "$.device.trusted_score") ELSE 0.1 END) * 0.3 + (CASE WHEN json_extract_string(risk_data, "$.device.location.city") != json_extract_string(risk_data, "$.user.behavior_pattern.location_history[0].city") THEN 0.6 ELSE 0.1 END) * 0.3) > 0.4 THEN 'MANUAL_REVIEW' ELSE 'APPROVE' END as recommended_action FROM financial_risk_analysis ORDER BY final_risk_score DESC;

Technical Value:

Multi-dimensional Analysis in a Single Query: All risk factor calculations completed in one SQL
Parallel Access to Nested Objects: Simultaneously extract multiple deeply nested fields
Fast Decision Support: Significantly reduces risk control decision processing time

Chapter 4: Deep Technical Implementation and Advanced Optimization

This chapter demonstrates the advanced technical features of Singdata Lakehouse JSON processing in depth. Each technical implementation provides a deep solution targeting the specific technical challenges raised in Chapter 1.

4.1 Time-series Array Processing for the "Customer Behavior Analysis" Challenge

4.1.1 Technical Challenge Review

Original Challenge: Analyzing the complete path from browsing to order placement, with data stored in the logistics.tracking_events time-series array. Traditional approaches require splitting array data into multiple rows for storage, making queries complex and performance-limited.

4.1.2 Singdata Lakehouse Time-series Array Processing Implementation

Core Technical Implementation:

-- Create customer behavior analysis table CREATE TABLE IF NOT EXISTS customer_behavior_analysis ( order_id STRING, behavior_data JSON, created_date DATE DEFAULT CURRENT_DATE() ); -- Insert time-series behavior data INSERT INTO customer_behavior_analysis VALUES ('ORD-BEHAVIOR-001', parse_json('{ "customer": {"id": 10001, "name": "Alice"}, "items": [{"sku": "PHONE-001", "price": 6999.00}], "behavior_tracking": { "source": "search_engine", "conversion_funnel": [ {"event": "product_view", "timestamp": "2024-12-01T09:00:00", "duration": 120}, {"event": "add_to_cart", "timestamp": "2024-12-01T09:02:00", "duration": 30}, {"event": "checkout_start", "timestamp": "2024-12-01T09:02:30", "duration": 180}, {"event": "payment_complete", "timestamp": "2024-12-01T09:05:30", "duration": 45} ] }, "logistics": { "tracking_events": [ {"status": "ordered", "timestamp": "2024-12-01T09:05:30", "location": "Beijing"}, {"status": "paid", "timestamp": "2024-12-01T09:06:00", "location": "Beijing"}, {"status": "shipped", "timestamp": "2024-12-01T14:00:00", "location": "Beijing Warehouse"}, {"status": "delivered", "timestamp": "2024-12-02T10:00:00", "location": "Chaoyang District"} ] } }'), CURRENT_DATE()), ('ORD-BEHAVIOR-002', parse_json('{ "customer": {"id": 10002, "name": "Bob"}, "items": [{"sku": "BOOK-001", "price": 299.00}], "behavior_tracking": { "source": "direct_visit", "conversion_funnel": [ {"event": "product_view", "timestamp": "2024-12-01T10:00:00", "duration": 200}, {"event": "add_to_cart", "timestamp": "2024-12-01T10:03:20", "duration": 45}, {"event": "checkout_start", "timestamp": "2024-12-01T10:04:05", "duration": 90}, {"event": "payment_complete", "timestamp": "2024-12-01T10:05:35", "duration": 30} ] }, "logistics": { "tracking_events": [ {"status": "ordered", "timestamp": "2024-12-01T10:05:35", "location": "Shanghai"}, {"status": "paid", "timestamp": "2024-12-01T10:06:00", "location": "Shanghai"}, {"status": "shipped", "timestamp": "2024-12-01T16:00:00", "location": "Shanghai Warehouse"}, {"status": "delivered", "timestamp": "2024-12-02T14:00:00", "location": "Pudong New Area"} ] } }'), CURRENT_DATE()); -- Complex time-series analysis: complete behavior path analysis in a single query SELECT order_id, json_extract_string(behavior_data, "$.customer.id") as customer_id, json_extract_string(behavior_data, "$.customer.name") as customer_name, -- Extract the complete conversion funnel path json_extract_string(behavior_data, "$.behavior_tracking.conversion_funnel[0].event") as funnel_step1, json_extract_string(behavior_data, "$.behavior_tracking.conversion_funnel[1].event") as funnel_step2, json_extract_string(behavior_data, "$.behavior_tracking.conversion_funnel[2].event") as funnel_step3, json_extract_string(behavior_data, "$.behavior_tracking.conversion_funnel[3].event") as funnel_step4, -- Calculate duration at each stage json_extract_int(behavior_data, "$.behavior_tracking.conversion_funnel[0].duration") as view_duration, json_extract_int(behavior_data, "$.behavior_tracking.conversion_funnel[1].duration") as cart_duration, json_extract_int(behavior_data, "$.behavior_tracking.conversion_funnel[2].duration") as checkout_duration, -- Logistics status time-series analysis json_extract_string(behavior_data, "$.logistics.tracking_events[0].status") as initial_status, json_extract_string(behavior_data, "$.logistics.tracking_events[3].status") as final_status, -- Real-time conversion efficiency calculation CASE WHEN json_extract_string(behavior_data, "$.logistics.tracking_events[1].status") = 'paid' THEN 'successful_conversion' ELSE 'incomplete_conversion' END as conversion_result, -- Geographic location change tracking json_extract_string(behavior_data, "$.logistics.tracking_events[0].location") as order_location, json_extract_string(behavior_data, "$.logistics.tracking_events[3].location") as delivery_location, -- Conversion efficiency analysis (json_extract_int(behavior_data, "$.behavior_tracking.conversion_funnel[0].duration") + json_extract_int(behavior_data, "$.behavior_tracking.conversion_funnel[1].duration") + json_extract_int(behavior_data, "$.behavior_tracking.conversion_funnel[2].duration") + json_extract_int(behavior_data, "$.behavior_tracking.conversion_funnel[3].duration")) as total_conversion_time FROM customer_behavior_analysis WHERE json_extract_string(behavior_data, "$.behavior_tracking.conversion_funnel") IS NOT NULL ORDER BY total_conversion_time ASC;

Technical Value:

Zero Data Restructuring: Time-series arrays are analyzed directly without splitting storage
Single Query Full Chain: The complete path from browsing to delivery is obtained in one go
Real-time Conversion Analysis: Quickly calculates conversion rates and user behavior patterns

4.2 Dynamic Array Computation for the "Complex Maintenance Planning" Challenge

4.2.1 Technical Challenge Review

Original Challenge: Each part in the maintenance.parts_status array has a different lifespan, requiring dynamic calculation of maintenance priorities. Traditional approaches need complex stored procedures and scheduled calculations.

4.2.2 Singdata Lakehouse Dynamic Array Calculation Implementation

Core Technical Implementation:

-- Create device maintenance analysis table CREATE TABLE IF NOT EXISTS device_maintenance_analysis ( device_id STRING, maintenance_data JSON, collected_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP() ); -- Store complex part status data INSERT INTO device_maintenance_analysis VALUES ('MACHINE-COMPLEX-001', parse_json('{ "device_info": {"type": "production_machine", "location": "Factory A - Production Line 1"}, "operational_status": { "machine_state": "running", "production_rate": 85.5, "maintenance": { "last_service": "2024-11-15T08:00:00Z", "next_scheduled": "2025-02-15T08:00:00Z", "parts_status": [ {"part": "Main Motor", "condition": "Good", "life_remaining": 75, "cost": 50000, "priority": 1}, {"part": "Conveyor Belt", "condition": "Fair", "life_remaining": 45, "cost": 8000, "priority": 2}, {"part": "Hydraulic System", "condition": "Good", "life_remaining": 30, "cost": 25000, "priority": 1}, {"part": "Control Panel", "condition": "Excellent", "life_remaining": 90, "cost": 15000, "priority": 3}, {"part": "Cooling System", "condition": "Needs Attention", "life_remaining": 20, "cost": 12000, "priority": 1} ] } } }'), CURRENT_TIMESTAMP()), ('MACHINE-COMPLEX-002', parse_json('{ "device_info": {"type": "assembly_machine", "location": "Factory B - Production Line 2"}, "operational_status": { "machine_state": "running", "production_rate": 92.3, "maintenance": { "last_service": "2024-10-20T08:00:00Z", "next_scheduled": "2025-01-20T08:00:00Z", "parts_status": [ {"part": "Assembly Arm", "condition": "Good", "life_remaining": 85, "cost": 35000, "priority": 1}, {"part": "Sensor", "condition": "Fair", "life_remaining": 55, "cost": 5000, "priority": 2}, {"part": "Circuit Board", "condition": "Excellent", "life_remaining": 95, "cost": 8000, "priority": 2}, {"part": "Motor", "condition": "Needs Attention", "life_remaining": 25, "cost": 18000, "priority": 1} ] } } }'), CURRENT_TIMESTAMP()); -- Intelligent maintenance priority dynamic calculation SELECT device_id, json_extract_string(maintenance_data, "$.device_info.location") as location, json_extract_string(maintenance_data, "$.device_info.type") as device_type, -- Comprehensive analysis of part status array json_extract_string(maintenance_data, "$.operational_status.maintenance.parts_status[0].part") as part1_name, json_extract_int(maintenance_data, "$.operational_status.maintenance.parts_status[0].life_remaining") as part1_life, json_extract_string(maintenance_data, "$.operational_status.maintenance.parts_status[1].part") as part2_name, json_extract_int(maintenance_data, "$.operational_status.maintenance.parts_status[1].life_remaining") as part2_life, json_extract_string(maintenance_data, "$.operational_status.maintenance.parts_status[2].part") as part3_name, json_extract_int(maintenance_data, "$.operational_status.maintenance.parts_status[2].life_remaining") as part3_life, -- Intelligent maintenance decision algorithm: comprehensive consideration of lifespan, cost, and priority CASE -- High-priority parts with less than 30% life remaining: urgent maintenance WHEN (json_extract_int(maintenance_data, "$.operational_status.maintenance.parts_status[0].life_remaining") < 30 AND json_extract_int(maintenance_data, "$.operational_status.maintenance.parts_status[0].priority") = 1) OR (json_extract_int(maintenance_data, "$.operational_status.maintenance.parts_status[2].life_remaining") < 30 AND json_extract_int(maintenance_data, "$.operational_status.maintenance.parts_status[2].priority") = 1) OR (COALESCE(json_extract_int(maintenance_data, "$.operational_status.maintenance.parts_status[4].life_remaining"), 100) < 30 AND COALESCE(json_extract_int(maintenance_data, "$.operational_status.maintenance.parts_status[4].priority"), 2) = 1) THEN 'urgent_maintenance' -- Any part with less than 50% life remaining: scheduled maintenance WHEN json_extract_int(maintenance_data, "$.operational_status.maintenance.parts_status[0].life_remaining") < 50 OR json_extract_int(maintenance_data, "$.operational_status.maintenance.parts_status[1].life_remaining") < 50 OR json_extract_int(maintenance_data, "$.operational_status.maintenance.parts_status[2].life_remaining") < 50 OR COALESCE(json_extract_int(maintenance_data, "$.operational_status.maintenance.parts_status[3].life_remaining"), 100) < 50 OR COALESCE(json_extract_int(maintenance_data, "$.operational_status.maintenance.parts_status[4].life_remaining"), 100) < 50 THEN 'scheduled_maintenance' ELSE 'normal_operation' END as maintenance_priority, -- Identify the most critical part (high-priority part with shortest life remaining) LEAST( CASE WHEN json_extract_int(maintenance_data, "$.operational_status.maintenance.parts_status[0].priority") = 1 THEN json_extract_int(maintenance_data, "$.operational_status.maintenance.parts_status[0].life_remaining") ELSE 100 END, CASE WHEN COALESCE(json_extract_int(maintenance_data, "$.operational_status.maintenance.parts_status[2].priority"), 2) = 1 THEN json_extract_int(maintenance_data, "$.operational_status.maintenance.parts_status[2].life_remaining") ELSE 100 END, CASE WHEN COALESCE(json_extract_int(maintenance_data, "$.operational_status.maintenance.parts_status[4].priority"), 2) = 1 THEN COALESCE(json_extract_int(maintenance_data, "$.operational_status.maintenance.parts_status[4].life_remaining"), 100) ELSE 100 END ) as critical_part_min_life, -- Maintenance cost estimation (dynamically calculated based on part status) (CASE WHEN json_extract_int(maintenance_data, "$.operational_status.maintenance.parts_status[0].life_remaining") < 30 THEN json_extract_int(maintenance_data, "$.operational_status.maintenance.parts_status[0].cost") ELSE 0 END + CASE WHEN json_extract_int(maintenance_data, "$.operational_status.maintenance.parts_status[1].life_remaining") < 30 THEN json_extract_int(maintenance_data, "$.operational_status.maintenance.parts_status[1].cost") ELSE 0 END + CASE WHEN json_extract_int(maintenance_data, "$.operational_status.maintenance.parts_status[2].life_remaining") < 30 THEN json_extract_int(maintenance_data, "$.operational_status.maintenance.parts_status[2].cost") ELSE 0 END + CASE WHEN COALESCE(json_extract_int(maintenance_data, "$.operational_status.maintenance.parts_status[3].life_remaining"), 100) < 30 THEN COALESCE(json_extract_int(maintenance_data, "$.operational_status.maintenance.parts_status[3].cost"), 0) ELSE 0 END + CASE WHEN COALESCE(json_extract_int(maintenance_data, "$.operational_status.maintenance.parts_status[4].life_remaining"), 100) < 30 THEN COALESCE(json_extract_int(maintenance_data, "$.operational_status.maintenance.parts_status[4].cost"), 0) ELSE 0 END ) as estimated_maintenance_cost FROM device_maintenance_analysis WHERE json_extract_string(maintenance_data, "$.operational_status.maintenance.parts_status") IS NOT NULL ORDER BY critical_part_min_life ASC;

Technical Value:

Intelligent Decision Algorithm: Complex maintenance strategy integrating lifespan, cost, and priority
Native Array Computation: Complex mathematical operations on arrays without splitting
Real-time Cost Estimation: Dynamically calculates maintenance budget based on real-time part status

4.3 Performance Optimization Techniques

4.3.1 High-Frequency Field Access Pattern Optimization

-- Create performance optimization demo table CREATE TABLE IF NOT EXISTS performance_orders ( id INT, order_data JSON, created_date DATE DEFAULT CURRENT_DATE() ); -- Insert test data INSERT INTO performance_orders VALUES (1, parse_json('{"customer": {"id": "C001", "level": "VIP"}, "payment": {"amount": 1500.00}, "items": [{"detail": {"description": "Premium Electronics"}}]}'), CURRENT_DATE()), (2, parse_json('{"customer": {"id": "C002", "level": "regular"}, "payment": {"amount": 2300.00}, "items": [{"detail": {"description": "Home Goods"}}]}'), CURRENT_DATE()), (3, parse_json('{"customer": {"id": "C003", "level": "VIP"}, "payment": {"amount": 800.00}, "items": [{"detail": {"description": "Daily Necessities"}}]}'), CURRENT_DATE()), (4, parse_json('{"customer": {"id": "C004", "level": "diamond"}, "payment": {"amount": 3500.00}, "items": [{"detail": {"description": "Luxury Goods"}}]}'), CURRENT_DATE()); -- High-frequency field access pattern optimization WITH high_performance_query AS ( SELECT -- Top-level field access: automatic columnar storage optimization json_extract_string(order_data, "$.customer.id") as customer_id, json_extract_double(order_data, "$.payment.amount") as amount, json_extract_string(order_data, "$.customer.level") as level, -- Batch field extraction: reduce repeated JSON parsing overhead order_data as full_order_data, created_date FROM performance_orders WHERE json_extract_string(order_data, "$.customer.id") IS NOT NULL -- Efficient top-level field filtering AND json_extract_double(order_data, "$.payment.amount") > 1000 -- Efficient numeric field filtering ) -- Secondary computation optimization: complex analysis based on pre-extracted fields SELECT customer_id, level, amount, -- Complex nested fields extracted on demand (avoid repeated full-table parsing) json_extract_string(full_order_data, "$.items[0].detail.description") as product_description, -- Aggregation optimization: pre-group to reduce computation CASE WHEN amount > 3000 THEN 'high_value' WHEN amount > 1500 THEN 'medium_value' ELSE 'low_value' END as value_segment FROM high_performance_query ORDER BY amount DESC;

4.3.2 Alternative Solutions for JSON Data Updates

Important Technical Limitation: After practical verification, Singdata Lakehouse does not support json_set, json_insert, json_remove, or other JSON modification functions. Below are verified alternative solutions:

Approach 1: Complete JSON Reconstruction (suitable for simple updates)

-- Create transaction demo table CREATE TABLE IF NOT EXISTS transaction_demo ( customer_id STRING, customer_data JSON, updated_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP() ); -- Insert initial data INSERT INTO transaction_demo VALUES ('10001', parse_json('{"name": "Alice", "level": "regular", "profile": {"upgrade_date": null}, "payment": {"vip_discount": 0}}'), CURRENT_TIMESTAMP()); -- Use parse_json + CONCAT to reconstruct JSON object for update UPDATE transaction_demo SET customer_data = parse_json(CONCAT( '{"name": "', json_extract_string(customer_data, "$.name"), '", ', '"level": "VIP", ', '"profile": {"upgrade_date": "2024-12-01"}, ', '"payment": {"vip_discount": 500.0}}' )), updated_time = CURRENT_TIMESTAMP() WHERE customer_id = '10001' AND json_extract_string(customer_data, "$.level") != 'VIP'; -- Verify update result SELECT customer_id, json_extract_string(customer_data, "$.name") as name, json_extract_string(customer_data, "$.level") as level, json_extract_string(customer_data, "$.profile.upgrade_date") as upgrade_date, json_extract_double(customer_data, "$.payment.vip_discount") as vip_discount FROM transaction_demo WHERE customer_id = '10001';

Approach 2: Hybrid Storage Strategy (recommended for frequent updates)

-- Separate frequently updated fields; use JSON for stable data CREATE TABLE IF NOT EXISTS hybrid_storage_demo ( id INT, user_level STRING, -- Frequently updated fields stored separately update_count INT, -- Frequently updated fields stored separately user_profile JSON, -- Stable complex data stored as JSON last_updated TIMESTAMP DEFAULT CURRENT_TIMESTAMP() ); -- Insert hybrid storage data INSERT INTO hybrid_storage_demo VALUES (1, 'VIP', 5, parse_json('{"preferences": ["electronics", "books"], "address": {"city": "Beijing", "district": "Chaoyang District"}}'), CURRENT_TIMESTAMP()); -- Query hybrid storage data SELECT id, user_level, -- Relational field, high update efficiency update_count, -- Relational field, high update efficiency json_extract_string(user_profile, "$.preferences[0]") as primary_preference, -- JSON field, flexible querying json_extract_string(user_profile, "$.address.city") as city -- JSON field, structured storage FROM hybrid_storage_demo;

Chapter 5: Chinese and Special Character Handling Guide

5.1 Chinese and Emoji Support Verification

Based on comprehensive testing, Singdata Lakehouse provides complete support for Chinese and emoji symbols:

5.1.1 Fully Supported Scenarios

Chinese Content Values:

-- Create user data table CREATE TABLE IF NOT EXISTS user_data_chinese ( id STRING, data JSON ); -- Insert Chinese content INSERT INTO user_data_chinese VALUES ('user001', parse_json('{"name": "Alice", "city": "Beijing", "message": "Hello World!"}')), ('user002', parse_json('{"name": "Bob", "city": "Shanghai", "message": "Data processing is simple"}')), ('user003', parse_json('{"name": "Charlie", "city": "Shenzhen", "message": "JSON features are powerful"}')); -- Query Chinese content SELECT json_extract_string(data, "$.name") as name, json_extract_string(data, "$.city") as city, json_extract_string(data, "$.message") as message FROM user_data_chinese;

Emoji Symbols:

-- Create social data table CREATE TABLE IF NOT EXISTS social_data_emoji ( id STRING, data JSON ); -- Insert emoji data INSERT INTO social_data_emoji VALUES ('post001', parse_json('{"content": "Feeling great today😊", "reactions": "👍❤️🔥", "rating": "⭐⭐⭐⭐⭐"}')), ('post002', parse_json('{"content": "Product quality is really good👌", "reactions": "👏🎉💯", "rating": "⭐⭐⭐⭐"}')), ('post003', parse_json('{"content": "Very satisfied with the service🙂", "reactions": "😀👍✨", "rating": "⭐⭐⭐⭐⭐"}')); -- Query emoji data SELECT json_extract_string(data, "$.content") as content, json_extract_string(data, "$.reactions") as reactions, json_extract_string(data, "$.rating") as rating FROM social_data_emoji;

Complex Chinese Content:

-- Create feedback data table CREATE TABLE IF NOT EXISTS feedback_data_complex ( id STRING, data JSON, created_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP() ); -- Insert complex Chinese and emoji content INSERT INTO feedback_data_complex VALUES ('fb001', parse_json('{"comment": "Product quality is great, fast delivery, good customer service👍", "sentiment": "positive", "tags": ["quality", "logistics", "service"]}'), CURRENT_TIMESTAMP()), ('fb002', parse_json('{"comment": "Beautiful packaging, powerful features, very satisfied😊", "sentiment": "positive", "tags": ["packaging", "features", "satisfaction"]}'), CURRENT_TIMESTAMP()), ('fb003', parse_json('{"comment": "Great value, recommended purchase🛒", "sentiment": "positive", "tags": ["value", "recommendation"]}'), CURRENT_TIMESTAMP()); -- Query complex content SELECT id, json_extract_string(data, "$.comment") as comment, json_extract_string(data, "$.sentiment") as sentiment, json_extract_string(data, "$.tags[0]") as first_tag, json_extract_string(data, "$.tags[1]") as second_tag FROM feedback_data_complex ORDER BY created_time DESC;

5.1.2 Querying Chinese and Emoji Data

-- Create user feedback data table CREATE TABLE IF NOT EXISTS user_feedback_full ( id STRING, data JSON, created_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP() ); -- Insert test data containing Chinese and emoji INSERT INTO user_feedback_full VALUES ('user001', parse_json('{"name": "Alice", "city": "Beijing", "comment": "Product is great😊", "score": 5}'), CURRENT_TIMESTAMP()), ('user002', parse_json('{"name": "Bob", "city": "Shanghai", "comment": "Good service👍", "score": 4}'), CURRENT_TIMESTAMP()), ('user003', parse_json('{"name": "Charlie", "city": "Beijing", "comment": "Fast delivery, beautiful packaging🎉", "score": 5}'), CURRENT_TIMESTAMP()), ('user004', parse_json('{"name": "David", "city": "Shenzhen", "comment": "Powerful features, worth recommending💯", "score": 5}'), CURRENT_TIMESTAMP()); -- Query data containing Chinese and emoji SELECT json_extract_string(data, "$.name") as user_name, json_extract_string(data, "$.city") as city, json_extract_string(data, "$.comment") as feedback, json_extract_int(data, "$.score") as score FROM user_feedback_full WHERE json_extract_string(data, "$.city") = 'Beijing' AND json_extract_string(data, "$.comment") LIKE '%😊%'; -- Group statistics by Chinese field SELECT json_extract_string(data, "$.city") as city, COUNT(*) as user_count, AVG(json_extract_int(data, "$.score")) as avg_score FROM user_feedback_full GROUP BY json_extract_string(data, "$.city") ORDER BY user_count DESC; -- Emoji usage analysis for high-score users SELECT json_extract_string(data, "$.name") as user_name, json_extract_string(data, "$.comment") as comment, json_extract_int(data, "$.score") as score, CASE WHEN json_extract_string(data, "$.comment") LIKE '%😊%' THEN 'happy' WHEN json_extract_string(data, "$.comment") LIKE '%👍%' THEN 'thumbs_up' WHEN json_extract_string(data, "$.comment") LIKE '%🎉%' THEN 'celebration' WHEN json_extract_string(data, "$.comment") LIKE '%💯%' THEN 'perfect' ELSE 'no_emoji' END as emoji_sentiment FROM user_feedback_full WHERE json_extract_int(data, "$.score") >= 4 ORDER BY json_extract_int(data, "$.score") DESC;

5.2 Best Practice Recommendations

5.2.1 Field Naming Conventions

Recommended: Use English field names

-- Create best practice demo table CREATE TABLE IF NOT EXISTS best_practice_demo ( id STRING, data JSON ); -- Recommended field naming approach INSERT INTO best_practice_demo VALUES ('demo001', parse_json('{"user_name": "Alice", "user_city": "Beijing", "user_comment": "Product is great😊", "rating_score": 5}')); -- Query demonstration SELECT json_extract_string(data, "$.user_name") as name, json_extract_string(data, "$.user_city") as city, json_extract_string(data, "$.user_comment") as comment, json_extract_int(data, "$.rating_score") as score FROM best_practice_demo;

Use with Caution: Chinese field names

-- Create Chinese field name demo table (not recommended) CREATE TABLE IF NOT EXISTS chinese_field_demo ( id STRING, data JSON ); -- Chinese field name example (functional but not recommended) INSERT INTO chinese_field_demo VALUES ('demo001', parse_json('{"user_name_cn": "Alice", "user_city_cn": "Beijing", "user_review_cn": "Product is great😊"}')); -- Query Chinese field names (syntactically correct but not recommended) SELECT json_extract_string(data, "$.user_name_cn") as name, json_extract_string(data, "$.user_city_cn") as city, json_extract_string(data, "$.user_review_cn") as comment FROM chinese_field_demo;

5.2.2 Special Character Handling

Safe Content Formats:

-- Create special character handling demo table CREATE TABLE IF NOT EXISTS special_char_demo ( id STRING, data JSON ); -- Insert data containing various special characters INSERT INTO special_char_demo VALUES ('char001', parse_json('{"message": "Normal text content and emoji😊", "symbols": "!@#$%&*()_+-=[]{}|;:,.<>?/", "punctuation": "Punctuation marks: ，。；：""''（）【】、？！"}')); -- Query special character data SELECT json_extract_string(data, "$.message") as message, json_extract_string(data, "$.symbols") as symbols, json_extract_string(data, "$.punctuation") as punctuation FROM special_char_demo;

5.3 Troubleshooting Guide

If you encounter Chinese or emoji display issues:

Check field naming: Prefer English field names
Verify JSON format: Ensure the JSON format is correct with no syntax errors
Use standard functions: Consistently use json_extract_* functions
Avoid complex escaping: Minimize the use of complex escape characters in JSON strings
Encoding settings: Ensure client and database character encoding settings are correct

-- Troubleshooting demo query CREATE TABLE IF NOT EXISTS troubleshoot_demo ( id STRING, data JSON ); -- Correct data insertion method INSERT INTO troubleshoot_demo VALUES ('test001', parse_json('{"name": "Test User", "message": "This is a test message🔧"}')); -- Verify that data is correctly stored and queried SELECT id, json_extract_string(data, "$.name") as name, json_extract_string(data, "$.message") as message, LENGTH(json_extract_string(data, "$.name")) as name_length, LENGTH(json_extract_string(data, "$.message")) as message_length FROM troubleshoot_demo;

Chapter 6: Technical Limitations and Best Practices

6.1 JSON Type Limitation Summary (Verified and Revised)

Core limitations discovered based on comprehensive verification:

6.1.1 Confirmed Supported Features

Supported JSON Functions:

parse_json() - JSON string parsing
json_extract_string() - Extract string values
json_extract_int() - Extract integer values
json_extract_double() - Extract floating-point values
json_extract() - General extraction function

Supported Syntax:

CREATE TABLE IF NOT EXISTS
Complete DDL operations (CREATE/ALTER/DROP)
Array access syntax [0], [1]
Nested object access $.path.to.field
Full support for Chinese and emoji content

6.1.2 Confirmed Unsupported Features

Unsupported JSON Functions:

-- The following function has been verified as unsupported: -- json_set() - JSON field update (important limitation)

Unsupported Operations:

-- Create limitation demo table CREATE TABLE IF NOT EXISTS limitation_demo ( id INT, json_data JSON ); -- Insert test data INSERT INTO limitation_demo VALUES (1, parse_json('{"name": "Alice", "score": 85}')), (2, parse_json('{"name": "Bob", "score": 92}')), (3, parse_json('{"name": "Charlie", "score": 78}')); -- The following operations are not supported and will cause errors: -- JSON type does not support direct comparison -- SELECT * FROM limitation_demo WHERE json_data = parse_json('{}'); -- JSON type does not support direct ordering -- SELECT * FROM limitation_demo ORDER BY json_data; -- JSON type does not support direct GROUP BY -- SELECT json_data, COUNT(*) FROM limitation_demo GROUP BY json_data;

6.1.3 Solutions

-- Use json_extract functions for comparison SELECT * FROM limitation_demo WHERE json_extract_string(json_data, "$.name") = 'Alice'; -- Extract fields then sort SELECT id, json_extract_string(json_data, "$.name") as name, json_extract_int(json_data, "$.score") as score FROM limitation_demo ORDER BY json_extract_int(json_data, "$.score") DESC; -- Extract fields then group SELECT CASE WHEN json_extract_int(json_data, "$.score") >= 90 THEN 'excellent' WHEN json_extract_int(json_data, "$.score") >= 80 THEN 'good' ELSE 'average' END as grade, COUNT(*) as count FROM limitation_demo GROUP BY CASE WHEN json_extract_int(json_data, "$.score") >= 90 THEN 'excellent' WHEN json_extract_int(json_data, "$.score") >= 80 THEN 'good' ELSE 'average' END;

6.2 Alternative Strategies for JSON Data Updates

6.2.1 Update Strategy Design Principles

Read-Heavy, Write-Light Principle: JSON is primarily used for read-intensive scenarios
Hybrid Storage Strategy: Separate frequently updated fields
Application-Layer Processing: Complex updates are completed at the application layer
Versioned Management: Important data supports historical versions

6.2.2 Specific Implementation Plans

Approach 1: Complete JSON Reconstruction

-- Suitable for: simple JSON updates, low update frequency scenarios UPDATE table_name SET json_column = parse_json(CONCAT( '{"field1": "', json_extract_string(json_column, "$.field1"), '", ', '"field2": "new_value", ', '"field3": ', json_extract_int(json_column, "$.field3"), '}' )) WHERE conditions;

Approach 2: Hybrid Storage Strategy

-- Recommended for: business scenarios with frequent update requirements CREATE TABLE IF NOT EXISTS hybrid_table ( id INT, frequently_updated_field STRING, -- Relational field status_field INT, -- Relational field stable_json_data JSON, -- JSON field for stable data last_updated TIMESTAMP );

Approach 3: Versioned Storage

-- Suitable for: important data requiring historical version retention CREATE TABLE IF NOT EXISTS versioned_json_data ( record_id STRING, version_number INT, json_data JSON, created_time TIMESTAMP, is_current BOOLEAN DEFAULT TRUE );

6.3 Unified Syntax Recommendations

To avoid issues caused by mixing different approaches, it is recommended to consistently use json_extract functions:

-- Create unified syntax demo table CREATE TABLE IF NOT EXISTS syntax_demo ( order_id STRING, data JSON ); -- Insert test data INSERT INTO syntax_demo VALUES ('ORD001', parse_json('{"customer": {"id": "C001", "name": "Alice", "level": "VIP"}, "payment": {"amount": 1500.00}, "items": [{"qty": 2}]}')), ('ORD002', parse_json('{"customer": {"id": "C002", "name": "Bob", "level": "regular"}, "payment": {"amount": 800.00}, "items": [{"qty": 1}]}')), ('ORD003', parse_json('{"customer": {"id": "C003", "name": "Charlie", "level": "VIP"}, "payment": {"amount": 2200.00}, "items": [{"qty": 3}]}')); -- Recommended unified syntax pattern SELECT order_id, json_extract_string(data, "$.customer.id") as customer_id, json_extract_string(data, "$.customer.name") as customer_name, json_extract_string(data, "$.customer.level") as customer_level, json_extract_double(data, "$.payment.amount") as amount, json_extract_int(data, "$.items[0].qty") as quantity FROM syntax_demo WHERE json_extract_string(data, "$.customer.level") = 'VIP' ORDER BY json_extract_double(data, "$.payment.amount") DESC;

6.4 Type-Safe Handling

6.4.1 Safe Default Value Handling

-- Create type-safety demo table CREATE TABLE IF NOT EXISTS type_safety_demo ( order_id STRING, order_data JSON ); -- Insert data containing missing fields INSERT INTO type_safety_demo VALUES ('ORD001', parse_json('{"customer": {"name": "Alice", "vip_level": "gold"}, "payment": {"discount": 100}}')), ('ORD002', parse_json('{"customer": {"name": "Bob"}, "payment": {}}')), -- Missing vip_level and discount ('ORD003', parse_json('{"customer": {"name": "Charlie", "vip_level": "silver"}, "payment": {"discount": 50}}')); -- Safe default value handling SELECT order_id, json_extract_string(order_data, "$.customer.name") as customer_name, -- Ensure type-consistent default value handling CASE WHEN json_extract_string(order_data, "$.customer.vip_level") IS NOT NULL THEN json_extract_string(order_data, "$.customer.vip_level") ELSE 'regular' END as safe_vip_level, -- Safe handling of numeric types CASE WHEN json_extract_double(order_data, "$.payment.discount") IS NOT NULL THEN json_extract_double(order_data, "$.payment.discount") ELSE 0.0 END as safe_discount, -- Use COALESCE to simplify default value handling COALESCE(json_extract_string(order_data, "$.customer.vip_level"), 'regular') as vip_level_coalesce, COALESCE(json_extract_double(order_data, "$.payment.discount"), 0.0) as discount_coalesce FROM type_safety_demo;

6.5 Performance Optimization Strategies

6.5.1 High-Frequency Field Optimization

-- Create performance optimization demo table CREATE TABLE IF NOT EXISTS perf_orders ( id INT, order_data JSON, created_date DATE DEFAULT CURRENT_DATE() ); -- Insert test data INSERT INTO perf_orders VALUES (1, parse_json('{"customer_id": "C001", "payment_amount": 1500.00, "items": [{"detail": {"description": "Premium Electronics"}}]}'), CURRENT_DATE()), (2, parse_json('{"customer_id": "C002", "payment_amount": 2300.00, "items": [{"detail": {"description": "Home Goods"}}]}'), CURRENT_DATE()), (3, parse_json('{"customer_id": "C003", "payment_amount": 800.00, "items": [{"detail": {"description": "Daily Necessities"}}]}'), CURRENT_DATE()), (4, parse_json('{"customer_id": "C004", "payment_amount": 3200.00, "items": [{"detail": {"description": "Luxury Goods"}}]}'), CURRENT_DATE()); -- Place frequently used fields at the JSON top level to leverage automatic columnar storage optimization SELECT json_extract_string(order_data, "$.customer_id") as customer_id, -- High-frequency field json_extract_double(order_data, "$.payment_amount") as amount, -- High-frequency field json_extract_string(order_data, "$.items[0].detail.description") as desc -- Deep-level field FROM perf_orders WHERE json_extract_string(order_data, "$.customer_id") IS NOT NULL -- Efficient filtering AND json_extract_double(order_data, "$.payment_amount") > 1000; -- Numeric filtering

6.5.2 Batch Extraction Optimization

-- Create batch optimization demo table CREATE TABLE IF NOT EXISTS batch_orders ( id INT, order_data JSON ); -- Insert more test data INSERT INTO batch_orders VALUES (1, parse_json('{"customer": {"id": "C001", "level": "VIP"}, "payment": {"amount": 3500.00}}')), (2, parse_json('{"customer": {"id": "C002", "level": "regular"}, "payment": {"amount": 1200.00}}')), (3, parse_json('{"customer": {"id": "C003", "level": "VIP"}, "payment": {"amount": 4200.00}}')), (4, parse_json('{"customer": {"id": "C004", "level": "diamond"}, "payment": {"amount": 5800.00}}')), (5, parse_json('{"customer": {"id": "C005", "level": "regular"}, "payment": {"amount": 900.00}}')); -- Use CTE to reduce repeated parsing WITH extracted_data AS ( SELECT id as order_id, json_extract_string(order_data, "$.customer.id") as customer_id, json_extract_string(order_data, "$.customer.level") as customer_level, json_extract_double(order_data, "$.payment.amount") as payment_amount FROM batch_orders WHERE json_extract_string(order_data, "$.customer.id") IS NOT NULL ) SELECT customer_level, COUNT(*) as order_count, AVG(payment_amount) as avg_amount, MAX(payment_amount) as max_amount, MIN(payment_amount) as min_amount FROM extracted_data GROUP BY customer_level ORDER BY avg_amount DESC;

Chapter 7: Environment Management and Implementation Recommendations

7.1 Pre-Implementation Checklist

7.1.1 Technical Preparation

Verify character encoding settings (UTF-8)
Test Chinese and emoji data processing
Prepare data migration plan
New: Develop a JSON data update strategy

7.1.2 Syntax Specification Check

Confirm all CREATE statements include IF NOT EXISTS
Verify that json_set/json_insert/json_remove functions are not used
Consistently use json_extract functions instead of direct JSON operations
Verify that JSON type is not used for direct comparison, sorting, or grouping
New: Design a hybrid storage architecture

7.1.3 Development Standards

Establish JSON field naming conventions (recommend English field names)
Consistently use json_extract functions
Establish error handling and default value handling standards
Design performance monitoring metrics
New: Establish alternative solution standards for JSON data updates

7.1.4 Verification Testing

-- Create system verification table CREATE TABLE IF NOT EXISTS system_verification ( test_id STRING, test_data JSON, test_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP() ); -- Basic functionality verification INSERT INTO system_verification VALUES ('test_001', parse_json('{"basic_test": "Basic functionality test", "emoji_test": "😊👍", "number_test": 123.45}'), CURRENT_TIMESTAMP()), ('test_002', parse_json('{"nested_test": {"level1": {"level2": "Deep nesting test"}}, "array_test": ["Array", "Test", "Items"]}'), CURRENT_TIMESTAMP()); -- Verification query SELECT test_id, json_extract_string(test_data, "$.basic_test") as basic_result, json_extract_string(test_data, "$.emoji_test") as emoji_result, json_extract_double(test_data, "$.number_test") as number_result, json_extract_string(test_data, "$.nested_test.level1.level2") as nested_result, json_extract_string(test_data, "$.array_test[0]") as array_result FROM system_verification ORDER BY test_time DESC;

7.2 Phased Implementation Strategy

Proof of Concept Phase: Select 1-2 key business scenarios for pilot testing
Small-Scale Deployment: Verify performance and stability in non-core business systems
Architecture Optimization Phase: Implement hybrid storage strategy and JSON update alternatives
Full Rollout: Develop a complete migration plan based on pilot experience

-- Implementation phase tracking table CREATE TABLE IF NOT EXISTS implementation_tracking ( phase STRING, milestone STRING, status STRING, completion_date DATE, notes JSON ); -- Record implementation progress INSERT INTO implementation_tracking VALUES ('Phase1', 'Proof of Concept', 'completed', CURRENT_DATE(), parse_json('{"scenarios": ["E-commerce order analysis", "User behavior tracking"], "success_rate": 0.95}')), ('Phase2', 'Small-scale deployment', 'in_progress', NULL, parse_json('{"target_systems": ["Test environment", "Development environment"], "expected_completion": "2024-12-15"}')), ('Phase3', 'Architecture optimization', 'planned', NULL, parse_json('{"focus": ["Hybrid storage design", "JSON update alternatives"], "expected_start": "2024-12-20"}'));

7.3 Monitoring and Operations

-- Create operational monitoring table CREATE TABLE IF NOT EXISTS operational_monitoring ( metric_name STRING, metric_value DOUBLE, metric_metadata JSON, measurement_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP() ); -- Insert monitoring data example INSERT INTO operational_monitoring VALUES ('json_query_performance', 0.95, parse_json('{"avg_response_time_ms": 145, "query_count": 1000, "success_rate": 0.99}'), CURRENT_TIMESTAMP()), ('storage_efficiency', 0.87, parse_json('{"compression_ratio": 0.72, "storage_saved_gb": 250, "cost_reduction": 0.35}'), CURRENT_TIMESTAMP()), ('json_update_performance', 0.78, parse_json('{"avg_update_time_ms": 280, "update_success_rate": 0.95, "method": "json_reconstruction"}'), CURRENT_TIMESTAMP()); -- Performance monitoring query template SELECT metric_name, metric_value, json_extract_double(metric_metadata, "$.avg_response_time_ms") as avg_response_time, json_extract_int(metric_metadata, "$.query_count") as query_count, json_extract_double(metric_metadata, "$.success_rate") as success_rate, measurement_time FROM operational_monitoring WHERE metric_name = 'json_query_performance' ORDER BY measurement_time DESC;

7.4 Performance Monitoring and Tuning

7.4.1 JSON Query Performance Monitoring

-- Create performance monitoring demo table CREATE TABLE IF NOT EXISTS performance_monitor ( id INT, json_column JSON, created_date DATE DEFAULT CURRENT_DATE() ); -- Insert test data INSERT INTO performance_monitor VALUES (1, parse_json('{"frequently_accessed_field": "value1", "other_data": "test", "nested": {"deep": {"field": "deep_value1"}}}'), CURRENT_DATE()), (2, parse_json('{"frequently_accessed_field": "value2", "other_data": "test", "nested": {"deep": {"field": "deep_value2"}}}'), CURRENT_DATE() - INTERVAL 10 DAYS), (3, parse_json('{"frequently_accessed_field": "value1", "other_data": "test", "nested": {"deep": {"field": "deep_value3"}}}'), CURRENT_DATE() - INTERVAL 2 DAYS), (4, parse_json('{"frequently_accessed_field": "value3", "other_data": "test", "nested": {"deep": {"field": "deep_value4"}}}'), CURRENT_DATE()), (5, parse_json('{"frequently_accessed_field": "value2", "other_data": "test", "nested": {"deep": {"field": "deep_value5"}}}'), CURRENT_DATE() - INTERVAL 5 DAYS); -- Query performance analysis template SELECT 'performance_monitor' as table_name, COUNT(*) as total_records, AVG(LENGTH(CAST(json_column AS STRING))) as avg_json_size, -- High-frequency field distribution analysis COUNT(DISTINCT json_extract_string(json_column, "$.frequently_accessed_field")) as unique_values, -- Data recency analysis COUNT(CASE WHEN created_date >= CURRENT_DATE() - INTERVAL 7 DAYS THEN 1 END) as recent_records, -- Field access complexity analysis COUNT(CASE WHEN json_extract_string(json_column, "$.nested.deep.field") IS NOT NULL THEN 1 END) as deep_nested_fields FROM performance_monitor WHERE json_column IS NOT NULL;

7.4.2 Automatic Optimization Recommendation Generation

-- Create optimization analysis demo table CREATE TABLE IF NOT EXISTS optimization_analysis ( id INT, order_data JSON, created_date DATE DEFAULT CURRENT_DATE() ); -- Insert test data INSERT INTO optimization_analysis VALUES (1, parse_json('{"customer": {"level": "VIP"}, "payment": {"amount": 1500.00}}'), CURRENT_DATE()), (2, parse_json('{"customer": {"level": "regular"}, "payment": {"amount": 800.00}}'), CURRENT_DATE()), (3, parse_json('{"customer": {"level": "VIP"}, "payment": {"amount": 2200.00}}'), CURRENT_DATE()), (4, parse_json('{"customer": {"level": "diamond"}, "payment": {"amount": 5000.00}}'), CURRENT_DATE()), (5, parse_json('{"customer": {"level": "VIP"}, "payment": {"amount": 3200.00}}'), CURRENT_DATE()), (6, parse_json('{"customer": {"level": "regular"}, "payment": {"amount": 600.00}}'), CURRENT_DATE()); -- Automatic optimization recommendations based on usage patterns WITH query_pattern_analysis AS ( SELECT json_extract_string(order_data, "$.customer.level") as access_pattern, COUNT(*) as usage_frequency, AVG(json_extract_double(order_data, "$.payment.amount")) as avg_amount FROM optimization_analysis GROUP BY json_extract_string(order_data, "$.customer.level") ) SELECT access_pattern, usage_frequency, ROUND(avg_amount, 2) as avg_amount, CASE WHEN usage_frequency > 1000 THEN 'recommend_column_optimization' WHEN usage_frequency > 100 THEN 'consider_indexing' ELSE 'current_performance_sufficient' END as optimization_recommendation, CASE WHEN avg_amount > 3000 THEN 'high_value_customer_segment' WHEN avg_amount > 1500 THEN 'medium_value_customer_segment' ELSE 'low_value_customer_segment' END as business_insight FROM query_pattern_analysis ORDER BY usage_frequency DESC;

7.5 Environment Cleanup

-- View all created tables SHOW TABLES LIKE '%json%'; -- Clean up test tables (optional) /* DROP TABLE IF EXISTS orders; DROP TABLE IF EXISTS ecommerce_orders_json; DROP TABLE IF EXISTS promotion_orders_json; DROP TABLE IF EXISTS inventory_orders_json; DROP TABLE IF EXISTS iot_devices_unified; DROP TABLE IF EXISTS financial_risk_analysis; DROP TABLE IF EXISTS customer_behavior_analysis; DROP TABLE IF EXISTS device_maintenance_analysis; DROP TABLE IF EXISTS user_data_chinese; DROP TABLE IF EXISTS social_data_emoji; DROP TABLE IF EXISTS feedback_data_complex; DROP TABLE IF EXISTS user_feedback_full; DROP TABLE IF EXISTS best_practice_demo; DROP TABLE IF EXISTS chinese_field_demo; DROP TABLE IF EXISTS special_char_demo; DROP TABLE IF EXISTS troubleshoot_demo; DROP TABLE IF EXISTS limitation_demo; DROP TABLE IF EXISTS syntax_demo; DROP TABLE IF EXISTS type_safety_demo; DROP TABLE IF EXISTS perf_orders; DROP TABLE IF EXISTS batch_orders; DROP TABLE IF EXISTS performance_monitor; DROP TABLE IF EXISTS optimization_analysis; DROP TABLE IF EXISTS system_verification; DROP TABLE IF EXISTS implementation_tracking; DROP TABLE IF EXISTS operational_monitoring; DROP TABLE IF EXISTS transaction_demo; DROP TABLE IF EXISTS hybrid_storage_demo; */ -- Verify cleanup results SELECT COUNT(*) as remaining_json_tables FROM information_schema.tables WHERE table_name LIKE '%json%';

Summary

Core Value Summary

Singdata Lakehouse JSON data processing capabilities demonstrate the following core values:

Technical Advantages

Native JSON Support: Full compatibility with standard JSON syntax; json_extract function queries are highly efficient
Intelligent Storage Optimization: High-frequency fields are stored in columnar format based on JSON Schema
Real-time Processing Capability: Dynamic tables enable real-time responsiveness
Multilingual Support: Complete support for Chinese, emoji, and other special characters
Enterprise-Grade Features: Complete elastic computing, multi-layer caching, and cluster isolation

Business Value

E-Commerce: Implement personalized recommendation systems, improving conversion effectiveness and user experience
IoT: Establish predictive maintenance systems, reducing equipment failures and maintenance costs
Financial: Build real-time risk control engines, enhancing risk management and compliance capabilities

Implementation Advantages

Quick Start: Based on standard SQL, significantly improving development efficiency
Smooth Migration: No need to restructure existing JSON data; import and use directly
Strong Scalability: Seamless scaling from GB to PB-level data, supporting long-term enterprise development

Key Technical Points

Syntax Specifications

Consistently use json_extract functions: Avoid mixing different access methods
Type Safety: JSON type does not support direct comparison, sorting, or grouping operations
Chinese Support: Full support for Chinese and emoji; recommend using English field names
CREATE Statement Standard: Consistently use IF NOT EXISTS to avoid duplicate creation errors
Important Limitation: json_set function is not supported

JSON Update Strategies

Hybrid Storage Design: Use relational storage for frequently updated fields, JSON for stable data
Application-Layer Processing: Complete complex JSON updates at the application layer before full replacement
Version Control: Retain historical versions for important JSON data changes
Batch Updates: Avoid row-by-row JSON reconstruction; prioritize batch processing

Performance Optimization

Columnar Design for High-Frequency Fields: Can significantly improve query efficiency
Multi-Layer Cache Utilization: Query result cache, metadata cache, and compute cluster local cache work together
Real-time Stream Processing: Dynamic tables enable efficient data aggregation

Action Recommendations

Start Immediately: Select a key business scenario and launch a proof of concept for JSON data processing
Architecture Design: Develop a hybrid storage strategy and clearly define JSON usage boundaries
Team Training: Organize technical team learning of this guide's best practices and syntax specifications
Gradual Expansion: Based on successful pilot experience, develop a comprehensive JSON data strategy
Continuous Optimization: Establish performance monitoring and optimization mechanisms to ensure long-term value realization

Singdata Lakehouse JSON Data Processing - Making complex JSON data processing simple and efficient within clearly defined technical boundaries

This guide provides a complete solution for enterprise JSON data processing. All SQL statements have been verified in Singdata Lakehouse. Each example includes complete table creation and data preparation statements, using IF NOT EXISTS to ensure repeatable execution. By following the best practices in this revised edition, enterprises can fully leverage the business value of JSON data and accelerate their digital transformation journey within clearly defined technical boundaries.

References

JSON Data Type

JSON Functions

Note: This guide is based on test results from the Singdata Lakehouse version as of May 2025. Subsequent versions may introduce changes. Please regularly check the official documentation for the latest information.