JSON Data Processing Guide for Complex Business Scenarios
Guide Overview
This guide demonstrates the complete technical implementation of Singdata Lakehouse JSON functionality, starting from real-world business scenarios and providing professional solutions for JSON data processing.
Important Update Notice: This revised edition is based on comprehensive practical verification, confirming the actual capability boundaries of Singdata Lakehouse JSON functionality. All SQL examples have been tested and verified.
Guide Structure
- Chapter 1: Business Scenario and Challenge Analysis
- Chapter 2: Singdata Lakehouse Solutions
- Chapter 3: Solutions for Core Technical Challenges
- Chapter 4: Deep Technical Implementation and Advanced Optimization
- Chapter 5: Chinese and Special Character Handling Guide
- Chapter 6: Technical Limitations and Best Practices
- Chapter 7: Environment Management and Implementation Recommendations
Chapter 1: Business Scenario and Challenge Analysis
1.1 JSON Data Processing Needs of Modern Enterprises
In the digital era, JSON has become the standard format for enterprise data exchange:
- API Economy: Under microservice architecture, a vast amount of data exchange adopts the JSON format
- Real-time Data Streams: IoT devices, user behaviors, and system logs generate massive amounts of JSON data
- Diverse Data Sources: Social media, e-commerce platforms, and financial transactions produce complex nested JSON
- Internationalization Challenges: JSON data contains multilingual content including Chinese, emoji, and more
1.2 Core Business Scenario Analysis
1.2.1 E-Commerce Scenario: Real-time Analysis of Complex Order Data
Business Scenario Description:
An e-commerce platform processes tens of thousands of orders daily, with each order containing complex nested structures such as customer information, product details, payment information, and logistics data:
{
"order_id": "ORD-2024-001",
"customer": {
"id": 10001,
"name": "Alice",
"profile": {
"level": "VIP",
"preferences": ["electronics", "gadgets", "books"],
"purchase_history": [
{"category": "electronics", "frequency": 15},
{"category": "books", "frequency": 8}
]
},
"address": {"city": "Beijing", "district": "Chaoyang District"}
},
"items": [
{
"sku": "PHONE-001",
"name": "iPhone 15",
"price": 6999.00,
"qty": 1,
"attributes": {"color": "Blue", "storage": "256GB"},
"promotion": {"type": "vip_discount", "discount": 500}
}
],
"payment": {
"method": "alipay",
"amount": 7197.00,
"promotion_applied": {"code": "VIP500", "discount": 500}
},
"logistics": {
"shipping_method": "express",
"estimated_delivery": "2024-12-03",
"tracking_events": [
{"status": "ordered", "timestamp": "2024-12-01T10:00:00"},
{"status": "paid", "timestamp": "2024-12-01T10:05:00"}
]
},
"user_feedback": {
"rating": "5 Stars",
"comment": "Great product! Fast delivery!"
}
}
Core Technical Challenges:
- Personalized Recommendation Challenge: Need real-time analysis of customer
preferences arrays and purchase_history nested structures
- Dynamic Promotion Strategy: Each item's
promotion information has an unfixed structure; different promotion types have completely different JSON structures
- Real-time Inventory Alerts: Need to extract all SKUs from the order's
items array for real-time sales statistics
- Customer Behavior Analysis: Analyze the complete path from browsing to order placement
- Multilingual Content Handling: Customer names, addresses, and reviews contain multiple languages and emoji symbols
1.2.2 IoT Scenario: Real-time Processing of Complex Industrial Equipment Monitoring Data
Business Scenario Description:
A manufacturing enterprise has thousands of production devices, each reporting complex monitoring data every minute:
{
"device_id": "MACHINE-001",
"timestamp": "2024-12-01T10:30:00Z",
"location": {
"building": "Factory-A",
"floor": 2,
"production_line": "Production-Line-1"
},
"sensors": {
"temperature": {"value": 85.2, "unit": "celsius", "status": "normal"},
"vibration": {
"x_axis": {"value": 0.02, "threshold": 0.05, "status": "normal"},
"y_axis": {"value": 0.03, "threshold": 0.05, "status": "normal"},
"z_axis": {"value": 0.08, "threshold": 0.05, "status": "warning"}
}
},
"operational_status": {
"machine_state": "running",
"production_rate": 85.5,
"efficiency": 92.3,
"maintenance": {
"last_service": "2024-11-15T08:00:00Z",
"next_scheduled": "2025-02-15T08:00:00Z",
"parts_status": [
{"part": "Motor Bearing", "condition": "Good", "life_remaining": 75},
{"part": "Conveyor Belt", "condition": "Fair", "life_remaining": 45}
]
}
},
"alerts": {
"message": "Device running normally",
"level": "info"
}
}
Core Technical Challenges:
- Heterogeneous Device Data Fusion: Different device models have completely different
sensors structures
- Real-time Anomaly Detection: Need to monitor multiple nested metrics simultaneously
- Complex Maintenance Planning: Each part in the
maintenance.parts_status array has a different lifespan
- Multi-dimensional Production Analysis: Need to analyze by production line, floor, device type, and other dimensions
1.2.3 Financial Scenario: Real-time Decision Making for Complex Transaction Risk Control
Business Scenario Description:
An internet bank needs to perform real-time risk assessment on every transaction:
{
"transaction_id": "TXN-20241201-001",
"timestamp": "2024-12-01T14:35:22Z",
"user": {
"id": 30001,
"name": "Bob",
"profile": {
"age": 28,
"credit_score": 720,
"account_level": "Gold"
},
"behavior_pattern": {
"typical_amount_range": {"min": 500, "max": 3000},
"location_history": [
{"city": "Beijing", "frequency": 85},
{"city": "Shanghai", "frequency": 10}
]
}
},
"transaction": {
"amount": 8500.00,
"currency": "CNY",
"type": "transfer",
"merchant": {"name": "Luxury Store", "category": "retail"}
},
"device": {
"type": "mobile",
"fingerprint": "ABC123XYZ789",
"location": {"city": "Shanghai", "ip": "121.xxx.xxx.xxx"},
"is_new_device": true,
"trusted_score": 0.3
},
"risk_analysis": {
"automated_factors": [
{"factor": "amount_anomaly", "score": 0.8, "reason": "Amount exceeds normal range"},
{"factor": "location_mismatch", "score": 0.6, "reason": "Transaction in different city"}
],
"final_decision": {
"action": "manual_review",
"confidence": 0.85,
"reason": "Multiple risk factors detected"
}
}
}
Core Technical Challenges:
- Complex Risk Factor Real-time Calculation: Need to simultaneously analyze risk factors from multiple nested objects
- Fast Decision Requirements: Risk control decisions must be completed within a short time frame
- Dynamic Rule Engine: Risk control rules change frequently, and data structures evolve continuously
- Compliance Audit Trail: Need to retain complete decision processes, supporting complex audit queries
1.3 Limitations of Traditional Approaches
Technical Challenges:
- Performance Bottleneck: Traditional databases have limited JSON query performance
- High Storage Cost: JSON data is redundant with low storage efficiency
- Development Complexity: Complex ETL transformations are required
- Insufficient Real-time Capability: Batch processing mode cannot meet real-time requirements
- Scalability Limitations: Single-machine architecture struggles to handle large-scale data
- Multilingual Support: Chinese, emoji, and other special characters are difficult to handle
Chapter 2: Singdata Lakehouse Solutions
2.1 Core Technical Advantages
2.1.1 Native JSON Support
Solution Features:
- Native Data Type Support: JSON is a first-class citizen, no pre-defined schema required
- Dynamic Structure Adaptation: New fields can be added without downtime or table structure modification
- Zero Transformation Cost: JSON data is stored and queried directly, no ETL preprocessing needed
- Multilingual Compatibility: Full support for Chinese and emoji content
-- Create sample table
CREATE TABLE IF NOT EXISTS orders (
id INT,
order_data JSON
);
-- Directly store JSON data with different structures (including Chinese)
INSERT INTO orders VALUES
(1, parse_json('{"customer": {"name": "Alice"}, "promotion": {"type": "vip_discount", "amount": 500}}')),
(2, parse_json('{"customer": {"name": "Bob"}, "promotion": {"type": "flash_sale", "percentage": 0.2}}')),
(3, parse_json('{"customer": {"name": "Charlie"}, "promotion": {"type": "bundle_deal", "required_items": 3}}'));
-- Query promotion data with different structures
SELECT
id,
json_extract_string(order_data, "$.customer.name") as customer_name,
json_extract_string(order_data, "$.promotion.type") as promotion_type
FROM orders;
2.1.2 Intelligent Columnar Storage
Solution Features:
- Automatic Field Optimization: Frequently accessed JSON fields are automatically converted to columnar storage
- Path Index Optimization: Efficient indexes are built for commonly used JSON paths
- Column Pruning Technology: Only reads the JSON fields needed by the query, reducing I/O overhead
-- Use the previously created orders table for high-frequency field queries
SELECT
json_extract_string(order_data, "$.customer.name") as customer_name,
json_extract_double(order_data, "$.promotion.amount") as discount_amount
FROM orders
WHERE json_extract_string(order_data, "$.promotion.type") = 'vip_discount';
Solution Features:
- Parallel Processing Framework: Complex JSON queries are automatically parallelized
- In-Memory Computing Optimization: Hot JSON data resides in memory
- Intelligent Caching Mechanism: Results for frequently accessed JSON paths are cached
2.1.4 Real-time Stream Processing Integration
Solution Features:
- TABLE STREAM Technology: Captures JSON data changes for incremental processing
- Dynamic Table Auto-Refresh: Real-time aggregation of complex JSON data
- Unified Stream-Batch Processing: Unified handling of real-time streams and historical batch data
-- Create IoT sensor data table
CREATE TABLE IF NOT EXISTS iot_sensors (
device_id STRING,
device_data JSON,
timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP()
);
-- Insert sample IoT data
INSERT INTO iot_sensors VALUES
('DEVICE-001', parse_json('{"device_id": "DEVICE-001", "sensors": {"temperature": {"value": 86.5}}, "location": "Factory A"}'), CURRENT_TIMESTAMP());
-- Create alert events table
CREATE TABLE IF NOT EXISTS alert_events (
device_id STRING,
temp_value DOUBLE,
alert_time TIMESTAMP
);
-- Real-time anomaly detection (manual simulation)
INSERT INTO alert_events
SELECT
json_extract_string(device_data, "$.device_id") as device_id,
json_extract_double(device_data, "$.sensors.temperature.value") as temp_value,
CURRENT_TIMESTAMP() as alert_time
FROM iot_sensors
WHERE json_extract_double(device_data, "$.sensors.temperature.value") > 85;
Chapter 3: Solutions for Core Technical Challenges
This chapter demonstrates in detail how Singdata Lakehouse specifically addresses each core technical challenge raised in Chapter 1. Every solution directly corresponds to a specific business challenge.
3.1 E-Commerce Scenario: Solutions to Technical Challenges
3.1.1 Challenge 1: Nested Array Query for Personalized Recommendations
Technical Challenge Description:
Real-time analysis of customer preferences arrays and purchase_history nested structures is required. Traditional relational databases need complex multi-table JOINs or ETL preprocessing and cannot meet the response time requirements for real-time personalized recommendations.
Singdata Lakehouse Solution:
-- Create e-commerce order analysis table with native JSON support
CREATE TABLE IF NOT EXISTS ecommerce_orders_json (
order_id STRING,
order_data JSON,
created_date DATE DEFAULT CURRENT_DATE()
);
-- Directly store complex nested structures, no preprocessing needed
INSERT INTO ecommerce_orders_json VALUES
('ORD-2024-001', parse_json('{
"customer": {
"id": 10001,
"name": "Alice",
"level": "diamond",
"profile": {
"preferences": ["electronics", "gadgets", "books"],
"purchase_history": [
{"category": "electronics", "frequency": 15},
{"category": "books", "frequency": 8}
]
}
},
"items": [{"sku": "PHONE-001", "category": "electronics", "price": 6999.00}],
"payment": {"amount": 7098.00}
}'), CURRENT_DATE()),
('ORD-2024-002', parse_json('{
"customer": {
"id": 10002,
"name": "Bob",
"level": "diamond",
"profile": {
"preferences": ["books", "sports", "electronics"],
"purchase_history": [
{"category": "books", "frequency": 20},
{"category": "sports", "frequency": 12}
]
}
},
"items": [{"sku": "BOOK-001", "category": "books", "price": 299.00}],
"payment": {"amount": 299.00}
}'), CURRENT_DATE()),
('ORD-2024-003', parse_json('{
"customer": {
"id": 10003,
"name": "Charlie",
"level": "diamond",
"profile": {
"preferences": ["gadgets", "electronics", "fashion"],
"purchase_history": [
{"category": "gadgets", "frequency": 25},
{"category": "electronics", "frequency": 18}
]
}
},
"items": [{"sku": "GADGET-001", "category": "gadgets", "price": 1599.00}],
"payment": {"amount": 1599.00}
}'), CURRENT_DATE());
-- Direct access to nested arrays, no JOIN needed
SELECT
json_extract_string(order_data, "$.customer.id") as customer_id,
json_extract_string(order_data, "$.customer.name") as customer_name,
-- Directly extract the first element of the preferences array (primary preference)
json_extract_string(order_data, "$.customer.profile.preferences[0]") as primary_preference,
-- Directly extract the most frequent category from purchase history
json_extract_int(order_data, "$.customer.profile.purchase_history[0].frequency") as top_frequency,
-- Real-time customer value calculation
json_extract_double(order_data, "$.payment.amount") as order_value
FROM ecommerce_orders_json
WHERE json_extract_string(order_data, "$.customer.level") = 'diamond'
ORDER BY json_extract_double(order_data, "$.payment.amount") DESC;
Technical Value:
- Zero ETL Processing: Complex nested data is stored and queried directly
- Native Array Access:
[0] syntax directly accesses array elements without table splitting
- Real-time Aggregation: Multi-dimensional customer profile analysis completed in a single query
- Automatic Optimization: High-frequency fields are automatically stored in columnar format, improving query efficiency
Technical Challenge Description:
Different promotion types (VIP discount, flash sale, bundle deal) have completely different JSON structures. Traditional fixed schemas cannot adapt and require pre-defining all possible promotion fields, resulting in high maintenance costs.
Singdata Lakehouse Dynamic Schema Solution:
-- Create promotion data table
CREATE TABLE IF NOT EXISTS promotion_orders_json (
order_id STRING,
order_data JSON,
created_date DATE DEFAULT CURRENT_DATE()
);
-- Store promotion data with different structures in the same table
INSERT INTO promotion_orders_json VALUES
-- VIP discount: fixed amount structure
('ORD-VIP-001', parse_json('{"customer": {"name": "Alice"}, "promotion": {"type": "vip_discount", "amount": 500}}'), CURRENT_DATE()),
-- Flash sale: percentage structure
('ORD-FLASH-001', parse_json('{"customer": {"name": "Bob"}, "promotion": {"type": "flash_sale", "percentage": 0.2}}'), CURRENT_DATE()),
-- Bundle deal: combined item structure
('ORD-BUNDLE-001', parse_json('{"customer": {"name": "Charlie"}, "promotion": {"type": "bundle_deal", "required_items": 3, "discount_per_item": 50}}'), CURRENT_DATE()),
-- Coupon: code structure
('ORD-COUPON-001', parse_json('{"customer": {"name": "Diana"}, "promotion": {"type": "coupon", "code": "SAVE20", "discount": 200}}'), CURRENT_DATE());
-- Adaptive query: one SQL handles all promotion types
SELECT
order_id,
json_extract_string(order_data, "$.customer.name") as customer_name,
json_extract_string(order_data, "$.promotion.type") as promotion_type,
-- Automatically extract corresponding fields based on promotion type
CASE WHEN json_extract_string(order_data, "$.promotion.type") = 'vip_discount'
THEN json_extract_double(order_data, "$.promotion.amount")
WHEN json_extract_string(order_data, "$.promotion.type") = 'coupon'
THEN json_extract_double(order_data, "$.promotion.discount")
ELSE 0
END as discount_amount,
CASE WHEN json_extract_string(order_data, "$.promotion.type") = 'flash_sale'
THEN json_extract_double(order_data, "$.promotion.percentage")
ELSE 0
END as discount_percentage,
CASE WHEN json_extract_string(order_data, "$.promotion.type") = 'bundle_deal'
THEN json_extract_int(order_data, "$.promotion.required_items")
ELSE 0
END as bundle_items,
CASE WHEN json_extract_string(order_data, "$.promotion.type") = 'coupon'
THEN json_extract_string(order_data, "$.promotion.code")
ELSE NULL
END as coupon_code
FROM promotion_orders_json
WHERE json_extract_string(order_data, "$.promotion.type") IS NOT NULL;
Technical Value:
- Dynamic Schema Adaptation: New promotion types are automatically supported with zero downtime
- Structure-Agnostic Queries: A single SQL handles multiple data structures
- Business Agility: Promotion rule changes do not require database schema modifications
3.1.3 Challenge 3: Array Aggregation for Real-time Inventory Alerts
Technical Challenge Description:
All SKUs need to be extracted from the order's items array and sales statistics calculated in real time. Traditional approaches require complex triggers or scheduled ETL jobs, with high latency and resource consumption.
Singdata Lakehouse Real-time Stream Processing Solution:
-- Create inventory order data table
CREATE TABLE IF NOT EXISTS inventory_orders_json (
order_id STRING,
order_data JSON,
created_date DATE DEFAULT CURRENT_DATE()
);
-- Insert order data containing SKUs and quantities
INSERT INTO inventory_orders_json VALUES
('ORD-INV-001', parse_json('{"items": [{"sku": "PHONE-001", "qty": 2}], "payment": {"amount": 13998.00}}'), CURRENT_DATE()),
('ORD-INV-002', parse_json('{"items": [{"sku": "BOOK-001", "qty": 5}], "payment": {"amount": 1495.00}}'), CURRENT_DATE()),
('ORD-INV-003', parse_json('{"items": [{"sku": "PHONE-001", "qty": 1}], "payment": {"amount": 6999.00}}'), CURRENT_DATE()),
('ORD-INV-004', parse_json('{"items": [{"sku": "GADGET-001", "qty": 3}], "payment": {"amount": 4797.00}}'), CURRENT_DATE()),
('ORD-INV-005', parse_json('{"items": [{"sku": "BOOK-001", "qty": 2}], "payment": {"amount": 598.00}}'), CURRENT_DATE());
-- Dynamic table for real-time inventory statistics with automatic refresh
CREATE OR REPLACE DYNAMIC TABLE IF NOT EXISTS real_time_inventory
REFRESH INTERVAL 1 MINUTES
AS SELECT
json_extract_string(order_data, "$.items[0].sku") as sku,
SUM(json_extract_int(order_data, "$.items[0].qty")) as total_sold,
COUNT(*) as order_count,
MAX(created_date) as last_order_date,
-- Real-time inventory alert threshold calculation
CASE
WHEN SUM(json_extract_int(order_data, "$.items[0].qty")) > 5 THEN 'high_demand'
WHEN SUM(json_extract_int(order_data, "$.items[0].qty")) > 2 THEN 'medium_demand'
ELSE 'low_demand'
END as demand_level
FROM inventory_orders_json
WHERE json_extract_string(order_data, "$.items[0].sku") IS NOT NULL
GROUP BY json_extract_string(order_data, "$.items[0].sku");
-- Real-time query for inventory alerts
SELECT * FROM real_time_inventory
WHERE demand_level = 'high_demand'
ORDER BY total_sold DESC;
Technical Value:
- Unified Stream-Batch: Dynamic tables automatically aggregate data changes
- Real-time Aggregation: No need to manually maintain triggers and aggregation logic
- Zero Maintenance Cost: The system automatically handles complex aggregation computations
3.2 IoT Scenario: Unified Processing of Complex Device Data
3.2.1 Challenge 1: Unified Querying of Heterogeneous Device Data
Technical Challenge Description:
Different device models have completely different sensors structures: old devices only have temperature, while new devices have dozens of sensors. Traditional approaches require creating different table structures for each device type, resulting in severe data silos.
Singdata Lakehouse Heterogeneous Data Unified Solution:
-- Create IoT device monitoring table supporting any device structure
CREATE TABLE IF NOT EXISTS iot_devices_unified (
device_id STRING,
sensor_data JSON,
collected_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP()
);
-- Store device data with different structures in the same table
INSERT INTO iot_devices_unified VALUES
-- Basic sensor: temperature only
('BASIC-001', parse_json('{"device_type": "basic", "temp": 22.1, "battery": 85}'), CURRENT_TIMESTAMP()),
-- Standard sensor: temperature + humidity
('STANDARD-001', parse_json('{"device_type": "standard", "readings": {"temperature": 24.5, "humidity": 58.2}}'), CURRENT_TIMESTAMP()),
-- Advanced sensor: complete multi-dimensional data
('ADVANCED-001', parse_json('{"device_type": "advanced", "sensors": {"temperature": {"value": 85.2}, "vibration": {"x_axis": {"value": 0.02}}}}'), CURRENT_TIMESTAMP()),
-- Industrial sensor: most complex structure
('INDUSTRIAL-001', parse_json('{"device_type": "industrial", "sensors": {"temperature": {"value": 78.5}, "vibration": {"x_axis": {"value": 0.01}, "y_axis": {"value": 0.03}, "z_axis": {"value": 0.02}}, "pressure": {"value": 1013.25}}}'), CURRENT_TIMESTAMP());
-- Unified query interface, automatically adapting to different device structures
SELECT
device_id,
json_extract_string(sensor_data, "$.device_type") as device_type,
-- Compatible with multiple temperature field structures
COALESCE(
json_extract_double(sensor_data, "$.sensors.temperature.value"), -- Advanced device
json_extract_double(sensor_data, "$.readings.temperature"), -- Standard device
json_extract_double(sensor_data, "$.temp") -- Basic device
) as temperature,
-- Optional advanced sensor data
json_extract_double(sensor_data, "$.sensors.vibration.x_axis.value") as vibration_x,
json_extract_double(sensor_data, "$.sensors.vibration.y_axis.value") as vibration_y,
json_extract_double(sensor_data, "$.sensors.vibration.z_axis.value") as vibration_z,
json_extract_double(sensor_data, "$.sensors.pressure.value") as pressure,
-- Automatic device type identification
CASE
WHEN json_extract_string(sensor_data, "$.sensors.vibration") IS NOT NULL THEN 'high_end_device'
WHEN json_extract_string(sensor_data, "$.readings.humidity") IS NOT NULL THEN 'standard_device'
ELSE 'basic_device'
END as inferred_device_category
FROM iot_devices_unified
ORDER BY collected_time DESC;
Technical Value:
- Device Plug-and-Play: New device types require no schema changes
- Unified Query Interface: A single SQL supports all device types
- Intelligent Field Mapping: COALESCE automatically adapts to different field structures
3.3 Financial Scenario: Real-time Decision Making for Complex Risk Control
3.3.1 Challenge 1: Complex Risk Factor Real-time Calculation
Technical Challenge Description:
Multiple nested objects such as user behavior patterns, device trust scores, and geographic locations must be analyzed simultaneously as risk factors. Traditional approaches require multiple database queries and complex business logic calculations, unable to meet fast decision-making requirements.
Singdata Lakehouse Multi-dimensional Risk Factor Parallel Calculation Solution:
-- Create financial risk control data table
CREATE TABLE IF NOT EXISTS financial_risk_analysis (
transaction_id STRING,
risk_data JSON,
analysis_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP()
);
-- Insert complex risk control test data
INSERT INTO financial_risk_analysis VALUES
('TXN-001', parse_json('{
"user": {
"id": "30001",
"behavior_pattern": {
"typical_amount_range": {"min": 500, "max": 3000},
"location_history": [{"city": "Beijing", "frequency": 85}]
}
},
"transaction": {"amount": 8500.00},
"device": {"is_new_device": true, "trusted_score": 0.3, "location": {"city": "Shanghai"}}
}'), CURRENT_TIMESTAMP()),
('TXN-002', parse_json('{
"user": {
"id": "30002",
"behavior_pattern": {
"typical_amount_range": {"min": 1000, "max": 5000},
"location_history": [{"city": "Shanghai", "frequency": 90}]
}
},
"transaction": {"amount": 2500.00},
"device": {"is_new_device": false, "trusted_score": 0.9, "location": {"city": "Shanghai"}}
}'), CURRENT_TIMESTAMP()),
('TXN-003', parse_json('{
"user": {
"id": "30003",
"behavior_pattern": {
"typical_amount_range": {"min": 200, "max": 1500},
"location_history": [{"city": "Shenzhen", "frequency": 95}]
}
},
"transaction": {"amount": 4500.00},
"device": {"is_new_device": true, "trusted_score": 0.2, "location": {"city": "Beijing"}}
}'), CURRENT_TIMESTAMP());
-- Complete complex multi-dimensional risk assessment in a single query
SELECT
transaction_id,
json_extract_string(risk_data, "$.user.id") as user_id,
-- Real-time amount anomaly risk calculation
CASE
WHEN json_extract_double(risk_data, "$.transaction.amount") >
json_extract_double(risk_data, "$.user.behavior_pattern.typical_amount_range.max") * 2
THEN 0.8 -- Exceeds normal amount by 2x: high risk
WHEN json_extract_double(risk_data, "$.transaction.amount") >
json_extract_double(risk_data, "$.user.behavior_pattern.typical_amount_range.max") * 1.5
THEN 0.5 -- Exceeds normal amount by 1.5x: medium risk
ELSE 0.2 -- Within normal range: low risk
END as amount_risk_score,
-- Real-time device trust risk calculation
CASE
WHEN json_extract_string(risk_data, "$.device.is_new_device") = 'true'
THEN 1.0 - json_extract_double(risk_data, "$.device.trusted_score")
ELSE 0.1
END as device_risk_score,
-- Real-time geographic location risk calculation
CASE
WHEN json_extract_string(risk_data, "$.device.location.city") !=
json_extract_string(risk_data, "$.user.behavior_pattern.location_history[0].city")
THEN 0.6 -- Transaction in different city: medium risk
ELSE 0.1 -- Usual location: low risk
END as location_risk_score,
-- Comprehensive risk score: weighted algorithm calculated in one pass
((CASE
WHEN json_extract_double(risk_data, "$.transaction.amount") >
json_extract_double(risk_data, "$.user.behavior_pattern.typical_amount_range.max") * 2
THEN 0.8 ELSE 0.2 END) * 0.4 + -- Amount risk weight 40%
(CASE
WHEN json_extract_string(risk_data, "$.device.is_new_device") = 'true'
THEN 1.0 - json_extract_double(risk_data, "$.device.trusted_score")
ELSE 0.1 END) * 0.3 + -- Device risk weight 30%
(CASE
WHEN json_extract_string(risk_data, "$.device.location.city") !=
json_extract_string(risk_data, "$.user.behavior_pattern.location_history[0].city")
THEN 0.6 ELSE 0.1 END) * 0.3 -- Location risk weight 30%
) as final_risk_score,
-- Real-time decision recommendation
CASE
WHEN ((CASE
WHEN json_extract_double(risk_data, "$.transaction.amount") >
json_extract_double(risk_data, "$.user.behavior_pattern.typical_amount_range.max") * 2
THEN 0.8 ELSE 0.2 END) * 0.4 +
(CASE
WHEN json_extract_string(risk_data, "$.device.is_new_device") = 'true'
THEN 1.0 - json_extract_double(risk_data, "$.device.trusted_score")
ELSE 0.1 END) * 0.3 +
(CASE
WHEN json_extract_string(risk_data, "$.device.location.city") !=
json_extract_string(risk_data, "$.user.behavior_pattern.location_history[0].city")
THEN 0.6 ELSE 0.1 END) * 0.3) > 0.7 THEN 'REJECT'
WHEN ((CASE
WHEN json_extract_double(risk_data, "$.transaction.amount") >
json_extract_double(risk_data, "$.user.behavior_pattern.typical_amount_range.max") * 2
THEN 0.8 ELSE 0.2 END) * 0.4 +
(CASE
WHEN json_extract_string(risk_data, "$.device.is_new_device") = 'true'
THEN 1.0 - json_extract_double(risk_data, "$.device.trusted_score")
ELSE 0.1 END) * 0.3 +
(CASE
WHEN json_extract_string(risk_data, "$.device.location.city") !=
json_extract_string(risk_data, "$.user.behavior_pattern.location_history[0].city")
THEN 0.6 ELSE 0.1 END) * 0.3) > 0.4 THEN 'MANUAL_REVIEW'
ELSE 'APPROVE'
END as recommended_action
FROM financial_risk_analysis
ORDER BY final_risk_score DESC;
Technical Value:
- Multi-dimensional Analysis in a Single Query: All risk factor calculations completed in one SQL
- Parallel Access to Nested Objects: Simultaneously extract multiple deeply nested fields
- Fast Decision Support: Significantly reduces risk control decision processing time
Chapter 4: Deep Technical Implementation and Advanced Optimization
This chapter demonstrates the advanced technical features of Singdata Lakehouse JSON processing in depth. Each technical implementation provides a deep solution targeting the specific technical challenges raised in Chapter 1.
4.1 Time-series Array Processing for the "Customer Behavior Analysis" Challenge
4.1.1 Technical Challenge Review
Original Challenge: Analyzing the complete path from browsing to order placement, with data stored in the logistics.tracking_events time-series array. Traditional approaches require splitting array data into multiple rows for storage, making queries complex and performance-limited.
4.1.2 Singdata Lakehouse Time-series Array Processing Implementation
Core Technical Implementation:
-- Create customer behavior analysis table
CREATE TABLE IF NOT EXISTS customer_behavior_analysis (
order_id STRING,
behavior_data JSON,
created_date DATE DEFAULT CURRENT_DATE()
);
-- Insert time-series behavior data
INSERT INTO customer_behavior_analysis VALUES
('ORD-BEHAVIOR-001', parse_json('{
"customer": {"id": 10001, "name": "Alice"},
"items": [{"sku": "PHONE-001", "price": 6999.00}],
"behavior_tracking": {
"source": "search_engine",
"conversion_funnel": [
{"event": "product_view", "timestamp": "2024-12-01T09:00:00", "duration": 120},
{"event": "add_to_cart", "timestamp": "2024-12-01T09:02:00", "duration": 30},
{"event": "checkout_start", "timestamp": "2024-12-01T09:02:30", "duration": 180},
{"event": "payment_complete", "timestamp": "2024-12-01T09:05:30", "duration": 45}
]
},
"logistics": {
"tracking_events": [
{"status": "ordered", "timestamp": "2024-12-01T09:05:30", "location": "Beijing"},
{"status": "paid", "timestamp": "2024-12-01T09:06:00", "location": "Beijing"},
{"status": "shipped", "timestamp": "2024-12-01T14:00:00", "location": "Beijing Warehouse"},
{"status": "delivered", "timestamp": "2024-12-02T10:00:00", "location": "Chaoyang District"}
]
}
}'), CURRENT_DATE()),
('ORD-BEHAVIOR-002', parse_json('{
"customer": {"id": 10002, "name": "Bob"},
"items": [{"sku": "BOOK-001", "price": 299.00}],
"behavior_tracking": {
"source": "direct_visit",
"conversion_funnel": [
{"event": "product_view", "timestamp": "2024-12-01T10:00:00", "duration": 200},
{"event": "add_to_cart", "timestamp": "2024-12-01T10:03:20", "duration": 45},
{"event": "checkout_start", "timestamp": "2024-12-01T10:04:05", "duration": 90},
{"event": "payment_complete", "timestamp": "2024-12-01T10:05:35", "duration": 30}
]
},
"logistics": {
"tracking_events": [
{"status": "ordered", "timestamp": "2024-12-01T10:05:35", "location": "Shanghai"},
{"status": "paid", "timestamp": "2024-12-01T10:06:00", "location": "Shanghai"},
{"status": "shipped", "timestamp": "2024-12-01T16:00:00", "location": "Shanghai Warehouse"},
{"status": "delivered", "timestamp": "2024-12-02T14:00:00", "location": "Pudong New Area"}
]
}
}'), CURRENT_DATE());
-- Complex time-series analysis: complete behavior path analysis in a single query
SELECT
order_id,
json_extract_string(behavior_data, "$.customer.id") as customer_id,
json_extract_string(behavior_data, "$.customer.name") as customer_name,
-- Extract the complete conversion funnel path
json_extract_string(behavior_data, "$.behavior_tracking.conversion_funnel[0].event") as funnel_step1,
json_extract_string(behavior_data, "$.behavior_tracking.conversion_funnel[1].event") as funnel_step2,
json_extract_string(behavior_data, "$.behavior_tracking.conversion_funnel[2].event") as funnel_step3,
json_extract_string(behavior_data, "$.behavior_tracking.conversion_funnel[3].event") as funnel_step4,
-- Calculate duration at each stage
json_extract_int(behavior_data, "$.behavior_tracking.conversion_funnel[0].duration") as view_duration,
json_extract_int(behavior_data, "$.behavior_tracking.conversion_funnel[1].duration") as cart_duration,
json_extract_int(behavior_data, "$.behavior_tracking.conversion_funnel[2].duration") as checkout_duration,
-- Logistics status time-series analysis
json_extract_string(behavior_data, "$.logistics.tracking_events[0].status") as initial_status,
json_extract_string(behavior_data, "$.logistics.tracking_events[3].status") as final_status,
-- Real-time conversion efficiency calculation
CASE
WHEN json_extract_string(behavior_data, "$.logistics.tracking_events[1].status") = 'paid'
THEN 'successful_conversion'
ELSE 'incomplete_conversion'
END as conversion_result,
-- Geographic location change tracking
json_extract_string(behavior_data, "$.logistics.tracking_events[0].location") as order_location,
json_extract_string(behavior_data, "$.logistics.tracking_events[3].location") as delivery_location,
-- Conversion efficiency analysis
(json_extract_int(behavior_data, "$.behavior_tracking.conversion_funnel[0].duration") +
json_extract_int(behavior_data, "$.behavior_tracking.conversion_funnel[1].duration") +
json_extract_int(behavior_data, "$.behavior_tracking.conversion_funnel[2].duration") +
json_extract_int(behavior_data, "$.behavior_tracking.conversion_funnel[3].duration")) as total_conversion_time
FROM customer_behavior_analysis
WHERE json_extract_string(behavior_data, "$.behavior_tracking.conversion_funnel") IS NOT NULL
ORDER BY total_conversion_time ASC;
Technical Value:
- Zero Data Restructuring: Time-series arrays are analyzed directly without splitting storage
- Single Query Full Chain: The complete path from browsing to delivery is obtained in one go
- Real-time Conversion Analysis: Quickly calculates conversion rates and user behavior patterns
4.2 Dynamic Array Computation for the "Complex Maintenance Planning" Challenge
4.2.1 Technical Challenge Review
Original Challenge: Each part in the maintenance.parts_status array has a different lifespan, requiring dynamic calculation of maintenance priorities. Traditional approaches need complex stored procedures and scheduled calculations.
4.2.2 Singdata Lakehouse Dynamic Array Calculation Implementation
Core Technical Implementation:
-- Create device maintenance analysis table
CREATE TABLE IF NOT EXISTS device_maintenance_analysis (
device_id STRING,
maintenance_data JSON,
collected_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP()
);
-- Store complex part status data
INSERT INTO device_maintenance_analysis VALUES
('MACHINE-COMPLEX-001', parse_json('{
"device_info": {"type": "production_machine", "location": "Factory A - Production Line 1"},
"operational_status": {
"machine_state": "running",
"production_rate": 85.5,
"maintenance": {
"last_service": "2024-11-15T08:00:00Z",
"next_scheduled": "2025-02-15T08:00:00Z",
"parts_status": [
{"part": "Main Motor", "condition": "Good", "life_remaining": 75, "cost": 50000, "priority": 1},
{"part": "Conveyor Belt", "condition": "Fair", "life_remaining": 45, "cost": 8000, "priority": 2},
{"part": "Hydraulic System", "condition": "Good", "life_remaining": 30, "cost": 25000, "priority": 1},
{"part": "Control Panel", "condition": "Excellent", "life_remaining": 90, "cost": 15000, "priority": 3},
{"part": "Cooling System", "condition": "Needs Attention", "life_remaining": 20, "cost": 12000, "priority": 1}
]
}
}
}'), CURRENT_TIMESTAMP()),
('MACHINE-COMPLEX-002', parse_json('{
"device_info": {"type": "assembly_machine", "location": "Factory B - Production Line 2"},
"operational_status": {
"machine_state": "running",
"production_rate": 92.3,
"maintenance": {
"last_service": "2024-10-20T08:00:00Z",
"next_scheduled": "2025-01-20T08:00:00Z",
"parts_status": [
{"part": "Assembly Arm", "condition": "Good", "life_remaining": 85, "cost": 35000, "priority": 1},
{"part": "Sensor", "condition": "Fair", "life_remaining": 55, "cost": 5000, "priority": 2},
{"part": "Circuit Board", "condition": "Excellent", "life_remaining": 95, "cost": 8000, "priority": 2},
{"part": "Motor", "condition": "Needs Attention", "life_remaining": 25, "cost": 18000, "priority": 1}
]
}
}
}'), CURRENT_TIMESTAMP());
-- Intelligent maintenance priority dynamic calculation
SELECT
device_id,
json_extract_string(maintenance_data, "$.device_info.location") as location,
json_extract_string(maintenance_data, "$.device_info.type") as device_type,
-- Comprehensive analysis of part status array
json_extract_string(maintenance_data, "$.operational_status.maintenance.parts_status[0].part") as part1_name,
json_extract_int(maintenance_data, "$.operational_status.maintenance.parts_status[0].life_remaining") as part1_life,
json_extract_string(maintenance_data, "$.operational_status.maintenance.parts_status[1].part") as part2_name,
json_extract_int(maintenance_data, "$.operational_status.maintenance.parts_status[1].life_remaining") as part2_life,
json_extract_string(maintenance_data, "$.operational_status.maintenance.parts_status[2].part") as part3_name,
json_extract_int(maintenance_data, "$.operational_status.maintenance.parts_status[2].life_remaining") as part3_life,
-- Intelligent maintenance decision algorithm: comprehensive consideration of lifespan, cost, and priority
CASE
-- High-priority parts with less than 30% life remaining: urgent maintenance
WHEN (json_extract_int(maintenance_data, "$.operational_status.maintenance.parts_status[0].life_remaining") < 30 AND
json_extract_int(maintenance_data, "$.operational_status.maintenance.parts_status[0].priority") = 1) OR
(json_extract_int(maintenance_data, "$.operational_status.maintenance.parts_status[2].life_remaining") < 30 AND
json_extract_int(maintenance_data, "$.operational_status.maintenance.parts_status[2].priority") = 1) OR
(COALESCE(json_extract_int(maintenance_data, "$.operational_status.maintenance.parts_status[4].life_remaining"), 100) < 30 AND
COALESCE(json_extract_int(maintenance_data, "$.operational_status.maintenance.parts_status[4].priority"), 2) = 1)
THEN 'urgent_maintenance'
-- Any part with less than 50% life remaining: scheduled maintenance
WHEN json_extract_int(maintenance_data, "$.operational_status.maintenance.parts_status[0].life_remaining") < 50 OR
json_extract_int(maintenance_data, "$.operational_status.maintenance.parts_status[1].life_remaining") < 50 OR
json_extract_int(maintenance_data, "$.operational_status.maintenance.parts_status[2].life_remaining") < 50 OR
COALESCE(json_extract_int(maintenance_data, "$.operational_status.maintenance.parts_status[3].life_remaining"), 100) < 50 OR
COALESCE(json_extract_int(maintenance_data, "$.operational_status.maintenance.parts_status[4].life_remaining"), 100) < 50
THEN 'scheduled_maintenance'
ELSE 'normal_operation'
END as maintenance_priority,
-- Identify the most critical part (high-priority part with shortest life remaining)
LEAST(
CASE WHEN json_extract_int(maintenance_data, "$.operational_status.maintenance.parts_status[0].priority") = 1
THEN json_extract_int(maintenance_data, "$.operational_status.maintenance.parts_status[0].life_remaining")
ELSE 100 END,
CASE WHEN COALESCE(json_extract_int(maintenance_data, "$.operational_status.maintenance.parts_status[2].priority"), 2) = 1
THEN json_extract_int(maintenance_data, "$.operational_status.maintenance.parts_status[2].life_remaining")
ELSE 100 END,
CASE WHEN COALESCE(json_extract_int(maintenance_data, "$.operational_status.maintenance.parts_status[4].priority"), 2) = 1
THEN COALESCE(json_extract_int(maintenance_data, "$.operational_status.maintenance.parts_status[4].life_remaining"), 100)
ELSE 100 END
) as critical_part_min_life,
-- Maintenance cost estimation (dynamically calculated based on part status)
(CASE WHEN json_extract_int(maintenance_data, "$.operational_status.maintenance.parts_status[0].life_remaining") < 30
THEN json_extract_int(maintenance_data, "$.operational_status.maintenance.parts_status[0].cost") ELSE 0 END +
CASE WHEN json_extract_int(maintenance_data, "$.operational_status.maintenance.parts_status[1].life_remaining") < 30
THEN json_extract_int(maintenance_data, "$.operational_status.maintenance.parts_status[1].cost") ELSE 0 END +
CASE WHEN json_extract_int(maintenance_data, "$.operational_status.maintenance.parts_status[2].life_remaining") < 30
THEN json_extract_int(maintenance_data, "$.operational_status.maintenance.parts_status[2].cost") ELSE 0 END +
CASE WHEN COALESCE(json_extract_int(maintenance_data, "$.operational_status.maintenance.parts_status[3].life_remaining"), 100) < 30
THEN COALESCE(json_extract_int(maintenance_data, "$.operational_status.maintenance.parts_status[3].cost"), 0) ELSE 0 END +
CASE WHEN COALESCE(json_extract_int(maintenance_data, "$.operational_status.maintenance.parts_status[4].life_remaining"), 100) < 30
THEN COALESCE(json_extract_int(maintenance_data, "$.operational_status.maintenance.parts_status[4].cost"), 0) ELSE 0 END
) as estimated_maintenance_cost
FROM device_maintenance_analysis
WHERE json_extract_string(maintenance_data, "$.operational_status.maintenance.parts_status") IS NOT NULL
ORDER BY critical_part_min_life ASC;
Technical Value:
- Intelligent Decision Algorithm: Complex maintenance strategy integrating lifespan, cost, and priority
- Native Array Computation: Complex mathematical operations on arrays without splitting
- Real-time Cost Estimation: Dynamically calculates maintenance budget based on real-time part status
4.3.1 High-Frequency Field Access Pattern Optimization
-- Create performance optimization demo table
CREATE TABLE IF NOT EXISTS performance_orders (
id INT,
order_data JSON,
created_date DATE DEFAULT CURRENT_DATE()
);
-- Insert test data
INSERT INTO performance_orders VALUES
(1, parse_json('{"customer": {"id": "C001", "level": "VIP"}, "payment": {"amount": 1500.00}, "items": [{"detail": {"description": "Premium Electronics"}}]}'), CURRENT_DATE()),
(2, parse_json('{"customer": {"id": "C002", "level": "regular"}, "payment": {"amount": 2300.00}, "items": [{"detail": {"description": "Home Goods"}}]}'), CURRENT_DATE()),
(3, parse_json('{"customer": {"id": "C003", "level": "VIP"}, "payment": {"amount": 800.00}, "items": [{"detail": {"description": "Daily Necessities"}}]}'), CURRENT_DATE()),
(4, parse_json('{"customer": {"id": "C004", "level": "diamond"}, "payment": {"amount": 3500.00}, "items": [{"detail": {"description": "Luxury Goods"}}]}'), CURRENT_DATE());
-- High-frequency field access pattern optimization
WITH high_performance_query AS (
SELECT
-- Top-level field access: automatic columnar storage optimization
json_extract_string(order_data, "$.customer.id") as customer_id,
json_extract_double(order_data, "$.payment.amount") as amount,
json_extract_string(order_data, "$.customer.level") as level,
-- Batch field extraction: reduce repeated JSON parsing overhead
order_data as full_order_data,
created_date
FROM performance_orders
WHERE json_extract_string(order_data, "$.customer.id") IS NOT NULL -- Efficient top-level field filtering
AND json_extract_double(order_data, "$.payment.amount") > 1000 -- Efficient numeric field filtering
)
-- Secondary computation optimization: complex analysis based on pre-extracted fields
SELECT
customer_id,
level,
amount,
-- Complex nested fields extracted on demand (avoid repeated full-table parsing)
json_extract_string(full_order_data, "$.items[0].detail.description") as product_description,
-- Aggregation optimization: pre-group to reduce computation
CASE
WHEN amount > 3000 THEN 'high_value'
WHEN amount > 1500 THEN 'medium_value'
ELSE 'low_value'
END as value_segment
FROM high_performance_query
ORDER BY amount DESC;
4.3.2 Alternative Solutions for JSON Data Updates
Important Technical Limitation: After practical verification, Singdata Lakehouse does not support json_set, json_insert, json_remove, or other JSON modification functions. Below are verified alternative solutions:
Approach 1: Complete JSON Reconstruction (suitable for simple updates)
-- Create transaction demo table
CREATE TABLE IF NOT EXISTS transaction_demo (
customer_id STRING,
customer_data JSON,
updated_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP()
);
-- Insert initial data
INSERT INTO transaction_demo VALUES
('10001', parse_json('{"name": "Alice", "level": "regular", "profile": {"upgrade_date": null}, "payment": {"vip_discount": 0}}'), CURRENT_TIMESTAMP());
-- Use parse_json + CONCAT to reconstruct JSON object for update
UPDATE transaction_demo
SET customer_data = parse_json(CONCAT(
'{"name": "', json_extract_string(customer_data, "$.name"), '", ',
'"level": "VIP", ',
'"profile": {"upgrade_date": "2024-12-01"}, ',
'"payment": {"vip_discount": 500.0}}'
)),
updated_time = CURRENT_TIMESTAMP()
WHERE customer_id = '10001'
AND json_extract_string(customer_data, "$.level") != 'VIP';
-- Verify update result
SELECT
customer_id,
json_extract_string(customer_data, "$.name") as name,
json_extract_string(customer_data, "$.level") as level,
json_extract_string(customer_data, "$.profile.upgrade_date") as upgrade_date,
json_extract_double(customer_data, "$.payment.vip_discount") as vip_discount
FROM transaction_demo
WHERE customer_id = '10001';
Approach 2: Hybrid Storage Strategy (recommended for frequent updates)
-- Separate frequently updated fields; use JSON for stable data
CREATE TABLE IF NOT EXISTS hybrid_storage_demo (
id INT,
user_level STRING, -- Frequently updated fields stored separately
update_count INT, -- Frequently updated fields stored separately
user_profile JSON, -- Stable complex data stored as JSON
last_updated TIMESTAMP DEFAULT CURRENT_TIMESTAMP()
);
-- Insert hybrid storage data
INSERT INTO hybrid_storage_demo VALUES
(1, 'VIP', 5,
parse_json('{"preferences": ["electronics", "books"], "address": {"city": "Beijing", "district": "Chaoyang District"}}'),
CURRENT_TIMESTAMP());
-- Query hybrid storage data
SELECT
id,
user_level, -- Relational field, high update efficiency
update_count, -- Relational field, high update efficiency
json_extract_string(user_profile, "$.preferences[0]") as primary_preference, -- JSON field, flexible querying
json_extract_string(user_profile, "$.address.city") as city -- JSON field, structured storage
FROM hybrid_storage_demo;
Chapter 5: Chinese and Special Character Handling Guide
5.1 Chinese and Emoji Support Verification
Based on comprehensive testing, Singdata Lakehouse provides complete support for Chinese and emoji symbols:
5.1.1 Fully Supported Scenarios
Chinese Content Values:
-- Create user data table
CREATE TABLE IF NOT EXISTS user_data_chinese (
id STRING,
data JSON
);
-- Insert Chinese content
INSERT INTO user_data_chinese VALUES
('user001', parse_json('{"name": "Alice", "city": "Beijing", "message": "Hello World!"}')),
('user002', parse_json('{"name": "Bob", "city": "Shanghai", "message": "Data processing is simple"}')),
('user003', parse_json('{"name": "Charlie", "city": "Shenzhen", "message": "JSON features are powerful"}'));
-- Query Chinese content
SELECT
json_extract_string(data, "$.name") as name,
json_extract_string(data, "$.city") as city,
json_extract_string(data, "$.message") as message
FROM user_data_chinese;
Emoji Symbols:
-- Create social data table
CREATE TABLE IF NOT EXISTS social_data_emoji (
id STRING,
data JSON
);
-- Insert emoji data
INSERT INTO social_data_emoji VALUES
('post001', parse_json('{"content": "Feeling great today😊", "reactions": "👍❤️🔥", "rating": "⭐⭐⭐⭐⭐"}')),
('post002', parse_json('{"content": "Product quality is really good👌", "reactions": "👏🎉💯", "rating": "⭐⭐⭐⭐"}')),
('post003', parse_json('{"content": "Very satisfied with the service🙂", "reactions": "😀👍✨", "rating": "⭐⭐⭐⭐⭐"}'));
-- Query emoji data
SELECT
json_extract_string(data, "$.content") as content,
json_extract_string(data, "$.reactions") as reactions,
json_extract_string(data, "$.rating") as rating
FROM social_data_emoji;
Complex Chinese Content:
-- Create feedback data table
CREATE TABLE IF NOT EXISTS feedback_data_complex (
id STRING,
data JSON,
created_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP()
);
-- Insert complex Chinese and emoji content
INSERT INTO feedback_data_complex VALUES
('fb001', parse_json('{"comment": "Product quality is great, fast delivery, good customer service👍", "sentiment": "positive", "tags": ["quality", "logistics", "service"]}'), CURRENT_TIMESTAMP()),
('fb002', parse_json('{"comment": "Beautiful packaging, powerful features, very satisfied😊", "sentiment": "positive", "tags": ["packaging", "features", "satisfaction"]}'), CURRENT_TIMESTAMP()),
('fb003', parse_json('{"comment": "Great value, recommended purchase🛒", "sentiment": "positive", "tags": ["value", "recommendation"]}'), CURRENT_TIMESTAMP());
-- Query complex content
SELECT
id,
json_extract_string(data, "$.comment") as comment,
json_extract_string(data, "$.sentiment") as sentiment,
json_extract_string(data, "$.tags[0]") as first_tag,
json_extract_string(data, "$.tags[1]") as second_tag
FROM feedback_data_complex
ORDER BY created_time DESC;
5.1.2 Querying Chinese and Emoji Data
-- Create user feedback data table
CREATE TABLE IF NOT EXISTS user_feedback_full (
id STRING,
data JSON,
created_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP()
);
-- Insert test data containing Chinese and emoji
INSERT INTO user_feedback_full VALUES
('user001', parse_json('{"name": "Alice", "city": "Beijing", "comment": "Product is great😊", "score": 5}'), CURRENT_TIMESTAMP()),
('user002', parse_json('{"name": "Bob", "city": "Shanghai", "comment": "Good service👍", "score": 4}'), CURRENT_TIMESTAMP()),
('user003', parse_json('{"name": "Charlie", "city": "Beijing", "comment": "Fast delivery, beautiful packaging🎉", "score": 5}'), CURRENT_TIMESTAMP()),
('user004', parse_json('{"name": "David", "city": "Shenzhen", "comment": "Powerful features, worth recommending💯", "score": 5}'), CURRENT_TIMESTAMP());
-- Query data containing Chinese and emoji
SELECT
json_extract_string(data, "$.name") as user_name,
json_extract_string(data, "$.city") as city,
json_extract_string(data, "$.comment") as feedback,
json_extract_int(data, "$.score") as score
FROM user_feedback_full
WHERE json_extract_string(data, "$.city") = 'Beijing'
AND json_extract_string(data, "$.comment") LIKE '%😊%';
-- Group statistics by Chinese field
SELECT
json_extract_string(data, "$.city") as city,
COUNT(*) as user_count,
AVG(json_extract_int(data, "$.score")) as avg_score
FROM user_feedback_full
GROUP BY json_extract_string(data, "$.city")
ORDER BY user_count DESC;
-- Emoji usage analysis for high-score users
SELECT
json_extract_string(data, "$.name") as user_name,
json_extract_string(data, "$.comment") as comment,
json_extract_int(data, "$.score") as score,
CASE
WHEN json_extract_string(data, "$.comment") LIKE '%😊%' THEN 'happy'
WHEN json_extract_string(data, "$.comment") LIKE '%👍%' THEN 'thumbs_up'
WHEN json_extract_string(data, "$.comment") LIKE '%🎉%' THEN 'celebration'
WHEN json_extract_string(data, "$.comment") LIKE '%💯%' THEN 'perfect'
ELSE 'no_emoji'
END as emoji_sentiment
FROM user_feedback_full
WHERE json_extract_int(data, "$.score") >= 4
ORDER BY json_extract_int(data, "$.score") DESC;
5.2 Best Practice Recommendations
5.2.1 Field Naming Conventions
Recommended: Use English field names
-- Create best practice demo table
CREATE TABLE IF NOT EXISTS best_practice_demo (
id STRING,
data JSON
);
-- Recommended field naming approach
INSERT INTO best_practice_demo VALUES
('demo001', parse_json('{"user_name": "Alice", "user_city": "Beijing", "user_comment": "Product is great😊", "rating_score": 5}'));
-- Query demonstration
SELECT
json_extract_string(data, "$.user_name") as name,
json_extract_string(data, "$.user_city") as city,
json_extract_string(data, "$.user_comment") as comment,
json_extract_int(data, "$.rating_score") as score
FROM best_practice_demo;
Use with Caution: Chinese field names
-- Create Chinese field name demo table (not recommended)
CREATE TABLE IF NOT EXISTS chinese_field_demo (
id STRING,
data JSON
);
-- Chinese field name example (functional but not recommended)
INSERT INTO chinese_field_demo VALUES
('demo001', parse_json('{"user_name_cn": "Alice", "user_city_cn": "Beijing", "user_review_cn": "Product is great😊"}'));
-- Query Chinese field names (syntactically correct but not recommended)
SELECT
json_extract_string(data, "$.user_name_cn") as name,
json_extract_string(data, "$.user_city_cn") as city,
json_extract_string(data, "$.user_review_cn") as comment
FROM chinese_field_demo;
5.2.2 Special Character Handling
Safe Content Formats:
-- Create special character handling demo table
CREATE TABLE IF NOT EXISTS special_char_demo (
id STRING,
data JSON
);
-- Insert data containing various special characters
INSERT INTO special_char_demo VALUES
('char001', parse_json('{"message": "Normal text content and emoji😊", "symbols": "!@#$%&*()_+-=[]{}|;:,.<>?/", "punctuation": "Punctuation marks: ,。;:""''()【】、?!"}'));
-- Query special character data
SELECT
json_extract_string(data, "$.message") as message,
json_extract_string(data, "$.symbols") as symbols,
json_extract_string(data, "$.punctuation") as punctuation
FROM special_char_demo;
5.3 Troubleshooting Guide
If you encounter Chinese or emoji display issues:
- Check field naming: Prefer English field names
- Verify JSON format: Ensure the JSON format is correct with no syntax errors
- Use standard functions: Consistently use json_extract_* functions
- Avoid complex escaping: Minimize the use of complex escape characters in JSON strings
- Encoding settings: Ensure client and database character encoding settings are correct
-- Troubleshooting demo query
CREATE TABLE IF NOT EXISTS troubleshoot_demo (
id STRING,
data JSON
);
-- Correct data insertion method
INSERT INTO troubleshoot_demo VALUES
('test001', parse_json('{"name": "Test User", "message": "This is a test message🔧"}'));
-- Verify that data is correctly stored and queried
SELECT
id,
json_extract_string(data, "$.name") as name,
json_extract_string(data, "$.message") as message,
LENGTH(json_extract_string(data, "$.name")) as name_length,
LENGTH(json_extract_string(data, "$.message")) as message_length
FROM troubleshoot_demo;
Chapter 6: Technical Limitations and Best Practices
6.1 JSON Type Limitation Summary (Verified and Revised)
Core limitations discovered based on comprehensive verification:
6.1.1 Confirmed Supported Features
Supported JSON Functions:
parse_json() - JSON string parsing
json_extract_string() - Extract string values
json_extract_int() - Extract integer values
json_extract_double() - Extract floating-point values
json_extract() - General extraction function
Supported Syntax:
CREATE TABLE IF NOT EXISTS
- Complete DDL operations (CREATE/ALTER/DROP)
- Array access syntax
[0], [1]
- Nested object access
$.path.to.field
- Full support for Chinese and emoji content
6.1.2 Confirmed Unsupported Features
Unsupported JSON Functions:
-- The following function has been verified as unsupported:
-- json_set() - JSON field update (important limitation)
Unsupported Operations:
-- Create limitation demo table
CREATE TABLE IF NOT EXISTS limitation_demo (
id INT,
json_data JSON
);
-- Insert test data
INSERT INTO limitation_demo VALUES
(1, parse_json('{"name": "Alice", "score": 85}')),
(2, parse_json('{"name": "Bob", "score": 92}')),
(3, parse_json('{"name": "Charlie", "score": 78}'));
-- The following operations are not supported and will cause errors:
-- JSON type does not support direct comparison
-- SELECT * FROM limitation_demo WHERE json_data = parse_json('{}');
-- JSON type does not support direct ordering
-- SELECT * FROM limitation_demo ORDER BY json_data;
-- JSON type does not support direct GROUP BY
-- SELECT json_data, COUNT(*) FROM limitation_demo GROUP BY json_data;
6.1.3 Solutions
-- Use json_extract functions for comparison
SELECT * FROM limitation_demo
WHERE json_extract_string(json_data, "$.name") = 'Alice';
-- Extract fields then sort
SELECT
id,
json_extract_string(json_data, "$.name") as name,
json_extract_int(json_data, "$.score") as score
FROM limitation_demo
ORDER BY json_extract_int(json_data, "$.score") DESC;
-- Extract fields then group
SELECT
CASE
WHEN json_extract_int(json_data, "$.score") >= 90 THEN 'excellent'
WHEN json_extract_int(json_data, "$.score") >= 80 THEN 'good'
ELSE 'average'
END as grade,
COUNT(*) as count
FROM limitation_demo
GROUP BY CASE
WHEN json_extract_int(json_data, "$.score") >= 90 THEN 'excellent'
WHEN json_extract_int(json_data, "$.score") >= 80 THEN 'good'
ELSE 'average'
END;
6.2 Alternative Strategies for JSON Data Updates
6.2.1 Update Strategy Design Principles
- Read-Heavy, Write-Light Principle: JSON is primarily used for read-intensive scenarios
- Hybrid Storage Strategy: Separate frequently updated fields
- Application-Layer Processing: Complex updates are completed at the application layer
- Versioned Management: Important data supports historical versions
6.2.2 Specific Implementation Plans
Approach 1: Complete JSON Reconstruction
-- Suitable for: simple JSON updates, low update frequency scenarios
UPDATE table_name
SET json_column = parse_json(CONCAT(
'{"field1": "', json_extract_string(json_column, "$.field1"), '", ',
'"field2": "new_value", ',
'"field3": ', json_extract_int(json_column, "$.field3"), '}'
))
WHERE conditions;
Approach 2: Hybrid Storage Strategy
-- Recommended for: business scenarios with frequent update requirements
CREATE TABLE IF NOT EXISTS hybrid_table (
id INT,
frequently_updated_field STRING, -- Relational field
status_field INT, -- Relational field
stable_json_data JSON, -- JSON field for stable data
last_updated TIMESTAMP
);
Approach 3: Versioned Storage
-- Suitable for: important data requiring historical version retention
CREATE TABLE IF NOT EXISTS versioned_json_data (
record_id STRING,
version_number INT,
json_data JSON,
created_time TIMESTAMP,
is_current BOOLEAN DEFAULT TRUE
);
6.3 Unified Syntax Recommendations
To avoid issues caused by mixing different approaches, it is recommended to consistently use json_extract functions:
-- Create unified syntax demo table
CREATE TABLE IF NOT EXISTS syntax_demo (
order_id STRING,
data JSON
);
-- Insert test data
INSERT INTO syntax_demo VALUES
('ORD001', parse_json('{"customer": {"id": "C001", "name": "Alice", "level": "VIP"}, "payment": {"amount": 1500.00}, "items": [{"qty": 2}]}')),
('ORD002', parse_json('{"customer": {"id": "C002", "name": "Bob", "level": "regular"}, "payment": {"amount": 800.00}, "items": [{"qty": 1}]}')),
('ORD003', parse_json('{"customer": {"id": "C003", "name": "Charlie", "level": "VIP"}, "payment": {"amount": 2200.00}, "items": [{"qty": 3}]}'));
-- Recommended unified syntax pattern
SELECT
order_id,
json_extract_string(data, "$.customer.id") as customer_id,
json_extract_string(data, "$.customer.name") as customer_name,
json_extract_string(data, "$.customer.level") as customer_level,
json_extract_double(data, "$.payment.amount") as amount,
json_extract_int(data, "$.items[0].qty") as quantity
FROM syntax_demo
WHERE json_extract_string(data, "$.customer.level") = 'VIP'
ORDER BY json_extract_double(data, "$.payment.amount") DESC;
6.4 Type-Safe Handling
6.4.1 Safe Default Value Handling
-- Create type-safety demo table
CREATE TABLE IF NOT EXISTS type_safety_demo (
order_id STRING,
order_data JSON
);
-- Insert data containing missing fields
INSERT INTO type_safety_demo VALUES
('ORD001', parse_json('{"customer": {"name": "Alice", "vip_level": "gold"}, "payment": {"discount": 100}}')),
('ORD002', parse_json('{"customer": {"name": "Bob"}, "payment": {}}')), -- Missing vip_level and discount
('ORD003', parse_json('{"customer": {"name": "Charlie", "vip_level": "silver"}, "payment": {"discount": 50}}'));
-- Safe default value handling
SELECT
order_id,
json_extract_string(order_data, "$.customer.name") as customer_name,
-- Ensure type-consistent default value handling
CASE WHEN json_extract_string(order_data, "$.customer.vip_level") IS NOT NULL
THEN json_extract_string(order_data, "$.customer.vip_level")
ELSE 'regular'
END as safe_vip_level,
-- Safe handling of numeric types
CASE WHEN json_extract_double(order_data, "$.payment.discount") IS NOT NULL
THEN json_extract_double(order_data, "$.payment.discount")
ELSE 0.0
END as safe_discount,
-- Use COALESCE to simplify default value handling
COALESCE(json_extract_string(order_data, "$.customer.vip_level"), 'regular') as vip_level_coalesce,
COALESCE(json_extract_double(order_data, "$.payment.discount"), 0.0) as discount_coalesce
FROM type_safety_demo;
6.5.1 High-Frequency Field Optimization
-- Create performance optimization demo table
CREATE TABLE IF NOT EXISTS perf_orders (
id INT,
order_data JSON,
created_date DATE DEFAULT CURRENT_DATE()
);
-- Insert test data
INSERT INTO perf_orders VALUES
(1, parse_json('{"customer_id": "C001", "payment_amount": 1500.00, "items": [{"detail": {"description": "Premium Electronics"}}]}'), CURRENT_DATE()),
(2, parse_json('{"customer_id": "C002", "payment_amount": 2300.00, "items": [{"detail": {"description": "Home Goods"}}]}'), CURRENT_DATE()),
(3, parse_json('{"customer_id": "C003", "payment_amount": 800.00, "items": [{"detail": {"description": "Daily Necessities"}}]}'), CURRENT_DATE()),
(4, parse_json('{"customer_id": "C004", "payment_amount": 3200.00, "items": [{"detail": {"description": "Luxury Goods"}}]}'), CURRENT_DATE());
-- Place frequently used fields at the JSON top level to leverage automatic columnar storage optimization
SELECT
json_extract_string(order_data, "$.customer_id") as customer_id, -- High-frequency field
json_extract_double(order_data, "$.payment_amount") as amount, -- High-frequency field
json_extract_string(order_data, "$.items[0].detail.description") as desc -- Deep-level field
FROM perf_orders
WHERE json_extract_string(order_data, "$.customer_id") IS NOT NULL -- Efficient filtering
AND json_extract_double(order_data, "$.payment_amount") > 1000; -- Numeric filtering
-- Create batch optimization demo table
CREATE TABLE IF NOT EXISTS batch_orders (
id INT,
order_data JSON
);
-- Insert more test data
INSERT INTO batch_orders VALUES
(1, parse_json('{"customer": {"id": "C001", "level": "VIP"}, "payment": {"amount": 3500.00}}')),
(2, parse_json('{"customer": {"id": "C002", "level": "regular"}, "payment": {"amount": 1200.00}}')),
(3, parse_json('{"customer": {"id": "C003", "level": "VIP"}, "payment": {"amount": 4200.00}}')),
(4, parse_json('{"customer": {"id": "C004", "level": "diamond"}, "payment": {"amount": 5800.00}}')),
(5, parse_json('{"customer": {"id": "C005", "level": "regular"}, "payment": {"amount": 900.00}}'));
-- Use CTE to reduce repeated parsing
WITH extracted_data AS (
SELECT
id as order_id,
json_extract_string(order_data, "$.customer.id") as customer_id,
json_extract_string(order_data, "$.customer.level") as customer_level,
json_extract_double(order_data, "$.payment.amount") as payment_amount
FROM batch_orders
WHERE json_extract_string(order_data, "$.customer.id") IS NOT NULL
)
SELECT
customer_level,
COUNT(*) as order_count,
AVG(payment_amount) as avg_amount,
MAX(payment_amount) as max_amount,
MIN(payment_amount) as min_amount
FROM extracted_data
GROUP BY customer_level
ORDER BY avg_amount DESC;
Chapter 7: Environment Management and Implementation Recommendations
7.1 Pre-Implementation Checklist
7.1.1 Technical Preparation
7.1.2 Syntax Specification Check
7.1.3 Development Standards
7.1.4 Verification Testing
-- Create system verification table
CREATE TABLE IF NOT EXISTS system_verification (
test_id STRING,
test_data JSON,
test_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP()
);
-- Basic functionality verification
INSERT INTO system_verification VALUES
('test_001', parse_json('{"basic_test": "Basic functionality test", "emoji_test": "😊👍", "number_test": 123.45}'), CURRENT_TIMESTAMP()),
('test_002', parse_json('{"nested_test": {"level1": {"level2": "Deep nesting test"}}, "array_test": ["Array", "Test", "Items"]}'), CURRENT_TIMESTAMP());
-- Verification query
SELECT
test_id,
json_extract_string(test_data, "$.basic_test") as basic_result,
json_extract_string(test_data, "$.emoji_test") as emoji_result,
json_extract_double(test_data, "$.number_test") as number_result,
json_extract_string(test_data, "$.nested_test.level1.level2") as nested_result,
json_extract_string(test_data, "$.array_test[0]") as array_result
FROM system_verification
ORDER BY test_time DESC;
7.2 Phased Implementation Strategy
- Proof of Concept Phase: Select 1-2 key business scenarios for pilot testing
- Small-Scale Deployment: Verify performance and stability in non-core business systems
- Architecture Optimization Phase: Implement hybrid storage strategy and JSON update alternatives
- Full Rollout: Develop a complete migration plan based on pilot experience
-- Implementation phase tracking table
CREATE TABLE IF NOT EXISTS implementation_tracking (
phase STRING,
milestone STRING,
status STRING,
completion_date DATE,
notes JSON
);
-- Record implementation progress
INSERT INTO implementation_tracking VALUES
('Phase1', 'Proof of Concept', 'completed', CURRENT_DATE(), parse_json('{"scenarios": ["E-commerce order analysis", "User behavior tracking"], "success_rate": 0.95}')),
('Phase2', 'Small-scale deployment', 'in_progress', NULL, parse_json('{"target_systems": ["Test environment", "Development environment"], "expected_completion": "2024-12-15"}')),
('Phase3', 'Architecture optimization', 'planned', NULL, parse_json('{"focus": ["Hybrid storage design", "JSON update alternatives"], "expected_start": "2024-12-20"}'));
7.3 Monitoring and Operations
-- Create operational monitoring table
CREATE TABLE IF NOT EXISTS operational_monitoring (
metric_name STRING,
metric_value DOUBLE,
metric_metadata JSON,
measurement_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP()
);
-- Insert monitoring data example
INSERT INTO operational_monitoring VALUES
('json_query_performance', 0.95, parse_json('{"avg_response_time_ms": 145, "query_count": 1000, "success_rate": 0.99}'), CURRENT_TIMESTAMP()),
('storage_efficiency', 0.87, parse_json('{"compression_ratio": 0.72, "storage_saved_gb": 250, "cost_reduction": 0.35}'), CURRENT_TIMESTAMP()),
('json_update_performance', 0.78, parse_json('{"avg_update_time_ms": 280, "update_success_rate": 0.95, "method": "json_reconstruction"}'), CURRENT_TIMESTAMP());
-- Performance monitoring query template
SELECT
metric_name,
metric_value,
json_extract_double(metric_metadata, "$.avg_response_time_ms") as avg_response_time,
json_extract_int(metric_metadata, "$.query_count") as query_count,
json_extract_double(metric_metadata, "$.success_rate") as success_rate,
measurement_time
FROM operational_monitoring
WHERE metric_name = 'json_query_performance'
ORDER BY measurement_time DESC;
-- Create performance monitoring demo table
CREATE TABLE IF NOT EXISTS performance_monitor (
id INT,
json_column JSON,
created_date DATE DEFAULT CURRENT_DATE()
);
-- Insert test data
INSERT INTO performance_monitor VALUES
(1, parse_json('{"frequently_accessed_field": "value1", "other_data": "test", "nested": {"deep": {"field": "deep_value1"}}}'), CURRENT_DATE()),
(2, parse_json('{"frequently_accessed_field": "value2", "other_data": "test", "nested": {"deep": {"field": "deep_value2"}}}'), CURRENT_DATE() - INTERVAL 10 DAYS),
(3, parse_json('{"frequently_accessed_field": "value1", "other_data": "test", "nested": {"deep": {"field": "deep_value3"}}}'), CURRENT_DATE() - INTERVAL 2 DAYS),
(4, parse_json('{"frequently_accessed_field": "value3", "other_data": "test", "nested": {"deep": {"field": "deep_value4"}}}'), CURRENT_DATE()),
(5, parse_json('{"frequently_accessed_field": "value2", "other_data": "test", "nested": {"deep": {"field": "deep_value5"}}}'), CURRENT_DATE() - INTERVAL 5 DAYS);
-- Query performance analysis template
SELECT
'performance_monitor' as table_name,
COUNT(*) as total_records,
AVG(LENGTH(CAST(json_column AS STRING))) as avg_json_size,
-- High-frequency field distribution analysis
COUNT(DISTINCT json_extract_string(json_column, "$.frequently_accessed_field")) as unique_values,
-- Data recency analysis
COUNT(CASE WHEN created_date >= CURRENT_DATE() - INTERVAL 7 DAYS THEN 1 END) as recent_records,
-- Field access complexity analysis
COUNT(CASE WHEN json_extract_string(json_column, "$.nested.deep.field") IS NOT NULL THEN 1 END) as deep_nested_fields
FROM performance_monitor
WHERE json_column IS NOT NULL;
7.4.2 Automatic Optimization Recommendation Generation
-- Create optimization analysis demo table
CREATE TABLE IF NOT EXISTS optimization_analysis (
id INT,
order_data JSON,
created_date DATE DEFAULT CURRENT_DATE()
);
-- Insert test data
INSERT INTO optimization_analysis VALUES
(1, parse_json('{"customer": {"level": "VIP"}, "payment": {"amount": 1500.00}}'), CURRENT_DATE()),
(2, parse_json('{"customer": {"level": "regular"}, "payment": {"amount": 800.00}}'), CURRENT_DATE()),
(3, parse_json('{"customer": {"level": "VIP"}, "payment": {"amount": 2200.00}}'), CURRENT_DATE()),
(4, parse_json('{"customer": {"level": "diamond"}, "payment": {"amount": 5000.00}}'), CURRENT_DATE()),
(5, parse_json('{"customer": {"level": "VIP"}, "payment": {"amount": 3200.00}}'), CURRENT_DATE()),
(6, parse_json('{"customer": {"level": "regular"}, "payment": {"amount": 600.00}}'), CURRENT_DATE());
-- Automatic optimization recommendations based on usage patterns
WITH query_pattern_analysis AS (
SELECT
json_extract_string(order_data, "$.customer.level") as access_pattern,
COUNT(*) as usage_frequency,
AVG(json_extract_double(order_data, "$.payment.amount")) as avg_amount
FROM optimization_analysis
GROUP BY json_extract_string(order_data, "$.customer.level")
)
SELECT
access_pattern,
usage_frequency,
ROUND(avg_amount, 2) as avg_amount,
CASE
WHEN usage_frequency > 1000 THEN 'recommend_column_optimization'
WHEN usage_frequency > 100 THEN 'consider_indexing'
ELSE 'current_performance_sufficient'
END as optimization_recommendation,
CASE
WHEN avg_amount > 3000 THEN 'high_value_customer_segment'
WHEN avg_amount > 1500 THEN 'medium_value_customer_segment'
ELSE 'low_value_customer_segment'
END as business_insight
FROM query_pattern_analysis
ORDER BY usage_frequency DESC;
7.5 Environment Cleanup
-- View all created tables
SHOW TABLES LIKE '%json%';
-- Clean up test tables (optional)
/*
DROP TABLE IF EXISTS orders;
DROP TABLE IF EXISTS ecommerce_orders_json;
DROP TABLE IF EXISTS promotion_orders_json;
DROP TABLE IF EXISTS inventory_orders_json;
DROP TABLE IF EXISTS iot_devices_unified;
DROP TABLE IF EXISTS financial_risk_analysis;
DROP TABLE IF EXISTS customer_behavior_analysis;
DROP TABLE IF EXISTS device_maintenance_analysis;
DROP TABLE IF EXISTS user_data_chinese;
DROP TABLE IF EXISTS social_data_emoji;
DROP TABLE IF EXISTS feedback_data_complex;
DROP TABLE IF EXISTS user_feedback_full;
DROP TABLE IF EXISTS best_practice_demo;
DROP TABLE IF EXISTS chinese_field_demo;
DROP TABLE IF EXISTS special_char_demo;
DROP TABLE IF EXISTS troubleshoot_demo;
DROP TABLE IF EXISTS limitation_demo;
DROP TABLE IF EXISTS syntax_demo;
DROP TABLE IF EXISTS type_safety_demo;
DROP TABLE IF EXISTS perf_orders;
DROP TABLE IF EXISTS batch_orders;
DROP TABLE IF EXISTS performance_monitor;
DROP TABLE IF EXISTS optimization_analysis;
DROP TABLE IF EXISTS system_verification;
DROP TABLE IF EXISTS implementation_tracking;
DROP TABLE IF EXISTS operational_monitoring;
DROP TABLE IF EXISTS transaction_demo;
DROP TABLE IF EXISTS hybrid_storage_demo;
*/
-- Verify cleanup results
SELECT COUNT(*) as remaining_json_tables
FROM information_schema.tables
WHERE table_name LIKE '%json%';
Summary
Core Value Summary
Singdata Lakehouse JSON data processing capabilities demonstrate the following core values:
Technical Advantages
- Native JSON Support: Full compatibility with standard JSON syntax; json_extract function queries are highly efficient
- Intelligent Storage Optimization: High-frequency fields are stored in columnar format based on JSON Schema
- Real-time Processing Capability: Dynamic tables enable real-time responsiveness
- Multilingual Support: Complete support for Chinese, emoji, and other special characters
- Enterprise-Grade Features: Complete elastic computing, multi-layer caching, and cluster isolation
Business Value
- E-Commerce: Implement personalized recommendation systems, improving conversion effectiveness and user experience
- IoT: Establish predictive maintenance systems, reducing equipment failures and maintenance costs
- Financial: Build real-time risk control engines, enhancing risk management and compliance capabilities
Implementation Advantages
- Quick Start: Based on standard SQL, significantly improving development efficiency
- Smooth Migration: No need to restructure existing JSON data; import and use directly
- Strong Scalability: Seamless scaling from GB to PB-level data, supporting long-term enterprise development
Key Technical Points
Syntax Specifications
- Consistently use json_extract functions: Avoid mixing different access methods
- Type Safety: JSON type does not support direct comparison, sorting, or grouping operations
- Chinese Support: Full support for Chinese and emoji; recommend using English field names
- CREATE Statement Standard: Consistently use IF NOT EXISTS to avoid duplicate creation errors
- Important Limitation: json_set function is not supported
JSON Update Strategies
- Hybrid Storage Design: Use relational storage for frequently updated fields, JSON for stable data
- Application-Layer Processing: Complete complex JSON updates at the application layer before full replacement
- Version Control: Retain historical versions for important JSON data changes
- Batch Updates: Avoid row-by-row JSON reconstruction; prioritize batch processing
- Columnar Design for High-Frequency Fields: Can significantly improve query efficiency
- Multi-Layer Cache Utilization: Query result cache, metadata cache, and compute cluster local cache work together
- Real-time Stream Processing: Dynamic tables enable efficient data aggregation
Action Recommendations
- Start Immediately: Select a key business scenario and launch a proof of concept for JSON data processing
- Architecture Design: Develop a hybrid storage strategy and clearly define JSON usage boundaries
- Team Training: Organize technical team learning of this guide's best practices and syntax specifications
- Gradual Expansion: Based on successful pilot experience, develop a comprehensive JSON data strategy
- Continuous Optimization: Establish performance monitoring and optimization mechanisms to ensure long-term value realization
Singdata Lakehouse JSON Data Processing - Making complex JSON data processing simple and efficient within clearly defined technical boundaries
This guide provides a complete solution for enterprise JSON data processing. All SQL statements have been verified in Singdata Lakehouse. Each example includes complete table creation and data preparation statements, using IF NOT EXISTS to ensure repeatable execution. By following the best practices in this revised edition, enterprises can fully leverage the business value of JSON data and accelerate their digital transformation journey within clearly defined technical boundaries.
References
JSON Data Type
JSON Functions
Note: This guide is based on test results from the Singdata Lakehouse version as of May 2025. Subsequent versions may introduce changes. Please regularly check the official documentation for the latest information.