Autonomous Driving Full-Loop Data Platform Solution


1. Solution Background

Data Challenges in the Autonomous Driving Industry

The core competitive advantage in autonomous driving is data — whoever can more rapidly convert massive driving data into high-quality training data and iterate models will lead in the competition. In reality, however, the data challenges facing autonomous driving companies are extremely complex:

Diverse data sources, massive scale

  • Each road-test vehicle generates several GB of data per hour (camera annotations + LiDAR point clouds + CAN bus)
  • Mass production fleets reach millions of vehicles; high-frequency telemetry QPS peaks at 100K–1M msg/s
  • Heterogeneous data formats: structured time-series (CAN signals), semi-structured (JSON events), large files (Parquet annotations)

Long data loop path, broken links From road-test collection to model updates deployed on-vehicle, the process spans annotation, simulation augmentation, training set construction, offline evaluation, OTA gradual rollout, and telemetry feedback — each step often completed by different teams using different tools, with data passing between silos inefficiently.

Long-tail scenarios are hard to cover Extreme weather, rare obstacles, and special traffic scenarios (NEAR_MISS, takeovers) occur at extremely low probability in real-world collection, yet are critical for safety validation. Manually collecting long-tail data is prohibitively expensive and time-consuming.

Strict compliance and safety requirements Regulators have explicit requirements for safety assessment, OTA upgrades, and fault tracing in autonomous driving — requiring a complete data evidence chain and risk event records.


2. Pain Points of Traditional Approaches

Fragmented architecture, bloated toolchain

StageTraditional ToolsPain Points
Batch data lake ingestionSpark / Flink + HDFSRequires standalone cluster, high ops cost, severe small-file problem
Real-time event ingestionFlink + KafkaDisconnected from offline pipeline, data consistency hard to guarantee
Annotation data managementCustom database + PythonVersion control chaotic, hard to trace
AI inference integrationPython microservices calling LLMLong engineering chain, difficult to couple with data processing
Metric aggregationHive / ClickHouseRequires independent deployment, high query latency
Data quality monitoringCustom scripts + cron jobsLacks real-time awareness, problems discovered late

For a typical autonomous driving data team, maintaining the above toolchain requires Spark cluster + Flink cluster + message queue + multiple databases + LLM services — enormous infrastructure cost and operational burden, with data engineers spending most of their energy on pipeline maintenance rather than business value.

Long data loop cycle

Under traditional architecture, the time from a production vehicle uploading a takeover event to that event entering the next training set typically takes 2–4 weeks:

Takeover event reported ↓ (T+1 batch processing) Data warehouse ingestion ↓ (manual scheduling) Annotation task assignment ↓ (3–5 days) Annotation completed and reviewed ↓ (manual build) Training set version packaged ↓ (submitted to cluster) Model training and evaluation ↓ (deployment approval) OTA release

Leading domestic automakers have compressed the data loop cycle to within a few weeks. Traditional architecture has become the bottleneck.

Weak long-tail scenario supplementation capability

The simulation system and data platform operate independently. Edge cases from real-world collection cannot be automatically injected into the simulation scenario library, resulting in low long-tail scenario coverage and poor model performance in extreme situations.


3. Singdata Lakehouse Solution

Solution Architecture

This solution builds an integrated autonomous driving data platform based on Singdata Lakehouse, covering R&D, testing, and mass production operation stages. Through the Data Flywheel, mass production driving data is continuously converted into training data, driving model iteration to form a positive flywheel.

Singdata Lakehouse Autonomous Driving Solution Architecture

Ten Functional Modules

ModuleCore CapabilityBusiness Value
M1 Multimodal Data CollectionCOPY INTO batch lake ingestion, automatic deduplicationRoad-test Parquet automated ingestion, no manual intervention
M2 Data AnnotationAI_COMPLETE pre-annotation + HITL human review100% pre-annotation coverage, 60%+ reduction in manual work
M3 Simulation and Synthetic DataINVERTED INDEX scenario retrieval + AI scenario classificationLong-tail scenarios automatically supplemented, breaking real-data bottleneck
M4 Training Data PreparationVersioned training sets + Window Function feature engineeringControllable real/synthetic data ratio, fully traceable
M5 Road-Test Replay and Shadow Modedivergence_score evaluation, 30-min auto-refreshNew algorithm evaluation without intervention, quantitative basis for launch decisions
M6 Mass Production Telemetry and Fault DiagnosisAI_COMPLETE DTC diagnosis + Flywheel injectionDTC faults auto-described, edge scenarios continuously accumulated
M7 OTA Gradual Rollout Tracking15-min refresh, dual-metric driven expansion decisionsOTA health fully visible, anomalies auto-alerted
M8 Safety and ComplianceMulti-module risk aggregation, L1-L4 severity gradingComplete evidence chain, meeting regulatory compliance requirements
M9 End-to-End DemoFull-chain one-click execution, Flywheel closed-loop validationFast POC, lower solution validation cost
M10 Vehicle Kafka Real-Time Reporting4 PIPEs, minute-level lake ingestion, second-level alertsMass production vehicle data real-time ingestion, fills road-test blind spots

Data Flywheel: Two Parallel Paths

Batch Path (M6): Daily road-test data from production vehicles enters the lake in bulk via COPY INTO, is cleaned with MERGE INTO, then 06_flywheel_scene_extract.sql automatically injects HARD_BRAKE / TAKEOVER / ANOMALY events into the simulation scenario library.

Real-time Path (M10): Real-time safety events from production vehicles enter the lake at second-level latency via Kafka PIPE. After 1-minute Dynamic Table refresh, 04_flywheel_bridge.sql injects events into the scenario library in real time, simultaneously triggering safety alerts written to the M8 compliance pipeline.

Both paths feed into the unified sim_ods_scenarios. Scenario types are automatically classified (LONG_TAIL / CORNER_CASE), INVERTED INDEX supports semantic retrieval, and the simulation platform can extract scenarios on demand.


4. Singdata Lakehouse Technical Advantages

1. SQL-first, zero additional dependencies

All processing logic is based on standard SQL — no Python / Spark / Flink environment required. AI inference (DTC diagnosis, scenario classification, annotation quality scoring) is embedded in SQL via AI_COMPLETE(), eliminating the engineering complexity of microservice calls:

-- Call LLM directly in SQL for row-by-row DTC fault diagnosis UPDATE fleet_dwd_dtc_records SET ai_diagnosis = AI_COMPLETE( 'conn_dashscope:deepseek-v3', 'Diagnose the following fault code, return root cause and recommendations: ' || dtc_code ) WHERE ai_diagnosis IS NULL;

2. Dynamic Table replaces Spark Streaming

Traditional solutions require maintaining standalone Flink/Spark Streaming clusters for incremental data processing. Lakehouse Dynamic Table implements automatic incremental refresh with a single SQL declaration:

CREATE DYNAMIC TABLE veh_dws_realtime_alert REFRESH INTERVAL 1 MINUTE -- 1-minute real-time alerts AS SELECT vehicle_id, COUNT(*) AS event_count, MAX(risk_level) AS max_risk FROM veh_dwd_safety_events_clean WHERE msg_timestamp >= CURRENT_TIMESTAMP() - INTERVAL 5 MINUTE GROUP BY vehicle_id HAVING MAX(risk_level) IN ('L3','L4');

No Streaming cluster to deploy, no checkpoints to manage. Refresh intervals are flexibly configurable from 1 minute to 6 hours.

3. Kafka PIPE for near-real-time ingestion

Native support for continuous Kafka consumption. 4 topics correspond to 4 independent PIPEs; the safety event PIPE batches every 10 seconds to ensure <30 seconds end-to-end alert latency:

CREATE PIPE pipe_vehicle_safety_events BATCH_INTERVAL_IN_SECONDS = '10' -- low latency for safety events AS COPY INTO veh_ods_safety_events FROM read_kafka(...);

4. INVERTED INDEX for scenario semantic retrieval

At millions-of-records scale in the simulation scenario library, full-text inverted index enables millisecond semantic retrieval without a standalone search engine (Elasticsearch, etc.):

CREATE INVERTED INDEX idx_scenario_desc ON TABLE sim_ods_scenarios (description, tags) PROPERTIES ('analyzer' = 'keyword'); -- Semantic scenario retrieval SELECT * FROM sim_ods_scenarios WHERE description LIKE '%rainy%' AND tags LIKE '%NEAR_MISS%';

5. Storage-compute separation, on-demand scaling

Lakehouse storage-compute separation architecture allows elastic scaling of compute resources during road-test batch processing, with automatic scale-down during idle periods — saving 40%–70% infrastructure cost compared to fixed-cluster solutions.

6. Unified governance, eliminate data silos

ODS / DWD / DWS three layers are all managed within a single Lakehouse instance. Annotation data, telemetry data, training data, and evaluation data share a unified permission system, data lineage, and Time Travel (historical version recovery) capability.


5. Customer Value

Efficiency Improvement

MetricTraditional SolutionLakehouse SolutionImprovement
Data loop cycle2–4 weeksDays (batch) / minutes (real-time)80%+ reduction
Annotation manual effort100% manualAI pre-annotation + human review60%+ reduction
Long-tail scenario supplementationManual curation, long cycleProduction events auto-injected into scenario library100% automated
OTA decision basisExperience-based judgmentsuccess_rate + divergence dual metricsQuantified and traceable

Cost Reduction

  • Infrastructure: Eliminates standalone deployment of Spark cluster + Flink cluster + multiple databases, reducing infrastructure cost by 40%–60%
  • Operations staffing: SQL-first requires no Spark/Flink engineers; data engineers maintain directly
  • Model iteration: Accelerated data loop shortens each model iteration cycle, indirectly reducing GPU training costs

Safety and Compliance

  • L1-L4 risk events recorded end-to-end with complete evidence chain, meeting autonomous driving regulations such as GB/T 40429
  • OTA gradual rollout decisions maintain complete logs, traceable evaluation basis for each upgrade
  • Safety events and simulation scenarios are bi-directionally linked — traceable to "which production event this scenario originated from"

Competitive Advantage

Reducing the data loop cycle from weeks to days means more model iterations can be completed in the same time — establishing a Data Flywheel advantage in fast-evolving tracks like NOA urban features.


6. Solution Notes

6.1 Lakehouse Table Creation Compatibility

The following issues were all encountered and resolved during real testing. Strictly follow these rules before deployment:

RuleDescriptionConsequence of Violation
ODS layer tables with JSON columns: no partition, no PKJSON column + partition + PK together causes errorCZLH-67000
Partitioned table PRIMARY KEY must include partition columne.g., PRIMARY KEY(event_id, trigger_ts)Table creation fails
INVERTED INDEX created separately per columnMulti-column joint index (col1, col2) not supportedSyntax error
change_tracking set separately via ALTER after table creationCannot be declared inline in CREATE TABLEProperty has no effect
MERGE INTO source side must deduplicate firstDuplicate PKs in the same batch causes errorCZLH-71001
Nested partition functions require redundant column substitutionDAYS(TIMESTAMP_MILLIS(ts_ms)) not supportedTable creation fails
JSON field access uses col['key'] syntaxColon syntax col:key not supportedCZLH-42000

6.2 Kafka PIPE Network Configuration

Most critical prerequisite: read_kafka only supports SASL_PLAINTEXT, not SASL_SSL.

Confirm before deployment:

✅ Lakehouse Virtual Cluster and Kafka are in the same VPC, or VPC peering is established ✅ Kafka security group allows Lakehouse node IP range → TCP 9092 inbound ✅ Kafka provides a SASL_PLAINTEXT endpoint (Alibaba Cloud Serverless requires manually adding a VPC endpoint) ✅ Connectivity validation SQL using read_kafka SELECT passes (returns data within 10 seconds)

If the network cannot be connected temporarily, use the OSS relay solution:

Kafka → Alibaba Cloud Function Compute FC (same VPC) → OSS → Lakehouse COPY INTO

6.3 Data Scale and Performance Planning

Data TypeScale ReferenceRecommended Strategy
Road-test annotation ParquetSeveral GB per vehicle per hour, TB-level daily growthCOPY INTO with daily partitions, MERGE INTO with partition pruning
Mass production vehicle high-frequency telemetry~86,400 records/day/vehicle, fleet of millions with 1M peak QPSDownsample on Kafka side (10s aggregation) before lake ingestion
Simulation scenario libraryMillions of records scaleINVERTED INDEX for retrieval, periodically clean low-value scenarios
Dynamic TableDWS layer minimum 1-min refreshHigh-frequency refresh tables hold only aggregation results, avoid full table scans

6.4 AI_COMPLETE Cost Control

AI_COMPLETE is billed per token. Pay attention in high-throughput scenarios:

  • On-demand triggering: Only call for new records where WHERE ai_diagnosis IS NULL — no repeated inference
  • Tiered invocation: Rule layer (Window Function) filters anomalies first; AI is only triggered for records matching rules — e.g., anomaly detection only calls AI for L2+ risk
  • Concise prompts: Closed-form output (returning only label names) consumes 5–10x fewer tokens than open-ended generation
  • Batch processing: Dynamic Table batch refresh naturally implements batch inference rather than real-time per-record calls

6.5 Studio Task Scheduling Recommendations

  • M1 batch lake ingestion should set AUTO_SUSPEND_IN_SECOND = 600 to avoid CZLH-60011 from frequent Virtual Cluster suspensions
  • Solidify Session Flags in Studio Tasks (e.g., set cz.sql.cast.string.to.json.as.parse=true) — these must be reset on each connection
  • RESET_KAFKA_GROUP_OFFSETS for PIPE only takes effect when first created; record the current offset before rebuilding a PIPE

7. Validation Status (Tested 2026-06-05)

Validation ItemStatusNotes
Create 27 tables (M1-M8, ODS+DWD+DWS)Passed00_create_schema.sql executed successfully, all compatibility issues resolved
M10 real-time pipeline table creation (9 tables)PassedThree files under 10-vehicle-kafka-ingest/ executed successfully, total 36 tables
Data Flywheel closed loopPassedfleet events auto-injected into sim_ods_scenarios (FLEET_EVENT×4)
OTA gradual rollout decision logicPassedsuccess_rate=66.7% → RECOMMEND_PAUSE
Shadow mode evaluationPassedavg_divergence=0.267 < 0.3 → PASS
Annotation quality dashboardPassedavg_confidence=0.855, ai_prelabel_rate=100%
Kafka producer (Python)Passed54 messages successfully sent to Alibaba Cloud Kafka (SASL_SSL/PLAIN)
Kafka PIPE lake ingestionPassedKafka PIPE data pipeline validated
Offline Pipeline M2→M6→FlywheelPassedEnd-to-end full pipeline validated, row counts in 10 tables match expectations
Studio Task deploymentPassed8 Tasks deployed to ads_full_loop folder, including DDL + real-time pipeline + offline pipeline

8. Quick Deployment Steps

Prerequisites

  • Singdata Lakehouse workspace is ready
  • If using AI_COMPLETE, pre-create CREATE CONNECTION (conn_dashscope, etc.)
  • If using Kafka PIPE, ensure Lakehouse Virtual Cluster and Kafka are in the same VPC (see §6.2)

Step 1: Create Tables (choose A or B)

Option A — Execute SQL directly

# All 27 tables for M1-M8 (including 8 DWS Dynamic Tables) run 00-setup/00_create_schema.sql # M10 real-time pipeline (9 tables + 4 Kafka PIPE definitions) run 10-vehicle-kafka-ingest/01_ods_tables.sql run 10-vehicle-kafka-ingest/02_dwd_tables.sql run 10-vehicle-kafka-ingest/03_dws_tables.sql

Option B — Execute via Studio Task

In the ads_full_loop folder in Lakehouse Studio, execute in order:

  1. ads_01_ddl_ods — ODS layer 10 tables
  2. ads_02_ddl_dwd — DWD layer 15 tables
  3. ads_03_ddl_dws — DWS layer 8 Dynamic Tables
  4. ads_04_rt_ods_pipe — M10 real-time pipeline ODS + PIPE (requires Kafka VPC connectivity)
  5. ads_05_rt_dwd — M10 DWD Dynamic Table
  6. ads_06_rt_dws — M10 DWS Dynamic Table

Step 2: Write Demo Data

run 09-full-loop-demo/01_sample_data_gen.sql

Step 3: Run Full-Chain Pipeline

run 09-full-loop-demo/02_run_full_loop.sql

Step 4: Validate Data Flywheel

run 09-full-loop-demo/03_validate_flywheel.sql

Step 5: Deploy Scheduled Tasks

Execute the ads_offline_pipeline task in Studio (configured with cron 0 2 * * *), or manually trigger a test run.


9. Studio Task List

Task NameFolderContentSchedule
ads_01_ddl_odsads_full_loopODS layer table creation (10 tables)Manual
ads_02_ddl_dwdads_full_loopDWD layer table creation (15 tables)Manual
ads_03_ddl_dwsads_full_loopDWS Dynamic Tables (8 tables)Manual
ads_04_rt_ods_pipeads_full_loopM10 ODS + Kafka PIPEManual
ads_05_rt_dwdads_full_loopM10 DWD Dynamic TableManual
ads_06_rt_dwsads_full_loopM10 DWS Dynamic TableManual
ads_07_rt_flywheelads_full_loopFlywheel bridge SQLManual
ads_offline_pipelineads_full_loopOffline Pipeline (M2→M6→DWS)Daily at 02:00

Core Data Processing

TechnologyDocument
Batch lake ingestionLakehouse File Batch Import/Export Guide (COPY INTO)
Upsert / incremental writesLakehouse Upsert Operations Guide (MERGE INTO)
Incremental auto-refreshLakehouse Dynamic Table Development Guide
Continuous Kafka consumptionLakehouse Continuous Data Ingestion Guide (Pipe)
Change data captureLakehouse CDC Change Data Capture Guide (Table Stream)
Historical version recoveryLakehouse Historical Data Recovery Guide (Time Travel)
Window functions / feature engineeringWindow Functions

Query Acceleration and Retrieval

TechnologyDocument
Inverted index / scenario retrievalLakehouse Query Acceleration Index Guide
Full-text searchFull-Text Search and Text Analysis Practical Guide

AI and Python

TechnologyDocument
In-SQL AI inferenceAI_COMPLETE Function Reference
AI Functions overviewLakehouse AI Functions Overview
Python DataFrame APIZettaPark Python SDK

Security and Governance

TechnologyDocument
Row-level access controlRow-Level Security (Row Filter)
Column-level dynamic maskingLakehouse Column-Level Security (Dynamic Masking)
Data quality checksData Quality Check (DQC): SQL-Driven Automated Validation

Task Scheduling and Operations

TechnologyDocument
Studio task schedulingStudio Task Development and Operations