Data Ingestion

Singdata Lakehouse supports three categories of data ingestion: real-time database sync, file import, and message queue ingestion. Choose based on your data source.


I have a relational database (MySQL / PostgreSQL / SQL Server, etc.)

Recommended: Studio Data Sync Tasks — visual configuration, supports full load + real-time incremental sync, no coding required.

ScenarioApproachReference
Single table or a few tables, real-time syncStudio real-time sync task (CDC)Real-time Sync Task
Full database sync, mirroring a source DB into LakehouseStudio multi-table real-time syncMulti-table Real-time Sync Guide
Offline periodic sync (T+1 or H+1)Studio offline sync taskOffline Sync Task · FAQ
Oracle database real-time syncBluepipe integrationOracle Real-time Sync
Sync over private network (VPC / Private Link)Studio + Private LinkRDS Sync over VPC

I have files (CSV / Parquet / JSON, etc.)

ScenarioApproachReference
Files are local, quick importStudio upload or PUT + COPY INTOImport Local Data · Quick Upload
Files are on OSS / S3 / COS, one-time importCOPY INTO + VolumeBulk Import from Object Storage
Files are continuously uploaded to OSS / S3, auto-ingestPipe (object storage mode)Pipe Continuous Ingestion · Object Storage Pipe
Feishu spreadsheet / online spreadsheet importFeishu data importHow to Import Feishu Spreadsheets

I have a Kafka message stream

ScenarioApproachReference
Continuously consume a Kafka topic and write to a tablePipe (Kafka mode)Kafka Pipe
Configure Kafka sync visually via StudioStudio real-time sync taskKafka Real-time Sync
Complex message processing before ingestionKafka external table + Table StreamKafka External Table + Table Stream

I have a custom data source or need programmatic ingestion

ScenarioApproachReference
Java application bulk writeJava SDK BulkLoadJava SDK Bulk Upload
Java application real-time write (millisecond latency)Java SDK RealtimeStreamJava SDK Real-time Upload
Python application bulk writePython SDKPython SDK Upload
Python data processing tasksStudio Python taskPython Task Development
Write from FlinkFlink ConnectorFlink Write to Lakehouse
Use open-source ETL toolsAirbyte / DataXEcosystem Integrations

I'm migrating from another data warehouse

SourceReference
Migrating from SnowflakeSnowflake ETL Pipeline Migration Guide
Migrating from Spark data engineeringSpark Best Practices Migration Guide
Migrating from Alibaba Cloud Data LakeAlibaba Cloud Data Lake Migration Guide

Not sure which approach to use?

Use this decision tree:

What is your data source? ├── Relational database (MySQL / PG / SQL Server) │ ├── Need real-time sync → Studio multi-table real-time sync │ └── Offline periodic sync → Studio offline sync task ├── Files (CSV / Parquet / JSON) │ ├── One-time import → COPY INTO │ └── Continuous auto-ingest → Pipe (object storage mode) ├── Kafka message stream │ ├── Simple consume and ingest → Pipe (Kafka mode) │ └── Complex processing → Kafka external table + Table Stream └── Custom / programmatic ingestion → SDK or Python task

For a full comparison of all approaches, see: A Comprehensive Guide to Ingesting Data into Singdata Lakehouse