Data Ingestion

Singdata Lakehouse supports three categories of data ingestion: real-time database sync, file import, and message queue ingestion. Choose based on your data source.

I have a relational database (MySQL / PostgreSQL / SQL Server, etc.)

Recommended: Studio Data Sync Tasks — visual configuration, supports full load + real-time incremental sync, no coding required.

Scenario	Approach	Reference
Single table or a few tables, real-time sync	Studio real-time sync task (CDC)	Real-time Sync Task
Full database sync, mirroring a source DB into Lakehouse	Studio multi-table real-time sync	Multi-table Real-time Sync Guide
Offline periodic sync (T+1 or H+1)	Studio offline sync task	Offline Sync Task · FAQ
Oracle database real-time sync	Bluepipe integration	Oracle Real-time Sync
Sync over private network (VPC / Private Link)	Studio + Private Link	RDS Sync over VPC

End-to-end example: Complete workflow from MySQL to BI reports

I have files (CSV / Parquet / JSON, etc.)

Scenario	Approach	Reference
Files are local, quick import	Studio upload or PUT + COPY INTO	Import Local Data · Quick Upload
Files are on OSS / S3 / COS, one-time import	COPY INTO + Volume	Bulk Import from Object Storage
Files are continuously uploaded to OSS / S3, auto-ingest	Pipe (object storage mode)	Pipe Continuous Ingestion · Object Storage Pipe
Feishu spreadsheet / online spreadsheet import	Feishu data import	How to Import Feishu Spreadsheets

Choosing between the two continuous ingestion modes: Use LIST_PURGE if you don't need to keep the source files after upload. Use EVENT_NOTIFICATION if you need to retain the source files or require near-real-time triggering. See Pipe Continuous Ingestion for details.

I have a Kafka message stream

Scenario	Approach	Reference
Continuously consume a Kafka topic and write to a table	Pipe (Kafka mode)	Kafka Pipe
Configure Kafka sync visually via Studio	Studio real-time sync task	Kafka Real-time Sync
Complex message processing before ingestion	Kafka external table + Table Stream	Kafka External Table + Table Stream

I have a custom data source or need programmatic ingestion

Scenario	Approach	Reference
Java application bulk write	Java SDK BulkLoad	Java SDK Bulk Upload
Java application real-time write (millisecond latency)	Java SDK RealtimeStream	Java SDK Real-time Upload
Python application bulk write	Python SDK	Python SDK Upload
Python data processing tasks	Studio Python task	Python Task Development
Write from Flink	Flink Connector	Flink Write to Lakehouse
Use open-source ETL tools	Airbyte / DataX	Ecosystem Integrations

I'm migrating from another data warehouse

Source	Reference
Migrating from Snowflake	Snowflake ETL Pipeline Migration Guide
Migrating from Spark data engineering	Spark Best Practices Migration Guide
Migrating from Alibaba Cloud Data Lake	Alibaba Cloud Data Lake Migration Guide

Not sure which approach to use?

Use this decision tree:

What is your data source? ├── Relational database (MySQL / PG / SQL Server) │ ├── Need real-time sync → Studio multi-table real-time sync │ └── Offline periodic sync → Studio offline sync task ├── Files (CSV / Parquet / JSON) │ ├── One-time import → COPY INTO │ └── Continuous auto-ingest → Pipe (object storage mode) ├── Kafka message stream │ ├── Simple consume and ingest → Pipe (Kafka mode) │ └── Complex processing → Kafka external table + Table Stream └── Custom / programmatic ingestion → SDK or Python task

For a full comparison of all approaches, see: A Comprehensive Guide to Ingesting Data into Singdata Lakehouse