Lakehouse Bulkload Quick Start
Introduction
Bulkload is a high-throughput batch data writing interface provided by Lakehouse, particularly suitable for handling large-scale continuous writing scenarios. Data written using Bulkload can be viewed immediately after submission.
Principle of Bulk Import
The bulk upload SDK provides an efficient data import mechanism suitable for Singdata Lakehouse. Below is a simplified description and flowchart of its working principle:
- Data Upload: Through the SDK, your data is first uploaded to the object storage service. The performance of this step is affected by local network speed and the number of concurrent connections.
- Trigger Import: After the data upload is complete, when you call the
bulkloadStream.close()
method, the SDK will automatically trigger an SQL command to import the data from the object storage into the Lakehouse table. It is not recommended to frequently callbulkloadStream.close()
in a single task;bulkloadStream.close()
should only be called once at the end. - Computing Resources: It is recommended to choose General Purpose Virtual Cluster for uploading data. General-purpose computing resources are more suitable for running batch jobs and loading data jobs. The speed of importing data from object storage to the Lakehouse table depends on the size of the computing resources you configure.
- Shard Upload Optimization: When processing compressed data larger than 1GB, it is recommended to assign a unique shard ID to each concurrent thread or process in the
createRow
method. This approach can fully leverage the parallel processing advantages of multithreading or multiprocessing, significantly improving data import efficiency. The best practice is to determine the number of shard IDs based on the number of concurrent threads, ensuring that each concurrent thread corresponds to an independent shard ID. If multiple concurrent threads are assigned the same shard ID, the data written last may overwrite the previous data, resulting in data loss. To ensure that all shard data is correctly imported into the table, call thebulkloadStream.close()
method to submit the entire import task after all concurrent operations are completed.
Below is the flowchart of the bulk import principle:
Applicable Scenarios
The SDK for bulk file uploads is particularly suitable for the following situations:
- One-time large data import: When you need to import a large amount of data, whether it is a one-time bulk task or a periodic operation with long intervals.
- Low import frequency: If your data import frequency is low (intervals greater than five minutes), even if the amount of data imported each time is not large, using the bulk import SDK is also appropriate.
Inapplicable Scenarios
The SDK for bulk file uploads is not suitable for the following situations:
- Real-time data import: If you need to frequently import data within a very short time (such as within 5 minutes), it is recommended to use real-time data interfaces to meet the requirements for real-time performance.
Write Restrictions
Please note that BulkloadStream does not support writing to primary key (pk) tables.
Create BulkloadStream
To create a bulk write stream through the ClickZetta client, refer to the following sample code:
Operation Types
When creating a Bulkload, you can specify the following operation types using the operate
method:
RowStream.BulkLoadOperate.APPEND
: Append mode, adds data to the table.RowStream.BulkLoadOperate.OVERWRITE
: Overwrite mode, deletes existing data in the table before writing new data.
Writing Data
Use the Row object to represent the specific data to be written. Encapsulate the data into the Row object by calling the row.setValue
method.
- The
createRow
method requires an integer as the shard ID when creating a Row object. This ID can be used with multithreading/process technology to write data using multiple complementary identical shard IDs, effectively increasing the speed of data writing. - The first parameter of the
setValue
method is the field name, and the second parameter is the specific data. The data type must be consistent with the table type. - The
apply
method is used to write data, and it requires specifying the Row object and the corresponding shard ID.
Writing Complex Type Data
Submit Data
The data written in batches is only visible after submission. Therefore, the submission process is very important.
- Use
bulkloadStream.getState()
to get the state of the BulkloadStream. - If the submission fails, you can get the error message through
bulkloadStream.getErrorMessage()
.
Usage Example
The following is an example of using Bulkload to write complex type data:
- Lakehouse url can be seen in the Lakehouse Studio management -> workspace to view the jdbc connection string