Task Billing
Question: How are data sync tasks billed?
Answer: The cost of data sync tasks is generally composed of two categories: hardware resource usage fees and data transfer network fees.
- Hardware resource usage fees are billed on a pay-as-you-go basis according to the specification and usage duration of the sync-type compute cluster. When the cluster is not in use or is stopped, no further fees are charged.
- Data transfer network fees are charged based on actual usage: When using features such as Data Integration to batch download and export data from Lakehouse over the public Internet, Internet network transfer fees are incurred. Internet network transfer is measured and billed based on the actual amount of data transferred. Network transfer traffic generated by uploading data from other data sources to Lakehouse over the Internet is not subject to network transfer fees. If dedicated lines, Private Link, or other network products are used to achieve cross-cloud, cross-region, or cross-VPC network connectivity, the network connectivity itself incurs fees. Depending on the network connectivity method, costs incurred on the Singdata Lakehouse side are collected by Singdata, while costs incurred in the cloud platform account are collected directly by the cloud platform.
- See: Billing Instructions
Data Sources
Question: What data sources are currently supported for offline sync?
Answer: On the task configuration page, when selecting the source and target, all supported data source types are listed in full. If no data source is available, click the + button to create a new one first, then use it. Offline sync data sources can be freely combined in pairs to build a rich variety of sync links. See: Data Source Management
Question: How to verify the accessibility of a data source in offline sync?
Answer: The Test Connectivity feature in the data source list (also available on the data source configuration page) can be used to test the access connectivity between the data sync task environment and the data source. If the test result shows a connectivity failure, check the accuracy of the configuration (such as the connection string address, username, and password) as well as the network status. If there is a network connectivity issue, refer to the Network Connectivity section below, which provides corresponding solutions.
Network Connectivity
Question: If the data source is in a VPC environment, how can network connectivity for data sync be established?
Answer: If the data source is in a VPC environment, the network environment where the data sync task runs cannot access it by default. Connectivity can be established through the following methods:
- Enable public network access on the source side and configure the public network access address in the data source.
- Use SSH Tunnel: Data Source Management
- Use Private Link:
- Combine Private Link and SSH: Synchronizing RDS Data in VPC via Private Link and SSH
Question: If the source database has a whitelist restriction, how can you ensure the data sync task can connect to the data source?
Answer: You need to add the data sync service egress IP address to the whitelist of the source. The IP addresses differ by service region. For specific addresses and configuration methods, see: Data Source IP Whitelist Configuration Guide
Task Configuration
Question: What is the main process for configuring an offline sync task?
Answer: The main steps are: creating data sources, selecting source and target, configuring field mapping, and configuring sync rules such as concurrency and dirty data management (optional). After configuration, you can manually trigger a test run, and then submit periodic scheduled execution. See: Batch Sync
Question: How to configure the task to control the pressure on the source during data sync?
Answer: In the Sync Rule Configuration area of the offline sync task, you can use two configuration items — maximum concurrency and sync rate — to respectively control the number of connections to the source and the pressure of read access. The higher the concurrency and sync rate, the greater the pressure on the source.
Question: How to handle "dirty data" in the source data?
Answer: In the Sync Rule Configuration area of the offline sync task, you can use two configuration items — Auto-end Task on Dirty Data and Collect Dirty Data — to control how the sync task handles dirty data. This "dirty data" will not be written to the target. "Dirty data" primarily refers to data that cannot be properly written to the target, the most common case being a field type mismatch, such as writing a String-type field value into a target INT field.
- If Auto-end Task on Dirty Data is set to Yes, the task will fail and exit when the number of dirty data rows reaches the specified threshold.
- If Auto-end Task on Dirty Data is set to No, the task will continue executing even if there is dirty data and will not exit automatically.
- The Collect Dirty Data option controls whether to collect dirty data for review after the task run completes. A maximum of 1000 rows of dirty data can be collected, and it is retained for up to 7 days.
Error Troubleshooting
Question: How to resolve the error java.lang.OutOfMemoryError: Java heap space when running a task?
Cause: This is usually due to large fields or rows in the data being read, or for data sources that support batch reads, a large amount of data synchronized in a single batch exceeds the heap memory of the sync task's compute process.
Solution: This can be resolved by adjusting the compute integration memory of the sync task. Add the parameter taskmanager.memory.process.size in Task Development -> Advanced Parameters, with valid units m or g, default value 1600m.
Question: How to resolve the error java.lang.OutOfMemoryError: Direct buffer memory when running a task?
Cause: Insufficient off-heap memory for the sync task's compute integration, which may be due to the following reasons:
- Large fields or rows in the source data, or for data sources that support batch reads, a large amount of data synchronized in a single batch, and the source data source using off-heap memory as a data cache, such as Elasticsearch.
- The target data source using off-heap memory to batch and send data, such as Lakehouse, where a batch of data exceeds the batch buffer size due to large rows.
Solution: Typically, the following solutions are available:
- Refer to the heap memory overflow solution and adjust the
taskmanager.memory.process.sizeparameter. - Separately adjust the task off-heap memory size by adjusting the
taskmanager.memory.task.off-heap.sizeparameter, e.g., 256m or 512m. - If the data source supports setting batch size, reduce the configured value appropriately. However, note that this may lead to reduced sync efficiency.
Question: How to resolve the error CZLH-67000:Out of Memory undefined: could not allocate block of size 262KB (1.0GB/1.0GB used)?
Cause: The underlying storage default maximum for indexing data is 1G. If the table data is too large, PK data operations may exceed the limited range.
Solution:
- For single-table offline sync, it is recommended to recreate the table and add the
cz.storage.art.max.memory.size.bytesparameter when creating the table. The specific size should be set based on the actual table size. - For multi-table full sync errors, add
lh.table.cz.storage.art.max.memory.size.bytesin the task's advanced parameters, then resubmit and start the task.
Question: How to resolve the error entity content is too long [320177567] for the configured buffer limit [104857600]?
Cause: ES reads data via HTTP requests in batches. The ES HTTP buffer has a size limit, with a default value of 100MB. If the batch data size exceeds this limit, the error will occur.
Solution:
- Reduce the batchSize for batch reading.
- Add an advanced parameter to increase the HTTP buffer size limit:
studio.connector.es.buffer_limit.
Note: Reducing batchSize will slightly reduce sync efficiency, while increasing studio.connector.es.buffer_limit will increase task memory usage.
Question: How to resolve the error The maximum buffer size of 16777216 is insufficient to read the data of a single field. This issue typically arises when a quotation begins but does not conclude within the confines of this buffer's maximum limit.?
Cause: The third-party CSV SDK used by Data Integration to read and parse CSV data reads data in batches and has a limit on data size. The default buffer size is 16MB.
Solution:
Add the advanced parameter: studio.connector.csv_reader.max_buffer_size. Values with MB/GB suffix are supported, e.g., 32MB.
Note: This parameter currently only takes effect for OSS/S3.
Question: How to resolve the error Field at index 0 in record starting at line 1 exceeds the max field size of 16777216 characters?
Cause: The third-party CSV SDK used by Data Integration to read and parse CSV data has a limit on the size of a single field. The default is 16MB.
Solution:
Add the advanced parameter: studio.connector.csv_reader.max_field_size. Values with MB/GB suffix are supported, e.g., 32MB.
Note: This parameter currently only takes effect for OSS/S3.
Question: How to resolve the error Field at index 5 in record starting at line 1 exceeds the max record size of 67108864 characters?
Cause: The third-party CSV SDK used by Data Integration to read and parse CSV data has a limit on the size of a single record. The default is 64MB.
Solution:
Add the advanced parameter: studio.connector.csv_reader.max_record_size. Values with MB/GB suffix are supported, e.g., 128MB.
Note: This parameter currently only takes effect for OSS/S3.
