Multi-Table Batch Sync Task

Overview

Multi-Table Batch Sync Task is a bulk data synchronization capability provided by Yunqi Lakehouse Studio, supporting scheduled synchronization of an entire database or multiple tables from source to Lakehouse. Unlike Multi-Table Real-time Sync Tasks, Multi-Table Batch Sync adopts periodic full synchronization, making it suitable for scenarios where data latency requirements are relatively relaxed and data updates follow periodic patterns.

Use Cases

Multi-Table Batch Sync Tasks are suitable for the following scenarios:

Full Database Migration: Batch synchronization of all tables from source database to Lakehouse, reducing table-by-table configuration effort
Periodic Data Updates: Source data updates in fixed cycles (e.g., daily, hourly) without requiring real-time synchronization
Sharded Database and Table Consolidation: Consolidating data from multiple sharded databases and tables into unified target tables
Periodic Data Reconciliation: Ensuring target data consistency with source through periodic full synchronization
Resource Optimization: Reducing resource consumption through batch sync when data latency requirements are not stringent

Features

Supported Data Sources

Source

MySQL
PostgreSQL
SQL Server
Aurora MySQL
Aurora PostgreSQL
PolarDB MySQL
PolarDB PostgreSQL

Target

Lakehouse

Synchronization Modes

Full Database Mirroring: Synchronizes all tables from the source database to the target.
Multi-Table Mirroring: Selects multiple source tables for synchronization to the target, maintaining independent table structures.
Multi-Table Merge: Consolidates multiple source tables (e.g., sharded databases and tables) into one or more target tables.

Core Capabilities

Automatic Table Creation: Automatically creates tables on target when they don't exist, supporting both primary key and non-primary key tables. Supports flexible use of parameters to customize target table naming.
Schema Evolution: Supports automatic synchronization of source schema changes to target (optional).
Flexible Write Modes: Supports multiple write modes including overwrite, upsert, and more.
Concurrency Control: Supports configuration of grouping strategies and concurrency levels to optimize sync performance and flexibly control pressure on source.
Scheduling Management: Flexible scheduling configuration based on Cron expressions, can be orchestrated with other task nodes.

Operation Steps

Step 1: Create Task

In the Lakehouse Studio Development module, click the "New" button
Select "Multi-Table Batch Sync" from task types (located under "Data Synchronization" category)
Enter task name and select folder

Step 2: Configure Source Data

Select Data Source Type

In the "Data Source Type" section, select source and target data source types:

Source: Select the database type to be synchronized (e.g., PostgreSQL, MySQL, etc.)
Target: Select Lakehouse

Select Source Data Source

Select a configured data source connection from the "Source Data Source" dropdown
The system will automatically load the database list under that data source

Configure Sync Objects

Select sync objects based on your requirements:

Full Database Mirroring Mode

Select an entire database for synchronization
The system will automatically synchronize all tables under that database

Multi-Table Mirroring Mode

Expand database and schema structures
Check the tables to be synchronized
Supports batch selection at three levels: database, schema, and table

Multi-Table Merge Mode

Additionally configure virtual tables to express merge relationships from multiple source tables to target tables
Supports batch rule configuration at database, schema, and table levels to define source table scope and merge into the same target table

Sync Object Filtering

Use search box for quick table location
Supports filtering by selected/schema/table
Upper right corner displays statistics of selected objects (e.g., "Selected: 3 Databases 45 Tables")

Step 3: Configure Target Settings

Select Target Data Source

Select target Lakehouse data source in the "Data Source Type" section

Configure Target Data Source

Select target workspace
Configure target namespace (Schema)

Namespace Rules

The system provides three naming rules:

Mirror Source: Maintains consistency with source schema name
Specify Selection: Manually select an existing target schema
Custom: Use rule expressions to customize target schema name

Rule Expression Instructions

Supports using variables:
- {SOURCE_DATABASE} represents source database name
Supports using custom task parameters:
- Such as ${bizdate}, etc. For task parameter usage, see: Task Parameters
Example: {SOURCE_DATABASE}_${bizdate}

Target Table Naming Rules

Configure target table naming rules:

Mirror Source: Maintains consistency with source table name
Custom: Use rule expressions to customize target table name

Rule Expression Instructions

Supports using variables:
- {SOURCE_DATABASE} represents source database name, {SOURCE_SCHEMA} represents source schema name, {SOURCE_TABLE} represents source table name
Supports using custom task parameters:
- Such as ${bizdate}, etc. For task parameter usage, see: Task Parameters
Example: {SOURCE_DATABASE}_{SOURCE_TABLE}_${bizdate}

Partition Configuration (Optional)

If creating partitioned tables, configure:

Whether to create as partitioned table
Partition field selection
Partition value expression

Step 4: Configure Mapping Relationships

View Sync Objects

The system displays all configured sync objects and their mapping relationships:

Left side shows source table list
Middle shows field-level mapping details
Right side shows complete target table path

Field Mapping

System automatically identifies primary keys (marked as PK)
Displays field type mapping relationships
Supports viewing detailed field mappings for each table

Statistics display:

Number of successfully mapped objects
Number of failed mappings (if any)

Step 5: Configure Sync Rules

Source Data Rules

Source Data Field Deletion

Continue sync, write Null values for deleted fields

Source Data Field Addition

Auto-adapt, add new fields in target table

Source Data Table Deletion

Auto-adapt, continue sync, ignore deleted tables

Grouping Strategy

Controls grouping and concurrent execution of sync tasks:

Smart Grouping

System automatically groups based on table size and other characteristics

Static Grouping

Groups based on fixed number of tables
Configurable number of tables per group

Concurrency Control

Number of Tables per Group

Manually specified in static grouping mode
Controls the number of tables in a single group
Default: 4

Maximum Source Connections per Group

Limits maximum concurrent connections to source per group
Default: 4

Number of Concurrent Groups

Number of groups that can execute simultaneously
Default: 2

Performance Tuning Recommendation: Excessive concurrency may impose pressure on source database. Configure appropriately based on source database performance.

Data Write Mode

Configure write strategies for different table types:

Non-Primary Key Table Write Mode

Overwrite: Clears target table and writes new data with each sync

Primary Key Table Write Mode

Overwrite: Clears target table and writes new data with each sync
Upsert: Performs insert or update operations based on primary key

Step 6: Task Debugging (Optional)

After completing task configuration and saving, you will enter the task overview page. The "Run" button in the upper right corner provides functionality to debug and verify if data sources and task configuration are correct. After triggering the run, you can view run details in "Run History" at the lower right corner of the page.

Please note: To control validation sync duration, the system automatically adds a limit of extracting only 1,000 rows per source table by default during runtime (can be manually adjusted, maximum 10,000 rows). For full synchronization of all data, please refer to subsequent steps to configure scheduled execution.

Step 7: Configure Scheduling Properties

Click the "Schedule" button to configure periodic scheduling rules for the task:

Scheduling Frequency: Select scheduling frequency as needed, such as daily or hourly scheduling
Scheduling Cycle:
1. Supports multiple cycles including hourly, daily, weekly, monthly
2. Can use visual configuration (system automatically converts to Cron expression)
Effective Date and Expiration Date: Set start and end dates for task scheduling
Dependency Configuration: Set upstream and downstream dependencies for the task (optional)
Other Configuration: Instance information and other configurations, same as periodic scheduling configuration for regular tasks

Step 8: Submit Task

Click "Submit" button after completing configuration
System performs configuration validation
After successful submission, task enters scheduling system and executes automatically according to set cycle

Task Operations

In "Task Operations", Multi-Table Batch Sync Tasks are categorized under "Scheduled Tasks". Click task name to drill down into detailed task information.

Task Details

In the task details page, you can view:

Task Details Tab

Displays DAG diagram of task upstream and downstream dependencies
Task configuration information
Scheduling configuration
Owner information

Task Instances Tab

Displays complete instance list of the task, including manually triggered temporary runs and scheduled periodic instances
Click on specific instance for details

Node Code Tab

Code representation of task configuration information

Sync Objects Tab

View all source and target tables configured in the task
View field mapping relationships
Reflects current configuration status in real-time

Operation Log Tab

Displays audit information for operations such as pause/start, publish/update, etc.

Task Operations

Edit:

Quick jump to task development interface to modify task configuration

Pause/Resume: Pause or resume scheduled execution

Offline:

Stops task and removes from scheduling system, task reverts to unsubmitted scheduling state
Offline (including downstream): Use this function to offline current task and its downstream together. If task has downstream, individual offline is not allowed

Backfill:

Performs data backfill for historical cycles

Instance Management

Instance List

Each task execution generates an instance. Under "Task Operations" -> "Instance Operations" tab, you can view task instances of Multi-Table Batch Sync Tasks.

Instance Details

Click instance ID to view detailed instance information.

Instance Details Tab

Upstream and downstream DAG task lineage relationships for this instance
Instance status, start time, end time
Runtime statistics

Sync Objects Tab

View all tables synchronized in this execution
Display read rows, written rows, sync rate for each table
Show execution status for each table

Script Content Tab

Code representation of task configuration information for this instance

Execution Log Tab

View detailed execution logs
Includes logs for initialization, table synchronization, completion, and other stages

Operation Log Tab

Displays audit logs for operations such as rerun, set success/set failure, etc.

Instance Operations

Rerun

Re-execute this instance
Supports selecting rerun scope:
- All Objects: Re-synchronize all tables
- Failed Objects Only: Rerun only failed synchronizations

Set Success/Set Failure

Manually set final status of instance

Cancel Run

Forcibly terminate running instance

Single Table Operations (in Sync Objects Tab)

View Sync Details: View synchronization details for a single table
Re-sync: Re-synchronize only this table
Force Stop: Terminate synchronization of this table

Monitoring and Alerting

Configure Alert Rules

Multi-Table Batch Sync as a whole can have the same monitoring alerts configured as other scheduled tasks. Additionally, alerts can be configured for individual tables within sync tasks:

Alert Item: Task Instance Failure

Alert Type: Event alert
Trigger Condition: Entire task instance execution fails
Supports configuring alert filter conditions

Alert Item: Multi-Table Batch Sync Single Table Sync Failure

Alert Type: Event alert
Trigger Condition: Any target table synchronization failure in the task
Alert Granularity: Based on target table unit, each failed table independently triggers alert
Supports configuring alert recipients and notification methods

Important Notes

Permission Requirements

Source Permissions: The account configured in the data source needs to have permissions to read source database metadata and table data (SELECT privilege)
Target Permissions: The task owner needs to have permissions to create tables and insert data in target Lakehouse (CREATE, INSERT privileges)

Performance Considerations

Source Pressure: Configure concurrency appropriately to avoid excessive pressure on source database.
Initialization Time: First execution requires initialization of all sync objects, which may take considerable time.
Table Quantity Limit: It's recommended to control the number of tables synchronized in a single task within a reasonable range; too many tables may affect task execution efficiency.

Data Consistency

Overwrite Mode: With overwrite mode, each execution completely refreshes target table data.
Upsert Mode: With upsert mode, incremental updates are performed based on primary key.
Source Data Changes: Batch sync adopts periodic full synchronization; data changes between two sync intervals will be reflected in the next synchronization.

Schema Evolution Limitations

When enabling Schema Evolution functionality, note:

Does not support modifying primary key fields (Lakehouse primary key table limitation)
Field type modifications only support same-type extensions (e.g., int8→int16→int32→int64)
Cross-type conversions (e.g., int→double) are not supported
Recommended to use this feature when source schema is relatively stable

Best Practices

Task Granularity: Reasonably divide task granularity based on business relevance and synchronization requirements
Scheduling Time: Choose time windows with lower source database pressure for execution
Testing and Validation: Use "Run" function for small-batch data testing before formal scheduling
Monitoring Configuration: Configure alert rules promptly to ensure quick awareness of synchronization anomalies
Regular Checks: Regularly review "Table Auto-change Records" to understand schema changes

Multi-Table Batch Sync Task

Overview

Use Cases

Features

Supported Data Sources

Synchronization Modes

Core Capabilities

Operation Steps

Step 1: Create Task

Step 2: Configure Source Data

Select Data Source Type

Select Source Data Source

Configure Sync Objects

Step 3: Configure Target Settings

Select Target Data Source

Configure Target Data Source

Namespace Rules

Target Table Naming Rules

Partition Configuration (Optional)

Step 4: Configure Mapping Relationships

View Sync Objects

Field Mapping

Step 5: Configure Sync Rules

Source Data Rules

Grouping Strategy

Concurrency Control

Data Write Mode

Step 6: Task Debugging (Optional)

Step 7: Configure Scheduling Properties

Step 8: Submit Task

Task Operations

Task Details

Task Operations

Instance Management

Instance List

Instance Details

Instance Operations

Monitoring and Alerting

Configure Alert Rules

Important Notes

Permission Requirements

Performance Considerations

Data Consistency

Schema Evolution Limitations

Best Practices

Related Documentation