Multi-Table Batch Sync Task
Overview
Multi-Table Batch Sync Task is a bulk data synchronization capability provided by Yunqi Lakehouse Studio, supporting scheduled synchronization of an entire database or multiple tables from source to Lakehouse. Unlike Multi-Table Real-time Sync Tasks, Multi-Table Batch Sync adopts periodic full synchronization, making it suitable for scenarios where data latency requirements are relatively relaxed and data updates follow periodic patterns.
Use Cases
Multi-Table Batch Sync Tasks are suitable for the following scenarios:
- Full Database Migration: Batch synchronization of all tables from source database to Lakehouse, reducing table-by-table configuration effort
- Periodic Data Updates: Source data updates in fixed cycles (e.g., daily, hourly) without requiring real-time synchronization
- Sharded Database and Table Consolidation: Consolidating data from multiple sharded databases and tables into unified target tables
- Periodic Data Reconciliation: Ensuring target data consistency with source through periodic full synchronization
- Resource Optimization: Reducing resource consumption through batch sync when data latency requirements are not stringent
Features
Supported Data Sources
Source
- MySQL
- PostgreSQL
- SQL Server
- Aurora MySQL
- Aurora PostgreSQL
- PolarDB MySQL
- PolarDB PostgreSQL
Target
- Lakehouse
Synchronization Modes
- Full Database Mirroring: Synchronizes all tables from the source database to the target.
- Multi-Table Mirroring: Selects multiple source tables for synchronization to the target, maintaining independent table structures.
- Multi-Table Merge: Consolidates multiple source tables (e.g., sharded databases and tables) into one or more target tables.
Core Capabilities
- Automatic Table Creation: Automatically creates tables on target when they don't exist, supporting both primary key and non-primary key tables. Supports flexible use of parameters to customize target table naming.
- Schema Evolution: Supports automatic synchronization of source schema changes to target (optional).
- Flexible Write Modes: Supports multiple write modes including overwrite, upsert, and more.
- Concurrency Control: Supports configuration of grouping strategies and concurrency levels to optimize sync performance and flexibly control pressure on source.
- Scheduling Management: Flexible scheduling configuration based on Cron expressions, can be orchestrated with other task nodes.
Operation Steps
Step 1: Create Task
- In the Lakehouse Studio Development module, click the "New" button
- Select "Multi-Table Batch Sync" from task types (located under "Data Synchronization" category)
- Enter task name and select folder
Step 2: Configure Source Data
Select Data Source Type
In the "Data Source Type" section, select source and target data source types:
- Source: Select the database type to be synchronized (e.g., PostgreSQL, MySQL, etc.)
- Target: Select Lakehouse
Select Source Data Source
- Select a configured data source connection from the "Source Data Source" dropdown
- The system will automatically load the database list under that data source
Configure Sync Objects
Select sync objects based on your requirements:
Full Database Mirroring Mode
- Select an entire database for synchronization
- The system will automatically synchronize all tables under that database
Multi-Table Mirroring Mode
- Expand database and schema structures
- Check the tables to be synchronized
- Supports batch selection at three levels: database, schema, and table
Multi-Table Merge Mode
- Additionally configure virtual tables to express merge relationships from multiple source tables to target tables
- Supports batch rule configuration at database, schema, and table levels to define source table scope and merge into the same target table
Sync Object Filtering
- Use search box for quick table location
- Supports filtering by selected/schema/table
- Upper right corner displays statistics of selected objects (e.g., "Selected: 3 Databases 45 Tables")
Step 3: Configure Target Settings
Select Target Data Source
Select target Lakehouse data source in the "Data Source Type" section
Configure Target Data Source
- Select target workspace
- Configure target namespace (Schema)
Namespace Rules
The system provides three naming rules:
- Mirror Source: Maintains consistency with source schema name
- Specify Selection: Manually select an existing target schema
- Custom: Use rule expressions to customize target schema name
Rule Expression Instructions
-
Supports using variables:
{SOURCE_DATABASE}represents source database name
-
Supports using custom task parameters:
- Such as
${bizdate}, etc. For task parameter usage, see: Task Parameters
- Such as
-
Example:
{SOURCE_DATABASE}_${bizdate}
Target Table Naming Rules
Configure target table naming rules:
- Mirror Source: Maintains consistency with source table name
- Custom: Use rule expressions to customize target table name
Rule Expression Instructions
-
Supports using variables:
{SOURCE_DATABASE}represents source database name,{SOURCE_SCHEMA}represents source schema name,{SOURCE_TABLE}represents source table name
-
Supports using custom task parameters:
- Such as
${bizdate}, etc. For task parameter usage, see: Task Parameters
- Such as
-
Example:
{SOURCE_DATABASE}_{SOURCE_TABLE}_${bizdate}
Partition Configuration (Optional)
If creating partitioned tables, configure:
- Whether to create as partitioned table
- Partition field selection
- Partition value expression
Step 4: Configure Mapping Relationships
View Sync Objects
The system displays all configured sync objects and their mapping relationships:
- Left side shows source table list
- Middle shows field-level mapping details
- Right side shows complete target table path
Field Mapping
- System automatically identifies primary keys (marked as PK)
- Displays field type mapping relationships
- Supports viewing detailed field mappings for each table
Statistics display:
- Number of successfully mapped objects
- Number of failed mappings (if any)
Step 5: Configure Sync Rules
Source Data Rules
Source Data Field Deletion
- Continue sync, write Null values for deleted fields
Source Data Field Addition
- Auto-adapt, add new fields in target table
Source Data Table Deletion
- Auto-adapt, continue sync, ignore deleted tables
Grouping Strategy
Controls grouping and concurrent execution of sync tasks:
Smart Grouping
- System automatically groups based on table size and other characteristics
Static Grouping
- Groups based on fixed number of tables
- Configurable number of tables per group
Concurrency Control
Number of Tables per Group
- Manually specified in static grouping mode
- Controls the number of tables in a single group
- Default: 4
Maximum Source Connections per Group
- Limits maximum concurrent connections to source per group
- Default: 4
Number of Concurrent Groups
- Number of groups that can execute simultaneously
- Default: 2
Data Write Mode
Configure write strategies for different table types:
Non-Primary Key Table Write Mode
- Overwrite: Clears target table and writes new data with each sync
Primary Key Table Write Mode
- Overwrite: Clears target table and writes new data with each sync
- Upsert: Performs insert or update operations based on primary key
Step 6: Task Debugging (Optional)
After completing task configuration and saving, you will enter the task overview page. The "Run" button in the upper right corner provides functionality to debug and verify if data sources and task configuration are correct. After triggering the run, you can view run details in "Run History" at the lower right corner of the page.
Step 7: Configure Scheduling Properties
Click the "Schedule" button to configure periodic scheduling rules for the task:
-
Scheduling Frequency: Select scheduling frequency as needed, such as daily or hourly scheduling
-
Scheduling Cycle:
- Supports multiple cycles including hourly, daily, weekly, monthly
- Can use visual configuration (system automatically converts to Cron expression)
-
Effective Date and Expiration Date: Set start and end dates for task scheduling
-
Dependency Configuration: Set upstream and downstream dependencies for the task (optional)
-
Other Configuration: Instance information and other configurations, same as periodic scheduling configuration for regular tasks
Step 8: Submit Task
- Click "Submit" button after completing configuration
- System performs configuration validation
- After successful submission, task enters scheduling system and executes automatically according to set cycle
Task Operations
In "Task Operations", Multi-Table Batch Sync Tasks are categorized under "Scheduled Tasks". Click task name to drill down into detailed task information.
Task Details
In the task details page, you can view:
Task Details Tab
- Displays DAG diagram of task upstream and downstream dependencies
- Task configuration information
- Scheduling configuration
- Owner information
Task Instances Tab
- Displays complete instance list of the task, including manually triggered temporary runs and scheduled periodic instances
- Click on specific instance for details
Node Code Tab
- Code representation of task configuration information
Sync Objects Tab
- View all source and target tables configured in the task
- View field mapping relationships
- Reflects current configuration status in real-time
Operation Log Tab
- Displays audit information for operations such as pause/start, publish/update, etc.
Task Operations
Edit:
- Quick jump to task development interface to modify task configuration
Pause/Resume: Pause or resume scheduled execution
Offline:
- Stops task and removes from scheduling system, task reverts to unsubmitted scheduling state
- Offline (including downstream): Use this function to offline current task and its downstream together. If task has downstream, individual offline is not allowed
Backfill:
- Performs data backfill for historical cycles
Instance Management
Instance List
Each task execution generates an instance. Under "Task Operations" -> "Instance Operations" tab, you can view task instances of Multi-Table Batch Sync Tasks.
Instance Details
Click instance ID to view detailed instance information.
Instance Details Tab
- Upstream and downstream DAG task lineage relationships for this instance
- Instance status, start time, end time
- Runtime statistics
Sync Objects Tab
- View all tables synchronized in this execution
- Display read rows, written rows, sync rate for each table
- Show execution status for each table
Script Content Tab
- Code representation of task configuration information for this instance
Execution Log Tab
- View detailed execution logs
- Includes logs for initialization, table synchronization, completion, and other stages
Operation Log Tab
- Displays audit logs for operations such as rerun, set success/set failure, etc.
Instance Operations
Rerun
-
Re-execute this instance
-
Supports selecting rerun scope:
- All Objects: Re-synchronize all tables
- Failed Objects Only: Rerun only failed synchronizations
Set Success/Set Failure
- Manually set final status of instance
Cancel Run
- Forcibly terminate running instance
Single Table Operations (in Sync Objects Tab)
- View Sync Details: View synchronization details for a single table
- Re-sync: Re-synchronize only this table
- Force Stop: Terminate synchronization of this table
Monitoring and Alerting
Configure Alert Rules
Multi-Table Batch Sync as a whole can have the same monitoring alerts configured as other scheduled tasks. Additionally, alerts can be configured for individual tables within sync tasks:
Alert Item: Task Instance Failure
- Alert Type: Event alert
- Trigger Condition: Entire task instance execution fails
- Supports configuring alert filter conditions
Alert Item: Multi-Table Batch Sync Single Table Sync Failure
- Alert Type: Event alert
- Trigger Condition: Any target table synchronization failure in the task
- Alert Granularity: Based on target table unit, each failed table independently triggers alert
- Supports configuring alert recipients and notification methods
Important Notes
Permission Requirements
- Source Permissions: The account configured in the data source needs to have permissions to read source database metadata and table data (SELECT privilege)
- Target Permissions: The task owner needs to have permissions to create tables and insert data in target Lakehouse (CREATE, INSERT privileges)
Performance Considerations
- Source Pressure: Configure concurrency appropriately to avoid excessive pressure on source database.
- Initialization Time: First execution requires initialization of all sync objects, which may take considerable time.
- Table Quantity Limit: It's recommended to control the number of tables synchronized in a single task within a reasonable range; too many tables may affect task execution efficiency.
Data Consistency
- Overwrite Mode: With overwrite mode, each execution completely refreshes target table data.
- Upsert Mode: With upsert mode, incremental updates are performed based on primary key.
- Source Data Changes: Batch sync adopts periodic full synchronization; data changes between two sync intervals will be reflected in the next synchronization.
Schema Evolution Limitations
When enabling Schema Evolution functionality, note:
- Does not support modifying primary key fields (Lakehouse primary key table limitation)
- Field type modifications only support same-type extensions (e.g., int8→int16→int32→int64)
- Cross-type conversions (e.g., int→double) are not supported
- Recommended to use this feature when source schema is relatively stable
Best Practices
- Task Granularity: Reasonably divide task granularity based on business relevance and synchronization requirements
- Scheduling Time: Choose time windows with lower source database pressure for execution
- Testing and Validation: Use "Run" function for small-batch data testing before formal scheduling
- Monitoring Configuration: Configure alert rules promptly to ensure quick awareness of synchronization anomalies
- Regular Checks: Regularly review "Table Auto-change Records" to understand schema changes
Related Documentation
- Task Development
- Data Source Connection Configuration
- Scheduling Management
- Task and Instance Operations
For other questions, please contact technical support or consult the complete product documentation.
