Task Scheduling and Instance Execution

Overview

Concepts

In task development scenarios, offline tasks run periodically (e.g., multiple times a day, once a day, etc.). In addition, there are often upstream-downstream dependencies between tasks, and they execute in dependency order. This periodic execution with upstream-downstream task dependencies is the core concept of scheduling.

Tasks and Instances

When a user submits a development script to the production environment, a periodic task is created. Each run of a periodic task generates a periodic instance, and each periodic instance may contain multiple execution records, such as automatic reruns.

Therefore, periodic tasks and periodic instances typically have a one-to-many relationship.

How the Scheduling System Works

The scheduling system has two main responsibilities: generating instances and running instances on schedule. The specific concepts are as follows:

Instance Generation

After a task is submitted to the scheduling system, the system generates instances based on the scheduling configuration. The system currently provides two instance generation modes:

Next-Day Effective: All instances that need to be executed on the next day are uniformly generated at 22:00 of the current day.
Effective Immediately After Publishing: After the user clicks submit, the system immediately generates instances that need to be executed on the current day. Instances for the next day are uniformly generated at 22:00 using the system default method.

The specific generation rules are detailed in the "Instance Generation" section under Scheduling Properties Configuration.

Running Instances on Schedule

When running instances on schedule each day, the scheduling system determines whether an instance can run. Generally, two conditions must be met:

Whether the task has reached its scheduled time
Whether all upstream tasks have succeeded

If both conditions are met, the scheduling system submits the instance task for execution. Once the instance task obtains the corresponding resources, it will execute normally as planned (Instance Running).

If either condition is not met, the task instance will remain in the Not Started state. However, to prevent a large number of task instances from remaining in the Not Started state for a long time due to configuration issues or paused upstream tasks, which would waste resources, the system will execute a kill operation based on the user-configured "Scheduling Wait Duration". That is, the task instance will be directly determined as failed by the system.

Note: If the periodic task in the production environment does not have a scheduling wait duration configured, the default instance scheduling wait duration is 3 days. That is, when the task reaches its scheduled time, after 3 days, regardless of whether all upstream tasks have succeeded, it will change from the Not Started state to the Failed state. If the user has configured a "Scheduling Wait Duration" in the scheduling configuration, once the task reaches its scheduled time, regardless of whether all upstream tasks have succeeded, once the scheduling wait duration is exceeded, the task instance state will change from Not Started to Failed.

Scheduling Properties Configuration

In the Development module, open any task, click the Schedule Configuration function, and you can configure a series of scheduling property information for that task, including basic information, scheduling time, instance information, scheduling dependencies, and task outputs.

This document focuses on scheduling time, instance information, and scheduling dependencies. For other content, please refer to the help documentation.

Scheduling Time

Scheduling Cycle: Configure the scheduling cycle, scheduling frequency, scheduling start time, and scheduling end time. After the user configures this information, the system will automatically generate a standard cron expression that complies with the rules. This automatically generated time expression will be parsed by the scheduling engine, which uses a time-wheel algorithm to derive all specific execution instances that meet the criteria within future cycles.

The scheduling frequencies in the current system are Minute, Hour, and Day. If a task's Cron expression has values set to single values or multiple values (rather than *) in positions above the day level, such as week, month, or year, and as long as these position values do not conflict and are valid, the original time type of the task is not changed (it does not become a weekly, monthly, or yearly task). It only constrains which specific days the task runs on, as a minute, hour, or day task within those days. (In other words, there is no concept of weekly, monthly, or yearly tasks.)

For example:

On the 1st and 3rd of each month, if a task runs once at a specified time, it is a daily task, not a monthly task.
On Monday and Wednesday of each week, if a task runs every two hours, it is an hourly task, not a weekly task, nor a daily task.
Similarly for minute tasks: when a task's interval is at the minute level, it is a minute task.

Effective Date: Refers to the effective date range of the current scheduling task. Users can choose Never End or specify a date.

Never End means permanent effect. The current system supports up to the year 2099.

If the user specifies a date, instances of this scheduling task will no longer be generated once the effective date range is exceeded.

Instance Information

Instance Generation

Instance Generation: Based on the user-configured scheduling configuration information and dependency relationships, instances are generated according to the instance generation rule selected by the user.

Effective After Publishing: Takes effect immediately after submitting the task. The specific instance change/generation scope is:

If the submission time is earlier than the scheduling start time, instances are updated/generated starting from the scheduling start time.
If the submission time is later than the scheduling start time, instances are updated/generated starting from the submission/publishing time.
Historically generated instances are not affected; only subsequent instances are changed starting from the start time.

Next-Day Effective: All instances that need to be executed on the next day are uniformly generated at 22:00 of the current day.

Instance Rerun Methods

The product provides three rerun methods, which mainly affect behavior in the following two scenarios:

After an instance fails to run, whether the system automatically reruns it.
After an instance succeeds or fails, whether the user can click the rerun operation in the instance maintenance list of the Operations Center.

The different values and behavior definitions are as follows:

Rerun is allowed whether the run succeeds or fails: After selecting this method, you must also set the "Instance Auto Rerun Count" and "Instance Rerun Interval". After selecting this behavior, when an instance fails to run, the system will trigger automatic reruns based on the configured rerun count and interval. Once the run succeeds or reaches the agreed-upon rerun count, the automatic rerun capability is no longer triggered. At the same time, you can find this instance in the Operations Center, and regardless of whether the instance state is success or failure, the user is supported to manually click the rerun operation to trigger rerun behavior.
Rerun is allowed on failure, not allowed on success: The specific behavior is the same as above, with the only difference being: when operating in the Operations Center, if the instance state is success, clicking the rerun operation is not allowed.
Rerun is not allowed whether the run succeeds or fails: After selecting this method, automatic rerun behavior will not be triggered, and users are also not allowed to click the rerun operation in the Operations Center.

Run Timeout Duration

When the user configures a run timeout duration, if the task instance's run time exceeds the set time, it will be forcibly killed and the run state will change to failed. The purpose is that when a task is in the "Running" state for an extended period, it occupies resources. By setting a "timeout duration," resources can be released in a timely manner.

Scheduling Wait Duration

After configuring the scheduling wait duration, when the task instance reaches its scheduled run time, regardless of whether upstream tasks have succeeded, a forced kill behavior will be triggered. This configuration is mainly used to prevent resource waste caused by large numbers of downstream instances being unable to execute due to paused upstream tasks. It is recommended to configure with caution.

Scheduling Dependencies

Scheduling dependencies refer to the upstream-downstream dependency relationships between periodic scheduling nodes. Through the dependency relationships between nodes, upstream and downstream nodes are orderly scheduled and executed. That is, downstream nodes will only start running after upstream nodes have run successfully, ensuring the timely production of valid business data.

For more details on instance execution behavior related to scheduling dependencies, see: Task Scheduling Dependencies

Operations Management

Periodic Tasks

Pause: Periodic tasks with the status "Scheduling" support clicking Pause. After clicking Pause, all instance tasks starting from the current trigger onwards will be set to "Paused". The concept of all instances includes: instances already generated for the current day and not yet run, as well as all instances for the next day generated at 22:00 daily. That is, although the user clicked the pause operation, it only operates on the instance state and does not block the task's behavior of generating instances on schedule. The pause behavior only sets the instances in the periodic task to the paused state. Downstream states will typically be blocked by the paused task and unable to run. However, if the downstream task has a scheduling wait duration configured (or the downstream task's scheduling wait duration from its planned start time exceeds 3 days), the downstream task will be forcibly killed.
Start: Periodic tasks with the status "Scheduling Paused" support clicking Start. After clicking Start, all paused instances in the current periodic task will trigger execution behavior.
Offline: After a periodic task is directly taken offline, it will not be managed in the periodic tasks, and instances will no longer be generated. If the current task has downstream tasks running, it is not allowed to directly take it offline. The downstream tasks must be taken offline first, or click Offline (Including Downstream) to take the current task and all downstream tasks offline.
Backfill: Regardless of the periodic task's status, users are supported to click the backfill operation. After the user triggers the backfill behavior, a corresponding backfill task will be generated in the backfill tasks.

Instance Operations

Rerun: Only periodic instances that have completed execution can use the rerun function. Whether rerun is allowed for different states depends on the specific options in the user's scheduling configuration. If the user has configured Rerun is not allowed whether the run succeeds or fails, clicking rerun is also not allowed. In the rerun operation, users are supported to choose to run only the current node or run downstream nodes. If the user has code adjustments and wishes to rerun a node with the latest code, they can check the ability to use the latest code and configuration and directly execute the rerun operation.
Pause: Only operable for task instances not in the paused state. After operation, the task instance's scheduling property will be set to "Scheduling Paused", meaning the task instance will no longer be scheduled to run.
Resume: Only operable for task instances in the paused state. After operation, the task instance's scheduling property will be restored to "Normal Scheduling", meaning the task instance will run according to the set scheduling time.
Terminate: Only operable for running task instances. After operation, the task instance's run will be terminated, and the task instance state will be set to failed.
Set to Success: Only operable for task instances that have failed to run. After operation, the task instance is forcibly set to success. This can be used in emergency operations scenarios, where it is determined that even if the current task instance fails, it does not affect the data correctness of downstream tasks, thereby unblocking the execution of downstream task instances.
Set to Failure: Only operable for task instances that have run successfully. After operation, the task instance is forcibly set to failed.