Monitoring and Alerting

Overview

The monitoring and alerting system is a powerful feature set that allows users to monitor key indicators such as task run status in real time using system built-in rules or custom rules. Once an anomaly is detected, the system sends timely alert notifications to help you ensure data flow stability and reliability.

Core Concepts

Concept	Description
Monitoring Rule	A set of configuration information containing key attributes such as specific objects and message types, informing the system which key messages to watch for. Rules configured for specific objects and specific conditions generate alert events when conditions are met.
Alert Event	Based on monitoring rules, a potential event record that needs to trigger an alert notification when monitoring conditions are satisfied.
Notification History	The actual notification records pushed to users after alert events are generated. Alert notification delivery is influenced by the following three strategies.
Notification Strategy	Defines which notification channels to use, sending frequency, etc., when pushing alerts to recipients.

Monitoring Rules

The monitoring rules list displays all currently configured rules. Users can filter to quickly find the needed rules.

For individual rules, the following operations are available:

Operation Name	Behavior Definition	Authorized Personnel
View Details	Open the monitoring alert rule details page to view complete information.	Open to all instance members
Enable/Disable	Enable or disable the alert rule.	Instance administrator, instance operations role
Copy	Copy the configuration attributes of the current rule to create a new rule.	Instance administrator, instance operations role
Edit	Modify the attributes of the monitoring rule through appropriate interactive methods.	Instance administrator, instance operations role
Subscribe/Unsubscribe	Add or remove the operator from the alert recipient list.	Open to all instance members

System Built-in Rules

The system provides some preset global monitoring rules that users can enable as needed.

Rule Name	Rule Function	Default Status
General Rule - Monitor Task Failure	Default rule for monitoring task instance failures.	Disabled by default

Create a New Monitoring Rule

Click the "Create Rule" button to create custom monitoring rules based on your needs.

Category	Parameter	Description
Basic Information	Name	Enter the name of the new custom rule.
Description	Optional. You can add a description of the current rule or note the handling procedures after receiving an alert.
Trigger Condition	Monitoring Items	The specific monitoring objects. The system currently supports "Event Monitoring" and "Metric Monitoring" two modes.
Filter Condition	Filter conditions for messages. Multiple conditions are combined with "AND".
Alert Level	Alert Level	The alert level configuration in the universal template is as follows, and users can also customize notification methods for different levels in the notification strategy. Critical: Send via all alert channels, including phone. Severe: Send via all alert channels, including phone. Warning: In-system, email, SMS, Webhook, excluding phone. Info: In-system, email, Webhook, excluding phone and SMS. Clicking the alert level will link to the information displayed in the notification strategy list below.
Monitoring Notification	Notification Strategy	Click the dropdown to select a notification strategy managed in "Notification Strategy", or click the "+" button to create a new one. For configuration details, see Notification Strategy.
Alert Subscription	Dropdown to select the specific recipients to be notified for this rule.
Webhook Notification	Select notification method. Currently supported types: DingTalk, Feishu.
Notification Start Time	The start time for sending notifications after the monitoring rule is triggered.
Notification End Time	The end time for sending notifications after the monitoring rule is triggered.

Trigger Condition Rules

Trigger conditions consist of monitoring items, metric calculation methods, thresholds, and trigger methods. The system currently supports "Event Monitoring" and "Metric Monitoring" two types.

Metric-Based Monitoring Configuration

****

For metric monitoring, after defining the calculation method and threshold, two trigger methods are supported.

Continuous: Once the metric touches the threshold for N consecutive times, a monitoring alert is triggered.

Check Interval: Users define that when the threshold is touched a cumulative N times within the check interval, a monitoring alert is triggered.

Metric monitoring configuration: Users can define calculation methods and thresholds, and select trigger methods. For example, configure the full-incremental integrated sync task delay metric: delay time >= 50 seconds, continuous for 3 data points, alert frequency limited to once every 30 minutes.

At 00:40, when the threshold is exceeded for 3 consecutive times, the first alert is triggered. From then until 01:50, it remains in the first alert stage.

During the alert stage, subsequent time points are evaluated based on the alert frequency, determining whether to continue sending alerts. With the limit set to once every 30 minutes, alerts are sent at 01:10 and 01:40.

Starting from 02:00, the next three metric values are all below the threshold, so the first alert recovers and no further alert notifications are triggered.

If the user's trigger method is based on cumulative count within the check interval.

Event-Based Monitoring Configuration

Event monitoring generates alerts when specific events or conditions occur. Users can monitor based on operational instances or data quality validation rules within the product. Based on currently supported behaviors, event monitoring is mainly divided into two categories:

Task Operations: Various periodic scheduling tasks configured in development scenarios, or real-time running task instances.

Data Quality: Various table quality monitoring tasks configured in data quality.

Refer to Monitoring Item Specification for detailed metric definitions.

Alert Events

The alert event list displays all alert information generated after monitoring rules are triggered. Users can perform operations on alert events in the list, such as suppress or close.

Suppress: Set the current alert event to not send messages for a specified number of minutes.

Close: Close the current alert event and stop receiving such messages.

Alert Event Handling Operations

Operation	Definition	Applicable Scenario	Scope
Suppress	Stop sending messages for the alert event within a specified time period.	A known issue is being addressed and repeated alerts are not needed temporarily.	Current alert event
Close	Close the alert event and stop receiving such messages.	Issue has been resolved or confirmed as a false alarm.	Current alert event
Auto Close (not triggered by clicking on the alert event)	The system automatically closes the alert after detecting issue recovery.	Task instance rerun succeeds, manually set to success.	Related alert events

Notification History

Notification history records all notification messages actually delivered based on notification strategies.

Notification Strategy

The notification strategy list displays all defined notification strategies. Users can search and filter.

Create a New Notification Strategy

Users can click the "Create Strategy" button to create a new notification strategy based on their needs.

Category	Parameter	Description
Basic Information	Name	The name of the notification strategy.
	Description	Optional. You can add a description of the current rule.
Notification Method	Critical Alert	Set the specific notification method for different alert levels. Supported methods: Webhook, SMS, Phone.
	Severe Alert
	Warning Alert
	Info Alert
Notification Time	Send Interval (minutes)	The time interval between two alerts.
	Max Send Count	The maximum number of alerts. After exceeding this count, no more alerts will be generated.
	Do Not Disturb Start Time	During the do-not-disturb period, the system will not send alerts. For example, if a task failure alert is configured and the do-not-disturb time is set from 00:00 to 08:00, no alert will be sent during that period. If at 08:00 the task is still in an abnormal state, an alert will be sent.
	Do Not Disturb End Time

Configuration Management

Configuration management allows users to configure personal information and Webhooks.

Personal Configuration

Users can modify their phone number and email address used for receiving alerts, and set do-not-disturb periods in personal configuration.

Webhook Configuration

Webhook configuration is used to define Webhook channels for alert delivery. Currently supports Feishu and DingTalk.

Create a New Webhook Configuration

Users can click the "Create Configuration" button and fill in the required parameters to create a new Webhook configuration. It is recommended to test before saving to ensure the configuration is correct.

Other

Automatic Closure of Monitoring Alerts

For monitoring alerts on task instance run failures, after the Operations Center handles the instance and it recovers successfully (such as manually setting success or rerunning the instance successfully), the corresponding alert event will be automatically set to closed. No manual closure is needed.

Webhook Alert Security Settings

For IM platforms such as DingTalk, Webhook alert delivery has certain security settings. Add "Singdata" as a custom keyword in the security settings.