Monitoring and Alerting

Overview

The monitoring and alerting system is a powerful feature set that allows users to monitor key indicators such as task run status in real time using system built-in rules or custom rules. Once an anomaly is detected, the system sends timely alert notifications to help you ensure data flow stability and reliability.

Core Concepts

ConceptDescription
Monitoring RuleA set of configuration information containing key attributes such as specific objects and message types, informing the system which key messages to watch for. Rules configured for specific objects and specific conditions generate alert events when conditions are met.
Alert EventBased on monitoring rules, a potential event record that needs to trigger an alert notification when monitoring conditions are satisfied.
Notification HistoryThe actual notification records pushed to users after alert events are generated. Alert notification delivery is influenced by the following three strategies.
Notification StrategyDefines which notification channels to use, sending frequency, etc., when pushing alerts to recipients.

Monitoring Rules

The monitoring rules list displays all currently configured rules. Users can filter to quickly find the needed rules.

Monitoring Rules List

For individual rules, the following operations are available:

Operation NameBehavior DefinitionAuthorized Personnel
View DetailsOpen the monitoring alert rule details page to view complete information.Open to all instance members
Enable/DisableEnable or disable the alert rule.Instance administrator, instance operations role
CopyCopy the configuration attributes of the current rule to create a new rule.Instance administrator, instance operations role
EditModify the attributes of the monitoring rule through appropriate interactive methods.Instance administrator, instance operations role
Subscribe/UnsubscribeAdd or remove the operator from the alert recipient list.Open to all instance members

System Built-in Rules

The system provides some preset global monitoring rules that users can enable as needed.

Rule NameRule FunctionDefault Status
General Rule - Monitor Task FailureDefault rule for monitoring task instance failures.Disabled by default

Create a New Monitoring Rule

Click the "Create Rule" button to create custom monitoring rules based on your needs.

Create Monitoring Rule Interface

CategoryParameterDescription
Basic InformationNameEnter the name of the new custom rule.
DescriptionOptional. You can add a description of the current rule or note the handling procedures after receiving an alert.
Trigger ConditionMonitoring ItemsThe specific monitoring objects. The system currently supports "Event Monitoring" and "Metric Monitoring" two modes.
Filter ConditionFilter conditions for messages. Multiple conditions are combined with "AND".
Alert LevelAlert LevelThe alert level configuration in the universal template is as follows, and users can also customize notification methods for different levels in the notification strategy. Critical: Send via all alert channels, including phone. Severe: Send via all alert channels, including phone. Warning: In-system, email, SMS, Webhook, excluding phone. Info: In-system, email, Webhook, excluding phone and SMS. Clicking the alert level will link to the information displayed in the notification strategy list below.
Monitoring NotificationNotification StrategyClick the dropdown to select a notification strategy managed in "Notification Strategy", or click the "+" button to create a new one. For configuration details, see Notification Strategy.
Alert SubscriptionDropdown to select the specific recipients to be notified for this rule.
Webhook NotificationSelect notification method. Currently supported types: DingTalk, Feishu.
Notification Start TimeThe start time for sending notifications after the monitoring rule is triggered.
Notification End TimeThe end time for sending notifications after the monitoring rule is triggered.

Trigger Condition Rules

Trigger conditions consist of monitoring items, metric calculation methods, thresholds, and trigger methods. The system currently supports "Event Monitoring" and "Metric Monitoring" two types.

Metric-Based Monitoring Configuration

For metric monitoring, after defining the calculation method and threshold, two trigger methods are supported.

Continuous: Once the metric touches the threshold for N consecutive times, a monitoring alert is triggered.

Check Interval: Users define that when the threshold is touched a cumulative N times within the check interval, a monitoring alert is triggered.

Metric monitoring configuration: Users can define calculation methods and thresholds, and select trigger methods. For example, configure the full-incremental integrated sync task delay metric: delay time >= 50 seconds, continuous for 3 data points, alert frequency limited to once every 30 minutes.

At 00:40, when the threshold is exceeded for 3 consecutive times, the first alert is triggered. From then until 01:50, it remains in the first alert stage.

During the alert stage, subsequent time points are evaluated based on the alert frequency, determining whether to continue sending alerts. With the limit set to once every 30 minutes, alerts are sent at 01:10 and 01:40.

Starting from 02:00, the next three metric values are all below the threshold, so the first alert recovers and no further alert notifications are triggered.

If the user's trigger method is based on cumulative count within the check interval.

Event-Based Monitoring Configuration

Event monitoring generates alerts when specific events or conditions occur. Users can monitor based on operational instances or data quality validation rules within the product. Based on currently supported behaviors, event monitoring is mainly divided into two categories:

Task Operations: Various periodic scheduling tasks configured in development scenarios, or real-time running task instances.

Data Quality: Various table quality monitoring tasks configured in data quality.

Alert Events

The alert event list displays all alert information generated after monitoring rules are triggered. Users can perform operations on alert events in the list, such as suppress or close.

Suppress: Set the current alert event to not send messages for a specified number of minutes.

Close: Close the current alert event and stop receiving such messages. Alert Event List

Alert Event Handling Operations

OperationDefinitionApplicable ScenarioScope
SuppressStop sending messages for the alert event within a specified time period.A known issue is being addressed and repeated alerts are not needed temporarily.Current alert event
CloseClose the alert event and stop receiving such messages.Issue has been resolved or confirmed as a false alarm.Current alert event
Auto Close (not triggered by clicking on the alert event)The system automatically closes the alert after detecting issue recovery.Task instance rerun succeeds, manually set to success.Related alert events

Notification History

Notification history records all notification messages actually delivered based on notification strategies.

Notification History List

Notification Strategy

The notification strategy list displays all defined notification strategies. Users can search and filter.

Notification Strategy List

Create a New Notification Strategy

Users can click the "Create Strategy" button to create a new notification strategy based on their needs.

Create Notification Strategy Interface

CategoryParameterDescription
Basic InformationNameThe name of the notification strategy.
DescriptionOptional. You can add a description of the current rule.
Notification MethodCritical AlertSet the specific notification method for different alert levels. Supported methods: Webhook, SMS, Phone.
Severe Alert
Warning Alert
Info Alert
Notification TimeSend Interval (minutes)The time interval between two alerts.
Max Send CountThe maximum number of alerts. After exceeding this count, no more alerts will be generated.
Do Not Disturb Start TimeDuring the do-not-disturb period, the system will not send alerts. For example, if a task failure alert is configured and the do-not-disturb time is set from 00:00 to 08:00, no alert will be sent during that period. If at 08:00 the task is still in an abnormal state, an alert will be sent.
Do Not Disturb End Time

Configuration Management

Configuration management allows users to configure personal information and Webhooks.

Personal Configuration

Users can modify their phone number and email address used for receiving alerts, and set do-not-disturb periods in personal configuration.

Personal Configuration Interface

Webhook Configuration

Webhook configuration is used to define Webhook channels for alert delivery. Currently supports Feishu and DingTalk.

Webhook Configuration Interface

Create a New Webhook Configuration

Users can click the "Create Configuration" button and fill in the required parameters to create a new Webhook configuration. It is recommended to test before saving to ensure the configuration is correct.

Other

Automatic Closure of Monitoring Alerts

For monitoring alerts on task instance run failures, after the Operations Center handles the instance and it recovers successfully (such as manually setting success or rerunning the instance successfully), the corresponding alert event will be automatically set to closed. No manual closure is needed.

Webhook Alert Security Settings

For IM platforms such as DingTalk, Webhook alert delivery has certain security settings. Add "Singdata" as a custom keyword in the security settings.