Monitoring and Alerting

Overview

The monitoring function allows you to monitor abnormal situations such as task running status through built-in rules or custom configuration rules, and send alert information.

Core Concepts

Concept	Explanation
Monitoring Rules	Refers to a set of configuration information, including specific objects and message types and other key attributes, to inform the system which key messages it is concerned about. Rules for specific objects and specific conditions will generate alert events when the conditions are met
Alert Events	Potential event record information that needs to send alert notifications, based on monitoring rules, when monitoring conditions are met
Notification History	The information record actually pushed to the user after the alert event is generated. The push of alert notifications will be affected by the following three strategies
Notification Strategy	Used to define what kind of notification channels, sending frequency, etc., to push to the alert recipient when pushing alerts

Monitoring Rules

The monitoring rules list fully displays the complete list of currently configured rules. You can perform filtering and filtering operations.

For a single rule, you can perform the following operations:

Operation Name	Behavior Definition	Operable Personnel
View Details	Open the details page of the monitoring alert rule to view the complete information	Open to all instance members
Enable/Disable	Set the alert rule to enable or stop	Instance administrator, instance operation and maintenance role
Copy	Based on the current rule, copy its configuration attributes to generate new rules	Instance administrator, instance operation and maintenance role
Edit	Support users to modify the attributes of the monitoring rule through appropriate interaction methods	Instance administrator, instance operation and maintenance role
Subscribe/Unsubscribe	Add/remove the operator himself to/from the alert recipient	Open to all instance members

Built-in System Rules

The system comes with built-in global monitoring rules that can be enabled as needed.

Rule Name	Rule Function	Default Start/Stop Status
General Rule Monitoring Task Failure	The default rule for global monitoring of task instance failures, triggers monitoring alerts when the instance fails	Default off

New Monitoring Rules

Click the "New Rule" button to customize and create monitoring rules as needed.

Category	Parameter	Description
Basic Information	Name	Enter the name of the new custom rule.
Description	Not required, you can add a description of the current rule, or fill in the relevant handling methods after receiving the alert.
Trigger Condition	Monitoring Items	The specific monitoring object, the current system supports "Event Monitoring" and "Metric Monitoring" two methods.
Filter Condition	The filter condition for messages, the relationship between multiple conditions is "and"
Alert Level	Alert Level	The alert level configuration rules in the universal template are as follows, and users are also supported to customize different levels of notification methods in the notification strategy. High risk: Send using all alert channels, including phone Serious: Use all alert channels to send, including phone Warning: System internal, email, SMS, Webhook, excluding phone Reminder: System internal, email, Webhook, excluding phone, SMS Clicking on the alert level will link the information presented in the notification strategy list below.
Monitoring Notification	Notification Strategy	Click the drop-down box to directly select the information managed in the notification "Notification Strategy", or click the + sign to create a new notification strategy. For the specific configuration of the notification strategy, see the notification strategy
Alert Subscription	Drop-down selection of the specific person who needs to be notified for this rule
Webhook Notification	Choose the notification method, the currently supported notification types are DingTalk Feishu
Notification Start Time	The start time for sending notifications after the monitoring rule is triggered
Notification End Time	The end time for sending notifications after the monitoring rule is triggered

Trigger Condition Rules

The trigger condition refers to the combination of the corresponding metric calculation method, threshold, and trigger method after the user has selected a specific monitoring item. The current product supports users to customize the configuration of "Event Monitoring" and "Metric Monitoring".

"Metric" monitoring configuration

For the metric monitoring type, after defining the calculation method and threshold, two trigger methods are supported.

Continuous: Once the metric touches the threshold continuously N times, it will trigger a monitoring alert.

Check Interval: Users need to define that after accumulating N times within the check interval range, it will trigger a monitoring alert.

For example: The user has configured the delay indicator of the full-incremental integrated synchronization task, the delay event>=50s, continuous 3 data points, and the alert frequency limit is sent once every 30 minutes.

At 00:40, it was found that it was above the threshold for 3 consecutive times, and the first alert was triggered. From then until 01:50, it was in the first alert stage.

During the alert stage, the next few time points are judged by the alert frequency, and the alert is continuously sent based on the configuration of the alert frequency. The alert limit is sent once every 30 minutes, so it will send alert notifications at 01:10 and 01:40.

Starting from 02:00, the next three indicators are all below the threshold, so the first alert is restored and no more alert notifications are triggered.

If the user's trigger method is to check the cumulative number of times in the monitoring.

"Event" monitoring configuration

Event monitoring is an alert generated when a specific event or condition occurs. Users can monitor based on the operation and maintenance instances or data quality check rules currently supported in the product. Based on the current behavior supported in the product, the current event monitoring is mainly divided into two categories:

Task Operation and Maintenance: Users configure various types of cycle scheduling tasks defined in the development scenario through the scheduling scenario, or real-time running task instances.

Data Quality: The various table quality monitoring configured by users in data quality.

Alert Events

The alert event list displays all the specific alert information after triggering the monitoring rule under the current instance, and can perform the following operations on the alert events in the current list:

Suppress: Set the current alert event to not send messages again within a few minutes.

Close: Close the current alert event and no longer receive such messages.

Notification History

All message notifications that actually arrive based on the notification strategy after the alert is triggered.

Notification Strategy

The notification strategy list displays all the definitions for the notification strategy, and you can perform search and filter operations in the list.

New Notification Strategy

Click the "New Strategy" button to create a new notification strategy as needed.

Category	Parameter	Description
Basic Information	Name	The name of the notification strategy
	Description	Not required, you can add a description of the current rule
Notification Method	High Risk Alert	Set the specific method of notification for different alert levels, the supported methods are: wehook SMS Phone
	Serious Alert
	Warning Alert
	Reminder Alert
Notification Time	Send Interval (Minutes)	The time interval between two alerts.
	Maximum Send Times	The maximum number of alerts, after exceeding the set number, no more alerts will be generated.
	Do Not Disturb Start Time	After setting the do not disturb time, the system will not send alerts during this time period. For example, when the task status is set to trigger an alert when it fails, and the do not disturb time for this task is set to 00:00 to 08:00, no alert information will be sent during this time period. If it reaches 8 o'clock and the task is still in the above abnormal state, an alert will be sent.
	Do Not Disturb End Time

Configuration Management

In configuration management, you can configure personal information and Webhooks.

Personal Configuration

In personal configuration, you can modify the currently logged-in user, the phone number and email address used to receive alerts. In addition, it also supports setting a do-not-disturb period, during which you will not receive system alert messages.

Webhook Configuration

Webhook configuration is used to define the Webhook channels needed for alert push, currently mainly supporting Feishu and DingTalk.

Creating a new webhook configuration

Click "Create Configuration", then fill in the required parameters on the page to create a new Webhook configuration. It is recommended to test after the Webhook address, and ensure that the test passes before clicking "Confirm" to save.

Others

Automatic closure of monitoring alerts

For monitoring alerts of task instance running failures, after the operation center handles the instance and recovers successfully, such as manually setting success or rerunning the instance successfully, the corresponding alert event will be automatically set to close, and there is no need for manual closure.

Webhook Alert Security Configuration

For IM platforms such as DingTalk, configuring Security Settings for webhook alert delivery requires adding "Singdata" as a mandatory keyword in the allowlist settings.