Monitoring and Alerting

Overview

The monitoring function allows you to monitor abnormal situations such as task running status through built-in rules or custom configuration rules, and send alert information.

Core Concepts

ConceptExplanation
Monitoring RulesRefers to a set of configuration information, including specific objects and message types and other key attributes, to inform the system which key messages it is concerned about. Rules for specific objects and specific conditions will generate alert events when the conditions are met
Alert EventsPotential event record information that needs to send alert notifications, based on monitoring rules, when monitoring conditions are met
Notification HistoryThe information record actually pushed to the user after the alert event is generated. The push of alert notifications will be affected by the following three strategies
Notification StrategyUsed to define what kind of notification channels, sending frequency, etc., to push to the alert recipient when pushing alerts

Monitoring Rules

The monitoring rules list fully displays the complete list of currently configured rules. You can perform filtering and filtering operations.

For a single rule, you can perform the following operations:

Operation NameBehavior DefinitionOperable Personnel
View DetailsOpen the details page of the monitoring alert rule to view the complete informationOpen to all instance members
Enable/DisableSet the alert rule to enable or stopInstance administrator, instance operation and maintenance role
CopyBased on the current rule, copy its configuration attributes to generate new rulesInstance administrator, instance operation and maintenance role
EditSupport users to modify the attributes of the monitoring rule through appropriate interaction methodsInstance administrator, instance operation and maintenance role
Subscribe/UnsubscribeAdd/remove the operator himself to/from the alert recipientOpen to all instance members

Built-in System Rules

The system comes with built-in global monitoring rules that can be enabled as needed.

Rule NameRule FunctionDefault Start/Stop Status
General Rule Monitoring Task FailureThe default rule for global monitoring of task instance failures, triggers monitoring alerts when the instance failsDefault off

New Monitoring Rules

Click the "New Rule" button to customize and create monitoring rules as needed.

CategoryParameterDescription
Basic InformationNameEnter the name of the new custom rule.
DescriptionNot required, you can add a description of the current rule, or fill in the relevant handling methods after receiving the alert.
Trigger ConditionMonitoring ItemsThe specific monitoring object, the current system supports "Event Monitoring" and "Metric Monitoring" two methods.
Filter ConditionThe filter condition for messages, the relationship between multiple conditions is "and"
Alert LevelAlert LevelThe alert level configuration rules in the universal template are as follows, and users are also supported to customize different levels of notification methods in the notification strategy. High risk: Send using all alert channels, including phone Serious: Use all alert channels to send, including phone Warning: System internal, email, SMS, Webhook, excluding phone Reminder: System internal, email, Webhook, excluding phone, SMS Clicking on the alert level will link the information presented in the notification strategy list below.
Monitoring NotificationNotification StrategyClick the drop-down box to directly select the information managed in the notification "Notification Strategy", or click the + sign to create a new notification strategy. For the specific configuration of the notification strategy, see the notification strategy
Alert SubscriptionDrop-down selection of the specific person who needs to be notified for this rule
Webhook NotificationChoose the notification method, the currently supported notification types are DingTalk Feishu
Notification Start TimeThe start time for sending notifications after the monitoring rule is triggered
Notification End TimeThe end time for sending notifications after the monitoring rule is triggered

Trigger Condition Rules

The trigger condition refers to the combination of the corresponding metric calculation method, threshold, and trigger method after the user has selected a specific monitoring item. The current product supports users to customize the configuration of "Event Monitoring" and "Metric Monitoring".

"Metric" monitoring configuration

For the metric monitoring type, after defining the calculation method and threshold, two trigger methods are supported.

Continuous: Once the metric touches the threshold continuously N times, it will trigger a monitoring alert.

Check Interval: Users need to define that after accumulating N times within the check interval range, it will trigger a monitoring alert.

For example: The user has configured the delay indicator of the full-incremental integrated synchronization task, the delay event>=50s, continuous 3 data points, and the alert frequency limit is sent once every 30 minutes.

At 00:40, it was found that it was above the threshold for 3 consecutive times, and the first alert was triggered. From then until 01:50, it was in the first alert stage.

During the alert stage, the next few time points are judged by the alert frequency, and the alert is continuously sent based on the configuration of the alert frequency. The alert limit is sent once every 30 minutes, so it will send alert notifications at 01:10 and 01:40.

Starting from 02:00, the next three indicators are all below the threshold, so the first alert is restored and no more alert notifications are triggered.

If the user's trigger method is to check the cumulative number of times in the monitoring.

"Event" monitoring configuration

Event monitoring is an alert generated when a specific event or condition occurs. Users can monitor based on the operation and maintenance instances or data quality check rules currently supported in the product. Based on the current behavior supported in the product, the current event monitoring is mainly divided into two categories:

Task Operation and Maintenance: Users configure various types of cycle scheduling tasks defined in the development scenario through the scheduling scenario, or real-time running task instances.

Data Quality: The various table quality monitoring configured by users in data quality.

Alert Events

The alert event list displays all the specific alert information after triggering the monitoring rule under the current instance, and can perform the following operations on the alert events in the current list:

Suppress: Set the current alert event to not send messages again within a few minutes.

Close: Close the current alert event and no longer receive such messages.

Notification History

All message notifications that actually arrive based on the notification strategy after the alert is triggered.

Notification Strategy

The notification strategy list displays all the definitions for the notification strategy, and you can perform search and filter operations in the list.

New Notification Strategy

Click the "New Strategy" button to create a new notification strategy as needed.

CategoryParameterDescription
Basic InformationNameThe name of the notification strategy
DescriptionNot required, you can add a description of the current rule
Notification MethodHigh Risk AlertSet the specific method of notification for different alert levels, the supported methods are: wehook SMS Phone
Serious Alert
Warning Alert
Reminder Alert
Notification TimeSend Interval (Minutes)The time interval between two alerts.
Maximum Send TimesThe maximum number of alerts, after exceeding the set number, no more alerts will be generated.
Do Not Disturb Start TimeAfter setting the do not disturb time, the system will not send alerts during this time period. For example, when the task status is set to trigger an alert when it fails, and the do not disturb time for this task is set to 00:00 to 08:00, no alert information will be sent during this time period. If it reaches 8 o'clock and the task is still in the above abnormal state, an alert will be sent.
Do Not Disturb End Time

Configuration Management

In configuration management, you can configure personal information and Webhooks.

Personal Configuration

In personal configuration, you can modify the currently logged-in user, the phone number and email address used to receive alerts. In addition, it also supports setting a do-not-disturb period, during which you will not receive system alert messages.

Webhook Configuration

Webhook configuration is used to define the Webhook channels needed for alert push, currently mainly supporting Feishu and DingTalk.

Creating a new webhook configuration

Click "Create Configuration", then fill in the required parameters on the page to create a new Webhook configuration. It is recommended to test after the Webhook address, and ensure that the test passes before clicking "Confirm" to save.

Others

Automatic closure of monitoring alerts

For monitoring alerts of task instance running failures, after the operation center handles the instance and recovers successfully, such as manually setting success or rerunning the instance successfully, the corresponding alert event will be automatically set to close, and there is no need for manual closure.

Webhook Alert Security Configuration

For IM platforms such as DingTalk, configuring Security Settings for webhook alert delivery requires adding "Singdata" as a mandatory keyword in the allowlist settings.