SLA Configuration¶
BizMetry includes a continuous agent monitoring system that tracks key infrastructure metrics of each agent in real time. When a metric exceeds a defined threshold for a sustained period, BizMetry triggers an SLA breach alert, making it immediately visible across the platform so operators can take action before the agent degrades or goes offline.
SLA monitoring can be independently enabled or disabled per metric, giving you fine-grained control over which conditions are actively watched.
To access this configuration, open the Agent Configuration dialog and select the SLAs tab.
How SLA Monitoring Works — Hysteresis¶
BizMetry's SLA monitoring uses a hysteresis-based model to determine when to trigger and when to clear a breach alert. This approach prevents alert flapping — a situation where an alert is repeatedly triggered and cleared in quick succession due to a metric oscillating around a threshold.
Rather than using a single threshold value, each metric is configured with two separate thresholds and two time windows:
- SET threshold (upper) — the value above which the metric is considered elevated.
- CLEAR threshold (lower) — the value below which the metric is considered recovered.
- SET time — the number of consecutive minutes the metric must remain above the SET threshold before the breach is triggered.
- RESET time — the number of consecutive minutes the metric must remain below the CLEAR threshold before the breach is cleared.
This creates a deliberate gap between the trigger and recovery conditions. A metric must sustain an elevated state for the full SET time window before an alert fires, and must sustain a recovered state for the full RESET time window before the alert clears. Transient spikes that resolve quickly never trigger an alert, and brief dips below the CLEAR threshold do not prematurely clear an active breach.
The diagram below illustrates this behavior:
CPU Usage¶
The CPU tab monitors the average CPU consumption of the agent process. High or sustained CPU usage may indicate that the agent is under heavy load, misconfigured, or experiencing resource contention with other workloads on the same host.
To activate CPU monitoring, set the SLA Monitoring Enabled toggle to ON.
SLA Breach Threshold¶
The dual-handle slider defines the two threshold values used by the hysteresis model:
- SET threshold (upper handle, shown in red) — the maximum average CPU usage percentage that, if exceeded for the configured SET time window, triggers the breach alert.
- CLEAR threshold (lower handle, shown in green) — the CPU usage percentage below which, if sustained for the configured RESET time window, the breach alert is cleared.
Drag each handle independently to set the desired values. The current threshold values are always shown to the right of the slider.
Time Windows¶
Defines the observation windows used by BizMetry to evaluate the CPU condition:
| Window | Description |
|---|---|
| Set Time (min) | Number of consecutive minutes CPU average must remain above the SET threshold before the breach alert is triggered. |
| Reset Time (min) | Number of consecutive minutes CPU average must remain below the CLEAR threshold before the breach alert is cleared. |
Recommendations: - A Set Time of 3–5 minutes avoids false positives from transient CPU spikes during normal workload bursts. - A Reset Time of 3–5 minutes prevents premature alert clearance if CPU briefly dips during an ongoing overload condition. - For production agents, consider a SET threshold of 80–90% and a CLEAR threshold of 60–70%, leaving a meaningful gap between the two.
Memory Usage¶
The Memory tab monitors the average memory consumption of the agent process. Sustained high memory usage may indicate a memory leak, oversized metric buffers, or inadequate Pod resource limits in the Kubernetes deployment.
To activate memory monitoring, set the SLA Monitoring Enabled toggle to ON.
SLA Breach Threshold¶
The dual-handle slider defines the two threshold values:
- SET threshold (upper handle, shown in red) — the maximum average memory usage percentage that, if exceeded for the configured SET time window, triggers the breach alert.
- CLEAR threshold (lower handle, shown in green) — the memory usage percentage below which, if sustained for the configured RESET time window, the breach alert is cleared.
Drag each handle independently to set the desired values. The current threshold values are always shown to the right of the slider.
Time Windows¶
| Window | Description |
|---|---|
| Set Time (min) | Number of consecutive minutes memory average must remain above the SET threshold before the breach alert is triggered. |
| Reset Time (min) | Number of consecutive minutes memory average must remain below the CLEAR threshold before the breach alert is cleared. |
Recommendations: - Memory usage tends to grow gradually rather than spike abruptly. A Set Time of 5–10 minutes is appropriate for most deployments. - If the agent is configured with large metric or log buffers (see General Configuration), expect baseline memory usage to be higher — adjust thresholds accordingly to avoid false positives. - A CLEAR threshold significantly lower than the SET threshold (e.g., SET at 85%, CLEAR at 60%) gives the agent room to recover without prematurely clearing the alert during garbage collection cycles.
Network Latency¶
The Network Latency tab monitors the average round-trip latency between the agent and the BizMetry platform. Elevated latency may indicate network congestion, routing issues, or degraded connectivity between the agent's host and the platform.
To activate latency monitoring, set the SLA Monitoring Enabled toggle to ON.
SLA Breach Threshold¶
The dual-handle slider defines the two threshold values, expressed in milliseconds (maximum 15,000 ms):
- SET threshold (upper handle, shown in red) — the maximum average latency in milliseconds that, if exceeded for the configured SET time window, triggers the breach alert.
- CLEAR threshold (lower handle, shown in green) — the latency in milliseconds below which, if sustained for the configured RESET time window, the breach alert is cleared.
Drag each handle independently to set the desired values. The current threshold values are always shown to the right of the slider.
Time Windows¶
| Window | Description |
|---|---|
| Set Time (min) | Number of consecutive minutes average latency must remain above the SET threshold before the breach alert is triggered. |
| Reset Time (min) | Number of consecutive minutes average latency must remain below the CLEAR threshold before the breach alert is cleared. |
Recommendations: - Acceptable latency thresholds depend heavily on the network environment. For cloud deployments, SET thresholds of 500–1000 ms are typical. For on-premises deployments on a fast LAN, even 200–300 ms may indicate a problem. - Network latency can fluctuate briefly due to transient congestion. A Set Time of 3–5 minutes filters out short-lived spikes without masking genuine degradation.
SLA Breach Alerts¶
When a breach condition is confirmed — after the metric has been above the SET threshold for the full SET time window — BizMetry triggers an SLA breach alert that is surfaced in two places across the platform:
On the Agent Card The agent card in the Agents tab of the Profile view displays a visual breach indicator. The card enters a pulsing fade-in/fade-out animation to draw immediate attention to the affected agent.
Do not ignore a pulsing agent card
The pulsing animation is BizMetry's strongest visual signal that an agent requires immediate attention. A sustained breach may indicate that the agent is approaching the limits of its operational capacity and could degrade or go offline if left unaddressed. Investigate and take corrective action as soon as possible.
On the Agent Panel The agent's row in the global Agents Panel also reflects the active breach, allowing operators monitoring the platform-wide view to detect affected agents without navigating to individual profiles.
Notifications¶
Every SLA state change — both when a breach is triggered and when it is cleared — automatically generates an entry in BizMetry's notification system. These notifications are accessible at any time from the main menu under Notifications, and can be filtered by agent to review the full breach history for a specific agent.
This audit trail is useful for understanding recurring patterns, correlating breaches with deployments or configuration changes, and demonstrating SLA compliance over time.






