About auto-healing and monitoring

This document explains Segura’s auto-healing feature, which automates the detection, isolation, and remediation of failures in critical cluster nodes and services. It also details native monitoring capabilities, integration with external systems, and practical examples to ensure operational continuity and stability.

Definition of auto-healing

Auto-healing consists of automated routines that identify, isolate, and attempt to fix failures, drastically reducing downtime and the need for manual intervention. The goal is to enable rapid recovery from unexpected events.

How auto-healing works

Automated health checks: Continuous monitoring of nodes and essential services, checking connectivity, performance, availability, and data integrity.
Failure detection: Automatic identification of hardware, software, system resource (CPU, RAM, disk), or communication issues.
Isolation of problematic nodes: Nodes marked as “unhealthy” are removed from the active pool to avoid impacting critical sessions.
Remediation attempts: Automatic procedures such as service restarts, cache clearing, or container restarts. Recovered nodes rejoin the cluster automatically.
Notification and escalation: Unresolved failures are immediately reported via dashboards, alerts (email, webhook, SIEM/SOAR), and detailed logs.

Native monitoring

Integrated dashboards: Real-time panels showing node status, health metrics (CPU, RAM, latency), active sessions, alarms, and critical events.
Customizable alerts: Thresholds can be set for resource usage, failures, and latency, triggering automatic alerts for administrators.
Event logging: Auditable logs of all auto-healing events, failures, and recoveries.
External monitoring: Integration with Prometheus, Zabbix, Grafana, SIEM/SOAR for centralized observability and incident response.
Health/status visualization: Clear indication of each node’s state (healthy, degraded, unhealthy, maintenance).

Practical examples and event logs

Event: Communication failure on node X.
Action: Node isolated and automatic service restart attempted.
Event: Memory consumption above threshold on node Y.
Action: Alert generated and session redistributed to other nodes.
Event: Persistent failure on node Z.
Action: Incident notification sent and logs exported to SIEM.

Integration with SIEM/SOAR

All critical events and auto-healing logs can be automatically forwarded to external systems via syslog, webhook, or API, enabling SOC/NOC teams to centralize Segura monitoring alongside other critical organizational systems.