About failover and service continuity

This document describes Segura's failover and service continuity mechanisms, detailing how the platform ensures high operational availability despite failures in cluster nodes or components. It covers automatic processes, session recovery, procedures for partial or total failures, and recommendations for testing and maintenance.

Automatic failover mechanisms

Continuous monitoring: All cluster nodes undergo frequent health checks to ensure availability.
Failure detection: Nodes with hardware, software, communication, or performance issues are automatically removed from the active pool.
Automatic session redistribution: Affected sessions are transferred to healthy nodes without manual intervention.
Transparent failover: End-user experience is preserved with automatic reconnection and session context/data retention whenever possible.

Session recovery

Session persistence: The cluster maintains session context and state to enable quick resumption after partial failures.
Load rebalancing: After failover, the load balancer redistributes new sessions among remaining nodes to prevent overload.
Log and event synchronization: Session activities, commands, and user actions are logged and synchronized across nodes, ensuring audit record integrity.

Procedures for partial or total failure

Isolated node failure

Node is automatically isolated.
Active sessions are migrated or terminated per defined policies.
Notifications and logs are generated for the operations team.

Multiple node or regional failure

The cluster remains operational as long as at least one healthy node is available.
In multi-region setups, sessions can be automatically routed to other regions.

Recovery

Recovered nodes can be reintegrated automatically or through administrative approval.
Full audit of the event remains accessible for compliance and investigation.

Tests and recommendations

Conduct periodic failover tests to ensure mechanisms operate correctly.
Maintain frequent backups of cluster configuration.
Document and update contacts responsible for availability incident response.