Icinga 2 – Distributed Monitoring

High availability and load balancing out-of-the box

Clustered Icinga 2 instances are designed to independently manage load balancing of monitoring tasks such as checks, notifications and database updates between them. They also autonomously replicate configuration as well as program states in real-time, ensuring complete data integrity in the event of a failover. Additionally, all network communication between clustered instances is secured with SSL x509 certificates, creating secure cluster zones.

High availability clusters

In Icinga 2, clustered instances assign an ‘active zone master’. This master writes to the IDO database and manages configurations, notifications and check distribution for all nodes. Should the active zone master fail, another instance is automatically assigned this role. Furthermore, each instance carries a unique ID to prevent conflicting database entries and split-brain behaviour.

With continuous synchronisation of program states as well as check results, this design gives Icinga 2 the edge over active-passive clusters using Pacemaker in Icinga 1 and Nagios. It also makes fail-safe monitoring much easier to scale in Icinga 2.

Distributed monitoring

Where operations are dispersed across multiple sites, Icinga 2 enables distributed monitoring too. Thanks to Icinga 2’s cluster zoning, satellite instances can be demarcated into their own secure zones to report to a central NOC. Satellites can be simple checkers or fully equipped with local IDO database, user interface and other extensions too.

Replication can be isolated to occur only between the master zone and each individual satellite, keeping satellite sites blind to one another. If a satellite goes rogue, check results are saved for retroactive replication access once the connection is restored.

Distributed, high availability monitoring

Combine high availability clusters with a distributed setup, and you have a best practice scenario for large and complex environments.

Satellite nodes can be scaled to form high availability clusters and cordoned off into secure zones. Load balancing and replication within them can be managed by an active zone instance to reflect different levels of hierarchy. An instance in the satellite zone can be a simple checker, sending results to the active satellite node for replication and display on a local interface. In turn the active satellite node can relay results to the NOC master zone for global reports.