Network Monitoring
Network monitoring is the automated process of actively and regularly testing network systems for availability and functionality. Monitoring is a fairly important concept in larger IS&T departments, as it provides the Network Operating Center (NOC) with active feedback about the current state of the network. Active monitoring can allow system administrators the opportunity to fix network problems quickly - sometimes even before network users notice the problem.
Ultimately, the problem of comprehensive network monitoring is a scalability problem. The larger the network infrastructure, the more services there are that require monitoring. This scalability problem can only reasonably be solved with some scaleable form of clustering.
Nagios is an open source software solution for network monitoring that is hosted by SourceForge.net. Nagios has a large community of users, and is well-understood by that community.
Nagios installations - even distributed installations - suffer from these same scalability issues, resulting in maintenance problems and other inherent scalability and performance problems.
What is DNX?
DNX is a modular extension of Nagios that offloads a significant portion of the work normally done by Nagios to a distributed network of remote hosts. The DNX module ensures that work is distributed fairly and evenly among the registered DNX client hosts.
Design Considerations
The Problem: The current suggested method of scaling Nagios to include multiple servers has a few small practical disadvantages. Each check is configured to execute on a particular distributed server, which then passively sends its results back up to the central box, which must have a matching passive check configured. This means that an administrator must install Nagios on each box and maintain the configuration of each check in two places (on the central server, and on one of the distributed servers) and keep track of which check is executing on which box. This can be pretty tedious for larger installations with multiple boxes.
More critically though, if a particular distributed server fails, all checks configured on that server will not get executed (and all will alarm on the central server, if freshness checking is configured).
The Approach: What would be ideal from an administration standpoint would be to have the Nagios host itself distribute its checks automatically and dynamically to a group of “worker nodes” in a cluster. This ideal would also include:
- Minimal configuration changes to the central Nagios node (one or two new lines, no changes to the checks themselves, no wrapper scripts, etc).
- The solution must avoid using the FIFO pipe, due to scalability issues.
- Worker nodes should be able to be added and removed without configuration changes (with the possible exception of security entries to prevent rogue nodes stealing checks or inserting bogus results).
- If a worker node fails in some way, only the then-in-flight checks would be lost resulting in a “(Service Check Timeout)”, and when Nagios re-tried the check, it would then be executed on any one of the remaining cluster nodes.
- Checks should not have affinity for any particular node (for the reason stated in #4).
Our Response: DNX is a Nagios Event Broker (NEB) module (the concept of a NEB is somewhat like a Linux kernel module) that intercepts the check commands just before the fork-fork-exec stage. Worker nodes request jobs and the NEB module matches the check command with a job request and sends it to the appropriate node to execute. The worker node executes the check command and passes the results back to the NEB module, which inserts it directly into the results queue data structure (bypassing the FIFO pipe).
High-Level Design
The typical Nagios installation is limited to a single machine. When that machine's CPU and I/O limits are reached, a second full installation of Nagios is required, which must be configured to send the results of its checks to the first. This is done "passively" via another Nagios add-on called NSCA. The problem is that this requires the administrator to maintain full Nagios configurations on two (and eventually more) machines. The failure of any one of these distributed machines requires manually re-distributing the checks, as well as their configurations.
Even this solution only scales so far, because the central Nagios machine that is receiving all these passive checks - the one running the web front-end CGIs - does so with a small FIFO pipe (the FIFO size varies with architecture, but is almost always less than 64K), so blocking on write requests to the FIFO becomes a serious performance problem.
DNX allows the execution of the Nagios check plug-ins to be distributed across a number of "worker nodes", without any modification to the existing Nagios configurations, except for a one-line entry in the main Nagios configuration file which loads a NEB (Nagios Event Broker) module - a Nagios add-on module. The NEB module does have its own configuration file, but it consists mostly of performance management and timeout variables.
The existing Nagios installation (the "head node") loads the NEB module and starts three new threads, the dispatcher, the collector, and the timer. Since the threads are in the same process space as the main Nagios scheduler, they each have access to all the Nagios internal data structures directly.
When this module is loaded, it registers for the Nagios NEBCALLBACK_SERVICE_CHECK_DATA callback, which is called just prior to the execution of the check, and watches for the NEBTYPE_SERVICECHECK_INITIATE event which means Nagios is about to execute the plug-in.
The worker node runs as a multi-threaded daemon, which has a pool manager thread called the "Work Load Manager", and the worker threads themselves, which actually execute each check. The Work Load Manager is modeled after the Apache threading model, with minimum, incremental, and maximum worker threads.
Each worker thread in a worker node sends a request for a job to the dispatcher thread on the main Nagios machine - the "head" node. The dispatcher on the head node adds the request to its list of job requests. If there are no job requests in the queue, the NEB callback event handler simply returns a 0 result code, and the head node executes the plug-in as it normally would. A new worker thread on each node can request a job at any time, without affecting the execution of the head node (no reloads or configuration changes are required, if the worker node's IP address is in the access control list). Job requests themselves eventually (and configurably) time out. So if all worker nodes die, local (head node) execution of plug-ins resumes (as if the DNX NEB module had never been loaded).
Assuming there is a job request in the queue, the NEB callback then matches a job request with a service check job, and the dispatcher thread sends it off to the worker node thread that made the request. The worker thread executes the check and sends the results to the head's collector thread. If a worker node dies for any reason, it stops requesting checks, so no further checks will get dropped (other than those already in-flight to the worker node when it died). Ultimately, even these checks are not really lost, as Nagios will reschedule the check when no results are returned for that check.
The worker node daemon can also load "lib-ified" check commands to speed up more common checks. For example, we have lib-ified the check_nrpe program so far (included with the Nagios distribution). This is a fairly simple process, and can be done to any plugin with a little effort. This saves DNX the effort of loading a lib-ified plug-in from disk each time that check is required.
The collector thread on the head node posts the results directly to the Nagios results ring buffer data structure (bypassing the FIFO pipe) for the Nagios reaper to handle.
The timer thread on the head node watches to expire checks and handle situations of late or lost check results.