Icinga 2 is a feature “monster”. You can do so much more with it than just “check” and “notify”. Forward your performance data into metric systems such as Graphite or InfluxDB, add the IDO database backend for beautiful dashboards in Icinga Web 2 or connect to the REST API and have Dashing present the latest stats in your office.
After all, Icinga 2 runs as an application on your server and will suffer from outages, full disks, load and memory issues and what not. It shouldn’t happen but what if?
Find the root cause
Sometimes problems are just trivial. Disk is running full, I/O performance issues, and what not. Such an analysis costs time and requires knowledge for the right tools. We’ve seen large scale installations of Icinga 2 and decided to share our knowledge in the new Analyze your Environment chapter inside the troubleshooting docs. This not only applies to Icinga but could be used for any server troubleshooting in sysadmin’s life ;)
One of the swiss-army-knife tools is htop, an imho better visualization of running processes. At first glance you would normally spot a process spending lots of cpu cycles. Or a tremendous amount of memory usage by a single process. This first peek into the system helps a lot with further analysing a problem. Sometimes you’ve already found the root cause – it is not Icinga 2, but the database server going crazy for example.
You’ve maybe used vmstat, iostat, sar already too – after years of sysadmin experience they are really helpful to gain insights into a system. If you are on Windows, the Sysinternals Suite might be helpful too.
Grep’ing syslog, mail log, mysql.log and even more comes next. Enabling a more verbose log for applications, and slowly starting to correlate specific event timestamps.
Alert soon enough
Manual analysis is hard, especially if you are not the sysadmin or doing support via remote connection (lags in typing are just awesome … not). In order to prevent time consuming analysis, you can ensure to propose best practices for local service monitoring.
You might remember a table inside the old wiki archive which tells you about certain things to monitor on your Icinga system. Be it disk, load, the database health, ntp and much more. We’ve enhanced the current Icinga 2 documentation with a dedicated chapter based on those details: Monitoring Icinga 2. Bonus: Our template library (ITL) already provides CheckCommand definitions for most those proposed service checks.
This also includes checks for cluster connectivity, clients and the REST API (just treat it as a webserver with check_http for example).
Keep in mind to forward your syslog messages to Elastic Stack or Graylog and have them available for data analysis and alerts too. Check out the Icinga output for Logstash which recently has been released.
You should also monitor your metric and log backends. If you have Graphite or InfluxDB with Grafana running, monitor its growth and size and overall performance. If you have icingabeat connected to Elastic Stack, add service checks for Elasticsearch monitoring.
If you are writing your own check plugin, keep in mind what Mattis Haase told us at Icinga Camp Berlin:
Advanced logging and statistics
A short technical deep-dive: Icinga 2 uses so-called “work queues” for asynchronous message processing. Basically that’s a list where tasks are added, and additional threads ensure to execute those tasks. That could be a cluster message received via JSON-RPC, or a new MySQL query updating the current service state inside the IDO database.
These work queues may grow over time if the consumer is not able to keep up. You may have seen that already with the IDO MySQL/PostgreSQL feature in recent releases.
[2017-05-24 15:00:15 +0200] information/IdoMysqlConnection: Query queue items: 17735, query rate: 675.483/s (40529/min 51582/5min 51582/15min); empty in infinite time, your database isn't able to keep up
What happens with a list grows and grows? Right, memory gets allocated and the Icinga 2 process consumes more memory over time. You can see that with htop and in your Grafana graphs.
We’ve learned that these logs help to analyse problems. Although that logging was rather spam-ish and needs adjustments (zero items shouldn’t be logged so often e.g.).
In addition to logs the REST API provides plenty of possibilities to gather stats. The main URL endpoint /v1/status allows to fetch general Icinga statistics including all features.
You can also use the debug console to calculate stats for late check results. We’ve recently added those insights into the troubleshooting docs too.
Analyse trends and correlate events
Any plugin or check you use should provide “good” performance data metrics. That allows to pass them to Graphite, InfluxDB, PNP, etc. and visualise trends over time.
It also allows you to correlate graphs and events. Lately we’ve analysed a customer environment where memory consumption was rapidly growing. At first glance it looked like a memory leak inside the InfluxDB feature. After a deep dive into graphs we’ve seen that a blocking InfluxDB HTTP API connection would block Icinga 2 in its cluster message processing. We had two graphs for that – the “icinga” CheckCommand with an active check execution rate going to zero and a peak in memory consumption.
That really was a tough one, and we would have been faster with more insights into Icinga 2 itself. If we would have seen the aforementioned work queue size growing over time, the “memory leak” would have been analysed as “not processed check results in memory” thus explaining why check results were marked as late in Icinga Web 2 (IDO database).
You really should watch Avishai’s talk at OSMC 2016 on a good introduction into data analysis:
The next Icinga 2 v2.7 release will add more metrics and possibilities to analyse Icinga 2’s performance. We’ll focus on three methods:
- Logging of work queue sizes, rates and estimated empty. If it would be empty in infinite time, you really have a problem. (#5280)
- Internal statistics available via the REST API /v1/status URL endpoint (#5266)
- Feature statistics including cluster JSON-RPC queue and API client metrics available as performance data for the “icinga” check (#5284)
Precisely you then look into the logs:
tail -f /var/log/icinga2/icinga2.log | grep WorkQueue [2017-05-26 11:05:55 +0200] information/WorkQueue: #5 (ApiListener, RelayQueue) items: 0, rate: 7.85/s (471/min 471/5min 471/15min); [2017-05-26 11:05:55 +0200] information/WorkQueue: #6 (ApiListener, SyncQueue) items: 0, rate: 0/s (0/min 0/5min 0/15min); [2017-05-26 11:05:55 +0200] information/WorkQueue: #7 (GraphiteWriter, graphite) items: 0, rate: 0.0166667/s (1/min 1/5min 1/15min); [2017-05-26 11:05:55 +0200] information/WorkQueue: #8 (IdoMysqlConnection, ido-mysql) items: 0, rate: 1.36667/s (82/min 82/5min 82/15min); [2017-05-26 11:06:05 +0200] information/WorkQueue: #8 (IdoMysqlConnection, ido-mysql) items: 3, rate: 2.93333/s (176/min 176/5min 176/15min);
These log entries will either be logged when there are pending items or in a 5 minute interval.
If you are interested in general statistics for your API queries, you can now fetch even more metrics from Icinga 2 itself:
curl -k -s -u root:icinga https://localhost:5665/v1/status | python -m json.tool
The icinga CheckCommand performance data can be visualised in Grafana dashboards. Note: 2.7 adds feature stats as additional performance as shown in the screenshots. Check a draft below – I am working on a Vagrant box update for my integrations talk at Icinga Camp Amsterdam. Join me there to see the latest developments :)
We will also enhance features to use work queues by default. First candidates after the IDO database are InfluxDB (#5219), Graphite (#5287), and Graylog (#5329). That way you’ll also get logs, stats and metrics “for free” in the future.
You can test-drive these changes inside the Icinga Vagrant boxes with the latest snapshot packages. Open up Grafana and build your own awesome dashboards.
These insights and hints should allow you to analyse your (monitoring) systems even better. If you are doing Icinga support (hi Icinga partners & community members) this should make your life easier too :)
If your problem turns out to be a possible bug you now have even more logs, stats and graphs. Please share them with us in your GitHub issue!
Join us at Icinga Camp Amsterdam and discuss your experience with performance analysis. Feedback much appreciated! :)