While surfing around for ideas to improve business monitoring in Icinga, we stumbled upon Bischeck and it’s creator Anders Haal. So we thought we’d share what we got to learn about teaming Icinga up with Bischeck for dynamic and adaptive thresholds – straight from the maker’s mouth:
What is Bischeck?
Bischeck is an open source project with the goal to provide dynamic and adaptive threshold logic for Nagios based monitoring solutions and forks such as Icinga.
Until now, Nagios based monitoring has only supported static thresholds. With static thresholds we are limited to define one maximum or one minimum value to express the threshold that is valid in every situation for the service that is monitored. To have one single value that is correct for each day of the week and for every hour of the day is not very likely. The risk is that we will get too many or too few alarms and there is even some service metrics that we will not be able to set a threshold due to their dynamic behavior. This is especially true when monitoring application and business related services that follow the dynamics of business load.
What can you do with Bischeck?
With Bischeck you have a solution that allows for dynamic and adaptive thresholds to complement the traditional static threshold solution. So dynamic and adaptive thresholds give you the ability to:
- Define different threshold profiles depending on the time of the day and day of the week or month: We can set thresholds for any service where we expect some increase and/or decrease in the metric during the day.
- Define thresholds based on historical data: This enables us to express different kinds of threshold baselines. For example, we can specify that the expected threshold at 12:00 should not be 5% higher or lower than the calculated average of the measured metrics at the same time from the previous 5 days. Bischeck supports several mathematical functions to calculate thresholds at run-time.
- Set multiple thresholds rules for the same service: E.g. for a file system utilization service we can combine the classic 90% file system utilization with a threshold that checks how quickly the utilization changes by using historical data to calculate a utilization delta over some time period.
- Use data collected for one or multiple service as an input in the calculation of the threshold for a different service: This adaptiveness is excellent when you have some service metrics that drive the business process. For instance the number of visits to your web shop is likely to have some affect on the number of expected orders, CPU utilization, application threads, etc. This means we can set the thresholds in relation to data that matters and not just a single value.
- Create virtual services: A virtual service would be a metric that is not possible to measure at a single source, but can only be calculated from other metrics. This can typically be ratios, aggregations, etc that can not be measured as a single metric by itself.
How does it work?
Bischeck can collect metrics in several ways, e.g. execute SQL queries, query Icinga/Nagios data over Livestatus, execute normal Icinga/Nagios check commands but bypassing state and just retrieve the performance data, etc. Both collection and threshold classes is simple to extend and customize.
Bischeck integrates with Icinga and Nagios by sending passive checks. Passive checks are supported over NSCA, NRDP and Livestatus. Bischeck data can also be sent to Graphite and OpenTSDB for graphing visualization.
Bischeck is written in Java and runs as a standalone daemon. It is “supported” on all major Linux distributions. It has also been tested on Windows, but installations scripts are currently not supplied for Windows. For more on how Bischeck works and its architecture see our documentation.
Where have you seen Bischeck in production environments?
DHL Freight in Sweden was our first “user”. It has been in production at DHL for over 2 years. They use monitored data like shipments orders to calculate the threshold in next step of the process e.g. monitoring how many of these shipments orders are geographically coded for delivery and truck loading.
DHL has been a great sponsor to the project and you can read more about what they use it in our testimony page. We now start to see some more companies testing it and hopefully we can disclose some more interesting production cases in the near future.
Why did you decide to create Bischeck?
Like so many developers, especially in the open source space, you develop solutions because you need some functionality and you can not find it. The pleasure is of course when you see other people that have the same need can gain from what you have done.
Any future development plans?
Absolutely. We will soon release 0.4.3 with just some minor fixes and improvements. At the same time we are working on the next major release that we think will be our 1.0.0. What we currently are targeting as the major feature is threshold baselining. With threshold baselining you will use the historical data that Bischeck collects and apply mathematical filters to the data to get a comparative threshold baseline. This will minimize configuration and hopefully a threshold that is very adaptive to the production environment. The benefit is of course less configuration, but more important, a better threshold management that only triggers adequate alarms. This feature will demand some changes to our historical cache storage and currently we are leaning against Redis which seems to work well for the our time series data model. Feedback and ideas are of course appreciated.
What’s the coolest thing about Bischeck for you?
I think the coolest thing is that it solves the problem that it was meant to solve. Hopefully the rest of the world will find dynamic and adaptive thresholds as cool and useful as we do.
Version compatibility: All Bischeck versions (0.4.2 at the time of writing) with all Icinga versions
More information: www.bischeck.org