Contributing¶
Linuxfabrik Standards¶
The following standards apply to all Linuxfabrik repositories.
Code of Conduct¶
Please read and follow our Code of Conduct.
Issue Tracking¶
Open issues are tracked on GitHub Issues in the respective repository.
Pre-commit¶
Some repositories use pre-commit for automated linting and formatting checks. If the repository contains a .pre-commit-config.yaml, install pre-commit and configure the hooks after cloning:
pre-commit install
Commit Messages¶
Commit messages follow the Conventional Commits specification:
<type>(<scope>): <subject>
If there is a related issue, append (fix #N):
<type>(<scope>): <subject> (fix #N)
<type> must be one of:
chore: Changes to the build process or auxiliary tools and librariesdocs: Documentation only changesfeat: A new featurefix: A bug fixperf: A code change that improves performancerefactor: A code change that neither fixes a bug nor adds a featurestyle: Changes that do not affect the meaning of the code (whitespace, formatting, etc.)test: Adding missing tests
Changelog¶
Document all changes in CHANGELOG.md following Keep a Changelog. Sort entries within sections alphabetically.
Language¶
Code, comments, commit messages, and documentation must be written in English.
Coding Conventions¶
- Sort variables, parameters, lists, and similar items alphabetically where possible.
- Always use long parameters when using shell commands.
- Use RFC 5737, 3849, 7042, and 2606 in examples and documentation:
- IPv4:
192.0.2.0/24,198.51.100.0/24,203.0.113.0/24 - IPv6:
2001:DB8::/32 - MAC:
00-00-5E-00-53-00through00-00-5E-00-53-FF(unicast),01-00-5E-90-10-00through01-00-5E-90-10-FF(multicast) - Domains:
*.example,example.com
- IPv4:
Check Plugin Developer Guidelines¶
Monitoring of an Application¶
Monitoring an application can be complex and produce a wide variety of data. In order to standardize the handling of threshold values on the command line, to reduce the number of command line parameters and their interdependencies and to enable independent and thus extended designs of the Grafana panels, each topic should be dealt with in a separate check (following the Linux mantra: "one tool, one task").
Avoid an extensive check that covers a wide variety of aspects:
myapp --action threading --warning 1500 --critical 2000myapp --action memory-usage --warning 80 --critical 90myapp --action deployment-status(warning and critical command line options not supported)
Better write three separate checks:
myapp-threading --warning 1500 --critical 2000myapp-memory-usage --warning 80 --critical 90myapp-deployment-status
All plugins are written in Python and will be licensed under the UNLICENSE, which is a license with no conditions whatsoever that dedicates works to the public domain.
Setting up your Development Environment¶
All plugins are coded using Python 3.9. Simply clone the libraries and monitoring plugins and start working:
git clone git@github.com:Linuxfabrik/lib.git
git clone git@github.com:Linuxfabrik/monitoring-plugins.git
Deliverables¶
Checklist:
- The plugin itself, tested on RHEL and Debian.
- README file explaining "How?" and "Why?"
- A free, monochrome, transparent SVG icon from https://simpleicons.org or https://fontawesome.com/search?ic=free, placed in the
icondirectory. - Optional:
unit-test/run- the unittest file (see Unit Tests) - Optional:
requirements.txt - If providing performance data: Grafana dashboard (see GRAFANA) and
.inifile for the Icinga Web 2 Grafana Module - Icinga Director Basket Config for the check plugin (
check2basket) - Icinga Service Set in
all-the-rest.json - Optional: sudoers file (see sudoers File)
- Optional: A screenshot of the plugins' output from within Icinga, resized to 423x106, using background-color
#f5f9fa, hosted on download.linuxfabrik.ch, and listed alphabetically in the projects README. - CHANGELOG
Rules of Thumb¶
- Be brief by default. Report what needs to be reported to fix a problem. If there is more information that might help the admin, support a
--lengthyparameter. - The plugin should be "self configuring" and/or using best practise defaults, so that it runs without parameters wherever possible.
- Develop with a minimal Linux in mind.
- Develop with Icinga2 in mind.
- Avoid complicated or fancy (and therefore unreadable) Python statements.
- If possible avoid libraries that have to be installed.
- Validate user input.
- It is ok to use temp files if needed.
- Much better: use a local SQLite database if you want to use a temp file.
- Keep in mind: Plugins have a limited runtime - typically 10 seconds max. Therefore it is ideal if the plugin executes fast and uses minimal resources (CPU time, memory etc.).
- Timeout gracefully on errors (for example
dfon a failed network drive) and return WARN. - Return UNKNOWN on missing dependencies or wrong parameters.
- Mainly return WARN. Only return CRIT if the operators want to or have to wake up at night. CRIT means "react immediately".
- EAFP: Easier to ask for forgiveness than permission. This common Python coding style assumes the existence of valid keys or attributes and catches exceptions if the assumption proves false. This clean and fast style is characterized by the presence of many try and except statements.
Return Codes¶
Plugins must return one of the following POSIX-compliant exit codes. Use the constants from lib.base:
| Exit Code | Status | Constant | Meaning |
|---|---|---|---|
| 0 | OK | STATE_OK |
Service functioning properly |
| 1 | Warning | STATE_WARN |
Service above warning threshold or not working properly |
| 2 | Critical | STATE_CRIT |
Service not running or above critical threshold |
| 3 | Unknown | STATE_UNKNOWN |
Invalid arguments, missing dependencies, or internal plugin failures |
Guidelines:
- Return
STATE_UNKNOWNon missing dependencies, wrong parameters, or when--help/--versionis requested. - Return
STATE_WARNfor most alert conditions. Only returnSTATE_CRITif the situation requires immediate human intervention ("wake up at night"). - Never return any exit code other than 0, 1, 2, or 3.
- Use
lib.base.oao()(output and out) to print the result and exit with the appropriate state in a single call.
Bytes vs. Unicode¶
Short:
- Use
txt.to_text()andtxt.to_bytes().
The theory:
- Data coming into your plugins must be bytes, encoded with
UTF-8. - Decode incoming bytes as soon as possible (best by using the
txtlibrary), producing unicode. - Use unicode throughout your plugin.
- When outputting data, use library functions, they should do output conversions for you. Library functions like
base.oaoorurl.fetch_jsonwill take care of the conversion to and from bytes.
See https://nedbatchelder.com/text/unipain.html for details.
Names, Naming Conventions¶
The plugin name should match the following regex: ^[a-zA-Z0-9\-\_]*$. This allows the plugin name to be used as the grafana dashboard uid (according to here).
Parameters, Option Processing¶
There are a few Nagios-compatible reserved options that should not be used for other purposes:
-a, --authentication authentication password
-C, --community SNMP community
-c, --critical critical threshold
-h, --help help
-H, --hostname hostname
-l, --logname login name
-p, --password password
-p, --port network port
-t, --timeout timeout
-u, --url URL
-u, --username username
-V, --version version
-v, --verbose verbose
-w, --warning warning threshold
Every plugin must support at least --help and --version:
--help(-h): Print a short usage statement followed by a detailed description of all options with their defaults. Keep the output within 80 characters width. Exit withSTATE_UNKNOWN(3).--version(-V): Print the plugin name and version (__version__). Exit withSTATE_UNKNOWN(3).
Positional arguments are not allowed. All parameters must be named options.
For all other options, use long parameters only. Separate words using a -. We recommend using some out of those:
--activestate
--alarm-duration
--always-ok
--argument
--authtype
--cache-expire
--command
--community
--config
--count
--critical
--critical-count
--critical-cpu
--critical-maxchildren
--critical-mem
--critical-pattern
--critical-regex
--critical-slowreq
--database
--datasource
--date
--device
--donor
--filename
--filter
--full
--hide-ok
--hostname
--icinga-callback
--icinga-password
--icinga-service-name
--icinga-url
--icinga-username
--idsite
--ignore
--ignore-pattern
--ignore-regex
--input
--insecure
--instance
--interface
--interval
--ipv6
--key
--latest
--lengthy
--loadstate
--message
--message-key
--metric
--mib
--mibdir
--mode
--module
--mount
--no-kthreads
--no-proxy
--no-summary
--node
--only-dirs
--only-files
--password
--path
--pattern
--perfdata
--perfdata-key
--period
--port
--portname
--prefix
--privlevel
--response
--service
--severity
--snmp-version
--starttype
--state
--state-key
--status
--substate
--suppress-lines
--task
--team
--test
--timeout
--timerange
--token
--trigger
--type
--unit
--unitfilestate
--url
--username
--version
--virtualenv
--warning
--warning-count
--warning-cpu
--warning-maxchildren
--warning-mem
--warning-pattern
--warning-regex
--warning-slowreq
Parameter types are usually:
type=floattype=inttype=lib.args.csvtype=lib.args.float_or_nonetype=lib.args.int_or_nonetype=str(the default)choices=['udp', 'udp6', 'tcp', 'tcp6']action='store_true',action='store_false'for switches
Hints:
- For complex parameter tupels, use the
csvtype.--input='Name, Value, Warn, Crit'results in[ 'Name', 'Value', 'Warn', 'Crit' ] - For repeating parameters, use the
appendaction. Adefaultvariable has to be a list then.--input=a --input=bresults in[ 'a', 'b' ] - If you combine
csvtype andappendaction, you get a two-dimensional list:--repeating-csv='1, 2, 3' --repeating-csv='a, b, c'results in[['1', '2', '3'], ['a', 'b', 'c']] - If you want to provide default values together with
append, inparser.add_argument(), leave thedefaultasNone. If aftermain:parse_args()the value is stillNone, put the desired default list (or any other object) there. The primary purpose of the parser is to parse the commandline - to figure out what the user wants to tell you. There's nothing wrong with tweaking (and checking) theargsNamespace after parsing. (According to https://bugs.python.org/issue16399) - When it comes to parameters, stay backwards compatible. If you have to rename or drop parameters, keep the old ones, but silently ignore them. This helps admins deploy the monitoring plugins to thousands of servers, while the monitoring server is updated later for various reasons. To be as tolerant as possible, replace the parameter's help text with
help=argparse.SUPPRESS:
def parse_args():
"""Parse command line arguments using argparse.
"""
parser = argparse.ArgumentParser(description=DESCRIPTION)
parser.add_argument(
'--my-old-and-deprecated-parameter',
help=argparse.SUPPRESS,
dest='MY_OLD_VAR',
)
- A plugin should tolerate unknown parameters. Imagine an monitoring system that checks thousand hosts. You want to update a plugin offering a new parameter that is essential for you, so you adjust the service definition, add the new parameter and update the plugin on one host. The non-updated plugin on the other 999 hosts will throw an 'UNKNOWN' error when argparse is used with
parser.parse_args(). This would significantly disrupt operations and cause stress. Therefore, it makes more sense to be tolerant and useparser.parse_known_args().
Commit Scopes¶
Use the plugin name as commit scope:
fix(about-me): cryptography deprecation warning (fix #341)
For the first commit, use the message Add <plugin-name>.
Threshold and Ranges¶
If a threshold has to be handled as a range parameter, this is how to interpret them. Compatible with the Monitoring Plugins Development Guidelines and the Nagios Plugin Development Guidelines.
The generalized range format is [@]start:end:
startmust be less than or equal toend.startand:are not required ifstartis 0.- simple value: a range from 0 up to and including the value
- empty value after
:: positive infinity ~: negative infinity@: if range starts with "@", then alert if inside this range (including endpoints)- An alert is raised if the metric is outside the range (inclusive of endpoints). The
@prefix inverts this logic.
Examples:
-w, -c |
OK if result is | WARN/CRIT if |
|---|---|---|
| 10 | in (0..10) | not in (0..10) |
| -10:0 | in (-10..0) | not in (-10..0) |
| 10: | in (10..inf) | not in (10..inf) |
| : | in (0..inf) | not in (0..inf) |
| ~:10 | in (-inf..10) | not in (-inf..10) |
| 10:20 | in (10..20) | not in (10..20) |
| @10:20 | not in (10..20) | in 10..20 |
| @~:20 | not in (-inf..20) | in (-inf..20) |
| @ | not in (0..inf) | in (0..inf) |
So, a definition like --warning 2:100 --critical 1:150 should return the states:
val 0 1 2 .. 100 101 .. 150 151
-w WA WA OK OK WA WA WA
-c CR OK OK OK OK OK CR
=> CR WA OK OK WA WA CR
Another example: --warning 190: --critical 200:
val 189 190 191 .. 199 200 201
-w WA OK OK OK OK OK
-c CR CR CR CR OK OK
=> CR CR CR CR OK OK
Another example: --warning ~:0 --critical 10
val -2 -1 0 1 .. 9 10 11
-w OK OK OK WA WA WA WA
-c CR CR OK OK OK OK CR
=> CR CR OK WA WA WA CR
Have a look at procs on how to implement this.
Caching temporary data, SQLite database¶
Use cache if you need a simple key-value store, for example as used in nextcloud-version. Otherwise, use db_sqlite as used in cpu-usage.
Error Handling¶
- Catch exceptions using
try/except, especially in functions. - In functions, if you have to catch exceptions, on such an exception always return
(False, errormessage). Otherwise return(True, result)if the function succeeds in any way. For example, returning(True, False)means that the function has not raised an exception and its result is simplyFalse. - A function calling a function with such an extended error handling has to return a
(retc, result)tuple itself. - In
main()you can uselib.base.coe()to simplify error handling. - Have a look at
nextcloud-versionfor details.
By the way, when running the compiled variants, this gives the nice and intended error if the module is missing:
try:
import psutil # pylint: disable=C0413
except ImportError:
print('Python module "psutil" is not installed.')
sys.exit(STATE_UNKNOWN)
while this leads to an ugly multi-exception stacktrace:
try:
import psutil # pylint: disable=C0413
except ImportError:
lib.base.cu('Python module "psutil" is not installed.')
Timeout Handling¶
Plugins have a limited runtime - typically 10 seconds max. Every plugin must handle timeouts gracefully to prevent hanging processes (e.g. df on a failed network drive, unresponsive API endpoints, stuck database connections).
- Always support a
--timeoutparameter (default: 8 seconds, leaving headroom for Icinga's own 10s timeout). - Use
lib.base.coe(lib.url.fetch(..., timeout=args.TIMEOUT))for HTTP requests - the library handles timeouts. - For shell commands, pass a timeout to
lib.shell.shell_exec(). - If a timeout occurs, return
STATE_WARNwith a meaningful message (e.g. "Timeout after 8s while connecting to ...").
Security¶
- External commands: When executing system commands, use
lib.shell.shell_exec(). Avoidos.system()orsubprocesswithshell=True, as these are vulnerable to shell injection. The official Monitoring Plugins guidelines require full paths for all external commands to prevent PATH-based trojan hijacking. Ourlib.shell.shell_exec()usessubprocesswithshell=False, which eliminates shell injection. We accept PATH-based command resolution for cross-platform compatibility (paths differ across distributions), but be aware that a compromised PATH could still redirect commands. - Input validation: Validate all user-supplied input. Use
argparsetype converters (type=int,type=float,type=lib.args.csv) to enforce expected types. - Temporary files: Avoid temporary files where possible. Prefer a local SQLite database via
lib.db_sqliteorlib.cache. If temp files are unavoidable, fail cleanly if the file cannot be created, and delete it when done. - Symlinks: If a plugin opens or reads files, ensure it does not follow symlinks to unintended locations.
- Credentials: Never log or print passwords, tokens, or other secrets in plugin output - not even in verbose mode.
- Network communication: Use HTTPS by default. Support
--insecureto allow self-signed certificates where needed, but never make insecure the default.
Plugin Output¶
Plugins must only print to STDOUT. Never print to STDERR, as Icinga/Nagios does not capture it.
The output structure follows the Monitoring Plugins standard:
STATUS_TEXT - summary message | perfdata
detailed line 1
detailed line 2 | more_perfdata
The first line is the most important - Icinga/Nagios uses it for notifications, web interface display, and SMS alerts. Everything after the first newline is considered "long output" and only shown in detail views.
Rules:
- Print a short concise message in the first line within the first 80 chars if possible.
- Use multi-line output for details (
msg_body), with the most important output in the first line (msg_header). - Performance data is separated from text output by a pipe (
|) character. Additional perfdata can follow on subsequent lines after a pipe. - Do not use the pipe character (
|) in the text output itself, as Icinga/Nagios uses it as a delimiter to separate text from performance data.lib.base.oao()automatically replaces stray pipes in the message. - Don't print "OK".
- Print "[WARNING]" or "[CRITICAL]" for clarification next to a specific item using
lib.base.state2str(). - If possible give a help text to solve the problem.
- Multiple items checked, and ...
- ... everything ok? Print "Everything is ok." or the most important output in the first line, and optional the items and their data attached in multiple lines.
- ... there are warnings or errors? Print "There are warnings." or "There are errors." or the most important output in the first line, and optional the items and their data attached in multiple lines.
- Based on parameters etc. nothing is checked at the end? Print "Nothing checked."
- Wrong username or password? Print "Failed to authenticate."
- Use short "Units of Measurements" without white spaces, including these terms:
- Bits: use
human.bits2human() - Bytes: use
human.bytes2human() - I/O and Throughput:
human.bytes2human() + '/s'(Byte per Second) - Network: "Rx/s", "Tx/s", use
human.bps2human() - Numbers: use
human.number2human() - Percentage: 93.2%
- Read/Write: "R/s", "W/s", "IO/s"
- Seconds, Minutes etc.: use
human.seconds2human() - Temperatures: 7.3C, 45F.
- Bits: use
- Use ISO format for date or datetime ("yyyy-mm-dd", "yyyy-mm-dd hh:mm:ss")
- Print human readable datetimes and time periods ("Up 3d 4h", "2019-12-31 23:59:59", "1.5s")
Verbose Output¶
If a plugin supports -v/--verbose, it should implement up to three verbosity levels (stackable -v -v -v or --verbose --verbose --verbose):
| Level | Output |
|---|---|
| 0 (default) | Single-line summary, minimal output |
1 (-v) |
Single-line with additional detail (e.g. list of affected items) |
2 (-v -v) |
Multi-line with configuration debug info (e.g. commands executed, API endpoints queried) |
3 (-v -v -v) |
Extensive diagnostic detail for troubleshooting |
Note: Most of our plugins use --lengthy instead of -v for extended output. The verbosity levels above apply if the plugin explicitly supports --verbose.
Plugin Performance Data, Perfdata¶
"UOM" means "Unit of Measurement".
Format (space-separated label/value pairs):
'label'=value[UOM];[warn];[crit];[min];[max]
Rules:
- Labels may contain any characters except
=(equals) and'(single quote). - Single quotes around the label are optional but required if the label contains spaces.
- The first 19 characters of a label should be unique (RRD data source limitation).
value,min, andmaxmust match the character class[-0-9.]and share the same UOM.warnandcrituse the range format (see Threshold and Ranges).minandmaxare not required for percentage (%) UOM.- Trailing unfilled semicolons may be dropped.
labeldoesn't need to be machine friendly, soPages scanned=100;;;;;is as valuable aspages-scanned=100;;;;;.
UOM suffixes:
no unit specified - assume a number (int or float) of things (eg, users, processes, load averages)
s - seconds (also us, ms etc.)
% - percentage
B - bytes (also KB, MB, TB etc.). Bytes preferred, they are exact.
c - a continuous counter (such as bytes transmitted on an interface [so instead of 'B'])
Wherever possible, prefer percentages over absolute values to assist users in comparing different systems with different absolute sizes.
Be aware of already-aggregated values returned by systems and applications. Apache for example returns a value "137.5 kB/request". Sounds good, but this is not a value at the current time of measurement. Instead, it is the average of all requests during the lifetime of the Apache worker process. If you use this in some sort of Grafana panel, you just get a boring line which converges towards a constant value very fast. Not useful at all.
A monitoring plugin has to calculate such values always on its own. If this is not possible because of missing data, discard them.
PEP 8¶
We use PEP 8 -- Style Guide for Python Code where it makes sense.
Docstrings¶
We document our Libraries using numpydoc docstrings, so that calling pydoc lib/base.py works, for example.
PyLint¶
To further improve code quality, we use PyLint like so:
- Libs:
pylint mylib.py - Monitoring Plugins:
pylint --disable='invalid-name, missing-function-docstring, missing-module-docstring' plugin-name
Have a look at PyLint's message codes.
isort¶
To help sort the import-statements we use isort:
# to sort all imports
isort --recursive .
# sort in a single plugin
isort plugin-name
Unit Tests¶
Unit tests are implemented using the unittest framework (https://docs.python.org/3/library/unittest.html). Have a look at the fs-ro plugin on how to implement unit tests. Rules of thumb:
- Within your
unit-test/runfile, call the plugin as a bash command, capture stdout, stderr and its return code (retc), and run your assertions against stdout, stderr and retc. - To test a plugin that needs to run some tools that aren't on your machine or that can't provide special output, provide stdout/stderr files in
unit-test/stdout,unit-test/stderrand/orunit-test/retcand a--testparameter to feedstdout/stdout-file,stderr/stderr-file,expected-retcinto your plugin. If you get the--testparameter, skip the execution of your bash/psutil/whatever function.
If you want to implement unit tests based on containers, the following rules apply:
- Each container file does everything necessary to set up a running environment for the check plugin (e.g. install Python if you want to run the plugin inside the container).
- The
./rununit test simply calls podman and, for each containerfile found, builds the container, injects the libs and the check plugin, and runs the tests - but does not modify the container in any other way. - See the
keycloak-versionplugin for how to do this.
Running a unit test:
# cd into the plugin directory, then:
cd unit-test
# run the Python based test:
./run
sudoers File¶
If the plugin requires sudo-permissions to run, please add the plugin to the sudoers-files for all supported operating systems in assets/sudoers/. The OS name should match the ansible variables ansible_facts['distribution'] + ansible_facts['distribution_major_version'] (eg CentOS7). Use symbolic links to prevent duplicate files.
Attention: The newline at the end is required!
Icinga Director Basket Config¶
Each plugin should provide its required Director config in form of a Director basket. The basket usually contains at least one Command, one Service Template and some associated Datafields. The rest of the Icinga Director configuration (Host Templates, Service Sets, Notification Templates, Tag Lists, etc) can be placed in the assets/icingaweb2-module-director/all-the-rest.json file.
The Icinga Director Basket for one or all plugins can be created using the check2basket tool.
Always review the basket before committing.
Create a Basket File from Scratch¶
After writing a new check called new-check, generate a basket file using:
./tools/check2basket --plugin-file check-plugins/new-check/new-check
The basket will be saved as check-plugins/new-check/icingaweb2-module-director/new-check.json. Inspect the basket, paying special attention to:
- Command:
timeout - ServiceTemplate:
check_interval - ServiceTemplate:
criticality - ServiceTemplate:
enable_perfdata - ServiceTemplate:
max_check_attempts - ServiceTemplate:
retry_interval
Fine-tune a Basket File¶
Never directly edit a basket JSON file. If adjustments must be made to the basket, create a YML/YAML config file for check2basket.
For example, to set the timeout to 30s, to enable notifications and some other options, the config in check-plugins/new-check/icingaweb2-module-director/new-check.yml should look as follows:
---
variants:
- linux
- windows
overwrites:
'["Command"]["cmd-check-new-check"]["command"]': '/usr/bin/sudo /usr/lib64/nagios/plugins/new-check'
'["Command"]["cmd-check-new-check"]["timeout"]': 30
'["ServiceTemplate"]["tpl-service-new-check"]["check_command"]': 'cmd-check-new-check-sudo'
'["ServiceTemplate"]["tpl-service-new-check"]["check_interval"]': 3600
'["ServiceTemplate"]["tpl-service-new-check"]["enable_perfdata"]': true
'["ServiceTemplate"]["tpl-service-new-check"]["max_check_attempts"]': 5
'["ServiceTemplate"]["tpl-service-new-check"]["retry_interval"]': 30
'["ServiceTemplate"]["tpl-service-new-check"]["use_agent"]': false
'["ServiceTemplate"]["tpl-service-new-check"]["vars"]["criticality"]': 'C'
Then, re-run check2basket to apply the overwrites:
./tools/check2basket --plugin-file check-plugins/new-check/new-check
If a parameter was added, changed or deleted in the plugin, simply re-run the check2basket to update the basket file.
Basket File for different OS¶
The check2basket tool also offers to generate so-called variants of the checks (different flavours of the check command call to run on different operating systems):
linux: This is the default, and will be used if no other variant is defined. It generates acmd-check-...,tpl-service-...and the associated datafields.windows: Generates acmd-check-...-windows,cmd-check-...-windows-python,tpl-service-...-windowsand the associated datafields.sudo: Generates acmd-check-...-sudoimporting thecmd-check-..., but with/usr/bin/sudoprepended to the command, and atpl-service...-sudoimporting thetpl-service..., but with thecmd-check-...-sudoas the check command.no-agent: Generates atpl-service...-no-agentimporting thetpl-service..., but with command endpoint set to the Icinga2 master.
Specify them in the check-plugins/new-check/icingaweb2-module-director/new-check.yml configuration as follows:
---
variants:
- linux
- sudo
- windows
- no-agent
Create Basket Files for all Check Plugins¶
To run check2basket against all checks, for example due to a change in the check2basket script itself, use:
./tools/check2basket --auto
Service Sets¶
If you want to create a Service Set, edit assets/icingaweb2-module-director/all-the-rest.json and append the definition using JSON. Provide new unique UUIDs. Do a syntax check using cat assets/icingaweb2-module-director/all-the-rest.json | jq afterwards.
If you want to move a service from one Service Set to another, you have to create a new UUID for the new service (this isn't even possible in the Icinga Director GUI).
Grafana Dashboards¶
The title of the dashboard should be capitalized, the name has to match the folder/plugin name (spaces will be replaced with -, / will be ignored. eg Network I/O will become network-io). Each Grafana panel should be meaningful, especially when comparing it to other related panels (eg memory usage and CPU usage).
Plugins and Capabilities¶
Incomplete list of special features in some check-plugins.
README explains Python regular expression negative lookaheads to exclude matches:
Lists "Top X" values (search for --top parameter):
Alerts only after a certain amount of calls (search for --count parameter):
Cuts (truncates) its SQLite database table:
Pure/raw network communication using byte-structs and sockets:
Checks for a minimum required 3rd party library version:
"Learns" thresholds on its own (implementing some kind of "threshold warm-up"):
Ports of applications:
- disk-smart: port of GSmartControl to Python.
- All mysql-* plugins: Port of MySQLTuner to Python.
Makes use of FREE and USED wording in parameters:
--perfdata-regex parameter lets you filter for a subset of performance data:
Is aware of its acknowledgement status in Icinga, and will suppress further warnings if it has been ACKed:
Calculates mean and median perfdata over a set of individual items:
Supports human-readable Nagios ranges for bytes:
Sanitizes complex data before querying MySQL/MariaDB:
Reads a file line-by-line, but backwards:
Makes heavy use of patterns versus compiled regexes, matching any() of them:
Using application's config file for authentication:
- All mysql-* plugins
Optionally uses an asset:
- php-status: relies on
monitoring.phpthat can provide more PHP insight in the context of the web server
Provides useful feedback from Redis' Memory Doctor:
Work without the jolokia.war plugin and use the native API:
- All wildfly-* checks
Supports human-readable Nagios ranges for durations:
Differentiates between Windows and Linux (search for lib.base.LINUX or lib.base.WINDOWS):
Unit tests use Docker/Podman to test against a range of versions or a range of operating systems:
- cpu-usage
- keycloak-version (checking the filesystem in the container as well as the API)
Read ini files (example use case: password file parsing):