Page MenuHomeFreeBSD

net-mgmt/nagios-check_smartmon: adjust it to work with more drives
Needs ReviewPublic

Authored by dvl on Jun 9 2021, 7:28 PM.
Tags
None
Referenced Files
Unknown Object (File)
Wed, Jan 15, 4:45 AM
Unknown Object (File)
Wed, Jan 15, 4:42 AM
Unknown Object (File)
Sat, Jan 11, 5:04 AM
Unknown Object (File)
Wed, Jan 8, 10:08 PM
Unknown Object (File)
Dec 19 2024, 3:21 PM
Unknown Object (File)
Dec 8 2024, 3:21 AM
Unknown Object (File)
Dec 6 2024, 9:23 PM
Unknown Object (File)
Nov 15 2024, 10:22 PM
Subscribers
This revision needs review, but there are no reviewers specified.

Details

Reviewers
None
Summary

The current release does not work with a wider range of drives. Case in point, it fails outright with this example.

[dan@r720-01:~] $ /usr/local/bin/sudo /usr/local/libexec/nagios/check_smartmon -d /dev/da2
Traceback (most recent call last):
  File "/usr/local/libexec/nagios/check_smartmon", line 318, in <module>
    (healthStatus, temperature) = parseOutput(healthStatusOutput, temperatureOutput, devtype)
  File "/usr/local/libexec/nagios/check_smartmon", line 219, in parseOutput
    temperature = int(parts[-3])
ValueError: invalid literal for int() with base 10: 'Temperature:'
[dan@r720-01:~] $

Diff Detail

Repository
rP FreeBSD ports repository
Lint
No Lint Coverage
Unit
No Test Coverage
Build Status
Buildable 39826
Build 36715: arc lint + arc unit

Event Timeline

dvl requested review of this revision.Jun 9 2021, 7:28 PM

Without the patch:

[dan@r720-01:~] $ sudo /usr/local/libexec/nagios/check_smartmonon -d /dev/da2
Traceback (most recent call last):
  File "/usr/local/libexec/nagios/check_smartmon", line 318, in <module>
    (healthStatus, temperature) = parseOutput(healthStatusOutput, temperatureOutput, devtype)
  File "/usr/local/libexec/nagios/check_smartmon", line 219, in parseOutput
    temperature = int(parts[-3])
ValueError: invalid literal for int() with base 10: 'Temperature:'

With the patch:

[dan@r720-01:~] $ sudo ~/check_smartmon -d /dev/da2
OK: device (/dev/da2) is functional and stable (temperature: 0)|TEMP=0;55;60;

The device in question is:

[dan@r720-01:~] $ sudo smartctl -a /dev/da2
smartctl 7.2 2020-12-30 r5155 [FreeBSD 13.0-RELEASE-p1 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               Pliant
Product:              LB406M
Revision:             D323
Compliance:           SPC-4
User Capacity:        400,088,457,216 bytes [400 GB]
Logical block size:   512 bytes
LU is resource provisioned, LBPRZ=1
Rotation Rate:        Solid State Device
Form Factor:          2.5 inches
Logical Unit id:      0x5001e8200275bf00
Serial number:        [redacted]
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Wed Jun  9 19:27:20 2021 UTC
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Disabled or Not Supported

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Percentage used endurance indicator: 0%
Current Drive Temperature:     41 C
Drive Trip Temperature:        60 C

Accumulated power on time, hours:minutes 32256:40
Manufactured in week 13 of year 2014
Specified cycle count over device lifetime:  0
Accumulated start-stop cycles:  99
Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        0         0         0          0      42204.716           0
write:         0        0         0         0          0      68043.995           0

Non-medium error count:   113372

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background long   Completed                  64   18064                 - [-   -    -]
# 2  Background short  Completed                  64   18064                 - [-   -    -]
# 3  Background short  Completed                  64   18045                 - [-   -    -]
# 4  Background short  Completed                  64       3                 - [-   -    -]
# 5  Background long   Completed                  64       2                 - [-   -    -]
# 6  Background short  Completed                  64       2                 - [-   -    -]

Long (extended) Self-test duration: 1800 seconds [30.0 minutes]

Given that reading a diff of a patch is difficult, the following is a diff of the installed code against my working code. This is copy/pasted so tabs etc may be wrong.

[dan@r720-01:~] $ diff -ruN /usr/local/libexec/nagios/check_smartmon ~/check_smartmon
--- /usr/local/libexec/nagios/check_smartmon	2021-05-22 22:29:46.000000000 +0000
+++ /usr/home/dan/check_smartmon	2021-06-09 20:22:50.700010000 +0000
@@ -59,7 +59,7 @@
                         metavar="LEVEL", help="set verbosity level to LEVEL; defaults to 0 (quiet), \
                                         possible values go up to 3")
         parser.add_option("-t", "--type", action="store", dest="devtype", default="ata", metavar="DEVTYPE",
-                        help="type of device (ATA|SCSI)")
+                        help="type of device (ata|scsi)")
         parser.add_option("-w", "--warning-threshold", metavar="TEMP", action="store",
                         type="int", dest="warningThreshold", default=55,
                         help="set temperature warning threshold to given temperature (defaults to 55)")
@@ -231,22 +231,25 @@
         return (healthStatus, temperature)
 # end
 
-def createReturnInfo(healthStatus, temperature, warningThreshold,
+def createReturnInfo(device, healthStatus, temperature, warningThreshold,
                 criticalThreshold):
         """Create return information according to given thresholds."""
 
         # this is absolutely critical!
         if healthStatus not in [ "PASSED", "OK" ]:
                 vprint(2, "Health status: %s" % healthStatus)
-                return (2, "CRITICAL: device does not pass health status")
+                return (2, "CRITICAL: device (%s) does not pass health status" %device)
         # fi
 
         if temperature > criticalThreshold:
-                return (2, "CRITICAL: device temperature (%d) exceeds critical temperature threshold (%s)" % (temperature, criticalThreshold))
+                return (2, "CRITICAL: device (%s) temperature (%d) exceeds critical temperature threshold (%s)|TEMP=%d;%d;%d;" 
+			% (device, temperature, criticalThreshold, temperature, warningThreshold, criticalThreshold))
         elif temperature > warningThreshold:
-                return (1, "WARNING: device temperature (%d) exceeds warning temperature threshold (%s)" % (temperature, warningThreshold))
+                return (1, "WARNING: device (%s) temperature (%d) exceeds warning temperature threshold (%s)|TEMP=%d;%d;%d;" 
+			% (device, temperature, warningThreshold, temperature, warningThreshold, criticalThreshold))
         else:
-                return (0, "OK: device is functional and stable (temperature: %d)" % temperature)
+                return (0, "OK: device (%s) is functional and stable (temperature: %d)|TEMP=%d;%d;%d;" 
+			% (device, temperature, temperature, warningThreshold, criticalThreshold))
         # fi
 # end
 
@@ -302,11 +305,11 @@
         devtype = options.devtype
         vprint(2, "command line supplied device type is: %s" % devtype)
         if not devtype:
-                devtype = "ATA"
+                if device_re.search( device ):
+                        devtype = "scsi"
+                else:
+                        devtype= "ata"
 
-        if device_re.search( device ):
-                devtype = "scsi"
-
         vprint(1, "Device type: %s" % devtype)
 
         # call smartctl and parse output
@@ -317,7 +320,7 @@
         vprint(2, "Parse smartctl output")
         (healthStatus, temperature) = parseOutput(healthStatusOutput, temperatureOutput, devtype)
         vprint(2, "Generate return information")
-        (value, message) = createReturnInfo(healthStatus, temperature,
+        (value, message) = createReturnInfo(device, healthStatus, temperature,
                         options.warningThreshold, options.criticalThreshold)
 
         # exit program
dvl retitled this revision from Adjust check_smartmon so it works with more drives to net-mgmt/nagios-check_smartmon: adjust it to work with more drives.Jun 9 2021, 7:36 PM
dvl edited the summary of this revision. (Show Details)

Fix the patch. I don't know why it was failing.

net-mgmt/nagios-check_smartmon/files/patch-check_smartmon
16

This is a help message. I'm fixing the text to match the code.