Procedure to check the LoadLeveler status when a CRITICAL alert is generated by Nagios.
If the service "loadl" is having a problem, Nagios will display a CRITICAL alert, e.g.:
07-02-2005 08:52:27] SERVICE ALERT: esmf.ess.uci.edu;loadl;CRITICAL;HARD;3;Service cluster problem: 7 ok, 0 warning, 0 unknown, 1 critical
Log onto esmf.ess.uci.edu , become root, and read the loadl.status file.
% cat /usr/local/nagios/DCS/loadl.status esmf01m 0 TCP OK - 0 second response time on port 9605 esmf02m 0 TCP OK - 0 second response time on port 9605 esmf03m 0 TCP OK - 0 second response time on port 9605 esmf04m 1 No response on port 9605 esmf05m 0 TCP OK - 0 second response time on port 9605 esmf06m 0 TCP OK - 0 second response time on port 9605 esmf07m 0 TCP OK - 0 second response time on port 9605 esmf08m 0 TCP OK - 0 second response time on port 9605
This is a check for the LoadL_schedd process which runs on each node and accepts jobs submissions to that node.
If LoadLeveler does not find the value in the configuration file, it looks in the /etc/services file. If the value is not found in this file, the default is used.
The first field on each line in the example that follows represents the name of a "service". In most cases, these services are also the names of daemons because few daemons need more than one udp and one tcp connection. There are two except ions: LoadL_negotiator_collector is the service name for a second stream port that is used by the LoadL_negotiator daemon; LoadL_schedd_status is the service na me for a second stream port used by the LoadL_schedd daemon.
LoadL_master 9616/tcp # Master port number for stream port LoadL_negotiator 9614/tcp # Negotiator port number LoadL_negotiator_collector 9612/tcp # Second negotiator stream port LoadL_schedd 9605/tcp # Schedd port number for stream port LoadL_schedd_status 9606/tcp # Schedd stream port for job status data LoadL_startd 9611/tcp # Startd port number for stream port LoadL_master 9617/udp # Master port number for dgram port LoadL_startd 9615/udp # Startd port number for dgram port
As root, access the node(s) with a non-responsive LoadL_schedd process and kill the LoadL_schedd process; it should be restarted automatically by the LoadL_master process.
$ ps -ef | grep Load
root 213130 606222 0 Jun 25 - 32:49 LoadL_startd -f -c /tmp
root 303170 606222 0 Jun 25 - 1:15 LoadL_kbdd -f -c /tmp
root 606222 1 0 Jun 25 - 0:58 /usr/lpp/LoadL/full/bin/LoadL
_master
root 663718 606222 0 Jun 25 - 2:57 LoadL_negotiator -f -c /tmp
root 1491022 606222 0 Jun 27 - 0:22 LoadL_schedd -f
loadl 2580502 565412 1 11:50:39 pts/0 0:00 grep Load
$ telnet localhost 9605
Trying...
^C
bash-3.00# kill 1491022
bash-3.00# ps -ef | grep LoadL
root 213130 606222 0 Jun 25 - 32:47 LoadL_startd -f -c /tmp
root 303170 606222 0 Jun 25 - 1:15 LoadL_kbdd -f -c /tmp
root 606222 1 0 Jun 25 - 0:58 /usr/lpp/LoadL/full/bin/LoadL
_master
root 663718 606222 0 Jun 25 - 2:57 LoadL_negotiator -f -c /tmp
root 1491022 606222 0 Jun 27 - 0:22 LoadL_schedd -f
root 2269336 2015286 0 11:45:43 pts/0 0:00 grep LoadL
bash-3.00# kill -9 1491022
bash-3.00# ps -ef | grep Load
root 213130 606222 0 Jun 25 - 32:49 LoadL_startd -f -c /tmp
root 303170 606222 0 Jun 25 - 1:15 LoadL_kbdd -f -c /tmp
root 565414 606222 4 11:51:37 - 0:00 LoadL_schedd -f
root 606222 1 0 Jun 25 - 0:58 /usr/lpp/LoadL/full/bin/LoadL
_master
root 663718 606222 0 Jun 25 - 2:57 LoadL_negotiator -f -c /tmp
root 2506996 2015286 1 11:51:42 pts/0 0:00 grep Load
bash-3.00# telnet localhost 9605
Trying...
Connected to loopback.
Escape character is '^]'.
^]
telnet> close
Connection closed.
bash-3.00# cat loadl.status
esmf01m 0 TCP OK - 0 second response time on port 9605
esmf02m 0 TCP OK - 0 second response time on port 9605
esmf03m 0 TCP OK - 0 second response time on port 9605
esmf04m 0 TCP OK - 0 second response time on port 9605
esmf05m 0 TCP OK - 0 second response time on port 9605
esmf06m 0 TCP OK - 0 second response time on port 9605
esmf07m 0 TCP OK - 0 second response time on port 9605
esmf08m 0 TCP OK - 0 second response time on port 9605
Now verify that the Nagios loadl service check goes GREEN.
Last modified: Sun Jul 3 10:16:55 PDT 2005