LoadLeveler Troubleshooting

Procedure to check the LoadLeveler status when a CRITICAL alert is generated by Nagios.

Check Nagios message

If the service "loadl" is having a problem, Nagios will display a CRITICAL alert, e.g.:

07-02-2005 08:52:27] SERVICE ALERT: esmf.ess.uci.edu;loadl;CRITICAL;HARD;3;Service cluster problem: 7 ok, 0 warning, 0 unknown, 1 critical

Check the ESMF

Log onto esmf.ess.uci.edu , become root, and read the loadl.status file.

% cat /usr/local/nagios/DCS/loadl.status
esmf01m 0 TCP OK - 0 second response time on port 9605
esmf02m 0 TCP OK - 0 second response time on port 9605
esmf03m 0 TCP OK - 0 second response time on port 9605
esmf04m 1 No response on port 9605
esmf05m 0 TCP OK - 0 second response time on port 9605
esmf06m 0 TCP OK - 0 second response time on port 9605
esmf07m 0 TCP OK - 0 second response time on port 9605
esmf08m 0 TCP OK - 0 second response time on port 9605

This is a check for the LoadL_schedd process which runs on each node and accepts jobs submissions to that node.

LoadLeveler Port Numbers

If LoadLeveler does not find the value in the configuration file, it looks in the /etc/services file. If the value is not found in this file, the default is used.

The first field on each line in the example that follows represents the name of a "service". In most cases, these services are also the names of daemons because few daemons need more than one udp and one tcp connection. There are two except ions: LoadL_negotiator_collector is the service name for a second stream port that is used by the LoadL_negotiator daemon; LoadL_schedd_status is the service na me for a second stream port used by the LoadL_schedd daemon.

LoadL_master               9616/tcp   # Master port number for stream port
LoadL_negotiator           9614/tcp   # Negotiator port number
LoadL_negotiator_collector 9612/tcp   # Second negotiator stream port
LoadL_schedd               9605/tcp   # Schedd port number for stream port
LoadL_schedd_status        9606/tcp   # Schedd stream port for job status data
LoadL_startd               9611/tcp   # Startd port number for stream port
LoadL_master               9617/udp   # Master port number for dgram port
LoadL_startd               9615/udp   # Startd port number for dgram port

Restarting the LoadL_schedd process

As root, access the node(s) with a non-responsive LoadL_schedd process and kill the LoadL_schedd process; it should be restarted automatically by the LoadL_master process.

$ ps -ef | grep Load
    root  213130  606222   0   Jun 25      - 32:49 LoadL_startd -f -c /tmp 
    root  303170  606222   0   Jun 25      -  1:15 LoadL_kbdd -f -c /tmp 
    root  606222       1   0   Jun 25      -  0:58 /usr/lpp/LoadL/full/bin/LoadL
_master 
    root  663718  606222   0   Jun 25      -  2:57 LoadL_negotiator -f -c /tmp 
    root 1491022  606222   0   Jun 27      -  0:22 LoadL_schedd -f 
   loadl 2580502  565412   1 11:50:39  pts/0  0:00 grep Load 
$ telnet localhost 9605
Trying...
^C
bash-3.00# kill 1491022
bash-3.00# ps -ef | grep LoadL
    root  213130  606222   0   Jun 25      - 32:47 LoadL_startd -f -c /tmp 
    root  303170  606222   0   Jun 25      -  1:15 LoadL_kbdd -f -c /tmp 
    root  606222       1   0   Jun 25      -  0:58 /usr/lpp/LoadL/full/bin/LoadL
_master 
    root  663718  606222   0   Jun 25      -  2:57 LoadL_negotiator -f -c /tmp 
    root 1491022  606222   0   Jun 27      -  0:22 LoadL_schedd -f 
    root 2269336 2015286   0 11:45:43  pts/0  0:00 grep LoadL
bash-3.00# kill -9 1491022
bash-3.00# ps -ef | grep Load
    root  213130  606222   0   Jun 25      - 32:49 LoadL_startd -f -c /tmp 
    root  303170  606222   0   Jun 25      -  1:15 LoadL_kbdd -f -c /tmp 
    root  565414  606222   4 11:51:37      -  0:00 LoadL_schedd -f 
    root  606222       1   0   Jun 25      -  0:58 /usr/lpp/LoadL/full/bin/LoadL
_master 
    root  663718  606222   0   Jun 25      -  2:57 LoadL_negotiator -f -c /tmp 
    root 2506996 2015286   1 11:51:42  pts/0  0:00 grep Load 
bash-3.00# telnet localhost 9605
Trying...
Connected to loopback.
Escape character is '^]'.
^]

telnet> close
Connection closed.
bash-3.00# cat loadl.status
esmf01m 0 TCP OK - 0 second response time on port 9605
esmf02m 0 TCP OK - 0 second response time on port 9605
esmf03m 0 TCP OK - 0 second response time on port 9605
esmf04m 0 TCP OK - 0 second response time on port 9605
esmf05m 0 TCP OK - 0 second response time on port 9605
esmf06m 0 TCP OK - 0 second response time on port 9605
esmf07m 0 TCP OK - 0 second response time on port 9605
esmf08m 0 TCP OK - 0 second response time on port 9605

Now verify that the Nagios loadl service check goes GREEN.

Last modified: Sun Jul 3 10:16:55 PDT 2005