Rahul Nabar
2010-08-11 21:43:39 UTC
I have a node where pbsnodes reports the following:
eu044
state = busy
np = 8
properties = INTEL,10GigE
ntype = cluster
status = opsys=linux,uname=Linux eu044 2.6.18-164.el5 #1 SMP Thu
Sep 3 03:28:30 EDT 2009
x86_64,sessions=25252,nsessions=1,nusers=1,idletime=4160964,totmem=24815792kb,availmem=103236kb,physmem=16429872kb,ncpus=8,loadave=9.00,netload=174910266926482,state=busy,jobs=,varattr=,rectime=1281562538
Since it doesn't show "job-exclusive" I assumed it means it doesn't
have a user job on it. But if I login to eu044 and do a top I see:
######################
top - 16:38:27 up 48 days, 3:53, 1 user, load average: 9.00, 9.00, 9.00
Tasks: 155 total, 7 running, 148 sleeping, 0 stopped, 0 zombie
Cpu(s): 6.0%us, 93.9%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 16429872k total, 16350560k used, 79312k free, 7336k buffers
Swap: 8385920k total, 8385920k used, 0k free, 14416k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
25254 gwpeng 25 0 2224m 817m 176 S 100.2 5.1 8879:07
vasp_gamma
25253 gwpeng 25 0 2307m 861m 176 R 99.9 5.4 8879:10
vasp_gamma
25255 gwpeng 25 0 2334m 1.4g 180 S 99.9 8.9 8879:20
vasp_gamma
25256 gwpeng 25 0 2334m 1.4g 176 S 99.9 8.7 8879:19
vasp_gamma
25257 gwpeng 25 0 2292m 919m 176 R 99.9 5.7 8879:15
vasp_gamma
25258 gwpeng 25 0 2333m 730m 176 R 99.9 4.6 8879:40
vasp_gamma
25259 gwpeng 25 0 2326m 942m 176 R 99.9 5.9 8879:13
vasp_gamma
25260 gwpeng 25 0 2204m 843m 176 R 99.9 5.3 8879:18
vasp_gamma
#############################
These are 8 core machines so I can understand that PBS reports busy
because the load average is 9 (>8).
But why does pbsnodes not list the node as job-exclusive as well? It
doesn't even seem to report a job number for that node.
The mom seems to be running on the node:
[root at eu044 ~]# service pbs status
pbs_mom is pid 3810
But a momctl reveals that the mom doesn't think there is a local job:
##############################
[root at eu044 ~]# /opt/torque/sbin/momctl -d 3
Host: eu044/eu044 Version: 2.4.5 PID: 3810
Server[0]: euadmin (10.0.3.2:1023)
Init Msgs Received: 5 hellos/2 cluster-addrs
Init Msgs Sent: 11 hellos
Last Msg From Server: 529523 seconds (DeleteJob)
Last Msg To Server: 8 seconds
HomeDirectory: /var/spool/torque/mom_priv
stdout/stderr spool directory: '/var/spool/torque/spool/' (1834324
blocks available)
NOTE: syslog enabled
MOM active: 4161213 seconds
Check Poll Time: 45 seconds
Server Update Interval: 45 seconds
LogLevel: 4 (use SIGUSR1/SIGUSR2 to adjust)
Communication Model: TCP
MemLocked: TRUE (mlock)
Prolog: /var/spool/torque/mom_priv/prologue (disabled)
Alarm Time: 0 of 10 seconds
Trusted Client List:
10.0.0.43,10.0.0.42,10.0.0.41,10.0.0.40,10.0.0.39,10.0.0.38,10.0.0.37,10.0.0.36,10.0.0.35,10.0.0.34,10.0.0.33,10.0.0.32,10.0.0.31,10.0.0.30,10.0.0.29,10.0.0.28,10.0.0.27,10.0.0.26,10.0.0.25,10.0.0.24,10.0.0.23,10.0.0.22,10.0.0.21,10.0.0.20,10.0.0.19,10.0.0.18,10.0.0.17,10.0.0.16,10.0.0.15,10.0.0.14,10.0.0.13,10.0.0.12,10.0.0.11,10.0.0.10,10.0.0.9,10.0.0.8,10.0.0.7,10.0.0.6,10.0.0.5,10.0.0.4,10.0.0.3,10.0.0.2,10.0.0.1,10.0.2.61,10.0.2.60,10.0.2.59,10.0.2.58,10.0.2.57,10.0.2.56,10.0.2.55,10.0.2.54,10.0.2.53,10.0.2.52,10.0.2.51,10.0.2.50,10.0.2.49,10.0.2.48,10.0.2.47,10.0.2.46,10.0.2.45,127.0.0.1
Copy Command: /usr/bin/scp -rpB
NOTE: no local jobs detected
diagnostics complete
#############################
I tried restarting the mom but it still doesnt detect a job!
--
Rahul
eu044
state = busy
np = 8
properties = INTEL,10GigE
ntype = cluster
status = opsys=linux,uname=Linux eu044 2.6.18-164.el5 #1 SMP Thu
Sep 3 03:28:30 EDT 2009
x86_64,sessions=25252,nsessions=1,nusers=1,idletime=4160964,totmem=24815792kb,availmem=103236kb,physmem=16429872kb,ncpus=8,loadave=9.00,netload=174910266926482,state=busy,jobs=,varattr=,rectime=1281562538
Since it doesn't show "job-exclusive" I assumed it means it doesn't
have a user job on it. But if I login to eu044 and do a top I see:
######################
top - 16:38:27 up 48 days, 3:53, 1 user, load average: 9.00, 9.00, 9.00
Tasks: 155 total, 7 running, 148 sleeping, 0 stopped, 0 zombie
Cpu(s): 6.0%us, 93.9%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 16429872k total, 16350560k used, 79312k free, 7336k buffers
Swap: 8385920k total, 8385920k used, 0k free, 14416k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
25254 gwpeng 25 0 2224m 817m 176 S 100.2 5.1 8879:07
vasp_gamma
25253 gwpeng 25 0 2307m 861m 176 R 99.9 5.4 8879:10
vasp_gamma
25255 gwpeng 25 0 2334m 1.4g 180 S 99.9 8.9 8879:20
vasp_gamma
25256 gwpeng 25 0 2334m 1.4g 176 S 99.9 8.7 8879:19
vasp_gamma
25257 gwpeng 25 0 2292m 919m 176 R 99.9 5.7 8879:15
vasp_gamma
25258 gwpeng 25 0 2333m 730m 176 R 99.9 4.6 8879:40
vasp_gamma
25259 gwpeng 25 0 2326m 942m 176 R 99.9 5.9 8879:13
vasp_gamma
25260 gwpeng 25 0 2204m 843m 176 R 99.9 5.3 8879:18
vasp_gamma
#############################
These are 8 core machines so I can understand that PBS reports busy
because the load average is 9 (>8).
But why does pbsnodes not list the node as job-exclusive as well? It
doesn't even seem to report a job number for that node.
The mom seems to be running on the node:
[root at eu044 ~]# service pbs status
pbs_mom is pid 3810
But a momctl reveals that the mom doesn't think there is a local job:
##############################
[root at eu044 ~]# /opt/torque/sbin/momctl -d 3
Host: eu044/eu044 Version: 2.4.5 PID: 3810
Server[0]: euadmin (10.0.3.2:1023)
Init Msgs Received: 5 hellos/2 cluster-addrs
Init Msgs Sent: 11 hellos
Last Msg From Server: 529523 seconds (DeleteJob)
Last Msg To Server: 8 seconds
HomeDirectory: /var/spool/torque/mom_priv
stdout/stderr spool directory: '/var/spool/torque/spool/' (1834324
blocks available)
NOTE: syslog enabled
MOM active: 4161213 seconds
Check Poll Time: 45 seconds
Server Update Interval: 45 seconds
LogLevel: 4 (use SIGUSR1/SIGUSR2 to adjust)
Communication Model: TCP
MemLocked: TRUE (mlock)
Prolog: /var/spool/torque/mom_priv/prologue (disabled)
Alarm Time: 0 of 10 seconds
Trusted Client List:
10.0.0.43,10.0.0.42,10.0.0.41,10.0.0.40,10.0.0.39,10.0.0.38,10.0.0.37,10.0.0.36,10.0.0.35,10.0.0.34,10.0.0.33,10.0.0.32,10.0.0.31,10.0.0.30,10.0.0.29,10.0.0.28,10.0.0.27,10.0.0.26,10.0.0.25,10.0.0.24,10.0.0.23,10.0.0.22,10.0.0.21,10.0.0.20,10.0.0.19,10.0.0.18,10.0.0.17,10.0.0.16,10.0.0.15,10.0.0.14,10.0.0.13,10.0.0.12,10.0.0.11,10.0.0.10,10.0.0.9,10.0.0.8,10.0.0.7,10.0.0.6,10.0.0.5,10.0.0.4,10.0.0.3,10.0.0.2,10.0.0.1,10.0.2.61,10.0.2.60,10.0.2.59,10.0.2.58,10.0.2.57,10.0.2.56,10.0.2.55,10.0.2.54,10.0.2.53,10.0.2.52,10.0.2.51,10.0.2.50,10.0.2.49,10.0.2.48,10.0.2.47,10.0.2.46,10.0.2.45,127.0.0.1
Copy Command: /usr/bin/scp -rpB
NOTE: no local jobs detected
diagnostics complete
#############################
I tried restarting the mom but it still doesnt detect a job!
--
Rahul