Discussion:
[torqueusers] Nodes that pbs reports are busy which are actually running a job
Rahul Nabar
2010-08-11 21:43:39 UTC
Permalink
I have a node where pbsnodes reports the following:

eu044
state = busy
np = 8
properties = INTEL,10GigE
ntype = cluster
status = opsys=linux,uname=Linux eu044 2.6.18-164.el5 #1 SMP Thu
Sep 3 03:28:30 EDT 2009
x86_64,sessions=25252,nsessions=1,nusers=1,idletime=4160964,totmem=24815792kb,availmem=103236kb,physmem=16429872kb,ncpus=8,loadave=9.00,netload=174910266926482,state=busy,jobs=,varattr=,rectime=1281562538

Since it doesn't show "job-exclusive" I assumed it means it doesn't
have a user job on it. But if I login to eu044 and do a top I see:

######################
top - 16:38:27 up 48 days, 3:53, 1 user, load average: 9.00, 9.00, 9.00
Tasks: 155 total, 7 running, 148 sleeping, 0 stopped, 0 zombie
Cpu(s): 6.0%us, 93.9%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 16429872k total, 16350560k used, 79312k free, 7336k buffers
Swap: 8385920k total, 8385920k used, 0k free, 14416k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
25254 gwpeng 25 0 2224m 817m 176 S 100.2 5.1 8879:07
vasp_gamma
25253 gwpeng 25 0 2307m 861m 176 R 99.9 5.4 8879:10
vasp_gamma
25255 gwpeng 25 0 2334m 1.4g 180 S 99.9 8.9 8879:20
vasp_gamma
25256 gwpeng 25 0 2334m 1.4g 176 S 99.9 8.7 8879:19
vasp_gamma
25257 gwpeng 25 0 2292m 919m 176 R 99.9 5.7 8879:15
vasp_gamma
25258 gwpeng 25 0 2333m 730m 176 R 99.9 4.6 8879:40
vasp_gamma
25259 gwpeng 25 0 2326m 942m 176 R 99.9 5.9 8879:13
vasp_gamma
25260 gwpeng 25 0 2204m 843m 176 R 99.9 5.3 8879:18
vasp_gamma
#############################

These are 8 core machines so I can understand that PBS reports busy
because the load average is 9 (>8).

But why does pbsnodes not list the node as job-exclusive as well? It
doesn't even seem to report a job number for that node.

The mom seems to be running on the node:

[root at eu044 ~]# service pbs status
pbs_mom is pid 3810

But a momctl reveals that the mom doesn't think there is a local job:

##############################
[root at eu044 ~]# /opt/torque/sbin/momctl -d 3

Host: eu044/eu044 Version: 2.4.5 PID: 3810
Server[0]: euadmin (10.0.3.2:1023)
Init Msgs Received: 5 hellos/2 cluster-addrs
Init Msgs Sent: 11 hellos
Last Msg From Server: 529523 seconds (DeleteJob)
Last Msg To Server: 8 seconds
HomeDirectory: /var/spool/torque/mom_priv
stdout/stderr spool directory: '/var/spool/torque/spool/' (1834324
blocks available)
NOTE: syslog enabled
MOM active: 4161213 seconds
Check Poll Time: 45 seconds
Server Update Interval: 45 seconds
LogLevel: 4 (use SIGUSR1/SIGUSR2 to adjust)
Communication Model: TCP
MemLocked: TRUE (mlock)
Prolog: /var/spool/torque/mom_priv/prologue (disabled)
Alarm Time: 0 of 10 seconds
Trusted Client List:
10.0.0.43,10.0.0.42,10.0.0.41,10.0.0.40,10.0.0.39,10.0.0.38,10.0.0.37,10.0.0.36,10.0.0.35,10.0.0.34,10.0.0.33,10.0.0.32,10.0.0.31,10.0.0.30,10.0.0.29,10.0.0.28,10.0.0.27,10.0.0.26,10.0.0.25,10.0.0.24,10.0.0.23,10.0.0.22,10.0.0.21,10.0.0.20,10.0.0.19,10.0.0.18,10.0.0.17,10.0.0.16,10.0.0.15,10.0.0.14,10.0.0.13,10.0.0.12,10.0.0.11,10.0.0.10,10.0.0.9,10.0.0.8,10.0.0.7,10.0.0.6,10.0.0.5,10.0.0.4,10.0.0.3,10.0.0.2,10.0.0.1,10.0.2.61,10.0.2.60,10.0.2.59,10.0.2.58,10.0.2.57,10.0.2.56,10.0.2.55,10.0.2.54,10.0.2.53,10.0.2.52,10.0.2.51,10.0.2.50,10.0.2.49,10.0.2.48,10.0.2.47,10.0.2.46,10.0.2.45,127.0.0.1
Copy Command: /usr/bin/scp -rpB
NOTE: no local jobs detected

diagnostics complete
#############################


I tried restarting the mom but it still doesnt detect a job!


--
Rahul
Garrick Staples
2010-08-11 21:53:20 UTC
Permalink
On Wed, Aug 11, 2010 at 04:43:39PM -0500, Rahul Nabar alleged:
> I have a node where pbsnodes reports the following:
>
> eu044
> state = busy
> np = 8
> properties = INTEL,10GigE
> ntype = cluster
> status = opsys=linux,uname=Linux eu044 2.6.18-164.el5 #1 SMP Thu
> Sep 3 03:28:30 EDT 2009
> x86_64,sessions=25252,nsessions=1,nusers=1,idletime=4160964,totmem=24815792kb,availmem=103236kb,physmem=16429872kb,ncpus=8,loadave=9.00,netload=174910266926482,state=busy,jobs=,varattr=,rectime=1281562538
>
> Since it doesn't show "job-exclusive" I assumed it means it doesn't
> have a user job on it. But if I login to eu044 and do a top I see:

Nope, it doesn't have a job. What you have are stale processes from an old job.

--
Garrick Staples, GNU/Linux HPCC SysAdmin
University of Southern California

Life is Good!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20100811/7fba23a0/attachment.bin
Rahul Nabar
2010-08-11 21:59:07 UTC
Permalink
On Wed, Aug 11, 2010 at 4:53 PM, Garrick Staples <garrick at usc.edu> wrote:
>
> Nope, it doesn't have a job. What you have are stale processes from an old job.

Thanks! I killed them, Does PBS cleanup processes after a job ends
automatically? Or is there a suitable flag? These are non-shared nodes
so no risk of stepping on another jobs processes. All 8 cores are
always assigned to same user.

If not is it a OK fix to put a pkill in the epilogue for all normal
usernames. Any caveats? Or better ideas?

--
Rahul
Garrick Staples
2010-08-11 22:13:09 UTC
Permalink
On Wed, Aug 11, 2010 at 04:59:07PM -0500, Rahul Nabar alleged:
> On Wed, Aug 11, 2010 at 4:53 PM, Garrick Staples <garrick at usc.edu> wrote:
> >
> > Nope, it doesn't have a job. What you have are stale processes from an old job.
>
> Thanks! I killed them, Does PBS cleanup processes after a job ends
> automatically? Or is there a suitable flag? These are non-shared nodes
> so no risk of stepping on another jobs processes. All 8 cores are
> always assigned to same user.
>
> If not is it a OK fix to put a pkill in the epilogue for all normal
> usernames. Any caveats? Or better ideas?

It will kill processes that it knows about. This includes any children of the
batch script and any processes launched through the TM interface. Any remote
processes started through a remote shell are unknown to PBS and can't be
killed. It is up to your epilogue to figure out what else needs to be killed.

--
Garrick Staples, GNU/Linux HPCC SysAdmin
University of Southern California

Life is Good!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20100811/6493e08f/attachment.bin
Rushton Martin
2010-08-12 09:56:18 UTC
Permalink
Be careful about assuming that one user = one job. When our new cluster
was delivered someone had configured the epilogue to kill off all
processes belonging to the user, but with 8 or 16 cores per node we were
caught when one user had several jobs running. The first job to finish
killed off the user's other jobs.

Martin Rushton
Weapons Technologies
Tel: 01959 514777, Mobile: 07939 219057
email: jmrushton at QinetiQ.com
www.QinetiQ.com
QinetiQ - Delivering customer-focused solutions

Please consider the environment before printing this email.
-----Original Message-----
From: torqueusers-bounces at supercluster.org
[mailto:torqueusers-bounces at supercluster.org] On Behalf Of Garrick
Staples
Sent: 11 August 2010 23:13
To: Torque Users Mailing List
Subject: Re: [torqueusers] Nodes that pbs reports are busy which are
actually running a job

On Wed, Aug 11, 2010 at 04:59:07PM -0500, Rahul Nabar alleged:
> On Wed, Aug 11, 2010 at 4:53 PM, Garrick Staples <garrick at usc.edu>
wrote:
> >
> > Nope, it doesn't have a job. What you have are stale processes from
an old job.
>
> Thanks! I killed them, Does PBS cleanup processes after a job ends
> automatically? Or is there a suitable flag? These are non-shared nodes

> so no risk of stepping on another jobs processes. All 8 cores are
> always assigned to same user.
>
> If not is it a OK fix to put a pkill in the epilogue for all normal
> usernames. Any caveats? Or better ideas?

It will kill processes that it knows about. This includes any children
of the batch script and any processes launched through the TM interface.
Any remote processes started through a remote shell are unknown to PBS
and can't be killed. It is up to your epilogue to figure out what else
needs to be killed.

--
Garrick Staples, GNU/Linux HPCC SysAdmin University of Southern
California

Life is Good!
This email and any attachments to it may be confidential and are
intended solely for the use of the individual to whom it is
addressed. If you are not the intended recipient of this email,
you must neither take any action based upon its contents, nor
copy or show it to anyone. Please contact the sender if you
believe you have received this email in error. QinetiQ may
monitor email traffic data and also the content of email for
the purposes of security. QinetiQ Limited (Registered in England
& Wales: Company Number: 3796233) Registered office: 85
Buckingham Gate, London SW1E 6PD http://www.qinetiq.com.
John S. Urban
2010-08-12 11:22:58 UTC
Permalink
We use a variety of schedulers, and find that cleaning up parallel processes
is an issue with all of them, but that
1) utilities that control the remote processes are usually available to help
prevent the problem in the first place;
such as blauncher with LSF, and some versions of mpiexec for PBS/TORQUE
when using MPI; and -kill
options on mpirun(1) commands sometimes help.
2) If you want to make sure everything is killed on the initial node when I
job wraps up, it is usually best to start
your job in a process group, and then kill the process group in the
epilogue rather than going after individual
processes.
3) orphaned processes on remote nodes are a little tricky, but if you write
a cron(1) entry that
A) CAREFULLY makes sure the scheduler is running on the node and
that you have gotten a good list
of what jobs should be on the node. From that list, determine
what usernames should be on the node.
Then look thru the process table on the node and build a list
of all usernames on the node. Make sure
you skip any "system" userid names. If the regular username
has processes on the node but no job on
the node, kill the processes.
B) This assumes the user can only use the node via a scheduled
process.
C) Only kill processes that have several minutes of wall clock
accumulated to avoid killing something that
just started while you were building your lists.
D) These steps will not clean up bad processes from job X if the
same user is using the node legitimately with a job Y;
which is increasingly possible as newer nodes have more and
more cores on them
E) It is assumed you have only given users IDs that have UIDs in a
unique range, so that you can easily
identify "regular" usernames from "system" names often used
for running daemons in the background;
that users don't have things like cron jobs or any other
legitimate way of accessing these nodes other
than via scheduled tasks (be they interactive or batch).
All of "3" has been running as a utility called "shouldnotbehere" via
cron(1) for years, and I spend no time manually
cleaning up errant processes; but I know of other sites that do.
Be very careful your "shouldnotbehere" utility knows when schedulers are not
giving them a good list
of user jobs, or you might kill all your jobs because you turned off your
scheduler. Make sure all kills done by the
utility are recorded to a central log so you can always know what is being
killed. Prevention is still the best medicine,
so if possible use methods that keep the job clean in the first place like
mpiexec(1). If you are using commercial
codes that don't allow you to modify the launch mechanism to a robust one
that cleans up after itself, you can get the
same effect with a little effort with two other common methods:
1) change the remote process startup command (typically ssh, remsh, or
rsh) to your own script; now several things
are possible to help you clean up jobs.
2) make the command executed on the remote machines a script instead of
the actual executable.
The details of why these are useful things to do are a bit long for this
discussion at this point; but I can elaborate if any of this sounds useful
to you. If you have a small number of users on a small number of nodes with
long running jobs creating such a utility as "shouldnotbehere"
is probably overkill; but having just one node running overloaded can cause
many types of large parallel jobs to run very poorly ("one bad
apple spoils the bunch"). You can easily end up where the nodes of your jobs
act like racehorses that stop after each lap and wait for the
slowest one to catch up before starting the next lap. Depending on different
system and MPI/PVM settings, those waiting nodes can look deceptively busy
and
"productive". So at least for me, making a "shouldnotbehere" utility has
been well worth it. As always, use the concept at your own risk.
Sorry I can't publish "shouldnotbehere" here (It's really not all that
complicated, and is just a ksh(1) script) but I can't. But depending on
your OS, something as simple as "ps -e -ouser=|sort|uniq|xargs" can give
you a list of usernames on a node; and something like
pbsnodes -a `hostname` (or LSF's bjobs -u all -m `hostname`) is one of may
ways to list jobs on a node.
And "runaway" processes often (but not always) have a parent process of 1 if
things have gone badly.

----- Original Message -----
From: "Rushton Martin" <JMRUSHTON at qinetiq.com>
To: "Torque Users Mailing List" <torqueusers at supercluster.org>
Sent: Thursday, August 12, 2010 5:56 AM
Subject: Re: [torqueusers] Nodes that pbs reports are busy which are
actually running a job


> Be careful about assuming that one user = one job. When our new cluster
> was delivered someone had configured the epilogue to kill off all
> processes belonging to the user, but with 8 or 16 cores per node we were
> caught when one user had several jobs running. The first job to finish
> killed off the user's other jobs.
>
> Martin Rushton
> Weapons Technologies
> Tel: 01959 514777, Mobile: 07939 219057
> email: jmrushton at QinetiQ.com
> www.QinetiQ.com
> QinetiQ - Delivering customer-focused solutions
>
> Please consider the environment before printing this email.
> -----Original Message-----
> From: torqueusers-bounces at supercluster.org
> [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Garrick
> Staples
> Sent: 11 August 2010 23:13
> To: Torque Users Mailing List
> Subject: Re: [torqueusers] Nodes that pbs reports are busy which are
> actually running a job
>
> On Wed, Aug 11, 2010 at 04:59:07PM -0500, Rahul Nabar alleged:
>> On Wed, Aug 11, 2010 at 4:53 PM, Garrick Staples <garrick at usc.edu>
> wrote:
>> >
>> > Nope, it doesn't have a job. What you have are stale processes from
> an old job.
>>
>> Thanks! I killed them, Does PBS cleanup processes after a job ends
>> automatically? Or is there a suitable flag? These are non-shared nodes
>
>> so no risk of stepping on another jobs processes. All 8 cores are
>> always assigned to same user.
>>
>> If not is it a OK fix to put a pkill in the epilogue for all normal
>> usernames. Any caveats? Or better ideas?
>
> It will kill processes that it knows about. This includes any children
> of the batch script and any processes launched through the TM interface.
> Any remote processes started through a remote shell are unknown to PBS
> and can't be killed. It is up to your epilogue to figure out what else
> needs to be killed.
>
> --
> Garrick Staples, GNU/Linux HPCC SysAdmin University of Southern
> California
>
> Life is Good!
> This email and any attachments to it may be confidential and are
> intended solely for the use of the individual to whom it is
> addressed. If you are not the intended recipient of this email,
> you must neither take any action based upon its contents, nor
> copy or show it to anyone. Please contact the sender if you
> believe you have received this email in error. QinetiQ may
> monitor email traffic data and also the content of email for
> the purposes of security. QinetiQ Limited (Registered in England
> & Wales: Company Number: 3796233) Registered office: 85
> Buckingham Gate, London SW1E 6PD http://www.qinetiq.com.
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
Rahul Nabar
2010-08-12 19:14:29 UTC
Permalink
On Thu, Aug 12, 2010 at 4:56 AM, Rushton Martin <JMRUSHTON at qinetiq.com> wrote:
> Be careful about assuming that one user = one job. ?When our new cluster
> was delivered someone had configured the epilogue to kill off all
> processes belonging to the user, but with 8 or 16 cores per node we were
> caught when one user had several jobs running. ?The first job to finish
> killed off the user's other jobs.

We only allot full nodes. So a user gets a full 8-core node and never
in parts. So, one node, one user, one job. Thus that danger is not
present in my case.

--
Rahul
Gus Correa
2010-08-12 15:43:33 UTC
Permalink
Rahul Nabar wrote:
> On Wed, Aug 11, 2010 at 4:53 PM, Garrick Staples <garrick at usc.edu> wrote:
>> Nope, it doesn't have a job. What you have are stale processes from an old job.
>
> Thanks! I killed them, Does PBS cleanup processes after a job ends
> automatically? Or is there a suitable flag? These are non-shared nodes
> so no risk of stepping on another jobs processes. All 8 cores are
> always assigned to same user.
>


> If not is it a OK fix to put a pkill in the epilogue for all normal
> usernames. Any caveats? Or better ideas?
>

Hi Rahul

If the user is running a new job on the same node,
or you if share nodes across different jobs and users,
this will kill legitimate processes.

$0.02
Gus
Gus Correa
2010-08-12 16:25:44 UTC
Permalink
Gus Correa wrote:
> Rahul Nabar wrote:
>> On Wed, Aug 11, 2010 at 4:53 PM, Garrick Staples <garrick at usc.edu> wrote:
>>> Nope, it doesn't have a job. What you have are stale processes from an old job.
>>
>> Thanks! I killed them, Does PBS cleanup processes after a job ends
>> automatically? Or is there a suitable flag? These are non-shared nodes
>> so no risk of stepping on another jobs processes. All 8 cores are
>> always assigned to same user.
>> If not is it a OK fix to put a pkill in the epilogue for all normal
>> usernames. Any caveats? Or better ideas?
>>

Hi Rahul

Check this link:

http://www.sysadmin.hep.ac.uk/wiki/ProcessesOnBatchNodes

Gus
Rahul Nabar
2010-08-12 19:16:50 UTC
Permalink
On Thu, Aug 12, 2010 at 10:43 AM, Gus Correa <gus at ldeo.columbia.edu> wrote:
> If the user is running a new job on the same node,

How so? Won't the epilogue run before the new job gets assigned? Thus
the pkill should be safe, right?

> or you if share nodes across different jobs and users,
> this will kill legitimate processes.

Not a problem. Our nodes are exclusive. A user gets only full node at a time.
Coyle, James J [ITACD]
2010-08-12 19:32:21 UTC
Permalink
Rahul,

I'd encourage you to check if the node is dedicated to a single batch job
before the kills. Even though the current policy makes this uneccesary,
at some oint you may change policy or re-use the code, and you'll
never rememeber the condition that made it safe to assume you were dedicaed
or why that assumption was necessary.

I implemented a node_cleanup that the epilogue script calls.

The check to see if the node is dedicated is simply a count of the number of
times the node is comntained in $PBS_NODEFILE. If that is the same as np
for that node, the node is dediacted to the batch jobs. In that case it is
OK to kill runaway processes. I also call node_cleanup from the prologue, in case
errant processes were left over from a previous non-dedicated job.


Jim Coyle
Research Computing Group
115 Durham Center http://jjc.public.iastate.edu
Iowa State Univ.
Ames Iowa 50011
________________________________________
From: torqueusers-bounces at supercluster.org [torqueusers-bounces at supercluster.org] On Behalf Of Rahul Nabar [rpnabar at gmail.com]
Sent: Thursday, August 12, 2010 2:16 PM
To: Torque Users Mailing List
Subject: Re: [torqueusers] Nodes that pbs reports are busy which are actually running a job

On Thu, Aug 12, 2010 at 10:43 AM, Gus Correa <gus at ldeo.columbia.edu> wrote:
> If the user is running a new job on the same node,

How so? Won't the epilogue run before the new job gets assigned? Thus
the pkill should be safe, right?

> or you if share nodes across different jobs and users,
> this will kill legitimate processes.

Not a problem. Our nodes are exclusive. A user gets only full node at a time.
Rahul Nabar
2010-08-12 19:37:15 UTC
Permalink
On Thu, Aug 12, 2010 at 2:32 PM, Coyle, James J [ITACD] <jjc at iastate.edu> wrote:
> ?The check to see if the node is dedicated is simply a count of the number of
> times the node is comntained in $PBS_NODEFILE. ?If that is the same as np
> for that node, the node is dediacted to the batch jobs. In that case it is
> OK to kill runaway processes. ?I also call node_cleanup from the prologue, in case
> errant processes were left over from a previous non-dedicated job.

Thanks! A very prudent check indeed! I'll make sure I check for this
before I issue my pkill.

One of the main reasons I had decided to do dedicated nodes on this
latest cluster was the ease of killing rogue processes. Just was lucky
that I didn't have so many so far.

--
Rahul
David Beer
2010-08-11 21:55:15 UTC
Permalink
----- Original Message -----
> I have a node where pbsnodes reports the following:
>
> eu044
> state = busy
> np = 8
> properties = INTEL,10GigE
> ntype = cluster
> status = opsys=linux,uname=Linux eu044 2.6.18-164.el5 #1 SMP Thu
> Sep 3 03:28:30 EDT 2009
> x86_64,sessions=25252,nsessions=1,nusers=1,idletime=4160964,totmem=24815792kb,availmem=103236kb,physmem=16429872kb,ncpus=8,loadave=9.00,netload=174910266926482,state=busy,jobs=,varattr=,rectime=1281562538
>
> Since it doesn't show "job-exclusive" I assumed it means it doesn't
> have a user job on it. But if I login to eu044 and do a top I see:
>
> ######################
> top - 16:38:27 up 48 days, 3:53, 1 user, load average: 9.00, 9.00,
> 9.00
> Tasks: 155 total, 7 running, 148 sleeping, 0 stopped, 0 zombie
> Cpu(s): 6.0%us, 93.9%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si,
> 0.0%st
> Mem: 16429872k total, 16350560k used, 79312k free, 7336k buffers
> Swap: 8385920k total, 8385920k used, 0k free, 14416k cached
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 25254 gwpeng 25 0 2224m 817m 176 S 100.2 5.1 8879:07
> vasp_gamma
> 25253 gwpeng 25 0 2307m 861m 176 R 99.9 5.4 8879:10
> vasp_gamma
> 25255 gwpeng 25 0 2334m 1.4g 180 S 99.9 8.9 8879:20
> vasp_gamma
> 25256 gwpeng 25 0 2334m 1.4g 176 S 99.9 8.7 8879:19
> vasp_gamma
> 25257 gwpeng 25 0 2292m 919m 176 R 99.9 5.7 8879:15
> vasp_gamma
> 25258 gwpeng 25 0 2333m 730m 176 R 99.9 4.6 8879:40
> vasp_gamma
> 25259 gwpeng 25 0 2326m 942m 176 R 99.9 5.9 8879:13
> vasp_gamma
> 25260 gwpeng 25 0 2204m 843m 176 R 99.9 5.3 8879:18
> vasp_gamma
> #############################
>
> These are 8 core machines so I can understand that PBS reports busy
> because the load average is 9 (>8).
>
> But why does pbsnodes not list the node as job-exclusive as well? It
> doesn't even seem to report a job number for that node.
>
> The mom seems to be running on the node:
>
> [root at eu044 ~]# service pbs status
> pbs_mom is pid 3810
>
> But a momctl reveals that the mom doesn't think there is a local job:
>
> ##############################
> [root at eu044 ~]# /opt/torque/sbin/momctl -d 3
>
> Host: eu044/eu044 Version: 2.4.5 PID: 3810
> Server[0]: euadmin (10.0.3.2:1023)
> Init Msgs Received: 5 hellos/2 cluster-addrs
> Init Msgs Sent: 11 hellos
> Last Msg From Server: 529523 seconds (DeleteJob)
> Last Msg To Server: 8 seconds
> HomeDirectory: /var/spool/torque/mom_priv
> stdout/stderr spool directory: '/var/spool/torque/spool/' (1834324
> blocks available)
> NOTE: syslog enabled
> MOM active: 4161213 seconds
> Check Poll Time: 45 seconds
> Server Update Interval: 45 seconds
> LogLevel: 4 (use SIGUSR1/SIGUSR2 to adjust)
> Communication Model: TCP
> MemLocked: TRUE (mlock)
> Prolog: /var/spool/torque/mom_priv/prologue (disabled)
> Alarm Time: 0 of 10 seconds
> Trusted Client List:
> 10.0.0.43,10.0.0.42,10.0.0.41,10.0.0.40,10.0.0.39,10.0.0.38,10.0.0.37,10.0.0.36,10.0.0.35,10.0.0.34,10.0.0.33,10.0.0.32,10.0.0.31,10.0.0.30,10.0.0.29,10.0.0.28,10.0.0.27,10.0.0.26,10.0.0.25,10.0.0.24,10.0.0.23,10.0.0.22,10.0.0.21,10.0.0.20,10.0.0.19,10.0.0.18,10.0.0.17,10.0.0.16,10.0.0.15,10.0.0.14,10.0.0.13,10.0.0.12,10.0.0.11,10.0.0.10,10.0.0.9,10.0.0.8,10.0.0.7,10.0.0.6,10.0.0.5,10.0.0.4,10.0.0.3,10.0.0.2,10.0.0.1,10.0.2.61,10.0.2.60,10.0.2.59,10.0.2.58,10.0.2.57,10.0.2.56,10.0.2.55,10.0.2.54,10.0.2.53,10.0.2.52,10.0.2.51,10.0.2.50,10.0.2.49,10.0.2.48,10.0.2.47,10.0.2.46,10.0.2.45,127.0.0.1
> Copy Command: /usr/bin/scp -rpB
> NOTE: no local jobs detected
>
> diagnostics complete
> #############################
>
>
> I tried restarting the mom but it still doesnt detect a job!
>
>
> --
> Rahul
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

Rahul,

It doesn't report a job because TORQUE doesn't know of any job present there. pbsnodes will output a job if the server believes a job is present, and momctl can tell you for sure if there is a job. At this point you should probably run top and see what is using all of the resources.

--
David Beer | Senior Software Engineer
Adaptive Computing
Continue reading on narkive:
Loading...