[torqueusers] qdel will not delete

Discussion:

Rahul Nabar

2008-12-11 17:47:25 UTC

I've had jobs that won't respond to qdel once every so often. Their
"REMAINING-time" on MAUI then becomes negative which was initially
confusing since I thought it was a MAUI bug.

But the root-cause seems to be that PBS will not obey the qdel on this
job. Irrespective of whether I issue it as root or MAUI issues it.

I had one such job today and I debugged it more: All the sub-nodes
seemed to be up. the mom daemon on each one of these nodes seemed to
be up and running.

The mom_log on the master node though was interesting; It had this snippet:

12/11/2008 11:47:38;0002; pbs_mom;Svr;im_request;connect from 11.0.1.79:1023
12/11/2008 11:47:38;0008;
pbs_mom;Job;233139.supernova.che.wisc.edu;received request 'KILL_JOB'
from 11.0.1.79:1023
12/11/2008 11:47:38;0008;
pbs_mom;Job;233139.supernova.che.wisc.edu;ERROR: received request
'KILL_JOB' from 11.0.1.79:1023 for job '233139.supernova.che.wisc.edu'
(job does not exist locally)

The only way I could get this job to delete was to restart the pbs_mom
on that node.

Anyone else who has encountered these symptoms? For me the first clue
was a negative "REMAINING-time" on MAUI and users who complained that
they could not qdel a job. In the past I've achieved the same effect
by removing the relevant foo.supe.JB and foo.supe.SC files from the
/var/spool/torque/server_priv/jobs on the master node.
But I don't think that is the best way out. I'd appreciate any other
debug suggestions as well.

--
Rahul

Steve Young

2008-12-11 18:03:03 UTC

Permalink

Usually when this happens qdel -p <job id> will remove the job from
the queue if a normal qdel won't do it. From the qdel man page:

-p Forcibly purge the job from the server. This
should only be used if a running job will not exit because its
allocated nodes are unreachable. The admin
should make every attempt at resolving the
problem on the nodes. If a job?s mother superior recovers after
purging the job, any epilogue scripts may still
run. This option is only available to a batch
operator or the batch administrator.

Hope this helps,

-Steve

Post by Rahul Nabar
I've had jobs that won't respond to qdel once every so often. Their
"REMAINING-time" on MAUI then becomes negative which was initially
confusing since I thought it was a MAUI bug.
But the root-cause seems to be that PBS will not obey the qdel on this
job. Irrespective of whether I issue it as root or MAUI issues it.
I had one such job today and I debugged it more: All the sub-nodes
seemed to be up. the mom daemon on each one of these nodes seemed to
be up and running.
12/11/2008 11:47:38;0002; pbs_mom;Svr;im_request;connect from 11.0.1.79:1023
12/11/2008 11:47:38;0008;
pbs_mom;Job;233139.supernova.che.wisc.edu;received request 'KILL_JOB'
from 11.0.1.79:1023
12/11/2008 11:47:38;0008;
pbs_mom;Job;233139.supernova.che.wisc.edu;ERROR: received request
'KILL_JOB' from 11.0.1.79:1023 for job '233139.supernova.che.wisc.edu'
(job does not exist locally)
The only way I could get this job to delete was to restart the pbs_mom
on that node.
Anyone else who has encountered these symptoms? For me the first clue
was a negative "REMAINING-time" on MAUI and users who complained that
they could not qdel a job. In the past I've achieved the same effect
by removing the relevant foo.supe.JB and foo.supe.SC files from the
/var/spool/torque/server_priv/jobs on the master node.
But I don't think that is the best way out. I'd appreciate any other
debug suggestions as well.
--
Rahul
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers

Greenseid, Joseph M.

2008-12-11 18:22:04 UTC

Permalink

I've only seen this problem when some of the nodes allocated to the job are unresponsive (either because they've crashed, or, for instance, they're so overloaded they're functionally crippled and unresponsive). When the unresponsive node is able to be communicated with by the mom, then the job will be able to exit (unless you force it as Steve mentions below).

--Joe

________________________________

From: torqueusers-***@supercluster.org on behalf of Steve Young
Sent: Thu 12/11/2008 2:02 PM
To: Rahul Nabar
Cc: ***@supercluster.org
Subject: Re: [torqueusers] qdel will not delete

Usually when this happens qdel -p <job id> will remove the job from
the queue if a normal qdel won't do it. From the qdel man page:

-p Forcibly purge the job from the server. This
should only be used if a running job will not exit because its
allocated nodes are unreachable. The admin
should make every attempt at resolving the
problem on the nodes. If a job's mother superior recovers after
purging the job, any epilogue scripts may still
run. This option is only available to a batch
operator or the batch administrator.

Hope this helps,

-Steve

_______________________________________________
torqueusers mailing list
***@supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20081211/060ce102/attachment.html

Garrick Staples

2008-12-11 18:08:02 UTC

Permalink

Sounds like an old bug. Are you running the latest version in your branch?

--
Garrick Staples, GNU/Linux HPCC SysAdmin
University of Southern California

See the Dishonor Roll at http://www.californiansagainsthate.com/

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20081211/211c54c3/attachment.bin

Rahul Nabar

2008-12-11 19:02:06 UTC

Permalink

Post by Garrick Staples
Sounds like an old bug. Are you running the latest version in your branch?

I'm not sure. I tried to find out but what's a "pbs version" command?
I can't see a -version option on any of the usual culprits.

I did come across a "pbs_version" variable in the admin guide but
can't figure out how to access this. I did also grep on version and
variants on all possible logs I can think of. Where exactly does PBS
write its version info!

--
Rahul

Jeremy Mann

2008-12-11 19:06:33 UTC

Permalink

Post by Rahul Nabar

Post by Garrick Staples
Sounds like an old bug. Are you running the latest version in your branch?

I'm not sure. I tried to find out but what's a "pbs version" command?
I can't see a -version option on any of the usual culprits.
I did come across a "pbs_version" variable in the admin guide but
can't figure out how to access this. I did also grep on version and
variants on all possible logs I can think of. Where exactly does PBS
write its version info!

Any of the pbs binaries have a version argument

pbs_mom --version
version: 2.3.0

--
Jeremy Mann
***@biochem.uthscsa.edu

University of Texas Health Science Center
Bioinformatics Core Facility
http://www.bioinformatics.uthscsa.edu
Phone: (210) 567-2672

Rahul Nabar

2008-12-11 19:14:43 UTC

Permalink

Post by Jeremy Mann
Any of the pbs binaries have a version argument
pbs_mom --version
version: 2.3.0

Did not seem to work for me. "man pbs_mom" did not reveal a --version.

But this seems to work:

momctl -q version
localhost: version = 'version=2.2.

I guess I am a bit behind on version.........

--
Rahul

Garrick Staples

2008-12-11 19:41:39 UTC

Permalink

Post by Rahul Nabar

Post by Garrick Staples
Sounds like an old bug. Are you running the latest version in your branch?

I'm not sure. I tried to find out but what's a "pbs version" command?
I can't see a -version option on any of the usual culprits.
I did come across a "pbs_version" variable in the admin guide but
can't figure out how to access this. I did also grep on version and
variants on all possible logs I can think of. Where exactly does PBS
write its version info!

echo 'p s pbs_version' | qmgr
pbs_server --version
pbs_server --about

If these don't work, then it is quite old and you should be upgradng.
--
Garrick Staples, GNU/Linux HPCC SysAdmin
University of Southern California

See the Dishonor Roll at http://www.californiansagainsthate.com/

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20081211/d47bb0da/attachment.bin

Rahul Nabar

2008-12-11 19:46:07 UTC

Permalink

Post by Garrick Staples
echo 'p s pbs_version' | qmgr
pbs_server --version
pbs_server --about
If these don't work, then it is quite old and you should be upgradng.

I'm glad. These all work. They all consistently return 2.2.1 as my
version. That shouldn't be a cause of concern, should it?

--
Rahul

Garrick Staples

2008-12-11 19:51:58 UTC

Permalink

Post by Rahul Nabar

Post by Garrick Staples
echo 'p s pbs_version' | qmgr
pbs_server --version
pbs_server --about
If these don't work, then it is quite old and you should be upgradng.

I'm glad. These all work. They all consistently return 2.2.1 as my
version. That shouldn't be a cause of concern, should it?

The 2.2.x line was mostly a dead end. It was never really supported and moved
on quickly to 2.3. The 2.1 line is still the most stable.
--
Garrick Staples, GNU/Linux HPCC SysAdmin
University of Southern California

See the Dishonor Roll at http://www.californiansagainsthate.com/

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20081211/5d2c8c74/attachment.bin

Yang Wang

2008-12-12 19:27:08 UTC

Permalink

Dear friends,

Is that possible to run two pbs_server daemons for the same cluster for fall-over purpose? Has someone done this? Is there a brief doc showing how to set up such a system?

Thanks and happy holidays!

Yang

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20081212/447db620/attachment.html

Josh Butikofer

2008-12-12 20:11:20 UTC

Permalink

This website may be helpful for you:

http://www.clusterresources.com/torquedocs21/4.3high-availability.shtml

It explains on how to setup high-availability and will probably do what
you want.

Josh Butikofer
Cluster Resources, Inc.
#############################

Post by Yang Wang
Dear friends,
Is that possible to run two pbs_server daemons for the same cluster for
fall-over purpose? Has someone done this? Is there a brief doc showing
how to set up such a system?
Thanks and happy holidays!
Yang
------------------------------------------------------------------------
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers

Prakash Velayutham

2008-12-12 20:37:37 UTC

Permalink

Hello All,

Has anyone here tested Torque with "--ha" in a VM (VMware based)
environment?

I tried the following:

2 VM Torque nodes running OpenSUSE 10.3, Torque-2.3.5

PBS Mom systems (physical hosts, not VMs) running Torque-2.3.5.

In this case, everything seems to run ok, until I submit a bulk of
jobs, and then I start getting errors like

pbs_iff: cannot read reply from pbs_server
Cannot connect to specified server host 'bmiclustersvc2-int'.
qsub: cannot connect to server bmiclustersvc2-int (errno=111)
Connection refused

Anyone seen this before? Any ideas what could be going wrong?

Thanks,
Prakash

Post by Josh Butikofer
http://www.clusterresources.com/torquedocs21/4.3high-
availability.shtml
It explains on how to setup high-availability and will probably do
what you want.
Josh Butikofer
Cluster Resources, Inc.
#############################

Post by Yang Wang
Dear friends,
Is that possible to run two pbs_server daemons for the same cluster
for fall-over purpose? Has someone done this? Is there a brief doc
showing how to set up such a system?
Thanks and happy holidays!
Yang
------------------------------------------------------------------------
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers

_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers

James J Coyle

2008-12-12 19:33:16 UTC

Permalink

Yang,

The developers can answer your original question, but I'm
guessing you cannot. Because trying to start a second server gives
you the message:

pbs_server: another server running

What I do instead is have a cron script run the following script
once each hour on my head node: (I used to run this every 15 minutes)

#!/bin/ksh

PS_PBS_SCHED=`/bin/ps -aef | grep pbs_sched | grep -v grep`
if [ -z "$PS_PBS_SCHED" ] ; then
/usr/local/sbin/pbs_sched
fi

PS_PBS_SERVER=`/bin/ps -aef | grep pbs_server | grep -v grep`
if [ -z "$PS_PBS_SERVER" ] ; then
/usr/local/sbin/pbs_server
fi

I have a similar script for pbs_mom on the compute nodes.

Stewart.Samuels at sanofi-aventis.com ()

2008-12-15 12:18:46 UTC

Permalink

I have not yet had the opportunity to test the 2.3.5 release of TORQUE
with --ha. However, I have tested fairly extensively the 2.3.4 release
using VMWare based hosts (2 masters and 1 compute node). I have
configured them based on the TORQUE reference manual. To date, I have
NOT been able to get HA working in what I would consider a robust enough
method that I can use it on a production system. The issue I have had
is that when I force a VM containing the Primary Master to fail, jobs
that were executing when the failure occurs sometimes complete with the
take over master and sometimes just hang. In addition to that, jobs
that were in the queues when the failover occurred simply stay queued
until the master is again brought up. In other words, jobs that were in
the queues do not go into execution when the secondary master becomes
the primary. Something gets lost here. I can however, submit new jobs
from the new master and they do go into execution. So some things are
definitely working, but other things definitely need further work.

BTW, I'm running RHEL 4.6 in the VMs.

Stewart

-----Original Message-----
From: torqueusers-***@supercluster.org
[mailto:torqueusers-***@supercluster.org] On Behalf Of Prakash
Velayutham
Sent: Friday, December 12, 2008 4:38 PM
To: torqueusers Users
Subject: [torqueusers] Has anyone tested HA features in a VM environment

Hello All,

Has anyone here tested Torque with "--ha" in a VM (VMware based)
environment?

I tried the following:

2 VM Torque nodes running OpenSUSE 10.3, Torque-2.3.5

PBS Mom systems (physical hosts, not VMs) running Torque-2.3.5.

In this case, everything seems to run ok, until I submit a bulk of jobs,
and then I start getting errors like

pbs_iff: cannot read reply from pbs_server Cannot connect to specified
server host 'bmiclustersvc2-int'.
qsub: cannot connect to server bmiclustersvc2-int (errno=111) Connection
refused

Anyone seen this before? Any ideas what could be going wrong?

Thanks,
Prakash

Post by Yang Wang
Dear friends,
Is that possible to run two pbs_server daemons for the same cluster
for fall-over purpose? Has someone done this? Is there a brief doc
showing how to set up such a system?
Thanks and happy holidays!
Yang
---------------------------------------------------------------------
--- _______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers

_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers