Discussion:
[torqueusers] getting torque/ pbs to reboot a node periodically.
Rahul Nabar
2008-12-09 16:46:08 UTC
Permalink
Is there any way to get pbs/torque to get a node to reboot periodically? Our
compute-nodes keep running forever and we suspect that overtime accumulate
zombie processes, memory leaks etc. Making each node reboot, say, on an
average once every 10 days or so is not a heavy overhead for us. After all a
reboot is done in less than 5 minutes. These reboots could also be used by
me to do some periodic logfile cleanup etc. {We have shared nodes 8
cores/node; so cannot really wipe out my scratch etc. through an epilouge
since another job might be running on the other cpus; and under normal
circumstances it is not usual to have a completely free node.}

What's the best way to auto-schedule this? Ideally I do not want the whole
cluster to reboot. In fact, I don't want to over-specify things at all.
Maybe the schedular can choose nodes to reboot based on its scheduling
strategy. Just so long as it rebooots each node "on an average" once every
10 days.

Any sugesstions on implimentation?
--
Rahul
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20081209/deb3bd03/attachment.html
Garrick Staples
2008-12-09 18:17:13 UTC
Permalink
Post by Rahul Nabar
Is there any way to get pbs/torque to get a node to reboot periodically? Our
compute-nodes keep running forever and we suspect that overtime accumulate
zombie processes, memory leaks etc. Making each node reboot, say, on an
average once every 10 days or so is not a heavy overhead for us. After all a
reboot is done in less than 5 minutes. These reboots could also be used by
me to do some periodic logfile cleanup etc. {We have shared nodes 8
cores/node; so cannot really wipe out my scratch etc. through an epilouge
since another job might be running on the other cpus; and under normal
circumstances it is not usual to have a completely free node.}
What's the best way to auto-schedule this? Ideally I do not want the whole
cluster to reboot. In fact, I don't want to over-specify things at all.
Maybe the schedular can choose nodes to reboot based on its scheduling
strategy. Just so long as it rebooots each node "on an average" once every
10 days.
Any sugesstions on implimentation?
It is actually difficult to do while avoiding possible race conditions.

First, you need to drain the nodes by marking them offline. Then you need to
mark them for reboot using the node note. Then a script can reboot nodes when
it finds them offline, without a job, and marked for reboot.
--
Garrick Staples, GNU/Linux HPCC SysAdmin
University of Southern California

See the Dishonor Roll at http://www.californiansagainsthate.com/

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20081209/81f0de9e/attachment.bin
Brock Palen
2008-12-09 19:07:42 UTC
Permalink
Could this be done with a moab/maui hack?

From cron every 10 days submit jobs one per node with:

#PBS -l host=$host,naccesspolicy=SINGLEJOB

Those jobs would be submitted by a user who has 'sudo reboot' rights.
You can also use a moab qos QFLAGS=NTR
So that that job is the next to run on the node.

This way the schedular says:

This job is the next job on node X because it can only run on node X
(hosts=$host, QFLAGS=NTR)
SINGLEJOB forces that job to be the only job running on that node
when reboot is ran by the user with sudoer's rights to reboot.

This is 100% hack, and I do not endorse it. Though it might just work.

Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
***@umich.edu
(734)936-1985
Post by Garrick Staples
Post by Rahul Nabar
Is there any way to get pbs/torque to get a node to reboot
periodically? Our
compute-nodes keep running forever and we suspect that overtime accumulate
zombie processes, memory leaks etc. Making each node reboot, say, on an
average once every 10 days or so is not a heavy overhead for us. After all a
reboot is done in less than 5 minutes. These reboots could also be used by
me to do some periodic logfile cleanup etc. {We have shared nodes 8
cores/node; so cannot really wipe out my scratch etc. through an epilouge
since another job might be running on the other cpus; and under normal
circumstances it is not usual to have a completely free node.}
What's the best way to auto-schedule this? Ideally I do not want the whole
cluster to reboot. In fact, I don't want to over-specify things at all.
Maybe the schedular can choose nodes to reboot based on its
scheduling
strategy. Just so long as it rebooots each node "on an average" once every
10 days.
Any sugesstions on implimentation?
It is actually difficult to do while avoiding possible race
conditions.
First, you need to drain the nodes by marking them offline. Then you need to
mark them for reboot using the node note. Then a script can reboot nodes when
it finds them offline, without a job, and marked for reboot.
--
Garrick Staples, GNU/Linux HPCC SysAdmin
University of Southern California
See the Dishonor Roll at http://www.californiansagainsthate.com/
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
Bogdan Costescu
2008-12-09 19:20:08 UTC
Permalink
Post by Garrick Staples
First, you need to drain the nodes by marking them offline. Then
you need to mark them for reboot using the node note. Then a script
can reboot nodes when it finds them offline, without a job, and
marked for reboot.
I've recently done something similar (reboot node after whatever jobs
run on it finish) using pbs_python in only a few lines of (Python)
code. There is no extra script looking for the node note, the Python
script polls the state of the node until it's only "offline", proceeds
to do whatever it needs to reboot the node and as soon as the node
goes into state "down" it clears the "offline" state.
--
Bogdan Costescu

IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany
Phone: +49 6221 54 8240, Fax: +49 6221 54 8850
E-mail: ***@iwr.uni-heidelberg.de
Garrick Staples
2008-12-09 19:31:34 UTC
Permalink
Post by Bogdan Costescu
Post by Garrick Staples
First, you need to drain the nodes by marking them offline. Then
you need to mark them for reboot using the node note. Then a script
can reboot nodes when it finds them offline, without a job, and
marked for reboot.
I've recently done something similar (reboot node after whatever jobs
run on it finish) using pbs_python in only a few lines of (Python)
code. There is no extra script looking for the node note, the Python
script polls the state of the node until it's only "offline", proceeds
to do whatever it needs to reboot the node and as soon as the node
goes into state "down" it clears the "offline" state.
Without marking the node for reboot in some fashion, how do you know which
nodes to reboot? Perhaps a node was marked offline for some other reason?

And your script doesn't check to see if it has a running job?
--
Garrick Staples, GNU/Linux HPCC SysAdmin
University of Southern California

See the Dishonor Roll at http://www.californiansagainsthate.com/

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20081209/4bb20f94/attachment-0001.bin
Bogdan Costescu
2008-12-09 19:43:00 UTC
Permalink
Post by Garrick Staples
Post by Bogdan Costescu
There is no extra script looking for the node note, the Python
script polls the state of the node until it's only "offline",
proceeds to do whatever it needs to reboot the node and as soon as
the node goes into state "down" it clears the "offline" state.
Without marking the node for reboot in some fashion, how do you know
which nodes to reboot?
The script knows which nodes it needs to reboot; it ignores other
nodes which are in "offline" state. If a node is marked "offline"
manually but the script is still asked to reboot it, what difference
could it make that the "offline" state was aquired from an admin or
from the script itself as long as the final result is the same:
draining of the node ?
Post by Garrick Staples
And your script doesn't check to see if it has a running job?
You missed the 'polls the state of the node until it's only "offline"'
or maybe I missed making it more verbose and saying 'and doesn't
contain other states related to running jobs, like "job-exclusive"'.
--
Bogdan Costescu

IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany
Phone: +49 6221 54 8240, Fax: +49 6221 54 8850
E-mail: ***@iwr.uni-heidelberg.de
Garrick Staples
2008-12-09 20:00:31 UTC
Permalink
Post by Bogdan Costescu
Post by Garrick Staples
Post by Bogdan Costescu
There is no extra script looking for the node note, the Python
script polls the state of the node until it's only "offline",
proceeds to do whatever it needs to reboot the node and as soon as
the node goes into state "down" it clears the "offline" state.
Without marking the node for reboot in some fashion, how do you know
which nodes to reboot?
The script knows which nodes it needs to reboot; it ignores other
nodes which are in "offline" state. If a node is marked "offline"
manually but the script is still asked to reboot it, what difference
could it make that the "offline" state was aquired from an admin or
draining of the node ?
Oh, so you are just using a different mechanism to tag the nodes to be
rebooted.
Post by Bogdan Costescu
Post by Garrick Staples
And your script doesn't check to see if it has a running job?
You missed the 'polls the state of the node until it's only "offline"'
or maybe I missed making it more verbose and saying 'and doesn't
contain other states related to running jobs, like "job-exclusive"'.
You missed that nodes routinely have running jobs while having only the "free",
"busy", or "offline" states. "job-exclusive" is only a special case where a
job has all of the node's resources. This might be the norm on your cluster,
but it isn't the norm every where.
--
Garrick Staples, GNU/Linux HPCC SysAdmin
University of Southern California

See the Dishonor Roll at http://www.californiansagainsthate.com/

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20081209/a51ab9a7/attachment.bin
Billy Crook
2008-12-10 14:44:55 UTC
Permalink
Why not submit, as a job, "/sbin/reboot"? Or if permissions would be
an issue, something suid. You'd request all resources on the node,
and a job time of ten minutes. The point being to occupy a node
legitimately, and when your time comes as regulated by torque, reboot
the node. The job would probably fail, but when the node comes back
online it should rejoin the queue and be available again right?

P.S. Credit also to Brock who beat me to it.
Post by Garrick Staples
Post by Rahul Nabar
Is there any way to get pbs/torque to get a node to reboot periodically? Our
compute-nodes keep running forever and we suspect that overtime accumulate
zombie processes, memory leaks etc. Making each node reboot, say, on an
average once every 10 days or so is not a heavy overhead for us. After all a
reboot is done in less than 5 minutes. These reboots could also be used by
me to do some periodic logfile cleanup etc. {We have shared nodes 8
cores/node; so cannot really wipe out my scratch etc. through an epilouge
since another job might be running on the other cpus; and under normal
circumstances it is not usual to have a completely free node.}
What's the best way to auto-schedule this? Ideally I do not want the whole
cluster to reboot. In fact, I don't want to over-specify things at all.
Maybe the schedular can choose nodes to reboot based on its scheduling
strategy. Just so long as it rebooots each node "on an average" once every
10 days.
Any sugesstions on implimentation?
It is actually difficult to do while avoiding possible race conditions.
First, you need to drain the nodes by marking them offline. Then you need to
mark them for reboot using the node note. Then a script can reboot nodes when
it finds them offline, without a job, and marked for reboot.
--
Garrick Staples, GNU/Linux HPCC SysAdmin
University of Southern California
See the Dishonor Roll at http://www.californiansagainsthate.com/
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
Yang Wang
2008-12-09 19:39:28 UTC
Permalink
I just installed the toreque 2.3.5 and maui-3.2.6p21. After submitted jobs to the queue, these jobs stay in the queue, NOT run automatically.
The showq does not show any active jobs.

Any thoughs/suggestion?

Thanks,

Yang


[***@bio202 maui]# showq

ACTIVE JOBS--------------------

JOBNAME USERNAME STATE PROC REMAINING STARTTIME

0 Active Jobs 0 of 0 Processors Active (0.00%)

IDLE JOBS----------------------

JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME

0 Idle Jobs
BLOCKED JOBS----------------
JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME

Total Jobs: 0 Active Jobs: 0 Idle Jobs: 0 Blocked Jobs: 0


[***@bio202 maui]# qstat -a

bio202.agencourt.com: Req'd Req'd Elap

Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time

-------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - -----

61.bio202.agenco ywang pipeline output_ALIGN_1_1 -- 1 -- -- -- Q --

62.bio202.agenco ywang pipeline output_ALIGN_2_1 -- 1 -- -- -- Q --

63.bio202.agenco ywang pipeline output_ALIGN_3_1 -- 1 -- -- -- Q --

64.bio202.agenco ywang pipeline output_ALIGN_4_1 -- 1 -- -- -- Q --

65.bio202.agenco ywang pipeline output_ALIGN_5_1 -- 1 -- -- -- Q --

66.bio202.agenco ywang pipeline output_ALIGN_6_1 -- 1 -- -- -- Q --

67.bio202.agenco ywang pipeline output_ALIGN_7_1 -- 1 -- -- -- Q --

68.bio202.agenco ywang tracking T_13239_1 -- -- -- -- -- H --
Sarah Mulholland
2008-12-09 19:48:49 UTC
Permalink
I had similar problems when I first ran maui. It turned out that I was missing some definitions in my maui.cfg file. I am using a QOS scheduling policy. The key for me was to make sure that I had granted all users, classes, and groups permissions to the quality of services. The diagnostic that helped me figure it out was checkjob.

If you're using some other kind of scheduling policy, there may be an analogous problem. Try checkjob on the job number.

# from my maui.cfg
# define some quality of services
QOSCFG[low] PRIORITY=10 QWEIGHT=1
QOSCFG[med] PRIORITY=130 QWEIGHT=1
# give all users, queues, and groups access to all QOS
USRCFG[DEFAULT] QDEF=low QLIST=low,med
CLASSCFG[DEFAULT] QDEF=low QLIST=low,med
GROUPCFG[DEFAULT] QDEF=low QLIST=low,med

I hope this helps.

-----Original Message-----
From: torqueusers-***@supercluster.org [mailto:torqueusers-***@supercluster.org] On Behalf Of Yang Wang
Sent: Tuesday, December 09, 2008 1:39 PM
To: Garrick Staples; ***@supercluster.org
Subject: [torqueusers] How can I check muai is actuallly talk with torque

I just installed the toreque 2.3.5 and maui-3.2.6p21. After submitted jobs to the queue, these jobs stay in the queue, NOT run automatically.
The showq does not show any active jobs.

Any thoughs/suggestion?

Thanks,

Yang


[***@bio202 maui]# showq

ACTIVE JOBS--------------------

JOBNAME USERNAME STATE PROC REMAINING STARTTIME

0 Active Jobs 0 of 0 Processors Active (0.00%)

IDLE JOBS----------------------

JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME

0 Idle Jobs
BLOCKED JOBS----------------
JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME

Total Jobs: 0 Active Jobs: 0 Idle Jobs: 0 Blocked Jobs: 0


[***@bio202 maui]# qstat -a

bio202.agencourt.com: Req'd Req'd Elap

Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time

-------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - -----

61.bio202.agenco ywang pipeline output_ALIGN_1_1 -- 1 -- -- -- Q --

62.bio202.agenco ywang pipeline output_ALIGN_2_1 -- 1 -- -- -- Q --

63.bio202.agenco ywang pipeline output_ALIGN_3_1 -- 1 -- -- -- Q --

64.bio202.agenco ywang pipeline output_ALIGN_4_1 -- 1 -- -- -- Q --

65.bio202.agenco ywang pipeline output_ALIGN_5_1 -- 1 -- -- -- Q --

66.bio202.agenco ywang pipeline output_ALIGN_6_1 -- 1 -- -- -- Q --

67.bio202.agenco ywang pipeline output_ALIGN_7_1 -- 1 -- -- -- Q --

68.bio202.agenco ywang tracking T_13239_1 -- -- -- -- -- H --
Yang Wang
2008-12-09 20:27:43 UTC
Permalink
Hi Sarah,

Thanks for the suggestions. When I checked the maui.cfg and compare it with a working version, it seems the file lacks a few lines there.

After adding these lines:

QUEUETIMEWEIGHT 1
JOBPRIOACCRUALPOLICY ALWAYS
CLASSWEIGHT 10


JOBNODEMATCHPOLICY EXACTNODE
NODEACCESSPOLICY SINGLEUSER
NODEALLOCATIONPOLICY PRIORITY

NODECFG[DEFAULT] PRIORITYF='-JOBCOUNT'

ENABLEMUITINODEJOBS TRUE
ENABLEMULTIREQJOBS TRUE


It now works fine. So maui installation do not provide a working version by default. And hope this should change somehow.

Thank you again.

Yang


-----Original Message-----
From: torqueusers-***@supercluster.org [mailto:torqueusers-***@supercluster.org] On Behalf Of Sarah Mulholland
Sent: Tuesday, December 09, 2008 3:50 PM
To: ***@supercluster.org
Subject: RE: [torqueusers] How can I check muai is actuallly talk with torque

I had similar problems when I first ran maui. It turned out that I was missing some definitions in my maui.cfg file. I am using a QOS scheduling policy. The key for me was to make sure that I had granted all users, classes, and groups permissions to the quality of services. The diagnostic that helped me figure it out was checkjob.

If you're using some other kind of scheduling policy, there may be an analogous problem. Try checkjob on the job number.

# from my maui.cfg
# define some quality of services
QOSCFG[low] PRIORITY=10 QWEIGHT=1
QOSCFG[med] PRIORITY=130 QWEIGHT=1
# give all users, queues, and groups access to all QOS
USRCFG[DEFAULT] QDEF=low QLIST=low,med
CLASSCFG[DEFAULT] QDEF=low QLIST=low,med
GROUPCFG[DEFAULT] QDEF=low QLIST=low,med

I hope this helps.

-----Original Message-----
From: torqueusers-***@supercluster.org [mailto:torqueusers-***@supercluster.org] On Behalf Of Yang Wang
Sent: Tuesday, December 09, 2008 1:39 PM
To: Garrick Staples; ***@supercluster.org
Subject: [torqueusers] How can I check muai is actuallly talk with torque

I just installed the toreque 2.3.5 and maui-3.2.6p21. After submitted jobs to the queue, these jobs stay in the queue, NOT run automatically.
The showq does not show any active jobs.

Any thoughs/suggestion?

Thanks,

Yang


[***@bio202 maui]# showq

ACTIVE JOBS--------------------

JOBNAME USERNAME STATE PROC REMAINING STARTTIME

0 Active Jobs 0 of 0 Processors Active (0.00%)

IDLE JOBS----------------------

JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME

0 Idle Jobs
BLOCKED JOBS----------------
JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME

Total Jobs: 0 Active Jobs: 0 Idle Jobs: 0 Blocked Jobs: 0


[***@bio202 maui]# qstat -a

bio202.agencourt.com: Req'd Req'd Elap

Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time

-------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - -----

61.bio202.agenco ywang pipeline output_ALIGN_1_1 -- 1 -- -- -- Q --

62.bio202.agenco ywang pipeline output_ALIGN_2_1 -- 1 -- -- -- Q --

63.bio202.agenco ywang pipeline output_ALIGN_3_1 -- 1 -- -- -- Q --

64.bio202.agenco ywang pipeline output_ALIGN_4_1 -- 1 -- -- -- Q --

65.bio202.agenco ywang pipeline output_ALIGN_5_1 -- 1 -- -- -- Q --

66.bio202.agenco ywang pipeline output_ALIGN_6_1 -- 1 -- -- -- Q --

67.bio202.agenco ywang pipeline output_ALIGN_7_1 -- 1 -- -- -- Q --

68.bio202.agenco ywang tracking T_13239_1 -- -- -- -- -- H --
Rahul Nabar
2008-12-10 21:36:50 UTC
Permalink
Post by Billy Crook
Why not submit, as a job, "/sbin/reboot"? Or if permissions would be
an issue, something suid. You'd request all resources on the node,
and a job time of ten minutes. The point being to occupy a node
legitimately, and when your time comes as regulated by torque, reboot
the node. The job would probably fail, but when the node comes back
online it should rejoin the queue and be available again right?
Thanks guys. I need to figure out which option I should use. Too many
alternatives! :)

OTOH, I never thought we'd need so many hacks to do something like a
planned reboot. I had expected to find a torque / maui inbuilt option.
Is it so uncommon to request reboot on a compute-node?
Post by Billy Crook
Agreed with everything above. Fix the problems. Don't reboot unnecessarily.
That would be the neater approach, agreed. I take a pragmatic
approach. 5 minutes lost to reboot every 2 weeks is way cheaper than a
week of digging into badly documented user code to figure out a
zombie, memory leak etc.
--
Rahul
Gabe Turner
2008-12-10 21:40:49 UTC
Permalink
Post by Rahul Nabar
Post by Billy Crook
Why not submit, as a job, "/sbin/reboot"? Or if permissions would be
an issue, something suid. You'd request all resources on the node,
and a job time of ten minutes. The point being to occupy a node
legitimately, and when your time comes as regulated by torque, reboot
the node. The job would probably fail, but when the node comes back
online it should rejoin the queue and be available again right?
Thanks guys. I need to figure out which option I should use. Too many
alternatives! :)
OTOH, I never thought we'd need so many hacks to do something like a
planned reboot. I had expected to find a torque / maui inbuilt option.
Is it so uncommon to request reboot on a compute-node?
I just don't know that it's all that common in practice. One of the issue
we've had when even attempting such is that there is no guarantee that the
node will come up in a sane state every single boot. Even if it works
great 99 out of 100 boots, something going awry could potentially drain
your queue. I've found user kind of hate that ;)

Gabe
--
Gabe Turner ***@msi.umn.edu
UNIX System Administrator,
University of Minnesota
Supercomputing Institute http://www.msi.umn.edu
Chris Samuel
2008-12-11 04:08:27 UTC
Permalink
Post by Yang Wang
The showq does not show any active jobs.
Any thoughs/suggestion?
The Maui command "diagnose -R" will tell you what
Maui thinks of its connection to its resource managers
(e.g. Torque).

cheers,
Chris
--
Christopher Samuel - (03) 9925 4751 - Systems Manager
The Victorian Partnership for Advanced Computing
P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency
Arnau Bria
2008-12-11 08:34:12 UTC
Permalink
On Thu, 11 Dec 2008 16:08:18 +1100 (EST)
Chris Samuel wrote:

Hi Chris,
Post by Chris Samuel
The Maui command "diagnose -R" will tell you what
Maui thinks of its connection to its resource managers
(e.g. Torque).
do you know how may I get more info about the "cannot start job"?

RM[base] type: 'PBS' state: 'Active'
Event Management: EPORT=15004
SSS protocol enabled
RM Performance: Avg Time: 0.02s Max Time: 3.08s (14719 samples)

RM[base] Failures:
Wed Dec 10 19:33:33 jobstart 'cannot start job'
Wed Dec 10 19:33:33 jobstart 'cannot start job'
Thu Dec 11 05:00:52 jobstart 'cannot start job'
Thu Dec 11 09:35:10 jobstart 'cannot start job'
Thu Dec 11 10:07:32 jobstart 'cannot start job'
Thu Dec 11 10:07:32 jobstart 'cannot start job'
Post by Chris Samuel
cheers,
Chris
Cheers,
Arnau

Chris Samuel
2008-12-11 04:28:48 UTC
Permalink
Post by Billy Crook
Why not submit, as a job, "/sbin/reboot"?
Torque, by default, doesn't permit that and I
think that's quite a sane thing.
Post by Billy Crook
Or if permissions would be an issue, something suid.
As an ex IT security person that sort of thing scares me..

Sudo is your friend!
--
Christopher Samuel - (03) 9925 4751 - Systems Manager
The Victorian Partnership for Advanced Computing
P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency
Chris Samuel
2008-12-11 04:32:04 UTC
Permalink
Post by Gabe Turner
I just don't know that it's all that common in practice.
One of the issue we've had when even attempting such is
that there is no guarantee that the node will come up in
a sane state every single boot. Even if it works great
99 out of 100 boots, something going awry could potentially
drain your queue. I've found user kind of hate that ;)
That's precisely why we don't start pbs_mom on boot by default.

Our current version will only do so if a file has been
touch'd in /etc (and removes it), but we don't tend to be
brave enough to do that (yet).

cheers,
Chris
--
Christopher Samuel - (03) 9925 4751 - Systems Manager
The Victorian Partnership for Advanced Computing
P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency
Loading...