Rahul Nabar
2008-12-09 16:46:08 UTC
Is there any way to get pbs/torque to get a node to reboot periodically? Our
compute-nodes keep running forever and we suspect that overtime accumulate
zombie processes, memory leaks etc. Making each node reboot, say, on an
average once every 10 days or so is not a heavy overhead for us. After all a
reboot is done in less than 5 minutes. These reboots could also be used by
me to do some periodic logfile cleanup etc. {We have shared nodes 8
cores/node; so cannot really wipe out my scratch etc. through an epilouge
since another job might be running on the other cpus; and under normal
circumstances it is not usual to have a completely free node.}
What's the best way to auto-schedule this? Ideally I do not want the whole
cluster to reboot. In fact, I don't want to over-specify things at all.
Maybe the schedular can choose nodes to reboot based on its scheduling
strategy. Just so long as it rebooots each node "on an average" once every
10 days.
Any sugesstions on implimentation?
compute-nodes keep running forever and we suspect that overtime accumulate
zombie processes, memory leaks etc. Making each node reboot, say, on an
average once every 10 days or so is not a heavy overhead for us. After all a
reboot is done in less than 5 minutes. These reboots could also be used by
me to do some periodic logfile cleanup etc. {We have shared nodes 8
cores/node; so cannot really wipe out my scratch etc. through an epilouge
since another job might be running on the other cpus; and under normal
circumstances it is not usual to have a completely free node.}
What's the best way to auto-schedule this? Ideally I do not want the whole
cluster to reboot. In fact, I don't want to over-specify things at all.
Maybe the schedular can choose nodes to reboot based on its scheduling
strategy. Just so long as it rebooots each node "on an average" once every
10 days.
Any sugesstions on implimentation?
--
Rahul
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20081209/deb3bd03/attachment.html
Rahul
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20081209/deb3bd03/attachment.html