Rahul Nabar
2008-12-11 17:47:25 UTC
I've had jobs that won't respond to qdel once every so often. Their
"REMAINING-time" on MAUI then becomes negative which was initially
confusing since I thought it was a MAUI bug.
But the root-cause seems to be that PBS will not obey the qdel on this
job. Irrespective of whether I issue it as root or MAUI issues it.
I had one such job today and I debugged it more: All the sub-nodes
seemed to be up. the mom daemon on each one of these nodes seemed to
be up and running.
The mom_log on the master node though was interesting; It had this snippet:
12/11/2008 11:47:38;0002; pbs_mom;Svr;im_request;connect from 11.0.1.79:1023
12/11/2008 11:47:38;0008;
pbs_mom;Job;233139.supernova.che.wisc.edu;received request 'KILL_JOB'
from 11.0.1.79:1023
12/11/2008 11:47:38;0008;
pbs_mom;Job;233139.supernova.che.wisc.edu;ERROR: received request
'KILL_JOB' from 11.0.1.79:1023 for job '233139.supernova.che.wisc.edu'
(job does not exist locally)
The only way I could get this job to delete was to restart the pbs_mom
on that node.
Anyone else who has encountered these symptoms? For me the first clue
was a negative "REMAINING-time" on MAUI and users who complained that
they could not qdel a job. In the past I've achieved the same effect
by removing the relevant foo.supe.JB and foo.supe.SC files from the
/var/spool/torque/server_priv/jobs on the master node.
But I don't think that is the best way out. I'd appreciate any other
debug suggestions as well.
"REMAINING-time" on MAUI then becomes negative which was initially
confusing since I thought it was a MAUI bug.
But the root-cause seems to be that PBS will not obey the qdel on this
job. Irrespective of whether I issue it as root or MAUI issues it.
I had one such job today and I debugged it more: All the sub-nodes
seemed to be up. the mom daemon on each one of these nodes seemed to
be up and running.
The mom_log on the master node though was interesting; It had this snippet:
12/11/2008 11:47:38;0002; pbs_mom;Svr;im_request;connect from 11.0.1.79:1023
12/11/2008 11:47:38;0008;
pbs_mom;Job;233139.supernova.che.wisc.edu;received request 'KILL_JOB'
from 11.0.1.79:1023
12/11/2008 11:47:38;0008;
pbs_mom;Job;233139.supernova.che.wisc.edu;ERROR: received request
'KILL_JOB' from 11.0.1.79:1023 for job '233139.supernova.che.wisc.edu'
(job does not exist locally)
The only way I could get this job to delete was to restart the pbs_mom
on that node.
Anyone else who has encountered these symptoms? For me the first clue
was a negative "REMAINING-time" on MAUI and users who complained that
they could not qdel a job. In the past I've achieved the same effect
by removing the relevant foo.supe.JB and foo.supe.SC files from the
/var/spool/torque/server_priv/jobs on the master node.
But I don't think that is the best way out. I'd appreciate any other
debug suggestions as well.
--
Rahul
Rahul