Ben Turner
2010-04-05 23:27:37 UTC
Hi,
I have a cluster with five nodes all configured the same. The queueing
system is working fine on four of them, but on one of them jobs get queued,
then start to run and then jump back to queued state immediately.
I have looked through all the logs and the only clue I can get is the
mom_log on the compute node that fails
Pbs_mom;Svr;pbs_mom;Bad file descriptor (9) in do_rpp, cannot get protocol
End of File
On the server there is a
PBS_Server; Svr;WARNING;ALERT:unable to contact node merlion05.qgeo.com
PBS_Server;Job;14754.merlion00.qgeo.com;unable to run job, MOM rejected/rc=2
PBS_Server;Req;req_reject;Reject reply code 15041(Execution server rejected
request MSG=cannot send job to mom, state=PRERUN), aux=0, type-RunJob from
Scheduler at merlion00.qgeo.com
Pbsnodes -a reports that the dodgy node merlion05 is OK.
Do anybody have any insight into this problem. I have been bashing my head
against a wall and have no idea where to go.
Cheers
Ben
I have a cluster with five nodes all configured the same. The queueing
system is working fine on four of them, but on one of them jobs get queued,
then start to run and then jump back to queued state immediately.
I have looked through all the logs and the only clue I can get is the
mom_log on the compute node that fails
Pbs_mom;Svr;pbs_mom;Bad file descriptor (9) in do_rpp, cannot get protocol
End of File
On the server there is a
PBS_Server; Svr;WARNING;ALERT:unable to contact node merlion05.qgeo.com
PBS_Server;Job;14754.merlion00.qgeo.com;unable to run job, MOM rejected/rc=2
PBS_Server;Req;req_reject;Reject reply code 15041(Execution server rejected
request MSG=cannot send job to mom, state=PRERUN), aux=0, type-RunJob from
Scheduler at merlion00.qgeo.com
Pbsnodes -a reports that the dodgy node merlion05 is OK.
Do anybody have any insight into this problem. I have been bashing my head
against a wall and have no idea where to go.
Cheers
Ben