Discussion:
[torqueusers] Jobs Jumping From Queued to Run to Queued???
Ben Turner
2010-04-05 23:27:37 UTC
Permalink
Hi,

I have a cluster with five nodes all configured the same. The queueing
system is working fine on four of them, but on one of them jobs get queued,
then start to run and then jump back to queued state immediately.

I have looked through all the logs and the only clue I can get is the
mom_log on the compute node that fails

Pbs_mom;Svr;pbs_mom;Bad file descriptor (9) in do_rpp, cannot get protocol
End of File


On the server there is a
PBS_Server; Svr;WARNING;ALERT:unable to contact node merlion05.qgeo.com
PBS_Server;Job;14754.merlion00.qgeo.com;unable to run job, MOM rejected/rc=2
PBS_Server;Req;req_reject;Reject reply code 15041(Execution server rejected
request MSG=cannot send job to mom, state=PRERUN), aux=0, type-RunJob from
Scheduler at merlion00.qgeo.com

Pbsnodes -a reports that the dodgy node merlion05 is OK.

Do anybody have any insight into this problem. I have been bashing my head
against a wall and have no idea where to go.

Cheers
Ben
Dr. Stephan Raub
2010-04-06 08:44:30 UTC
Permalink
Hi,

I sometimes have experienced this behavior if

-) a global mounted /home-dir of the user (to whom the job belongs to) is
not mounted

-) the user is not known on this computenode (e.g. because the LDAP service
is down, the nscd has hung up,...)

-) the prolog-script failed (for what reason ever).

-) the ssh-keys are not okay, so that the corresponding user can not jump
from computenode to computenode without using a password.

Perhaps, some of this is the solution for your problem.

Stephan


--
---------------------------------------------------------
| | Dr. rer. nat. Stephan Raub
| | Dipl. Chem.
| | IT-Management / ZIM
| | Heinrich-Heine-Universit?t D?sseldorf Universit?tsstr. 1 /
| | 25.41.O2.25-2
| | 40225 D?sseldorf / Germany
---------------------------------------------------------

Wichtiger Hinweis: Diese E-Mail kann Betriebs- oder Gesch?ftsgeheimnisse,
bzw.
sonstige vertrauliche Informationen enthalten. Sollten Sie diese E-Mail
irrt?mlich erhalten haben, ist Ihnen eine Kenntnisnahme des Inhalts, eine
Vervielf?ltigung oder Weitergabe der E-Mail ausdr?cklich untersagt. Bitte
benachrichtigen Sie uns und vernichten Sie die empfangene E-Mail. Vielen
Dank.

Important Note: This e-mail may contain trade secrets or privileged,
undisclosed or otherwise confidential information. If you have received this
e-mail in error, you are hereby notified that any review, copying or
distribution of it is strictly prohibited. Please inform us immediately and
destroy the original transmittal. Thank you for your cooperation.
-----Urspr?ngliche Nachricht-----
Von: torqueusers-bounces at supercluster.org [mailto:torqueusers-
bounces at supercluster.org] Im Auftrag von Ben Turner
Gesendet: Dienstag, 6. April 2010 01:28
An: torqueusers at supercluster.org
Betreff: [torqueusers] Jobs Jumping From Queued to Run to Queued???
Hi,
I have a cluster with five nodes all configured the same. The queueing
system is working fine on four of them, but on one of them jobs get queued,
then start to run and then jump back to queued state immediately.
I have looked through all the logs and the only clue I can get is the
mom_log on the compute node that fails
Pbs_mom;Svr;pbs_mom;Bad file descriptor (9) in do_rpp, cannot get protocol
End of File
On the server there is a
PBS_Server; Svr;WARNING;ALERT:unable to contact node merlion05.qgeo.com
PBS_Server;Job;14754.merlion00.qgeo.com;unable to run job, MOM
rejected/rc=2
PBS_Server;Req;req_reject;Reject reply code 15041(Execution server rejected
request MSG=cannot send job to mom, state=PRERUN), aux=0, type-RunJob from
Scheduler at merlion00.qgeo.com
Pbsnodes -a reports that the dodgy node merlion05 is OK.
Do anybody have any insight into this problem. I have been bashing my head
against a wall and have no idea where to go.
Cheers
Ben
_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Ben Turner
2010-04-06 10:16:51 UTC
Permalink
Hi Stephan,

Thanks for your help. I'll try these things and let you know if they worked.

Cheers
Ben


-----Original Message-----
From: Dr. Stephan Raub [mailto:raub at uni-duesseldorf.de]
Sent: Tuesday, 6 April, 2010 6:45 PM
To: 'Ben Turner'; torqueusers at supercluster.org
Subject: AW: [torqueusers] Jobs Jumping From Queued to Run to Queued???

Hi,

I sometimes have experienced this behavior if

-) a global mounted /home-dir of the user (to whom the job belongs to) is
not mounted

-) the user is not known on this computenode (e.g. because the LDAP service
is down, the nscd has hung up,...)

-) the prolog-script failed (for what reason ever).

-) the ssh-keys are not okay, so that the corresponding user can not jump
from computenode to computenode without using a password.

Perhaps, some of this is the solution for your problem.

Stephan


--
---------------------------------------------------------
| | Dr. rer. nat. Stephan Raub
| | Dipl. Chem.
| | IT-Management / ZIM
| | Heinrich-Heine-Universit?t D?sseldorf Universit?tsstr. 1 /
| | 25.41.O2.25-2
| | 40225 D?sseldorf / Germany
---------------------------------------------------------

Wichtiger Hinweis: Diese E-Mail kann Betriebs- oder Gesch?ftsgeheimnisse,
bzw.
sonstige vertrauliche Informationen enthalten. Sollten Sie diese E-Mail
irrt?mlich erhalten haben, ist Ihnen eine Kenntnisnahme des Inhalts, eine
Vervielf?ltigung oder Weitergabe der E-Mail ausdr?cklich untersagt. Bitte
benachrichtigen Sie uns und vernichten Sie die empfangene E-Mail. Vielen
Dank.

Important Note: This e-mail may contain trade secrets or privileged,
undisclosed or otherwise confidential information. If you have received this
e-mail in error, you are hereby notified that any review, copying or
distribution of it is strictly prohibited. Please inform us immediately and
destroy the original transmittal. Thank you for your cooperation.
-----Urspr?ngliche Nachricht-----
Von: torqueusers-bounces at supercluster.org [mailto:torqueusers-
bounces at supercluster.org] Im Auftrag von Ben Turner
Gesendet: Dienstag, 6. April 2010 01:28
An: torqueusers at supercluster.org
Betreff: [torqueusers] Jobs Jumping From Queued to Run to Queued???
Hi,
I have a cluster with five nodes all configured the same. The queueing
system is working fine on four of them, but on one of them jobs get queued,
then start to run and then jump back to queued state immediately.
I have looked through all the logs and the only clue I can get is the
mom_log on the compute node that fails
Pbs_mom;Svr;pbs_mom;Bad file descriptor (9) in do_rpp, cannot get protocol
End of File
On the server there is a
PBS_Server; Svr;WARNING;ALERT:unable to contact node merlion05.qgeo.com
PBS_Server;Job;14754.merlion00.qgeo.com;unable to run job, MOM
rejected/rc=2
PBS_Server;Req;req_reject;Reject reply code 15041(Execution server rejected
request MSG=cannot send job to mom, state=PRERUN), aux=0, type-RunJob from
Scheduler at merlion00.qgeo.com
Pbsnodes -a reports that the dodgy node merlion05 is OK.
Do anybody have any insight into this problem. I have been bashing my head
against a wall and have no idea where to go.
Cheers
Ben
_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Loading...