Oswin,
If I'm following that correctly, it is running the local tasks but the
remote tasks never execute. Correct? To debug this, I'd run an interactive
job and execute the remote pbsdsh. While doing that, I'd tail both the
mother superior's mom log and the log of the mom where it should execute to
see if something obvious comes up.
If you don't want to debug, another option would be to install a newer
version (4.2.10 is very old and *many* pbsdsh bugs have been fixed since)
and see if the problem just goes away by itself.
David
On Wed, Sep 28, 2016 at 5:01 AM, Oswin Krause <***@di.ku.dk> wrote:
> Hi,
>
> Sorry for dropping the ball a little on this. I am still struggling with
> it. In the mean time I set logevents to 511 for pbs_mom and loglevel 7 to
> get some more output. Still there is not much of a hint. See the attachment
> for pbs_mom after starting an interactive job with
>
> qsub -I -l nodes=4:ppn=20
>
> which leads to being scheduled on the the node logged and then using
> (after verifying which nodes got assigned) from within the session:
>
> [***@a00562 ~]$ pbsdsh -v -h a00561.science.domain hostname
> pbsdsh(): spawned task 20
> pbsdsh(): spawn event returned: 20 (1 spawns and 0 obits outstanding)
> pbsdsh(): sending obit for task 20
> a00562.science.domain
> pbsdsh(): obit event returned: 20 (0 spawns and 1 obits outstanding)
> pbsdsh(): task 20 exit status 0
>
>
> which by the log should start a job on node 20 but does in fact not. In
> fact I see exactly 0 log output of pbs_mom on a00561 regarding this. There
> is also no obvious hint in the server logs (also log_level 7)
>
> Best,
> Oswin
>
>
>
>
> From: Oswin Krause
>
> Sent: Thursday, September 08, 2016 11:11 PM
>
> To: Torque Users Mailing List
>
> Subject: RE: [torqueusers] pbsdsh starts all process on same node
>
>
>
>
>
>
> Hi,
>
>
>
> I have analysed this pretty long (i stumbled over this as openmpi creates
> errors because openmpi internal processes intended to be spawned on other
> node are spawned together with the main process, leading to problems).
>
>
>
> When I run the line from earlier i do not see any error on the server. On
> the node where i have the interactive session i see in the logs that all
> processes are started there (agreed by the hostname output) and on the
> other node i see nothing.
>
>
>
> Best,
>
> Oswin
>
>
>
>
>
> From: torqueusers-***@supercluster.org [torqueusers-bounces@
> supercluster.org] on behalf of David Beer [***@adaptivecomputing.com]
>
> Sent: Thursday, September 08, 2016 10:33 PM
>
> To: Torque Users Mailing List
>
> Subject: Re: [torqueusers] pbsdsh starts all process on same node
>
>
>
>
>
>
> Ok, are there any errors in the error file for the job? Is hostname run 40
> times on one node?
>
>
> On Thu, Sep 8, 2016 at 2:22 PM, Oswin Krause
> <***@di.ku.dk> wrote:
>
>
>
> Hi David,
>
>
>
> Thanks for the fast reply! exec_host looks fine as nodes a00551 and a00553
> are shown. I shortened the output for readability.
>
>
>
> exec_host = a00551.science.domain-0/0+a00551.science.domain-0/1+a00551.sci
>
> ence.domain-0/2[...]+a00551.science.domain-0/19+a00553.science.domain-
>
> 0/0+[...]+a00553.science.domain-0/19
>
>
>
> This is also reflected in the PBS_NODEFILE
>
>
>
> Best,
>
> Oswin
>
>
>
>
>
> From:
>
> torqueusers-***@supercluster.org [torqueusers-***@supercluster.org]
> on behalf of David
> Beer [***@adaptivecomputing.com]
>
> Sent: Thursday, September 08, 2016 9:45 PM
>
> To: Torque Users Mailing List
>
> Subject: Re: [torqueusers] pbsdsh starts all process on same node
>
>
>
>
>
>
>
>
> Oswin,
>
>
>
> The first thing I'd check is qstat -f for the value of exec_host.
> Depending on what policies are set, Maui can assign a job - even though it
> says nodes=2 - all to a single host.
>
>
>
> David
>
>
>
> On Thu, Sep 8, 2016 at 1:20 PM, Oswin Krause
> <***@di.ku.dk> wrote:
>
>
> Hi,
>
>
>
> I am trying to debug the following situation after having setup a torque
> 4.2.10 with maui 3.3.1 (1 master, 11 slaves, 20 cores each).
>
>
>
> I start an interactive session:
>
> [***@a00552 ~]$ qsub -l nodes=2:ppn=20
>
> [***@a00562 ~]$ cat $PBS_NODEFILE
>
> a00562.science.domain
>
> [x20]
>
> a00561.science.domain
>
> [x20]
>
>
>
> this is exactly as expected. However:
>
>
>
> [***@a00562 ~]$ pbsdsh hostname
>
> a00562.science.domain
>
> [x40]
>
>
>
> apparently, for some reason i can not connect from a00562 to a00561. Is
> there any way to debug this? The logs of the server and MoM are not very
> helpful.
>
>
>
> Note: I assume this is an authentification problem. Our it-department uses
> kerberos for everything and i have read that torque is not really capable
> of handling this. However I would like to rule out everything else before I
> annoy people in the it-department
> :-)
>
>
>
> Best,
>
> Oswin
>
>
>
>
>
>
>
> _______________________________________________
>
> torqueusers mailing list
>
> ***@supercluster.org
>
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>
>
>
>
>
>
>
> --
>
>
>
>
> David Beer | Torque Architect
> Adaptive Computing
>
>
>
>
>
>
>
>
>
>
>
>
> _______________________________________________
>
> torqueusers mailing list
>
> ***@supercluster.org
>
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>
>
>
>
>
>
>
>
>
> --
>
>
>
>
> David Beer | Torque Architect
> Adaptive Computing
>
>
>
>
>
>
>
>
>
>
>
> _______________________________________________
> torqueusers mailing list
> ***@supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
--
David Beer | Torque Architect
Adaptive Computing