Discussion:
[torqueusers] pbsdsh starts all process on same node
Oswin Krause
2016-09-08 19:20:17 UTC
Permalink
Hi,

I am trying to debug the following situation after having setup a torque 4.2.10 with maui 3.3.1 (1 master, 11 slaves, 20 cores each).

I start an interactive session:
[***@a00552 ~]$ qsub -l nodes=2:ppn=20
[***@a00562 ~]$ cat $PBS_NODEFILE
a00562.science.domain
[x20]
a00561.science.domain
[x20]

this is exactly as expected. However:

[***@a00562 ~]$ pbsdsh hostname
a00562.science.domain
[x40]

apparently, for some reason i can not connect from a00562 to a00561. Is there any way to debug this? The logs of the server and MoM are not very helpful.

Note: I assume this is an authentification problem. Our it-department uses kerberos for everything and i have read that torque is not really capable of handling this. However I would like to rule out everything else before I annoy people in the it-department :-)

Best,
Oswin
David Beer
2016-09-08 19:45:30 UTC
Permalink
Oswin,

The first thing I'd check is qstat -f for the value of exec_host. Depending
on what policies are set, Maui can assign a job - even though it says
nodes=2 - all to a single host.

David

On Thu, Sep 8, 2016 at 1:20 PM, Oswin Krause <***@di.ku.dk> wrote:

> Hi,
>
> I am trying to debug the following situation after having setup a torque
> 4.2.10 with maui 3.3.1 (1 master, 11 slaves, 20 cores each).
>
> I start an interactive session:
> [***@a00552 ~]$ qsub -l nodes=2:ppn=20
> [***@a00562 ~]$ cat $PBS_NODEFILE
> a00562.science.domain
> [x20]
> a00561.science.domain
> [x20]
>
> this is exactly as expected. However:
>
> [***@a00562 ~]$ pbsdsh hostname
> a00562.science.domain
> [x40]
>
> apparently, for some reason i can not connect from a00562 to a00561. Is
> there any way to debug this? The logs of the server and MoM are not very
> helpful.
>
> Note: I assume this is an authentification problem. Our it-department uses
> kerberos for everything and i have read that torque is not really capable
> of handling this. However I would like to rule out everything else before I
> annoy people in the it-department :-)
>
> Best,
> Oswin
>
>
>
> _______________________________________________
> torqueusers mailing list
> ***@supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>



--
David Beer | Torque Architect
Adaptive Computing
Oswin Krause
2016-09-08 20:22:28 UTC
Permalink
Hi David,

Thanks for the fast reply! exec_host looks fine as nodes a00551 and a00553 are shown. I shortened the output for readability.

exec_host = a00551.science.domain-0/0+a00551.science.domain-0/1+a00551.sci
ence.domain-0/2[...]+a00551.science.domain-0/19+a00553.science.domain-
0/0+[...]+a00553.science.domain-0/19

This is also reflected in the PBS_NODEFILE

Best,
Oswin

________________________________
From: torqueusers-***@supercluster.org [torqueusers-***@supercluster.org] on behalf of David Beer [***@adaptivecomputing.com]
Sent: Thursday, September 08, 2016 9:45 PM
To: Torque Users Mailing List
Subject: Re: [torqueusers] pbsdsh starts all process on same node

Oswin,

The first thing I'd check is qstat -f for the value of exec_host. Depending on what policies are set, Maui can assign a job - even though it says nodes=2 - all to a single host.

David

On Thu, Sep 8, 2016 at 1:20 PM, Oswin Krause <***@di.ku.dk<mailto:***@di.ku.dk>> wrote:
Hi,

I am trying to debug the following situation after having setup a torque 4.2.10 with maui 3.3.1 (1 master, 11 slaves, 20 cores each).

I start an interactive session:
[***@a00552 ~]$ qsub -l nodes=2:ppn=20
[***@a00562 ~]$ cat $PBS_NODEFILE
a00562.science.domain
[x20]
a00561.science.domain
[x20]

this is exactly as expected. However:

[***@a00562 ~]$ pbsdsh hostname
a00562.science.domain
[x40]

apparently, for some reason i can not connect from a00562 to a00561. Is there any way to debug this? The logs of the server and MoM are not very helpful.

Note: I assume this is an authentification problem. Our it-department uses kerberos for everything and i have read that torque is not really capable of handling this. However I would like to rule out everything else before I annoy people in the it-department :-)

Best,
Oswin



_______________________________________________
torqueusers mailing list
***@supercluster.org<mailto:***@supercluster.org>
http://www.supercluster.org/mailman/listinfo/torqueusers



--
David Beer | Torque Architect
Adaptive Computing
David Beer
2016-09-08 20:33:40 UTC
Permalink
Ok, are there any errors in the error file for the job? Is hostname run 40
times on one node?

On Thu, Sep 8, 2016 at 2:22 PM, Oswin Krause <***@di.ku.dk> wrote:

> Hi David,
>
> Thanks for the fast reply! exec_host looks fine as nodes a00551 and a00553
> are shown. I shortened the output for readability.
>
> exec_host = a00551.science.domain-0/0+a00551.science.domain-0/1+a00551.sci
> ence.domain-0/2[...]+a00551.science.domain-0/19+a00553.science.domain-
> 0/0+[...]+a00553.science.domain-0/19
>
> This is also reflected in the PBS_NODEFILE
>
> Best,
> Oswin
>
> ------------------------------
> *From:* torqueusers-***@supercluster.org [torqueusers-bounces@
> supercluster.org] on behalf of David Beer [***@adaptivecomputing.com]
> *Sent:* Thursday, September 08, 2016 9:45 PM
> *To:* Torque Users Mailing List
> *Subject:* Re: [torqueusers] pbsdsh starts all process on same node
>
> Oswin,
>
> The first thing I'd check is qstat -f for the value of exec_host.
> Depending on what policies are set, Maui can assign a job - even though it
> says nodes=2 - all to a single host.
>
> David
>
> On Thu, Sep 8, 2016 at 1:20 PM, Oswin Krause <***@di.ku.dk>
> wrote:
>
>> Hi,
>>
>> I am trying to debug the following situation after having setup a torque
>> 4.2.10 with maui 3.3.1 (1 master, 11 slaves, 20 cores each).
>>
>> I start an interactive session:
>> [***@a00552 ~]$ qsub -l nodes=2:ppn=20
>> [***@a00562 ~]$ cat $PBS_NODEFILE
>> a00562.science.domain
>> [x20]
>> a00561.science.domain
>> [x20]
>>
>> this is exactly as expected. However:
>>
>> [***@a00562 ~]$ pbsdsh hostname
>> a00562.science.domain
>> [x40]
>>
>> apparently, for some reason i can not connect from a00562 to a00561. Is
>> there any way to debug this? The logs of the server and MoM are not very
>> helpful.
>>
>> Note: I assume this is an authentification problem. Our it-department
>> uses kerberos for everything and i have read that torque is not really
>> capable of handling this. However I would like to rule out everything else
>> before I annoy people in the it-department :-)
>>
>> Best,
>> Oswin
>>
>>
>>
>> _______________________________________________
>> torqueusers mailing list
>> ***@supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>
>
>
> --
> David Beer | Torque Architect
> Adaptive Computing
>
> _______________________________________________
> torqueusers mailing list
> ***@supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>


--
David Beer | Torque Architect
Adaptive Computing
Oswin Krause
2016-09-08 21:11:05 UTC
Permalink
Hi,

I have analysed this pretty long (i stumbled over this as openmpi creates errors because openmpi internal processes intended to be spawned on other node are spawned together with the main process, leading to problems).

When I run the line from earlier i do not see any error on the server. On the node where i have the interactive session i see in the logs that all processes are started there (agreed by the hostname output) and on the other node i see nothing.

Best,
Oswin

________________________________
From: torqueusers-***@supercluster.org [torqueusers-***@supercluster.org] on behalf of David Beer [***@adaptivecomputing.com]
Sent: Thursday, September 08, 2016 10:33 PM
To: Torque Users Mailing List
Subject: Re: [torqueusers] pbsdsh starts all process on same node

Ok, are there any errors in the error file for the job? Is hostname run 40 times on one node?

On Thu, Sep 8, 2016 at 2:22 PM, Oswin Krause <***@di.ku.dk<mailto:***@di.ku.dk>> wrote:
Hi David,

Thanks for the fast reply! exec_host looks fine as nodes a00551 and a00553 are shown. I shortened the output for readability.

exec_host = a00551.science.domain-0/0+a00551.science.domain-0/1+a00551.sci
ence.domain-0/2[...]+a00551.science.domain-0/19+a00553.science.domain-
0/0+[...]+a00553.science.domain-0/19

This is also reflected in the PBS_NODEFILE

Best,
Oswin

________________________________
From: torqueusers-***@supercluster.org<mailto:torqueusers-***@supercluster.org> [torqueusers-***@supercluster.org<mailto:torqueusers-***@supercluster.org>] on behalf of David Beer [***@adaptivecomputing.com<mailto:***@adaptivecomputing.com>]
Sent: Thursday, September 08, 2016 9:45 PM
To: Torque Users Mailing List
Subject: Re: [torqueusers] pbsdsh starts all process on same node

Oswin,

The first thing I'd check is qstat -f for the value of exec_host. Depending on what policies are set, Maui can assign a job - even though it says nodes=2 - all to a single host.

David

On Thu, Sep 8, 2016 at 1:20 PM, Oswin Krause <***@di.ku.dk<mailto:***@di.ku.dk>> wrote:
Hi,

I am trying to debug the following situation after having setup a torque 4.2.10 with maui 3.3.1 (1 master, 11 slaves, 20 cores each).

I start an interactive session:
[***@a00552 ~]$ qsub -l nodes=2:ppn=20
[***@a00562 ~]$ cat $PBS_NODEFILE
a00562.science.domain
[x20]
a00561.science.domain
[x20]

this is exactly as expected. However:

[***@a00562 ~]$ pbsdsh hostname
a00562.science.domain
[x40]

apparently, for some reason i can not connect from a00562 to a00561. Is there any way to debug this? The logs of the server and MoM are not very helpful.

Note: I assume this is an authentification problem. Our it-department uses kerberos for everything and i have read that torque is not really capable of handling this. However I would like to rule out everything else before I annoy people in the it-department :-)

Best,
Oswin



_______________________________________________
torqueusers mailing list
***@supercluster.org<mailto:***@supercluster.org>
http://www.supercluster.org/mailman/listinfo/torqueusers



--
David Beer | Torque Architect
Adaptive Computing

_______________________________________________
torqueusers mailing list
***@supercluster.org<mailto:***@supercluster.org>
http://www.supercluster.org/mailman/listinfo/torqueusers




--
David Beer | Torque Architect
Adaptive Computing
Oswin Krause
2016-09-28 11:01:09 UTC
Permalink
Hi,

Sorry for dropping the ball a little on this. I am still struggling with it. In the mean time I set logevents to 511 for pbs_mom and loglevel 7 to get some more output. Still there is not much of a hint. See the attachment for pbs_mom after starting an interactive job with

qsub -I -l nodes=4:ppn=20

which leads to being scheduled on the the node logged and then using (after verifying which nodes got assigned) from within the session:

[***@a00562 ~]$ pbsdsh -v -h a00561.science.domain hostname
pbsdsh(): spawned task 20
pbsdsh(): spawn event returned: 20 (1 spawns and 0 obits outstanding)
pbsdsh(): sending obit for task 20
a00562.science.domain
pbsdsh(): obit event returned: 20 (0 spawns and 1 obits outstanding)
pbsdsh(): task 20 exit status 0


which by the log should start a job on node 20 but does in fact not. In fact I see exactly 0 log output of pbs_mom on a00561 regarding this. There is also no obvious hint in the server logs (also log_level 7)

Best,
Oswin




From: Oswin Krause

Sent: Thursday, September 08, 2016 11:11 PM

To: Torque Users Mailing List

Subject: RE: [torqueusers] pbsdsh starts all process on same node






Hi,



I have analysed this pretty long (i stumbled over this as openmpi creates errors because openmpi internal processes intended to be spawned on other node are spawned together with the main process, leading to problems).



When I run the line from earlier i do not see any error on the server. On the node where i have the interactive session i see in the logs that all processes are started there (agreed by the hostname output) and on the other node i see nothing.



Best,

Oswin





From: torqueusers-***@supercluster.org [torqueusers-***@supercluster.org] on behalf of David Beer [***@adaptivecomputing.com]

Sent: Thursday, September 08, 2016 10:33 PM

To: Torque Users Mailing List

Subject: Re: [torqueusers] pbsdsh starts all process on same node






Ok, are there any errors in the error file for the job? Is hostname run 40 times on one node?


On Thu, Sep 8, 2016 at 2:22 PM, Oswin Krause
<***@di.ku.dk> wrote:



Hi David,



Thanks for the fast reply! exec_host looks fine as nodes a00551 and a00553 are shown. I shortened the output for readability.



exec_host = a00551.science.domain-0/0+a00551.science.domain-0/1+a00551.sci

ence.domain-0/2[...]+a00551.science.domain-0/19+a00553.science.domain-

0/0+[...]+a00553.science.domain-0/19



This is also reflected in the PBS_NODEFILE



Best,

Oswin





From:

torqueusers-***@supercluster.org [torqueusers-***@supercluster.org] on behalf of David
Beer [***@adaptivecomputing.com]

Sent: Thursday, September 08, 2016 9:45 PM

To: Torque Users Mailing List

Subject: Re: [torqueusers] pbsdsh starts all process on same node








Oswin,



The first thing I'd check is qstat -f for the value of exec_host. Depending on what policies are set, Maui can assign a job - even though it says nodes=2 - all to a single host.



David



On Thu, Sep 8, 2016 at 1:20 PM, Oswin Krause
<***@di.ku.dk> wrote:


Hi,



I am trying to debug the following situation after having setup a torque 4.2.10 with maui 3.3.1 (1 master, 11 slaves, 20 cores each).



I start an interactive session:

[***@a00552 ~]$ qsub -l nodes=2:ppn=20

[***@a00562 ~]$ cat $PBS_NODEFILE

a00562.science.domain

[x20]

a00561.science.domain

[x20]



this is exactly as expected. However:



[***@a00562 ~]$ pbsdsh hostname

a00562.science.domain

[x40]



apparently, for some reason i can not connect from a00562 to a00561. Is there any way to debug this? The logs of the server and MoM are not very helpful.



Note: I assume this is an authentification problem. Our it-department uses kerberos for everything and i have read that torque is not really capable of handling this. However I would like to rule out everything else before I annoy people in the it-department
:-)



Best,

Oswin







_______________________________________________

torqueusers mailing list

***@supercluster.org

http://www.supercluster.org/mailman/listinfo/torqueusers









--




David Beer | Torque Architect
Adaptive Computing












_______________________________________________

torqueusers mailing list

***@supercluster.org

http://www.supercluster.org/mailman/listinfo/torqueusers











--




David Beer | Torque Architect
Adaptive Computing
David Beer
2016-10-25 22:40:19 UTC
Permalink
Oswin,

If I'm following that correctly, it is running the local tasks but the
remote tasks never execute. Correct? To debug this, I'd run an interactive
job and execute the remote pbsdsh. While doing that, I'd tail both the
mother superior's mom log and the log of the mom where it should execute to
see if something obvious comes up.

If you don't want to debug, another option would be to install a newer
version (4.2.10 is very old and *many* pbsdsh bugs have been fixed since)
and see if the problem just goes away by itself.

David

On Wed, Sep 28, 2016 at 5:01 AM, Oswin Krause <***@di.ku.dk> wrote:

> Hi,
>
> Sorry for dropping the ball a little on this. I am still struggling with
> it. In the mean time I set logevents to 511 for pbs_mom and loglevel 7 to
> get some more output. Still there is not much of a hint. See the attachment
> for pbs_mom after starting an interactive job with
>
> qsub -I -l nodes=4:ppn=20
>
> which leads to being scheduled on the the node logged and then using
> (after verifying which nodes got assigned) from within the session:
>
> [***@a00562 ~]$ pbsdsh -v -h a00561.science.domain hostname
> pbsdsh(): spawned task 20
> pbsdsh(): spawn event returned: 20 (1 spawns and 0 obits outstanding)
> pbsdsh(): sending obit for task 20
> a00562.science.domain
> pbsdsh(): obit event returned: 20 (0 spawns and 1 obits outstanding)
> pbsdsh(): task 20 exit status 0
>
>
> which by the log should start a job on node 20 but does in fact not. In
> fact I see exactly 0 log output of pbs_mom on a00561 regarding this. There
> is also no obvious hint in the server logs (also log_level 7)
>
> Best,
> Oswin
>
>
>
>
> From: Oswin Krause
>
> Sent: Thursday, September 08, 2016 11:11 PM
>
> To: Torque Users Mailing List
>
> Subject: RE: [torqueusers] pbsdsh starts all process on same node
>
>
>
>
>
>
> Hi,
>
>
>
> I have analysed this pretty long (i stumbled over this as openmpi creates
> errors because openmpi internal processes intended to be spawned on other
> node are spawned together with the main process, leading to problems).
>
>
>
> When I run the line from earlier i do not see any error on the server. On
> the node where i have the interactive session i see in the logs that all
> processes are started there (agreed by the hostname output) and on the
> other node i see nothing.
>
>
>
> Best,
>
> Oswin
>
>
>
>
>
> From: torqueusers-***@supercluster.org [torqueusers-bounces@
> supercluster.org] on behalf of David Beer [***@adaptivecomputing.com]
>
> Sent: Thursday, September 08, 2016 10:33 PM
>
> To: Torque Users Mailing List
>
> Subject: Re: [torqueusers] pbsdsh starts all process on same node
>
>
>
>
>
>
> Ok, are there any errors in the error file for the job? Is hostname run 40
> times on one node?
>
>
> On Thu, Sep 8, 2016 at 2:22 PM, Oswin Krause
> <***@di.ku.dk> wrote:
>
>
>
> Hi David,
>
>
>
> Thanks for the fast reply! exec_host looks fine as nodes a00551 and a00553
> are shown. I shortened the output for readability.
>
>
>
> exec_host = a00551.science.domain-0/0+a00551.science.domain-0/1+a00551.sci
>
> ence.domain-0/2[...]+a00551.science.domain-0/19+a00553.science.domain-
>
> 0/0+[...]+a00553.science.domain-0/19
>
>
>
> This is also reflected in the PBS_NODEFILE
>
>
>
> Best,
>
> Oswin
>
>
>
>
>
> From:
>
> torqueusers-***@supercluster.org [torqueusers-***@supercluster.org]
> on behalf of David
> Beer [***@adaptivecomputing.com]
>
> Sent: Thursday, September 08, 2016 9:45 PM
>
> To: Torque Users Mailing List
>
> Subject: Re: [torqueusers] pbsdsh starts all process on same node
>
>
>
>
>
>
>
>
> Oswin,
>
>
>
> The first thing I'd check is qstat -f for the value of exec_host.
> Depending on what policies are set, Maui can assign a job - even though it
> says nodes=2 - all to a single host.
>
>
>
> David
>
>
>
> On Thu, Sep 8, 2016 at 1:20 PM, Oswin Krause
> <***@di.ku.dk> wrote:
>
>
> Hi,
>
>
>
> I am trying to debug the following situation after having setup a torque
> 4.2.10 with maui 3.3.1 (1 master, 11 slaves, 20 cores each).
>
>
>
> I start an interactive session:
>
> [***@a00552 ~]$ qsub -l nodes=2:ppn=20
>
> [***@a00562 ~]$ cat $PBS_NODEFILE
>
> a00562.science.domain
>
> [x20]
>
> a00561.science.domain
>
> [x20]
>
>
>
> this is exactly as expected. However:
>
>
>
> [***@a00562 ~]$ pbsdsh hostname
>
> a00562.science.domain
>
> [x40]
>
>
>
> apparently, for some reason i can not connect from a00562 to a00561. Is
> there any way to debug this? The logs of the server and MoM are not very
> helpful.
>
>
>
> Note: I assume this is an authentification problem. Our it-department uses
> kerberos for everything and i have read that torque is not really capable
> of handling this. However I would like to rule out everything else before I
> annoy people in the it-department
> :-)
>
>
>
> Best,
>
> Oswin
>
>
>
>
>
>
>
> _______________________________________________
>
> torqueusers mailing list
>
> ***@supercluster.org
>
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>
>
>
>
>
>
>
> --
>
>
>
>
> David Beer | Torque Architect
> Adaptive Computing
>
>
>
>
>
>
>
>
>
>
>
>
> _______________________________________________
>
> torqueusers mailing list
>
> ***@supercluster.org
>
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>
>
>
>
>
>
>
>
>
> --
>
>
>
>
> David Beer | Torque Architect
> Adaptive Computing
>
>
>
>
>
>
>
>
>
>
>
> _______________________________________________
> torqueusers mailing list
> ***@supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>


--
David Beer | Torque Architect
Adaptive Computing
Oswin Krause
2016-10-26 06:27:14 UTC
Permalink
Hi David,

Thanks for trying to help me!

the jobs are all executed - but on the wrong node. When I start an interactive session, all jobs get started on the first node, while the other nodes are not touched. The log i sent previously was from exactly that first node i started an interactive session from, e.g. a00562. When I check the log of the node a00561 on which i wanted to start the process via pbsdsh, it seems no communication has taken place.

I could try a newer version, however this is the one supported by my distro (RHEL 7.0) and I wanted to avoid having to build packages...maybe i have to.

Thanks a lot!

Best,
Oswin


________________________________
From: torqueusers-***@supercluster.org [torqueusers-***@supercluster.org] on behalf of David Beer [***@adaptivecomputing.com]
Sent: Wednesday, October 26, 2016 12:40 AM
To: Torque Users Mailing List
Subject: Re: [torqueusers] pbsdsh starts all process on same node

Oswin,

If I'm following that correctly, it is running the local tasks but the remote tasks never execute. Correct? To debug this, I'd run an interactive job and execute the remote pbsdsh. While doing that, I'd tail both the mother superior's mom log and the log of the mom where it should execute to see if something obvious comes up.

If you don't want to debug, another option would be to install a newer version (4.2.10 is very old and *many* pbsdsh bugs have been fixed since) and see if the problem just goes away by itself.

David

On Wed, Sep 28, 2016 at 5:01 AM, Oswin Krause <***@di.ku.dk<mailto:***@di.ku.dk>> wrote:
Hi,

Sorry for dropping the ball a little on this. I am still struggling with it. In the mean time I set logevents to 511 for pbs_mom and loglevel 7 to get some more output. Still there is not much of a hint. See the attachment for pbs_mom after starting an interactive job with

qsub -I -l nodes=4:ppn=20

which leads to being scheduled on the the node logged and then using (after verifying which nodes got assigned) from within the session:

[***@a00562 ~]$ pbsdsh -v -h a00561.science.domain hostname
pbsdsh(): spawned task 20
pbsdsh(): spawn event returned: 20 (1 spawns and 0 obits outstanding)
pbsdsh(): sending obit for task 20
a00562.science.domain
pbsdsh(): obit event returned: 20 (0 spawns and 1 obits outstanding)
pbsdsh(): task 20 exit status 0


which by the log should start a job on node 20 but does in fact not. In fact I see exactly 0 log output of pbs_mom on a00561 regarding this. There is also no obvious hint in the server logs (also log_level 7)

Best,
Oswin




From: Oswin Krause

Sent: Thursday, September 08, 2016 11:11 PM

To: Torque Users Mailing List

Subject: RE: [torqueusers] pbsdsh starts all process on same node






Hi,



I have analysed this pretty long (i stumbled over this as openmpi creates errors because openmpi internal processes intended to be spawned on other node are spawned together with the main process, leading to problems).



When I run the line from earlier i do not see any error on the server. On the node where i have the interactive session i see in the logs that all processes are started there (agreed by the hostname output) and on the other node i see nothing.



Best,

Oswin





From: torqueusers-***@supercluster.org<mailto:torqueusers-***@supercluster.org> [torqueusers-***@supercluster.org<mailto:torqueusers-***@supercluster.org>] on behalf of David Beer [***@adaptivecomputing.com<mailto:***@adaptivecomputing.com>]

Sent: Thursday, September 08, 2016 10:33 PM

To: Torque Users Mailing List

Subject: Re: [torqueusers] pbsdsh starts all process on same node






Ok, are there any errors in the error file for the job? Is hostname run 40 times on one node?


On Thu, Sep 8, 2016 at 2:22 PM, Oswin Krause
<***@di.ku.dk<mailto:***@di.ku.dk>> wrote:



Hi David,



Thanks for the fast reply! exec_host looks fine as nodes a00551 and a00553 are shown. I shortened the output for readability.



exec_host = a00551.science.domain-0/0+a00551.science.domain-0/1+a00551.sci

ence.domain-0/2[...]+a00551.science.domain-0/19+a00553.science.domain-

0/0+[...]+a00553.science.domain-0/19



This is also reflected in the PBS_NODEFILE



Best,

Oswin





From:

torqueusers-***@supercluster.org<mailto:torqueusers-***@supercluster.org> [torqueusers-***@supercluster.org<mailto:torqueusers-***@supercluster.org>] on behalf of David
Beer [***@adaptivecomputing.com<mailto:***@adaptivecomputing.com>]

Sent: Thursday, September 08, 2016 9:45 PM

To: Torque Users Mailing List

Subject: Re: [torqueusers] pbsdsh starts all process on same node








Oswin,



The first thing I'd check is qstat -f for the value of exec_host. Depending on what policies are set, Maui can assign a job - even though it says nodes=2 - all to a single host.



David



On Thu, Sep 8, 2016 at 1:20 PM, Oswin Krause
<***@di.ku.dk<mailto:***@di.ku.dk>> wrote:


Hi,



I am trying to debug the following situation after having setup a torque 4.2.10 with maui 3.3.1 (1 master, 11 slaves, 20 cores each).



I start an interactive session:

[***@a00552 ~]$ qsub -l nodes=2:ppn=20

[***@a00562 ~]$ cat $PBS_NODEFILE

a00562.science.domain

[x20]

a00561.science.domain

[x20]



this is exactly as expected. However:



[***@a00562 ~]$ pbsdsh hostname

a00562.science.domain

[x40]



apparently, for some reason i can not connect from a00562 to a00561. Is there any way to debug this? The logs of the server and MoM are not very helpful.



Note: I assume this is an authentification problem. Our it-department uses kerberos for everything and i have read that torque is not really capable of handling this. However I would like to rule out everything else before I annoy people in the it-department
:-)



Best,

Oswin







_______________________________________________

torqueusers mailing list

***@supercluster.org<mailto:***@supercluster.org>

http://www.supercluster.org/mailman/listinfo/torqueusers









--




David Beer | Torque Architect
Adaptive Computing












_______________________________________________

torqueusers mailing list

***@supercluster.org<mailto:***@supercluster.org>

http://www.supercluster.org/mailman/listinfo/torqueusers











--




David Beer | Torque Architect
Adaptive Computing











_______________________________________________
torqueusers mailing list
***@supercluster.org<mailto:***@supercluster.org>
http://www.supercluster.org/mailman/listinfo/torqueusers




--
David Beer | Torque Architect
Adaptive Computing
Oswin Krause
2016-11-16 15:16:00 UTC
Permalink
Hi David,

thank you for your help. I have followed your advice and installed torque 6.1.0 and it works!

Thanks a lot!


________________________________
From: torqueusers-***@supercluster.org [torqueusers-***@supercluster.org] on behalf of David Beer [***@adaptivecomputing.com]
Sent: Wednesday, October 26, 2016 12:40 AM
To: Torque Users Mailing List
Subject: Re: [torqueusers] pbsdsh starts all process on same node

Oswin,

If I'm following that correctly, it is running the local tasks but the remote tasks never execute. Correct? To debug this, I'd run an interactive job and execute the remote pbsdsh. While doing that, I'd tail both the mother superior's mom log and the log of the mom where it should execute to see if something obvious comes up.

If you don't want to debug, another option would be to install a newer version (4.2.10 is very old and *many* pbsdsh bugs have been fixed since) and see if the problem just goes away by itself.

David

On Wed, Sep 28, 2016 at 5:01 AM, Oswin Krause <***@di.ku.dk<mailto:***@di.ku.dk>> wrote:
Hi,

Sorry for dropping the ball a little on this. I am still struggling with it. In the mean time I set logevents to 511 for pbs_mom and loglevel 7 to get some more output. Still there is not much of a hint. See the attachment for pbs_mom after starting an interactive job with

qsub -I -l nodes=4:ppn=20

which leads to being scheduled on the the node logged and then using (after verifying which nodes got assigned) from within the session:

[***@a00562 ~]$ pbsdsh -v -h a00561.science.domain hostname
pbsdsh(): spawned task 20
pbsdsh(): spawn event returned: 20 (1 spawns and 0 obits outstanding)
pbsdsh(): sending obit for task 20
a00562.science.domain
pbsdsh(): obit event returned: 20 (0 spawns and 1 obits outstanding)
pbsdsh(): task 20 exit status 0


which by the log should start a job on node 20 but does in fact not. In fact I see exactly 0 log output of pbs_mom on a00561 regarding this. There is also no obvious hint in the server logs (also log_level 7)

Best,
Oswin




From: Oswin Krause

Sent: Thursday, September 08, 2016 11:11 PM

To: Torque Users Mailing List

Subject: RE: [torqueusers] pbsdsh starts all process on same node






Hi,



I have analysed this pretty long (i stumbled over this as openmpi creates errors because openmpi internal processes intended to be spawned on other node are spawned together with the main process, leading to problems).



When I run the line from earlier i do not see any error on the server. On the node where i have the interactive session i see in the logs that all processes are started there (agreed by the hostname output) and on the other node i see nothing.



Best,

Oswin





From: torqueusers-***@supercluster.org<mailto:torqueusers-***@supercluster.org> [torqueusers-***@supercluster.org<mailto:torqueusers-***@supercluster.org>] on behalf of David Beer [***@adaptivecomputing.com<mailto:***@adaptivecomputing.com>]

Sent: Thursday, September 08, 2016 10:33 PM

To: Torque Users Mailing List

Subject: Re: [torqueusers] pbsdsh starts all process on same node






Ok, are there any errors in the error file for the job? Is hostname run 40 times on one node?


On Thu, Sep 8, 2016 at 2:22 PM, Oswin Krause
<***@di.ku.dk<mailto:***@di.ku.dk>> wrote:



Hi David,



Thanks for the fast reply! exec_host looks fine as nodes a00551 and a00553 are shown. I shortened the output for readability.



exec_host = a00551.science.domain-0/0+a00551.science.domain-0/1+a00551.sci

ence.domain-0/2[...]+a00551.science.domain-0/19+a00553.science.domain-

0/0+[...]+a00553.science.domain-0/19



This is also reflected in the PBS_NODEFILE



Best,

Oswin





From:

torqueusers-***@supercluster.org<mailto:torqueusers-***@supercluster.org> [torqueusers-***@supercluster.org<mailto:torqueusers-***@supercluster.org>] on behalf of David
Beer [***@adaptivecomputing.com<mailto:***@adaptivecomputing.com>]

Sent: Thursday, September 08, 2016 9:45 PM

To: Torque Users Mailing List

Subject: Re: [torqueusers] pbsdsh starts all process on same node








Oswin,



The first thing I'd check is qstat -f for the value of exec_host. Depending on what policies are set, Maui can assign a job - even though it says nodes=2 - all to a single host.



David



On Thu, Sep 8, 2016 at 1:20 PM, Oswin Krause
<***@di.ku.dk<mailto:***@di.ku.dk>> wrote:


Hi,



I am trying to debug the following situation after having setup a torque 4.2.10 with maui 3.3.1 (1 master, 11 slaves, 20 cores each).



I start an interactive session:

[***@a00552 ~]$ qsub -l nodes=2:ppn=20

[***@a00562 ~]$ cat $PBS_NODEFILE

a00562.science.domain

[x20]

a00561.science.domain

[x20]



this is exactly as expected. However:



[***@a00562 ~]$ pbsdsh hostname

a00562.science.domain

[x40]



apparently, for some reason i can not connect from a00562 to a00561. Is there any way to debug this? The logs of the server and MoM are not very helpful.



Note: I assume this is an authentification problem. Our it-department uses kerberos for everything and i have read that torque is not really capable of handling this. However I would like to rule out everything else before I annoy people in the it-department
:-)



Best,

Oswin







_______________________________________________

torqueusers mailing list

***@supercluster.org<mailto:***@supercluster.org>

http://www.supercluster.org/mailman/listinfo/torqueusers









--




David Beer | Torque Architect
Adaptive Computing












_______________________________________________

torqueusers mailing list

***@supercluster.org<mailto:***@supercluster.org>

http://www.supercluster.org/mailman/listinfo/torqueusers











--




David Beer | Torque Architect
Adaptive Computing











_______________________________________________
torqueusers mailing list
***@supercluster.org<mailto:***@supercluster.org>
http://www.supercluster.org/mailman/listinfo/torqueusers




--
David Beer | Torque Architect
Adaptive Computing
Oswin Krause
2016-11-16 15:58:36 UTC
Permalink
Sorry,

I was a bit too quick. It does still end up on the same node. This is so...if i would get at least an error message. Is there any way i can figure out, what is going wrong? a way to see where processes are started? The log is not really helpful as well. I would expect to see at some log levle "contacting <ip address>" or something similar. But i think no loglevel i tried gave me that information.

*Are the developers looking at this mailing list?

Best,
Oswin


________________________________
From: Oswin Krause
Sent: Wednesday, November 16, 2016 4:16 PM
To: Torque Users Mailing List
Subject: RE: [torqueusers] pbsdsh starts all process on same node

Hi David,

thank you for your help. I have followed your advice and installed torque 6.1.0 and it works!

Thanks a lot!


________________________________
From: torqueusers-***@supercluster.org [torqueusers-***@supercluster.org] on behalf of David Beer [***@adaptivecomputing.com]
Sent: Wednesday, October 26, 2016 12:40 AM
To: Torque Users Mailing List
Subject: Re: [torqueusers] pbsdsh starts all process on same node

Oswin,

If I'm following that correctly, it is running the local tasks but the remote tasks never execute. Correct? To debug this, I'd run an interactive job and execute the remote pbsdsh. While doing that, I'd tail both the mother superior's mom log and the log of the mom where it should execute to see if something obvious comes up.

If you don't want to debug, another option would be to install a newer version (4.2.10 is very old and *many* pbsdsh bugs have been fixed since) and see if the problem just goes away by itself.

David

On Wed, Sep 28, 2016 at 5:01 AM, Oswin Krause <***@di.ku.dk<mailto:***@di.ku.dk>> wrote:
Hi,

Sorry for dropping the ball a little on this. I am still struggling with it. In the mean time I set logevents to 511 for pbs_mom and loglevel 7 to get some more output. Still there is not much of a hint. See the attachment for pbs_mom after starting an interactive job with

qsub -I -l nodes=4:ppn=20

which leads to being scheduled on the the node logged and then using (after verifying which nodes got assigned) from within the session:

[***@a00562 ~]$ pbsdsh -v -h a00561.science.domain hostname
pbsdsh(): spawned task 20
pbsdsh(): spawn event returned: 20 (1 spawns and 0 obits outstanding)
pbsdsh(): sending obit for task 20
a00562.science.domain
pbsdsh(): obit event returned: 20 (0 spawns and 1 obits outstanding)
pbsdsh(): task 20 exit status 0


which by the log should start a job on node 20 but does in fact not. In fact I see exactly 0 log output of pbs_mom on a00561 regarding this. There is also no obvious hint in the server logs (also log_level 7)

Best,
Oswin




From: Oswin Krause

Sent: Thursday, September 08, 2016 11:11 PM

To: Torque Users Mailing List

Subject: RE: [torqueusers] pbsdsh starts all process on same node






Hi,



I have analysed this pretty long (i stumbled over this as openmpi creates errors because openmpi internal processes intended to be spawned on other node are spawned together with the main process, leading to problems).



When I run the line from earlier i do not see any error on the server. On the node where i have the interactive session i see in the logs that all processes are started there (agreed by the hostname output) and on the other node i see nothing.



Best,

Oswin





From: torqueusers-***@supercluster.org<mailto:torqueusers-***@supercluster.org> [torqueusers-***@supercluster.org<mailto:torqueusers-***@supercluster.org>] on behalf of David Beer [***@adaptivecomputing.com<mailto:***@adaptivecomputing.com>]

Sent: Thursday, September 08, 2016 10:33 PM

To: Torque Users Mailing List

Subject: Re: [torqueusers] pbsdsh starts all process on same node






Ok, are there any errors in the error file for the job? Is hostname run 40 times on one node?


On Thu, Sep 8, 2016 at 2:22 PM, Oswin Krause
<***@di.ku.dk<mailto:***@di.ku.dk>> wrote:



Hi David,



Thanks for the fast reply! exec_host looks fine as nodes a00551 and a00553 are shown. I shortened the output for readability.



exec_host = a00551.science.domain-0/0+a00551.science.domain-0/1+a00551.sci

ence.domain-0/2[...]+a00551.science.domain-0/19+a00553.science.domain-

0/0+[...]+a00553.science.domain-0/19



This is also reflected in the PBS_NODEFILE



Best,

Oswin





From:

torqueusers-***@supercluster.org<mailto:torqueusers-***@supercluster.org> [torqueusers-***@supercluster.org<mailto:torqueusers-***@supercluster.org>] on behalf of David
Beer [***@adaptivecomputing.com<mailto:***@adaptivecomputing.com>]

Sent: Thursday, September 08, 2016 9:45 PM

To: Torque Users Mailing List

Subject: Re: [torqueusers] pbsdsh starts all process on same node








Oswin,



The first thing I'd check is qstat -f for the value of exec_host. Depending on what policies are set, Maui can assign a job - even though it says nodes=2 - all to a single host.



David



On Thu, Sep 8, 2016 at 1:20 PM, Oswin Krause
<***@di.ku.dk<mailto:***@di.ku.dk>> wrote:


Hi,



I am trying to debug the following situation after having setup a torque 4.2.10 with maui 3.3.1 (1 master, 11 slaves, 20 cores each).



I start an interactive session:

[***@a00552 ~]$ qsub -l nodes=2:ppn=20

[***@a00562 ~]$ cat $PBS_NODEFILE

a00562.science.domain

[x20]

a00561.science.domain

[x20]



this is exactly as expected. However:



[***@a00562 ~]$ pbsdsh hostname

a00562.science.domain

[x40]



apparently, for some reason i can not connect from a00562 to a00561. Is there any way to debug this? The logs of the server and MoM are not very helpful.



Note: I assume this is an authentification problem. Our it-department uses kerberos for everything and i have read that torque is not really capable of handling this. However I would like to rule out everything else before I annoy people in the it-department
:-)



Best,

Oswin







_______________________________________________

torqueusers mailing list

***@supercluster.org<mailto:***@supercluster.org>

http://www.supercluster.org/mailman/listinfo/torqueusers









--




David Beer | Torque Architect
Adaptive Computing












_______________________________________________

torqueusers mailing list

***@supercluster.org<mailto:***@supercluster.org>

http://www.supercluster.org/mailman/listinfo/torqueusers











--




David Beer | Torque Architect
Adaptive Computing











_______________________________________________
torqueusers mailing list
***@supercluster.org<mailto:***@supercluster.org>
http://www.supercluster.org/mailman/listinfo/torqueusers




--
David Beer | Torque Architect
Adaptive Computing
Loading...