Discussion:
[torqueusers] qsub and mpiexec -f machinefile
Tiago Silva (Cefas)
2014-02-19 12:40:09 UTC
Permalink
Hi,

My MPI code is normally executed across a set of nodes with something like:

mpiexec -f machinefile -np 6 ./bin

where the machinefile has 6 entries with node names, for instance:
n01
n01
n02
n02
n02
n02


Now the issue here is that this list has been optimised to balance the load between nodes and to reduce internode communication. So for instance model domain tiles 0 and 1 will run on n01 while tiles 2 to 5 will run on n02.

Is there a way to integrate this into qsub since I don't know which nodes will be assigned before submission? Or in other words can I control grouping processes in one node?

In my example I used 6 processes for simplicity but normally I parallelise across 4-16 nodes and >100 processes.

Thanks,
tiago
This email and any attachments are intended for the named recipient only. Its unauthorised use, distribution, disclosure, storage or copying is not permitted.
If you have received it in error, please destroy all copies and notify the sender. In messages of a non-business nature, the views and opinions expressed are the author's own
and do not necessarily reflect those of Cefas.
Communications on Cefas? computer systems may be monitored and/or recorded to secure the effective operation of the system and for other lawful purposes.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20140219/b502183d/attachment.html
Gus Correa
2014-02-19 15:11:28 UTC
Permalink
Hi Tiago

The Torque/PBS node file is available to your job script
through the environmnent variable $PBS_NODEFILE.
This file has one line listing the node name for each processor/core
that you requested.
Just do a "cat $PBS_NODEFILE" inside your job script to see how it looks.
Inside your job script, and before the mpiexec command, you can
run a brief auxiliary script to create the machinefile
you need from the the $PBS_NODEFILE.
You will need to create this auxiliary script,
tailored to your application.
Still, this method won't bind the MPI processes to the
appropriate hardware components (cores, sockets, etc),
(in case this is also part of your goal).

Having said that, if you are using OpenMPI, it can be built with
Torque support (with the --with-tm=/torque/location configuration option).
This would give you a range of options on how to assign different
cores, sockets, etc, to different MPI ranks/processes, directly
in the mpiexec command, or in the OpenMPI runtime configuration files.
This method would't require creating the machinefile
from the PBS_NODEFILE.
This second approach has the advantage of allowing
you to bind the processes to cores, sockets, etc.

I hope this helps,
Gus Correa
Post by Tiago Silva (Cefas)
Hi,
mpiexec -f machinefile -np 6 ./bin
n01
n01
n02
n02
n02
n02
Now the issue here is that this list has been optimised to balance the
load between nodes and to reduce internode communication. So for
instance model domain tiles 0 and 1 will run on n01 while tiles 2 to 5
will run on n02.
Is there a way to integrate this into qsub since I don?t know which
nodes will be assigned before submission? Or in other words can I
control grouping processes in one node?
In my example I used 6 processes for simplicity but normally I
parallelise across 4-16 nodes and >100 processes.
Thanks,
tiago
This email and any attachments are intended for the named recipient
only. Its unauthorised use, distribution, disclosure, storage or copying
is not permitted. If you have received it in error, please destroy all
copies and notify the sender. In messages of a non-business nature, the
views and opinions expressed are the author's own and do not necessarily
reflect those of Cefas. Communications on Cefas? computer systems may be
monitored and/or recorded to secure the effective operation of the
system and for other lawful purposes.
_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Tiago Silva (Cefas)
2014-02-20 09:51:24 UTC
Permalink
Thanks, this seems promising. Before I try building with openmpi, if I parse PBS_NODEFILE to produce my own machinefile for mpiexec, for instance following my previous example:

n100
n100
n101
n101
n101
n101

won't mpiexec start mpi processes with ranks 0-1 onto n100 and with rank 2-5 on n101? That what I think it does when I don't use qsub.

Tiago
-----Original Message-----
From: torqueusers-bounces at supercluster.org [mailto:torqueusers-
bounces at supercluster.org] On Behalf Of Gus Correa
Sent: 19 February 2014 15:11
To: Torque Users Mailing List
Subject: Re: [torqueusers] qsub and mpiexec -f machinefile
Hi Tiago
The Torque/PBS node file is available to your job script through the
environmnent variable $PBS_NODEFILE.
This file has one line listing the node name for each processor/core
that you requested.
Just do a "cat $PBS_NODEFILE" inside your job script to see how it looks.
Inside your job script, and before the mpiexec command, you can run a
brief auxiliary script to create the machinefile you need from the the
$PBS_NODEFILE.
You will need to create this auxiliary script, tailored to your
application.
Still, this method won't bind the MPI processes to the appropriate
hardware components (cores, sockets, etc), (in case this is also part
of your goal).
Having said that, if you are using OpenMPI, it can be built with Torque
support (with the --with-tm=/torque/location configuration option).
This would give you a range of options on how to assign different
cores, sockets, etc, to different MPI ranks/processes, directly in the
mpiexec command, or in the OpenMPI runtime configuration files.
This method would't require creating the machinefile from the
PBS_NODEFILE.
This second approach has the advantage of allowing you to bind the
processes to cores, sockets, etc.
I hope this helps,
Gus Correa
Post by Tiago Silva (Cefas)
Hi,
My MPI code is normally executed across a set of nodes with something
mpiexec -f machinefile -np 6 ./bin
n01
n01
n02
n02
n02
n02
Now the issue here is that this list has been optimised to balance
the
Post by Tiago Silva (Cefas)
load between nodes and to reduce internode communication. So for
instance model domain tiles 0 and 1 will run on n01 while tiles 2 to
5
Post by Tiago Silva (Cefas)
will run on n02.
Is there a way to integrate this into qsub since I don't know which
nodes will be assigned before submission? Or in other words can I
control grouping processes in one node?
In my example I used 6 processes for simplicity but normally I
parallelise across 4-16 nodes and >100 processes.
Thanks,
tiago
This email and any attachments are intended for the named recipient
only. Its unauthorised use, distribution, disclosure, storage or
copying is not permitted. If you have received it in error, please
destroy all copies and notify the sender. In messages of a
non-business nature, the views and opinions expressed are the
author's
Post by Tiago Silva (Cefas)
own and do not necessarily reflect those of Cefas. Communications on
Cefas' computer systems may be monitored and/or recorded to secure
the
Post by Tiago Silva (Cefas)
effective operation of the system and for other lawful purposes.
_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
This email and any attachments are intended for the named recipient only. Its unauthorised use, distribution, disclosure, storage or copying is not permitted.
If you have received it in error, please destroy all copies and notify the sender. In messages of a non-business nature, the views and opinions expressed are the author's own
and do not necessarily reflect those of Cefas.
Communications on Cefas? computer systems may be monitored and/or recorded to secure the effective operation of the system and for other lawful purposes.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20140220/c3fffa9c/attachment.html
Michel Béland
2014-02-20 13:09:50 UTC
Permalink
Post by Tiago Silva (Cefas)
Thanks, this seems promising. Before I try building with openmpi, if I
parse PBS_NODEFILE to produce my own machinefile for mpiexec, for
n100
n100
n101
n101
n101
n101
won't mpiexec start mpi processes with ranks 0-1 onto n100 and with
rank 2-5 on n101? That what I think it does when I don't use qsub.
Yes, but you should not change the nodes inside the $PBS_NODEFILE. You
can change the order but do not delete machines and add new ones,
otherwise your MPI code will try to run on nodes belonging to other jobs.

If you want to have exactly the nodes above, you can ask for
-lnodes=n100:ppn=2+n101:ppn=4. If you only want two cores on the first
node and four on the second but the specific nodes are irrelevant, you
can ask for -lnodes=1:ppn=2+1:ppn=4 instead.

Michel B?land
Calcul Qu?bec
Tiago Silva (Cefas)
2014-02-20 13:32:48 UTC
Permalink
Sure, I will want to stick to the exact same nodes. In my case I don't need to worry about free slots on the nodes I am specifying exclusive usage with -W x="NACCESSPOLICY:SINGLEJOB".
I actually oversubscribe the cores as some processes have very little to do, that is part of the performance optimisation I want to retain

Thanks again
tiago
-----Original Message-----
From: torqueusers-bounces at supercluster.org [mailto:torqueusers-
bounces at supercluster.org] On Behalf Of Michel B?land
Sent: 20 February 2014 13:10
To: torqueusers at supercluster.org
Subject: Re: [torqueusers] qsub and mpiexec -f machinefile
Post by Tiago Silva (Cefas)
Thanks, this seems promising. Before I try building with openmpi, if
I
Post by Tiago Silva (Cefas)
parse PBS_NODEFILE to produce my own machinefile for mpiexec, for
n100
n100
n101
n101
n101
n101
won't mpiexec start mpi processes with ranks 0-1 onto n100 and with
rank 2-5 on n101? That what I think it does when I don't use qsub.
Yes, but you should not change the nodes inside the $PBS_NODEFILE. You
can change the order but do not delete machines and add new ones,
otherwise your MPI code will try to run on nodes belonging to other jobs.
If you want to have exactly the nodes above, you can ask for -
lnodes=n100:ppn=2+n101:ppn=4. If you only want two cores on the first
node and four on the second but the specific nodes are irrelevant, you
can ask for -lnodes=1:ppn=2+1:ppn=4 instead.
Michel B?land
Calcul Qu?bec
_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
This email and any attachments are intended for the named recipient only. Its unauthorised use, distribution, disclosure, storage or copying is not permitted.
If you have received it in error, please destroy all copies and notify the sender. In messages of a non-business nature, the views and opinions expressed are the author's own
and do not necessarily reflect those of Cefas.
Communications on Cefas? computer systems may be monitored and/or recorded to secure the effective operation of the system and for other lawful purposes.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20140220/1d2604b5/attachment.html
Gus Correa
2014-02-20 19:12:10 UTC
Permalink
Hi Tiago

Which MPI and which mpiexec are you using?
I am not familiar to all of them, but the behavior
depends primarily on which you are using.
Most likely, by default you will get the sequential
rank-to-node mapping that you mentioned.
Have you tried it?
What result did you get?

You can insert the MPI function MPI_Get_processor_name,
early on your code, say, right after MPI_Init, MPI_Comm_size, and
MPI_Comm_rank,
and then printout the pairs rank and processor name
(which will probably be your nodes' names).

https://www.open-mpi.org/doc/v1.4/man3/MPI_Get_processor_name.3.php
http://www.mcs.anl.gov/research/projects/mpi/www/www3/MPI_Get_processor_name.html

With OpenMPI there are easier ways (through mpiexec) to report this
information.

However, there are ways to change the sequential
rank-to-node mapping, if this is your goal,
again, depending on which mpiexec you are using.

Anyway, this is more of an MPI then of a Torque question.

I hope this helps,
Gus Correa
Post by Tiago Silva (Cefas)
Thanks, this seems promising. Before I try building with openmpi, if I
parse PBS_NODEFILE to produce my own machinefile for mpiexec, for
n100
n100
n101
n101
n101
n101
won't mpiexec start mpi processes with ranks 0-1 onto n100 and with rank
2-5 on n101? That what I think it does when I don't use qsub.
Tiago
-----Original Message-----
From: torqueusers-bounces at supercluster.org
<mailto:torqueusers-bounces at supercluster.org> [mailto:torqueusers-
bounces at supercluster.org] On Behalf Of Gus Correa
Sent: 19 February 2014 15:11
To: Torque Users Mailing List
Subject: Re: [torqueusers] qsub and mpiexec -f machinefile
Hi Tiago
The Torque/PBS node file is available to your job script through the
environmnent variable $PBS_NODEFILE.
This file has one line listing the node name for each processor/core
that you requested.
Just do a "cat $PBS_NODEFILE" inside your job script to see how it looks.
Inside your job script, and before the mpiexec command, you can run a
brief auxiliary script to create the machinefile you need from the the
$PBS_NODEFILE.
You will need to create this auxiliary script, tailored to your
application.
Still, this method won't bind the MPI processes to the appropriate
hardware components (cores, sockets, etc), (in case this is also part
of your goal).
Having said that, if you are using OpenMPI, it can be built with Torque
support (with the --with-tm=/torque/location configuration option).
This would give you a range of options on how to assign different
cores, sockets, etc, to different MPI ranks/processes, directly in the
mpiexec command, or in the OpenMPI runtime configuration files.
This method would't require creating the machinefile from the
PBS_NODEFILE.
This second approach has the advantage of allowing you to bind the
processes to cores, sockets, etc.
I hope this helps,
Gus Correa
Post by Tiago Silva (Cefas)
Hi,
My MPI code is normally executed across a set of nodes with something
mpiexec -f machinefile -np 6 ./bin
n01
n01
n02
n02
n02
n02
Now the issue here is that this list has been optimised to balance
the
Post by Tiago Silva (Cefas)
load between nodes and to reduce internode communication. So for
instance model domain tiles 0 and 1 will run on n01 while tiles 2 to
5
Post by Tiago Silva (Cefas)
will run on n02.
Is there a way to integrate this into qsub since I don't know which
nodes will be assigned before submission? Or in other words can I
control grouping processes in one node?
In my example I used 6 processes for simplicity but normally I
parallelise across 4-16 nodes and >100 processes.
Thanks,
tiago
This email and any attachments are intended for the named recipient
only. Its unauthorised use, distribution, disclosure, storage or
copying is not permitted. If you have received it in error, please
destroy all copies and notify the sender. In messages of a
non-business nature, the views and opinions expressed are the
author's
Post by Tiago Silva (Cefas)
own and do not necessarily reflect those of Cefas. Communications on
Cefas' computer systems may be monitored and/or recorded to secure
the
Post by Tiago Silva (Cefas)
effective operation of the system and for other lawful purposes.
_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org <mailto:torqueusers at supercluster.org>
http://www.supercluster.org/mailman/listinfo/torqueusers
_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org <mailto:torqueusers at supercluster.org>
http://www.supercluster.org/mailman/listinfo/torqueusers
This email and any attachments are intended for the named recipient
only. Its unauthorised use, distribution, disclosure, storage or copying
is not permitted. If you have received it in error, please destroy all
copies and notify the sender. In messages of a non-business nature, the
views and opinions expressed are the author's own and do not necessarily
reflect those of Cefas. Communications on Cefas? computer systems may be
monitored and/or recorded to secure the effective operation of the
system and for other lawful purposes.
_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Tiago Silva (Cefas)
2014-02-21 12:07:18 UTC
Permalink
I am using mpich2 1.5 with hydra. Now that I think of it the behaviour is as I expect it: the machinefile binds ranks with nodes. I have compiled it also with Openmpi 1.6.5 but I don't remember the behaviour and our openmpi is not integrated with torque.

Thanks for the suggestion, I will try to use PBS_NODEFILE to generate a machinefile on the fly.

Tiago

[hyde at deepgreen PP]$ which mpiexec
/apps/mpich2/1.5/ifort/bin/mpiexec
[hyde at deepgreen PP]$ mpirun -info
HYDRA build details:
Version: 1.5
Release Date: Mon Oct 8 14:00:48 CDT 2012
(...)

-----Original Message-----
From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Gus Correa
Sent: 20 February 2014 19:12
To: Torque Users Mailing List
Subject: Re: [torqueusers] qsub and mpiexec -f machinefile

Hi Tiago

Which MPI and which mpiexec are you using?
I am not familiar to all of them, but the behavior depends primarily on which you are using.
Most likely, by default you will get the sequential rank-to-node mapping that you mentioned.
Have you tried it?
What result did you get?

You can insert the MPI function MPI_Get_processor_name, early on your code, say, right after MPI_Init, MPI_Comm_size, and MPI_Comm_rank, and then printout the pairs rank and processor name (which will probably be your nodes' names).

https://www.open-mpi.org/doc/v1.4/man3/MPI_Get_processor_name.3.php
http://www.mcs.anl.gov/research/projects/mpi/www/www3/MPI_Get_processor_name.html

With OpenMPI there are easier ways (through mpiexec) to report this information.

However, there are ways to change the sequential rank-to-node mapping, if this is your goal, again, depending on which mpiexec you are using.

Anyway, this is more of an MPI then of a Torque question.

I hope this helps,
Gus Correa
Post by Tiago Silva (Cefas)
Thanks, this seems promising. Before I try building with openmpi, if I
parse PBS_NODEFILE to produce my own machinefile for mpiexec, for
n100
n100
n101
n101
n101
n101
won't mpiexec start mpi processes with ranks 0-1 onto n100 and with rank
2-5 on n101? That what I think it does when I don't use qsub.
Tiago
-----Original Message-----
From: torqueusers-bounces at supercluster.org
<mailto:torqueusers-bounces at supercluster.org> [mailto:torqueusers- >
bounces at supercluster.org] On Behalf Of Gus Correa > Sent: 19 February
[torqueusers] qsub and mpiexec -f machinefile > > Hi Tiago > > The
Torque/PBS node file is available to your job script through the >
environmnent variable $PBS_NODEFILE.
This file has one line listing the node name for each
processor/core > that you requested.
Just do a "cat $PBS_NODEFILE" inside your job script to see how it looks.
Inside your job script, and before the mpiexec command, you can run
a > brief auxiliary script to create the machinefile you need from
the the > $PBS_NODEFILE.
You will need to create this auxiliary script, tailored to your >
application.
Still, this method won't bind the MPI processes to the appropriate
hardware components (cores, sockets, etc), (in case this is also
part > of your goal).
Having said that, if you are using OpenMPI, it can be built with
Torque > support (with the --with-tm=/torque/location configuration option).
This would give you a range of options on how to assign different
cores, sockets, etc, to different MPI ranks/processes, directly in
the > mpiexec command, or in the OpenMPI runtime configuration files.
This method would't require creating the machinefile from the >
PBS_NODEFILE.
This second approach has the advantage of allowing you to bind the
processes to cores, sockets, etc.
I hope this helps,
Gus Correa
Post by Tiago Silva (Cefas)
Hi,
My MPI code is normally executed across a set of nodes with
mpiexec -f machinefile -np 6 ./bin > > > > where the
n01
n01
n02
n02
n02
n02
Now the issue here is that this list has been optimised to
balance > the > > load between nodes and to reduce internode
communication. So for > > instance model domain tiles 0 and 1 will
run on n01 while tiles 2 to > 5 > > will run on n02.
Post by Tiago Silva (Cefas)
Is there a way to integrate this into qsub since I don't know
which > > nodes will be assigned before submission? Or in other words
can I > > control grouping processes in one node?
Post by Tiago Silva (Cefas)
In my example I used 6 processes for simplicity but normally I >
parallelise across 4-16 nodes and >100 processes.
Post by Tiago Silva (Cefas)
Thanks,
tiago
This email and any attachments are intended for the named
recipient > > only. Its unauthorised use, distribution, disclosure,
storage or > > copying is not permitted. If you have received it in
error, please > > destroy all copies and notify the sender. In
messages of a > > non-business nature, the views and opinions
expressed are the > author's > > own and do not necessarily reflect
those of Cefas. Communications on > > Cefas' computer systems may be
monitored and/or recorded to secure > the > > effective operation of
the system and for other lawful purposes.
Post by Tiago Silva (Cefas)
_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
<mailto:torqueusers at supercluster.org>
Post by Tiago Silva (Cefas)
http://www.supercluster.org/mailman/listinfo/torqueusers
_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org <mailto:torqueusers at supercluster.org>
http://www.supercluster.org/mailman/listinfo/torqueusers
This email and any attachments are intended for the named recipient
only. Its unauthorised use, distribution, disclosure, storage or
copying is not permitted. If you have received it in error, please
destroy all copies and notify the sender. In messages of a
non-business nature, the views and opinions expressed are the author's
own and do not necessarily reflect those of Cefas. Communications on
Cefas' computer systems may be monitored and/or recorded to secure the
effective operation of the system and for other lawful purposes.
_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
This email and any attachments are intended for the named recipient only. Its unauthorised use, distribution, disclosure, storage or copying is not permitted.
If you have received it in error, please destroy all copies and notify the sender. In messages of a non-business nature, the views and opinions expressed are the author's own
and do not necessarily reflect those of Cefas.
Communications on Cefas? computer systems may be monitored and/or recorded to secure the effective operation of the system and for other lawful purposes.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20140221/1759fed3/attachment.html
Loading...