[torqueusers] Ansys/fluent MPI on RHEL7 slow

Discussion:

Martin Vogt

2016-10-27 08:07:03 UTC

Hello,

we use ansys/fluent on RHEL7/torque-client and when starting a parallel
job over qsub (torque-clioent) the CPU pinning goes wrong and the
performance is ~20times slower than on a SLE11 system.(see strace below)

I found out, that the call sched_setaffinity does not succeed in the
fluent MPI binaries in the qsub environement.
If I adjust the CPU pinning afterwards with "taskset" (manually) the
performance is fine.

When I ssh into the node, everything is fine too.

Thus, the issue only occurs when the node is allocated with qsub
(interactive).
For me, the only difference is, that the shell started over qsub has
another cgroup container, than the shell in the ssh session.

best regards,

Martin

In the torque-client cgrpoup

ps -O cgroup

PID CGROUP S TTY TIME COMMAND
48526 6:cpuset:/torque/276.login4 S pts/0 00:00:00 -bash
48662 6:cpuset:/torque/276.login4 R pts/0 00:00:00 ps -O cgroup

fluent MPI strace

/tmp/1.dbg:38757 sched_setaffinity(0, 512, {1, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0}) = 0 /tmp/2.dbg:38758 sched_setaffinity(0, 512,
{100, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}) = 0 /tmp/3.dbg:38759
sched_setaffinity(0, 512, {2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0}) = -1 EINVAL (Invalid argument) /tmp/3.dbg:38759 write(2,
"sched_setaffinity() call failed: Invalid argument\n", 50) = 50
/tmp/4.dbg:38760 sched_setaffinity(0, 512, {200, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0}) = -1 EINVAL (Invalid argument) /tmp/4.dbg:38760
write(2, "sched_setaffinity() call failed: Invalid argument\n", 50) = 50
/tmp/a.dbg:37731 sched_setaffinity(0, 512, {1, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0}) = 0

ps -O cgroup PID CGROUP S TTY TIME COMMAND 48668

1:name=systemd:/user.slice/ S pts/1 00:00:00 -bash 48760
1:name=systemd:/user.slice/ R pts/1 00:00:00 ps -O cgroup

fluent MPI strace

/tmp/1.dbg:42337 sched_setaffinity(0, 512, {1, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0}) = 0 /tmp/2.dbg:42338 sched_setaffinity(0, 512,
{100, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}) = 0 /tmp/3.dbg:42339
sched_setaffinity(0, 512, {2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0}) = 0 /tmp/4.dbg:42340 sched_setaffinity(0, 512, {200, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0}) = 0 /tmp/a.dbg:37731 sched_setaffinity(0,
512, {1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}) = 0

David Beer

2016-10-27 16:19:07 UTC

Permalink

Martin,

It sounds like the MPI is trying to set affinity for cpus that aren't in
the cpuset for the job. I'm not sure exactly how to resolve this - can you
specify a set of cpus for fluent MPI? If not, it seems like it'd be best to
pick whether you want fluent MPI or Torque to pick the job's cgroup.

Post by Martin Vogt
Hello,
we use ansys/fluent on RHEL7/torque-client and when starting a parallel
job over qsub (torque-clioent) the CPU pinning goes wrong and the
performance is ~20times slower than on a SLE11 system.(see strace below)
I found out, that the call sched_setaffinity does not succeed in the
fluent MPI binaries in the qsub environement.
If I adjust the CPU pinning afterwards with "taskset" (manually) the
performance is fine.
When I ssh into the node, everything is fine too.
Thus, the issue only occurs when the node is allocated with qsub
(interactive).
For me, the only difference is, that the shell started over qsub has
another cgroup container, than the shell in the ssh session.
best regards,
Martin
In the torque-client cgrpoup

ps -O cgroup

PID CGROUP S TTY TIME COMMAND
48526 6:cpuset:/torque/276.login4 S pts/0 00:00:00 -bash
48662 6:cpuset:/torque/276.login4 R pts/0 00:00:00 ps -O cgroup
fluent MPI strace
/tmp/1.dbg:38757 sched_setaffinity(0, 512, {1, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0}) = 0 /tmp/2.dbg:38758 sched_setaffinity(0, 512,
{100, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}) = 0 /tmp/3.dbg:38759
sched_setaffinity(0, 512, {2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0}) = -1 EINVAL (Invalid argument) /tmp/3.dbg:38759 write(2,
"sched_setaffinity() call failed: Invalid argument\n", 50) = 50
/tmp/4.dbg:38760 sched_setaffinity(0, 512, {200, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0}) = -1 EINVAL (Invalid argument) /tmp/4.dbg:38760
write(2, "sched_setaffinity() call failed: Invalid argument\n", 50) = 50
/tmp/a.dbg:37731 sched_setaffinity(0, 512, {1, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0}) = 0

ps -O cgroup PID CGROUP S TTY TIME COMMAND 48668

1:name=systemd:/user.slice/ S pts/1 00:00:00 -bash 48760
1:name=systemd:/user.slice/ R pts/1 00:00:00 ps -O cgroup
fluent MPI strace
/tmp/1.dbg:42337 sched_setaffinity(0, 512, {1, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0}) = 0 /tmp/2.dbg:42338 sched_setaffinity(0, 512,
{100, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}) = 0 /tmp/3.dbg:42339
sched_setaffinity(0, 512, {2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0}) = 0 /tmp/4.dbg:42340 sched_setaffinity(0, 512, {200, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0}) = 0 /tmp/a.dbg:37731 sched_setaffinity(0,
512, {1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}) = 0
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers

--
David Beer | Torque Architect
Adaptive Computing

Bidwell, Matt

2016-10-27 16:41:39 UTC

Permalink

Also, which MPI? Some seem to be better at ordering than others. I donât work with Ansys a ton, but weâve recommended our users use the Intel MPI, which is the flag â-mpi intelmpiâ for Ansys. I will admit I have not checked out the ordering when C-groups is enabled in Torque, so this is just a general thing weâve learned.
Matt

From: torqueusers-***@supercluster.org [mailto:torqueusers-***@supercluster.org] On Behalf Of David Beer
Sent: Thursday, October 27, 2016 10:19 AM
To: Torque Users Mailing List
Subject: Re: [torqueusers] Ansys/fluent MPI on RHEL7 slow

Martin,

It sounds like the MPI is trying to set affinity for cpus that aren't in the cpuset for the job. I'm not sure exactly how to resolve this - can you specify a set of cpus for fluent MPI? If not, it seems like it'd be best to pick whether you want fluent MPI or Torque to pick the job's cgroup.

On Thu, Oct 27, 2016 at 2:07 AM, Martin Vogt <***@itwm.fraunhofer.de<mailto:***@itwm.fraunhofer.de>> wrote:

Hello,

we use ansys/fluent on RHEL7/torque-client and when starting a parallel
job over qsub (torque-clioent) the CPU pinning goes wrong and the
performance is ~20times slower than on a SLE11 system.(see strace below)

I found out, that the call sched_setaffinity does not succeed in the
fluent MPI binaries in the qsub environement.
If I adjust the CPU pinning afterwards with "taskset" (manually) the
performance is fine.

When I ssh into the node, everything is fine too.

Thus, the issue only occurs when the node is allocated with qsub
(interactive).
For me, the only difference is, that the shell started over qsub has
another cgroup container, than the shell in the ssh session.

best regards,

Martin

In the torque-client cgrpoup

ps -O cgroup

ps -O cgroup PID CGROUP S TTY TIME COMMAND 48668

--
David Beer | Torque Architect
Adaptive Computing

Kevin Van Workum

2016-10-27 16:35:25 UTC

Permalink

The simplest thing to do is require that Fluent jobs request all the cores
on the nodes so that Fluent's MPI (Platform?) can set the affinity however
it decides. Ansys products do not support running under cgroups for this
reason.

ps -O cgroup

PID CGROUP S TTY TIME COMMAND
48526 6:cpuset:/torque/276.login4 S pts/0 00:00:00 -bash
48662 6:cpuset:/torque/276.login4 R pts/0 00:00:00 ps -O cgroup
fluent MPI strace
/tmp/1.dbg:38757 sched_setaffinity(0, 512, {1, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0}) = 0 /tmp/2.dbg:38758 sched_setaffinity(0, 512,
{100, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}) = 0 /tmp/3.dbg:38759
sched_setaffinity(0, 512, {2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0}) = -1 EINVAL (Invalid argument) /tmp/3.dbg:38759 write(2,
"sched_setaffinity() call failed: Invalid argument\n", 50) = 50
/tmp/4.dbg:38760 sched_setaffinity(0, 512, {200, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0}) = -1 EINVAL (Invalid argument) /tmp/4.dbg:38760
write(2, "sched_setaffinity() call failed: Invalid argument\n", 50) = 50
/tmp/a.dbg:37731 sched_setaffinity(0, 512, {1, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0}) = 0

ps -O cgroup PID CGROUP S TTY TIME COMMAND 48668

1:name=systemd:/user.slice/ S pts/1 00:00:00 -bash 48760
1:name=systemd:/user.slice/ R pts/1 00:00:00 ps -O cgroup
fluent MPI strace
/tmp/1.dbg:42337 sched_setaffinity(0, 512, {1, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0}) = 0 /tmp/2.dbg:42338 sched_setaffinity(0, 512,
{100, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}) = 0 /tmp/3.dbg:42339
sched_setaffinity(0, 512, {2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0}) = 0 /tmp/4.dbg:42340 sched_setaffinity(0, 512, {200, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0}) = 0 /tmp/a.dbg:37731 sched_setaffinity(0,
512, {1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}) = 0
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers

--
Kevin Van Workum, PhD
Sabalcore Computing Inc.
"Where Data Becomes Discovery"
http://www.sabalcore.com
877-492-8027 ext. 1011

--