[torqueusers] Weird re-running jobs on a new installation of torque with maui.

Discussion:

Mike Diehn

2016-11-01 14:51:54 UTC

Just got a shiny new cluster with Bright 7 and CentOS 7.2. It came with
SLURM selected. I chose Torque instead, v 6.0.0 and installed Maui v 3.3.2.

I submitted a test job and it queued and ran just fine. Seemed to running
an awfully long time, though - all I had in the submit script is "uname -a"

After watching carefully, I noticed the job was running over and over and
over and over! Every time it finished, it would start up again. I had to
qdel it to get it stop.

It had the RERUNABLE flag set, but I don't take that to mean it *should*
re-run. And when I submitted another test with "qsub -r n ...." that job
didn't start at all until I used qrun.

So, I'm wondering if there's something I need to do with Maui or Torque to
make them play nice with each other. I followed the "integration guide" up
at adaptivecomputing, but no changes.

Best,
Mike

--
Mike Diehn
Enfield, NH
***@diehn.net

David Beer

2016-11-01 16:34:02 UTC

Permalink

Mike,

I've never heard of this issue that you're reporting, but I know that we
have patched Maui to be compatible with the way Torque now reports usage
information. If you check out Maui from source and build that, then you'll
get that change. I haven't heard of anyone else reporting that jobs run
multiple times.

As far as the rerunnable flag, you are correct; jobs shouldn't run multiple
times. Have you looked into the logs (perhaps via tracejob) to see why the
job is being run again?

Post by Mike Diehn
Just got a shiny new cluster with Bright 7 and CentOS 7.2. It came with
SLURM selected. I chose Torque instead, v 6.0.0 and installed Maui v 3.3.2.
I submitted a test job and it queued and ran just fine. Seemed to running
an awfully long time, though - all I had in the submit script is "uname -a"
After watching carefully, I noticed the job was running over and over and
over and over! Every time it finished, it would start up again. I had to
qdel it to get it stop.
It had the RERUNABLE flag set, but I don't take that to mean it *should*
re-run. And when I submitted another test with "qsub -r n ...." that job
didn't start at all until I used qrun.
So, I'm wondering if there's something I need to do with Maui or Torque to
make them play nice with each other. I followed the "integration guide" up
at adaptivecomputing, but no changes.
Best,
Mike
--
Mike Diehn
Enfield, NH
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers

--
David Beer | Torque Architect
Adaptive Computing

Mike Diehn

2016-11-02 13:01:22 UTC

Permalink

Post by David Beer
Mike,
I've never heard of this issue that you're reporting, but I know that we
have patched Maui to be compatible with the way Torque now reports usage
information. If you check out Maui from source and build that, then you'll
get that change. I haven't heard of anyone else reporting that jobs run
multiple times.

I haven't checked out the current maui from source and tried that yet. But
see below and would you please tell me if you think the changes will fix
what I found here?

Post by David Beer
As far as the rerunnable flag, you are correct; jobs shouldn't run
multiple times. Have you looked into the logs (perhaps via tracejob) to see
why the job is being run again?

David - in short, it *appears* everything is working as designed and I
simply need to figure out why my jobs are all returning -3. That's
JOB_EXEC_RETRY and means job execution failed, do retry.

The output from tracejob looks like this:

[***@bruce ~]# tracejob 25
/cm/shared/apps/torque/var/spool/mom_logs/20161102: No such file or
directory
/cm/shared/apps/torque/var/spool/sched_logs/20161102: No such file or
directory

Job: 25.bruce.cm.cluster

11/02/2016 08:50:11 A queue=batch
11/02/2016 08:50:12.115 S Job Run at request of ***@bruce.cm.cluster
11/02/2016 08:50:12.133 S child reported success for job after 0 seconds
(dest=???), rc=0
11/02/2016 08:50:12.135 S preparing to send 'b' mail for job
25.bruce.cm.cluster to ***@brucehead (---)
11/02/2016 08:50:12.135 S Not sending email: User does not want mail of
this type.
11/02/2016 08:50:12 A user=mdiehn group=mdiehn
jobname=mikeEpilogueTests queue=batch ctime=1478091011 qtime=1478091011
etime=1478091011 start=1478091012
owner=***@brucehead

exec_host=brucehead/0-7+bruce16/0-7+bruce15/0-7+bruce14/0-7
Resource_List.neednodes=4:ppn=8
Resource_List.nodect=4
Resource_List.nodes=4:ppn=8 Resource_List.walltime=00:30:00
11/02/2016 08:50:21.284 S obit received - updating final job usage info
*11/02/2016 08:50:21.285 S job exit status -3 handled*

#--- here is the first re-run...

11/02/2016 08:50:23.057 S Job Run at request of ***@bruce.cm.cluster
11/02/2016 08:50:23.073 S child reported success for job after 0 seconds
(dest=???), rc=0
11/02/2016 08:50:23.074 S preparing to send 'b' mail for job
25.bruce.cm.cluster to ***@brucehead (---)
11/02/2016 08:50:23.074 S Not sending email: User does not want mail of
this type.
11/02/2016 08:50:23 A user=mdiehn group=mdiehn
jobname=mikeEpilogueTests queue=batch ctime=1478091011 qtime=1478091011
etime=1478091011 start=1478091023
owner=***@brucehead

exec_host=brucehead/0-7+bruce16/0-7+bruce15/0-7+bruce14/0-7
Resource_List.neednodes=4:ppn=8
Resource_List.nodect=4
Resource_List.nodes=4:ppn=8 Resource_List.walltime=00:30:00
11/02/2016 08:50:29.447 S obit received - updating final job usage info
11/02/2016 08:50:29.448 S job exit status -3 handled

#--- end of the first re-run, it just repeats that same output from here on

David Beer

2016-11-02 15:40:48 UTC

Permalink

It seems unlikely that this is going to be caused by the difference in
reporting. I'd look at the node where the job executed and check the mom
logs, as well as the mother superior logs. This exit status means that the
job couldn't run on that node for a reason that has nothing to do with the
job. Perhaps the user doesn't exist on that node, it was a multi-node job
that couldn't bring in the sister moms, a system call failed (such as not
being able to setup the cgroup or cpuset), the prologue failed, the limits
couldn't be set for the job, the environment couldn't be setup, or some
other transitory failure.

Post by Mike Diehn

I haven't checked out the current maui from source and tried that yet. But
see below and would you please tell me if you think the changes will fix
what I found here?

Post by David Beer
As far as the rerunnable flag, you are correct; jobs shouldn't run
multiple times. Have you looked into the logs (perhaps via tracejob) to see
why the job is being run again?

David - in short, it *appears* everything is working as designed and I
simply need to figure out why my jobs are all returning -3. That's
JOB_EXEC_RETRY and means job execution failed, do retry.
/cm/shared/apps/torque/var/spool/mom_logs/20161102: No such file or
directory
/cm/shared/apps/torque/var/spool/sched_logs/20161102: No such file or
directory
Job: 25.bruce.cm.cluster
11/02/2016 08:50:11 A queue=batch
11/02/2016 08:50:12.133 S child reported success for job after 0
seconds (dest=???), rc=0
11/02/2016 08:50:12.135 S preparing to send 'b' mail for job
11/02/2016 08:50:12.135 S Not sending email: User does not want mail of
this type.
11/02/2016 08:50:12 A user=mdiehn group=mdiehn
jobname=mikeEpilogueTests queue=batch ctime=1478091011 qtime=1478091011
etime=1478091011 start=1478091012
exec_host=brucehead/0-7+bruce16/0-7+bruce15/0-7+bruce14/0-7
Resource_List.neednodes=4:ppn=8
Resource_List.nodect=4
Resource_List.nodes=4:ppn=8 Resource_List.walltime=00:30:00
11/02/2016 08:50:21.284 S obit received - updating final job usage info
*11/02/2016 08:50:21.285 S job exit status -3 handled*
#--- here is the first re-run...
11/02/2016 08:50:23.073 S child reported success for job after 0
seconds (dest=???), rc=0
11/02/2016 08:50:23.074 S preparing to send 'b' mail for job
11/02/2016 08:50:23.074 S Not sending email: User does not want mail of
this type.
11/02/2016 08:50:23 A user=mdiehn group=mdiehn
jobname=mikeEpilogueTests queue=batch ctime=1478091011 qtime=1478091011
etime=1478091011 start=1478091023
exec_host=brucehead/0-7+bruce16/0-7+bruce15/0-7+bruce14/0-7
Resource_List.neednodes=4:ppn=8
Resource_List.nodect=4
Resource_List.nodes=4:ppn=8 Resource_List.walltime=00:30:00
11/02/2016 08:50:29.447 S obit received - updating final job usage info
11/02/2016 08:50:29.448 S job exit status -3 handled
#--- end of the first re-run, it just repeats that same output from here on
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers

--
David Beer | Torque Architect
Adaptive Computing

Mike Diehn

2016-11-02 15:49:11 UTC

Permalink

a system call failed (such as not being able to setup the cgroup or cpuset)

There are entries in the pbs_mom logs out on the nodes about not being able
to remove files in /sys/...../cgroup stuff... I'll work from there.

Thanks again!

--
Mike Diehn
Enfield, NH
***@diehn.net

Mike Diehn

2016-11-02 19:20:36 UTC

Permalink

David,

I may have been overstating to problem. It seems the jobs re-run simply
because they have the rerunable flag set. So, as soon as the abort, they
re-queue and run again.

The real problem seems to be that pbs_mom on the nodes simply can't get the
jobs started:

11/02/2016 14:59:20.529;01; pbs_mom.5947;Job;TMomFinalizeJob3;Job
39.bruce.cm.cluster read start return code=-3 session=41041

See that -3?

After that it's all abort and clean up.

I've found this commit:

https://github.com/adaptivecomputing/torque/commit/41b6cd605bb9bc55f18eeaa1daaa63ef41e0d356

Do you suppose it's worth my downloading, building and installing a new
release? One that has that in it?

Post by Mike Diehn

a system call failed (such as not being able to setup the cgroup or cpuset)

There are entries in the pbs_mom logs out on the nodes about not being
able to remove files in /sys/...../cgroup stuff... I'll work from there.
Thanks again!
--
Mike Diehn
Enfield, NH

--
Mike Diehn
Enfield, NH
***@diehn.net

David Beer

2016-11-02 20:05:44 UTC

Permalink

Mike,

Ah, that might be the source of your issue. That is in 6.0.1 and 6.0.2, and
it'd definitely be worth just going to 6.0.2, especially if you are
planning to use cgroups (which it looks like you are).

David

Post by Mike Diehn
David,
I may have been overstating to problem. It seems the jobs re-run simply
because they have the rerunable flag set. So, as soon as the abort, they
re-queue and run again.
The real problem seems to be that pbs_mom on the nodes simply can't get
11/02/2016 14:59:20.529;01; pbs_mom.5947;Job;TMomFinalizeJob3;Job
39.bruce.cm.cluster read start return code=-3 session=41041
See that -3?
After that it's all abort and clean up.
https://github.com/adaptivecomputing/torque/commit/
41b6cd605bb9bc55f18eeaa1daaa63ef41e0d356
Do you suppose it's worth my downloading, building and installing a new
release? One that has that in it?

Post by Mike Diehn

a system call failed (such as not being able to setup the cgroup or cpuset)

There are entries in the pbs_mom logs out on the nodes about not being
able to remove files in /sys/...../cgroup stuff... I'll work from there.
Thanks again!
--
Mike Diehn
Enfield, NH

--
Mike Diehn
Enfield, NH
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers

--
David Beer | Torque Architect
Adaptive Computing

Mike Diehn

2016-11-02 20:20:32 UTC

Permalink

I've downloaded torque-6.0.2-1469811694_d9a3483.tar.gz from Adaptive's web
site.

I see five RCs for 6.1 in github. I'd go for one of those if I thought it
was ready for production.... ;-)

Post by David Beer
Mike,
Ah, that might be the source of your issue. That is in 6.0.1 and 6.0.2,
and it'd definitely be worth just going to 6.0.2, especially if you are
planning to use cgroups (which it looks like you are).
David

Post by Mike Diehn
David,
I may have been overstating to problem. It seems the jobs re-run simply
because they have the rerunable flag set. So, as soon as the abort, they
re-queue and run again.
The real problem seems to be that pbs_mom on the nodes simply can't get
11/02/2016 14:59:20.529;01; pbs_mom.5947;Job;TMomFinalizeJob3;Job
39.bruce.cm.cluster read start return code=-3 session=41041
See that -3?
After that it's all abort and clean up.
https://github.com/adaptivecomputing/torque/commit/41b6cd605
bb9bc55f18eeaa1daaa63ef41e0d356
Do you suppose it's worth my downloading, building and installing a new
release? One that has that in it?

Post by Mike Diehn

a system call failed (such as not being able to setup the cgroup or cpuset)

There are entries in the pbs_mom logs out on the nodes about not being
able to remove files in /sys/...../cgroup stuff... I'll work from there.
Thanks again!
--
Mike Diehn
Enfield, NH

--
Mike Diehn
Enfield, NH
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers

--
David Beer | Torque Architect
Adaptive Computing
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers

--
Mike Diehn
Enfield, NH
***@diehn.net

David Beer

2016-11-02 20:37:25 UTC

Permalink

I can tell you that Torque is going to only have 1 minor change from rc 5
to what will be released. If you are adventurous, you are welcome to grab
one. The minor change deals with GPUs and CUDA 8 support, so there's a good
chance it isn't relevant to you.

Post by Mike Diehn
I've downloaded torque-6.0.2-1469811694_d9a3483.tar.gz from Adaptive's
web site.
I see five RCs for 6.1 in github. I'd go for one of those if I thought it
was ready for production.... ;-)

Post by Mike Diehn
David,
I may have been overstating to problem. It seems the jobs re-run simply
because they have the rerunable flag set. So, as soon as the abort, they
re-queue and run again.
The real problem seems to be that pbs_mom on the nodes simply can't get
11/02/2016 14:59:20.529;01; pbs_mom.5947;Job;TMomFinalizeJob3;Job
39.bruce.cm.cluster read start return code=-3 session=41041
See that -3?
After that it's all abort and clean up.
https://github.com/adaptivecomputing/torque/commit/41b6cd605
bb9bc55f18eeaa1daaa63ef41e0d356
Do you suppose it's worth my downloading, building and installing a new
release? One that has that in it?

On Wed, Nov 2, 2016 at 11:40 AM, David Beer <

a system call failed (such as not being able to setup the cgroup or cpuset)

There are entries in the pbs_mom logs out on the nodes about not being
able to remove files in /sys/...../cgroup stuff... I'll work from there.
Thanks again!
--
Mike Diehn
Enfield, NH

--
Mike Diehn
Enfield, NH
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers

--
David Beer | Torque Architect
Adaptive Computing
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers

--
Mike Diehn
Enfield, NH
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers

--
David Beer | Torque Architect
Adaptive Computing

Mike Diehn

2016-11-11 02:23:04 UTC

Permalink

I compiled and installed torque-6.0.2-1469811694_d9a3483. Once I stopped
the old and started the new pbs_server on the head and pbs_mom on the
nodes, my jobs will queue, start, run, stop *and stay stopped!*

Yayyyy!!

Thanks for the advice and support, David!

Sincerely,
Mike

Post by David Beer
I can tell you that Torque is going to only have 1 minor change from rc 5
to what will be released. If you are adventurous, you are welcome to grab
one. The minor change deals with GPUs and CUDA 8 support, so there's a good
chance it isn't relevant to you.

--
Mike Diehn
Enfield, NH
***@diehn.net

David Beer

2016-11-11 04:51:39 UTC

Permalink

I'm really glad we figured that one out.

Cheers

Post by Mike Diehn
I compiled and installed torque-6.0.2-1469811694_d9a3483. Once I stopped
the old and started the new pbs_server on the head and pbs_mom on the
nodes, my jobs will queue, start, run, stop *and stay stopped!*
Yayyyy!!
Thanks for the advice and support, David!
Sincerely,
Mike

--
Mike Diehn
Enfield, NH
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers

--
David Beer | Torque Architect
Adaptive Computing