Discussion:
[torqueusers] RM failure with node_spec failed
Stijn De Weirdt
2016-08-29 14:41:55 UTC
Permalink
hi all,

we are running torque 6.0.2 with moab 9.0.2.
we are seeing a whole bunch of single core jobs getting placed in
batchhold because moab tries to start the jobs on a node, but torque
fails/refuses to actually start it.
Could not locate requested resources 'nodeXYZ' (node_spec failed)
moab checkjob has similar cryptic
RM failure, rc: 15046, msg: 'Resource temporarily unavailable'
the jobs requests nodes=1:ppn=1 and the node itself has 1 core unused
(23 used out of 24 total, so 1 core unused)
the momlogs do not have any entry wrt the job being tried to start, so i
guess it's the pbs_server that decides the job can't be started when
moab asks pbs_server to start it.

the only thing that looks suspicous is that the load on the node is
higher than 23 (for 23 cores used). but we do not have anny load
settings in use (that we know of, or that are defaults that we are
unaware of).

any hints/tips how to debug this?

stijn
David Beer
2016-08-29 16:18:46 UTC
Permalink
Stjin,

Another customer reported a similar issue, and in debugging it we found an
issue with how tasks are freed. I would recommend that you try out these
two changesets to see if it fixes your issue:

commit 59ba48ef436f6f7242d4115a994a02f3e4724866
Author: David Beer <***@adaptivecomputing.com>
Date: Fri Aug 19 09:33:00 2016 -0600

TRQ-3727. Fix an issue with the way we were freeing tasks.

commit afd68cb20de167fcaca6858c01cc434b22512abe
Author: David Beer <***@adaptivecomputing.com>
Date: Thu Aug 18 14:25:53 2016 -0600

TRQ-3727. Add protections against erroneous jobs in node_usage files.

Cheers,

David
Post by Stijn De Weirdt
hi all,
we are running torque 6.0.2 with moab 9.0.2.
we are seeing a whole bunch of single core jobs getting placed in
batchhold because moab tries to start the jobs on a node, but torque
fails/refuses to actually start it.
Could not locate requested resources 'nodeXYZ' (node_spec failed)
moab checkjob has similar cryptic
RM failure, rc: 15046, msg: 'Resource temporarily unavailable'
the jobs requests nodes=1:ppn=1 and the node itself has 1 core unused
(23 used out of 24 total, so 1 core unused)
the momlogs do not have any entry wrt the job being tried to start, so i
guess it's the pbs_server that decides the job can't be started when
moab asks pbs_server to start it.
the only thing that looks suspicous is that the load on the node is
higher than 23 (for 23 cores used). but we do not have anny load
settings in use (that we know of, or that are defaults that we are
unaware of).
any hints/tips how to debug this?
stijn
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
--
David Beer | Torque Architect
Adaptive Computing
stdweird
2016-08-30 14:54:08 UTC
Permalink
hi david,

thanks for the suggestion and fixes! we are running latest 6.0-dev for
10+ hours now, things are looking much better, no new occurences yet
(before the update, we certainly would have run into a few already).


stijn
Post by David Beer
Stjin,
Another customer reported a similar issue, and in debugging it we found an
issue with how tasks are freed. I would recommend that you try out these
commit 59ba48ef436f6f7242d4115a994a02f3e4724866
Date: Fri Aug 19 09:33:00 2016 -0600
TRQ-3727. Fix an issue with the way we were freeing tasks.
commit afd68cb20de167fcaca6858c01cc434b22512abe
Date: Thu Aug 18 14:25:53 2016 -0600
TRQ-3727. Add protections against erroneous jobs in node_usage files.
Cheers,
David
Post by Stijn De Weirdt
hi all,
we are running torque 6.0.2 with moab 9.0.2.
we are seeing a whole bunch of single core jobs getting placed in
batchhold because moab tries to start the jobs on a node, but torque
fails/refuses to actually start it.
Could not locate requested resources 'nodeXYZ' (node_spec failed)
moab checkjob has similar cryptic
RM failure, rc: 15046, msg: 'Resource temporarily unavailable'
the jobs requests nodes=1:ppn=1 and the node itself has 1 core unused
(23 used out of 24 total, so 1 core unused)
the momlogs do not have any entry wrt the job being tried to start, so i
guess it's the pbs_server that decides the job can't be started when
moab asks pbs_server to start it.
the only thing that looks suspicous is that the load on the node is
higher than 23 (for 23 cores used). but we do not have anny load
settings in use (that we know of, or that are defaults that we are
unaware of).
any hints/tips how to debug this?
stijn
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
David Beer
2016-08-30 15:32:03 UTC
Permalink
Stjin,

I'm glad to hear that appears to have resolved the issue. I wouldn't
usually recommend running from the tip, but in this case there are a few
important fixes.
Post by stdweird
hi david,
thanks for the suggestion and fixes! we are running latest 6.0-dev for
10+ hours now, things are looking much better, no new occurences yet
(before the update, we certainly would have run into a few already).
stijn
Post by David Beer
Stjin,
Another customer reported a similar issue, and in debugging it we found
an
Post by David Beer
issue with how tasks are freed. I would recommend that you try out these
commit 59ba48ef436f6f7242d4115a994a02f3e4724866
Date: Fri Aug 19 09:33:00 2016 -0600
TRQ-3727. Fix an issue with the way we were freeing tasks.
commit afd68cb20de167fcaca6858c01cc434b22512abe
Date: Thu Aug 18 14:25:53 2016 -0600
TRQ-3727. Add protections against erroneous jobs in node_usage files.
Cheers,
David
On Mon, Aug 29, 2016 at 8:41 AM, Stijn De Weirdt <
Post by Stijn De Weirdt
hi all,
we are running torque 6.0.2 with moab 9.0.2.
we are seeing a whole bunch of single core jobs getting placed in
batchhold because moab tries to start the jobs on a node, but torque
fails/refuses to actually start it.
Could not locate requested resources 'nodeXYZ' (node_spec failed)
moab checkjob has similar cryptic
RM failure, rc: 15046, msg: 'Resource temporarily unavailable'
the jobs requests nodes=1:ppn=1 and the node itself has 1 core unused
(23 used out of 24 total, so 1 core unused)
the momlogs do not have any entry wrt the job being tried to start, so i
guess it's the pbs_server that decides the job can't be started when
moab asks pbs_server to start it.
the only thing that looks suspicous is that the load on the node is
higher than 23 (for 23 cores used). but we do not have anny load
settings in use (that we know of, or that are defaults that we are
unaware of).
any hints/tips how to debug this?
stijn
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
--
David Beer | Torque Architect
Adaptive Computing
John Griffin-Wiesner
2017-02-10 19:22:37 UTC
Permalink
Looks like this crept back into torque 6.1.0 ? We have
several nodes with 24 cores and 2 gpus defined and this
same config worked at torque 5.



***@mesabim3 [/var/spool/torque] # tracejob 3199013
/var/spool/torque/mom_logs/20170210: No such file or directory

Job: 3199013.mesabim3.msi.umn.edu

02/10/2017 12:32:49.861 S enqueuing into k40, state 1 hop 1
02/10/2017 12:32:49 A queue=k40
02/10/2017 12:32:59 L NAME=3199013 REQUESTEDTC=1 UNAME=johngw GNAME=tech WCLIMIT=600 STATE=Idle RCLASS=[k40] SUBMITTIME=1486751569
RMEMCMP=>= RDISKCMP=>= SYSTEMQUEUETIME=1486751569 QOS=smallqos FLAGS=INTERACTIVE ACCOUNT=tech ENDDATE=2140000000
GRES=GPUS=2 SRM=Mesabi
MESSAGE="\STARTLabel\20\20\20CreateTime\20ExpireTime\20\20\20\20Owner\20Prio\20Num\20Message\0a,\STARTcheckpoint\20record\20not\20found"
DRMJID=3199013.mesabim3.msi.umn.edu
02/10/2017 12:33:00.308 S Could not locate requested resources 'cn3008:ppn=24:gpus=2' (node_spec failed)
02/10/2017 12:33:03.460 S Could not locate requested resources 'cn3008:ppn=24:gpus=2' (node_spec failed)
02/10/2017 12:33:29.065 S Could not locate requested resources 'cn3008:ppn=24:gpus=2' (node_spec failed)
02/10/2017 12:33:30.515 S Could not locate requested resources 'cn3008:ppn=24:gpus=2' (node_spec failed)
02/10/2017 12:33:56.048 S Could not locate requested resources 'cn3008:ppn=24:gpus=2' (node_spec failed)
02/10/2017 12:33:57.418 S Could not locate requested resources 'cn3008:ppn=24:gpus=2' (node_spec failed)
02/10/2017 12:35:04.102 S Could not locate requested resources 'cn3008:ppn=24:gpus=2' (node_spec failed)
02/10/2017 12:35:06.284 S Could not locate requested resources 'cn3008:ppn=24:gpus=2' (node_spec failed)
02/10/2017 12:35:30.751 S Job deleted at request of ***@mesabim3.msi.umn.edu
02/10/2017 12:35:30 L NAME=3199013 REQUESTEDTC=1 UNAME=johngw GNAME=tech WCLIMIT=600 STATE=Idle RCLASS=[k40] SUBMITTIME=1486751569
DISPATCHTIME=1486751702 RMEMCMP=>= RDISKCMP=>= SYSTEMQUEUETIME=1486751569 QOS=smallqos
FLAGS=INTERACTIVE,FSVIOLATION ACCOUNT=tech BYPASSCOUNT=6 PARTITION=mesabipar DPROCS=24 ENDDATE=2140000000
GRES=GPUS=2 SRM=Mesabi
MESSAGE="\START0\20\20\20\20\20\20\20\20\20-00:00:24\20\20\20\20364days\20\20\20\20\20\20N/A\20\20\20\200\20\2015\20cannot\20start\20job\203199013\20-\20RM\20failure,\20rc:\2015046,\20msg:\20'Resource\20temporarily\20unavailable'\0a1\20\20\20\20\20\20\20\20\20\2000:00:00\201:00:00:00\20\20\20\20\20\20N/A\20\20\20\200\20\20\201\20job\20cancelled\20-\20job\20was\20rejected\0a,\STARTjob\20was\20rejected"
VARIABLE=UsageRecord=4610911 EFFECTIVEQUEUEDURATION=161 DRMJID=3199013.mesabim3.msi.umn.edu
02/10/2017 12:35:30 A requestor=***@mesabim3.msi.umn.edu
02/10/2017 12:35:54 L NAME=3199013 REQUESTEDTC=1 UNAME=johngw GNAME=tech WCLIMIT=600 STATE=Removed RCLASS=[k40] SUBMITTIME=1486751569
DISPATCHTIME=1486751702 RMEMCMP=>= RDISKCMP=>= SYSTEMQUEUETIME=1486751569 QOS=smallqos
FLAGS=INTERACTIVE,FSVIOLATION ACCOUNT=tech BYPASSCOUNT=6 PARTITION=mesabipar DPROCS=24 ENDDATE=2140000000
GRES=GPUS=2 SRM=Mesabi
MESSAGE="\START0\20\20\20\20\20\20\20\20\20-00:00:48\20\20\20\20364days\20\20\20\20\20\20N/A\20\20\20\200\20\2015\20cannot\20start\20job\203199013\20-\20RM\20failure,\20rc:\2015046,\20msg:\20'Resource\20temporarily\20unavailable'\0a1\20\20\20\20\20\20\20\20\20-00:00:24\20\20\2023:59:36\20\20\20\20\20\20N/A\20\20\20\200\20\20\201\20job\20cancelled\20-\20job\20was\20rejected\0a"
EXITCODE=0 VARIABLE=UsageRecord=4610911 EFFECTIVEQUEUEDURATION=161 DRMJID=3199013.mesabim3.msi.umn.edu
02/10/2017 12:40:30.249 S on_job_exit valid pjob: 3199013.mesabim3.msi.umn.edu (substate=59)
02/10/2017 12:42:02.734 S Request invalid for state of job COMPLETE
02/10/2017 12:45:33.019 S dequeuing from k40, state COMPLETE
Post by David Beer
Stjin,
I'm glad to hear that appears to have resolved the issue. I wouldn't
usually recommend running from the tip, but in this case there are a few
important fixes.
Post by stdweird
hi david,
thanks for the suggestion and fixes! we are running latest 6.0-dev for
10+ hours now, things are looking much better, no new occurences yet
(before the update, we certainly would have run into a few already).
stijn
Post by David Beer
Stjin,
Another customer reported a similar issue, and in debugging it we found
an
Post by David Beer
issue with how tasks are freed. I would recommend that you try out these
commit 59ba48ef436f6f7242d4115a994a02f3e4724866
Date: Fri Aug 19 09:33:00 2016 -0600
TRQ-3727. Fix an issue with the way we were freeing tasks.
commit afd68cb20de167fcaca6858c01cc434b22512abe
Date: Thu Aug 18 14:25:53 2016 -0600
TRQ-3727. Add protections against erroneous jobs in node_usage files.
Cheers,
David
On Mon, Aug 29, 2016 at 8:41 AM, Stijn De Weirdt <
Post by Stijn De Weirdt
hi all,
we are running torque 6.0.2 with moab 9.0.2.
we are seeing a whole bunch of single core jobs getting placed in
batchhold because moab tries to start the jobs on a node, but torque
fails/refuses to actually start it.
Could not locate requested resources 'nodeXYZ' (node_spec failed)
moab checkjob has similar cryptic
RM failure, rc: 15046, msg: 'Resource temporarily unavailable'
the jobs requests nodes=1:ppn=1 and the node itself has 1 core unused
(23 used out of 24 total, so 1 core unused)
the momlogs do not have any entry wrt the job being tried to start, so i
guess it's the pbs_server that decides the job can't be started when
moab asks pbs_server to start it.
the only thing that looks suspicous is that the load on the node is
higher than 23 (for 23 cores used). but we do not have anny load
settings in use (that we know of, or that are defaults that we are
unaware of).
any hints/tips how to debug this?
stijn
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
--
David Beer | Torque Architect
Adaptive Computing
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
--
John Griffin-Wiesner
HPC Systems Administrator
Minnesota Supercomputing Institute
http://www.msi.umn.edu
***@msi.umn.edu
Loading...