Eva Hocks
2017-01-20 18:18:27 UTC
Centos 6, torque 4.2.10, maui 3.3.1
when a user changes the slot limit of an existing array job torque does
not change the array job status to hold, causing maui to start trying
those jobs
qalter -t %5 703[]
703[17].hpcdev-005 hotel-17 hocks 00:04:09 R hotel
703[18].hpcdev-005 hotel-18 hocks 00:04:09 R hotel
703[19].hpcdev-005 hotel-19 hocks 00:04:09 R hotel
703[20].hpcdev-005 hotel-20 hocks 00:04:09 R hotel
703[21].hpcdev-005 hotel-21 hocks 00:04:09 R hotel
703[22].hpcdev-005 hotel-22 hocks 0 Q hotel
703[23].hpcdev-005 hotel-23 hocks 0 Q hotel
703[24].hpcdev-005 hotel-24 hocks 0 Q hotel
703[25].hpcdev-005 hotel-25 hocks 0 Q hotel
703[26].hpcdev-005 hotel-26 hocks 0 Q hotel
703[27].hpcdev-005 hotel-27 hocks 0 Q hotel
703[28].hpcdev-005 hotel-28 hocks 0 Q hotel
703[29].hpcdev-005 hotel-29 hocks 0 Q hotel
703[30].hpcdev-005 hotel-30 hocks 0 Q hotel
01/20 09:56:17 INFO: checking job 703[22](1) state: Idle (ex: Idle)
01/20 09:56:17 MJobSelectMNL(703[22],DEFAULT,NULL,MNodeList,NodeMap,MaxSpeed,2)
01/20 09:56:17 MReqGetFNL(703[22],0,DEFAULT,NULL,DstNL,NC,TC,2140000000,0)
...
01/20 09:56:17 INFO: idle resources (8 tasks/1 nodes) found with feasible list specified
01/20 09:56:17 INFO: tasks located for job 703[22]: 1 of 1 required (8 feasible)
01/20 09:56:17 INFO: allocated MNode[000]x1 'hpc-0-5' to 703[22]:0
01/20 09:56:17 MJobStart(703[22])
01/20 09:56:17 MAMAllocJReserve(703[22],RIndex,ErrMsg)
01/20 09:56:17 MRMJobStart(703[22],Msg,SC)
01/20 09:56:17 MPBSJobStart(703[22],base,Msg,SC)
01/20 09:56:17 ERROR: job '703[22]' cannot be started: (rc: 15004 errmsg: 'Invalid request MSG=Cannot run job. Array slot limit is 5 and there are already 5 jobs running' hostlist: 'hpc-0-5')
01/20 09:56:17 ALERT: cannot start job 703[22] (RM 'base' failed in function 'jobstart')
01/20 09:56:17 WARNING: cannot start job '703[22]' through resource manager
01/20 09:56:17 ALERT: job '703[22]' deferred after 1 failed start attempts (API failure on last attempt)
01/20 09:56:17 MJobSetHold(703[22],16,00:15:00,RMFailure,cannot start job - RM failure, rc: 15004,
msg: 'Invalid request MSG=Cannot run job. Array slot limit is 5 and there are already 5 jobs running')
01/20 09:56:17 ALERT: job '703[22]' cannot run (deferring job for 900 seconds)
01/20 09:56:17 MSysRegEvent(JOBDEFER: defer hold placed on job '703[22]'. reason: 'RMFailure',0,0,1)
When an array job is submitted with the slot limit in the submit
command, the array jobs beyond the slot limit are set to hold and
released when the slots are available
#PBS -t 1-100%5
706[1].hpcdev-005 hotel-1 hocks 0 Q hotel
706[2].hpcdev-005 hotel-2 hocks 0 Q hotel
706[3].hpcdev-005 hotel-3 hocks 0 Q hotel
706[4].hpcdev-005 hotel-4 hocks 0 Q hotel
706[5].hpcdev-005 hotel-5 hocks 0 Q hotel
706[6].hpcdev-005 hotel-6 hocks 0 H hotel
706[7].hpcdev-005 hotel-7 hocks 0 H hotel
706[8].hpcdev-005 hotel-8 hocks 0 H hotel
706[9].hpcdev-005 hotel-9 hocks 0 H hotel
706[10].hpcdev-005 hotel-10 hocks 0 H hotel
706[11].hpcdev-005 hotel-11 hocks 0 H hotel
706[12].hpcdev-005 hotel-12 hocks 0 H hotel
706[13].hpcdev-005 hotel-13 hocks 0 H hotel
Any way to get torque to treat qaltered array job slots the same as when
submitted with a slot limit to avoid "Invalid request"? That causes maui
to first run through the list of jobs it cannot run and not scheduling
other jobs which could run.
Thanks
Eva
when a user changes the slot limit of an existing array job torque does
not change the array job status to hold, causing maui to start trying
those jobs
qalter -t %5 703[]
703[17].hpcdev-005 hotel-17 hocks 00:04:09 R hotel
703[18].hpcdev-005 hotel-18 hocks 00:04:09 R hotel
703[19].hpcdev-005 hotel-19 hocks 00:04:09 R hotel
703[20].hpcdev-005 hotel-20 hocks 00:04:09 R hotel
703[21].hpcdev-005 hotel-21 hocks 00:04:09 R hotel
703[22].hpcdev-005 hotel-22 hocks 0 Q hotel
703[23].hpcdev-005 hotel-23 hocks 0 Q hotel
703[24].hpcdev-005 hotel-24 hocks 0 Q hotel
703[25].hpcdev-005 hotel-25 hocks 0 Q hotel
703[26].hpcdev-005 hotel-26 hocks 0 Q hotel
703[27].hpcdev-005 hotel-27 hocks 0 Q hotel
703[28].hpcdev-005 hotel-28 hocks 0 Q hotel
703[29].hpcdev-005 hotel-29 hocks 0 Q hotel
703[30].hpcdev-005 hotel-30 hocks 0 Q hotel
01/20 09:56:17 INFO: checking job 703[22](1) state: Idle (ex: Idle)
01/20 09:56:17 MJobSelectMNL(703[22],DEFAULT,NULL,MNodeList,NodeMap,MaxSpeed,2)
01/20 09:56:17 MReqGetFNL(703[22],0,DEFAULT,NULL,DstNL,NC,TC,2140000000,0)
...
01/20 09:56:17 INFO: idle resources (8 tasks/1 nodes) found with feasible list specified
01/20 09:56:17 INFO: tasks located for job 703[22]: 1 of 1 required (8 feasible)
01/20 09:56:17 INFO: allocated MNode[000]x1 'hpc-0-5' to 703[22]:0
01/20 09:56:17 MJobStart(703[22])
01/20 09:56:17 MAMAllocJReserve(703[22],RIndex,ErrMsg)
01/20 09:56:17 MRMJobStart(703[22],Msg,SC)
01/20 09:56:17 MPBSJobStart(703[22],base,Msg,SC)
01/20 09:56:17 ERROR: job '703[22]' cannot be started: (rc: 15004 errmsg: 'Invalid request MSG=Cannot run job. Array slot limit is 5 and there are already 5 jobs running' hostlist: 'hpc-0-5')
01/20 09:56:17 ALERT: cannot start job 703[22] (RM 'base' failed in function 'jobstart')
01/20 09:56:17 WARNING: cannot start job '703[22]' through resource manager
01/20 09:56:17 ALERT: job '703[22]' deferred after 1 failed start attempts (API failure on last attempt)
01/20 09:56:17 MJobSetHold(703[22],16,00:15:00,RMFailure,cannot start job - RM failure, rc: 15004,
msg: 'Invalid request MSG=Cannot run job. Array slot limit is 5 and there are already 5 jobs running')
01/20 09:56:17 ALERT: job '703[22]' cannot run (deferring job for 900 seconds)
01/20 09:56:17 MSysRegEvent(JOBDEFER: defer hold placed on job '703[22]'. reason: 'RMFailure',0,0,1)
When an array job is submitted with the slot limit in the submit
command, the array jobs beyond the slot limit are set to hold and
released when the slots are available
#PBS -t 1-100%5
706[1].hpcdev-005 hotel-1 hocks 0 Q hotel
706[2].hpcdev-005 hotel-2 hocks 0 Q hotel
706[3].hpcdev-005 hotel-3 hocks 0 Q hotel
706[4].hpcdev-005 hotel-4 hocks 0 Q hotel
706[5].hpcdev-005 hotel-5 hocks 0 Q hotel
706[6].hpcdev-005 hotel-6 hocks 0 H hotel
706[7].hpcdev-005 hotel-7 hocks 0 H hotel
706[8].hpcdev-005 hotel-8 hocks 0 H hotel
706[9].hpcdev-005 hotel-9 hocks 0 H hotel
706[10].hpcdev-005 hotel-10 hocks 0 H hotel
706[11].hpcdev-005 hotel-11 hocks 0 H hotel
706[12].hpcdev-005 hotel-12 hocks 0 H hotel
706[13].hpcdev-005 hotel-13 hocks 0 H hotel
Any way to get torque to treat qaltered array job slots the same as when
submitted with a slot limit to avoid "Invalid request"? That causes maui
to first run through the list of jobs it cannot run and not scheduling
other jobs which could run.
Thanks
Eva