Discussion:
[torqueusers] Possible incompatible between 6.0.2 and Maui 3.3.1
Guangping Zhang
2016-08-20 09:59:22 UTC
Permalink
Dear all,

I found the Torque 6.0.2 not work properly with Maui 3.3.1 time to time.

And I found in the log file of maui that

08/20 17:14:12 INFO: PBS node node04 set to state Idle (free)
08/20 17:14:12 INFO: node 'node04' changed states from Running to Idle
08/20 17:14:12 MPBSNodeUpdate(node04,node04,Idle,NODE00)
08/20 17:14:12 INFO: node node04 has joblist '0-9/248.node00'
08/20 17:14:12 ALERT: cannot locate PBS job '0-9' (running on node
node04)

where 0-9 not jobs but the allocated procs for job 248.node00. So, will
this prevent torque to work good along with maui ?

Thanks for your discussion.

/Guangping
David Beer
2016-08-22 16:21:56 UTC
Permalink
This incompatibility exists for all versions of Torque > 5. It has been
fixed in the Maui source, but no official release has been made. You can
grab the new source from svn:

svn co svn://opensvn.adaptivecomputing.com/maui

After that you can build it as you would a normal tarball.
Post by Guangping Zhang
Dear all,
I found the Torque 6.0.2 not work properly with Maui 3.3.1 time to time.
And I found in the log file of maui that
08/20 17:14:12 INFO: PBS node node04 set to state Idle (free)
08/20 17:14:12 INFO: node 'node04' changed states from Running to Idle
08/20 17:14:12 MPBSNodeUpdate(node04,node04,Idle,NODE00)
08/20 17:14:12 INFO: node node04 has joblist '0-9/248.node00'
08/20 17:14:12 ALERT: cannot locate PBS job '0-9' (running on node
node04)
where 0-9 not jobs but the allocated procs for job 248.node00. So, will
this prevent torque to work good along with maui ?
Thanks for your discussion.
/Guangping
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
--
David Beer | Torque Architect
Adaptive Computing
Dr. Henrik Schulz
2017-03-21 13:41:56 UTC
Permalink
Dear David,

IÂŽm sorry to bother You again with this issue, but the problem still exists.

Please have a look onto this example:

- I submitted a job like this:

qsub -q fwd -l nodes=1:ppn=4 -I -l walltime=12:00:00

- maui.log tells me that the job cannot be started:

03/21 14:30:11 MRMJobStart(784238,Msg,SC)
03/21 14:30:11 MPBSJobStart(784238,base,Msg,SC)
03/21 14:30:11 ERROR: job '784238' cannot be started: (rc: 15046 errmsg: 'Resource temporarily unavailable MSG=job allocation request exceeds currently available cluster nodes, 1 requested, 0 available' hostlist: 'fluid001:ppn=4')
03/21 14:30:11 ERROR: cannot start job '784238' in partition DEFAULT
03/21 14:30:11 MJobPReserve(784238,DEFAULT,ResCount,ResCountRej)
03/21 14:30:30 job '784238' State: Idle EState: Idle QueueTime: Tue Mar 21 14:29:50

- checkjob knows that on this particular node there are 16 CPU cores and thinks that 9 are in use:

checking node fluid001

State: Running (in current state for 00:00:00)
Expected State: Idle SyncDeadline: Sat Oct 24 14:26:40
Configured Resources: PROCS: 16 MEM: 62G SWAP: 62G DISK: 1M
Utilized Resources: SWAP: 10G
Dedicated Resources: PROCS: 9
Opsys: ubuntu Arch: x64
Speed: 1.00 Load: 15.030
Network: [DEFAULT]
Features: [NONE]
Attributes: [Batch]
Classes: [default 16:16][fwd 7:16][fwi 16:16][short 16:16][long 16:16][benchmark 16:16][fwo 16:16]

Total Time: INFINITY Up: INFINITY (98.92%) Active: INFINITY (93.87%)

Reservations:
Job '772551'(x1) -6:05:29:39 -> 2:02:30:21 (8:08:00:00)
Job '772553'(x1) -6:05:29:39 -> 2:02:30:21 (8:08:00:00)
Job '772555'(x1) -6:05:29:39 -> 2:02:30:21 (8:08:00:00)
Job '772557'(x1) -6:05:29:39 -> 2:02:30:21 (8:08:00:00)
Job '779684'(x1) -2:20:22:38 -> 5:11:37:22 (8:08:00:00)
Job '779685'(x1) -2:20:22:38 -> 5:11:37:22 (8:08:00:00)
Job '781758'(x1) -1:19:54:49 -> 6:12:05:11 (8:08:00:00)
Job '783132'(x1) -1:00:19:39 -> 7:07:40:21 (8:08:00:00)
Job '783909'(x1) -6:19:42 -> 8:01:40:18 (8:08:00:00)
User 'fluid.0.0'(x1) -00:03:52 -> INFINITY ( INFINITY)
Blocked ***@00:00:00 Procs: 7/16 (43.75%)
Blocked ***@2:02:30:21 Procs: 11/16 (68.75%)
Blocked ***@5:11:37:22 Procs: 13/16 (81.25%)
Blocked ***@6:12:05:11 Procs: 14/16 (87.50%)
Blocked ***@7:07:40:21 Procs: 15/16 (93.75%)
Blocked ***@8:01:40:18 Procs: 16/16 (100.00%)
JobList: 772551,772553,772555,772557,779684,779685,781758,783132,783909

- with qstat I can see that there is only one free slot on the node and 15 are used by the jobs:

qstat -ae -n | grep fluid001
fluid001/0
fluid001/9
fluid001/11
fluid001/13
fluid001/5,7
fluid001/14-15
fluid001/1
fluid001/2-4,6
fluid001/8,10

- The node has 9 running jobs, but the syntax of the allocations is still misunderstood by maui.

Do I have to switch to a newer version of Torque? Currently I am using version 5.1.1.

Thanks in advance,
Henrik
svn co svn://opensvn.adaptivecomputing.com/maui <http://opensvn.adaptivecomputing.com/maui>
After that you can build it as you would a normal tarball.
Dear all,
I found the Torque 6.0.2 not work properly with Maui 3.3.1 time to time.
And I found in the log file of maui that
08/20 17:14:12 INFO: PBS node node04 set to state Idle (free)
08/20 17:14:12 INFO: node 'node04' changed states from Running to Idle
08/20 17:14:12 MPBSNodeUpdate(node04,node04,Idle,NODE00)
08/20 17:14:12 INFO: node node04 has joblist '0-9/248.node00'
08/20 17:14:12 ALERT: cannot locate PBS job '0-9' (running on node node04)
where 0-9 not jobs but the allocated procs for job 248.node00. So, will this prevent torque to work good along with maui ?
Thanks for your discussion.
/Guangping
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers <http://www.supercluster.org/mailman/listinfo/torqueusers>
--
David Beer | Torque Architect
Adaptive Computing
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
Loading...