[torqueusers] Cannot send jobs to specific nodes

Discussion:

Josep Guerrero

2011-04-22 07:04:14 UTC

Hello,

I'm using torque in a small 8 node cluster. Up to now I was using version
2.4.3. Some users need to run their jobs in specific nodes, so they used to
send their jobs like this:

qsub -q l0 -l nodes=hidra2 tmp/prova.sh

(where l0 is the name of the queue, and nodes are named hidra0,
hidra1,..,hidra7) and it worked. Recently I upgraded to torque 2.5.5 and now
this doesn't work anymore. If I try to run in a specific node, I always get the
same error:

qsub: Job exceeds queue resource limits MSG=cannot satisfy queue max nodes
requirement

but I can run jobs normally if I don't specify the node, even if the
destination node ends up being the same.

I discovered I get the same error if I write some random gibberish after
"nodes="

hidra0:~> qsub -q l0 -l nodes=adfafda tmp/prova.sh
qsub: Job exceeds queue resource limits MSG=cannot satisfy queue max nodes
requirement

and that if going back to 2.4.3, the "nodes=" option worked again. I "straced"
qsub, but I wasn't able to obtain any help (the only strace output that may be
related is that qsub opened at several points the /etc/hosts file. Nodes are
listed there, too).

I've searched torque manual and the list, but I've found no reference to this
problem. I've found this warning, but I'm not sure if it may have any
relationship to the error:

========
Versions of TORQUE earlier than 2.4.5 attempted to apply queue and server
defaults to a job that didn't have defaults specified. If a setting still did
not have a value after that, TORQUE applied the queue and server maximum
values to a job (meaning, the maximum values for an applicable setting were
applied to jobs that had no specified or default value).
In TORQUE 2.4.5 and later, the queue and server maximum values are no longer
used as a value for missing settings.
========

Does someone know what may be the problem, and if there is any way to solve
it?

Thanks!

Josep Guerrero

Coyle, James J [ITACD]

2011-04-22 18:13:27 UTC

Permalink

Josep,

I'm not sure why what you describe ever worked.

Edit your /var/spool/torque/server_priv/nodes
file to contain

hidra0 np=1 hidra0
hidra1 np=1 hidra1
hidra2 np=1 hidra2
hidra3 np=1 hidra3
hidra4 np=1 hidra4
hidra5 np=1 hidra5
hidra6 np=1 hidra6
hidra7 np=1 hidra7

(or np=4 if you have 4 processors per node)

then you should be able to issue:

qsub -l nodes=1:hidra2 tmp/prova.sh

I have a node with commercial software for UPC
licensed for it, so only that node can run the software.

qsub -I -l nodes=:upc

I get:

qsub: Illegal attribute or resource value for Resource_List.nodes

qsub -I -l nodes=1:upc

I get:

qsub: waiting for job 222116.hpc4 to start
qsub: job 222116.hpc4 ready

James Coyle, PhD
High Performance Computing Group
Iowa State Univ.
web: http://www.public.iastate.edu/~jjc

-----Original Message-----
From: torqueusers-bounces at supercluster.org [mailto:torqueusers-
bounces at supercluster.org] On Behalf Of Josep Guerrero
Sent: Friday, April 22, 2011 2:04 AM
To: torqueusers at supercluster.org
Subject: [torqueusers] Cannot send jobs to specific nodes
Hello,
I'm using torque in a small 8 node cluster. Up to now I was using version
2.4.3. Some users need to run their jobs in specific nodes, so they used to
qsub -q l0 -l nodes=hidra2 tmp/prova.sh
(where l0 is the name of the queue, and nodes are named hidra0,
hidra1,..,hidra7) and it worked. Recently I upgraded to torque 2.5.5 and now
this doesn't work anymore. If I try to run in a specific node, I always get the
qsub: Job exceeds queue resource limits MSG=cannot satisfy queue max nodes
requirement
but I can run jobs normally if I don't specify the node, even if the
destination node ends up being the same.
I discovered I get the same error if I write some random gibberish after
"nodes="
hidra0:~> qsub -q l0 -l nodes=adfafda tmp/prova.sh
qsub: Job exceeds queue resource limits MSG=cannot satisfy queue max nodes
requirement
and that if going back to 2.4.3, the "nodes=" option worked again. I "straced"
qsub, but I wasn't able to obtain any help (the only strace output that may be
related is that qsub opened at several points the /etc/hosts file. Nodes are
listed there, too).
I've searched torque manual and the list, but I've found no
reference to this
problem. I've found this warning, but I'm not sure if it may have any
========
Versions of TORQUE earlier than 2.4.5 attempted to apply queue and server
defaults to a job that didn't have defaults specified. If a setting still did
not have a value after that, TORQUE applied the queue and server maximum
values to a job (meaning, the maximum values for an applicable
setting were
applied to jobs that had no specified or default value).
In TORQUE 2.4.5 and later, the queue and server maximum values are no longer
used as a value for missing settings.
========
Does someone know what may be the problem, and if there is any way to solve
it?
Thanks!
Josep Guerrero
_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers

Coyle, James J [ITACD]

2011-04-22 18:22:55 UTC

Permalink

Josep,

Sorry, it looks like what you describe does
work on my system, running 2.5.4

I has nodes=:upc which did not work, but
nodes=upc or nodes=node122 did work for me, and
node122 was not listed as an attribute
for node122, as I had suggested was needed.

James Coyle, PhD
High Performance Computing Group
Iowa State Univ.
web: http://www.public.iastate.edu/~jjc

-----Original Message-----
From: torqueusers-bounces at supercluster.org [mailto:torqueusers-
bounces at supercluster.org] On Behalf Of Coyle, James J [ITACD]
Sent: Friday, April 22, 2011 1:13 PM
To: Torque Users Mailing List
Subject: Re: [torqueusers] Cannot send jobs to specific nodes
Josep,
I'm not sure why what you describe ever worked.
Edit your /var/spool/torque/server_priv/nodes
file to contain
hidra0 np=1 hidra0
hidra1 np=1 hidra1
hidra2 np=1 hidra2
hidra3 np=1 hidra3
hidra4 np=1 hidra4
hidra5 np=1 hidra5
hidra6 np=1 hidra6
hidra7 np=1 hidra7
(or np=4 if you have 4 processors per node)
qsub -l nodes=1:hidra2 tmp/prova.sh
I have a node with commercial software for UPC
licensed for it, so only that node can run the software.

qsub -I -l nodes=:upc

qsub: Illegal attribute or resource value for Resource_List.nodes

qsub -I -l nodes=1:upc

qsub: waiting for job 222116.hpc4 to start
qsub: job 222116.hpc4 ready
James Coyle, PhD
High Performance Computing Group
Iowa State Univ.
web: http://www.public.iastate.edu/~jjc

2.5.5

and now
this doesn't work anymore. If I try to run in a specific node, I always get the
qsub: Job exceeds queue resource limits MSG=cannot satisfy queue

max

nodes
requirement
but I can run jobs normally if I don't specify the node, even if

the

destination node ends up being the same.
I discovered I get the same error if I write some random gibberish after
"nodes="
hidra0:~> qsub -q l0 -l nodes=adfafda tmp/prova.sh
qsub: Job exceeds queue resource limits MSG=cannot satisfy queue

max

nodes
requirement
and that if going back to 2.4.3, the "nodes=" option worked again.

"straced"
qsub, but I wasn't able to obtain any help (the only strace output that may be
related is that qsub opened at several points the /etc/hosts file. Nodes are
listed there, too).
I've searched torque manual and the list, but I've found no
reference to this
problem. I've found this warning, but I'm not sure if it may have any
========
Versions of TORQUE earlier than 2.4.5 attempted to apply queue and server
defaults to a job that didn't have defaults specified. If a setting still did
not have a value after that, TORQUE applied the queue and server maximum
values to a job (meaning, the maximum values for an applicable
setting were
applied to jobs that had no specified or default value).
In TORQUE 2.4.5 and later, the queue and server maximum values are no longer
used as a value for missing settings.
========
Does someone know what may be the problem, and if there is any way to solve
it?
Thanks!
Josep Guerrero
_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers

_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers

Josep Guerrero

2011-04-22 19:24:50 UTC

Permalink

Hello James,

Post by Coyle, James J [ITACD]
Sorry, it looks like what you describe does
work on my system, running 2.5.4

That's curious. I tried the same with 2.5.3 a few months ago, and it didn't
work either (that time I had to downgrade to 2.4.3). I think I must have
something misconfigured. At least it works using the hostname as a property, so
I don't need to downgrade this time.

Thaks again for your help,

Josep Guerrero

Josep Guerrero

2011-04-22 19:03:03 UTC

Permalink

Hello James,

Thanks for your message!

Post by Coyle, James J [ITACD]
I'm not sure why what you describe ever worked.

As I understood the online manual, it looked like it should work (maybe I
misunderstood, or it's outdated or a typo?):

http://www.clusterresources.com/torquedocs/2.1jobsubmission.shtml

nodes {<node_count> | <hostname>} [:ppn=<ppn>][:gpus=<gpu>]
[:<property>[:<property>]...] [+ ...]

well, that is, if the pipe before <hostname> is to be interpreted as an
"or"...

Post by Coyle, James J [ITACD]
hidra0 np=1 hidra0
hidra1 np=1 hidra1
hidra2 np=1 hidra2
hidra3 np=1 hidra3
hidra4 np=1 hidra4
hidra5 np=1 hidra5
hidra6 np=1 hidra6
hidra7 np=1 hidra7
(or np=4 if you have 4 processors per node)
qsub -l nodes=1:hidra2 tmp/prova.sh

Thanks! It works! By the way, I didn't have the third column in "nodes", and
I've just found that your options work even without it.

If I remember right (I'm new to torque and may have it wrong. Please, correct
me if I say something silly), the third column in the "nodes" file defines
groups (or properties) to which the nodes belong (or have), and the ":" in the
-l option to qsub means that the next argument is to be read as a required
property for the selected nodes. Since I didn't write it explicitly in
"nodes", I guess every node is given its own hostname as property by default,
so the <hostname> syntax I cited above is no longer needed or even useful...

Anyway I'm going to add explicitly the hostnames to the properties column,
just in case the defaults change.

Post by Coyle, James J [ITACD]

qsub -I -l nodes=:upc

Just point that the equivalent to what I was unsuccessfully trying would be:

qsub -I -l nodes=upc

(without the ":")

Thanks again for your help!. One user needs to be able to select manually its
nodes, and I feared I would have to downgrade torque to 2.4.3 .

Josep Guerrero

Lloyd Brown

2011-04-22 19:32:23 UTC

Permalink

I don't know if this will help much, but I was able to successfully
request nodes=nodename with Torque 2.5.4 and Moab 6.0.2, without
resorting to the node-feature list approach described in another part of
this thread. So it definitely is supposed to work, and does in some
cases. Possibly a bug? Anyone from Adaptive want to chime in here?

One difference was that I'm not using queues, but rather letting it just

$ qsub -I -l nodes=m6-1-1:ppn=1,walltime=10:00,qos=test,pmem=100mb
qsub: waiting for job 3799067.fslsched.fsl.byu.edu to start
qsub: job 3799067.fslsched.fsl.byu.edu ready
$ cat $PBS_NODEFILE
m6-1-1
$

It also worked without the ":ppn=1". The other parameters are important
with our scheduler for various reasons, so I can't currently eliminate them.

Lloyd Brown

Hello,
I'm using torque in a small 8 node cluster. Up to now I was using version
2.4.3. Some users need to run their jobs in specific nodes, so they used to
qsub -q l0 -l nodes=hidra2 tmp/prova.sh
(where l0 is the name of the queue, and nodes are named hidra0,
hidra1,..,hidra7) and it worked. Recently I upgraded to torque 2.5.5 and now
this doesn't work anymore. If I try to run in a specific node, I always get the
qsub: Job exceeds queue resource limits MSG=cannot satisfy queue max nodes
requirement
but I can run jobs normally if I don't specify the node, even if the
destination node ends up being the same.
I discovered I get the same error if I write some random gibberish after
"nodes="
hidra0:~> qsub -q l0 -l nodes=adfafda tmp/prova.sh
qsub: Job exceeds queue resource limits MSG=cannot satisfy queue max nodes
requirement
and that if going back to 2.4.3, the "nodes=" option worked again. I "straced"
qsub, but I wasn't able to obtain any help (the only strace output that may be
related is that qsub opened at several points the /etc/hosts file. Nodes are
listed there, too).
I've searched torque manual and the list, but I've found no reference to this
problem. I've found this warning, but I'm not sure if it may have any
========
Versions of TORQUE earlier than 2.4.5 attempted to apply queue and server
defaults to a job that didn't have defaults specified. If a setting still did
not have a value after that, TORQUE applied the queue and server maximum
values to a job (meaning, the maximum values for an applicable setting were
applied to jobs that had no specified or default value).
In TORQUE 2.4.5 and later, the queue and server maximum values are no longer
used as a value for missing settings.
========
Does someone know what may be the problem, and if there is any way to solve
it?
Thanks!
Josep Guerrero
_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers

--
Lloyd Brown
Systems Administrator
Fulton Supercomputing Lab
Brigham Young University
http://marylou.byu.edu

Josep Guerrero

2011-04-22 19:58:54 UTC

Permalink

Post by Lloyd Brown
I don't know if this will help much, but I was able to successfully
request nodes=nodename with Torque 2.5.4 and Moab 6.0.2, without
resorting to the node-feature list approach described in another part of
this thread. So it definitely is supposed to work, and does in some
cases. Possibly a bug? Anyone from Adaptive want to chime in here?
One difference was that I'm not using queues, but rather letting it just

It also worked without the ":ppn=1". The other parameters are important
with our scheduler for various reasons, so I can't currently eliminate them.

Could it be the problem is related to name resolution for the client nodes?.
The fact that a using a random name produces the same error than using an
existing node name is curious. And the secondary nodes (and the interface of
the server node that connects to them) are in a private network, and though
all of them are listed in each other /etc/hosts file, none of the secondary
nodes are listed in the DNS server (since the only way to connect to them is
through the server node, and this one has them in its /etc/hosts file, I didn't
see the point).

Maybe, when resolving the hostname, qsub only trusts DNS (strace showed it
reads the /etc/hosts, but perhaps doesn't use the data?), so it finds no nodes.
Only, then I don't understand why the error message doesn't complain about not
satisfying the min number of nodes (1, I suppose) instead of the max (8).

Josep Guerrero

Ken Nielson

2011-04-22 19:37:03 UTC

Permalink

Post by Josep Guerrero
Hello,
I'm using torque in a small 8 node cluster. Up to now I was using version
2.4.3. Some users need to run their jobs in specific nodes, so they used to
qsub -q l0 -l nodes=hidra2 tmp/prova.sh
(where l0 is the name of the queue, and nodes are named hidra0,
hidra1,..,hidra7) and it worked. Recently I upgraded to torque 2.5.5 and now
this doesn't work anymore. If I try to run in a specific node, I always get the
qsub: Job exceeds queue resource limits MSG=cannot satisfy queue max nodes
requirement
but I can run jobs normally if I don't specify the node, even if the
destination node ends up being the same.
I discovered I get the same error if I write some random gibberish after
"nodes="
hidra0:~> qsub -q l0 -l nodes=adfafda tmp/prova.sh
qsub: Job exceeds queue resource limits MSG=cannot satisfy queue max nodes
requirement
and that if going back to 2.4.3, the "nodes=" option worked again. I "straced"
qsub, but I wasn't able to obtain any help (the only strace output that may be
related is that qsub opened at several points the /etc/hosts file. Nodes are
listed there, too).
I've searched torque manual and the list, but I've found no reference to this
problem. I've found this warning, but I'm not sure if it may have any
========
Versions of TORQUE earlier than 2.4.5 attempted to apply queue and server
defaults to a job that didn't have defaults specified. If a setting still did
not have a value after that, TORQUE applied the queue and server maximum
values to a job (meaning, the maximum values for an applicable setting were
applied to jobs that had no specified or default value).
In TORQUE 2.4.5 and later, the queue and server maximum values are no longer
used as a value for missing settings.
========
Does someone know what may be the problem, and if there is any way to solve
it?
Thanks!
Josep Guerrero
_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers

Josep,

What does your queue configuration look like?

Ken
--
<http://www.adaptivecomputing.com/news/moabcon.php>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20110422/be24719f/attachment-0001.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: MoabCon_250px.png
Type: image/png
Size: 11771 bytes
Desc: not available
Url : Loading Image...

Josep Guerrero

2011-04-22 20:05:33 UTC

Permalink

Post by Ken Nielson
What does your queue configuration look like?

I used these instructions to create it:

create queue l0
set queue l0 queue_type = Execution
set queue l0 Priority = 20
set queue l0 max_user_queuable = 3000
set queue l0 max_running = 192
set queue l0 resources_max.nodect = 192
set queue l0 resources_max.nodes = 8
set queue l0 resources_default.nodes = 1
set queue l0 resources_default.walltime = 100000:00:00
set queue l0 max_user_run = 150
set queue l0 enabled = True
set queue l0 started = True

Thre are 8 physical nodes, each one with 4 processors, and each processor with
6 cores (so there are 192 cores). There are only two queues defined, this l0
and the default batch queue that is never used.

Josep Guerrero