Discussion:
[torqueusers] Job remains in queue forever
Praveen C
2010-05-04 09:28:17 UTC
Permalink
Hello

Please excuse me if you received multiple copies of this email. I was having
trouble with my mailing list management.

I am unable to run jobs through pbs. The job gets stuck in queue forever. I
have given some info below. Please tell me if I can give any more info to
debug this problem. Hope somebody can point me in the right direction.

Thanks
praveen

[root at master log]# qmgr -c 'print server'
#
# Create queues and set their attributes.
#
#
# Create and define queue default
#
create queue default
set queue default queue_type = Execution
set queue default enabled = True
set queue default started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_host_enable = False
set server acl_hosts = master.tifrbng.res.in
set server default_queue = default
set server log_events = 511
set server mail_from = adm
set server query_other_jobs = True
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6
set server next_job_number = 16



Here is reason for queuing:

[praveen at master alpha_sweep]$ checkjob 15


checking job 15

State: Idle EState: Deferred
Creds: user:praveen group:[DEFAULT] class:default qos:DEFAULT
WallTime: 00:00:00 of 00:30:00
SubmitTime: Tue May 4 09:45:39
(Time Queued Total: 00:13:13 Eligible: 00:00:01)

StartDate: -00:13:11 Tue May 4 09:45:41
Total Tasks: 10

Req[0] TaskCount: 10 Partition: ALL
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [nash]


IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 1
PartitionMask: [ALL]
Flags: RESTARTABLE

job is deferred. Reason: RMFailure (job cannot be started - cannot set
hostlist)
Holds: Defer (hold reason: RMFailure)
PE: 10.00 StartPriority: 1
cannot select job 15 for partition DEFAULT (job hold active)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20100504/07da3219/attachment.html
torque
2010-05-04 04:32:09 UTC
Permalink
Hello

I am unable to run jobs through pbs. The job gets stuck in queue forever. I
have given some info below. Please tell me if I can give any more info to
debug this problem. Hope somebody can point me in the right direction.

Thanks
praveen

[root at master log]# qmgr -c 'print server'
#
# Create queues and set their attributes.
#
#
# Create and define queue default
#
create queue default
set queue default queue_type = Execution
set queue default enabled = True
set queue default started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_host_enable = False
set server acl_hosts = master.tifrbng.res.in
set server default_queue = default
set server log_events = 511
set server mail_from = adm
set server query_other_jobs = True
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6
set server next_job_number = 16



Here is reason for queuing:

[praveen at master alpha_sweep]$ checkjob 15


checking job 15

State: Idle EState: Deferred
Creds: user:praveen group:[DEFAULT] class:default qos:DEFAULT
WallTime: 00:00:00 of 00:30:00
SubmitTime: Tue May 4 09:45:39
(Time Queued Total: 00:13:13 Eligible: 00:00:01)

StartDate: -00:13:11 Tue May 4 09:45:41
Total Tasks: 10

Req[0] TaskCount: 10 Partition: ALL
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [nash]


IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 1
PartitionMask: [ALL]
Flags: RESTARTABLE

job is deferred. Reason: RMFailure (job cannot be started - cannot set
hostlist)
Holds: Defer (hold reason: RMFailure)
PE: 10.00 StartPriority: 1
cannot select job 15 for partition DEFAULT (job hold active)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20100504/d5dba8f5/attachment.html
torque
2010-05-04 05:50:00 UTC
Permalink
Hello

I am unable to run jobs through pbs. The job gets stuck in queue forever. I
have given some info below. Please tell me if I can give any more info to
debug this problem. Hope somebody can point me in the right direction.

Thanks
praveen

[root at master log]# qmgr -c 'print server'
#
# Create queues and set their attributes.
#
#
# Create and define queue default
#
create queue default
set queue default queue_type = Execution
set queue default enabled = True
set queue default started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_host_enable = False
set server acl_hosts = master.tifrbng.res.in
set server default_queue = default
set server log_events = 511
set server mail_from = adm
set server query_other_jobs = True
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6
set server next_job_number = 16



Here is reason for queuing:

[praveen at master alpha_sweep]$ checkjob 15


checking job 15

State: Idle EState: Deferred
Creds: user:praveen group:[DEFAULT] class:default qos:DEFAULT
WallTime: 00:00:00 of 00:30:00
SubmitTime: Tue May 4 09:45:39
(Time Queued Total: 00:13:13 Eligible: 00:00:01)

StartDate: -00:13:11 Tue May 4 09:45:41
Total Tasks: 10

Req[0] TaskCount: 10 Partition: ALL
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [nash]


IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 1
PartitionMask: [ALL]
Flags: RESTARTABLE

job is deferred. Reason: RMFailure (job cannot be started - cannot set
hostlist)
Holds: Defer (hold reason: RMFailure)
PE: 10.00 StartPriority: 1
cannot select job 15 for partition DEFAULT (job hold active)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20100504/1e29d910/attachment.html
Ken Nielson
2010-05-04 14:42:42 UTC
Permalink
Post by Praveen C
Hello
Please excuse me if you received multiple copies of this email. I was
having trouble with my mailing list management.
I am unable to run jobs through pbs. The job gets stuck in queue
forever. I have given some info below. Please tell me if I can give
any more info to debug this problem. Hope somebody can point me in the
right direction.
Thanks
praveen
[root at master log]# qmgr -c 'print server'
#
# Create queues and set their attributes.
#
#
# Create and define queue default
#
create queue default
set queue default queue_type = Execution
set queue default enabled = True
set queue default started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_host_enable = False
set server acl_hosts = master.tifrbng.res.in
<http://master.tifrbng.res.in>
set server default_queue = default
set server log_events = 511
set server mail_from = adm
set server query_other_jobs = True
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6
set server next_job_number = 16
[praveen at master alpha_sweep]$ checkjob 15
checking job 15
State: Idle EState: Deferred
Creds: user:praveen group:[DEFAULT] class:default qos:DEFAULT
WallTime: 00:00:00 of 00:30:00
SubmitTime: Tue May 4 09:45:39
(Time Queued Total: 00:13:13 Eligible: 00:00:01)
StartDate: -00:13:11 Tue May 4 09:45:41
Total Tasks: 10
Req[0] TaskCount: 10 Partition: ALL
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [nash]
IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 1
PartitionMask: [ALL]
Flags: RESTARTABLE
job is deferred. Reason: RMFailure (job cannot be started - cannot
set hostlist)
Holds: Defer (hold reason: RMFailure)
PE: 10.00 StartPriority: 1
cannot select job 15 for partition DEFAULT (job hold active)
What are you using for your scheduler?

Ken Nielson
Adaptive Computing
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20100504/43a5b201/attachment.html
Praveen C
2010-05-04 14:49:12 UTC
Permalink
Maui.

Here is my maui.cfg

I had a problem of hostname being in lower case and in some places in upper
case, as "master" and "Master". I have searched all files I could find and
changed it to all lower case. But still the jobs are getting queued for no
reason :-(

Thanks
praveen

RMPOLLINTERVAL 00:00:30

SERVERHOST master.tifrbng.res.in
SERVERPORT 42559
SERVERMODE NORMAL

RMCFG[base] TYPE=PBS

ADMIN1 maui root

LOGFILE maui.log
LOGFILEMAXSIZE 10000000
LOGLEVEL 3

QUEUETIMEWEIGHT 1

BACKFILLPOLICY FIRSTFIT
RESERVATIONPOLICY CURRENTHIGHEST

NODEALLOCATIONPOLICY MINRESOURCE


On Tue, May 4, 2010 at 8:12 PM, Ken Nielson
Post by Praveen C
Hello
Please excuse me if you received multiple copies of this email. I was
having trouble with my mailing list management.
I am unable to run jobs through pbs. The job gets stuck in queue forever.
I have given some info below. Please tell me if I can give any more info to
debug this problem. Hope somebody can point me in the right direction.
Thanks
praveen
[root at master log]# qmgr -c 'print server'
#
# Create queues and set their attributes.
#
#
# Create and define queue default
#
create queue default
set queue default queue_type = Execution
set queue default enabled = True
set queue default started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_host_enable = False
set server acl_hosts = master.tifrbng.res.in
set server default_queue = default
set server log_events = 511
set server mail_from = adm
set server query_other_jobs = True
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6
set server next_job_number = 16
[praveen at master alpha_sweep]$ checkjob 15
checking job 15
State: Idle EState: Deferred
Creds: user:praveen group:[DEFAULT] class:default qos:DEFAULT
WallTime: 00:00:00 of 00:30:00
SubmitTime: Tue May 4 09:45:39
(Time Queued Total: 00:13:13 Eligible: 00:00:01)
StartDate: -00:13:11 Tue May 4 09:45:41
Total Tasks: 10
Req[0] TaskCount: 10 Partition: ALL
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [nash]
IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 1
PartitionMask: [ALL]
Flags: RESTARTABLE
job is deferred. Reason: RMFailure (job cannot be started - cannot set
hostlist)
Holds: Defer (hold reason: RMFailure)
PE: 10.00 StartPriority: 1
cannot select job 15 for partition DEFAULT (job hold active)
What are you using for your scheduler?
Ken Nielson
Adaptive Computing
_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20100504/9c1c406a/attachment-0001.html
Praveen C
2010-05-04 15:26:03 UTC
Permalink
Here is something I see in the /opt/torque/server_logs

05/04/2010 09:45:39;0008;PBS_Server;Job;15.master.tifrbng.res.in;Job Queued
at request of praveen at master.tifrbng.res.in, owner =
praveen at master.tifrbng.res.in, job name = rae2822, queue = default05/04/2010
09:45:39;0040;PBS_Server;Svr;master.tifrbng.res.in;Scheduler sent command
new05/04/2010 09:45:40;0020;PBS_Server;Job;15.master.tifrbng.res.in;Unauthorized
Re
quest, request type: 11, Object: Job, Name: 15.master.tifrbng.res.in,
request from: maui at master.tifrbng.res.in
05/04/2010 09:45:40;0080;PBS_Server;Req;req_reject;Reject reply
code=15007(Unauthorized Request MSG=operation not permitted), aux=0,
type=ModifyJob, from maui@
master.tifrbng.res.in

Does this shed any light !!!

praveen
Post by Praveen C
Maui.
Here is my maui.cfg
I had a problem of hostname being in lower case and in some places in upper
case, as "master" and "Master". I have searched all files I could find and
changed it to all lower case. But still the jobs are getting queued for no
reason :-(
Thanks
praveen
RMPOLLINTERVAL 00:00:30
SERVERHOST master.tifrbng.res.in
SERVERPORT 42559
SERVERMODE NORMAL
RMCFG[base] TYPE=PBS
ADMIN1 maui root
LOGFILE maui.log
LOGFILEMAXSIZE 10000000
LOGLEVEL 3
QUEUETIMEWEIGHT 1
BACKFILLPOLICY FIRSTFIT
RESERVATIONPOLICY CURRENTHIGHEST
NODEALLOCATIONPOLICY MINRESOURCE
On Tue, May 4, 2010 at 8:12 PM, Ken Nielson <
Post by Praveen C
Hello
Please excuse me if you received multiple copies of this email. I was
having trouble with my mailing list management.
I am unable to run jobs through pbs. The job gets stuck in queue
forever. I have given some info below. Please tell me if I can give any more
info to debug this problem. Hope somebody can point me in the right
direction.
Thanks
praveen
[root at master log]# qmgr -c 'print server'
#
# Create queues and set their attributes.
#
#
# Create and define queue default
#
create queue default
set queue default queue_type = Execution
set queue default enabled = True
set queue default started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_host_enable = False
set server acl_hosts = master.tifrbng.res.in
set server default_queue = default
set server log_events = 511
set server mail_from = adm
set server query_other_jobs = True
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6
set server next_job_number = 16
[praveen at master alpha_sweep]$ checkjob 15
checking job 15
State: Idle EState: Deferred
Creds: user:praveen group:[DEFAULT] class:default qos:DEFAULT
WallTime: 00:00:00 of 00:30:00
SubmitTime: Tue May 4 09:45:39
(Time Queued Total: 00:13:13 Eligible: 00:00:01)
StartDate: -00:13:11 Tue May 4 09:45:41
Total Tasks: 10
Req[0] TaskCount: 10 Partition: ALL
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [nash]
IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 1
PartitionMask: [ALL]
Flags: RESTARTABLE
job is deferred. Reason: RMFailure (job cannot be started - cannot
set hostlist)
Holds: Defer (hold reason: RMFailure)
PE: 10.00 StartPriority: 1
cannot select job 15 for partition DEFAULT (job hold active)
What are you using for your scheduler?
Ken Nielson
Adaptive Computing
_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20100504/5db15907/attachment.html
Glen Beane
2010-05-04 16:09:01 UTC
Permalink
Post by Praveen C
Here is something I see in the /opt/torque/server_logs
05/04/2010 09:45:39;0008;PBS_Server;Job;15.master.tifrbng.res.in;Job
Queued at request of praveen at master.tifrbng.res.in, owner =
praveen at master.tifrbng.res.in, job name = rae2822, queue =
default05/04/2010 09:45:39;0040;PBS_Server;Svr;master.tifrbng.res.in;Scheduler
sent command new05/04/2010 09:45:40;0020;PBS_Server;Job;
15.master.tifrbng.res.in;Unauthorized Re
quest, request type: 11, Object: Job, Name: 15.master.tifrbng.res.in,
request from: maui at master.tifrbng.res.in
05/04/2010 09:45:40;0080;PBS_Server;Req;req_reject;Reject reply
code=15007(Unauthorized Request MSG=operation not permitted), aux=0,
master.tifrbng.res.in
Does this shed any light !!!
It looks like the user maui is running as is not a TORQUE operator. try
running qmgr -c "s s operators+=maui at master.tifrbng.res.id" as a torque
manager
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20100504/fb53d381/attachment.html
Praveen C
2010-05-04 16:30:14 UTC
Permalink
Post by Glen Beane
It looks like the user maui is running as is not a TORQUE operator. try
running qmgr -c "s s operators+=maui at master.tifrbng.res.id" as a torque
manager
I did this. But still job is in queue. In log I see

05/04/2010 21:53:20;0100;PBS_Server;Job;16.master.tifrbng.res.in;enqueuing
into default, state 1 hop 1
05/04/2010 21:53:20;0008;PBS_Server;Job;16.master.tifrbng.res.in;Job Queued
at request of praveen at master.tifrbng.res.in, owner =
praveen at master.tifrbng.res.in, job name = rae2822, queue = default

The "Unauthorized request" error I got before is not there. But still job
does not run.

praveen
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20100504/c26f9a4d/attachment-0001.html
Glen Beane
2010-05-04 16:52:56 UTC
Permalink
Post by Praveen C
Post by Glen Beane
It looks like the user maui is running as is not a TORQUE operator. try
running qmgr -c "s s operators+=maui at master.tifrbng.res.id" as a torque
manager
I did this. But still job is in queue. In log I see
05/04/2010 21:53:20;0100;PBS_Server;Job;16.master.tifrbng.res.in;enqueuing
into default, state 1 hop 1
05/04/2010 21:53:20;0008;PBS_Server;Job;16.master.tifrbng.res.in;Job
Queued at request of praveen at master.tifrbng.res.in, owner =
praveen at master.tifrbng.res.in, job name = rae2822, queue = default
The "Unauthorized request" error I got before is not there. But still job
does not run.
this sequence of events looks normal. Now you'll have to find out why maui
is not running your job; this snippet of log is not enough information.
What does the checkjob command say for the job?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20100504/83b9af53/attachment.html
Praveen C
2010-05-04 16:57:57 UTC
Permalink
Here is some output

[praveen at master rae2822-C243x43]$ qstat
Job id Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
19.master rae2822 praveen 0 Q
default


$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

[root at master server_logs]# showq
ACTIVE JOBS--------------------
JOBNAME USERNAME STATE PROC REMAINING
STARTTIME


0 Active Jobs 0 of 100 Processors Active (0.00%)
0 of 15 Nodes Active (0.00%)

IDLE JOBS----------------------
JOBNAME USERNAME STATE PROC WCLIMIT
QUEUETIME


0 Idle Jobs

BLOCKED JOBS----------------
JOBNAME USERNAME STATE PROC WCLIMIT
QUEUETIME

19 praveen Deferred 1 00:30:00 Tue May 4
22:24:51

Total Jobs: 1 Active Jobs: 0 Idle Jobs: 0 Blocked Jobs: 1



$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

[praveen at master rae2822-C243x43]$ checkjob 19


checking job 19

State: Idle EState: Deferred
Creds: user:praveen group:[DEFAULT] class:default qos:DEFAULT
WallTime: 00:00:00 of 00:30:00
SubmitTime: Tue May 4 22:24:51
(Time Queued Total: 00:01:34 Eligible: 00:00:01)

StartDate: -00:01:32 Tue May 4 22:24:53
Total Tasks: 1

Req[0] TaskCount: 1 Partition: ALL
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [nash]


IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 1
PartitionMask: [ALL]
Flags: RESTARTABLE

job is deferred. Reason: RMFailure (job cannot be started - cannot set
hostlist)
Holds: Defer (hold reason: RMFailure)
PE: 1.00 StartPriority: 1
cannot select job 19 for partition DEFAULT (job hold active)

$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

[praveen at master rae2822-C243x43]$ qstat -f
Job Id: 19.master.tifrbng.res.in
Job_Name = rae2822
Job_Owner = praveen at master.tifrbng.res.in
job_state = Q
queue = default
server = master.tifrbng.res.in
Checkpoint = u
ctime = Tue May 4 22:24:51 2010
Error_Path = master.tifrbng.res.in:
/home/praveen/test/tmp/rae2822-C243x43/
rae2822.e19
Hold_Types = n
Join_Path = oe
Keep_Files = n
Mail_Points = e
mtime = Tue May 4 22:24:51 2010
Output_Path = master.tifrbng.res.in:
/home/praveen/test/tmp/rae2822-C243x43
/nuwtun.log
Priority = 0
qtime = Tue May 4 22:24:51 2010
Rerunable = True
Resource_List.nodect = 1
Resource_List.nodes = 1:nash:ppn=1
Resource_List.walltime = 00:30:00
Variable_List = PBS_O_HOME=/home/praveen,PBS_O_LANG=en_US.iso885915,
PBS_O_LOGNAME=praveen,
PBS_O_PATH=/opt/pgi/linux86-64/2010/mpi/openmpi/bin:/usr/kerberos/bin
:/usr/java/latest/bin:/usr/local/bin:/bin:/usr/bin:/opt/ganglia/bin:/o
pt/ganglia/sbin:/opt/openmpi/bin/:/opt/maui/bin:/opt/torque/bin:/opt/t
orque/sbin:/opt/rocks/bin:/opt/rocks/sbin:/opt/sun-ct/bin:/opt/pgi/lin
ux86-64/2010/bin:/home/praveen/usr/local/bin:/home/praveen/usr/local/d
akota/bin:/home/praveen/usr/local/paraview-3.4.0-Linux-x86_64/bin:/opt
/cmake-2.8.1/bin:/home/praveen/src/nuwtun/src-flo:/home/praveen/src/nu
wtun/src-adj:/home/praveen/src/nuwtun/src-grd:/home/praveen/src/nuwtun
/src-opt:/home/praveen/src/nuwtun/src-utl:/home/praveen/src/famosa/bui
ld/bin:/home/praveen/src/Num3sis/build/bin:.,
PBS_O_MAIL=/var/spool/mail/praveen,PBS_O_SHELL=/bin/bash,
PBS_SERVER=master.tifrbng.res.in,PBS_O_HOST=master.tifrbng.res.in,
PBS_O_WORKDIR=/home/praveen/test/tmp/rae2822-C243x43,
PBS_O_QUEUE=default
etime = Tue May 4 22:24:51 2010
submit_args = nuwtun.pbs
Post by Glen Beane
Post by Praveen C
Post by Glen Beane
It looks like the user maui is running as is not a TORQUE operator. try
running qmgr -c "s s operators+=maui at master.tifrbng.res.id" as a torque
manager
I did this. But still job is in queue. In log I see
05/04/2010 21:53:20;0100;PBS_Server;Job;16.master.tifrbng.res.in;enqueuing
into default, state 1 hop 1
05/04/2010 21:53:20;0008;PBS_Server;Job;16.master.tifrbng.res.in;Job
Queued at request of praveen at master.tifrbng.res.in, owner =
praveen at master.tifrbng.res.in, job name = rae2822, queue = default
The "Unauthorized request" error I got before is not there. But still job
does not run.
this sequence of events looks normal. Now you'll have to find out why maui
is not running your job; this snippet of log is not enough information.
What does the checkjob command say for the job?
_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20100504/0037f099/attachment.html
Glen Beane
2010-05-04 17:01:26 UTC
Permalink
Post by Praveen C
job is deferred. Reason: RMFailure (job cannot be started - cannot set
hostlist)
oh, maybe setting the hostlist requires manager privileges - I don't
remember.

try qmgr -c "s s managers+=maui at your_hostname"


also, what does your maui.cfg file look like?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20100504/283fbfca/attachment-0001.html
Ken Nielson
2010-05-04 17:06:26 UTC
Permalink
Post by Praveen C
Here is some output
[praveen at master rae2822-C243x43]$ qstat
Job id Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
19.master rae2822 praveen 0 Q
default
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
[root at master server_logs]# showq
ACTIVE JOBS--------------------
JOBNAME USERNAME STATE PROC REMAINING
STARTTIME
0 Active Jobs 0 of 100 Processors Active (0.00%)
0 of 15 Nodes Active (0.00%)
IDLE JOBS----------------------
JOBNAME USERNAME STATE PROC WCLIMIT
QUEUETIME
0 Idle Jobs
BLOCKED JOBS----------------
JOBNAME USERNAME STATE PROC WCLIMIT
QUEUETIME
19 praveen Deferred 1 00:30:00 Tue May 4
22:24:51
Total Jobs: 1 Active Jobs: 0 Idle Jobs: 0 Blocked Jobs: 1
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
[praveen at master rae2822-C243x43]$ checkjob 19
checking job 19
State: Idle EState: Deferred
Creds: user:praveen group:[DEFAULT] class:default qos:DEFAULT
WallTime: 00:00:00 of 00:30:00
SubmitTime: Tue May 4 22:24:51
(Time Queued Total: 00:01:34 Eligible: 00:00:01)
StartDate: -00:01:32 Tue May 4 22:24:53
Total Tasks: 1
Req[0] TaskCount: 1 Partition: ALL
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [nash]
IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 1
PartitionMask: [ALL]
Flags: RESTARTABLE
job is deferred. Reason: RMFailure (job cannot be started - cannot
set hostlist)
Holds: Defer (hold reason: RMFailure)
PE: 1.00 StartPriority: 1
cannot select job 19 for partition DEFAULT (job hold active)
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
[praveen at master rae2822-C243x43]$ qstat -f
Job Id: 19.master.tifrbng.res.in <http://19.master.tifrbng.res.in>
Job_Name = rae2822
Job_Owner = praveen at master.tifrbng.res.in
<mailto:praveen at master.tifrbng.res.in>
job_state = Q
queue = default
server = master.tifrbng.res.in <http://master.tifrbng.res.in>
Checkpoint = u
ctime = Tue May 4 22:24:51 2010
Error_Path =
master.tifrbng.res.in:/home/praveen/test/tmp/rae2822-C243x43/
rae2822.e19
Hold_Types = n
Join_Path = oe
Keep_Files = n
Mail_Points = e
mtime = Tue May 4 22:24:51 2010
Output_Path =
master.tifrbng.res.in:/home/praveen/test/tmp/rae2822-C243x43
/nuwtun.log
Priority = 0
qtime = Tue May 4 22:24:51 2010
Rerunable = True
Resource_List.nodect = 1
Resource_List.nodes = 1:nash:ppn=1
Resource_List.walltime = 00:30:00
Variable_List = PBS_O_HOME=/home/praveen,PBS_O_LANG=en_US.iso885915,
PBS_O_LOGNAME=praveen,
PBS_O_PATH=/opt/pgi/linux86-64/2010/mpi/openmpi/bin:/usr/kerberos/bin
:/usr/java/latest/bin:/usr/local/bin:/bin:/usr/bin:/opt/ganglia/bin:/o
pt/ganglia/sbin:/opt/openmpi/bin/:/opt/maui/bin:/opt/torque/bin:/opt/t
orque/sbin:/opt/rocks/bin:/opt/rocks/sbin:/opt/sun-ct/bin:/opt/pgi/lin
ux86-64/2010/bin:/home/praveen/usr/local/bin:/home/praveen/usr/local/d
akota/bin:/home/praveen/usr/local/paraview-3.4.0-Linux-x86_64/bin:/opt
/cmake-2.8.1/bin:/home/praveen/src/nuwtun/src-flo:/home/praveen/src/nu
wtun/src-adj:/home/praveen/src/nuwtun/src-grd:/home/praveen/src/nuwtun
/src-opt:/home/praveen/src/nuwtun/src-utl:/home/praveen/src/famosa/bui
ld/bin:/home/praveen/src/Num3sis/build/bin:.,
PBS_O_MAIL=/var/spool/mail/praveen,PBS_O_SHELL=/bin/bash,
PBS_SERVER=master.tifrbng.res.in
<http://master.tifrbng.res.in>,PBS_O_HOST=master.tifrbng.res.in
<http://master.tifrbng.res.in>,
PBS_O_WORKDIR=/home/praveen/test/tmp/rae2822-C243x43,
PBS_O_QUEUE=default
etime = Tue May 4 22:24:51 2010
submit_args = nuwtun.pbs
On Tue, May 4, 2010 at 10:22 PM, Glen Beane <glen.beane at gmail.com
On Tue, May 4, 2010 at 12:30 PM, Praveen C <cpraveen at gmail.com
On Tue, May 4, 2010 at 9:39 PM, Glen Beane
It looks like the user maui is running as is not a TORQUE
operator. try running qmgr -c "s s
operators+=maui at master.tifrbng.res.id
<mailto:maui at master.tifrbng.res.id>" as a torque manager
I did this. But still job is in queue. In log I see
05/04/2010
21:53:20;0100;PBS_Server;Job;16.master.tifrbng.res.in
<http://16.master.tifrbng.res.in>;enqueuing into default,
state 1 hop 1
05/04/2010
21:53:20;0008;PBS_Server;Job;16.master.tifrbng.res.in
<http://16.master.tifrbng.res.in>;Job Queued at request of
praveen at master.tifrbng.res.in
<mailto:praveen at master.tifrbng.res.in>, owner =
praveen at master.tifrbng.res.in
<mailto:praveen at master.tifrbng.res.in>, job name = rae2822,
queue = default
The "Unauthorized request" error I got before is not there.
But still job does not run.
this sequence of events looks normal. Now you'll have to find out
why maui is not running your job; this snippet of log is not
enough information. What does the checkjob command say for the job?
_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org <mailto:torqueusers at supercluster.org>
http://www.supercluster.org/mailman/listinfo/torqueusers
_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Glen Beane pointed out that you need to add a queue operator and queue
manager. That aside, the job is now in a deferred state in Maui and will
be run again until the deferred interval has expired. You can change the
state of the job and make it available to be run by using the command
"releasehold <jobid>"

Ken Nielson
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20100504/542f0952/attachment.html
Praveen C
2010-05-04 17:34:49 UTC
Permalink
Thanks to Ken and Glen, I am making some progress.

I am able to run a job as long as I ask for only one node

nodes=1:nash:ppn=4

is ok. But if I request more than one node, the job goes in state "E"
according to qstat and then aborts.

The emails says

PBS Job Id: 34.master.tifrbng.res.in
Job Name: rae2822
Exec host: c-04/1+c-04/0+c-03/1+c-03/0
Aborted by PBS Server
Job cannot be executed
See Administrator for help

My maui.cfg is

RMPOLLINTERVAL 00:00:30

SERVERHOST master.tifrbng.res.in
SERVERPORT 42559
SERVERMODE NORMAL

RMCFG[base] TYPE=PBS

ADMIN1 maui root

LOGFILE maui.log
LOGFILEMAXSIZE 10000000
LOGLEVEL 3

QUEUETIMEWEIGHT 1

BACKFILLPOLICY FIRSTFIT
RESERVATIONPOLICY CURRENTHIGHEST

NODEALLOCATIONPOLICY MINRESOURCE
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20100504/ed0f9ec6/attachment.html
Christopher Samuel
2010-05-04 22:09:39 UTC
Permalink
Post by Praveen C
But if I request more than one node, the job goes in state
"E" according to qstat and then aborts.
Is there anything in the logs for Torque on the compute
nodes for the job ? Or the server log ?

cheers,
Chris
--
Christopher Samuel - Senior Systems Administrator
VLSCI - Victorian Life Sciences Computational Initiative
Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545
http://www.vlsci.unimelb.edu.au/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20100505/3f7ed565/attachment-0001.html
Praveen C
2010-05-05 06:57:30 UTC
Permalink
Thanks for that hint Christopher. Looking at the logs I found that the
/etc/hosts file on the compute nodes was not complete. After I copied the
/etc/hosts from master to all the compute nodes, its working correctly.

Thanks to everyone who helped me on this. Hope this is finally resolved now.

praveen
Post by Christopher Samuel
Post by Praveen C
But if I request more than one node, the job goes in state
"E" according to qstat and then aborts.
Is there anything in the logs for Torque on the compute
nodes for the job ? Or the server log ?
cheers,
Chris
--
Christopher Samuel - Senior Systems Administrator
VLSCI - Victorian Life Sciences Computational Initiative
Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545
http://www.vlsci.unimelb.edu.au/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20100505/6f7da2a3/attachment.html
Simon Kiertscher
2010-05-05 08:57:44 UTC
Permalink
Hello everyone,

I am trying to figure out what is the difference between the "select"
and the "nodes" statement while submitting a job. In the torque
documentation only the nodes statement appears but you can also use the
select statement. Have someone a guide or something where this is explained?

greetings Simon
Ken Nielson
2010-05-04 16:52:32 UTC
Permalink
What is the current state of the job in both Maui and TORQUE?

Ken

----- Original Message -----
From: "Praveen C" <cpraveen at gmail.com>
To: "Torque Users Mailing List" <torqueusers at supercluster.org>
Sent: Tuesday, May 4, 2010 10:30:14 AM
Subject: Re: [torqueusers] Job remains in queue forever

On Tue, May 4, 2010 at 9:39 PM, Glen Beane < glen.beane at gmail.com >
wrote:




It looks like the user maui is running as is not a TORQUE operator. try
running qmgr -c "s s operators+= maui at master.tifrbng.res.id " as a
torque manager

I did this. But still job is in queue. In log I see



05/04/2010 21:53:20;0100;PBS_Server;Job; 16.master.tifrbng.res.in
;enqueuing into default, state 1 hop 1
05/04/2010 21:53:20;0008;PBS_Server;Job; 16.master.tifrbng.res.in ;Job
Queued at request of praveen at master.tifrbng.res.in , owner =
praveen at master.tifrbng.res.in , job name = rae2822, queue = default


The "Unauthorized request" error I got before is not there. But still
job does not run.


praveen
_______________________________________________ torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Simon Kiertscher
2010-05-05 10:29:28 UTC
Permalink
Hello everyone,

I am trying to figure out what is the difference between the "select"
and the "nodes" statement while submitting a job. In the torque
documentation only the nodes statement appears but you can also use the
select statement. Have someone a guide or something where this is explained?

greetings Simon
Coyle, James J [ITACD]
2010-05-05 14:44:34 UTC
Permalink
Simon,

I found this reference to select vs. nodes in a Yale Univ. ITS Wiki:

http://hpc.research.yale.edu/wiki/index.php/Requesting_resources_using_select_and_place
From there it looks like select is a new PBSPro option. My best guess is
It looks it woulod be the same as nodes= when both apply, but you could also
Use select=10
To get 10 cpus anywhere, which is what nprocs= (or ncpus=) does.

Or use
select=10:ib

to get 10 nodes with the ib property.


For other examples, see also:

http://www.mathworks.cn/access/helpdesk/help/toolbox/distcomp/resourcetemplate.html

and Univ of Neb. At Lincoln's:

http://hcc.unl.edu/firefly/FFfaq.php

My torque 2.3.6 does reject the select= syntax, but
does not honor it.

The Yale Wiki says not to use it with Torque.

http://hpc.research.yale.edu/wiki/index.php/Torque_Userguide

Perhaps this is just product differentiation in the user interface
So that some modification is needed between PBSPro scripts and
Torque scripts.

If you are getting jobs from a PBSPro site, you could probably put
something in a submit filter to check for select=
and warn the user, or try to translate.

- Jim C.
-----Original Message-----
From: torqueusers-bounces at supercluster.org [mailto:torqueusers-
bounces at supercluster.org] On Behalf Of Simon Kiertscher
Sent: Wednesday, May 05, 2010 5:29 AM
To: torqueusers at supercluster.org
Subject: [torqueusers] nodes vs select statement
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Hello everyone,
I am trying to figure out what is the difference between the
"select"
and the "nodes" statement while submitting a job. In the torque
documentation only the nodes statement appears but you can also use the
select statement. Have someone a guide or something where this is explained?
greetings Simon
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (MingW32)
iEYEARECAAYFAkvhSIgACgkQG0AVdCSrH+aoUACdGX/stYIe1iWUq5wL/iaOiAIb
MFEAoML4fsCUkAhZT1PfA1YxzoWRYEoL
=Smi7
-----END PGP SIGNATURE-----
_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Coyle, James J [ITACD]
2010-05-05 14:59:03 UTC
Permalink
Correction:

My Torque 2.3.6 do NOT reject select=
but does not honor it either.
-----Original Message-----
From: torqueusers-bounces at supercluster.org [mailto:torqueusers-
bounces at supercluster.org] On Behalf Of Coyle, James J [ITACD]
Sent: Wednesday, May 05, 2010 9:45 AM
To: Torque Users Mailing List
Subject: Re: [torqueusers] nodes vs select statement
Simon,
http://hpc.research.yale.edu/wiki/index.php/Requesting_resources_usi
ng_select_and_place
From there it looks like select is a new PBSPro option. My best
guess is
It looks it woulod be the same as nodes= when both apply, but you could also
Use select=10
To get 10 cpus anywhere, which is what nprocs= (or ncpus=) does.
Or use
select=10:ib
to get 10 nodes with the ib property.
http://www.mathworks.cn/access/helpdesk/help/toolbox/distcomp/resour
cetemplate.html
http://hcc.unl.edu/firefly/FFfaq.php
My torque 2.3.6 does reject the select= syntax, but
does not honor it.
The Yale Wiki says not to use it with Torque.
http://hpc.research.yale.edu/wiki/index.php/Torque_Userguide
Perhaps this is just product differentiation in the user interface
So that some modification is needed between PBSPro scripts and
Torque scripts.
If you are getting jobs from a PBSPro site, you could probably put
something in a submit filter to check for select=
and warn the user, or try to translate.
- Jim C.
-----Original Message-----
From: torqueusers-bounces at supercluster.org [mailto:torqueusers-
bounces at supercluster.org] On Behalf Of Simon Kiertscher
Sent: Wednesday, May 05, 2010 5:29 AM
To: torqueusers at supercluster.org
Subject: [torqueusers] nodes vs select statement
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Hello everyone,
I am trying to figure out what is the difference between the
"select"
and the "nodes" statement while submitting a job. In the torque
documentation only the nodes statement appears but you can also use the
select statement. Have someone a guide or something where this is explained?
greetings Simon
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (MingW32)
iEYEARECAAYFAkvhSIgACgkQG0AVdCSrH+aoUACdGX/stYIe1iWUq5wL/iaOiAIb
MFEAoML4fsCUkAhZT1PfA1YxzoWRYEoL
=Smi7
-----END PGP SIGNATURE-----
_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Christopher Samuel
2010-05-05 15:17:29 UTC
Permalink
Post by Coyle, James J [ITACD]
My Torque 2.3.6 do NOT reject select=
but does not honor it either.
Support for that in qsub was added on the 19th September
2006, but I'm guessing you need support in the scheduler
for that too ?

cheers,
Chris
--
Christopher Samuel - Senior Systems Administrator
VLSCI - Victorian Life Sciences Computational Initiative
Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545
http://www.vlsci.unimelb.edu.au/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20100506/950e6573/attachment-0001.html
Simon Kiertscher
2010-05-05 18:49:39 UTC
Permalink
thanks Jim for that much informations.

greetings Simon
Post by Coyle, James J [ITACD]
My Torque 2.3.6 do NOT reject select=
but does not honor it either.
-----Original Message-----
From: torqueusers-bounces at supercluster.org [mailto:torqueusers-
bounces at supercluster.org] On Behalf Of Coyle, James J [ITACD]
Sent: Wednesday, May 05, 2010 9:45 AM
To: Torque Users Mailing List
Subject: Re: [torqueusers] nodes vs select statement
Simon,
http://hpc.research.yale.edu/wiki/index.php/Requesting_resources_usi
ng_select_and_place
From there it looks like select is a new PBSPro option. My best
guess is
It looks it woulod be the same as nodes= when both apply, but you could also
Use select=10
To get 10 cpus anywhere, which is what nprocs= (or ncpus=) does.
Or use
select=10:ib
to get 10 nodes with the ib property.
http://www.mathworks.cn/access/helpdesk/help/toolbox/distcomp/resour
cetemplate.html
http://hcc.unl.edu/firefly/FFfaq.php
My torque 2.3.6 does reject the select= syntax, but
does not honor it.
The Yale Wiki says not to use it with Torque.
http://hpc.research.yale.edu/wiki/index.php/Torque_Userguide
Perhaps this is just product differentiation in the user interface
So that some modification is needed between PBSPro scripts and
Torque scripts.
If you are getting jobs from a PBSPro site, you could probably put
something in a submit filter to check for select=
and warn the user, or try to translate.
- Jim C.
-----Original Message-----
From: torqueusers-bounces at supercluster.org [mailto:torqueusers-
bounces at supercluster.org] On Behalf Of Simon Kiertscher
Sent: Wednesday, May 05, 2010 5:29 AM
To: torqueusers at supercluster.org
Subject: [torqueusers] nodes vs select statement
Hello everyone,
I am trying to figure out what is the difference between the
"select"
and the "nodes" statement while submitting a job. In the torque
documentation only the nodes statement appears but you can also use the
select statement. Have someone a guide or something where this is explained?
greetings Simon
_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Post by Coyle, James J [ITACD]
_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Loading...