Stu Whitman
2018-02-27 03:48:26 UTC
Hello,
I'm glad to see some activity on this list. Thanks for reviving it.
I have setup a CentOS 6 torque cluster using torque-6.1.1.1. I am using pbs_sched. There are 6 nodes with 8gb of ram and 1 node with 64gb of ram. I am experiencing a problem using job dependencies and memory resource together. I submit a job (job 1) without requesting memory. I then submit a second job (job 2) that depends (afterany) on job 1 and needs 16gb of memory. Job 2 waits for job 1, but does not start when job 1 finishes. Does the line containing req_information.hostlist.0 = sge01-c6.farifax.groupw.com:ppn=1 in the following output from "qstat -f 51" say that the job has to use node sge01? That is the node that ran job 1. It only has 8gb of ram.
Job Id: 51.sge07-c6.fairfax.groupw.com
Job_Name = LoadWarehouse
Job_Owner = ***@sge07-c6
job_state = Q
queue = batch
server = sge07-c6.fairfax.groupw.com
Checkpoint = u
ctime = Wed Nov 15 15:38:08 2017
depend = beforeany:53.sge07-c6.fairfax.groupw.com
Error_Path = sge07-c6.fairfax.groupw.com:/home/swhitman/casa/storm/head/da
ta/WONA_HEAD/loadwarehouse.log.$PBS_ARRAYID
Hold_Types = n
Join_Path = oe
Keep_Files = n
Mail_Points = a
mtime = Wed Nov 15 15:49:48 2017
Output_Path = sge07-c6.fairfax.groupw.com:/home/swhitman/casa/storm/head/d
ata/WONA_HEAD/loadwarehouse.log.$PBS_ARRAYID
Priority = 0
qtime = Wed Nov 15 15:38:08 2017
Rerunable = True
Resource_List.walltime = 00:30:00
Resource_List.mem = 16gb
Resource_List.nodes = 1
Resource_List.nodect = 1
Variable_List = ...
euser = swhitman
egroup = staff
queue_type = E
comment = Not Running - PBS Error: Resource temporarily unavailable
etime = Wed Nov 15 15:44:32 2017
submit_args = -l walltime=00:30:00 -l mem=16gb -F /home/swhitman/casa/stor
m/head/data/WONA_HEAD -W depend=afterany:48.sge07-c6.fairfax.groupw.co
m -v PBS_ARRAYID=2 /home/swhitman/casa/storm/head/bin/bqs_load_rep_int
o_data_warehouse.sh
fault_tolerant = False
job_radix = 0
submit_host = sge07-c6.fairfax.groupw.com
init_work_dir = /home/swhitman/casa/storm/head/data/WONA_HEAD
job_arguments = /home/swhitman/casa/storm/head/data/WONA_HEAD
request_version = 1
req_information.task_count.0 = 1
req_information.lprocs.0 = 1
req_information.memory.0 = 16777216kb
req_information.thread_usage_policy.0 = allowthreads
req_information.hostlist.0 = sge01-c6.fairfax.groupw.com:ppn=1
I expect job 2 to use node sge07 because it has 64Gb of ram. What is going on? Is there a configuration setting I missed?
Thanks,
-Stu
I'm glad to see some activity on this list. Thanks for reviving it.
I have setup a CentOS 6 torque cluster using torque-6.1.1.1. I am using pbs_sched. There are 6 nodes with 8gb of ram and 1 node with 64gb of ram. I am experiencing a problem using job dependencies and memory resource together. I submit a job (job 1) without requesting memory. I then submit a second job (job 2) that depends (afterany) on job 1 and needs 16gb of memory. Job 2 waits for job 1, but does not start when job 1 finishes. Does the line containing req_information.hostlist.0 = sge01-c6.farifax.groupw.com:ppn=1 in the following output from "qstat -f 51" say that the job has to use node sge01? That is the node that ran job 1. It only has 8gb of ram.
Job Id: 51.sge07-c6.fairfax.groupw.com
Job_Name = LoadWarehouse
Job_Owner = ***@sge07-c6
job_state = Q
queue = batch
server = sge07-c6.fairfax.groupw.com
Checkpoint = u
ctime = Wed Nov 15 15:38:08 2017
depend = beforeany:53.sge07-c6.fairfax.groupw.com
Error_Path = sge07-c6.fairfax.groupw.com:/home/swhitman/casa/storm/head/da
ta/WONA_HEAD/loadwarehouse.log.$PBS_ARRAYID
Hold_Types = n
Join_Path = oe
Keep_Files = n
Mail_Points = a
mtime = Wed Nov 15 15:49:48 2017
Output_Path = sge07-c6.fairfax.groupw.com:/home/swhitman/casa/storm/head/d
ata/WONA_HEAD/loadwarehouse.log.$PBS_ARRAYID
Priority = 0
qtime = Wed Nov 15 15:38:08 2017
Rerunable = True
Resource_List.walltime = 00:30:00
Resource_List.mem = 16gb
Resource_List.nodes = 1
Resource_List.nodect = 1
Variable_List = ...
euser = swhitman
egroup = staff
queue_type = E
comment = Not Running - PBS Error: Resource temporarily unavailable
etime = Wed Nov 15 15:44:32 2017
submit_args = -l walltime=00:30:00 -l mem=16gb -F /home/swhitman/casa/stor
m/head/data/WONA_HEAD -W depend=afterany:48.sge07-c6.fairfax.groupw.co
m -v PBS_ARRAYID=2 /home/swhitman/casa/storm/head/bin/bqs_load_rep_int
o_data_warehouse.sh
fault_tolerant = False
job_radix = 0
submit_host = sge07-c6.fairfax.groupw.com
init_work_dir = /home/swhitman/casa/storm/head/data/WONA_HEAD
job_arguments = /home/swhitman/casa/storm/head/data/WONA_HEAD
request_version = 1
req_information.task_count.0 = 1
req_information.lprocs.0 = 1
req_information.memory.0 = 16777216kb
req_information.thread_usage_policy.0 = allowthreads
req_information.hostlist.0 = sge01-c6.fairfax.groupw.com:ppn=1
I expect job 2 to use node sge07 because it has 64Gb of ram. What is going on? Is there a configuration setting I missed?
Thanks,
-Stu