sudo
2016-12-02 01:56:05 UTC
Hello,
I have small two nodes cluster just for my testing purpose.
The torque server(hostname=frongw) has two NIC interfaces and
I managed to install/configure/setup torque on this node to run batch jobs.
'pbsnodes' reports both nodes as "state = free" and "power_state = Running".
Now, torque server accepts and run batch jobs on the torque server(frongw) node,
but, fails to run on cluster node(nfs1).
The error message of the tracejob was "send of job to nfs1 failed error = 15010".
12/01/2016 19:10:53.671 S enqueuing into batch, state 1 hop 1
12/01/2016 19:10:53.874 S Job Modified at request of ***@frontgw
12/01/2016 19:10:53.875 S Job Run at request of ***@frontgw
12/01/2016 19:10:53 A queue=batch
12/01/2016 19:10:53.880 S send of job to nfs1 failed error = 15010
12/01/2016 19:10:53.880 S unable to run job, MOM rejected/rc=-1
12/01/2016 19:10:53.881 S unable to run job, send to MOM '192.168.1.101' failed
server_logs:
12/02/2016 10:14:19.711;08;PBS_Server.16835;Job;9.frontgw;unable to run job, MOM rejected/rc=-1
12/02/2016 10:14:19.712;128;PBS_Server.16835;Req;req_reject;Reject reply code=15043(Execution server rejected request MSG=cannot send job to mom, state=TRNOUT), aux=0, type=RunJob, from ***@frontgw
12/02/2016 10:14:19.713;08;PBS_Server.16835;Job;9.frontgw;unable to run job, send to MOM '192.168.1.101' failed
---
Would someone help/suggest me what I am missing(or wrong)?
I know torque is just fine if I configure torque server with the hostname of eth0
(In this case, 'nfs0')
But, in my case above, hostname is given by eth1(frontgw),
that means two subnets involved in this batch environment.
Best Regards,
--Sudo
---
What I did are :
- On the torque sever (gategw)
(1) Run pbs_mom with aliasing (-A nfs0)
(2) Some qmgr commands;
set server acl_hosts += nfs0
set server acl_hosts += nfs1
set server submit_hosts += nfs0
(3) /var/spool/torque/server_name . . . 'frontgw'
(4) /var/spool/torque/mom_priv/config . . '$pbsserver nfs0'
(5) /var/spool/torque/server_priv/nodes
nfs0 np=2
nfs1 np=4
---
- On the compute node (nfs1)
(1) Run pbs_mom with aliasing (-A nfs1)
(2) /var/spool/torque/server_name . . . 'frontgw'
(3) /var/spool/torque/mom_priv/config . . '$pbsserver frontgw'
-------------------------------------------------------
torque server :
- hostname : frontgw (eth1: 192.168.21.100)
nfs0 (eth0: 192.168.1.100)
compute node :
- hostname : nfs1 (eth0: 192.168.21.101)
/etc/hosts has information for these hosts.
Authentication :
- root . . . ssh works without password
- users . . . SSH HostbasedAuthentication yes
-------------------------------------------------------
'pbsnodes'
nfs0
state = free
power_state = Running
np = 2
ntype = cluster
status = rectime=1480641566,macaddr=80:ee:73:b2:fd:47,cpuclock=OnDemand:800MHz,varattr=,jobs=,state=free,netload=2979647,gres=,loadave=0.02,ncpus=2,physmem=16214056kb,availmem=24054920kb,totmem=24508452kb,idletime=0,nusers=2,nsessions=6,sessions=3257 3262 3263 3289 3291 5223,uname=Linux frontgw 2.6.32-573.el6.x86_64 #1 SMP Wed Jul 1 18:23:37 EDT 2015 x86_64,opsys=linux
mom_service_port = 15002
mom_manager_port = 15003
total_sockets = 1
total_numa_nodes = 1
total_cores = 2
total_threads = 2
dedicated_sockets = 0
dedicated_numa_nodes = 0
dedicated_cores = 0
dedicated_threads = 0
nfs1
state = free
power_state = Running
np = 4
ntype = cluster
status = rectime=1480641575,macaddr=38:2c:4a:c9:25:e2,cpuclock=OnDemand:3101MHz,varattr=,jobs=,state=free,netload=1932987,gres=,loadave=0.00,ncpus=4,physmem=16293080kb,availmem=49342204kb,totmem=49847508kb,idletime=0,nusers=1,nsessions=5,sessions=3329 3333 3334 3360 3362,uname=Linux nfs1 2.6.32-573.el6.x86_64 #1 SMP Thu Jul 23 15:44:03 UTC 2015 x86_64,opsys=linux
mom_service_port = 15002
mom_manager_port = 15003
total_sockets = 1
total_numa_nodes = 1
total_cores = 4
total_threads = 4
dedicated_sockets = 0
dedicated_numa_nodes = 0
dedicated_cores = 0
dedicated_threads = 0
----
Test batch job script.
#!/bin/bash -x
#PBS -N torqtest
#PBS -q batch
#PBS -l nodes=1:ppn=4
#PBS -l walltime=01:00:00
cd $PBS_O_WORKDIR
echo $'\n'"-----------------"
env
echo $'\n'"-----------------"
sleep 60
---
TORQUE information.
Torque Server Version = 6.0.2
hwloc-ls 1.9
configure --enable-cgroups --disable-gui
---
I have small two nodes cluster just for my testing purpose.
The torque server(hostname=frongw) has two NIC interfaces and
I managed to install/configure/setup torque on this node to run batch jobs.
'pbsnodes' reports both nodes as "state = free" and "power_state = Running".
Now, torque server accepts and run batch jobs on the torque server(frongw) node,
but, fails to run on cluster node(nfs1).
The error message of the tracejob was "send of job to nfs1 failed error = 15010".
12/01/2016 19:10:53.671 S enqueuing into batch, state 1 hop 1
12/01/2016 19:10:53.874 S Job Modified at request of ***@frontgw
12/01/2016 19:10:53.875 S Job Run at request of ***@frontgw
12/01/2016 19:10:53 A queue=batch
12/01/2016 19:10:53.880 S send of job to nfs1 failed error = 15010
12/01/2016 19:10:53.880 S unable to run job, MOM rejected/rc=-1
12/01/2016 19:10:53.881 S unable to run job, send to MOM '192.168.1.101' failed
server_logs:
12/02/2016 10:14:19.711;08;PBS_Server.16835;Job;9.frontgw;unable to run job, MOM rejected/rc=-1
12/02/2016 10:14:19.712;128;PBS_Server.16835;Req;req_reject;Reject reply code=15043(Execution server rejected request MSG=cannot send job to mom, state=TRNOUT), aux=0, type=RunJob, from ***@frontgw
12/02/2016 10:14:19.713;08;PBS_Server.16835;Job;9.frontgw;unable to run job, send to MOM '192.168.1.101' failed
---
Would someone help/suggest me what I am missing(or wrong)?
I know torque is just fine if I configure torque server with the hostname of eth0
(In this case, 'nfs0')
But, in my case above, hostname is given by eth1(frontgw),
that means two subnets involved in this batch environment.
Best Regards,
--Sudo
---
What I did are :
- On the torque sever (gategw)
(1) Run pbs_mom with aliasing (-A nfs0)
(2) Some qmgr commands;
set server acl_hosts += nfs0
set server acl_hosts += nfs1
set server submit_hosts += nfs0
(3) /var/spool/torque/server_name . . . 'frontgw'
(4) /var/spool/torque/mom_priv/config . . '$pbsserver nfs0'
(5) /var/spool/torque/server_priv/nodes
nfs0 np=2
nfs1 np=4
---
- On the compute node (nfs1)
(1) Run pbs_mom with aliasing (-A nfs1)
(2) /var/spool/torque/server_name . . . 'frontgw'
(3) /var/spool/torque/mom_priv/config . . '$pbsserver frontgw'
-------------------------------------------------------
torque server :
- hostname : frontgw (eth1: 192.168.21.100)
nfs0 (eth0: 192.168.1.100)
compute node :
- hostname : nfs1 (eth0: 192.168.21.101)
/etc/hosts has information for these hosts.
Authentication :
- root . . . ssh works without password
- users . . . SSH HostbasedAuthentication yes
-------------------------------------------------------
'pbsnodes'
nfs0
state = free
power_state = Running
np = 2
ntype = cluster
status = rectime=1480641566,macaddr=80:ee:73:b2:fd:47,cpuclock=OnDemand:800MHz,varattr=,jobs=,state=free,netload=2979647,gres=,loadave=0.02,ncpus=2,physmem=16214056kb,availmem=24054920kb,totmem=24508452kb,idletime=0,nusers=2,nsessions=6,sessions=3257 3262 3263 3289 3291 5223,uname=Linux frontgw 2.6.32-573.el6.x86_64 #1 SMP Wed Jul 1 18:23:37 EDT 2015 x86_64,opsys=linux
mom_service_port = 15002
mom_manager_port = 15003
total_sockets = 1
total_numa_nodes = 1
total_cores = 2
total_threads = 2
dedicated_sockets = 0
dedicated_numa_nodes = 0
dedicated_cores = 0
dedicated_threads = 0
nfs1
state = free
power_state = Running
np = 4
ntype = cluster
status = rectime=1480641575,macaddr=38:2c:4a:c9:25:e2,cpuclock=OnDemand:3101MHz,varattr=,jobs=,state=free,netload=1932987,gres=,loadave=0.00,ncpus=4,physmem=16293080kb,availmem=49342204kb,totmem=49847508kb,idletime=0,nusers=1,nsessions=5,sessions=3329 3333 3334 3360 3362,uname=Linux nfs1 2.6.32-573.el6.x86_64 #1 SMP Thu Jul 23 15:44:03 UTC 2015 x86_64,opsys=linux
mom_service_port = 15002
mom_manager_port = 15003
total_sockets = 1
total_numa_nodes = 1
total_cores = 4
total_threads = 4
dedicated_sockets = 0
dedicated_numa_nodes = 0
dedicated_cores = 0
dedicated_threads = 0
----
Test batch job script.
#!/bin/bash -x
#PBS -N torqtest
#PBS -q batch
#PBS -l nodes=1:ppn=4
#PBS -l walltime=01:00:00
cd $PBS_O_WORKDIR
echo $'\n'"-----------------"
env
echo $'\n'"-----------------"
sleep 60
---
TORQUE information.
Torque Server Version = 6.0.2
hwloc-ls 1.9
configure --enable-cgroups --disable-gui
---