[torqueusers] send of job to HOST failed error = 15010 (help)

Discussion:

sudo

2016-12-02 01:56:05 UTC

Hello,

I have small two nodes cluster just for my testing purpose.
The torque server(hostname=frongw) has two NIC interfaces and
I managed to install/configure/setup torque on this node to run batch jobs.
'pbsnodes' reports both nodes as "state = free" and "power_state = Running".

Now, torque server accepts and run batch jobs on the torque server(frongw) node,
but, fails to run on cluster node(nfs1).

The error message of the tracejob was "send of job to nfs1 failed error = 15010".
12/01/2016 19:10:53.671 S enqueuing into batch, state 1 hop 1
12/01/2016 19:10:53.874 S Job Modified at request of ***@frontgw
12/01/2016 19:10:53.875 S Job Run at request of ***@frontgw
12/01/2016 19:10:53 A queue=batch
12/01/2016 19:10:53.880 S send of job to nfs1 failed error = 15010
12/01/2016 19:10:53.880 S unable to run job, MOM rejected/rc=-1
12/01/2016 19:10:53.881 S unable to run job, send to MOM '192.168.1.101' failed

server_logs:
12/02/2016 10:14:19.711;08;PBS_Server.16835;Job;9.frontgw;unable to run job, MOM rejected/rc=-1
12/02/2016 10:14:19.712;128;PBS_Server.16835;Req;req_reject;Reject reply code=15043(Execution server rejected request MSG=cannot send job to mom, state=TRNOUT), aux=0, type=RunJob, from ***@frontgw
12/02/2016 10:14:19.713;08;PBS_Server.16835;Job;9.frontgw;unable to run job, send to MOM '192.168.1.101' failed
---

Would someone help/suggest me what I am missing(or wrong)?

I know torque is just fine if I configure torque server with the hostname of eth0
(In this case, 'nfs0')

But, in my case above, hostname is given by eth1(frontgw),
that means two subnets involved in this batch environment.

Best Regards,
--Sudo

---

What I did are :
- On the torque sever (gategw)

(1) Run pbs_mom with aliasing (-A nfs0)
(2) Some qmgr commands;
set server acl_hosts += nfs0
set server acl_hosts += nfs1
set server submit_hosts += nfs0
(3) /var/spool/torque/server_name . . . 'frontgw'
(4) /var/spool/torque/mom_priv/config . . '$pbsserver nfs0'
(5) /var/spool/torque/server_priv/nodes
nfs0 np=2
nfs1 np=4

---

- On the compute node (nfs1)
(1) Run pbs_mom with aliasing (-A nfs1)
(2) /var/spool/torque/server_name . . . 'frontgw'
(3) /var/spool/torque/mom_priv/config . . '$pbsserver frontgw'

-------------------------------------------------------

torque server :
- hostname : frontgw (eth1: 192.168.21.100)
nfs0 (eth0: 192.168.1.100)

compute node :
- hostname : nfs1 (eth0: 192.168.21.101)

/etc/hosts has information for these hosts.

Authentication :

- root . . . ssh works without password
- users . . . SSH HostbasedAuthentication yes

-------------------------------------------------------
'pbsnodes'

nfs0
state = free
power_state = Running
np = 2
ntype = cluster
status = rectime=1480641566,macaddr=80:ee:73:b2:fd:47,cpuclock=OnDemand:800MHz,varattr=,jobs=,state=free,netload=2979647,gres=,loadave=0.02,ncpus=2,physmem=16214056kb,availmem=24054920kb,totmem=24508452kb,idletime=0,nusers=2,nsessions=6,sessions=3257 3262 3263 3289 3291 5223,uname=Linux frontgw 2.6.32-573.el6.x86_64 #1 SMP Wed Jul 1 18:23:37 EDT 2015 x86_64,opsys=linux
mom_service_port = 15002
mom_manager_port = 15003
total_sockets = 1
total_numa_nodes = 1
total_cores = 2
total_threads = 2
dedicated_sockets = 0
dedicated_numa_nodes = 0
dedicated_cores = 0
dedicated_threads = 0

nfs1
state = free
power_state = Running
np = 4
ntype = cluster
status = rectime=1480641575,macaddr=38:2c:4a:c9:25:e2,cpuclock=OnDemand:3101MHz,varattr=,jobs=,state=free,netload=1932987,gres=,loadave=0.00,ncpus=4,physmem=16293080kb,availmem=49342204kb,totmem=49847508kb,idletime=0,nusers=1,nsessions=5,sessions=3329 3333 3334 3360 3362,uname=Linux nfs1 2.6.32-573.el6.x86_64 #1 SMP Thu Jul 23 15:44:03 UTC 2015 x86_64,opsys=linux
mom_service_port = 15002
mom_manager_port = 15003
total_sockets = 1
total_numa_nodes = 1
total_cores = 4
total_threads = 4
dedicated_sockets = 0
dedicated_numa_nodes = 0
dedicated_cores = 0
dedicated_threads = 0

----
Test batch job script.

#!/bin/bash -x
#PBS -N torqtest
#PBS -q batch
#PBS -l nodes=1:ppn=4
#PBS -l walltime=01:00:00

cd $PBS_O_WORKDIR
echo $'\n'"-----------------"
env
echo $'\n'"-----------------"
sleep 60

---
TORQUE information.

Torque Server Version = 6.0.2
hwloc-ls 1.9
configure --enable-cgroups --disable-gui
---

sudo

2016-12-02 02:38:31 UTC

Permalink

Uuups, I made typo.

Post by sudo
- hostname : nfs1 (eth0: 192.168.21.101)

Correct:
- hostname : nfs1 (eth0: 192.168.1.101)

Sorry,
--Sudo

Post by sudo
Hello,
I have small two nodes cluster just for my testing purpose.
The torque server(hostname=frongw) has two NIC interfaces and
I managed to install/configure/setup torque on this node to run batch jobs.
'pbsnodes' reports both nodes as "state = free" and "power_state = Running".
Now, torque server accepts and run batch jobs on the torque server(frongw) node,
but, fails to run on cluster node(nfs1).
The error message of the tracejob was "send of job to nfs1 failed error = 15010".
12/01/2016 19:10:53.671 S enqueuing into batch, state 1 hop 1
12/01/2016 19:10:53 A queue=batch
12/01/2016 19:10:53.880 S send of job to nfs1 failed error = 15010
12/01/2016 19:10:53.880 S unable to run job, MOM rejected/rc=-1
12/01/2016 19:10:53.881 S unable to run job, send to MOM '192.168.1.101' failed
12/02/2016 10:14:19.711;08;PBS_Server.16835;Job;9.frontgw;unable to run job, MOM rejected/rc=-1
12/02/2016 10:14:19.713;08;PBS_Server.16835;Job;9.frontgw;unable to run job, send to MOM '192.168.1.101' failed
---
Would someone help/suggest me what I am missing(or wrong)?
I know torque is just fine if I configure torque server with the hostname of eth0
(In this case, 'nfs0')
But, in my case above, hostname is given by eth1(frontgw),
that means two subnets involved in this batch environment.
Best Regards,
--Sudo
---
- On the torque sever (gategw)
(1) Run pbs_mom with aliasing (-A nfs0)
(2) Some qmgr commands;
set server acl_hosts += nfs0
set server acl_hosts += nfs1
set server submit_hosts += nfs0
(3) /var/spool/torque/server_name . . . 'frontgw'
(4) /var/spool/torque/mom_priv/config . . '$pbsserver nfs0'
(5) /var/spool/torque/server_priv/nodes
nfs0 np=2
nfs1 np=4
---
- On the compute node (nfs1)
(1) Run pbs_mom with aliasing (-A nfs1)
(2) /var/spool/torque/server_name . . . 'frontgw'
(3) /var/spool/torque/mom_priv/config . . '$pbsserver frontgw'
-------------------------------------------------------
- hostname : frontgw (eth1: 192.168.21.100)
nfs0 (eth0: 192.168.1.100)
- hostname : nfs1 (eth0: 192.168.21.101)
/etc/hosts has information for these hosts.
- root . . . ssh works without password
- users . . . SSH HostbasedAuthentication yes
-------------------------------------------------------
'pbsnodes'
nfs0
state = free
power_state = Running
np = 2
ntype = cluster
status = rectime=1480641566,macaddr=80:ee:73:b2:fd:47,cpuclock=OnDemand:800MHz,varattr=,jobs=,state=free,netload=2979647,gres=,loadave=0.02,ncpus=2,physmem=16214056kb,availmem=24054920kb,totmem=24508452kb,idletime=0,nusers=2,nsessions=6,sessions=3257 3262 3263 3289 3291 5223,uname=Linux frontgw 2.6.32-573.el6.x86_64 #1 SMP Wed Jul 1 18:23:37 EDT 2015 x86_64,opsys=linux
mom_service_port = 15002
mom_manager_port = 15003
total_sockets = 1
total_numa_nodes = 1
total_cores = 2
total_threads = 2
dedicated_sockets = 0
dedicated_numa_nodes = 0
dedicated_cores = 0
dedicated_threads = 0
nfs1
state = free
power_state = Running
np = 4
ntype = cluster
status = rectime=1480641575,macaddr=38:2c:4a:c9:25:e2,cpuclock=OnDemand:3101MHz,varattr=,jobs=,state=free,netload=1932987,gres=,loadave=0.00,ncpus=4,physmem=16293080kb,availmem=49342204kb,totmem=49847508kb,idletime=0,nusers=1,nsessions=5,sessions=3329 3333 3334 3360 3362,uname=Linux nfs1 2.6.32-573.el6.x86_64 #1 SMP Thu Jul 23 15:44:03 UTC 2015 x86_64,opsys=linux
mom_service_port = 15002
mom_manager_port = 15003
total_sockets = 1
total_numa_nodes = 1
total_cores = 4
total_threads = 4
dedicated_sockets = 0
dedicated_numa_nodes = 0
dedicated_cores = 0
dedicated_threads = 0
----
Test batch job script.
#!/bin/bash -x
#PBS -N torqtest
#PBS -q batch
#PBS -l nodes=1:ppn=4
#PBS -l walltime=01:00:00
cd $PBS_O_WORKDIR
echo $'\n'"-----------------"
env
echo $'\n'"-----------------"
sleep 60
---
TORQUE information.
Torque Server Version = 6.0.2
hwloc-ls 1.9
configure --enable-cgroups --disable-gui
---

sudo

2016-12-02 03:53:58 UTC

Permalink

I'm not pretty sure how things went well, my test job runs on compute nodes, too.

What I did was ;

---
- /var/spool/torque/server_name (no changed)
frontgw

- /var/spool/torque/mom_priv/config (was: "frontgw")
$pbsserver nfs0
$loglevel 7

- /etc/init.d/pbs_mom
Removed pbs_mom aliasing. delete '-A nfs1' option .

- Restart TORQUE daemons both on server/client.

Now, pbs_server successfully submitted/running on compute node (nfs1).

Still, I'd like to have any comments/suggestions/validity about the configuration of torque
for the cluster environment like this.

Best Regards,
--Sudo

---
[***@frontgw ~]$ qstat -a

frontgw:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
----------------------- ----------- -------- ---------------- ------ ----- ------ --------- --------- - ---------
13.frontgw user009 batch torqtest 16107 1 3 -- 01:00:00 R 00:00:08
[***@frontgw ~]$ qstat -f
Job Id: 13.frontgw
Job_Name = torqtest
Job_Owner = ***@frontgw
job_state = R
queue = batch
server = frontgw
Checkpoint = u
ctime = Fri Dec 2 12:40:48 2016
Error_Path = nfs0:/home/user009/torqtest.e13
exec_host = nfs1/0-2
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Fri Dec 2 12:40:48 2016
Output_Path = nfs0:/home/user009/torqtest.o13
Priority = 0
qtime = Fri Dec 2 12:40:48 2016
Rerunable = True
Resource_List.nodect = 1
Resource_List.nodes = 1:ppn=3
Resource_List.walltime = 01:00:00
session_id = 16107
Variable_List = PBS_O_QUEUE=batch,PBS_O_HOME=/home/user009,
PBS_O_LOGNAME=user009,
PBS_O_PATH=/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/lo
cal/sbin:/usr/sbin:/sbin:/home/user009/bin,
PBS_O_MAIL=/var/spool/mail/user009,PBS_O_SHELL=/bin/bash,
PBS_O_LANG=ja_JP.UTF-8,PBS_O_WORKDIR=/home/user009,PBS_O_HOST=nfs0,
PBS_O_SERVER=frontgw
euser = user009
egroup = user009
queue_type = E
comment = Job started on Fri Dec 02 at 12:40
etime = Fri Dec 2 12:40:48 2016
submit_args = PBS
start_time = Fri Dec 2 12:40:48 2016
Walltime.Remaining = 3587
start_count = 1
fault_tolerant = False
job_radix = 0
submit_host = nfs0
request_version = 1
req_information.task_count.0 = 1
req_information.lprocs.0 = 3
req_information.thread_usage_policy.0 = allowthreads
req_information.hostlist.0 = nfs1:ppn=3
req_information.task_usage.0.task.0.cpu_list = 0-2
req_information.task_usage.0.task.0.mem_list = 0
req_information.task_usage.0.task.0.cores = 0
req_information.task_usage.0.task.0.threads = 3
req_information.task_usage.0.task.0.host = nfs1
cpuset_string = nfs1:0-2
memset_string = nfs1:0
---

Post by sudo
Uuups, I made typo.

Post by sudo
- hostname : nfs1 (eth0: 192.168.21.101)

- hostname : nfs1 (eth0: 192.168.1.101)
Sorry,
--Sudo

_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers

Gus Correa

2016-12-02 17:21:55 UTC

Permalink

It sounds as a mixup of name resolution (and subnets,
maybe in /etc/hosts) vs. the server_name
and perhaps what you have in $TORQUE/server_priv/nodes,
and $TORQUE/mom_priv/config.
You didn't show their contents, so it is hard to say.
Check those files to see if they are consistent with
each other.

I hope this helps,
Gus Correa

Post by sudo
I'm not pretty sure how things went well, my test job runs on compute nodes, too.
What I did was ;
---
- /var/spool/torque/server_name (no changed)
frontgw
- /var/spool/torque/mom_priv/config (was: "frontgw")
$pbsserver nfs0
$loglevel 7
- /etc/init.d/pbs_mom
Removed pbs_mom aliasing. delete '-A nfs1' option .
- Restart TORQUE daemons both on server/client.
Now, pbs_server successfully submitted/running on compute node (nfs1).
Still, I'd like to have any comments/suggestions/validity about the configuration of torque
for the cluster environment like this.
Best Regards,
--Sudo
---
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK
Memory Time S Time
----------------------- ----------- -------- ---------------- ------
----- ------ --------- --------- - ---------
13.frontgw user009 batch torqtest 16107 1
3 -- 01:00:00 R 00:00:08
Job Id: 13.frontgw
Job_Name = torqtest
job_state = R
queue = batch
server = frontgw
Checkpoint = u
ctime = Fri Dec 2 12:40:48 2016
Error_Path = nfs0:/home/user009/torqtest.e13
exec_host = nfs1/0-2
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Fri Dec 2 12:40:48 2016
Output_Path = nfs0:/home/user009/torqtest.o13
Priority = 0
qtime = Fri Dec 2 12:40:48 2016
Rerunable = True
Resource_List.nodect = 1
Resource_List.nodes = 1:ppn=3
Resource_List.walltime = 01:00:00
session_id = 16107
Variable_List = PBS_O_QUEUE=batch,PBS_O_HOME=/home/user009,
PBS_O_LOGNAME=user009,
PBS_O_PATH=/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/lo
cal/sbin:/usr/sbin:/sbin:/home/user009/bin,
PBS_O_MAIL=/var/spool/mail/user009,PBS_O_SHELL=/bin/bash,
PBS_O_LANG=ja_JP.UTF-8,PBS_O_WORKDIR=/home/user009,PBS_O_HOST=nfs0,
PBS_O_SERVER=frontgw
euser = user009
egroup = user009
queue_type = E
comment = Job started on Fri Dec 02 at 12:40
etime = Fri Dec 2 12:40:48 2016
submit_args = PBS
start_time = Fri Dec 2 12:40:48 2016
Walltime.Remaining = 3587
start_count = 1
fault_tolerant = False
job_radix = 0
submit_host = nfs0
request_version = 1
req_information.task_count.0 = 1
req_information.lprocs.0 = 3
req_information.thread_usage_policy.0 = allowthreads
req_information.hostlist.0 = nfs1:ppn=3
req_information.task_usage.0.task.0.cpu_list = 0-2
req_information.task_usage.0.task.0.mem_list = 0
req_information.task_usage.0.task.0.cores = 0
req_information.task_usage.0.task.0.threads = 3
req_information.task_usage.0.task.0.host = nfs1
cpuset_string = nfs1:0-2
memset_string = nfs1:0
---

Post by sudo
Uuups, I made typo.

Post by sudo
- hostname : nfs1 (eth0: 192.168.21.101)

- hostname : nfs1 (eth0: 192.168.1.101)
Sorry,
--Sudo

Post by sudo
Hello,
I have small two nodes cluster just for my testing purpose.
The torque server(hostname=frongw) has two NIC interfaces and
I managed to install/configure/setup torque on this node to run batch jobs.
'pbsnodes' reports both nodes as "state = free" and "power_state = Running".
Now, torque server accepts and run batch jobs on the torque
server(frongw) node,
but, fails to run on cluster node(nfs1).
The error message of the tracejob was "send of job to nfs1 failed error = 15010".
12/01/2016 19:10:53.671 S enqueuing into batch, state 1 hop 1
12/01/2016 19:10:53.874 S Job Modified at request of
12/01/2016 19:10:53 A queue=batch
12/01/2016 19:10:53.880 S send of job to nfs1 failed error = 15010
12/01/2016 19:10:53.880 S unable to run job, MOM rejected/rc=-1
12/01/2016 19:10:53.881 S unable to run job, send to MOM '192.168.1.101' failed
12/02/2016 10:14:19.711;08;PBS_Server.16835;Job;9.frontgw;unable
to run job, MOM rejected/rc=-1
12/02/2016
10:14:19.712;128;PBS_Server.16835;Req;req_reject;Reject reply
code=15043(Execution server rejected request MSG=cannot send job to
12/02/2016 10:14:19.713;08;PBS_Server.16835;Job;9.frontgw;unable
to run job, send to MOM '192.168.1.101' failed
---
Would someone help/suggest me what I am missing(or wrong)?
I know torque is just fine if I configure torque server with the hostname of eth0
(In this case, 'nfs0')
But, in my case above, hostname is given by eth1(frontgw),
that means two subnets involved in this batch environment.
Best Regards,
--Sudo
---
- On the torque sever (gategw)
(1) Run pbs_mom with aliasing (-A nfs0)
(2) Some qmgr commands;
set server acl_hosts += nfs0
set server acl_hosts += nfs1
set server submit_hosts += nfs0
(3) /var/spool/torque/server_name . . . 'frontgw'
(4) /var/spool/torque/mom_priv/config . . '$pbsserver nfs0'
(5) /var/spool/torque/server_priv/nodes
nfs0 np=2
nfs1 np=4
---
- On the compute node (nfs1)
(1) Run pbs_mom with aliasing (-A nfs1)
(2) /var/spool/torque/server_name . . . 'frontgw'
(3) /var/spool/torque/mom_priv/config . . '$pbsserver frontgw'
-------------------------------------------------------
- hostname : frontgw (eth1: 192.168.21.100)
nfs0 (eth0: 192.168.1.100)
- hostname : nfs1 (eth0: 192.168.21.101)
/etc/hosts has information for these hosts.
- root . . . ssh works without password
- users . . . SSH HostbasedAuthentication yes
-------------------------------------------------------
'pbsnodes'
nfs0
state = free
power_state = Running
np = 2
ntype = cluster
status =
rectime=1480641566,macaddr=80:ee:73:b2:fd:47,cpuclock=OnDemand:800MHz,varattr=,jobs=,state=free,netload=2979647,gres=,loadave=0.02,ncpus=2,physmem=16214056kb,availmem=24054920kb,totmem=24508452kb,idletime=0,nusers=2,nsessions=6,sessions=3257
3262 3263 3289 3291 5223,uname=Linux frontgw 2.6.32-573.el6.x86_64 #1
SMP Wed Jul 1 18:23:37 EDT 2015 x86_64,opsys=linux
mom_service_port = 15002
mom_manager_port = 15003
total_sockets = 1
total_numa_nodes = 1
total_cores = 2
total_threads = 2
dedicated_sockets = 0
dedicated_numa_nodes = 0
dedicated_cores = 0
dedicated_threads = 0
nfs1
state = free
power_state = Running
np = 4
ntype = cluster
status =
rectime=1480641575,macaddr=38:2c:4a:c9:25:e2,cpuclock=OnDemand:3101MHz,varattr=,jobs=,state=free,netload=1932987,gres=,loadave=0.00,ncpus=4,physmem=16293080kb,availmem=49342204kb,totmem=49847508kb,idletime=0,nusers=1,nsessions=5,sessions=3329
3333 3334 3360 3362,uname=Linux nfs1 2.6.32-573.el6.x86_64 #1 SMP Thu
Jul 23 15:44:03 UTC 2015 x86_64,opsys=linux
mom_service_port = 15002
mom_manager_port = 15003
total_sockets = 1
total_numa_nodes = 1
total_cores = 4
total_threads = 4
dedicated_sockets = 0
dedicated_numa_nodes = 0
dedicated_cores = 0
dedicated_threads = 0
----
Test batch job script.
#!/bin/bash -x
#PBS -N torqtest
#PBS -q batch
#PBS -l nodes=1:ppn=4
#PBS -l walltime=01:00:00
cd $PBS_O_WORKDIR
echo $'\n'"-----------------"
env
echo $'\n'"-----------------"
sleep 60
---
TORQUE information.
Torque Server Version = 6.0.2
hwloc-ls 1.9
configure --enable-cgroups --disable-gui
---

_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers