Discussion:
[torqueusers] Post job file processing error
tracy_luofengji
2009-03-12 13:09:05 UTC
Permalink
Dear all,
Hello, I did a fresh installation of torque 2.3.0 on my cluster, and I met a strange post job file processing problem. I did the same installation procedure on all the 5 compute nodes (node1, node2, node3, node4, node5) and node0 acts as the master. On the compute nodes, I just installed the packages:

/usr/local/torque-package-mom-linux-i686.sh --install
/usr/local/torque-package-clients-linux-i686.sh --install

and then, on the compute nodes, I ran: pbs_mom

The problem is, when I submit test jobs, only the node1 could send the output file back to the master node. Then other 4 compute nodes could not send the output file back. I ran the command qstat -f and saw following sentences:
.....
sched_hint:Post job file processing error;job32.ciarlab11.cluster.net on host ciarlab14.cluster.net/0
Unable to copy file /var/spool/torque/spool/32.ciarlab11.cluster.net.OU to ciarlab11.cluster.net:/usr/local/out
Unable to copy file /var/spool/torque/spool/32.ciarlab11.cluster.net.ER to ciarlab11.cluster.net:/usr/local/err
comment=Job started on Thu Mar 12 at 21:09
etime=Thu Mar 12 21:09:18 2009
exit_status = -1
submit_args=pbsjob
start_time=Thu Mar 12 21:09:18 2007
start_count=1

And my job scipt is:
#!/bin/sh
#PBS -N exampleJob
#PBS -o /usr/local/out
#PBS -e /usr/local/err
#PBS -V
echo 'helloworld'

I have spent 2 days on this issue, and I hope I can get some support from this mailling list.
Any help will be appraciated.

Thanks!
Regards,
Tracy




-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20090312/c4267442/attachment.html
tracy_luofengji
2009-03-13 00:56:19 UTC
Permalink
Dear all,
Hello, I did a fresh installation of torque 2.3.0 on my cluster, and I met a strange post job file processing problem. I did the same installation procedure on all the 5 compute nodes (node1, node2, node3, node4, node5) and node0 acts as the master. On the compute nodes, I just installed the packages:

/usr/local/torque-package-mom-linux-i686.sh --install
/usr/local/torque-package-clients-linux-i686.sh --install

and then, on the compute nodes, I ran: pbs_mom

The problem is, when I submit test jobs, only the node1 could send the output file back to the master node. Then other 4 compute nodes could not send the output file back. I ran the command qstat -f and saw following sentences:
.....
sched_hint:Post job file processing error;job32.ciarlab11.cluster.net on host ciarlab14.cluster.net/0
Unable to copy file /var/spool/torque/spool/32.ciarlab11.cluster.net.OU to ciarlab11.cluster.net:/usr/local/out
Unable to copy file /var/spool/torque/spool/32.ciarlab11.cluster.net.ER to ciarlab11.cluster.net:/usr/local/err
comment=Job started on Thu Mar 12 at 21:09
etime=Thu Mar 12 21:09:18 2009
exit_status = -1
submit_args=pbsjob
start_time=Thu Mar 12 21:09:18 2007
start_count=1

And my job scipt is:
#!/bin/sh
#PBS -N exampleJob
#PBS -o /usr/local/out
#PBS -e /usr/local/err
#PBS -V
echo 'helloworld'

I have spent 2 days on this issue, and I hope I can get some support from this mailling list.
Any help will be appraciated.

Thanks!
Regards,
Tracy


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20090313/bbc4aa19/attachment.html
Steve Young
2009-03-13 11:27:39 UTC
Permalink
Hi,
You need to make sure that you can ssh/scp without a password between
the server and the nodes. Depending on how you have things configured
you may need to make sure that you do it with short name and/or FQDN.
You should be able to try a copy by hand using the error you posted.
Go to ciarlab14.cluster.net and try to copy the file from /var/spool/
torque/spool/32.ciarlab11.cluster.net.OU to ciarlab11.cluster.net:/usr/
local/out. I'm also guessing that your using NFS partitions so be sure
that you can write to those partitions on the nodes. You might need to
utilize the mom directive usecp. From the Torque admin guide:

$usecp <HOST>:<SRCDIR> <DSTDIR> Specifies which directories should
be staged (see TORQUE Data Management) $usecp *.fte.com:/data /usr/
local/data

Also, you state your nodes are node0-node5 but the error message says
ciarlab11.cluster.net and ciarlab14.cluster.net so that is a little
bit confusing. I know this has been covered on the list before so
searching the archives might give you some more answers to this type
of problem. I hope this helps.

-Steve
Post by tracy_luofengji
Dear all,
Hello, I did a fresh installation of torque 2.3.0 on my cluster, and
I met a strange post job file processing problem. I did the same
installation procedure on all the 5 compute nodes (node1, node2,
node3, node4, node5) and node0 acts as the master. On the compute
/usr/local/torque-package-mom-linux-i686.sh --install
/usr/local/torque-package-clients-linux-i686.sh --install
and then, on the compute nodes, I ran: pbs_mom
The problem is, when I submit test jobs, only the node1 could send
the output file back to the master node. Then other 4 compute nodes
could not send the output file back. I ran the command qstat -f and
......
sched_hint:Post job file processing
error;job32.ciarlab11.cluster.net on host ciarlab14.cluster.net/0
Unable to copy file /var/spool/torque/spool/
32.ciarlab11.cluster.net.OU to ciarlab11.cluster.net:/usr/local/out
Unable to copy file /var/spool/torque/spool/
32.ciarlab11.cluster.net.ER to ciarlab11.cluster.net:/usr/local/err
comment=Job started on Thu Mar 12 at 21:09
etime=Thu Mar 12 21:09:18 2009
exit_status = -1
submit_args=pbsjob
start_time=Thu Mar 12 21:09:18 2007
start_count=1
#!/bin/sh
#PBS -N exampleJob
#PBS -o /usr/local/out
#PBS -e /usr/local/err
#PBS -V
echo 'helloworld'
I have spent 2 days on this issue, and I hope I can get some support
from this mailling list.
Any help will be appraciated.
Thanks!
Regards,
Tracy
ÍøÒ×ÓÊÏ䣬ÖйúµÚÒ»Žóµç×ÓÓÊŒþ·þÎñÉÌ
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20090313/577e9bb9/attachment.html
Continue reading on narkive:
Loading...