tracy_luofengji
2009-03-12 13:09:05 UTC
Dear all,
Hello, I did a fresh installation of torque 2.3.0 on my cluster, and I met a strange post job file processing problem. I did the same installation procedure on all the 5 compute nodes (node1, node2, node3, node4, node5) and node0 acts as the master. On the compute nodes, I just installed the packages:
/usr/local/torque-package-mom-linux-i686.sh --install
/usr/local/torque-package-clients-linux-i686.sh --install
and then, on the compute nodes, I ran: pbs_mom
The problem is, when I submit test jobs, only the node1 could send the output file back to the master node. Then other 4 compute nodes could not send the output file back. I ran the command qstat -f and saw following sentences:
.....
sched_hint:Post job file processing error;job32.ciarlab11.cluster.net on host ciarlab14.cluster.net/0
Unable to copy file /var/spool/torque/spool/32.ciarlab11.cluster.net.OU to ciarlab11.cluster.net:/usr/local/out
Unable to copy file /var/spool/torque/spool/32.ciarlab11.cluster.net.ER to ciarlab11.cluster.net:/usr/local/err
comment=Job started on Thu Mar 12 at 21:09
etime=Thu Mar 12 21:09:18 2009
exit_status = -1
submit_args=pbsjob
start_time=Thu Mar 12 21:09:18 2007
start_count=1
And my job scipt is:
#!/bin/sh
#PBS -N exampleJob
#PBS -o /usr/local/out
#PBS -e /usr/local/err
#PBS -V
echo 'helloworld'
I have spent 2 days on this issue, and I hope I can get some support from this mailling list.
Any help will be appraciated.
Thanks!
Regards,
Tracy
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20090312/c4267442/attachment.html
Hello, I did a fresh installation of torque 2.3.0 on my cluster, and I met a strange post job file processing problem. I did the same installation procedure on all the 5 compute nodes (node1, node2, node3, node4, node5) and node0 acts as the master. On the compute nodes, I just installed the packages:
/usr/local/torque-package-mom-linux-i686.sh --install
/usr/local/torque-package-clients-linux-i686.sh --install
and then, on the compute nodes, I ran: pbs_mom
The problem is, when I submit test jobs, only the node1 could send the output file back to the master node. Then other 4 compute nodes could not send the output file back. I ran the command qstat -f and saw following sentences:
.....
sched_hint:Post job file processing error;job32.ciarlab11.cluster.net on host ciarlab14.cluster.net/0
Unable to copy file /var/spool/torque/spool/32.ciarlab11.cluster.net.OU to ciarlab11.cluster.net:/usr/local/out
Unable to copy file /var/spool/torque/spool/32.ciarlab11.cluster.net.ER to ciarlab11.cluster.net:/usr/local/err
comment=Job started on Thu Mar 12 at 21:09
etime=Thu Mar 12 21:09:18 2009
exit_status = -1
submit_args=pbsjob
start_time=Thu Mar 12 21:09:18 2007
start_count=1
And my job scipt is:
#!/bin/sh
#PBS -N exampleJob
#PBS -o /usr/local/out
#PBS -e /usr/local/err
#PBS -V
echo 'helloworld'
I have spent 2 days on this issue, and I hope I can get some support from this mailling list.
Any help will be appraciated.
Thanks!
Regards,
Tracy
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20090312/c4267442/attachment.html