Stevens, Philip
2014-04-09 15:44:24 UTC
Hi there,
I have some kind of a problem. Everything worked fine except some of the jobs were stuck in the queue although they finished correctly (a former collegue said that this might happen once in a while..). Anyways, after googling how to get rid of it I tried all of the found solutions but nothing helped. Though, I decided to restart the server using:
"qterm -t quick"
I thought everything should be working fine now (since it is the quick restart option?) but after several minutes this error message keeps popping up when using qstat -a to check the queue:
Error communicating with node1(IP-Adress)
Cannot connect to default server host 'node1' - check pbs_server daemon and/or trqauthd.
qstat: cannot connect to server node1 (errno=111) Connection refused.
running pbs_server (no message) as root as well as trqauthd
(hostname:node1
pbs_server port is: 15001
trqauthd daemonized - port 15005)
I still get the same error message.
Comparing the hostname of the server and the name given in /var/spool/torque/server_name it is exactly the same.
So after I thought it is just a quick restart, the server now does not work at all.
Any suggestions how to fix this are highly appreciated!
Thank you.
Phil
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20140409/62ee429e/attachment.html
I have some kind of a problem. Everything worked fine except some of the jobs were stuck in the queue although they finished correctly (a former collegue said that this might happen once in a while..). Anyways, after googling how to get rid of it I tried all of the found solutions but nothing helped. Though, I decided to restart the server using:
"qterm -t quick"
I thought everything should be working fine now (since it is the quick restart option?) but after several minutes this error message keeps popping up when using qstat -a to check the queue:
Error communicating with node1(IP-Adress)
Cannot connect to default server host 'node1' - check pbs_server daemon and/or trqauthd.
qstat: cannot connect to server node1 (errno=111) Connection refused.
running pbs_server (no message) as root as well as trqauthd
(hostname:node1
pbs_server port is: 15001
trqauthd daemonized - port 15005)
I still get the same error message.
Comparing the hostname of the server and the name given in /var/spool/torque/server_name it is exactly the same.
So after I thought it is just a quick restart, the server now does not work at all.
Any suggestions how to fix this are highly appreciated!
Thank you.
Phil
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20140409/62ee429e/attachment.html