Discussion:
[torqueusers] Server not responsive
Stevens, Philip
2014-04-09 15:44:24 UTC
Permalink
Hi there,

I have some kind of a problem. Everything worked fine except some of the jobs were stuck in the queue although they finished correctly (a former collegue said that this might happen once in a while..). Anyways, after googling how to get rid of it I tried all of the found solutions but nothing helped. Though, I decided to restart the server using:

"qterm -t quick"

I thought everything should be working fine now (since it is the quick restart option?) but after several minutes this error message keeps popping up when using qstat -a to check the queue:

Error communicating with node1(IP-Adress)
Cannot connect to default server host 'node1' - check pbs_server daemon and/or trqauthd.
qstat: cannot connect to server node1 (errno=111) Connection refused.

running pbs_server (no message) as root as well as trqauthd
(hostname:node1
pbs_server port is: 15001
trqauthd daemonized - port 15005)

I still get the same error message.
Comparing the hostname of the server and the name given in /var/spool/torque/server_name it is exactly the same.

So after I thought it is just a quick restart, the server now does not work at all.

Any suggestions how to fix this are highly appreciated!

Thank you.

Phil
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20140409/62ee429e/attachment.html
Andrus, Brian Contractor
2014-04-10 19:41:28 UTC
Permalink
Phil,

More info would be useful
"does not work at all" is a little broad :)

What are the symptoms? What operating system? How was it installed?


Brian Andrus
ITACS/Research Computing
Naval Postgraduate School
Monterey, California
voice: 831-656-6238



From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Stevens, Philip
Sent: Wednesday, April 09, 2014 8:44 AM
To: torqueusers at supercluster.org
Subject: [torqueusers] Server not responsive

Hi there,

I have some kind of a problem. Everything worked fine except some of the jobs were stuck in the queue although they finished correctly (a former collegue said that this might happen once in a while..). Anyways, after googling how to get rid of it I tried all of the found solutions but nothing helped. Though, I decided to restart the server using:

"qterm -t quick"

I thought everything should be working fine now (since it is the quick restart option?) but after several minutes this error message keeps popping up when using qstat -a to check the queue:

Error communicating with node1(IP-Adress)
Cannot connect to default server host 'node1' - check pbs_server daemon and/or trqauthd.
qstat: cannot connect to server node1 (errno=111) Connection refused.

running pbs_server (no message) as root as well as trqauthd
(hostname:node1
pbs_server port is: 15001
trqauthd daemonized - port 15005)

I still get the same error message.
Comparing the hostname of the server and the name given in /var/spool/torque/server_name it is exactly the same.

So after I thought it is just a quick restart, the server now does not work at all.

Any suggestions how to fix this are highly appreciated!

Thank you.

Phil
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20140410/17bafb8e/attachment-0001.html
Stevens, Philip
2014-04-11 07:22:08 UTC
Permalink
Apologize the few information in the first mail.

Using pbs_server -v I get the following message:
version: 4.1.2

It is running on a CentOS 6.4 System for approximately 2 years now without any flaws. As I described I can't reach the server somehow. Trying to start the server with 'pbs_server' there is no message that one is running or anything like that. After using 'pbs_server, trqauthd, pbs_sched' I try to get the list of jobs (which should be empty) with 'qstat' but instead the following message pops up:

Error communicating with node1(10.11.22.10)
Cannot connect to default server host 'node1' - check pbs_server daemon and/or trqauthd.
qstat: cannot connect to server node1 (errno=111) Connection refused

Firewall and SELinux are disabled. The problem came up after I tried to restart the server using 'qterm -t quick' since some jobs where stated in the queue although they finished. The hostname in /etc/hosts and in /var/spool/torque/server_name is the same. As well as the one which is printed on the command line using 'hostname'.

So what I assume is that the server starts up but shuts down immediatly afterwards because using it like 'pbs_server &' to directly get the pid and checking for it directly afterwards there is no process running with that ID.

Any idea how to fix this is highly appreciated, and sorry again for the crappy first mail! ;)


cheers,

Phil
________________________________
Von: torqueusers-bounces at supercluster.org [torqueusers-bounces at supercluster.org]" im Auftrag von "Andrus, Brian Contractor [bdandrus at nps.edu]
Gesendet: Donnerstag, 10. April 2014 21:41
An: Torque Users Mailing List
Betreff: Re: [torqueusers] Server not responsive

Phil,

More info would be useful
?does not work at all? is a little broad :)

What are the symptoms? What operating system? How was it installed?


Brian Andrus
ITACS/Research Computing
Naval Postgraduate School
Monterey, California
voice: 831-656-6238



From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Stevens, Philip
Sent: Wednesday, April 09, 2014 8:44 AM
To: torqueusers at supercluster.org
Subject: [torqueusers] Server not responsive

Hi there,

I have some kind of a problem. Everything worked fine except some of the jobs were stuck in the queue although they finished correctly (a former collegue said that this might happen once in a while..). Anyways, after googling how to get rid of it I tried all of the found solutions but nothing helped. Though, I decided to restart the server using:

"qterm -t quick"

I thought everything should be working fine now (since it is the quick restart option?) but after several minutes this error message keeps popping up when using qstat -a to check the queue:

Error communicating with node1(IP-Adress)
Cannot connect to default server host 'node1' - check pbs_server daemon and/or trqauthd.
qstat: cannot connect to server node1 (errno=111) Connection refused.

running pbs_server (no message) as root as well as trqauthd
(hostname:node1
pbs_server port is: 15001
trqauthd daemonized - port 15005)

I still get the same error message.
Comparing the hostname of the server and the name given in /var/spool/torque/server_name it is exactly the same.

So after I thought it is just a quick restart, the server now does not work at all.

Any suggestions how to fix this are highly appreciated!

Thank you.

Phil
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20140411/cb014dfa/attachment.html
Stevens, Philip
2014-04-11 11:46:29 UTC
Permalink
Hi there,

I kind of fixed it by cold restarting the server. Some jobs are lost but better than no server at all.

Thanks for your thoughts!

Best,

Phil


________________________________
Von: torqueusers-bounces at supercluster.org [torqueusers-bounces at supercluster.org]" im Auftrag von "Stevens, Philip [philip.stevens at igb.fraunhofer.de]
Gesendet: Freitag, 11. April 2014 09:22
An: Torque Users Mailing List
Betreff: Re: [torqueusers] Server not responsive

Apologize the few information in the first mail.

Using pbs_server -v I get the following message:
version: 4.1.2

It is running on a CentOS 6.4 System for approximately 2 years now without any flaws. As I described I can't reach the server somehow. Trying to start the server with 'pbs_server' there is no message that one is running or anything like that. After using 'pbs_server, trqauthd, pbs_sched' I try to get the list of jobs (which should be empty) with 'qstat' but instead the following message pops up:

Error communicating with node1(10.11.22.10)
Cannot connect to default server host 'node1' - check pbs_server daemon and/or trqauthd.
qstat: cannot connect to server node1 (errno=111) Connection refused

Firewall and SELinux are disabled. The problem came up after I tried to restart the server using 'qterm -t quick' since some jobs where stated in the queue although they finished. The hostname in /etc/hosts and in /var/spool/torque/server_name is the same. As well as the one which is printed on the command line using 'hostname'.

So what I assume is that the server starts up but shuts down immediatly afterwards because using it like 'pbs_server &' to directly get the pid and checking for it directly afterwards there is no process running with that ID.

Any idea how to fix this is highly appreciated, and sorry again for the crappy first mail! ;)


cheers,

Phil
________________________________
Von: torqueusers-bounces at supercluster.org [torqueusers-bounces at supercluster.org]" im Auftrag von "Andrus, Brian Contractor [bdandrus at nps.edu]
Gesendet: Donnerstag, 10. April 2014 21:41
An: Torque Users Mailing List
Betreff: Re: [torqueusers] Server not responsive

Phil,

More info would be useful
?does not work at all? is a little broad :)

What are the symptoms? What operating system? How was it installed?


Brian Andrus
ITACS/Research Computing
Naval Postgraduate School
Monterey, California
voice: 831-656-6238



From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Stevens, Philip
Sent: Wednesday, April 09, 2014 8:44 AM
To: torqueusers at supercluster.org
Subject: [torqueusers] Server not responsive

Hi there,

I have some kind of a problem. Everything worked fine except some of the jobs were stuck in the queue although they finished correctly (a former collegue said that this might happen once in a while..). Anyways, after googling how to get rid of it I tried all of the found solutions but nothing helped. Though, I decided to restart the server using:

"qterm -t quick"

I thought everything should be working fine now (since it is the quick restart option?) but after several minutes this error message keeps popping up when using qstat -a to check the queue:

Error communicating with node1(IP-Adress)
Cannot connect to default server host 'node1' - check pbs_server daemon and/or trqauthd.
qstat: cannot connect to server node1 (errno=111) Connection refused.

running pbs_server (no message) as root as well as trqauthd
(hostname:node1
pbs_server port is: 15001
trqauthd daemonized - port 15005)

I still get the same error message.
Comparing the hostname of the server and the name given in /var/spool/torque/server_name it is exactly the same.

So after I thought it is just a quick restart, the server now does not work at all.

Any suggestions how to fix this are highly appreciated!

Thank you.

Phil
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20140411/086608ca/attachment.html
Michel Béland
2014-04-11 16:43:23 UTC
Permalink
Hello,

The command ? qterm -t quick ? does not restart the server, it just
kills the server while leaving the jobs running. You have to restart the
server after that. If I were you, I would have restarted pbs_server with
the default option (warm), which only requeues rerunnable jobs according
to the man page.

I find it surprizing that there is no pbs_server option to leave *all*
the jobs running. Is this manual page (version 2.5.3) accurate? I just
tested it on a lightly loaded cluster with only a test job that was
rerunnable and a restart of the server leaves the job running (which is
good in my opinion).

> Hi there,
>
> I have some kind of a problem. Everything worked fine except some of
> the jobs were stuck in the queue although they finished correctly (a
> former collegue said that this might happen once in a while..).
> Anyways, after googling how to get rid of it I tried all of the found
> solutions but nothing helped. Though, I decided to restart the server
> using:
>
> /"qterm -t quick"/
>
> I thought everything should be working fine now (since it is the quick
> restart option?) but after several minutes this error message keeps
> popping up when using qstat -a to check the queue:
>
> /Error communicating with node1(IP-Adress)
> Cannot connect to default server host 'node1' - check pbs_server
> daemon and/or trqauthd.
> qstat: cannot connect to server node1 (errno=111) Connection refused/.
>
> running /pbs_server (no message)/ as root as well as trqauthd
> (hostname:node1
> pbs_server port is: 15001
> trqauthd daemonized - port 15005)
>
> I still get the same error message.
> Comparing the hostname of the server and the name given in
> /var/spool/torque/server_name it is exactly the same.
>
> So after I thought it is just a quick restart, the server now does not
> work at all.
>
> Any suggestions how to fix this are highly appreciated!
>
> Thank you.
>
> Phil
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers


--
Michel B?land, analyste en calcul scientifique
michel.beland at calculquebec.ca
bureau S-250, pavillon Roger-Gaudry (principal), Universit? de Montr?al
t?l?phone : 514 343-6111 poste 3892 t?l?copieur : 514 343-2155
Calcul Qu?bec (www.calculquebec.ca)
Calcul Canada (calculcanada.ca)

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20140411/4f966832/attachment.html
Loading...