Discussion:
[torqueusers] Crash of pbs_server in initial setup script (torque.setup) in Torque (4.2.10 and 6.02) on Ubuntu 16.04LTS with dual Xeon E5 v4 chips
Kazuhiro Fujita
2016-10-24 08:09:20 UTC
Permalink
Dear All,

I use Torque 4.2.10 on Ubuntu 14.04 LTS and 16.04 LTS with dual E5-2630 v3
chips.
I recently got servers with dual Xeon E5 v4 chips, and installed Ubuntu
16.04 LTS on them.
And I tried to set up Torque on them, but I stacked with the initial setup
script.
It seems that qmgr may trigger to crash pbs_server in initial setup script
(torque.setup). (see below)
Similar error is also observed in Torque 6.02.
Have you ever observed this kind of errors?
And if you know possible solutions, please tell me.
Any comments will be highly appreciated.
Would it be better to change the OS to other distribution, such as
Scientific Linux?

Thank you in Advance,
Kazu


Errors in torque 4.2.10 setup

> torque-server-***@torque-server:~/Downloads/torque/torque-4.2.10$ sudo
> ./torque.setup $USER
> Currently no servers active. Default server will be listed as active
> server. Error 15133
> Active server name: torque-server pbs_server port is: 15001
> trqauthd daemonized - port /tmp/trqauthd-unix
> trqauthd successfully started
> initializing TORQUE (admin: torque-server-***@torque-server)
> You have selected to start pbs_server in create mode.
> If the server database exists it will be overwritten.
> do you wish to continue y/(n)?y
> root 27941 1942 1 12:22 ? 00:00:00 pbs_server -t create
> Max open servers: 9
> set server operators += torque-server-***@torque-server
> Max open servers: 9
> set server managers += torque-server-***@torque-server
> qmgr obj=batch svr=default: End of File
> Unable to communicate with torque-server(10.x.x.x)
> Cannot connect to specified server host 'torque-server'.
> qmgr: cannot connect to server (errno=111) Connection refused
> Unable to communicate with torque-server(10.x.x.x)
> Cannot connect to specified server host 'torque-server'.
> qmgr: cannot connect to server (errno=111) Connection refused
> Unable to communicate with torque-server(10.x.x.x)
> Cannot connect to specified server host 'torque-server'.
> qmgr: cannot connect to server (errno=111) Connection refused
> Unable to communicate with torque-server(10.x.x.x)
> Cannot connect to specified server host 'torque-server'.
> qmgr: cannot connect to server (errno=111) Connection refused
> Unable to communicate with torque-server(10.x.x.x)
> Cannot connect to specified server host 'torque-server'.
> qmgr: cannot connect to server (errno=111) Connection refused
> torque-server-***@torque-server:~/Downloads/torque/torque-4.2.10$ ps
> aux | grep pbs
> torque-+ 27996 0.0 0.0 22304 948 pts/2 S+ 12:22 0:00 grep
> --color=auto pbs

pbs_server -t create was not found.

Errors in torque 6.0.2 setup

> torque-server-***@torque-server:~/Downloads/torque/6.0.2$ sudo
> ./torque.setup $USER
> Currently no servers active. Default server will be listed as active
> server. Error 15133
> Active server name: torque-server pbs_server port is: 15001
> trqauthd daemonized - port /tmp/trqauthd-unix
> trqauthd successfully started
> initializing TORQUE (admin: torque-server-***@torque-server)
> You have selected to start pbs_server in create mode.
> If the server database exists it will be overwritten.
> do you wish to continue y/(n)?y
> root 39521 1 1 16:10 ? 00:00:00 pbs_server -t create
> Max open servers: 9
> Max open servers: 9
> qmgr obj=batch svr=default: End of File
> Unable to communicate with torque-server(10.x.x.x)
> Cannot connect to specified server host 'torque-server'.
> qmgr: cannot connect to server (errno=111) Connection refused
> Unable to communicate with torque-server(10.x.x.x)
> Cannot connect to specified server host 'torque-server'.
> qmgr: cannot connect to server (errno=111) Connection refused
> Unable to communicate with torque-server(10.x.x.x)
> Cannot connect to specified server host 'torque-server'.
> qmgr: cannot connect to server (errno=111) Connection refused
> Unable to communicate with torque-server(10.x.x.x)
> Cannot connect to specified server host 'torque-server'.
> qmgr: cannot connect to server (errno=111) Connection refused
> Unable to communicate with torque-server(10.x.x.x)
> Cannot connect to specified server host 'torque-server'.
> qmgr: cannot connect to server (errno=111) Connection refused
> torque-server-***@torque-server:~/Downloads/torque/6.0.2$ ps aux | grep
> pbs
> comp_ad+ 39569 0.0 0.0 22304 1032 pts/8 S+ 16:11 0:00 grep
> --color=auto pbs

pbs_server -t create was not found.

Commands used for installation before the setup script

> # build and install torque
> ./configure
> make
> sudo make install



echo $HOSTNAME | sudo tee /var/spool/torque/server_name
> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
> sudo ldconfig



# set up as services

sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
> sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
> sudo update-rc.d trqauthd defaults
> sudo update-rc.d pbs_server defaults
> sudo update-rc.d pbs_sched defaults
> sudo update-rc.d pbs_mom defaults



sudo ./torque.setup $USER
David Beer
2016-10-24 16:00:19 UTC
Permalink
Kazu,

Can you give us a backtrace for this crash? We have fixed some issues on
startup (around mutex management for newer pthread implementations) and a
backtrace would allow me to confirm if what you're seeing is fixed.

On Mon, Oct 24, 2016 at 2:09 AM, Kazuhiro Fujita <***@gmail.com>
wrote:

> Dear All,
>
> I use Torque 4.2.10 on Ubuntu 14.04 LTS and 16.04 LTS with dual E5-2630 v3
> chips.
> I recently got servers with dual Xeon E5 v4 chips, and installed Ubuntu
> 16.04 LTS on them.
> And I tried to set up Torque on them, but I stacked with the initial setup
> script.
> It seems that qmgr may trigger to crash pbs_server in initial setup script
> (torque.setup). (see below)
> Similar error is also observed in Torque 6.02.
> Have you ever observed this kind of errors?
> And if you know possible solutions, please tell me.
> Any comments will be highly appreciated.
> Would it be better to change the OS to other distribution, such as
> Scientific Linux?
>
> Thank you in Advance,
> Kazu
>
>
> Errors in torque 4.2.10 setup
>
>> torque-server-***@torque-server:~/Downloads/torque/torque-4.2.10$ sudo
>> ./torque.setup $USER
>> Currently no servers active. Default server will be listed as active
>> server. Error 15133
>> Active server name: torque-server pbs_server port is: 15001
>> trqauthd daemonized - port /tmp/trqauthd-unix
>> trqauthd successfully started
>> initializing TORQUE (admin: torque-server-***@torque-server)
>> You have selected to start pbs_server in create mode.
>> If the server database exists it will be overwritten.
>> do you wish to continue y/(n)?y
>> root 27941 1942 1 12:22 ? 00:00:00 pbs_server -t create
>> Max open servers: 9
>> set server operators += torque-server-***@torque-server
>> Max open servers: 9
>> set server managers += torque-server-***@torque-server
>> qmgr obj=batch svr=default: End of File
>> Unable to communicate with torque-server(10.x.x.x)
>> Cannot connect to specified server host 'torque-server'.
>> qmgr: cannot connect to server (errno=111) Connection refused
>> Unable to communicate with torque-server(10.x.x.x)
>> Cannot connect to specified server host 'torque-server'.
>> qmgr: cannot connect to server (errno=111) Connection refused
>> Unable to communicate with torque-server(10.x.x.x)
>> Cannot connect to specified server host 'torque-server'.
>> qmgr: cannot connect to server (errno=111) Connection refused
>> Unable to communicate with torque-server(10.x.x.x)
>> Cannot connect to specified server host 'torque-server'.
>> qmgr: cannot connect to server (errno=111) Connection refused
>> Unable to communicate with torque-server(10.x.x.x)
>> Cannot connect to specified server host 'torque-server'.
>> qmgr: cannot connect to server (errno=111) Connection refused
>> torque-server-***@torque-server:~/Downloads/torque/torque-4.2.10$ ps
>> aux | grep pbs
>> torque-+ 27996 0.0 0.0 22304 948 pts/2 S+ 12:22 0:00 grep
>> --color=auto pbs
>
> pbs_server -t create was not found.
>
> Errors in torque 6.0.2 setup
>
>> torque-server-***@torque-server:~/Downloads/torque/6.0.2$ sudo
>> ./torque.setup $USER
>> Currently no servers active. Default server will be listed as active
>> server. Error 15133
>> Active server name: torque-server pbs_server port is: 15001
>> trqauthd daemonized - port /tmp/trqauthd-unix
>> trqauthd successfully started
>> initializing TORQUE (admin: torque-server-***@torque-server)
>> You have selected to start pbs_server in create mode.
>> If the server database exists it will be overwritten.
>> do you wish to continue y/(n)?y
>> root 39521 1 1 16:10 ? 00:00:00 pbs_server -t create
>> Max open servers: 9
>> Max open servers: 9
>> qmgr obj=batch svr=default: End of File
>> Unable to communicate with torque-server(10.x.x.x)
>> Cannot connect to specified server host 'torque-server'.
>> qmgr: cannot connect to server (errno=111) Connection refused
>> Unable to communicate with torque-server(10.x.x.x)
>> Cannot connect to specified server host 'torque-server'.
>> qmgr: cannot connect to server (errno=111) Connection refused
>> Unable to communicate with torque-server(10.x.x.x)
>> Cannot connect to specified server host 'torque-server'.
>> qmgr: cannot connect to server (errno=111) Connection refused
>> Unable to communicate with torque-server(10.x.x.x)
>> Cannot connect to specified server host 'torque-server'.
>> qmgr: cannot connect to server (errno=111) Connection refused
>> Unable to communicate with torque-server(10.x.x.x)
>> Cannot connect to specified server host 'torque-server'.
>> qmgr: cannot connect to server (errno=111) Connection refused
>> torque-server-***@torque-server:~/Downloads/torque/6.0.2$ ps aux |
>> grep pbs
>> comp_ad+ 39569 0.0 0.0 22304 1032 pts/8 S+ 16:11 0:00 grep
>> --color=auto pbs
>
> pbs_server -t create was not found.
>
> Commands used for installation before the setup script
>
>> # build and install torque
>> ./configure
>> make
>> sudo make install
>
>
>
> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>> sudo ldconfig
>
>
>
> # set up as services
>
> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>> sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
>> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>> sudo update-rc.d trqauthd defaults
>> sudo update-rc.d pbs_server defaults
>> sudo update-rc.d pbs_sched defaults
>> sudo update-rc.d pbs_mom defaults
>
>
>
> sudo ./torque.setup $USER
>
>
>
> _______________________________________________
> torqueusers mailing list
> ***@supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>


--
David Beer | Torque Architect
Adaptive Computing
Kazuhiro Fujita
2016-10-24 17:46:05 UTC
Permalink
David,

Thank you for the quick response.
I actually tried to install torque 6.1-dev in new servers. These passed the
torque.setup script, but pbs_server was still unstable after qmgr settings.
Could you tell me how to backtrace?

Best,
Kazu

On Tuesday, October 25, 2016, David Beer <***@adaptivecomputing.com>
wrote:

> Kazu,
>
> Can you give us a backtrace for this crash? We have fixed some issues on
> startup (around mutex management for newer pthread implementations) and a
> backtrace would allow me to confirm if what you're seeing is fixed.
>
> On Mon, Oct 24, 2016 at 2:09 AM, Kazuhiro Fujita <
> ***@gmail.com
> <javascript:_e(%7B%7D,'cvml','***@gmail.com');>> wrote:
>
>> Dear All,
>>
>> I use Torque 4.2.10 on Ubuntu 14.04 LTS and 16.04 LTS with dual E5-2630
>> v3 chips.
>> I recently got servers with dual Xeon E5 v4 chips, and installed Ubuntu
>> 16.04 LTS on them.
>> And I tried to set up Torque on them, but I stacked with the initial
>> setup script.
>> It seems that qmgr may trigger to crash pbs_server in initial setup
>> script (torque.setup). (see below)
>> Similar error is also observed in Torque 6.02.
>> Have you ever observed this kind of errors?
>> And if you know possible solutions, please tell me.
>> Any comments will be highly appreciated.
>> Would it be better to change the OS to other distribution, such as
>> Scientific Linux?
>>
>> Thank you in Advance,
>> Kazu
>>
>>
>> Errors in torque 4.2.10 setup
>>
>>> torque-server-***@torque-server:~/Downloads/torque/torque-4.2.10$
>>> sudo ./torque.setup $USER
>>> Currently no servers active. Default server will be listed as active
>>> server. Error 15133
>>> Active server name: torque-server pbs_server port is: 15001
>>> trqauthd daemonized - port /tmp/trqauthd-unix
>>> trqauthd successfully started
>>> initializing TORQUE (admin: torque-server-***@torque-server)
>>> You have selected to start pbs_server in create mode.
>>> If the server database exists it will be overwritten.
>>> do you wish to continue y/(n)?y
>>> root 27941 1942 1 12:22 ? 00:00:00 pbs_server -t create
>>> Max open servers: 9
>>> set server operators += torque-server-***@torque-server
>>> Max open servers: 9
>>> set server managers += torque-server-***@torque-server
>>> qmgr obj=batch svr=default: End of File
>>> Unable to communicate with torque-server(10.x.x.x)
>>> Cannot connect to specified server host 'torque-server'.
>>> qmgr: cannot connect to server (errno=111) Connection refused
>>> Unable to communicate with torque-server(10.x.x.x)
>>> Cannot connect to specified server host 'torque-server'.
>>> qmgr: cannot connect to server (errno=111) Connection refused
>>> Unable to communicate with torque-server(10.x.x.x)
>>> Cannot connect to specified server host 'torque-server'.
>>> qmgr: cannot connect to server (errno=111) Connection refused
>>> Unable to communicate with torque-server(10.x.x.x)
>>> Cannot connect to specified server host 'torque-server'.
>>> qmgr: cannot connect to server (errno=111) Connection refused
>>> Unable to communicate with torque-server(10.x.x.x)
>>> Cannot connect to specified server host 'torque-server'.
>>> qmgr: cannot connect to server (errno=111) Connection refused
>>> torque-server-***@torque-server:~/Downloads/torque/torque-4.2.10$ ps
>>> aux | grep pbs
>>> torque-+ 27996 0.0 0.0 22304 948 pts/2 S+ 12:22 0:00 grep
>>> --color=auto pbs
>>
>> pbs_server -t create was not found.
>>
>> Errors in torque 6.0.2 setup
>>
>>> torque-server-***@torque-server:~/Downloads/torque/6.0.2$ sudo
>>> ./torque.setup $USER
>>> Currently no servers active. Default server will be listed as active
>>> server. Error 15133
>>> Active server name: torque-server pbs_server port is: 15001
>>> trqauthd daemonized - port /tmp/trqauthd-unix
>>> trqauthd successfully started
>>> initializing TORQUE (admin: torque-server-***@torque-server)
>>> You have selected to start pbs_server in create mode.
>>> If the server database exists it will be overwritten.
>>> do you wish to continue y/(n)?y
>>> root 39521 1 1 16:10 ? 00:00:00 pbs_server -t create
>>> Max open servers: 9
>>> Max open servers: 9
>>> qmgr obj=batch svr=default: End of File
>>> Unable to communicate with torque-server(10.x.x.x)
>>> Cannot connect to specified server host 'torque-server'.
>>> qmgr: cannot connect to server (errno=111) Connection refused
>>> Unable to communicate with torque-server(10.x.x.x)
>>> Cannot connect to specified server host 'torque-server'.
>>> qmgr: cannot connect to server (errno=111) Connection refused
>>> Unable to communicate with torque-server(10.x.x.x)
>>> Cannot connect to specified server host 'torque-server'.
>>> qmgr: cannot connect to server (errno=111) Connection refused
>>> Unable to communicate with torque-server(10.x.x.x)
>>> Cannot connect to specified server host 'torque-server'.
>>> qmgr: cannot connect to server (errno=111) Connection refused
>>> Unable to communicate with torque-server(10.x.x.x)
>>> Cannot connect to specified server host 'torque-server'.
>>> qmgr: cannot connect to server (errno=111) Connection refused
>>> torque-server-***@torque-server:~/Downloads/torque/6.0.2$ ps aux |
>>> grep pbs
>>> comp_ad+ 39569 0.0 0.0 22304 1032 pts/8 S+ 16:11 0:00 grep
>>> --color=auto pbs
>>
>> pbs_server -t create was not found.
>>
>> Commands used for installation before the setup script
>>
>>> # build and install torque
>>> ./configure
>>> make
>>> sudo make install
>>
>>
>>
>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>>> sudo ldconfig
>>
>>
>>
>> # set up as services
>>
>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>>> sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
>>> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>> sudo update-rc.d trqauthd defaults
>>> sudo update-rc.d pbs_server defaults
>>> sudo update-rc.d pbs_sched defaults
>>> sudo update-rc.d pbs_mom defaults
>>
>>
>>
>> sudo ./torque.setup $USER
>>
>>
>>
>> _______________________________________________
>> torqueusers mailing list
>> ***@supercluster.org
>> <javascript:_e(%7B%7D,'cvml','***@supercluster.org');>
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>
>
> --
> David Beer | Torque Architect
> Adaptive Computing
>
David Beer
2016-10-24 20:01:40 UTC
Permalink
You can either:

1. start pbs_server using gdb (gdb pbs_server <enter> r -D <enter> when the
prompt appears) and then type bt <enter> after the crash
OR
2. Enable core dumping, run pbs_server, then open the core using gdb (gdb
pbs_server <path to core file>) and then type bt when the prompt appears.

On Mon, Oct 24, 2016 at 11:46 AM, Kazuhiro Fujita <***@gmail.com
> wrote:

> David,
>
> Thank you for the quick response.
> I actually tried to install torque 6.1-dev in new servers. These passed
> the torque.setup script, but pbs_server was still unstable after qmgr
> settings.
> Could you tell me how to backtrace?
>
> Best,
> Kazu
>
>
> On Tuesday, October 25, 2016, David Beer <***@adaptivecomputing.com>
> wrote:
>
>> Kazu,
>>
>> Can you give us a backtrace for this crash? We have fixed some issues on
>> startup (around mutex management for newer pthread implementations) and a
>> backtrace would allow me to confirm if what you're seeing is fixed.
>>
>> On Mon, Oct 24, 2016 at 2:09 AM, Kazuhiro Fujita <
>> ***@gmail.com> wrote:
>>
>>> Dear All,
>>>
>>> I use Torque 4.2.10 on Ubuntu 14.04 LTS and 16.04 LTS with dual E5-2630
>>> v3 chips.
>>> I recently got servers with dual Xeon E5 v4 chips, and installed Ubuntu
>>> 16.04 LTS on them.
>>> And I tried to set up Torque on them, but I stacked with the initial
>>> setup script.
>>> It seems that qmgr may trigger to crash pbs_server in initial setup
>>> script (torque.setup). (see below)
>>> Similar error is also observed in Torque 6.02.
>>> Have you ever observed this kind of errors?
>>> And if you know possible solutions, please tell me.
>>> Any comments will be highly appreciated.
>>> Would it be better to change the OS to other distribution, such as
>>> Scientific Linux?
>>>
>>> Thank you in Advance,
>>> Kazu
>>>
>>>
>>> Errors in torque 4.2.10 setup
>>>
>>>> torque-server-***@torque-server:~/Downloads/torque/torque-4.2.10$
>>>> sudo ./torque.setup $USER
>>>> Currently no servers active. Default server will be listed as active
>>>> server. Error 15133
>>>> Active server name: torque-server pbs_server port is: 15001
>>>> trqauthd daemonized - port /tmp/trqauthd-unix
>>>> trqauthd successfully started
>>>> initializing TORQUE (admin: torque-server-***@torque-server)
>>>> You have selected to start pbs_server in create mode.
>>>> If the server database exists it will be overwritten.
>>>> do you wish to continue y/(n)?y
>>>> root 27941 1942 1 12:22 ? 00:00:00 pbs_server -t create
>>>> Max open servers: 9
>>>> set server operators += torque-server-***@torque-server
>>>> Max open servers: 9
>>>> set server managers += torque-server-***@torque-server
>>>> qmgr obj=batch svr=default: End of File
>>>> Unable to communicate with torque-server(10.x.x.x)
>>>> Cannot connect to specified server host 'torque-server'.
>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>> Unable to communicate with torque-server(10.x.x.x)
>>>> Cannot connect to specified server host 'torque-server'.
>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>> Unable to communicate with torque-server(10.x.x.x)
>>>> Cannot connect to specified server host 'torque-server'.
>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>> Unable to communicate with torque-server(10.x.x.x)
>>>> Cannot connect to specified server host 'torque-server'.
>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>> Unable to communicate with torque-server(10.x.x.x)
>>>> Cannot connect to specified server host 'torque-server'.
>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>> torque-server-***@torque-server:~/Downloads/torque/torque-4.2.10$ ps
>>>> aux | grep pbs
>>>> torque-+ 27996 0.0 0.0 22304 948 pts/2 S+ 12:22 0:00 grep
>>>> --color=auto pbs
>>>
>>> pbs_server -t create was not found.
>>>
>>> Errors in torque 6.0.2 setup
>>>
>>>> torque-server-***@torque-server:~/Downloads/torque/6.0.2$ sudo
>>>> ./torque.setup $USER
>>>> Currently no servers active. Default server will be listed as active
>>>> server. Error 15133
>>>> Active server name: torque-server pbs_server port is: 15001
>>>> trqauthd daemonized - port /tmp/trqauthd-unix
>>>> trqauthd successfully started
>>>> initializing TORQUE (admin: torque-server-***@torque-server)
>>>> You have selected to start pbs_server in create mode.
>>>> If the server database exists it will be overwritten.
>>>> do you wish to continue y/(n)?y
>>>> root 39521 1 1 16:10 ? 00:00:00 pbs_server -t create
>>>> Max open servers: 9
>>>> Max open servers: 9
>>>> qmgr obj=batch svr=default: End of File
>>>> Unable to communicate with torque-server(10.x.x.x)
>>>> Cannot connect to specified server host 'torque-server'.
>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>> Unable to communicate with torque-server(10.x.x.x)
>>>> Cannot connect to specified server host 'torque-server'.
>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>> Unable to communicate with torque-server(10.x.x.x)
>>>> Cannot connect to specified server host 'torque-server'.
>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>> Unable to communicate with torque-server(10.x.x.x)
>>>> Cannot connect to specified server host 'torque-server'.
>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>> Unable to communicate with torque-server(10.x.x.x)
>>>> Cannot connect to specified server host 'torque-server'.
>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>> torque-server-***@torque-server:~/Downloads/torque/6.0.2$ ps aux |
>>>> grep pbs
>>>> comp_ad+ 39569 0.0 0.0 22304 1032 pts/8 S+ 16:11 0:00 grep
>>>> --color=auto pbs
>>>
>>> pbs_server -t create was not found.
>>>
>>> Commands used for installation before the setup script
>>>
>>>> # build and install torque
>>>> ./configure
>>>> make
>>>> sudo make install
>>>
>>>
>>>
>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>>>> sudo ldconfig
>>>
>>>
>>>
>>> # set up as services
>>>
>>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>>>> sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
>>>> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
>>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>>> sudo update-rc.d trqauthd defaults
>>>> sudo update-rc.d pbs_server defaults
>>>> sudo update-rc.d pbs_sched defaults
>>>> sudo update-rc.d pbs_mom defaults
>>>
>>>
>>>
>>> sudo ./torque.setup $USER
>>>
>>>
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> ***@supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>>
>>
>>
>> --
>> David Beer | Torque Architect
>> Adaptive Computing
>>
>
> _______________________________________________
> torqueusers mailing list
> ***@supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>


--
David Beer | Torque Architect
Adaptive Computing
Kazuhiro Fujita
2016-10-25 05:36:26 UTC
Permalink
David,

I attached the backtrace of pbs_server (Torque 6.0.2) by gdb.
(based on https://wiki.ubuntu.com/Backtrace)

I started pbs_server with gdb,
and execute qmgr from another terminal. (see below)

sudo qmgr -c 'p s'
> Unable to communicate with torque-server(10.x.x.x)
> Cannot connect to specified server host 'torque-server'.
> qmgr: cannot connect to server (errno=111) Connection refused
>

After the qmgr execution, I pressed ctrl +c in gdb.

Best,
Kaz


On Tue, Oct 25, 2016 at 1:00 AM, David Beer <***@adaptivecomputing.com>
wrote:

> Kazu,
>
> Can you give us a backtrace for this crash? We have fixed some issues on
> startup (around mutex management for newer pthread implementations) and a
> backtrace would allow me to confirm if what you're seeing is fixed.
>
> On Mon, Oct 24, 2016 at 2:09 AM, Kazuhiro Fujita <
> ***@gmail.com> wrote:
>
>> Dear All,
>>
>> I use Torque 4.2.10 on Ubuntu 14.04 LTS and 16.04 LTS with dual E5-2630
>> v3 chips.
>> I recently got servers with dual Xeon E5 v4 chips, and installed Ubuntu
>> 16.04 LTS on them.
>> And I tried to set up Torque on them, but I stacked with the initial
>> setup script.
>> It seems that qmgr may trigger to crash pbs_server in initial setup
>> script (torque.setup). (see below)
>> Similar error is also observed in Torque 6.02.
>> Have you ever observed this kind of errors?
>> And if you know possible solutions, please tell me.
>> Any comments will be highly appreciated.
>> Would it be better to change the OS to other distribution, such as
>> Scientific Linux?
>>
>> Thank you in Advance,
>> Kazu
>>
>>
>> Errors in torque 4.2.10 setup
>>
>>> torque-server-***@torque-server:~/Downloads/torque/torque-4.2.10$
>>> sudo ./torque.setup $USER
>>> Currently no servers active. Default server will be listed as active
>>> server. Error 15133
>>> Active server name: torque-server pbs_server port is: 15001
>>> trqauthd daemonized - port /tmp/trqauthd-unix
>>> trqauthd successfully started
>>> initializing TORQUE (admin: torque-server-***@torque-server)
>>> You have selected to start pbs_server in create mode.
>>> If the server database exists it will be overwritten.
>>> do you wish to continue y/(n)?y
>>> root 27941 1942 1 12:22 ? 00:00:00 pbs_server -t create
>>> Max open servers: 9
>>> set server operators += torque-server-***@torque-server
>>> Max open servers: 9
>>> set server managers += torque-server-***@torque-server
>>> qmgr obj=batch svr=default: End of File
>>> Unable to communicate with torque-server(10.x.x.x)
>>> Cannot connect to specified server host 'torque-server'.
>>> qmgr: cannot connect to server (errno=111) Connection refused
>>> Unable to communicate with torque-server(10.x.x.x)
>>> Cannot connect to specified server host 'torque-server'.
>>> qmgr: cannot connect to server (errno=111) Connection refused
>>> Unable to communicate with torque-server(10.x.x.x)
>>> Cannot connect to specified server host 'torque-server'.
>>> qmgr: cannot connect to server (errno=111) Connection refused
>>> Unable to communicate with torque-server(10.x.x.x)
>>> Cannot connect to specified server host 'torque-server'.
>>> qmgr: cannot connect to server (errno=111) Connection refused
>>> Unable to communicate with torque-server(10.x.x.x)
>>> Cannot connect to specified server host 'torque-server'.
>>> qmgr: cannot connect to server (errno=111) Connection refused
>>> torque-server-***@torque-server:~/Downloads/torque/torque-4.2.10$ ps
>>> aux | grep pbs
>>> torque-+ 27996 0.0 0.0 22304 948 pts/2 S+ 12:22 0:00 grep
>>> --color=auto pbs
>>
>> pbs_server -t create was not found.
>>
>> Errors in torque 6.0.2 setup
>>
>>> torque-server-***@torque-server:~/Downloads/torque/6.0.2$ sudo
>>> ./torque.setup $USER
>>> Currently no servers active. Default server will be listed as active
>>> server. Error 15133
>>> Active server name: torque-server pbs_server port is: 15001
>>> trqauthd daemonized - port /tmp/trqauthd-unix
>>> trqauthd successfully started
>>> initializing TORQUE (admin: torque-server-***@torque-server)
>>> You have selected to start pbs_server in create mode.
>>> If the server database exists it will be overwritten.
>>> do you wish to continue y/(n)?y
>>> root 39521 1 1 16:10 ? 00:00:00 pbs_server -t create
>>> Max open servers: 9
>>> Max open servers: 9
>>> qmgr obj=batch svr=default: End of File
>>> Unable to communicate with torque-server(10.x.x.x)
>>> Cannot connect to specified server host 'torque-server'.
>>> qmgr: cannot connect to server (errno=111) Connection refused
>>> Unable to communicate with torque-server(10.x.x.x)
>>> Cannot connect to specified server host 'torque-server'.
>>> qmgr: cannot connect to server (errno=111) Connection refused
>>> Unable to communicate with torque-server(10.x.x.x)
>>> Cannot connect to specified server host 'torque-server'.
>>> qmgr: cannot connect to server (errno=111) Connection refused
>>> Unable to communicate with torque-server(10.x.x.x)
>>> Cannot connect to specified server host 'torque-server'.
>>> qmgr: cannot connect to server (errno=111) Connection refused
>>> Unable to communicate with torque-server(10.x.x.x)
>>> Cannot connect to specified server host 'torque-server'.
>>> qmgr: cannot connect to server (errno=111) Connection refused
>>> torque-server-***@torque-server:~/Downloads/torque/6.0.2$ ps aux |
>>> grep pbs
>>> comp_ad+ 39569 0.0 0.0 22304 1032 pts/8 S+ 16:11 0:00 grep
>>> --color=auto pbs
>>
>> pbs_server -t create was not found.
>>
>> Commands used for installation before the setup script
>>
>>> # build and install torque
>>> ./configure
>>> make
>>> sudo make install
>>
>>
>>
>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>>> sudo ldconfig
>>
>>
>>
>> # set up as services
>>
>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>>> sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
>>> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>> sudo update-rc.d trqauthd defaults
>>> sudo update-rc.d pbs_server defaults
>>> sudo update-rc.d pbs_sched defaults
>>> sudo update-rc.d pbs_mom defaults
>>
>>
>>
>> sudo ./torque.setup $USER
>>
>>
>>
>> _______________________________________________
>> torqueusers mailing list
>> ***@supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>
>
> --
> David Beer | Torque Architect
> Adaptive Computing
>
> _______________________________________________
> torqueusers mailing list
> ***@supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
Kazuhiro Fujita
2016-10-25 06:30:45 UTC
Permalink
Thank you David for the comment on the backtrace.
I haven't noticed that until writing this mail.
So, I used backtrace as written in the Ubuntu wiki.

I also attached the backtrace of pbs_server (Torque 6.1-dev) by gdb.
As I mentioned before torque.setup script was successfully executed, but
unstable.

Before using gdb, I used following commands.

> git clone https://github.com/adaptivecomputing/torque.git -b 6.1-dev
> 6.1-dev
> cd 6.1-dev
> ./autogen.sh
> # build and install torque
> ./configure
> make
> sudo make install
> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
> sudo ldconfig
> # set as services
> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
> sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
> sudo update-rc.d trqauthd defaults
> sudo update-rc.d pbs_server defaults
> sudo update-rc.d pbs_sched defaults
> sudo update-rc.d pbs_mom defaults
>
> sudo ./torque.setup $USER
> sudo qmgr -c 'p s'
> echo "$HOSTNAME np=`cat /proc/cpuinfo | grep processor | wc -l`" | sudo
> tee /var/spool/torque/server_priv/nodes
> sudo nano /var/spool/torque/server_priv/nodes # (changed np)
> sudo qterm -t quick
> sudo /etc/init.d/trqauthd stop


trqauthd was not stop by the last command. So, I stopped it by killing the
trqauthd process.
Then I restarted the torque processes with gdb.

sudo /etc/init.d/trqauthd start

sudo gdb /etc/init.d/pbs_server 2>&1 | tee ~/gdb-torquesetup-6.1-dev.txt


In another terminal, I executed the following commands before pbs_server
was crashed.

sudo /etc/init.d/pbs_mom start
> sudo /etc/init.d/pbs_sched start
> ps aux | grep pbs
> pbsnodes -a
> echo "sleep 30" | qsub


The output of the last command is "0.torque-server".
And this command crashed the pbs_server in gdb.
Then, I made the backtrace.

Best,
Kazu


On Tue, Oct 25, 2016 at 2:36 PM, Kazuhiro Fujita <***@gmail.com>
wrote:

> David,
>
> I attached the backtrace of pbs_server (Torque 6.0.2) by gdb.
> (based on https://wiki.ubuntu.com/Backtrace)
>
> I started pbs_server with gdb,
> and execute qmgr from another terminal. (see below)
>
> sudo qmgr -c 'p s'
>> Unable to communicate with torque-server(10.x.x.x)
>> Cannot connect to specified server host 'torque-server'.
>> qmgr: cannot connect to server (errno=111) Connection refused
>>
>
> After the qmgr execution, I pressed ctrl +c in gdb.
>
> Best,
> Kaz
>
>
> On Tue, Oct 25, 2016 at 1:00 AM, David Beer <***@adaptivecomputing.com>
> wrote:
>
>> Kazu,
>>
>> Can you give us a backtrace for this crash? We have fixed some issues on
>> startup (around mutex management for newer pthread implementations) and a
>> backtrace would allow me to confirm if what you're seeing is fixed.
>>
>> On Mon, Oct 24, 2016 at 2:09 AM, Kazuhiro Fujita <
>> ***@gmail.com> wrote:
>>
>>> Dear All,
>>>
>>> I use Torque 4.2.10 on Ubuntu 14.04 LTS and 16.04 LTS with dual E5-2630
>>> v3 chips.
>>> I recently got servers with dual Xeon E5 v4 chips, and installed Ubuntu
>>> 16.04 LTS on them.
>>> And I tried to set up Torque on them, but I stacked with the initial
>>> setup script.
>>> It seems that qmgr may trigger to crash pbs_server in initial setup
>>> script (torque.setup). (see below)
>>> Similar error is also observed in Torque 6.02.
>>> Have you ever observed this kind of errors?
>>> And if you know possible solutions, please tell me.
>>> Any comments will be highly appreciated.
>>> Would it be better to change the OS to other distribution, such as
>>> Scientific Linux?
>>>
>>> Thank you in Advance,
>>> Kazu
>>>
>>>
>>> Errors in torque 4.2.10 setup
>>>
>>>> torque-server-***@torque-server:~/Downloads/torque/torque-4.2.10$
>>>> sudo ./torque.setup $USER
>>>> Currently no servers active. Default server will be listed as active
>>>> server. Error 15133
>>>> Active server name: torque-server pbs_server port is: 15001
>>>> trqauthd daemonized - port /tmp/trqauthd-unix
>>>> trqauthd successfully started
>>>> initializing TORQUE (admin: torque-server-***@torque-server)
>>>> You have selected to start pbs_server in create mode.
>>>> If the server database exists it will be overwritten.
>>>> do you wish to continue y/(n)?y
>>>> root 27941 1942 1 12:22 ? 00:00:00 pbs_server -t create
>>>> Max open servers: 9
>>>> set server operators += torque-server-***@torque-server
>>>> Max open servers: 9
>>>> set server managers += torque-server-***@torque-server
>>>> qmgr obj=batch svr=default: End of File
>>>> Unable to communicate with torque-server(10.x.x.x)
>>>> Cannot connect to specified server host 'torque-server'.
>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>> Unable to communicate with torque-server(10.x.x.x)
>>>> Cannot connect to specified server host 'torque-server'.
>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>> Unable to communicate with torque-server(10.x.x.x)
>>>> Cannot connect to specified server host 'torque-server'.
>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>> Unable to communicate with torque-server(10.x.x.x)
>>>> Cannot connect to specified server host 'torque-server'.
>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>> Unable to communicate with torque-server(10.x.x.x)
>>>> Cannot connect to specified server host 'torque-server'.
>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>> torque-server-***@torque-server:~/Downloads/torque/torque-4.2.10$ ps
>>>> aux | grep pbs
>>>> torque-+ 27996 0.0 0.0 22304 948 pts/2 S+ 12:22 0:00 grep
>>>> --color=auto pbs
>>>
>>> pbs_server -t create was not found.
>>>
>>> Errors in torque 6.0.2 setup
>>>
>>>> torque-server-***@torque-server:~/Downloads/torque/6.0.2$ sudo
>>>> ./torque.setup $USER
>>>> Currently no servers active. Default server will be listed as active
>>>> server. Error 15133
>>>> Active server name: torque-server pbs_server port is: 15001
>>>> trqauthd daemonized - port /tmp/trqauthd-unix
>>>> trqauthd successfully started
>>>> initializing TORQUE (admin: torque-server-***@torque-server)
>>>> You have selected to start pbs_server in create mode.
>>>> If the server database exists it will be overwritten.
>>>> do you wish to continue y/(n)?y
>>>> root 39521 1 1 16:10 ? 00:00:00 pbs_server -t create
>>>> Max open servers: 9
>>>> Max open servers: 9
>>>> qmgr obj=batch svr=default: End of File
>>>> Unable to communicate with torque-server(10.x.x.x)
>>>> Cannot connect to specified server host 'torque-server'.
>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>> Unable to communicate with torque-server(10.x.x.x)
>>>> Cannot connect to specified server host 'torque-server'.
>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>> Unable to communicate with torque-server(10.x.x.x)
>>>> Cannot connect to specified server host 'torque-server'.
>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>> Unable to communicate with torque-server(10.x.x.x)
>>>> Cannot connect to specified server host 'torque-server'.
>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>> Unable to communicate with torque-server(10.x.x.x)
>>>> Cannot connect to specified server host 'torque-server'.
>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>> torque-server-***@torque-server:~/Downloads/torque/6.0.2$ ps aux |
>>>> grep pbs
>>>> comp_ad+ 39569 0.0 0.0 22304 1032 pts/8 S+ 16:11 0:00 grep
>>>> --color=auto pbs
>>>
>>> pbs_server -t create was not found.
>>>
>>> Commands used for installation before the setup script
>>>
>>>> # build and install torque
>>>> ./configure
>>>> make
>>>> sudo make install
>>>
>>>
>>>
>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>>>> sudo ldconfig
>>>
>>>
>>>
>>> # set up as services
>>>
>>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>>>> sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
>>>> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
>>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>>> sudo update-rc.d trqauthd defaults
>>>> sudo update-rc.d pbs_server defaults
>>>> sudo update-rc.d pbs_sched defaults
>>>> sudo update-rc.d pbs_mom defaults
>>>
>>>
>>>
>>> sudo ./torque.setup $USER
>>>
>>>
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> ***@supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>>
>>
>>
>> --
>> David Beer | Torque Architect
>> Adaptive Computing
>>
>> _______________________________________________
>> torqueusers mailing list
>> ***@supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>
David Beer
2016-10-25 20:06:08 UTC
Permalink
I can confirm that this bug is fixed in 6.0-dev, and we've made a hotfix
for it, 6.0.2.h3. This was caused because of a change in the implementation
for the pthread library, so most will not see this crash, but it appears
that if you have a newer version of that library, then you will get it.
Rick is going to send instructions for how to grab 6.0.2.h3.

David

On Tue, Oct 25, 2016 at 12:30 AM, Kazuhiro Fujita <***@gmail.com
> wrote:

> Thank you David for the comment on the backtrace.
> I haven't noticed that until writing this mail.
> So, I used backtrace as written in the Ubuntu wiki.
>
> I also attached the backtrace of pbs_server (Torque 6.1-dev) by gdb.
> As I mentioned before torque.setup script was successfully executed, but
> unstable.
>
> Before using gdb, I used following commands.
>
>> git clone https://github.com/adaptivecomputing/torque.git -b 6.1-dev
>> 6.1-dev
>> cd 6.1-dev
>> ./autogen.sh
>> # build and install torque
>> ./configure
>> make
>> sudo make install
>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>> sudo ldconfig
>> # set as services
>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>> sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
>> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>> sudo update-rc.d trqauthd defaults
>> sudo update-rc.d pbs_server defaults
>> sudo update-rc.d pbs_sched defaults
>> sudo update-rc.d pbs_mom defaults
>>
>> sudo ./torque.setup $USER
>> sudo qmgr -c 'p s'
>> echo "$HOSTNAME np=`cat /proc/cpuinfo | grep processor | wc -l`" | sudo
>> tee /var/spool/torque/server_priv/nodes
>> sudo nano /var/spool/torque/server_priv/nodes # (changed np)
>> sudo qterm -t quick
>> sudo /etc/init.d/trqauthd stop
>
>
> trqauthd was not stop by the last command. So, I stopped it by killing the
> trqauthd process.
> Then I restarted the torque processes with gdb.
>
> sudo /etc/init.d/trqauthd start
>
> sudo gdb /etc/init.d/pbs_server 2>&1 | tee ~/gdb-torquesetup-6.1-dev.txt
>
>
> In another terminal, I executed the following commands before pbs_server
> was crashed.
>
> sudo /etc/init.d/pbs_mom start
>> sudo /etc/init.d/pbs_sched start
>> ps aux | grep pbs
>> pbsnodes -a
>> echo "sleep 30" | qsub
>
>
> The output of the last command is "0.torque-server".
> And this command crashed the pbs_server in gdb.
> Then, I made the backtrace.
>
> Best,
> Kazu
>
>
> On Tue, Oct 25, 2016 at 2:36 PM, Kazuhiro Fujita <
> ***@gmail.com> wrote:
>
>> David,
>>
>> I attached the backtrace of pbs_server (Torque 6.0.2) by gdb.
>> (based on https://wiki.ubuntu.com/Backtrace)
>>
>> I started pbs_server with gdb,
>> and execute qmgr from another terminal. (see below)
>>
>> sudo qmgr -c 'p s'
>>> Unable to communicate with torque-server(10.x.x.x)
>>> Cannot connect to specified server host 'torque-server'.
>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>
>>
>> After the qmgr execution, I pressed ctrl +c in gdb.
>>
>> Best,
>> Kaz
>>
>>
>> On Tue, Oct 25, 2016 at 1:00 AM, David Beer <***@adaptivecomputing.com>
>> wrote:
>>
>>> Kazu,
>>>
>>> Can you give us a backtrace for this crash? We have fixed some issues on
>>> startup (around mutex management for newer pthread implementations) and a
>>> backtrace would allow me to confirm if what you're seeing is fixed.
>>>
>>> On Mon, Oct 24, 2016 at 2:09 AM, Kazuhiro Fujita <
>>> ***@gmail.com> wrote:
>>>
>>>> Dear All,
>>>>
>>>> I use Torque 4.2.10 on Ubuntu 14.04 LTS and 16.04 LTS with dual E5-2630
>>>> v3 chips.
>>>> I recently got servers with dual Xeon E5 v4 chips, and installed Ubuntu
>>>> 16.04 LTS on them.
>>>> And I tried to set up Torque on them, but I stacked with the initial
>>>> setup script.
>>>> It seems that qmgr may trigger to crash pbs_server in initial setup
>>>> script (torque.setup). (see below)
>>>> Similar error is also observed in Torque 6.02.
>>>> Have you ever observed this kind of errors?
>>>> And if you know possible solutions, please tell me.
>>>> Any comments will be highly appreciated.
>>>> Would it be better to change the OS to other distribution, such as
>>>> Scientific Linux?
>>>>
>>>> Thank you in Advance,
>>>> Kazu
>>>>
>>>>
>>>> Errors in torque 4.2.10 setup
>>>>
>>>>> torque-server-***@torque-server:~/Downloads/torque/torque-4.2.10$
>>>>> sudo ./torque.setup $USER
>>>>> Currently no servers active. Default server will be listed as active
>>>>> server. Error 15133
>>>>> Active server name: torque-server pbs_server port is: 15001
>>>>> trqauthd daemonized - port /tmp/trqauthd-unix
>>>>> trqauthd successfully started
>>>>> initializing TORQUE (admin: torque-server-***@torque-server)
>>>>> You have selected to start pbs_server in create mode.
>>>>> If the server database exists it will be overwritten.
>>>>> do you wish to continue y/(n)?y
>>>>> root 27941 1942 1 12:22 ? 00:00:00 pbs_server -t create
>>>>> Max open servers: 9
>>>>> set server operators += torque-server-***@torque-server
>>>>> Max open servers: 9
>>>>> set server managers += torque-server-***@torque-server
>>>>> qmgr obj=batch svr=default: End of File
>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>> Cannot connect to specified server host 'torque-server'.
>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>> Cannot connect to specified server host 'torque-server'.
>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>> Cannot connect to specified server host 'torque-server'.
>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>> Cannot connect to specified server host 'torque-server'.
>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>> Cannot connect to specified server host 'torque-server'.
>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>> torque-server-***@torque-server:~/Downloads/torque/torque-4.2.10$
>>>>> ps aux | grep pbs
>>>>> torque-+ 27996 0.0 0.0 22304 948 pts/2 S+ 12:22 0:00 grep
>>>>> --color=auto pbs
>>>>
>>>> pbs_server -t create was not found.
>>>>
>>>> Errors in torque 6.0.2 setup
>>>>
>>>>> torque-server-***@torque-server:~/Downloads/torque/6.0.2$ sudo
>>>>> ./torque.setup $USER
>>>>> Currently no servers active. Default server will be listed as active
>>>>> server. Error 15133
>>>>> Active server name: torque-server pbs_server port is: 15001
>>>>> trqauthd daemonized - port /tmp/trqauthd-unix
>>>>> trqauthd successfully started
>>>>> initializing TORQUE (admin: torque-server-***@torque-server)
>>>>> You have selected to start pbs_server in create mode.
>>>>> If the server database exists it will be overwritten.
>>>>> do you wish to continue y/(n)?y
>>>>> root 39521 1 1 16:10 ? 00:00:00 pbs_server -t create
>>>>> Max open servers: 9
>>>>> Max open servers: 9
>>>>> qmgr obj=batch svr=default: End of File
>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>> Cannot connect to specified server host 'torque-server'.
>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>> Cannot connect to specified server host 'torque-server'.
>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>> Cannot connect to specified server host 'torque-server'.
>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>> Cannot connect to specified server host 'torque-server'.
>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>> Cannot connect to specified server host 'torque-server'.
>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>> torque-server-***@torque-server:~/Downloads/torque/6.0.2$ ps aux |
>>>>> grep pbs
>>>>> comp_ad+ 39569 0.0 0.0 22304 1032 pts/8 S+ 16:11 0:00 grep
>>>>> --color=auto pbs
>>>>
>>>> pbs_server -t create was not found.
>>>>
>>>> Commands used for installation before the setup script
>>>>
>>>>> # build and install torque
>>>>> ./configure
>>>>> make
>>>>> sudo make install
>>>>
>>>>
>>>>
>>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>>>>> sudo ldconfig
>>>>
>>>>
>>>>
>>>> # set up as services
>>>>
>>>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>>>>> sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
>>>>> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
>>>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>>>> sudo update-rc.d trqauthd defaults
>>>>> sudo update-rc.d pbs_server defaults
>>>>> sudo update-rc.d pbs_sched defaults
>>>>> sudo update-rc.d pbs_mom defaults
>>>>
>>>>
>>>>
>>>> sudo ./torque.setup $USER
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> torqueusers mailing list
>>>> ***@supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>
>>>>
>>>
>>>
>>> --
>>> David Beer | Torque Architect
>>> Adaptive Computing
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> ***@supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>>
>>
>
> _______________________________________________
> torqueusers mailing list
> ***@supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>


--
David Beer | Torque Architect
Adaptive Computing
David Beer
2016-10-25 20:06:44 UTC
Permalink
Actually, Rick just sent me the link. You can download it from here:
http://files.adaptivecomputing.com/hotfix/torque-6.0.2.h3.tar.gz

On Tue, Oct 25, 2016 at 2:06 PM, David Beer <***@adaptivecomputing.com>
wrote:

> I can confirm that this bug is fixed in 6.0-dev, and we've made a hotfix
> for it, 6.0.2.h3. This was caused because of a change in the implementation
> for the pthread library, so most will not see this crash, but it appears
> that if you have a newer version of that library, then you will get it.
> Rick is going to send instructions for how to grab 6.0.2.h3.
>
> David
>
> On Tue, Oct 25, 2016 at 12:30 AM, Kazuhiro Fujita <
> ***@gmail.com> wrote:
>
>> Thank you David for the comment on the backtrace.
>> I haven't noticed that until writing this mail.
>> So, I used backtrace as written in the Ubuntu wiki.
>>
>> I also attached the backtrace of pbs_server (Torque 6.1-dev) by gdb.
>> As I mentioned before torque.setup script was successfully executed, but
>> unstable.
>>
>> Before using gdb, I used following commands.
>>
>>> git clone https://github.com/adaptivecomputing/torque.git -b 6.1-dev
>>> 6.1-dev
>>> cd 6.1-dev
>>> ./autogen.sh
>>> # build and install torque
>>> ./configure
>>> make
>>> sudo make install
>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>>> sudo ldconfig
>>> # set as services
>>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>>> sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
>>> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>> sudo update-rc.d trqauthd defaults
>>> sudo update-rc.d pbs_server defaults
>>> sudo update-rc.d pbs_sched defaults
>>> sudo update-rc.d pbs_mom defaults
>>>
>>> sudo ./torque.setup $USER
>>> sudo qmgr -c 'p s'
>>> echo "$HOSTNAME np=`cat /proc/cpuinfo | grep processor | wc -l`" | sudo
>>> tee /var/spool/torque/server_priv/nodes
>>> sudo nano /var/spool/torque/server_priv/nodes # (changed np)
>>> sudo qterm -t quick
>>> sudo /etc/init.d/trqauthd stop
>>
>>
>> trqauthd was not stop by the last command. So, I stopped it by killing
>> the trqauthd process.
>> Then I restarted the torque processes with gdb.
>>
>> sudo /etc/init.d/trqauthd start
>>
>> sudo gdb /etc/init.d/pbs_server 2>&1 | tee ~/gdb-torquesetup-6.1-dev.txt
>>
>>
>> In another terminal, I executed the following commands before pbs_server
>> was crashed.
>>
>> sudo /etc/init.d/pbs_mom start
>>> sudo /etc/init.d/pbs_sched start
>>> ps aux | grep pbs
>>> pbsnodes -a
>>> echo "sleep 30" | qsub
>>
>>
>> The output of the last command is "0.torque-server".
>> And this command crashed the pbs_server in gdb.
>> Then, I made the backtrace.
>>
>> Best,
>> Kazu
>>
>>
>> On Tue, Oct 25, 2016 at 2:36 PM, Kazuhiro Fujita <
>> ***@gmail.com> wrote:
>>
>>> David,
>>>
>>> I attached the backtrace of pbs_server (Torque 6.0.2) by gdb.
>>> (based on https://wiki.ubuntu.com/Backtrace)
>>>
>>> I started pbs_server with gdb,
>>> and execute qmgr from another terminal. (see below)
>>>
>>> sudo qmgr -c 'p s'
>>>> Unable to communicate with torque-server(10.x.x.x)
>>>> Cannot connect to specified server host 'torque-server'.
>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>
>>>
>>> After the qmgr execution, I pressed ctrl +c in gdb.
>>>
>>> Best,
>>> Kaz
>>>
>>>
>>> On Tue, Oct 25, 2016 at 1:00 AM, David Beer <***@adaptivecomputing.com
>>> > wrote:
>>>
>>>> Kazu,
>>>>
>>>> Can you give us a backtrace for this crash? We have fixed some issues
>>>> on startup (around mutex management for newer pthread implementations) and
>>>> a backtrace would allow me to confirm if what you're seeing is fixed.
>>>>
>>>> On Mon, Oct 24, 2016 at 2:09 AM, Kazuhiro Fujita <
>>>> ***@gmail.com> wrote:
>>>>
>>>>> Dear All,
>>>>>
>>>>> I use Torque 4.2.10 on Ubuntu 14.04 LTS and 16.04 LTS with dual
>>>>> E5-2630 v3 chips.
>>>>> I recently got servers with dual Xeon E5 v4 chips, and
>>>>> installed Ubuntu 16.04 LTS on them.
>>>>> And I tried to set up Torque on them, but I stacked with the initial
>>>>> setup script.
>>>>> It seems that qmgr may trigger to crash pbs_server in initial setup
>>>>> script (torque.setup). (see below)
>>>>> Similar error is also observed in Torque 6.02.
>>>>> Have you ever observed this kind of errors?
>>>>> And if you know possible solutions, please tell me.
>>>>> Any comments will be highly appreciated.
>>>>> Would it be better to change the OS to other distribution, such as
>>>>> Scientific Linux?
>>>>>
>>>>> Thank you in Advance,
>>>>> Kazu
>>>>>
>>>>>
>>>>> Errors in torque 4.2.10 setup
>>>>>
>>>>>> torque-server-***@torque-server:~/Downloads/torque/torque-4.2.10$
>>>>>> sudo ./torque.setup $USER
>>>>>> Currently no servers active. Default server will be listed as active
>>>>>> server. Error 15133
>>>>>> Active server name: torque-server pbs_server port is: 15001
>>>>>> trqauthd daemonized - port /tmp/trqauthd-unix
>>>>>> trqauthd successfully started
>>>>>> initializing TORQUE (admin: torque-server-***@torque-server)
>>>>>> You have selected to start pbs_server in create mode.
>>>>>> If the server database exists it will be overwritten.
>>>>>> do you wish to continue y/(n)?y
>>>>>> root 27941 1942 1 12:22 ? 00:00:00 pbs_server -t create
>>>>>> Max open servers: 9
>>>>>> set server operators += torque-server-***@torque-server
>>>>>> Max open servers: 9
>>>>>> set server managers += torque-server-***@torque-server
>>>>>> qmgr obj=batch svr=default: End of File
>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>> torque-server-***@torque-server:~/Downloads/torque/torque-4.2.10$
>>>>>> ps aux | grep pbs
>>>>>> torque-+ 27996 0.0 0.0 22304 948 pts/2 S+ 12:22 0:00 grep
>>>>>> --color=auto pbs
>>>>>
>>>>> pbs_server -t create was not found.
>>>>>
>>>>> Errors in torque 6.0.2 setup
>>>>>
>>>>>> torque-server-***@torque-server:~/Downloads/torque/6.0.2$ sudo
>>>>>> ./torque.setup $USER
>>>>>> Currently no servers active. Default server will be listed as active
>>>>>> server. Error 15133
>>>>>> Active server name: torque-server pbs_server port is: 15001
>>>>>> trqauthd daemonized - port /tmp/trqauthd-unix
>>>>>> trqauthd successfully started
>>>>>> initializing TORQUE (admin: torque-server-***@torque-server)
>>>>>> You have selected to start pbs_server in create mode.
>>>>>> If the server database exists it will be overwritten.
>>>>>> do you wish to continue y/(n)?y
>>>>>> root 39521 1 1 16:10 ? 00:00:00 pbs_server -t create
>>>>>> Max open servers: 9
>>>>>> Max open servers: 9
>>>>>> qmgr obj=batch svr=default: End of File
>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>> torque-server-***@torque-server:~/Downloads/torque/6.0.2$ ps aux |
>>>>>> grep pbs
>>>>>> comp_ad+ 39569 0.0 0.0 22304 1032 pts/8 S+ 16:11 0:00 grep
>>>>>> --color=auto pbs
>>>>>
>>>>> pbs_server -t create was not found.
>>>>>
>>>>> Commands used for installation before the setup script
>>>>>
>>>>>> # build and install torque
>>>>>> ./configure
>>>>>> make
>>>>>> sudo make install
>>>>>
>>>>>
>>>>>
>>>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>>>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>>>>>> sudo ldconfig
>>>>>
>>>>>
>>>>>
>>>>> # set up as services
>>>>>
>>>>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>>>>>> sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
>>>>>> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
>>>>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>>>>> sudo update-rc.d trqauthd defaults
>>>>>> sudo update-rc.d pbs_server defaults
>>>>>> sudo update-rc.d pbs_sched defaults
>>>>>> sudo update-rc.d pbs_mom defaults
>>>>>
>>>>>
>>>>>
>>>>> sudo ./torque.setup $USER
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> torqueusers mailing list
>>>>> ***@supercluster.org
>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> David Beer | Torque Architect
>>>> Adaptive Computing
>>>>
>>>> _______________________________________________
>>>> torqueusers mailing list
>>>> ***@supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>
>>>>
>>>
>>
>> _______________________________________________
>> torqueusers mailing list
>> ***@supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>
>
> --
> David Beer | Torque Architect
> Adaptive Computing
>



--
David Beer | Torque Architect
Adaptive Computing
Kazuhiro Fujita
2016-10-26 02:52:26 UTC
Permalink
David and Rick,

Thank you for the quick response. I will try it later.

Best,
Kazu

On Wed, Oct 26, 2016 at 5:06 AM, David Beer <***@adaptivecomputing.com>
wrote:

> Actually, Rick just sent me the link. You can download it from here:
> http://files.adaptivecomputing.com/hotfix/torque-6.0.2.h3.tar.gz
>
> On Tue, Oct 25, 2016 at 2:06 PM, David Beer <***@adaptivecomputing.com>
> wrote:
>
>> I can confirm that this bug is fixed in 6.0-dev, and we've made a hotfix
>> for it, 6.0.2.h3. This was caused because of a change in the implementation
>> for the pthread library, so most will not see this crash, but it appears
>> that if you have a newer version of that library, then you will get it.
>> Rick is going to send instructions for how to grab 6.0.2.h3.
>>
>> David
>>
>> On Tue, Oct 25, 2016 at 12:30 AM, Kazuhiro Fujita <
>> ***@gmail.com> wrote:
>>
>>> Thank you David for the comment on the backtrace.
>>> I haven't noticed that until writing this mail.
>>> So, I used backtrace as written in the Ubuntu wiki.
>>>
>>> I also attached the backtrace of pbs_server (Torque 6.1-dev) by gdb.
>>> As I mentioned before torque.setup script was successfully executed, but
>>> unstable.
>>>
>>> Before using gdb, I used following commands.
>>>
>>>> git clone https://github.com/adaptivecomputing/torque.git -b 6.1-dev
>>>> 6.1-dev
>>>> cd 6.1-dev
>>>> ./autogen.sh
>>>> # build and install torque
>>>> ./configure
>>>> make
>>>> sudo make install
>>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>>>> sudo ldconfig
>>>> # set as services
>>>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>>>> sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
>>>> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
>>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>>> sudo update-rc.d trqauthd defaults
>>>> sudo update-rc.d pbs_server defaults
>>>> sudo update-rc.d pbs_sched defaults
>>>> sudo update-rc.d pbs_mom defaults
>>>>
>>>> sudo ./torque.setup $USER
>>>> sudo qmgr -c 'p s'
>>>> echo "$HOSTNAME np=`cat /proc/cpuinfo | grep processor | wc -l`" | sudo
>>>> tee /var/spool/torque/server_priv/nodes
>>>> sudo nano /var/spool/torque/server_priv/nodes # (changed np)
>>>> sudo qterm -t quick
>>>> sudo /etc/init.d/trqauthd stop
>>>
>>>
>>> trqauthd was not stop by the last command. So, I stopped it by killing
>>> the trqauthd process.
>>> Then I restarted the torque processes with gdb.
>>>
>>> sudo /etc/init.d/trqauthd start
>>>
>>> sudo gdb /etc/init.d/pbs_server 2>&1 | tee ~/gdb-torquesetup-6.1-dev.txt
>>>
>>>
>>> In another terminal, I executed the following commands before pbs_server
>>> was crashed.
>>>
>>> sudo /etc/init.d/pbs_mom start
>>>> sudo /etc/init.d/pbs_sched start
>>>> ps aux | grep pbs
>>>> pbsnodes -a
>>>> echo "sleep 30" | qsub
>>>
>>>
>>> The output of the last command is "0.torque-server".
>>> And this command crashed the pbs_server in gdb.
>>> Then, I made the backtrace.
>>>
>>> Best,
>>> Kazu
>>>
>>>
>>> On Tue, Oct 25, 2016 at 2:36 PM, Kazuhiro Fujita <
>>> ***@gmail.com> wrote:
>>>
>>>> David,
>>>>
>>>> I attached the backtrace of pbs_server (Torque 6.0.2) by gdb.
>>>> (based on https://wiki.ubuntu.com/Backtrace)
>>>>
>>>> I started pbs_server with gdb,
>>>> and execute qmgr from another terminal. (see below)
>>>>
>>>> sudo qmgr -c 'p s'
>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>> Cannot connect to specified server host 'torque-server'.
>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>
>>>>
>>>> After the qmgr execution, I pressed ctrl +c in gdb.
>>>>
>>>> Best,
>>>> Kaz
>>>>
>>>>
>>>> On Tue, Oct 25, 2016 at 1:00 AM, David Beer <
>>>> ***@adaptivecomputing.com> wrote:
>>>>
>>>>> Kazu,
>>>>>
>>>>> Can you give us a backtrace for this crash? We have fixed some issues
>>>>> on startup (around mutex management for newer pthread implementations) and
>>>>> a backtrace would allow me to confirm if what you're seeing is fixed.
>>>>>
>>>>> On Mon, Oct 24, 2016 at 2:09 AM, Kazuhiro Fujita <
>>>>> ***@gmail.com> wrote:
>>>>>
>>>>>> Dear All,
>>>>>>
>>>>>> I use Torque 4.2.10 on Ubuntu 14.04 LTS and 16.04 LTS with dual
>>>>>> E5-2630 v3 chips.
>>>>>> I recently got servers with dual Xeon E5 v4 chips, and
>>>>>> installed Ubuntu 16.04 LTS on them.
>>>>>> And I tried to set up Torque on them, but I stacked with the initial
>>>>>> setup script.
>>>>>> It seems that qmgr may trigger to crash pbs_server in initial setup
>>>>>> script (torque.setup). (see below)
>>>>>> Similar error is also observed in Torque 6.02.
>>>>>> Have you ever observed this kind of errors?
>>>>>> And if you know possible solutions, please tell me.
>>>>>> Any comments will be highly appreciated.
>>>>>> Would it be better to change the OS to other distribution, such as
>>>>>> Scientific Linux?
>>>>>>
>>>>>> Thank you in Advance,
>>>>>> Kazu
>>>>>>
>>>>>>
>>>>>> Errors in torque 4.2.10 setup
>>>>>>
>>>>>>> torque-server-***@torque-server:~/Downloads/torque/torque-4.2.10$
>>>>>>> sudo ./torque.setup $USER
>>>>>>> Currently no servers active. Default server will be listed as active
>>>>>>> server. Error 15133
>>>>>>> Active server name: torque-server pbs_server port is: 15001
>>>>>>> trqauthd daemonized - port /tmp/trqauthd-unix
>>>>>>> trqauthd successfully started
>>>>>>> initializing TORQUE (admin: torque-server-***@torque-server)
>>>>>>> You have selected to start pbs_server in create mode.
>>>>>>> If the server database exists it will be overwritten.
>>>>>>> do you wish to continue y/(n)?y
>>>>>>> root 27941 1942 1 12:22 ? 00:00:00 pbs_server -t create
>>>>>>> Max open servers: 9
>>>>>>> set server operators += torque-server-***@torque-server
>>>>>>> Max open servers: 9
>>>>>>> set server managers += torque-server-***@torque-server
>>>>>>> qmgr obj=batch svr=default: End of File
>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>> torque-server-***@torque-server:~/Downloads/torque/torque-4.2.10$
>>>>>>> ps aux | grep pbs
>>>>>>> torque-+ 27996 0.0 0.0 22304 948 pts/2 S+ 12:22 0:00
>>>>>>> grep --color=auto pbs
>>>>>>
>>>>>> pbs_server -t create was not found.
>>>>>>
>>>>>> Errors in torque 6.0.2 setup
>>>>>>
>>>>>>> torque-server-***@torque-server:~/Downloads/torque/6.0.2$ sudo
>>>>>>> ./torque.setup $USER
>>>>>>> Currently no servers active. Default server will be listed as active
>>>>>>> server. Error 15133
>>>>>>> Active server name: torque-server pbs_server port is: 15001
>>>>>>> trqauthd daemonized - port /tmp/trqauthd-unix
>>>>>>> trqauthd successfully started
>>>>>>> initializing TORQUE (admin: torque-server-***@torque-server)
>>>>>>> You have selected to start pbs_server in create mode.
>>>>>>> If the server database exists it will be overwritten.
>>>>>>> do you wish to continue y/(n)?y
>>>>>>> root 39521 1 1 16:10 ? 00:00:00 pbs_server -t create
>>>>>>> Max open servers: 9
>>>>>>> Max open servers: 9
>>>>>>> qmgr obj=batch svr=default: End of File
>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>> torque-server-***@torque-server:~/Downloads/torque/6.0.2$ ps aux
>>>>>>> | grep pbs
>>>>>>> comp_ad+ 39569 0.0 0.0 22304 1032 pts/8 S+ 16:11 0:00
>>>>>>> grep --color=auto pbs
>>>>>>
>>>>>> pbs_server -t create was not found.
>>>>>>
>>>>>> Commands used for installation before the setup script
>>>>>>
>>>>>>> # build and install torque
>>>>>>> ./configure
>>>>>>> make
>>>>>>> sudo make install
>>>>>>
>>>>>>
>>>>>>
>>>>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>>>>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>>>>>>> sudo ldconfig
>>>>>>
>>>>>>
>>>>>>
>>>>>> # set up as services
>>>>>>
>>>>>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>>>>>>> sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
>>>>>>> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
>>>>>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>>>>>> sudo update-rc.d trqauthd defaults
>>>>>>> sudo update-rc.d pbs_server defaults
>>>>>>> sudo update-rc.d pbs_sched defaults
>>>>>>> sudo update-rc.d pbs_mom defaults
>>>>>>
>>>>>>
>>>>>>
>>>>>> sudo ./torque.setup $USER
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> torqueusers mailing list
>>>>>> ***@supercluster.org
>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> David Beer | Torque Architect
>>>>> Adaptive Computing
>>>>>
>>>>> _______________________________________________
>>>>> torqueusers mailing list
>>>>> ***@supercluster.org
>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>
>>>>>
>>>>
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> ***@supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>>
>>
>>
>> --
>> David Beer | Torque Architect
>> Adaptive Computing
>>
>
>
>
> --
> David Beer | Torque Architect
> Adaptive Computing
>
> _______________________________________________
> torqueusers mailing list
> ***@supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
Kazuhiro Fujita
2016-10-26 06:46:49 UTC
Permalink
David,

I tried the 6.0.2.h3. But, it seems that the other issue is still remained.
After I initialized serverdb by "sudo pbs_server -t create", pbs_server
crashed.
Then, I used gdb with pbs_server.

Best,
Kazu

sudo gdb /usr/local/sbin/pbs_server
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html
>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/local/sbin/pbs_server...done.
(gdb) r -D
Starting program: /usr/local/sbin/pbs_server -D
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
pbs_server is up (version - 6.0.2.h3, port - 15001)
[New Thread 0x7ffff39c1700 (LWP 25591)]
[New Thread 0x7ffff31c0700 (LWP 25592)]
[New Thread 0x7ffff29bf700 (LWP 25593)]
[New Thread 0x7ffff21be700 (LWP 25594)]
[New Thread 0x7ffff19bd700 (LWP 25595)]
[New Thread 0x7ffff11bc700 (LWP 25596)]

Thread 7 "pbs_server" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff11bc700 (LWP 25596)]
__lll_unlock_elision (lock=0x57276c0, private=0) at
../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
29 ../sysdeps/unix/sysv/linux/x86/elision-unlock.c: No such file or
directory.
(gdb) bt
#0 __lll_unlock_elision (lock=0x57276c0, private=0) at
../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
#1 0x00000000004ac076 in dispatch_timed_task (ptask=0x5727660) at
svr_task.c:318
#2 0x0000000000460247 in check_tasks (notUsed=0x0) at pbsd_main.c:921
#3 0x00000000004fc171 in work_thread (a=0x510f650) at u_threadpool.c:318
#4 0x00007ffff6ed86fa in start_thread (arg=0x7ffff11bc700) at
pthread_create.c:333
#5 0x00007ffff6165b5d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:109




On Wed, Oct 26, 2016 at 11:52 AM, Kazuhiro Fujita <***@gmail.com
> wrote:

> David and Rick,
>
> Thank you for the quick response. I will try it later.
>
> Best,
> Kazu
>
> On Wed, Oct 26, 2016 at 5:06 AM, David Beer <***@adaptivecomputing.com>
> wrote:
>
>> Actually, Rick just sent me the link. You can download it from here:
>> http://files.adaptivecomputing.com/hotfix/torque-6.0.2.h3.tar.gz
>>
>> On Tue, Oct 25, 2016 at 2:06 PM, David Beer <***@adaptivecomputing.com>
>> wrote:
>>
>>> I can confirm that this bug is fixed in 6.0-dev, and we've made a hotfix
>>> for it, 6.0.2.h3. This was caused because of a change in the implementation
>>> for the pthread library, so most will not see this crash, but it appears
>>> that if you have a newer version of that library, then you will get it.
>>> Rick is going to send instructions for how to grab 6.0.2.h3.
>>>
>>> David
>>>
>>> On Tue, Oct 25, 2016 at 12:30 AM, Kazuhiro Fujita <
>>> ***@gmail.com> wrote:
>>>
>>>> Thank you David for the comment on the backtrace.
>>>> I haven't noticed that until writing this mail.
>>>> So, I used backtrace as written in the Ubuntu wiki.
>>>>
>>>> I also attached the backtrace of pbs_server (Torque 6.1-dev) by gdb.
>>>> As I mentioned before torque.setup script was successfully executed,
>>>> but unstable.
>>>>
>>>> Before using gdb, I used following commands.
>>>>
>>>>> git clone https://github.com/adaptivecomputing/torque.git -b 6.1-dev
>>>>> 6.1-dev
>>>>> cd 6.1-dev
>>>>> ./autogen.sh
>>>>> # build and install torque
>>>>> ./configure
>>>>> make
>>>>> sudo make install
>>>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>>>>> sudo ldconfig
>>>>> # set as services
>>>>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>>>>> sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
>>>>> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
>>>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>>>> sudo update-rc.d trqauthd defaults
>>>>> sudo update-rc.d pbs_server defaults
>>>>> sudo update-rc.d pbs_sched defaults
>>>>> sudo update-rc.d pbs_mom defaults
>>>>>
>>>>> sudo ./torque.setup $USER
>>>>> sudo qmgr -c 'p s'
>>>>> echo "$HOSTNAME np=`cat /proc/cpuinfo | grep processor | wc -l`" |
>>>>> sudo tee /var/spool/torque/server_priv/nodes
>>>>> sudo nano /var/spool/torque/server_priv/nodes # (changed np)
>>>>> sudo qterm -t quick
>>>>> sudo /etc/init.d/trqauthd stop
>>>>
>>>>
>>>> trqauthd was not stop by the last command. So, I stopped it by killing
>>>> the trqauthd process.
>>>> Then I restarted the torque processes with gdb.
>>>>
>>>> sudo /etc/init.d/trqauthd start
>>>>
>>>> sudo gdb /etc/init.d/pbs_server 2>&1 | tee ~/gdb-torquesetup-6.1-dev.txt
>>>>
>>>>
>>>> In another terminal, I executed the following commands before
>>>> pbs_server was crashed.
>>>>
>>>> sudo /etc/init.d/pbs_mom start
>>>>> sudo /etc/init.d/pbs_sched start
>>>>> ps aux | grep pbs
>>>>> pbsnodes -a
>>>>> echo "sleep 30" | qsub
>>>>
>>>>
>>>> The output of the last command is "0.torque-server".
>>>> And this command crashed the pbs_server in gdb.
>>>> Then, I made the backtrace.
>>>>
>>>> Best,
>>>> Kazu
>>>>
>>>>
>>>> On Tue, Oct 25, 2016 at 2:36 PM, Kazuhiro Fujita <
>>>> ***@gmail.com> wrote:
>>>>
>>>>> David,
>>>>>
>>>>> I attached the backtrace of pbs_server (Torque 6.0.2) by gdb.
>>>>> (based on https://wiki.ubuntu.com/Backtrace)
>>>>>
>>>>> I started pbs_server with gdb,
>>>>> and execute qmgr from another terminal. (see below)
>>>>>
>>>>> sudo qmgr -c 'p s'
>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>
>>>>>
>>>>> After the qmgr execution, I pressed ctrl +c in gdb.
>>>>>
>>>>> Best,
>>>>> Kaz
>>>>>
>>>>>
>>>>> On Tue, Oct 25, 2016 at 1:00 AM, David Beer <
>>>>> ***@adaptivecomputing.com> wrote:
>>>>>
>>>>>> Kazu,
>>>>>>
>>>>>> Can you give us a backtrace for this crash? We have fixed some issues
>>>>>> on startup (around mutex management for newer pthread implementations) and
>>>>>> a backtrace would allow me to confirm if what you're seeing is fixed.
>>>>>>
>>>>>> On Mon, Oct 24, 2016 at 2:09 AM, Kazuhiro Fujita <
>>>>>> ***@gmail.com> wrote:
>>>>>>
>>>>>>> Dear All,
>>>>>>>
>>>>>>> I use Torque 4.2.10 on Ubuntu 14.04 LTS and 16.04 LTS with dual
>>>>>>> E5-2630 v3 chips.
>>>>>>> I recently got servers with dual Xeon E5 v4 chips, and
>>>>>>> installed Ubuntu 16.04 LTS on them.
>>>>>>> And I tried to set up Torque on them, but I stacked with the initial
>>>>>>> setup script.
>>>>>>> It seems that qmgr may trigger to crash pbs_server in initial setup
>>>>>>> script (torque.setup). (see below)
>>>>>>> Similar error is also observed in Torque 6.02.
>>>>>>> Have you ever observed this kind of errors?
>>>>>>> And if you know possible solutions, please tell me.
>>>>>>> Any comments will be highly appreciated.
>>>>>>> Would it be better to change the OS to other distribution, such as
>>>>>>> Scientific Linux?
>>>>>>>
>>>>>>> Thank you in Advance,
>>>>>>> Kazu
>>>>>>>
>>>>>>>
>>>>>>> Errors in torque 4.2.10 setup
>>>>>>>
>>>>>>>> torque-server-***@torque-server:~/Downloads/torque/torque-4.2.10$
>>>>>>>> sudo ./torque.setup $USER
>>>>>>>> Currently no servers active. Default server will be listed as
>>>>>>>> active server. Error 15133
>>>>>>>> Active server name: torque-server pbs_server port is: 15001
>>>>>>>> trqauthd daemonized - port /tmp/trqauthd-unix
>>>>>>>> trqauthd successfully started
>>>>>>>> initializing TORQUE (admin: torque-server-***@torque-server)
>>>>>>>> You have selected to start pbs_server in create mode.
>>>>>>>> If the server database exists it will be overwritten.
>>>>>>>> do you wish to continue y/(n)?y
>>>>>>>> root 27941 1942 1 12:22 ? 00:00:00 pbs_server -t create
>>>>>>>> Max open servers: 9
>>>>>>>> set server operators += torque-server-***@torque-server
>>>>>>>> Max open servers: 9
>>>>>>>> set server managers += torque-server-***@torque-server
>>>>>>>> qmgr obj=batch svr=default: End of File
>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>> torque-server-***@torque-server:~/Downloads/torque/torque-4.2.10$
>>>>>>>> ps aux | grep pbs
>>>>>>>> torque-+ 27996 0.0 0.0 22304 948 pts/2 S+ 12:22 0:00
>>>>>>>> grep --color=auto pbs
>>>>>>>
>>>>>>> pbs_server -t create was not found.
>>>>>>>
>>>>>>> Errors in torque 6.0.2 setup
>>>>>>>
>>>>>>>> torque-server-***@torque-server:~/Downloads/torque/6.0.2$ sudo
>>>>>>>> ./torque.setup $USER
>>>>>>>> Currently no servers active. Default server will be listed as
>>>>>>>> active server. Error 15133
>>>>>>>> Active server name: torque-server pbs_server port is: 15001
>>>>>>>> trqauthd daemonized - port /tmp/trqauthd-unix
>>>>>>>> trqauthd successfully started
>>>>>>>> initializing TORQUE (admin: torque-server-***@torque-server)
>>>>>>>> You have selected to start pbs_server in create mode.
>>>>>>>> If the server database exists it will be overwritten.
>>>>>>>> do you wish to continue y/(n)?y
>>>>>>>> root 39521 1 1 16:10 ? 00:00:00 pbs_server -t create
>>>>>>>> Max open servers: 9
>>>>>>>> Max open servers: 9
>>>>>>>> qmgr obj=batch svr=default: End of File
>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>> torque-server-***@torque-server:~/Downloads/torque/6.0.2$ ps aux
>>>>>>>> | grep pbs
>>>>>>>> comp_ad+ 39569 0.0 0.0 22304 1032 pts/8 S+ 16:11 0:00
>>>>>>>> grep --color=auto pbs
>>>>>>>
>>>>>>> pbs_server -t create was not found.
>>>>>>>
>>>>>>> Commands used for installation before the setup script
>>>>>>>
>>>>>>>> # build and install torque
>>>>>>>> ./configure
>>>>>>>> make
>>>>>>>> sudo make install
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>>>>>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>>>>>>>> sudo ldconfig
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> # set up as services
>>>>>>>
>>>>>>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>>>>>>>> sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
>>>>>>>> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
>>>>>>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>>>>>>> sudo update-rc.d trqauthd defaults
>>>>>>>> sudo update-rc.d pbs_server defaults
>>>>>>>> sudo update-rc.d pbs_sched defaults
>>>>>>>> sudo update-rc.d pbs_mom defaults
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> sudo ./torque.setup $USER
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> torqueusers mailing list
>>>>>>> ***@supercluster.org
>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> David Beer | Torque Architect
>>>>>> Adaptive Computing
>>>>>>
>>>>>> _______________________________________________
>>>>>> torqueusers mailing list
>>>>>> ***@supercluster.org
>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>
>>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> torqueusers mailing list
>>>> ***@supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>
>>>>
>>>
>>>
>>> --
>>> David Beer | Torque Architect
>>> Adaptive Computing
>>>
>>
>>
>>
>> --
>> David Beer | Torque Architect
>> Adaptive Computing
>>
>> _______________________________________________
>> torqueusers mailing list
>> ***@supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>
David Beer
2016-10-27 20:34:41 UTC
Permalink
I wonder if that fix wasn't placed in the hotfix. Is there any chance you
can try installing 6.0-dev on your system (via github) to see if it's
resolved. For the record, my Ubuntu 16 system doesn't give me this error,
or I'd try it myself. For whatever reason, none of our test cluster
machines (Cent & Redhat 6-7, SLES 11-12) experience this either. We did
have another user that experiences it on a test cluster, but not being able
to reproduce it has made it harder to track down.

On Wed, Oct 26, 2016 at 12:46 AM, Kazuhiro Fujita <***@gmail.com
> wrote:

> David,
>
> I tried the 6.0.2.h3. But, it seems that the other issue is still remained.
> After I initialized serverdb by "sudo pbs_server -t create", pbs_server
> crashed.
> Then, I used gdb with pbs_server.
>
> Best,
> Kazu
>
> sudo gdb /usr/local/sbin/pbs_server
> GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
> Copyright (C) 2016 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.
> html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law. Type "show copying"
> and "show warranty" for details.
> This GDB was configured as "x86_64-linux-gnu".
> Type "show configuration" for configuration details.
> For bug reporting instructions, please see:
> <http://www.gnu.org/software/gdb/bugs/>.
> Find the GDB manual and other documentation resources online at:
> <http://www.gnu.org/software/gdb/documentation/>.
> For help, type "help".
> Type "apropos word" to search for commands related to "word"...
> Reading symbols from /usr/local/sbin/pbs_server...done.
> (gdb) r -D
> Starting program: /usr/local/sbin/pbs_server -D
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
> pbs_server is up (version - 6.0.2.h3, port - 15001)
> [New Thread 0x7ffff39c1700 (LWP 25591)]
> [New Thread 0x7ffff31c0700 (LWP 25592)]
> [New Thread 0x7ffff29bf700 (LWP 25593)]
> [New Thread 0x7ffff21be700 (LWP 25594)]
> [New Thread 0x7ffff19bd700 (LWP 25595)]
> [New Thread 0x7ffff11bc700 (LWP 25596)]
>
> Thread 7 "pbs_server" received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 0x7ffff11bc700 (LWP 25596)]
> __lll_unlock_elision (lock=0x57276c0, private=0) at
> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
> 29 ../sysdeps/unix/sysv/linux/x86/elision-unlock.c: No such file or
> directory.
> (gdb) bt
> #0 __lll_unlock_elision (lock=0x57276c0, private=0) at
> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
> #1 0x00000000004ac076 in dispatch_timed_task (ptask=0x5727660) at
> svr_task.c:318
> #2 0x0000000000460247 in check_tasks (notUsed=0x0) at pbsd_main.c:921
> #3 0x00000000004fc171 in work_thread (a=0x510f650) at u_threadpool.c:318
> #4 0x00007ffff6ed86fa in start_thread (arg=0x7ffff11bc700) at
> pthread_create.c:333
> #5 0x00007ffff6165b5d in clone () at ../sysdeps/unix/sysv/linux/
> x86_64/clone.S:109
>
>
>
>
> On Wed, Oct 26, 2016 at 11:52 AM, Kazuhiro Fujita <
> ***@gmail.com> wrote:
>
>> David and Rick,
>>
>> Thank you for the quick response. I will try it later.
>>
>> Best,
>> Kazu
>>
>> On Wed, Oct 26, 2016 at 5:06 AM, David Beer <***@adaptivecomputing.com>
>> wrote:
>>
>>> Actually, Rick just sent me the link. You can download it from here:
>>> http://files.adaptivecomputing.com/hotfix/torque-6.0.2.h3.tar.gz
>>>
>>> On Tue, Oct 25, 2016 at 2:06 PM, David Beer <***@adaptivecomputing.com
>>> > wrote:
>>>
>>>> I can confirm that this bug is fixed in 6.0-dev, and we've made a
>>>> hotfix for it, 6.0.2.h3. This was caused because of a change in the
>>>> implementation for the pthread library, so most will not see this crash,
>>>> but it appears that if you have a newer version of that library, then you
>>>> will get it. Rick is going to send instructions for how to grab 6.0.2.h3.
>>>>
>>>> David
>>>>
>>>> On Tue, Oct 25, 2016 at 12:30 AM, Kazuhiro Fujita <
>>>> ***@gmail.com> wrote:
>>>>
>>>>> Thank you David for the comment on the backtrace.
>>>>> I haven't noticed that until writing this mail.
>>>>> So, I used backtrace as written in the Ubuntu wiki.
>>>>>
>>>>> I also attached the backtrace of pbs_server (Torque 6.1-dev) by gdb.
>>>>> As I mentioned before torque.setup script was successfully executed,
>>>>> but unstable.
>>>>>
>>>>> Before using gdb, I used following commands.
>>>>>
>>>>>> git clone https://github.com/adaptivecomputing/torque.git -b 6.1-dev
>>>>>> 6.1-dev
>>>>>> cd 6.1-dev
>>>>>> ./autogen.sh
>>>>>> # build and install torque
>>>>>> ./configure
>>>>>> make
>>>>>> sudo make install
>>>>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>>>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>>>>>> sudo ldconfig
>>>>>> # set as services
>>>>>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>>>>>> sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
>>>>>> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
>>>>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>>>>> sudo update-rc.d trqauthd defaults
>>>>>> sudo update-rc.d pbs_server defaults
>>>>>> sudo update-rc.d pbs_sched defaults
>>>>>> sudo update-rc.d pbs_mom defaults
>>>>>>
>>>>>> sudo ./torque.setup $USER
>>>>>> sudo qmgr -c 'p s'
>>>>>> echo "$HOSTNAME np=`cat /proc/cpuinfo | grep processor | wc -l`" |
>>>>>> sudo tee /var/spool/torque/server_priv/nodes
>>>>>> sudo nano /var/spool/torque/server_priv/nodes # (changed np)
>>>>>> sudo qterm -t quick
>>>>>> sudo /etc/init.d/trqauthd stop
>>>>>
>>>>>
>>>>> trqauthd was not stop by the last command. So, I stopped it by killing
>>>>> the trqauthd process.
>>>>> Then I restarted the torque processes with gdb.
>>>>>
>>>>> sudo /etc/init.d/trqauthd start
>>>>>
>>>>> sudo gdb /etc/init.d/pbs_server 2>&1 | tee
>>>>>> ~/gdb-torquesetup-6.1-dev.txt
>>>>>
>>>>>
>>>>> In another terminal, I executed the following commands before
>>>>> pbs_server was crashed.
>>>>>
>>>>> sudo /etc/init.d/pbs_mom start
>>>>>> sudo /etc/init.d/pbs_sched start
>>>>>> ps aux | grep pbs
>>>>>> pbsnodes -a
>>>>>> echo "sleep 30" | qsub
>>>>>
>>>>>
>>>>> The output of the last command is "0.torque-server".
>>>>> And this command crashed the pbs_server in gdb.
>>>>> Then, I made the backtrace.
>>>>>
>>>>> Best,
>>>>> Kazu
>>>>>
>>>>>
>>>>> On Tue, Oct 25, 2016 at 2:36 PM, Kazuhiro Fujita <
>>>>> ***@gmail.com> wrote:
>>>>>
>>>>>> David,
>>>>>>
>>>>>> I attached the backtrace of pbs_server (Torque 6.0.2) by gdb.
>>>>>> (based on https://wiki.ubuntu.com/Backtrace)
>>>>>>
>>>>>> I started pbs_server with gdb,
>>>>>> and execute qmgr from another terminal. (see below)
>>>>>>
>>>>>> sudo qmgr -c 'p s'
>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>
>>>>>>
>>>>>> After the qmgr execution, I pressed ctrl +c in gdb.
>>>>>>
>>>>>> Best,
>>>>>> Kaz
>>>>>>
>>>>>>
>>>>>> On Tue, Oct 25, 2016 at 1:00 AM, David Beer <
>>>>>> ***@adaptivecomputing.com> wrote:
>>>>>>
>>>>>>> Kazu,
>>>>>>>
>>>>>>> Can you give us a backtrace for this crash? We have fixed some
>>>>>>> issues on startup (around mutex management for newer pthread
>>>>>>> implementations) and a backtrace would allow me to confirm if what you're
>>>>>>> seeing is fixed.
>>>>>>>
>>>>>>> On Mon, Oct 24, 2016 at 2:09 AM, Kazuhiro Fujita <
>>>>>>> ***@gmail.com> wrote:
>>>>>>>
>>>>>>>> Dear All,
>>>>>>>>
>>>>>>>> I use Torque 4.2.10 on Ubuntu 14.04 LTS and 16.04 LTS with dual
>>>>>>>> E5-2630 v3 chips.
>>>>>>>> I recently got servers with dual Xeon E5 v4 chips, and
>>>>>>>> installed Ubuntu 16.04 LTS on them.
>>>>>>>> And I tried to set up Torque on them, but I stacked with the
>>>>>>>> initial setup script.
>>>>>>>> It seems that qmgr may trigger to crash pbs_server in initial setup
>>>>>>>> script (torque.setup). (see below)
>>>>>>>> Similar error is also observed in Torque 6.02.
>>>>>>>> Have you ever observed this kind of errors?
>>>>>>>> And if you know possible solutions, please tell me.
>>>>>>>> Any comments will be highly appreciated.
>>>>>>>> Would it be better to change the OS to other distribution, such as
>>>>>>>> Scientific Linux?
>>>>>>>>
>>>>>>>> Thank you in Advance,
>>>>>>>> Kazu
>>>>>>>>
>>>>>>>>
>>>>>>>> Errors in torque 4.2.10 setup
>>>>>>>>
>>>>>>>>> torque-server-***@torque-server:~/Downloads/torque/torque-4.2.10$
>>>>>>>>> sudo ./torque.setup $USER
>>>>>>>>> Currently no servers active. Default server will be listed as
>>>>>>>>> active server. Error 15133
>>>>>>>>> Active server name: torque-server pbs_server port is: 15001
>>>>>>>>> trqauthd daemonized - port /tmp/trqauthd-unix
>>>>>>>>> trqauthd successfully started
>>>>>>>>> initializing TORQUE (admin: torque-server-***@torque-server)
>>>>>>>>> You have selected to start pbs_server in create mode.
>>>>>>>>> If the server database exists it will be overwritten.
>>>>>>>>> do you wish to continue y/(n)?y
>>>>>>>>> root 27941 1942 1 12:22 ? 00:00:00 pbs_server -t
>>>>>>>>> create
>>>>>>>>> Max open servers: 9
>>>>>>>>> set server operators += torque-server-***@torque-server
>>>>>>>>> Max open servers: 9
>>>>>>>>> set server managers += torque-server-***@torque-server
>>>>>>>>> qmgr obj=batch svr=default: End of File
>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>> torque-server-***@torque-server:~/Downloads/torque/torque-4.2.10$
>>>>>>>>> ps aux | grep pbs
>>>>>>>>> torque-+ 27996 0.0 0.0 22304 948 pts/2 S+ 12:22 0:00
>>>>>>>>> grep --color=auto pbs
>>>>>>>>
>>>>>>>> pbs_server -t create was not found.
>>>>>>>>
>>>>>>>> Errors in torque 6.0.2 setup
>>>>>>>>
>>>>>>>>> torque-server-***@torque-server:~/Downloads/torque/6.0.2$ sudo
>>>>>>>>> ./torque.setup $USER
>>>>>>>>> Currently no servers active. Default server will be listed as
>>>>>>>>> active server. Error 15133
>>>>>>>>> Active server name: torque-server pbs_server port is: 15001
>>>>>>>>> trqauthd daemonized - port /tmp/trqauthd-unix
>>>>>>>>> trqauthd successfully started
>>>>>>>>> initializing TORQUE (admin: torque-server-***@torque-server)
>>>>>>>>> You have selected to start pbs_server in create mode.
>>>>>>>>> If the server database exists it will be overwritten.
>>>>>>>>> do you wish to continue y/(n)?y
>>>>>>>>> root 39521 1 1 16:10 ? 00:00:00 pbs_server -t
>>>>>>>>> create
>>>>>>>>> Max open servers: 9
>>>>>>>>> Max open servers: 9
>>>>>>>>> qmgr obj=batch svr=default: End of File
>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>> torque-server-***@torque-server:~/Downloads/torque/6.0.2$ ps
>>>>>>>>> aux | grep pbs
>>>>>>>>> comp_ad+ 39569 0.0 0.0 22304 1032 pts/8 S+ 16:11 0:00
>>>>>>>>> grep --color=auto pbs
>>>>>>>>
>>>>>>>> pbs_server -t create was not found.
>>>>>>>>
>>>>>>>> Commands used for installation before the setup script
>>>>>>>>
>>>>>>>>> # build and install torque
>>>>>>>>> ./configure
>>>>>>>>> make
>>>>>>>>> sudo make install
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>>>>>>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>>>>>>>>> sudo ldconfig
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> # set up as services
>>>>>>>>
>>>>>>>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>>>>>>>>> sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
>>>>>>>>> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
>>>>>>>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>>>>>>>> sudo update-rc.d trqauthd defaults
>>>>>>>>> sudo update-rc.d pbs_server defaults
>>>>>>>>> sudo update-rc.d pbs_sched defaults
>>>>>>>>> sudo update-rc.d pbs_mom defaults
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> sudo ./torque.setup $USER
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> torqueusers mailing list
>>>>>>>> ***@supercluster.org
>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> David Beer | Torque Architect
>>>>>>> Adaptive Computing
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> torqueusers mailing list
>>>>>>> ***@supercluster.org
>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> torqueusers mailing list
>>>>> ***@supercluster.org
>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> David Beer | Torque Architect
>>>> Adaptive Computing
>>>>
>>>
>>>
>>>
>>> --
>>> David Beer | Torque Architect
>>> Adaptive Computing
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> ***@supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>>
>>
>
> _______________________________________________
> torqueusers mailing list
> ***@supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>


--
David Beer | Torque Architect
Adaptive Computing
Kazuhiro Fujita
2016-10-28 03:43:43 UTC
Permalink
Thank you for your comments.
I will try the 6.0-dev next week.

Best,
Kazu

On Fri, Oct 28, 2016 at 5:34 AM, David Beer <***@adaptivecomputing.com>
wrote:

> I wonder if that fix wasn't placed in the hotfix. Is there any chance you
> can try installing 6.0-dev on your system (via github) to see if it's
> resolved. For the record, my Ubuntu 16 system doesn't give me this error,
> or I'd try it myself. For whatever reason, none of our test cluster
> machines (Cent & Redhat 6-7, SLES 11-12) experience this either. We did
> have another user that experiences it on a test cluster, but not being able
> to reproduce it has made it harder to track down.
>
> On Wed, Oct 26, 2016 at 12:46 AM, Kazuhiro Fujita <
> ***@gmail.com> wrote:
>
>> David,
>>
>> I tried the 6.0.2.h3. But, it seems that the other issue is still
>> remained.
>> After I initialized serverdb by "sudo pbs_server -t create", pbs_server
>> crashed.
>> Then, I used gdb with pbs_server.
>>
>> Best,
>> Kazu
>>
>> sudo gdb /usr/local/sbin/pbs_server
>> GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
>> Copyright (C) 2016 Free Software Foundation, Inc.
>> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.h
>> tml>
>> This is free software: you are free to change and redistribute it.
>> There is NO WARRANTY, to the extent permitted by law. Type "show copying"
>> and "show warranty" for details.
>> This GDB was configured as "x86_64-linux-gnu".
>> Type "show configuration" for configuration details.
>> For bug reporting instructions, please see:
>> <http://www.gnu.org/software/gdb/bugs/>.
>> Find the GDB manual and other documentation resources online at:
>> <http://www.gnu.org/software/gdb/documentation/>.
>> For help, type "help".
>> Type "apropos word" to search for commands related to "word"...
>> Reading symbols from /usr/local/sbin/pbs_server...done.
>> (gdb) r -D
>> Starting program: /usr/local/sbin/pbs_server -D
>> [Thread debugging using libthread_db enabled]
>> Using host libthread_db library "/lib/x86_64-linux-gnu/libthre
>> ad_db.so.1".
>> pbs_server is up (version - 6.0.2.h3, port - 15001)
>> [New Thread 0x7ffff39c1700 (LWP 25591)]
>> [New Thread 0x7ffff31c0700 (LWP 25592)]
>> [New Thread 0x7ffff29bf700 (LWP 25593)]
>> [New Thread 0x7ffff21be700 (LWP 25594)]
>> [New Thread 0x7ffff19bd700 (LWP 25595)]
>> [New Thread 0x7ffff11bc700 (LWP 25596)]
>>
>> Thread 7 "pbs_server" received signal SIGSEGV, Segmentation fault.
>> [Switching to Thread 0x7ffff11bc700 (LWP 25596)]
>> __lll_unlock_elision (lock=0x57276c0, private=0) at
>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>> 29 ../sysdeps/unix/sysv/linux/x86/elision-unlock.c: No such file or
>> directory.
>> (gdb) bt
>> #0 __lll_unlock_elision (lock=0x57276c0, private=0) at
>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>> #1 0x00000000004ac076 in dispatch_timed_task (ptask=0x5727660) at
>> svr_task.c:318
>> #2 0x0000000000460247 in check_tasks (notUsed=0x0) at pbsd_main.c:921
>> #3 0x00000000004fc171 in work_thread (a=0x510f650) at u_threadpool.c:318
>> #4 0x00007ffff6ed86fa in start_thread (arg=0x7ffff11bc700) at
>> pthread_create.c:333
>> #5 0x00007ffff6165b5d in clone () at ../sysdeps/unix/sysv/linux/x86
>> _64/clone.S:109
>>
>>
>>
>>
>> On Wed, Oct 26, 2016 at 11:52 AM, Kazuhiro Fujita <
>> ***@gmail.com> wrote:
>>
>>> David and Rick,
>>>
>>> Thank you for the quick response. I will try it later.
>>>
>>> Best,
>>> Kazu
>>>
>>> On Wed, Oct 26, 2016 at 5:06 AM, David Beer <***@adaptivecomputing.com
>>> > wrote:
>>>
>>>> Actually, Rick just sent me the link. You can download it from here:
>>>> http://files.adaptivecomputing.com/hotfix/torque-6.0.2.h3.tar.gz
>>>>
>>>> On Tue, Oct 25, 2016 at 2:06 PM, David Beer <
>>>> ***@adaptivecomputing.com> wrote:
>>>>
>>>>> I can confirm that this bug is fixed in 6.0-dev, and we've made a
>>>>> hotfix for it, 6.0.2.h3. This was caused because of a change in the
>>>>> implementation for the pthread library, so most will not see this crash,
>>>>> but it appears that if you have a newer version of that library, then you
>>>>> will get it. Rick is going to send instructions for how to grab 6.0.2.h3.
>>>>>
>>>>> David
>>>>>
>>>>> On Tue, Oct 25, 2016 at 12:30 AM, Kazuhiro Fujita <
>>>>> ***@gmail.com> wrote:
>>>>>
>>>>>> Thank you David for the comment on the backtrace.
>>>>>> I haven't noticed that until writing this mail.
>>>>>> So, I used backtrace as written in the Ubuntu wiki.
>>>>>>
>>>>>> I also attached the backtrace of pbs_server (Torque 6.1-dev) by gdb.
>>>>>> As I mentioned before torque.setup script was successfully executed,
>>>>>> but unstable.
>>>>>>
>>>>>> Before using gdb, I used following commands.
>>>>>>
>>>>>>> git clone https://github.com/adaptivecomputing/torque.git -b
>>>>>>> 6.1-dev 6.1-dev
>>>>>>> cd 6.1-dev
>>>>>>> ./autogen.sh
>>>>>>> # build and install torque
>>>>>>> ./configure
>>>>>>> make
>>>>>>> sudo make install
>>>>>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>>>>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>>>>>>> sudo ldconfig
>>>>>>> # set as services
>>>>>>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>>>>>>> sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
>>>>>>> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
>>>>>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>>>>>> sudo update-rc.d trqauthd defaults
>>>>>>> sudo update-rc.d pbs_server defaults
>>>>>>> sudo update-rc.d pbs_sched defaults
>>>>>>> sudo update-rc.d pbs_mom defaults
>>>>>>>
>>>>>>> sudo ./torque.setup $USER
>>>>>>> sudo qmgr -c 'p s'
>>>>>>> echo "$HOSTNAME np=`cat /proc/cpuinfo | grep processor | wc -l`" |
>>>>>>> sudo tee /var/spool/torque/server_priv/nodes
>>>>>>> sudo nano /var/spool/torque/server_priv/nodes # (changed np)
>>>>>>> sudo qterm -t quick
>>>>>>> sudo /etc/init.d/trqauthd stop
>>>>>>
>>>>>>
>>>>>> trqauthd was not stop by the last command. So, I stopped it by
>>>>>> killing the trqauthd process.
>>>>>> Then I restarted the torque processes with gdb.
>>>>>>
>>>>>> sudo /etc/init.d/trqauthd start
>>>>>>
>>>>>> sudo gdb /etc/init.d/pbs_server 2>&1 | tee
>>>>>>> ~/gdb-torquesetup-6.1-dev.txt
>>>>>>
>>>>>>
>>>>>> In another terminal, I executed the following commands before
>>>>>> pbs_server was crashed.
>>>>>>
>>>>>> sudo /etc/init.d/pbs_mom start
>>>>>>> sudo /etc/init.d/pbs_sched start
>>>>>>> ps aux | grep pbs
>>>>>>> pbsnodes -a
>>>>>>> echo "sleep 30" | qsub
>>>>>>
>>>>>>
>>>>>> The output of the last command is "0.torque-server".
>>>>>> And this command crashed the pbs_server in gdb.
>>>>>> Then, I made the backtrace.
>>>>>>
>>>>>> Best,
>>>>>> Kazu
>>>>>>
>>>>>>
>>>>>> On Tue, Oct 25, 2016 at 2:36 PM, Kazuhiro Fujita <
>>>>>> ***@gmail.com> wrote:
>>>>>>
>>>>>>> David,
>>>>>>>
>>>>>>> I attached the backtrace of pbs_server (Torque 6.0.2) by gdb.
>>>>>>> (based on https://wiki.ubuntu.com/Backtrace)
>>>>>>>
>>>>>>> I started pbs_server with gdb,
>>>>>>> and execute qmgr from another terminal. (see below)
>>>>>>>
>>>>>>> sudo qmgr -c 'p s'
>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>
>>>>>>>
>>>>>>> After the qmgr execution, I pressed ctrl +c in gdb.
>>>>>>>
>>>>>>> Best,
>>>>>>> Kaz
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Oct 25, 2016 at 1:00 AM, David Beer <
>>>>>>> ***@adaptivecomputing.com> wrote:
>>>>>>>
>>>>>>>> Kazu,
>>>>>>>>
>>>>>>>> Can you give us a backtrace for this crash? We have fixed some
>>>>>>>> issues on startup (around mutex management for newer pthread
>>>>>>>> implementations) and a backtrace would allow me to confirm if what you're
>>>>>>>> seeing is fixed.
>>>>>>>>
>>>>>>>> On Mon, Oct 24, 2016 at 2:09 AM, Kazuhiro Fujita <
>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Dear All,
>>>>>>>>>
>>>>>>>>> I use Torque 4.2.10 on Ubuntu 14.04 LTS and 16.04 LTS with dual
>>>>>>>>> E5-2630 v3 chips.
>>>>>>>>> I recently got servers with dual Xeon E5 v4 chips, and
>>>>>>>>> installed Ubuntu 16.04 LTS on them.
>>>>>>>>> And I tried to set up Torque on them, but I stacked with the
>>>>>>>>> initial setup script.
>>>>>>>>> It seems that qmgr may trigger to crash pbs_server in initial
>>>>>>>>> setup script (torque.setup). (see below)
>>>>>>>>> Similar error is also observed in Torque 6.02.
>>>>>>>>> Have you ever observed this kind of errors?
>>>>>>>>> And if you know possible solutions, please tell me.
>>>>>>>>> Any comments will be highly appreciated.
>>>>>>>>> Would it be better to change the OS to other distribution, such as
>>>>>>>>> Scientific Linux?
>>>>>>>>>
>>>>>>>>> Thank you in Advance,
>>>>>>>>> Kazu
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Errors in torque 4.2.10 setup
>>>>>>>>>
>>>>>>>>>> torque-server-***@torque-server:~/Downloads/torque/torque-4.2.10$
>>>>>>>>>> sudo ./torque.setup $USER
>>>>>>>>>> Currently no servers active. Default server will be listed as
>>>>>>>>>> active server. Error 15133
>>>>>>>>>> Active server name: torque-server pbs_server port is: 15001
>>>>>>>>>> trqauthd daemonized - port /tmp/trqauthd-unix
>>>>>>>>>> trqauthd successfully started
>>>>>>>>>> initializing TORQUE (admin: torque-server-***@torque-server)
>>>>>>>>>> You have selected to start pbs_server in create mode.
>>>>>>>>>> If the server database exists it will be overwritten.
>>>>>>>>>> do you wish to continue y/(n)?y
>>>>>>>>>> root 27941 1942 1 12:22 ? 00:00:00 pbs_server -t
>>>>>>>>>> create
>>>>>>>>>> Max open servers: 9
>>>>>>>>>> set server operators += torque-server-***@torque-server
>>>>>>>>>> Max open servers: 9
>>>>>>>>>> set server managers += torque-server-***@torque-server
>>>>>>>>>> qmgr obj=batch svr=default: End of File
>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>>> torque-server-***@torque-server:~/Downloads/torque/torque-4.2.10$
>>>>>>>>>> ps aux | grep pbs
>>>>>>>>>> torque-+ 27996 0.0 0.0 22304 948 pts/2 S+ 12:22 0:00
>>>>>>>>>> grep --color=auto pbs
>>>>>>>>>
>>>>>>>>> pbs_server -t create was not found.
>>>>>>>>>
>>>>>>>>> Errors in torque 6.0.2 setup
>>>>>>>>>
>>>>>>>>>> torque-server-***@torque-server:~/Downloads/torque/6.0.2$ sudo
>>>>>>>>>> ./torque.setup $USER
>>>>>>>>>> Currently no servers active. Default server will be listed as
>>>>>>>>>> active server. Error 15133
>>>>>>>>>> Active server name: torque-server pbs_server port is: 15001
>>>>>>>>>> trqauthd daemonized - port /tmp/trqauthd-unix
>>>>>>>>>> trqauthd successfully started
>>>>>>>>>> initializing TORQUE (admin: torque-server-***@torque-server)
>>>>>>>>>> You have selected to start pbs_server in create mode.
>>>>>>>>>> If the server database exists it will be overwritten.
>>>>>>>>>> do you wish to continue y/(n)?y
>>>>>>>>>> root 39521 1 1 16:10 ? 00:00:00 pbs_server -t
>>>>>>>>>> create
>>>>>>>>>> Max open servers: 9
>>>>>>>>>> Max open servers: 9
>>>>>>>>>> qmgr obj=batch svr=default: End of File
>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>>> torque-server-***@torque-server:~/Downloads/torque/6.0.2$ ps
>>>>>>>>>> aux | grep pbs
>>>>>>>>>> comp_ad+ 39569 0.0 0.0 22304 1032 pts/8 S+ 16:11 0:00
>>>>>>>>>> grep --color=auto pbs
>>>>>>>>>
>>>>>>>>> pbs_server -t create was not found.
>>>>>>>>>
>>>>>>>>> Commands used for installation before the setup script
>>>>>>>>>
>>>>>>>>>> # build and install torque
>>>>>>>>>> ./configure
>>>>>>>>>> make
>>>>>>>>>> sudo make install
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>>>>>>>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>>>>>>>>>> sudo ldconfig
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> # set up as services
>>>>>>>>>
>>>>>>>>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>>>>>>>>> sudo update-rc.d trqauthd defaults
>>>>>>>>>> sudo update-rc.d pbs_server defaults
>>>>>>>>>> sudo update-rc.d pbs_sched defaults
>>>>>>>>>> sudo update-rc.d pbs_mom defaults
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> sudo ./torque.setup $USER
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> torqueusers mailing list
>>>>>>>>> ***@supercluster.org
>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> David Beer | Torque Architect
>>>>>>>> Adaptive Computing
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> torqueusers mailing list
>>>>>>>> ***@supercluster.org
>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> torqueusers mailing list
>>>>>> ***@supercluster.org
>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> David Beer | Torque Architect
>>>>> Adaptive Computing
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> David Beer | Torque Architect
>>>> Adaptive Computing
>>>>
>>>> _______________________________________________
>>>> torqueusers mailing list
>>>> ***@supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>
>>>>
>>>
>>
>> _______________________________________________
>> torqueusers mailing list
>> ***@supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>
>
> --
> David Beer | Torque Architect
> Adaptive Computing
>
> _______________________________________________
> torqueusers mailing list
> ***@supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
Kazuhiro Fujita
2016-11-01 07:19:05 UTC
Permalink
David,

I tested the 6.0-dev. It passed the "sudo ./torque.setup $USER" script,
but pbs_server and pbs_sched are unstable like 6.1-dev.

Best,
Kazu

Before execution of gdb

git clone https://github.com/adaptivecomputing/torque.git -b 6.0-dev 6.0-dev
> cd 6.0-dev
> ./autogen.sh
> # build and install torque
> ./configure
> make
> sudo make install
> # Set the correct name of the server
> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
> # configure and start trqauthd
> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
> sudo update-rc.d trqauthd defaults
> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
> sudo ldconfig
> sudo service trqauthd start
> # Initialize serverdb by executing the torque.setup script
> sudo ./torque.setup $USER
>
> sudo qmgr -c 'p s'
> sudo qterm
> sudo /etc/init.d/trqauthd stop
> # set nodes
> echo "$HOSTNAME np=`cat /proc/cpuinfo | grep processor | wc -l`" | sudo
> tee /var/spool/torque/server_priv/nodes
> sudo nano /var/spool/torque/server_priv/nodes
> # set the head node
> echo "\$pbsserver $HOSTNAME" | sudo tee /var/spool/torque/mom_priv/config
> # configure other deamons
> sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
> sudo update-rc.d pbs_server defaults
> sudo update-rc.d pbs_sched defaults
> sudo update-rc.d pbs_mom defaults
> # start torque daemons
> sudo service trqauthd start


Execution of gdb

> sudo gdb /usr/local/sbin/pbs_server


Commands executed by another terminal

> sudo /etc/init.d/pbs_mom start
> sudo /etc/init.d/pbs_sched start
> pbsnodes -a
> echo "sleep 30" | qsub


The last command did not cause a crash of pbs_server. The backtrace is
described below.
$ sudo gdb /usr/local/sbin/pbs_server
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html
>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/local/sbin/pbs_server...done.
(gdb) r -D
Starting program: /usr/local/sbin/pbs_server -D
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff39c1700 (LWP 5024)]
pbs_server is up (version - 6.0, port - 15001)
[New Thread 0x7ffff31c0700 (LWP 5025)]
PBS_Server: LOG_ERROR::tcp_connect_sockaddr, Failed when trying to open tcp
connection - connect() failed [rc = -2] [addr = 10.0.0.249:15003]
PBS_Server: LOG_ERROR::sendHierarchyToNode, Could not send mom hierarchy to
host Dual-E52630v4:15003
[New Thread 0x7ffff29bf700 (LWP 5026)]
[New Thread 0x7ffff21be700 (LWP 5027)]
[New Thread 0x7ffff19bd700 (LWP 5028)]
[New Thread 0x7ffff11bc700 (LWP 5029)]
[New Thread 0x7ffff09bb700 (LWP 5030)]
[Thread 0x7ffff09bb700 (LWP 5030) exited]
[New Thread 0x7ffff09bb700 (LWP 5031)]
[New Thread 0x7fffe3fff700 (LWP 5109)]
[New Thread 0x7fffe37fe700 (LWP 5113)]
[New Thread 0x7fffe29cf700 (LWP 5121)]
[Thread 0x7fffe29cf700 (LWP 5121) exited]
^C
Thread 1 "pbs_server" received signal SIGINT, Interrupt.
0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-template.S:84
84 ../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb) backtrace full
#0 0x00007ffff612a75d in nanosleep () at
../sysdeps/unix/syscall-template.S:84
No locals.
#1 0x00007ffff615c1a4 in usleep (useconds=<optimized out>) at
../sysdeps/posix/usleep.c:32
ts = {tv_sec = 0, tv_nsec = 250000000}
#2 0x000000000046123a in main_loop () at pbsd_main.c:1454
state = 3
waittime = 5
pjob = 0x313a74
iter = 0x0
when = 1477984074
log = 0
scheduling = 1
sched_iteration = 600
time_now = 1477984190
update_loglevel = 1477984198
log_buf = "Server Ready, pid = 5020, loglevel=0", '\000' <repeats
140 times>,
"c\000\000\000\000\000\000\000\000\020\000\000\000\000\000\000\240\265\377\377\377\177",
'\000' <repeats 26 times>...
sem_val = 5228929
__func__ = "main_loop"
#3 0x0000000000461fb6 in main (argc=2, argv=0x7fffffffe598) at
pbsd_main.c:1935
i = 2
rc = 0
local_errno = 0
lockfile = "/var/spool/torque/server_priv/server.lock", '\000'
<repeats 983 times>
EMsg = '\000' <repeats 1023 times>
tmpLine = "Using ports Server:15001 Scheduler:15004 MOM:15002
(server: 'Dual-E52630v4')", '\000' <repeats 945 times>
log_buf = "Using ports Server:15001 Scheduler:15004 MOM:15002
(server: 'Dual-E52630v4')", '\000' <repeats 1114 times>...
server_name_file_port = 15001
fp = 0x51095f0
(gdb) info registers
rax 0xfffffffffffffdfc -516
rbx 0x5 5
rcx 0x7ffff612a75d 140737321805661
rdx 0x0 0
rsi 0x0 0
rdi 0x7fffffffb3f0 140737488335856
rbp 0x7fffffffe4b0 0x7fffffffe4b0
rsp 0x7fffffffc870 0x7fffffffc870
r8 0x0 0
r9 0x4000001 67108865
r10 0x1 1
r11 0x293 659
r12 0x4260b0 4350128
r13 0x7fffffffe590 140737488348560
r14 0x0 0
r15 0x0 0
rip 0x461fb6 0x461fb6 <main(int, char**)+2388>
eflags 0x293 [ CF AF SF IF ]
cs 0x33 51
ss 0x2b 43
ds 0x0 0
es 0x0 0
fs 0x0 0
gs 0x0 0
(gdb) x/16i $pc
=> 0x461fb6 <main(int, char**)+2388>: callq 0x494762 <shutdown_ack()>
0x461fbb <main(int, char**)+2393>: mov $0xffffffff,%edi
0x461fc0 <main(int, char**)+2398>: callq 0x4250b0 <***@plt>
0x461fc5 <main(int, char**)+2403>: mov 0x70f55c(%rip),%rdx #
0xb71528 <msg_svrdown>
0x461fcc <main(int, char**)+2410>: mov 0x70eeed(%rip),%rax #
0xb70ec0 <msg_daemonname>
0x461fd3 <main(int, char**)+2417>: mov %rdx,%rcx
0x461fd6 <main(int, char**)+2420>: mov %rax,%rdx
0x461fd9 <main(int, char**)+2423>: mov $0x1,%esi
0x461fde <main(int, char**)+2428>: mov $0x8002,%edi
0x461fe3 <main(int, char**)+2433>: callq 0x425840
<***@plt>
0x461fe8 <main(int, char**)+2438>: mov $0x0,%edi
0x461fed <main(int, char**)+2443>: callq 0x4269c9 <acct_close(bool)>
0x461ff2 <main(int, char**)+2448>: mov $0xb6cdc0,%edi
0x461ff7 <main(int, char**)+2453>: callq 0x425a00
<***@plt>
0x461ffc <main(int, char**)+2458>: mov $0x1,%edi
0x462001 <main(int, char**)+2463>: callq 0x424db0 <***@plt>
(gdb) thread apply all backtrace

Thread 11 (Thread 0x7fffe37fe700 (LWP 5113)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at
../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00000000004fc19c in work_thread (a=0x5110710) at u_threadpool.c:272
#2 0x00007ffff6ed870a in start_thread (arg=0x7fffe37fe700) at
pthread_create.c:333
#3 0x00007ffff616582d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 10 (Thread 0x7fffe3fff700 (LWP 5109)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at
../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00000000004fc19c in work_thread (a=0x5110710) at u_threadpool.c:272
#2 0x00007ffff6ed870a in start_thread (arg=0x7fffe3fff700) at
pthread_create.c:333
#3 0x00007ffff616582d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 9 (Thread 0x7ffff09bb700 (LWP 5031)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at
../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00000000004fc19c in work_thread (a=0x5110810) at u_threadpool.c:272
#2 0x00007ffff6ed870a in start_thread (arg=0x7ffff09bb700) at
pthread_create.c:333
#3 0x00007ffff616582d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 7 (Thread 0x7ffff11bc700 (LWP 5029)):
#0 0x00007ffff612a75d in nanosleep () at
../sysdeps/unix/syscall-template.S:84
#1 0x00007ffff612a6aa in __sleep (seconds=0) at ../sysdeps/posix/sleep.c:55
#2 0x00000000004769bb in remove_completed_jobs (vp=0x0) at
req_jobobit.c:3759
#3 0x00007ffff6ed870a in start_thread (arg=0x7ffff11bc700) at
pthread_create.c:333
#4 0x00007ffff616582d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 6 (Thread 0x7ffff19bd700 (LWP 5028)):
#0 0x00007ffff612a75d in nanosleep () at
../sysdeps/unix/syscall-template.S:84
#1 0x00007ffff612a6aa in __sleep (seconds=0) at ../sysdeps/posix/sleep.c:55
#2 0x00000000004afa7b in remove_extra_recycle_jobs (vp=0x0) at
job_recycler.c:216
#3 0x00007ffff6ed870a in start_thread (arg=0x7ffff19bd700) at
pthread_create.c:333
#4 0x00007ffff616582d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 5 (Thread 0x7ffff21be700 (LWP 5027)):
#0 0x00007ffff612a75d in nanosleep () at
../sysdeps/unix/syscall-template.S:84
#1 0x00007ffff612a6aa in __sleep (seconds=0) at ../sysdeps/posix/sleep.c:55
#2 0x00000000004bc73b in inspect_exiting_jobs (vp=0x0) at
exiting_jobs.c:319
#3 0x00007ffff6ed870a in start_thread (arg=0x7ffff21be700) at
pthread_create.c:333
#4 0x00007ffff616582d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 4 (Thread 0x7ffff29bf700 (LWP 5026)):
#0 0x00007ffff612a75d in nanosleep () at
../sysdeps/unix/syscall-template.S:84
#1 0x00007ffff612a6aa in __sleep (seconds=0) at ../sysdeps/posix/sleep.c:55
#2 0x000000000046078d in handle_queue_routing_retries (vp=0x0) at
pbsd_main.c:1079
#3 0x00007ffff6ed870a in start_thread (arg=0x7ffff29bf700) at
pthread_create.c:333
#4 0x00007ffff616582d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 3 (Thread 0x7ffff31c0700 (LWP 5025)):
#0 0x00007ffff6ee17bd in accept () at ../sysdeps/unix/syscall-template.S:84
#1 0x00007ffff750a276 in start_listener_addrinfo (host_name=0x7ffff31bfaf0
"Dual-E52630v4", server_port=15001, process_meth=0x4c4935
<start_process_pbs_server_port(void*)>)
at ../Libnet/server_core.c:398
#2 0x00000000004608f3 in start_accept_listener (vp=0x0) at pbsd_main.c:1141
#3 0x00007ffff6ed870a in start_thread (arg=0x7ffff31c0700) at
pthread_create.c:333
#4 0x00007ffff616582d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 2 (Thread 0x7ffff39c1700 (LWP 5024)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at
../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00000000004fc19c in work_thread (a=0x5110810) at u_threadpool.c:272
#2 0x00007ffff6ed870a in start_thread (arg=0x7ffff39c1700) at
pthread_create.c:333
---Type <return> to continue, or q <return> to quit---
#3 0x00007ffff616582d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 1 (Thread 0x7ffff7fd5740 (LWP 5020)):
#0 0x00007ffff612a75d in nanosleep () at
../sysdeps/unix/syscall-template.S:84
#1 0x00007ffff615c1a4 in usleep (useconds=<optimized out>) at
../sysdeps/posix/usleep.c:32
#2 0x000000000046123a in main_loop () at pbsd_main.c:1454
#3 0x0000000000461fb6 in main (argc=2, argv=0x7fffffffe598) at
pbsd_main.c:1935
(gdb) quit





On Fri, Oct 28, 2016 at 12:43 PM, Kazuhiro Fujita <***@gmail.com
> wrote:

> Thank you for your comments.
> I will try the 6.0-dev next week.
>
> Best,
> Kazu
>
> On Fri, Oct 28, 2016 at 5:34 AM, David Beer <***@adaptivecomputing.com>
> wrote:
>
>> I wonder if that fix wasn't placed in the hotfix. Is there any chance you
>> can try installing 6.0-dev on your system (via github) to see if it's
>> resolved. For the record, my Ubuntu 16 system doesn't give me this error,
>> or I'd try it myself. For whatever reason, none of our test cluster
>> machines (Cent & Redhat 6-7, SLES 11-12) experience this either. We did
>> have another user that experiences it on a test cluster, but not being able
>> to reproduce it has made it harder to track down.
>>
>> On Wed, Oct 26, 2016 at 12:46 AM, Kazuhiro Fujita <
>> ***@gmail.com> wrote:
>>
>>> David,
>>>
>>> I tried the 6.0.2.h3. But, it seems that the other issue is still
>>> remained.
>>> After I initialized serverdb by "sudo pbs_server -t create", pbs_server
>>> crashed.
>>> Then, I used gdb with pbs_server.
>>>
>>> Best,
>>> Kazu
>>>
>>> sudo gdb /usr/local/sbin/pbs_server
>>> GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
>>> Copyright (C) 2016 Free Software Foundation, Inc.
>>> License GPLv3+: GNU GPL version 3 or later <
>>> http://gnu.org/licenses/gpl.html>
>>> This is free software: you are free to change and redistribute it.
>>> There is NO WARRANTY, to the extent permitted by law. Type "show
>>> copying"
>>> and "show warranty" for details.
>>> This GDB was configured as "x86_64-linux-gnu".
>>> Type "show configuration" for configuration details.
>>> For bug reporting instructions, please see:
>>> <http://www.gnu.org/software/gdb/bugs/>.
>>> Find the GDB manual and other documentation resources online at:
>>> <http://www.gnu.org/software/gdb/documentation/>.
>>> For help, type "help".
>>> Type "apropos word" to search for commands related to "word"...
>>> Reading symbols from /usr/local/sbin/pbs_server...done.
>>> (gdb) r -D
>>> Starting program: /usr/local/sbin/pbs_server -D
>>> [Thread debugging using libthread_db enabled]
>>> Using host libthread_db library "/lib/x86_64-linux-gnu/libthre
>>> ad_db.so.1".
>>> pbs_server is up (version - 6.0.2.h3, port - 15001)
>>> [New Thread 0x7ffff39c1700 (LWP 25591)]
>>> [New Thread 0x7ffff31c0700 (LWP 25592)]
>>> [New Thread 0x7ffff29bf700 (LWP 25593)]
>>> [New Thread 0x7ffff21be700 (LWP 25594)]
>>> [New Thread 0x7ffff19bd700 (LWP 25595)]
>>> [New Thread 0x7ffff11bc700 (LWP 25596)]
>>>
>>> Thread 7 "pbs_server" received signal SIGSEGV, Segmentation fault.
>>> [Switching to Thread 0x7ffff11bc700 (LWP 25596)]
>>> __lll_unlock_elision (lock=0x57276c0, private=0) at
>>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>>> 29 ../sysdeps/unix/sysv/linux/x86/elision-unlock.c: No such file or
>>> directory.
>>> (gdb) bt
>>> #0 __lll_unlock_elision (lock=0x57276c0, private=0) at
>>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>>> #1 0x00000000004ac076 in dispatch_timed_task (ptask=0x5727660) at
>>> svr_task.c:318
>>> #2 0x0000000000460247 in check_tasks (notUsed=0x0) at pbsd_main.c:921
>>> #3 0x00000000004fc171 in work_thread (a=0x510f650) at u_threadpool.c:318
>>> #4 0x00007ffff6ed86fa in start_thread (arg=0x7ffff11bc700) at
>>> pthread_create.c:333
>>> #5 0x00007ffff6165b5d in clone () at ../sysdeps/unix/sysv/linux/x86
>>> _64/clone.S:109
>>>
>>>
>>>
>>>
>>> On Wed, Oct 26, 2016 at 11:52 AM, Kazuhiro Fujita <
>>> ***@gmail.com> wrote:
>>>
>>>> David and Rick,
>>>>
>>>> Thank you for the quick response. I will try it later.
>>>>
>>>> Best,
>>>> Kazu
>>>>
>>>> On Wed, Oct 26, 2016 at 5:06 AM, David Beer <
>>>> ***@adaptivecomputing.com> wrote:
>>>>
>>>>> Actually, Rick just sent me the link. You can download it from here:
>>>>> http://files.adaptivecomputing.com/hotfix/torque-6.0.2.h3.tar.gz
>>>>>
>>>>> On Tue, Oct 25, 2016 at 2:06 PM, David Beer <
>>>>> ***@adaptivecomputing.com> wrote:
>>>>>
>>>>>> I can confirm that this bug is fixed in 6.0-dev, and we've made a
>>>>>> hotfix for it, 6.0.2.h3. This was caused because of a change in the
>>>>>> implementation for the pthread library, so most will not see this crash,
>>>>>> but it appears that if you have a newer version of that library, then you
>>>>>> will get it. Rick is going to send instructions for how to grab 6.0.2.h3.
>>>>>>
>>>>>> David
>>>>>>
>>>>>> On Tue, Oct 25, 2016 at 12:30 AM, Kazuhiro Fujita <
>>>>>> ***@gmail.com> wrote:
>>>>>>
>>>>>>> Thank you David for the comment on the backtrace.
>>>>>>> I haven't noticed that until writing this mail.
>>>>>>> So, I used backtrace as written in the Ubuntu wiki.
>>>>>>>
>>>>>>> I also attached the backtrace of pbs_server (Torque 6.1-dev) by gdb.
>>>>>>> As I mentioned before torque.setup script was successfully executed,
>>>>>>> but unstable.
>>>>>>>
>>>>>>> Before using gdb, I used following commands.
>>>>>>>
>>>>>>>> git clone https://github.com/adaptivecomputing/torque.git -b
>>>>>>>> 6.1-dev 6.1-dev
>>>>>>>> cd 6.1-dev
>>>>>>>> ./autogen.sh
>>>>>>>> # build and install torque
>>>>>>>> ./configure
>>>>>>>> make
>>>>>>>> sudo make install
>>>>>>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>>>>>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>>>>>>>> sudo ldconfig
>>>>>>>> # set as services
>>>>>>>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>>>>>>>> sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
>>>>>>>> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
>>>>>>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>>>>>>> sudo update-rc.d trqauthd defaults
>>>>>>>> sudo update-rc.d pbs_server defaults
>>>>>>>> sudo update-rc.d pbs_sched defaults
>>>>>>>> sudo update-rc.d pbs_mom defaults
>>>>>>>>
>>>>>>>> sudo ./torque.setup $USER
>>>>>>>> sudo qmgr -c 'p s'
>>>>>>>> echo "$HOSTNAME np=`cat /proc/cpuinfo | grep processor | wc -l`" |
>>>>>>>> sudo tee /var/spool/torque/server_priv/nodes
>>>>>>>> sudo nano /var/spool/torque/server_priv/nodes # (changed np)
>>>>>>>> sudo qterm -t quick
>>>>>>>> sudo /etc/init.d/trqauthd stop
>>>>>>>
>>>>>>>
>>>>>>> trqauthd was not stop by the last command. So, I stopped it by
>>>>>>> killing the trqauthd process.
>>>>>>> Then I restarted the torque processes with gdb.
>>>>>>>
>>>>>>> sudo /etc/init.d/trqauthd start
>>>>>>>
>>>>>>> sudo gdb /etc/init.d/pbs_server 2>&1 | tee
>>>>>>>> ~/gdb-torquesetup-6.1-dev.txt
>>>>>>>
>>>>>>>
>>>>>>> In another terminal, I executed the following commands before
>>>>>>> pbs_server was crashed.
>>>>>>>
>>>>>>> sudo /etc/init.d/pbs_mom start
>>>>>>>> sudo /etc/init.d/pbs_sched start
>>>>>>>> ps aux | grep pbs
>>>>>>>> pbsnodes -a
>>>>>>>> echo "sleep 30" | qsub
>>>>>>>
>>>>>>>
>>>>>>> The output of the last command is "0.torque-server".
>>>>>>> And this command crashed the pbs_server in gdb.
>>>>>>> Then, I made the backtrace.
>>>>>>>
>>>>>>> Best,
>>>>>>> Kazu
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Oct 25, 2016 at 2:36 PM, Kazuhiro Fujita <
>>>>>>> ***@gmail.com> wrote:
>>>>>>>
>>>>>>>> David,
>>>>>>>>
>>>>>>>> I attached the backtrace of pbs_server (Torque 6.0.2) by gdb.
>>>>>>>> (based on https://wiki.ubuntu.com/Backtrace)
>>>>>>>>
>>>>>>>> I started pbs_server with gdb,
>>>>>>>> and execute qmgr from another terminal. (see below)
>>>>>>>>
>>>>>>>> sudo qmgr -c 'p s'
>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>>
>>>>>>>>
>>>>>>>> After the qmgr execution, I pressed ctrl +c in gdb.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Kaz
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Oct 25, 2016 at 1:00 AM, David Beer <
>>>>>>>> ***@adaptivecomputing.com> wrote:
>>>>>>>>
>>>>>>>>> Kazu,
>>>>>>>>>
>>>>>>>>> Can you give us a backtrace for this crash? We have fixed some
>>>>>>>>> issues on startup (around mutex management for newer pthread
>>>>>>>>> implementations) and a backtrace would allow me to confirm if what you're
>>>>>>>>> seeing is fixed.
>>>>>>>>>
>>>>>>>>> On Mon, Oct 24, 2016 at 2:09 AM, Kazuhiro Fujita <
>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Dear All,
>>>>>>>>>>
>>>>>>>>>> I use Torque 4.2.10 on Ubuntu 14.04 LTS and 16.04 LTS with dual
>>>>>>>>>> E5-2630 v3 chips.
>>>>>>>>>> I recently got servers with dual Xeon E5 v4 chips, and
>>>>>>>>>> installed Ubuntu 16.04 LTS on them.
>>>>>>>>>> And I tried to set up Torque on them, but I stacked with the
>>>>>>>>>> initial setup script.
>>>>>>>>>> It seems that qmgr may trigger to crash pbs_server in initial
>>>>>>>>>> setup script (torque.setup). (see below)
>>>>>>>>>> Similar error is also observed in Torque 6.02.
>>>>>>>>>> Have you ever observed this kind of errors?
>>>>>>>>>> And if you know possible solutions, please tell me.
>>>>>>>>>> Any comments will be highly appreciated.
>>>>>>>>>> Would it be better to change the OS to other distribution, such
>>>>>>>>>> as Scientific Linux?
>>>>>>>>>>
>>>>>>>>>> Thank you in Advance,
>>>>>>>>>> Kazu
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Errors in torque 4.2.10 setup
>>>>>>>>>>
>>>>>>>>>>> torque-server-***@torque-server:~/Downloads/torque/torque-4.2.10$
>>>>>>>>>>> sudo ./torque.setup $USER
>>>>>>>>>>> Currently no servers active. Default server will be listed as
>>>>>>>>>>> active server. Error 15133
>>>>>>>>>>> Active server name: torque-server pbs_server port is: 15001
>>>>>>>>>>> trqauthd daemonized - port /tmp/trqauthd-unix
>>>>>>>>>>> trqauthd successfully started
>>>>>>>>>>> initializing TORQUE (admin: torque-server-***@torque-server)
>>>>>>>>>>> You have selected to start pbs_server in create mode.
>>>>>>>>>>> If the server database exists it will be overwritten.
>>>>>>>>>>> do you wish to continue y/(n)?y
>>>>>>>>>>> root 27941 1942 1 12:22 ? 00:00:00 pbs_server -t
>>>>>>>>>>> create
>>>>>>>>>>> Max open servers: 9
>>>>>>>>>>> set server operators += torque-server-***@torque-server
>>>>>>>>>>> Max open servers: 9
>>>>>>>>>>> set server managers += torque-server-***@torque-server
>>>>>>>>>>> qmgr obj=batch svr=default: End of File
>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>>>> torque-server-***@torque-server:~/Downloads/torque/torque-4.2.10$
>>>>>>>>>>> ps aux | grep pbs
>>>>>>>>>>> torque-+ 27996 0.0 0.0 22304 948 pts/2 S+ 12:22 0:00
>>>>>>>>>>> grep --color=auto pbs
>>>>>>>>>>
>>>>>>>>>> pbs_server -t create was not found.
>>>>>>>>>>
>>>>>>>>>> Errors in torque 6.0.2 setup
>>>>>>>>>>
>>>>>>>>>>> torque-server-***@torque-server:~/Downloads/torque/6.0.2$
>>>>>>>>>>> sudo ./torque.setup $USER
>>>>>>>>>>> Currently no servers active. Default server will be listed as
>>>>>>>>>>> active server. Error 15133
>>>>>>>>>>> Active server name: torque-server pbs_server port is: 15001
>>>>>>>>>>> trqauthd daemonized - port /tmp/trqauthd-unix
>>>>>>>>>>> trqauthd successfully started
>>>>>>>>>>> initializing TORQUE (admin: torque-server-***@torque-server)
>>>>>>>>>>> You have selected to start pbs_server in create mode.
>>>>>>>>>>> If the server database exists it will be overwritten.
>>>>>>>>>>> do you wish to continue y/(n)?y
>>>>>>>>>>> root 39521 1 1 16:10 ? 00:00:00 pbs_server -t
>>>>>>>>>>> create
>>>>>>>>>>> Max open servers: 9
>>>>>>>>>>> Max open servers: 9
>>>>>>>>>>> qmgr obj=batch svr=default: End of File
>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>>>> torque-server-***@torque-server:~/Downloads/torque/6.0.2$ ps
>>>>>>>>>>> aux | grep pbs
>>>>>>>>>>> comp_ad+ 39569 0.0 0.0 22304 1032 pts/8 S+ 16:11 0:00
>>>>>>>>>>> grep --color=auto pbs
>>>>>>>>>>
>>>>>>>>>> pbs_server -t create was not found.
>>>>>>>>>>
>>>>>>>>>> Commands used for installation before the setup script
>>>>>>>>>>
>>>>>>>>>>> # build and install torque
>>>>>>>>>>> ./configure
>>>>>>>>>>> make
>>>>>>>>>>> sudo make install
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>>>>>>>>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>>>>>>>>>>> sudo ldconfig
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> # set up as services
>>>>>>>>>>
>>>>>>>>>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>>>>>>>>>> sudo update-rc.d trqauthd defaults
>>>>>>>>>>> sudo update-rc.d pbs_server defaults
>>>>>>>>>>> sudo update-rc.d pbs_sched defaults
>>>>>>>>>>> sudo update-rc.d pbs_mom defaults
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> sudo ./torque.setup $USER
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> torqueusers mailing list
>>>>>>>>>> ***@supercluster.org
>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> David Beer | Torque Architect
>>>>>>>>> Adaptive Computing
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> torqueusers mailing list
>>>>>>>>> ***@supercluster.org
>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> torqueusers mailing list
>>>>>>> ***@supercluster.org
>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> David Beer | Torque Architect
>>>>>> Adaptive Computing
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> David Beer | Torque Architect
>>>>> Adaptive Computing
>>>>>
>>>>> _______________________________________________
>>>>> torqueusers mailing list
>>>>> ***@supercluster.org
>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>
>>>>>
>>>>
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> ***@supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>>
>>
>>
>> --
>> David Beer | Torque Architect
>> Adaptive Computing
>>
>> _______________________________________________
>> torqueusers mailing list
>> ***@supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>
David Beer
2016-11-01 16:43:16 UTC
Permalink
Kazu,

Thanks for sticking with us on this. You mentioned that pbs_server did not
crash when you submitted the job, but you said that it and pbs_sched are
"unstable." What do you mean by unstable? Will jobs run? You gdb output
looks like a pbs_server that isn't busy, but other than that it looks
normal.

David

On Tue, Nov 1, 2016 at 1:19 AM, Kazuhiro Fujita <***@gmail.com>
wrote:

> David,
>
> I tested the 6.0-dev. It passed the "sudo ./torque.setup $USER" script,
> but pbs_server and pbs_sched are unstable like 6.1-dev.
>
> Best,
> Kazu
>
> Before execution of gdb
>
> git clone https://github.com/adaptivecomputing/torque.git -b 6.0-dev
>> 6.0-dev
>> cd 6.0-dev
>> ./autogen.sh
>> # build and install torque
>> ./configure
>> make
>> sudo make install
>> # Set the correct name of the server
>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>> # configure and start trqauthd
>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>> sudo update-rc.d trqauthd defaults
>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>> sudo ldconfig
>> sudo service trqauthd start
>> # Initialize serverdb by executing the torque.setup script
>> sudo ./torque.setup $USER
>>
>> sudo qmgr -c 'p s'
>> sudo qterm
>> sudo /etc/init.d/trqauthd stop
>> # set nodes
>> echo "$HOSTNAME np=`cat /proc/cpuinfo | grep processor | wc -l`" | sudo
>> tee /var/spool/torque/server_priv/nodes
>> sudo nano /var/spool/torque/server_priv/nodes
>> # set the head node
>> echo "\$pbsserver $HOSTNAME" | sudo tee /var/spool/torque/mom_priv/config
>> # configure other deamons
>> sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
>> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>> sudo update-rc.d pbs_server defaults
>> sudo update-rc.d pbs_sched defaults
>> sudo update-rc.d pbs_mom defaults
>> # start torque daemons
>> sudo service trqauthd start
>
>
> Execution of gdb
>
>> sudo gdb /usr/local/sbin/pbs_server
>
>
> Commands executed by another terminal
>
>> sudo /etc/init.d/pbs_mom start
>> sudo /etc/init.d/pbs_sched start
>> pbsnodes -a
>> echo "sleep 30" | qsub
>
>
> The last command did not cause a crash of pbs_server. The backtrace is
> described below.
> $ sudo gdb /usr/local/sbin/pbs_server
> GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
> Copyright (C) 2016 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.
> html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law. Type "show copying"
> and "show warranty" for details.
> This GDB was configured as "x86_64-linux-gnu".
> Type "show configuration" for configuration details.
> For bug reporting instructions, please see:
> <http://www.gnu.org/software/gdb/bugs/>.
> Find the GDB manual and other documentation resources online at:
> <http://www.gnu.org/software/gdb/documentation/>.
> For help, type "help".
> Type "apropos word" to search for commands related to "word"...
> Reading symbols from /usr/local/sbin/pbs_server...done.
> (gdb) r -D
> Starting program: /usr/local/sbin/pbs_server -D
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
> [New Thread 0x7ffff39c1700 (LWP 5024)]
> pbs_server is up (version - 6.0, port - 15001)
> [New Thread 0x7ffff31c0700 (LWP 5025)]
> PBS_Server: LOG_ERROR::tcp_connect_sockaddr, Failed when trying to open
> tcp connection - connect() failed [rc = -2] [addr = 10.0.0.249:15003]
> PBS_Server: LOG_ERROR::sendHierarchyToNode, Could not send mom hierarchy
> to host Dual-E52630v4:15003
> [New Thread 0x7ffff29bf700 (LWP 5026)]
> [New Thread 0x7ffff21be700 (LWP 5027)]
> [New Thread 0x7ffff19bd700 (LWP 5028)]
> [New Thread 0x7ffff11bc700 (LWP 5029)]
> [New Thread 0x7ffff09bb700 (LWP 5030)]
> [Thread 0x7ffff09bb700 (LWP 5030) exited]
> [New Thread 0x7ffff09bb700 (LWP 5031)]
> [New Thread 0x7fffe3fff700 (LWP 5109)]
> [New Thread 0x7fffe37fe700 (LWP 5113)]
> [New Thread 0x7fffe29cf700 (LWP 5121)]
> [Thread 0x7fffe29cf700 (LWP 5121) exited]
> ^C
> Thread 1 "pbs_server" received signal SIGINT, Interrupt.
> 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-
> template.S:84
> 84 ../sysdeps/unix/syscall-template.S: No such file or directory.
> (gdb) backtrace full
> #0 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-
> template.S:84
> No locals.
> #1 0x00007ffff615c1a4 in usleep (useconds=<optimized out>) at
> ../sysdeps/posix/usleep.c:32
> ts = {tv_sec = 0, tv_nsec = 250000000}
> #2 0x000000000046123a in main_loop () at pbsd_main.c:1454
> state = 3
> waittime = 5
> pjob = 0x313a74
> iter = 0x0
> when = 1477984074
> log = 0
> scheduling = 1
> sched_iteration = 600
> time_now = 1477984190
> update_loglevel = 1477984198
> log_buf = "Server Ready, pid = 5020, loglevel=0", '\000' <repeats
> 140 times>, "c\000\000\000\000\000\000\000\000\020\000\000\000\000\
> 000\000\240\265\377\377\377\177", '\000' <repeats 26 times>...
> sem_val = 5228929
> __func__ = "main_loop"
> #3 0x0000000000461fb6 in main (argc=2, argv=0x7fffffffe598) at
> pbsd_main.c:1935
> i = 2
> rc = 0
> local_errno = 0
> lockfile = "/var/spool/torque/server_priv/server.lock", '\000'
> <repeats 983 times>
> EMsg = '\000' <repeats 1023 times>
> tmpLine = "Using ports Server:15001 Scheduler:15004 MOM:15002
> (server: 'Dual-E52630v4')", '\000' <repeats 945 times>
> log_buf = "Using ports Server:15001 Scheduler:15004 MOM:15002
> (server: 'Dual-E52630v4')", '\000' <repeats 1114 times>...
> server_name_file_port = 15001
> fp = 0x51095f0
> (gdb) info registers
> rax 0xfffffffffffffdfc -516
> rbx 0x5 5
> rcx 0x7ffff612a75d 140737321805661
> rdx 0x0 0
> rsi 0x0 0
> rdi 0x7fffffffb3f0 140737488335856
> rbp 0x7fffffffe4b0 0x7fffffffe4b0
> rsp 0x7fffffffc870 0x7fffffffc870
> r8 0x0 0
> r9 0x4000001 67108865
> r10 0x1 1
> r11 0x293 659
> r12 0x4260b0 4350128
> r13 0x7fffffffe590 140737488348560
> r14 0x0 0
> r15 0x0 0
> rip 0x461fb6 0x461fb6 <main(int, char**)+2388>
> eflags 0x293 [ CF AF SF IF ]
> cs 0x33 51
> ss 0x2b 43
> ds 0x0 0
> es 0x0 0
> fs 0x0 0
> gs 0x0 0
> (gdb) x/16i $pc
> => 0x461fb6 <main(int, char**)+2388>: callq 0x494762 <shutdown_ack()>
> 0x461fbb <main(int, char**)+2393>: mov $0xffffffff,%edi
> 0x461fc0 <main(int, char**)+2398>: callq 0x4250b0 <***@plt>
> 0x461fc5 <main(int, char**)+2403>: mov 0x70f55c(%rip),%rdx #
> 0xb71528 <msg_svrdown>
> 0x461fcc <main(int, char**)+2410>: mov 0x70eeed(%rip),%rax #
> 0xb70ec0 <msg_daemonname>
> 0x461fd3 <main(int, char**)+2417>: mov %rdx,%rcx
> 0x461fd6 <main(int, char**)+2420>: mov %rax,%rdx
> 0x461fd9 <main(int, char**)+2423>: mov $0x1,%esi
> 0x461fde <main(int, char**)+2428>: mov $0x8002,%edi
> 0x461fe3 <main(int, char**)+2433>: callq 0x425840
> <***@plt>
> 0x461fe8 <main(int, char**)+2438>: mov $0x0,%edi
> 0x461fed <main(int, char**)+2443>: callq 0x4269c9 <acct_close(bool)>
> 0x461ff2 <main(int, char**)+2448>: mov $0xb6cdc0,%edi
> 0x461ff7 <main(int, char**)+2453>: callq 0x425a00
> <***@plt>
> 0x461ffc <main(int, char**)+2458>: mov $0x1,%edi
> 0x462001 <main(int, char**)+2463>: callq 0x424db0 <***@plt>
> (gdb) thread apply all backtrace
>
> Thread 11 (Thread 0x7fffe37fe700 (LWP 5113)):
> #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/
> x86_64/pthread_cond_wait.S:185
> #1 0x00000000004fc19c in work_thread (a=0x5110710) at u_threadpool.c:272
> #2 0x00007ffff6ed870a in start_thread (arg=0x7fffe37fe700) at
> pthread_create.c:333
> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/
> x86_64/clone.S:109
>
> Thread 10 (Thread 0x7fffe3fff700 (LWP 5109)):
> #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/
> x86_64/pthread_cond_wait.S:185
> #1 0x00000000004fc19c in work_thread (a=0x5110710) at u_threadpool.c:272
> #2 0x00007ffff6ed870a in start_thread (arg=0x7fffe3fff700) at
> pthread_create.c:333
> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/
> x86_64/clone.S:109
>
> Thread 9 (Thread 0x7ffff09bb700 (LWP 5031)):
> #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/
> x86_64/pthread_cond_wait.S:185
> #1 0x00000000004fc19c in work_thread (a=0x5110810) at u_threadpool.c:272
> #2 0x00007ffff6ed870a in start_thread (arg=0x7ffff09bb700) at
> pthread_create.c:333
> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/
> x86_64/clone.S:109
>
> Thread 7 (Thread 0x7ffff11bc700 (LWP 5029)):
> #0 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-
> template.S:84
> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
> ../sysdeps/posix/sleep.c:55
> #2 0x00000000004769bb in remove_completed_jobs (vp=0x0) at
> req_jobobit.c:3759
> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff11bc700) at
> pthread_create.c:333
> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/
> x86_64/clone.S:109
>
> Thread 6 (Thread 0x7ffff19bd700 (LWP 5028)):
> #0 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-
> template.S:84
> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
> ../sysdeps/posix/sleep.c:55
> #2 0x00000000004afa7b in remove_extra_recycle_jobs (vp=0x0) at
> job_recycler.c:216
> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff19bd700) at
> pthread_create.c:333
> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/
> x86_64/clone.S:109
>
> Thread 5 (Thread 0x7ffff21be700 (LWP 5027)):
> #0 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-
> template.S:84
> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
> ../sysdeps/posix/sleep.c:55
> #2 0x00000000004bc73b in inspect_exiting_jobs (vp=0x0) at
> exiting_jobs.c:319
> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff21be700) at
> pthread_create.c:333
> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/
> x86_64/clone.S:109
>
> Thread 4 (Thread 0x7ffff29bf700 (LWP 5026)):
> #0 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-
> template.S:84
> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
> ../sysdeps/posix/sleep.c:55
> #2 0x000000000046078d in handle_queue_routing_retries (vp=0x0) at
> pbsd_main.c:1079
> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff29bf700) at
> pthread_create.c:333
> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/
> x86_64/clone.S:109
>
> Thread 3 (Thread 0x7ffff31c0700 (LWP 5025)):
> #0 0x00007ffff6ee17bd in accept () at ../sysdeps/unix/syscall-
> template.S:84
> #1 0x00007ffff750a276 in start_listener_addrinfo
> (host_name=0x7ffff31bfaf0 "Dual-E52630v4", server_port=15001,
> process_meth=0x4c4935 <start_process_pbs_server_port(void*)>)
> at ../Libnet/server_core.c:398
> #2 0x00000000004608f3 in start_accept_listener (vp=0x0) at
> pbsd_main.c:1141
> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff31c0700) at
> pthread_create.c:333
> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/
> x86_64/clone.S:109
>
> Thread 2 (Thread 0x7ffff39c1700 (LWP 5024)):
> #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/
> x86_64/pthread_cond_wait.S:185
> #1 0x00000000004fc19c in work_thread (a=0x5110810) at u_threadpool.c:272
> #2 0x00007ffff6ed870a in start_thread (arg=0x7ffff39c1700) at
> pthread_create.c:333
> ---Type <return> to continue, or q <return> to quit---
> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/
> x86_64/clone.S:109
>
> Thread 1 (Thread 0x7ffff7fd5740 (LWP 5020)):
> #0 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-
> template.S:84
> #1 0x00007ffff615c1a4 in usleep (useconds=<optimized out>) at
> ../sysdeps/posix/usleep.c:32
> #2 0x000000000046123a in main_loop () at pbsd_main.c:1454
> #3 0x0000000000461fb6 in main (argc=2, argv=0x7fffffffe598) at
> pbsd_main.c:1935
> (gdb) quit
>
>
>
>
>
> On Fri, Oct 28, 2016 at 12:43 PM, Kazuhiro Fujita <
> ***@gmail.com> wrote:
>
>> Thank you for your comments.
>> I will try the 6.0-dev next week.
>>
>> Best,
>> Kazu
>>
>> On Fri, Oct 28, 2016 at 5:34 AM, David Beer <***@adaptivecomputing.com>
>> wrote:
>>
>>> I wonder if that fix wasn't placed in the hotfix. Is there any chance
>>> you can try installing 6.0-dev on your system (via github) to see if it's
>>> resolved. For the record, my Ubuntu 16 system doesn't give me this error,
>>> or I'd try it myself. For whatever reason, none of our test cluster
>>> machines (Cent & Redhat 6-7, SLES 11-12) experience this either. We did
>>> have another user that experiences it on a test cluster, but not being able
>>> to reproduce it has made it harder to track down.
>>>
>>> On Wed, Oct 26, 2016 at 12:46 AM, Kazuhiro Fujita <
>>> ***@gmail.com> wrote:
>>>
>>>> David,
>>>>
>>>> I tried the 6.0.2.h3. But, it seems that the other issue is still
>>>> remained.
>>>> After I initialized serverdb by "sudo pbs_server -t create", pbs_server
>>>> crashed.
>>>> Then, I used gdb with pbs_server.
>>>>
>>>> Best,
>>>> Kazu
>>>>
>>>> sudo gdb /usr/local/sbin/pbs_server
>>>> GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
>>>> Copyright (C) 2016 Free Software Foundation, Inc.
>>>> License GPLv3+: GNU GPL version 3 or later <
>>>> http://gnu.org/licenses/gpl.html>
>>>> This is free software: you are free to change and redistribute it.
>>>> There is NO WARRANTY, to the extent permitted by law. Type "show
>>>> copying"
>>>> and "show warranty" for details.
>>>> This GDB was configured as "x86_64-linux-gnu".
>>>> Type "show configuration" for configuration details.
>>>> For bug reporting instructions, please see:
>>>> <http://www.gnu.org/software/gdb/bugs/>.
>>>> Find the GDB manual and other documentation resources online at:
>>>> <http://www.gnu.org/software/gdb/documentation/>.
>>>> For help, type "help".
>>>> Type "apropos word" to search for commands related to "word"...
>>>> Reading symbols from /usr/local/sbin/pbs_server...done.
>>>> (gdb) r -D
>>>> Starting program: /usr/local/sbin/pbs_server -D
>>>> [Thread debugging using libthread_db enabled]
>>>> Using host libthread_db library "/lib/x86_64-linux-gnu/libthre
>>>> ad_db.so.1".
>>>> pbs_server is up (version - 6.0.2.h3, port - 15001)
>>>> [New Thread 0x7ffff39c1700 (LWP 25591)]
>>>> [New Thread 0x7ffff31c0700 (LWP 25592)]
>>>> [New Thread 0x7ffff29bf700 (LWP 25593)]
>>>> [New Thread 0x7ffff21be700 (LWP 25594)]
>>>> [New Thread 0x7ffff19bd700 (LWP 25595)]
>>>> [New Thread 0x7ffff11bc700 (LWP 25596)]
>>>>
>>>> Thread 7 "pbs_server" received signal SIGSEGV, Segmentation fault.
>>>> [Switching to Thread 0x7ffff11bc700 (LWP 25596)]
>>>> __lll_unlock_elision (lock=0x57276c0, private=0) at
>>>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>>>> 29 ../sysdeps/unix/sysv/linux/x86/elision-unlock.c: No such file or
>>>> directory.
>>>> (gdb) bt
>>>> #0 __lll_unlock_elision (lock=0x57276c0, private=0) at
>>>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>>>> #1 0x00000000004ac076 in dispatch_timed_task (ptask=0x5727660) at
>>>> svr_task.c:318
>>>> #2 0x0000000000460247 in check_tasks (notUsed=0x0) at pbsd_main.c:921
>>>> #3 0x00000000004fc171 in work_thread (a=0x510f650) at
>>>> u_threadpool.c:318
>>>> #4 0x00007ffff6ed86fa in start_thread (arg=0x7ffff11bc700) at
>>>> pthread_create.c:333
>>>> #5 0x00007ffff6165b5d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>> _64/clone.S:109
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Oct 26, 2016 at 11:52 AM, Kazuhiro Fujita <
>>>> ***@gmail.com> wrote:
>>>>
>>>>> David and Rick,
>>>>>
>>>>> Thank you for the quick response. I will try it later.
>>>>>
>>>>> Best,
>>>>> Kazu
>>>>>
>>>>> On Wed, Oct 26, 2016 at 5:06 AM, David Beer <
>>>>> ***@adaptivecomputing.com> wrote:
>>>>>
>>>>>> Actually, Rick just sent me the link. You can download it from here:
>>>>>> http://files.adaptivecomputing.com/hotfix/torque-6.0.2.h3.tar.gz
>>>>>>
>>>>>> On Tue, Oct 25, 2016 at 2:06 PM, David Beer <
>>>>>> ***@adaptivecomputing.com> wrote:
>>>>>>
>>>>>>> I can confirm that this bug is fixed in 6.0-dev, and we've made a
>>>>>>> hotfix for it, 6.0.2.h3. This was caused because of a change in the
>>>>>>> implementation for the pthread library, so most will not see this crash,
>>>>>>> but it appears that if you have a newer version of that library, then you
>>>>>>> will get it. Rick is going to send instructions for how to grab 6.0.2.h3.
>>>>>>>
>>>>>>> David
>>>>>>>
>>>>>>> On Tue, Oct 25, 2016 at 12:30 AM, Kazuhiro Fujita <
>>>>>>> ***@gmail.com> wrote:
>>>>>>>
>>>>>>>> Thank you David for the comment on the backtrace.
>>>>>>>> I haven't noticed that until writing this mail.
>>>>>>>> So, I used backtrace as written in the Ubuntu wiki.
>>>>>>>>
>>>>>>>> I also attached the backtrace of pbs_server (Torque 6.1-dev) by gdb.
>>>>>>>> As I mentioned before torque.setup script was successfully
>>>>>>>> executed, but unstable.
>>>>>>>>
>>>>>>>> Before using gdb, I used following commands.
>>>>>>>>
>>>>>>>>> git clone https://github.com/adaptivecomputing/torque.git -b
>>>>>>>>> 6.1-dev 6.1-dev
>>>>>>>>> cd 6.1-dev
>>>>>>>>> ./autogen.sh
>>>>>>>>> # build and install torque
>>>>>>>>> ./configure
>>>>>>>>> make
>>>>>>>>> sudo make install
>>>>>>>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>>>>>>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>>>>>>>>> sudo ldconfig
>>>>>>>>> # set as services
>>>>>>>>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>>>>>>>>> sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
>>>>>>>>> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
>>>>>>>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>>>>>>>> sudo update-rc.d trqauthd defaults
>>>>>>>>> sudo update-rc.d pbs_server defaults
>>>>>>>>> sudo update-rc.d pbs_sched defaults
>>>>>>>>> sudo update-rc.d pbs_mom defaults
>>>>>>>>>
>>>>>>>>> sudo ./torque.setup $USER
>>>>>>>>> sudo qmgr -c 'p s'
>>>>>>>>> echo "$HOSTNAME np=`cat /proc/cpuinfo | grep processor | wc -l`" |
>>>>>>>>> sudo tee /var/spool/torque/server_priv/nodes
>>>>>>>>> sudo nano /var/spool/torque/server_priv/nodes # (changed np)
>>>>>>>>> sudo qterm -t quick
>>>>>>>>> sudo /etc/init.d/trqauthd stop
>>>>>>>>
>>>>>>>>
>>>>>>>> trqauthd was not stop by the last command. So, I stopped it by
>>>>>>>> killing the trqauthd process.
>>>>>>>> Then I restarted the torque processes with gdb.
>>>>>>>>
>>>>>>>> sudo /etc/init.d/trqauthd start
>>>>>>>>
>>>>>>>> sudo gdb /etc/init.d/pbs_server 2>&1 | tee
>>>>>>>>> ~/gdb-torquesetup-6.1-dev.txt
>>>>>>>>
>>>>>>>>
>>>>>>>> In another terminal, I executed the following commands before
>>>>>>>> pbs_server was crashed.
>>>>>>>>
>>>>>>>> sudo /etc/init.d/pbs_mom start
>>>>>>>>> sudo /etc/init.d/pbs_sched start
>>>>>>>>> ps aux | grep pbs
>>>>>>>>> pbsnodes -a
>>>>>>>>> echo "sleep 30" | qsub
>>>>>>>>
>>>>>>>>
>>>>>>>> The output of the last command is "0.torque-server".
>>>>>>>> And this command crashed the pbs_server in gdb.
>>>>>>>> Then, I made the backtrace.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Kazu
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Oct 25, 2016 at 2:36 PM, Kazuhiro Fujita <
>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> David,
>>>>>>>>>
>>>>>>>>> I attached the backtrace of pbs_server (Torque 6.0.2) by gdb.
>>>>>>>>> (based on https://wiki.ubuntu.com/Backtrace)
>>>>>>>>>
>>>>>>>>> I started pbs_server with gdb,
>>>>>>>>> and execute qmgr from another terminal. (see below)
>>>>>>>>>
>>>>>>>>> sudo qmgr -c 'p s'
>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> After the qmgr execution, I pressed ctrl +c in gdb.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Kaz
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Oct 25, 2016 at 1:00 AM, David Beer <
>>>>>>>>> ***@adaptivecomputing.com> wrote:
>>>>>>>>>
>>>>>>>>>> Kazu,
>>>>>>>>>>
>>>>>>>>>> Can you give us a backtrace for this crash? We have fixed some
>>>>>>>>>> issues on startup (around mutex management for newer pthread
>>>>>>>>>> implementations) and a backtrace would allow me to confirm if what you're
>>>>>>>>>> seeing is fixed.
>>>>>>>>>>
>>>>>>>>>> On Mon, Oct 24, 2016 at 2:09 AM, Kazuhiro Fujita <
>>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Dear All,
>>>>>>>>>>>
>>>>>>>>>>> I use Torque 4.2.10 on Ubuntu 14.04 LTS and 16.04 LTS with dual
>>>>>>>>>>> E5-2630 v3 chips.
>>>>>>>>>>> I recently got servers with dual Xeon E5 v4 chips, and
>>>>>>>>>>> installed Ubuntu 16.04 LTS on them.
>>>>>>>>>>> And I tried to set up Torque on them, but I stacked with the
>>>>>>>>>>> initial setup script.
>>>>>>>>>>> It seems that qmgr may trigger to crash pbs_server in initial
>>>>>>>>>>> setup script (torque.setup). (see below)
>>>>>>>>>>> Similar error is also observed in Torque 6.02.
>>>>>>>>>>> Have you ever observed this kind of errors?
>>>>>>>>>>> And if you know possible solutions, please tell me.
>>>>>>>>>>> Any comments will be highly appreciated.
>>>>>>>>>>> Would it be better to change the OS to other distribution, such
>>>>>>>>>>> as Scientific Linux?
>>>>>>>>>>>
>>>>>>>>>>> Thank you in Advance,
>>>>>>>>>>> Kazu
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Errors in torque 4.2.10 setup
>>>>>>>>>>>
>>>>>>>>>>>> torque-server-***@torque-server:~/Downloads/torque/torque-4.2.10$
>>>>>>>>>>>> sudo ./torque.setup $USER
>>>>>>>>>>>> Currently no servers active. Default server will be listed as
>>>>>>>>>>>> active server. Error 15133
>>>>>>>>>>>> Active server name: torque-server pbs_server port is: 15001
>>>>>>>>>>>> trqauthd daemonized - port /tmp/trqauthd-unix
>>>>>>>>>>>> trqauthd successfully started
>>>>>>>>>>>> initializing TORQUE (admin: torque-server-***@torque-server)
>>>>>>>>>>>> You have selected to start pbs_server in create mode.
>>>>>>>>>>>> If the server database exists it will be overwritten.
>>>>>>>>>>>> do you wish to continue y/(n)?y
>>>>>>>>>>>> root 27941 1942 1 12:22 ? 00:00:00 pbs_server -t
>>>>>>>>>>>> create
>>>>>>>>>>>> Max open servers: 9
>>>>>>>>>>>> set server operators += torque-server-***@torque-server
>>>>>>>>>>>> Max open servers: 9
>>>>>>>>>>>> set server managers += torque-server-***@torque-server
>>>>>>>>>>>> qmgr obj=batch svr=default: End of File
>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>>>>> torque-server-***@torque-server:~/Downloads/torque/torque-4.2.10$
>>>>>>>>>>>> ps aux | grep pbs
>>>>>>>>>>>> torque-+ 27996 0.0 0.0 22304 948 pts/2 S+ 12:22
>>>>>>>>>>>> 0:00 grep --color=auto pbs
>>>>>>>>>>>
>>>>>>>>>>> pbs_server -t create was not found.
>>>>>>>>>>>
>>>>>>>>>>> Errors in torque 6.0.2 setup
>>>>>>>>>>>
>>>>>>>>>>>> torque-server-***@torque-server:~/Downloads/torque/6.0.2$
>>>>>>>>>>>> sudo ./torque.setup $USER
>>>>>>>>>>>> Currently no servers active. Default server will be listed as
>>>>>>>>>>>> active server. Error 15133
>>>>>>>>>>>> Active server name: torque-server pbs_server port is: 15001
>>>>>>>>>>>> trqauthd daemonized - port /tmp/trqauthd-unix
>>>>>>>>>>>> trqauthd successfully started
>>>>>>>>>>>> initializing TORQUE (admin: torque-server-***@torque-server)
>>>>>>>>>>>> You have selected to start pbs_server in create mode.
>>>>>>>>>>>> If the server database exists it will be overwritten.
>>>>>>>>>>>> do you wish to continue y/(n)?y
>>>>>>>>>>>> root 39521 1 1 16:10 ? 00:00:00 pbs_server -t
>>>>>>>>>>>> create
>>>>>>>>>>>> Max open servers: 9
>>>>>>>>>>>> Max open servers: 9
>>>>>>>>>>>> qmgr obj=batch svr=default: End of File
>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>>>>> torque-server-***@torque-server:~/Downloads/torque/6.0.2$ ps
>>>>>>>>>>>> aux | grep pbs
>>>>>>>>>>>> comp_ad+ 39569 0.0 0.0 22304 1032 pts/8 S+ 16:11
>>>>>>>>>>>> 0:00 grep --color=auto pbs
>>>>>>>>>>>
>>>>>>>>>>> pbs_server -t create was not found.
>>>>>>>>>>>
>>>>>>>>>>> Commands used for installation before the setup script
>>>>>>>>>>>
>>>>>>>>>>>> # build and install torque
>>>>>>>>>>>> ./configure
>>>>>>>>>>>> make
>>>>>>>>>>>> sudo make install
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>>>>>>>>>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>>>>>>>>>>>> sudo ldconfig
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> # set up as services
>>>>>>>>>>>
>>>>>>>>>>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>>>>>>>>>>> sudo update-rc.d trqauthd defaults
>>>>>>>>>>>> sudo update-rc.d pbs_server defaults
>>>>>>>>>>>> sudo update-rc.d pbs_sched defaults
>>>>>>>>>>>> sudo update-rc.d pbs_mom defaults
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> sudo ./torque.setup $USER
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> torqueusers mailing list
>>>>>>>>>>> ***@supercluster.org
>>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> David Beer | Torque Architect
>>>>>>>>>> Adaptive Computing
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> torqueusers mailing list
>>>>>>>>>> ***@supercluster.org
>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> torqueusers mailing list
>>>>>>>> ***@supercluster.org
>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> David Beer | Torque Architect
>>>>>>> Adaptive Computing
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> David Beer | Torque Architect
>>>>>> Adaptive Computing
>>>>>>
>>>>>> _______________________________________________
>>>>>> torqueusers mailing list
>>>>>> ***@supercluster.org
>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>
>>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> torqueusers mailing list
>>>> ***@supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>
>>>>
>>>
>>>
>>> --
>>> David Beer | Torque Architect
>>> Adaptive Computing
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> ***@supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>>
>>
>
> _______________________________________________
> torqueusers mailing list
> ***@supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>


--
David Beer | Torque Architect
Adaptive Computing
Kazuhiro Fujita
2016-11-08 09:06:26 UTC
Permalink
Hi David,

I reinstalled the 6.0-dev today from github, and observed slight different
behaviors I think.
I used the "service" command to start daemons this time.

Best,
Kazu

Befor crash

> git clone https://github.com/adaptivecomputing/torque.git -b 6.0-dev
> 6.0-dev
> cd 6.0-dev
> ./autogen.sh
> # build and install torque
> ./configure
> make
> sudo make install
> # Set the correct name of the server
> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
> # configure and start trqauthd
> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
> sudo update-rc.d trqauthd defaults
> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
> sudo ldconfig
> sudo service trqauthd start
> # Initialize serverdb by executing the torque.setup script
> sudo ./torque.setup $USER
> sudo qmgr -c 'p s'
> sudo qterm
> sudo service trqauthd stop
> ps aux | grep pbs
> ps aux | grep trq
> # set nodes
> echo "$HOSTNAME np=`cat /proc/cpuinfo | grep processor | wc -l`" | sudo
> tee /var/spool/torque/server_priv/nodes
> sudo nano /var/spool/torque/server_priv/nodes
> # set the head node
> echo "\$pbsserver $HOSTNAME" | sudo tee /var/spool/torque/mom_priv/config
> # configure other deamons
> sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
> sudo update-rc.d pbs_server defaults
> sudo update-rc.d pbs_sched defaults
> sudo update-rc.d pbs_mom defaults
> # start torque daemons
> sudo service trqauthd start
> sudo service pbs_server start
> sudo service pbs_sched start
> sudo service pbs_mom start
> # chekc configuration of computaion nodes
> pbsnodes -a


I checked torque processes by "ps aux | grep pbs" and "ps aux | grep trq"
several times.
After "pbsnodes -a", it seems ok.
But, the next qsub command seems to trigger to crash "pbs_server" and
"pbs_sched".

$ ps aux | grep trq
> root 9682 0.0 0.0 109112 3632 ? S 17:39 0:00
> /usr/local/sbin/trqauthd
> comp_ad+ 9842 0.0 0.0 15236 936 pts/8 S+ 17:40 0:00 grep
> --color=auto trq
> $ ps aux | grep pbs
> root 9720 0.0 0.0 695140 25760 ? Sl 17:39 0:00
> /usr/local/sbin/pbs_server
> root 9771 0.0 0.0 37996 4940 ? Ss 17:39 0:00
> /usr/local/sbin/pbs_sched
> root 9814 0.2 0.2 173776 136692 ? SLsl 17:40 0:00
> /usr/local/sbin/pbs_mom
> comp_ad+ 9844 0.0 0.0 15236 1012 pts/8 S+ 17:40 0:00 grep
> --color=auto pbs
> $ echo "sleep 30" | qsub
> 0.Dual-E52630v4
> $ ps aux | grep pbs
> root 9814 0.1 0.2 173776 136692 ? SLsl 17:40 0:00
> /usr/local/sbin/pbs_mom
> comp_ad+ 9855 0.0 0.0 15236 928 pts/8 S+ 17:41 0:00 grep
> --color=auto pbs
> $ ps aux | grep trq
> root 9682 0.0 0.0 109112 4144 ? S 17:39 0:00
> /usr/local/sbin/trqauthd
> comp_ad+ 9860 0.0 0.0 15236 1092 pts/8 S+ 17:41 0:00 grep
> --color=auto trq


Then, I stopped the remained processes,

sudo service pbs_mom stop
> sudo service trqauthd stop


and start again the "trqauthd", and "pbs_server" with gdb. "pbs_server"
crashed in gdb without other commands.

sudo service trqauthd start
> sudo gdb /usr/local/sbin/pbs_server


sudo gdb /usr/local/sbin/pbs_server
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html
>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/local/sbin/pbs_server...done.
(gdb) r -D
Starting program: /usr/local/sbin/pbs_server -D
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
__lll_unlock_elision (lock=0x512f1b0, private=0) at
../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
29 ../sysdeps/unix/sysv/linux/x86/elision-unlock.c: No such file or
directory.
(gdb) bt
#0 __lll_unlock_elision (lock=0x512f1b0, private=0) at
../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
#1 0x00000000004a4953 in unlock_ji_mutex (pjob=0x512d880, id=0x525b30
<svr_enquejob(job*, int, char const*, bool, bool)::__func__>
"svr_enquejob", msg=0x524554 "1", logging=0)
at svr_jobfunc.c:4011
#2 0x000000000049db0c in svr_enquejob (pjob=0x512d880, has_sv_qs_mutex=1,
prev_job_id=0x0, have_reservation=false, being_recovered=true) at
svr_jobfunc.c:421
#3 0x000000000045b828 in pbsd_init_reque (pjob=0x512d880, change_state=1)
at pbsd_init.c:2824
#4 0x000000000045ad93 in pbsd_init_job (pjob=0x512d880, type=1) at
pbsd_init.c:2558
#5 0x0000000000459483 in handle_job_recovery (type=1) at pbsd_init.c:1803
#6 0x000000000045a173 in handle_job_and_array_recovery (type=1) at
pbsd_init.c:2100
#7 0x000000000045a8fe in pbsd_init (type=1) at pbsd_init.c:2316
#8 0x0000000000461f29 in main (argc=2, argv=0x7fffffffe598) at
pbsd_main.c:1898
(gdb) backtrace full
#0 __lll_unlock_elision (lock=0x512f1b0, private=0) at
../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
No locals.
#1 0x00000000004a4953 in unlock_ji_mutex (pjob=0x512d880, id=0x525b30
<svr_enquejob(job*, int, char const*, bool, bool)::__func__>
"svr_enquejob", msg=0x524554 "1", logging=0)
at svr_jobfunc.c:4011
rc = 0
err_msg = 0x0
stub_msg = "no pos"
__func__ = "unlock_ji_mutex"
#2 0x000000000049db0c in svr_enquejob (pjob=0x512d880, has_sv_qs_mutex=1,
prev_job_id=0x0, have_reservation=false, being_recovered=true) at
svr_jobfunc.c:421
pattrjb = 0x7fffffff4a10
pdef = 0x4
pque = 0x0
rc = 0
log_buf = '\000' <repeats 24 times>,
"\030\000\000\000\060\000\000\000PU\377\377\377\177\000\000\220T\377\377\377\177",
'\000' <repeats 50 times>,
"\003\000\000\000\000\000\000\000#\000\000\000\000\000\000\000pO\377\377\377\177",
'\000' <repeats 26 times>,
"\221\260\000\000\000\200\377\377oO\377\377\377\177\000\000H+B\366\377\177\000\000p+B\366\377\177\000\000\200O\377\377\377\177\000\000\201\260\000\000\000\200\377\377\177O\377\377\377\177",
'\000' <repeats 18 times>...
time_now = 1478594788
job_id =
"0.Dual-E52630v4\000\000\000\000\000\000\000\000\000\362\377\377\377\377\377\377\377\340J\377\377\377\177\000\000\060L\377\377\377\177\000\000\001\000\000\000\000\000\000\000\244\201\000\000\001\000\000\000\030\354\377\367\377\177\000\***@L\377\377\377\177\000\000\000\000\000\000\005\000\000\220\r\000\000\000\000\000\000\000k\022j\365\377\177\000\000\031J\377\377\377\177\000\000\201n\376\017\000\000\000\000\\\216!X\000\000\000\000_#\343+\000\000\000\000\\\216!X\000\000\000\000\207\065],",
'\000' <repeats 36 times>,
"k\022j\365\377\177\000\000\300K\377\377\377\177\000\000\000\000\000\000\000\000\000\000"...
queue_name = "batch\000\377\377\240\340\377\367\377\177\000"
total_jobs = 0
user_jobs = 0
array_jobs = 0
__func__ = "svr_enquejob"
que_mgr = {unlock_on_exit = 160, locked = 75, mutex_valid = 255,
managed_mutex = 0x7ffff7ddccda <open_path+474>}
#3 0x000000000045b828 in pbsd_init_reque (pjob=0x512d880, change_state=1)
at pbsd_init.c:2824
newstate = 0
newsubstate = 0
rc = 0
log_buf = "pbsd_init_reque:1", '\000' <repeats 1063 times>...
__func__ = "pbsd_init_reque"
#4 0x000000000045ad93 in pbsd_init_job (pjob=0x512d880, type=1) at
pbsd_init.c:2558
d = 0
rc = 0
time_now = 1478594788
log_buf = '\000' <repeats 2112 times>...
local_errno = 0
job_id = '\000' <repeats 1016 times>...
job_atr_hold = 0
job_exit_status = 0
__func__ = "pbsd_init_job"
#5 0x0000000000459483 in handle_job_recovery (type=1) at pbsd_init.c:1803
pjob = 0x512d880
Index = 0
JobArray_iter = {first = "0.Dual-E52630v4", second = }
log_buf = "14 total files read from
disk\000\000\000\000\000\000\000\001\000\000\000\320\316\022\005\000\000\000\000\220N\022\005",
'\000' <repeats 12 times>, "Expected 1, recovered 1 queues", '\000'
<repeats 1330 times>...
rc = 0
job_rc = 0
logtype = 0
pdirent = 0x0
pdirent_sub = 0x0
dir = 0x5124e90
dir_sub = 0x0
had = 0
pjob = 0x0
time_now = 1478594788
---Type <return> to continue, or q <return> to quit---
basen = '\000' <repeats 1088 times>...
use_jobs_subdirs = 0
__func__ = "handle_job_recovery"
#6 0x000000000045a173 in handle_job_and_array_recovery (type=1) at
pbsd_init.c:2100
rc = 0
tmp_rc = 1974134615
#7 0x000000000045a8fe in pbsd_init (type=1) at pbsd_init.c:2316
ret = 0
gid = 0
log_buf = "pbsd_init:1", '\000' <repeats 997 times>...
__func__ = "pbsd_init"
#8 0x0000000000461f29 in main (argc=2, argv=0x7fffffffe598) at
pbsd_main.c:1898
i = 2
rc = 0
local_errno = 0
lockfile = "/var/spool/torque/server_priv/server.lock", '\000'
<repeats 983 times>
EMsg = '\000' <repeats 1023 times>
tmpLine = "Server Dual-E52630v4 started, initialization type = 1",
'\000' <repeats 970 times>
log_buf = "Server Dual-E52630v4 started, initialization type = 1",
'\000' <repeats 1139 times>...
server_name_file_port = 15001
fp = 0x51095f0
(gdb) info registers
rax 0x0 0
rbx 0x6 6
rcx 0x0 0
rdx 0x512f1b0 85127600
rsi 0x0 0
rdi 0x512f1b0 85127600
rbp 0x7fffffffe4b0 0x7fffffffe4b0
rsp 0x7fffffffc870 0x7fffffffc870
r8 0x0 0
r9 0x7fffffff57a2 140737488312226
r10 0x513c800 85182464
r11 0x7ffff61e6128 140737322574120
r12 0x4260b0 4350128
r13 0x7fffffffe590 140737488348560
r14 0x0 0
r15 0x0 0
rip 0x461f29 0x461f29 <main(int, char**)+2183>
eflags 0x10246 [ PF ZF IF RF ]
cs 0x33 51
ss 0x2b 43
ds 0x0 0
es 0x0 0
fs 0x0 0
gs 0x0 0
(gdb) x/16i $pc
=> 0x461f29 <main(int, char**)+2183>: test %eax,%eax
0x461f2b <main(int, char**)+2185>: setne %al
0x461f2e <main(int, char**)+2188>: test %al,%al
0x461f30 <main(int, char**)+2190>: je 0x461f55 <main(int,
char**)+2227>
0x461f32 <main(int, char**)+2192>: mov 0x70efc7(%rip),%rax #
0xb70f00 <msg_daemonname>
0x461f39 <main(int, char**)+2199>: mov $0x51bab2,%edx
0x461f3e <main(int, char**)+2204>: mov %rax,%rsi
0x461f41 <main(int, char**)+2207>: mov $0xffffffff,%edi
0x461f46 <main(int, char**)+2212>: callq 0x425420 <***@plt
>
0x461f4b <main(int, char**)+2217>: mov $0x3,%edi
0x461f50 <main(int, char**)+2222>: callq 0x425680 <***@plt>
0x461f55 <main(int, char**)+2227>: mov 0x71021d(%rip),%esi #
0xb72178 <pbs_mom_port>
0x461f5b <main(int, char**)+2233>: mov 0x710227(%rip),%ecx #
0xb72188 <pbs_scheduler_port>
0x461f61 <main(int, char**)+2239>: mov 0x710225(%rip),%edx #
0xb7218c <pbs_server_port_dis>
0x461f67 <main(int, char**)+2245>: lea -0x1400(%rbp),%rax
0x461f6e <main(int, char**)+2252>: mov $0xb739c0,%r9d
(gdb) thread apply all backtrace

Thread 1 (Thread 0x7ffff7fd5740 (LWP 10004)):
#0 __lll_unlock_elision (lock=0x512f1b0, private=0) at
../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
#1 0x00000000004a4953 in unlock_ji_mutex (pjob=0x512d880, id=0x525b30
<svr_enquejob(job*, int, char const*, bool, bool)::__func__>
"svr_enquejob", msg=0x524554 "1", logging=0)
at svr_jobfunc.c:4011
#2 0x000000000049db0c in svr_enquejob (pjob=0x512d880, has_sv_qs_mutex=1,
prev_job_id=0x0, have_reservation=false, being_recovered=true) at
svr_jobfunc.c:421
#3 0x000000000045b828 in pbsd_init_reque (pjob=0x512d880, change_state=1)
at pbsd_init.c:2824
#4 0x000000000045ad93 in pbsd_init_job (pjob=0x512d880, type=1) at
pbsd_init.c:2558
#5 0x0000000000459483 in handle_job_recovery (type=1) at pbsd_init.c:1803
#6 0x000000000045a173 in handle_job_and_array_recovery (type=1) at
pbsd_init.c:2100
#7 0x000000000045a8fe in pbsd_init (type=1) at pbsd_init.c:2316
#8 0x0000000000461f29 in main (argc=2, argv=0x7fffffffe598) at
pbsd_main.c:1898
(gdb) quit
A debugging session is active.

Inferior 1 [process 10004] will be killed.

Quit anyway? (y or n) y







On Wed, Nov 2, 2016 at 1:43 AM, David Beer <***@adaptivecomputing.com>
wrote:

> Kazu,
>
> Thanks for sticking with us on this. You mentioned that pbs_server did not
> crash when you submitted the job, but you said that it and pbs_sched are
> "unstable." What do you mean by unstable? Will jobs run? You gdb output
> looks like a pbs_server that isn't busy, but other than that it looks
> normal.
>
> David
>
> On Tue, Nov 1, 2016 at 1:19 AM, Kazuhiro Fujita <***@gmail.com
> > wrote:
>
>> David,
>>
>> I tested the 6.0-dev. It passed the "sudo ./torque.setup $USER" script,
>> but pbs_server and pbs_sched are unstable like 6.1-dev.
>>
>> Best,
>> Kazu
>>
>> Before execution of gdb
>>
>> git clone https://github.com/adaptivecomputing/torque.git -b 6.0-dev
>>> 6.0-dev
>>> cd 6.0-dev
>>> ./autogen.sh
>>> # build and install torque
>>> ./configure
>>> make
>>> sudo make install
>>> # Set the correct name of the server
>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>> # configure and start trqauthd
>>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>>> sudo update-rc.d trqauthd defaults
>>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>>> sudo ldconfig
>>> sudo service trqauthd start
>>> # Initialize serverdb by executing the torque.setup script
>>> sudo ./torque.setup $USER
>>>
>>> sudo qmgr -c 'p s'
>>> sudo qterm
>>> sudo /etc/init.d/trqauthd stop
>>> # set nodes
>>> echo "$HOSTNAME np=`cat /proc/cpuinfo | grep processor | wc -l`" | sudo
>>> tee /var/spool/torque/server_priv/nodes
>>> sudo nano /var/spool/torque/server_priv/nodes
>>> # set the head node
>>> echo "\$pbsserver $HOSTNAME" | sudo tee /var/spool/torque/mom_priv/con
>>> fig
>>> # configure other deamons
>>> sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
>>> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>> sudo update-rc.d pbs_server defaults
>>> sudo update-rc.d pbs_sched defaults
>>> sudo update-rc.d pbs_mom defaults
>>> # start torque daemons
>>> sudo service trqauthd start
>>
>>
>> Execution of gdb
>>
>>> sudo gdb /usr/local/sbin/pbs_server
>>
>>
>> Commands executed by another terminal
>>
>>> sudo /etc/init.d/pbs_mom start
>>> sudo /etc/init.d/pbs_sched start
>>> pbsnodes -a
>>> echo "sleep 30" | qsub
>>
>>
>> The last command did not cause a crash of pbs_server. The backtrace is
>> described below.
>> $ sudo gdb /usr/local/sbin/pbs_server
>> GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
>> Copyright (C) 2016 Free Software Foundation, Inc.
>> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.h
>> tml>
>> This is free software: you are free to change and redistribute it.
>> There is NO WARRANTY, to the extent permitted by law. Type "show copying"
>> and "show warranty" for details.
>> This GDB was configured as "x86_64-linux-gnu".
>> Type "show configuration" for configuration details.
>> For bug reporting instructions, please see:
>> <http://www.gnu.org/software/gdb/bugs/>.
>> Find the GDB manual and other documentation resources online at:
>> <http://www.gnu.org/software/gdb/documentation/>.
>> For help, type "help".
>> Type "apropos word" to search for commands related to "word"...
>> Reading symbols from /usr/local/sbin/pbs_server...done.
>> (gdb) r -D
>> Starting program: /usr/local/sbin/pbs_server -D
>> [Thread debugging using libthread_db enabled]
>> Using host libthread_db library "/lib/x86_64-linux-gnu/libthre
>> ad_db.so.1".
>> [New Thread 0x7ffff39c1700 (LWP 5024)]
>> pbs_server is up (version - 6.0, port - 15001)
>> [New Thread 0x7ffff31c0700 (LWP 5025)]
>> PBS_Server: LOG_ERROR::tcp_connect_sockaddr, Failed when trying to open
>> tcp connection - connect() failed [rc = -2] [addr = 10.0.0.249:15003]
>> PBS_Server: LOG_ERROR::sendHierarchyToNode, Could not send mom hierarchy
>> to host Dual-E52630v4:15003
>> [New Thread 0x7ffff29bf700 (LWP 5026)]
>> [New Thread 0x7ffff21be700 (LWP 5027)]
>> [New Thread 0x7ffff19bd700 (LWP 5028)]
>> [New Thread 0x7ffff11bc700 (LWP 5029)]
>> [New Thread 0x7ffff09bb700 (LWP 5030)]
>> [Thread 0x7ffff09bb700 (LWP 5030) exited]
>> [New Thread 0x7ffff09bb700 (LWP 5031)]
>> [New Thread 0x7fffe3fff700 (LWP 5109)]
>> [New Thread 0x7fffe37fe700 (LWP 5113)]
>> [New Thread 0x7fffe29cf700 (LWP 5121)]
>> [Thread 0x7fffe29cf700 (LWP 5121) exited]
>> ^C
>> Thread 1 "pbs_server" received signal SIGINT, Interrupt.
>> 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-templa
>> te.S:84
>> 84 ../sysdeps/unix/syscall-template.S: No such file or directory.
>> (gdb) backtrace full
>> #0 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-templa
>> te.S:84
>> No locals.
>> #1 0x00007ffff615c1a4 in usleep (useconds=<optimized out>) at
>> ../sysdeps/posix/usleep.c:32
>> ts = {tv_sec = 0, tv_nsec = 250000000}
>> #2 0x000000000046123a in main_loop () at pbsd_main.c:1454
>> state = 3
>> waittime = 5
>> pjob = 0x313a74
>> iter = 0x0
>> when = 1477984074
>> log = 0
>> scheduling = 1
>> sched_iteration = 600
>> time_now = 1477984190
>> update_loglevel = 1477984198
>> log_buf = "Server Ready, pid = 5020, loglevel=0", '\000' <repeats
>> 140 times>, "c\000\000\000\000\000\000\000\000\020\000\000\000\000\000\000\240\265\377\377\377\177",
>> '\000' <repeats 26 times>...
>> sem_val = 5228929
>> __func__ = "main_loop"
>> #3 0x0000000000461fb6 in main (argc=2, argv=0x7fffffffe598) at
>> pbsd_main.c:1935
>> i = 2
>> rc = 0
>> local_errno = 0
>> lockfile = "/var/spool/torque/server_priv/server.lock", '\000'
>> <repeats 983 times>
>> EMsg = '\000' <repeats 1023 times>
>> tmpLine = "Using ports Server:15001 Scheduler:15004 MOM:15002
>> (server: 'Dual-E52630v4')", '\000' <repeats 945 times>
>> log_buf = "Using ports Server:15001 Scheduler:15004 MOM:15002
>> (server: 'Dual-E52630v4')", '\000' <repeats 1114 times>...
>> server_name_file_port = 15001
>> fp = 0x51095f0
>> (gdb) info registers
>> rax 0xfffffffffffffdfc -516
>> rbx 0x5 5
>> rcx 0x7ffff612a75d 140737321805661
>> rdx 0x0 0
>> rsi 0x0 0
>> rdi 0x7fffffffb3f0 140737488335856
>> rbp 0x7fffffffe4b0 0x7fffffffe4b0
>> rsp 0x7fffffffc870 0x7fffffffc870
>> r8 0x0 0
>> r9 0x4000001 67108865
>> r10 0x1 1
>> r11 0x293 659
>> r12 0x4260b0 4350128
>> r13 0x7fffffffe590 140737488348560
>> r14 0x0 0
>> r15 0x0 0
>> rip 0x461fb6 0x461fb6 <main(int, char**)+2388>
>> eflags 0x293 [ CF AF SF IF ]
>> cs 0x33 51
>> ss 0x2b 43
>> ds 0x0 0
>> es 0x0 0
>> fs 0x0 0
>> gs 0x0 0
>> (gdb) x/16i $pc
>> => 0x461fb6 <main(int, char**)+2388>: callq 0x494762 <shutdown_ack()>
>> 0x461fbb <main(int, char**)+2393>: mov $0xffffffff,%edi
>> 0x461fc0 <main(int, char**)+2398>: callq 0x4250b0 <***@plt>
>> 0x461fc5 <main(int, char**)+2403>: mov 0x70f55c(%rip),%rdx
>> # 0xb71528 <msg_svrdown>
>> 0x461fcc <main(int, char**)+2410>: mov 0x70eeed(%rip),%rax
>> # 0xb70ec0 <msg_daemonname>
>> 0x461fd3 <main(int, char**)+2417>: mov %rdx,%rcx
>> 0x461fd6 <main(int, char**)+2420>: mov %rax,%rdx
>> 0x461fd9 <main(int, char**)+2423>: mov $0x1,%esi
>> 0x461fde <main(int, char**)+2428>: mov $0x8002,%edi
>> 0x461fe3 <main(int, char**)+2433>: callq 0x425840
>> <***@plt>
>> 0x461fe8 <main(int, char**)+2438>: mov $0x0,%edi
>> 0x461fed <main(int, char**)+2443>: callq 0x4269c9 <acct_close(bool)>
>> 0x461ff2 <main(int, char**)+2448>: mov $0xb6cdc0,%edi
>> 0x461ff7 <main(int, char**)+2453>: callq 0x425a00
>> <***@plt>
>> 0x461ffc <main(int, char**)+2458>: mov $0x1,%edi
>> 0x462001 <main(int, char**)+2463>: callq 0x424db0 <***@plt>
>> (gdb) thread apply all backtrace
>>
>> Thread 11 (Thread 0x7fffe37fe700 (LWP 5113)):
>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86
>> _64/pthread_cond_wait.S:185
>> #1 0x00000000004fc19c in work_thread (a=0x5110710) at u_threadpool.c:272
>> #2 0x00007ffff6ed870a in start_thread (arg=0x7fffe37fe700) at
>> pthread_create.c:333
>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>> _64/clone.S:109
>>
>> Thread 10 (Thread 0x7fffe3fff700 (LWP 5109)):
>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86
>> _64/pthread_cond_wait.S:185
>> #1 0x00000000004fc19c in work_thread (a=0x5110710) at u_threadpool.c:272
>> #2 0x00007ffff6ed870a in start_thread (arg=0x7fffe3fff700) at
>> pthread_create.c:333
>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>> _64/clone.S:109
>>
>> Thread 9 (Thread 0x7ffff09bb700 (LWP 5031)):
>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86
>> _64/pthread_cond_wait.S:185
>> #1 0x00000000004fc19c in work_thread (a=0x5110810) at u_threadpool.c:272
>> #2 0x00007ffff6ed870a in start_thread (arg=0x7ffff09bb700) at
>> pthread_create.c:333
>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>> _64/clone.S:109
>>
>> Thread 7 (Thread 0x7ffff11bc700 (LWP 5029)):
>> #0 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-templa
>> te.S:84
>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>> ../sysdeps/posix/sleep.c:55
>> #2 0x00000000004769bb in remove_completed_jobs (vp=0x0) at
>> req_jobobit.c:3759
>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff11bc700) at
>> pthread_create.c:333
>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>> _64/clone.S:109
>>
>> Thread 6 (Thread 0x7ffff19bd700 (LWP 5028)):
>> #0 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-templa
>> te.S:84
>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>> ../sysdeps/posix/sleep.c:55
>> #2 0x00000000004afa7b in remove_extra_recycle_jobs (vp=0x0) at
>> job_recycler.c:216
>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff19bd700) at
>> pthread_create.c:333
>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>> _64/clone.S:109
>>
>> Thread 5 (Thread 0x7ffff21be700 (LWP 5027)):
>> #0 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-templa
>> te.S:84
>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>> ../sysdeps/posix/sleep.c:55
>> #2 0x00000000004bc73b in inspect_exiting_jobs (vp=0x0) at
>> exiting_jobs.c:319
>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff21be700) at
>> pthread_create.c:333
>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>> _64/clone.S:109
>>
>> Thread 4 (Thread 0x7ffff29bf700 (LWP 5026)):
>> #0 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-templa
>> te.S:84
>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>> ../sysdeps/posix/sleep.c:55
>> #2 0x000000000046078d in handle_queue_routing_retries (vp=0x0) at
>> pbsd_main.c:1079
>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff29bf700) at
>> pthread_create.c:333
>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>> _64/clone.S:109
>>
>> Thread 3 (Thread 0x7ffff31c0700 (LWP 5025)):
>> #0 0x00007ffff6ee17bd in accept () at ../sysdeps/unix/syscall-templa
>> te.S:84
>> #1 0x00007ffff750a276 in start_listener_addrinfo
>> (host_name=0x7ffff31bfaf0 "Dual-E52630v4", server_port=15001,
>> process_meth=0x4c4935 <start_process_pbs_server_port(void*)>)
>> at ../Libnet/server_core.c:398
>> #2 0x00000000004608f3 in start_accept_listener (vp=0x0) at
>> pbsd_main.c:1141
>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff31c0700) at
>> pthread_create.c:333
>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>> _64/clone.S:109
>>
>> Thread 2 (Thread 0x7ffff39c1700 (LWP 5024)):
>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86
>> _64/pthread_cond_wait.S:185
>> #1 0x00000000004fc19c in work_thread (a=0x5110810) at u_threadpool.c:272
>> #2 0x00007ffff6ed870a in start_thread (arg=0x7ffff39c1700) at
>> pthread_create.c:333
>> ---Type <return> to continue, or q <return> to quit---
>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>> _64/clone.S:109
>>
>> Thread 1 (Thread 0x7ffff7fd5740 (LWP 5020)):
>> #0 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-templa
>> te.S:84
>> #1 0x00007ffff615c1a4 in usleep (useconds=<optimized out>) at
>> ../sysdeps/posix/usleep.c:32
>> #2 0x000000000046123a in main_loop () at pbsd_main.c:1454
>> #3 0x0000000000461fb6 in main (argc=2, argv=0x7fffffffe598) at
>> pbsd_main.c:1935
>> (gdb) quit
>>
>>
>>
>>
>>
>> On Fri, Oct 28, 2016 at 12:43 PM, Kazuhiro Fujita <
>> ***@gmail.com> wrote:
>>
>>> Thank you for your comments.
>>> I will try the 6.0-dev next week.
>>>
>>> Best,
>>> Kazu
>>>
>>> On Fri, Oct 28, 2016 at 5:34 AM, David Beer <***@adaptivecomputing.com
>>> > wrote:
>>>
>>>> I wonder if that fix wasn't placed in the hotfix. Is there any chance
>>>> you can try installing 6.0-dev on your system (via github) to see if it's
>>>> resolved. For the record, my Ubuntu 16 system doesn't give me this error,
>>>> or I'd try it myself. For whatever reason, none of our test cluster
>>>> machines (Cent & Redhat 6-7, SLES 11-12) experience this either. We did
>>>> have another user that experiences it on a test cluster, but not being able
>>>> to reproduce it has made it harder to track down.
>>>>
>>>> On Wed, Oct 26, 2016 at 12:46 AM, Kazuhiro Fujita <
>>>> ***@gmail.com> wrote:
>>>>
>>>>> David,
>>>>>
>>>>> I tried the 6.0.2.h3. But, it seems that the other issue is still
>>>>> remained.
>>>>> After I initialized serverdb by "sudo pbs_server -t create",
>>>>> pbs_server crashed.
>>>>> Then, I used gdb with pbs_server.
>>>>>
>>>>> Best,
>>>>> Kazu
>>>>>
>>>>> sudo gdb /usr/local/sbin/pbs_server
>>>>> GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
>>>>> Copyright (C) 2016 Free Software Foundation, Inc.
>>>>> License GPLv3+: GNU GPL version 3 or later <
>>>>> http://gnu.org/licenses/gpl.html>
>>>>> This is free software: you are free to change and redistribute it.
>>>>> There is NO WARRANTY, to the extent permitted by law. Type "show
>>>>> copying"
>>>>> and "show warranty" for details.
>>>>> This GDB was configured as "x86_64-linux-gnu".
>>>>> Type "show configuration" for configuration details.
>>>>> For bug reporting instructions, please see:
>>>>> <http://www.gnu.org/software/gdb/bugs/>.
>>>>> Find the GDB manual and other documentation resources online at:
>>>>> <http://www.gnu.org/software/gdb/documentation/>.
>>>>> For help, type "help".
>>>>> Type "apropos word" to search for commands related to "word"...
>>>>> Reading symbols from /usr/local/sbin/pbs_server...done.
>>>>> (gdb) r -D
>>>>> Starting program: /usr/local/sbin/pbs_server -D
>>>>> [Thread debugging using libthread_db enabled]
>>>>> Using host libthread_db library "/lib/x86_64-linux-gnu/libthre
>>>>> ad_db.so.1".
>>>>> pbs_server is up (version - 6.0.2.h3, port - 15001)
>>>>> [New Thread 0x7ffff39c1700 (LWP 25591)]
>>>>> [New Thread 0x7ffff31c0700 (LWP 25592)]
>>>>> [New Thread 0x7ffff29bf700 (LWP 25593)]
>>>>> [New Thread 0x7ffff21be700 (LWP 25594)]
>>>>> [New Thread 0x7ffff19bd700 (LWP 25595)]
>>>>> [New Thread 0x7ffff11bc700 (LWP 25596)]
>>>>>
>>>>> Thread 7 "pbs_server" received signal SIGSEGV, Segmentation fault.
>>>>> [Switching to Thread 0x7ffff11bc700 (LWP 25596)]
>>>>> __lll_unlock_elision (lock=0x57276c0, private=0) at
>>>>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>>>>> 29 ../sysdeps/unix/sysv/linux/x86/elision-unlock.c: No such file or
>>>>> directory.
>>>>> (gdb) bt
>>>>> #0 __lll_unlock_elision (lock=0x57276c0, private=0) at
>>>>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>>>>> #1 0x00000000004ac076 in dispatch_timed_task (ptask=0x5727660) at
>>>>> svr_task.c:318
>>>>> #2 0x0000000000460247 in check_tasks (notUsed=0x0) at pbsd_main.c:921
>>>>> #3 0x00000000004fc171 in work_thread (a=0x510f650) at
>>>>> u_threadpool.c:318
>>>>> #4 0x00007ffff6ed86fa in start_thread (arg=0x7ffff11bc700) at
>>>>> pthread_create.c:333
>>>>> #5 0x00007ffff6165b5d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>> _64/clone.S:109
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Oct 26, 2016 at 11:52 AM, Kazuhiro Fujita <
>>>>> ***@gmail.com> wrote:
>>>>>
>>>>>> David and Rick,
>>>>>>
>>>>>> Thank you for the quick response. I will try it later.
>>>>>>
>>>>>> Best,
>>>>>> Kazu
>>>>>>
>>>>>> On Wed, Oct 26, 2016 at 5:06 AM, David Beer <
>>>>>> ***@adaptivecomputing.com> wrote:
>>>>>>
>>>>>>> Actually, Rick just sent me the link. You can download it from here:
>>>>>>> http://files.adaptivecomputing.com/hotfix/torque-6.0.2.h3.tar.gz
>>>>>>>
>>>>>>> On Tue, Oct 25, 2016 at 2:06 PM, David Beer <
>>>>>>> ***@adaptivecomputing.com> wrote:
>>>>>>>
>>>>>>>> I can confirm that this bug is fixed in 6.0-dev, and we've made a
>>>>>>>> hotfix for it, 6.0.2.h3. This was caused because of a change in the
>>>>>>>> implementation for the pthread library, so most will not see this crash,
>>>>>>>> but it appears that if you have a newer version of that library, then you
>>>>>>>> will get it. Rick is going to send instructions for how to grab 6.0.2.h3.
>>>>>>>>
>>>>>>>> David
>>>>>>>>
>>>>>>>> On Tue, Oct 25, 2016 at 12:30 AM, Kazuhiro Fujita <
>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Thank you David for the comment on the backtrace.
>>>>>>>>> I haven't noticed that until writing this mail.
>>>>>>>>> So, I used backtrace as written in the Ubuntu wiki.
>>>>>>>>>
>>>>>>>>> I also attached the backtrace of pbs_server (Torque 6.1-dev) by
>>>>>>>>> gdb.
>>>>>>>>> As I mentioned before torque.setup script was successfully
>>>>>>>>> executed, but unstable.
>>>>>>>>>
>>>>>>>>> Before using gdb, I used following commands.
>>>>>>>>>
>>>>>>>>>> git clone https://github.com/adaptivecomputing/torque.git -b
>>>>>>>>>> 6.1-dev 6.1-dev
>>>>>>>>>> cd 6.1-dev
>>>>>>>>>> ./autogen.sh
>>>>>>>>>> # build and install torque
>>>>>>>>>> ./configure
>>>>>>>>>> make
>>>>>>>>>> sudo make install
>>>>>>>>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>>>>>>>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>>>>>>>>>> sudo ldconfig
>>>>>>>>>> # set as services
>>>>>>>>>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>>>>>>>>> sudo update-rc.d trqauthd defaults
>>>>>>>>>> sudo update-rc.d pbs_server defaults
>>>>>>>>>> sudo update-rc.d pbs_sched defaults
>>>>>>>>>> sudo update-rc.d pbs_mom defaults
>>>>>>>>>>
>>>>>>>>>> sudo ./torque.setup $USER
>>>>>>>>>> sudo qmgr -c 'p s'
>>>>>>>>>> echo "$HOSTNAME np=`cat /proc/cpuinfo | grep processor | wc -l`"
>>>>>>>>>> | sudo tee /var/spool/torque/server_priv/nodes
>>>>>>>>>> sudo nano /var/spool/torque/server_priv/nodes # (changed np)
>>>>>>>>>> sudo qterm -t quick
>>>>>>>>>> sudo /etc/init.d/trqauthd stop
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> trqauthd was not stop by the last command. So, I stopped it by
>>>>>>>>> killing the trqauthd process.
>>>>>>>>> Then I restarted the torque processes with gdb.
>>>>>>>>>
>>>>>>>>> sudo /etc/init.d/trqauthd start
>>>>>>>>>
>>>>>>>>> sudo gdb /etc/init.d/pbs_server 2>&1 | tee
>>>>>>>>>> ~/gdb-torquesetup-6.1-dev.txt
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> In another terminal, I executed the following commands before
>>>>>>>>> pbs_server was crashed.
>>>>>>>>>
>>>>>>>>> sudo /etc/init.d/pbs_mom start
>>>>>>>>>> sudo /etc/init.d/pbs_sched start
>>>>>>>>>> ps aux | grep pbs
>>>>>>>>>> pbsnodes -a
>>>>>>>>>> echo "sleep 30" | qsub
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The output of the last command is "0.torque-server".
>>>>>>>>> And this command crashed the pbs_server in gdb.
>>>>>>>>> Then, I made the backtrace.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Kazu
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Oct 25, 2016 at 2:36 PM, Kazuhiro Fujita <
>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> David,
>>>>>>>>>>
>>>>>>>>>> I attached the backtrace of pbs_server (Torque 6.0.2) by gdb.
>>>>>>>>>> (based on https://wiki.ubuntu.com/Backtrace)
>>>>>>>>>>
>>>>>>>>>> I started pbs_server with gdb,
>>>>>>>>>> and execute qmgr from another terminal. (see below)
>>>>>>>>>>
>>>>>>>>>> sudo qmgr -c 'p s'
>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> After the qmgr execution, I pressed ctrl +c in gdb.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Kaz
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Oct 25, 2016 at 1:00 AM, David Beer <
>>>>>>>>>> ***@adaptivecomputing.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Kazu,
>>>>>>>>>>>
>>>>>>>>>>> Can you give us a backtrace for this crash? We have fixed some
>>>>>>>>>>> issues on startup (around mutex management for newer pthread
>>>>>>>>>>> implementations) and a backtrace would allow me to confirm if what you're
>>>>>>>>>>> seeing is fixed.
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Oct 24, 2016 at 2:09 AM, Kazuhiro Fujita <
>>>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Dear All,
>>>>>>>>>>>>
>>>>>>>>>>>> I use Torque 4.2.10 on Ubuntu 14.04 LTS and 16.04 LTS with dual
>>>>>>>>>>>> E5-2630 v3 chips.
>>>>>>>>>>>> I recently got servers with dual Xeon E5 v4 chips, and
>>>>>>>>>>>> installed Ubuntu 16.04 LTS on them.
>>>>>>>>>>>> And I tried to set up Torque on them, but I stacked with the
>>>>>>>>>>>> initial setup script.
>>>>>>>>>>>> It seems that qmgr may trigger to crash pbs_server in initial
>>>>>>>>>>>> setup script (torque.setup). (see below)
>>>>>>>>>>>> Similar error is also observed in Torque 6.02.
>>>>>>>>>>>> Have you ever observed this kind of errors?
>>>>>>>>>>>> And if you know possible solutions, please tell me.
>>>>>>>>>>>> Any comments will be highly appreciated.
>>>>>>>>>>>> Would it be better to change the OS to other distribution, such
>>>>>>>>>>>> as Scientific Linux?
>>>>>>>>>>>>
>>>>>>>>>>>> Thank you in Advance,
>>>>>>>>>>>> Kazu
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Errors in torque 4.2.10 setup
>>>>>>>>>>>>
>>>>>>>>>>>>> torque-server-***@torque-server:~/Downloads/torque/torque-4.2.10$
>>>>>>>>>>>>> sudo ./torque.setup $USER
>>>>>>>>>>>>> Currently no servers active. Default server will be listed as
>>>>>>>>>>>>> active server. Error 15133
>>>>>>>>>>>>> Active server name: torque-server pbs_server port is: 15001
>>>>>>>>>>>>> trqauthd daemonized - port /tmp/trqauthd-unix
>>>>>>>>>>>>> trqauthd successfully started
>>>>>>>>>>>>> initializing TORQUE (admin: torque-server-***@torque-server)
>>>>>>>>>>>>> You have selected to start pbs_server in create mode.
>>>>>>>>>>>>> If the server database exists it will be overwritten.
>>>>>>>>>>>>> do you wish to continue y/(n)?y
>>>>>>>>>>>>> root 27941 1942 1 12:22 ? 00:00:00 pbs_server -t
>>>>>>>>>>>>> create
>>>>>>>>>>>>> Max open servers: 9
>>>>>>>>>>>>> set server operators += torque-server-***@torque-server
>>>>>>>>>>>>> Max open servers: 9
>>>>>>>>>>>>> set server managers += torque-server-***@torque-server
>>>>>>>>>>>>> qmgr obj=batch svr=default: End of File
>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>>>>>> torque-server-***@torque-server:~/Downloads/torque/torque-4.2.10$
>>>>>>>>>>>>> ps aux | grep pbs
>>>>>>>>>>>>> torque-+ 27996 0.0 0.0 22304 948 pts/2 S+ 12:22
>>>>>>>>>>>>> 0:00 grep --color=auto pbs
>>>>>>>>>>>>
>>>>>>>>>>>> pbs_server -t create was not found.
>>>>>>>>>>>>
>>>>>>>>>>>> Errors in torque 6.0.2 setup
>>>>>>>>>>>>
>>>>>>>>>>>>> torque-server-***@torque-server:~/Downloads/torque/6.0.2$
>>>>>>>>>>>>> sudo ./torque.setup $USER
>>>>>>>>>>>>> Currently no servers active. Default server will be listed as
>>>>>>>>>>>>> active server. Error 15133
>>>>>>>>>>>>> Active server name: torque-server pbs_server port is: 15001
>>>>>>>>>>>>> trqauthd daemonized - port /tmp/trqauthd-unix
>>>>>>>>>>>>> trqauthd successfully started
>>>>>>>>>>>>> initializing TORQUE (admin: torque-server-***@torque-server)
>>>>>>>>>>>>> You have selected to start pbs_server in create mode.
>>>>>>>>>>>>> If the server database exists it will be overwritten.
>>>>>>>>>>>>> do you wish to continue y/(n)?y
>>>>>>>>>>>>> root 39521 1 1 16:10 ? 00:00:00 pbs_server -t
>>>>>>>>>>>>> create
>>>>>>>>>>>>> Max open servers: 9
>>>>>>>>>>>>> Max open servers: 9
>>>>>>>>>>>>> qmgr obj=batch svr=default: End of File
>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>>>>>> torque-server-***@torque-server:~/Downloads/torque/6.0.2$
>>>>>>>>>>>>> ps aux | grep pbs
>>>>>>>>>>>>> comp_ad+ 39569 0.0 0.0 22304 1032 pts/8 S+ 16:11
>>>>>>>>>>>>> 0:00 grep --color=auto pbs
>>>>>>>>>>>>
>>>>>>>>>>>> pbs_server -t create was not found.
>>>>>>>>>>>>
>>>>>>>>>>>> Commands used for installation before the setup script
>>>>>>>>>>>>
>>>>>>>>>>>>> # build and install torque
>>>>>>>>>>>>> ./configure
>>>>>>>>>>>>> make
>>>>>>>>>>>>> sudo make install
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>>>>>>>>>>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>>>>>>>>>>>>> sudo ldconfig
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> # set up as services
>>>>>>>>>>>>
>>>>>>>>>>>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_server
>>>>>>>>>>>>> /etc/init.d/pbs_server
>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>>>>>>>>>>>> sudo update-rc.d trqauthd defaults
>>>>>>>>>>>>> sudo update-rc.d pbs_server defaults
>>>>>>>>>>>>> sudo update-rc.d pbs_sched defaults
>>>>>>>>>>>>> sudo update-rc.d pbs_mom defaults
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> sudo ./torque.setup $USER
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> torqueusers mailing list
>>>>>>>>>>>> ***@supercluster.org
>>>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> David Beer | Torque Architect
>>>>>>>>>>> Adaptive Computing
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> torqueusers mailing list
>>>>>>>>>>> ***@supercluster.org
>>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> torqueusers mailing list
>>>>>>>>> ***@supercluster.org
>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> David Beer | Torque Architect
>>>>>>>> Adaptive Computing
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> David Beer | Torque Architect
>>>>>>> Adaptive Computing
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> torqueusers mailing list
>>>>>>> ***@supercluster.org
>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> torqueusers mailing list
>>>>> ***@supercluster.org
>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> David Beer | Torque Architect
>>>> Adaptive Computing
>>>>
>>>> _______________________________________________
>>>> torqueusers mailing list
>>>> ***@supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>
>>>>
>>>
>>
>> _______________________________________________
>> torqueusers mailing list
>> ***@supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>
>
> --
> David Beer | Torque Architect
> Adaptive Computing
>
> _______________________________________________
> torqueusers mailing list
> ***@supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
David Beer
2016-11-09 18:07:17 UTC
Permalink
Kazu,

I was able to get a system to reproduce this error. I have now checked in
another fix, and I can no longer reproduce this. Can you pull the latest
and let me know if it fixes it for you?

On Tue, Nov 8, 2016 at 2:06 AM, Kazuhiro Fujita <***@gmail.com>
wrote:

> Hi David,
>
> I reinstalled the 6.0-dev today from github, and observed slight different
> behaviors I think.
> I used the "service" command to start daemons this time.
>
> Best,
> Kazu
>
> Befor crash
>
>> git clone https://github.com/adaptivecomputing/torque.git -b 6.0-dev
>> 6.0-dev
>> cd 6.0-dev
>> ./autogen.sh
>> # build and install torque
>> ./configure
>> make
>> sudo make install
>> # Set the correct name of the server
>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>> # configure and start trqauthd
>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>> sudo update-rc.d trqauthd defaults
>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>> sudo ldconfig
>> sudo service trqauthd start
>> # Initialize serverdb by executing the torque.setup script
>> sudo ./torque.setup $USER
>> sudo qmgr -c 'p s'
>> sudo qterm
>> sudo service trqauthd stop
>> ps aux | grep pbs
>> ps aux | grep trq
>> # set nodes
>> echo "$HOSTNAME np=`cat /proc/cpuinfo | grep processor | wc -l`" | sudo
>> tee /var/spool/torque/server_priv/nodes
>> sudo nano /var/spool/torque/server_priv/nodes
>> # set the head node
>> echo "\$pbsserver $HOSTNAME" | sudo tee /var/spool/torque/mom_priv/config
>> # configure other deamons
>> sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
>> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>> sudo update-rc.d pbs_server defaults
>> sudo update-rc.d pbs_sched defaults
>> sudo update-rc.d pbs_mom defaults
>> # start torque daemons
>> sudo service trqauthd start
>> sudo service pbs_server start
>> sudo service pbs_sched start
>> sudo service pbs_mom start
>> # chekc configuration of computaion nodes
>> pbsnodes -a
>
>
> I checked torque processes by "ps aux | grep pbs" and "ps aux | grep trq"
> several times.
> After "pbsnodes -a", it seems ok.
> But, the next qsub command seems to trigger to crash "pbs_server" and
> "pbs_sched".
>
> $ ps aux | grep trq
>> root 9682 0.0 0.0 109112 3632 ? S 17:39 0:00
>> /usr/local/sbin/trqauthd
>> comp_ad+ 9842 0.0 0.0 15236 936 pts/8 S+ 17:40 0:00 grep
>> --color=auto trq
>> $ ps aux | grep pbs
>> root 9720 0.0 0.0 695140 25760 ? Sl 17:39 0:00
>> /usr/local/sbin/pbs_server
>> root 9771 0.0 0.0 37996 4940 ? Ss 17:39 0:00
>> /usr/local/sbin/pbs_sched
>> root 9814 0.2 0.2 173776 136692 ? SLsl 17:40 0:00
>> /usr/local/sbin/pbs_mom
>> comp_ad+ 9844 0.0 0.0 15236 1012 pts/8 S+ 17:40 0:00 grep
>> --color=auto pbs
>> $ echo "sleep 30" | qsub
>> 0.Dual-E52630v4
>> $ ps aux | grep pbs
>> root 9814 0.1 0.2 173776 136692 ? SLsl 17:40 0:00
>> /usr/local/sbin/pbs_mom
>> comp_ad+ 9855 0.0 0.0 15236 928 pts/8 S+ 17:41 0:00 grep
>> --color=auto pbs
>> $ ps aux | grep trq
>> root 9682 0.0 0.0 109112 4144 ? S 17:39 0:00
>> /usr/local/sbin/trqauthd
>> comp_ad+ 9860 0.0 0.0 15236 1092 pts/8 S+ 17:41 0:00 grep
>> --color=auto trq
>
>
> Then, I stopped the remained processes,
>
> sudo service pbs_mom stop
>> sudo service trqauthd stop
>
>
> and start again the "trqauthd", and "pbs_server" with gdb. "pbs_server"
> crashed in gdb without other commands.
>
> sudo service trqauthd start
>> sudo gdb /usr/local/sbin/pbs_server
>
>
> sudo gdb /usr/local/sbin/pbs_server
> GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
> Copyright (C) 2016 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.
> html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law. Type "show copying"
> and "show warranty" for details.
> This GDB was configured as "x86_64-linux-gnu".
> Type "show configuration" for configuration details.
> For bug reporting instructions, please see:
> <http://www.gnu.org/software/gdb/bugs/>.
> Find the GDB manual and other documentation resources online at:
> <http://www.gnu.org/software/gdb/documentation/>.
> For help, type "help".
> Type "apropos word" to search for commands related to "word"...
> Reading symbols from /usr/local/sbin/pbs_server...done.
> (gdb) r -D
> Starting program: /usr/local/sbin/pbs_server -D
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
>
> Program received signal SIGSEGV, Segmentation fault.
> __lll_unlock_elision (lock=0x512f1b0, private=0) at
> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
> 29 ../sysdeps/unix/sysv/linux/x86/elision-unlock.c: No such file or
> directory.
> (gdb) bt
> #0 __lll_unlock_elision (lock=0x512f1b0, private=0) at
> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
> #1 0x00000000004a4953 in unlock_ji_mutex (pjob=0x512d880, id=0x525b30
> <svr_enquejob(job*, int, char const*, bool, bool)::__func__>
> "svr_enquejob", msg=0x524554 "1", logging=0)
> at svr_jobfunc.c:4011
> #2 0x000000000049db0c in svr_enquejob (pjob=0x512d880, has_sv_qs_mutex=1,
> prev_job_id=0x0, have_reservation=false, being_recovered=true) at
> svr_jobfunc.c:421
> #3 0x000000000045b828 in pbsd_init_reque (pjob=0x512d880, change_state=1)
> at pbsd_init.c:2824
> #4 0x000000000045ad93 in pbsd_init_job (pjob=0x512d880, type=1) at
> pbsd_init.c:2558
> #5 0x0000000000459483 in handle_job_recovery (type=1) at pbsd_init.c:1803
> #6 0x000000000045a173 in handle_job_and_array_recovery (type=1) at
> pbsd_init.c:2100
> #7 0x000000000045a8fe in pbsd_init (type=1) at pbsd_init.c:2316
> #8 0x0000000000461f29 in main (argc=2, argv=0x7fffffffe598) at
> pbsd_main.c:1898
> (gdb) backtrace full
> #0 __lll_unlock_elision (lock=0x512f1b0, private=0) at
> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
> No locals.
> #1 0x00000000004a4953 in unlock_ji_mutex (pjob=0x512d880, id=0x525b30
> <svr_enquejob(job*, int, char const*, bool, bool)::__func__>
> "svr_enquejob", msg=0x524554 "1", logging=0)
> at svr_jobfunc.c:4011
> rc = 0
> err_msg = 0x0
> stub_msg = "no pos"
> __func__ = "unlock_ji_mutex"
> #2 0x000000000049db0c in svr_enquejob (pjob=0x512d880, has_sv_qs_mutex=1,
> prev_job_id=0x0, have_reservation=false, being_recovered=true) at
> svr_jobfunc.c:421
> pattrjb = 0x7fffffff4a10
> pdef = 0x4
> pque = 0x0
> rc = 0
> log_buf = '\000' <repeats 24 times>, "\030\000\000\000\060\000\000\
> 000PU\377\377\377\177\000\000\220T\377\377\377\177", '\000' <repeats 50
> times>, "\003\000\000\000\000\000\000\000#\000\000\000\000\000\000\000pO\377\377\377\177",
> '\000' <repeats 26 times>, "\221\260\000\000\000\200\377\
> 377oO\377\377\377\177\000\000H+B\366\377\177\000\000p+B\
> 366\377\177\000\000\200O\377\377\377\177\000\000\201\260\
> 000\000\000\200\377\377\177O\377\377\377\177", '\000' <repeats 18
> times>...
> time_now = 1478594788
> job_id = "0.Dual-E52630v4\000\000\000\000\000\000\000\000\000\362\
> 377\377\377\377\377\377\377\340J\377\377\377\177\000\000\
> 060L\377\377\377\177\000\000\001\000\000\000\000\000\000\
> 000\244\201\000\000\001\000\000\000\030\354\377\367\377\177\000\***@L
> \377\377\377\177\000\000\000\000\000\000\005\
> 000\000\220\r\000\000\000\000\000\000\000k\022j\365\377\177\
> 000\000\031J\377\377\377\177\000\000\201n\376\017\000\000\
> 000\000\\\216!X\000\000\000\000_#\343+\000\000\000\000\\\
> 216!X\000\000\000\000\207\065],", '\000' <repeats 36 times>,
> "k\022j\365\377\177\000\000\300K\377\377\377\177\000\000\
> 000\000\000\000\000\000\000\000"...
> queue_name = "batch\000\377\377\240\340\377\367\377\177\000"
> total_jobs = 0
> user_jobs = 0
> array_jobs = 0
> __func__ = "svr_enquejob"
> que_mgr = {unlock_on_exit = 160, locked = 75, mutex_valid = 255,
> managed_mutex = 0x7ffff7ddccda <open_path+474>}
> #3 0x000000000045b828 in pbsd_init_reque (pjob=0x512d880, change_state=1)
> at pbsd_init.c:2824
> newstate = 0
> newsubstate = 0
> rc = 0
> log_buf = "pbsd_init_reque:1", '\000' <repeats 1063 times>...
> __func__ = "pbsd_init_reque"
> #4 0x000000000045ad93 in pbsd_init_job (pjob=0x512d880, type=1) at
> pbsd_init.c:2558
> d = 0
> rc = 0
> time_now = 1478594788
> log_buf = '\000' <repeats 2112 times>...
> local_errno = 0
> job_id = '\000' <repeats 1016 times>...
> job_atr_hold = 0
> job_exit_status = 0
> __func__ = "pbsd_init_job"
> #5 0x0000000000459483 in handle_job_recovery (type=1) at pbsd_init.c:1803
> pjob = 0x512d880
> Index = 0
> JobArray_iter = {first = "0.Dual-E52630v4", second = }
> log_buf = "14 total files read from disk\000\000\000\000\000\000\
> 000\001\000\000\000\320\316\022\005\000\000\000\000\220N\022\005", '\000'
> <repeats 12 times>, "Expected 1, recovered 1 queues", '\000' <repeats 1330
> times>...
> rc = 0
> job_rc = 0
> logtype = 0
> pdirent = 0x0
> pdirent_sub = 0x0
> dir = 0x5124e90
> dir_sub = 0x0
> had = 0
> pjob = 0x0
> time_now = 1478594788
> ---Type <return> to continue, or q <return> to quit---
> basen = '\000' <repeats 1088 times>...
> use_jobs_subdirs = 0
> __func__ = "handle_job_recovery"
> #6 0x000000000045a173 in handle_job_and_array_recovery (type=1) at
> pbsd_init.c:2100
> rc = 0
> tmp_rc = 1974134615
> #7 0x000000000045a8fe in pbsd_init (type=1) at pbsd_init.c:2316
> ret = 0
> gid = 0
> log_buf = "pbsd_init:1", '\000' <repeats 997 times>...
> __func__ = "pbsd_init"
> #8 0x0000000000461f29 in main (argc=2, argv=0x7fffffffe598) at
> pbsd_main.c:1898
> i = 2
> rc = 0
> local_errno = 0
> lockfile = "/var/spool/torque/server_priv/server.lock", '\000'
> <repeats 983 times>
> EMsg = '\000' <repeats 1023 times>
> tmpLine = "Server Dual-E52630v4 started, initialization type = 1",
> '\000' <repeats 970 times>
> log_buf = "Server Dual-E52630v4 started, initialization type = 1",
> '\000' <repeats 1139 times>...
> server_name_file_port = 15001
> fp = 0x51095f0
> (gdb) info registers
> rax 0x0 0
> rbx 0x6 6
> rcx 0x0 0
> rdx 0x512f1b0 85127600
> rsi 0x0 0
> rdi 0x512f1b0 85127600
> rbp 0x7fffffffe4b0 0x7fffffffe4b0
> rsp 0x7fffffffc870 0x7fffffffc870
> r8 0x0 0
> r9 0x7fffffff57a2 140737488312226
> r10 0x513c800 85182464
> r11 0x7ffff61e6128 140737322574120
> r12 0x4260b0 4350128
> r13 0x7fffffffe590 140737488348560
> r14 0x0 0
> r15 0x0 0
> rip 0x461f29 0x461f29 <main(int, char**)+2183>
> eflags 0x10246 [ PF ZF IF RF ]
> cs 0x33 51
> ss 0x2b 43
> ds 0x0 0
> es 0x0 0
> fs 0x0 0
> gs 0x0 0
> (gdb) x/16i $pc
> => 0x461f29 <main(int, char**)+2183>: test %eax,%eax
> 0x461f2b <main(int, char**)+2185>: setne %al
> 0x461f2e <main(int, char**)+2188>: test %al,%al
> 0x461f30 <main(int, char**)+2190>: je 0x461f55 <main(int,
> char**)+2227>
> 0x461f32 <main(int, char**)+2192>: mov 0x70efc7(%rip),%rax #
> 0xb70f00 <msg_daemonname>
> 0x461f39 <main(int, char**)+2199>: mov $0x51bab2,%edx
> 0x461f3e <main(int, char**)+2204>: mov %rax,%rsi
> 0x461f41 <main(int, char**)+2207>: mov $0xffffffff,%edi
> 0x461f46 <main(int, char**)+2212>: callq 0x425420
> <***@plt>
> 0x461f4b <main(int, char**)+2217>: mov $0x3,%edi
> 0x461f50 <main(int, char**)+2222>: callq 0x425680 <***@plt>
> 0x461f55 <main(int, char**)+2227>: mov 0x71021d(%rip),%esi #
> 0xb72178 <pbs_mom_port>
> 0x461f5b <main(int, char**)+2233>: mov 0x710227(%rip),%ecx #
> 0xb72188 <pbs_scheduler_port>
> 0x461f61 <main(int, char**)+2239>: mov 0x710225(%rip),%edx #
> 0xb7218c <pbs_server_port_dis>
> 0x461f67 <main(int, char**)+2245>: lea -0x1400(%rbp),%rax
> 0x461f6e <main(int, char**)+2252>: mov $0xb739c0,%r9d
> (gdb) thread apply all backtrace
>
> Thread 1 (Thread 0x7ffff7fd5740 (LWP 10004)):
> #0 __lll_unlock_elision (lock=0x512f1b0, private=0) at
> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
> #1 0x00000000004a4953 in unlock_ji_mutex (pjob=0x512d880, id=0x525b30
> <svr_enquejob(job*, int, char const*, bool, bool)::__func__>
> "svr_enquejob", msg=0x524554 "1", logging=0)
> at svr_jobfunc.c:4011
> #2 0x000000000049db0c in svr_enquejob (pjob=0x512d880, has_sv_qs_mutex=1,
> prev_job_id=0x0, have_reservation=false, being_recovered=true) at
> svr_jobfunc.c:421
> #3 0x000000000045b828 in pbsd_init_reque (pjob=0x512d880, change_state=1)
> at pbsd_init.c:2824
> #4 0x000000000045ad93 in pbsd_init_job (pjob=0x512d880, type=1) at
> pbsd_init.c:2558
> #5 0x0000000000459483 in handle_job_recovery (type=1) at pbsd_init.c:1803
> #6 0x000000000045a173 in handle_job_and_array_recovery (type=1) at
> pbsd_init.c:2100
> #7 0x000000000045a8fe in pbsd_init (type=1) at pbsd_init.c:2316
> #8 0x0000000000461f29 in main (argc=2, argv=0x7fffffffe598) at
> pbsd_main.c:1898
> (gdb) quit
> A debugging session is active.
>
> Inferior 1 [process 10004] will be killed.
>
> Quit anyway? (y or n) y
>
>
>
>
>
>
>
> On Wed, Nov 2, 2016 at 1:43 AM, David Beer <***@adaptivecomputing.com>
> wrote:
>
>> Kazu,
>>
>> Thanks for sticking with us on this. You mentioned that pbs_server did
>> not crash when you submitted the job, but you said that it and pbs_sched
>> are "unstable." What do you mean by unstable? Will jobs run? You gdb output
>> looks like a pbs_server that isn't busy, but other than that it looks
>> normal.
>>
>> David
>>
>> On Tue, Nov 1, 2016 at 1:19 AM, Kazuhiro Fujita <
>> ***@gmail.com> wrote:
>>
>>> David,
>>>
>>> I tested the 6.0-dev. It passed the "sudo ./torque.setup $USER" script,
>>> but pbs_server and pbs_sched are unstable like 6.1-dev.
>>>
>>> Best,
>>> Kazu
>>>
>>> Before execution of gdb
>>>
>>> git clone https://github.com/adaptivecomputing/torque.git -b 6.0-dev
>>>> 6.0-dev
>>>> cd 6.0-dev
>>>> ./autogen.sh
>>>> # build and install torque
>>>> ./configure
>>>> make
>>>> sudo make install
>>>> # Set the correct name of the server
>>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>> # configure and start trqauthd
>>>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>>>> sudo update-rc.d trqauthd defaults
>>>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>>>> sudo ldconfig
>>>> sudo service trqauthd start
>>>> # Initialize serverdb by executing the torque.setup script
>>>> sudo ./torque.setup $USER
>>>>
>>>> sudo qmgr -c 'p s'
>>>> sudo qterm
>>>> sudo /etc/init.d/trqauthd stop
>>>> # set nodes
>>>> echo "$HOSTNAME np=`cat /proc/cpuinfo | grep processor | wc -l`" | sudo
>>>> tee /var/spool/torque/server_priv/nodes
>>>> sudo nano /var/spool/torque/server_priv/nodes
>>>> # set the head node
>>>> echo "\$pbsserver $HOSTNAME" | sudo tee /var/spool/torque/mom_priv/con
>>>> fig
>>>> # configure other deamons
>>>> sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
>>>> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
>>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>>> sudo update-rc.d pbs_server defaults
>>>> sudo update-rc.d pbs_sched defaults
>>>> sudo update-rc.d pbs_mom defaults
>>>> # start torque daemons
>>>> sudo service trqauthd start
>>>
>>>
>>> Execution of gdb
>>>
>>>> sudo gdb /usr/local/sbin/pbs_server
>>>
>>>
>>> Commands executed by another terminal
>>>
>>>> sudo /etc/init.d/pbs_mom start
>>>> sudo /etc/init.d/pbs_sched start
>>>> pbsnodes -a
>>>> echo "sleep 30" | qsub
>>>
>>>
>>> The last command did not cause a crash of pbs_server. The backtrace is
>>> described below.
>>> $ sudo gdb /usr/local/sbin/pbs_server
>>> GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
>>> Copyright (C) 2016 Free Software Foundation, Inc.
>>> License GPLv3+: GNU GPL version 3 or later <
>>> http://gnu.org/licenses/gpl.html>
>>> This is free software: you are free to change and redistribute it.
>>> There is NO WARRANTY, to the extent permitted by law. Type "show
>>> copying"
>>> and "show warranty" for details.
>>> This GDB was configured as "x86_64-linux-gnu".
>>> Type "show configuration" for configuration details.
>>> For bug reporting instructions, please see:
>>> <http://www.gnu.org/software/gdb/bugs/>.
>>> Find the GDB manual and other documentation resources online at:
>>> <http://www.gnu.org/software/gdb/documentation/>.
>>> For help, type "help".
>>> Type "apropos word" to search for commands related to "word"...
>>> Reading symbols from /usr/local/sbin/pbs_server...done.
>>> (gdb) r -D
>>> Starting program: /usr/local/sbin/pbs_server -D
>>> [Thread debugging using libthread_db enabled]
>>> Using host libthread_db library "/lib/x86_64-linux-gnu/libthre
>>> ad_db.so.1".
>>> [New Thread 0x7ffff39c1700 (LWP 5024)]
>>> pbs_server is up (version - 6.0, port - 15001)
>>> [New Thread 0x7ffff31c0700 (LWP 5025)]
>>> PBS_Server: LOG_ERROR::tcp_connect_sockaddr, Failed when trying to open
>>> tcp connection - connect() failed [rc = -2] [addr = 10.0.0.249:15003]
>>> PBS_Server: LOG_ERROR::sendHierarchyToNode, Could not send mom
>>> hierarchy to host Dual-E52630v4:15003
>>> [New Thread 0x7ffff29bf700 (LWP 5026)]
>>> [New Thread 0x7ffff21be700 (LWP 5027)]
>>> [New Thread 0x7ffff19bd700 (LWP 5028)]
>>> [New Thread 0x7ffff11bc700 (LWP 5029)]
>>> [New Thread 0x7ffff09bb700 (LWP 5030)]
>>> [Thread 0x7ffff09bb700 (LWP 5030) exited]
>>> [New Thread 0x7ffff09bb700 (LWP 5031)]
>>> [New Thread 0x7fffe3fff700 (LWP 5109)]
>>> [New Thread 0x7fffe37fe700 (LWP 5113)]
>>> [New Thread 0x7fffe29cf700 (LWP 5121)]
>>> [Thread 0x7fffe29cf700 (LWP 5121) exited]
>>> ^C
>>> Thread 1 "pbs_server" received signal SIGINT, Interrupt.
>>> 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-templa
>>> te.S:84
>>> 84 ../sysdeps/unix/syscall-template.S: No such file or directory.
>>> (gdb) backtrace full
>>> #0 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-templa
>>> te.S:84
>>> No locals.
>>> #1 0x00007ffff615c1a4 in usleep (useconds=<optimized out>) at
>>> ../sysdeps/posix/usleep.c:32
>>> ts = {tv_sec = 0, tv_nsec = 250000000}
>>> #2 0x000000000046123a in main_loop () at pbsd_main.c:1454
>>> state = 3
>>> waittime = 5
>>> pjob = 0x313a74
>>> iter = 0x0
>>> when = 1477984074
>>> log = 0
>>> scheduling = 1
>>> sched_iteration = 600
>>> time_now = 1477984190
>>> update_loglevel = 1477984198
>>> log_buf = "Server Ready, pid = 5020, loglevel=0", '\000'
>>> <repeats 140 times>, "c\000\000\000\000\000\000\000
>>> \000\020\000\000\000\000\000\000\240\265\377\377\377\177", '\000'
>>> <repeats 26 times>...
>>> sem_val = 5228929
>>> __func__ = "main_loop"
>>> #3 0x0000000000461fb6 in main (argc=2, argv=0x7fffffffe598) at
>>> pbsd_main.c:1935
>>> i = 2
>>> rc = 0
>>> local_errno = 0
>>> lockfile = "/var/spool/torque/server_priv/server.lock", '\000'
>>> <repeats 983 times>
>>> EMsg = '\000' <repeats 1023 times>
>>> tmpLine = "Using ports Server:15001 Scheduler:15004 MOM:15002
>>> (server: 'Dual-E52630v4')", '\000' <repeats 945 times>
>>> log_buf = "Using ports Server:15001 Scheduler:15004 MOM:15002
>>> (server: 'Dual-E52630v4')", '\000' <repeats 1114 times>...
>>> server_name_file_port = 15001
>>> fp = 0x51095f0
>>> (gdb) info registers
>>> rax 0xfffffffffffffdfc -516
>>> rbx 0x5 5
>>> rcx 0x7ffff612a75d 140737321805661
>>> rdx 0x0 0
>>> rsi 0x0 0
>>> rdi 0x7fffffffb3f0 140737488335856
>>> rbp 0x7fffffffe4b0 0x7fffffffe4b0
>>> rsp 0x7fffffffc870 0x7fffffffc870
>>> r8 0x0 0
>>> r9 0x4000001 67108865
>>> r10 0x1 1
>>> r11 0x293 659
>>> r12 0x4260b0 4350128
>>> r13 0x7fffffffe590 140737488348560
>>> r14 0x0 0
>>> r15 0x0 0
>>> rip 0x461fb6 0x461fb6 <main(int, char**)+2388>
>>> eflags 0x293 [ CF AF SF IF ]
>>> cs 0x33 51
>>> ss 0x2b 43
>>> ds 0x0 0
>>> es 0x0 0
>>> fs 0x0 0
>>> gs 0x0 0
>>> (gdb) x/16i $pc
>>> => 0x461fb6 <main(int, char**)+2388>: callq 0x494762 <shutdown_ack()>
>>> 0x461fbb <main(int, char**)+2393>: mov $0xffffffff,%edi
>>> 0x461fc0 <main(int, char**)+2398>: callq 0x4250b0 <***@plt>
>>> 0x461fc5 <main(int, char**)+2403>: mov 0x70f55c(%rip),%rdx
>>> # 0xb71528 <msg_svrdown>
>>> 0x461fcc <main(int, char**)+2410>: mov 0x70eeed(%rip),%rax
>>> # 0xb70ec0 <msg_daemonname>
>>> 0x461fd3 <main(int, char**)+2417>: mov %rdx,%rcx
>>> 0x461fd6 <main(int, char**)+2420>: mov %rax,%rdx
>>> 0x461fd9 <main(int, char**)+2423>: mov $0x1,%esi
>>> 0x461fde <main(int, char**)+2428>: mov $0x8002,%edi
>>> 0x461fe3 <main(int, char**)+2433>: callq 0x425840
>>> <***@plt>
>>> 0x461fe8 <main(int, char**)+2438>: mov $0x0,%edi
>>> 0x461fed <main(int, char**)+2443>: callq 0x4269c9 <acct_close(bool)>
>>> 0x461ff2 <main(int, char**)+2448>: mov $0xb6cdc0,%edi
>>> 0x461ff7 <main(int, char**)+2453>: callq 0x425a00
>>> <***@plt>
>>> 0x461ffc <main(int, char**)+2458>: mov $0x1,%edi
>>> 0x462001 <main(int, char**)+2463>: callq 0x424db0 <***@plt
>>> >
>>> (gdb) thread apply all backtrace
>>>
>>> Thread 11 (Thread 0x7fffe37fe700 (LWP 5113)):
>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86
>>> _64/pthread_cond_wait.S:185
>>> #1 0x00000000004fc19c in work_thread (a=0x5110710) at u_threadpool.c:272
>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7fffe37fe700) at
>>> pthread_create.c:333
>>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>> _64/clone.S:109
>>>
>>> Thread 10 (Thread 0x7fffe3fff700 (LWP 5109)):
>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86
>>> _64/pthread_cond_wait.S:185
>>> #1 0x00000000004fc19c in work_thread (a=0x5110710) at u_threadpool.c:272
>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7fffe3fff700) at
>>> pthread_create.c:333
>>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>> _64/clone.S:109
>>>
>>> Thread 9 (Thread 0x7ffff09bb700 (LWP 5031)):
>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86
>>> _64/pthread_cond_wait.S:185
>>> #1 0x00000000004fc19c in work_thread (a=0x5110810) at u_threadpool.c:272
>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7ffff09bb700) at
>>> pthread_create.c:333
>>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>> _64/clone.S:109
>>>
>>> Thread 7 (Thread 0x7ffff11bc700 (LWP 5029)):
>>> #0 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-templa
>>> te.S:84
>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>> ../sysdeps/posix/sleep.c:55
>>> #2 0x00000000004769bb in remove_completed_jobs (vp=0x0) at
>>> req_jobobit.c:3759
>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff11bc700) at
>>> pthread_create.c:333
>>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>> _64/clone.S:109
>>>
>>> Thread 6 (Thread 0x7ffff19bd700 (LWP 5028)):
>>> #0 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-templa
>>> te.S:84
>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>> ../sysdeps/posix/sleep.c:55
>>> #2 0x00000000004afa7b in remove_extra_recycle_jobs (vp=0x0) at
>>> job_recycler.c:216
>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff19bd700) at
>>> pthread_create.c:333
>>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>> _64/clone.S:109
>>>
>>> Thread 5 (Thread 0x7ffff21be700 (LWP 5027)):
>>> #0 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-templa
>>> te.S:84
>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>> ../sysdeps/posix/sleep.c:55
>>> #2 0x00000000004bc73b in inspect_exiting_jobs (vp=0x0) at
>>> exiting_jobs.c:319
>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff21be700) at
>>> pthread_create.c:333
>>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>> _64/clone.S:109
>>>
>>> Thread 4 (Thread 0x7ffff29bf700 (LWP 5026)):
>>> #0 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-templa
>>> te.S:84
>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>> ../sysdeps/posix/sleep.c:55
>>> #2 0x000000000046078d in handle_queue_routing_retries (vp=0x0) at
>>> pbsd_main.c:1079
>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff29bf700) at
>>> pthread_create.c:333
>>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>> _64/clone.S:109
>>>
>>> Thread 3 (Thread 0x7ffff31c0700 (LWP 5025)):
>>> #0 0x00007ffff6ee17bd in accept () at ../sysdeps/unix/syscall-templa
>>> te.S:84
>>> #1 0x00007ffff750a276 in start_listener_addrinfo
>>> (host_name=0x7ffff31bfaf0 "Dual-E52630v4", server_port=15001,
>>> process_meth=0x4c4935 <start_process_pbs_server_port(void*)>)
>>> at ../Libnet/server_core.c:398
>>> #2 0x00000000004608f3 in start_accept_listener (vp=0x0) at
>>> pbsd_main.c:1141
>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff31c0700) at
>>> pthread_create.c:333
>>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>> _64/clone.S:109
>>>
>>> Thread 2 (Thread 0x7ffff39c1700 (LWP 5024)):
>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86
>>> _64/pthread_cond_wait.S:185
>>> #1 0x00000000004fc19c in work_thread (a=0x5110810) at u_threadpool.c:272
>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7ffff39c1700) at
>>> pthread_create.c:333
>>> ---Type <return> to continue, or q <return> to quit---
>>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>> _64/clone.S:109
>>>
>>> Thread 1 (Thread 0x7ffff7fd5740 (LWP 5020)):
>>> #0 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-templa
>>> te.S:84
>>> #1 0x00007ffff615c1a4 in usleep (useconds=<optimized out>) at
>>> ../sysdeps/posix/usleep.c:32
>>> #2 0x000000000046123a in main_loop () at pbsd_main.c:1454
>>> #3 0x0000000000461fb6 in main (argc=2, argv=0x7fffffffe598) at
>>> pbsd_main.c:1935
>>> (gdb) quit
>>>
>>>
>>>
>>>
>>>
>>> On Fri, Oct 28, 2016 at 12:43 PM, Kazuhiro Fujita <
>>> ***@gmail.com> wrote:
>>>
>>>> Thank you for your comments.
>>>> I will try the 6.0-dev next week.
>>>>
>>>> Best,
>>>> Kazu
>>>>
>>>> On Fri, Oct 28, 2016 at 5:34 AM, David Beer <
>>>> ***@adaptivecomputing.com> wrote:
>>>>
>>>>> I wonder if that fix wasn't placed in the hotfix. Is there any chance
>>>>> you can try installing 6.0-dev on your system (via github) to see if it's
>>>>> resolved. For the record, my Ubuntu 16 system doesn't give me this error,
>>>>> or I'd try it myself. For whatever reason, none of our test cluster
>>>>> machines (Cent & Redhat 6-7, SLES 11-12) experience this either. We did
>>>>> have another user that experiences it on a test cluster, but not being able
>>>>> to reproduce it has made it harder to track down.
>>>>>
>>>>> On Wed, Oct 26, 2016 at 12:46 AM, Kazuhiro Fujita <
>>>>> ***@gmail.com> wrote:
>>>>>
>>>>>> David,
>>>>>>
>>>>>> I tried the 6.0.2.h3. But, it seems that the other issue is still
>>>>>> remained.
>>>>>> After I initialized serverdb by "sudo pbs_server -t create",
>>>>>> pbs_server crashed.
>>>>>> Then, I used gdb with pbs_server.
>>>>>>
>>>>>> Best,
>>>>>> Kazu
>>>>>>
>>>>>> sudo gdb /usr/local/sbin/pbs_server
>>>>>> GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
>>>>>> Copyright (C) 2016 Free Software Foundation, Inc.
>>>>>> License GPLv3+: GNU GPL version 3 or later <
>>>>>> http://gnu.org/licenses/gpl.html>
>>>>>> This is free software: you are free to change and redistribute it.
>>>>>> There is NO WARRANTY, to the extent permitted by law. Type "show
>>>>>> copying"
>>>>>> and "show warranty" for details.
>>>>>> This GDB was configured as "x86_64-linux-gnu".
>>>>>> Type "show configuration" for configuration details.
>>>>>> For bug reporting instructions, please see:
>>>>>> <http://www.gnu.org/software/gdb/bugs/>.
>>>>>> Find the GDB manual and other documentation resources online at:
>>>>>> <http://www.gnu.org/software/gdb/documentation/>.
>>>>>> For help, type "help".
>>>>>> Type "apropos word" to search for commands related to "word"...
>>>>>> Reading symbols from /usr/local/sbin/pbs_server...done.
>>>>>> (gdb) r -D
>>>>>> Starting program: /usr/local/sbin/pbs_server -D
>>>>>> [Thread debugging using libthread_db enabled]
>>>>>> Using host libthread_db library "/lib/x86_64-linux-gnu/libthre
>>>>>> ad_db.so.1".
>>>>>> pbs_server is up (version - 6.0.2.h3, port - 15001)
>>>>>> [New Thread 0x7ffff39c1700 (LWP 25591)]
>>>>>> [New Thread 0x7ffff31c0700 (LWP 25592)]
>>>>>> [New Thread 0x7ffff29bf700 (LWP 25593)]
>>>>>> [New Thread 0x7ffff21be700 (LWP 25594)]
>>>>>> [New Thread 0x7ffff19bd700 (LWP 25595)]
>>>>>> [New Thread 0x7ffff11bc700 (LWP 25596)]
>>>>>>
>>>>>> Thread 7 "pbs_server" received signal SIGSEGV, Segmentation fault.
>>>>>> [Switching to Thread 0x7ffff11bc700 (LWP 25596)]
>>>>>> __lll_unlock_elision (lock=0x57276c0, private=0) at
>>>>>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>>>>>> 29 ../sysdeps/unix/sysv/linux/x86/elision-unlock.c: No such file or
>>>>>> directory.
>>>>>> (gdb) bt
>>>>>> #0 __lll_unlock_elision (lock=0x57276c0, private=0) at
>>>>>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>>>>>> #1 0x00000000004ac076 in dispatch_timed_task (ptask=0x5727660) at
>>>>>> svr_task.c:318
>>>>>> #2 0x0000000000460247 in check_tasks (notUsed=0x0) at pbsd_main.c:921
>>>>>> #3 0x00000000004fc171 in work_thread (a=0x510f650) at
>>>>>> u_threadpool.c:318
>>>>>> #4 0x00007ffff6ed86fa in start_thread (arg=0x7ffff11bc700) at
>>>>>> pthread_create.c:333
>>>>>> #5 0x00007ffff6165b5d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>> _64/clone.S:109
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Oct 26, 2016 at 11:52 AM, Kazuhiro Fujita <
>>>>>> ***@gmail.com> wrote:
>>>>>>
>>>>>>> David and Rick,
>>>>>>>
>>>>>>> Thank you for the quick response. I will try it later.
>>>>>>>
>>>>>>> Best,
>>>>>>> Kazu
>>>>>>>
>>>>>>> On Wed, Oct 26, 2016 at 5:06 AM, David Beer <
>>>>>>> ***@adaptivecomputing.com> wrote:
>>>>>>>
>>>>>>>> Actually, Rick just sent me the link. You can download it from
>>>>>>>> here: http://files.adaptivecomputing.com/hotfix/torque-6.0.2
>>>>>>>> .h3.tar.gz
>>>>>>>>
>>>>>>>> On Tue, Oct 25, 2016 at 2:06 PM, David Beer <
>>>>>>>> ***@adaptivecomputing.com> wrote:
>>>>>>>>
>>>>>>>>> I can confirm that this bug is fixed in 6.0-dev, and we've made a
>>>>>>>>> hotfix for it, 6.0.2.h3. This was caused because of a change in the
>>>>>>>>> implementation for the pthread library, so most will not see this crash,
>>>>>>>>> but it appears that if you have a newer version of that library, then you
>>>>>>>>> will get it. Rick is going to send instructions for how to grab 6.0.2.h3.
>>>>>>>>>
>>>>>>>>> David
>>>>>>>>>
>>>>>>>>> On Tue, Oct 25, 2016 at 12:30 AM, Kazuhiro Fujita <
>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Thank you David for the comment on the backtrace.
>>>>>>>>>> I haven't noticed that until writing this mail.
>>>>>>>>>> So, I used backtrace as written in the Ubuntu wiki.
>>>>>>>>>>
>>>>>>>>>> I also attached the backtrace of pbs_server (Torque 6.1-dev) by
>>>>>>>>>> gdb.
>>>>>>>>>> As I mentioned before torque.setup script was successfully
>>>>>>>>>> executed, but unstable.
>>>>>>>>>>
>>>>>>>>>> Before using gdb, I used following commands.
>>>>>>>>>>
>>>>>>>>>>> git clone https://github.com/adaptivecomputing/torque.git -b
>>>>>>>>>>> 6.1-dev 6.1-dev
>>>>>>>>>>> cd 6.1-dev
>>>>>>>>>>> ./autogen.sh
>>>>>>>>>>> # build and install torque
>>>>>>>>>>> ./configure
>>>>>>>>>>> make
>>>>>>>>>>> sudo make install
>>>>>>>>>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>>>>>>>>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>>>>>>>>>>> sudo ldconfig
>>>>>>>>>>> # set as services
>>>>>>>>>>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>>>>>>>>>> sudo update-rc.d trqauthd defaults
>>>>>>>>>>> sudo update-rc.d pbs_server defaults
>>>>>>>>>>> sudo update-rc.d pbs_sched defaults
>>>>>>>>>>> sudo update-rc.d pbs_mom defaults
>>>>>>>>>>>
>>>>>>>>>>> sudo ./torque.setup $USER
>>>>>>>>>>> sudo qmgr -c 'p s'
>>>>>>>>>>> echo "$HOSTNAME np=`cat /proc/cpuinfo | grep processor | wc -l`"
>>>>>>>>>>> | sudo tee /var/spool/torque/server_priv/nodes
>>>>>>>>>>> sudo nano /var/spool/torque/server_priv/nodes # (changed np)
>>>>>>>>>>> sudo qterm -t quick
>>>>>>>>>>> sudo /etc/init.d/trqauthd stop
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> trqauthd was not stop by the last command. So, I stopped it by
>>>>>>>>>> killing the trqauthd process.
>>>>>>>>>> Then I restarted the torque processes with gdb.
>>>>>>>>>>
>>>>>>>>>> sudo /etc/init.d/trqauthd start
>>>>>>>>>>
>>>>>>>>>> sudo gdb /etc/init.d/pbs_server 2>&1 | tee
>>>>>>>>>>> ~/gdb-torquesetup-6.1-dev.txt
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> In another terminal, I executed the following commands before
>>>>>>>>>> pbs_server was crashed.
>>>>>>>>>>
>>>>>>>>>> sudo /etc/init.d/pbs_mom start
>>>>>>>>>>> sudo /etc/init.d/pbs_sched start
>>>>>>>>>>> ps aux | grep pbs
>>>>>>>>>>> pbsnodes -a
>>>>>>>>>>> echo "sleep 30" | qsub
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The output of the last command is "0.torque-server".
>>>>>>>>>> And this command crashed the pbs_server in gdb.
>>>>>>>>>> Then, I made the backtrace.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Kazu
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Oct 25, 2016 at 2:36 PM, Kazuhiro Fujita <
>>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> David,
>>>>>>>>>>>
>>>>>>>>>>> I attached the backtrace of pbs_server (Torque 6.0.2) by gdb.
>>>>>>>>>>> (based on https://wiki.ubuntu.com/Backtrace)
>>>>>>>>>>>
>>>>>>>>>>> I started pbs_server with gdb,
>>>>>>>>>>> and execute qmgr from another terminal. (see below)
>>>>>>>>>>>
>>>>>>>>>>> sudo qmgr -c 'p s'
>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> After the qmgr execution, I pressed ctrl +c in gdb.
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Kaz
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Oct 25, 2016 at 1:00 AM, David Beer <
>>>>>>>>>>> ***@adaptivecomputing.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Kazu,
>>>>>>>>>>>>
>>>>>>>>>>>> Can you give us a backtrace for this crash? We have fixed some
>>>>>>>>>>>> issues on startup (around mutex management for newer pthread
>>>>>>>>>>>> implementations) and a backtrace would allow me to confirm if what you're
>>>>>>>>>>>> seeing is fixed.
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Oct 24, 2016 at 2:09 AM, Kazuhiro Fujita <
>>>>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Dear All,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I use Torque 4.2.10 on Ubuntu 14.04 LTS and 16.04 LTS with
>>>>>>>>>>>>> dual E5-2630 v3 chips.
>>>>>>>>>>>>> I recently got servers with dual Xeon E5 v4 chips, and
>>>>>>>>>>>>> installed Ubuntu 16.04 LTS on them.
>>>>>>>>>>>>> And I tried to set up Torque on them, but I stacked with the
>>>>>>>>>>>>> initial setup script.
>>>>>>>>>>>>> It seems that qmgr may trigger to crash pbs_server in initial
>>>>>>>>>>>>> setup script (torque.setup). (see below)
>>>>>>>>>>>>> Similar error is also observed in Torque 6.02.
>>>>>>>>>>>>> Have you ever observed this kind of errors?
>>>>>>>>>>>>> And if you know possible solutions, please tell me.
>>>>>>>>>>>>> Any comments will be highly appreciated.
>>>>>>>>>>>>> Would it be better to change the OS to other distribution,
>>>>>>>>>>>>> such as Scientific Linux?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thank you in Advance,
>>>>>>>>>>>>> Kazu
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Errors in torque 4.2.10 setup
>>>>>>>>>>>>>
>>>>>>>>>>>>>> torque-server-***@torque-server:~/Downloads/torque/torque-4.2.10$
>>>>>>>>>>>>>> sudo ./torque.setup $USER
>>>>>>>>>>>>>> Currently no servers active. Default server will be listed as
>>>>>>>>>>>>>> active server. Error 15133
>>>>>>>>>>>>>> Active server name: torque-server pbs_server port is: 15001
>>>>>>>>>>>>>> trqauthd daemonized - port /tmp/trqauthd-unix
>>>>>>>>>>>>>> trqauthd successfully started
>>>>>>>>>>>>>> initializing TORQUE (admin: torque-server-***@torque-ser
>>>>>>>>>>>>>> ver)
>>>>>>>>>>>>>> You have selected to start pbs_server in create mode.
>>>>>>>>>>>>>> If the server database exists it will be overwritten.
>>>>>>>>>>>>>> do you wish to continue y/(n)?y
>>>>>>>>>>>>>> root 27941 1942 1 12:22 ? 00:00:00 pbs_server -t
>>>>>>>>>>>>>> create
>>>>>>>>>>>>>> Max open servers: 9
>>>>>>>>>>>>>> set server operators += torque-server-***@torque-server
>>>>>>>>>>>>>> Max open servers: 9
>>>>>>>>>>>>>> set server managers += torque-server-***@torque-server
>>>>>>>>>>>>>> qmgr obj=batch svr=default: End of File
>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>>>>>>> torque-server-***@torque-server:~/Downloads/torque/torque-4.2.10$
>>>>>>>>>>>>>> ps aux | grep pbs
>>>>>>>>>>>>>> torque-+ 27996 0.0 0.0 22304 948 pts/2 S+ 12:22
>>>>>>>>>>>>>> 0:00 grep --color=auto pbs
>>>>>>>>>>>>>
>>>>>>>>>>>>> pbs_server -t create was not found.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Errors in torque 6.0.2 setup
>>>>>>>>>>>>>
>>>>>>>>>>>>>> torque-server-***@torque-server:~/Downloads/torque/6.0.2$
>>>>>>>>>>>>>> sudo ./torque.setup $USER
>>>>>>>>>>>>>> Currently no servers active. Default server will be listed as
>>>>>>>>>>>>>> active server. Error 15133
>>>>>>>>>>>>>> Active server name: torque-server pbs_server port is: 15001
>>>>>>>>>>>>>> trqauthd daemonized - port /tmp/trqauthd-unix
>>>>>>>>>>>>>> trqauthd successfully started
>>>>>>>>>>>>>> initializing TORQUE (admin: torque-server-***@torque-ser
>>>>>>>>>>>>>> ver)
>>>>>>>>>>>>>> You have selected to start pbs_server in create mode.
>>>>>>>>>>>>>> If the server database exists it will be overwritten.
>>>>>>>>>>>>>> do you wish to continue y/(n)?y
>>>>>>>>>>>>>> root 39521 1 1 16:10 ? 00:00:00 pbs_server -t
>>>>>>>>>>>>>> create
>>>>>>>>>>>>>> Max open servers: 9
>>>>>>>>>>>>>> Max open servers: 9
>>>>>>>>>>>>>> qmgr obj=batch svr=default: End of File
>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>>>>>>> torque-server-***@torque-server:~/Downloads/torque/6.0.2$
>>>>>>>>>>>>>> ps aux | grep pbs
>>>>>>>>>>>>>> comp_ad+ 39569 0.0 0.0 22304 1032 pts/8 S+ 16:11
>>>>>>>>>>>>>> 0:00 grep --color=auto pbs
>>>>>>>>>>>>>
>>>>>>>>>>>>> pbs_server -t create was not found.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Commands used for installation before the setup script
>>>>>>>>>>>>>
>>>>>>>>>>>>>> # build and install torque
>>>>>>>>>>>>>> ./configure
>>>>>>>>>>>>>> make
>>>>>>>>>>>>>> sudo make install
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>>>>>>>>>>>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>>>>>>>>>>>>>> sudo ldconfig
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> # set up as services
>>>>>>>>>>>>>
>>>>>>>>>>>>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_server
>>>>>>>>>>>>>> /etc/init.d/pbs_server
>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>>>>>>>>>>>>> sudo update-rc.d trqauthd defaults
>>>>>>>>>>>>>> sudo update-rc.d pbs_server defaults
>>>>>>>>>>>>>> sudo update-rc.d pbs_sched defaults
>>>>>>>>>>>>>> sudo update-rc.d pbs_mom defaults
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> sudo ./torque.setup $USER
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> torqueusers mailing list
>>>>>>>>>>>>> ***@supercluster.org
>>>>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> David Beer | Torque Architect
>>>>>>>>>>>> Adaptive Computing
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> torqueusers mailing list
>>>>>>>>>>>> ***@supercluster.org
>>>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> torqueusers mailing list
>>>>>>>>>> ***@supercluster.org
>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> David Beer | Torque Architect
>>>>>>>>> Adaptive Computing
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> David Beer | Torque Architect
>>>>>>>> Adaptive Computing
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> torqueusers mailing list
>>>>>>>> ***@supercluster.org
>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> torqueusers mailing list
>>>>>> ***@supercluster.org
>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> David Beer | Torque Architect
>>>>> Adaptive Computing
>>>>>
>>>>> _______________________________________________
>>>>> torqueusers mailing list
>>>>> ***@supercluster.org
>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>
>>>>>
>>>>
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> ***@supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>>
>>
>>
>> --
>> David Beer | Torque Architect
>> Adaptive Computing
>>
>> _______________________________________________
>> torqueusers mailing list
>> ***@supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>
> _______________________________________________
> torqueusers mailing list
> ***@supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>


--
David Beer | Torque Architect
Adaptive Computing
Kazuhiro Fujita
2016-11-10 03:01:19 UTC
Permalink
David,

Now, it works. Thank you.
But, jobs are executed in the LIFO manner, as I observed in a E5-2630v3
server...
I show the result by 'qstat -t' after 'echo "sleep 10" | qsub -t 1-10' 3
times.

Best,
Kazu

$ qstat -t
Job ID Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
0.Dual-E5-2630v3 STDIN comp_admin 00:00:00 C
batch
1[1].Dual-E5-2630v3 STDIN-1 comp_admin 0 Q
batch
1[2].Dual-E5-2630v3 STDIN-2 comp_admin 0 Q
batch
1[3].Dual-E5-2630v3 STDIN-3 comp_admin 0 Q
batch
1[4].Dual-E5-2630v3 STDIN-4 comp_admin 0 Q
batch
1[5].Dual-E5-2630v3 STDIN-5 comp_admin 0 Q
batch
1[6].Dual-E5-2630v3 STDIN-6 comp_admin 0 Q
batch
1[7].Dual-E5-2630v3 STDIN-7 comp_admin 00:00:00 C
batch
1[8].Dual-E5-2630v3 STDIN-8 comp_admin 00:00:00 C
batch
1[9].Dual-E5-2630v3 STDIN-9 comp_admin 00:00:00 C
batch
1[10].Dual-E5-2630v3 STDIN-10 comp_admin 00:00:00 C
batch
2[1].Dual-E5-2630v3 STDIN-1 comp_admin 0 Q
batch
2[2].Dual-E5-2630v3 STDIN-2 comp_admin 0 Q
batch
2[3].Dual-E5-2630v3 STDIN-3 comp_admin 0 Q
batch
2[4].Dual-E5-2630v3 STDIN-4 comp_admin 0 Q
batch
2[5].Dual-E5-2630v3 STDIN-5 comp_admin 0 Q
batch
2[6].Dual-E5-2630v3 STDIN-6 comp_admin 0 Q
batch
2[7].Dual-E5-2630v3 STDIN-7 comp_admin 0 Q
batch
2[8].Dual-E5-2630v3 STDIN-8 comp_admin 0 Q
batch
2[9].Dual-E5-2630v3 STDIN-9 comp_admin 0 Q
batch
2[10].Dual-E5-2630v3 STDIN-10 comp_admin 0 Q
batch
3[1].Dual-E5-2630v3 STDIN-1 comp_admin 0 Q
batch
3[2].Dual-E5-2630v3 STDIN-2 comp_admin 0 Q
batch
3[3].Dual-E5-2630v3 STDIN-3 comp_admin 0 Q
batch
3[4].Dual-E5-2630v3 STDIN-4 comp_admin 0 Q
batch
3[5].Dual-E5-2630v3 STDIN-5 comp_admin 0 Q
batch
3[6].Dual-E5-2630v3 STDIN-6 comp_admin 0 Q
batch
3[7].Dual-E5-2630v3 STDIN-7 comp_admin 0 R
batch
3[8].Dual-E5-2630v3 STDIN-8 comp_admin 0 R
batch
3[9].Dual-E5-2630v3 STDIN-9 comp_admin 0 R
batch
3[10].Dual-E5-2630v3 STDIN-10 comp_admin 0 R
batch



On Thu, Nov 10, 2016 at 3:07 AM, David Beer <***@adaptivecomputing.com>
wrote:

> Kazu,
>
> I was able to get a system to reproduce this error. I have now checked in
> another fix, and I can no longer reproduce this. Can you pull the latest
> and let me know if it fixes it for you?
>
> On Tue, Nov 8, 2016 at 2:06 AM, Kazuhiro Fujita <***@gmail.com
> > wrote:
>
>> Hi David,
>>
>> I reinstalled the 6.0-dev today from github, and observed slight
>> different behaviors I think.
>> I used the "service" command to start daemons this time.
>>
>> Best,
>> Kazu
>>
>> Befor crash
>>
>>> git clone https://github.com/adaptivecomputing/torque.git -b 6.0-dev
>>> 6.0-dev
>>> cd 6.0-dev
>>> ./autogen.sh
>>> # build and install torque
>>> ./configure
>>> make
>>> sudo make install
>>> # Set the correct name of the server
>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>> # configure and start trqauthd
>>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>>> sudo update-rc.d trqauthd defaults
>>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>>> sudo ldconfig
>>> sudo service trqauthd start
>>> # Initialize serverdb by executing the torque.setup script
>>> sudo ./torque.setup $USER
>>> sudo qmgr -c 'p s'
>>> sudo qterm
>>> sudo service trqauthd stop
>>> ps aux | grep pbs
>>> ps aux | grep trq
>>> # set nodes
>>> echo "$HOSTNAME np=`cat /proc/cpuinfo | grep processor | wc -l`" | sudo
>>> tee /var/spool/torque/server_priv/nodes
>>> sudo nano /var/spool/torque/server_priv/nodes
>>> # set the head node
>>> echo "\$pbsserver $HOSTNAME" | sudo tee /var/spool/torque/mom_priv/con
>>> fig
>>> # configure other deamons
>>> sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
>>> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>> sudo update-rc.d pbs_server defaults
>>> sudo update-rc.d pbs_sched defaults
>>> sudo update-rc.d pbs_mom defaults
>>> # start torque daemons
>>> sudo service trqauthd start
>>> sudo service pbs_server start
>>> sudo service pbs_sched start
>>> sudo service pbs_mom start
>>> # chekc configuration of computaion nodes
>>> pbsnodes -a
>>
>>
>> I checked torque processes by "ps aux | grep pbs" and "ps aux | grep trq"
>> several times.
>> After "pbsnodes -a", it seems ok.
>> But, the next qsub command seems to trigger to crash "pbs_server" and
>> "pbs_sched".
>>
>> $ ps aux | grep trq
>>> root 9682 0.0 0.0 109112 3632 ? S 17:39 0:00
>>> /usr/local/sbin/trqauthd
>>> comp_ad+ 9842 0.0 0.0 15236 936 pts/8 S+ 17:40 0:00 grep
>>> --color=auto trq
>>> $ ps aux | grep pbs
>>> root 9720 0.0 0.0 695140 25760 ? Sl 17:39 0:00
>>> /usr/local/sbin/pbs_server
>>> root 9771 0.0 0.0 37996 4940 ? Ss 17:39 0:00
>>> /usr/local/sbin/pbs_sched
>>> root 9814 0.2 0.2 173776 136692 ? SLsl 17:40 0:00
>>> /usr/local/sbin/pbs_mom
>>> comp_ad+ 9844 0.0 0.0 15236 1012 pts/8 S+ 17:40 0:00 grep
>>> --color=auto pbs
>>> $ echo "sleep 30" | qsub
>>> 0.Dual-E52630v4
>>> $ ps aux | grep pbs
>>> root 9814 0.1 0.2 173776 136692 ? SLsl 17:40 0:00
>>> /usr/local/sbin/pbs_mom
>>> comp_ad+ 9855 0.0 0.0 15236 928 pts/8 S+ 17:41 0:00 grep
>>> --color=auto pbs
>>> $ ps aux | grep trq
>>> root 9682 0.0 0.0 109112 4144 ? S 17:39 0:00
>>> /usr/local/sbin/trqauthd
>>> comp_ad+ 9860 0.0 0.0 15236 1092 pts/8 S+ 17:41 0:00 grep
>>> --color=auto trq
>>
>>
>> Then, I stopped the remained processes,
>>
>> sudo service pbs_mom stop
>>> sudo service trqauthd stop
>>
>>
>> and start again the "trqauthd", and "pbs_server" with gdb. "pbs_server"
>> crashed in gdb without other commands.
>>
>> sudo service trqauthd start
>>> sudo gdb /usr/local/sbin/pbs_server
>>
>>
>> sudo gdb /usr/local/sbin/pbs_server
>> GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
>> Copyright (C) 2016 Free Software Foundation, Inc.
>> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.h
>> tml>
>> This is free software: you are free to change and redistribute it.
>> There is NO WARRANTY, to the extent permitted by law. Type "show copying"
>> and "show warranty" for details.
>> This GDB was configured as "x86_64-linux-gnu".
>> Type "show configuration" for configuration details.
>> For bug reporting instructions, please see:
>> <http://www.gnu.org/software/gdb/bugs/>.
>> Find the GDB manual and other documentation resources online at:
>> <http://www.gnu.org/software/gdb/documentation/>.
>> For help, type "help".
>> Type "apropos word" to search for commands related to "word"...
>> Reading symbols from /usr/local/sbin/pbs_server...done.
>> (gdb) r -D
>> Starting program: /usr/local/sbin/pbs_server -D
>> [Thread debugging using libthread_db enabled]
>> Using host libthread_db library "/lib/x86_64-linux-gnu/libthre
>> ad_db.so.1".
>>
>> Program received signal SIGSEGV, Segmentation fault.
>> __lll_unlock_elision (lock=0x512f1b0, private=0) at
>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>> 29 ../sysdeps/unix/sysv/linux/x86/elision-unlock.c: No such file or
>> directory.
>> (gdb) bt
>> #0 __lll_unlock_elision (lock=0x512f1b0, private=0) at
>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>> #1 0x00000000004a4953 in unlock_ji_mutex (pjob=0x512d880, id=0x525b30
>> <svr_enquejob(job*, int, char const*, bool, bool)::__func__>
>> "svr_enquejob", msg=0x524554 "1", logging=0)
>> at svr_jobfunc.c:4011
>> #2 0x000000000049db0c in svr_enquejob (pjob=0x512d880,
>> has_sv_qs_mutex=1, prev_job_id=0x0, have_reservation=false,
>> being_recovered=true) at svr_jobfunc.c:421
>> #3 0x000000000045b828 in pbsd_init_reque (pjob=0x512d880,
>> change_state=1) at pbsd_init.c:2824
>> #4 0x000000000045ad93 in pbsd_init_job (pjob=0x512d880, type=1) at
>> pbsd_init.c:2558
>> #5 0x0000000000459483 in handle_job_recovery (type=1) at pbsd_init.c:1803
>> #6 0x000000000045a173 in handle_job_and_array_recovery (type=1) at
>> pbsd_init.c:2100
>> #7 0x000000000045a8fe in pbsd_init (type=1) at pbsd_init.c:2316
>> #8 0x0000000000461f29 in main (argc=2, argv=0x7fffffffe598) at
>> pbsd_main.c:1898
>> (gdb) backtrace full
>> #0 __lll_unlock_elision (lock=0x512f1b0, private=0) at
>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>> No locals.
>> #1 0x00000000004a4953 in unlock_ji_mutex (pjob=0x512d880, id=0x525b30
>> <svr_enquejob(job*, int, char const*, bool, bool)::__func__>
>> "svr_enquejob", msg=0x524554 "1", logging=0)
>> at svr_jobfunc.c:4011
>> rc = 0
>> err_msg = 0x0
>> stub_msg = "no pos"
>> __func__ = "unlock_ji_mutex"
>> #2 0x000000000049db0c in svr_enquejob (pjob=0x512d880,
>> has_sv_qs_mutex=1, prev_job_id=0x0, have_reservation=false,
>> being_recovered=true) at svr_jobfunc.c:421
>> pattrjb = 0x7fffffff4a10
>> pdef = 0x4
>> pque = 0x0
>> rc = 0
>> log_buf = '\000' <repeats 24 times>,
>> "\030\000\000\000\060\000\000\000PU\377\377\377\177\000\000\220T\377\377\377\177",
>> '\000' <repeats 50 times>, "\003\000\000\000\000\000\000\
>> 000#\000\000\000\000\000\000\000pO\377\377\377\177", '\000' <repeats 26
>> times>, "\221\260\000\000\000\200\377\377oO\377\377\377\177\000\000H
>> +B\366\377\177\000\000p+B\366\377\177\000\000\200O\377\377\
>> 377\177\000\000\201\260\000\000\000\200\377\377\177O\377\377\377\177",
>> '\000' <repeats 18 times>...
>> time_now = 1478594788
>> job_id = "0.Dual-E52630v4\000\000\000\0
>> 00\000\000\000\000\000\362\377\377\377\377\377\377\377\340J\
>> 377\377\377\177\000\000\060L\377\377\377\177\000\000\001\
>> 000\000\000\000\000\000\000\244\201\000\000\001\000\000\
>> 000\030\354\377\367\377\177\000\***@L\377\377\377\177\000\
>> 000\000\000\000\000\005\000\000\220\r\000\000\000\000\000\
>> 000\000k\022j\365\377\177\000\000\031J\377\377\377\177\000\
>> 000\201n\376\017\000\000\000\000\\\216!X\000\000\000\000_#\
>> 343+\000\000\000\000\\\216!X\000\000\000\000\207\065],", '\000' <repeats
>> 36 times>, "k\022j\365\377\177\000\000\300K\377\377\377\177\000\000\000
>> \000\000\000\000\000\000\000"...
>> queue_name = "batch\000\377\377\240\340\377\367\377\177\000"
>> total_jobs = 0
>> user_jobs = 0
>> array_jobs = 0
>> __func__ = "svr_enquejob"
>> que_mgr = {unlock_on_exit = 160, locked = 75, mutex_valid = 255,
>> managed_mutex = 0x7ffff7ddccda <open_path+474>}
>> #3 0x000000000045b828 in pbsd_init_reque (pjob=0x512d880,
>> change_state=1) at pbsd_init.c:2824
>> newstate = 0
>> newsubstate = 0
>> rc = 0
>> log_buf = "pbsd_init_reque:1", '\000' <repeats 1063 times>...
>> __func__ = "pbsd_init_reque"
>> #4 0x000000000045ad93 in pbsd_init_job (pjob=0x512d880, type=1) at
>> pbsd_init.c:2558
>> d = 0
>> rc = 0
>> time_now = 1478594788
>> log_buf = '\000' <repeats 2112 times>...
>> local_errno = 0
>> job_id = '\000' <repeats 1016 times>...
>> job_atr_hold = 0
>> job_exit_status = 0
>> __func__ = "pbsd_init_job"
>> #5 0x0000000000459483 in handle_job_recovery (type=1) at pbsd_init.c:1803
>> pjob = 0x512d880
>> Index = 0
>> JobArray_iter = {first = "0.Dual-E52630v4", second = }
>> log_buf = "14 total files read from disk\000\000\000\000\000\000\0
>> 00\001\000\000\000\320\316\022\005\000\000\000\000\220N\022\005", '\000'
>> <repeats 12 times>, "Expected 1, recovered 1 queues", '\000' <repeats 1330
>> times>...
>> rc = 0
>> job_rc = 0
>> logtype = 0
>> pdirent = 0x0
>> pdirent_sub = 0x0
>> dir = 0x5124e90
>> dir_sub = 0x0
>> had = 0
>> pjob = 0x0
>> time_now = 1478594788
>> ---Type <return> to continue, or q <return> to quit---
>> basen = '\000' <repeats 1088 times>...
>> use_jobs_subdirs = 0
>> __func__ = "handle_job_recovery"
>> #6 0x000000000045a173 in handle_job_and_array_recovery (type=1) at
>> pbsd_init.c:2100
>> rc = 0
>> tmp_rc = 1974134615
>> #7 0x000000000045a8fe in pbsd_init (type=1) at pbsd_init.c:2316
>> ret = 0
>> gid = 0
>> log_buf = "pbsd_init:1", '\000' <repeats 997 times>...
>> __func__ = "pbsd_init"
>> #8 0x0000000000461f29 in main (argc=2, argv=0x7fffffffe598) at
>> pbsd_main.c:1898
>> i = 2
>> rc = 0
>> local_errno = 0
>> lockfile = "/var/spool/torque/server_priv/server.lock", '\000'
>> <repeats 983 times>
>> EMsg = '\000' <repeats 1023 times>
>> tmpLine = "Server Dual-E52630v4 started, initialization type =
>> 1", '\000' <repeats 970 times>
>> log_buf = "Server Dual-E52630v4 started, initialization type =
>> 1", '\000' <repeats 1139 times>...
>> server_name_file_port = 15001
>> fp = 0x51095f0
>> (gdb) info registers
>> rax 0x0 0
>> rbx 0x6 6
>> rcx 0x0 0
>> rdx 0x512f1b0 85127600
>> rsi 0x0 0
>> rdi 0x512f1b0 85127600
>> rbp 0x7fffffffe4b0 0x7fffffffe4b0
>> rsp 0x7fffffffc870 0x7fffffffc870
>> r8 0x0 0
>> r9 0x7fffffff57a2 140737488312226
>> r10 0x513c800 85182464
>> r11 0x7ffff61e6128 140737322574120
>> r12 0x4260b0 4350128
>> r13 0x7fffffffe590 140737488348560
>> r14 0x0 0
>> r15 0x0 0
>> rip 0x461f29 0x461f29 <main(int, char**)+2183>
>> eflags 0x10246 [ PF ZF IF RF ]
>> cs 0x33 51
>> ss 0x2b 43
>> ds 0x0 0
>> es 0x0 0
>> fs 0x0 0
>> gs 0x0 0
>> (gdb) x/16i $pc
>> => 0x461f29 <main(int, char**)+2183>: test %eax,%eax
>> 0x461f2b <main(int, char**)+2185>: setne %al
>> 0x461f2e <main(int, char**)+2188>: test %al,%al
>> 0x461f30 <main(int, char**)+2190>: je 0x461f55 <main(int,
>> char**)+2227>
>> 0x461f32 <main(int, char**)+2192>: mov 0x70efc7(%rip),%rax
>> # 0xb70f00 <msg_daemonname>
>> 0x461f39 <main(int, char**)+2199>: mov $0x51bab2,%edx
>> 0x461f3e <main(int, char**)+2204>: mov %rax,%rsi
>> 0x461f41 <main(int, char**)+2207>: mov $0xffffffff,%edi
>> 0x461f46 <main(int, char**)+2212>: callq 0x425420
>> <***@plt>
>> 0x461f4b <main(int, char**)+2217>: mov $0x3,%edi
>> 0x461f50 <main(int, char**)+2222>: callq 0x425680 <***@plt>
>> 0x461f55 <main(int, char**)+2227>: mov 0x71021d(%rip),%esi
>> # 0xb72178 <pbs_mom_port>
>> 0x461f5b <main(int, char**)+2233>: mov 0x710227(%rip),%ecx
>> # 0xb72188 <pbs_scheduler_port>
>> 0x461f61 <main(int, char**)+2239>: mov 0x710225(%rip),%edx
>> # 0xb7218c <pbs_server_port_dis>
>> 0x461f67 <main(int, char**)+2245>: lea -0x1400(%rbp),%rax
>> 0x461f6e <main(int, char**)+2252>: mov $0xb739c0,%r9d
>> (gdb) thread apply all backtrace
>>
>> Thread 1 (Thread 0x7ffff7fd5740 (LWP 10004)):
>> #0 __lll_unlock_elision (lock=0x512f1b0, private=0) at
>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>> #1 0x00000000004a4953 in unlock_ji_mutex (pjob=0x512d880, id=0x525b30
>> <svr_enquejob(job*, int, char const*, bool, bool)::__func__>
>> "svr_enquejob", msg=0x524554 "1", logging=0)
>> at svr_jobfunc.c:4011
>> #2 0x000000000049db0c in svr_enquejob (pjob=0x512d880,
>> has_sv_qs_mutex=1, prev_job_id=0x0, have_reservation=false,
>> being_recovered=true) at svr_jobfunc.c:421
>> #3 0x000000000045b828 in pbsd_init_reque (pjob=0x512d880,
>> change_state=1) at pbsd_init.c:2824
>> #4 0x000000000045ad93 in pbsd_init_job (pjob=0x512d880, type=1) at
>> pbsd_init.c:2558
>> #5 0x0000000000459483 in handle_job_recovery (type=1) at pbsd_init.c:1803
>> #6 0x000000000045a173 in handle_job_and_array_recovery (type=1) at
>> pbsd_init.c:2100
>> #7 0x000000000045a8fe in pbsd_init (type=1) at pbsd_init.c:2316
>> #8 0x0000000000461f29 in main (argc=2, argv=0x7fffffffe598) at
>> pbsd_main.c:1898
>> (gdb) quit
>> A debugging session is active.
>>
>> Inferior 1 [process 10004] will be killed.
>>
>> Quit anyway? (y or n) y
>>
>>
>>
>>
>>
>>
>>
>> On Wed, Nov 2, 2016 at 1:43 AM, David Beer <***@adaptivecomputing.com>
>> wrote:
>>
>>> Kazu,
>>>
>>> Thanks for sticking with us on this. You mentioned that pbs_server did
>>> not crash when you submitted the job, but you said that it and pbs_sched
>>> are "unstable." What do you mean by unstable? Will jobs run? You gdb output
>>> looks like a pbs_server that isn't busy, but other than that it looks
>>> normal.
>>>
>>> David
>>>
>>> On Tue, Nov 1, 2016 at 1:19 AM, Kazuhiro Fujita <
>>> ***@gmail.com> wrote:
>>>
>>>> David,
>>>>
>>>> I tested the 6.0-dev. It passed the "sudo ./torque.setup $USER" script,
>>>> but pbs_server and pbs_sched are unstable like 6.1-dev.
>>>>
>>>> Best,
>>>> Kazu
>>>>
>>>> Before execution of gdb
>>>>
>>>> git clone https://github.com/adaptivecomputing/torque.git -b 6.0-dev
>>>>> 6.0-dev
>>>>> cd 6.0-dev
>>>>> ./autogen.sh
>>>>> # build and install torque
>>>>> ./configure
>>>>> make
>>>>> sudo make install
>>>>> # Set the correct name of the server
>>>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>>> # configure and start trqauthd
>>>>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>>>>> sudo update-rc.d trqauthd defaults
>>>>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>>>>> sudo ldconfig
>>>>> sudo service trqauthd start
>>>>> # Initialize serverdb by executing the torque.setup script
>>>>> sudo ./torque.setup $USER
>>>>>
>>>>> sudo qmgr -c 'p s'
>>>>> sudo qterm
>>>>> sudo /etc/init.d/trqauthd stop
>>>>> # set nodes
>>>>> echo "$HOSTNAME np=`cat /proc/cpuinfo | grep processor | wc -l`" |
>>>>> sudo tee /var/spool/torque/server_priv/nodes
>>>>> sudo nano /var/spool/torque/server_priv/nodes
>>>>> # set the head node
>>>>> echo "\$pbsserver $HOSTNAME" | sudo tee /var/spool/torque/mom_priv/con
>>>>> fig
>>>>> # configure other deamons
>>>>> sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
>>>>> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
>>>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>>>> sudo update-rc.d pbs_server defaults
>>>>> sudo update-rc.d pbs_sched defaults
>>>>> sudo update-rc.d pbs_mom defaults
>>>>> # start torque daemons
>>>>> sudo service trqauthd start
>>>>
>>>>
>>>> Execution of gdb
>>>>
>>>>> sudo gdb /usr/local/sbin/pbs_server
>>>>
>>>>
>>>> Commands executed by another terminal
>>>>
>>>>> sudo /etc/init.d/pbs_mom start
>>>>> sudo /etc/init.d/pbs_sched start
>>>>> pbsnodes -a
>>>>> echo "sleep 30" | qsub
>>>>
>>>>
>>>> The last command did not cause a crash of pbs_server. The backtrace is
>>>> described below.
>>>> $ sudo gdb /usr/local/sbin/pbs_server
>>>> GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
>>>> Copyright (C) 2016 Free Software Foundation, Inc.
>>>> License GPLv3+: GNU GPL version 3 or later <
>>>> http://gnu.org/licenses/gpl.html>
>>>> This is free software: you are free to change and redistribute it.
>>>> There is NO WARRANTY, to the extent permitted by law. Type "show
>>>> copying"
>>>> and "show warranty" for details.
>>>> This GDB was configured as "x86_64-linux-gnu".
>>>> Type "show configuration" for configuration details.
>>>> For bug reporting instructions, please see:
>>>> <http://www.gnu.org/software/gdb/bugs/>.
>>>> Find the GDB manual and other documentation resources online at:
>>>> <http://www.gnu.org/software/gdb/documentation/>.
>>>> For help, type "help".
>>>> Type "apropos word" to search for commands related to "word"...
>>>> Reading symbols from /usr/local/sbin/pbs_server...done.
>>>> (gdb) r -D
>>>> Starting program: /usr/local/sbin/pbs_server -D
>>>> [Thread debugging using libthread_db enabled]
>>>> Using host libthread_db library "/lib/x86_64-linux-gnu/libthre
>>>> ad_db.so.1".
>>>> [New Thread 0x7ffff39c1700 (LWP 5024)]
>>>> pbs_server is up (version - 6.0, port - 15001)
>>>> [New Thread 0x7ffff31c0700 (LWP 5025)]
>>>> PBS_Server: LOG_ERROR::tcp_connect_sockaddr, Failed when trying to
>>>> open tcp connection - connect() failed [rc = -2] [addr =
>>>> 10.0.0.249:15003]
>>>> PBS_Server: LOG_ERROR::sendHierarchyToNode, Could not send mom
>>>> hierarchy to host Dual-E52630v4:15003
>>>> [New Thread 0x7ffff29bf700 (LWP 5026)]
>>>> [New Thread 0x7ffff21be700 (LWP 5027)]
>>>> [New Thread 0x7ffff19bd700 (LWP 5028)]
>>>> [New Thread 0x7ffff11bc700 (LWP 5029)]
>>>> [New Thread 0x7ffff09bb700 (LWP 5030)]
>>>> [Thread 0x7ffff09bb700 (LWP 5030) exited]
>>>> [New Thread 0x7ffff09bb700 (LWP 5031)]
>>>> [New Thread 0x7fffe3fff700 (LWP 5109)]
>>>> [New Thread 0x7fffe37fe700 (LWP 5113)]
>>>> [New Thread 0x7fffe29cf700 (LWP 5121)]
>>>> [Thread 0x7fffe29cf700 (LWP 5121) exited]
>>>> ^C
>>>> Thread 1 "pbs_server" received signal SIGINT, Interrupt.
>>>> 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-templa
>>>> te.S:84
>>>> 84 ../sysdeps/unix/syscall-template.S: No such file or directory.
>>>> (gdb) backtrace full
>>>> #0 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-templa
>>>> te.S:84
>>>> No locals.
>>>> #1 0x00007ffff615c1a4 in usleep (useconds=<optimized out>) at
>>>> ../sysdeps/posix/usleep.c:32
>>>> ts = {tv_sec = 0, tv_nsec = 250000000}
>>>> #2 0x000000000046123a in main_loop () at pbsd_main.c:1454
>>>> state = 3
>>>> waittime = 5
>>>> pjob = 0x313a74
>>>> iter = 0x0
>>>> when = 1477984074
>>>> log = 0
>>>> scheduling = 1
>>>> sched_iteration = 600
>>>> time_now = 1477984190
>>>> update_loglevel = 1477984198
>>>> log_buf = "Server Ready, pid = 5020, loglevel=0", '\000'
>>>> <repeats 140 times>, "c\000\000\000\000\000\000\000
>>>> \000\020\000\000\000\000\000\000\240\265\377\377\377\177", '\000'
>>>> <repeats 26 times>...
>>>> sem_val = 5228929
>>>> __func__ = "main_loop"
>>>> #3 0x0000000000461fb6 in main (argc=2, argv=0x7fffffffe598) at
>>>> pbsd_main.c:1935
>>>> i = 2
>>>> rc = 0
>>>> local_errno = 0
>>>> lockfile = "/var/spool/torque/server_priv/server.lock", '\000'
>>>> <repeats 983 times>
>>>> EMsg = '\000' <repeats 1023 times>
>>>> tmpLine = "Using ports Server:15001 Scheduler:15004 MOM:15002
>>>> (server: 'Dual-E52630v4')", '\000' <repeats 945 times>
>>>> log_buf = "Using ports Server:15001 Scheduler:15004 MOM:15002
>>>> (server: 'Dual-E52630v4')", '\000' <repeats 1114 times>...
>>>> server_name_file_port = 15001
>>>> fp = 0x51095f0
>>>> (gdb) info registers
>>>> rax 0xfffffffffffffdfc -516
>>>> rbx 0x5 5
>>>> rcx 0x7ffff612a75d 140737321805661
>>>> rdx 0x0 0
>>>> rsi 0x0 0
>>>> rdi 0x7fffffffb3f0 140737488335856
>>>> rbp 0x7fffffffe4b0 0x7fffffffe4b0
>>>> rsp 0x7fffffffc870 0x7fffffffc870
>>>> r8 0x0 0
>>>> r9 0x4000001 67108865
>>>> r10 0x1 1
>>>> r11 0x293 659
>>>> r12 0x4260b0 4350128
>>>> r13 0x7fffffffe590 140737488348560
>>>> r14 0x0 0
>>>> r15 0x0 0
>>>> rip 0x461fb6 0x461fb6 <main(int, char**)+2388>
>>>> eflags 0x293 [ CF AF SF IF ]
>>>> cs 0x33 51
>>>> ss 0x2b 43
>>>> ds 0x0 0
>>>> es 0x0 0
>>>> fs 0x0 0
>>>> gs 0x0 0
>>>> (gdb) x/16i $pc
>>>> => 0x461fb6 <main(int, char**)+2388>: callq 0x494762 <shutdown_ack()>
>>>> 0x461fbb <main(int, char**)+2393>: mov $0xffffffff,%edi
>>>> 0x461fc0 <main(int, char**)+2398>: callq 0x4250b0 <***@plt>
>>>> 0x461fc5 <main(int, char**)+2403>: mov 0x70f55c(%rip),%rdx
>>>> # 0xb71528 <msg_svrdown>
>>>> 0x461fcc <main(int, char**)+2410>: mov 0x70eeed(%rip),%rax
>>>> # 0xb70ec0 <msg_daemonname>
>>>> 0x461fd3 <main(int, char**)+2417>: mov %rdx,%rcx
>>>> 0x461fd6 <main(int, char**)+2420>: mov %rax,%rdx
>>>> 0x461fd9 <main(int, char**)+2423>: mov $0x1,%esi
>>>> 0x461fde <main(int, char**)+2428>: mov $0x8002,%edi
>>>> 0x461fe3 <main(int, char**)+2433>: callq 0x425840
>>>> <***@plt>
>>>> 0x461fe8 <main(int, char**)+2438>: mov $0x0,%edi
>>>> 0x461fed <main(int, char**)+2443>: callq 0x4269c9
>>>> <acct_close(bool)>
>>>> 0x461ff2 <main(int, char**)+2448>: mov $0xb6cdc0,%edi
>>>> 0x461ff7 <main(int, char**)+2453>: callq 0x425a00
>>>> <***@plt>
>>>> 0x461ffc <main(int, char**)+2458>: mov $0x1,%edi
>>>> 0x462001 <main(int, char**)+2463>: callq 0x424db0
>>>> <***@plt>
>>>> (gdb) thread apply all backtrace
>>>>
>>>> Thread 11 (Thread 0x7fffe37fe700 (LWP 5113)):
>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86
>>>> _64/pthread_cond_wait.S:185
>>>> #1 0x00000000004fc19c in work_thread (a=0x5110710) at
>>>> u_threadpool.c:272
>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7fffe37fe700) at
>>>> pthread_create.c:333
>>>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>> _64/clone.S:109
>>>>
>>>> Thread 10 (Thread 0x7fffe3fff700 (LWP 5109)):
>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86
>>>> _64/pthread_cond_wait.S:185
>>>> #1 0x00000000004fc19c in work_thread (a=0x5110710) at
>>>> u_threadpool.c:272
>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7fffe3fff700) at
>>>> pthread_create.c:333
>>>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>> _64/clone.S:109
>>>>
>>>> Thread 9 (Thread 0x7ffff09bb700 (LWP 5031)):
>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86
>>>> _64/pthread_cond_wait.S:185
>>>> #1 0x00000000004fc19c in work_thread (a=0x5110810) at
>>>> u_threadpool.c:272
>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7ffff09bb700) at
>>>> pthread_create.c:333
>>>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>> _64/clone.S:109
>>>>
>>>> Thread 7 (Thread 0x7ffff11bc700 (LWP 5029)):
>>>> #0 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-templa
>>>> te.S:84
>>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>>> ../sysdeps/posix/sleep.c:55
>>>> #2 0x00000000004769bb in remove_completed_jobs (vp=0x0) at
>>>> req_jobobit.c:3759
>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff11bc700) at
>>>> pthread_create.c:333
>>>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>> _64/clone.S:109
>>>>
>>>> Thread 6 (Thread 0x7ffff19bd700 (LWP 5028)):
>>>> #0 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-templa
>>>> te.S:84
>>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>>> ../sysdeps/posix/sleep.c:55
>>>> #2 0x00000000004afa7b in remove_extra_recycle_jobs (vp=0x0) at
>>>> job_recycler.c:216
>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff19bd700) at
>>>> pthread_create.c:333
>>>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>> _64/clone.S:109
>>>>
>>>> Thread 5 (Thread 0x7ffff21be700 (LWP 5027)):
>>>> #0 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-templa
>>>> te.S:84
>>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>>> ../sysdeps/posix/sleep.c:55
>>>> #2 0x00000000004bc73b in inspect_exiting_jobs (vp=0x0) at
>>>> exiting_jobs.c:319
>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff21be700) at
>>>> pthread_create.c:333
>>>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>> _64/clone.S:109
>>>>
>>>> Thread 4 (Thread 0x7ffff29bf700 (LWP 5026)):
>>>> #0 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-templa
>>>> te.S:84
>>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>>> ../sysdeps/posix/sleep.c:55
>>>> #2 0x000000000046078d in handle_queue_routing_retries (vp=0x0) at
>>>> pbsd_main.c:1079
>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff29bf700) at
>>>> pthread_create.c:333
>>>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>> _64/clone.S:109
>>>>
>>>> Thread 3 (Thread 0x7ffff31c0700 (LWP 5025)):
>>>> #0 0x00007ffff6ee17bd in accept () at ../sysdeps/unix/syscall-templa
>>>> te.S:84
>>>> #1 0x00007ffff750a276 in start_listener_addrinfo
>>>> (host_name=0x7ffff31bfaf0 "Dual-E52630v4", server_port=15001,
>>>> process_meth=0x4c4935 <start_process_pbs_server_port(void*)>)
>>>> at ../Libnet/server_core.c:398
>>>> #2 0x00000000004608f3 in start_accept_listener (vp=0x0) at
>>>> pbsd_main.c:1141
>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff31c0700) at
>>>> pthread_create.c:333
>>>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>> _64/clone.S:109
>>>>
>>>> Thread 2 (Thread 0x7ffff39c1700 (LWP 5024)):
>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86
>>>> _64/pthread_cond_wait.S:185
>>>> #1 0x00000000004fc19c in work_thread (a=0x5110810) at
>>>> u_threadpool.c:272
>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7ffff39c1700) at
>>>> pthread_create.c:333
>>>> ---Type <return> to continue, or q <return> to quit---
>>>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>> _64/clone.S:109
>>>>
>>>> Thread 1 (Thread 0x7ffff7fd5740 (LWP 5020)):
>>>> #0 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-templa
>>>> te.S:84
>>>> #1 0x00007ffff615c1a4 in usleep (useconds=<optimized out>) at
>>>> ../sysdeps/posix/usleep.c:32
>>>> #2 0x000000000046123a in main_loop () at pbsd_main.c:1454
>>>> #3 0x0000000000461fb6 in main (argc=2, argv=0x7fffffffe598) at
>>>> pbsd_main.c:1935
>>>> (gdb) quit
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Oct 28, 2016 at 12:43 PM, Kazuhiro Fujita <
>>>> ***@gmail.com> wrote:
>>>>
>>>>> Thank you for your comments.
>>>>> I will try the 6.0-dev next week.
>>>>>
>>>>> Best,
>>>>> Kazu
>>>>>
>>>>> On Fri, Oct 28, 2016 at 5:34 AM, David Beer <
>>>>> ***@adaptivecomputing.com> wrote:
>>>>>
>>>>>> I wonder if that fix wasn't placed in the hotfix. Is there any chance
>>>>>> you can try installing 6.0-dev on your system (via github) to see if it's
>>>>>> resolved. For the record, my Ubuntu 16 system doesn't give me this error,
>>>>>> or I'd try it myself. For whatever reason, none of our test cluster
>>>>>> machines (Cent & Redhat 6-7, SLES 11-12) experience this either. We did
>>>>>> have another user that experiences it on a test cluster, but not being able
>>>>>> to reproduce it has made it harder to track down.
>>>>>>
>>>>>> On Wed, Oct 26, 2016 at 12:46 AM, Kazuhiro Fujita <
>>>>>> ***@gmail.com> wrote:
>>>>>>
>>>>>>> David,
>>>>>>>
>>>>>>> I tried the 6.0.2.h3. But, it seems that the other issue is still
>>>>>>> remained.
>>>>>>> After I initialized serverdb by "sudo pbs_server -t create",
>>>>>>> pbs_server crashed.
>>>>>>> Then, I used gdb with pbs_server.
>>>>>>>
>>>>>>> Best,
>>>>>>> Kazu
>>>>>>>
>>>>>>> sudo gdb /usr/local/sbin/pbs_server
>>>>>>> GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
>>>>>>> Copyright (C) 2016 Free Software Foundation, Inc.
>>>>>>> License GPLv3+: GNU GPL version 3 or later <
>>>>>>> http://gnu.org/licenses/gpl.html>
>>>>>>> This is free software: you are free to change and redistribute it.
>>>>>>> There is NO WARRANTY, to the extent permitted by law. Type "show
>>>>>>> copying"
>>>>>>> and "show warranty" for details.
>>>>>>> This GDB was configured as "x86_64-linux-gnu".
>>>>>>> Type "show configuration" for configuration details.
>>>>>>> For bug reporting instructions, please see:
>>>>>>> <http://www.gnu.org/software/gdb/bugs/>.
>>>>>>> Find the GDB manual and other documentation resources online at:
>>>>>>> <http://www.gnu.org/software/gdb/documentation/>.
>>>>>>> For help, type "help".
>>>>>>> Type "apropos word" to search for commands related to "word"...
>>>>>>> Reading symbols from /usr/local/sbin/pbs_server...done.
>>>>>>> (gdb) r -D
>>>>>>> Starting program: /usr/local/sbin/pbs_server -D
>>>>>>> [Thread debugging using libthread_db enabled]
>>>>>>> Using host libthread_db library "/lib/x86_64-linux-gnu/libthre
>>>>>>> ad_db.so.1".
>>>>>>> pbs_server is up (version - 6.0.2.h3, port - 15001)
>>>>>>> [New Thread 0x7ffff39c1700 (LWP 25591)]
>>>>>>> [New Thread 0x7ffff31c0700 (LWP 25592)]
>>>>>>> [New Thread 0x7ffff29bf700 (LWP 25593)]
>>>>>>> [New Thread 0x7ffff21be700 (LWP 25594)]
>>>>>>> [New Thread 0x7ffff19bd700 (LWP 25595)]
>>>>>>> [New Thread 0x7ffff11bc700 (LWP 25596)]
>>>>>>>
>>>>>>> Thread 7 "pbs_server" received signal SIGSEGV, Segmentation fault.
>>>>>>> [Switching to Thread 0x7ffff11bc700 (LWP 25596)]
>>>>>>> __lll_unlock_elision (lock=0x57276c0, private=0) at
>>>>>>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>>>>>>> 29 ../sysdeps/unix/sysv/linux/x86/elision-unlock.c: No such file or
>>>>>>> directory.
>>>>>>> (gdb) bt
>>>>>>> #0 __lll_unlock_elision (lock=0x57276c0, private=0) at
>>>>>>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>>>>>>> #1 0x00000000004ac076 in dispatch_timed_task (ptask=0x5727660) at
>>>>>>> svr_task.c:318
>>>>>>> #2 0x0000000000460247 in check_tasks (notUsed=0x0) at
>>>>>>> pbsd_main.c:921
>>>>>>> #3 0x00000000004fc171 in work_thread (a=0x510f650) at
>>>>>>> u_threadpool.c:318
>>>>>>> #4 0x00007ffff6ed86fa in start_thread (arg=0x7ffff11bc700) at
>>>>>>> pthread_create.c:333
>>>>>>> #5 0x00007ffff6165b5d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>>> _64/clone.S:109
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Oct 26, 2016 at 11:52 AM, Kazuhiro Fujita <
>>>>>>> ***@gmail.com> wrote:
>>>>>>>
>>>>>>>> David and Rick,
>>>>>>>>
>>>>>>>> Thank you for the quick response. I will try it later.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Kazu
>>>>>>>>
>>>>>>>> On Wed, Oct 26, 2016 at 5:06 AM, David Beer <
>>>>>>>> ***@adaptivecomputing.com> wrote:
>>>>>>>>
>>>>>>>>> Actually, Rick just sent me the link. You can download it from
>>>>>>>>> here: http://files.adaptivecomputing.com/hotfix/torque-6.0.2
>>>>>>>>> .h3.tar.gz
>>>>>>>>>
>>>>>>>>> On Tue, Oct 25, 2016 at 2:06 PM, David Beer <
>>>>>>>>> ***@adaptivecomputing.com> wrote:
>>>>>>>>>
>>>>>>>>>> I can confirm that this bug is fixed in 6.0-dev, and we've made a
>>>>>>>>>> hotfix for it, 6.0.2.h3. This was caused because of a change in the
>>>>>>>>>> implementation for the pthread library, so most will not see this crash,
>>>>>>>>>> but it appears that if you have a newer version of that library, then you
>>>>>>>>>> will get it. Rick is going to send instructions for how to grab 6.0.2.h3.
>>>>>>>>>>
>>>>>>>>>> David
>>>>>>>>>>
>>>>>>>>>> On Tue, Oct 25, 2016 at 12:30 AM, Kazuhiro Fujita <
>>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thank you David for the comment on the backtrace.
>>>>>>>>>>> I haven't noticed that until writing this mail.
>>>>>>>>>>> So, I used backtrace as written in the Ubuntu wiki.
>>>>>>>>>>>
>>>>>>>>>>> I also attached the backtrace of pbs_server (Torque 6.1-dev) by
>>>>>>>>>>> gdb.
>>>>>>>>>>> As I mentioned before torque.setup script was successfully
>>>>>>>>>>> executed, but unstable.
>>>>>>>>>>>
>>>>>>>>>>> Before using gdb, I used following commands.
>>>>>>>>>>>
>>>>>>>>>>>> git clone https://github.com/adaptivecomputing/torque.git -b
>>>>>>>>>>>> 6.1-dev 6.1-dev
>>>>>>>>>>>> cd 6.1-dev
>>>>>>>>>>>> ./autogen.sh
>>>>>>>>>>>> # build and install torque
>>>>>>>>>>>> ./configure
>>>>>>>>>>>> make
>>>>>>>>>>>> sudo make install
>>>>>>>>>>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>>>>>>>>>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>>>>>>>>>>>> sudo ldconfig
>>>>>>>>>>>> # set as services
>>>>>>>>>>>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>>>>>>>>>>> sudo update-rc.d trqauthd defaults
>>>>>>>>>>>> sudo update-rc.d pbs_server defaults
>>>>>>>>>>>> sudo update-rc.d pbs_sched defaults
>>>>>>>>>>>> sudo update-rc.d pbs_mom defaults
>>>>>>>>>>>>
>>>>>>>>>>>> sudo ./torque.setup $USER
>>>>>>>>>>>> sudo qmgr -c 'p s'
>>>>>>>>>>>> echo "$HOSTNAME np=`cat /proc/cpuinfo | grep processor | wc
>>>>>>>>>>>> -l`" | sudo tee /var/spool/torque/server_priv/nodes
>>>>>>>>>>>> sudo nano /var/spool/torque/server_priv/nodes # (changed np)
>>>>>>>>>>>> sudo qterm -t quick
>>>>>>>>>>>> sudo /etc/init.d/trqauthd stop
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> trqauthd was not stop by the last command. So, I stopped it by
>>>>>>>>>>> killing the trqauthd process.
>>>>>>>>>>> Then I restarted the torque processes with gdb.
>>>>>>>>>>>
>>>>>>>>>>> sudo /etc/init.d/trqauthd start
>>>>>>>>>>>
>>>>>>>>>>> sudo gdb /etc/init.d/pbs_server 2>&1 | tee
>>>>>>>>>>>> ~/gdb-torquesetup-6.1-dev.txt
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> In another terminal, I executed the following commands before
>>>>>>>>>>> pbs_server was crashed.
>>>>>>>>>>>
>>>>>>>>>>> sudo /etc/init.d/pbs_mom start
>>>>>>>>>>>> sudo /etc/init.d/pbs_sched start
>>>>>>>>>>>> ps aux | grep pbs
>>>>>>>>>>>> pbsnodes -a
>>>>>>>>>>>> echo "sleep 30" | qsub
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> The output of the last command is "0.torque-server".
>>>>>>>>>>> And this command crashed the pbs_server in gdb.
>>>>>>>>>>> Then, I made the backtrace.
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Kazu
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Oct 25, 2016 at 2:36 PM, Kazuhiro Fujita <
>>>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> David,
>>>>>>>>>>>>
>>>>>>>>>>>> I attached the backtrace of pbs_server (Torque 6.0.2) by gdb.
>>>>>>>>>>>> (based on https://wiki.ubuntu.com/Backtrace)
>>>>>>>>>>>>
>>>>>>>>>>>> I started pbs_server with gdb,
>>>>>>>>>>>> and execute qmgr from another terminal. (see below)
>>>>>>>>>>>>
>>>>>>>>>>>> sudo qmgr -c 'p s'
>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> After the qmgr execution, I pressed ctrl +c in gdb.
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Kaz
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Oct 25, 2016 at 1:00 AM, David Beer <
>>>>>>>>>>>> ***@adaptivecomputing.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Kazu,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Can you give us a backtrace for this crash? We have fixed some
>>>>>>>>>>>>> issues on startup (around mutex management for newer pthread
>>>>>>>>>>>>> implementations) and a backtrace would allow me to confirm if what you're
>>>>>>>>>>>>> seeing is fixed.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Oct 24, 2016 at 2:09 AM, Kazuhiro Fujita <
>>>>>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Dear All,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I use Torque 4.2.10 on Ubuntu 14.04 LTS and 16.04 LTS with
>>>>>>>>>>>>>> dual E5-2630 v3 chips.
>>>>>>>>>>>>>> I recently got servers with dual Xeon E5 v4 chips, and
>>>>>>>>>>>>>> installed Ubuntu 16.04 LTS on them.
>>>>>>>>>>>>>> And I tried to set up Torque on them, but I stacked with the
>>>>>>>>>>>>>> initial setup script.
>>>>>>>>>>>>>> It seems that qmgr may trigger to crash pbs_server in initial
>>>>>>>>>>>>>> setup script (torque.setup). (see below)
>>>>>>>>>>>>>> Similar error is also observed in Torque 6.02.
>>>>>>>>>>>>>> Have you ever observed this kind of errors?
>>>>>>>>>>>>>> And if you know possible solutions, please tell me.
>>>>>>>>>>>>>> Any comments will be highly appreciated.
>>>>>>>>>>>>>> Would it be better to change the OS to other distribution,
>>>>>>>>>>>>>> such as Scientific Linux?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thank you in Advance,
>>>>>>>>>>>>>> Kazu
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Errors in torque 4.2.10 setup
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> torque-server-***@torque-server:~/Downloads/torque/torque-4.2.10$
>>>>>>>>>>>>>>> sudo ./torque.setup $USER
>>>>>>>>>>>>>>> Currently no servers active. Default server will be listed
>>>>>>>>>>>>>>> as active server. Error 15133
>>>>>>>>>>>>>>> Active server name: torque-server pbs_server port is: 15001
>>>>>>>>>>>>>>> trqauthd daemonized - port /tmp/trqauthd-unix
>>>>>>>>>>>>>>> trqauthd successfully started
>>>>>>>>>>>>>>> initializing TORQUE (admin: torque-server-***@torque-ser
>>>>>>>>>>>>>>> ver)
>>>>>>>>>>>>>>> You have selected to start pbs_server in create mode.
>>>>>>>>>>>>>>> If the server database exists it will be overwritten.
>>>>>>>>>>>>>>> do you wish to continue y/(n)?y
>>>>>>>>>>>>>>> root 27941 1942 1 12:22 ? 00:00:00 pbs_server
>>>>>>>>>>>>>>> -t create
>>>>>>>>>>>>>>> Max open servers: 9
>>>>>>>>>>>>>>> set server operators += torque-server-***@torque-server
>>>>>>>>>>>>>>> Max open servers: 9
>>>>>>>>>>>>>>> set server managers += torque-server-***@torque-server
>>>>>>>>>>>>>>> qmgr obj=batch svr=default: End of File
>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>> torque-server-***@torque-server:~/Downloads/torque/torque-4.2.10$
>>>>>>>>>>>>>>> ps aux | grep pbs
>>>>>>>>>>>>>>> torque-+ 27996 0.0 0.0 22304 948 pts/2 S+ 12:22
>>>>>>>>>>>>>>> 0:00 grep --color=auto pbs
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> pbs_server -t create was not found.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Errors in torque 6.0.2 setup
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> torque-server-***@torque-server:~/Downloads/torque/6.0.2$
>>>>>>>>>>>>>>> sudo ./torque.setup $USER
>>>>>>>>>>>>>>> Currently no servers active. Default server will be listed
>>>>>>>>>>>>>>> as active server. Error 15133
>>>>>>>>>>>>>>> Active server name: torque-server pbs_server port is: 15001
>>>>>>>>>>>>>>> trqauthd daemonized - port /tmp/trqauthd-unix
>>>>>>>>>>>>>>> trqauthd successfully started
>>>>>>>>>>>>>>> initializing TORQUE (admin: torque-server-***@torque-ser
>>>>>>>>>>>>>>> ver)
>>>>>>>>>>>>>>> You have selected to start pbs_server in create mode.
>>>>>>>>>>>>>>> If the server database exists it will be overwritten.
>>>>>>>>>>>>>>> do you wish to continue y/(n)?y
>>>>>>>>>>>>>>> root 39521 1 1 16:10 ? 00:00:00 pbs_server
>>>>>>>>>>>>>>> -t create
>>>>>>>>>>>>>>> Max open servers: 9
>>>>>>>>>>>>>>> Max open servers: 9
>>>>>>>>>>>>>>> qmgr obj=batch svr=default: End of File
>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>> torque-server-***@torque-server:~/Downloads/torque/6.0.2$
>>>>>>>>>>>>>>> ps aux | grep pbs
>>>>>>>>>>>>>>> comp_ad+ 39569 0.0 0.0 22304 1032 pts/8 S+ 16:11
>>>>>>>>>>>>>>> 0:00 grep --color=auto pbs
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> pbs_server -t create was not found.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Commands used for installation before the setup script
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> # build and install torque
>>>>>>>>>>>>>>> ./configure
>>>>>>>>>>>>>>> make
>>>>>>>>>>>>>>> sudo make install
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>>>>>>>>>>>>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>>>>>>>>>>>>>>> sudo ldconfig
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> # set up as services
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_server
>>>>>>>>>>>>>>> /etc/init.d/pbs_server
>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_sched
>>>>>>>>>>>>>>> /etc/init.d/pbs_sched
>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>>>>>>>>>>>>>> sudo update-rc.d trqauthd defaults
>>>>>>>>>>>>>>> sudo update-rc.d pbs_server defaults
>>>>>>>>>>>>>>> sudo update-rc.d pbs_sched defaults
>>>>>>>>>>>>>>> sudo update-rc.d pbs_mom defaults
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> sudo ./torque.setup $USER
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> torqueusers mailing list
>>>>>>>>>>>>>> ***@supercluster.org
>>>>>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> David Beer | Torque Architect
>>>>>>>>>>>>> Adaptive Computing
>>>>>>>>>>>>>
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> torqueusers mailing list
>>>>>>>>>>>>> ***@supercluster.org
>>>>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> torqueusers mailing list
>>>>>>>>>>> ***@supercluster.org
>>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> David Beer | Torque Architect
>>>>>>>>>> Adaptive Computing
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> David Beer | Torque Architect
>>>>>>>>> Adaptive Computing
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> torqueusers mailing list
>>>>>>>>> ***@supercluster.org
>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> torqueusers mailing list
>>>>>>> ***@supercluster.org
>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> David Beer | Torque Architect
>>>>>> Adaptive Computing
>>>>>>
>>>>>> _______________________________________________
>>>>>> torqueusers mailing list
>>>>>> ***@supercluster.org
>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>
>>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> torqueusers mailing list
>>>> ***@supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>
>>>>
>>>
>>>
>>> --
>>> David Beer | Torque Architect
>>> Adaptive Computing
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> ***@supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>>
>>
>> _______________________________________________
>> torqueusers mailing list
>> ***@supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>
>
> --
> David Beer | Torque Architect
> Adaptive Computing
>
> _______________________________________________
> torqueusers mailing list
> ***@supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
Kazuhiro Fujita
2016-11-10 04:33:00 UTC
Permalink
David,

In the last mail I sent, I reinstalled 6.0-dev in a wrong server as you can
see in output (E5-2630v3).
In a E5-2630v4 server, pbs_server failed to restart as a daemon after
"./torque.setup
$USER".

Before crash:

> git clone https://github.com/adaptivecomputing/torque.git -b 6.0-dev
> 6.0-dev
> cd 6.0-dev
> ./autogen.sh
> # build and install torque
> ./configure
> make
> sudo make install
> # Set the correct name of the server
> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
> # configure and start trqauthd
> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
> sudo update-rc.d trqauthd defaults
> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
> sudo ldconfig
> sudo service trqauthd start
> # Initialize serverdb by executing the torque.setup script
> sudo ./torque.setup $USER
> sudo qmgr -c 'p s'
> sudo qterm
> sudo service trqauthd stop
> ps aux | grep pbs
> ps aux | grep trq
> # set nodes
> echo "$HOSTNAME np=`cat /proc/cpuinfo | grep processor | wc -l`" | sudo
> tee /var/spool/torque/server_priv/nodes
> sudo nano /var/spool/torque/server_priv/nodes
> # set the head node
> echo "\$pbsserver $HOSTNAME" | sudo tee /var/spool/torque/mom_priv/config
> # configure other daemons
> sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
> sudo update-rc.d pbs_server defaults
> sudo update-rc.d pbs_sched defaults
> sudo update-rc.d pbs_mom defaults
> # restart torque daemons
> sudo service trqauthd start
> sudo service pbs_server start


Then, pbs_server did not start. So, I started pbs_server with gdb.
But, pbs_server with gdb did not crash even after qsub and qstat from
another terminal.
So, I stopped the pbs_server in gdb with ctrl + c.

Best,
Kazu

gdb output

> $ sudo gdb /usr/local/sbin/pbs_server
> GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
> Copyright (C) 2016 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later <
> http://gnu.org/licenses/gpl.html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law. Type "show copying"
> and "show warranty" for details.
> This GDB was configured as "x86_64-linux-gnu".
> Type "show configuration" for configuration details.
> For bug reporting instructions, please see:
> <http://www.gnu.org/software/gdb/bugs/>.
> Find the GDB manual and other documentation resources online at:
> <http://www.gnu.org/software/gdb/documentation/>.
> For help, type "help".
> Type "apropos word" to search for commands related to "word"...
> Reading symbols from /usr/local/sbin/pbs_server...done.
> (gdb) r -D
> Starting program: /usr/local/sbin/pbs_server -D
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
> [New Thread 0x7ffff39c1700 (LWP 35864)]
> pbs_server is up (version - 6.0, port - 15001)
> PBS_Server: LOG_ERROR::tcp_connect_sockaddr, Failed when trying to open
> tcp connection - connect() failed [rc = -2] [addr = 10.0.0.249:15003]
> [New Thread 0x7ffff31c0700 (LWP 35865)]
> PBS_Server: LOG_ERROR::sendHierarchyToNode, Could not send mom hierarchy
> to host Dual-E52630v4:15003
> [New Thread 0x7ffff29bf700 (LWP 35866)]
> [New Thread 0x7ffff21be700 (LWP 35867)]
> [New Thread 0x7ffff19bd700 (LWP 35868)]
> [New Thread 0x7ffff11bc700 (LWP 35869)]
> [New Thread 0x7ffff09bb700 (LWP 35870)]
> [Thread 0x7ffff09bb700 (LWP 35870) exited]
> [New Thread 0x7ffff09bb700 (LWP 35871)]
> [New Thread 0x7fffe3fff700 (LWP 36003)]
> [New Thread 0x7fffe37fe700 (LWP 36004)]
> [New Thread 0x7fffe2ffd700 (LWP 36011)]
> [New Thread 0x7fffe21ce700 (LWP 36016)]
> [Thread 0x7fffe21ce700 (LWP 36016) exited]
> ^C
> Thread 1 "pbs_server" received signal SIGINT, Interrupt.
> 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-template.S:84
> 84 ../sysdeps/unix/syscall-template.S: No such file or directory.
> (gdb) bt
> #0 0x00007ffff612a75d in nanosleep () at
> ../sysdeps/unix/syscall-template.S:84
> #1 0x00007ffff615c1a4 in usleep (useconds=<optimized out>) at
> ../sysdeps/posix/usleep.c:32
> #2 0x0000000000461216 in main_loop () at pbsd_main.c:1454
> #3 0x0000000000461f92 in main (argc=2, argv=0x7fffffffe598) at
> pbsd_main.c:1935
> (gdb) backtrace full
> #0 0x00007ffff612a75d in nanosleep () at
> ../sysdeps/unix/syscall-template.S:84
> No locals.
> #1 0x00007ffff615c1a4 in usleep (useconds=<optimized out>) at
> ../sysdeps/posix/usleep.c:32
> ts = {tv_sec = 0, tv_nsec = 250000000}
> #2 0x0000000000461216 in main_loop () at pbsd_main.c:1454
> state = 3
> waittime = 5
> pjob = 0x313a74
> iter = 0x0
> when = 1478748888
> log = 0
> scheduling = 1
> sched_iteration = 600
> time_now = 1478748970
> update_loglevel = 1478748979
> log_buf = "Server Ready, pid = 35860, loglevel=0", '\000' <repeats
> 139 times>,
> "c\000\000\000\000\000\000\000\000\020\000\000\000\000\000\000\240\265\377\377\377\177",
> '\000' <repeats 26 times>...
> sem_val = 5229209
> __func__ = "main_loop"
> #3 0x0000000000461f92 in main (argc=2, argv=0x7fffffffe598) at
> pbsd_main.c:1935
> i = 2
> rc = 0
> local_errno = 0
> lockfile = "/var/spool/torque/server_priv/server.lock", '\000'
> <repeats 983 times>
> EMsg = '\000' <repeats 1023 times>
> tmpLine = "Using ports Server:15001 Scheduler:15004 MOM:15002
> (server: 'Dual-E52630v4')", '\000' <repeats 945 times>
> log_buf = "Using ports Server:15001 Scheduler:15004 MOM:15002
> (server: 'Dual-E52630v4')", '\000' <repeats 1114 times>...
> server_name_file_port = 15001
> fp = 0x51095f0
> (gdb) info registers
> rax 0xfffffffffffffdfc -516
> rbx 0x6 6
> rcx 0x7ffff612a75d 140737321805661
> rdx 0x0 0
> rsi 0x0 0
> rdi 0x7fffffffb3f0 140737488335856
> rbp 0x7fffffffe4b0 0x7fffffffe4b0
> rsp 0x7fffffffc870 0x7fffffffc870
> r8 0x0 0
> r9 0x4000001 67108865
> r10 0x1 1
> r11 0x293 659
> r12 0x4260b0 4350128
> r13 0x7fffffffe590 140737488348560
> r14 0x0 0
> r15 0x0 0
> rip 0x461f92 0x461f92 <main(int, char**)+2388>
> eflags 0x293 [ CF AF SF IF ]
> cs 0x33 51
> ss 0x2b 43
> ds 0x0 0
> es 0x0 0
> fs 0x0 0
> gs 0x0 0
> (gdb) x/16i $pc
> => 0x461f92 <main(int, char**)+2388>: callq 0x49484c <shutdown_ack()>
> 0x461f97 <main(int, char**)+2393>: mov $0xffffffff,%edi
> 0x461f9c <main(int, char**)+2398>: callq 0x4250b0 <***@plt>
> 0x461fa1 <main(int, char**)+2403>: mov 0x70f5c0(%rip),%rdx #
> 0xb71568 <msg_svrdown>
> 0x461fa8 <main(int, char**)+2410>: mov 0x70ef51(%rip),%rax #
> 0xb70f00 <msg_daemonname>
> 0x461faf <main(int, char**)+2417>: mov %rdx,%rcx
> 0x461fb2 <main(int, char**)+2420>: mov %rax,%rdx
> 0x461fb5 <main(int, char**)+2423>: mov $0x1,%esi
> 0x461fba <main(int, char**)+2428>: mov $0x8002,%edi
> 0x461fbf <main(int, char**)+2433>: callq 0x425840
> <***@plt>
> 0x461fc4 <main(int, char**)+2438>: mov $0x0,%edi
> 0x461fc9 <main(int, char**)+2443>: callq 0x4269c9 <acct_close(bool)>
> 0x461fce <main(int, char**)+2448>: mov $0xb6ce00,%edi
> 0x461fd3 <main(int, char**)+2453>: callq 0x425a00
> <***@plt>
> 0x461fd8 <main(int, char**)+2458>: mov $0x1,%edi
> 0x461fdd <main(int, char**)+2463>: callq 0x424db0 <***@plt>
> (gdb) thread apply all backtrace
> Thread 12 (Thread 0x7fffe2ffd700 (LWP 36011)):
> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
> #1 0x00000000004fc2b4 in work_thread (a=0x5110710) at u_threadpool.c:272
> #2 0x00007ffff6ed870a in start_thread (arg=0x7fffe2ffd700) at
> pthread_create.c:333
> #3 0x00007ffff616582d in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
> Thread 11 (Thread 0x7fffe37fe700 (LWP 36004)):
> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
> #1 0x00000000004fc2b4 in work_thread (a=0x5110810) at u_threadpool.c:272
> #2 0x00007ffff6ed870a in start_thread (arg=0x7fffe37fe700) at
> pthread_create.c:333
> #3 0x00007ffff616582d in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
> Thread 10 (Thread 0x7fffe3fff700 (LWP 36003)):
> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
> #1 0x00000000004fc2b4 in work_thread (a=0x5110710) at u_threadpool.c:272
> #2 0x00007ffff6ed870a in start_thread (arg=0x7fffe3fff700) at
> pthread_create.c:333
> #3 0x00007ffff616582d in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
> Thread 9 (Thread 0x7ffff09bb700 (LWP 35871)):
> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
> #1 0x00000000004fc2b4 in work_thread (a=0x5110810) at u_threadpool.c:272
> #2 0x00007ffff6ed870a in start_thread (arg=0x7ffff09bb700) at
> pthread_create.c:333
> #3 0x00007ffff616582d in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
> Thread 7 (Thread 0x7ffff11bc700 (LWP 35869)):
> #0 0x00007ffff612a75d in nanosleep () at
> ../sysdeps/unix/syscall-template.S:84
> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
> ../sysdeps/posix/sleep.c:55
> #2 0x0000000000476913 in remove_completed_jobs (vp=0x0) at
> req_jobobit.c:3759
> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff11bc700) at
> pthread_create.c:333
> #4 0x00007ffff616582d in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
> Thread 6 (Thread 0x7ffff19bd700 (LWP 35868)):
> #0 0x00007ffff612a75d in nanosleep () at
> ../sysdeps/unix/syscall-template.S:84
> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
> ../sysdeps/posix/sleep.c:55
> #2 0x00000000004afb93 in remove_extra_recycle_jobs (vp=0x0) at
> job_recycler.c:216
> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff19bd700) at
> pthread_create.c:333
> #4 0x00007ffff616582d in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
> Thread 5 (Thread 0x7ffff21be700 (LWP 35867)):
> #0 0x00007ffff612a75d in nanosleep () at
> ../sysdeps/unix/syscall-template.S:84
> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
> ../sysdeps/posix/sleep.c:55
> #2 0x00000000004bc853 in inspect_exiting_jobs (vp=0x0) at
> exiting_jobs.c:319
> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff21be700) at
> pthread_create.c:333
> #4 0x00007ffff616582d in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
> Thread 4 (Thread 0x7ffff29bf700 (LWP 35866)):
> #0 0x00007ffff612a75d in nanosleep () at
> ../sysdeps/unix/syscall-template.S:84
> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
> ../sysdeps/posix/sleep.c:55
> #2 0x0000000000460769 in handle_queue_routing_retries (vp=0x0) at
> pbsd_main.c:1079
> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff29bf700) at
> pthread_create.c:333
> #4 0x00007ffff616582d in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
> Thread 3 (Thread 0x7ffff31c0700 (LWP 35865)):
> #0 0x00007ffff6ee17bd in accept () at
> ../sysdeps/unix/syscall-template.S:84
> #1 0x00007ffff750a276 in start_listener_addrinfo
> (host_name=0x7ffff31bfaf0 "Dual-E52630v4", server_port=15001,
> process_meth=0x4c4a4d <start_process_pbs_server_port(void*)>)
> at ../Libnet/server_core.c:398
> ---Type <return> to continue, or q <return> to quit---
> #2 0x00000000004608cf in start_accept_listener (vp=0x0) at
> pbsd_main.c:1141
> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff31c0700) at
> pthread_create.c:333
> #4 0x00007ffff616582d in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
> Thread 2 (Thread 0x7ffff39c1700 (LWP 35864)):
> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
> #1 0x00000000004fc2b4 in work_thread (a=0x5110810) at u_threadpool.c:272
> #2 0x00007ffff6ed870a in start_thread (arg=0x7ffff39c1700) at
> pthread_create.c:333
> #3 0x00007ffff616582d in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
> Thread 1 (Thread 0x7ffff7fd5740 (LWP 35860)):
> #0 0x00007ffff612a75d in nanosleep () at
> ../sysdeps/unix/syscall-template.S:84
> #1 0x00007ffff615c1a4 in usleep (useconds=<optimized out>) at
> ../sysdeps/posix/usleep.c:32
> #2 0x0000000000461216 in main_loop () at pbsd_main.c:1454
> #3 0x0000000000461f92 in main (argc=2, argv=0x7fffffffe598) at
> pbsd_main.c:1935
> (gdb) quit
> A debugging session is active.
> Inferior 1 [process 35860] will be killed.
> Quit anyway? (y or n) y



Commands executed from another terminal after pbs_server with gdb (r -D)

> $ sudo service pbs_sched start
> $ sudo service pbs_mom start
> $ pbsnodes -a
> Dual-E52630v4
> state = free
> power_state = Running
> np = 4
> ntype = cluster
> status =
> rectime=1478748911,macaddr=34:97:f6:5d:09:a6,cpuclock=Fixed,varattr=,jobs=,state=free,netload=322618417,gres=,loadave=0.06,ncpus=40,physmem=65857216kb,availmem=131970532kb,totmem=132849340kb,idletime=108,nusers=4,nsessions=17,sessions=1036
> 1316 1327 1332 1420 1421 1422 1423 1424 1425 1426 1430 1471 1510 27075
> 27130 35902,uname=Linux Dual-E52630v4 4.4.0-45-generic #66-Ubuntu SMP Wed
> Oct 19 14:12:37 UTC 2016 x86_64,opsys=linux
> mom_service_port = 15002
> mom_manager_port = 15003
> $ echo "sleep 30" | qsub
> 0.Dual-E52630v4
> $ qstat
> Job ID Name User Time Use S Queue
> ------------------------- ---------------- --------------- -------- - -----
> 0.Dual-E52630v4 STDIN comp_admin 0 Q
> batch



On Thu, Nov 10, 2016 at 12:01 PM, Kazuhiro Fujita <***@gmail.com
> wrote:

> David,
>
> Now, it works. Thank you.
> But, jobs are executed in the LIFO manner, as I observed in a E5-2630v3
> server...
> I show the result by 'qstat -t' after 'echo "sleep 10" | qsub -t 1-10' 3
> times.
>
> Best,
> Kazu
>
> $ qstat -t
> Job ID Name User Time Use S Queue
> ------------------------- ---------------- --------------- -------- - -----
> 0.Dual-E5-2630v3 STDIN comp_admin 00:00:00 C
> batch
> 1[1].Dual-E5-2630v3 STDIN-1 comp_admin 0 Q
> batch
> 1[2].Dual-E5-2630v3 STDIN-2 comp_admin 0 Q
> batch
> 1[3].Dual-E5-2630v3 STDIN-3 comp_admin 0 Q
> batch
> 1[4].Dual-E5-2630v3 STDIN-4 comp_admin 0 Q
> batch
> 1[5].Dual-E5-2630v3 STDIN-5 comp_admin 0 Q
> batch
> 1[6].Dual-E5-2630v3 STDIN-6 comp_admin 0 Q
> batch
> 1[7].Dual-E5-2630v3 STDIN-7 comp_admin 00:00:00 C
> batch
> 1[8].Dual-E5-2630v3 STDIN-8 comp_admin 00:00:00 C
> batch
> 1[9].Dual-E5-2630v3 STDIN-9 comp_admin 00:00:00 C
> batch
> 1[10].Dual-E5-2630v3 STDIN-10 comp_admin 00:00:00 C
> batch
> 2[1].Dual-E5-2630v3 STDIN-1 comp_admin 0 Q
> batch
> 2[2].Dual-E5-2630v3 STDIN-2 comp_admin 0 Q
> batch
> 2[3].Dual-E5-2630v3 STDIN-3 comp_admin 0 Q
> batch
> 2[4].Dual-E5-2630v3 STDIN-4 comp_admin 0 Q
> batch
> 2[5].Dual-E5-2630v3 STDIN-5 comp_admin 0 Q
> batch
> 2[6].Dual-E5-2630v3 STDIN-6 comp_admin 0 Q
> batch
> 2[7].Dual-E5-2630v3 STDIN-7 comp_admin 0 Q
> batch
> 2[8].Dual-E5-2630v3 STDIN-8 comp_admin 0 Q
> batch
> 2[9].Dual-E5-2630v3 STDIN-9 comp_admin 0 Q
> batch
> 2[10].Dual-E5-2630v3 STDIN-10 comp_admin 0 Q
> batch
> 3[1].Dual-E5-2630v3 STDIN-1 comp_admin 0 Q
> batch
> 3[2].Dual-E5-2630v3 STDIN-2 comp_admin 0 Q
> batch
> 3[3].Dual-E5-2630v3 STDIN-3 comp_admin 0 Q
> batch
> 3[4].Dual-E5-2630v3 STDIN-4 comp_admin 0 Q
> batch
> 3[5].Dual-E5-2630v3 STDIN-5 comp_admin 0 Q
> batch
> 3[6].Dual-E5-2630v3 STDIN-6 comp_admin 0 Q
> batch
> 3[7].Dual-E5-2630v3 STDIN-7 comp_admin 0 R
> batch
> 3[8].Dual-E5-2630v3 STDIN-8 comp_admin 0 R
> batch
> 3[9].Dual-E5-2630v3 STDIN-9 comp_admin 0 R
> batch
> 3[10].Dual-E5-2630v3 STDIN-10 comp_admin 0 R
> batch
>
>
>
> On Thu, Nov 10, 2016 at 3:07 AM, David Beer <***@adaptivecomputing.com>
> wrote:
>
>> Kazu,
>>
>> I was able to get a system to reproduce this error. I have now checked in
>> another fix, and I can no longer reproduce this. Can you pull the latest
>> and let me know if it fixes it for you?
>>
>> On Tue, Nov 8, 2016 at 2:06 AM, Kazuhiro Fujita <
>> ***@gmail.com> wrote:
>>
>>> Hi David,
>>>
>>> I reinstalled the 6.0-dev today from github, and observed slight
>>> different behaviors I think.
>>> I used the "service" command to start daemons this time.
>>>
>>> Best,
>>> Kazu
>>>
>>> Befor crash
>>>
>>>> git clone https://github.com/adaptivecomputing/torque.git -b 6.0-dev
>>>> 6.0-dev
>>>> cd 6.0-dev
>>>> ./autogen.sh
>>>> # build and install torque
>>>> ./configure
>>>> make
>>>> sudo make install
>>>> # Set the correct name of the server
>>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>> # configure and start trqauthd
>>>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>>>> sudo update-rc.d trqauthd defaults
>>>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>>>> sudo ldconfig
>>>> sudo service trqauthd start
>>>> # Initialize serverdb by executing the torque.setup script
>>>> sudo ./torque.setup $USER
>>>> sudo qmgr -c 'p s'
>>>> sudo qterm
>>>> sudo service trqauthd stop
>>>> ps aux | grep pbs
>>>> ps aux | grep trq
>>>> # set nodes
>>>> echo "$HOSTNAME np=`cat /proc/cpuinfo | grep processor | wc -l`" | sudo
>>>> tee /var/spool/torque/server_priv/nodes
>>>> sudo nano /var/spool/torque/server_priv/nodes
>>>> # set the head node
>>>> echo "\$pbsserver $HOSTNAME" | sudo tee /var/spool/torque/mom_priv/con
>>>> fig
>>>> # configure other deamons
>>>> sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
>>>> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
>>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>>> sudo update-rc.d pbs_server defaults
>>>> sudo update-rc.d pbs_sched defaults
>>>> sudo update-rc.d pbs_mom defaults
>>>> # start torque daemons
>>>> sudo service trqauthd start
>>>> sudo service pbs_server start
>>>> sudo service pbs_sched start
>>>> sudo service pbs_mom start
>>>> # chekc configuration of computaion nodes
>>>> pbsnodes -a
>>>
>>>
>>> I checked torque processes by "ps aux | grep pbs" and "ps aux | grep
>>> trq" several times.
>>> After "pbsnodes -a", it seems ok.
>>> But, the next qsub command seems to trigger to crash "pbs_server" and
>>> "pbs_sched".
>>>
>>> $ ps aux | grep trq
>>>> root 9682 0.0 0.0 109112 3632 ? S 17:39 0:00
>>>> /usr/local/sbin/trqauthd
>>>> comp_ad+ 9842 0.0 0.0 15236 936 pts/8 S+ 17:40 0:00 grep
>>>> --color=auto trq
>>>> $ ps aux | grep pbs
>>>> root 9720 0.0 0.0 695140 25760 ? Sl 17:39 0:00
>>>> /usr/local/sbin/pbs_server
>>>> root 9771 0.0 0.0 37996 4940 ? Ss 17:39 0:00
>>>> /usr/local/sbin/pbs_sched
>>>> root 9814 0.2 0.2 173776 136692 ? SLsl 17:40 0:00
>>>> /usr/local/sbin/pbs_mom
>>>> comp_ad+ 9844 0.0 0.0 15236 1012 pts/8 S+ 17:40 0:00 grep
>>>> --color=auto pbs
>>>> $ echo "sleep 30" | qsub
>>>> 0.Dual-E52630v4
>>>> $ ps aux | grep pbs
>>>> root 9814 0.1 0.2 173776 136692 ? SLsl 17:40 0:00
>>>> /usr/local/sbin/pbs_mom
>>>> comp_ad+ 9855 0.0 0.0 15236 928 pts/8 S+ 17:41 0:00 grep
>>>> --color=auto pbs
>>>> $ ps aux | grep trq
>>>> root 9682 0.0 0.0 109112 4144 ? S 17:39 0:00
>>>> /usr/local/sbin/trqauthd
>>>> comp_ad+ 9860 0.0 0.0 15236 1092 pts/8 S+ 17:41 0:00 grep
>>>> --color=auto trq
>>>
>>>
>>> Then, I stopped the remained processes,
>>>
>>> sudo service pbs_mom stop
>>>> sudo service trqauthd stop
>>>
>>>
>>> and start again the "trqauthd", and "pbs_server" with gdb. "pbs_server"
>>> crashed in gdb without other commands.
>>>
>>> sudo service trqauthd start
>>>> sudo gdb /usr/local/sbin/pbs_server
>>>
>>>
>>> sudo gdb /usr/local/sbin/pbs_server
>>> GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
>>> Copyright (C) 2016 Free Software Foundation, Inc.
>>> License GPLv3+: GNU GPL version 3 or later <
>>> http://gnu.org/licenses/gpl.html>
>>> This is free software: you are free to change and redistribute it.
>>> There is NO WARRANTY, to the extent permitted by law. Type "show
>>> copying"
>>> and "show warranty" for details.
>>> This GDB was configured as "x86_64-linux-gnu".
>>> Type "show configuration" for configuration details.
>>> For bug reporting instructions, please see:
>>> <http://www.gnu.org/software/gdb/bugs/>.
>>> Find the GDB manual and other documentation resources online at:
>>> <http://www.gnu.org/software/gdb/documentation/>.
>>> For help, type "help".
>>> Type "apropos word" to search for commands related to "word"...
>>> Reading symbols from /usr/local/sbin/pbs_server...done.
>>> (gdb) r -D
>>> Starting program: /usr/local/sbin/pbs_server -D
>>> [Thread debugging using libthread_db enabled]
>>> Using host libthread_db library "/lib/x86_64-linux-gnu/libthre
>>> ad_db.so.1".
>>>
>>> Program received signal SIGSEGV, Segmentation fault.
>>> __lll_unlock_elision (lock=0x512f1b0, private=0) at
>>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>>> 29 ../sysdeps/unix/sysv/linux/x86/elision-unlock.c: No such file or
>>> directory.
>>> (gdb) bt
>>> #0 __lll_unlock_elision (lock=0x512f1b0, private=0) at
>>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>>> #1 0x00000000004a4953 in unlock_ji_mutex (pjob=0x512d880, id=0x525b30
>>> <svr_enquejob(job*, int, char const*, bool, bool)::__func__>
>>> "svr_enquejob", msg=0x524554 "1", logging=0)
>>> at svr_jobfunc.c:4011
>>> #2 0x000000000049db0c in svr_enquejob (pjob=0x512d880,
>>> has_sv_qs_mutex=1, prev_job_id=0x0, have_reservation=false,
>>> being_recovered=true) at svr_jobfunc.c:421
>>> #3 0x000000000045b828 in pbsd_init_reque (pjob=0x512d880,
>>> change_state=1) at pbsd_init.c:2824
>>> #4 0x000000000045ad93 in pbsd_init_job (pjob=0x512d880, type=1) at
>>> pbsd_init.c:2558
>>> #5 0x0000000000459483 in handle_job_recovery (type=1) at
>>> pbsd_init.c:1803
>>> #6 0x000000000045a173 in handle_job_and_array_recovery (type=1) at
>>> pbsd_init.c:2100
>>> #7 0x000000000045a8fe in pbsd_init (type=1) at pbsd_init.c:2316
>>> #8 0x0000000000461f29 in main (argc=2, argv=0x7fffffffe598) at
>>> pbsd_main.c:1898
>>> (gdb) backtrace full
>>> #0 __lll_unlock_elision (lock=0x512f1b0, private=0) at
>>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>>> No locals.
>>> #1 0x00000000004a4953 in unlock_ji_mutex (pjob=0x512d880, id=0x525b30
>>> <svr_enquejob(job*, int, char const*, bool, bool)::__func__>
>>> "svr_enquejob", msg=0x524554 "1", logging=0)
>>> at svr_jobfunc.c:4011
>>> rc = 0
>>> err_msg = 0x0
>>> stub_msg = "no pos"
>>> __func__ = "unlock_ji_mutex"
>>> #2 0x000000000049db0c in svr_enquejob (pjob=0x512d880,
>>> has_sv_qs_mutex=1, prev_job_id=0x0, have_reservation=false,
>>> being_recovered=true) at svr_jobfunc.c:421
>>> pattrjb = 0x7fffffff4a10
>>> pdef = 0x4
>>> pque = 0x0
>>> rc = 0
>>> log_buf = '\000' <repeats 24 times>,
>>> "\030\000\000\000\060\000\000\000PU\377\377\377\177\000\000\220T\377\377\377\177",
>>> '\000' <repeats 50 times>, "\003\000\000\000\000\000\000\
>>> 000#\000\000\000\000\000\000\000pO\377\377\377\177", '\000' <repeats 26
>>> times>, "\221\260\000\000\000\200\377\377oO\377\377\377\177\000\000H
>>> +B\366\377\177\000\000p+B\366\377\177\000\000\200O\377\377\3
>>> 77\177\000\000\201\260\000\000\000\200\377\377\177O\377\377\377\177",
>>> '\000' <repeats 18 times>...
>>> time_now = 1478594788
>>> job_id = "0.Dual-E52630v4\000\000\000\0
>>> 00\000\000\000\000\000\362\377\377\377\377\377\377\377\340J\
>>> 377\377\377\177\000\000\060L\377\377\377\177\000\000\001\000
>>> \000\000\000\000\000\000\244\201\000\000\001\000\000\000\
>>> 030\354\377\367\377\177\000\***@L\377\377\377\177\000\000\
>>> 000\000\000\000\005\000\000\220\r\000\000\000\000\000\000\
>>> 000k\022j\365\377\177\000\000\031J\377\377\377\177\000\000\
>>> 201n\376\017\000\000\000\000\\\216!X\000\000\000\000_#\343+\
>>> 000\000\000\000\\\216!X\000\000\000\000\207\065],", '\000' <repeats 36
>>> times>, "k\022j\365\377\177\000\000\300K\377\377\377\177\000\000\000
>>> \000\000\000\000\000\000\000"...
>>> queue_name = "batch\000\377\377\240\340\377\367\377\177\000"
>>> total_jobs = 0
>>> user_jobs = 0
>>> array_jobs = 0
>>> __func__ = "svr_enquejob"
>>> que_mgr = {unlock_on_exit = 160, locked = 75, mutex_valid = 255,
>>> managed_mutex = 0x7ffff7ddccda <open_path+474>}
>>> #3 0x000000000045b828 in pbsd_init_reque (pjob=0x512d880,
>>> change_state=1) at pbsd_init.c:2824
>>> newstate = 0
>>> newsubstate = 0
>>> rc = 0
>>> log_buf = "pbsd_init_reque:1", '\000' <repeats 1063 times>...
>>> __func__ = "pbsd_init_reque"
>>> #4 0x000000000045ad93 in pbsd_init_job (pjob=0x512d880, type=1) at
>>> pbsd_init.c:2558
>>> d = 0
>>> rc = 0
>>> time_now = 1478594788
>>> log_buf = '\000' <repeats 2112 times>...
>>> local_errno = 0
>>> job_id = '\000' <repeats 1016 times>...
>>> job_atr_hold = 0
>>> job_exit_status = 0
>>> __func__ = "pbsd_init_job"
>>> #5 0x0000000000459483 in handle_job_recovery (type=1) at
>>> pbsd_init.c:1803
>>> pjob = 0x512d880
>>> Index = 0
>>> JobArray_iter = {first = "0.Dual-E52630v4", second = }
>>> log_buf = "14 total files read from
>>> disk\000\000\000\000\000\000\000\001\000\000\000\320\316\022
>>> \005\000\000\000\000\220N\022\005", '\000' <repeats 12 times>,
>>> "Expected 1, recovered 1 queues", '\000' <repeats 1330 times>...
>>> rc = 0
>>> job_rc = 0
>>> logtype = 0
>>> pdirent = 0x0
>>> pdirent_sub = 0x0
>>> dir = 0x5124e90
>>> dir_sub = 0x0
>>> had = 0
>>> pjob = 0x0
>>> time_now = 1478594788
>>> ---Type <return> to continue, or q <return> to quit---
>>> basen = '\000' <repeats 1088 times>...
>>> use_jobs_subdirs = 0
>>> __func__ = "handle_job_recovery"
>>> #6 0x000000000045a173 in handle_job_and_array_recovery (type=1) at
>>> pbsd_init.c:2100
>>> rc = 0
>>> tmp_rc = 1974134615
>>> #7 0x000000000045a8fe in pbsd_init (type=1) at pbsd_init.c:2316
>>> ret = 0
>>> gid = 0
>>> log_buf = "pbsd_init:1", '\000' <repeats 997 times>...
>>> __func__ = "pbsd_init"
>>> #8 0x0000000000461f29 in main (argc=2, argv=0x7fffffffe598) at
>>> pbsd_main.c:1898
>>> i = 2
>>> rc = 0
>>> local_errno = 0
>>> lockfile = "/var/spool/torque/server_priv/server.lock", '\000'
>>> <repeats 983 times>
>>> EMsg = '\000' <repeats 1023 times>
>>> tmpLine = "Server Dual-E52630v4 started, initialization type =
>>> 1", '\000' <repeats 970 times>
>>> log_buf = "Server Dual-E52630v4 started, initialization type =
>>> 1", '\000' <repeats 1139 times>...
>>> server_name_file_port = 15001
>>> fp = 0x51095f0
>>> (gdb) info registers
>>> rax 0x0 0
>>> rbx 0x6 6
>>> rcx 0x0 0
>>> rdx 0x512f1b0 85127600
>>> rsi 0x0 0
>>> rdi 0x512f1b0 85127600
>>> rbp 0x7fffffffe4b0 0x7fffffffe4b0
>>> rsp 0x7fffffffc870 0x7fffffffc870
>>> r8 0x0 0
>>> r9 0x7fffffff57a2 140737488312226
>>> r10 0x513c800 85182464
>>> r11 0x7ffff61e6128 140737322574120
>>> r12 0x4260b0 4350128
>>> r13 0x7fffffffe590 140737488348560
>>> r14 0x0 0
>>> r15 0x0 0
>>> rip 0x461f29 0x461f29 <main(int, char**)+2183>
>>> eflags 0x10246 [ PF ZF IF RF ]
>>> cs 0x33 51
>>> ss 0x2b 43
>>> ds 0x0 0
>>> es 0x0 0
>>> fs 0x0 0
>>> gs 0x0 0
>>> (gdb) x/16i $pc
>>> => 0x461f29 <main(int, char**)+2183>: test %eax,%eax
>>> 0x461f2b <main(int, char**)+2185>: setne %al
>>> 0x461f2e <main(int, char**)+2188>: test %al,%al
>>> 0x461f30 <main(int, char**)+2190>: je 0x461f55 <main(int,
>>> char**)+2227>
>>> 0x461f32 <main(int, char**)+2192>: mov 0x70efc7(%rip),%rax
>>> # 0xb70f00 <msg_daemonname>
>>> 0x461f39 <main(int, char**)+2199>: mov $0x51bab2,%edx
>>> 0x461f3e <main(int, char**)+2204>: mov %rax,%rsi
>>> 0x461f41 <main(int, char**)+2207>: mov $0xffffffff,%edi
>>> 0x461f46 <main(int, char**)+2212>: callq 0x425420
>>> <***@plt>
>>> 0x461f4b <main(int, char**)+2217>: mov $0x3,%edi
>>> 0x461f50 <main(int, char**)+2222>: callq 0x425680 <***@plt>
>>> 0x461f55 <main(int, char**)+2227>: mov 0x71021d(%rip),%esi
>>> # 0xb72178 <pbs_mom_port>
>>> 0x461f5b <main(int, char**)+2233>: mov 0x710227(%rip),%ecx
>>> # 0xb72188 <pbs_scheduler_port>
>>> 0x461f61 <main(int, char**)+2239>: mov 0x710225(%rip),%edx
>>> # 0xb7218c <pbs_server_port_dis>
>>> 0x461f67 <main(int, char**)+2245>: lea -0x1400(%rbp),%rax
>>> 0x461f6e <main(int, char**)+2252>: mov $0xb739c0,%r9d
>>> (gdb) thread apply all backtrace
>>>
>>> Thread 1 (Thread 0x7ffff7fd5740 (LWP 10004)):
>>> #0 __lll_unlock_elision (lock=0x512f1b0, private=0) at
>>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>>> #1 0x00000000004a4953 in unlock_ji_mutex (pjob=0x512d880, id=0x525b30
>>> <svr_enquejob(job*, int, char const*, bool, bool)::__func__>
>>> "svr_enquejob", msg=0x524554 "1", logging=0)
>>> at svr_jobfunc.c:4011
>>> #2 0x000000000049db0c in svr_enquejob (pjob=0x512d880,
>>> has_sv_qs_mutex=1, prev_job_id=0x0, have_reservation=false,
>>> being_recovered=true) at svr_jobfunc.c:421
>>> #3 0x000000000045b828 in pbsd_init_reque (pjob=0x512d880,
>>> change_state=1) at pbsd_init.c:2824
>>> #4 0x000000000045ad93 in pbsd_init_job (pjob=0x512d880, type=1) at
>>> pbsd_init.c:2558
>>> #5 0x0000000000459483 in handle_job_recovery (type=1) at
>>> pbsd_init.c:1803
>>> #6 0x000000000045a173 in handle_job_and_array_recovery (type=1) at
>>> pbsd_init.c:2100
>>> #7 0x000000000045a8fe in pbsd_init (type=1) at pbsd_init.c:2316
>>> #8 0x0000000000461f29 in main (argc=2, argv=0x7fffffffe598) at
>>> pbsd_main.c:1898
>>> (gdb) quit
>>> A debugging session is active.
>>>
>>> Inferior 1 [process 10004] will be killed.
>>>
>>> Quit anyway? (y or n) y
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Nov 2, 2016 at 1:43 AM, David Beer <***@adaptivecomputing.com>
>>> wrote:
>>>
>>>> Kazu,
>>>>
>>>> Thanks for sticking with us on this. You mentioned that pbs_server did
>>>> not crash when you submitted the job, but you said that it and pbs_sched
>>>> are "unstable." What do you mean by unstable? Will jobs run? You gdb output
>>>> looks like a pbs_server that isn't busy, but other than that it looks
>>>> normal.
>>>>
>>>> David
>>>>
>>>> On Tue, Nov 1, 2016 at 1:19 AM, Kazuhiro Fujita <
>>>> ***@gmail.com> wrote:
>>>>
>>>>> David,
>>>>>
>>>>> I tested the 6.0-dev. It passed the "sudo ./torque.setup $USER"
>>>>> script,
>>>>> but pbs_server and pbs_sched are unstable like 6.1-dev.
>>>>>
>>>>> Best,
>>>>> Kazu
>>>>>
>>>>> Before execution of gdb
>>>>>
>>>>> git clone https://github.com/adaptivecomputing/torque.git -b 6.0-dev
>>>>>> 6.0-dev
>>>>>> cd 6.0-dev
>>>>>> ./autogen.sh
>>>>>> # build and install torque
>>>>>> ./configure
>>>>>> make
>>>>>> sudo make install
>>>>>> # Set the correct name of the server
>>>>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>>>> # configure and start trqauthd
>>>>>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>>>>>> sudo update-rc.d trqauthd defaults
>>>>>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>>>>>> sudo ldconfig
>>>>>> sudo service trqauthd start
>>>>>> # Initialize serverdb by executing the torque.setup script
>>>>>> sudo ./torque.setup $USER
>>>>>>
>>>>>> sudo qmgr -c 'p s'
>>>>>> sudo qterm
>>>>>> sudo /etc/init.d/trqauthd stop
>>>>>> # set nodes
>>>>>> echo "$HOSTNAME np=`cat /proc/cpuinfo | grep processor | wc -l`" |
>>>>>> sudo tee /var/spool/torque/server_priv/nodes
>>>>>> sudo nano /var/spool/torque/server_priv/nodes
>>>>>> # set the head node
>>>>>> echo "\$pbsserver $HOSTNAME" | sudo tee /var/spool/torque/mom_priv/con
>>>>>> fig
>>>>>> # configure other deamons
>>>>>> sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
>>>>>> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
>>>>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>>>>> sudo update-rc.d pbs_server defaults
>>>>>> sudo update-rc.d pbs_sched defaults
>>>>>> sudo update-rc.d pbs_mom defaults
>>>>>> # start torque daemons
>>>>>> sudo service trqauthd start
>>>>>
>>>>>
>>>>> Execution of gdb
>>>>>
>>>>>> sudo gdb /usr/local/sbin/pbs_server
>>>>>
>>>>>
>>>>> Commands executed by another terminal
>>>>>
>>>>>> sudo /etc/init.d/pbs_mom start
>>>>>> sudo /etc/init.d/pbs_sched start
>>>>>> pbsnodes -a
>>>>>> echo "sleep 30" | qsub
>>>>>
>>>>>
>>>>> The last command did not cause a crash of pbs_server. The backtrace is
>>>>> described below.
>>>>> $ sudo gdb /usr/local/sbin/pbs_server
>>>>> GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
>>>>> Copyright (C) 2016 Free Software Foundation, Inc.
>>>>> License GPLv3+: GNU GPL version 3 or later <
>>>>> http://gnu.org/licenses/gpl.html>
>>>>> This is free software: you are free to change and redistribute it.
>>>>> There is NO WARRANTY, to the extent permitted by law. Type "show
>>>>> copying"
>>>>> and "show warranty" for details.
>>>>> This GDB was configured as "x86_64-linux-gnu".
>>>>> Type "show configuration" for configuration details.
>>>>> For bug reporting instructions, please see:
>>>>> <http://www.gnu.org/software/gdb/bugs/>.
>>>>> Find the GDB manual and other documentation resources online at:
>>>>> <http://www.gnu.org/software/gdb/documentation/>.
>>>>> For help, type "help".
>>>>> Type "apropos word" to search for commands related to "word"...
>>>>> Reading symbols from /usr/local/sbin/pbs_server...done.
>>>>> (gdb) r -D
>>>>> Starting program: /usr/local/sbin/pbs_server -D
>>>>> [Thread debugging using libthread_db enabled]
>>>>> Using host libthread_db library "/lib/x86_64-linux-gnu/libthre
>>>>> ad_db.so.1".
>>>>> [New Thread 0x7ffff39c1700 (LWP 5024)]
>>>>> pbs_server is up (version - 6.0, port - 15001)
>>>>> [New Thread 0x7ffff31c0700 (LWP 5025)]
>>>>> PBS_Server: LOG_ERROR::tcp_connect_sockaddr, Failed when trying to
>>>>> open tcp connection - connect() failed [rc = -2] [addr =
>>>>> 10.0.0.249:15003]
>>>>> PBS_Server: LOG_ERROR::sendHierarchyToNode, Could not send mom
>>>>> hierarchy to host Dual-E52630v4:15003
>>>>> [New Thread 0x7ffff29bf700 (LWP 5026)]
>>>>> [New Thread 0x7ffff21be700 (LWP 5027)]
>>>>> [New Thread 0x7ffff19bd700 (LWP 5028)]
>>>>> [New Thread 0x7ffff11bc700 (LWP 5029)]
>>>>> [New Thread 0x7ffff09bb700 (LWP 5030)]
>>>>> [Thread 0x7ffff09bb700 (LWP 5030) exited]
>>>>> [New Thread 0x7ffff09bb700 (LWP 5031)]
>>>>> [New Thread 0x7fffe3fff700 (LWP 5109)]
>>>>> [New Thread 0x7fffe37fe700 (LWP 5113)]
>>>>> [New Thread 0x7fffe29cf700 (LWP 5121)]
>>>>> [Thread 0x7fffe29cf700 (LWP 5121) exited]
>>>>> ^C
>>>>> Thread 1 "pbs_server" received signal SIGINT, Interrupt.
>>>>> 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-templa
>>>>> te.S:84
>>>>> 84 ../sysdeps/unix/syscall-template.S: No such file or directory.
>>>>> (gdb) backtrace full
>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>> No locals.
>>>>> #1 0x00007ffff615c1a4 in usleep (useconds=<optimized out>) at
>>>>> ../sysdeps/posix/usleep.c:32
>>>>> ts = {tv_sec = 0, tv_nsec = 250000000}
>>>>> #2 0x000000000046123a in main_loop () at pbsd_main.c:1454
>>>>> state = 3
>>>>> waittime = 5
>>>>> pjob = 0x313a74
>>>>> iter = 0x0
>>>>> when = 1477984074
>>>>> log = 0
>>>>> scheduling = 1
>>>>> sched_iteration = 600
>>>>> time_now = 1477984190
>>>>> update_loglevel = 1477984198
>>>>> log_buf = "Server Ready, pid = 5020, loglevel=0", '\000'
>>>>> <repeats 140 times>, "c\000\000\000\000\000\000\000
>>>>> \000\020\000\000\000\000\000\000\240\265\377\377\377\177", '\000'
>>>>> <repeats 26 times>...
>>>>> sem_val = 5228929
>>>>> __func__ = "main_loop"
>>>>> #3 0x0000000000461fb6 in main (argc=2, argv=0x7fffffffe598) at
>>>>> pbsd_main.c:1935
>>>>> i = 2
>>>>> rc = 0
>>>>> local_errno = 0
>>>>> lockfile = "/var/spool/torque/server_priv/server.lock",
>>>>> '\000' <repeats 983 times>
>>>>> EMsg = '\000' <repeats 1023 times>
>>>>> tmpLine = "Using ports Server:15001 Scheduler:15004
>>>>> MOM:15002 (server: 'Dual-E52630v4')", '\000' <repeats 945 times>
>>>>> log_buf = "Using ports Server:15001 Scheduler:15004
>>>>> MOM:15002 (server: 'Dual-E52630v4')", '\000' <repeats 1114 times>...
>>>>> server_name_file_port = 15001
>>>>> fp = 0x51095f0
>>>>> (gdb) info registers
>>>>> rax 0xfffffffffffffdfc -516
>>>>> rbx 0x5 5
>>>>> rcx 0x7ffff612a75d 140737321805661
>>>>> rdx 0x0 0
>>>>> rsi 0x0 0
>>>>> rdi 0x7fffffffb3f0 140737488335856
>>>>> rbp 0x7fffffffe4b0 0x7fffffffe4b0
>>>>> rsp 0x7fffffffc870 0x7fffffffc870
>>>>> r8 0x0 0
>>>>> r9 0x4000001 67108865
>>>>> r10 0x1 1
>>>>> r11 0x293 659
>>>>> r12 0x4260b0 4350128
>>>>> r13 0x7fffffffe590 140737488348560
>>>>> r14 0x0 0
>>>>> r15 0x0 0
>>>>> rip 0x461fb6 0x461fb6 <main(int, char**)+2388>
>>>>> eflags 0x293 [ CF AF SF IF ]
>>>>> cs 0x33 51
>>>>> ss 0x2b 43
>>>>> ds 0x0 0
>>>>> es 0x0 0
>>>>> fs 0x0 0
>>>>> gs 0x0 0
>>>>> (gdb) x/16i $pc
>>>>> => 0x461fb6 <main(int, char**)+2388>: callq 0x494762 <shutdown_ack()>
>>>>> 0x461fbb <main(int, char**)+2393>: mov $0xffffffff,%edi
>>>>> 0x461fc0 <main(int, char**)+2398>: callq 0x4250b0 <***@plt>
>>>>> 0x461fc5 <main(int, char**)+2403>: mov 0x70f55c(%rip),%rdx
>>>>> # 0xb71528 <msg_svrdown>
>>>>> 0x461fcc <main(int, char**)+2410>: mov 0x70eeed(%rip),%rax
>>>>> # 0xb70ec0 <msg_daemonname>
>>>>> 0x461fd3 <main(int, char**)+2417>: mov %rdx,%rcx
>>>>> 0x461fd6 <main(int, char**)+2420>: mov %rax,%rdx
>>>>> 0x461fd9 <main(int, char**)+2423>: mov $0x1,%esi
>>>>> 0x461fde <main(int, char**)+2428>: mov $0x8002,%edi
>>>>> 0x461fe3 <main(int, char**)+2433>: callq 0x425840
>>>>> <***@plt>
>>>>> 0x461fe8 <main(int, char**)+2438>: mov $0x0,%edi
>>>>> 0x461fed <main(int, char**)+2443>: callq 0x4269c9
>>>>> <acct_close(bool)>
>>>>> 0x461ff2 <main(int, char**)+2448>: mov $0xb6cdc0,%edi
>>>>> 0x461ff7 <main(int, char**)+2453>: callq 0x425a00
>>>>> <***@plt>
>>>>> 0x461ffc <main(int, char**)+2458>: mov $0x1,%edi
>>>>> 0x462001 <main(int, char**)+2463>: callq 0x424db0
>>>>> <***@plt>
>>>>> (gdb) thread apply all backtrace
>>>>>
>>>>> Thread 11 (Thread 0x7fffe37fe700 (LWP 5113)):
>>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
>>>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
>>>>> #1 0x00000000004fc19c in work_thread (a=0x5110710) at
>>>>> u_threadpool.c:272
>>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7fffe37fe700) at
>>>>> pthread_create.c:333
>>>>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>> _64/clone.S:109
>>>>>
>>>>> Thread 10 (Thread 0x7fffe3fff700 (LWP 5109)):
>>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
>>>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
>>>>> #1 0x00000000004fc19c in work_thread (a=0x5110710) at
>>>>> u_threadpool.c:272
>>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7fffe3fff700) at
>>>>> pthread_create.c:333
>>>>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>> _64/clone.S:109
>>>>>
>>>>> Thread 9 (Thread 0x7ffff09bb700 (LWP 5031)):
>>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
>>>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
>>>>> #1 0x00000000004fc19c in work_thread (a=0x5110810) at
>>>>> u_threadpool.c:272
>>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7ffff09bb700) at
>>>>> pthread_create.c:333
>>>>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>> _64/clone.S:109
>>>>>
>>>>> Thread 7 (Thread 0x7ffff11bc700 (LWP 5029)):
>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>>>> ../sysdeps/posix/sleep.c:55
>>>>> #2 0x00000000004769bb in remove_completed_jobs (vp=0x0) at
>>>>> req_jobobit.c:3759
>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff11bc700) at
>>>>> pthread_create.c:333
>>>>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>> _64/clone.S:109
>>>>>
>>>>> Thread 6 (Thread 0x7ffff19bd700 (LWP 5028)):
>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>>>> ../sysdeps/posix/sleep.c:55
>>>>> #2 0x00000000004afa7b in remove_extra_recycle_jobs (vp=0x0) at
>>>>> job_recycler.c:216
>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff19bd700) at
>>>>> pthread_create.c:333
>>>>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>> _64/clone.S:109
>>>>>
>>>>> Thread 5 (Thread 0x7ffff21be700 (LWP 5027)):
>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>>>> ../sysdeps/posix/sleep.c:55
>>>>> #2 0x00000000004bc73b in inspect_exiting_jobs (vp=0x0) at
>>>>> exiting_jobs.c:319
>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff21be700) at
>>>>> pthread_create.c:333
>>>>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>> _64/clone.S:109
>>>>>
>>>>> Thread 4 (Thread 0x7ffff29bf700 (LWP 5026)):
>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>>>> ../sysdeps/posix/sleep.c:55
>>>>> #2 0x000000000046078d in handle_queue_routing_retries (vp=0x0) at
>>>>> pbsd_main.c:1079
>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff29bf700) at
>>>>> pthread_create.c:333
>>>>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>> _64/clone.S:109
>>>>>
>>>>> Thread 3 (Thread 0x7ffff31c0700 (LWP 5025)):
>>>>> #0 0x00007ffff6ee17bd in accept () at ../sysdeps/unix/syscall-templa
>>>>> te.S:84
>>>>> #1 0x00007ffff750a276 in start_listener_addrinfo
>>>>> (host_name=0x7ffff31bfaf0 "Dual-E52630v4", server_port=15001,
>>>>> process_meth=0x4c4935 <start_process_pbs_server_port(void*)>)
>>>>> at ../Libnet/server_core.c:398
>>>>> #2 0x00000000004608f3 in start_accept_listener (vp=0x0) at
>>>>> pbsd_main.c:1141
>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff31c0700) at
>>>>> pthread_create.c:333
>>>>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>> _64/clone.S:109
>>>>>
>>>>> Thread 2 (Thread 0x7ffff39c1700 (LWP 5024)):
>>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
>>>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
>>>>> #1 0x00000000004fc19c in work_thread (a=0x5110810) at
>>>>> u_threadpool.c:272
>>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7ffff39c1700) at
>>>>> pthread_create.c:333
>>>>> ---Type <return> to continue, or q <return> to quit---
>>>>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>> _64/clone.S:109
>>>>>
>>>>> Thread 1 (Thread 0x7ffff7fd5740 (LWP 5020)):
>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>> #1 0x00007ffff615c1a4 in usleep (useconds=<optimized out>) at
>>>>> ../sysdeps/posix/usleep.c:32
>>>>> #2 0x000000000046123a in main_loop () at pbsd_main.c:1454
>>>>> #3 0x0000000000461fb6 in main (argc=2, argv=0x7fffffffe598) at
>>>>> pbsd_main.c:1935
>>>>> (gdb) quit
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Oct 28, 2016 at 12:43 PM, Kazuhiro Fujita <
>>>>> ***@gmail.com> wrote:
>>>>>
>>>>>> Thank you for your comments.
>>>>>> I will try the 6.0-dev next week.
>>>>>>
>>>>>> Best,
>>>>>> Kazu
>>>>>>
>>>>>> On Fri, Oct 28, 2016 at 5:34 AM, David Beer <
>>>>>> ***@adaptivecomputing.com> wrote:
>>>>>>
>>>>>>> I wonder if that fix wasn't placed in the hotfix. Is there any
>>>>>>> chance you can try installing 6.0-dev on your system (via github) to see if
>>>>>>> it's resolved. For the record, my Ubuntu 16 system doesn't give me this
>>>>>>> error, or I'd try it myself. For whatever reason, none of our test cluster
>>>>>>> machines (Cent & Redhat 6-7, SLES 11-12) experience this either. We did
>>>>>>> have another user that experiences it on a test cluster, but not being able
>>>>>>> to reproduce it has made it harder to track down.
>>>>>>>
>>>>>>> On Wed, Oct 26, 2016 at 12:46 AM, Kazuhiro Fujita <
>>>>>>> ***@gmail.com> wrote:
>>>>>>>
>>>>>>>> David,
>>>>>>>>
>>>>>>>> I tried the 6.0.2.h3. But, it seems that the other issue is still
>>>>>>>> remained.
>>>>>>>> After I initialized serverdb by "sudo pbs_server -t create",
>>>>>>>> pbs_server crashed.
>>>>>>>> Then, I used gdb with pbs_server.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Kazu
>>>>>>>>
>>>>>>>> sudo gdb /usr/local/sbin/pbs_server
>>>>>>>> GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
>>>>>>>> Copyright (C) 2016 Free Software Foundation, Inc.
>>>>>>>> License GPLv3+: GNU GPL version 3 or later <
>>>>>>>> http://gnu.org/licenses/gpl.html>
>>>>>>>> This is free software: you are free to change and redistribute it.
>>>>>>>> There is NO WARRANTY, to the extent permitted by law. Type "show
>>>>>>>> copying"
>>>>>>>> and "show warranty" for details.
>>>>>>>> This GDB was configured as "x86_64-linux-gnu".
>>>>>>>> Type "show configuration" for configuration details.
>>>>>>>> For bug reporting instructions, please see:
>>>>>>>> <http://www.gnu.org/software/gdb/bugs/>.
>>>>>>>> Find the GDB manual and other documentation resources online at:
>>>>>>>> <http://www.gnu.org/software/gdb/documentation/>.
>>>>>>>> For help, type "help".
>>>>>>>> Type "apropos word" to search for commands related to "word"...
>>>>>>>> Reading symbols from /usr/local/sbin/pbs_server...done.
>>>>>>>> (gdb) r -D
>>>>>>>> Starting program: /usr/local/sbin/pbs_server -D
>>>>>>>> [Thread debugging using libthread_db enabled]
>>>>>>>> Using host libthread_db library "/lib/x86_64-linux-gnu/libthre
>>>>>>>> ad_db.so.1".
>>>>>>>> pbs_server is up (version - 6.0.2.h3, port - 15001)
>>>>>>>> [New Thread 0x7ffff39c1700 (LWP 25591)]
>>>>>>>> [New Thread 0x7ffff31c0700 (LWP 25592)]
>>>>>>>> [New Thread 0x7ffff29bf700 (LWP 25593)]
>>>>>>>> [New Thread 0x7ffff21be700 (LWP 25594)]
>>>>>>>> [New Thread 0x7ffff19bd700 (LWP 25595)]
>>>>>>>> [New Thread 0x7ffff11bc700 (LWP 25596)]
>>>>>>>>
>>>>>>>> Thread 7 "pbs_server" received signal SIGSEGV, Segmentation fault.
>>>>>>>> [Switching to Thread 0x7ffff11bc700 (LWP 25596)]
>>>>>>>> __lll_unlock_elision (lock=0x57276c0, private=0) at
>>>>>>>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>>>>>>>> 29 ../sysdeps/unix/sysv/linux/x86/elision-unlock.c: No such file
>>>>>>>> or directory.
>>>>>>>> (gdb) bt
>>>>>>>> #0 __lll_unlock_elision (lock=0x57276c0, private=0) at
>>>>>>>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>>>>>>>> #1 0x00000000004ac076 in dispatch_timed_task (ptask=0x5727660) at
>>>>>>>> svr_task.c:318
>>>>>>>> #2 0x0000000000460247 in check_tasks (notUsed=0x0) at
>>>>>>>> pbsd_main.c:921
>>>>>>>> #3 0x00000000004fc171 in work_thread (a=0x510f650) at
>>>>>>>> u_threadpool.c:318
>>>>>>>> #4 0x00007ffff6ed86fa in start_thread (arg=0x7ffff11bc700) at
>>>>>>>> pthread_create.c:333
>>>>>>>> #5 0x00007ffff6165b5d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>>>> _64/clone.S:109
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Oct 26, 2016 at 11:52 AM, Kazuhiro Fujita <
>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> David and Rick,
>>>>>>>>>
>>>>>>>>> Thank you for the quick response. I will try it later.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Kazu
>>>>>>>>>
>>>>>>>>> On Wed, Oct 26, 2016 at 5:06 AM, David Beer <
>>>>>>>>> ***@adaptivecomputing.com> wrote:
>>>>>>>>>
>>>>>>>>>> Actually, Rick just sent me the link. You can download it from
>>>>>>>>>> here: http://files.adaptivecomputing.com/hotfix/torque-6.0.2
>>>>>>>>>> .h3.tar.gz
>>>>>>>>>>
>>>>>>>>>> On Tue, Oct 25, 2016 at 2:06 PM, David Beer <
>>>>>>>>>> ***@adaptivecomputing.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> I can confirm that this bug is fixed in 6.0-dev, and we've made
>>>>>>>>>>> a hotfix for it, 6.0.2.h3. This was caused because of a change in the
>>>>>>>>>>> implementation for the pthread library, so most will not see this crash,
>>>>>>>>>>> but it appears that if you have a newer version of that library, then you
>>>>>>>>>>> will get it. Rick is going to send instructions for how to grab 6.0.2.h3.
>>>>>>>>>>>
>>>>>>>>>>> David
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Oct 25, 2016 at 12:30 AM, Kazuhiro Fujita <
>>>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thank you David for the comment on the backtrace.
>>>>>>>>>>>> I haven't noticed that until writing this mail.
>>>>>>>>>>>> So, I used backtrace as written in the Ubuntu wiki.
>>>>>>>>>>>>
>>>>>>>>>>>> I also attached the backtrace of pbs_server (Torque 6.1-dev) by
>>>>>>>>>>>> gdb.
>>>>>>>>>>>> As I mentioned before torque.setup script was successfully
>>>>>>>>>>>> executed, but unstable.
>>>>>>>>>>>>
>>>>>>>>>>>> Before using gdb, I used following commands.
>>>>>>>>>>>>
>>>>>>>>>>>>> git clone https://github.com/adaptivecomputing/torque.git -b
>>>>>>>>>>>>> 6.1-dev 6.1-dev
>>>>>>>>>>>>> cd 6.1-dev
>>>>>>>>>>>>> ./autogen.sh
>>>>>>>>>>>>> # build and install torque
>>>>>>>>>>>>> ./configure
>>>>>>>>>>>>> make
>>>>>>>>>>>>> sudo make install
>>>>>>>>>>>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>>>>>>>>>>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>>>>>>>>>>>>> sudo ldconfig
>>>>>>>>>>>>> # set as services
>>>>>>>>>>>>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_server
>>>>>>>>>>>>> /etc/init.d/pbs_server
>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>>>>>>>>>>>> sudo update-rc.d trqauthd defaults
>>>>>>>>>>>>> sudo update-rc.d pbs_server defaults
>>>>>>>>>>>>> sudo update-rc.d pbs_sched defaults
>>>>>>>>>>>>> sudo update-rc.d pbs_mom defaults
>>>>>>>>>>>>>
>>>>>>>>>>>>> sudo ./torque.setup $USER
>>>>>>>>>>>>> sudo qmgr -c 'p s'
>>>>>>>>>>>>> echo "$HOSTNAME np=`cat /proc/cpuinfo | grep processor | wc
>>>>>>>>>>>>> -l`" | sudo tee /var/spool/torque/server_priv/nodes
>>>>>>>>>>>>> sudo nano /var/spool/torque/server_priv/nodes # (changed np)
>>>>>>>>>>>>> sudo qterm -t quick
>>>>>>>>>>>>> sudo /etc/init.d/trqauthd stop
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> trqauthd was not stop by the last command. So, I stopped it by
>>>>>>>>>>>> killing the trqauthd process.
>>>>>>>>>>>> Then I restarted the torque processes with gdb.
>>>>>>>>>>>>
>>>>>>>>>>>> sudo /etc/init.d/trqauthd start
>>>>>>>>>>>>
>>>>>>>>>>>> sudo gdb /etc/init.d/pbs_server 2>&1 | tee
>>>>>>>>>>>>> ~/gdb-torquesetup-6.1-dev.txt
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> In another terminal, I executed the following commands before
>>>>>>>>>>>> pbs_server was crashed.
>>>>>>>>>>>>
>>>>>>>>>>>> sudo /etc/init.d/pbs_mom start
>>>>>>>>>>>>> sudo /etc/init.d/pbs_sched start
>>>>>>>>>>>>> ps aux | grep pbs
>>>>>>>>>>>>> pbsnodes -a
>>>>>>>>>>>>> echo "sleep 30" | qsub
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> The output of the last command is "0.torque-server".
>>>>>>>>>>>> And this command crashed the pbs_server in gdb.
>>>>>>>>>>>> Then, I made the backtrace.
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Kazu
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Oct 25, 2016 at 2:36 PM, Kazuhiro Fujita <
>>>>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> David,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I attached the backtrace of pbs_server (Torque 6.0.2) by gdb.
>>>>>>>>>>>>> (based on https://wiki.ubuntu.com/Backtrace)
>>>>>>>>>>>>>
>>>>>>>>>>>>> I started pbs_server with gdb,
>>>>>>>>>>>>> and execute qmgr from another terminal. (see below)
>>>>>>>>>>>>>
>>>>>>>>>>>>> sudo qmgr -c 'p s'
>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection refused
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> After the qmgr execution, I pressed ctrl +c in gdb.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>> Kaz
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Oct 25, 2016 at 1:00 AM, David Beer <
>>>>>>>>>>>>> ***@adaptivecomputing.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Kazu,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Can you give us a backtrace for this crash? We have fixed
>>>>>>>>>>>>>> some issues on startup (around mutex management for newer pthread
>>>>>>>>>>>>>> implementations) and a backtrace would allow me to confirm if what you're
>>>>>>>>>>>>>> seeing is fixed.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Oct 24, 2016 at 2:09 AM, Kazuhiro Fujita <
>>>>>>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Dear All,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I use Torque 4.2.10 on Ubuntu 14.04 LTS and 16.04 LTS with
>>>>>>>>>>>>>>> dual E5-2630 v3 chips.
>>>>>>>>>>>>>>> I recently got servers with dual Xeon E5 v4 chips, and
>>>>>>>>>>>>>>> installed Ubuntu 16.04 LTS on them.
>>>>>>>>>>>>>>> And I tried to set up Torque on them, but I stacked with the
>>>>>>>>>>>>>>> initial setup script.
>>>>>>>>>>>>>>> It seems that qmgr may trigger to crash pbs_server in
>>>>>>>>>>>>>>> initial setup script (torque.setup). (see below)
>>>>>>>>>>>>>>> Similar error is also observed in Torque 6.02.
>>>>>>>>>>>>>>> Have you ever observed this kind of errors?
>>>>>>>>>>>>>>> And if you know possible solutions, please tell me.
>>>>>>>>>>>>>>> Any comments will be highly appreciated.
>>>>>>>>>>>>>>> Would it be better to change the OS to other distribution,
>>>>>>>>>>>>>>> such as Scientific Linux?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thank you in Advance,
>>>>>>>>>>>>>>> Kazu
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Errors in torque 4.2.10 setup
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> torque-server-***@torque-ser
>>>>>>>>>>>>>>>> ver:~/Downloads/torque/torque-4.2.10$ sudo ./torque.setup
>>>>>>>>>>>>>>>> $USER
>>>>>>>>>>>>>>>> Currently no servers active. Default server will be listed
>>>>>>>>>>>>>>>> as active server. Error 15133
>>>>>>>>>>>>>>>> Active server name: torque-server pbs_server port is: 15001
>>>>>>>>>>>>>>>> trqauthd daemonized - port /tmp/trqauthd-unix
>>>>>>>>>>>>>>>> trqauthd successfully started
>>>>>>>>>>>>>>>> initializing TORQUE (admin: torque-server-***@torque-ser
>>>>>>>>>>>>>>>> ver)
>>>>>>>>>>>>>>>> You have selected to start pbs_server in create mode.
>>>>>>>>>>>>>>>> If the server database exists it will be overwritten.
>>>>>>>>>>>>>>>> do you wish to continue y/(n)?y
>>>>>>>>>>>>>>>> root 27941 1942 1 12:22 ? 00:00:00 pbs_server
>>>>>>>>>>>>>>>> -t create
>>>>>>>>>>>>>>>> Max open servers: 9
>>>>>>>>>>>>>>>> set server operators += torque-server-***@torque-server
>>>>>>>>>>>>>>>> Max open servers: 9
>>>>>>>>>>>>>>>> set server managers += torque-server-***@torque-server
>>>>>>>>>>>>>>>> qmgr obj=batch svr=default: End of File
>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>> torque-server-***@torque-ser
>>>>>>>>>>>>>>>> ver:~/Downloads/torque/torque-4.2.10$ ps aux | grep pbs
>>>>>>>>>>>>>>>> torque-+ 27996 0.0 0.0 22304 948 pts/2 S+ 12:22
>>>>>>>>>>>>>>>> 0:00 grep --color=auto pbs
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> pbs_server -t create was not found.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Errors in torque 6.0.2 setup
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> torque-server-***@torque-server:~/Downloads/torque/6.0.2$
>>>>>>>>>>>>>>>> sudo ./torque.setup $USER
>>>>>>>>>>>>>>>> Currently no servers active. Default server will be listed
>>>>>>>>>>>>>>>> as active server. Error 15133
>>>>>>>>>>>>>>>> Active server name: torque-server pbs_server port is: 15001
>>>>>>>>>>>>>>>> trqauthd daemonized - port /tmp/trqauthd-unix
>>>>>>>>>>>>>>>> trqauthd successfully started
>>>>>>>>>>>>>>>> initializing TORQUE (admin: torque-server-***@torque-ser
>>>>>>>>>>>>>>>> ver)
>>>>>>>>>>>>>>>> You have selected to start pbs_server in create mode.
>>>>>>>>>>>>>>>> If the server database exists it will be overwritten.
>>>>>>>>>>>>>>>> do you wish to continue y/(n)?y
>>>>>>>>>>>>>>>> root 39521 1 1 16:10 ? 00:00:00 pbs_server
>>>>>>>>>>>>>>>> -t create
>>>>>>>>>>>>>>>> Max open servers: 9
>>>>>>>>>>>>>>>> Max open servers: 9
>>>>>>>>>>>>>>>> qmgr obj=batch svr=default: End of File
>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>> torque-server-***@torque-server:~/Downloads/torque/6.0.2$
>>>>>>>>>>>>>>>> ps aux | grep pbs
>>>>>>>>>>>>>>>> comp_ad+ 39569 0.0 0.0 22304 1032 pts/8 S+ 16:11
>>>>>>>>>>>>>>>> 0:00 grep --color=auto pbs
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> pbs_server -t create was not found.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Commands used for installation before the setup script
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> # build and install torque
>>>>>>>>>>>>>>>> ./configure
>>>>>>>>>>>>>>>> make
>>>>>>>>>>>>>>>> sudo make install
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>>>>>>>>>>>>>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>>>>>>>>>>>>>>>> sudo ldconfig
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> # set up as services
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_server
>>>>>>>>>>>>>>>> /etc/init.d/pbs_server
>>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_sched
>>>>>>>>>>>>>>>> /etc/init.d/pbs_sched
>>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>>>>>>>>>>>>>>> sudo update-rc.d trqauthd defaults
>>>>>>>>>>>>>>>> sudo update-rc.d pbs_server defaults
>>>>>>>>>>>>>>>> sudo update-rc.d pbs_sched defaults
>>>>>>>>>>>>>>>> sudo update-rc.d pbs_mom defaults
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> sudo ./torque.setup $USER
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> torqueusers mailing list
>>>>>>>>>>>>>>> ***@supercluster.org
>>>>>>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> David Beer | Torque Architect
>>>>>>>>>>>>>> Adaptive Computing
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> torqueusers mailing list
>>>>>>>>>>>>>> ***@supercluster.org
>>>>>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> torqueusers mailing list
>>>>>>>>>>>> ***@supercluster.org
>>>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> David Beer | Torque Architect
>>>>>>>>>>> Adaptive Computing
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> David Beer | Torque Architect
>>>>>>>>>> Adaptive Computing
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> torqueusers mailing list
>>>>>>>>>> ***@supercluster.org
>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> torqueusers mailing list
>>>>>>>> ***@supercluster.org
>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> David Beer | Torque Architect
>>>>>>> Adaptive Computing
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> torqueusers mailing list
>>>>>>> ***@supercluster.org
>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> torqueusers mailing list
>>>>> ***@supercluster.org
>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> David Beer | Torque Architect
>>>> Adaptive Computing
>>>>
>>>> _______________________________________________
>>>> torqueusers mailing list
>>>> ***@supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>
>>>>
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> ***@supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>>
>>
>>
>> --
>> David Beer | Torque Architect
>> Adaptive Computing
>>
>> _______________________________________________
>> torqueusers mailing list
>> ***@supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>
David Beer
2016-11-15 23:10:48 UTC
Permalink
Kazu,

What did it do when it failed to start?

On Wed, Nov 9, 2016 at 9:33 PM, Kazuhiro Fujita <***@gmail.com>
wrote:

> David,
>
> In the last mail I sent, I reinstalled 6.0-dev in a wrong server as you
> can see in output (E5-2630v3).
> In a E5-2630v4 server, pbs_server failed to restart as a daemon after "./torque.setup
> $USER".
>
> Before crash:
>
>> git clone https://github.com/adaptivecomputing/torque.git -b 6.0-dev
>> 6.0-dev
>> cd 6.0-dev
>> ./autogen.sh
>> # build and install torque
>> ./configure
>> make
>> sudo make install
>> # Set the correct name of the server
>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>> # configure and start trqauthd
>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>> sudo update-rc.d trqauthd defaults
>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>> sudo ldconfig
>> sudo service trqauthd start
>> # Initialize serverdb by executing the torque.setup script
>> sudo ./torque.setup $USER
>> sudo qmgr -c 'p s'
>> sudo qterm
>> sudo service trqauthd stop
>> ps aux | grep pbs
>> ps aux | grep trq
>> # set nodes
>> echo "$HOSTNAME np=`cat /proc/cpuinfo | grep processor | wc -l`" | sudo
>> tee /var/spool/torque/server_priv/nodes
>> sudo nano /var/spool/torque/server_priv/nodes
>> # set the head node
>> echo "\$pbsserver $HOSTNAME" | sudo tee /var/spool/torque/mom_priv/config
>> # configure other daemons
>> sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
>> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>> sudo update-rc.d pbs_server defaults
>> sudo update-rc.d pbs_sched defaults
>> sudo update-rc.d pbs_mom defaults
>> # restart torque daemons
>> sudo service trqauthd start
>> sudo service pbs_server start
>
>
> Then, pbs_server did not start. So, I started pbs_server with gdb.
> But, pbs_server with gdb did not crash even after qsub and qstat from
> another terminal.
> So, I stopped the pbs_server in gdb with ctrl + c.
>
> Best,
> Kazu
>
> gdb output
>
>> $ sudo gdb /usr/local/sbin/pbs_server
>> GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
>> Copyright (C) 2016 Free Software Foundation, Inc.
>> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.
>> html>
>> This is free software: you are free to change and redistribute it.
>> There is NO WARRANTY, to the extent permitted by law. Type "show copying"
>> and "show warranty" for details.
>> This GDB was configured as "x86_64-linux-gnu".
>> Type "show configuration" for configuration details.
>> For bug reporting instructions, please see:
>> <http://www.gnu.org/software/gdb/bugs/>.
>> Find the GDB manual and other documentation resources online at:
>> <http://www.gnu.org/software/gdb/documentation/>.
>> For help, type "help".
>> Type "apropos word" to search for commands related to "word"...
>> Reading symbols from /usr/local/sbin/pbs_server...done.
>> (gdb) r -D
>> Starting program: /usr/local/sbin/pbs_server -D
>> [Thread debugging using libthread_db enabled]
>> Using host libthread_db library "/lib/x86_64-linux-gnu/
>> libthread_db.so.1".
>> [New Thread 0x7ffff39c1700 (LWP 35864)]
>> pbs_server is up (version - 6.0, port - 15001)
>> PBS_Server: LOG_ERROR::tcp_connect_sockaddr, Failed when trying to open
>> tcp connection - connect() failed [rc = -2] [addr = 10.0.0.249:15003]
>> [New Thread 0x7ffff31c0700 (LWP 35865)]
>> PBS_Server: LOG_ERROR::sendHierarchyToNode, Could not send mom hierarchy
>> to host Dual-E52630v4:15003
>> [New Thread 0x7ffff29bf700 (LWP 35866)]
>> [New Thread 0x7ffff21be700 (LWP 35867)]
>> [New Thread 0x7ffff19bd700 (LWP 35868)]
>> [New Thread 0x7ffff11bc700 (LWP 35869)]
>> [New Thread 0x7ffff09bb700 (LWP 35870)]
>> [Thread 0x7ffff09bb700 (LWP 35870) exited]
>> [New Thread 0x7ffff09bb700 (LWP 35871)]
>> [New Thread 0x7fffe3fff700 (LWP 36003)]
>> [New Thread 0x7fffe37fe700 (LWP 36004)]
>> [New Thread 0x7fffe2ffd700 (LWP 36011)]
>> [New Thread 0x7fffe21ce700 (LWP 36016)]
>> [Thread 0x7fffe21ce700 (LWP 36016) exited]
>> ^C
>> Thread 1 "pbs_server" received signal SIGINT, Interrupt.
>> 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-
>> template.S:84
>> 84 ../sysdeps/unix/syscall-template.S: No such file or directory.
>> (gdb) bt
>> #0 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-
>> template.S:84
>> #1 0x00007ffff615c1a4 in usleep (useconds=<optimized out>) at
>> ../sysdeps/posix/usleep.c:32
>> #2 0x0000000000461216 in main_loop () at pbsd_main.c:1454
>> #3 0x0000000000461f92 in main (argc=2, argv=0x7fffffffe598) at
>> pbsd_main.c:1935
>> (gdb) backtrace full
>> #0 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-
>> template.S:84
>> No locals.
>> #1 0x00007ffff615c1a4 in usleep (useconds=<optimized out>) at
>> ../sysdeps/posix/usleep.c:32
>> ts = {tv_sec = 0, tv_nsec = 250000000}
>> #2 0x0000000000461216 in main_loop () at pbsd_main.c:1454
>> state = 3
>> waittime = 5
>> pjob = 0x313a74
>> iter = 0x0
>> when = 1478748888
>> log = 0
>> scheduling = 1
>> sched_iteration = 600
>> time_now = 1478748970
>> update_loglevel = 1478748979
>> log_buf = "Server Ready, pid = 35860, loglevel=0", '\000'
>> <repeats 139 times>, "c\000\000\000\000\000\000\
>> 000\000\020\000\000\000\000\000\000\240\265\377\377\377\177", '\000'
>> <repeats 26 times>...
>> sem_val = 5229209
>> __func__ = "main_loop"
>> #3 0x0000000000461f92 in main (argc=2, argv=0x7fffffffe598) at
>> pbsd_main.c:1935
>> i = 2
>> rc = 0
>> local_errno = 0
>> lockfile = "/var/spool/torque/server_priv/server.lock", '\000'
>> <repeats 983 times>
>> EMsg = '\000' <repeats 1023 times>
>> tmpLine = "Using ports Server:15001 Scheduler:15004 MOM:15002
>> (server: 'Dual-E52630v4')", '\000' <repeats 945 times>
>> log_buf = "Using ports Server:15001 Scheduler:15004 MOM:15002
>> (server: 'Dual-E52630v4')", '\000' <repeats 1114 times>...
>> server_name_file_port = 15001
>> fp = 0x51095f0
>> (gdb) info registers
>> rax 0xfffffffffffffdfc -516
>> rbx 0x6 6
>> rcx 0x7ffff612a75d 140737321805661
>> rdx 0x0 0
>> rsi 0x0 0
>> rdi 0x7fffffffb3f0 140737488335856
>> rbp 0x7fffffffe4b0 0x7fffffffe4b0
>> rsp 0x7fffffffc870 0x7fffffffc870
>> r8 0x0 0
>> r9 0x4000001 67108865
>> r10 0x1 1
>> r11 0x293 659
>> r12 0x4260b0 4350128
>> r13 0x7fffffffe590 140737488348560
>> r14 0x0 0
>> r15 0x0 0
>> rip 0x461f92 0x461f92 <main(int, char**)+2388>
>> eflags 0x293 [ CF AF SF IF ]
>> cs 0x33 51
>> ss 0x2b 43
>> ds 0x0 0
>> es 0x0 0
>> fs 0x0 0
>> gs 0x0 0
>> (gdb) x/16i $pc
>> => 0x461f92 <main(int, char**)+2388>: callq 0x49484c <shutdown_ack()>
>> 0x461f97 <main(int, char**)+2393>: mov $0xffffffff,%edi
>> 0x461f9c <main(int, char**)+2398>: callq 0x4250b0 <***@plt>
>> 0x461fa1 <main(int, char**)+2403>: mov 0x70f5c0(%rip),%rdx
>> # 0xb71568 <msg_svrdown>
>> 0x461fa8 <main(int, char**)+2410>: mov 0x70ef51(%rip),%rax
>> # 0xb70f00 <msg_daemonname>
>> 0x461faf <main(int, char**)+2417>: mov %rdx,%rcx
>> 0x461fb2 <main(int, char**)+2420>: mov %rax,%rdx
>> 0x461fb5 <main(int, char**)+2423>: mov $0x1,%esi
>> 0x461fba <main(int, char**)+2428>: mov $0x8002,%edi
>> 0x461fbf <main(int, char**)+2433>: callq 0x425840
>> <***@plt>
>> 0x461fc4 <main(int, char**)+2438>: mov $0x0,%edi
>> 0x461fc9 <main(int, char**)+2443>: callq 0x4269c9 <acct_close(bool)>
>> 0x461fce <main(int, char**)+2448>: mov $0xb6ce00,%edi
>> 0x461fd3 <main(int, char**)+2453>: callq 0x425a00
>> <***@plt>
>> 0x461fd8 <main(int, char**)+2458>: mov $0x1,%edi
>> 0x461fdd <main(int, char**)+2463>: callq 0x424db0 <***@plt>
>> (gdb) thread apply all backtrace
>> Thread 12 (Thread 0x7fffe2ffd700 (LWP 36011)):
>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/
>> x86_64/pthread_cond_wait.S:185
>> #1 0x00000000004fc2b4 in work_thread (a=0x5110710) at u_threadpool.c:272
>> #2 0x00007ffff6ed870a in start_thread (arg=0x7fffe2ffd700) at
>> pthread_create.c:333
>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/
>> x86_64/clone.S:109
>> Thread 11 (Thread 0x7fffe37fe700 (LWP 36004)):
>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/
>> x86_64/pthread_cond_wait.S:185
>> #1 0x00000000004fc2b4 in work_thread (a=0x5110810) at u_threadpool.c:272
>> #2 0x00007ffff6ed870a in start_thread (arg=0x7fffe37fe700) at
>> pthread_create.c:333
>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/
>> x86_64/clone.S:109
>> Thread 10 (Thread 0x7fffe3fff700 (LWP 36003)):
>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/
>> x86_64/pthread_cond_wait.S:185
>> #1 0x00000000004fc2b4 in work_thread (a=0x5110710) at u_threadpool.c:272
>> #2 0x00007ffff6ed870a in start_thread (arg=0x7fffe3fff700) at
>> pthread_create.c:333
>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/
>> x86_64/clone.S:109
>> Thread 9 (Thread 0x7ffff09bb700 (LWP 35871)):
>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/
>> x86_64/pthread_cond_wait.S:185
>> #1 0x00000000004fc2b4 in work_thread (a=0x5110810) at u_threadpool.c:272
>> #2 0x00007ffff6ed870a in start_thread (arg=0x7ffff09bb700) at
>> pthread_create.c:333
>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/
>> x86_64/clone.S:109
>> Thread 7 (Thread 0x7ffff11bc700 (LWP 35869)):
>> #0 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-
>> template.S:84
>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>> ../sysdeps/posix/sleep.c:55
>> #2 0x0000000000476913 in remove_completed_jobs (vp=0x0) at
>> req_jobobit.c:3759
>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff11bc700) at
>> pthread_create.c:333
>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/
>> x86_64/clone.S:109
>> Thread 6 (Thread 0x7ffff19bd700 (LWP 35868)):
>> #0 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-
>> template.S:84
>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>> ../sysdeps/posix/sleep.c:55
>> #2 0x00000000004afb93 in remove_extra_recycle_jobs (vp=0x0) at
>> job_recycler.c:216
>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff19bd700) at
>> pthread_create.c:333
>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/
>> x86_64/clone.S:109
>> Thread 5 (Thread 0x7ffff21be700 (LWP 35867)):
>> #0 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-
>> template.S:84
>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>> ../sysdeps/posix/sleep.c:55
>> #2 0x00000000004bc853 in inspect_exiting_jobs (vp=0x0) at
>> exiting_jobs.c:319
>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff21be700) at
>> pthread_create.c:333
>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/
>> x86_64/clone.S:109
>> Thread 4 (Thread 0x7ffff29bf700 (LWP 35866)):
>> #0 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-
>> template.S:84
>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>> ../sysdeps/posix/sleep.c:55
>> #2 0x0000000000460769 in handle_queue_routing_retries (vp=0x0) at
>> pbsd_main.c:1079
>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff29bf700) at
>> pthread_create.c:333
>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/
>> x86_64/clone.S:109
>> Thread 3 (Thread 0x7ffff31c0700 (LWP 35865)):
>> #0 0x00007ffff6ee17bd in accept () at ../sysdeps/unix/syscall-
>> template.S:84
>> #1 0x00007ffff750a276 in start_listener_addrinfo
>> (host_name=0x7ffff31bfaf0 "Dual-E52630v4", server_port=15001,
>> process_meth=0x4c4a4d <start_process_pbs_server_port(void*)>)
>> at ../Libnet/server_core.c:398
>> ---Type <return> to continue, or q <return> to quit---
>> #2 0x00000000004608cf in start_accept_listener (vp=0x0) at
>> pbsd_main.c:1141
>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff31c0700) at
>> pthread_create.c:333
>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/
>> x86_64/clone.S:109
>> Thread 2 (Thread 0x7ffff39c1700 (LWP 35864)):
>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/
>> x86_64/pthread_cond_wait.S:185
>> #1 0x00000000004fc2b4 in work_thread (a=0x5110810) at u_threadpool.c:272
>> #2 0x00007ffff6ed870a in start_thread (arg=0x7ffff39c1700) at
>> pthread_create.c:333
>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/
>> x86_64/clone.S:109
>> Thread 1 (Thread 0x7ffff7fd5740 (LWP 35860)):
>> #0 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-
>> template.S:84
>> #1 0x00007ffff615c1a4 in usleep (useconds=<optimized out>) at
>> ../sysdeps/posix/usleep.c:32
>> #2 0x0000000000461216 in main_loop () at pbsd_main.c:1454
>> #3 0x0000000000461f92 in main (argc=2, argv=0x7fffffffe598) at
>> pbsd_main.c:1935
>> (gdb) quit
>> A debugging session is active.
>> Inferior 1 [process 35860] will be killed.
>> Quit anyway? (y or n) y
>
>
>
> Commands executed from another terminal after pbs_server with gdb (r -D)
>
>> $ sudo service pbs_sched start
>> $ sudo service pbs_mom start
>> $ pbsnodes -a
>> Dual-E52630v4
>> state = free
>> power_state = Running
>> np = 4
>> ntype = cluster
>> status = rectime=1478748911,macaddr=34:
>> 97:f6:5d:09:a6,cpuclock=Fixed,varattr=,jobs=,state=free,
>> netload=322618417,gres=,loadave=0.06,ncpus=40,physmem=
>> 65857216kb,availmem=131970532kb,totmem=132849340kb,idletime=108,
>> nusers=4,nsessions=17,sessions=1036 1316 1327 1332 1420 1421 1422 1423
>> 1424 1425 1426 1430 1471 1510 27075 27130 35902,uname=Linux Dual-E52630v4
>> 4.4.0-45-generic #66-Ubuntu SMP Wed Oct 19 14:12:37 UTC 2016
>> x86_64,opsys=linux
>> mom_service_port = 15002
>> mom_manager_port = 15003
>> $ echo "sleep 30" | qsub
>> 0.Dual-E52630v4
>> $ qstat
>> Job ID Name User Time Use S
>> Queue
>> ------------------------- ---------------- --------------- -------- -
>> -----
>> 0.Dual-E52630v4 STDIN comp_admin 0 Q
>> batch
>
>
>
> On Thu, Nov 10, 2016 at 12:01 PM, Kazuhiro Fujita <
> ***@gmail.com> wrote:
>
>> David,
>>
>> Now, it works. Thank you.
>> But, jobs are executed in the LIFO manner, as I observed in a E5-2630v3
>> server...
>> I show the result by 'qstat -t' after 'echo "sleep 10" | qsub -t 1-10' 3
>> times.
>>
>> Best,
>> Kazu
>>
>> $ qstat -t
>> Job ID Name User Time Use S
>> Queue
>> ------------------------- ---------------- --------------- -------- -
>> -----
>> 0.Dual-E5-2630v3 STDIN comp_admin 00:00:00 C
>> batch
>> 1[1].Dual-E5-2630v3 STDIN-1 comp_admin 0 Q
>> batch
>> 1[2].Dual-E5-2630v3 STDIN-2 comp_admin 0 Q
>> batch
>> 1[3].Dual-E5-2630v3 STDIN-3 comp_admin 0 Q
>> batch
>> 1[4].Dual-E5-2630v3 STDIN-4 comp_admin 0 Q
>> batch
>> 1[5].Dual-E5-2630v3 STDIN-5 comp_admin 0 Q
>> batch
>> 1[6].Dual-E5-2630v3 STDIN-6 comp_admin 0 Q
>> batch
>> 1[7].Dual-E5-2630v3 STDIN-7 comp_admin 00:00:00 C
>> batch
>> 1[8].Dual-E5-2630v3 STDIN-8 comp_admin 00:00:00 C
>> batch
>> 1[9].Dual-E5-2630v3 STDIN-9 comp_admin 00:00:00 C
>> batch
>> 1[10].Dual-E5-2630v3 STDIN-10 comp_admin 00:00:00 C
>> batch
>> 2[1].Dual-E5-2630v3 STDIN-1 comp_admin 0 Q
>> batch
>> 2[2].Dual-E5-2630v3 STDIN-2 comp_admin 0 Q
>> batch
>> 2[3].Dual-E5-2630v3 STDIN-3 comp_admin 0 Q
>> batch
>> 2[4].Dual-E5-2630v3 STDIN-4 comp_admin 0 Q
>> batch
>> 2[5].Dual-E5-2630v3 STDIN-5 comp_admin 0 Q
>> batch
>> 2[6].Dual-E5-2630v3 STDIN-6 comp_admin 0 Q
>> batch
>> 2[7].Dual-E5-2630v3 STDIN-7 comp_admin 0 Q
>> batch
>> 2[8].Dual-E5-2630v3 STDIN-8 comp_admin 0 Q
>> batch
>> 2[9].Dual-E5-2630v3 STDIN-9 comp_admin 0 Q
>> batch
>> 2[10].Dual-E5-2630v3 STDIN-10 comp_admin 0 Q
>> batch
>> 3[1].Dual-E5-2630v3 STDIN-1 comp_admin 0 Q
>> batch
>> 3[2].Dual-E5-2630v3 STDIN-2 comp_admin 0 Q
>> batch
>> 3[3].Dual-E5-2630v3 STDIN-3 comp_admin 0 Q
>> batch
>> 3[4].Dual-E5-2630v3 STDIN-4 comp_admin 0 Q
>> batch
>> 3[5].Dual-E5-2630v3 STDIN-5 comp_admin 0 Q
>> batch
>> 3[6].Dual-E5-2630v3 STDIN-6 comp_admin 0 Q
>> batch
>> 3[7].Dual-E5-2630v3 STDIN-7 comp_admin 0 R
>> batch
>> 3[8].Dual-E5-2630v3 STDIN-8 comp_admin 0 R
>> batch
>> 3[9].Dual-E5-2630v3 STDIN-9 comp_admin 0 R
>> batch
>> 3[10].Dual-E5-2630v3 STDIN-10 comp_admin 0 R
>> batch
>>
>>
>>
>> On Thu, Nov 10, 2016 at 3:07 AM, David Beer <***@adaptivecomputing.com>
>> wrote:
>>
>>> Kazu,
>>>
>>> I was able to get a system to reproduce this error. I have now checked
>>> in another fix, and I can no longer reproduce this. Can you pull the latest
>>> and let me know if it fixes it for you?
>>>
>>> On Tue, Nov 8, 2016 at 2:06 AM, Kazuhiro Fujita <
>>> ***@gmail.com> wrote:
>>>
>>>> Hi David,
>>>>
>>>> I reinstalled the 6.0-dev today from github, and observed slight
>>>> different behaviors I think.
>>>> I used the "service" command to start daemons this time.
>>>>
>>>> Best,
>>>> Kazu
>>>>
>>>> Befor crash
>>>>
>>>>> git clone https://github.com/adaptivecomputing/torque.git -b 6.0-dev
>>>>> 6.0-dev
>>>>> cd 6.0-dev
>>>>> ./autogen.sh
>>>>> # build and install torque
>>>>> ./configure
>>>>> make
>>>>> sudo make install
>>>>> # Set the correct name of the server
>>>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>>> # configure and start trqauthd
>>>>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>>>>> sudo update-rc.d trqauthd defaults
>>>>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>>>>> sudo ldconfig
>>>>> sudo service trqauthd start
>>>>> # Initialize serverdb by executing the torque.setup script
>>>>> sudo ./torque.setup $USER
>>>>> sudo qmgr -c 'p s'
>>>>> sudo qterm
>>>>> sudo service trqauthd stop
>>>>> ps aux | grep pbs
>>>>> ps aux | grep trq
>>>>> # set nodes
>>>>> echo "$HOSTNAME np=`cat /proc/cpuinfo | grep processor | wc -l`" |
>>>>> sudo tee /var/spool/torque/server_priv/nodes
>>>>> sudo nano /var/spool/torque/server_priv/nodes
>>>>> # set the head node
>>>>> echo "\$pbsserver $HOSTNAME" | sudo tee /var/spool/torque/mom_priv/con
>>>>> fig
>>>>> # configure other deamons
>>>>> sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
>>>>> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
>>>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>>>> sudo update-rc.d pbs_server defaults
>>>>> sudo update-rc.d pbs_sched defaults
>>>>> sudo update-rc.d pbs_mom defaults
>>>>> # start torque daemons
>>>>> sudo service trqauthd start
>>>>> sudo service pbs_server start
>>>>> sudo service pbs_sched start
>>>>> sudo service pbs_mom start
>>>>> # chekc configuration of computaion nodes
>>>>> pbsnodes -a
>>>>
>>>>
>>>> I checked torque processes by "ps aux | grep pbs" and "ps aux | grep
>>>> trq" several times.
>>>> After "pbsnodes -a", it seems ok.
>>>> But, the next qsub command seems to trigger to crash "pbs_server" and
>>>> "pbs_sched".
>>>>
>>>> $ ps aux | grep trq
>>>>> root 9682 0.0 0.0 109112 3632 ? S 17:39 0:00
>>>>> /usr/local/sbin/trqauthd
>>>>> comp_ad+ 9842 0.0 0.0 15236 936 pts/8 S+ 17:40 0:00 grep
>>>>> --color=auto trq
>>>>> $ ps aux | grep pbs
>>>>> root 9720 0.0 0.0 695140 25760 ? Sl 17:39 0:00
>>>>> /usr/local/sbin/pbs_server
>>>>> root 9771 0.0 0.0 37996 4940 ? Ss 17:39 0:00
>>>>> /usr/local/sbin/pbs_sched
>>>>> root 9814 0.2 0.2 173776 136692 ? SLsl 17:40 0:00
>>>>> /usr/local/sbin/pbs_mom
>>>>> comp_ad+ 9844 0.0 0.0 15236 1012 pts/8 S+ 17:40 0:00 grep
>>>>> --color=auto pbs
>>>>> $ echo "sleep 30" | qsub
>>>>> 0.Dual-E52630v4
>>>>> $ ps aux | grep pbs
>>>>> root 9814 0.1 0.2 173776 136692 ? SLsl 17:40 0:00
>>>>> /usr/local/sbin/pbs_mom
>>>>> comp_ad+ 9855 0.0 0.0 15236 928 pts/8 S+ 17:41 0:00 grep
>>>>> --color=auto pbs
>>>>> $ ps aux | grep trq
>>>>> root 9682 0.0 0.0 109112 4144 ? S 17:39 0:00
>>>>> /usr/local/sbin/trqauthd
>>>>> comp_ad+ 9860 0.0 0.0 15236 1092 pts/8 S+ 17:41 0:00 grep
>>>>> --color=auto trq
>>>>
>>>>
>>>> Then, I stopped the remained processes,
>>>>
>>>> sudo service pbs_mom stop
>>>>> sudo service trqauthd stop
>>>>
>>>>
>>>> and start again the "trqauthd", and "pbs_server" with gdb. "pbs_server"
>>>> crashed in gdb without other commands.
>>>>
>>>> sudo service trqauthd start
>>>>> sudo gdb /usr/local/sbin/pbs_server
>>>>
>>>>
>>>> sudo gdb /usr/local/sbin/pbs_server
>>>> GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
>>>> Copyright (C) 2016 Free Software Foundation, Inc.
>>>> License GPLv3+: GNU GPL version 3 or later <
>>>> http://gnu.org/licenses/gpl.html>
>>>> This is free software: you are free to change and redistribute it.
>>>> There is NO WARRANTY, to the extent permitted by law. Type "show
>>>> copying"
>>>> and "show warranty" for details.
>>>> This GDB was configured as "x86_64-linux-gnu".
>>>> Type "show configuration" for configuration details.
>>>> For bug reporting instructions, please see:
>>>> <http://www.gnu.org/software/gdb/bugs/>.
>>>> Find the GDB manual and other documentation resources online at:
>>>> <http://www.gnu.org/software/gdb/documentation/>.
>>>> For help, type "help".
>>>> Type "apropos word" to search for commands related to "word"...
>>>> Reading symbols from /usr/local/sbin/pbs_server...done.
>>>> (gdb) r -D
>>>> Starting program: /usr/local/sbin/pbs_server -D
>>>> [Thread debugging using libthread_db enabled]
>>>> Using host libthread_db library "/lib/x86_64-linux-gnu/libthre
>>>> ad_db.so.1".
>>>>
>>>> Program received signal SIGSEGV, Segmentation fault.
>>>> __lll_unlock_elision (lock=0x512f1b0, private=0) at
>>>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>>>> 29 ../sysdeps/unix/sysv/linux/x86/elision-unlock.c: No such file or
>>>> directory.
>>>> (gdb) bt
>>>> #0 __lll_unlock_elision (lock=0x512f1b0, private=0) at
>>>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>>>> #1 0x00000000004a4953 in unlock_ji_mutex (pjob=0x512d880, id=0x525b30
>>>> <svr_enquejob(job*, int, char const*, bool, bool)::__func__>
>>>> "svr_enquejob", msg=0x524554 "1", logging=0)
>>>> at svr_jobfunc.c:4011
>>>> #2 0x000000000049db0c in svr_enquejob (pjob=0x512d880,
>>>> has_sv_qs_mutex=1, prev_job_id=0x0, have_reservation=false,
>>>> being_recovered=true) at svr_jobfunc.c:421
>>>> #3 0x000000000045b828 in pbsd_init_reque (pjob=0x512d880,
>>>> change_state=1) at pbsd_init.c:2824
>>>> #4 0x000000000045ad93 in pbsd_init_job (pjob=0x512d880, type=1) at
>>>> pbsd_init.c:2558
>>>> #5 0x0000000000459483 in handle_job_recovery (type=1) at
>>>> pbsd_init.c:1803
>>>> #6 0x000000000045a173 in handle_job_and_array_recovery (type=1) at
>>>> pbsd_init.c:2100
>>>> #7 0x000000000045a8fe in pbsd_init (type=1) at pbsd_init.c:2316
>>>> #8 0x0000000000461f29 in main (argc=2, argv=0x7fffffffe598) at
>>>> pbsd_main.c:1898
>>>> (gdb) backtrace full
>>>> #0 __lll_unlock_elision (lock=0x512f1b0, private=0) at
>>>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>>>> No locals.
>>>> #1 0x00000000004a4953 in unlock_ji_mutex (pjob=0x512d880, id=0x525b30
>>>> <svr_enquejob(job*, int, char const*, bool, bool)::__func__>
>>>> "svr_enquejob", msg=0x524554 "1", logging=0)
>>>> at svr_jobfunc.c:4011
>>>> rc = 0
>>>> err_msg = 0x0
>>>> stub_msg = "no pos"
>>>> __func__ = "unlock_ji_mutex"
>>>> #2 0x000000000049db0c in svr_enquejob (pjob=0x512d880,
>>>> has_sv_qs_mutex=1, prev_job_id=0x0, have_reservation=false,
>>>> being_recovered=true) at svr_jobfunc.c:421
>>>> pattrjb = 0x7fffffff4a10
>>>> pdef = 0x4
>>>> pque = 0x0
>>>> rc = 0
>>>> log_buf = '\000' <repeats 24 times>,
>>>> "\030\000\000\000\060\000\000\000PU\377\377\377\177\000\000\220T\377\377\377\177",
>>>> '\000' <repeats 50 times>, "\003\000\000\000\000\000\000\
>>>> 000#\000\000\000\000\000\000\000pO\377\377\377\177", '\000' <repeats
>>>> 26 times>, "\221\260\000\000\000\200\377\377oO\377\377\377\177\000\000H
>>>> +B\366\377\177\000\000p+B\366\377\177\000\000\200O\377\377\3
>>>> 77\177\000\000\201\260\000\000\000\200\377\377\177O\377\377\377\177",
>>>> '\000' <repeats 18 times>...
>>>> time_now = 1478594788
>>>> job_id = "0.Dual-E52630v4\000\000\000\0
>>>> 00\000\000\000\000\000\362\377\377\377\377\377\377\377\340J\
>>>> 377\377\377\177\000\000\060L\377\377\377\177\000\000\001\000
>>>> \000\000\000\000\000\000\244\201\000\000\001\000\000\000\030
>>>> \354\377\367\377\177\000\***@L\377\377\377\177\000\000\000\
>>>> 000\000\000\005\000\000\220\r\000\000\000\000\000\000\000k\
>>>> 022j\365\377\177\000\000\031J\377\377\377\177\000\000\201n\
>>>> 376\017\000\000\000\000\\\216!X\000\000\000\000_#\343+\000\
>>>> 000\000\000\\\216!X\000\000\000\000\207\065],", '\000' <repeats 36
>>>> times>, "k\022j\365\377\177\000\000\300K\377\377\377\177\000\000\000
>>>> \000\000\000\000\000\000\000"...
>>>> queue_name = "batch\000\377\377\240\340\377\367\377\177\000"
>>>> total_jobs = 0
>>>> user_jobs = 0
>>>> array_jobs = 0
>>>> __func__ = "svr_enquejob"
>>>> que_mgr = {unlock_on_exit = 160, locked = 75, mutex_valid =
>>>> 255, managed_mutex = 0x7ffff7ddccda <open_path+474>}
>>>> #3 0x000000000045b828 in pbsd_init_reque (pjob=0x512d880,
>>>> change_state=1) at pbsd_init.c:2824
>>>> newstate = 0
>>>> newsubstate = 0
>>>> rc = 0
>>>> log_buf = "pbsd_init_reque:1", '\000' <repeats 1063 times>...
>>>> __func__ = "pbsd_init_reque"
>>>> #4 0x000000000045ad93 in pbsd_init_job (pjob=0x512d880, type=1) at
>>>> pbsd_init.c:2558
>>>> d = 0
>>>> rc = 0
>>>> time_now = 1478594788
>>>> log_buf = '\000' <repeats 2112 times>...
>>>> local_errno = 0
>>>> job_id = '\000' <repeats 1016 times>...
>>>> job_atr_hold = 0
>>>> job_exit_status = 0
>>>> __func__ = "pbsd_init_job"
>>>> #5 0x0000000000459483 in handle_job_recovery (type=1) at
>>>> pbsd_init.c:1803
>>>> pjob = 0x512d880
>>>> Index = 0
>>>> JobArray_iter = {first = "0.Dual-E52630v4", second = }
>>>> log_buf = "14 total files read from
>>>> disk\000\000\000\000\000\000\000\001\000\000\000\320\316\022
>>>> \005\000\000\000\000\220N\022\005", '\000' <repeats 12 times>,
>>>> "Expected 1, recovered 1 queues", '\000' <repeats 1330 times>...
>>>> rc = 0
>>>> job_rc = 0
>>>> logtype = 0
>>>> pdirent = 0x0
>>>> pdirent_sub = 0x0
>>>> dir = 0x5124e90
>>>> dir_sub = 0x0
>>>> had = 0
>>>> pjob = 0x0
>>>> time_now = 1478594788
>>>> ---Type <return> to continue, or q <return> to quit---
>>>> basen = '\000' <repeats 1088 times>...
>>>> use_jobs_subdirs = 0
>>>> __func__ = "handle_job_recovery"
>>>> #6 0x000000000045a173 in handle_job_and_array_recovery (type=1) at
>>>> pbsd_init.c:2100
>>>> rc = 0
>>>> tmp_rc = 1974134615
>>>> #7 0x000000000045a8fe in pbsd_init (type=1) at pbsd_init.c:2316
>>>> ret = 0
>>>> gid = 0
>>>> log_buf = "pbsd_init:1", '\000' <repeats 997 times>...
>>>> __func__ = "pbsd_init"
>>>> #8 0x0000000000461f29 in main (argc=2, argv=0x7fffffffe598) at
>>>> pbsd_main.c:1898
>>>> i = 2
>>>> rc = 0
>>>> local_errno = 0
>>>> lockfile = "/var/spool/torque/server_priv/server.lock", '\000'
>>>> <repeats 983 times>
>>>> EMsg = '\000' <repeats 1023 times>
>>>> tmpLine = "Server Dual-E52630v4 started, initialization type =
>>>> 1", '\000' <repeats 970 times>
>>>> log_buf = "Server Dual-E52630v4 started, initialization type =
>>>> 1", '\000' <repeats 1139 times>...
>>>> server_name_file_port = 15001
>>>> fp = 0x51095f0
>>>> (gdb) info registers
>>>> rax 0x0 0
>>>> rbx 0x6 6
>>>> rcx 0x0 0
>>>> rdx 0x512f1b0 85127600
>>>> rsi 0x0 0
>>>> rdi 0x512f1b0 85127600
>>>> rbp 0x7fffffffe4b0 0x7fffffffe4b0
>>>> rsp 0x7fffffffc870 0x7fffffffc870
>>>> r8 0x0 0
>>>> r9 0x7fffffff57a2 140737488312226
>>>> r10 0x513c800 85182464
>>>> r11 0x7ffff61e6128 140737322574120
>>>> r12 0x4260b0 4350128
>>>> r13 0x7fffffffe590 140737488348560
>>>> r14 0x0 0
>>>> r15 0x0 0
>>>> rip 0x461f29 0x461f29 <main(int, char**)+2183>
>>>> eflags 0x10246 [ PF ZF IF RF ]
>>>> cs 0x33 51
>>>> ss 0x2b 43
>>>> ds 0x0 0
>>>> es 0x0 0
>>>> fs 0x0 0
>>>> gs 0x0 0
>>>> (gdb) x/16i $pc
>>>> => 0x461f29 <main(int, char**)+2183>: test %eax,%eax
>>>> 0x461f2b <main(int, char**)+2185>: setne %al
>>>> 0x461f2e <main(int, char**)+2188>: test %al,%al
>>>> 0x461f30 <main(int, char**)+2190>: je 0x461f55 <main(int,
>>>> char**)+2227>
>>>> 0x461f32 <main(int, char**)+2192>: mov 0x70efc7(%rip),%rax
>>>> # 0xb70f00 <msg_daemonname>
>>>> 0x461f39 <main(int, char**)+2199>: mov $0x51bab2,%edx
>>>> 0x461f3e <main(int, char**)+2204>: mov %rax,%rsi
>>>> 0x461f41 <main(int, char**)+2207>: mov $0xffffffff,%edi
>>>> 0x461f46 <main(int, char**)+2212>: callq 0x425420
>>>> <***@plt>
>>>> 0x461f4b <main(int, char**)+2217>: mov $0x3,%edi
>>>> 0x461f50 <main(int, char**)+2222>: callq 0x425680 <***@plt>
>>>> 0x461f55 <main(int, char**)+2227>: mov 0x71021d(%rip),%esi
>>>> # 0xb72178 <pbs_mom_port>
>>>> 0x461f5b <main(int, char**)+2233>: mov 0x710227(%rip),%ecx
>>>> # 0xb72188 <pbs_scheduler_port>
>>>> 0x461f61 <main(int, char**)+2239>: mov 0x710225(%rip),%edx
>>>> # 0xb7218c <pbs_server_port_dis>
>>>> 0x461f67 <main(int, char**)+2245>: lea -0x1400(%rbp),%rax
>>>> 0x461f6e <main(int, char**)+2252>: mov $0xb739c0,%r9d
>>>> (gdb) thread apply all backtrace
>>>>
>>>> Thread 1 (Thread 0x7ffff7fd5740 (LWP 10004)):
>>>> #0 __lll_unlock_elision (lock=0x512f1b0, private=0) at
>>>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>>>> #1 0x00000000004a4953 in unlock_ji_mutex (pjob=0x512d880, id=0x525b30
>>>> <svr_enquejob(job*, int, char const*, bool, bool)::__func__>
>>>> "svr_enquejob", msg=0x524554 "1", logging=0)
>>>> at svr_jobfunc.c:4011
>>>> #2 0x000000000049db0c in svr_enquejob (pjob=0x512d880,
>>>> has_sv_qs_mutex=1, prev_job_id=0x0, have_reservation=false,
>>>> being_recovered=true) at svr_jobfunc.c:421
>>>> #3 0x000000000045b828 in pbsd_init_reque (pjob=0x512d880,
>>>> change_state=1) at pbsd_init.c:2824
>>>> #4 0x000000000045ad93 in pbsd_init_job (pjob=0x512d880, type=1) at
>>>> pbsd_init.c:2558
>>>> #5 0x0000000000459483 in handle_job_recovery (type=1) at
>>>> pbsd_init.c:1803
>>>> #6 0x000000000045a173 in handle_job_and_array_recovery (type=1) at
>>>> pbsd_init.c:2100
>>>> #7 0x000000000045a8fe in pbsd_init (type=1) at pbsd_init.c:2316
>>>> #8 0x0000000000461f29 in main (argc=2, argv=0x7fffffffe598) at
>>>> pbsd_main.c:1898
>>>> (gdb) quit
>>>> A debugging session is active.
>>>>
>>>> Inferior 1 [process 10004] will be killed.
>>>>
>>>> Quit anyway? (y or n) y
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Nov 2, 2016 at 1:43 AM, David Beer <***@adaptivecomputing.com
>>>> > wrote:
>>>>
>>>>> Kazu,
>>>>>
>>>>> Thanks for sticking with us on this. You mentioned that pbs_server did
>>>>> not crash when you submitted the job, but you said that it and pbs_sched
>>>>> are "unstable." What do you mean by unstable? Will jobs run? You gdb output
>>>>> looks like a pbs_server that isn't busy, but other than that it looks
>>>>> normal.
>>>>>
>>>>> David
>>>>>
>>>>> On Tue, Nov 1, 2016 at 1:19 AM, Kazuhiro Fujita <
>>>>> ***@gmail.com> wrote:
>>>>>
>>>>>> David,
>>>>>>
>>>>>> I tested the 6.0-dev. It passed the "sudo ./torque.setup $USER"
>>>>>> script,
>>>>>> but pbs_server and pbs_sched are unstable like 6.1-dev.
>>>>>>
>>>>>> Best,
>>>>>> Kazu
>>>>>>
>>>>>> Before execution of gdb
>>>>>>
>>>>>> git clone https://github.com/adaptivecomputing/torque.git -b 6.0-dev
>>>>>>> 6.0-dev
>>>>>>> cd 6.0-dev
>>>>>>> ./autogen.sh
>>>>>>> # build and install torque
>>>>>>> ./configure
>>>>>>> make
>>>>>>> sudo make install
>>>>>>> # Set the correct name of the server
>>>>>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>>>>> # configure and start trqauthd
>>>>>>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>>>>>>> sudo update-rc.d trqauthd defaults
>>>>>>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>>>>>>> sudo ldconfig
>>>>>>> sudo service trqauthd start
>>>>>>> # Initialize serverdb by executing the torque.setup script
>>>>>>> sudo ./torque.setup $USER
>>>>>>>
>>>>>>> sudo qmgr -c 'p s'
>>>>>>> sudo qterm
>>>>>>> sudo /etc/init.d/trqauthd stop
>>>>>>> # set nodes
>>>>>>> echo "$HOSTNAME np=`cat /proc/cpuinfo | grep processor | wc -l`" |
>>>>>>> sudo tee /var/spool/torque/server_priv/nodes
>>>>>>> sudo nano /var/spool/torque/server_priv/nodes
>>>>>>> # set the head node
>>>>>>> echo "\$pbsserver $HOSTNAME" | sudo tee
>>>>>>> /var/spool/torque/mom_priv/config
>>>>>>> # configure other deamons
>>>>>>> sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
>>>>>>> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
>>>>>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>>>>>> sudo update-rc.d pbs_server defaults
>>>>>>> sudo update-rc.d pbs_sched defaults
>>>>>>> sudo update-rc.d pbs_mom defaults
>>>>>>> # start torque daemons
>>>>>>> sudo service trqauthd start
>>>>>>
>>>>>>
>>>>>> Execution of gdb
>>>>>>
>>>>>>> sudo gdb /usr/local/sbin/pbs_server
>>>>>>
>>>>>>
>>>>>> Commands executed by another terminal
>>>>>>
>>>>>>> sudo /etc/init.d/pbs_mom start
>>>>>>> sudo /etc/init.d/pbs_sched start
>>>>>>> pbsnodes -a
>>>>>>> echo "sleep 30" | qsub
>>>>>>
>>>>>>
>>>>>> The last command did not cause a crash of pbs_server. The backtrace
>>>>>> is described below.
>>>>>> $ sudo gdb /usr/local/sbin/pbs_server
>>>>>> GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
>>>>>> Copyright (C) 2016 Free Software Foundation, Inc.
>>>>>> License GPLv3+: GNU GPL version 3 or later <
>>>>>> http://gnu.org/licenses/gpl.html>
>>>>>> This is free software: you are free to change and redistribute it.
>>>>>> There is NO WARRANTY, to the extent permitted by law. Type "show
>>>>>> copying"
>>>>>> and "show warranty" for details.
>>>>>> This GDB was configured as "x86_64-linux-gnu".
>>>>>> Type "show configuration" for configuration details.
>>>>>> For bug reporting instructions, please see:
>>>>>> <http://www.gnu.org/software/gdb/bugs/>.
>>>>>> Find the GDB manual and other documentation resources online at:
>>>>>> <http://www.gnu.org/software/gdb/documentation/>.
>>>>>> For help, type "help".
>>>>>> Type "apropos word" to search for commands related to "word"...
>>>>>> Reading symbols from /usr/local/sbin/pbs_server...done.
>>>>>> (gdb) r -D
>>>>>> Starting program: /usr/local/sbin/pbs_server -D
>>>>>> [Thread debugging using libthread_db enabled]
>>>>>> Using host libthread_db library "/lib/x86_64-linux-gnu/libthre
>>>>>> ad_db.so.1".
>>>>>> [New Thread 0x7ffff39c1700 (LWP 5024)]
>>>>>> pbs_server is up (version - 6.0, port - 15001)
>>>>>> [New Thread 0x7ffff31c0700 (LWP 5025)]
>>>>>> PBS_Server: LOG_ERROR::tcp_connect_sockaddr, Failed when trying to
>>>>>> open tcp connection - connect() failed [rc = -2] [addr =
>>>>>> 10.0.0.249:15003]
>>>>>> PBS_Server: LOG_ERROR::sendHierarchyToNode, Could not send mom
>>>>>> hierarchy to host Dual-E52630v4:15003
>>>>>> [New Thread 0x7ffff29bf700 (LWP 5026)]
>>>>>> [New Thread 0x7ffff21be700 (LWP 5027)]
>>>>>> [New Thread 0x7ffff19bd700 (LWP 5028)]
>>>>>> [New Thread 0x7ffff11bc700 (LWP 5029)]
>>>>>> [New Thread 0x7ffff09bb700 (LWP 5030)]
>>>>>> [Thread 0x7ffff09bb700 (LWP 5030) exited]
>>>>>> [New Thread 0x7ffff09bb700 (LWP 5031)]
>>>>>> [New Thread 0x7fffe3fff700 (LWP 5109)]
>>>>>> [New Thread 0x7fffe37fe700 (LWP 5113)]
>>>>>> [New Thread 0x7fffe29cf700 (LWP 5121)]
>>>>>> [Thread 0x7fffe29cf700 (LWP 5121) exited]
>>>>>> ^C
>>>>>> Thread 1 "pbs_server" received signal SIGINT, Interrupt.
>>>>>> 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-templa
>>>>>> te.S:84
>>>>>> 84 ../sysdeps/unix/syscall-template.S: No such file or directory.
>>>>>> (gdb) backtrace full
>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>> No locals.
>>>>>> #1 0x00007ffff615c1a4 in usleep (useconds=<optimized out>) at
>>>>>> ../sysdeps/posix/usleep.c:32
>>>>>> ts = {tv_sec = 0, tv_nsec = 250000000}
>>>>>> #2 0x000000000046123a in main_loop () at pbsd_main.c:1454
>>>>>> state = 3
>>>>>> waittime = 5
>>>>>> pjob = 0x313a74
>>>>>> iter = 0x0
>>>>>> when = 1477984074
>>>>>> log = 0
>>>>>> scheduling = 1
>>>>>> sched_iteration = 600
>>>>>> time_now = 1477984190
>>>>>> update_loglevel = 1477984198
>>>>>> log_buf = "Server Ready, pid = 5020, loglevel=0", '\000'
>>>>>> <repeats 140 times>, "c\000\000\000\000\000\000\000
>>>>>> \000\020\000\000\000\000\000\000\240\265\377\377\377\177", '\000'
>>>>>> <repeats 26 times>...
>>>>>> sem_val = 5228929
>>>>>> __func__ = "main_loop"
>>>>>> #3 0x0000000000461fb6 in main (argc=2, argv=0x7fffffffe598) at
>>>>>> pbsd_main.c:1935
>>>>>> i = 2
>>>>>> rc = 0
>>>>>> local_errno = 0
>>>>>> lockfile = "/var/spool/torque/server_priv/server.lock",
>>>>>> '\000' <repeats 983 times>
>>>>>> EMsg = '\000' <repeats 1023 times>
>>>>>> tmpLine = "Using ports Server:15001 Scheduler:15004
>>>>>> MOM:15002 (server: 'Dual-E52630v4')", '\000' <repeats 945 times>
>>>>>> log_buf = "Using ports Server:15001 Scheduler:15004
>>>>>> MOM:15002 (server: 'Dual-E52630v4')", '\000' <repeats 1114 times>...
>>>>>> server_name_file_port = 15001
>>>>>> fp = 0x51095f0
>>>>>> (gdb) info registers
>>>>>> rax 0xfffffffffffffdfc -516
>>>>>> rbx 0x5 5
>>>>>> rcx 0x7ffff612a75d 140737321805661
>>>>>> rdx 0x0 0
>>>>>> rsi 0x0 0
>>>>>> rdi 0x7fffffffb3f0 140737488335856
>>>>>> rbp 0x7fffffffe4b0 0x7fffffffe4b0
>>>>>> rsp 0x7fffffffc870 0x7fffffffc870
>>>>>> r8 0x0 0
>>>>>> r9 0x4000001 67108865
>>>>>> r10 0x1 1
>>>>>> r11 0x293 659
>>>>>> r12 0x4260b0 4350128
>>>>>> r13 0x7fffffffe590 140737488348560
>>>>>> r14 0x0 0
>>>>>> r15 0x0 0
>>>>>> rip 0x461fb6 0x461fb6 <main(int, char**)+2388>
>>>>>> eflags 0x293 [ CF AF SF IF ]
>>>>>> cs 0x33 51
>>>>>> ss 0x2b 43
>>>>>> ds 0x0 0
>>>>>> es 0x0 0
>>>>>> fs 0x0 0
>>>>>> gs 0x0 0
>>>>>> (gdb) x/16i $pc
>>>>>> => 0x461fb6 <main(int, char**)+2388>: callq 0x494762
>>>>>> <shutdown_ack()>
>>>>>> 0x461fbb <main(int, char**)+2393>: mov $0xffffffff,%edi
>>>>>> 0x461fc0 <main(int, char**)+2398>: callq 0x4250b0 <***@plt>
>>>>>> 0x461fc5 <main(int, char**)+2403>: mov 0x70f55c(%rip),%rdx
>>>>>> # 0xb71528 <msg_svrdown>
>>>>>> 0x461fcc <main(int, char**)+2410>: mov 0x70eeed(%rip),%rax
>>>>>> # 0xb70ec0 <msg_daemonname>
>>>>>> 0x461fd3 <main(int, char**)+2417>: mov %rdx,%rcx
>>>>>> 0x461fd6 <main(int, char**)+2420>: mov %rax,%rdx
>>>>>> 0x461fd9 <main(int, char**)+2423>: mov $0x1,%esi
>>>>>> 0x461fde <main(int, char**)+2428>: mov $0x8002,%edi
>>>>>> 0x461fe3 <main(int, char**)+2433>: callq 0x425840
>>>>>> <***@plt>
>>>>>> 0x461fe8 <main(int, char**)+2438>: mov $0x0,%edi
>>>>>> 0x461fed <main(int, char**)+2443>: callq 0x4269c9
>>>>>> <acct_close(bool)>
>>>>>> 0x461ff2 <main(int, char**)+2448>: mov $0xb6cdc0,%edi
>>>>>> 0x461ff7 <main(int, char**)+2453>: callq 0x425a00
>>>>>> <***@plt>
>>>>>> 0x461ffc <main(int, char**)+2458>: mov $0x1,%edi
>>>>>> 0x462001 <main(int, char**)+2463>: callq 0x424db0
>>>>>> <***@plt>
>>>>>> (gdb) thread apply all backtrace
>>>>>>
>>>>>> Thread 11 (Thread 0x7fffe37fe700 (LWP 5113)):
>>>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
>>>>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
>>>>>> #1 0x00000000004fc19c in work_thread (a=0x5110710) at
>>>>>> u_threadpool.c:272
>>>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7fffe37fe700) at
>>>>>> pthread_create.c:333
>>>>>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>> _64/clone.S:109
>>>>>>
>>>>>> Thread 10 (Thread 0x7fffe3fff700 (LWP 5109)):
>>>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
>>>>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
>>>>>> #1 0x00000000004fc19c in work_thread (a=0x5110710) at
>>>>>> u_threadpool.c:272
>>>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7fffe3fff700) at
>>>>>> pthread_create.c:333
>>>>>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>> _64/clone.S:109
>>>>>>
>>>>>> Thread 9 (Thread 0x7ffff09bb700 (LWP 5031)):
>>>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
>>>>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
>>>>>> #1 0x00000000004fc19c in work_thread (a=0x5110810) at
>>>>>> u_threadpool.c:272
>>>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7ffff09bb700) at
>>>>>> pthread_create.c:333
>>>>>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>> _64/clone.S:109
>>>>>>
>>>>>> Thread 7 (Thread 0x7ffff11bc700 (LWP 5029)):
>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>>>>> ../sysdeps/posix/sleep.c:55
>>>>>> #2 0x00000000004769bb in remove_completed_jobs (vp=0x0) at
>>>>>> req_jobobit.c:3759
>>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff11bc700) at
>>>>>> pthread_create.c:333
>>>>>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>> _64/clone.S:109
>>>>>>
>>>>>> Thread 6 (Thread 0x7ffff19bd700 (LWP 5028)):
>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>>>>> ../sysdeps/posix/sleep.c:55
>>>>>> #2 0x00000000004afa7b in remove_extra_recycle_jobs (vp=0x0) at
>>>>>> job_recycler.c:216
>>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff19bd700) at
>>>>>> pthread_create.c:333
>>>>>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>> _64/clone.S:109
>>>>>>
>>>>>> Thread 5 (Thread 0x7ffff21be700 (LWP 5027)):
>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>>>>> ../sysdeps/posix/sleep.c:55
>>>>>> #2 0x00000000004bc73b in inspect_exiting_jobs (vp=0x0) at
>>>>>> exiting_jobs.c:319
>>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff21be700) at
>>>>>> pthread_create.c:333
>>>>>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>> _64/clone.S:109
>>>>>>
>>>>>> Thread 4 (Thread 0x7ffff29bf700 (LWP 5026)):
>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>>>>> ../sysdeps/posix/sleep.c:55
>>>>>> #2 0x000000000046078d in handle_queue_routing_retries (vp=0x0) at
>>>>>> pbsd_main.c:1079
>>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff29bf700) at
>>>>>> pthread_create.c:333
>>>>>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>> _64/clone.S:109
>>>>>>
>>>>>> Thread 3 (Thread 0x7ffff31c0700 (LWP 5025)):
>>>>>> #0 0x00007ffff6ee17bd in accept () at ../sysdeps/unix/syscall-templa
>>>>>> te.S:84
>>>>>> #1 0x00007ffff750a276 in start_listener_addrinfo
>>>>>> (host_name=0x7ffff31bfaf0 "Dual-E52630v4", server_port=15001,
>>>>>> process_meth=0x4c4935 <start_process_pbs_server_port(void*)>)
>>>>>> at ../Libnet/server_core.c:398
>>>>>> #2 0x00000000004608f3 in start_accept_listener (vp=0x0) at
>>>>>> pbsd_main.c:1141
>>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff31c0700) at
>>>>>> pthread_create.c:333
>>>>>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>> _64/clone.S:109
>>>>>>
>>>>>> Thread 2 (Thread 0x7ffff39c1700 (LWP 5024)):
>>>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
>>>>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
>>>>>> #1 0x00000000004fc19c in work_thread (a=0x5110810) at
>>>>>> u_threadpool.c:272
>>>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7ffff39c1700) at
>>>>>> pthread_create.c:333
>>>>>> ---Type <return> to continue, or q <return> to quit---
>>>>>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>> _64/clone.S:109
>>>>>>
>>>>>> Thread 1 (Thread 0x7ffff7fd5740 (LWP 5020)):
>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>> #1 0x00007ffff615c1a4 in usleep (useconds=<optimized out>) at
>>>>>> ../sysdeps/posix/usleep.c:32
>>>>>> #2 0x000000000046123a in main_loop () at pbsd_main.c:1454
>>>>>> #3 0x0000000000461fb6 in main (argc=2, argv=0x7fffffffe598) at
>>>>>> pbsd_main.c:1935
>>>>>> (gdb) quit
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Oct 28, 2016 at 12:43 PM, Kazuhiro Fujita <
>>>>>> ***@gmail.com> wrote:
>>>>>>
>>>>>>> Thank you for your comments.
>>>>>>> I will try the 6.0-dev next week.
>>>>>>>
>>>>>>> Best,
>>>>>>> Kazu
>>>>>>>
>>>>>>> On Fri, Oct 28, 2016 at 5:34 AM, David Beer <
>>>>>>> ***@adaptivecomputing.com> wrote:
>>>>>>>
>>>>>>>> I wonder if that fix wasn't placed in the hotfix. Is there any
>>>>>>>> chance you can try installing 6.0-dev on your system (via github) to see if
>>>>>>>> it's resolved. For the record, my Ubuntu 16 system doesn't give me this
>>>>>>>> error, or I'd try it myself. For whatever reason, none of our test cluster
>>>>>>>> machines (Cent & Redhat 6-7, SLES 11-12) experience this either. We did
>>>>>>>> have another user that experiences it on a test cluster, but not being able
>>>>>>>> to reproduce it has made it harder to track down.
>>>>>>>>
>>>>>>>> On Wed, Oct 26, 2016 at 12:46 AM, Kazuhiro Fujita <
>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> David,
>>>>>>>>>
>>>>>>>>> I tried the 6.0.2.h3. But, it seems that the other issue is still
>>>>>>>>> remained.
>>>>>>>>> After I initialized serverdb by "sudo pbs_server -t create",
>>>>>>>>> pbs_server crashed.
>>>>>>>>> Then, I used gdb with pbs_server.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Kazu
>>>>>>>>>
>>>>>>>>> sudo gdb /usr/local/sbin/pbs_server
>>>>>>>>> GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
>>>>>>>>> Copyright (C) 2016 Free Software Foundation, Inc.
>>>>>>>>> License GPLv3+: GNU GPL version 3 or later <
>>>>>>>>> http://gnu.org/licenses/gpl.html>
>>>>>>>>> This is free software: you are free to change and redistribute it.
>>>>>>>>> There is NO WARRANTY, to the extent permitted by law. Type "show
>>>>>>>>> copying"
>>>>>>>>> and "show warranty" for details.
>>>>>>>>> This GDB was configured as "x86_64-linux-gnu".
>>>>>>>>> Type "show configuration" for configuration details.
>>>>>>>>> For bug reporting instructions, please see:
>>>>>>>>> <http://www.gnu.org/software/gdb/bugs/>.
>>>>>>>>> Find the GDB manual and other documentation resources online at:
>>>>>>>>> <http://www.gnu.org/software/gdb/documentation/>.
>>>>>>>>> For help, type "help".
>>>>>>>>> Type "apropos word" to search for commands related to "word"...
>>>>>>>>> Reading symbols from /usr/local/sbin/pbs_server...done.
>>>>>>>>> (gdb) r -D
>>>>>>>>> Starting program: /usr/local/sbin/pbs_server -D
>>>>>>>>> [Thread debugging using libthread_db enabled]
>>>>>>>>> Using host libthread_db library "/lib/x86_64-linux-gnu/libthre
>>>>>>>>> ad_db.so.1".
>>>>>>>>> pbs_server is up (version - 6.0.2.h3, port - 15001)
>>>>>>>>> [New Thread 0x7ffff39c1700 (LWP 25591)]
>>>>>>>>> [New Thread 0x7ffff31c0700 (LWP 25592)]
>>>>>>>>> [New Thread 0x7ffff29bf700 (LWP 25593)]
>>>>>>>>> [New Thread 0x7ffff21be700 (LWP 25594)]
>>>>>>>>> [New Thread 0x7ffff19bd700 (LWP 25595)]
>>>>>>>>> [New Thread 0x7ffff11bc700 (LWP 25596)]
>>>>>>>>>
>>>>>>>>> Thread 7 "pbs_server" received signal SIGSEGV, Segmentation fault.
>>>>>>>>> [Switching to Thread 0x7ffff11bc700 (LWP 25596)]
>>>>>>>>> __lll_unlock_elision (lock=0x57276c0, private=0) at
>>>>>>>>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>>>>>>>>> 29 ../sysdeps/unix/sysv/linux/x86/elision-unlock.c: No such file
>>>>>>>>> or directory.
>>>>>>>>> (gdb) bt
>>>>>>>>> #0 __lll_unlock_elision (lock=0x57276c0, private=0) at
>>>>>>>>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>>>>>>>>> #1 0x00000000004ac076 in dispatch_timed_task (ptask=0x5727660) at
>>>>>>>>> svr_task.c:318
>>>>>>>>> #2 0x0000000000460247 in check_tasks (notUsed=0x0) at
>>>>>>>>> pbsd_main.c:921
>>>>>>>>> #3 0x00000000004fc171 in work_thread (a=0x510f650) at
>>>>>>>>> u_threadpool.c:318
>>>>>>>>> #4 0x00007ffff6ed86fa in start_thread (arg=0x7ffff11bc700) at
>>>>>>>>> pthread_create.c:333
>>>>>>>>> #5 0x00007ffff6165b5d in clone () at
>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Oct 26, 2016 at 11:52 AM, Kazuhiro Fujita <
>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> David and Rick,
>>>>>>>>>>
>>>>>>>>>> Thank you for the quick response. I will try it later.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Kazu
>>>>>>>>>>
>>>>>>>>>> On Wed, Oct 26, 2016 at 5:06 AM, David Beer <
>>>>>>>>>> ***@adaptivecomputing.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Actually, Rick just sent me the link. You can download it from
>>>>>>>>>>> here: http://files.adaptivecomputing.com/hotfix/torque-6.0.2
>>>>>>>>>>> .h3.tar.gz
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Oct 25, 2016 at 2:06 PM, David Beer <
>>>>>>>>>>> ***@adaptivecomputing.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I can confirm that this bug is fixed in 6.0-dev, and we've made
>>>>>>>>>>>> a hotfix for it, 6.0.2.h3. This was caused because of a change in the
>>>>>>>>>>>> implementation for the pthread library, so most will not see this crash,
>>>>>>>>>>>> but it appears that if you have a newer version of that library, then you
>>>>>>>>>>>> will get it. Rick is going to send instructions for how to grab 6.0.2.h3.
>>>>>>>>>>>>
>>>>>>>>>>>> David
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Oct 25, 2016 at 12:30 AM, Kazuhiro Fujita <
>>>>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Thank you David for the comment on the backtrace.
>>>>>>>>>>>>> I haven't noticed that until writing this mail.
>>>>>>>>>>>>> So, I used backtrace as written in the Ubuntu wiki.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I also attached the backtrace of pbs_server (Torque 6.1-dev)
>>>>>>>>>>>>> by gdb.
>>>>>>>>>>>>> As I mentioned before torque.setup script was successfully
>>>>>>>>>>>>> executed, but unstable.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Before using gdb, I used following commands.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> git clone https://github.com/adaptivecomputing/torque.git -b
>>>>>>>>>>>>>> 6.1-dev 6.1-dev
>>>>>>>>>>>>>> cd 6.1-dev
>>>>>>>>>>>>>> ./autogen.sh
>>>>>>>>>>>>>> # build and install torque
>>>>>>>>>>>>>> ./configure
>>>>>>>>>>>>>> make
>>>>>>>>>>>>>> sudo make install
>>>>>>>>>>>>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>>>>>>>>>>>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>>>>>>>>>>>>>> sudo ldconfig
>>>>>>>>>>>>>> # set as services
>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_server
>>>>>>>>>>>>>> /etc/init.d/pbs_server
>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>>>>>>>>>>>>> sudo update-rc.d trqauthd defaults
>>>>>>>>>>>>>> sudo update-rc.d pbs_server defaults
>>>>>>>>>>>>>> sudo update-rc.d pbs_sched defaults
>>>>>>>>>>>>>> sudo update-rc.d pbs_mom defaults
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> sudo ./torque.setup $USER
>>>>>>>>>>>>>> sudo qmgr -c 'p s'
>>>>>>>>>>>>>> echo "$HOSTNAME np=`cat /proc/cpuinfo | grep processor | wc
>>>>>>>>>>>>>> -l`" | sudo tee /var/spool/torque/server_priv/nodes
>>>>>>>>>>>>>> sudo nano /var/spool/torque/server_priv/nodes # (changed np)
>>>>>>>>>>>>>> sudo qterm -t quick
>>>>>>>>>>>>>> sudo /etc/init.d/trqauthd stop
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> trqauthd was not stop by the last command. So, I stopped it by
>>>>>>>>>>>>> killing the trqauthd process.
>>>>>>>>>>>>> Then I restarted the torque processes with gdb.
>>>>>>>>>>>>>
>>>>>>>>>>>>> sudo /etc/init.d/trqauthd start
>>>>>>>>>>>>>
>>>>>>>>>>>>> sudo gdb /etc/init.d/pbs_server 2>&1 | tee
>>>>>>>>>>>>>> ~/gdb-torquesetup-6.1-dev.txt
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> In another terminal, I executed the following commands before
>>>>>>>>>>>>> pbs_server was crashed.
>>>>>>>>>>>>>
>>>>>>>>>>>>> sudo /etc/init.d/pbs_mom start
>>>>>>>>>>>>>> sudo /etc/init.d/pbs_sched start
>>>>>>>>>>>>>> ps aux | grep pbs
>>>>>>>>>>>>>> pbsnodes -a
>>>>>>>>>>>>>> echo "sleep 30" | qsub
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> The output of the last command is "0.torque-server".
>>>>>>>>>>>>> And this command crashed the pbs_server in gdb.
>>>>>>>>>>>>> Then, I made the backtrace.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>> Kazu
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Oct 25, 2016 at 2:36 PM, Kazuhiro Fujita <
>>>>>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> David,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I attached the backtrace of pbs_server (Torque 6.0.2) by gdb.
>>>>>>>>>>>>>> (based on https://wiki.ubuntu.com/Backtrace)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I started pbs_server with gdb,
>>>>>>>>>>>>>> and execute qmgr from another terminal. (see below)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> sudo qmgr -c 'p s'
>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> After the qmgr execution, I pressed ctrl +c in gdb.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>> Kaz
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Oct 25, 2016 at 1:00 AM, David Beer <
>>>>>>>>>>>>>> ***@adaptivecomputing.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Kazu,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Can you give us a backtrace for this crash? We have fixed
>>>>>>>>>>>>>>> some issues on startup (around mutex management for newer pthread
>>>>>>>>>>>>>>> implementations) and a backtrace would allow me to confirm if what you're
>>>>>>>>>>>>>>> seeing is fixed.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, Oct 24, 2016 at 2:09 AM, Kazuhiro Fujita <
>>>>>>>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Dear All,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I use Torque 4.2.10 on Ubuntu 14.04 LTS and 16.04 LTS with
>>>>>>>>>>>>>>>> dual E5-2630 v3 chips.
>>>>>>>>>>>>>>>> I recently got servers with dual Xeon E5 v4 chips, and
>>>>>>>>>>>>>>>> installed Ubuntu 16.04 LTS on them.
>>>>>>>>>>>>>>>> And I tried to set up Torque on them, but I stacked with
>>>>>>>>>>>>>>>> the initial setup script.
>>>>>>>>>>>>>>>> It seems that qmgr may trigger to crash pbs_server in
>>>>>>>>>>>>>>>> initial setup script (torque.setup). (see below)
>>>>>>>>>>>>>>>> Similar error is also observed in Torque 6.02.
>>>>>>>>>>>>>>>> Have you ever observed this kind of errors?
>>>>>>>>>>>>>>>> And if you know possible solutions, please tell me.
>>>>>>>>>>>>>>>> Any comments will be highly appreciated.
>>>>>>>>>>>>>>>> Would it be better to change the OS to other distribution,
>>>>>>>>>>>>>>>> such as Scientific Linux?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thank you in Advance,
>>>>>>>>>>>>>>>> Kazu
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Errors in torque 4.2.10 setup
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> torque-server-***@torque-ser
>>>>>>>>>>>>>>>>> ver:~/Downloads/torque/torque-4.2.10$ sudo ./torque.setup
>>>>>>>>>>>>>>>>> $USER
>>>>>>>>>>>>>>>>> Currently no servers active. Default server will be listed
>>>>>>>>>>>>>>>>> as active server. Error 15133
>>>>>>>>>>>>>>>>> Active server name: torque-server pbs_server port is:
>>>>>>>>>>>>>>>>> 15001
>>>>>>>>>>>>>>>>> trqauthd daemonized - port /tmp/trqauthd-unix
>>>>>>>>>>>>>>>>> trqauthd successfully started
>>>>>>>>>>>>>>>>> initializing TORQUE (admin: torque-server-***@torque-ser
>>>>>>>>>>>>>>>>> ver)
>>>>>>>>>>>>>>>>> You have selected to start pbs_server in create mode.
>>>>>>>>>>>>>>>>> If the server database exists it will be overwritten.
>>>>>>>>>>>>>>>>> do you wish to continue y/(n)?y
>>>>>>>>>>>>>>>>> root 27941 1942 1 12:22 ? 00:00:00 pbs_server
>>>>>>>>>>>>>>>>> -t create
>>>>>>>>>>>>>>>>> Max open servers: 9
>>>>>>>>>>>>>>>>> set server operators += torque-server-***@torque-server
>>>>>>>>>>>>>>>>> Max open servers: 9
>>>>>>>>>>>>>>>>> set server managers += torque-server-***@torque-server
>>>>>>>>>>>>>>>>> qmgr obj=batch svr=default: End of File
>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>>> torque-server-***@torque-ser
>>>>>>>>>>>>>>>>> ver:~/Downloads/torque/torque-4.2.10$ ps aux | grep pbs
>>>>>>>>>>>>>>>>> torque-+ 27996 0.0 0.0 22304 948 pts/2 S+ 12:22
>>>>>>>>>>>>>>>>> 0:00 grep --color=auto pbs
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> pbs_server -t create was not found.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Errors in torque 6.0.2 setup
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> torque-server-***@torque-server:~/Downloads/torque/6.0.2$
>>>>>>>>>>>>>>>>> sudo ./torque.setup $USER
>>>>>>>>>>>>>>>>> Currently no servers active. Default server will be listed
>>>>>>>>>>>>>>>>> as active server. Error 15133
>>>>>>>>>>>>>>>>> Active server name: torque-server pbs_server port is:
>>>>>>>>>>>>>>>>> 15001
>>>>>>>>>>>>>>>>> trqauthd daemonized - port /tmp/trqauthd-unix
>>>>>>>>>>>>>>>>> trqauthd successfully started
>>>>>>>>>>>>>>>>> initializing TORQUE (admin: torque-server-***@torque-ser
>>>>>>>>>>>>>>>>> ver)
>>>>>>>>>>>>>>>>> You have selected to start pbs_server in create mode.
>>>>>>>>>>>>>>>>> If the server database exists it will be overwritten.
>>>>>>>>>>>>>>>>> do you wish to continue y/(n)?y
>>>>>>>>>>>>>>>>> root 39521 1 1 16:10 ? 00:00:00 pbs_server
>>>>>>>>>>>>>>>>> -t create
>>>>>>>>>>>>>>>>> Max open servers: 9
>>>>>>>>>>>>>>>>> Max open servers: 9
>>>>>>>>>>>>>>>>> qmgr obj=batch svr=default: End of File
>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>>> torque-server-***@torque-server:~/Downloads/torque/6.0.2$
>>>>>>>>>>>>>>>>> ps aux | grep pbs
>>>>>>>>>>>>>>>>> comp_ad+ 39569 0.0 0.0 22304 1032 pts/8 S+ 16:11
>>>>>>>>>>>>>>>>> 0:00 grep --color=auto pbs
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> pbs_server -t create was not found.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Commands used for installation before the setup script
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> # build and install torque
>>>>>>>>>>>>>>>>> ./configure
>>>>>>>>>>>>>>>>> make
>>>>>>>>>>>>>>>>> sudo make install
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>>>>>>>>>>>>>>> echo /usr/local/lib > sudo tee
>>>>>>>>>>>>>>>>> /etc/ld.so.conf.d/torque.conf
>>>>>>>>>>>>>>>>> sudo ldconfig
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> # set up as services
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>>>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_server
>>>>>>>>>>>>>>>>> /etc/init.d/pbs_server
>>>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_sched
>>>>>>>>>>>>>>>>> /etc/init.d/pbs_sched
>>>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>>>>>>>>>>>>>>>> sudo update-rc.d trqauthd defaults
>>>>>>>>>>>>>>>>> sudo update-rc.d pbs_server defaults
>>>>>>>>>>>>>>>>> sudo update-rc.d pbs_sched defaults
>>>>>>>>>>>>>>>>> sudo update-rc.d pbs_mom defaults
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> sudo ./torque.setup $USER
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>> torqueusers mailing list
>>>>>>>>>>>>>>>> ***@supercluster.org
>>>>>>>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> David Beer | Torque Architect
>>>>>>>>>>>>>>> Adaptive Computing
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> torqueusers mailing list
>>>>>>>>>>>>>>> ***@supercluster.org
>>>>>>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> torqueusers mailing list
>>>>>>>>>>>>> ***@supercluster.org
>>>>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> David Beer | Torque Architect
>>>>>>>>>>>> Adaptive Computing
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> David Beer | Torque Architect
>>>>>>>>>>> Adaptive Computing
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> torqueusers mailing list
>>>>>>>>>>> ***@supercluster.org
>>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> torqueusers mailing list
>>>>>>>>> ***@supercluster.org
>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> David Beer | Torque Architect
>>>>>>>> Adaptive Computing
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> torqueusers mailing list
>>>>>>>> ***@supercluster.org
>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> torqueusers mailing list
>>>>>> ***@supercluster.org
>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> David Beer | Torque Architect
>>>>> Adaptive Computing
>>>>>
>>>>> _______________________________________________
>>>>> torqueusers mailing list
>>>>> ***@supercluster.org
>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> torqueusers mailing list
>>>> ***@supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>
>>>>
>>>
>>>
>>> --
>>> David Beer | Torque Architect
>>> Adaptive Computing
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> ***@supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>>
>>
>
> _______________________________________________
> torqueusers mailing list
> ***@supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>


--
David Beer | Torque Architect
Adaptive Computing
Kazuhiro Fujita
2016-11-16 07:24:40 UTC
Permalink
David,

I did not find the process of pbs_server after executions of commands shown
below.

sudo service trqauthd start
> sudo service pbs_server start


I am not sure what it did.

Best,
Kazu


On Wed, Nov 16, 2016 at 8:10 AM, David Beer <***@adaptivecomputing.com>
wrote:

> Kazu,
>
> What did it do when it failed to start?
>
> On Wed, Nov 9, 2016 at 9:33 PM, Kazuhiro Fujita <***@gmail.com
> > wrote:
>
>> David,
>>
>> In the last mail I sent, I reinstalled 6.0-dev in a wrong server as you
>> can see in output (E5-2630v3).
>> In a E5-2630v4 server, pbs_server failed to restart as a daemon after "./torque.setup
>> $USER".
>>
>> Before crash:
>>
>>> git clone https://github.com/adaptivecomputing/torque.git -b 6.0-dev
>>> 6.0-dev
>>> cd 6.0-dev
>>> ./autogen.sh
>>> # build and install torque
>>> ./configure
>>> make
>>> sudo make install
>>> # Set the correct name of the server
>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>> # configure and start trqauthd
>>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>>> sudo update-rc.d trqauthd defaults
>>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>>> sudo ldconfig
>>> sudo service trqauthd start
>>> # Initialize serverdb by executing the torque.setup script
>>> sudo ./torque.setup $USER
>>> sudo qmgr -c 'p s'
>>> sudo qterm
>>> sudo service trqauthd stop
>>> ps aux | grep pbs
>>> ps aux | grep trq
>>> # set nodes
>>> echo "$HOSTNAME np=`cat /proc/cpuinfo | grep processor | wc -l`" | sudo
>>> tee /var/spool/torque/server_priv/nodes
>>> sudo nano /var/spool/torque/server_priv/nodes
>>> # set the head node
>>> echo "\$pbsserver $HOSTNAME" | sudo tee /var/spool/torque/mom_priv/con
>>> fig
>>> # configure other daemons
>>> sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
>>> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>> sudo update-rc.d pbs_server defaults
>>> sudo update-rc.d pbs_sched defaults
>>> sudo update-rc.d pbs_mom defaults
>>> # restart torque daemons
>>> sudo service trqauthd start
>>> sudo service pbs_server start
>>
>>
>> Then, pbs_server did not start. So, I started pbs_server with gdb.
>> But, pbs_server with gdb did not crash even after qsub and qstat from
>> another terminal.
>> So, I stopped the pbs_server in gdb with ctrl + c.
>>
>> Best,
>> Kazu
>>
>> gdb output
>>
>>> $ sudo gdb /usr/local/sbin/pbs_server
>>> GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
>>> Copyright (C) 2016 Free Software Foundation, Inc.
>>> License GPLv3+: GNU GPL version 3 or later <
>>> http://gnu.org/licenses/gpl.html>
>>> This is free software: you are free to change and redistribute it.
>>> There is NO WARRANTY, to the extent permitted by law. Type "show
>>> copying"
>>> and "show warranty" for details.
>>> This GDB was configured as "x86_64-linux-gnu".
>>> Type "show configuration" for configuration details.
>>> For bug reporting instructions, please see:
>>> <http://www.gnu.org/software/gdb/bugs/>.
>>> Find the GDB manual and other documentation resources online at:
>>> <http://www.gnu.org/software/gdb/documentation/>.
>>> For help, type "help".
>>> Type "apropos word" to search for commands related to "word"...
>>> Reading symbols from /usr/local/sbin/pbs_server...done.
>>> (gdb) r -D
>>> Starting program: /usr/local/sbin/pbs_server -D
>>> [Thread debugging using libthread_db enabled]
>>> Using host libthread_db library "/lib/x86_64-linux-gnu/libthre
>>> ad_db.so.1".
>>> [New Thread 0x7ffff39c1700 (LWP 35864)]
>>> pbs_server is up (version - 6.0, port - 15001)
>>> PBS_Server: LOG_ERROR::tcp_connect_sockaddr, Failed when trying to open
>>> tcp connection - connect() failed [rc = -2] [addr = 10.0.0.249:15003]
>>> [New Thread 0x7ffff31c0700 (LWP 35865)]
>>> PBS_Server: LOG_ERROR::sendHierarchyToNode, Could not send mom
>>> hierarchy to host Dual-E52630v4:15003
>>> [New Thread 0x7ffff29bf700 (LWP 35866)]
>>> [New Thread 0x7ffff21be700 (LWP 35867)]
>>> [New Thread 0x7ffff19bd700 (LWP 35868)]
>>> [New Thread 0x7ffff11bc700 (LWP 35869)]
>>> [New Thread 0x7ffff09bb700 (LWP 35870)]
>>> [Thread 0x7ffff09bb700 (LWP 35870) exited]
>>> [New Thread 0x7ffff09bb700 (LWP 35871)]
>>> [New Thread 0x7fffe3fff700 (LWP 36003)]
>>> [New Thread 0x7fffe37fe700 (LWP 36004)]
>>> [New Thread 0x7fffe2ffd700 (LWP 36011)]
>>> [New Thread 0x7fffe21ce700 (LWP 36016)]
>>> [Thread 0x7fffe21ce700 (LWP 36016) exited]
>>> ^C
>>> Thread 1 "pbs_server" received signal SIGINT, Interrupt.
>>> 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-templa
>>> te.S:84
>>> 84 ../sysdeps/unix/syscall-template.S: No such file or directory.
>>> (gdb) bt
>>> #0 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-templa
>>> te.S:84
>>> #1 0x00007ffff615c1a4 in usleep (useconds=<optimized out>) at
>>> ../sysdeps/posix/usleep.c:32
>>> #2 0x0000000000461216 in main_loop () at pbsd_main.c:1454
>>> #3 0x0000000000461f92 in main (argc=2, argv=0x7fffffffe598) at
>>> pbsd_main.c:1935
>>> (gdb) backtrace full
>>> #0 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-templa
>>> te.S:84
>>> No locals.
>>> #1 0x00007ffff615c1a4 in usleep (useconds=<optimized out>) at
>>> ../sysdeps/posix/usleep.c:32
>>> ts = {tv_sec = 0, tv_nsec = 250000000}
>>> #2 0x0000000000461216 in main_loop () at pbsd_main.c:1454
>>> state = 3
>>> waittime = 5
>>> pjob = 0x313a74
>>> iter = 0x0
>>> when = 1478748888
>>> log = 0
>>> scheduling = 1
>>> sched_iteration = 600
>>> time_now = 1478748970
>>> update_loglevel = 1478748979
>>> log_buf = "Server Ready, pid = 35860, loglevel=0", '\000'
>>> <repeats 139 times>, "c\000\000\000\000\000\000\000
>>> \000\020\000\000\000\000\000\000\240\265\377\377\377\177", '\000'
>>> <repeats 26 times>...
>>> sem_val = 5229209
>>> __func__ = "main_loop"
>>> #3 0x0000000000461f92 in main (argc=2, argv=0x7fffffffe598) at
>>> pbsd_main.c:1935
>>> i = 2
>>> rc = 0
>>> local_errno = 0
>>> lockfile = "/var/spool/torque/server_priv/server.lock", '\000'
>>> <repeats 983 times>
>>> EMsg = '\000' <repeats 1023 times>
>>> tmpLine = "Using ports Server:15001 Scheduler:15004 MOM:15002
>>> (server: 'Dual-E52630v4')", '\000' <repeats 945 times>
>>> log_buf = "Using ports Server:15001 Scheduler:15004 MOM:15002
>>> (server: 'Dual-E52630v4')", '\000' <repeats 1114 times>...
>>> server_name_file_port = 15001
>>> fp = 0x51095f0
>>> (gdb) info registers
>>> rax 0xfffffffffffffdfc -516
>>> rbx 0x6 6
>>> rcx 0x7ffff612a75d 140737321805661
>>> rdx 0x0 0
>>> rsi 0x0 0
>>> rdi 0x7fffffffb3f0 140737488335856
>>> rbp 0x7fffffffe4b0 0x7fffffffe4b0
>>> rsp 0x7fffffffc870 0x7fffffffc870
>>> r8 0x0 0
>>> r9 0x4000001 67108865
>>> r10 0x1 1
>>> r11 0x293 659
>>> r12 0x4260b0 4350128
>>> r13 0x7fffffffe590 140737488348560
>>> r14 0x0 0
>>> r15 0x0 0
>>> rip 0x461f92 0x461f92 <main(int, char**)+2388>
>>> eflags 0x293 [ CF AF SF IF ]
>>> cs 0x33 51
>>> ss 0x2b 43
>>> ds 0x0 0
>>> es 0x0 0
>>> fs 0x0 0
>>> gs 0x0 0
>>> (gdb) x/16i $pc
>>> => 0x461f92 <main(int, char**)+2388>: callq 0x49484c <shutdown_ack()>
>>> 0x461f97 <main(int, char**)+2393>: mov $0xffffffff,%edi
>>> 0x461f9c <main(int, char**)+2398>: callq 0x4250b0 <***@plt>
>>> 0x461fa1 <main(int, char**)+2403>: mov 0x70f5c0(%rip),%rdx
>>> # 0xb71568 <msg_svrdown>
>>> 0x461fa8 <main(int, char**)+2410>: mov 0x70ef51(%rip),%rax
>>> # 0xb70f00 <msg_daemonname>
>>> 0x461faf <main(int, char**)+2417>: mov %rdx,%rcx
>>> 0x461fb2 <main(int, char**)+2420>: mov %rax,%rdx
>>> 0x461fb5 <main(int, char**)+2423>: mov $0x1,%esi
>>> 0x461fba <main(int, char**)+2428>: mov $0x8002,%edi
>>> 0x461fbf <main(int, char**)+2433>: callq 0x425840
>>> <***@plt>
>>> 0x461fc4 <main(int, char**)+2438>: mov $0x0,%edi
>>> 0x461fc9 <main(int, char**)+2443>: callq 0x4269c9 <acct_close(bool)>
>>> 0x461fce <main(int, char**)+2448>: mov $0xb6ce00,%edi
>>> 0x461fd3 <main(int, char**)+2453>: callq 0x425a00
>>> <***@plt>
>>> 0x461fd8 <main(int, char**)+2458>: mov $0x1,%edi
>>> 0x461fdd <main(int, char**)+2463>: callq 0x424db0 <***@plt
>>> >
>>> (gdb) thread apply all backtrace
>>> Thread 12 (Thread 0x7fffe2ffd700 (LWP 36011)):
>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86
>>> _64/pthread_cond_wait.S:185
>>> #1 0x00000000004fc2b4 in work_thread (a=0x5110710) at u_threadpool.c:272
>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7fffe2ffd700) at
>>> pthread_create.c:333
>>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>> _64/clone.S:109
>>> Thread 11 (Thread 0x7fffe37fe700 (LWP 36004)):
>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86
>>> _64/pthread_cond_wait.S:185
>>> #1 0x00000000004fc2b4 in work_thread (a=0x5110810) at u_threadpool.c:272
>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7fffe37fe700) at
>>> pthread_create.c:333
>>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>> _64/clone.S:109
>>> Thread 10 (Thread 0x7fffe3fff700 (LWP 36003)):
>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86
>>> _64/pthread_cond_wait.S:185
>>> #1 0x00000000004fc2b4 in work_thread (a=0x5110710) at u_threadpool.c:272
>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7fffe3fff700) at
>>> pthread_create.c:333
>>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>> _64/clone.S:109
>>> Thread 9 (Thread 0x7ffff09bb700 (LWP 35871)):
>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86
>>> _64/pthread_cond_wait.S:185
>>> #1 0x00000000004fc2b4 in work_thread (a=0x5110810) at u_threadpool.c:272
>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7ffff09bb700) at
>>> pthread_create.c:333
>>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>> _64/clone.S:109
>>> Thread 7 (Thread 0x7ffff11bc700 (LWP 35869)):
>>> #0 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-templa
>>> te.S:84
>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>> ../sysdeps/posix/sleep.c:55
>>> #2 0x0000000000476913 in remove_completed_jobs (vp=0x0) at
>>> req_jobobit.c:3759
>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff11bc700) at
>>> pthread_create.c:333
>>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>> _64/clone.S:109
>>> Thread 6 (Thread 0x7ffff19bd700 (LWP 35868)):
>>> #0 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-templa
>>> te.S:84
>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>> ../sysdeps/posix/sleep.c:55
>>> #2 0x00000000004afb93 in remove_extra_recycle_jobs (vp=0x0) at
>>> job_recycler.c:216
>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff19bd700) at
>>> pthread_create.c:333
>>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>> _64/clone.S:109
>>> Thread 5 (Thread 0x7ffff21be700 (LWP 35867)):
>>> #0 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-templa
>>> te.S:84
>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>> ../sysdeps/posix/sleep.c:55
>>> #2 0x00000000004bc853 in inspect_exiting_jobs (vp=0x0) at
>>> exiting_jobs.c:319
>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff21be700) at
>>> pthread_create.c:333
>>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>> _64/clone.S:109
>>> Thread 4 (Thread 0x7ffff29bf700 (LWP 35866)):
>>> #0 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-templa
>>> te.S:84
>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>> ../sysdeps/posix/sleep.c:55
>>> #2 0x0000000000460769 in handle_queue_routing_retries (vp=0x0) at
>>> pbsd_main.c:1079
>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff29bf700) at
>>> pthread_create.c:333
>>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>> _64/clone.S:109
>>> Thread 3 (Thread 0x7ffff31c0700 (LWP 35865)):
>>> #0 0x00007ffff6ee17bd in accept () at ../sysdeps/unix/syscall-templa
>>> te.S:84
>>> #1 0x00007ffff750a276 in start_listener_addrinfo
>>> (host_name=0x7ffff31bfaf0 "Dual-E52630v4", server_port=15001,
>>> process_meth=0x4c4a4d <start_process_pbs_server_port(void*)>)
>>> at ../Libnet/server_core.c:398
>>> ---Type <return> to continue, or q <return> to quit---
>>> #2 0x00000000004608cf in start_accept_listener (vp=0x0) at
>>> pbsd_main.c:1141
>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff31c0700) at
>>> pthread_create.c:333
>>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>> _64/clone.S:109
>>> Thread 2 (Thread 0x7ffff39c1700 (LWP 35864)):
>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86
>>> _64/pthread_cond_wait.S:185
>>> #1 0x00000000004fc2b4 in work_thread (a=0x5110810) at u_threadpool.c:272
>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7ffff39c1700) at
>>> pthread_create.c:333
>>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>> _64/clone.S:109
>>> Thread 1 (Thread 0x7ffff7fd5740 (LWP 35860)):
>>> #0 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-templa
>>> te.S:84
>>> #1 0x00007ffff615c1a4 in usleep (useconds=<optimized out>) at
>>> ../sysdeps/posix/usleep.c:32
>>> #2 0x0000000000461216 in main_loop () at pbsd_main.c:1454
>>> #3 0x0000000000461f92 in main (argc=2, argv=0x7fffffffe598) at
>>> pbsd_main.c:1935
>>> (gdb) quit
>>> A debugging session is active.
>>> Inferior 1 [process 35860] will be killed.
>>> Quit anyway? (y or n) y
>>
>>
>>
>> Commands executed from another terminal after pbs_server with gdb (r -D)
>>
>>> $ sudo service pbs_sched start
>>> $ sudo service pbs_mom start
>>> $ pbsnodes -a
>>> Dual-E52630v4
>>> state = free
>>> power_state = Running
>>> np = 4
>>> ntype = cluster
>>> status = rectime=1478748911,macaddr=34:
>>> 97:f6:5d:09:a6,cpuclock=Fixed,varattr=,jobs=,state=free,netl
>>> oad=322618417,gres=,loadave=0.06,ncpus=40,physmem=65857216kb
>>> ,availmem=131970532kb,totmem=132849340kb,idletime=108,nusers=4,nsessions=17,sessions=1036
>>> 1316 1327 1332 1420 1421 1422 1423 1424 1425 1426 1430 1471 1510 27075
>>> 27130 35902,uname=Linux Dual-E52630v4 4.4.0-45-generic #66-Ubuntu SMP Wed
>>> Oct 19 14:12:37 UTC 2016 x86_64,opsys=linux
>>> mom_service_port = 15002
>>> mom_manager_port = 15003
>>> $ echo "sleep 30" | qsub
>>> 0.Dual-E52630v4
>>> $ qstat
>>> Job ID Name User Time Use S
>>> Queue
>>> ------------------------- ---------------- --------------- -------- -
>>> -----
>>> 0.Dual-E52630v4 STDIN comp_admin 0 Q
>>> batch
>>
>>
>>
>> On Thu, Nov 10, 2016 at 12:01 PM, Kazuhiro Fujita <
>> ***@gmail.com> wrote:
>>
>>> David,
>>>
>>> Now, it works. Thank you.
>>> But, jobs are executed in the LIFO manner, as I observed in a E5-2630v3
>>> server...
>>> I show the result by 'qstat -t' after 'echo "sleep 10" | qsub -t 1-10' 3
>>> times.
>>>
>>> Best,
>>> Kazu
>>>
>>> $ qstat -t
>>> Job ID Name User Time Use S
>>> Queue
>>> ------------------------- ---------------- --------------- -------- -
>>> -----
>>> 0.Dual-E5-2630v3 STDIN comp_admin 00:00:00 C
>>> batch
>>> 1[1].Dual-E5-2630v3 STDIN-1 comp_admin 0 Q
>>> batch
>>> 1[2].Dual-E5-2630v3 STDIN-2 comp_admin 0 Q
>>> batch
>>> 1[3].Dual-E5-2630v3 STDIN-3 comp_admin 0 Q
>>> batch
>>> 1[4].Dual-E5-2630v3 STDIN-4 comp_admin 0 Q
>>> batch
>>> 1[5].Dual-E5-2630v3 STDIN-5 comp_admin 0 Q
>>> batch
>>> 1[6].Dual-E5-2630v3 STDIN-6 comp_admin 0 Q
>>> batch
>>> 1[7].Dual-E5-2630v3 STDIN-7 comp_admin 00:00:00 C
>>> batch
>>> 1[8].Dual-E5-2630v3 STDIN-8 comp_admin 00:00:00 C
>>> batch
>>> 1[9].Dual-E5-2630v3 STDIN-9 comp_admin 00:00:00 C
>>> batch
>>> 1[10].Dual-E5-2630v3 STDIN-10 comp_admin 00:00:00 C
>>> batch
>>> 2[1].Dual-E5-2630v3 STDIN-1 comp_admin 0 Q
>>> batch
>>> 2[2].Dual-E5-2630v3 STDIN-2 comp_admin 0 Q
>>> batch
>>> 2[3].Dual-E5-2630v3 STDIN-3 comp_admin 0 Q
>>> batch
>>> 2[4].Dual-E5-2630v3 STDIN-4 comp_admin 0 Q
>>> batch
>>> 2[5].Dual-E5-2630v3 STDIN-5 comp_admin 0 Q
>>> batch
>>> 2[6].Dual-E5-2630v3 STDIN-6 comp_admin 0 Q
>>> batch
>>> 2[7].Dual-E5-2630v3 STDIN-7 comp_admin 0 Q
>>> batch
>>> 2[8].Dual-E5-2630v3 STDIN-8 comp_admin 0 Q
>>> batch
>>> 2[9].Dual-E5-2630v3 STDIN-9 comp_admin 0 Q
>>> batch
>>> 2[10].Dual-E5-2630v3 STDIN-10 comp_admin 0 Q
>>> batch
>>> 3[1].Dual-E5-2630v3 STDIN-1 comp_admin 0 Q
>>> batch
>>> 3[2].Dual-E5-2630v3 STDIN-2 comp_admin 0 Q
>>> batch
>>> 3[3].Dual-E5-2630v3 STDIN-3 comp_admin 0 Q
>>> batch
>>> 3[4].Dual-E5-2630v3 STDIN-4 comp_admin 0 Q
>>> batch
>>> 3[5].Dual-E5-2630v3 STDIN-5 comp_admin 0 Q
>>> batch
>>> 3[6].Dual-E5-2630v3 STDIN-6 comp_admin 0 Q
>>> batch
>>> 3[7].Dual-E5-2630v3 STDIN-7 comp_admin 0 R
>>> batch
>>> 3[8].Dual-E5-2630v3 STDIN-8 comp_admin 0 R
>>> batch
>>> 3[9].Dual-E5-2630v3 STDIN-9 comp_admin 0 R
>>> batch
>>> 3[10].Dual-E5-2630v3 STDIN-10 comp_admin 0 R
>>> batch
>>>
>>>
>>>
>>> On Thu, Nov 10, 2016 at 3:07 AM, David Beer <***@adaptivecomputing.com
>>> > wrote:
>>>
>>>> Kazu,
>>>>
>>>> I was able to get a system to reproduce this error. I have now checked
>>>> in another fix, and I can no longer reproduce this. Can you pull the latest
>>>> and let me know if it fixes it for you?
>>>>
>>>> On Tue, Nov 8, 2016 at 2:06 AM, Kazuhiro Fujita <
>>>> ***@gmail.com> wrote:
>>>>
>>>>> Hi David,
>>>>>
>>>>> I reinstalled the 6.0-dev today from github, and observed slight
>>>>> different behaviors I think.
>>>>> I used the "service" command to start daemons this time.
>>>>>
>>>>> Best,
>>>>> Kazu
>>>>>
>>>>> Befor crash
>>>>>
>>>>>> git clone https://github.com/adaptivecomputing/torque.git -b 6.0-dev
>>>>>> 6.0-dev
>>>>>> cd 6.0-dev
>>>>>> ./autogen.sh
>>>>>> # build and install torque
>>>>>> ./configure
>>>>>> make
>>>>>> sudo make install
>>>>>> # Set the correct name of the server
>>>>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>>>> # configure and start trqauthd
>>>>>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>>>>>> sudo update-rc.d trqauthd defaults
>>>>>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>>>>>> sudo ldconfig
>>>>>> sudo service trqauthd start
>>>>>> # Initialize serverdb by executing the torque.setup script
>>>>>> sudo ./torque.setup $USER
>>>>>> sudo qmgr -c 'p s'
>>>>>> sudo qterm
>>>>>> sudo service trqauthd stop
>>>>>> ps aux | grep pbs
>>>>>> ps aux | grep trq
>>>>>> # set nodes
>>>>>> echo "$HOSTNAME np=`cat /proc/cpuinfo | grep processor | wc -l`" |
>>>>>> sudo tee /var/spool/torque/server_priv/nodes
>>>>>> sudo nano /var/spool/torque/server_priv/nodes
>>>>>> # set the head node
>>>>>> echo "\$pbsserver $HOSTNAME" | sudo tee /var/spool/torque/mom_priv/con
>>>>>> fig
>>>>>> # configure other deamons
>>>>>> sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
>>>>>> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
>>>>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>>>>> sudo update-rc.d pbs_server defaults
>>>>>> sudo update-rc.d pbs_sched defaults
>>>>>> sudo update-rc.d pbs_mom defaults
>>>>>> # start torque daemons
>>>>>> sudo service trqauthd start
>>>>>> sudo service pbs_server start
>>>>>> sudo service pbs_sched start
>>>>>> sudo service pbs_mom start
>>>>>> # chekc configuration of computaion nodes
>>>>>> pbsnodes -a
>>>>>
>>>>>
>>>>> I checked torque processes by "ps aux | grep pbs" and "ps aux | grep
>>>>> trq" several times.
>>>>> After "pbsnodes -a", it seems ok.
>>>>> But, the next qsub command seems to trigger to crash "pbs_server" and
>>>>> "pbs_sched".
>>>>>
>>>>> $ ps aux | grep trq
>>>>>> root 9682 0.0 0.0 109112 3632 ? S 17:39 0:00
>>>>>> /usr/local/sbin/trqauthd
>>>>>> comp_ad+ 9842 0.0 0.0 15236 936 pts/8 S+ 17:40 0:00 grep
>>>>>> --color=auto trq
>>>>>> $ ps aux | grep pbs
>>>>>> root 9720 0.0 0.0 695140 25760 ? Sl 17:39 0:00
>>>>>> /usr/local/sbin/pbs_server
>>>>>> root 9771 0.0 0.0 37996 4940 ? Ss 17:39 0:00
>>>>>> /usr/local/sbin/pbs_sched
>>>>>> root 9814 0.2 0.2 173776 136692 ? SLsl 17:40 0:00
>>>>>> /usr/local/sbin/pbs_mom
>>>>>> comp_ad+ 9844 0.0 0.0 15236 1012 pts/8 S+ 17:40 0:00 grep
>>>>>> --color=auto pbs
>>>>>> $ echo "sleep 30" | qsub
>>>>>> 0.Dual-E52630v4
>>>>>> $ ps aux | grep pbs
>>>>>> root 9814 0.1 0.2 173776 136692 ? SLsl 17:40 0:00
>>>>>> /usr/local/sbin/pbs_mom
>>>>>> comp_ad+ 9855 0.0 0.0 15236 928 pts/8 S+ 17:41 0:00 grep
>>>>>> --color=auto pbs
>>>>>> $ ps aux | grep trq
>>>>>> root 9682 0.0 0.0 109112 4144 ? S 17:39 0:00
>>>>>> /usr/local/sbin/trqauthd
>>>>>> comp_ad+ 9860 0.0 0.0 15236 1092 pts/8 S+ 17:41 0:00 grep
>>>>>> --color=auto trq
>>>>>
>>>>>
>>>>> Then, I stopped the remained processes,
>>>>>
>>>>> sudo service pbs_mom stop
>>>>>> sudo service trqauthd stop
>>>>>
>>>>>
>>>>> and start again the "trqauthd", and "pbs_server" with gdb.
>>>>> "pbs_server" crashed in gdb without other commands.
>>>>>
>>>>> sudo service trqauthd start
>>>>>> sudo gdb /usr/local/sbin/pbs_server
>>>>>
>>>>>
>>>>> sudo gdb /usr/local/sbin/pbs_server
>>>>> GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
>>>>> Copyright (C) 2016 Free Software Foundation, Inc.
>>>>> License GPLv3+: GNU GPL version 3 or later <
>>>>> http://gnu.org/licenses/gpl.html>
>>>>> This is free software: you are free to change and redistribute it.
>>>>> There is NO WARRANTY, to the extent permitted by law. Type "show
>>>>> copying"
>>>>> and "show warranty" for details.
>>>>> This GDB was configured as "x86_64-linux-gnu".
>>>>> Type "show configuration" for configuration details.
>>>>> For bug reporting instructions, please see:
>>>>> <http://www.gnu.org/software/gdb/bugs/>.
>>>>> Find the GDB manual and other documentation resources online at:
>>>>> <http://www.gnu.org/software/gdb/documentation/>.
>>>>> For help, type "help".
>>>>> Type "apropos word" to search for commands related to "word"...
>>>>> Reading symbols from /usr/local/sbin/pbs_server...done.
>>>>> (gdb) r -D
>>>>> Starting program: /usr/local/sbin/pbs_server -D
>>>>> [Thread debugging using libthread_db enabled]
>>>>> Using host libthread_db library "/lib/x86_64-linux-gnu/libthre
>>>>> ad_db.so.1".
>>>>>
>>>>> Program received signal SIGSEGV, Segmentation fault.
>>>>> __lll_unlock_elision (lock=0x512f1b0, private=0) at
>>>>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>>>>> 29 ../sysdeps/unix/sysv/linux/x86/elision-unlock.c: No such file or
>>>>> directory.
>>>>> (gdb) bt
>>>>> #0 __lll_unlock_elision (lock=0x512f1b0, private=0) at
>>>>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>>>>> #1 0x00000000004a4953 in unlock_ji_mutex (pjob=0x512d880, id=0x525b30
>>>>> <svr_enquejob(job*, int, char const*, bool, bool)::__func__>
>>>>> "svr_enquejob", msg=0x524554 "1", logging=0)
>>>>> at svr_jobfunc.c:4011
>>>>> #2 0x000000000049db0c in svr_enquejob (pjob=0x512d880,
>>>>> has_sv_qs_mutex=1, prev_job_id=0x0, have_reservation=false,
>>>>> being_recovered=true) at svr_jobfunc.c:421
>>>>> #3 0x000000000045b828 in pbsd_init_reque (pjob=0x512d880,
>>>>> change_state=1) at pbsd_init.c:2824
>>>>> #4 0x000000000045ad93 in pbsd_init_job (pjob=0x512d880, type=1) at
>>>>> pbsd_init.c:2558
>>>>> #5 0x0000000000459483 in handle_job_recovery (type=1) at
>>>>> pbsd_init.c:1803
>>>>> #6 0x000000000045a173 in handle_job_and_array_recovery (type=1) at
>>>>> pbsd_init.c:2100
>>>>> #7 0x000000000045a8fe in pbsd_init (type=1) at pbsd_init.c:2316
>>>>> #8 0x0000000000461f29 in main (argc=2, argv=0x7fffffffe598) at
>>>>> pbsd_main.c:1898
>>>>> (gdb) backtrace full
>>>>> #0 __lll_unlock_elision (lock=0x512f1b0, private=0) at
>>>>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>>>>> No locals.
>>>>> #1 0x00000000004a4953 in unlock_ji_mutex (pjob=0x512d880, id=0x525b30
>>>>> <svr_enquejob(job*, int, char const*, bool, bool)::__func__>
>>>>> "svr_enquejob", msg=0x524554 "1", logging=0)
>>>>> at svr_jobfunc.c:4011
>>>>> rc = 0
>>>>> err_msg = 0x0
>>>>> stub_msg = "no pos"
>>>>> __func__ = "unlock_ji_mutex"
>>>>> #2 0x000000000049db0c in svr_enquejob (pjob=0x512d880,
>>>>> has_sv_qs_mutex=1, prev_job_id=0x0, have_reservation=false,
>>>>> being_recovered=true) at svr_jobfunc.c:421
>>>>> pattrjb = 0x7fffffff4a10
>>>>> pdef = 0x4
>>>>> pque = 0x0
>>>>> rc = 0
>>>>> log_buf = '\000' <repeats 24 times>,
>>>>> "\030\000\000\000\060\000\000\000PU\377\377\377\177\000\000\220T\377\377\377\177",
>>>>> '\000' <repeats 50 times>, "\003\000\000\000\000\000\000\
>>>>> 000#\000\000\000\000\000\000\000pO\377\377\377\177", '\000' <repeats
>>>>> 26 times>, "\221\260\000\000\000\200\377\
>>>>> 377oO\377\377\377\177\000\000H+B\366\377\177\000\000p+B\366\
>>>>> 377\177\000\000\200O\377\377\377\177\000\000\201\260\000\000
>>>>> \000\200\377\377\177O\377\377\377\177", '\000' <repeats 18 times>...
>>>>> time_now = 1478594788
>>>>> job_id = "0.Dual-E52630v4\000\000\000\0
>>>>> 00\000\000\000\000\000\362\377\377\377\377\377\377\377\340J\
>>>>> 377\377\377\177\000\000\060L\377\377\377\177\000\000\001\000
>>>>> \000\000\000\000\000\000\244\201\000\000\001\000\000\000\030
>>>>> \354\377\367\377\177\000\***@L\377\377\377\177\000\000\000\0
>>>>> 00\000\000\005\000\000\220\r\000\000\000\000\000\000\000k\02
>>>>> 2j\365\377\177\000\000\031J\377\377\377\177\000\000\201n\376
>>>>> \017\000\000\000\000\\\216!X\000\000\000\000_#\343+\000\000
>>>>> \000\000\\\216!X\000\000\000\000\207\065],", '\000' <repeats 36
>>>>> times>, "k\022j\365\377\177\000\000\300K\377\377\377\177\000\000\000
>>>>> \000\000\000\000\000\000\000"...
>>>>> queue_name = "batch\000\377\377\240\340\377\367\377\177\000"
>>>>> total_jobs = 0
>>>>> user_jobs = 0
>>>>> array_jobs = 0
>>>>> __func__ = "svr_enquejob"
>>>>> que_mgr = {unlock_on_exit = 160, locked = 75, mutex_valid =
>>>>> 255, managed_mutex = 0x7ffff7ddccda <open_path+474>}
>>>>> #3 0x000000000045b828 in pbsd_init_reque (pjob=0x512d880,
>>>>> change_state=1) at pbsd_init.c:2824
>>>>> newstate = 0
>>>>> newsubstate = 0
>>>>> rc = 0
>>>>> log_buf = "pbsd_init_reque:1", '\000' <repeats 1063 times>...
>>>>> __func__ = "pbsd_init_reque"
>>>>> #4 0x000000000045ad93 in pbsd_init_job (pjob=0x512d880, type=1) at
>>>>> pbsd_init.c:2558
>>>>> d = 0
>>>>> rc = 0
>>>>> time_now = 1478594788
>>>>> log_buf = '\000' <repeats 2112 times>...
>>>>> local_errno = 0
>>>>> job_id = '\000' <repeats 1016 times>...
>>>>> job_atr_hold = 0
>>>>> job_exit_status = 0
>>>>> __func__ = "pbsd_init_job"
>>>>> #5 0x0000000000459483 in handle_job_recovery (type=1) at
>>>>> pbsd_init.c:1803
>>>>> pjob = 0x512d880
>>>>> Index = 0
>>>>> JobArray_iter = {first = "0.Dual-E52630v4", second = }
>>>>> log_buf = "14 total files read from
>>>>> disk\000\000\000\000\000\000\000\001\000\000\000\320\316\022
>>>>> \005\000\000\000\000\220N\022\005", '\000' <repeats 12 times>,
>>>>> "Expected 1, recovered 1 queues", '\000' <repeats 1330 times>...
>>>>> rc = 0
>>>>> job_rc = 0
>>>>> logtype = 0
>>>>> pdirent = 0x0
>>>>> pdirent_sub = 0x0
>>>>> dir = 0x5124e90
>>>>> dir_sub = 0x0
>>>>> had = 0
>>>>> pjob = 0x0
>>>>> time_now = 1478594788
>>>>> ---Type <return> to continue, or q <return> to quit---
>>>>> basen = '\000' <repeats 1088 times>...
>>>>> use_jobs_subdirs = 0
>>>>> __func__ = "handle_job_recovery"
>>>>> #6 0x000000000045a173 in handle_job_and_array_recovery (type=1) at
>>>>> pbsd_init.c:2100
>>>>> rc = 0
>>>>> tmp_rc = 1974134615
>>>>> #7 0x000000000045a8fe in pbsd_init (type=1) at pbsd_init.c:2316
>>>>> ret = 0
>>>>> gid = 0
>>>>> log_buf = "pbsd_init:1", '\000' <repeats 997 times>...
>>>>> __func__ = "pbsd_init"
>>>>> #8 0x0000000000461f29 in main (argc=2, argv=0x7fffffffe598) at
>>>>> pbsd_main.c:1898
>>>>> i = 2
>>>>> rc = 0
>>>>> local_errno = 0
>>>>> lockfile = "/var/spool/torque/server_priv/server.lock",
>>>>> '\000' <repeats 983 times>
>>>>> EMsg = '\000' <repeats 1023 times>
>>>>> tmpLine = "Server Dual-E52630v4 started, initialization type =
>>>>> 1", '\000' <repeats 970 times>
>>>>> log_buf = "Server Dual-E52630v4 started, initialization type =
>>>>> 1", '\000' <repeats 1139 times>...
>>>>> server_name_file_port = 15001
>>>>> fp = 0x51095f0
>>>>> (gdb) info registers
>>>>> rax 0x0 0
>>>>> rbx 0x6 6
>>>>> rcx 0x0 0
>>>>> rdx 0x512f1b0 85127600
>>>>> rsi 0x0 0
>>>>> rdi 0x512f1b0 85127600
>>>>> rbp 0x7fffffffe4b0 0x7fffffffe4b0
>>>>> rsp 0x7fffffffc870 0x7fffffffc870
>>>>> r8 0x0 0
>>>>> r9 0x7fffffff57a2 140737488312226
>>>>> r10 0x513c800 85182464
>>>>> r11 0x7ffff61e6128 140737322574120
>>>>> r12 0x4260b0 4350128
>>>>> r13 0x7fffffffe590 140737488348560
>>>>> r14 0x0 0
>>>>> r15 0x0 0
>>>>> rip 0x461f29 0x461f29 <main(int, char**)+2183>
>>>>> eflags 0x10246 [ PF ZF IF RF ]
>>>>> cs 0x33 51
>>>>> ss 0x2b 43
>>>>> ds 0x0 0
>>>>> es 0x0 0
>>>>> fs 0x0 0
>>>>> gs 0x0 0
>>>>> (gdb) x/16i $pc
>>>>> => 0x461f29 <main(int, char**)+2183>: test %eax,%eax
>>>>> 0x461f2b <main(int, char**)+2185>: setne %al
>>>>> 0x461f2e <main(int, char**)+2188>: test %al,%al
>>>>> 0x461f30 <main(int, char**)+2190>: je 0x461f55 <main(int,
>>>>> char**)+2227>
>>>>> 0x461f32 <main(int, char**)+2192>: mov 0x70efc7(%rip),%rax
>>>>> # 0xb70f00 <msg_daemonname>
>>>>> 0x461f39 <main(int, char**)+2199>: mov $0x51bab2,%edx
>>>>> 0x461f3e <main(int, char**)+2204>: mov %rax,%rsi
>>>>> 0x461f41 <main(int, char**)+2207>: mov $0xffffffff,%edi
>>>>> 0x461f46 <main(int, char**)+2212>: callq 0x425420
>>>>> <***@plt>
>>>>> 0x461f4b <main(int, char**)+2217>: mov $0x3,%edi
>>>>> 0x461f50 <main(int, char**)+2222>: callq 0x425680 <***@plt>
>>>>> 0x461f55 <main(int, char**)+2227>: mov 0x71021d(%rip),%esi
>>>>> # 0xb72178 <pbs_mom_port>
>>>>> 0x461f5b <main(int, char**)+2233>: mov 0x710227(%rip),%ecx
>>>>> # 0xb72188 <pbs_scheduler_port>
>>>>> 0x461f61 <main(int, char**)+2239>: mov 0x710225(%rip),%edx
>>>>> # 0xb7218c <pbs_server_port_dis>
>>>>> 0x461f67 <main(int, char**)+2245>: lea -0x1400(%rbp),%rax
>>>>> 0x461f6e <main(int, char**)+2252>: mov $0xb739c0,%r9d
>>>>> (gdb) thread apply all backtrace
>>>>>
>>>>> Thread 1 (Thread 0x7ffff7fd5740 (LWP 10004)):
>>>>> #0 __lll_unlock_elision (lock=0x512f1b0, private=0) at
>>>>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>>>>> #1 0x00000000004a4953 in unlock_ji_mutex (pjob=0x512d880, id=0x525b30
>>>>> <svr_enquejob(job*, int, char const*, bool, bool)::__func__>
>>>>> "svr_enquejob", msg=0x524554 "1", logging=0)
>>>>> at svr_jobfunc.c:4011
>>>>> #2 0x000000000049db0c in svr_enquejob (pjob=0x512d880,
>>>>> has_sv_qs_mutex=1, prev_job_id=0x0, have_reservation=false,
>>>>> being_recovered=true) at svr_jobfunc.c:421
>>>>> #3 0x000000000045b828 in pbsd_init_reque (pjob=0x512d880,
>>>>> change_state=1) at pbsd_init.c:2824
>>>>> #4 0x000000000045ad93 in pbsd_init_job (pjob=0x512d880, type=1) at
>>>>> pbsd_init.c:2558
>>>>> #5 0x0000000000459483 in handle_job_recovery (type=1) at
>>>>> pbsd_init.c:1803
>>>>> #6 0x000000000045a173 in handle_job_and_array_recovery (type=1) at
>>>>> pbsd_init.c:2100
>>>>> #7 0x000000000045a8fe in pbsd_init (type=1) at pbsd_init.c:2316
>>>>> #8 0x0000000000461f29 in main (argc=2, argv=0x7fffffffe598) at
>>>>> pbsd_main.c:1898
>>>>> (gdb) quit
>>>>> A debugging session is active.
>>>>>
>>>>> Inferior 1 [process 10004] will be killed.
>>>>>
>>>>> Quit anyway? (y or n) y
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Nov 2, 2016 at 1:43 AM, David Beer <
>>>>> ***@adaptivecomputing.com> wrote:
>>>>>
>>>>>> Kazu,
>>>>>>
>>>>>> Thanks for sticking with us on this. You mentioned that pbs_server
>>>>>> did not crash when you submitted the job, but you said that it and
>>>>>> pbs_sched are "unstable." What do you mean by unstable? Will jobs run? You
>>>>>> gdb output looks like a pbs_server that isn't busy, but other than that it
>>>>>> looks normal.
>>>>>>
>>>>>> David
>>>>>>
>>>>>> On Tue, Nov 1, 2016 at 1:19 AM, Kazuhiro Fujita <
>>>>>> ***@gmail.com> wrote:
>>>>>>
>>>>>>> David,
>>>>>>>
>>>>>>> I tested the 6.0-dev. It passed the "sudo ./torque.setup $USER"
>>>>>>> script,
>>>>>>> but pbs_server and pbs_sched are unstable like 6.1-dev.
>>>>>>>
>>>>>>> Best,
>>>>>>> Kazu
>>>>>>>
>>>>>>> Before execution of gdb
>>>>>>>
>>>>>>> git clone https://github.com/adaptivecomputing/torque.git -b
>>>>>>>> 6.0-dev 6.0-dev
>>>>>>>> cd 6.0-dev
>>>>>>>> ./autogen.sh
>>>>>>>> # build and install torque
>>>>>>>> ./configure
>>>>>>>> make
>>>>>>>> sudo make install
>>>>>>>> # Set the correct name of the server
>>>>>>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>>>>>> # configure and start trqauthd
>>>>>>>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>>>>>>>> sudo update-rc.d trqauthd defaults
>>>>>>>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>>>>>>>> sudo ldconfig
>>>>>>>> sudo service trqauthd start
>>>>>>>> # Initialize serverdb by executing the torque.setup script
>>>>>>>> sudo ./torque.setup $USER
>>>>>>>>
>>>>>>>> sudo qmgr -c 'p s'
>>>>>>>> sudo qterm
>>>>>>>> sudo /etc/init.d/trqauthd stop
>>>>>>>> # set nodes
>>>>>>>> echo "$HOSTNAME np=`cat /proc/cpuinfo | grep processor | wc -l`" |
>>>>>>>> sudo tee /var/spool/torque/server_priv/nodes
>>>>>>>> sudo nano /var/spool/torque/server_priv/nodes
>>>>>>>> # set the head node
>>>>>>>> echo "\$pbsserver $HOSTNAME" | sudo tee
>>>>>>>> /var/spool/torque/mom_priv/config
>>>>>>>> # configure other deamons
>>>>>>>> sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
>>>>>>>> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
>>>>>>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>>>>>>> sudo update-rc.d pbs_server defaults
>>>>>>>> sudo update-rc.d pbs_sched defaults
>>>>>>>> sudo update-rc.d pbs_mom defaults
>>>>>>>> # start torque daemons
>>>>>>>> sudo service trqauthd start
>>>>>>>
>>>>>>>
>>>>>>> Execution of gdb
>>>>>>>
>>>>>>>> sudo gdb /usr/local/sbin/pbs_server
>>>>>>>
>>>>>>>
>>>>>>> Commands executed by another terminal
>>>>>>>
>>>>>>>> sudo /etc/init.d/pbs_mom start
>>>>>>>> sudo /etc/init.d/pbs_sched start
>>>>>>>> pbsnodes -a
>>>>>>>> echo "sleep 30" | qsub
>>>>>>>
>>>>>>>
>>>>>>> The last command did not cause a crash of pbs_server. The backtrace
>>>>>>> is described below.
>>>>>>> $ sudo gdb /usr/local/sbin/pbs_server
>>>>>>> GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
>>>>>>> Copyright (C) 2016 Free Software Foundation, Inc.
>>>>>>> License GPLv3+: GNU GPL version 3 or later <
>>>>>>> http://gnu.org/licenses/gpl.html>
>>>>>>> This is free software: you are free to change and redistribute it.
>>>>>>> There is NO WARRANTY, to the extent permitted by law. Type "show
>>>>>>> copying"
>>>>>>> and "show warranty" for details.
>>>>>>> This GDB was configured as "x86_64-linux-gnu".
>>>>>>> Type "show configuration" for configuration details.
>>>>>>> For bug reporting instructions, please see:
>>>>>>> <http://www.gnu.org/software/gdb/bugs/>.
>>>>>>> Find the GDB manual and other documentation resources online at:
>>>>>>> <http://www.gnu.org/software/gdb/documentation/>.
>>>>>>> For help, type "help".
>>>>>>> Type "apropos word" to search for commands related to "word"...
>>>>>>> Reading symbols from /usr/local/sbin/pbs_server...done.
>>>>>>> (gdb) r -D
>>>>>>> Starting program: /usr/local/sbin/pbs_server -D
>>>>>>> [Thread debugging using libthread_db enabled]
>>>>>>> Using host libthread_db library "/lib/x86_64-linux-gnu/libthre
>>>>>>> ad_db.so.1".
>>>>>>> [New Thread 0x7ffff39c1700 (LWP 5024)]
>>>>>>> pbs_server is up (version - 6.0, port - 15001)
>>>>>>> [New Thread 0x7ffff31c0700 (LWP 5025)]
>>>>>>> PBS_Server: LOG_ERROR::tcp_connect_sockaddr, Failed when trying to
>>>>>>> open tcp connection - connect() failed [rc = -2] [addr =
>>>>>>> 10.0.0.249:15003]
>>>>>>> PBS_Server: LOG_ERROR::sendHierarchyToNode, Could not send mom
>>>>>>> hierarchy to host Dual-E52630v4:15003
>>>>>>> [New Thread 0x7ffff29bf700 (LWP 5026)]
>>>>>>> [New Thread 0x7ffff21be700 (LWP 5027)]
>>>>>>> [New Thread 0x7ffff19bd700 (LWP 5028)]
>>>>>>> [New Thread 0x7ffff11bc700 (LWP 5029)]
>>>>>>> [New Thread 0x7ffff09bb700 (LWP 5030)]
>>>>>>> [Thread 0x7ffff09bb700 (LWP 5030) exited]
>>>>>>> [New Thread 0x7ffff09bb700 (LWP 5031)]
>>>>>>> [New Thread 0x7fffe3fff700 (LWP 5109)]
>>>>>>> [New Thread 0x7fffe37fe700 (LWP 5113)]
>>>>>>> [New Thread 0x7fffe29cf700 (LWP 5121)]
>>>>>>> [Thread 0x7fffe29cf700 (LWP 5121) exited]
>>>>>>> ^C
>>>>>>> Thread 1 "pbs_server" received signal SIGINT, Interrupt.
>>>>>>> 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-templa
>>>>>>> te.S:84
>>>>>>> 84 ../sysdeps/unix/syscall-template.S: No such file or directory.
>>>>>>> (gdb) backtrace full
>>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>> No locals.
>>>>>>> #1 0x00007ffff615c1a4 in usleep (useconds=<optimized out>) at
>>>>>>> ../sysdeps/posix/usleep.c:32
>>>>>>> ts = {tv_sec = 0, tv_nsec = 250000000}
>>>>>>> #2 0x000000000046123a in main_loop () at pbsd_main.c:1454
>>>>>>> state = 3
>>>>>>> waittime = 5
>>>>>>> pjob = 0x313a74
>>>>>>> iter = 0x0
>>>>>>> when = 1477984074
>>>>>>> log = 0
>>>>>>> scheduling = 1
>>>>>>> sched_iteration = 600
>>>>>>> time_now = 1477984190
>>>>>>> update_loglevel = 1477984198
>>>>>>> log_buf = "Server Ready, pid = 5020, loglevel=0", '\000'
>>>>>>> <repeats 140 times>, "c\000\000\000\000\000\000\000
>>>>>>> \000\020\000\000\000\000\000\000\240\265\377\377\377\177", '\000'
>>>>>>> <repeats 26 times>...
>>>>>>> sem_val = 5228929
>>>>>>> __func__ = "main_loop"
>>>>>>> #3 0x0000000000461fb6 in main (argc=2, argv=0x7fffffffe598) at
>>>>>>> pbsd_main.c:1935
>>>>>>> i = 2
>>>>>>> rc = 0
>>>>>>> local_errno = 0
>>>>>>> lockfile = "/var/spool/torque/server_priv/server.lock",
>>>>>>> '\000' <repeats 983 times>
>>>>>>> EMsg = '\000' <repeats 1023 times>
>>>>>>> tmpLine = "Using ports Server:15001 Scheduler:15004
>>>>>>> MOM:15002 (server: 'Dual-E52630v4')", '\000' <repeats 945 times>
>>>>>>> log_buf = "Using ports Server:15001 Scheduler:15004
>>>>>>> MOM:15002 (server: 'Dual-E52630v4')", '\000' <repeats 1114 times>...
>>>>>>> server_name_file_port = 15001
>>>>>>> fp = 0x51095f0
>>>>>>> (gdb) info registers
>>>>>>> rax 0xfffffffffffffdfc -516
>>>>>>> rbx 0x5 5
>>>>>>> rcx 0x7ffff612a75d 140737321805661
>>>>>>> rdx 0x0 0
>>>>>>> rsi 0x0 0
>>>>>>> rdi 0x7fffffffb3f0 140737488335856
>>>>>>> rbp 0x7fffffffe4b0 0x7fffffffe4b0
>>>>>>> rsp 0x7fffffffc870 0x7fffffffc870
>>>>>>> r8 0x0 0
>>>>>>> r9 0x4000001 67108865
>>>>>>> r10 0x1 1
>>>>>>> r11 0x293 659
>>>>>>> r12 0x4260b0 4350128
>>>>>>> r13 0x7fffffffe590 140737488348560
>>>>>>> r14 0x0 0
>>>>>>> r15 0x0 0
>>>>>>> rip 0x461fb6 0x461fb6 <main(int, char**)+2388>
>>>>>>> eflags 0x293 [ CF AF SF IF ]
>>>>>>> cs 0x33 51
>>>>>>> ss 0x2b 43
>>>>>>> ds 0x0 0
>>>>>>> es 0x0 0
>>>>>>> fs 0x0 0
>>>>>>> gs 0x0 0
>>>>>>> (gdb) x/16i $pc
>>>>>>> => 0x461fb6 <main(int, char**)+2388>: callq 0x494762
>>>>>>> <shutdown_ack()>
>>>>>>> 0x461fbb <main(int, char**)+2393>: mov $0xffffffff,%edi
>>>>>>> 0x461fc0 <main(int, char**)+2398>: callq 0x4250b0 <***@plt
>>>>>>> >
>>>>>>> 0x461fc5 <main(int, char**)+2403>: mov 0x70f55c(%rip),%rdx
>>>>>>> # 0xb71528 <msg_svrdown>
>>>>>>> 0x461fcc <main(int, char**)+2410>: mov 0x70eeed(%rip),%rax
>>>>>>> # 0xb70ec0 <msg_daemonname>
>>>>>>> 0x461fd3 <main(int, char**)+2417>: mov %rdx,%rcx
>>>>>>> 0x461fd6 <main(int, char**)+2420>: mov %rax,%rdx
>>>>>>> 0x461fd9 <main(int, char**)+2423>: mov $0x1,%esi
>>>>>>> 0x461fde <main(int, char**)+2428>: mov $0x8002,%edi
>>>>>>> 0x461fe3 <main(int, char**)+2433>: callq 0x425840
>>>>>>> <***@plt>
>>>>>>> 0x461fe8 <main(int, char**)+2438>: mov $0x0,%edi
>>>>>>> 0x461fed <main(int, char**)+2443>: callq 0x4269c9
>>>>>>> <acct_close(bool)>
>>>>>>> 0x461ff2 <main(int, char**)+2448>: mov $0xb6cdc0,%edi
>>>>>>> 0x461ff7 <main(int, char**)+2453>: callq 0x425a00
>>>>>>> <***@plt>
>>>>>>> 0x461ffc <main(int, char**)+2458>: mov $0x1,%edi
>>>>>>> 0x462001 <main(int, char**)+2463>: callq 0x424db0
>>>>>>> <***@plt>
>>>>>>> (gdb) thread apply all backtrace
>>>>>>>
>>>>>>> Thread 11 (Thread 0x7fffe37fe700 (LWP 5113)):
>>>>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
>>>>>>> #1 0x00000000004fc19c in work_thread (a=0x5110710) at
>>>>>>> u_threadpool.c:272
>>>>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7fffe37fe700) at
>>>>>>> pthread_create.c:333
>>>>>>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>>> _64/clone.S:109
>>>>>>>
>>>>>>> Thread 10 (Thread 0x7fffe3fff700 (LWP 5109)):
>>>>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
>>>>>>> #1 0x00000000004fc19c in work_thread (a=0x5110710) at
>>>>>>> u_threadpool.c:272
>>>>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7fffe3fff700) at
>>>>>>> pthread_create.c:333
>>>>>>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>>> _64/clone.S:109
>>>>>>>
>>>>>>> Thread 9 (Thread 0x7ffff09bb700 (LWP 5031)):
>>>>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
>>>>>>> #1 0x00000000004fc19c in work_thread (a=0x5110810) at
>>>>>>> u_threadpool.c:272
>>>>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7ffff09bb700) at
>>>>>>> pthread_create.c:333
>>>>>>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>>> _64/clone.S:109
>>>>>>>
>>>>>>> Thread 7 (Thread 0x7ffff11bc700 (LWP 5029)):
>>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>>>>>> ../sysdeps/posix/sleep.c:55
>>>>>>> #2 0x00000000004769bb in remove_completed_jobs (vp=0x0) at
>>>>>>> req_jobobit.c:3759
>>>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff11bc700) at
>>>>>>> pthread_create.c:333
>>>>>>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>>> _64/clone.S:109
>>>>>>>
>>>>>>> Thread 6 (Thread 0x7ffff19bd700 (LWP 5028)):
>>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>>>>>> ../sysdeps/posix/sleep.c:55
>>>>>>> #2 0x00000000004afa7b in remove_extra_recycle_jobs (vp=0x0) at
>>>>>>> job_recycler.c:216
>>>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff19bd700) at
>>>>>>> pthread_create.c:333
>>>>>>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>>> _64/clone.S:109
>>>>>>>
>>>>>>> Thread 5 (Thread 0x7ffff21be700 (LWP 5027)):
>>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>>>>>> ../sysdeps/posix/sleep.c:55
>>>>>>> #2 0x00000000004bc73b in inspect_exiting_jobs (vp=0x0) at
>>>>>>> exiting_jobs.c:319
>>>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff21be700) at
>>>>>>> pthread_create.c:333
>>>>>>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>>> _64/clone.S:109
>>>>>>>
>>>>>>> Thread 4 (Thread 0x7ffff29bf700 (LWP 5026)):
>>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>>>>>> ../sysdeps/posix/sleep.c:55
>>>>>>> #2 0x000000000046078d in handle_queue_routing_retries (vp=0x0) at
>>>>>>> pbsd_main.c:1079
>>>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff29bf700) at
>>>>>>> pthread_create.c:333
>>>>>>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>>> _64/clone.S:109
>>>>>>>
>>>>>>> Thread 3 (Thread 0x7ffff31c0700 (LWP 5025)):
>>>>>>> #0 0x00007ffff6ee17bd in accept () at ../sysdeps/unix/syscall-templa
>>>>>>> te.S:84
>>>>>>> #1 0x00007ffff750a276 in start_listener_addrinfo
>>>>>>> (host_name=0x7ffff31bfaf0 "Dual-E52630v4", server_port=15001,
>>>>>>> process_meth=0x4c4935 <start_process_pbs_server_port(void*)>)
>>>>>>> at ../Libnet/server_core.c:398
>>>>>>> #2 0x00000000004608f3 in start_accept_listener (vp=0x0) at
>>>>>>> pbsd_main.c:1141
>>>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff31c0700) at
>>>>>>> pthread_create.c:333
>>>>>>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>>> _64/clone.S:109
>>>>>>>
>>>>>>> Thread 2 (Thread 0x7ffff39c1700 (LWP 5024)):
>>>>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
>>>>>>> #1 0x00000000004fc19c in work_thread (a=0x5110810) at
>>>>>>> u_threadpool.c:272
>>>>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7ffff39c1700) at
>>>>>>> pthread_create.c:333
>>>>>>> ---Type <return> to continue, or q <return> to quit---
>>>>>>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>>> _64/clone.S:109
>>>>>>>
>>>>>>> Thread 1 (Thread 0x7ffff7fd5740 (LWP 5020)):
>>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>> #1 0x00007ffff615c1a4 in usleep (useconds=<optimized out>) at
>>>>>>> ../sysdeps/posix/usleep.c:32
>>>>>>> #2 0x000000000046123a in main_loop () at pbsd_main.c:1454
>>>>>>> #3 0x0000000000461fb6 in main (argc=2, argv=0x7fffffffe598) at
>>>>>>> pbsd_main.c:1935
>>>>>>> (gdb) quit
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Oct 28, 2016 at 12:43 PM, Kazuhiro Fujita <
>>>>>>> ***@gmail.com> wrote:
>>>>>>>
>>>>>>>> Thank you for your comments.
>>>>>>>> I will try the 6.0-dev next week.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Kazu
>>>>>>>>
>>>>>>>> On Fri, Oct 28, 2016 at 5:34 AM, David Beer <
>>>>>>>> ***@adaptivecomputing.com> wrote:
>>>>>>>>
>>>>>>>>> I wonder if that fix wasn't placed in the hotfix. Is there any
>>>>>>>>> chance you can try installing 6.0-dev on your system (via github) to see if
>>>>>>>>> it's resolved. For the record, my Ubuntu 16 system doesn't give me this
>>>>>>>>> error, or I'd try it myself. For whatever reason, none of our test cluster
>>>>>>>>> machines (Cent & Redhat 6-7, SLES 11-12) experience this either. We did
>>>>>>>>> have another user that experiences it on a test cluster, but not being able
>>>>>>>>> to reproduce it has made it harder to track down.
>>>>>>>>>
>>>>>>>>> On Wed, Oct 26, 2016 at 12:46 AM, Kazuhiro Fujita <
>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> David,
>>>>>>>>>>
>>>>>>>>>> I tried the 6.0.2.h3. But, it seems that the other issue is still
>>>>>>>>>> remained.
>>>>>>>>>> After I initialized serverdb by "sudo pbs_server -t create",
>>>>>>>>>> pbs_server crashed.
>>>>>>>>>> Then, I used gdb with pbs_server.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Kazu
>>>>>>>>>>
>>>>>>>>>> sudo gdb /usr/local/sbin/pbs_server
>>>>>>>>>> GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
>>>>>>>>>> Copyright (C) 2016 Free Software Foundation, Inc.
>>>>>>>>>> License GPLv3+: GNU GPL version 3 or later <
>>>>>>>>>> http://gnu.org/licenses/gpl.html>
>>>>>>>>>> This is free software: you are free to change and redistribute it.
>>>>>>>>>> There is NO WARRANTY, to the extent permitted by law. Type "show
>>>>>>>>>> copying"
>>>>>>>>>> and "show warranty" for details.
>>>>>>>>>> This GDB was configured as "x86_64-linux-gnu".
>>>>>>>>>> Type "show configuration" for configuration details.
>>>>>>>>>> For bug reporting instructions, please see:
>>>>>>>>>> <http://www.gnu.org/software/gdb/bugs/>.
>>>>>>>>>> Find the GDB manual and other documentation resources online at:
>>>>>>>>>> <http://www.gnu.org/software/gdb/documentation/>.
>>>>>>>>>> For help, type "help".
>>>>>>>>>> Type "apropos word" to search for commands related to "word"...
>>>>>>>>>> Reading symbols from /usr/local/sbin/pbs_server...done.
>>>>>>>>>> (gdb) r -D
>>>>>>>>>> Starting program: /usr/local/sbin/pbs_server -D
>>>>>>>>>> [Thread debugging using libthread_db enabled]
>>>>>>>>>> Using host libthread_db library "/lib/x86_64-linux-gnu/libthre
>>>>>>>>>> ad_db.so.1".
>>>>>>>>>> pbs_server is up (version - 6.0.2.h3, port - 15001)
>>>>>>>>>> [New Thread 0x7ffff39c1700 (LWP 25591)]
>>>>>>>>>> [New Thread 0x7ffff31c0700 (LWP 25592)]
>>>>>>>>>> [New Thread 0x7ffff29bf700 (LWP 25593)]
>>>>>>>>>> [New Thread 0x7ffff21be700 (LWP 25594)]
>>>>>>>>>> [New Thread 0x7ffff19bd700 (LWP 25595)]
>>>>>>>>>> [New Thread 0x7ffff11bc700 (LWP 25596)]
>>>>>>>>>>
>>>>>>>>>> Thread 7 "pbs_server" received signal SIGSEGV, Segmentation fault.
>>>>>>>>>> [Switching to Thread 0x7ffff11bc700 (LWP 25596)]
>>>>>>>>>> __lll_unlock_elision (lock=0x57276c0, private=0) at
>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>>>>>>>>>> 29 ../sysdeps/unix/sysv/linux/x86/elision-unlock.c: No such file
>>>>>>>>>> or directory.
>>>>>>>>>> (gdb) bt
>>>>>>>>>> #0 __lll_unlock_elision (lock=0x57276c0, private=0) at
>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>>>>>>>>>> #1 0x00000000004ac076 in dispatch_timed_task (ptask=0x5727660)
>>>>>>>>>> at svr_task.c:318
>>>>>>>>>> #2 0x0000000000460247 in check_tasks (notUsed=0x0) at
>>>>>>>>>> pbsd_main.c:921
>>>>>>>>>> #3 0x00000000004fc171 in work_thread (a=0x510f650) at
>>>>>>>>>> u_threadpool.c:318
>>>>>>>>>> #4 0x00007ffff6ed86fa in start_thread (arg=0x7ffff11bc700) at
>>>>>>>>>> pthread_create.c:333
>>>>>>>>>> #5 0x00007ffff6165b5d in clone () at
>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Oct 26, 2016 at 11:52 AM, Kazuhiro Fujita <
>>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> David and Rick,
>>>>>>>>>>>
>>>>>>>>>>> Thank you for the quick response. I will try it later.
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Kazu
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Oct 26, 2016 at 5:06 AM, David Beer <
>>>>>>>>>>> ***@adaptivecomputing.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Actually, Rick just sent me the link. You can download it from
>>>>>>>>>>>> here: http://files.adaptivecomputing.com/hotfix/torque-6.0.2
>>>>>>>>>>>> .h3.tar.gz
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Oct 25, 2016 at 2:06 PM, David Beer <
>>>>>>>>>>>> ***@adaptivecomputing.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I can confirm that this bug is fixed in 6.0-dev, and we've
>>>>>>>>>>>>> made a hotfix for it, 6.0.2.h3. This was caused because of a change in the
>>>>>>>>>>>>> implementation for the pthread library, so most will not see this crash,
>>>>>>>>>>>>> but it appears that if you have a newer version of that library, then you
>>>>>>>>>>>>> will get it. Rick is going to send instructions for how to grab 6.0.2.h3.
>>>>>>>>>>>>>
>>>>>>>>>>>>> David
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Oct 25, 2016 at 12:30 AM, Kazuhiro Fujita <
>>>>>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thank you David for the comment on the backtrace.
>>>>>>>>>>>>>> I haven't noticed that until writing this mail.
>>>>>>>>>>>>>> So, I used backtrace as written in the Ubuntu wiki.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I also attached the backtrace of pbs_server (Torque 6.1-dev)
>>>>>>>>>>>>>> by gdb.
>>>>>>>>>>>>>> As I mentioned before torque.setup script was successfully
>>>>>>>>>>>>>> executed, but unstable.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Before using gdb, I used following commands.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> git clone https://github.com/adaptivecomputing/torque.git
>>>>>>>>>>>>>>> -b 6.1-dev 6.1-dev
>>>>>>>>>>>>>>> cd 6.1-dev
>>>>>>>>>>>>>>> ./autogen.sh
>>>>>>>>>>>>>>> # build and install torque
>>>>>>>>>>>>>>> ./configure
>>>>>>>>>>>>>>> make
>>>>>>>>>>>>>>> sudo make install
>>>>>>>>>>>>>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>>>>>>>>>>>>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>>>>>>>>>>>>>>> sudo ldconfig
>>>>>>>>>>>>>>> # set as services
>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_server
>>>>>>>>>>>>>>> /etc/init.d/pbs_server
>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_sched
>>>>>>>>>>>>>>> /etc/init.d/pbs_sched
>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>>>>>>>>>>>>>> sudo update-rc.d trqauthd defaults
>>>>>>>>>>>>>>> sudo update-rc.d pbs_server defaults
>>>>>>>>>>>>>>> sudo update-rc.d pbs_sched defaults
>>>>>>>>>>>>>>> sudo update-rc.d pbs_mom defaults
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> sudo ./torque.setup $USER
>>>>>>>>>>>>>>> sudo qmgr -c 'p s'
>>>>>>>>>>>>>>> echo "$HOSTNAME np=`cat /proc/cpuinfo | grep processor | wc
>>>>>>>>>>>>>>> -l`" | sudo tee /var/spool/torque/server_priv/nodes
>>>>>>>>>>>>>>> sudo nano /var/spool/torque/server_priv/nodes # (changed np)
>>>>>>>>>>>>>>> sudo qterm -t quick
>>>>>>>>>>>>>>> sudo /etc/init.d/trqauthd stop
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> trqauthd was not stop by the last command. So, I stopped it
>>>>>>>>>>>>>> by killing the trqauthd process.
>>>>>>>>>>>>>> Then I restarted the torque processes with gdb.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> sudo /etc/init.d/trqauthd start
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> sudo gdb /etc/init.d/pbs_server 2>&1 | tee
>>>>>>>>>>>>>>> ~/gdb-torquesetup-6.1-dev.txt
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In another terminal, I executed the following commands before
>>>>>>>>>>>>>> pbs_server was crashed.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> sudo /etc/init.d/pbs_mom start
>>>>>>>>>>>>>>> sudo /etc/init.d/pbs_sched start
>>>>>>>>>>>>>>> ps aux | grep pbs
>>>>>>>>>>>>>>> pbsnodes -a
>>>>>>>>>>>>>>> echo "sleep 30" | qsub
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The output of the last command is "0.torque-server".
>>>>>>>>>>>>>> And this command crashed the pbs_server in gdb.
>>>>>>>>>>>>>> Then, I made the backtrace.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>> Kazu
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Oct 25, 2016 at 2:36 PM, Kazuhiro Fujita <
>>>>>>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> David,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I attached the backtrace of pbs_server (Torque 6.0.2) by gdb.
>>>>>>>>>>>>>>> (based on https://wiki.ubuntu.com/Backtrace)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I started pbs_server with gdb,
>>>>>>>>>>>>>>> and execute qmgr from another terminal. (see below)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> sudo qmgr -c 'p s'
>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> After the qmgr execution, I pressed ctrl +c in gdb.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>> Kaz
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, Oct 25, 2016 at 1:00 AM, David Beer <
>>>>>>>>>>>>>>> ***@adaptivecomputing.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Kazu,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Can you give us a backtrace for this crash? We have fixed
>>>>>>>>>>>>>>>> some issues on startup (around mutex management for newer pthread
>>>>>>>>>>>>>>>> implementations) and a backtrace would allow me to confirm if what you're
>>>>>>>>>>>>>>>> seeing is fixed.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Mon, Oct 24, 2016 at 2:09 AM, Kazuhiro Fujita <
>>>>>>>>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Dear All,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I use Torque 4.2.10 on Ubuntu 14.04 LTS and 16.04 LTS with
>>>>>>>>>>>>>>>>> dual E5-2630 v3 chips.
>>>>>>>>>>>>>>>>> I recently got servers with dual Xeon E5 v4 chips, and
>>>>>>>>>>>>>>>>> installed Ubuntu 16.04 LTS on them.
>>>>>>>>>>>>>>>>> And I tried to set up Torque on them, but I stacked with
>>>>>>>>>>>>>>>>> the initial setup script.
>>>>>>>>>>>>>>>>> It seems that qmgr may trigger to crash pbs_server in
>>>>>>>>>>>>>>>>> initial setup script (torque.setup). (see below)
>>>>>>>>>>>>>>>>> Similar error is also observed in Torque 6.02.
>>>>>>>>>>>>>>>>> Have you ever observed this kind of errors?
>>>>>>>>>>>>>>>>> And if you know possible solutions, please tell me.
>>>>>>>>>>>>>>>>> Any comments will be highly appreciated.
>>>>>>>>>>>>>>>>> Would it be better to change the OS to other distribution,
>>>>>>>>>>>>>>>>> such as Scientific Linux?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thank you in Advance,
>>>>>>>>>>>>>>>>> Kazu
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Errors in torque 4.2.10 setup
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> torque-server-***@torque-ser
>>>>>>>>>>>>>>>>>> ver:~/Downloads/torque/torque-4.2.10$ sudo
>>>>>>>>>>>>>>>>>> ./torque.setup $USER
>>>>>>>>>>>>>>>>>> Currently no servers active. Default server will be
>>>>>>>>>>>>>>>>>> listed as active server. Error 15133
>>>>>>>>>>>>>>>>>> Active server name: torque-server pbs_server port is:
>>>>>>>>>>>>>>>>>> 15001
>>>>>>>>>>>>>>>>>> trqauthd daemonized - port /tmp/trqauthd-unix
>>>>>>>>>>>>>>>>>> trqauthd successfully started
>>>>>>>>>>>>>>>>>> initializing TORQUE (admin: torque-server-***@torque-ser
>>>>>>>>>>>>>>>>>> ver)
>>>>>>>>>>>>>>>>>> You have selected to start pbs_server in create mode.
>>>>>>>>>>>>>>>>>> If the server database exists it will be overwritten.
>>>>>>>>>>>>>>>>>> do you wish to continue y/(n)?y
>>>>>>>>>>>>>>>>>> root 27941 1942 1 12:22 ? 00:00:00
>>>>>>>>>>>>>>>>>> pbs_server -t create
>>>>>>>>>>>>>>>>>> Max open servers: 9
>>>>>>>>>>>>>>>>>> set server operators += torque-server-***@torque-server
>>>>>>>>>>>>>>>>>> Max open servers: 9
>>>>>>>>>>>>>>>>>> set server managers += torque-server-***@torque-server
>>>>>>>>>>>>>>>>>> qmgr obj=batch svr=default: End of File
>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>>>> torque-server-***@torque-ser
>>>>>>>>>>>>>>>>>> ver:~/Downloads/torque/torque-4.2.10$ ps aux | grep pbs
>>>>>>>>>>>>>>>>>> torque-+ 27996 0.0 0.0 22304 948 pts/2 S+ 12:22
>>>>>>>>>>>>>>>>>> 0:00 grep --color=auto pbs
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> pbs_server -t create was not found.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Errors in torque 6.0.2 setup
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> torque-server-***@torque-server:~/Downloads/torque/6.0.2$
>>>>>>>>>>>>>>>>>> sudo ./torque.setup $USER
>>>>>>>>>>>>>>>>>> Currently no servers active. Default server will be
>>>>>>>>>>>>>>>>>> listed as active server. Error 15133
>>>>>>>>>>>>>>>>>> Active server name: torque-server pbs_server port is:
>>>>>>>>>>>>>>>>>> 15001
>>>>>>>>>>>>>>>>>> trqauthd daemonized - port /tmp/trqauthd-unix
>>>>>>>>>>>>>>>>>> trqauthd successfully started
>>>>>>>>>>>>>>>>>> initializing TORQUE (admin: torque-server-***@torque-ser
>>>>>>>>>>>>>>>>>> ver)
>>>>>>>>>>>>>>>>>> You have selected to start pbs_server in create mode.
>>>>>>>>>>>>>>>>>> If the server database exists it will be overwritten.
>>>>>>>>>>>>>>>>>> do you wish to continue y/(n)?y
>>>>>>>>>>>>>>>>>> root 39521 1 1 16:10 ? 00:00:00
>>>>>>>>>>>>>>>>>> pbs_server -t create
>>>>>>>>>>>>>>>>>> Max open servers: 9
>>>>>>>>>>>>>>>>>> Max open servers: 9
>>>>>>>>>>>>>>>>>> qmgr obj=batch svr=default: End of File
>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>>>> torque-server-***@torque-server:~/Downloads/torque/6.0.2$
>>>>>>>>>>>>>>>>>> ps aux | grep pbs
>>>>>>>>>>>>>>>>>> comp_ad+ 39569 0.0 0.0 22304 1032 pts/8 S+ 16:11
>>>>>>>>>>>>>>>>>> 0:00 grep --color=auto pbs
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> pbs_server -t create was not found.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Commands used for installation before the setup script
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> # build and install torque
>>>>>>>>>>>>>>>>>> ./configure
>>>>>>>>>>>>>>>>>> make
>>>>>>>>>>>>>>>>>> sudo make install
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>>>>>>>>>>>>>>>> echo /usr/local/lib > sudo tee
>>>>>>>>>>>>>>>>>> /etc/ld.so.conf.d/torque.conf
>>>>>>>>>>>>>>>>>> sudo ldconfig
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> # set up as services
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>>>>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_server
>>>>>>>>>>>>>>>>>> /etc/init.d/pbs_server
>>>>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_sched
>>>>>>>>>>>>>>>>>> /etc/init.d/pbs_sched
>>>>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>>>>>>>>>>>>>>>>> sudo update-rc.d trqauthd defaults
>>>>>>>>>>>>>>>>>> sudo update-rc.d pbs_server defaults
>>>>>>>>>>>>>>>>>> sudo update-rc.d pbs_sched defaults
>>>>>>>>>>>>>>>>>> sudo update-rc.d pbs_mom defaults
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> sudo ./torque.setup $USER
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>> torqueusers mailing list
>>>>>>>>>>>>>>>>> ***@supercluster.org
>>>>>>>>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> David Beer | Torque Architect
>>>>>>>>>>>>>>>> Adaptive Computing
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>> torqueusers mailing list
>>>>>>>>>>>>>>>> ***@supercluster.org
>>>>>>>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> torqueusers mailing list
>>>>>>>>>>>>>> ***@supercluster.org
>>>>>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> David Beer | Torque Architect
>>>>>>>>>>>>> Adaptive Computing
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> David Beer | Torque Architect
>>>>>>>>>>>> Adaptive Computing
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> torqueusers mailing list
>>>>>>>>>>>> ***@supercluster.org
>>>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> torqueusers mailing list
>>>>>>>>>> ***@supercluster.org
>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> David Beer | Torque Architect
>>>>>>>>> Adaptive Computing
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> torqueusers mailing list
>>>>>>>>> ***@supercluster.org
>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> torqueusers mailing list
>>>>>>> ***@supercluster.org
>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> David Beer | Torque Architect
>>>>>> Adaptive Computing
>>>>>>
>>>>>> _______________________________________________
>>>>>> torqueusers mailing list
>>>>>> ***@supercluster.org
>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> torqueusers mailing list
>>>>> ***@supercluster.org
>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> David Beer | Torque Architect
>>>> Adaptive Computing
>>>>
>>>> _______________________________________________
>>>> torqueusers mailing list
>>>> ***@supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>
>>>>
>>>
>>
>> _______________________________________________
>> torqueusers mailing list
>> ***@supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>
>
> --
> David Beer | Torque Architect
> Adaptive Computing
>
> _______________________________________________
> torqueusers mailing list
> ***@supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
David Beer
2016-11-17 19:21:07 UTC
Permalink
Kazu,

Did you look at the server logs?

On Wed, Nov 16, 2016 at 12:24 AM, Kazuhiro Fujita <***@gmail.com
> wrote:

> David,
>
> I did not find the process of pbs_server after executions of commands
> shown below.
>
> sudo service trqauthd start
>> sudo service pbs_server start
>
>
> I am not sure what it did.
>
> Best,
> Kazu
>
>
> On Wed, Nov 16, 2016 at 8:10 AM, David Beer <***@adaptivecomputing.com>
> wrote:
>
>> Kazu,
>>
>> What did it do when it failed to start?
>>
>> On Wed, Nov 9, 2016 at 9:33 PM, Kazuhiro Fujita <
>> ***@gmail.com> wrote:
>>
>>> David,
>>>
>>> In the last mail I sent, I reinstalled 6.0-dev in a wrong server as you
>>> can see in output (E5-2630v3).
>>> In a E5-2630v4 server, pbs_server failed to restart as a daemon after "./torque.setup
>>> $USER".
>>>
>>> Before crash:
>>>
>>>> git clone https://github.com/adaptivecomputing/torque.git -b 6.0-dev
>>>> 6.0-dev
>>>> cd 6.0-dev
>>>> ./autogen.sh
>>>> # build and install torque
>>>> ./configure
>>>> make
>>>> sudo make install
>>>> # Set the correct name of the server
>>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>> # configure and start trqauthd
>>>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>>>> sudo update-rc.d trqauthd defaults
>>>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>>>> sudo ldconfig
>>>> sudo service trqauthd start
>>>> # Initialize serverdb by executing the torque.setup script
>>>> sudo ./torque.setup $USER
>>>> sudo qmgr -c 'p s'
>>>> sudo qterm
>>>> sudo service trqauthd stop
>>>> ps aux | grep pbs
>>>> ps aux | grep trq
>>>> # set nodes
>>>> echo "$HOSTNAME np=`cat /proc/cpuinfo | grep processor | wc -l`" | sudo
>>>> tee /var/spool/torque/server_priv/nodes
>>>> sudo nano /var/spool/torque/server_priv/nodes
>>>> # set the head node
>>>> echo "\$pbsserver $HOSTNAME" | sudo tee /var/spool/torque/mom_priv/con
>>>> fig
>>>> # configure other daemons
>>>> sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
>>>> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
>>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>>> sudo update-rc.d pbs_server defaults
>>>> sudo update-rc.d pbs_sched defaults
>>>> sudo update-rc.d pbs_mom defaults
>>>> # restart torque daemons
>>>> sudo service trqauthd start
>>>> sudo service pbs_server start
>>>
>>>
>>> Then, pbs_server did not start. So, I started pbs_server with gdb.
>>> But, pbs_server with gdb did not crash even after qsub and qstat from
>>> another terminal.
>>> So, I stopped the pbs_server in gdb with ctrl + c.
>>>
>>> Best,
>>> Kazu
>>>
>>> gdb output
>>>
>>>> $ sudo gdb /usr/local/sbin/pbs_server
>>>> GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
>>>> Copyright (C) 2016 Free Software Foundation, Inc.
>>>> License GPLv3+: GNU GPL version 3 or later <
>>>> http://gnu.org/licenses/gpl.html>
>>>> This is free software: you are free to change and redistribute it.
>>>> There is NO WARRANTY, to the extent permitted by law. Type "show
>>>> copying"
>>>> and "show warranty" for details.
>>>> This GDB was configured as "x86_64-linux-gnu".
>>>> Type "show configuration" for configuration details.
>>>> For bug reporting instructions, please see:
>>>> <http://www.gnu.org/software/gdb/bugs/>.
>>>> Find the GDB manual and other documentation resources online at:
>>>> <http://www.gnu.org/software/gdb/documentation/>.
>>>> For help, type "help".
>>>> Type "apropos word" to search for commands related to "word"...
>>>> Reading symbols from /usr/local/sbin/pbs_server...done.
>>>> (gdb) r -D
>>>> Starting program: /usr/local/sbin/pbs_server -D
>>>> [Thread debugging using libthread_db enabled]
>>>> Using host libthread_db library "/lib/x86_64-linux-gnu/libthre
>>>> ad_db.so.1".
>>>> [New Thread 0x7ffff39c1700 (LWP 35864)]
>>>> pbs_server is up (version - 6.0, port - 15001)
>>>> PBS_Server: LOG_ERROR::tcp_connect_sockaddr, Failed when trying to
>>>> open tcp connection - connect() failed [rc = -2] [addr =
>>>> 10.0.0.249:15003]
>>>> [New Thread 0x7ffff31c0700 (LWP 35865)]
>>>> PBS_Server: LOG_ERROR::sendHierarchyToNode, Could not send mom
>>>> hierarchy to host Dual-E52630v4:15003
>>>> [New Thread 0x7ffff29bf700 (LWP 35866)]
>>>> [New Thread 0x7ffff21be700 (LWP 35867)]
>>>> [New Thread 0x7ffff19bd700 (LWP 35868)]
>>>> [New Thread 0x7ffff11bc700 (LWP 35869)]
>>>> [New Thread 0x7ffff09bb700 (LWP 35870)]
>>>> [Thread 0x7ffff09bb700 (LWP 35870) exited]
>>>> [New Thread 0x7ffff09bb700 (LWP 35871)]
>>>> [New Thread 0x7fffe3fff700 (LWP 36003)]
>>>> [New Thread 0x7fffe37fe700 (LWP 36004)]
>>>> [New Thread 0x7fffe2ffd700 (LWP 36011)]
>>>> [New Thread 0x7fffe21ce700 (LWP 36016)]
>>>> [Thread 0x7fffe21ce700 (LWP 36016) exited]
>>>> ^C
>>>> Thread 1 "pbs_server" received signal SIGINT, Interrupt.
>>>> 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-templa
>>>> te.S:84
>>>> 84 ../sysdeps/unix/syscall-template.S: No such file or directory.
>>>> (gdb) bt
>>>> #0 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-templa
>>>> te.S:84
>>>> #1 0x00007ffff615c1a4 in usleep (useconds=<optimized out>) at
>>>> ../sysdeps/posix/usleep.c:32
>>>> #2 0x0000000000461216 in main_loop () at pbsd_main.c:1454
>>>> #3 0x0000000000461f92 in main (argc=2, argv=0x7fffffffe598) at
>>>> pbsd_main.c:1935
>>>> (gdb) backtrace full
>>>> #0 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-templa
>>>> te.S:84
>>>> No locals.
>>>> #1 0x00007ffff615c1a4 in usleep (useconds=<optimized out>) at
>>>> ../sysdeps/posix/usleep.c:32
>>>> ts = {tv_sec = 0, tv_nsec = 250000000}
>>>> #2 0x0000000000461216 in main_loop () at pbsd_main.c:1454
>>>> state = 3
>>>> waittime = 5
>>>> pjob = 0x313a74
>>>> iter = 0x0
>>>> when = 1478748888
>>>> log = 0
>>>> scheduling = 1
>>>> sched_iteration = 600
>>>> time_now = 1478748970
>>>> update_loglevel = 1478748979
>>>> log_buf = "Server Ready, pid = 35860, loglevel=0", '\000'
>>>> <repeats 139 times>, "c\000\000\000\000\000\000\000
>>>> \000\020\000\000\000\000\000\000\240\265\377\377\377\177", '\000'
>>>> <repeats 26 times>...
>>>> sem_val = 5229209
>>>> __func__ = "main_loop"
>>>> #3 0x0000000000461f92 in main (argc=2, argv=0x7fffffffe598) at
>>>> pbsd_main.c:1935
>>>> i = 2
>>>> rc = 0
>>>> local_errno = 0
>>>> lockfile = "/var/spool/torque/server_priv/server.lock", '\000'
>>>> <repeats 983 times>
>>>> EMsg = '\000' <repeats 1023 times>
>>>> tmpLine = "Using ports Server:15001 Scheduler:15004 MOM:15002
>>>> (server: 'Dual-E52630v4')", '\000' <repeats 945 times>
>>>> log_buf = "Using ports Server:15001 Scheduler:15004 MOM:15002
>>>> (server: 'Dual-E52630v4')", '\000' <repeats 1114 times>...
>>>> server_name_file_port = 15001
>>>> fp = 0x51095f0
>>>> (gdb) info registers
>>>> rax 0xfffffffffffffdfc -516
>>>> rbx 0x6 6
>>>> rcx 0x7ffff612a75d 140737321805661
>>>> rdx 0x0 0
>>>> rsi 0x0 0
>>>> rdi 0x7fffffffb3f0 140737488335856
>>>> rbp 0x7fffffffe4b0 0x7fffffffe4b0
>>>> rsp 0x7fffffffc870 0x7fffffffc870
>>>> r8 0x0 0
>>>> r9 0x4000001 67108865
>>>> r10 0x1 1
>>>> r11 0x293 659
>>>> r12 0x4260b0 4350128
>>>> r13 0x7fffffffe590 140737488348560
>>>> r14 0x0 0
>>>> r15 0x0 0
>>>> rip 0x461f92 0x461f92 <main(int, char**)+2388>
>>>> eflags 0x293 [ CF AF SF IF ]
>>>> cs 0x33 51
>>>> ss 0x2b 43
>>>> ds 0x0 0
>>>> es 0x0 0
>>>> fs 0x0 0
>>>> gs 0x0 0
>>>> (gdb) x/16i $pc
>>>> => 0x461f92 <main(int, char**)+2388>: callq 0x49484c <shutdown_ack()>
>>>> 0x461f97 <main(int, char**)+2393>: mov $0xffffffff,%edi
>>>> 0x461f9c <main(int, char**)+2398>: callq 0x4250b0 <***@plt>
>>>> 0x461fa1 <main(int, char**)+2403>: mov 0x70f5c0(%rip),%rdx
>>>> # 0xb71568 <msg_svrdown>
>>>> 0x461fa8 <main(int, char**)+2410>: mov 0x70ef51(%rip),%rax
>>>> # 0xb70f00 <msg_daemonname>
>>>> 0x461faf <main(int, char**)+2417>: mov %rdx,%rcx
>>>> 0x461fb2 <main(int, char**)+2420>: mov %rax,%rdx
>>>> 0x461fb5 <main(int, char**)+2423>: mov $0x1,%esi
>>>> 0x461fba <main(int, char**)+2428>: mov $0x8002,%edi
>>>> 0x461fbf <main(int, char**)+2433>: callq 0x425840
>>>> <***@plt>
>>>> 0x461fc4 <main(int, char**)+2438>: mov $0x0,%edi
>>>> 0x461fc9 <main(int, char**)+2443>: callq 0x4269c9
>>>> <acct_close(bool)>
>>>> 0x461fce <main(int, char**)+2448>: mov $0xb6ce00,%edi
>>>> 0x461fd3 <main(int, char**)+2453>: callq 0x425a00
>>>> <***@plt>
>>>> 0x461fd8 <main(int, char**)+2458>: mov $0x1,%edi
>>>> 0x461fdd <main(int, char**)+2463>: callq 0x424db0
>>>> <***@plt>
>>>> (gdb) thread apply all backtrace
>>>> Thread 12 (Thread 0x7fffe2ffd700 (LWP 36011)):
>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86
>>>> _64/pthread_cond_wait.S:185
>>>> #1 0x00000000004fc2b4 in work_thread (a=0x5110710) at
>>>> u_threadpool.c:272
>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7fffe2ffd700) at
>>>> pthread_create.c:333
>>>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>> _64/clone.S:109
>>>> Thread 11 (Thread 0x7fffe37fe700 (LWP 36004)):
>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86
>>>> _64/pthread_cond_wait.S:185
>>>> #1 0x00000000004fc2b4 in work_thread (a=0x5110810) at
>>>> u_threadpool.c:272
>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7fffe37fe700) at
>>>> pthread_create.c:333
>>>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>> _64/clone.S:109
>>>> Thread 10 (Thread 0x7fffe3fff700 (LWP 36003)):
>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86
>>>> _64/pthread_cond_wait.S:185
>>>> #1 0x00000000004fc2b4 in work_thread (a=0x5110710) at
>>>> u_threadpool.c:272
>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7fffe3fff700) at
>>>> pthread_create.c:333
>>>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>> _64/clone.S:109
>>>> Thread 9 (Thread 0x7ffff09bb700 (LWP 35871)):
>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86
>>>> _64/pthread_cond_wait.S:185
>>>> #1 0x00000000004fc2b4 in work_thread (a=0x5110810) at
>>>> u_threadpool.c:272
>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7ffff09bb700) at
>>>> pthread_create.c:333
>>>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>> _64/clone.S:109
>>>> Thread 7 (Thread 0x7ffff11bc700 (LWP 35869)):
>>>> #0 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-templa
>>>> te.S:84
>>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>>> ../sysdeps/posix/sleep.c:55
>>>> #2 0x0000000000476913 in remove_completed_jobs (vp=0x0) at
>>>> req_jobobit.c:3759
>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff11bc700) at
>>>> pthread_create.c:333
>>>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>> _64/clone.S:109
>>>> Thread 6 (Thread 0x7ffff19bd700 (LWP 35868)):
>>>> #0 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-templa
>>>> te.S:84
>>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>>> ../sysdeps/posix/sleep.c:55
>>>> #2 0x00000000004afb93 in remove_extra_recycle_jobs (vp=0x0) at
>>>> job_recycler.c:216
>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff19bd700) at
>>>> pthread_create.c:333
>>>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>> _64/clone.S:109
>>>> Thread 5 (Thread 0x7ffff21be700 (LWP 35867)):
>>>> #0 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-templa
>>>> te.S:84
>>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>>> ../sysdeps/posix/sleep.c:55
>>>> #2 0x00000000004bc853 in inspect_exiting_jobs (vp=0x0) at
>>>> exiting_jobs.c:319
>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff21be700) at
>>>> pthread_create.c:333
>>>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>> _64/clone.S:109
>>>> Thread 4 (Thread 0x7ffff29bf700 (LWP 35866)):
>>>> #0 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-templa
>>>> te.S:84
>>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>>> ../sysdeps/posix/sleep.c:55
>>>> #2 0x0000000000460769 in handle_queue_routing_retries (vp=0x0) at
>>>> pbsd_main.c:1079
>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff29bf700) at
>>>> pthread_create.c:333
>>>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>> _64/clone.S:109
>>>> Thread 3 (Thread 0x7ffff31c0700 (LWP 35865)):
>>>> #0 0x00007ffff6ee17bd in accept () at ../sysdeps/unix/syscall-templa
>>>> te.S:84
>>>> #1 0x00007ffff750a276 in start_listener_addrinfo
>>>> (host_name=0x7ffff31bfaf0 "Dual-E52630v4", server_port=15001,
>>>> process_meth=0x4c4a4d <start_process_pbs_server_port(void*)>)
>>>> at ../Libnet/server_core.c:398
>>>> ---Type <return> to continue, or q <return> to quit---
>>>> #2 0x00000000004608cf in start_accept_listener (vp=0x0) at
>>>> pbsd_main.c:1141
>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff31c0700) at
>>>> pthread_create.c:333
>>>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>> _64/clone.S:109
>>>> Thread 2 (Thread 0x7ffff39c1700 (LWP 35864)):
>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86
>>>> _64/pthread_cond_wait.S:185
>>>> #1 0x00000000004fc2b4 in work_thread (a=0x5110810) at
>>>> u_threadpool.c:272
>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7ffff39c1700) at
>>>> pthread_create.c:333
>>>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>> _64/clone.S:109
>>>> Thread 1 (Thread 0x7ffff7fd5740 (LWP 35860)):
>>>> #0 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-templa
>>>> te.S:84
>>>> #1 0x00007ffff615c1a4 in usleep (useconds=<optimized out>) at
>>>> ../sysdeps/posix/usleep.c:32
>>>> #2 0x0000000000461216 in main_loop () at pbsd_main.c:1454
>>>> #3 0x0000000000461f92 in main (argc=2, argv=0x7fffffffe598) at
>>>> pbsd_main.c:1935
>>>> (gdb) quit
>>>> A debugging session is active.
>>>> Inferior 1 [process 35860] will be killed.
>>>> Quit anyway? (y or n) y
>>>
>>>
>>>
>>> Commands executed from another terminal after pbs_server with gdb (r -D)
>>>
>>>> $ sudo service pbs_sched start
>>>> $ sudo service pbs_mom start
>>>> $ pbsnodes -a
>>>> Dual-E52630v4
>>>> state = free
>>>> power_state = Running
>>>> np = 4
>>>> ntype = cluster
>>>> status = rectime=1478748911,macaddr=34:
>>>> 97:f6:5d:09:a6,cpuclock=Fixed,varattr=,jobs=,state=free,netl
>>>> oad=322618417,gres=,loadave=0.06,ncpus=40,physmem=65857216kb
>>>> ,availmem=131970532kb,totmem=132849340kb,idletime=108,nusers=4,nsessions=17,sessions=1036
>>>> 1316 1327 1332 1420 1421 1422 1423 1424 1425 1426 1430 1471 1510 27075
>>>> 27130 35902,uname=Linux Dual-E52630v4 4.4.0-45-generic #66-Ubuntu SMP Wed
>>>> Oct 19 14:12:37 UTC 2016 x86_64,opsys=linux
>>>> mom_service_port = 15002
>>>> mom_manager_port = 15003
>>>> $ echo "sleep 30" | qsub
>>>> 0.Dual-E52630v4
>>>> $ qstat
>>>> Job ID Name User Time Use S
>>>> Queue
>>>> ------------------------- ---------------- --------------- -------- -
>>>> -----
>>>> 0.Dual-E52630v4 STDIN comp_admin 0 Q
>>>> batch
>>>
>>>
>>>
>>> On Thu, Nov 10, 2016 at 12:01 PM, Kazuhiro Fujita <
>>> ***@gmail.com> wrote:
>>>
>>>> David,
>>>>
>>>> Now, it works. Thank you.
>>>> But, jobs are executed in the LIFO manner, as I observed in a E5-2630v3
>>>> server...
>>>> I show the result by 'qstat -t' after 'echo "sleep 10" | qsub -t 1-10'
>>>> 3 times.
>>>>
>>>> Best,
>>>> Kazu
>>>>
>>>> $ qstat -t
>>>> Job ID Name User Time Use S
>>>> Queue
>>>> ------------------------- ---------------- --------------- -------- -
>>>> -----
>>>> 0.Dual-E5-2630v3 STDIN comp_admin 00:00:00 C
>>>> batch
>>>> 1[1].Dual-E5-2630v3 STDIN-1 comp_admin 0 Q
>>>> batch
>>>> 1[2].Dual-E5-2630v3 STDIN-2 comp_admin 0 Q
>>>> batch
>>>> 1[3].Dual-E5-2630v3 STDIN-3 comp_admin 0 Q
>>>> batch
>>>> 1[4].Dual-E5-2630v3 STDIN-4 comp_admin 0 Q
>>>> batch
>>>> 1[5].Dual-E5-2630v3 STDIN-5 comp_admin 0 Q
>>>> batch
>>>> 1[6].Dual-E5-2630v3 STDIN-6 comp_admin 0 Q
>>>> batch
>>>> 1[7].Dual-E5-2630v3 STDIN-7 comp_admin 00:00:00 C
>>>> batch
>>>> 1[8].Dual-E5-2630v3 STDIN-8 comp_admin 00:00:00 C
>>>> batch
>>>> 1[9].Dual-E5-2630v3 STDIN-9 comp_admin 00:00:00 C
>>>> batch
>>>> 1[10].Dual-E5-2630v3 STDIN-10 comp_admin 00:00:00 C
>>>> batch
>>>> 2[1].Dual-E5-2630v3 STDIN-1 comp_admin 0 Q
>>>> batch
>>>> 2[2].Dual-E5-2630v3 STDIN-2 comp_admin 0 Q
>>>> batch
>>>> 2[3].Dual-E5-2630v3 STDIN-3 comp_admin 0 Q
>>>> batch
>>>> 2[4].Dual-E5-2630v3 STDIN-4 comp_admin 0 Q
>>>> batch
>>>> 2[5].Dual-E5-2630v3 STDIN-5 comp_admin 0 Q
>>>> batch
>>>> 2[6].Dual-E5-2630v3 STDIN-6 comp_admin 0 Q
>>>> batch
>>>> 2[7].Dual-E5-2630v3 STDIN-7 comp_admin 0 Q
>>>> batch
>>>> 2[8].Dual-E5-2630v3 STDIN-8 comp_admin 0 Q
>>>> batch
>>>> 2[9].Dual-E5-2630v3 STDIN-9 comp_admin 0 Q
>>>> batch
>>>> 2[10].Dual-E5-2630v3 STDIN-10 comp_admin 0 Q
>>>> batch
>>>> 3[1].Dual-E5-2630v3 STDIN-1 comp_admin 0 Q
>>>> batch
>>>> 3[2].Dual-E5-2630v3 STDIN-2 comp_admin 0 Q
>>>> batch
>>>> 3[3].Dual-E5-2630v3 STDIN-3 comp_admin 0 Q
>>>> batch
>>>> 3[4].Dual-E5-2630v3 STDIN-4 comp_admin 0 Q
>>>> batch
>>>> 3[5].Dual-E5-2630v3 STDIN-5 comp_admin 0 Q
>>>> batch
>>>> 3[6].Dual-E5-2630v3 STDIN-6 comp_admin 0 Q
>>>> batch
>>>> 3[7].Dual-E5-2630v3 STDIN-7 comp_admin 0 R
>>>> batch
>>>> 3[8].Dual-E5-2630v3 STDIN-8 comp_admin 0 R
>>>> batch
>>>> 3[9].Dual-E5-2630v3 STDIN-9 comp_admin 0 R
>>>> batch
>>>> 3[10].Dual-E5-2630v3 STDIN-10 comp_admin 0 R
>>>> batch
>>>>
>>>>
>>>>
>>>> On Thu, Nov 10, 2016 at 3:07 AM, David Beer <
>>>> ***@adaptivecomputing.com> wrote:
>>>>
>>>>> Kazu,
>>>>>
>>>>> I was able to get a system to reproduce this error. I have now checked
>>>>> in another fix, and I can no longer reproduce this. Can you pull the latest
>>>>> and let me know if it fixes it for you?
>>>>>
>>>>> On Tue, Nov 8, 2016 at 2:06 AM, Kazuhiro Fujita <
>>>>> ***@gmail.com> wrote:
>>>>>
>>>>>> Hi David,
>>>>>>
>>>>>> I reinstalled the 6.0-dev today from github, and observed slight
>>>>>> different behaviors I think.
>>>>>> I used the "service" command to start daemons this time.
>>>>>>
>>>>>> Best,
>>>>>> Kazu
>>>>>>
>>>>>> Befor crash
>>>>>>
>>>>>>> git clone https://github.com/adaptivecomputing/torque.git -b
>>>>>>> 6.0-dev 6.0-dev
>>>>>>> cd 6.0-dev
>>>>>>> ./autogen.sh
>>>>>>> # build and install torque
>>>>>>> ./configure
>>>>>>> make
>>>>>>> sudo make install
>>>>>>> # Set the correct name of the server
>>>>>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>>>>> # configure and start trqauthd
>>>>>>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>>>>>>> sudo update-rc.d trqauthd defaults
>>>>>>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>>>>>>> sudo ldconfig
>>>>>>> sudo service trqauthd start
>>>>>>> # Initialize serverdb by executing the torque.setup script
>>>>>>> sudo ./torque.setup $USER
>>>>>>> sudo qmgr -c 'p s'
>>>>>>> sudo qterm
>>>>>>> sudo service trqauthd stop
>>>>>>> ps aux | grep pbs
>>>>>>> ps aux | grep trq
>>>>>>> # set nodes
>>>>>>> echo "$HOSTNAME np=`cat /proc/cpuinfo | grep processor | wc -l`" |
>>>>>>> sudo tee /var/spool/torque/server_priv/nodes
>>>>>>> sudo nano /var/spool/torque/server_priv/nodes
>>>>>>> # set the head node
>>>>>>> echo "\$pbsserver $HOSTNAME" | sudo tee
>>>>>>> /var/spool/torque/mom_priv/config
>>>>>>> # configure other deamons
>>>>>>> sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
>>>>>>> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
>>>>>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>>>>>> sudo update-rc.d pbs_server defaults
>>>>>>> sudo update-rc.d pbs_sched defaults
>>>>>>> sudo update-rc.d pbs_mom defaults
>>>>>>> # start torque daemons
>>>>>>> sudo service trqauthd start
>>>>>>> sudo service pbs_server start
>>>>>>> sudo service pbs_sched start
>>>>>>> sudo service pbs_mom start
>>>>>>> # chekc configuration of computaion nodes
>>>>>>> pbsnodes -a
>>>>>>
>>>>>>
>>>>>> I checked torque processes by "ps aux | grep pbs" and "ps aux | grep
>>>>>> trq" several times.
>>>>>> After "pbsnodes -a", it seems ok.
>>>>>> But, the next qsub command seems to trigger to crash "pbs_server" and
>>>>>> "pbs_sched".
>>>>>>
>>>>>> $ ps aux | grep trq
>>>>>>> root 9682 0.0 0.0 109112 3632 ? S 17:39 0:00
>>>>>>> /usr/local/sbin/trqauthd
>>>>>>> comp_ad+ 9842 0.0 0.0 15236 936 pts/8 S+ 17:40 0:00
>>>>>>> grep --color=auto trq
>>>>>>> $ ps aux | grep pbs
>>>>>>> root 9720 0.0 0.0 695140 25760 ? Sl 17:39 0:00
>>>>>>> /usr/local/sbin/pbs_server
>>>>>>> root 9771 0.0 0.0 37996 4940 ? Ss 17:39 0:00
>>>>>>> /usr/local/sbin/pbs_sched
>>>>>>> root 9814 0.2 0.2 173776 136692 ? SLsl 17:40 0:00
>>>>>>> /usr/local/sbin/pbs_mom
>>>>>>> comp_ad+ 9844 0.0 0.0 15236 1012 pts/8 S+ 17:40 0:00
>>>>>>> grep --color=auto pbs
>>>>>>> $ echo "sleep 30" | qsub
>>>>>>> 0.Dual-E52630v4
>>>>>>> $ ps aux | grep pbs
>>>>>>> root 9814 0.1 0.2 173776 136692 ? SLsl 17:40 0:00
>>>>>>> /usr/local/sbin/pbs_mom
>>>>>>> comp_ad+ 9855 0.0 0.0 15236 928 pts/8 S+ 17:41 0:00
>>>>>>> grep --color=auto pbs
>>>>>>> $ ps aux | grep trq
>>>>>>> root 9682 0.0 0.0 109112 4144 ? S 17:39 0:00
>>>>>>> /usr/local/sbin/trqauthd
>>>>>>> comp_ad+ 9860 0.0 0.0 15236 1092 pts/8 S+ 17:41 0:00
>>>>>>> grep --color=auto trq
>>>>>>
>>>>>>
>>>>>> Then, I stopped the remained processes,
>>>>>>
>>>>>> sudo service pbs_mom stop
>>>>>>> sudo service trqauthd stop
>>>>>>
>>>>>>
>>>>>> and start again the "trqauthd", and "pbs_server" with gdb.
>>>>>> "pbs_server" crashed in gdb without other commands.
>>>>>>
>>>>>> sudo service trqauthd start
>>>>>>> sudo gdb /usr/local/sbin/pbs_server
>>>>>>
>>>>>>
>>>>>> sudo gdb /usr/local/sbin/pbs_server
>>>>>> GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
>>>>>> Copyright (C) 2016 Free Software Foundation, Inc.
>>>>>> License GPLv3+: GNU GPL version 3 or later <
>>>>>> http://gnu.org/licenses/gpl.html>
>>>>>> This is free software: you are free to change and redistribute it.
>>>>>> There is NO WARRANTY, to the extent permitted by law. Type "show
>>>>>> copying"
>>>>>> and "show warranty" for details.
>>>>>> This GDB was configured as "x86_64-linux-gnu".
>>>>>> Type "show configuration" for configuration details.
>>>>>> For bug reporting instructions, please see:
>>>>>> <http://www.gnu.org/software/gdb/bugs/>.
>>>>>> Find the GDB manual and other documentation resources online at:
>>>>>> <http://www.gnu.org/software/gdb/documentation/>.
>>>>>> For help, type "help".
>>>>>> Type "apropos word" to search for commands related to "word"...
>>>>>> Reading symbols from /usr/local/sbin/pbs_server...done.
>>>>>> (gdb) r -D
>>>>>> Starting program: /usr/local/sbin/pbs_server -D
>>>>>> [Thread debugging using libthread_db enabled]
>>>>>> Using host libthread_db library "/lib/x86_64-linux-gnu/libthre
>>>>>> ad_db.so.1".
>>>>>>
>>>>>> Program received signal SIGSEGV, Segmentation fault.
>>>>>> __lll_unlock_elision (lock=0x512f1b0, private=0) at
>>>>>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>>>>>> 29 ../sysdeps/unix/sysv/linux/x86/elision-unlock.c: No such file or
>>>>>> directory.
>>>>>> (gdb) bt
>>>>>> #0 __lll_unlock_elision (lock=0x512f1b0, private=0) at
>>>>>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>>>>>> #1 0x00000000004a4953 in unlock_ji_mutex (pjob=0x512d880,
>>>>>> id=0x525b30 <svr_enquejob(job*, int, char const*, bool, bool)::__func__>
>>>>>> "svr_enquejob", msg=0x524554 "1", logging=0)
>>>>>> at svr_jobfunc.c:4011
>>>>>> #2 0x000000000049db0c in svr_enquejob (pjob=0x512d880,
>>>>>> has_sv_qs_mutex=1, prev_job_id=0x0, have_reservation=false,
>>>>>> being_recovered=true) at svr_jobfunc.c:421
>>>>>> #3 0x000000000045b828 in pbsd_init_reque (pjob=0x512d880,
>>>>>> change_state=1) at pbsd_init.c:2824
>>>>>> #4 0x000000000045ad93 in pbsd_init_job (pjob=0x512d880, type=1) at
>>>>>> pbsd_init.c:2558
>>>>>> #5 0x0000000000459483 in handle_job_recovery (type=1) at
>>>>>> pbsd_init.c:1803
>>>>>> #6 0x000000000045a173 in handle_job_and_array_recovery (type=1) at
>>>>>> pbsd_init.c:2100
>>>>>> #7 0x000000000045a8fe in pbsd_init (type=1) at pbsd_init.c:2316
>>>>>> #8 0x0000000000461f29 in main (argc=2, argv=0x7fffffffe598) at
>>>>>> pbsd_main.c:1898
>>>>>> (gdb) backtrace full
>>>>>> #0 __lll_unlock_elision (lock=0x512f1b0, private=0) at
>>>>>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>>>>>> No locals.
>>>>>> #1 0x00000000004a4953 in unlock_ji_mutex (pjob=0x512d880,
>>>>>> id=0x525b30 <svr_enquejob(job*, int, char const*, bool, bool)::__func__>
>>>>>> "svr_enquejob", msg=0x524554 "1", logging=0)
>>>>>> at svr_jobfunc.c:4011
>>>>>> rc = 0
>>>>>> err_msg = 0x0
>>>>>> stub_msg = "no pos"
>>>>>> __func__ = "unlock_ji_mutex"
>>>>>> #2 0x000000000049db0c in svr_enquejob (pjob=0x512d880,
>>>>>> has_sv_qs_mutex=1, prev_job_id=0x0, have_reservation=false,
>>>>>> being_recovered=true) at svr_jobfunc.c:421
>>>>>> pattrjb = 0x7fffffff4a10
>>>>>> pdef = 0x4
>>>>>> pque = 0x0
>>>>>> rc = 0
>>>>>> log_buf = '\000' <repeats 24 times>,
>>>>>> "\030\000\000\000\060\000\000\000PU\377\377\377\177\000\000\220T\377\377\377\177",
>>>>>> '\000' <repeats 50 times>, "\003\000\000\000\000\000\000\
>>>>>> 000#\000\000\000\000\000\000\000pO\377\377\377\177", '\000' <repeats
>>>>>> 26 times>, "\221\260\000\000\000\200\377\
>>>>>> 377oO\377\377\377\177\000\000H+B\366\377\177\000\000p+B\366\
>>>>>> 377\177\000\000\200O\377\377\377\177\000\000\201\260\000\000
>>>>>> \000\200\377\377\177O\377\377\377\177", '\000' <repeats 18 times>...
>>>>>> time_now = 1478594788
>>>>>> job_id = "0.Dual-E52630v4\000\000\000\0
>>>>>> 00\000\000\000\000\000\362\377\377\377\377\377\377\377\340J\
>>>>>> 377\377\377\177\000\000\060L\377\377\377\177\000\000\001\000
>>>>>> \000\000\000\000\000\000\244\201\000\000\001\000\000\000\030
>>>>>> \354\377\367\377\177\000\***@L\377\377\377\177\000\000\000\0
>>>>>> 00\000\000\005\000\000\220\r\000\000\000\000\000\000\000k\02
>>>>>> 2j\365\377\177\000\000\031J\377\377\377\177\000\000\201n\376
>>>>>> \017\000\000\000\000\\\216!X\000\000\000\000_#\343+\000\000\
>>>>>> 000\000\\\216!X\000\000\000\000\207\065],", '\000' <repeats 36
>>>>>> times>, "k\022j\365\377\177\000\000\300K\377\377\377\177\000\000\000
>>>>>> \000\000\000\000\000\000\000"...
>>>>>> queue_name = "batch\000\377\377\240\340\377\367\377\177\000"
>>>>>> total_jobs = 0
>>>>>> user_jobs = 0
>>>>>> array_jobs = 0
>>>>>> __func__ = "svr_enquejob"
>>>>>> que_mgr = {unlock_on_exit = 160, locked = 75, mutex_valid =
>>>>>> 255, managed_mutex = 0x7ffff7ddccda <open_path+474>}
>>>>>> #3 0x000000000045b828 in pbsd_init_reque (pjob=0x512d880,
>>>>>> change_state=1) at pbsd_init.c:2824
>>>>>> newstate = 0
>>>>>> newsubstate = 0
>>>>>> rc = 0
>>>>>> log_buf = "pbsd_init_reque:1", '\000' <repeats 1063 times>...
>>>>>> __func__ = "pbsd_init_reque"
>>>>>> #4 0x000000000045ad93 in pbsd_init_job (pjob=0x512d880, type=1) at
>>>>>> pbsd_init.c:2558
>>>>>> d = 0
>>>>>> rc = 0
>>>>>> time_now = 1478594788
>>>>>> log_buf = '\000' <repeats 2112 times>...
>>>>>> local_errno = 0
>>>>>> job_id = '\000' <repeats 1016 times>...
>>>>>> job_atr_hold = 0
>>>>>> job_exit_status = 0
>>>>>> __func__ = "pbsd_init_job"
>>>>>> #5 0x0000000000459483 in handle_job_recovery (type=1) at
>>>>>> pbsd_init.c:1803
>>>>>> pjob = 0x512d880
>>>>>> Index = 0
>>>>>> JobArray_iter = {first = "0.Dual-E52630v4", second = }
>>>>>> log_buf = "14 total files read from
>>>>>> disk\000\000\000\000\000\000\000\001\000\000\000\320\316\022
>>>>>> \005\000\000\000\000\220N\022\005", '\000' <repeats 12 times>,
>>>>>> "Expected 1, recovered 1 queues", '\000' <repeats 1330 times>...
>>>>>> rc = 0
>>>>>> job_rc = 0
>>>>>> logtype = 0
>>>>>> pdirent = 0x0
>>>>>> pdirent_sub = 0x0
>>>>>> dir = 0x5124e90
>>>>>> dir_sub = 0x0
>>>>>> had = 0
>>>>>> pjob = 0x0
>>>>>> time_now = 1478594788
>>>>>> ---Type <return> to continue, or q <return> to quit---
>>>>>> basen = '\000' <repeats 1088 times>...
>>>>>> use_jobs_subdirs = 0
>>>>>> __func__ = "handle_job_recovery"
>>>>>> #6 0x000000000045a173 in handle_job_and_array_recovery (type=1) at
>>>>>> pbsd_init.c:2100
>>>>>> rc = 0
>>>>>> tmp_rc = 1974134615
>>>>>> #7 0x000000000045a8fe in pbsd_init (type=1) at pbsd_init.c:2316
>>>>>> ret = 0
>>>>>> gid = 0
>>>>>> log_buf = "pbsd_init:1", '\000' <repeats 997 times>...
>>>>>> __func__ = "pbsd_init"
>>>>>> #8 0x0000000000461f29 in main (argc=2, argv=0x7fffffffe598) at
>>>>>> pbsd_main.c:1898
>>>>>> i = 2
>>>>>> rc = 0
>>>>>> local_errno = 0
>>>>>> lockfile = "/var/spool/torque/server_priv/server.lock",
>>>>>> '\000' <repeats 983 times>
>>>>>> EMsg = '\000' <repeats 1023 times>
>>>>>> tmpLine = "Server Dual-E52630v4 started, initialization type
>>>>>> = 1", '\000' <repeats 970 times>
>>>>>> log_buf = "Server Dual-E52630v4 started, initialization type
>>>>>> = 1", '\000' <repeats 1139 times>...
>>>>>> server_name_file_port = 15001
>>>>>> fp = 0x51095f0
>>>>>> (gdb) info registers
>>>>>> rax 0x0 0
>>>>>> rbx 0x6 6
>>>>>> rcx 0x0 0
>>>>>> rdx 0x512f1b0 85127600
>>>>>> rsi 0x0 0
>>>>>> rdi 0x512f1b0 85127600
>>>>>> rbp 0x7fffffffe4b0 0x7fffffffe4b0
>>>>>> rsp 0x7fffffffc870 0x7fffffffc870
>>>>>> r8 0x0 0
>>>>>> r9 0x7fffffff57a2 140737488312226
>>>>>> r10 0x513c800 85182464
>>>>>> r11 0x7ffff61e6128 140737322574120
>>>>>> r12 0x4260b0 4350128
>>>>>> r13 0x7fffffffe590 140737488348560
>>>>>> r14 0x0 0
>>>>>> r15 0x0 0
>>>>>> rip 0x461f29 0x461f29 <main(int, char**)+2183>
>>>>>> eflags 0x10246 [ PF ZF IF RF ]
>>>>>> cs 0x33 51
>>>>>> ss 0x2b 43
>>>>>> ds 0x0 0
>>>>>> es 0x0 0
>>>>>> fs 0x0 0
>>>>>> gs 0x0 0
>>>>>> (gdb) x/16i $pc
>>>>>> => 0x461f29 <main(int, char**)+2183>: test %eax,%eax
>>>>>> 0x461f2b <main(int, char**)+2185>: setne %al
>>>>>> 0x461f2e <main(int, char**)+2188>: test %al,%al
>>>>>> 0x461f30 <main(int, char**)+2190>: je 0x461f55 <main(int,
>>>>>> char**)+2227>
>>>>>> 0x461f32 <main(int, char**)+2192>: mov 0x70efc7(%rip),%rax
>>>>>> # 0xb70f00 <msg_daemonname>
>>>>>> 0x461f39 <main(int, char**)+2199>: mov $0x51bab2,%edx
>>>>>> 0x461f3e <main(int, char**)+2204>: mov %rax,%rsi
>>>>>> 0x461f41 <main(int, char**)+2207>: mov $0xffffffff,%edi
>>>>>> 0x461f46 <main(int, char**)+2212>: callq 0x425420
>>>>>> <***@plt>
>>>>>> 0x461f4b <main(int, char**)+2217>: mov $0x3,%edi
>>>>>> 0x461f50 <main(int, char**)+2222>: callq 0x425680 <***@plt>
>>>>>> 0x461f55 <main(int, char**)+2227>: mov 0x71021d(%rip),%esi
>>>>>> # 0xb72178 <pbs_mom_port>
>>>>>> 0x461f5b <main(int, char**)+2233>: mov 0x710227(%rip),%ecx
>>>>>> # 0xb72188 <pbs_scheduler_port>
>>>>>> 0x461f61 <main(int, char**)+2239>: mov 0x710225(%rip),%edx
>>>>>> # 0xb7218c <pbs_server_port_dis>
>>>>>> 0x461f67 <main(int, char**)+2245>: lea -0x1400(%rbp),%rax
>>>>>> 0x461f6e <main(int, char**)+2252>: mov $0xb739c0,%r9d
>>>>>> (gdb) thread apply all backtrace
>>>>>>
>>>>>> Thread 1 (Thread 0x7ffff7fd5740 (LWP 10004)):
>>>>>> #0 __lll_unlock_elision (lock=0x512f1b0, private=0) at
>>>>>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>>>>>> #1 0x00000000004a4953 in unlock_ji_mutex (pjob=0x512d880,
>>>>>> id=0x525b30 <svr_enquejob(job*, int, char const*, bool, bool)::__func__>
>>>>>> "svr_enquejob", msg=0x524554 "1", logging=0)
>>>>>> at svr_jobfunc.c:4011
>>>>>> #2 0x000000000049db0c in svr_enquejob (pjob=0x512d880,
>>>>>> has_sv_qs_mutex=1, prev_job_id=0x0, have_reservation=false,
>>>>>> being_recovered=true) at svr_jobfunc.c:421
>>>>>> #3 0x000000000045b828 in pbsd_init_reque (pjob=0x512d880,
>>>>>> change_state=1) at pbsd_init.c:2824
>>>>>> #4 0x000000000045ad93 in pbsd_init_job (pjob=0x512d880, type=1) at
>>>>>> pbsd_init.c:2558
>>>>>> #5 0x0000000000459483 in handle_job_recovery (type=1) at
>>>>>> pbsd_init.c:1803
>>>>>> #6 0x000000000045a173 in handle_job_and_array_recovery (type=1) at
>>>>>> pbsd_init.c:2100
>>>>>> #7 0x000000000045a8fe in pbsd_init (type=1) at pbsd_init.c:2316
>>>>>> #8 0x0000000000461f29 in main (argc=2, argv=0x7fffffffe598) at
>>>>>> pbsd_main.c:1898
>>>>>> (gdb) quit
>>>>>> A debugging session is active.
>>>>>>
>>>>>> Inferior 1 [process 10004] will be killed.
>>>>>>
>>>>>> Quit anyway? (y or n) y
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Nov 2, 2016 at 1:43 AM, David Beer <
>>>>>> ***@adaptivecomputing.com> wrote:
>>>>>>
>>>>>>> Kazu,
>>>>>>>
>>>>>>> Thanks for sticking with us on this. You mentioned that pbs_server
>>>>>>> did not crash when you submitted the job, but you said that it and
>>>>>>> pbs_sched are "unstable." What do you mean by unstable? Will jobs run? You
>>>>>>> gdb output looks like a pbs_server that isn't busy, but other than that it
>>>>>>> looks normal.
>>>>>>>
>>>>>>> David
>>>>>>>
>>>>>>> On Tue, Nov 1, 2016 at 1:19 AM, Kazuhiro Fujita <
>>>>>>> ***@gmail.com> wrote:
>>>>>>>
>>>>>>>> David,
>>>>>>>>
>>>>>>>> I tested the 6.0-dev. It passed the "sudo ./torque.setup $USER"
>>>>>>>> script,
>>>>>>>> but pbs_server and pbs_sched are unstable like 6.1-dev.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Kazu
>>>>>>>>
>>>>>>>> Before execution of gdb
>>>>>>>>
>>>>>>>> git clone https://github.com/adaptivecomputing/torque.git -b
>>>>>>>>> 6.0-dev 6.0-dev
>>>>>>>>> cd 6.0-dev
>>>>>>>>> ./autogen.sh
>>>>>>>>> # build and install torque
>>>>>>>>> ./configure
>>>>>>>>> make
>>>>>>>>> sudo make install
>>>>>>>>> # Set the correct name of the server
>>>>>>>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>>>>>>> # configure and start trqauthd
>>>>>>>>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>>>>>>>>> sudo update-rc.d trqauthd defaults
>>>>>>>>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>>>>>>>>> sudo ldconfig
>>>>>>>>> sudo service trqauthd start
>>>>>>>>> # Initialize serverdb by executing the torque.setup script
>>>>>>>>> sudo ./torque.setup $USER
>>>>>>>>>
>>>>>>>>> sudo qmgr -c 'p s'
>>>>>>>>> sudo qterm
>>>>>>>>> sudo /etc/init.d/trqauthd stop
>>>>>>>>> # set nodes
>>>>>>>>> echo "$HOSTNAME np=`cat /proc/cpuinfo | grep processor | wc -l`" |
>>>>>>>>> sudo tee /var/spool/torque/server_priv/nodes
>>>>>>>>> sudo nano /var/spool/torque/server_priv/nodes
>>>>>>>>> # set the head node
>>>>>>>>> echo "\$pbsserver $HOSTNAME" | sudo tee
>>>>>>>>> /var/spool/torque/mom_priv/config
>>>>>>>>> # configure other deamons
>>>>>>>>> sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
>>>>>>>>> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
>>>>>>>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>>>>>>>> sudo update-rc.d pbs_server defaults
>>>>>>>>> sudo update-rc.d pbs_sched defaults
>>>>>>>>> sudo update-rc.d pbs_mom defaults
>>>>>>>>> # start torque daemons
>>>>>>>>> sudo service trqauthd start
>>>>>>>>
>>>>>>>>
>>>>>>>> Execution of gdb
>>>>>>>>
>>>>>>>>> sudo gdb /usr/local/sbin/pbs_server
>>>>>>>>
>>>>>>>>
>>>>>>>> Commands executed by another terminal
>>>>>>>>
>>>>>>>>> sudo /etc/init.d/pbs_mom start
>>>>>>>>> sudo /etc/init.d/pbs_sched start
>>>>>>>>> pbsnodes -a
>>>>>>>>> echo "sleep 30" | qsub
>>>>>>>>
>>>>>>>>
>>>>>>>> The last command did not cause a crash of pbs_server. The backtrace
>>>>>>>> is described below.
>>>>>>>> $ sudo gdb /usr/local/sbin/pbs_server
>>>>>>>> GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
>>>>>>>> Copyright (C) 2016 Free Software Foundation, Inc.
>>>>>>>> License GPLv3+: GNU GPL version 3 or later <
>>>>>>>> http://gnu.org/licenses/gpl.html>
>>>>>>>> This is free software: you are free to change and redistribute it.
>>>>>>>> There is NO WARRANTY, to the extent permitted by law. Type "show
>>>>>>>> copying"
>>>>>>>> and "show warranty" for details.
>>>>>>>> This GDB was configured as "x86_64-linux-gnu".
>>>>>>>> Type "show configuration" for configuration details.
>>>>>>>> For bug reporting instructions, please see:
>>>>>>>> <http://www.gnu.org/software/gdb/bugs/>.
>>>>>>>> Find the GDB manual and other documentation resources online at:
>>>>>>>> <http://www.gnu.org/software/gdb/documentation/>.
>>>>>>>> For help, type "help".
>>>>>>>> Type "apropos word" to search for commands related to "word"...
>>>>>>>> Reading symbols from /usr/local/sbin/pbs_server...done.
>>>>>>>> (gdb) r -D
>>>>>>>> Starting program: /usr/local/sbin/pbs_server -D
>>>>>>>> [Thread debugging using libthread_db enabled]
>>>>>>>> Using host libthread_db library "/lib/x86_64-linux-gnu/libthre
>>>>>>>> ad_db.so.1".
>>>>>>>> [New Thread 0x7ffff39c1700 (LWP 5024)]
>>>>>>>> pbs_server is up (version - 6.0, port - 15001)
>>>>>>>> [New Thread 0x7ffff31c0700 (LWP 5025)]
>>>>>>>> PBS_Server: LOG_ERROR::tcp_connect_sockaddr, Failed when trying to
>>>>>>>> open tcp connection - connect() failed [rc = -2] [addr =
>>>>>>>> 10.0.0.249:15003]
>>>>>>>> PBS_Server: LOG_ERROR::sendHierarchyToNode, Could not send mom
>>>>>>>> hierarchy to host Dual-E52630v4:15003
>>>>>>>> [New Thread 0x7ffff29bf700 (LWP 5026)]
>>>>>>>> [New Thread 0x7ffff21be700 (LWP 5027)]
>>>>>>>> [New Thread 0x7ffff19bd700 (LWP 5028)]
>>>>>>>> [New Thread 0x7ffff11bc700 (LWP 5029)]
>>>>>>>> [New Thread 0x7ffff09bb700 (LWP 5030)]
>>>>>>>> [Thread 0x7ffff09bb700 (LWP 5030) exited]
>>>>>>>> [New Thread 0x7ffff09bb700 (LWP 5031)]
>>>>>>>> [New Thread 0x7fffe3fff700 (LWP 5109)]
>>>>>>>> [New Thread 0x7fffe37fe700 (LWP 5113)]
>>>>>>>> [New Thread 0x7fffe29cf700 (LWP 5121)]
>>>>>>>> [Thread 0x7fffe29cf700 (LWP 5121) exited]
>>>>>>>> ^C
>>>>>>>> Thread 1 "pbs_server" received signal SIGINT, Interrupt.
>>>>>>>> 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-templa
>>>>>>>> te.S:84
>>>>>>>> 84 ../sysdeps/unix/syscall-template.S: No such file or directory.
>>>>>>>> (gdb) backtrace full
>>>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>>> No locals.
>>>>>>>> #1 0x00007ffff615c1a4 in usleep (useconds=<optimized out>) at
>>>>>>>> ../sysdeps/posix/usleep.c:32
>>>>>>>> ts = {tv_sec = 0, tv_nsec = 250000000}
>>>>>>>> #2 0x000000000046123a in main_loop () at pbsd_main.c:1454
>>>>>>>> state = 3
>>>>>>>> waittime = 5
>>>>>>>> pjob = 0x313a74
>>>>>>>> iter = 0x0
>>>>>>>> when = 1477984074
>>>>>>>> log = 0
>>>>>>>> scheduling = 1
>>>>>>>> sched_iteration = 600
>>>>>>>> time_now = 1477984190
>>>>>>>> update_loglevel = 1477984198
>>>>>>>> log_buf = "Server Ready, pid = 5020, loglevel=0", '\000'
>>>>>>>> <repeats 140 times>, "c\000\000\000\000\000\000\000
>>>>>>>> \000\020\000\000\000\000\000\000\240\265\377\377\377\177", '\000'
>>>>>>>> <repeats 26 times>...
>>>>>>>> sem_val = 5228929
>>>>>>>> __func__ = "main_loop"
>>>>>>>> #3 0x0000000000461fb6 in main (argc=2, argv=0x7fffffffe598) at
>>>>>>>> pbsd_main.c:1935
>>>>>>>> i = 2
>>>>>>>> rc = 0
>>>>>>>> local_errno = 0
>>>>>>>> lockfile = "/var/spool/torque/server_priv/server.lock",
>>>>>>>> '\000' <repeats 983 times>
>>>>>>>> EMsg = '\000' <repeats 1023 times>
>>>>>>>> tmpLine = "Using ports Server:15001 Scheduler:15004
>>>>>>>> MOM:15002 (server: 'Dual-E52630v4')", '\000' <repeats 945 times>
>>>>>>>> log_buf = "Using ports Server:15001 Scheduler:15004
>>>>>>>> MOM:15002 (server: 'Dual-E52630v4')", '\000' <repeats 1114 times>...
>>>>>>>> server_name_file_port = 15001
>>>>>>>> fp = 0x51095f0
>>>>>>>> (gdb) info registers
>>>>>>>> rax 0xfffffffffffffdfc -516
>>>>>>>> rbx 0x5 5
>>>>>>>> rcx 0x7ffff612a75d 140737321805661
>>>>>>>> rdx 0x0 0
>>>>>>>> rsi 0x0 0
>>>>>>>> rdi 0x7fffffffb3f0 140737488335856
>>>>>>>> rbp 0x7fffffffe4b0 0x7fffffffe4b0
>>>>>>>> rsp 0x7fffffffc870 0x7fffffffc870
>>>>>>>> r8 0x0 0
>>>>>>>> r9 0x4000001 67108865
>>>>>>>> r10 0x1 1
>>>>>>>> r11 0x293 659
>>>>>>>> r12 0x4260b0 4350128
>>>>>>>> r13 0x7fffffffe590 140737488348560
>>>>>>>> r14 0x0 0
>>>>>>>> r15 0x0 0
>>>>>>>> rip 0x461fb6 0x461fb6 <main(int, char**)+2388>
>>>>>>>> eflags 0x293 [ CF AF SF IF ]
>>>>>>>> cs 0x33 51
>>>>>>>> ss 0x2b 43
>>>>>>>> ds 0x0 0
>>>>>>>> es 0x0 0
>>>>>>>> fs 0x0 0
>>>>>>>> gs 0x0 0
>>>>>>>> (gdb) x/16i $pc
>>>>>>>> => 0x461fb6 <main(int, char**)+2388>: callq 0x494762
>>>>>>>> <shutdown_ack()>
>>>>>>>> 0x461fbb <main(int, char**)+2393>: mov $0xffffffff,%edi
>>>>>>>> 0x461fc0 <main(int, char**)+2398>: callq 0x4250b0
>>>>>>>> <***@plt>
>>>>>>>> 0x461fc5 <main(int, char**)+2403>: mov 0x70f55c(%rip),%rdx
>>>>>>>> # 0xb71528 <msg_svrdown>
>>>>>>>> 0x461fcc <main(int, char**)+2410>: mov 0x70eeed(%rip),%rax
>>>>>>>> # 0xb70ec0 <msg_daemonname>
>>>>>>>> 0x461fd3 <main(int, char**)+2417>: mov %rdx,%rcx
>>>>>>>> 0x461fd6 <main(int, char**)+2420>: mov %rax,%rdx
>>>>>>>> 0x461fd9 <main(int, char**)+2423>: mov $0x1,%esi
>>>>>>>> 0x461fde <main(int, char**)+2428>: mov $0x8002,%edi
>>>>>>>> 0x461fe3 <main(int, char**)+2433>: callq 0x425840
>>>>>>>> <***@plt>
>>>>>>>> 0x461fe8 <main(int, char**)+2438>: mov $0x0,%edi
>>>>>>>> 0x461fed <main(int, char**)+2443>: callq 0x4269c9
>>>>>>>> <acct_close(bool)>
>>>>>>>> 0x461ff2 <main(int, char**)+2448>: mov $0xb6cdc0,%edi
>>>>>>>> 0x461ff7 <main(int, char**)+2453>: callq 0x425a00
>>>>>>>> <***@plt>
>>>>>>>> 0x461ffc <main(int, char**)+2458>: mov $0x1,%edi
>>>>>>>> 0x462001 <main(int, char**)+2463>: callq 0x424db0
>>>>>>>> <***@plt>
>>>>>>>> (gdb) thread apply all backtrace
>>>>>>>>
>>>>>>>> Thread 11 (Thread 0x7fffe37fe700 (LWP 5113)):
>>>>>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
>>>>>>>> #1 0x00000000004fc19c in work_thread (a=0x5110710) at
>>>>>>>> u_threadpool.c:272
>>>>>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7fffe37fe700) at
>>>>>>>> pthread_create.c:333
>>>>>>>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>>>> _64/clone.S:109
>>>>>>>>
>>>>>>>> Thread 10 (Thread 0x7fffe3fff700 (LWP 5109)):
>>>>>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
>>>>>>>> #1 0x00000000004fc19c in work_thread (a=0x5110710) at
>>>>>>>> u_threadpool.c:272
>>>>>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7fffe3fff700) at
>>>>>>>> pthread_create.c:333
>>>>>>>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>>>> _64/clone.S:109
>>>>>>>>
>>>>>>>> Thread 9 (Thread 0x7ffff09bb700 (LWP 5031)):
>>>>>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
>>>>>>>> #1 0x00000000004fc19c in work_thread (a=0x5110810) at
>>>>>>>> u_threadpool.c:272
>>>>>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7ffff09bb700) at
>>>>>>>> pthread_create.c:333
>>>>>>>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>>>> _64/clone.S:109
>>>>>>>>
>>>>>>>> Thread 7 (Thread 0x7ffff11bc700 (LWP 5029)):
>>>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>>>>>>> ../sysdeps/posix/sleep.c:55
>>>>>>>> #2 0x00000000004769bb in remove_completed_jobs (vp=0x0) at
>>>>>>>> req_jobobit.c:3759
>>>>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff11bc700) at
>>>>>>>> pthread_create.c:333
>>>>>>>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>>>> _64/clone.S:109
>>>>>>>>
>>>>>>>> Thread 6 (Thread 0x7ffff19bd700 (LWP 5028)):
>>>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>>>>>>> ../sysdeps/posix/sleep.c:55
>>>>>>>> #2 0x00000000004afa7b in remove_extra_recycle_jobs (vp=0x0) at
>>>>>>>> job_recycler.c:216
>>>>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff19bd700) at
>>>>>>>> pthread_create.c:333
>>>>>>>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>>>> _64/clone.S:109
>>>>>>>>
>>>>>>>> Thread 5 (Thread 0x7ffff21be700 (LWP 5027)):
>>>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>>>>>>> ../sysdeps/posix/sleep.c:55
>>>>>>>> #2 0x00000000004bc73b in inspect_exiting_jobs (vp=0x0) at
>>>>>>>> exiting_jobs.c:319
>>>>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff21be700) at
>>>>>>>> pthread_create.c:333
>>>>>>>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>>>> _64/clone.S:109
>>>>>>>>
>>>>>>>> Thread 4 (Thread 0x7ffff29bf700 (LWP 5026)):
>>>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>>>>>>> ../sysdeps/posix/sleep.c:55
>>>>>>>> #2 0x000000000046078d in handle_queue_routing_retries (vp=0x0) at
>>>>>>>> pbsd_main.c:1079
>>>>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff29bf700) at
>>>>>>>> pthread_create.c:333
>>>>>>>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>>>> _64/clone.S:109
>>>>>>>>
>>>>>>>> Thread 3 (Thread 0x7ffff31c0700 (LWP 5025)):
>>>>>>>> #0 0x00007ffff6ee17bd in accept () at
>>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>>> #1 0x00007ffff750a276 in start_listener_addrinfo
>>>>>>>> (host_name=0x7ffff31bfaf0 "Dual-E52630v4", server_port=15001,
>>>>>>>> process_meth=0x4c4935 <start_process_pbs_server_port(void*)>)
>>>>>>>> at ../Libnet/server_core.c:398
>>>>>>>> #2 0x00000000004608f3 in start_accept_listener (vp=0x0) at
>>>>>>>> pbsd_main.c:1141
>>>>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff31c0700) at
>>>>>>>> pthread_create.c:333
>>>>>>>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>>>> _64/clone.S:109
>>>>>>>>
>>>>>>>> Thread 2 (Thread 0x7ffff39c1700 (LWP 5024)):
>>>>>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
>>>>>>>> #1 0x00000000004fc19c in work_thread (a=0x5110810) at
>>>>>>>> u_threadpool.c:272
>>>>>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7ffff39c1700) at
>>>>>>>> pthread_create.c:333
>>>>>>>> ---Type <return> to continue, or q <return> to quit---
>>>>>>>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>>>> _64/clone.S:109
>>>>>>>>
>>>>>>>> Thread 1 (Thread 0x7ffff7fd5740 (LWP 5020)):
>>>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>>> #1 0x00007ffff615c1a4 in usleep (useconds=<optimized out>) at
>>>>>>>> ../sysdeps/posix/usleep.c:32
>>>>>>>> #2 0x000000000046123a in main_loop () at pbsd_main.c:1454
>>>>>>>> #3 0x0000000000461fb6 in main (argc=2, argv=0x7fffffffe598) at
>>>>>>>> pbsd_main.c:1935
>>>>>>>> (gdb) quit
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Oct 28, 2016 at 12:43 PM, Kazuhiro Fujita <
>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Thank you for your comments.
>>>>>>>>> I will try the 6.0-dev next week.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Kazu
>>>>>>>>>
>>>>>>>>> On Fri, Oct 28, 2016 at 5:34 AM, David Beer <
>>>>>>>>> ***@adaptivecomputing.com> wrote:
>>>>>>>>>
>>>>>>>>>> I wonder if that fix wasn't placed in the hotfix. Is there any
>>>>>>>>>> chance you can try installing 6.0-dev on your system (via github) to see if
>>>>>>>>>> it's resolved. For the record, my Ubuntu 16 system doesn't give me this
>>>>>>>>>> error, or I'd try it myself. For whatever reason, none of our test cluster
>>>>>>>>>> machines (Cent & Redhat 6-7, SLES 11-12) experience this either. We did
>>>>>>>>>> have another user that experiences it on a test cluster, but not being able
>>>>>>>>>> to reproduce it has made it harder to track down.
>>>>>>>>>>
>>>>>>>>>> On Wed, Oct 26, 2016 at 12:46 AM, Kazuhiro Fujita <
>>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> David,
>>>>>>>>>>>
>>>>>>>>>>> I tried the 6.0.2.h3. But, it seems that the other issue is
>>>>>>>>>>> still remained.
>>>>>>>>>>> After I initialized serverdb by "sudo pbs_server -t create",
>>>>>>>>>>> pbs_server crashed.
>>>>>>>>>>> Then, I used gdb with pbs_server.
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Kazu
>>>>>>>>>>>
>>>>>>>>>>> sudo gdb /usr/local/sbin/pbs_server
>>>>>>>>>>> GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
>>>>>>>>>>> Copyright (C) 2016 Free Software Foundation, Inc.
>>>>>>>>>>> License GPLv3+: GNU GPL version 3 or later <
>>>>>>>>>>> http://gnu.org/licenses/gpl.html>
>>>>>>>>>>> This is free software: you are free to change and redistribute
>>>>>>>>>>> it.
>>>>>>>>>>> There is NO WARRANTY, to the extent permitted by law. Type
>>>>>>>>>>> "show copying"
>>>>>>>>>>> and "show warranty" for details.
>>>>>>>>>>> This GDB was configured as "x86_64-linux-gnu".
>>>>>>>>>>> Type "show configuration" for configuration details.
>>>>>>>>>>> For bug reporting instructions, please see:
>>>>>>>>>>> <http://www.gnu.org/software/gdb/bugs/>.
>>>>>>>>>>> Find the GDB manual and other documentation resources online at:
>>>>>>>>>>> <http://www.gnu.org/software/gdb/documentation/>.
>>>>>>>>>>> For help, type "help".
>>>>>>>>>>> Type "apropos word" to search for commands related to "word"...
>>>>>>>>>>> Reading symbols from /usr/local/sbin/pbs_server...done.
>>>>>>>>>>> (gdb) r -D
>>>>>>>>>>> Starting program: /usr/local/sbin/pbs_server -D
>>>>>>>>>>> [Thread debugging using libthread_db enabled]
>>>>>>>>>>> Using host libthread_db library "/lib/x86_64-linux-gnu/libthre
>>>>>>>>>>> ad_db.so.1".
>>>>>>>>>>> pbs_server is up (version - 6.0.2.h3, port - 15001)
>>>>>>>>>>> [New Thread 0x7ffff39c1700 (LWP 25591)]
>>>>>>>>>>> [New Thread 0x7ffff31c0700 (LWP 25592)]
>>>>>>>>>>> [New Thread 0x7ffff29bf700 (LWP 25593)]
>>>>>>>>>>> [New Thread 0x7ffff21be700 (LWP 25594)]
>>>>>>>>>>> [New Thread 0x7ffff19bd700 (LWP 25595)]
>>>>>>>>>>> [New Thread 0x7ffff11bc700 (LWP 25596)]
>>>>>>>>>>>
>>>>>>>>>>> Thread 7 "pbs_server" received signal SIGSEGV, Segmentation
>>>>>>>>>>> fault.
>>>>>>>>>>> [Switching to Thread 0x7ffff11bc700 (LWP 25596)]
>>>>>>>>>>> __lll_unlock_elision (lock=0x57276c0, private=0) at
>>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>>>>>>>>>>> 29 ../sysdeps/unix/sysv/linux/x86/elision-unlock.c: No such
>>>>>>>>>>> file or directory.
>>>>>>>>>>> (gdb) bt
>>>>>>>>>>> #0 __lll_unlock_elision (lock=0x57276c0, private=0) at
>>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>>>>>>>>>>> #1 0x00000000004ac076 in dispatch_timed_task (ptask=0x5727660)
>>>>>>>>>>> at svr_task.c:318
>>>>>>>>>>> #2 0x0000000000460247 in check_tasks (notUsed=0x0) at
>>>>>>>>>>> pbsd_main.c:921
>>>>>>>>>>> #3 0x00000000004fc171 in work_thread (a=0x510f650) at
>>>>>>>>>>> u_threadpool.c:318
>>>>>>>>>>> #4 0x00007ffff6ed86fa in start_thread (arg=0x7ffff11bc700) at
>>>>>>>>>>> pthread_create.c:333
>>>>>>>>>>> #5 0x00007ffff6165b5d in clone () at
>>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Oct 26, 2016 at 11:52 AM, Kazuhiro Fujita <
>>>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> David and Rick,
>>>>>>>>>>>>
>>>>>>>>>>>> Thank you for the quick response. I will try it later.
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Kazu
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Oct 26, 2016 at 5:06 AM, David Beer <
>>>>>>>>>>>> ***@adaptivecomputing.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Actually, Rick just sent me the link. You can download it from
>>>>>>>>>>>>> here: http://files.adaptivecomputing.com/hotfix/torque-6.0.2
>>>>>>>>>>>>> .h3.tar.gz
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Oct 25, 2016 at 2:06 PM, David Beer <
>>>>>>>>>>>>> ***@adaptivecomputing.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I can confirm that this bug is fixed in 6.0-dev, and we've
>>>>>>>>>>>>>> made a hotfix for it, 6.0.2.h3. This was caused because of a change in the
>>>>>>>>>>>>>> implementation for the pthread library, so most will not see this crash,
>>>>>>>>>>>>>> but it appears that if you have a newer version of that library, then you
>>>>>>>>>>>>>> will get it. Rick is going to send instructions for how to grab 6.0.2.h3.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> David
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Oct 25, 2016 at 12:30 AM, Kazuhiro Fujita <
>>>>>>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thank you David for the comment on the backtrace.
>>>>>>>>>>>>>>> I haven't noticed that until writing this mail.
>>>>>>>>>>>>>>> So, I used backtrace as written in the Ubuntu wiki.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I also attached the backtrace of pbs_server (Torque 6.1-dev)
>>>>>>>>>>>>>>> by gdb.
>>>>>>>>>>>>>>> As I mentioned before torque.setup script was successfully
>>>>>>>>>>>>>>> executed, but unstable.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Before using gdb, I used following commands.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> git clone https://github.com/adaptivecomputing/torque.git
>>>>>>>>>>>>>>>> -b 6.1-dev 6.1-dev
>>>>>>>>>>>>>>>> cd 6.1-dev
>>>>>>>>>>>>>>>> ./autogen.sh
>>>>>>>>>>>>>>>> # build and install torque
>>>>>>>>>>>>>>>> ./configure
>>>>>>>>>>>>>>>> make
>>>>>>>>>>>>>>>> sudo make install
>>>>>>>>>>>>>>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>>>>>>>>>>>>>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>>>>>>>>>>>>>>>> sudo ldconfig
>>>>>>>>>>>>>>>> # set as services
>>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_server
>>>>>>>>>>>>>>>> /etc/init.d/pbs_server
>>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_sched
>>>>>>>>>>>>>>>> /etc/init.d/pbs_sched
>>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>>>>>>>>>>>>>>> sudo update-rc.d trqauthd defaults
>>>>>>>>>>>>>>>> sudo update-rc.d pbs_server defaults
>>>>>>>>>>>>>>>> sudo update-rc.d pbs_sched defaults
>>>>>>>>>>>>>>>> sudo update-rc.d pbs_mom defaults
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> sudo ./torque.setup $USER
>>>>>>>>>>>>>>>> sudo qmgr -c 'p s'
>>>>>>>>>>>>>>>> echo "$HOSTNAME np=`cat /proc/cpuinfo | grep processor | wc
>>>>>>>>>>>>>>>> -l`" | sudo tee /var/spool/torque/server_priv/nodes
>>>>>>>>>>>>>>>> sudo nano /var/spool/torque/server_priv/nodes # (changed
>>>>>>>>>>>>>>>> np)
>>>>>>>>>>>>>>>> sudo qterm -t quick
>>>>>>>>>>>>>>>> sudo /etc/init.d/trqauthd stop
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> trqauthd was not stop by the last command. So, I stopped it
>>>>>>>>>>>>>>> by killing the trqauthd process.
>>>>>>>>>>>>>>> Then I restarted the torque processes with gdb.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> sudo /etc/init.d/trqauthd start
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> sudo gdb /etc/init.d/pbs_server 2>&1 | tee
>>>>>>>>>>>>>>>> ~/gdb-torquesetup-6.1-dev.txt
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> In another terminal, I executed the following commands
>>>>>>>>>>>>>>> before pbs_server was crashed.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> sudo /etc/init.d/pbs_mom start
>>>>>>>>>>>>>>>> sudo /etc/init.d/pbs_sched start
>>>>>>>>>>>>>>>> ps aux | grep pbs
>>>>>>>>>>>>>>>> pbsnodes -a
>>>>>>>>>>>>>>>> echo "sleep 30" | qsub
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The output of the last command is "0.torque-server".
>>>>>>>>>>>>>>> And this command crashed the pbs_server in gdb.
>>>>>>>>>>>>>>> Then, I made the backtrace.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>> Kazu
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, Oct 25, 2016 at 2:36 PM, Kazuhiro Fujita <
>>>>>>>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> David,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I attached the backtrace of pbs_server (Torque 6.0.2) by
>>>>>>>>>>>>>>>> gdb.
>>>>>>>>>>>>>>>> (based on https://wiki.ubuntu.com/Backtrace)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I started pbs_server with gdb,
>>>>>>>>>>>>>>>> and execute qmgr from another terminal. (see below)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> sudo qmgr -c 'p s'
>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> After the qmgr execution, I pressed ctrl +c in gdb.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>> Kaz
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Tue, Oct 25, 2016 at 1:00 AM, David Beer <
>>>>>>>>>>>>>>>> ***@adaptivecomputing.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Kazu,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Can you give us a backtrace for this crash? We have fixed
>>>>>>>>>>>>>>>>> some issues on startup (around mutex management for newer pthread
>>>>>>>>>>>>>>>>> implementations) and a backtrace would allow me to confirm if what you're
>>>>>>>>>>>>>>>>> seeing is fixed.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Mon, Oct 24, 2016 at 2:09 AM, Kazuhiro Fujita <
>>>>>>>>>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Dear All,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I use Torque 4.2.10 on Ubuntu 14.04 LTS and 16.04 LTS
>>>>>>>>>>>>>>>>>> with dual E5-2630 v3 chips.
>>>>>>>>>>>>>>>>>> I recently got servers with dual Xeon E5 v4 chips, and
>>>>>>>>>>>>>>>>>> installed Ubuntu 16.04 LTS on them.
>>>>>>>>>>>>>>>>>> And I tried to set up Torque on them, but I stacked with
>>>>>>>>>>>>>>>>>> the initial setup script.
>>>>>>>>>>>>>>>>>> It seems that qmgr may trigger to crash pbs_server in
>>>>>>>>>>>>>>>>>> initial setup script (torque.setup). (see below)
>>>>>>>>>>>>>>>>>> Similar error is also observed in Torque 6.02.
>>>>>>>>>>>>>>>>>> Have you ever observed this kind of errors?
>>>>>>>>>>>>>>>>>> And if you know possible solutions, please tell me.
>>>>>>>>>>>>>>>>>> Any comments will be highly appreciated.
>>>>>>>>>>>>>>>>>> Would it be better to change the OS to other
>>>>>>>>>>>>>>>>>> distribution, such as Scientific Linux?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thank you in Advance,
>>>>>>>>>>>>>>>>>> Kazu
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Errors in torque 4.2.10 setup
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> torque-server-***@torque-ser
>>>>>>>>>>>>>>>>>>> ver:~/Downloads/torque/torque-4.2.10$ sudo
>>>>>>>>>>>>>>>>>>> ./torque.setup $USER
>>>>>>>>>>>>>>>>>>> Currently no servers active. Default server will be
>>>>>>>>>>>>>>>>>>> listed as active server. Error 15133
>>>>>>>>>>>>>>>>>>> Active server name: torque-server pbs_server port is:
>>>>>>>>>>>>>>>>>>> 15001
>>>>>>>>>>>>>>>>>>> trqauthd daemonized - port /tmp/trqauthd-unix
>>>>>>>>>>>>>>>>>>> trqauthd successfully started
>>>>>>>>>>>>>>>>>>> initializing TORQUE (admin:
>>>>>>>>>>>>>>>>>>> torque-server-***@torque-server)
>>>>>>>>>>>>>>>>>>> You have selected to start pbs_server in create mode.
>>>>>>>>>>>>>>>>>>> If the server database exists it will be overwritten.
>>>>>>>>>>>>>>>>>>> do you wish to continue y/(n)?y
>>>>>>>>>>>>>>>>>>> root 27941 1942 1 12:22 ? 00:00:00
>>>>>>>>>>>>>>>>>>> pbs_server -t create
>>>>>>>>>>>>>>>>>>> Max open servers: 9
>>>>>>>>>>>>>>>>>>> set server operators += torque-server-***@torque-ser
>>>>>>>>>>>>>>>>>>> ver
>>>>>>>>>>>>>>>>>>> Max open servers: 9
>>>>>>>>>>>>>>>>>>> set server managers += torque-server-***@torque-server
>>>>>>>>>>>>>>>>>>> qmgr obj=batch svr=default: End of File
>>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>>>>> torque-server-***@torque-ser
>>>>>>>>>>>>>>>>>>> ver:~/Downloads/torque/torque-4.2.10$ ps aux | grep pbs
>>>>>>>>>>>>>>>>>>> torque-+ 27996 0.0 0.0 22304 948 pts/2 S+
>>>>>>>>>>>>>>>>>>> 12:22 0:00 grep --color=auto pbs
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> pbs_server -t create was not found.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Errors in torque 6.0.2 setup
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> torque-server-***@torque-server:~/Downloads/torque/6.0.2$
>>>>>>>>>>>>>>>>>>> sudo ./torque.setup $USER
>>>>>>>>>>>>>>>>>>> Currently no servers active. Default server will be
>>>>>>>>>>>>>>>>>>> listed as active server. Error 15133
>>>>>>>>>>>>>>>>>>> Active server name: torque-server pbs_server port is:
>>>>>>>>>>>>>>>>>>> 15001
>>>>>>>>>>>>>>>>>>> trqauthd daemonized - port /tmp/trqauthd-unix
>>>>>>>>>>>>>>>>>>> trqauthd successfully started
>>>>>>>>>>>>>>>>>>> initializing TORQUE (admin:
>>>>>>>>>>>>>>>>>>> torque-server-***@torque-server)
>>>>>>>>>>>>>>>>>>> You have selected to start pbs_server in create mode.
>>>>>>>>>>>>>>>>>>> If the server database exists it will be overwritten.
>>>>>>>>>>>>>>>>>>> do you wish to continue y/(n)?y
>>>>>>>>>>>>>>>>>>> root 39521 1 1 16:10 ? 00:00:00
>>>>>>>>>>>>>>>>>>> pbs_server -t create
>>>>>>>>>>>>>>>>>>> Max open servers: 9
>>>>>>>>>>>>>>>>>>> Max open servers: 9
>>>>>>>>>>>>>>>>>>> qmgr obj=batch svr=default: End of File
>>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>>>>> torque-server-***@torque-server:~/Downloads/torque/6.0.2$
>>>>>>>>>>>>>>>>>>> ps aux | grep pbs
>>>>>>>>>>>>>>>>>>> comp_ad+ 39569 0.0 0.0 22304 1032 pts/8 S+
>>>>>>>>>>>>>>>>>>> 16:11 0:00 grep --color=auto pbs
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> pbs_server -t create was not found.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Commands used for installation before the setup script
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> # build and install torque
>>>>>>>>>>>>>>>>>>> ./configure
>>>>>>>>>>>>>>>>>>> make
>>>>>>>>>>>>>>>>>>> sudo make install
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>>>>>>>>>>>>>>>>> echo /usr/local/lib > sudo tee
>>>>>>>>>>>>>>>>>>> /etc/ld.so.conf.d/torque.conf
>>>>>>>>>>>>>>>>>>> sudo ldconfig
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> # set up as services
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.trqauthd
>>>>>>>>>>>>>>>>>>> /etc/init.d/trqauthd
>>>>>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_server
>>>>>>>>>>>>>>>>>>> /etc/init.d/pbs_server
>>>>>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_sched
>>>>>>>>>>>>>>>>>>> /etc/init.d/pbs_sched
>>>>>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>>>>>>>>>>>>>>>>>> sudo update-rc.d trqauthd defaults
>>>>>>>>>>>>>>>>>>> sudo update-rc.d pbs_server defaults
>>>>>>>>>>>>>>>>>>> sudo update-rc.d pbs_sched defaults
>>>>>>>>>>>>>>>>>>> sudo update-rc.d pbs_mom defaults
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> sudo ./torque.setup $USER
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>> torqueusers mailing list
>>>>>>>>>>>>>>>>>> ***@supercluster.org
>>>>>>>>>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> David Beer | Torque Architect
>>>>>>>>>>>>>>>>> Adaptive Computing
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>> torqueusers mailing list
>>>>>>>>>>>>>>>>> ***@supercluster.org
>>>>>>>>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> torqueusers mailing list
>>>>>>>>>>>>>>> ***@supercluster.org
>>>>>>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> David Beer | Torque Architect
>>>>>>>>>>>>>> Adaptive Computing
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> David Beer | Torque Architect
>>>>>>>>>>>>> Adaptive Computing
>>>>>>>>>>>>>
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> torqueusers mailing list
>>>>>>>>>>>>> ***@supercluster.org
>>>>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> torqueusers mailing list
>>>>>>>>>>> ***@supercluster.org
>>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> David Beer | Torque Architect
>>>>>>>>>> Adaptive Computing
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> torqueusers mailing list
>>>>>>>>>> ***@supercluster.org
>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> torqueusers mailing list
>>>>>>>> ***@supercluster.org
>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> David Beer | Torque Architect
>>>>>>> Adaptive Computing
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> torqueusers mailing list
>>>>>>> ***@supercluster.org
>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> torqueusers mailing list
>>>>>> ***@supercluster.org
>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> David Beer | Torque Architect
>>>>> Adaptive Computing
>>>>>
>>>>> _______________________________________________
>>>>> torqueusers mailing list
>>>>> ***@supercluster.org
>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>
>>>>>
>>>>
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> ***@supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>>
>>
>>
>> --
>> David Beer | Torque Architect
>> Adaptive Computing
>>
>> _______________________________________________
>> torqueusers mailing list
>> ***@supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>
> _______________________________________________
> torqueusers mailing list
> ***@supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>


--
David Beer | Torque Architect
Adaptive Computing
Kazuhiro Fujita
2016-11-24 04:52:05 UTC
Permalink
David,

I reinstalled the torque 6.0-dev without update from github.
At this time, I can restart all torque daemons,
but qsub command caused the crash of pbs_server and pbs_sched.
I attached the log files in this mail.

Best,
Kazu

Before the crash:

> # build and install torque
> ./configure
> make
> sudo make install
> # Set a correct host name of the server
> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
> # configure and start trqauthd
> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
> sudo update-rc.d trqauthd defaults
> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
> sudo ldconfig
> sudo service trqauthd start
> # Initialize serverdb by executing the torque.setup script
> sudo ./torque.setup $USER
> sudo qmgr -c "p s"
> # stop pbs_server and trqauthd daemons for setting nodes.
> sudo qterm
> sudo service trqauthd stop
> ps aux | grep pbs
> ps aux | grep trq
> # set nodes
> echo "$HOSTNAME np=`cat /proc/cpuinfo | grep processor | wc -l`" | sudo
> tee /var/spool/torque/server_priv/nodes
> sudo nano /var/spool/torque/server_priv/nodes
> # set the head node
> echo "\$pbsserver $HOSTNAME" | sudo tee /var/spool/torque/mom_priv/config
> # configure other torque daemons
> sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
> sudo update-rc.d pbs_server defaults
> sudo update-rc.d pbs_sched defaults
> sudo update-rc.d pbs_mom defaults
> # restart torque daemons
> sudo service trqauthd start
> sudo service pbs_server start
> ps aux | grep pbs
> ps aux | grep trq
> sudo service pbs_sched start
> sudo service pbs_mom start
> ps aux | grep pbs
> ps aux | grep trq
> # check configuration of computaion nodes
> pbsnodes -a


$ ps aux | grep trq
root 19130 0.0 0.0 109112 3756 ? S 13:25 0:00
/usr/local/sbin/trqauthd
comp_ad+ 19293 0.0 0.0 15236 1020 pts/8 S+ 13:28 0:00 grep
--color=auto trq
$ ps aux | grep pbs
root 19175 0.0 0.0 695136 23640 ? Sl 13:26 0:00
/usr/local/sbin/pbs_server
root 19224 0.0 0.0 37996 4936 ? Ss 13:27 0:00
/usr/local/sbin/pbs_sched
root 19265 0.1 0.2 173776 136692 ? SLsl 13:27 0:00
/usr/local/sbin/pbs_mom
comp_ad+ 19295 0.0 0.0 15236 924 pts/8 S+ 13:28 0:00 grep
--color=auto pbs

Subsequent qsub command caused the crash of pbs_server and pbs_sched.

$ echo "sleep 30" | qsub
0.Dual-E52630v4
$ ps aux | grep trq
root 19130 0.0 0.0 109112 4268 ? S 13:25 0:00
/usr/local/sbin/trqauthd
comp_ad+ 19309 0.0 0.0 15236 1020 pts/8 S+ 13:28 0:00 grep
--color=auto trq
$ ps aux | grep pbs
root 19265 0.1 0.2 173776 136688 ? SLsl 13:27 0:00
/usr/local/sbin/pbs_mom
comp_ad+ 19311 0.0 0.0 15236 1016 pts/8 S+ 13:28 0:00 grep
--color=auto pbs




On Fri, Nov 18, 2016 at 4:21 AM, David Beer <***@adaptivecomputing.com>
wrote:

> Kazu,
>
> Did you look at the server logs?
>
> On Wed, Nov 16, 2016 at 12:24 AM, Kazuhiro Fujita <
> ***@gmail.com> wrote:
>
>> David,
>>
>> I did not find the process of pbs_server after executions of commands
>> shown below.
>>
>> sudo service trqauthd start
>>> sudo service pbs_server start
>>
>>
>> I am not sure what it did.
>>
>> Best,
>> Kazu
>>
>>
>> On Wed, Nov 16, 2016 at 8:10 AM, David Beer <***@adaptivecomputing.com>
>> wrote:
>>
>>> Kazu,
>>>
>>> What did it do when it failed to start?
>>>
>>> On Wed, Nov 9, 2016 at 9:33 PM, Kazuhiro Fujita <
>>> ***@gmail.com> wrote:
>>>
>>>> David,
>>>>
>>>> In the last mail I sent, I reinstalled 6.0-dev in a wrong server as
>>>> you can see in output (E5-2630v3).
>>>> In a E5-2630v4 server, pbs_server failed to restart as a daemon after "./torque.setup
>>>> $USER".
>>>>
>>>> Before crash:
>>>>
>>>>> git clone https://github.com/adaptivecomputing/torque.git -b 6.0-dev
>>>>> 6.0-dev
>>>>> cd 6.0-dev
>>>>> ./autogen.sh
>>>>> # build and install torque
>>>>> ./configure
>>>>> make
>>>>> sudo make install
>>>>> # Set the correct name of the server
>>>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>>> # configure and start trqauthd
>>>>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>>>>> sudo update-rc.d trqauthd defaults
>>>>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>>>>> sudo ldconfig
>>>>> sudo service trqauthd start
>>>>> # Initialize serverdb by executing the torque.setup script
>>>>> sudo ./torque.setup $USER
>>>>> sudo qmgr -c 'p s'
>>>>> sudo qterm
>>>>> sudo service trqauthd stop
>>>>> ps aux | grep pbs
>>>>> ps aux | grep trq
>>>>> # set nodes
>>>>> echo "$HOSTNAME np=`cat /proc/cpuinfo | grep processor | wc -l`" |
>>>>> sudo tee /var/spool/torque/server_priv/nodes
>>>>> sudo nano /var/spool/torque/server_priv/nodes
>>>>> # set the head node
>>>>> echo "\$pbsserver $HOSTNAME" | sudo tee /var/spool/torque/mom_priv/con
>>>>> fig
>>>>> # configure other daemons
>>>>> sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
>>>>> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
>>>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>>>> sudo update-rc.d pbs_server defaults
>>>>> sudo update-rc.d pbs_sched defaults
>>>>> sudo update-rc.d pbs_mom defaults
>>>>> # restart torque daemons
>>>>> sudo service trqauthd start
>>>>> sudo service pbs_server start
>>>>
>>>>
>>>> Then, pbs_server did not start. So, I started pbs_server with gdb.
>>>> But, pbs_server with gdb did not crash even after qsub and qstat from
>>>> another terminal.
>>>> So, I stopped the pbs_server in gdb with ctrl + c.
>>>>
>>>> Best,
>>>> Kazu
>>>>
>>>> gdb output
>>>>
>>>>> $ sudo gdb /usr/local/sbin/pbs_server
>>>>> GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
>>>>> Copyright (C) 2016 Free Software Foundation, Inc.
>>>>> License GPLv3+: GNU GPL version 3 or later <
>>>>> http://gnu.org/licenses/gpl.html>
>>>>> This is free software: you are free to change and redistribute it.
>>>>> There is NO WARRANTY, to the extent permitted by law. Type "show
>>>>> copying"
>>>>> and "show warranty" for details.
>>>>> This GDB was configured as "x86_64-linux-gnu".
>>>>> Type "show configuration" for configuration details.
>>>>> For bug reporting instructions, please see:
>>>>> <http://www.gnu.org/software/gdb/bugs/>.
>>>>> Find the GDB manual and other documentation resources online at:
>>>>> <http://www.gnu.org/software/gdb/documentation/>.
>>>>> For help, type "help".
>>>>> Type "apropos word" to search for commands related to "word"...
>>>>> Reading symbols from /usr/local/sbin/pbs_server...done.
>>>>> (gdb) r -D
>>>>> Starting program: /usr/local/sbin/pbs_server -D
>>>>> [Thread debugging using libthread_db enabled]
>>>>> Using host libthread_db library "/lib/x86_64-linux-gnu/libthre
>>>>> ad_db.so.1".
>>>>> [New Thread 0x7ffff39c1700 (LWP 35864)]
>>>>> pbs_server is up (version - 6.0, port - 15001)
>>>>> PBS_Server: LOG_ERROR::tcp_connect_sockaddr, Failed when trying to
>>>>> open tcp connection - connect() failed [rc = -2] [addr =
>>>>> 10.0.0.249:15003]
>>>>> [New Thread 0x7ffff31c0700 (LWP 35865)]
>>>>> PBS_Server: LOG_ERROR::sendHierarchyToNode, Could not send mom
>>>>> hierarchy to host Dual-E52630v4:15003
>>>>> [New Thread 0x7ffff29bf700 (LWP 35866)]
>>>>> [New Thread 0x7ffff21be700 (LWP 35867)]
>>>>> [New Thread 0x7ffff19bd700 (LWP 35868)]
>>>>> [New Thread 0x7ffff11bc700 (LWP 35869)]
>>>>> [New Thread 0x7ffff09bb700 (LWP 35870)]
>>>>> [Thread 0x7ffff09bb700 (LWP 35870) exited]
>>>>> [New Thread 0x7ffff09bb700 (LWP 35871)]
>>>>> [New Thread 0x7fffe3fff700 (LWP 36003)]
>>>>> [New Thread 0x7fffe37fe700 (LWP 36004)]
>>>>> [New Thread 0x7fffe2ffd700 (LWP 36011)]
>>>>> [New Thread 0x7fffe21ce700 (LWP 36016)]
>>>>> [Thread 0x7fffe21ce700 (LWP 36016) exited]
>>>>> ^C
>>>>> Thread 1 "pbs_server" received signal SIGINT, Interrupt.
>>>>> 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-templa
>>>>> te.S:84
>>>>> 84 ../sysdeps/unix/syscall-template.S: No such file or directory.
>>>>> (gdb) bt
>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>> #1 0x00007ffff615c1a4 in usleep (useconds=<optimized out>) at
>>>>> ../sysdeps/posix/usleep.c:32
>>>>> #2 0x0000000000461216 in main_loop () at pbsd_main.c:1454
>>>>> #3 0x0000000000461f92 in main (argc=2, argv=0x7fffffffe598) at
>>>>> pbsd_main.c:1935
>>>>> (gdb) backtrace full
>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>> No locals.
>>>>> #1 0x00007ffff615c1a4 in usleep (useconds=<optimized out>) at
>>>>> ../sysdeps/posix/usleep.c:32
>>>>> ts = {tv_sec = 0, tv_nsec = 250000000}
>>>>> #2 0x0000000000461216 in main_loop () at pbsd_main.c:1454
>>>>> state = 3
>>>>> waittime = 5
>>>>> pjob = 0x313a74
>>>>> iter = 0x0
>>>>> when = 1478748888
>>>>> log = 0
>>>>> scheduling = 1
>>>>> sched_iteration = 600
>>>>> time_now = 1478748970
>>>>> update_loglevel = 1478748979
>>>>> log_buf = "Server Ready, pid = 35860, loglevel=0", '\000'
>>>>> <repeats 139 times>, "c\000\000\000\000\000\000\000
>>>>> \000\020\000\000\000\000\000\000\240\265\377\377\377\177", '\000'
>>>>> <repeats 26 times>...
>>>>> sem_val = 5229209
>>>>> __func__ = "main_loop"
>>>>> #3 0x0000000000461f92 in main (argc=2, argv=0x7fffffffe598) at
>>>>> pbsd_main.c:1935
>>>>> i = 2
>>>>> rc = 0
>>>>> local_errno = 0
>>>>> lockfile = "/var/spool/torque/server_priv/server.lock",
>>>>> '\000' <repeats 983 times>
>>>>> EMsg = '\000' <repeats 1023 times>
>>>>> tmpLine = "Using ports Server:15001 Scheduler:15004
>>>>> MOM:15002 (server: 'Dual-E52630v4')", '\000' <repeats 945 times>
>>>>> log_buf = "Using ports Server:15001 Scheduler:15004
>>>>> MOM:15002 (server: 'Dual-E52630v4')", '\000' <repeats 1114 times>...
>>>>> server_name_file_port = 15001
>>>>> fp = 0x51095f0
>>>>> (gdb) info registers
>>>>> rax 0xfffffffffffffdfc -516
>>>>> rbx 0x6 6
>>>>> rcx 0x7ffff612a75d 140737321805661
>>>>> rdx 0x0 0
>>>>> rsi 0x0 0
>>>>> rdi 0x7fffffffb3f0 140737488335856
>>>>> rbp 0x7fffffffe4b0 0x7fffffffe4b0
>>>>> rsp 0x7fffffffc870 0x7fffffffc870
>>>>> r8 0x0 0
>>>>> r9 0x4000001 67108865
>>>>> r10 0x1 1
>>>>> r11 0x293 659
>>>>> r12 0x4260b0 4350128
>>>>> r13 0x7fffffffe590 140737488348560
>>>>> r14 0x0 0
>>>>> r15 0x0 0
>>>>> rip 0x461f92 0x461f92 <main(int, char**)+2388>
>>>>> eflags 0x293 [ CF AF SF IF ]
>>>>> cs 0x33 51
>>>>> ss 0x2b 43
>>>>> ds 0x0 0
>>>>> es 0x0 0
>>>>> fs 0x0 0
>>>>> gs 0x0 0
>>>>> (gdb) x/16i $pc
>>>>> => 0x461f92 <main(int, char**)+2388>: callq 0x49484c <shutdown_ack()>
>>>>> 0x461f97 <main(int, char**)+2393>: mov $0xffffffff,%edi
>>>>> 0x461f9c <main(int, char**)+2398>: callq 0x4250b0 <***@plt>
>>>>> 0x461fa1 <main(int, char**)+2403>: mov 0x70f5c0(%rip),%rdx
>>>>> # 0xb71568 <msg_svrdown>
>>>>> 0x461fa8 <main(int, char**)+2410>: mov 0x70ef51(%rip),%rax
>>>>> # 0xb70f00 <msg_daemonname>
>>>>> 0x461faf <main(int, char**)+2417>: mov %rdx,%rcx
>>>>> 0x461fb2 <main(int, char**)+2420>: mov %rax,%rdx
>>>>> 0x461fb5 <main(int, char**)+2423>: mov $0x1,%esi
>>>>> 0x461fba <main(int, char**)+2428>: mov $0x8002,%edi
>>>>> 0x461fbf <main(int, char**)+2433>: callq 0x425840
>>>>> <***@plt>
>>>>> 0x461fc4 <main(int, char**)+2438>: mov $0x0,%edi
>>>>> 0x461fc9 <main(int, char**)+2443>: callq 0x4269c9
>>>>> <acct_close(bool)>
>>>>> 0x461fce <main(int, char**)+2448>: mov $0xb6ce00,%edi
>>>>> 0x461fd3 <main(int, char**)+2453>: callq 0x425a00
>>>>> <***@plt>
>>>>> 0x461fd8 <main(int, char**)+2458>: mov $0x1,%edi
>>>>> 0x461fdd <main(int, char**)+2463>: callq 0x424db0
>>>>> <***@plt>
>>>>> (gdb) thread apply all backtrace
>>>>> Thread 12 (Thread 0x7fffe2ffd700 (LWP 36011)):
>>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
>>>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
>>>>> #1 0x00000000004fc2b4 in work_thread (a=0x5110710) at
>>>>> u_threadpool.c:272
>>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7fffe2ffd700) at
>>>>> pthread_create.c:333
>>>>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>> _64/clone.S:109
>>>>> Thread 11 (Thread 0x7fffe37fe700 (LWP 36004)):
>>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
>>>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
>>>>> #1 0x00000000004fc2b4 in work_thread (a=0x5110810) at
>>>>> u_threadpool.c:272
>>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7fffe37fe700) at
>>>>> pthread_create.c:333
>>>>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>> _64/clone.S:109
>>>>> Thread 10 (Thread 0x7fffe3fff700 (LWP 36003)):
>>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
>>>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
>>>>> #1 0x00000000004fc2b4 in work_thread (a=0x5110710) at
>>>>> u_threadpool.c:272
>>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7fffe3fff700) at
>>>>> pthread_create.c:333
>>>>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>> _64/clone.S:109
>>>>> Thread 9 (Thread 0x7ffff09bb700 (LWP 35871)):
>>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
>>>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
>>>>> #1 0x00000000004fc2b4 in work_thread (a=0x5110810) at
>>>>> u_threadpool.c:272
>>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7ffff09bb700) at
>>>>> pthread_create.c:333
>>>>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>> _64/clone.S:109
>>>>> Thread 7 (Thread 0x7ffff11bc700 (LWP 35869)):
>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>>>> ../sysdeps/posix/sleep.c:55
>>>>> #2 0x0000000000476913 in remove_completed_jobs (vp=0x0) at
>>>>> req_jobobit.c:3759
>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff11bc700) at
>>>>> pthread_create.c:333
>>>>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>> _64/clone.S:109
>>>>> Thread 6 (Thread 0x7ffff19bd700 (LWP 35868)):
>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>>>> ../sysdeps/posix/sleep.c:55
>>>>> #2 0x00000000004afb93 in remove_extra_recycle_jobs (vp=0x0) at
>>>>> job_recycler.c:216
>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff19bd700) at
>>>>> pthread_create.c:333
>>>>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>> _64/clone.S:109
>>>>> Thread 5 (Thread 0x7ffff21be700 (LWP 35867)):
>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>>>> ../sysdeps/posix/sleep.c:55
>>>>> #2 0x00000000004bc853 in inspect_exiting_jobs (vp=0x0) at
>>>>> exiting_jobs.c:319
>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff21be700) at
>>>>> pthread_create.c:333
>>>>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>> _64/clone.S:109
>>>>> Thread 4 (Thread 0x7ffff29bf700 (LWP 35866)):
>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>>>> ../sysdeps/posix/sleep.c:55
>>>>> #2 0x0000000000460769 in handle_queue_routing_retries (vp=0x0) at
>>>>> pbsd_main.c:1079
>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff29bf700) at
>>>>> pthread_create.c:333
>>>>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>> _64/clone.S:109
>>>>> Thread 3 (Thread 0x7ffff31c0700 (LWP 35865)):
>>>>> #0 0x00007ffff6ee17bd in accept () at ../sysdeps/unix/syscall-templa
>>>>> te.S:84
>>>>> #1 0x00007ffff750a276 in start_listener_addrinfo
>>>>> (host_name=0x7ffff31bfaf0 "Dual-E52630v4", server_port=15001,
>>>>> process_meth=0x4c4a4d <start_process_pbs_server_port(void*)>)
>>>>> at ../Libnet/server_core.c:398
>>>>> ---Type <return> to continue, or q <return> to quit---
>>>>> #2 0x00000000004608cf in start_accept_listener (vp=0x0) at
>>>>> pbsd_main.c:1141
>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff31c0700) at
>>>>> pthread_create.c:333
>>>>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>> _64/clone.S:109
>>>>> Thread 2 (Thread 0x7ffff39c1700 (LWP 35864)):
>>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
>>>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
>>>>> #1 0x00000000004fc2b4 in work_thread (a=0x5110810) at
>>>>> u_threadpool.c:272
>>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7ffff39c1700) at
>>>>> pthread_create.c:333
>>>>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>> _64/clone.S:109
>>>>> Thread 1 (Thread 0x7ffff7fd5740 (LWP 35860)):
>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>> #1 0x00007ffff615c1a4 in usleep (useconds=<optimized out>) at
>>>>> ../sysdeps/posix/usleep.c:32
>>>>> #2 0x0000000000461216 in main_loop () at pbsd_main.c:1454
>>>>> #3 0x0000000000461f92 in main (argc=2, argv=0x7fffffffe598) at
>>>>> pbsd_main.c:1935
>>>>> (gdb) quit
>>>>> A debugging session is active.
>>>>> Inferior 1 [process 35860] will be killed.
>>>>> Quit anyway? (y or n) y
>>>>
>>>>
>>>>
>>>> Commands executed from another terminal after pbs_server with gdb (r -D)
>>>>
>>>>> $ sudo service pbs_sched start
>>>>> $ sudo service pbs_mom start
>>>>> $ pbsnodes -a
>>>>> Dual-E52630v4
>>>>> state = free
>>>>> power_state = Running
>>>>> np = 4
>>>>> ntype = cluster
>>>>> status = rectime=1478748911,macaddr=34:
>>>>> 97:f6:5d:09:a6,cpuclock=Fixed,varattr=,jobs=,state=free,netl
>>>>> oad=322618417,gres=,loadave=0.06,ncpus=40,physmem=65857216kb
>>>>> ,availmem=131970532kb,totmem=132849340kb,idletime=108,nusers=4,nsessions=17,sessions=1036
>>>>> 1316 1327 1332 1420 1421 1422 1423 1424 1425 1426 1430 1471 1510 27075
>>>>> 27130 35902,uname=Linux Dual-E52630v4 4.4.0-45-generic #66-Ubuntu SMP Wed
>>>>> Oct 19 14:12:37 UTC 2016 x86_64,opsys=linux
>>>>> mom_service_port = 15002
>>>>> mom_manager_port = 15003
>>>>> $ echo "sleep 30" | qsub
>>>>> 0.Dual-E52630v4
>>>>> $ qstat
>>>>> Job ID Name User Time Use S
>>>>> Queue
>>>>> ------------------------- ---------------- --------------- -------- -
>>>>> -----
>>>>> 0.Dual-E52630v4 STDIN comp_admin 0 Q
>>>>> batch
>>>>
>>>>
>>>>
>>>> On Thu, Nov 10, 2016 at 12:01 PM, Kazuhiro Fujita <
>>>> ***@gmail.com> wrote:
>>>>
>>>>> David,
>>>>>
>>>>> Now, it works. Thank you.
>>>>> But, jobs are executed in the LIFO manner, as I observed in a
>>>>> E5-2630v3 server...
>>>>> I show the result by 'qstat -t' after 'echo "sleep 10" | qsub -t 1-10'
>>>>> 3 times.
>>>>>
>>>>> Best,
>>>>> Kazu
>>>>>
>>>>> $ qstat -t
>>>>> Job ID Name User Time Use S
>>>>> Queue
>>>>> ------------------------- ---------------- --------------- -------- -
>>>>> -----
>>>>> 0.Dual-E5-2630v3 STDIN comp_admin 00:00:00 C
>>>>> batch
>>>>> 1[1].Dual-E5-2630v3 STDIN-1 comp_admin 0 Q
>>>>> batch
>>>>> 1[2].Dual-E5-2630v3 STDIN-2 comp_admin 0 Q
>>>>> batch
>>>>> 1[3].Dual-E5-2630v3 STDIN-3 comp_admin 0 Q
>>>>> batch
>>>>> 1[4].Dual-E5-2630v3 STDIN-4 comp_admin 0 Q
>>>>> batch
>>>>> 1[5].Dual-E5-2630v3 STDIN-5 comp_admin 0 Q
>>>>> batch
>>>>> 1[6].Dual-E5-2630v3 STDIN-6 comp_admin 0 Q
>>>>> batch
>>>>> 1[7].Dual-E5-2630v3 STDIN-7 comp_admin 00:00:00 C
>>>>> batch
>>>>> 1[8].Dual-E5-2630v3 STDIN-8 comp_admin 00:00:00 C
>>>>> batch
>>>>> 1[9].Dual-E5-2630v3 STDIN-9 comp_admin 00:00:00 C
>>>>> batch
>>>>> 1[10].Dual-E5-2630v3 STDIN-10 comp_admin 00:00:00 C
>>>>> batch
>>>>> 2[1].Dual-E5-2630v3 STDIN-1 comp_admin 0 Q
>>>>> batch
>>>>> 2[2].Dual-E5-2630v3 STDIN-2 comp_admin 0 Q
>>>>> batch
>>>>> 2[3].Dual-E5-2630v3 STDIN-3 comp_admin 0 Q
>>>>> batch
>>>>> 2[4].Dual-E5-2630v3 STDIN-4 comp_admin 0 Q
>>>>> batch
>>>>> 2[5].Dual-E5-2630v3 STDIN-5 comp_admin 0 Q
>>>>> batch
>>>>> 2[6].Dual-E5-2630v3 STDIN-6 comp_admin 0 Q
>>>>> batch
>>>>> 2[7].Dual-E5-2630v3 STDIN-7 comp_admin 0 Q
>>>>> batch
>>>>> 2[8].Dual-E5-2630v3 STDIN-8 comp_admin 0 Q
>>>>> batch
>>>>> 2[9].Dual-E5-2630v3 STDIN-9 comp_admin 0 Q
>>>>> batch
>>>>> 2[10].Dual-E5-2630v3 STDIN-10 comp_admin 0 Q
>>>>> batch
>>>>> 3[1].Dual-E5-2630v3 STDIN-1 comp_admin 0 Q
>>>>> batch
>>>>> 3[2].Dual-E5-2630v3 STDIN-2 comp_admin 0 Q
>>>>> batch
>>>>> 3[3].Dual-E5-2630v3 STDIN-3 comp_admin 0 Q
>>>>> batch
>>>>> 3[4].Dual-E5-2630v3 STDIN-4 comp_admin 0 Q
>>>>> batch
>>>>> 3[5].Dual-E5-2630v3 STDIN-5 comp_admin 0 Q
>>>>> batch
>>>>> 3[6].Dual-E5-2630v3 STDIN-6 comp_admin 0 Q
>>>>> batch
>>>>> 3[7].Dual-E5-2630v3 STDIN-7 comp_admin 0 R
>>>>> batch
>>>>> 3[8].Dual-E5-2630v3 STDIN-8 comp_admin 0 R
>>>>> batch
>>>>> 3[9].Dual-E5-2630v3 STDIN-9 comp_admin 0 R
>>>>> batch
>>>>> 3[10].Dual-E5-2630v3 STDIN-10 comp_admin 0 R
>>>>> batch
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Nov 10, 2016 at 3:07 AM, David Beer <
>>>>> ***@adaptivecomputing.com> wrote:
>>>>>
>>>>>> Kazu,
>>>>>>
>>>>>> I was able to get a system to reproduce this error. I have now
>>>>>> checked in another fix, and I can no longer reproduce this. Can you pull
>>>>>> the latest and let me know if it fixes it for you?
>>>>>>
>>>>>> On Tue, Nov 8, 2016 at 2:06 AM, Kazuhiro Fujita <
>>>>>> ***@gmail.com> wrote:
>>>>>>
>>>>>>> Hi David,
>>>>>>>
>>>>>>> I reinstalled the 6.0-dev today from github, and observed slight
>>>>>>> different behaviors I think.
>>>>>>> I used the "service" command to start daemons this time.
>>>>>>>
>>>>>>> Best,
>>>>>>> Kazu
>>>>>>>
>>>>>>> Befor crash
>>>>>>>
>>>>>>>> git clone https://github.com/adaptivecomputing/torque.git -b
>>>>>>>> 6.0-dev 6.0-dev
>>>>>>>> cd 6.0-dev
>>>>>>>> ./autogen.sh
>>>>>>>> # build and install torque
>>>>>>>> ./configure
>>>>>>>> make
>>>>>>>> sudo make install
>>>>>>>> # Set the correct name of the server
>>>>>>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>>>>>> # configure and start trqauthd
>>>>>>>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>>>>>>>> sudo update-rc.d trqauthd defaults
>>>>>>>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>>>>>>>> sudo ldconfig
>>>>>>>> sudo service trqauthd start
>>>>>>>> # Initialize serverdb by executing the torque.setup script
>>>>>>>> sudo ./torque.setup $USER
>>>>>>>> sudo qmgr -c 'p s'
>>>>>>>> sudo qterm
>>>>>>>> sudo service trqauthd stop
>>>>>>>> ps aux | grep pbs
>>>>>>>> ps aux | grep trq
>>>>>>>> # set nodes
>>>>>>>> echo "$HOSTNAME np=`cat /proc/cpuinfo | grep processor | wc -l`" |
>>>>>>>> sudo tee /var/spool/torque/server_priv/nodes
>>>>>>>> sudo nano /var/spool/torque/server_priv/nodes
>>>>>>>> # set the head node
>>>>>>>> echo "\$pbsserver $HOSTNAME" | sudo tee
>>>>>>>> /var/spool/torque/mom_priv/config
>>>>>>>> # configure other deamons
>>>>>>>> sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
>>>>>>>> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
>>>>>>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>>>>>>> sudo update-rc.d pbs_server defaults
>>>>>>>> sudo update-rc.d pbs_sched defaults
>>>>>>>> sudo update-rc.d pbs_mom defaults
>>>>>>>> # start torque daemons
>>>>>>>> sudo service trqauthd start
>>>>>>>> sudo service pbs_server start
>>>>>>>> sudo service pbs_sched start
>>>>>>>> sudo service pbs_mom start
>>>>>>>> # chekc configuration of computaion nodes
>>>>>>>> pbsnodes -a
>>>>>>>
>>>>>>>
>>>>>>> I checked torque processes by "ps aux | grep pbs" and "ps aux | grep
>>>>>>> trq" several times.
>>>>>>> After "pbsnodes -a", it seems ok.
>>>>>>> But, the next qsub command seems to trigger to crash "pbs_server"
>>>>>>> and "pbs_sched".
>>>>>>>
>>>>>>> $ ps aux | grep trq
>>>>>>>> root 9682 0.0 0.0 109112 3632 ? S 17:39 0:00
>>>>>>>> /usr/local/sbin/trqauthd
>>>>>>>> comp_ad+ 9842 0.0 0.0 15236 936 pts/8 S+ 17:40 0:00
>>>>>>>> grep --color=auto trq
>>>>>>>> $ ps aux | grep pbs
>>>>>>>> root 9720 0.0 0.0 695140 25760 ? Sl 17:39 0:00
>>>>>>>> /usr/local/sbin/pbs_server
>>>>>>>> root 9771 0.0 0.0 37996 4940 ? Ss 17:39 0:00
>>>>>>>> /usr/local/sbin/pbs_sched
>>>>>>>> root 9814 0.2 0.2 173776 136692 ? SLsl 17:40 0:00
>>>>>>>> /usr/local/sbin/pbs_mom
>>>>>>>> comp_ad+ 9844 0.0 0.0 15236 1012 pts/8 S+ 17:40 0:00
>>>>>>>> grep --color=auto pbs
>>>>>>>> $ echo "sleep 30" | qsub
>>>>>>>> 0.Dual-E52630v4
>>>>>>>> $ ps aux | grep pbs
>>>>>>>> root 9814 0.1 0.2 173776 136692 ? SLsl 17:40 0:00
>>>>>>>> /usr/local/sbin/pbs_mom
>>>>>>>> comp_ad+ 9855 0.0 0.0 15236 928 pts/8 S+ 17:41 0:00
>>>>>>>> grep --color=auto pbs
>>>>>>>> $ ps aux | grep trq
>>>>>>>> root 9682 0.0 0.0 109112 4144 ? S 17:39 0:00
>>>>>>>> /usr/local/sbin/trqauthd
>>>>>>>> comp_ad+ 9860 0.0 0.0 15236 1092 pts/8 S+ 17:41 0:00
>>>>>>>> grep --color=auto trq
>>>>>>>
>>>>>>>
>>>>>>> Then, I stopped the remained processes,
>>>>>>>
>>>>>>> sudo service pbs_mom stop
>>>>>>>> sudo service trqauthd stop
>>>>>>>
>>>>>>>
>>>>>>> and start again the "trqauthd", and "pbs_server" with gdb.
>>>>>>> "pbs_server" crashed in gdb without other commands.
>>>>>>>
>>>>>>> sudo service trqauthd start
>>>>>>>> sudo gdb /usr/local/sbin/pbs_server
>>>>>>>
>>>>>>>
>>>>>>> sudo gdb /usr/local/sbin/pbs_server
>>>>>>> GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
>>>>>>> Copyright (C) 2016 Free Software Foundation, Inc.
>>>>>>> License GPLv3+: GNU GPL version 3 or later <
>>>>>>> http://gnu.org/licenses/gpl.html>
>>>>>>> This is free software: you are free to change and redistribute it.
>>>>>>> There is NO WARRANTY, to the extent permitted by law. Type "show
>>>>>>> copying"
>>>>>>> and "show warranty" for details.
>>>>>>> This GDB was configured as "x86_64-linux-gnu".
>>>>>>> Type "show configuration" for configuration details.
>>>>>>> For bug reporting instructions, please see:
>>>>>>> <http://www.gnu.org/software/gdb/bugs/>.
>>>>>>> Find the GDB manual and other documentation resources online at:
>>>>>>> <http://www.gnu.org/software/gdb/documentation/>.
>>>>>>> For help, type "help".
>>>>>>> Type "apropos word" to search for commands related to "word"...
>>>>>>> Reading symbols from /usr/local/sbin/pbs_server...done.
>>>>>>> (gdb) r -D
>>>>>>> Starting program: /usr/local/sbin/pbs_server -D
>>>>>>> [Thread debugging using libthread_db enabled]
>>>>>>> Using host libthread_db library "/lib/x86_64-linux-gnu/libthre
>>>>>>> ad_db.so.1".
>>>>>>>
>>>>>>> Program received signal SIGSEGV, Segmentation fault.
>>>>>>> __lll_unlock_elision (lock=0x512f1b0, private=0) at
>>>>>>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>>>>>>> 29 ../sysdeps/unix/sysv/linux/x86/elision-unlock.c: No such file or
>>>>>>> directory.
>>>>>>> (gdb) bt
>>>>>>> #0 __lll_unlock_elision (lock=0x512f1b0, private=0) at
>>>>>>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>>>>>>> #1 0x00000000004a4953 in unlock_ji_mutex (pjob=0x512d880,
>>>>>>> id=0x525b30 <svr_enquejob(job*, int, char const*, bool, bool)::__func__>
>>>>>>> "svr_enquejob", msg=0x524554 "1", logging=0)
>>>>>>> at svr_jobfunc.c:4011
>>>>>>> #2 0x000000000049db0c in svr_enquejob (pjob=0x512d880,
>>>>>>> has_sv_qs_mutex=1, prev_job_id=0x0, have_reservation=false,
>>>>>>> being_recovered=true) at svr_jobfunc.c:421
>>>>>>> #3 0x000000000045b828 in pbsd_init_reque (pjob=0x512d880,
>>>>>>> change_state=1) at pbsd_init.c:2824
>>>>>>> #4 0x000000000045ad93 in pbsd_init_job (pjob=0x512d880, type=1) at
>>>>>>> pbsd_init.c:2558
>>>>>>> #5 0x0000000000459483 in handle_job_recovery (type=1) at
>>>>>>> pbsd_init.c:1803
>>>>>>> #6 0x000000000045a173 in handle_job_and_array_recovery (type=1) at
>>>>>>> pbsd_init.c:2100
>>>>>>> #7 0x000000000045a8fe in pbsd_init (type=1) at pbsd_init.c:2316
>>>>>>> #8 0x0000000000461f29 in main (argc=2, argv=0x7fffffffe598) at
>>>>>>> pbsd_main.c:1898
>>>>>>> (gdb) backtrace full
>>>>>>> #0 __lll_unlock_elision (lock=0x512f1b0, private=0) at
>>>>>>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>>>>>>> No locals.
>>>>>>> #1 0x00000000004a4953 in unlock_ji_mutex (pjob=0x512d880,
>>>>>>> id=0x525b30 <svr_enquejob(job*, int, char const*, bool, bool)::__func__>
>>>>>>> "svr_enquejob", msg=0x524554 "1", logging=0)
>>>>>>> at svr_jobfunc.c:4011
>>>>>>> rc = 0
>>>>>>> err_msg = 0x0
>>>>>>> stub_msg = "no pos"
>>>>>>> __func__ = "unlock_ji_mutex"
>>>>>>> #2 0x000000000049db0c in svr_enquejob (pjob=0x512d880,
>>>>>>> has_sv_qs_mutex=1, prev_job_id=0x0, have_reservation=false,
>>>>>>> being_recovered=true) at svr_jobfunc.c:421
>>>>>>> pattrjb = 0x7fffffff4a10
>>>>>>> pdef = 0x4
>>>>>>> pque = 0x0
>>>>>>> rc = 0
>>>>>>> log_buf = '\000' <repeats 24 times>,
>>>>>>> "\030\000\000\000\060\000\000\000PU\377\377\377\177\000\000\220T\377\377\377\177",
>>>>>>> '\000' <repeats 50 times>, "\003\000\000\000\000\000\000\
>>>>>>> 000#\000\000\000\000\000\000\000pO\377\377\377\177", '\000'
>>>>>>> <repeats 26 times>, "\221\260\000\000\000\200\377\
>>>>>>> 377oO\377\377\377\177\000\000H+B\366\377\177\000\000p+B\366\
>>>>>>> 377\177\000\000\200O\377\377\377\177\000\000\201\260\000\000
>>>>>>> \000\200\377\377\177O\377\377\377\177", '\000' <repeats 18 times>...
>>>>>>> time_now = 1478594788
>>>>>>> job_id = "0.Dual-E52630v4\000\000\000\0
>>>>>>> 00\000\000\000\000\000\362\377\377\377\377\377\377\377\340J\
>>>>>>> 377\377\377\177\000\000\060L\377\377\377\177\000\000\001\000
>>>>>>> \000\000\000\000\000\000\244\201\000\000\001\000\000\000\030
>>>>>>> \354\377\367\377\177\000\***@L\377\377\377\177\000\000\000\0
>>>>>>> 00\000\000\005\000\000\220\r\000\000\000\000\000\000\000k\02
>>>>>>> 2j\365\377\177\000\000\031J\377\377\377\177\000\000\201n\376
>>>>>>> \017\000\000\000\000\\\216!X\000\000\000\000_#\343+\000\000\
>>>>>>> 000\000\\\216!X\000\000\000\000\207\065],", '\000' <repeats 36
>>>>>>> times>, "k\022j\365\377\177\000\000\300K\377\377\377\177\000\000\000
>>>>>>> \000\000\000\000\000\000\000"...
>>>>>>> queue_name = "batch\000\377\377\240\340\377\367\377\177\000"
>>>>>>> total_jobs = 0
>>>>>>> user_jobs = 0
>>>>>>> array_jobs = 0
>>>>>>> __func__ = "svr_enquejob"
>>>>>>> que_mgr = {unlock_on_exit = 160, locked = 75, mutex_valid =
>>>>>>> 255, managed_mutex = 0x7ffff7ddccda <open_path+474>}
>>>>>>> #3 0x000000000045b828 in pbsd_init_reque (pjob=0x512d880,
>>>>>>> change_state=1) at pbsd_init.c:2824
>>>>>>> newstate = 0
>>>>>>> newsubstate = 0
>>>>>>> rc = 0
>>>>>>> log_buf = "pbsd_init_reque:1", '\000' <repeats 1063 times>...
>>>>>>> __func__ = "pbsd_init_reque"
>>>>>>> #4 0x000000000045ad93 in pbsd_init_job (pjob=0x512d880, type=1) at
>>>>>>> pbsd_init.c:2558
>>>>>>> d = 0
>>>>>>> rc = 0
>>>>>>> time_now = 1478594788
>>>>>>> log_buf = '\000' <repeats 2112 times>...
>>>>>>> local_errno = 0
>>>>>>> job_id = '\000' <repeats 1016 times>...
>>>>>>> job_atr_hold = 0
>>>>>>> job_exit_status = 0
>>>>>>> __func__ = "pbsd_init_job"
>>>>>>> #5 0x0000000000459483 in handle_job_recovery (type=1) at
>>>>>>> pbsd_init.c:1803
>>>>>>> pjob = 0x512d880
>>>>>>> Index = 0
>>>>>>> JobArray_iter = {first = "0.Dual-E52630v4", second = }
>>>>>>> log_buf = "14 total files read from
>>>>>>> disk\000\000\000\000\000\000\000\001\000\000\000\320\316\022
>>>>>>> \005\000\000\000\000\220N\022\005", '\000' <repeats 12 times>,
>>>>>>> "Expected 1, recovered 1 queues", '\000' <repeats 1330 times>...
>>>>>>> rc = 0
>>>>>>> job_rc = 0
>>>>>>> logtype = 0
>>>>>>> pdirent = 0x0
>>>>>>> pdirent_sub = 0x0
>>>>>>> dir = 0x5124e90
>>>>>>> dir_sub = 0x0
>>>>>>> had = 0
>>>>>>> pjob = 0x0
>>>>>>> time_now = 1478594788
>>>>>>> ---Type <return> to continue, or q <return> to quit---
>>>>>>> basen = '\000' <repeats 1088 times>...
>>>>>>> use_jobs_subdirs = 0
>>>>>>> __func__ = "handle_job_recovery"
>>>>>>> #6 0x000000000045a173 in handle_job_and_array_recovery (type=1) at
>>>>>>> pbsd_init.c:2100
>>>>>>> rc = 0
>>>>>>> tmp_rc = 1974134615
>>>>>>> #7 0x000000000045a8fe in pbsd_init (type=1) at pbsd_init.c:2316
>>>>>>> ret = 0
>>>>>>> gid = 0
>>>>>>> log_buf = "pbsd_init:1", '\000' <repeats 997 times>...
>>>>>>> __func__ = "pbsd_init"
>>>>>>> #8 0x0000000000461f29 in main (argc=2, argv=0x7fffffffe598) at
>>>>>>> pbsd_main.c:1898
>>>>>>> i = 2
>>>>>>> rc = 0
>>>>>>> local_errno = 0
>>>>>>> lockfile = "/var/spool/torque/server_priv/server.lock",
>>>>>>> '\000' <repeats 983 times>
>>>>>>> EMsg = '\000' <repeats 1023 times>
>>>>>>> tmpLine = "Server Dual-E52630v4 started, initialization type
>>>>>>> = 1", '\000' <repeats 970 times>
>>>>>>> log_buf = "Server Dual-E52630v4 started, initialization type
>>>>>>> = 1", '\000' <repeats 1139 times>...
>>>>>>> server_name_file_port = 15001
>>>>>>> fp = 0x51095f0
>>>>>>> (gdb) info registers
>>>>>>> rax 0x0 0
>>>>>>> rbx 0x6 6
>>>>>>> rcx 0x0 0
>>>>>>> rdx 0x512f1b0 85127600
>>>>>>> rsi 0x0 0
>>>>>>> rdi 0x512f1b0 85127600
>>>>>>> rbp 0x7fffffffe4b0 0x7fffffffe4b0
>>>>>>> rsp 0x7fffffffc870 0x7fffffffc870
>>>>>>> r8 0x0 0
>>>>>>> r9 0x7fffffff57a2 140737488312226
>>>>>>> r10 0x513c800 85182464
>>>>>>> r11 0x7ffff61e6128 140737322574120
>>>>>>> r12 0x4260b0 4350128
>>>>>>> r13 0x7fffffffe590 140737488348560
>>>>>>> r14 0x0 0
>>>>>>> r15 0x0 0
>>>>>>> rip 0x461f29 0x461f29 <main(int, char**)+2183>
>>>>>>> eflags 0x10246 [ PF ZF IF RF ]
>>>>>>> cs 0x33 51
>>>>>>> ss 0x2b 43
>>>>>>> ds 0x0 0
>>>>>>> es 0x0 0
>>>>>>> fs 0x0 0
>>>>>>> gs 0x0 0
>>>>>>> (gdb) x/16i $pc
>>>>>>> => 0x461f29 <main(int, char**)+2183>: test %eax,%eax
>>>>>>> 0x461f2b <main(int, char**)+2185>: setne %al
>>>>>>> 0x461f2e <main(int, char**)+2188>: test %al,%al
>>>>>>> 0x461f30 <main(int, char**)+2190>: je 0x461f55 <main(int,
>>>>>>> char**)+2227>
>>>>>>> 0x461f32 <main(int, char**)+2192>: mov 0x70efc7(%rip),%rax
>>>>>>> # 0xb70f00 <msg_daemonname>
>>>>>>> 0x461f39 <main(int, char**)+2199>: mov $0x51bab2,%edx
>>>>>>> 0x461f3e <main(int, char**)+2204>: mov %rax,%rsi
>>>>>>> 0x461f41 <main(int, char**)+2207>: mov $0xffffffff,%edi
>>>>>>> 0x461f46 <main(int, char**)+2212>: callq 0x425420
>>>>>>> <***@plt>
>>>>>>> 0x461f4b <main(int, char**)+2217>: mov $0x3,%edi
>>>>>>> 0x461f50 <main(int, char**)+2222>: callq 0x425680 <***@plt>
>>>>>>> 0x461f55 <main(int, char**)+2227>: mov 0x71021d(%rip),%esi
>>>>>>> # 0xb72178 <pbs_mom_port>
>>>>>>> 0x461f5b <main(int, char**)+2233>: mov 0x710227(%rip),%ecx
>>>>>>> # 0xb72188 <pbs_scheduler_port>
>>>>>>> 0x461f61 <main(int, char**)+2239>: mov 0x710225(%rip),%edx
>>>>>>> # 0xb7218c <pbs_server_port_dis>
>>>>>>> 0x461f67 <main(int, char**)+2245>: lea -0x1400(%rbp),%rax
>>>>>>> 0x461f6e <main(int, char**)+2252>: mov $0xb739c0,%r9d
>>>>>>> (gdb) thread apply all backtrace
>>>>>>>
>>>>>>> Thread 1 (Thread 0x7ffff7fd5740 (LWP 10004)):
>>>>>>> #0 __lll_unlock_elision (lock=0x512f1b0, private=0) at
>>>>>>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>>>>>>> #1 0x00000000004a4953 in unlock_ji_mutex (pjob=0x512d880,
>>>>>>> id=0x525b30 <svr_enquejob(job*, int, char const*, bool, bool)::__func__>
>>>>>>> "svr_enquejob", msg=0x524554 "1", logging=0)
>>>>>>> at svr_jobfunc.c:4011
>>>>>>> #2 0x000000000049db0c in svr_enquejob (pjob=0x512d880,
>>>>>>> has_sv_qs_mutex=1, prev_job_id=0x0, have_reservation=false,
>>>>>>> being_recovered=true) at svr_jobfunc.c:421
>>>>>>> #3 0x000000000045b828 in pbsd_init_reque (pjob=0x512d880,
>>>>>>> change_state=1) at pbsd_init.c:2824
>>>>>>> #4 0x000000000045ad93 in pbsd_init_job (pjob=0x512d880, type=1) at
>>>>>>> pbsd_init.c:2558
>>>>>>> #5 0x0000000000459483 in handle_job_recovery (type=1) at
>>>>>>> pbsd_init.c:1803
>>>>>>> #6 0x000000000045a173 in handle_job_and_array_recovery (type=1) at
>>>>>>> pbsd_init.c:2100
>>>>>>> #7 0x000000000045a8fe in pbsd_init (type=1) at pbsd_init.c:2316
>>>>>>> #8 0x0000000000461f29 in main (argc=2, argv=0x7fffffffe598) at
>>>>>>> pbsd_main.c:1898
>>>>>>> (gdb) quit
>>>>>>> A debugging session is active.
>>>>>>>
>>>>>>> Inferior 1 [process 10004] will be killed.
>>>>>>>
>>>>>>> Quit anyway? (y or n) y
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Nov 2, 2016 at 1:43 AM, David Beer <
>>>>>>> ***@adaptivecomputing.com> wrote:
>>>>>>>
>>>>>>>> Kazu,
>>>>>>>>
>>>>>>>> Thanks for sticking with us on this. You mentioned that pbs_server
>>>>>>>> did not crash when you submitted the job, but you said that it and
>>>>>>>> pbs_sched are "unstable." What do you mean by unstable? Will jobs run? You
>>>>>>>> gdb output looks like a pbs_server that isn't busy, but other than that it
>>>>>>>> looks normal.
>>>>>>>>
>>>>>>>> David
>>>>>>>>
>>>>>>>> On Tue, Nov 1, 2016 at 1:19 AM, Kazuhiro Fujita <
>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> David,
>>>>>>>>>
>>>>>>>>> I tested the 6.0-dev. It passed the "sudo ./torque.setup $USER"
>>>>>>>>> script,
>>>>>>>>> but pbs_server and pbs_sched are unstable like 6.1-dev.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Kazu
>>>>>>>>>
>>>>>>>>> Before execution of gdb
>>>>>>>>>
>>>>>>>>> git clone https://github.com/adaptivecomputing/torque.git -b
>>>>>>>>>> 6.0-dev 6.0-dev
>>>>>>>>>> cd 6.0-dev
>>>>>>>>>> ./autogen.sh
>>>>>>>>>> # build and install torque
>>>>>>>>>> ./configure
>>>>>>>>>> make
>>>>>>>>>> sudo make install
>>>>>>>>>> # Set the correct name of the server
>>>>>>>>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>>>>>>>> # configure and start trqauthd
>>>>>>>>>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>>>>>>>>>> sudo update-rc.d trqauthd defaults
>>>>>>>>>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>>>>>>>>>> sudo ldconfig
>>>>>>>>>> sudo service trqauthd start
>>>>>>>>>> # Initialize serverdb by executing the torque.setup script
>>>>>>>>>> sudo ./torque.setup $USER
>>>>>>>>>>
>>>>>>>>>> sudo qmgr -c 'p s'
>>>>>>>>>> sudo qterm
>>>>>>>>>> sudo /etc/init.d/trqauthd stop
>>>>>>>>>> # set nodes
>>>>>>>>>> echo "$HOSTNAME np=`cat /proc/cpuinfo | grep processor | wc -l`"
>>>>>>>>>> | sudo tee /var/spool/torque/server_priv/nodes
>>>>>>>>>> sudo nano /var/spool/torque/server_priv/nodes
>>>>>>>>>> # set the head node
>>>>>>>>>> echo "\$pbsserver $HOSTNAME" | sudo tee
>>>>>>>>>> /var/spool/torque/mom_priv/config
>>>>>>>>>> # configure other deamons
>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>>>>>>>>> sudo update-rc.d pbs_server defaults
>>>>>>>>>> sudo update-rc.d pbs_sched defaults
>>>>>>>>>> sudo update-rc.d pbs_mom defaults
>>>>>>>>>> # start torque daemons
>>>>>>>>>> sudo service trqauthd start
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Execution of gdb
>>>>>>>>>
>>>>>>>>>> sudo gdb /usr/local/sbin/pbs_server
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Commands executed by another terminal
>>>>>>>>>
>>>>>>>>>> sudo /etc/init.d/pbs_mom start
>>>>>>>>>> sudo /etc/init.d/pbs_sched start
>>>>>>>>>> pbsnodes -a
>>>>>>>>>> echo "sleep 30" | qsub
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The last command did not cause a crash of pbs_server. The
>>>>>>>>> backtrace is described below.
>>>>>>>>> $ sudo gdb /usr/local/sbin/pbs_server
>>>>>>>>> GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
>>>>>>>>> Copyright (C) 2016 Free Software Foundation, Inc.
>>>>>>>>> License GPLv3+: GNU GPL version 3 or later <
>>>>>>>>> http://gnu.org/licenses/gpl.html>
>>>>>>>>> This is free software: you are free to change and redistribute it.
>>>>>>>>> There is NO WARRANTY, to the extent permitted by law. Type "show
>>>>>>>>> copying"
>>>>>>>>> and "show warranty" for details.
>>>>>>>>> This GDB was configured as "x86_64-linux-gnu".
>>>>>>>>> Type "show configuration" for configuration details.
>>>>>>>>> For bug reporting instructions, please see:
>>>>>>>>> <http://www.gnu.org/software/gdb/bugs/>.
>>>>>>>>> Find the GDB manual and other documentation resources online at:
>>>>>>>>> <http://www.gnu.org/software/gdb/documentation/>.
>>>>>>>>> For help, type "help".
>>>>>>>>> Type "apropos word" to search for commands related to "word"...
>>>>>>>>> Reading symbols from /usr/local/sbin/pbs_server...done.
>>>>>>>>> (gdb) r -D
>>>>>>>>> Starting program: /usr/local/sbin/pbs_server -D
>>>>>>>>> [Thread debugging using libthread_db enabled]
>>>>>>>>> Using host libthread_db library "/lib/x86_64-linux-gnu/libthre
>>>>>>>>> ad_db.so.1".
>>>>>>>>> [New Thread 0x7ffff39c1700 (LWP 5024)]
>>>>>>>>> pbs_server is up (version - 6.0, port - 15001)
>>>>>>>>> [New Thread 0x7ffff31c0700 (LWP 5025)]
>>>>>>>>> PBS_Server: LOG_ERROR::tcp_connect_sockaddr, Failed when trying
>>>>>>>>> to open tcp connection - connect() failed [rc = -2] [addr =
>>>>>>>>> 10.0.0.249:15003]
>>>>>>>>> PBS_Server: LOG_ERROR::sendHierarchyToNode, Could not send mom
>>>>>>>>> hierarchy to host Dual-E52630v4:15003
>>>>>>>>> [New Thread 0x7ffff29bf700 (LWP 5026)]
>>>>>>>>> [New Thread 0x7ffff21be700 (LWP 5027)]
>>>>>>>>> [New Thread 0x7ffff19bd700 (LWP 5028)]
>>>>>>>>> [New Thread 0x7ffff11bc700 (LWP 5029)]
>>>>>>>>> [New Thread 0x7ffff09bb700 (LWP 5030)]
>>>>>>>>> [Thread 0x7ffff09bb700 (LWP 5030) exited]
>>>>>>>>> [New Thread 0x7ffff09bb700 (LWP 5031)]
>>>>>>>>> [New Thread 0x7fffe3fff700 (LWP 5109)]
>>>>>>>>> [New Thread 0x7fffe37fe700 (LWP 5113)]
>>>>>>>>> [New Thread 0x7fffe29cf700 (LWP 5121)]
>>>>>>>>> [Thread 0x7fffe29cf700 (LWP 5121) exited]
>>>>>>>>> ^C
>>>>>>>>> Thread 1 "pbs_server" received signal SIGINT, Interrupt.
>>>>>>>>> 0x00007ffff612a75d in nanosleep () at
>>>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>>>> 84 ../sysdeps/unix/syscall-template.S: No such file or directory.
>>>>>>>>> (gdb) backtrace full
>>>>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>>>> No locals.
>>>>>>>>> #1 0x00007ffff615c1a4 in usleep (useconds=<optimized out>) at
>>>>>>>>> ../sysdeps/posix/usleep.c:32
>>>>>>>>> ts = {tv_sec = 0, tv_nsec = 250000000}
>>>>>>>>> #2 0x000000000046123a in main_loop () at pbsd_main.c:1454
>>>>>>>>> state = 3
>>>>>>>>> waittime = 5
>>>>>>>>> pjob = 0x313a74
>>>>>>>>> iter = 0x0
>>>>>>>>> when = 1477984074
>>>>>>>>> log = 0
>>>>>>>>> scheduling = 1
>>>>>>>>> sched_iteration = 600
>>>>>>>>> time_now = 1477984190
>>>>>>>>> update_loglevel = 1477984198
>>>>>>>>> log_buf = "Server Ready, pid = 5020, loglevel=0", '\000'
>>>>>>>>> <repeats 140 times>, "c\000\000\000\000\000\000\000
>>>>>>>>> \000\020\000\000\000\000\000\000\240\265\377\377\377\177", '\000'
>>>>>>>>> <repeats 26 times>...
>>>>>>>>> sem_val = 5228929
>>>>>>>>> __func__ = "main_loop"
>>>>>>>>> #3 0x0000000000461fb6 in main (argc=2, argv=0x7fffffffe598) at
>>>>>>>>> pbsd_main.c:1935
>>>>>>>>> i = 2
>>>>>>>>> rc = 0
>>>>>>>>> local_errno = 0
>>>>>>>>> lockfile = "/var/spool/torque/server_priv/server.lock",
>>>>>>>>> '\000' <repeats 983 times>
>>>>>>>>> EMsg = '\000' <repeats 1023 times>
>>>>>>>>> tmpLine = "Using ports Server:15001 Scheduler:15004
>>>>>>>>> MOM:15002 (server: 'Dual-E52630v4')", '\000' <repeats 945 times>
>>>>>>>>> log_buf = "Using ports Server:15001 Scheduler:15004
>>>>>>>>> MOM:15002 (server: 'Dual-E52630v4')", '\000' <repeats 1114 times>...
>>>>>>>>> server_name_file_port = 15001
>>>>>>>>> fp = 0x51095f0
>>>>>>>>> (gdb) info registers
>>>>>>>>> rax 0xfffffffffffffdfc -516
>>>>>>>>> rbx 0x5 5
>>>>>>>>> rcx 0x7ffff612a75d 140737321805661
>>>>>>>>> rdx 0x0 0
>>>>>>>>> rsi 0x0 0
>>>>>>>>> rdi 0x7fffffffb3f0 140737488335856
>>>>>>>>> rbp 0x7fffffffe4b0 0x7fffffffe4b0
>>>>>>>>> rsp 0x7fffffffc870 0x7fffffffc870
>>>>>>>>> r8 0x0 0
>>>>>>>>> r9 0x4000001 67108865
>>>>>>>>> r10 0x1 1
>>>>>>>>> r11 0x293 659
>>>>>>>>> r12 0x4260b0 4350128
>>>>>>>>> r13 0x7fffffffe590 140737488348560
>>>>>>>>> r14 0x0 0
>>>>>>>>> r15 0x0 0
>>>>>>>>> rip 0x461fb6 0x461fb6 <main(int, char**)+2388>
>>>>>>>>> eflags 0x293 [ CF AF SF IF ]
>>>>>>>>> cs 0x33 51
>>>>>>>>> ss 0x2b 43
>>>>>>>>> ds 0x0 0
>>>>>>>>> es 0x0 0
>>>>>>>>> fs 0x0 0
>>>>>>>>> gs 0x0 0
>>>>>>>>> (gdb) x/16i $pc
>>>>>>>>> => 0x461fb6 <main(int, char**)+2388>: callq 0x494762
>>>>>>>>> <shutdown_ack()>
>>>>>>>>> 0x461fbb <main(int, char**)+2393>: mov $0xffffffff,%edi
>>>>>>>>> 0x461fc0 <main(int, char**)+2398>: callq 0x4250b0
>>>>>>>>> <***@plt>
>>>>>>>>> 0x461fc5 <main(int, char**)+2403>: mov 0x70f55c(%rip),%rdx
>>>>>>>>> # 0xb71528 <msg_svrdown>
>>>>>>>>> 0x461fcc <main(int, char**)+2410>: mov 0x70eeed(%rip),%rax
>>>>>>>>> # 0xb70ec0 <msg_daemonname>
>>>>>>>>> 0x461fd3 <main(int, char**)+2417>: mov %rdx,%rcx
>>>>>>>>> 0x461fd6 <main(int, char**)+2420>: mov %rax,%rdx
>>>>>>>>> 0x461fd9 <main(int, char**)+2423>: mov $0x1,%esi
>>>>>>>>> 0x461fde <main(int, char**)+2428>: mov $0x8002,%edi
>>>>>>>>> 0x461fe3 <main(int, char**)+2433>: callq 0x425840
>>>>>>>>> <***@plt>
>>>>>>>>> 0x461fe8 <main(int, char**)+2438>: mov $0x0,%edi
>>>>>>>>> 0x461fed <main(int, char**)+2443>: callq 0x4269c9
>>>>>>>>> <acct_close(bool)>
>>>>>>>>> 0x461ff2 <main(int, char**)+2448>: mov $0xb6cdc0,%edi
>>>>>>>>> 0x461ff7 <main(int, char**)+2453>: callq 0x425a00
>>>>>>>>> <***@plt>
>>>>>>>>> 0x461ffc <main(int, char**)+2458>: mov $0x1,%edi
>>>>>>>>> 0x462001 <main(int, char**)+2463>: callq 0x424db0
>>>>>>>>> <***@plt>
>>>>>>>>> (gdb) thread apply all backtrace
>>>>>>>>>
>>>>>>>>> Thread 11 (Thread 0x7fffe37fe700 (LWP 5113)):
>>>>>>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
>>>>>>>>> #1 0x00000000004fc19c in work_thread (a=0x5110710) at
>>>>>>>>> u_threadpool.c:272
>>>>>>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7fffe37fe700) at
>>>>>>>>> pthread_create.c:333
>>>>>>>>> #3 0x00007ffff616582d in clone () at
>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>>>>>>>>>
>>>>>>>>> Thread 10 (Thread 0x7fffe3fff700 (LWP 5109)):
>>>>>>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
>>>>>>>>> #1 0x00000000004fc19c in work_thread (a=0x5110710) at
>>>>>>>>> u_threadpool.c:272
>>>>>>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7fffe3fff700) at
>>>>>>>>> pthread_create.c:333
>>>>>>>>> #3 0x00007ffff616582d in clone () at
>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>>>>>>>>>
>>>>>>>>> Thread 9 (Thread 0x7ffff09bb700 (LWP 5031)):
>>>>>>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
>>>>>>>>> #1 0x00000000004fc19c in work_thread (a=0x5110810) at
>>>>>>>>> u_threadpool.c:272
>>>>>>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7ffff09bb700) at
>>>>>>>>> pthread_create.c:333
>>>>>>>>> #3 0x00007ffff616582d in clone () at
>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>>>>>>>>>
>>>>>>>>> Thread 7 (Thread 0x7ffff11bc700 (LWP 5029)):
>>>>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>>>>>>>> ../sysdeps/posix/sleep.c:55
>>>>>>>>> #2 0x00000000004769bb in remove_completed_jobs (vp=0x0) at
>>>>>>>>> req_jobobit.c:3759
>>>>>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff11bc700) at
>>>>>>>>> pthread_create.c:333
>>>>>>>>> #4 0x00007ffff616582d in clone () at
>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>>>>>>>>>
>>>>>>>>> Thread 6 (Thread 0x7ffff19bd700 (LWP 5028)):
>>>>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>>>>>>>> ../sysdeps/posix/sleep.c:55
>>>>>>>>> #2 0x00000000004afa7b in remove_extra_recycle_jobs (vp=0x0) at
>>>>>>>>> job_recycler.c:216
>>>>>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff19bd700) at
>>>>>>>>> pthread_create.c:333
>>>>>>>>> #4 0x00007ffff616582d in clone () at
>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>>>>>>>>>
>>>>>>>>> Thread 5 (Thread 0x7ffff21be700 (LWP 5027)):
>>>>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>>>>>>>> ../sysdeps/posix/sleep.c:55
>>>>>>>>> #2 0x00000000004bc73b in inspect_exiting_jobs (vp=0x0) at
>>>>>>>>> exiting_jobs.c:319
>>>>>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff21be700) at
>>>>>>>>> pthread_create.c:333
>>>>>>>>> #4 0x00007ffff616582d in clone () at
>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>>>>>>>>>
>>>>>>>>> Thread 4 (Thread 0x7ffff29bf700 (LWP 5026)):
>>>>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>>>>>>>> ../sysdeps/posix/sleep.c:55
>>>>>>>>> #2 0x000000000046078d in handle_queue_routing_retries (vp=0x0) at
>>>>>>>>> pbsd_main.c:1079
>>>>>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff29bf700) at
>>>>>>>>> pthread_create.c:333
>>>>>>>>> #4 0x00007ffff616582d in clone () at
>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>>>>>>>>>
>>>>>>>>> Thread 3 (Thread 0x7ffff31c0700 (LWP 5025)):
>>>>>>>>> #0 0x00007ffff6ee17bd in accept () at
>>>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>>>> #1 0x00007ffff750a276 in start_listener_addrinfo
>>>>>>>>> (host_name=0x7ffff31bfaf0 "Dual-E52630v4", server_port=15001,
>>>>>>>>> process_meth=0x4c4935 <start_process_pbs_server_port(void*)>)
>>>>>>>>> at ../Libnet/server_core.c:398
>>>>>>>>> #2 0x00000000004608f3 in start_accept_listener (vp=0x0) at
>>>>>>>>> pbsd_main.c:1141
>>>>>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff31c0700) at
>>>>>>>>> pthread_create.c:333
>>>>>>>>> #4 0x00007ffff616582d in clone () at
>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>>>>>>>>>
>>>>>>>>> Thread 2 (Thread 0x7ffff39c1700 (LWP 5024)):
>>>>>>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
>>>>>>>>> #1 0x00000000004fc19c in work_thread (a=0x5110810) at
>>>>>>>>> u_threadpool.c:272
>>>>>>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7ffff39c1700) at
>>>>>>>>> pthread_create.c:333
>>>>>>>>> ---Type <return> to continue, or q <return> to quit---
>>>>>>>>> #3 0x00007ffff616582d in clone () at
>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>>>>>>>>>
>>>>>>>>> Thread 1 (Thread 0x7ffff7fd5740 (LWP 5020)):
>>>>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>>>> #1 0x00007ffff615c1a4 in usleep (useconds=<optimized out>) at
>>>>>>>>> ../sysdeps/posix/usleep.c:32
>>>>>>>>> #2 0x000000000046123a in main_loop () at pbsd_main.c:1454
>>>>>>>>> #3 0x0000000000461fb6 in main (argc=2, argv=0x7fffffffe598) at
>>>>>>>>> pbsd_main.c:1935
>>>>>>>>> (gdb) quit
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Oct 28, 2016 at 12:43 PM, Kazuhiro Fujita <
>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Thank you for your comments.
>>>>>>>>>> I will try the 6.0-dev next week.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Kazu
>>>>>>>>>>
>>>>>>>>>> On Fri, Oct 28, 2016 at 5:34 AM, David Beer <
>>>>>>>>>> ***@adaptivecomputing.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> I wonder if that fix wasn't placed in the hotfix. Is there any
>>>>>>>>>>> chance you can try installing 6.0-dev on your system (via github) to see if
>>>>>>>>>>> it's resolved. For the record, my Ubuntu 16 system doesn't give me this
>>>>>>>>>>> error, or I'd try it myself. For whatever reason, none of our test cluster
>>>>>>>>>>> machines (Cent & Redhat 6-7, SLES 11-12) experience this either. We did
>>>>>>>>>>> have another user that experiences it on a test cluster, but not being able
>>>>>>>>>>> to reproduce it has made it harder to track down.
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Oct 26, 2016 at 12:46 AM, Kazuhiro Fujita <
>>>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> David,
>>>>>>>>>>>>
>>>>>>>>>>>> I tried the 6.0.2.h3. But, it seems that the other issue is
>>>>>>>>>>>> still remained.
>>>>>>>>>>>> After I initialized serverdb by "sudo pbs_server -t create",
>>>>>>>>>>>> pbs_server crashed.
>>>>>>>>>>>> Then, I used gdb with pbs_server.
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Kazu
>>>>>>>>>>>>
>>>>>>>>>>>> sudo gdb /usr/local/sbin/pbs_server
>>>>>>>>>>>> GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
>>>>>>>>>>>> Copyright (C) 2016 Free Software Foundation, Inc.
>>>>>>>>>>>> License GPLv3+: GNU GPL version 3 or later <
>>>>>>>>>>>> http://gnu.org/licenses/gpl.html>
>>>>>>>>>>>> This is free software: you are free to change and redistribute
>>>>>>>>>>>> it.
>>>>>>>>>>>> There is NO WARRANTY, to the extent permitted by law. Type
>>>>>>>>>>>> "show copying"
>>>>>>>>>>>> and "show warranty" for details.
>>>>>>>>>>>> This GDB was configured as "x86_64-linux-gnu".
>>>>>>>>>>>> Type "show configuration" for configuration details.
>>>>>>>>>>>> For bug reporting instructions, please see:
>>>>>>>>>>>> <http://www.gnu.org/software/gdb/bugs/>.
>>>>>>>>>>>> Find the GDB manual and other documentation resources online at:
>>>>>>>>>>>> <http://www.gnu.org/software/gdb/documentation/>.
>>>>>>>>>>>> For help, type "help".
>>>>>>>>>>>> Type "apropos word" to search for commands related to "word"...
>>>>>>>>>>>> Reading symbols from /usr/local/sbin/pbs_server...done.
>>>>>>>>>>>> (gdb) r -D
>>>>>>>>>>>> Starting program: /usr/local/sbin/pbs_server -D
>>>>>>>>>>>> [Thread debugging using libthread_db enabled]
>>>>>>>>>>>> Using host libthread_db library "/lib/x86_64-linux-gnu/libthre
>>>>>>>>>>>> ad_db.so.1".
>>>>>>>>>>>> pbs_server is up (version - 6.0.2.h3, port - 15001)
>>>>>>>>>>>> [New Thread 0x7ffff39c1700 (LWP 25591)]
>>>>>>>>>>>> [New Thread 0x7ffff31c0700 (LWP 25592)]
>>>>>>>>>>>> [New Thread 0x7ffff29bf700 (LWP 25593)]
>>>>>>>>>>>> [New Thread 0x7ffff21be700 (LWP 25594)]
>>>>>>>>>>>> [New Thread 0x7ffff19bd700 (LWP 25595)]
>>>>>>>>>>>> [New Thread 0x7ffff11bc700 (LWP 25596)]
>>>>>>>>>>>>
>>>>>>>>>>>> Thread 7 "pbs_server" received signal SIGSEGV, Segmentation
>>>>>>>>>>>> fault.
>>>>>>>>>>>> [Switching to Thread 0x7ffff11bc700 (LWP 25596)]
>>>>>>>>>>>> __lll_unlock_elision (lock=0x57276c0, private=0) at
>>>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>>>>>>>>>>>> 29 ../sysdeps/unix/sysv/linux/x86/elision-unlock.c: No such
>>>>>>>>>>>> file or directory.
>>>>>>>>>>>> (gdb) bt
>>>>>>>>>>>> #0 __lll_unlock_elision (lock=0x57276c0, private=0) at
>>>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>>>>>>>>>>>> #1 0x00000000004ac076 in dispatch_timed_task (ptask=0x5727660)
>>>>>>>>>>>> at svr_task.c:318
>>>>>>>>>>>> #2 0x0000000000460247 in check_tasks (notUsed=0x0) at
>>>>>>>>>>>> pbsd_main.c:921
>>>>>>>>>>>> #3 0x00000000004fc171 in work_thread (a=0x510f650) at
>>>>>>>>>>>> u_threadpool.c:318
>>>>>>>>>>>> #4 0x00007ffff6ed86fa in start_thread (arg=0x7ffff11bc700) at
>>>>>>>>>>>> pthread_create.c:333
>>>>>>>>>>>> #5 0x00007ffff6165b5d in clone () at
>>>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Oct 26, 2016 at 11:52 AM, Kazuhiro Fujita <
>>>>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> David and Rick,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thank you for the quick response. I will try it later.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>> Kazu
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Oct 26, 2016 at 5:06 AM, David Beer <
>>>>>>>>>>>>> ***@adaptivecomputing.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Actually, Rick just sent me the link. You can download it
>>>>>>>>>>>>>> from here: http://files.adaptivecom
>>>>>>>>>>>>>> puting.com/hotfix/torque-6.0.2.h3.tar.gz
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Oct 25, 2016 at 2:06 PM, David Beer <
>>>>>>>>>>>>>> ***@adaptivecomputing.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I can confirm that this bug is fixed in 6.0-dev, and we've
>>>>>>>>>>>>>>> made a hotfix for it, 6.0.2.h3. This was caused because of a change in the
>>>>>>>>>>>>>>> implementation for the pthread library, so most will not see this crash,
>>>>>>>>>>>>>>> but it appears that if you have a newer version of that library, then you
>>>>>>>>>>>>>>> will get it. Rick is going to send instructions for how to grab 6.0.2.h3.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> David
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, Oct 25, 2016 at 12:30 AM, Kazuhiro Fujita <
>>>>>>>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thank you David for the comment on the backtrace.
>>>>>>>>>>>>>>>> I haven't noticed that until writing this mail.
>>>>>>>>>>>>>>>> So, I used backtrace as written in the Ubuntu wiki.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I also attached the backtrace of pbs_server (Torque
>>>>>>>>>>>>>>>> 6.1-dev) by gdb.
>>>>>>>>>>>>>>>> As I mentioned before torque.setup script was successfully
>>>>>>>>>>>>>>>> executed, but unstable.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Before using gdb, I used following commands.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> git clone https://github.com/adaptivecomputing/torque.git
>>>>>>>>>>>>>>>>> -b 6.1-dev 6.1-dev
>>>>>>>>>>>>>>>>> cd 6.1-dev
>>>>>>>>>>>>>>>>> ./autogen.sh
>>>>>>>>>>>>>>>>> # build and install torque
>>>>>>>>>>>>>>>>> ./configure
>>>>>>>>>>>>>>>>> make
>>>>>>>>>>>>>>>>> sudo make install
>>>>>>>>>>>>>>>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>>>>>>>>>>>>>>> echo /usr/local/lib > sudo tee
>>>>>>>>>>>>>>>>> /etc/ld.so.conf.d/torque.conf
>>>>>>>>>>>>>>>>> sudo ldconfig
>>>>>>>>>>>>>>>>> # set as services
>>>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>>>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_server
>>>>>>>>>>>>>>>>> /etc/init.d/pbs_server
>>>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_sched
>>>>>>>>>>>>>>>>> /etc/init.d/pbs_sched
>>>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>>>>>>>>>>>>>>>> sudo update-rc.d trqauthd defaults
>>>>>>>>>>>>>>>>> sudo update-rc.d pbs_server defaults
>>>>>>>>>>>>>>>>> sudo update-rc.d pbs_sched defaults
>>>>>>>>>>>>>>>>> sudo update-rc.d pbs_mom defaults
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> sudo ./torque.setup $USER
>>>>>>>>>>>>>>>>> sudo qmgr -c 'p s'
>>>>>>>>>>>>>>>>> echo "$HOSTNAME np=`cat /proc/cpuinfo | grep processor |
>>>>>>>>>>>>>>>>> wc -l`" | sudo tee /var/spool/torque/server_priv/nodes
>>>>>>>>>>>>>>>>> sudo nano /var/spool/torque/server_priv/nodes # (changed
>>>>>>>>>>>>>>>>> np)
>>>>>>>>>>>>>>>>> sudo qterm -t quick
>>>>>>>>>>>>>>>>> sudo /etc/init.d/trqauthd stop
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> trqauthd was not stop by the last command. So, I stopped it
>>>>>>>>>>>>>>>> by killing the trqauthd process.
>>>>>>>>>>>>>>>> Then I restarted the torque processes with gdb.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> sudo /etc/init.d/trqauthd start
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> sudo gdb /etc/init.d/pbs_server 2>&1 | tee
>>>>>>>>>>>>>>>>> ~/gdb-torquesetup-6.1-dev.txt
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> In another terminal, I executed the following commands
>>>>>>>>>>>>>>>> before pbs_server was crashed.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> sudo /etc/init.d/pbs_mom start
>>>>>>>>>>>>>>>>> sudo /etc/init.d/pbs_sched start
>>>>>>>>>>>>>>>>> ps aux | grep pbs
>>>>>>>>>>>>>>>>> pbsnodes -a
>>>>>>>>>>>>>>>>> echo "sleep 30" | qsub
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The output of the last command is "0.torque-server".
>>>>>>>>>>>>>>>> And this command crashed the pbs_server in gdb.
>>>>>>>>>>>>>>>> Then, I made the backtrace.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>> Kazu
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Tue, Oct 25, 2016 at 2:36 PM, Kazuhiro Fujita <
>>>>>>>>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> David,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I attached the backtrace of pbs_server (Torque 6.0.2) by
>>>>>>>>>>>>>>>>> gdb.
>>>>>>>>>>>>>>>>> (based on https://wiki.ubuntu.com/Backtrace)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I started pbs_server with gdb,
>>>>>>>>>>>>>>>>> and execute qmgr from another terminal. (see below)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> sudo qmgr -c 'p s'
>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> After the qmgr execution, I pressed ctrl +c in gdb.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>> Kaz
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Tue, Oct 25, 2016 at 1:00 AM, David Beer <
>>>>>>>>>>>>>>>>> ***@adaptivecomputing.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Kazu,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Can you give us a backtrace for this crash? We have fixed
>>>>>>>>>>>>>>>>>> some issues on startup (around mutex management for newer pthread
>>>>>>>>>>>>>>>>>> implementations) and a backtrace would allow me to confirm if what you're
>>>>>>>>>>>>>>>>>> seeing is fixed.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Mon, Oct 24, 2016 at 2:09 AM, Kazuhiro Fujita <
>>>>>>>>>>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Dear All,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I use Torque 4.2.10 on Ubuntu 14.04 LTS and 16.04 LTS
>>>>>>>>>>>>>>>>>>> with dual E5-2630 v3 chips.
>>>>>>>>>>>>>>>>>>> I recently got servers with dual Xeon E5 v4 chips, and
>>>>>>>>>>>>>>>>>>> installed Ubuntu 16.04 LTS on them.
>>>>>>>>>>>>>>>>>>> And I tried to set up Torque on them, but I stacked with
>>>>>>>>>>>>>>>>>>> the initial setup script.
>>>>>>>>>>>>>>>>>>> It seems that qmgr may trigger to crash pbs_server in
>>>>>>>>>>>>>>>>>>> initial setup script (torque.setup). (see below)
>>>>>>>>>>>>>>>>>>> Similar error is also observed in Torque 6.02.
>>>>>>>>>>>>>>>>>>> Have you ever observed this kind of errors?
>>>>>>>>>>>>>>>>>>> And if you know possible solutions, please tell me.
>>>>>>>>>>>>>>>>>>> Any comments will be highly appreciated.
>>>>>>>>>>>>>>>>>>> Would it be better to change the OS to other
>>>>>>>>>>>>>>>>>>> distribution, such as Scientific Linux?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thank you in Advance,
>>>>>>>>>>>>>>>>>>> Kazu
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Errors in torque 4.2.10 setup
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> torque-server-***@torque-ser
>>>>>>>>>>>>>>>>>>>> ver:~/Downloads/torque/torque-4.2.10$ sudo
>>>>>>>>>>>>>>>>>>>> ./torque.setup $USER
>>>>>>>>>>>>>>>>>>>> Currently no servers active. Default server will be
>>>>>>>>>>>>>>>>>>>> listed as active server. Error 15133
>>>>>>>>>>>>>>>>>>>> Active server name: torque-server pbs_server port is:
>>>>>>>>>>>>>>>>>>>> 15001
>>>>>>>>>>>>>>>>>>>> trqauthd daemonized - port /tmp/trqauthd-unix
>>>>>>>>>>>>>>>>>>>> trqauthd successfully started
>>>>>>>>>>>>>>>>>>>> initializing TORQUE (admin:
>>>>>>>>>>>>>>>>>>>> torque-server-***@torque-server)
>>>>>>>>>>>>>>>>>>>> You have selected to start pbs_server in create mode.
>>>>>>>>>>>>>>>>>>>> If the server database exists it will be overwritten.
>>>>>>>>>>>>>>>>>>>> do you wish to continue y/(n)?y
>>>>>>>>>>>>>>>>>>>> root 27941 1942 1 12:22 ? 00:00:00
>>>>>>>>>>>>>>>>>>>> pbs_server -t create
>>>>>>>>>>>>>>>>>>>> Max open servers: 9
>>>>>>>>>>>>>>>>>>>> set server operators += torque-server-***@torque-ser
>>>>>>>>>>>>>>>>>>>> ver
>>>>>>>>>>>>>>>>>>>> Max open servers: 9
>>>>>>>>>>>>>>>>>>>> set server managers += torque-server-***@torque-ser
>>>>>>>>>>>>>>>>>>>> ver
>>>>>>>>>>>>>>>>>>>> qmgr obj=batch svr=default: End of File
>>>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>>>>>> torque-server-***@torque-ser
>>>>>>>>>>>>>>>>>>>> ver:~/Downloads/torque/torque-4.2.10$ ps aux | grep pbs
>>>>>>>>>>>>>>>>>>>> torque-+ 27996 0.0 0.0 22304 948 pts/2 S+
>>>>>>>>>>>>>>>>>>>> 12:22 0:00 grep --color=auto pbs
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> pbs_server -t create was not found.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Errors in torque 6.0.2 setup
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> torque-server-***@torque-server:~/Downloads/torque/6.0.2$
>>>>>>>>>>>>>>>>>>>> sudo ./torque.setup $USER
>>>>>>>>>>>>>>>>>>>> Currently no servers active. Default server will be
>>>>>>>>>>>>>>>>>>>> listed as active server. Error 15133
>>>>>>>>>>>>>>>>>>>> Active server name: torque-server pbs_server port is:
>>>>>>>>>>>>>>>>>>>> 15001
>>>>>>>>>>>>>>>>>>>> trqauthd daemonized - port /tmp/trqauthd-unix
>>>>>>>>>>>>>>>>>>>> trqauthd successfully started
>>>>>>>>>>>>>>>>>>>> initializing TORQUE (admin:
>>>>>>>>>>>>>>>>>>>> torque-server-***@torque-server)
>>>>>>>>>>>>>>>>>>>> You have selected to start pbs_server in create mode.
>>>>>>>>>>>>>>>>>>>> If the server database exists it will be overwritten.
>>>>>>>>>>>>>>>>>>>> do you wish to continue y/(n)?y
>>>>>>>>>>>>>>>>>>>> root 39521 1 1 16:10 ? 00:00:00
>>>>>>>>>>>>>>>>>>>> pbs_server -t create
>>>>>>>>>>>>>>>>>>>> Max open servers: 9
>>>>>>>>>>>>>>>>>>>> Max open servers: 9
>>>>>>>>>>>>>>>>>>>> qmgr obj=batch svr=default: End of File
>>>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>>>>>> torque-server-***@torque-server:~/Downloads/torque/6.0.2$
>>>>>>>>>>>>>>>>>>>> ps aux | grep pbs
>>>>>>>>>>>>>>>>>>>> comp_ad+ 39569 0.0 0.0 22304 1032 pts/8 S+
>>>>>>>>>>>>>>>>>>>> 16:11 0:00 grep --color=auto pbs
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> pbs_server -t create was not found.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Commands used for installation before the setup script
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> # build and install torque
>>>>>>>>>>>>>>>>>>>> ./configure
>>>>>>>>>>>>>>>>>>>> make
>>>>>>>>>>>>>>>>>>>> sudo make install
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>>>>>>>>>>>>>>>>>> echo /usr/local/lib > sudo tee
>>>>>>>>>>>>>>>>>>>> /etc/ld.so.conf.d/torque.conf
>>>>>>>>>>>>>>>>>>>> sudo ldconfig
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> # set up as services
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.trqauthd
>>>>>>>>>>>>>>>>>>>> /etc/init.d/trqauthd
>>>>>>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_server
>>>>>>>>>>>>>>>>>>>> /etc/init.d/pbs_server
>>>>>>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_sched
>>>>>>>>>>>>>>>>>>>> /etc/init.d/pbs_sched
>>>>>>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_mom
>>>>>>>>>>>>>>>>>>>> /etc/init.d/pbs_mom
>>>>>>>>>>>>>>>>>>>> sudo update-rc.d trqauthd defaults
>>>>>>>>>>>>>>>>>>>> sudo update-rc.d pbs_server defaults
>>>>>>>>>>>>>>>>>>>> sudo update-rc.d pbs_sched defaults
>>>>>>>>>>>>>>>>>>>> sudo update-rc.d pbs_mom defaults
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> sudo ./torque.setup $USER
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>>> torqueusers mailing list
>>>>>>>>>>>>>>>>>>> ***@supercluster.org
>>>>>>>>>>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>> David Beer | Torque Architect
>>>>>>>>>>>>>>>>>> Adaptive Computing
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>> torqueusers mailing list
>>>>>>>>>>>>>>>>>> ***@supercluster.org
>>>>>>>>>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>> torqueusers mailing list
>>>>>>>>>>>>>>>> ***@supercluster.org
>>>>>>>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> David Beer | Torque Architect
>>>>>>>>>>>>>>> Adaptive Computing
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> David Beer | Torque Architect
>>>>>>>>>>>>>> Adaptive Computing
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> torqueusers mailing list
>>>>>>>>>>>>>> ***@supercluster.org
>>>>>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> torqueusers mailing list
>>>>>>>>>>>> ***@supercluster.org
>>>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> David Beer | Torque Architect
>>>>>>>>>>> Adaptive Computing
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> torqueusers mailing list
>>>>>>>>>>> ***@supercluster.org
>>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> torqueusers mailing list
>>>>>>>>> ***@supercluster.org
>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> David Beer | Torque Architect
>>>>>>>> Adaptive Computing
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> torqueusers mailing list
>>>>>>>> ***@supercluster.org
>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> torqueusers mailing list
>>>>>>> ***@supercluster.org
>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> David Beer | Torque Architect
>>>>>> Adaptive Computing
>>>>>>
>>>>>> _______________________________________________
>>>>>> torqueusers mailing list
>>>>>> ***@supercluster.org
>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>
>>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> torqueusers mailing list
>>>> ***@supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>
>>>>
>>>
>>>
>>> --
>>> David Beer | Torque Architect
>>> Adaptive Computing
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> ***@supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>>
>>
>> _______________________________________________
>> torqueusers mailing list
>> ***@supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>
>
> --
> David Beer | Torque Architect
> Adaptive Computing
>
> _______________________________________________
> torqueusers mailing list
> ***@supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
David Beer
2016-11-28 23:53:13 UTC
Permalink
Kazu,

I'm shocked you're seeing so many issues. Can you send a backtrace? These
logs don't show anything sinister.

On Wed, Nov 23, 2016 at 9:52 PM, Kazuhiro Fujita <***@gmail.com>
wrote:

> David,
>
> I reinstalled the torque 6.0-dev without update from github.
> At this time, I can restart all torque daemons,
> but qsub command caused the crash of pbs_server and pbs_sched.
> I attached the log files in this mail.
>
> Best,
> Kazu
>
> Before the crash:
>
>> # build and install torque
>> ./configure
>> make
>> sudo make install
>> # Set a correct host name of the server
>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>> # configure and start trqauthd
>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>> sudo update-rc.d trqauthd defaults
>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>> sudo ldconfig
>> sudo service trqauthd start
>> # Initialize serverdb by executing the torque.setup script
>> sudo ./torque.setup $USER
>> sudo qmgr -c "p s"
>> # stop pbs_server and trqauthd daemons for setting nodes.
>> sudo qterm
>> sudo service trqauthd stop
>> ps aux | grep pbs
>> ps aux | grep trq
>> # set nodes
>> echo "$HOSTNAME np=`cat /proc/cpuinfo | grep processor | wc -l`" | sudo
>> tee /var/spool/torque/server_priv/nodes
>> sudo nano /var/spool/torque/server_priv/nodes
>> # set the head node
>> echo "\$pbsserver $HOSTNAME" | sudo tee /var/spool/torque/mom_priv/config
>> # configure other torque daemons
>> sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
>> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>> sudo update-rc.d pbs_server defaults
>> sudo update-rc.d pbs_sched defaults
>> sudo update-rc.d pbs_mom defaults
>> # restart torque daemons
>> sudo service trqauthd start
>> sudo service pbs_server start
>> ps aux | grep pbs
>> ps aux | grep trq
>> sudo service pbs_sched start
>> sudo service pbs_mom start
>> ps aux | grep pbs
>> ps aux | grep trq
>> # check configuration of computaion nodes
>> pbsnodes -a
>
>
> $ ps aux | grep trq
> root 19130 0.0 0.0 109112 3756 ? S 13:25 0:00
> /usr/local/sbin/trqauthd
> comp_ad+ 19293 0.0 0.0 15236 1020 pts/8 S+ 13:28 0:00 grep
> --color=auto trq
> $ ps aux | grep pbs
> root 19175 0.0 0.0 695136 23640 ? Sl 13:26 0:00
> /usr/local/sbin/pbs_server
> root 19224 0.0 0.0 37996 4936 ? Ss 13:27 0:00
> /usr/local/sbin/pbs_sched
> root 19265 0.1 0.2 173776 136692 ? SLsl 13:27 0:00
> /usr/local/sbin/pbs_mom
> comp_ad+ 19295 0.0 0.0 15236 924 pts/8 S+ 13:28 0:00 grep
> --color=auto pbs
>
> Subsequent qsub command caused the crash of pbs_server and pbs_sched.
>
> $ echo "sleep 30" | qsub
> 0.Dual-E52630v4
> $ ps aux | grep trq
> root 19130 0.0 0.0 109112 4268 ? S 13:25 0:00
> /usr/local/sbin/trqauthd
> comp_ad+ 19309 0.0 0.0 15236 1020 pts/8 S+ 13:28 0:00 grep
> --color=auto trq
> $ ps aux | grep pbs
> root 19265 0.1 0.2 173776 136688 ? SLsl 13:27 0:00
> /usr/local/sbin/pbs_mom
> comp_ad+ 19311 0.0 0.0 15236 1016 pts/8 S+ 13:28 0:00 grep
> --color=auto pbs
>
>
>
>
> On Fri, Nov 18, 2016 at 4:21 AM, David Beer <***@adaptivecomputing.com>
> wrote:
>
>> Kazu,
>>
>> Did you look at the server logs?
>>
>> On Wed, Nov 16, 2016 at 12:24 AM, Kazuhiro Fujita <
>> ***@gmail.com> wrote:
>>
>>> David,
>>>
>>> I did not find the process of pbs_server after executions of commands
>>> shown below.
>>>
>>> sudo service trqauthd start
>>>> sudo service pbs_server start
>>>
>>>
>>> I am not sure what it did.
>>>
>>> Best,
>>> Kazu
>>>
>>>
>>> On Wed, Nov 16, 2016 at 8:10 AM, David Beer <***@adaptivecomputing.com
>>> > wrote:
>>>
>>>> Kazu,
>>>>
>>>> What did it do when it failed to start?
>>>>
>>>> On Wed, Nov 9, 2016 at 9:33 PM, Kazuhiro Fujita <
>>>> ***@gmail.com> wrote:
>>>>
>>>>> David,
>>>>>
>>>>> In the last mail I sent, I reinstalled 6.0-dev in a wrong server as
>>>>> you can see in output (E5-2630v3).
>>>>> In a E5-2630v4 server, pbs_server failed to restart as a daemon after "./torque.setup
>>>>> $USER".
>>>>>
>>>>> Before crash:
>>>>>
>>>>>> git clone https://github.com/adaptivecomputing/torque.git -b 6.0-dev
>>>>>> 6.0-dev
>>>>>> cd 6.0-dev
>>>>>> ./autogen.sh
>>>>>> # build and install torque
>>>>>> ./configure
>>>>>> make
>>>>>> sudo make install
>>>>>> # Set the correct name of the server
>>>>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>>>> # configure and start trqauthd
>>>>>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>>>>>> sudo update-rc.d trqauthd defaults
>>>>>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>>>>>> sudo ldconfig
>>>>>> sudo service trqauthd start
>>>>>> # Initialize serverdb by executing the torque.setup script
>>>>>> sudo ./torque.setup $USER
>>>>>> sudo qmgr -c 'p s'
>>>>>> sudo qterm
>>>>>> sudo service trqauthd stop
>>>>>> ps aux | grep pbs
>>>>>> ps aux | grep trq
>>>>>> # set nodes
>>>>>> echo "$HOSTNAME np=`cat /proc/cpuinfo | grep processor | wc -l`" |
>>>>>> sudo tee /var/spool/torque/server_priv/nodes
>>>>>> sudo nano /var/spool/torque/server_priv/nodes
>>>>>> # set the head node
>>>>>> echo "\$pbsserver $HOSTNAME" | sudo tee /var/spool/torque/mom_priv/con
>>>>>> fig
>>>>>> # configure other daemons
>>>>>> sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
>>>>>> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
>>>>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>>>>> sudo update-rc.d pbs_server defaults
>>>>>> sudo update-rc.d pbs_sched defaults
>>>>>> sudo update-rc.d pbs_mom defaults
>>>>>> # restart torque daemons
>>>>>> sudo service trqauthd start
>>>>>> sudo service pbs_server start
>>>>>
>>>>>
>>>>> Then, pbs_server did not start. So, I started pbs_server with gdb.
>>>>> But, pbs_server with gdb did not crash even after qsub and qstat from
>>>>> another terminal.
>>>>> So, I stopped the pbs_server in gdb with ctrl + c.
>>>>>
>>>>> Best,
>>>>> Kazu
>>>>>
>>>>> gdb output
>>>>>
>>>>>> $ sudo gdb /usr/local/sbin/pbs_server
>>>>>> GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
>>>>>> Copyright (C) 2016 Free Software Foundation, Inc.
>>>>>> License GPLv3+: GNU GPL version 3 or later <
>>>>>> http://gnu.org/licenses/gpl.html>
>>>>>> This is free software: you are free to change and redistribute it.
>>>>>> There is NO WARRANTY, to the extent permitted by law. Type "show
>>>>>> copying"
>>>>>> and "show warranty" for details.
>>>>>> This GDB was configured as "x86_64-linux-gnu".
>>>>>> Type "show configuration" for configuration details.
>>>>>> For bug reporting instructions, please see:
>>>>>> <http://www.gnu.org/software/gdb/bugs/>.
>>>>>> Find the GDB manual and other documentation resources online at:
>>>>>> <http://www.gnu.org/software/gdb/documentation/>.
>>>>>> For help, type "help".
>>>>>> Type "apropos word" to search for commands related to "word"...
>>>>>> Reading symbols from /usr/local/sbin/pbs_server...done.
>>>>>> (gdb) r -D
>>>>>> Starting program: /usr/local/sbin/pbs_server -D
>>>>>> [Thread debugging using libthread_db enabled]
>>>>>> Using host libthread_db library "/lib/x86_64-linux-gnu/libthre
>>>>>> ad_db.so.1".
>>>>>> [New Thread 0x7ffff39c1700 (LWP 35864)]
>>>>>> pbs_server is up (version - 6.0, port - 15001)
>>>>>> PBS_Server: LOG_ERROR::tcp_connect_sockaddr, Failed when trying to
>>>>>> open tcp connection - connect() failed [rc = -2] [addr =
>>>>>> 10.0.0.249:15003]
>>>>>> [New Thread 0x7ffff31c0700 (LWP 35865)]
>>>>>> PBS_Server: LOG_ERROR::sendHierarchyToNode, Could not send mom
>>>>>> hierarchy to host Dual-E52630v4:15003
>>>>>> [New Thread 0x7ffff29bf700 (LWP 35866)]
>>>>>> [New Thread 0x7ffff21be700 (LWP 35867)]
>>>>>> [New Thread 0x7ffff19bd700 (LWP 35868)]
>>>>>> [New Thread 0x7ffff11bc700 (LWP 35869)]
>>>>>> [New Thread 0x7ffff09bb700 (LWP 35870)]
>>>>>> [Thread 0x7ffff09bb700 (LWP 35870) exited]
>>>>>> [New Thread 0x7ffff09bb700 (LWP 35871)]
>>>>>> [New Thread 0x7fffe3fff700 (LWP 36003)]
>>>>>> [New Thread 0x7fffe37fe700 (LWP 36004)]
>>>>>> [New Thread 0x7fffe2ffd700 (LWP 36011)]
>>>>>> [New Thread 0x7fffe21ce700 (LWP 36016)]
>>>>>> [Thread 0x7fffe21ce700 (LWP 36016) exited]
>>>>>> ^C
>>>>>> Thread 1 "pbs_server" received signal SIGINT, Interrupt.
>>>>>> 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-templa
>>>>>> te.S:84
>>>>>> 84 ../sysdeps/unix/syscall-template.S: No such file or directory.
>>>>>> (gdb) bt
>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>> #1 0x00007ffff615c1a4 in usleep (useconds=<optimized out>) at
>>>>>> ../sysdeps/posix/usleep.c:32
>>>>>> #2 0x0000000000461216 in main_loop () at pbsd_main.c:1454
>>>>>> #3 0x0000000000461f92 in main (argc=2, argv=0x7fffffffe598) at
>>>>>> pbsd_main.c:1935
>>>>>> (gdb) backtrace full
>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>> No locals.
>>>>>> #1 0x00007ffff615c1a4 in usleep (useconds=<optimized out>) at
>>>>>> ../sysdeps/posix/usleep.c:32
>>>>>> ts = {tv_sec = 0, tv_nsec = 250000000}
>>>>>> #2 0x0000000000461216 in main_loop () at pbsd_main.c:1454
>>>>>> state = 3
>>>>>> waittime = 5
>>>>>> pjob = 0x313a74
>>>>>> iter = 0x0
>>>>>> when = 1478748888
>>>>>> log = 0
>>>>>> scheduling = 1
>>>>>> sched_iteration = 600
>>>>>> time_now = 1478748970
>>>>>> update_loglevel = 1478748979
>>>>>> log_buf = "Server Ready, pid = 35860, loglevel=0", '\000'
>>>>>> <repeats 139 times>, "c\000\000\000\000\000\000\000
>>>>>> \000\020\000\000\000\000\000\000\240\265\377\377\377\177", '\000'
>>>>>> <repeats 26 times>...
>>>>>> sem_val = 5229209
>>>>>> __func__ = "main_loop"
>>>>>> #3 0x0000000000461f92 in main (argc=2, argv=0x7fffffffe598) at
>>>>>> pbsd_main.c:1935
>>>>>> i = 2
>>>>>> rc = 0
>>>>>> local_errno = 0
>>>>>> lockfile = "/var/spool/torque/server_priv/server.lock",
>>>>>> '\000' <repeats 983 times>
>>>>>> EMsg = '\000' <repeats 1023 times>
>>>>>> tmpLine = "Using ports Server:15001 Scheduler:15004
>>>>>> MOM:15002 (server: 'Dual-E52630v4')", '\000' <repeats 945 times>
>>>>>> log_buf = "Using ports Server:15001 Scheduler:15004
>>>>>> MOM:15002 (server: 'Dual-E52630v4')", '\000' <repeats 1114 times>...
>>>>>> server_name_file_port = 15001
>>>>>> fp = 0x51095f0
>>>>>> (gdb) info registers
>>>>>> rax 0xfffffffffffffdfc -516
>>>>>> rbx 0x6 6
>>>>>> rcx 0x7ffff612a75d 140737321805661
>>>>>> rdx 0x0 0
>>>>>> rsi 0x0 0
>>>>>> rdi 0x7fffffffb3f0 140737488335856
>>>>>> rbp 0x7fffffffe4b0 0x7fffffffe4b0
>>>>>> rsp 0x7fffffffc870 0x7fffffffc870
>>>>>> r8 0x0 0
>>>>>> r9 0x4000001 67108865
>>>>>> r10 0x1 1
>>>>>> r11 0x293 659
>>>>>> r12 0x4260b0 4350128
>>>>>> r13 0x7fffffffe590 140737488348560
>>>>>> r14 0x0 0
>>>>>> r15 0x0 0
>>>>>> rip 0x461f92 0x461f92 <main(int, char**)+2388>
>>>>>> eflags 0x293 [ CF AF SF IF ]
>>>>>> cs 0x33 51
>>>>>> ss 0x2b 43
>>>>>> ds 0x0 0
>>>>>> es 0x0 0
>>>>>> fs 0x0 0
>>>>>> gs 0x0 0
>>>>>> (gdb) x/16i $pc
>>>>>> => 0x461f92 <main(int, char**)+2388>: callq 0x49484c
>>>>>> <shutdown_ack()>
>>>>>> 0x461f97 <main(int, char**)+2393>: mov $0xffffffff,%edi
>>>>>> 0x461f9c <main(int, char**)+2398>: callq 0x4250b0 <***@plt>
>>>>>> 0x461fa1 <main(int, char**)+2403>: mov 0x70f5c0(%rip),%rdx
>>>>>> # 0xb71568 <msg_svrdown>
>>>>>> 0x461fa8 <main(int, char**)+2410>: mov 0x70ef51(%rip),%rax
>>>>>> # 0xb70f00 <msg_daemonname>
>>>>>> 0x461faf <main(int, char**)+2417>: mov %rdx,%rcx
>>>>>> 0x461fb2 <main(int, char**)+2420>: mov %rax,%rdx
>>>>>> 0x461fb5 <main(int, char**)+2423>: mov $0x1,%esi
>>>>>> 0x461fba <main(int, char**)+2428>: mov $0x8002,%edi
>>>>>> 0x461fbf <main(int, char**)+2433>: callq 0x425840
>>>>>> <***@plt>
>>>>>> 0x461fc4 <main(int, char**)+2438>: mov $0x0,%edi
>>>>>> 0x461fc9 <main(int, char**)+2443>: callq 0x4269c9
>>>>>> <acct_close(bool)>
>>>>>> 0x461fce <main(int, char**)+2448>: mov $0xb6ce00,%edi
>>>>>> 0x461fd3 <main(int, char**)+2453>: callq 0x425a00
>>>>>> <***@plt>
>>>>>> 0x461fd8 <main(int, char**)+2458>: mov $0x1,%edi
>>>>>> 0x461fdd <main(int, char**)+2463>: callq 0x424db0
>>>>>> <***@plt>
>>>>>> (gdb) thread apply all backtrace
>>>>>> Thread 12 (Thread 0x7fffe2ffd700 (LWP 36011)):
>>>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
>>>>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
>>>>>> #1 0x00000000004fc2b4 in work_thread (a=0x5110710) at
>>>>>> u_threadpool.c:272
>>>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7fffe2ffd700) at
>>>>>> pthread_create.c:333
>>>>>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>> _64/clone.S:109
>>>>>> Thread 11 (Thread 0x7fffe37fe700 (LWP 36004)):
>>>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
>>>>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
>>>>>> #1 0x00000000004fc2b4 in work_thread (a=0x5110810) at
>>>>>> u_threadpool.c:272
>>>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7fffe37fe700) at
>>>>>> pthread_create.c:333
>>>>>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>> _64/clone.S:109
>>>>>> Thread 10 (Thread 0x7fffe3fff700 (LWP 36003)):
>>>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
>>>>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
>>>>>> #1 0x00000000004fc2b4 in work_thread (a=0x5110710) at
>>>>>> u_threadpool.c:272
>>>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7fffe3fff700) at
>>>>>> pthread_create.c:333
>>>>>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>> _64/clone.S:109
>>>>>> Thread 9 (Thread 0x7ffff09bb700 (LWP 35871)):
>>>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
>>>>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
>>>>>> #1 0x00000000004fc2b4 in work_thread (a=0x5110810) at
>>>>>> u_threadpool.c:272
>>>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7ffff09bb700) at
>>>>>> pthread_create.c:333
>>>>>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>> _64/clone.S:109
>>>>>> Thread 7 (Thread 0x7ffff11bc700 (LWP 35869)):
>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>>>>> ../sysdeps/posix/sleep.c:55
>>>>>> #2 0x0000000000476913 in remove_completed_jobs (vp=0x0) at
>>>>>> req_jobobit.c:3759
>>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff11bc700) at
>>>>>> pthread_create.c:333
>>>>>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>> _64/clone.S:109
>>>>>> Thread 6 (Thread 0x7ffff19bd700 (LWP 35868)):
>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>>>>> ../sysdeps/posix/sleep.c:55
>>>>>> #2 0x00000000004afb93 in remove_extra_recycle_jobs (vp=0x0) at
>>>>>> job_recycler.c:216
>>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff19bd700) at
>>>>>> pthread_create.c:333
>>>>>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>> _64/clone.S:109
>>>>>> Thread 5 (Thread 0x7ffff21be700 (LWP 35867)):
>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>>>>> ../sysdeps/posix/sleep.c:55
>>>>>> #2 0x00000000004bc853 in inspect_exiting_jobs (vp=0x0) at
>>>>>> exiting_jobs.c:319
>>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff21be700) at
>>>>>> pthread_create.c:333
>>>>>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>> _64/clone.S:109
>>>>>> Thread 4 (Thread 0x7ffff29bf700 (LWP 35866)):
>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>>>>> ../sysdeps/posix/sleep.c:55
>>>>>> #2 0x0000000000460769 in handle_queue_routing_retries (vp=0x0) at
>>>>>> pbsd_main.c:1079
>>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff29bf700) at
>>>>>> pthread_create.c:333
>>>>>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>> _64/clone.S:109
>>>>>> Thread 3 (Thread 0x7ffff31c0700 (LWP 35865)):
>>>>>> #0 0x00007ffff6ee17bd in accept () at ../sysdeps/unix/syscall-templa
>>>>>> te.S:84
>>>>>> #1 0x00007ffff750a276 in start_listener_addrinfo
>>>>>> (host_name=0x7ffff31bfaf0 "Dual-E52630v4", server_port=15001,
>>>>>> process_meth=0x4c4a4d <start_process_pbs_server_port(void*)>)
>>>>>> at ../Libnet/server_core.c:398
>>>>>> ---Type <return> to continue, or q <return> to quit---
>>>>>> #2 0x00000000004608cf in start_accept_listener (vp=0x0) at
>>>>>> pbsd_main.c:1141
>>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff31c0700) at
>>>>>> pthread_create.c:333
>>>>>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>> _64/clone.S:109
>>>>>> Thread 2 (Thread 0x7ffff39c1700 (LWP 35864)):
>>>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
>>>>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
>>>>>> #1 0x00000000004fc2b4 in work_thread (a=0x5110810) at
>>>>>> u_threadpool.c:272
>>>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7ffff39c1700) at
>>>>>> pthread_create.c:333
>>>>>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>> _64/clone.S:109
>>>>>> Thread 1 (Thread 0x7ffff7fd5740 (LWP 35860)):
>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>> #1 0x00007ffff615c1a4 in usleep (useconds=<optimized out>) at
>>>>>> ../sysdeps/posix/usleep.c:32
>>>>>> #2 0x0000000000461216 in main_loop () at pbsd_main.c:1454
>>>>>> #3 0x0000000000461f92 in main (argc=2, argv=0x7fffffffe598) at
>>>>>> pbsd_main.c:1935
>>>>>> (gdb) quit
>>>>>> A debugging session is active.
>>>>>> Inferior 1 [process 35860] will be killed.
>>>>>> Quit anyway? (y or n) y
>>>>>
>>>>>
>>>>>
>>>>> Commands executed from another terminal after pbs_server with gdb (r
>>>>> -D)
>>>>>
>>>>>> $ sudo service pbs_sched start
>>>>>> $ sudo service pbs_mom start
>>>>>> $ pbsnodes -a
>>>>>> Dual-E52630v4
>>>>>> state = free
>>>>>> power_state = Running
>>>>>> np = 4
>>>>>> ntype = cluster
>>>>>> status = rectime=1478748911,macaddr=34:
>>>>>> 97:f6:5d:09:a6,cpuclock=Fixed,varattr=,jobs=,state=free,netl
>>>>>> oad=322618417,gres=,loadave=0.06,ncpus=40,physmem=65857216kb
>>>>>> ,availmem=131970532kb,totmem=132849340kb,idletime=108,nusers=4,nsessions=17,sessions=1036
>>>>>> 1316 1327 1332 1420 1421 1422 1423 1424 1425 1426 1430 1471 1510 27075
>>>>>> 27130 35902,uname=Linux Dual-E52630v4 4.4.0-45-generic #66-Ubuntu SMP Wed
>>>>>> Oct 19 14:12:37 UTC 2016 x86_64,opsys=linux
>>>>>> mom_service_port = 15002
>>>>>> mom_manager_port = 15003
>>>>>> $ echo "sleep 30" | qsub
>>>>>> 0.Dual-E52630v4
>>>>>> $ qstat
>>>>>> Job ID Name User Time Use S
>>>>>> Queue
>>>>>> ------------------------- ---------------- --------------- -------- -
>>>>>> -----
>>>>>> 0.Dual-E52630v4 STDIN comp_admin 0
>>>>>> Q batch
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Nov 10, 2016 at 12:01 PM, Kazuhiro Fujita <
>>>>> ***@gmail.com> wrote:
>>>>>
>>>>>> David,
>>>>>>
>>>>>> Now, it works. Thank you.
>>>>>> But, jobs are executed in the LIFO manner, as I observed in a
>>>>>> E5-2630v3 server...
>>>>>> I show the result by 'qstat -t' after 'echo "sleep 10" | qsub -t
>>>>>> 1-10' 3 times.
>>>>>>
>>>>>> Best,
>>>>>> Kazu
>>>>>>
>>>>>> $ qstat -t
>>>>>> Job ID Name User Time Use S
>>>>>> Queue
>>>>>> ------------------------- ---------------- --------------- -------- -
>>>>>> -----
>>>>>> 0.Dual-E5-2630v3 STDIN comp_admin 00:00:00
>>>>>> C batch
>>>>>> 1[1].Dual-E5-2630v3 STDIN-1 comp_admin 0
>>>>>> Q batch
>>>>>> 1[2].Dual-E5-2630v3 STDIN-2 comp_admin 0
>>>>>> Q batch
>>>>>> 1[3].Dual-E5-2630v3 STDIN-3 comp_admin 0
>>>>>> Q batch
>>>>>> 1[4].Dual-E5-2630v3 STDIN-4 comp_admin 0
>>>>>> Q batch
>>>>>> 1[5].Dual-E5-2630v3 STDIN-5 comp_admin 0
>>>>>> Q batch
>>>>>> 1[6].Dual-E5-2630v3 STDIN-6 comp_admin 0
>>>>>> Q batch
>>>>>> 1[7].Dual-E5-2630v3 STDIN-7 comp_admin 00:00:00
>>>>>> C batch
>>>>>> 1[8].Dual-E5-2630v3 STDIN-8 comp_admin 00:00:00
>>>>>> C batch
>>>>>> 1[9].Dual-E5-2630v3 STDIN-9 comp_admin 00:00:00
>>>>>> C batch
>>>>>> 1[10].Dual-E5-2630v3 STDIN-10 comp_admin 00:00:00
>>>>>> C batch
>>>>>> 2[1].Dual-E5-2630v3 STDIN-1 comp_admin 0
>>>>>> Q batch
>>>>>> 2[2].Dual-E5-2630v3 STDIN-2 comp_admin 0
>>>>>> Q batch
>>>>>> 2[3].Dual-E5-2630v3 STDIN-3 comp_admin 0
>>>>>> Q batch
>>>>>> 2[4].Dual-E5-2630v3 STDIN-4 comp_admin 0
>>>>>> Q batch
>>>>>> 2[5].Dual-E5-2630v3 STDIN-5 comp_admin 0
>>>>>> Q batch
>>>>>> 2[6].Dual-E5-2630v3 STDIN-6 comp_admin 0
>>>>>> Q batch
>>>>>> 2[7].Dual-E5-2630v3 STDIN-7 comp_admin 0
>>>>>> Q batch
>>>>>> 2[8].Dual-E5-2630v3 STDIN-8 comp_admin 0
>>>>>> Q batch
>>>>>> 2[9].Dual-E5-2630v3 STDIN-9 comp_admin 0
>>>>>> Q batch
>>>>>> 2[10].Dual-E5-2630v3 STDIN-10 comp_admin 0
>>>>>> Q batch
>>>>>> 3[1].Dual-E5-2630v3 STDIN-1 comp_admin 0
>>>>>> Q batch
>>>>>> 3[2].Dual-E5-2630v3 STDIN-2 comp_admin 0
>>>>>> Q batch
>>>>>> 3[3].Dual-E5-2630v3 STDIN-3 comp_admin 0
>>>>>> Q batch
>>>>>> 3[4].Dual-E5-2630v3 STDIN-4 comp_admin 0
>>>>>> Q batch
>>>>>> 3[5].Dual-E5-2630v3 STDIN-5 comp_admin 0
>>>>>> Q batch
>>>>>> 3[6].Dual-E5-2630v3 STDIN-6 comp_admin 0
>>>>>> Q batch
>>>>>> 3[7].Dual-E5-2630v3 STDIN-7 comp_admin 0
>>>>>> R batch
>>>>>> 3[8].Dual-E5-2630v3 STDIN-8 comp_admin 0
>>>>>> R batch
>>>>>> 3[9].Dual-E5-2630v3 STDIN-9 comp_admin 0
>>>>>> R batch
>>>>>> 3[10].Dual-E5-2630v3 STDIN-10 comp_admin 0
>>>>>> R batch
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Nov 10, 2016 at 3:07 AM, David Beer <
>>>>>> ***@adaptivecomputing.com> wrote:
>>>>>>
>>>>>>> Kazu,
>>>>>>>
>>>>>>> I was able to get a system to reproduce this error. I have now
>>>>>>> checked in another fix, and I can no longer reproduce this. Can you pull
>>>>>>> the latest and let me know if it fixes it for you?
>>>>>>>
>>>>>>> On Tue, Nov 8, 2016 at 2:06 AM, Kazuhiro Fujita <
>>>>>>> ***@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi David,
>>>>>>>>
>>>>>>>> I reinstalled the 6.0-dev today from github, and observed slight
>>>>>>>> different behaviors I think.
>>>>>>>> I used the "service" command to start daemons this time.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Kazu
>>>>>>>>
>>>>>>>> Befor crash
>>>>>>>>
>>>>>>>>> git clone https://github.com/adaptivecomputing/torque.git -b
>>>>>>>>> 6.0-dev 6.0-dev
>>>>>>>>> cd 6.0-dev
>>>>>>>>> ./autogen.sh
>>>>>>>>> # build and install torque
>>>>>>>>> ./configure
>>>>>>>>> make
>>>>>>>>> sudo make install
>>>>>>>>> # Set the correct name of the server
>>>>>>>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>>>>>>> # configure and start trqauthd
>>>>>>>>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>>>>>>>>> sudo update-rc.d trqauthd defaults
>>>>>>>>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>>>>>>>>> sudo ldconfig
>>>>>>>>> sudo service trqauthd start
>>>>>>>>> # Initialize serverdb by executing the torque.setup script
>>>>>>>>> sudo ./torque.setup $USER
>>>>>>>>> sudo qmgr -c 'p s'
>>>>>>>>> sudo qterm
>>>>>>>>> sudo service trqauthd stop
>>>>>>>>> ps aux | grep pbs
>>>>>>>>> ps aux | grep trq
>>>>>>>>> # set nodes
>>>>>>>>> echo "$HOSTNAME np=`cat /proc/cpuinfo | grep processor | wc -l`" |
>>>>>>>>> sudo tee /var/spool/torque/server_priv/nodes
>>>>>>>>> sudo nano /var/spool/torque/server_priv/nodes
>>>>>>>>> # set the head node
>>>>>>>>> echo "\$pbsserver $HOSTNAME" | sudo tee
>>>>>>>>> /var/spool/torque/mom_priv/config
>>>>>>>>> # configure other deamons
>>>>>>>>> sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
>>>>>>>>> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
>>>>>>>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>>>>>>>> sudo update-rc.d pbs_server defaults
>>>>>>>>> sudo update-rc.d pbs_sched defaults
>>>>>>>>> sudo update-rc.d pbs_mom defaults
>>>>>>>>> # start torque daemons
>>>>>>>>> sudo service trqauthd start
>>>>>>>>> sudo service pbs_server start
>>>>>>>>> sudo service pbs_sched start
>>>>>>>>> sudo service pbs_mom start
>>>>>>>>> # chekc configuration of computaion nodes
>>>>>>>>> pbsnodes -a
>>>>>>>>
>>>>>>>>
>>>>>>>> I checked torque processes by "ps aux | grep pbs" and "ps aux |
>>>>>>>> grep trq" several times.
>>>>>>>> After "pbsnodes -a", it seems ok.
>>>>>>>> But, the next qsub command seems to trigger to crash "pbs_server"
>>>>>>>> and "pbs_sched".
>>>>>>>>
>>>>>>>> $ ps aux | grep trq
>>>>>>>>> root 9682 0.0 0.0 109112 3632 ? S 17:39 0:00
>>>>>>>>> /usr/local/sbin/trqauthd
>>>>>>>>> comp_ad+ 9842 0.0 0.0 15236 936 pts/8 S+ 17:40 0:00
>>>>>>>>> grep --color=auto trq
>>>>>>>>> $ ps aux | grep pbs
>>>>>>>>> root 9720 0.0 0.0 695140 25760 ? Sl 17:39 0:00
>>>>>>>>> /usr/local/sbin/pbs_server
>>>>>>>>> root 9771 0.0 0.0 37996 4940 ? Ss 17:39 0:00
>>>>>>>>> /usr/local/sbin/pbs_sched
>>>>>>>>> root 9814 0.2 0.2 173776 136692 ? SLsl 17:40 0:00
>>>>>>>>> /usr/local/sbin/pbs_mom
>>>>>>>>> comp_ad+ 9844 0.0 0.0 15236 1012 pts/8 S+ 17:40 0:00
>>>>>>>>> grep --color=auto pbs
>>>>>>>>> $ echo "sleep 30" | qsub
>>>>>>>>> 0.Dual-E52630v4
>>>>>>>>> $ ps aux | grep pbs
>>>>>>>>> root 9814 0.1 0.2 173776 136692 ? SLsl 17:40 0:00
>>>>>>>>> /usr/local/sbin/pbs_mom
>>>>>>>>> comp_ad+ 9855 0.0 0.0 15236 928 pts/8 S+ 17:41 0:00
>>>>>>>>> grep --color=auto pbs
>>>>>>>>> $ ps aux | grep trq
>>>>>>>>> root 9682 0.0 0.0 109112 4144 ? S 17:39 0:00
>>>>>>>>> /usr/local/sbin/trqauthd
>>>>>>>>> comp_ad+ 9860 0.0 0.0 15236 1092 pts/8 S+ 17:41 0:00
>>>>>>>>> grep --color=auto trq
>>>>>>>>
>>>>>>>>
>>>>>>>> Then, I stopped the remained processes,
>>>>>>>>
>>>>>>>> sudo service pbs_mom stop
>>>>>>>>> sudo service trqauthd stop
>>>>>>>>
>>>>>>>>
>>>>>>>> and start again the "trqauthd", and "pbs_server" with gdb.
>>>>>>>> "pbs_server" crashed in gdb without other commands.
>>>>>>>>
>>>>>>>> sudo service trqauthd start
>>>>>>>>> sudo gdb /usr/local/sbin/pbs_server
>>>>>>>>
>>>>>>>>
>>>>>>>> sudo gdb /usr/local/sbin/pbs_server
>>>>>>>> GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
>>>>>>>> Copyright (C) 2016 Free Software Foundation, Inc.
>>>>>>>> License GPLv3+: GNU GPL version 3 or later <
>>>>>>>> http://gnu.org/licenses/gpl.html>
>>>>>>>> This is free software: you are free to change and redistribute it.
>>>>>>>> There is NO WARRANTY, to the extent permitted by law. Type "show
>>>>>>>> copying"
>>>>>>>> and "show warranty" for details.
>>>>>>>> This GDB was configured as "x86_64-linux-gnu".
>>>>>>>> Type "show configuration" for configuration details.
>>>>>>>> For bug reporting instructions, please see:
>>>>>>>> <http://www.gnu.org/software/gdb/bugs/>.
>>>>>>>> Find the GDB manual and other documentation resources online at:
>>>>>>>> <http://www.gnu.org/software/gdb/documentation/>.
>>>>>>>> For help, type "help".
>>>>>>>> Type "apropos word" to search for commands related to "word"...
>>>>>>>> Reading symbols from /usr/local/sbin/pbs_server...done.
>>>>>>>> (gdb) r -D
>>>>>>>> Starting program: /usr/local/sbin/pbs_server -D
>>>>>>>> [Thread debugging using libthread_db enabled]
>>>>>>>> Using host libthread_db library "/lib/x86_64-linux-gnu/libthre
>>>>>>>> ad_db.so.1".
>>>>>>>>
>>>>>>>> Program received signal SIGSEGV, Segmentation fault.
>>>>>>>> __lll_unlock_elision (lock=0x512f1b0, private=0) at
>>>>>>>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>>>>>>>> 29 ../sysdeps/unix/sysv/linux/x86/elision-unlock.c: No such file
>>>>>>>> or directory.
>>>>>>>> (gdb) bt
>>>>>>>> #0 __lll_unlock_elision (lock=0x512f1b0, private=0) at
>>>>>>>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>>>>>>>> #1 0x00000000004a4953 in unlock_ji_mutex (pjob=0x512d880,
>>>>>>>> id=0x525b30 <svr_enquejob(job*, int, char const*, bool, bool)::__func__>
>>>>>>>> "svr_enquejob", msg=0x524554 "1", logging=0)
>>>>>>>> at svr_jobfunc.c:4011
>>>>>>>> #2 0x000000000049db0c in svr_enquejob (pjob=0x512d880,
>>>>>>>> has_sv_qs_mutex=1, prev_job_id=0x0, have_reservation=false,
>>>>>>>> being_recovered=true) at svr_jobfunc.c:421
>>>>>>>> #3 0x000000000045b828 in pbsd_init_reque (pjob=0x512d880,
>>>>>>>> change_state=1) at pbsd_init.c:2824
>>>>>>>> #4 0x000000000045ad93 in pbsd_init_job (pjob=0x512d880, type=1) at
>>>>>>>> pbsd_init.c:2558
>>>>>>>> #5 0x0000000000459483 in handle_job_recovery (type=1) at
>>>>>>>> pbsd_init.c:1803
>>>>>>>> #6 0x000000000045a173 in handle_job_and_array_recovery (type=1) at
>>>>>>>> pbsd_init.c:2100
>>>>>>>> #7 0x000000000045a8fe in pbsd_init (type=1) at pbsd_init.c:2316
>>>>>>>> #8 0x0000000000461f29 in main (argc=2, argv=0x7fffffffe598) at
>>>>>>>> pbsd_main.c:1898
>>>>>>>> (gdb) backtrace full
>>>>>>>> #0 __lll_unlock_elision (lock=0x512f1b0, private=0) at
>>>>>>>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>>>>>>>> No locals.
>>>>>>>> #1 0x00000000004a4953 in unlock_ji_mutex (pjob=0x512d880,
>>>>>>>> id=0x525b30 <svr_enquejob(job*, int, char const*, bool, bool)::__func__>
>>>>>>>> "svr_enquejob", msg=0x524554 "1", logging=0)
>>>>>>>> at svr_jobfunc.c:4011
>>>>>>>> rc = 0
>>>>>>>> err_msg = 0x0
>>>>>>>> stub_msg = "no pos"
>>>>>>>> __func__ = "unlock_ji_mutex"
>>>>>>>> #2 0x000000000049db0c in svr_enquejob (pjob=0x512d880,
>>>>>>>> has_sv_qs_mutex=1, prev_job_id=0x0, have_reservation=false,
>>>>>>>> being_recovered=true) at svr_jobfunc.c:421
>>>>>>>> pattrjb = 0x7fffffff4a10
>>>>>>>> pdef = 0x4
>>>>>>>> pque = 0x0
>>>>>>>> rc = 0
>>>>>>>> log_buf = '\000' <repeats 24 times>,
>>>>>>>> "\030\000\000\000\060\000\000\000PU\377\377\377\177\000\000\220T\377\377\377\177",
>>>>>>>> '\000' <repeats 50 times>, "\003\000\000\000\000\000\000\
>>>>>>>> 000#\000\000\000\000\000\000\000pO\377\377\377\177", '\000'
>>>>>>>> <repeats 26 times>, "\221\260\000\000\000\200\377\
>>>>>>>> 377oO\377\377\377\177\000\000H+B\366\377\177\000\000p+B\366\
>>>>>>>> 377\177\000\000\200O\377\377\377\177\000\000\201\260\000\000
>>>>>>>> \000\200\377\377\177O\377\377\377\177", '\000' <repeats 18
>>>>>>>> times>...
>>>>>>>> time_now = 1478594788
>>>>>>>> job_id = "0.Dual-E52630v4\000\000\000\0
>>>>>>>> 00\000\000\000\000\000\362\377\377\377\377\377\377\377\340J\
>>>>>>>> 377\377\377\177\000\000\060L\377\377\377\177\000\000\001\000
>>>>>>>> \000\000\000\000\000\000\244\201\000\000\001\000\000\000\030
>>>>>>>> \354\377\367\377\177\000\***@L\377\377\377\177\000\000\000\0
>>>>>>>> 00\000\000\005\000\000\220\r\000\000\000\000\000\000\000k\02
>>>>>>>> 2j\365\377\177\000\000\031J\377\377\377\177\000\000\201n\376
>>>>>>>> \017\000\000\000\000\\\216!X\000\000\000\000_#\343+\000\000\
>>>>>>>> 000\000\\\216!X\000\000\000\000\207\065],", '\000' <repeats 36
>>>>>>>> times>, "k\022j\365\377\177\000\000\30
>>>>>>>> 0K\377\377\377\177\000\000\000\000\000\000\000\000\000\000"...
>>>>>>>> queue_name = "batch\000\377\377\240\340\377
>>>>>>>> \367\377\177\000"
>>>>>>>> total_jobs = 0
>>>>>>>> user_jobs = 0
>>>>>>>> array_jobs = 0
>>>>>>>> __func__ = "svr_enquejob"
>>>>>>>> que_mgr = {unlock_on_exit = 160, locked = 75, mutex_valid =
>>>>>>>> 255, managed_mutex = 0x7ffff7ddccda <open_path+474>}
>>>>>>>> #3 0x000000000045b828 in pbsd_init_reque (pjob=0x512d880,
>>>>>>>> change_state=1) at pbsd_init.c:2824
>>>>>>>> newstate = 0
>>>>>>>> newsubstate = 0
>>>>>>>> rc = 0
>>>>>>>> log_buf = "pbsd_init_reque:1", '\000' <repeats 1063
>>>>>>>> times>...
>>>>>>>> __func__ = "pbsd_init_reque"
>>>>>>>> #4 0x000000000045ad93 in pbsd_init_job (pjob=0x512d880, type=1) at
>>>>>>>> pbsd_init.c:2558
>>>>>>>> d = 0
>>>>>>>> rc = 0
>>>>>>>> time_now = 1478594788
>>>>>>>> log_buf = '\000' <repeats 2112 times>...
>>>>>>>> local_errno = 0
>>>>>>>> job_id = '\000' <repeats 1016 times>...
>>>>>>>> job_atr_hold = 0
>>>>>>>> job_exit_status = 0
>>>>>>>> __func__ = "pbsd_init_job"
>>>>>>>> #5 0x0000000000459483 in handle_job_recovery (type=1) at
>>>>>>>> pbsd_init.c:1803
>>>>>>>> pjob = 0x512d880
>>>>>>>> Index = 0
>>>>>>>> JobArray_iter = {first = "0.Dual-E52630v4", second = }
>>>>>>>> log_buf = "14 total files read from
>>>>>>>> disk\000\000\000\000\000\000\000\001\000\000\000\320\316\022
>>>>>>>> \005\000\000\000\000\220N\022\005", '\000' <repeats 12 times>,
>>>>>>>> "Expected 1, recovered 1 queues", '\000' <repeats 1330 times>...
>>>>>>>> rc = 0
>>>>>>>> job_rc = 0
>>>>>>>> logtype = 0
>>>>>>>> pdirent = 0x0
>>>>>>>> pdirent_sub = 0x0
>>>>>>>> dir = 0x5124e90
>>>>>>>> dir_sub = 0x0
>>>>>>>> had = 0
>>>>>>>> pjob = 0x0
>>>>>>>> time_now = 1478594788
>>>>>>>> ---Type <return> to continue, or q <return> to quit---
>>>>>>>> basen = '\000' <repeats 1088 times>...
>>>>>>>> use_jobs_subdirs = 0
>>>>>>>> __func__ = "handle_job_recovery"
>>>>>>>> #6 0x000000000045a173 in handle_job_and_array_recovery (type=1) at
>>>>>>>> pbsd_init.c:2100
>>>>>>>> rc = 0
>>>>>>>> tmp_rc = 1974134615
>>>>>>>> #7 0x000000000045a8fe in pbsd_init (type=1) at pbsd_init.c:2316
>>>>>>>> ret = 0
>>>>>>>> gid = 0
>>>>>>>> log_buf = "pbsd_init:1", '\000' <repeats 997 times>...
>>>>>>>> __func__ = "pbsd_init"
>>>>>>>> #8 0x0000000000461f29 in main (argc=2, argv=0x7fffffffe598) at
>>>>>>>> pbsd_main.c:1898
>>>>>>>> i = 2
>>>>>>>> rc = 0
>>>>>>>> local_errno = 0
>>>>>>>> lockfile = "/var/spool/torque/server_priv/server.lock",
>>>>>>>> '\000' <repeats 983 times>
>>>>>>>> EMsg = '\000' <repeats 1023 times>
>>>>>>>> tmpLine = "Server Dual-E52630v4 started, initialization
>>>>>>>> type = 1", '\000' <repeats 970 times>
>>>>>>>> log_buf = "Server Dual-E52630v4 started, initialization
>>>>>>>> type = 1", '\000' <repeats 1139 times>...
>>>>>>>> server_name_file_port = 15001
>>>>>>>> fp = 0x51095f0
>>>>>>>> (gdb) info registers
>>>>>>>> rax 0x0 0
>>>>>>>> rbx 0x6 6
>>>>>>>> rcx 0x0 0
>>>>>>>> rdx 0x512f1b0 85127600
>>>>>>>> rsi 0x0 0
>>>>>>>> rdi 0x512f1b0 85127600
>>>>>>>> rbp 0x7fffffffe4b0 0x7fffffffe4b0
>>>>>>>> rsp 0x7fffffffc870 0x7fffffffc870
>>>>>>>> r8 0x0 0
>>>>>>>> r9 0x7fffffff57a2 140737488312226
>>>>>>>> r10 0x513c800 85182464
>>>>>>>> r11 0x7ffff61e6128 140737322574120
>>>>>>>> r12 0x4260b0 4350128
>>>>>>>> r13 0x7fffffffe590 140737488348560
>>>>>>>> r14 0x0 0
>>>>>>>> r15 0x0 0
>>>>>>>> rip 0x461f29 0x461f29 <main(int, char**)+2183>
>>>>>>>> eflags 0x10246 [ PF ZF IF RF ]
>>>>>>>> cs 0x33 51
>>>>>>>> ss 0x2b 43
>>>>>>>> ds 0x0 0
>>>>>>>> es 0x0 0
>>>>>>>> fs 0x0 0
>>>>>>>> gs 0x0 0
>>>>>>>> (gdb) x/16i $pc
>>>>>>>> => 0x461f29 <main(int, char**)+2183>: test %eax,%eax
>>>>>>>> 0x461f2b <main(int, char**)+2185>: setne %al
>>>>>>>> 0x461f2e <main(int, char**)+2188>: test %al,%al
>>>>>>>> 0x461f30 <main(int, char**)+2190>: je 0x461f55 <main(int,
>>>>>>>> char**)+2227>
>>>>>>>> 0x461f32 <main(int, char**)+2192>: mov 0x70efc7(%rip),%rax
>>>>>>>> # 0xb70f00 <msg_daemonname>
>>>>>>>> 0x461f39 <main(int, char**)+2199>: mov $0x51bab2,%edx
>>>>>>>> 0x461f3e <main(int, char**)+2204>: mov %rax,%rsi
>>>>>>>> 0x461f41 <main(int, char**)+2207>: mov $0xffffffff,%edi
>>>>>>>> 0x461f46 <main(int, char**)+2212>: callq 0x425420
>>>>>>>> <***@plt>
>>>>>>>> 0x461f4b <main(int, char**)+2217>: mov $0x3,%edi
>>>>>>>> 0x461f50 <main(int, char**)+2222>: callq 0x425680 <***@plt>
>>>>>>>> 0x461f55 <main(int, char**)+2227>: mov 0x71021d(%rip),%esi
>>>>>>>> # 0xb72178 <pbs_mom_port>
>>>>>>>> 0x461f5b <main(int, char**)+2233>: mov 0x710227(%rip),%ecx
>>>>>>>> # 0xb72188 <pbs_scheduler_port>
>>>>>>>> 0x461f61 <main(int, char**)+2239>: mov 0x710225(%rip),%edx
>>>>>>>> # 0xb7218c <pbs_server_port_dis>
>>>>>>>> 0x461f67 <main(int, char**)+2245>: lea -0x1400(%rbp),%rax
>>>>>>>> 0x461f6e <main(int, char**)+2252>: mov $0xb739c0,%r9d
>>>>>>>> (gdb) thread apply all backtrace
>>>>>>>>
>>>>>>>> Thread 1 (Thread 0x7ffff7fd5740 (LWP 10004)):
>>>>>>>> #0 __lll_unlock_elision (lock=0x512f1b0, private=0) at
>>>>>>>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>>>>>>>> #1 0x00000000004a4953 in unlock_ji_mutex (pjob=0x512d880,
>>>>>>>> id=0x525b30 <svr_enquejob(job*, int, char const*, bool, bool)::__func__>
>>>>>>>> "svr_enquejob", msg=0x524554 "1", logging=0)
>>>>>>>> at svr_jobfunc.c:4011
>>>>>>>> #2 0x000000000049db0c in svr_enquejob (pjob=0x512d880,
>>>>>>>> has_sv_qs_mutex=1, prev_job_id=0x0, have_reservation=false,
>>>>>>>> being_recovered=true) at svr_jobfunc.c:421
>>>>>>>> #3 0x000000000045b828 in pbsd_init_reque (pjob=0x512d880,
>>>>>>>> change_state=1) at pbsd_init.c:2824
>>>>>>>> #4 0x000000000045ad93 in pbsd_init_job (pjob=0x512d880, type=1) at
>>>>>>>> pbsd_init.c:2558
>>>>>>>> #5 0x0000000000459483 in handle_job_recovery (type=1) at
>>>>>>>> pbsd_init.c:1803
>>>>>>>> #6 0x000000000045a173 in handle_job_and_array_recovery (type=1) at
>>>>>>>> pbsd_init.c:2100
>>>>>>>> #7 0x000000000045a8fe in pbsd_init (type=1) at pbsd_init.c:2316
>>>>>>>> #8 0x0000000000461f29 in main (argc=2, argv=0x7fffffffe598) at
>>>>>>>> pbsd_main.c:1898
>>>>>>>> (gdb) quit
>>>>>>>> A debugging session is active.
>>>>>>>>
>>>>>>>> Inferior 1 [process 10004] will be killed.
>>>>>>>>
>>>>>>>> Quit anyway? (y or n) y
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Nov 2, 2016 at 1:43 AM, David Beer <
>>>>>>>> ***@adaptivecomputing.com> wrote:
>>>>>>>>
>>>>>>>>> Kazu,
>>>>>>>>>
>>>>>>>>> Thanks for sticking with us on this. You mentioned that pbs_server
>>>>>>>>> did not crash when you submitted the job, but you said that it and
>>>>>>>>> pbs_sched are "unstable." What do you mean by unstable? Will jobs run? You
>>>>>>>>> gdb output looks like a pbs_server that isn't busy, but other than that it
>>>>>>>>> looks normal.
>>>>>>>>>
>>>>>>>>> David
>>>>>>>>>
>>>>>>>>> On Tue, Nov 1, 2016 at 1:19 AM, Kazuhiro Fujita <
>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> David,
>>>>>>>>>>
>>>>>>>>>> I tested the 6.0-dev. It passed the "sudo ./torque.setup $USER"
>>>>>>>>>> script,
>>>>>>>>>> but pbs_server and pbs_sched are unstable like 6.1-dev.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Kazu
>>>>>>>>>>
>>>>>>>>>> Before execution of gdb
>>>>>>>>>>
>>>>>>>>>> git clone https://github.com/adaptivecomputing/torque.git -b
>>>>>>>>>>> 6.0-dev 6.0-dev
>>>>>>>>>>> cd 6.0-dev
>>>>>>>>>>> ./autogen.sh
>>>>>>>>>>> # build and install torque
>>>>>>>>>>> ./configure
>>>>>>>>>>> make
>>>>>>>>>>> sudo make install
>>>>>>>>>>> # Set the correct name of the server
>>>>>>>>>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>>>>>>>>> # configure and start trqauthd
>>>>>>>>>>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>>>>>>>>>>> sudo update-rc.d trqauthd defaults
>>>>>>>>>>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>>>>>>>>>>> sudo ldconfig
>>>>>>>>>>> sudo service trqauthd start
>>>>>>>>>>> # Initialize serverdb by executing the torque.setup script
>>>>>>>>>>> sudo ./torque.setup $USER
>>>>>>>>>>>
>>>>>>>>>>> sudo qmgr -c 'p s'
>>>>>>>>>>> sudo qterm
>>>>>>>>>>> sudo /etc/init.d/trqauthd stop
>>>>>>>>>>> # set nodes
>>>>>>>>>>> echo "$HOSTNAME np=`cat /proc/cpuinfo | grep processor | wc -l`"
>>>>>>>>>>> | sudo tee /var/spool/torque/server_priv/nodes
>>>>>>>>>>> sudo nano /var/spool/torque/server_priv/nodes
>>>>>>>>>>> # set the head node
>>>>>>>>>>> echo "\$pbsserver $HOSTNAME" | sudo tee
>>>>>>>>>>> /var/spool/torque/mom_priv/config
>>>>>>>>>>> # configure other deamons
>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>>>>>>>>>> sudo update-rc.d pbs_server defaults
>>>>>>>>>>> sudo update-rc.d pbs_sched defaults
>>>>>>>>>>> sudo update-rc.d pbs_mom defaults
>>>>>>>>>>> # start torque daemons
>>>>>>>>>>> sudo service trqauthd start
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Execution of gdb
>>>>>>>>>>
>>>>>>>>>>> sudo gdb /usr/local/sbin/pbs_server
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Commands executed by another terminal
>>>>>>>>>>
>>>>>>>>>>> sudo /etc/init.d/pbs_mom start
>>>>>>>>>>> sudo /etc/init.d/pbs_sched start
>>>>>>>>>>> pbsnodes -a
>>>>>>>>>>> echo "sleep 30" | qsub
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The last command did not cause a crash of pbs_server. The
>>>>>>>>>> backtrace is described below.
>>>>>>>>>> $ sudo gdb /usr/local/sbin/pbs_server
>>>>>>>>>> GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
>>>>>>>>>> Copyright (C) 2016 Free Software Foundation, Inc.
>>>>>>>>>> License GPLv3+: GNU GPL version 3 or later <
>>>>>>>>>> http://gnu.org/licenses/gpl.html>
>>>>>>>>>> This is free software: you are free to change and redistribute it.
>>>>>>>>>> There is NO WARRANTY, to the extent permitted by law. Type "show
>>>>>>>>>> copying"
>>>>>>>>>> and "show warranty" for details.
>>>>>>>>>> This GDB was configured as "x86_64-linux-gnu".
>>>>>>>>>> Type "show configuration" for configuration details.
>>>>>>>>>> For bug reporting instructions, please see:
>>>>>>>>>> <http://www.gnu.org/software/gdb/bugs/>.
>>>>>>>>>> Find the GDB manual and other documentation resources online at:
>>>>>>>>>> <http://www.gnu.org/software/gdb/documentation/>.
>>>>>>>>>> For help, type "help".
>>>>>>>>>> Type "apropos word" to search for commands related to "word"...
>>>>>>>>>> Reading symbols from /usr/local/sbin/pbs_server...done.
>>>>>>>>>> (gdb) r -D
>>>>>>>>>> Starting program: /usr/local/sbin/pbs_server -D
>>>>>>>>>> [Thread debugging using libthread_db enabled]
>>>>>>>>>> Using host libthread_db library "/lib/x86_64-linux-gnu/libthre
>>>>>>>>>> ad_db.so.1".
>>>>>>>>>> [New Thread 0x7ffff39c1700 (LWP 5024)]
>>>>>>>>>> pbs_server is up (version - 6.0, port - 15001)
>>>>>>>>>> [New Thread 0x7ffff31c0700 (LWP 5025)]
>>>>>>>>>> PBS_Server: LOG_ERROR::tcp_connect_sockaddr, Failed when trying
>>>>>>>>>> to open tcp connection - connect() failed [rc = -2] [addr =
>>>>>>>>>> 10.0.0.249:15003]
>>>>>>>>>> PBS_Server: LOG_ERROR::sendHierarchyToNode, Could not send mom
>>>>>>>>>> hierarchy to host Dual-E52630v4:15003
>>>>>>>>>> [New Thread 0x7ffff29bf700 (LWP 5026)]
>>>>>>>>>> [New Thread 0x7ffff21be700 (LWP 5027)]
>>>>>>>>>> [New Thread 0x7ffff19bd700 (LWP 5028)]
>>>>>>>>>> [New Thread 0x7ffff11bc700 (LWP 5029)]
>>>>>>>>>> [New Thread 0x7ffff09bb700 (LWP 5030)]
>>>>>>>>>> [Thread 0x7ffff09bb700 (LWP 5030) exited]
>>>>>>>>>> [New Thread 0x7ffff09bb700 (LWP 5031)]
>>>>>>>>>> [New Thread 0x7fffe3fff700 (LWP 5109)]
>>>>>>>>>> [New Thread 0x7fffe37fe700 (LWP 5113)]
>>>>>>>>>> [New Thread 0x7fffe29cf700 (LWP 5121)]
>>>>>>>>>> [Thread 0x7fffe29cf700 (LWP 5121) exited]
>>>>>>>>>> ^C
>>>>>>>>>> Thread 1 "pbs_server" received signal SIGINT, Interrupt.
>>>>>>>>>> 0x00007ffff612a75d in nanosleep () at
>>>>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>>>>> 84 ../sysdeps/unix/syscall-template.S: No such file or directory.
>>>>>>>>>> (gdb) backtrace full
>>>>>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>>>>> No locals.
>>>>>>>>>> #1 0x00007ffff615c1a4 in usleep (useconds=<optimized out>) at
>>>>>>>>>> ../sysdeps/posix/usleep.c:32
>>>>>>>>>> ts = {tv_sec = 0, tv_nsec = 250000000}
>>>>>>>>>> #2 0x000000000046123a in main_loop () at pbsd_main.c:1454
>>>>>>>>>> state = 3
>>>>>>>>>> waittime = 5
>>>>>>>>>> pjob = 0x313a74
>>>>>>>>>> iter = 0x0
>>>>>>>>>> when = 1477984074
>>>>>>>>>> log = 0
>>>>>>>>>> scheduling = 1
>>>>>>>>>> sched_iteration = 600
>>>>>>>>>> time_now = 1477984190
>>>>>>>>>> update_loglevel = 1477984198
>>>>>>>>>> log_buf = "Server Ready, pid = 5020, loglevel=0", '\000'
>>>>>>>>>> <repeats 140 times>, "c\000\000\000\000\000\000\000
>>>>>>>>>> \000\020\000\000\000\000\000\000\240\265\377\377\377\177",
>>>>>>>>>> '\000' <repeats 26 times>...
>>>>>>>>>> sem_val = 5228929
>>>>>>>>>> __func__ = "main_loop"
>>>>>>>>>> #3 0x0000000000461fb6 in main (argc=2, argv=0x7fffffffe598) at
>>>>>>>>>> pbsd_main.c:1935
>>>>>>>>>> i = 2
>>>>>>>>>> rc = 0
>>>>>>>>>> local_errno = 0
>>>>>>>>>> lockfile = "/var/spool/torque/server_priv/server.lock",
>>>>>>>>>> '\000' <repeats 983 times>
>>>>>>>>>> EMsg = '\000' <repeats 1023 times>
>>>>>>>>>> tmpLine = "Using ports Server:15001 Scheduler:15004
>>>>>>>>>> MOM:15002 (server: 'Dual-E52630v4')", '\000' <repeats 945 times>
>>>>>>>>>> log_buf = "Using ports Server:15001 Scheduler:15004
>>>>>>>>>> MOM:15002 (server: 'Dual-E52630v4')", '\000' <repeats 1114 times>...
>>>>>>>>>> server_name_file_port = 15001
>>>>>>>>>> fp = 0x51095f0
>>>>>>>>>> (gdb) info registers
>>>>>>>>>> rax 0xfffffffffffffdfc -516
>>>>>>>>>> rbx 0x5 5
>>>>>>>>>> rcx 0x7ffff612a75d 140737321805661
>>>>>>>>>> rdx 0x0 0
>>>>>>>>>> rsi 0x0 0
>>>>>>>>>> rdi 0x7fffffffb3f0 140737488335856
>>>>>>>>>> rbp 0x7fffffffe4b0 0x7fffffffe4b0
>>>>>>>>>> rsp 0x7fffffffc870 0x7fffffffc870
>>>>>>>>>> r8 0x0 0
>>>>>>>>>> r9 0x4000001 67108865
>>>>>>>>>> r10 0x1 1
>>>>>>>>>> r11 0x293 659
>>>>>>>>>> r12 0x4260b0 4350128
>>>>>>>>>> r13 0x7fffffffe590 140737488348560
>>>>>>>>>> r14 0x0 0
>>>>>>>>>> r15 0x0 0
>>>>>>>>>> rip 0x461fb6 0x461fb6 <main(int, char**)+2388>
>>>>>>>>>> eflags 0x293 [ CF AF SF IF ]
>>>>>>>>>> cs 0x33 51
>>>>>>>>>> ss 0x2b 43
>>>>>>>>>> ds 0x0 0
>>>>>>>>>> es 0x0 0
>>>>>>>>>> fs 0x0 0
>>>>>>>>>> gs 0x0 0
>>>>>>>>>> (gdb) x/16i $pc
>>>>>>>>>> => 0x461fb6 <main(int, char**)+2388>: callq 0x494762
>>>>>>>>>> <shutdown_ack()>
>>>>>>>>>> 0x461fbb <main(int, char**)+2393>: mov $0xffffffff,%edi
>>>>>>>>>> 0x461fc0 <main(int, char**)+2398>: callq 0x4250b0
>>>>>>>>>> <***@plt>
>>>>>>>>>> 0x461fc5 <main(int, char**)+2403>: mov 0x70f55c(%rip),%rdx
>>>>>>>>>> # 0xb71528 <msg_svrdown>
>>>>>>>>>> 0x461fcc <main(int, char**)+2410>: mov 0x70eeed(%rip),%rax
>>>>>>>>>> # 0xb70ec0 <msg_daemonname>
>>>>>>>>>> 0x461fd3 <main(int, char**)+2417>: mov %rdx,%rcx
>>>>>>>>>> 0x461fd6 <main(int, char**)+2420>: mov %rax,%rdx
>>>>>>>>>> 0x461fd9 <main(int, char**)+2423>: mov $0x1,%esi
>>>>>>>>>> 0x461fde <main(int, char**)+2428>: mov $0x8002,%edi
>>>>>>>>>> 0x461fe3 <main(int, char**)+2433>: callq 0x425840
>>>>>>>>>> <***@plt>
>>>>>>>>>> 0x461fe8 <main(int, char**)+2438>: mov $0x0,%edi
>>>>>>>>>> 0x461fed <main(int, char**)+2443>: callq 0x4269c9
>>>>>>>>>> <acct_close(bool)>
>>>>>>>>>> 0x461ff2 <main(int, char**)+2448>: mov $0xb6cdc0,%edi
>>>>>>>>>> 0x461ff7 <main(int, char**)+2453>: callq 0x425a00
>>>>>>>>>> <***@plt>
>>>>>>>>>> 0x461ffc <main(int, char**)+2458>: mov $0x1,%edi
>>>>>>>>>> 0x462001 <main(int, char**)+2463>: callq 0x424db0
>>>>>>>>>> <***@plt>
>>>>>>>>>> (gdb) thread apply all backtrace
>>>>>>>>>>
>>>>>>>>>> Thread 11 (Thread 0x7fffe37fe700 (LWP 5113)):
>>>>>>>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
>>>>>>>>>> #1 0x00000000004fc19c in work_thread (a=0x5110710) at
>>>>>>>>>> u_threadpool.c:272
>>>>>>>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7fffe37fe700) at
>>>>>>>>>> pthread_create.c:333
>>>>>>>>>> #3 0x00007ffff616582d in clone () at
>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>>>>>>>>>>
>>>>>>>>>> Thread 10 (Thread 0x7fffe3fff700 (LWP 5109)):
>>>>>>>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
>>>>>>>>>> #1 0x00000000004fc19c in work_thread (a=0x5110710) at
>>>>>>>>>> u_threadpool.c:272
>>>>>>>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7fffe3fff700) at
>>>>>>>>>> pthread_create.c:333
>>>>>>>>>> #3 0x00007ffff616582d in clone () at
>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>>>>>>>>>>
>>>>>>>>>> Thread 9 (Thread 0x7ffff09bb700 (LWP 5031)):
>>>>>>>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
>>>>>>>>>> #1 0x00000000004fc19c in work_thread (a=0x5110810) at
>>>>>>>>>> u_threadpool.c:272
>>>>>>>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7ffff09bb700) at
>>>>>>>>>> pthread_create.c:333
>>>>>>>>>> #3 0x00007ffff616582d in clone () at
>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>>>>>>>>>>
>>>>>>>>>> Thread 7 (Thread 0x7ffff11bc700 (LWP 5029)):
>>>>>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>>>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>>>>>>>>> ../sysdeps/posix/sleep.c:55
>>>>>>>>>> #2 0x00000000004769bb in remove_completed_jobs (vp=0x0) at
>>>>>>>>>> req_jobobit.c:3759
>>>>>>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff11bc700) at
>>>>>>>>>> pthread_create.c:333
>>>>>>>>>> #4 0x00007ffff616582d in clone () at
>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>>>>>>>>>>
>>>>>>>>>> Thread 6 (Thread 0x7ffff19bd700 (LWP 5028)):
>>>>>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>>>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>>>>>>>>> ../sysdeps/posix/sleep.c:55
>>>>>>>>>> #2 0x00000000004afa7b in remove_extra_recycle_jobs (vp=0x0) at
>>>>>>>>>> job_recycler.c:216
>>>>>>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff19bd700) at
>>>>>>>>>> pthread_create.c:333
>>>>>>>>>> #4 0x00007ffff616582d in clone () at
>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>>>>>>>>>>
>>>>>>>>>> Thread 5 (Thread 0x7ffff21be700 (LWP 5027)):
>>>>>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>>>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>>>>>>>>> ../sysdeps/posix/sleep.c:55
>>>>>>>>>> #2 0x00000000004bc73b in inspect_exiting_jobs (vp=0x0) at
>>>>>>>>>> exiting_jobs.c:319
>>>>>>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff21be700) at
>>>>>>>>>> pthread_create.c:333
>>>>>>>>>> #4 0x00007ffff616582d in clone () at
>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>>>>>>>>>>
>>>>>>>>>> Thread 4 (Thread 0x7ffff29bf700 (LWP 5026)):
>>>>>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>>>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>>>>>>>>> ../sysdeps/posix/sleep.c:55
>>>>>>>>>> #2 0x000000000046078d in handle_queue_routing_retries (vp=0x0)
>>>>>>>>>> at pbsd_main.c:1079
>>>>>>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff29bf700) at
>>>>>>>>>> pthread_create.c:333
>>>>>>>>>> #4 0x00007ffff616582d in clone () at
>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>>>>>>>>>>
>>>>>>>>>> Thread 3 (Thread 0x7ffff31c0700 (LWP 5025)):
>>>>>>>>>> #0 0x00007ffff6ee17bd in accept () at
>>>>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>>>>> #1 0x00007ffff750a276 in start_listener_addrinfo
>>>>>>>>>> (host_name=0x7ffff31bfaf0 "Dual-E52630v4", server_port=15001,
>>>>>>>>>> process_meth=0x4c4935 <start_process_pbs_server_port(void*)>)
>>>>>>>>>> at ../Libnet/server_core.c:398
>>>>>>>>>> #2 0x00000000004608f3 in start_accept_listener (vp=0x0) at
>>>>>>>>>> pbsd_main.c:1141
>>>>>>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff31c0700) at
>>>>>>>>>> pthread_create.c:333
>>>>>>>>>> #4 0x00007ffff616582d in clone () at
>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>>>>>>>>>>
>>>>>>>>>> Thread 2 (Thread 0x7ffff39c1700 (LWP 5024)):
>>>>>>>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
>>>>>>>>>> #1 0x00000000004fc19c in work_thread (a=0x5110810) at
>>>>>>>>>> u_threadpool.c:272
>>>>>>>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7ffff39c1700) at
>>>>>>>>>> pthread_create.c:333
>>>>>>>>>> ---Type <return> to continue, or q <return> to quit---
>>>>>>>>>> #3 0x00007ffff616582d in clone () at
>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>>>>>>>>>>
>>>>>>>>>> Thread 1 (Thread 0x7ffff7fd5740 (LWP 5020)):
>>>>>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>>>>> #1 0x00007ffff615c1a4 in usleep (useconds=<optimized out>) at
>>>>>>>>>> ../sysdeps/posix/usleep.c:32
>>>>>>>>>> #2 0x000000000046123a in main_loop () at pbsd_main.c:1454
>>>>>>>>>> #3 0x0000000000461fb6 in main (argc=2, argv=0x7fffffffe598) at
>>>>>>>>>> pbsd_main.c:1935
>>>>>>>>>> (gdb) quit
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Oct 28, 2016 at 12:43 PM, Kazuhiro Fujita <
>>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thank you for your comments.
>>>>>>>>>>> I will try the 6.0-dev next week.
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Kazu
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Oct 28, 2016 at 5:34 AM, David Beer <
>>>>>>>>>>> ***@adaptivecomputing.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I wonder if that fix wasn't placed in the hotfix. Is there any
>>>>>>>>>>>> chance you can try installing 6.0-dev on your system (via github) to see if
>>>>>>>>>>>> it's resolved. For the record, my Ubuntu 16 system doesn't give me this
>>>>>>>>>>>> error, or I'd try it myself. For whatever reason, none of our test cluster
>>>>>>>>>>>> machines (Cent & Redhat 6-7, SLES 11-12) experience this either. We did
>>>>>>>>>>>> have another user that experiences it on a test cluster, but not being able
>>>>>>>>>>>> to reproduce it has made it harder to track down.
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Oct 26, 2016 at 12:46 AM, Kazuhiro Fujita <
>>>>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> David,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I tried the 6.0.2.h3. But, it seems that the other issue is
>>>>>>>>>>>>> still remained.
>>>>>>>>>>>>> After I initialized serverdb by "sudo pbs_server -t create",
>>>>>>>>>>>>> pbs_server crashed.
>>>>>>>>>>>>> Then, I used gdb with pbs_server.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>> Kazu
>>>>>>>>>>>>>
>>>>>>>>>>>>> sudo gdb /usr/local/sbin/pbs_server
>>>>>>>>>>>>> GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
>>>>>>>>>>>>> Copyright (C) 2016 Free Software Foundation, Inc.
>>>>>>>>>>>>> License GPLv3+: GNU GPL version 3 or later <
>>>>>>>>>>>>> http://gnu.org/licenses/gpl.html>
>>>>>>>>>>>>> This is free software: you are free to change and redistribute
>>>>>>>>>>>>> it.
>>>>>>>>>>>>> There is NO WARRANTY, to the extent permitted by law. Type
>>>>>>>>>>>>> "show copying"
>>>>>>>>>>>>> and "show warranty" for details.
>>>>>>>>>>>>> This GDB was configured as "x86_64-linux-gnu".
>>>>>>>>>>>>> Type "show configuration" for configuration details.
>>>>>>>>>>>>> For bug reporting instructions, please see:
>>>>>>>>>>>>> <http://www.gnu.org/software/gdb/bugs/>.
>>>>>>>>>>>>> Find the GDB manual and other documentation resources online
>>>>>>>>>>>>> at:
>>>>>>>>>>>>> <http://www.gnu.org/software/gdb/documentation/>.
>>>>>>>>>>>>> For help, type "help".
>>>>>>>>>>>>> Type "apropos word" to search for commands related to "word"...
>>>>>>>>>>>>> Reading symbols from /usr/local/sbin/pbs_server...done.
>>>>>>>>>>>>> (gdb) r -D
>>>>>>>>>>>>> Starting program: /usr/local/sbin/pbs_server -D
>>>>>>>>>>>>> [Thread debugging using libthread_db enabled]
>>>>>>>>>>>>> Using host libthread_db library "/lib/x86_64-linux-gnu/libthre
>>>>>>>>>>>>> ad_db.so.1".
>>>>>>>>>>>>> pbs_server is up (version - 6.0.2.h3, port - 15001)
>>>>>>>>>>>>> [New Thread 0x7ffff39c1700 (LWP 25591)]
>>>>>>>>>>>>> [New Thread 0x7ffff31c0700 (LWP 25592)]
>>>>>>>>>>>>> [New Thread 0x7ffff29bf700 (LWP 25593)]
>>>>>>>>>>>>> [New Thread 0x7ffff21be700 (LWP 25594)]
>>>>>>>>>>>>> [New Thread 0x7ffff19bd700 (LWP 25595)]
>>>>>>>>>>>>> [New Thread 0x7ffff11bc700 (LWP 25596)]
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thread 7 "pbs_server" received signal SIGSEGV, Segmentation
>>>>>>>>>>>>> fault.
>>>>>>>>>>>>> [Switching to Thread 0x7ffff11bc700 (LWP 25596)]
>>>>>>>>>>>>> __lll_unlock_elision (lock=0x57276c0, private=0) at
>>>>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>>>>>>>>>>>>> 29 ../sysdeps/unix/sysv/linux/x86/elision-unlock.c: No such
>>>>>>>>>>>>> file or directory.
>>>>>>>>>>>>> (gdb) bt
>>>>>>>>>>>>> #0 __lll_unlock_elision (lock=0x57276c0, private=0) at
>>>>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>>>>>>>>>>>>> #1 0x00000000004ac076 in dispatch_timed_task
>>>>>>>>>>>>> (ptask=0x5727660) at svr_task.c:318
>>>>>>>>>>>>> #2 0x0000000000460247 in check_tasks (notUsed=0x0) at
>>>>>>>>>>>>> pbsd_main.c:921
>>>>>>>>>>>>> #3 0x00000000004fc171 in work_thread (a=0x510f650) at
>>>>>>>>>>>>> u_threadpool.c:318
>>>>>>>>>>>>> #4 0x00007ffff6ed86fa in start_thread (arg=0x7ffff11bc700) at
>>>>>>>>>>>>> pthread_create.c:333
>>>>>>>>>>>>> #5 0x00007ffff6165b5d in clone () at
>>>>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Oct 26, 2016 at 11:52 AM, Kazuhiro Fujita <
>>>>>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> David and Rick,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thank you for the quick response. I will try it later.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>> Kazu
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Oct 26, 2016 at 5:06 AM, David Beer <
>>>>>>>>>>>>>> ***@adaptivecomputing.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Actually, Rick just sent me the link. You can download it
>>>>>>>>>>>>>>> from here: http://files.adaptivecom
>>>>>>>>>>>>>>> puting.com/hotfix/torque-6.0.2.h3.tar.gz
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, Oct 25, 2016 at 2:06 PM, David Beer <
>>>>>>>>>>>>>>> ***@adaptivecomputing.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I can confirm that this bug is fixed in 6.0-dev, and we've
>>>>>>>>>>>>>>>> made a hotfix for it, 6.0.2.h3. This was caused because of a change in the
>>>>>>>>>>>>>>>> implementation for the pthread library, so most will not see this crash,
>>>>>>>>>>>>>>>> but it appears that if you have a newer version of that library, then you
>>>>>>>>>>>>>>>> will get it. Rick is going to send instructions for how to grab 6.0.2.h3.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> David
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Tue, Oct 25, 2016 at 12:30 AM, Kazuhiro Fujita <
>>>>>>>>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thank you David for the comment on the backtrace.
>>>>>>>>>>>>>>>>> I haven't noticed that until writing this mail.
>>>>>>>>>>>>>>>>> So, I used backtrace as written in the Ubuntu wiki.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I also attached the backtrace of pbs_server (Torque
>>>>>>>>>>>>>>>>> 6.1-dev) by gdb.
>>>>>>>>>>>>>>>>> As I mentioned before torque.setup script was successfully
>>>>>>>>>>>>>>>>> executed, but unstable.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Before using gdb, I used following commands.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> git clone https://github.com/adaptivecomputing/torque.git
>>>>>>>>>>>>>>>>>> -b 6.1-dev 6.1-dev
>>>>>>>>>>>>>>>>>> cd 6.1-dev
>>>>>>>>>>>>>>>>>> ./autogen.sh
>>>>>>>>>>>>>>>>>> # build and install torque
>>>>>>>>>>>>>>>>>> ./configure
>>>>>>>>>>>>>>>>>> make
>>>>>>>>>>>>>>>>>> sudo make install
>>>>>>>>>>>>>>>>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>>>>>>>>>>>>>>>> echo /usr/local/lib > sudo tee
>>>>>>>>>>>>>>>>>> /etc/ld.so.conf.d/torque.conf
>>>>>>>>>>>>>>>>>> sudo ldconfig
>>>>>>>>>>>>>>>>>> # set as services
>>>>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.trqauthd
>>>>>>>>>>>>>>>>>> /etc/init.d/trqauthd
>>>>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_server
>>>>>>>>>>>>>>>>>> /etc/init.d/pbs_server
>>>>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_sched
>>>>>>>>>>>>>>>>>> /etc/init.d/pbs_sched
>>>>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>>>>>>>>>>>>>>>>> sudo update-rc.d trqauthd defaults
>>>>>>>>>>>>>>>>>> sudo update-rc.d pbs_server defaults
>>>>>>>>>>>>>>>>>> sudo update-rc.d pbs_sched defaults
>>>>>>>>>>>>>>>>>> sudo update-rc.d pbs_mom defaults
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> sudo ./torque.setup $USER
>>>>>>>>>>>>>>>>>> sudo qmgr -c 'p s'
>>>>>>>>>>>>>>>>>> echo "$HOSTNAME np=`cat /proc/cpuinfo | grep processor |
>>>>>>>>>>>>>>>>>> wc -l`" | sudo tee /var/spool/torque/server_priv/nodes
>>>>>>>>>>>>>>>>>> sudo nano /var/spool/torque/server_priv/nodes # (changed
>>>>>>>>>>>>>>>>>> np)
>>>>>>>>>>>>>>>>>> sudo qterm -t quick
>>>>>>>>>>>>>>>>>> sudo /etc/init.d/trqauthd stop
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> trqauthd was not stop by the last command. So, I stopped
>>>>>>>>>>>>>>>>> it by killing the trqauthd process.
>>>>>>>>>>>>>>>>> Then I restarted the torque processes with gdb.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> sudo /etc/init.d/trqauthd start
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> sudo gdb /etc/init.d/pbs_server 2>&1 | tee
>>>>>>>>>>>>>>>>>> ~/gdb-torquesetup-6.1-dev.txt
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> In another terminal, I executed the following commands
>>>>>>>>>>>>>>>>> before pbs_server was crashed.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> sudo /etc/init.d/pbs_mom start
>>>>>>>>>>>>>>>>>> sudo /etc/init.d/pbs_sched start
>>>>>>>>>>>>>>>>>> ps aux | grep pbs
>>>>>>>>>>>>>>>>>> pbsnodes -a
>>>>>>>>>>>>>>>>>> echo "sleep 30" | qsub
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The output of the last command is "0.torque-server".
>>>>>>>>>>>>>>>>> And this command crashed the pbs_server in gdb.
>>>>>>>>>>>>>>>>> Then, I made the backtrace.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>> Kazu
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Tue, Oct 25, 2016 at 2:36 PM, Kazuhiro Fujita <
>>>>>>>>>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> David,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I attached the backtrace of pbs_server (Torque 6.0.2) by
>>>>>>>>>>>>>>>>>> gdb.
>>>>>>>>>>>>>>>>>> (based on https://wiki.ubuntu.com/Backtrace)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I started pbs_server with gdb,
>>>>>>>>>>>>>>>>>> and execute qmgr from another terminal. (see below)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> sudo qmgr -c 'p s'
>>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server'.
>>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> After the qmgr execution, I pressed ctrl +c in gdb.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>> Kaz
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Tue, Oct 25, 2016 at 1:00 AM, David Beer <
>>>>>>>>>>>>>>>>>> ***@adaptivecomputing.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Kazu,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Can you give us a backtrace for this crash? We have
>>>>>>>>>>>>>>>>>>> fixed some issues on startup (around mutex management for newer pthread
>>>>>>>>>>>>>>>>>>> implementations) and a backtrace would allow me to confirm if what you're
>>>>>>>>>>>>>>>>>>> seeing is fixed.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Mon, Oct 24, 2016 at 2:09 AM, Kazuhiro Fujita <
>>>>>>>>>>>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Dear All,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I use Torque 4.2.10 on Ubuntu 14.04 LTS and 16.04 LTS
>>>>>>>>>>>>>>>>>>>> with dual E5-2630 v3 chips.
>>>>>>>>>>>>>>>>>>>> I recently got servers with dual Xeon E5 v4 chips, and
>>>>>>>>>>>>>>>>>>>> installed Ubuntu 16.04 LTS on them.
>>>>>>>>>>>>>>>>>>>> And I tried to set up Torque on them, but I stacked
>>>>>>>>>>>>>>>>>>>> with the initial setup script.
>>>>>>>>>>>>>>>>>>>> It seems that qmgr may trigger to crash pbs_server in
>>>>>>>>>>>>>>>>>>>> initial setup script (torque.setup). (see below)
>>>>>>>>>>>>>>>>>>>> Similar error is also observed in Torque 6.02.
>>>>>>>>>>>>>>>>>>>> Have you ever observed this kind of errors?
>>>>>>>>>>>>>>>>>>>> And if you know possible solutions, please tell me.
>>>>>>>>>>>>>>>>>>>> Any comments will be highly appreciated.
>>>>>>>>>>>>>>>>>>>> Would it be better to change the OS to other
>>>>>>>>>>>>>>>>>>>> distribution, such as Scientific Linux?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thank you in Advance,
>>>>>>>>>>>>>>>>>>>> Kazu
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Errors in torque 4.2.10 setup
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> torque-server-***@torque-ser
>>>>>>>>>>>>>>>>>>>>> ver:~/Downloads/torque/torque-4.2.10$ sudo
>>>>>>>>>>>>>>>>>>>>> ./torque.setup $USER
>>>>>>>>>>>>>>>>>>>>> Currently no servers active. Default server will be
>>>>>>>>>>>>>>>>>>>>> listed as active server. Error 15133
>>>>>>>>>>>>>>>>>>>>> Active server name: torque-server pbs_server port is:
>>>>>>>>>>>>>>>>>>>>> 15001
>>>>>>>>>>>>>>>>>>>>> trqauthd daemonized - port /tmp/trqauthd-unix
>>>>>>>>>>>>>>>>>>>>> trqauthd successfully started
>>>>>>>>>>>>>>>>>>>>> initializing TORQUE (admin:
>>>>>>>>>>>>>>>>>>>>> torque-server-***@torque-server)
>>>>>>>>>>>>>>>>>>>>> You have selected to start pbs_server in create mode.
>>>>>>>>>>>>>>>>>>>>> If the server database exists it will be overwritten.
>>>>>>>>>>>>>>>>>>>>> do you wish to continue y/(n)?y
>>>>>>>>>>>>>>>>>>>>> root 27941 1942 1 12:22 ? 00:00:00
>>>>>>>>>>>>>>>>>>>>> pbs_server -t create
>>>>>>>>>>>>>>>>>>>>> Max open servers: 9
>>>>>>>>>>>>>>>>>>>>> set server operators += torque-server-***@torque-ser
>>>>>>>>>>>>>>>>>>>>> ver
>>>>>>>>>>>>>>>>>>>>> Max open servers: 9
>>>>>>>>>>>>>>>>>>>>> set server managers += torque-server-***@torque-ser
>>>>>>>>>>>>>>>>>>>>> ver
>>>>>>>>>>>>>>>>>>>>> qmgr obj=batch svr=default: End of File
>>>>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>>>>> Cannot connect to specified server host
>>>>>>>>>>>>>>>>>>>>> 'torque-server'.
>>>>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>>>>> Cannot connect to specified server host
>>>>>>>>>>>>>>>>>>>>> 'torque-server'.
>>>>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>>>>> Cannot connect to specified server host
>>>>>>>>>>>>>>>>>>>>> 'torque-server'.
>>>>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>>>>> Cannot connect to specified server host
>>>>>>>>>>>>>>>>>>>>> 'torque-server'.
>>>>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>>>>> Cannot connect to specified server host
>>>>>>>>>>>>>>>>>>>>> 'torque-server'.
>>>>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>>>>>>> torque-server-***@torque-ser
>>>>>>>>>>>>>>>>>>>>> ver:~/Downloads/torque/torque-4.2.10$ ps aux | grep
>>>>>>>>>>>>>>>>>>>>> pbs
>>>>>>>>>>>>>>>>>>>>> torque-+ 27996 0.0 0.0 22304 948 pts/2 S+
>>>>>>>>>>>>>>>>>>>>> 12:22 0:00 grep --color=auto pbs
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> pbs_server -t create was not found.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Errors in torque 6.0.2 setup
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> torque-server-***@torque-server:~/Downloads/torque/6.0.2$
>>>>>>>>>>>>>>>>>>>>> sudo ./torque.setup $USER
>>>>>>>>>>>>>>>>>>>>> Currently no servers active. Default server will be
>>>>>>>>>>>>>>>>>>>>> listed as active server. Error 15133
>>>>>>>>>>>>>>>>>>>>> Active server name: torque-server pbs_server port is:
>>>>>>>>>>>>>>>>>>>>> 15001
>>>>>>>>>>>>>>>>>>>>> trqauthd daemonized - port /tmp/trqauthd-unix
>>>>>>>>>>>>>>>>>>>>> trqauthd successfully started
>>>>>>>>>>>>>>>>>>>>> initializing TORQUE (admin:
>>>>>>>>>>>>>>>>>>>>> torque-server-***@torque-server)
>>>>>>>>>>>>>>>>>>>>> You have selected to start pbs_server in create mode.
>>>>>>>>>>>>>>>>>>>>> If the server database exists it will be overwritten.
>>>>>>>>>>>>>>>>>>>>> do you wish to continue y/(n)?y
>>>>>>>>>>>>>>>>>>>>> root 39521 1 1 16:10 ? 00:00:00
>>>>>>>>>>>>>>>>>>>>> pbs_server -t create
>>>>>>>>>>>>>>>>>>>>> Max open servers: 9
>>>>>>>>>>>>>>>>>>>>> Max open servers: 9
>>>>>>>>>>>>>>>>>>>>> qmgr obj=batch svr=default: End of File
>>>>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>>>>> Cannot connect to specified server host
>>>>>>>>>>>>>>>>>>>>> 'torque-server'.
>>>>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>>>>> Cannot connect to specified server host
>>>>>>>>>>>>>>>>>>>>> 'torque-server'.
>>>>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>>>>> Cannot connect to specified server host
>>>>>>>>>>>>>>>>>>>>> 'torque-server'.
>>>>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>>>>> Cannot connect to specified server host
>>>>>>>>>>>>>>>>>>>>> 'torque-server'.
>>>>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>>>>> Cannot connect to specified server host
>>>>>>>>>>>>>>>>>>>>> 'torque-server'.
>>>>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>>>>>>> torque-server-***@torque-server:~/Downloads/torque/6.0.2$
>>>>>>>>>>>>>>>>>>>>> ps aux | grep pbs
>>>>>>>>>>>>>>>>>>>>> comp_ad+ 39569 0.0 0.0 22304 1032 pts/8 S+
>>>>>>>>>>>>>>>>>>>>> 16:11 0:00 grep --color=auto pbs
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> pbs_server -t create was not found.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Commands used for installation before the setup script
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> # build and install torque
>>>>>>>>>>>>>>>>>>>>> ./configure
>>>>>>>>>>>>>>>>>>>>> make
>>>>>>>>>>>>>>>>>>>>> sudo make install
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>>>>>>>>>>>>>>>>>>> echo /usr/local/lib > sudo tee
>>>>>>>>>>>>>>>>>>>>> /etc/ld.so.conf.d/torque.conf
>>>>>>>>>>>>>>>>>>>>> sudo ldconfig
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> # set up as services
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.trqauthd
>>>>>>>>>>>>>>>>>>>>> /etc/init.d/trqauthd
>>>>>>>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_server
>>>>>>>>>>>>>>>>>>>>> /etc/init.d/pbs_server
>>>>>>>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_sched
>>>>>>>>>>>>>>>>>>>>> /etc/init.d/pbs_sched
>>>>>>>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_mom
>>>>>>>>>>>>>>>>>>>>> /etc/init.d/pbs_mom
>>>>>>>>>>>>>>>>>>>>> sudo update-rc.d trqauthd defaults
>>>>>>>>>>>>>>>>>>>>> sudo update-rc.d pbs_server defaults
>>>>>>>>>>>>>>>>>>>>> sudo update-rc.d pbs_sched defaults
>>>>>>>>>>>>>>>>>>>>> sudo update-rc.d pbs_mom defaults
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> sudo ./torque.setup $USER
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>>>> torqueusers mailing list
>>>>>>>>>>>>>>>>>>>> ***@supercluster.org
>>>>>>>>>>>>>>>>>>>> http://www.supercluster.org/ma
>>>>>>>>>>>>>>>>>>>> ilman/listinfo/torqueusers
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>> David Beer | Torque Architect
>>>>>>>>>>>>>>>>>>> Adaptive Computing
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>>> torqueusers mailing list
>>>>>>>>>>>>>>>>>>> ***@supercluster.org
>>>>>>>>>>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>> torqueusers mailing list
>>>>>>>>>>>>>>>>> ***@supercluster.org
>>>>>>>>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> David Beer | Torque Architect
>>>>>>>>>>>>>>>> Adaptive Computing
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> David Beer | Torque Architect
>>>>>>>>>>>>>>> Adaptive Computing
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> torqueusers mailing list
>>>>>>>>>>>>>>> ***@supercluster.org
>>>>>>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> torqueusers mailing list
>>>>>>>>>>>>> ***@supercluster.org
>>>>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> David Beer | Torque Architect
>>>>>>>>>>>> Adaptive Computing
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> torqueusers mailing list
>>>>>>>>>>>> ***@supercluster.org
>>>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> torqueusers mailing list
>>>>>>>>>> ***@supercluster.org
>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> David Beer | Torque Architect
>>>>>>>>> Adaptive Computing
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> torqueusers mailing list
>>>>>>>>> ***@supercluster.org
>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> torqueusers mailing list
>>>>>>>> ***@supercluster.org
>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> David Beer | Torque Architect
>>>>>>> Adaptive Computing
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> torqueusers mailing list
>>>>>>> ***@supercluster.org
>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> torqueusers mailing list
>>>>> ***@supercluster.org
>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> David Beer | Torque Architect
>>>> Adaptive Computing
>>>>
>>>> _______________________________________________
>>>> torqueusers mailing list
>>>> ***@supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>
>>>>
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> ***@supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>>
>>
>>
>> --
>> David Beer | Torque Architect
>> Adaptive Computing
>>
>> _______________________________________________
>> torqueusers mailing list
>> ***@supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>
> _______________________________________________
> torqueusers mailing list
> ***@supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>


--
David Beer | Torque Architect
Adaptive Computing
Kazuhiro Fujita
2016-11-30 06:12:51 UTC
Permalink
David,

I attached the backtrace below.

Before: gdb
sudo service pbs_mom stop
sudo service pbs_sched stop
sudo service pbs_server stop
sudo service trqauthd stop
sudo service trqauthd start
sudo gdb /usr/local/sbin/pbs_server

Then,
(gdb) r -D

In another terminal I executed following commands,
and the last command (echo "sleep 30" | qsub) caused the crash as I
reported before.

$sudo service pbs_sched start
$sudo service pbs_mom start
$ps aux | grep pbs
root 36957 0.0 0.0 55808 4164 pts/8 S 14:53 0:00 sudo gdb
/usr/local/sbin/pbs_server
root 36958 0.7 0.0 109464 63648 pts/8 S 14:53 0:00 gdb
/usr/local/sbin/pbs_server
root 36960 0.0 0.0 473936 24768 pts/8 Sl+ 14:53 0:00
/usr/local/sbin/pbs_server -D
root 37079 0.0 0.0 37996 4940 ? Ss 14:54 0:00
/usr/local/sbin/pbs_sched
root 37116 0.0 0.1 115892 76900 ? RLsl 14:54 0:00
/usr/local/sbin/pbs_mom
comp_ad+ 37118 0.0 0.0 15236 976 pts/9 S+ 14:54 0:00 grep
--color=auto pbs
$ps aux | grep trq
root 36956 0.0 0.0 29052 2332 ? S 14:52 0:00
/usr/local/sbin/trqauthd
comp_ad+ 37135 0.0 0.0 15236 1032 pts/9 S+ 14:54 0:00 grep
--color=auto trq
$ pbsnodes -a
$ echo "sleep 30" | qsub

The output of gdb is shown below.

Best,
Kazu


$ sudo gdb /usr/local/sbin/pbs_server
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html
>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/local/sbin/pbs_server...done.
(gdb) r -D
Starting program: /usr/local/sbin/pbs_server -D
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff39c1700 (LWP 36964)]
pbs_server is up (version - 6.0, port - 15001)
PBS_Server: LOG_ERROR::tcp_connect_sockaddr, Failed when trying to open tcp
connection - connect() failed [rc = -2] [addr = 10.0.0.249:15003]
PBS_Server: LOG_ERROR::sendHierarchyToNode, Could not send mom hierarchy to
host Dual-E52630v4:15003
[New Thread 0x7ffff31c0700 (LWP 36965)]
[New Thread 0x7ffff29bf700 (LWP 36966)]
[New Thread 0x7ffff21be700 (LWP 36967)]
[New Thread 0x7ffff19bd700 (LWP 36968)]
[New Thread 0x7ffff11bc700 (LWP 36969)]
[New Thread 0x7ffff09bb700 (LWP 36970)]
[Thread 0x7ffff09bb700 (LWP 36970) exited]
[New Thread 0x7ffff09bb700 (LWP 36971)]
[New Thread 0x7fffe3fff700 (LWP 37132)]
[New Thread 0x7fffe37fe700 (LWP 37133)]
[New Thread 0x7fffe2ffd700 (LWP 37145)]
[New Thread 0x7fffe21ce700 (LWP 37150)]
[Thread 0x7fffe21ce700 (LWP 37150) exited]
Assertion failed, bad pointer in link: file "req_select.c", line 401

Thread 10 "pbs_server" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffe3fff700 (LWP 37132)]
__lll_unlock_elision (lock=0x51118d0, private=0) at
../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
29 ../sysdeps/unix/sysv/linux/x86/elision-unlock.c: No such file or
directory.
(gdb) backtrace full
#0 __lll_unlock_elision (lock=0x51118d0, private=0) at
../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
No locals.
#1 0x0000000000465e0f in unlock_queue (the_queue=0x512ced0, id=0x522704
"req_selectjobs", msg=0x0, logging=0) at queue_func.c:189
rc = 0
err_msg = 0x0
stub_msg = "no pos"
__func__ = "unlock_queue"
#2 0x000000000049384a in req_selectjobs (preq=0x7fffdc081bd0) at
req_select.c:347
bad = 1
cntl = 0x7fffdc000930
plist = 0x7fffdc001880
pque = 0x512ced0
rc = 0
log_buf = '\000' <repeats 184 times>,
"\b\205P\367\377\177\000\000\000\000\000\000\000\000\000\000"...
selistp = 0x0
#3 0x00000000004652f4 in dispatch_request (sfds=9, request=0x7fffdc081bd0)
at process_request.c:899
rc = 0
log_buf = "***@Dual-E52630v4\000\066\063\060v4", '\000' <repeats
3424 times>...
__func__ = "dispatch_request"
#4 0x0000000000464e8f in process_request (chan=0x7fffdc0008c0) at
process_request.c:702
rc = 0
request = 0x7fffdc081bd0
state = 3
time_now = 1480485385
auth_err = 0x0
conn_socktype = 2
conn_authen = 1
sfds = 9
#5 0x00000000004c4805 in process_pbs_server_port (sock=9,
is_scheduler_port=0, args=0x7fffe40008e0) at incoming_request.c:162
protocol_type = 2
rc = 0
log_buf = '\000' <repeats 3992 times>...
chan = 0x7fffdc0008c0
__func__ = "process_pbs_server_port"
#6 0x00000000004c4ac9 in start_process_pbs_server_port
(new_sock=0x7fffe40008e0) at incoming_request.c:270
args = 0x7fffe40008e0
sock = 9
rc = 0
#7 0x00000000004fc495 in work_thread (a=0x5110710) at u_threadpool.c:318
__clframe = {__cancel_routine = 0x4fc071 <work_cleanup(void*)>,
__cancel_arg = 0x5110710, __do_it = 1, __cancel_type = 0}
__clframe = {__cancel_routine = 0x4fbf64
<work_thread_cleanup(void*)>, __cancel_arg = 0x5110710, __do_it = 1,
__cancel_type = 0}
tp = 0x5110710
rc = 0
func = 0x4c4a4d <start_process_pbs_server_port(void*)>
arg = 0x7fffe40008e0
mywork = 0x7fffe4000b80
working = {next = 0x0, working_id = 140737018590976}
ts = {tv_sec = 0, tv_nsec = 0}
__func__ = "work_thread"
#8 0x00007ffff6ed870a in start_thread (arg=0x7fffe3fff700) at
pthread_create.c:333
__res = <optimized out>
pd = 0x7fffe3fff700
now = <optimized out>
unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140737018590976,
-786842131623855334, 0, 140737272078415, 140737018591680, 0,
786815742786911002, 786861764219616026},
mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0},
data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
not_first_call = <optimized out>
pagesize_m1 = <optimized out>
sp = <optimized out>
freesize = <optimized out>
---Type <return> to continue, or q <return> to quit---
__PRETTY_FUNCTION__ = "start_thread"
#9 0x00007ffff616582d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:109
No locals.
(gdb) info registers
rax 0x0 0
rbx 0x7fffe3fff700 140737018590976
rcx 0x0 0
rdx 0x51118d0 85006544
rsi 0x0 0
rdi 0x51118d0 85006544
rbp 0x0 0x0
rsp 0x7fffe3ffefc0 0x7fffe3ffefc0
r8 0x0 0
r9 0x1 1
r10 0x7fffdc0c8295 140736885195413
r11 0x0 0
r12 0x0 0
r13 0x7ffff31be04f 140737272078415
r14 0x7fffe3fff9c0 140737018591680
r15 0x0 0
rip 0x7ffff616582d 0x7ffff616582d <clone+109>
eflags 0x10246 [ PF ZF IF RF ]
cs 0x33 51
ss 0x2b 43
ds 0x0 0
es 0x0 0
fs 0x0 0
gs 0x0 0
(gdb) x/16i $pc
=> 0x7ffff616582d <clone+109>: mov %rax,%rdi
0x7ffff6165830 <clone+112>: callq 0x7ffff612ab60 <__GI__exit>
0x7ffff6165835 <clone+117>: mov 0x2bc63c(%rip),%rcx #
0x7ffff6421e78
0x7ffff616583c <clone+124>: neg %eax
0x7ffff616583e <clone+126>: mov %eax,%fs:(%rcx)
0x7ffff6165841 <clone+129>: or $0xffffffffffffffff,%rax
0x7ffff6165845 <clone+133>: retq
0x7ffff6165846: nopw %cs:0x0(%rax,%rax,1)
0x7ffff6165850 <lseek64>: mov $0x8,%eax
0x7ffff6165855 <lseek64+5>: syscall
0x7ffff6165857 <lseek64+7>: cmp $0xfffffffffffff001,%rax
0x7ffff616585d <lseek64+13>: jae 0x7ffff6165860 <lseek64+16>
0x7ffff616585f <lseek64+15>: retq
0x7ffff6165860 <lseek64+16>: mov 0x2bc611(%rip),%rcx #
0x7ffff6421e78
0x7ffff6165867 <lseek64+23>: neg %eax
0x7ffff6165869 <lseek64+25>: mov %eax,%fs:(%rcx)
(gdb) thread apply all backtrace

Thread 12 (Thread 0x7fffe2ffd700 (LWP 37145)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at
../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00000000004fc2b4 in work_thread (a=0x5110710) at u_threadpool.c:272
#2 0x00007ffff6ed870a in start_thread (arg=0x7fffe2ffd700) at
pthread_create.c:333
#3 0x00007ffff616582d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 11 (Thread 0x7fffe37fe700 (LWP 37133)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at
../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00000000004fc2b4 in work_thread (a=0x5110810) at u_threadpool.c:272
#2 0x00007ffff6ed870a in start_thread (arg=0x7fffe37fe700) at
pthread_create.c:333
#3 0x00007ffff616582d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 10 (Thread 0x7fffe3fff700 (LWP 37132)):
#0 __lll_unlock_elision (lock=0x51118d0, private=0) at
../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
#1 0x0000000000465e0f in unlock_queue (the_queue=0x512ced0, id=0x522704
"req_selectjobs", msg=0x0, logging=0) at queue_func.c:189
#2 0x000000000049384a in req_selectjobs (preq=0x7fffdc081bd0) at
req_select.c:347
#3 0x00000000004652f4 in dispatch_request (sfds=9, request=0x7fffdc081bd0)
at process_request.c:899
#4 0x0000000000464e8f in process_request (chan=0x7fffdc0008c0) at
process_request.c:702
#5 0x00000000004c4805 in process_pbs_server_port (sock=9,
is_scheduler_port=0, args=0x7fffe40008e0) at incoming_request.c:162
#6 0x00000000004c4ac9 in start_process_pbs_server_port
(new_sock=0x7fffe40008e0) at incoming_request.c:270
#7 0x00000000004fc495 in work_thread (a=0x5110710) at u_threadpool.c:318
#8 0x00007ffff6ed870a in start_thread (arg=0x7fffe3fff700) at
pthread_create.c:333
#9 0x00007ffff616582d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 9 (Thread 0x7ffff09bb700 (LWP 36971)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at
../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00000000004fc2b4 in work_thread (a=0x5110810) at u_threadpool.c:272
#2 0x00007ffff6ed870a in start_thread (arg=0x7ffff09bb700) at
pthread_create.c:333
#3 0x00007ffff616582d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 7 (Thread 0x7ffff11bc700 (LWP 36969)):
#0 0x00007ffff612a75d in nanosleep () at
../sysdeps/unix/syscall-template.S:84
#1 0x00007ffff612a6aa in __sleep (seconds=0) at ../sysdeps/posix/sleep.c:55
#2 0x0000000000476913 in remove_completed_jobs (vp=0x0) at
req_jobobit.c:3759
#3 0x00007ffff6ed870a in start_thread (arg=0x7ffff11bc700) at
pthread_create.c:333
#4 0x00007ffff616582d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 6 (Thread 0x7ffff19bd700 (LWP 36968)):
#0 0x00007ffff612a75d in nanosleep () at
../sysdeps/unix/syscall-template.S:84
#1 0x00007ffff612a6aa in __sleep (seconds=0) at ../sysdeps/posix/sleep.c:55
#2 0x00000000004afb93 in remove_extra_recycle_jobs (vp=0x0) at
job_recycler.c:216
#3 0x00007ffff6ed870a in start_thread (arg=0x7ffff19bd700) at
pthread_create.c:333
#4 0x00007ffff616582d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 5 (Thread 0x7ffff21be700 (LWP 36967)):
#0 0x00007ffff612a75d in nanosleep () at
../sysdeps/unix/syscall-template.S:84
#1 0x00007ffff612a6aa in __sleep (seconds=0) at ../sysdeps/posix/sleep.c:55
#2 0x00000000004bc853 in inspect_exiting_jobs (vp=0x0) at
exiting_jobs.c:319
#3 0x00007ffff6ed870a in start_thread (arg=0x7ffff21be700) at
pthread_create.c:333
#4 0x00007ffff616582d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 4 (Thread 0x7ffff29bf700 (LWP 36966)):
#0 0x00007ffff612a75d in nanosleep () at
../sysdeps/unix/syscall-template.S:84
#1 0x00007ffff612a6aa in __sleep (seconds=0) at ../sysdeps/posix/sleep.c:55
#2 0x0000000000460769 in handle_queue_routing_retries (vp=0x0) at
pbsd_main.c:1079
#3 0x00007ffff6ed870a in start_thread (arg=0x7ffff29bf700) at
pthread_create.c:333
#4 0x00007ffff616582d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:109

---Type <return> to continue, or q <return> to quit---
Thread 3 (Thread 0x7ffff31c0700 (LWP 36965)):
#0 0x00007ffff6ee17bd in accept () at ../sysdeps/unix/syscall-template.S:84
#1 0x00007ffff750a276 in start_listener_addrinfo (host_name=0x7ffff31bfaf0
"Dual-E52630v4", server_port=15001, process_meth=0x4c4a4d
<start_process_pbs_server_port(void*)>)
at ../Libnet/server_core.c:398
#2 0x00000000004608cf in start_accept_listener (vp=0x0) at pbsd_main.c:1141
#3 0x00007ffff6ed870a in start_thread (arg=0x7ffff31c0700) at
pthread_create.c:333
#4 0x00007ffff616582d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 2 (Thread 0x7ffff39c1700 (LWP 36964)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at
../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00000000004fc2b4 in work_thread (a=0x5110810) at u_threadpool.c:272
#2 0x00007ffff6ed870a in start_thread (arg=0x7ffff39c1700) at
pthread_create.c:333
#3 0x00007ffff616582d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 1 (Thread 0x7ffff7fd5740 (LWP 36960)):
#0 0x00007ffff612a75d in nanosleep () at
../sysdeps/unix/syscall-template.S:84
#1 0x00007ffff615c1a4 in usleep (useconds=<optimized out>) at
../sysdeps/posix/usleep.c:32
#2 0x0000000000461216 in main_loop () at pbsd_main.c:1454
#3 0x0000000000461f92 in main (argc=2, argv=0x7fffffffe598) at
pbsd_main.c:1935
(gdb) quit
A debugging session is active.

Inferior 1 [process 36960] will be killed.

Quit anyway? (y or n) y




On Tue, Nov 29, 2016 at 8:53 AM, David Beer <***@adaptivecomputing.com>
wrote:

> Kazu,
>
> I'm shocked you're seeing so many issues. Can you send a backtrace? These
> logs don't show anything sinister.
>
> On Wed, Nov 23, 2016 at 9:52 PM, Kazuhiro Fujita <
> ***@gmail.com> wrote:
>
>> David,
>>
>> I reinstalled the torque 6.0-dev without update from github.
>> At this time, I can restart all torque daemons,
>> but qsub command caused the crash of pbs_server and pbs_sched.
>> I attached the log files in this mail.
>>
>> Best,
>> Kazu
>>
>> Before the crash:
>>
>>> # build and install torque
>>> ./configure
>>> make
>>> sudo make install
>>> # Set a correct host name of the server
>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>> # configure and start trqauthd
>>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>>> sudo update-rc.d trqauthd defaults
>>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>>> sudo ldconfig
>>> sudo service trqauthd start
>>> # Initialize serverdb by executing the torque.setup script
>>> sudo ./torque.setup $USER
>>> sudo qmgr -c "p s"
>>> # stop pbs_server and trqauthd daemons for setting nodes.
>>> sudo qterm
>>> sudo service trqauthd stop
>>> ps aux | grep pbs
>>> ps aux | grep trq
>>> # set nodes
>>> echo "$HOSTNAME np=`cat /proc/cpuinfo | grep processor | wc -l`" | sudo
>>> tee /var/spool/torque/server_priv/nodes
>>> sudo nano /var/spool/torque/server_priv/nodes
>>> # set the head node
>>> echo "\$pbsserver $HOSTNAME" | sudo tee /var/spool/torque/mom_priv/con
>>> fig
>>> # configure other torque daemons
>>> sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
>>> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>> sudo update-rc.d pbs_server defaults
>>> sudo update-rc.d pbs_sched defaults
>>> sudo update-rc.d pbs_mom defaults
>>> # restart torque daemons
>>> sudo service trqauthd start
>>> sudo service pbs_server start
>>> ps aux | grep pbs
>>> ps aux | grep trq
>>> sudo service pbs_sched start
>>> sudo service pbs_mom start
>>> ps aux | grep pbs
>>> ps aux | grep trq
>>> # check configuration of computaion nodes
>>> pbsnodes -a
>>
>>
>> $ ps aux | grep trq
>> root 19130 0.0 0.0 109112 3756 ? S 13:25 0:00
>> /usr/local/sbin/trqauthd
>> comp_ad+ 19293 0.0 0.0 15236 1020 pts/8 S+ 13:28 0:00 grep
>> --color=auto trq
>> $ ps aux | grep pbs
>> root 19175 0.0 0.0 695136 23640 ? Sl 13:26 0:00
>> /usr/local/sbin/pbs_server
>> root 19224 0.0 0.0 37996 4936 ? Ss 13:27 0:00
>> /usr/local/sbin/pbs_sched
>> root 19265 0.1 0.2 173776 136692 ? SLsl 13:27 0:00
>> /usr/local/sbin/pbs_mom
>> comp_ad+ 19295 0.0 0.0 15236 924 pts/8 S+ 13:28 0:00 grep
>> --color=auto pbs
>>
>> Subsequent qsub command caused the crash of pbs_server and pbs_sched.
>>
>> $ echo "sleep 30" | qsub
>> 0.Dual-E52630v4
>> $ ps aux | grep trq
>> root 19130 0.0 0.0 109112 4268 ? S 13:25 0:00
>> /usr/local/sbin/trqauthd
>> comp_ad+ 19309 0.0 0.0 15236 1020 pts/8 S+ 13:28 0:00 grep
>> --color=auto trq
>> $ ps aux | grep pbs
>> root 19265 0.1 0.2 173776 136688 ? SLsl 13:27 0:00
>> /usr/local/sbin/pbs_mom
>> comp_ad+ 19311 0.0 0.0 15236 1016 pts/8 S+ 13:28 0:00 grep
>> --color=auto pbs
>>
>>
>>
>>
>> On Fri, Nov 18, 2016 at 4:21 AM, David Beer <***@adaptivecomputing.com>
>> wrote:
>>
>>> Kazu,
>>>
>>> Did you look at the server logs?
>>>
>>> On Wed, Nov 16, 2016 at 12:24 AM, Kazuhiro Fujita <
>>> ***@gmail.com> wrote:
>>>
>>>> David,
>>>>
>>>> I did not find the process of pbs_server after executions of commands
>>>> shown below.
>>>>
>>>> sudo service trqauthd start
>>>>> sudo service pbs_server start
>>>>
>>>>
>>>> I am not sure what it did.
>>>>
>>>> Best,
>>>> Kazu
>>>>
>>>>
>>>> On Wed, Nov 16, 2016 at 8:10 AM, David Beer <
>>>> ***@adaptivecomputing.com> wrote:
>>>>
>>>>> Kazu,
>>>>>
>>>>> What did it do when it failed to start?
>>>>>
>>>>> On Wed, Nov 9, 2016 at 9:33 PM, Kazuhiro Fujita <
>>>>> ***@gmail.com> wrote:
>>>>>
>>>>>> David,
>>>>>>
>>>>>> In the last mail I sent, I reinstalled 6.0-dev in a wrong server as
>>>>>> you can see in output (E5-2630v3).
>>>>>> In a E5-2630v4 server, pbs_server failed to restart as a daemon after
>>>>>> "./torque.setup $USER".
>>>>>>
>>>>>> Before crash:
>>>>>>
>>>>>>> git clone https://github.com/adaptivecomputing/torque.git -b
>>>>>>> 6.0-dev 6.0-dev
>>>>>>> cd 6.0-dev
>>>>>>> ./autogen.sh
>>>>>>> # build and install torque
>>>>>>> ./configure
>>>>>>> make
>>>>>>> sudo make install
>>>>>>> # Set the correct name of the server
>>>>>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>>>>> # configure and start trqauthd
>>>>>>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>>>>>>> sudo update-rc.d trqauthd defaults
>>>>>>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>>>>>>> sudo ldconfig
>>>>>>> sudo service trqauthd start
>>>>>>> # Initialize serverdb by executing the torque.setup script
>>>>>>> sudo ./torque.setup $USER
>>>>>>> sudo qmgr -c 'p s'
>>>>>>> sudo qterm
>>>>>>> sudo service trqauthd stop
>>>>>>> ps aux | grep pbs
>>>>>>> ps aux | grep trq
>>>>>>> # set nodes
>>>>>>> echo "$HOSTNAME np=`cat /proc/cpuinfo | grep processor | wc -l`" |
>>>>>>> sudo tee /var/spool/torque/server_priv/nodes
>>>>>>> sudo nano /var/spool/torque/server_priv/nodes
>>>>>>> # set the head node
>>>>>>> echo "\$pbsserver $HOSTNAME" | sudo tee
>>>>>>> /var/spool/torque/mom_priv/config
>>>>>>> # configure other daemons
>>>>>>> sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
>>>>>>> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
>>>>>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>>>>>> sudo update-rc.d pbs_server defaults
>>>>>>> sudo update-rc.d pbs_sched defaults
>>>>>>> sudo update-rc.d pbs_mom defaults
>>>>>>> # restart torque daemons
>>>>>>> sudo service trqauthd start
>>>>>>> sudo service pbs_server start
>>>>>>
>>>>>>
>>>>>> Then, pbs_server did not start. So, I started pbs_server with gdb.
>>>>>> But, pbs_server with gdb did not crash even after qsub and qstat from
>>>>>> another terminal.
>>>>>> So, I stopped the pbs_server in gdb with ctrl + c.
>>>>>>
>>>>>> Best,
>>>>>> Kazu
>>>>>>
>>>>>> gdb output
>>>>>>
>>>>>>> $ sudo gdb /usr/local/sbin/pbs_server
>>>>>>> GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
>>>>>>> Copyright (C) 2016 Free Software Foundation, Inc.
>>>>>>> License GPLv3+: GNU GPL version 3 or later <
>>>>>>> http://gnu.org/licenses/gpl.html>
>>>>>>> This is free software: you are free to change and redistribute it.
>>>>>>> There is NO WARRANTY, to the extent permitted by law. Type "show
>>>>>>> copying"
>>>>>>> and "show warranty" for details.
>>>>>>> This GDB was configured as "x86_64-linux-gnu".
>>>>>>> Type "show configuration" for configuration details.
>>>>>>> For bug reporting instructions, please see:
>>>>>>> <http://www.gnu.org/software/gdb/bugs/>.
>>>>>>> Find the GDB manual and other documentation resources online at:
>>>>>>> <http://www.gnu.org/software/gdb/documentation/>.
>>>>>>> For help, type "help".
>>>>>>> Type "apropos word" to search for commands related to "word"...
>>>>>>> Reading symbols from /usr/local/sbin/pbs_server...done.
>>>>>>> (gdb) r -D
>>>>>>> Starting program: /usr/local/sbin/pbs_server -D
>>>>>>> [Thread debugging using libthread_db enabled]
>>>>>>> Using host libthread_db library "/lib/x86_64-linux-gnu/libthre
>>>>>>> ad_db.so.1".
>>>>>>> [New Thread 0x7ffff39c1700 (LWP 35864)]
>>>>>>> pbs_server is up (version - 6.0, port - 15001)
>>>>>>> PBS_Server: LOG_ERROR::tcp_connect_sockaddr, Failed when trying to
>>>>>>> open tcp connection - connect() failed [rc = -2] [addr =
>>>>>>> 10.0.0.249:15003]
>>>>>>> [New Thread 0x7ffff31c0700 (LWP 35865)]
>>>>>>> PBS_Server: LOG_ERROR::sendHierarchyToNode, Could not send mom
>>>>>>> hierarchy to host Dual-E52630v4:15003
>>>>>>> [New Thread 0x7ffff29bf700 (LWP 35866)]
>>>>>>> [New Thread 0x7ffff21be700 (LWP 35867)]
>>>>>>> [New Thread 0x7ffff19bd700 (LWP 35868)]
>>>>>>> [New Thread 0x7ffff11bc700 (LWP 35869)]
>>>>>>> [New Thread 0x7ffff09bb700 (LWP 35870)]
>>>>>>> [Thread 0x7ffff09bb700 (LWP 35870) exited]
>>>>>>> [New Thread 0x7ffff09bb700 (LWP 35871)]
>>>>>>> [New Thread 0x7fffe3fff700 (LWP 36003)]
>>>>>>> [New Thread 0x7fffe37fe700 (LWP 36004)]
>>>>>>> [New Thread 0x7fffe2ffd700 (LWP 36011)]
>>>>>>> [New Thread 0x7fffe21ce700 (LWP 36016)]
>>>>>>> [Thread 0x7fffe21ce700 (LWP 36016) exited]
>>>>>>> ^C
>>>>>>> Thread 1 "pbs_server" received signal SIGINT, Interrupt.
>>>>>>> 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-templa
>>>>>>> te.S:84
>>>>>>> 84 ../sysdeps/unix/syscall-template.S: No such file or directory.
>>>>>>> (gdb) bt
>>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>> #1 0x00007ffff615c1a4 in usleep (useconds=<optimized out>) at
>>>>>>> ../sysdeps/posix/usleep.c:32
>>>>>>> #2 0x0000000000461216 in main_loop () at pbsd_main.c:1454
>>>>>>> #3 0x0000000000461f92 in main (argc=2, argv=0x7fffffffe598) at
>>>>>>> pbsd_main.c:1935
>>>>>>> (gdb) backtrace full
>>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>> No locals.
>>>>>>> #1 0x00007ffff615c1a4 in usleep (useconds=<optimized out>) at
>>>>>>> ../sysdeps/posix/usleep.c:32
>>>>>>> ts = {tv_sec = 0, tv_nsec = 250000000}
>>>>>>> #2 0x0000000000461216 in main_loop () at pbsd_main.c:1454
>>>>>>> state = 3
>>>>>>> waittime = 5
>>>>>>> pjob = 0x313a74
>>>>>>> iter = 0x0
>>>>>>> when = 1478748888
>>>>>>> log = 0
>>>>>>> scheduling = 1
>>>>>>> sched_iteration = 600
>>>>>>> time_now = 1478748970
>>>>>>> update_loglevel = 1478748979
>>>>>>> log_buf = "Server Ready, pid = 35860, loglevel=0", '\000'
>>>>>>> <repeats 139 times>, "c\000\000\000\000\000\000\000
>>>>>>> \000\020\000\000\000\000\000\000\240\265\377\377\377\177", '\000'
>>>>>>> <repeats 26 times>...
>>>>>>> sem_val = 5229209
>>>>>>> __func__ = "main_loop"
>>>>>>> #3 0x0000000000461f92 in main (argc=2, argv=0x7fffffffe598) at
>>>>>>> pbsd_main.c:1935
>>>>>>> i = 2
>>>>>>> rc = 0
>>>>>>> local_errno = 0
>>>>>>> lockfile = "/var/spool/torque/server_priv/server.lock",
>>>>>>> '\000' <repeats 983 times>
>>>>>>> EMsg = '\000' <repeats 1023 times>
>>>>>>> tmpLine = "Using ports Server:15001 Scheduler:15004
>>>>>>> MOM:15002 (server: 'Dual-E52630v4')", '\000' <repeats 945 times>
>>>>>>> log_buf = "Using ports Server:15001 Scheduler:15004
>>>>>>> MOM:15002 (server: 'Dual-E52630v4')", '\000' <repeats 1114 times>...
>>>>>>> server_name_file_port = 15001
>>>>>>> fp = 0x51095f0
>>>>>>> (gdb) info registers
>>>>>>> rax 0xfffffffffffffdfc -516
>>>>>>> rbx 0x6 6
>>>>>>> rcx 0x7ffff612a75d 140737321805661
>>>>>>> rdx 0x0 0
>>>>>>> rsi 0x0 0
>>>>>>> rdi 0x7fffffffb3f0 140737488335856
>>>>>>> rbp 0x7fffffffe4b0 0x7fffffffe4b0
>>>>>>> rsp 0x7fffffffc870 0x7fffffffc870
>>>>>>> r8 0x0 0
>>>>>>> r9 0x4000001 67108865
>>>>>>> r10 0x1 1
>>>>>>> r11 0x293 659
>>>>>>> r12 0x4260b0 4350128
>>>>>>> r13 0x7fffffffe590 140737488348560
>>>>>>> r14 0x0 0
>>>>>>> r15 0x0 0
>>>>>>> rip 0x461f92 0x461f92 <main(int, char**)+2388>
>>>>>>> eflags 0x293 [ CF AF SF IF ]
>>>>>>> cs 0x33 51
>>>>>>> ss 0x2b 43
>>>>>>> ds 0x0 0
>>>>>>> es 0x0 0
>>>>>>> fs 0x0 0
>>>>>>> gs 0x0 0
>>>>>>> (gdb) x/16i $pc
>>>>>>> => 0x461f92 <main(int, char**)+2388>: callq 0x49484c
>>>>>>> <shutdown_ack()>
>>>>>>> 0x461f97 <main(int, char**)+2393>: mov $0xffffffff,%edi
>>>>>>> 0x461f9c <main(int, char**)+2398>: callq 0x4250b0 <***@plt
>>>>>>> >
>>>>>>> 0x461fa1 <main(int, char**)+2403>: mov 0x70f5c0(%rip),%rdx
>>>>>>> # 0xb71568 <msg_svrdown>
>>>>>>> 0x461fa8 <main(int, char**)+2410>: mov 0x70ef51(%rip),%rax
>>>>>>> # 0xb70f00 <msg_daemonname>
>>>>>>> 0x461faf <main(int, char**)+2417>: mov %rdx,%rcx
>>>>>>> 0x461fb2 <main(int, char**)+2420>: mov %rax,%rdx
>>>>>>> 0x461fb5 <main(int, char**)+2423>: mov $0x1,%esi
>>>>>>> 0x461fba <main(int, char**)+2428>: mov $0x8002,%edi
>>>>>>> 0x461fbf <main(int, char**)+2433>: callq 0x425840
>>>>>>> <***@plt>
>>>>>>> 0x461fc4 <main(int, char**)+2438>: mov $0x0,%edi
>>>>>>> 0x461fc9 <main(int, char**)+2443>: callq 0x4269c9
>>>>>>> <acct_close(bool)>
>>>>>>> 0x461fce <main(int, char**)+2448>: mov $0xb6ce00,%edi
>>>>>>> 0x461fd3 <main(int, char**)+2453>: callq 0x425a00
>>>>>>> <***@plt>
>>>>>>> 0x461fd8 <main(int, char**)+2458>: mov $0x1,%edi
>>>>>>> 0x461fdd <main(int, char**)+2463>: callq 0x424db0
>>>>>>> <***@plt>
>>>>>>> (gdb) thread apply all backtrace
>>>>>>> Thread 12 (Thread 0x7fffe2ffd700 (LWP 36011)):
>>>>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
>>>>>>> #1 0x00000000004fc2b4 in work_thread (a=0x5110710) at
>>>>>>> u_threadpool.c:272
>>>>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7fffe2ffd700) at
>>>>>>> pthread_create.c:333
>>>>>>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>>> _64/clone.S:109
>>>>>>> Thread 11 (Thread 0x7fffe37fe700 (LWP 36004)):
>>>>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
>>>>>>> #1 0x00000000004fc2b4 in work_thread (a=0x5110810) at
>>>>>>> u_threadpool.c:272
>>>>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7fffe37fe700) at
>>>>>>> pthread_create.c:333
>>>>>>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>>> _64/clone.S:109
>>>>>>> Thread 10 (Thread 0x7fffe3fff700 (LWP 36003)):
>>>>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
>>>>>>> #1 0x00000000004fc2b4 in work_thread (a=0x5110710) at
>>>>>>> u_threadpool.c:272
>>>>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7fffe3fff700) at
>>>>>>> pthread_create.c:333
>>>>>>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>>> _64/clone.S:109
>>>>>>> Thread 9 (Thread 0x7ffff09bb700 (LWP 35871)):
>>>>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
>>>>>>> #1 0x00000000004fc2b4 in work_thread (a=0x5110810) at
>>>>>>> u_threadpool.c:272
>>>>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7ffff09bb700) at
>>>>>>> pthread_create.c:333
>>>>>>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>>> _64/clone.S:109
>>>>>>> Thread 7 (Thread 0x7ffff11bc700 (LWP 35869)):
>>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>>>>>> ../sysdeps/posix/sleep.c:55
>>>>>>> #2 0x0000000000476913 in remove_completed_jobs (vp=0x0) at
>>>>>>> req_jobobit.c:3759
>>>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff11bc700) at
>>>>>>> pthread_create.c:333
>>>>>>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>>> _64/clone.S:109
>>>>>>> Thread 6 (Thread 0x7ffff19bd700 (LWP 35868)):
>>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>>>>>> ../sysdeps/posix/sleep.c:55
>>>>>>> #2 0x00000000004afb93 in remove_extra_recycle_jobs (vp=0x0) at
>>>>>>> job_recycler.c:216
>>>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff19bd700) at
>>>>>>> pthread_create.c:333
>>>>>>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>>> _64/clone.S:109
>>>>>>> Thread 5 (Thread 0x7ffff21be700 (LWP 35867)):
>>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>>>>>> ../sysdeps/posix/sleep.c:55
>>>>>>> #2 0x00000000004bc853 in inspect_exiting_jobs (vp=0x0) at
>>>>>>> exiting_jobs.c:319
>>>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff21be700) at
>>>>>>> pthread_create.c:333
>>>>>>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>>> _64/clone.S:109
>>>>>>> Thread 4 (Thread 0x7ffff29bf700 (LWP 35866)):
>>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>>>>>> ../sysdeps/posix/sleep.c:55
>>>>>>> #2 0x0000000000460769 in handle_queue_routing_retries (vp=0x0) at
>>>>>>> pbsd_main.c:1079
>>>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff29bf700) at
>>>>>>> pthread_create.c:333
>>>>>>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>>> _64/clone.S:109
>>>>>>> Thread 3 (Thread 0x7ffff31c0700 (LWP 35865)):
>>>>>>> #0 0x00007ffff6ee17bd in accept () at ../sysdeps/unix/syscall-templa
>>>>>>> te.S:84
>>>>>>> #1 0x00007ffff750a276 in start_listener_addrinfo
>>>>>>> (host_name=0x7ffff31bfaf0 "Dual-E52630v4", server_port=15001,
>>>>>>> process_meth=0x4c4a4d <start_process_pbs_server_port(void*)>)
>>>>>>> at ../Libnet/server_core.c:398
>>>>>>> ---Type <return> to continue, or q <return> to quit---
>>>>>>> #2 0x00000000004608cf in start_accept_listener (vp=0x0) at
>>>>>>> pbsd_main.c:1141
>>>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff31c0700) at
>>>>>>> pthread_create.c:333
>>>>>>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>>> _64/clone.S:109
>>>>>>> Thread 2 (Thread 0x7ffff39c1700 (LWP 35864)):
>>>>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
>>>>>>> #1 0x00000000004fc2b4 in work_thread (a=0x5110810) at
>>>>>>> u_threadpool.c:272
>>>>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7ffff39c1700) at
>>>>>>> pthread_create.c:333
>>>>>>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>>> _64/clone.S:109
>>>>>>> Thread 1 (Thread 0x7ffff7fd5740 (LWP 35860)):
>>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>> #1 0x00007ffff615c1a4 in usleep (useconds=<optimized out>) at
>>>>>>> ../sysdeps/posix/usleep.c:32
>>>>>>> #2 0x0000000000461216 in main_loop () at pbsd_main.c:1454
>>>>>>> #3 0x0000000000461f92 in main (argc=2, argv=0x7fffffffe598) at
>>>>>>> pbsd_main.c:1935
>>>>>>> (gdb) quit
>>>>>>> A debugging session is active.
>>>>>>> Inferior 1 [process 35860] will be killed.
>>>>>>> Quit anyway? (y or n) y
>>>>>>
>>>>>>
>>>>>>
>>>>>> Commands executed from another terminal after pbs_server with gdb (r
>>>>>> -D)
>>>>>>
>>>>>>> $ sudo service pbs_sched start
>>>>>>> $ sudo service pbs_mom start
>>>>>>> $ pbsnodes -a
>>>>>>> Dual-E52630v4
>>>>>>> state = free
>>>>>>> power_state = Running
>>>>>>> np = 4
>>>>>>> ntype = cluster
>>>>>>> status = rectime=1478748911,macaddr=34:
>>>>>>> 97:f6:5d:09:a6,cpuclock=Fixed,varattr=,jobs=,state=free,netl
>>>>>>> oad=322618417,gres=,loadave=0.06,ncpus=40,physmem=65857216kb
>>>>>>> ,availmem=131970532kb,totmem=132849340kb,idletime=108,nusers=4,nsessions=17,sessions=1036
>>>>>>> 1316 1327 1332 1420 1421 1422 1423 1424 1425 1426 1430 1471 1510 27075
>>>>>>> 27130 35902,uname=Linux Dual-E52630v4 4.4.0-45-generic #66-Ubuntu SMP Wed
>>>>>>> Oct 19 14:12:37 UTC 2016 x86_64,opsys=linux
>>>>>>> mom_service_port = 15002
>>>>>>> mom_manager_port = 15003
>>>>>>> $ echo "sleep 30" | qsub
>>>>>>> 0.Dual-E52630v4
>>>>>>> $ qstat
>>>>>>> Job ID Name User Time Use
>>>>>>> S Queue
>>>>>>> ------------------------- ---------------- --------------- --------
>>>>>>> - -----
>>>>>>> 0.Dual-E52630v4 STDIN comp_admin 0
>>>>>>> Q batch
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Nov 10, 2016 at 12:01 PM, Kazuhiro Fujita <
>>>>>> ***@gmail.com> wrote:
>>>>>>
>>>>>>> David,
>>>>>>>
>>>>>>> Now, it works. Thank you.
>>>>>>> But, jobs are executed in the LIFO manner, as I observed in a
>>>>>>> E5-2630v3 server...
>>>>>>> I show the result by 'qstat -t' after 'echo "sleep 10" | qsub -t
>>>>>>> 1-10' 3 times.
>>>>>>>
>>>>>>> Best,
>>>>>>> Kazu
>>>>>>>
>>>>>>> $ qstat -t
>>>>>>> Job ID Name User Time Use
>>>>>>> S Queue
>>>>>>> ------------------------- ---------------- --------------- --------
>>>>>>> - -----
>>>>>>> 0.Dual-E5-2630v3 STDIN comp_admin 00:00:00
>>>>>>> C batch
>>>>>>> 1[1].Dual-E5-2630v3 STDIN-1 comp_admin 0
>>>>>>> Q batch
>>>>>>> 1[2].Dual-E5-2630v3 STDIN-2 comp_admin 0
>>>>>>> Q batch
>>>>>>> 1[3].Dual-E5-2630v3 STDIN-3 comp_admin 0
>>>>>>> Q batch
>>>>>>> 1[4].Dual-E5-2630v3 STDIN-4 comp_admin 0
>>>>>>> Q batch
>>>>>>> 1[5].Dual-E5-2630v3 STDIN-5 comp_admin 0
>>>>>>> Q batch
>>>>>>> 1[6].Dual-E5-2630v3 STDIN-6 comp_admin 0
>>>>>>> Q batch
>>>>>>> 1[7].Dual-E5-2630v3 STDIN-7 comp_admin 00:00:00
>>>>>>> C batch
>>>>>>> 1[8].Dual-E5-2630v3 STDIN-8 comp_admin 00:00:00
>>>>>>> C batch
>>>>>>> 1[9].Dual-E5-2630v3 STDIN-9 comp_admin 00:00:00
>>>>>>> C batch
>>>>>>> 1[10].Dual-E5-2630v3 STDIN-10 comp_admin 00:00:00
>>>>>>> C batch
>>>>>>> 2[1].Dual-E5-2630v3 STDIN-1 comp_admin 0
>>>>>>> Q batch
>>>>>>> 2[2].Dual-E5-2630v3 STDIN-2 comp_admin 0
>>>>>>> Q batch
>>>>>>> 2[3].Dual-E5-2630v3 STDIN-3 comp_admin 0
>>>>>>> Q batch
>>>>>>> 2[4].Dual-E5-2630v3 STDIN-4 comp_admin 0
>>>>>>> Q batch
>>>>>>> 2[5].Dual-E5-2630v3 STDIN-5 comp_admin 0
>>>>>>> Q batch
>>>>>>> 2[6].Dual-E5-2630v3 STDIN-6 comp_admin 0
>>>>>>> Q batch
>>>>>>> 2[7].Dual-E5-2630v3 STDIN-7 comp_admin 0
>>>>>>> Q batch
>>>>>>> 2[8].Dual-E5-2630v3 STDIN-8 comp_admin 0
>>>>>>> Q batch
>>>>>>> 2[9].Dual-E5-2630v3 STDIN-9 comp_admin 0
>>>>>>> Q batch
>>>>>>> 2[10].Dual-E5-2630v3 STDIN-10 comp_admin 0
>>>>>>> Q batch
>>>>>>> 3[1].Dual-E5-2630v3 STDIN-1 comp_admin 0
>>>>>>> Q batch
>>>>>>> 3[2].Dual-E5-2630v3 STDIN-2 comp_admin 0
>>>>>>> Q batch
>>>>>>> 3[3].Dual-E5-2630v3 STDIN-3 comp_admin 0
>>>>>>> Q batch
>>>>>>> 3[4].Dual-E5-2630v3 STDIN-4 comp_admin 0
>>>>>>> Q batch
>>>>>>> 3[5].Dual-E5-2630v3 STDIN-5 comp_admin 0
>>>>>>> Q batch
>>>>>>> 3[6].Dual-E5-2630v3 STDIN-6 comp_admin 0
>>>>>>> Q batch
>>>>>>> 3[7].Dual-E5-2630v3 STDIN-7 comp_admin 0
>>>>>>> R batch
>>>>>>> 3[8].Dual-E5-2630v3 STDIN-8 comp_admin 0
>>>>>>> R batch
>>>>>>> 3[9].Dual-E5-2630v3 STDIN-9 comp_admin 0
>>>>>>> R batch
>>>>>>> 3[10].Dual-E5-2630v3 STDIN-10 comp_admin 0
>>>>>>> R batch
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Nov 10, 2016 at 3:07 AM, David Beer <
>>>>>>> ***@adaptivecomputing.com> wrote:
>>>>>>>
>>>>>>>> Kazu,
>>>>>>>>
>>>>>>>> I was able to get a system to reproduce this error. I have now
>>>>>>>> checked in another fix, and I can no longer reproduce this. Can you pull
>>>>>>>> the latest and let me know if it fixes it for you?
>>>>>>>>
>>>>>>>> On Tue, Nov 8, 2016 at 2:06 AM, Kazuhiro Fujita <
>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi David,
>>>>>>>>>
>>>>>>>>> I reinstalled the 6.0-dev today from github, and observed slight
>>>>>>>>> different behaviors I think.
>>>>>>>>> I used the "service" command to start daemons this time.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Kazu
>>>>>>>>>
>>>>>>>>> Befor crash
>>>>>>>>>
>>>>>>>>>> git clone https://github.com/adaptivecomputing/torque.git -b
>>>>>>>>>> 6.0-dev 6.0-dev
>>>>>>>>>> cd 6.0-dev
>>>>>>>>>> ./autogen.sh
>>>>>>>>>> # build and install torque
>>>>>>>>>> ./configure
>>>>>>>>>> make
>>>>>>>>>> sudo make install
>>>>>>>>>> # Set the correct name of the server
>>>>>>>>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>>>>>>>> # configure and start trqauthd
>>>>>>>>>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>>>>>>>>>> sudo update-rc.d trqauthd defaults
>>>>>>>>>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>>>>>>>>>> sudo ldconfig
>>>>>>>>>> sudo service trqauthd start
>>>>>>>>>> # Initialize serverdb by executing the torque.setup script
>>>>>>>>>> sudo ./torque.setup $USER
>>>>>>>>>> sudo qmgr -c 'p s'
>>>>>>>>>> sudo qterm
>>>>>>>>>> sudo service trqauthd stop
>>>>>>>>>> ps aux | grep pbs
>>>>>>>>>> ps aux | grep trq
>>>>>>>>>> # set nodes
>>>>>>>>>> echo "$HOSTNAME np=`cat /proc/cpuinfo | grep processor | wc -l`"
>>>>>>>>>> | sudo tee /var/spool/torque/server_priv/nodes
>>>>>>>>>> sudo nano /var/spool/torque/server_priv/nodes
>>>>>>>>>> # set the head node
>>>>>>>>>> echo "\$pbsserver $HOSTNAME" | sudo tee
>>>>>>>>>> /var/spool/torque/mom_priv/config
>>>>>>>>>> # configure other deamons
>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>>>>>>>>> sudo update-rc.d pbs_server defaults
>>>>>>>>>> sudo update-rc.d pbs_sched defaults
>>>>>>>>>> sudo update-rc.d pbs_mom defaults
>>>>>>>>>> # start torque daemons
>>>>>>>>>> sudo service trqauthd start
>>>>>>>>>> sudo service pbs_server start
>>>>>>>>>> sudo service pbs_sched start
>>>>>>>>>> sudo service pbs_mom start
>>>>>>>>>> # chekc configuration of computaion nodes
>>>>>>>>>> pbsnodes -a
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I checked torque processes by "ps aux | grep pbs" and "ps aux |
>>>>>>>>> grep trq" several times.
>>>>>>>>> After "pbsnodes -a", it seems ok.
>>>>>>>>> But, the next qsub command seems to trigger to crash "pbs_server"
>>>>>>>>> and "pbs_sched".
>>>>>>>>>
>>>>>>>>> $ ps aux | grep trq
>>>>>>>>>> root 9682 0.0 0.0 109112 3632 ? S 17:39 0:00
>>>>>>>>>> /usr/local/sbin/trqauthd
>>>>>>>>>> comp_ad+ 9842 0.0 0.0 15236 936 pts/8 S+ 17:40 0:00
>>>>>>>>>> grep --color=auto trq
>>>>>>>>>> $ ps aux | grep pbs
>>>>>>>>>> root 9720 0.0 0.0 695140 25760 ? Sl 17:39 0:00
>>>>>>>>>> /usr/local/sbin/pbs_server
>>>>>>>>>> root 9771 0.0 0.0 37996 4940 ? Ss 17:39 0:00
>>>>>>>>>> /usr/local/sbin/pbs_sched
>>>>>>>>>> root 9814 0.2 0.2 173776 136692 ? SLsl 17:40 0:00
>>>>>>>>>> /usr/local/sbin/pbs_mom
>>>>>>>>>> comp_ad+ 9844 0.0 0.0 15236 1012 pts/8 S+ 17:40 0:00
>>>>>>>>>> grep --color=auto pbs
>>>>>>>>>> $ echo "sleep 30" | qsub
>>>>>>>>>> 0.Dual-E52630v4
>>>>>>>>>> $ ps aux | grep pbs
>>>>>>>>>> root 9814 0.1 0.2 173776 136692 ? SLsl 17:40 0:00
>>>>>>>>>> /usr/local/sbin/pbs_mom
>>>>>>>>>> comp_ad+ 9855 0.0 0.0 15236 928 pts/8 S+ 17:41 0:00
>>>>>>>>>> grep --color=auto pbs
>>>>>>>>>> $ ps aux | grep trq
>>>>>>>>>> root 9682 0.0 0.0 109112 4144 ? S 17:39 0:00
>>>>>>>>>> /usr/local/sbin/trqauthd
>>>>>>>>>> comp_ad+ 9860 0.0 0.0 15236 1092 pts/8 S+ 17:41 0:00
>>>>>>>>>> grep --color=auto trq
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Then, I stopped the remained processes,
>>>>>>>>>
>>>>>>>>> sudo service pbs_mom stop
>>>>>>>>>> sudo service trqauthd stop
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> and start again the "trqauthd", and "pbs_server" with gdb.
>>>>>>>>> "pbs_server" crashed in gdb without other commands.
>>>>>>>>>
>>>>>>>>> sudo service trqauthd start
>>>>>>>>>> sudo gdb /usr/local/sbin/pbs_server
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> sudo gdb /usr/local/sbin/pbs_server
>>>>>>>>> GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
>>>>>>>>> Copyright (C) 2016 Free Software Foundation, Inc.
>>>>>>>>> License GPLv3+: GNU GPL version 3 or later <
>>>>>>>>> http://gnu.org/licenses/gpl.html>
>>>>>>>>> This is free software: you are free to change and redistribute it.
>>>>>>>>> There is NO WARRANTY, to the extent permitted by law. Type "show
>>>>>>>>> copying"
>>>>>>>>> and "show warranty" for details.
>>>>>>>>> This GDB was configured as "x86_64-linux-gnu".
>>>>>>>>> Type "show configuration" for configuration details.
>>>>>>>>> For bug reporting instructions, please see:
>>>>>>>>> <http://www.gnu.org/software/gdb/bugs/>.
>>>>>>>>> Find the GDB manual and other documentation resources online at:
>>>>>>>>> <http://www.gnu.org/software/gdb/documentation/>.
>>>>>>>>> For help, type "help".
>>>>>>>>> Type "apropos word" to search for commands related to "word"...
>>>>>>>>> Reading symbols from /usr/local/sbin/pbs_server...done.
>>>>>>>>> (gdb) r -D
>>>>>>>>> Starting program: /usr/local/sbin/pbs_server -D
>>>>>>>>> [Thread debugging using libthread_db enabled]
>>>>>>>>> Using host libthread_db library "/lib/x86_64-linux-gnu/libthre
>>>>>>>>> ad_db.so.1".
>>>>>>>>>
>>>>>>>>> Program received signal SIGSEGV, Segmentation fault.
>>>>>>>>> __lll_unlock_elision (lock=0x512f1b0, private=0) at
>>>>>>>>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>>>>>>>>> 29 ../sysdeps/unix/sysv/linux/x86/elision-unlock.c: No such file
>>>>>>>>> or directory.
>>>>>>>>> (gdb) bt
>>>>>>>>> #0 __lll_unlock_elision (lock=0x512f1b0, private=0) at
>>>>>>>>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>>>>>>>>> #1 0x00000000004a4953 in unlock_ji_mutex (pjob=0x512d880,
>>>>>>>>> id=0x525b30 <svr_enquejob(job*, int, char const*, bool, bool)::__func__>
>>>>>>>>> "svr_enquejob", msg=0x524554 "1", logging=0)
>>>>>>>>> at svr_jobfunc.c:4011
>>>>>>>>> #2 0x000000000049db0c in svr_enquejob (pjob=0x512d880,
>>>>>>>>> has_sv_qs_mutex=1, prev_job_id=0x0, have_reservation=false,
>>>>>>>>> being_recovered=true) at svr_jobfunc.c:421
>>>>>>>>> #3 0x000000000045b828 in pbsd_init_reque (pjob=0x512d880,
>>>>>>>>> change_state=1) at pbsd_init.c:2824
>>>>>>>>> #4 0x000000000045ad93 in pbsd_init_job (pjob=0x512d880, type=1)
>>>>>>>>> at pbsd_init.c:2558
>>>>>>>>> #5 0x0000000000459483 in handle_job_recovery (type=1) at
>>>>>>>>> pbsd_init.c:1803
>>>>>>>>> #6 0x000000000045a173 in handle_job_and_array_recovery (type=1)
>>>>>>>>> at pbsd_init.c:2100
>>>>>>>>> #7 0x000000000045a8fe in pbsd_init (type=1) at pbsd_init.c:2316
>>>>>>>>> #8 0x0000000000461f29 in main (argc=2, argv=0x7fffffffe598) at
>>>>>>>>> pbsd_main.c:1898
>>>>>>>>> (gdb) backtrace full
>>>>>>>>> #0 __lll_unlock_elision (lock=0x512f1b0, private=0) at
>>>>>>>>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>>>>>>>>> No locals.
>>>>>>>>> #1 0x00000000004a4953 in unlock_ji_mutex (pjob=0x512d880,
>>>>>>>>> id=0x525b30 <svr_enquejob(job*, int, char const*, bool, bool)::__func__>
>>>>>>>>> "svr_enquejob", msg=0x524554 "1", logging=0)
>>>>>>>>> at svr_jobfunc.c:4011
>>>>>>>>> rc = 0
>>>>>>>>> err_msg = 0x0
>>>>>>>>> stub_msg = "no pos"
>>>>>>>>> __func__ = "unlock_ji_mutex"
>>>>>>>>> #2 0x000000000049db0c in svr_enquejob (pjob=0x512d880,
>>>>>>>>> has_sv_qs_mutex=1, prev_job_id=0x0, have_reservation=false,
>>>>>>>>> being_recovered=true) at svr_jobfunc.c:421
>>>>>>>>> pattrjb = 0x7fffffff4a10
>>>>>>>>> pdef = 0x4
>>>>>>>>> pque = 0x0
>>>>>>>>> rc = 0
>>>>>>>>> log_buf = '\000' <repeats 24 times>,
>>>>>>>>> "\030\000\000\000\060\000\000\000PU\377\377\377\177\000\000\220T\377\377\377\177",
>>>>>>>>> '\000' <repeats 50 times>, "\003\000\000\000\000\000\000\
>>>>>>>>> 000#\000\000\000\000\000\000\000pO\377\377\377\177", '\000'
>>>>>>>>> <repeats 26 times>, "\221\260\000\000\000\200\377\
>>>>>>>>> 377oO\377\377\377\177\000\000H+B\366\377\177\000\000p+B\366\
>>>>>>>>> 377\177\000\000\200O\377\377\377\177\000\000\201\260\000\000
>>>>>>>>> \000\200\377\377\177O\377\377\377\177", '\000' <repeats 18
>>>>>>>>> times>...
>>>>>>>>> time_now = 1478594788
>>>>>>>>> job_id = "0.Dual-E52630v4\000\000\000\0
>>>>>>>>> 00\000\000\000\000\000\362\377\377\377\377\377\377\377\340J\
>>>>>>>>> 377\377\377\177\000\000\060L\377\377\377\177\000\000\001\000
>>>>>>>>> \000\000\000\000\000\000\244\201\000\000\001\000\000\000\030
>>>>>>>>> \354\377\367\377\177\000\***@L\377\377\377\177\000\000\000\0
>>>>>>>>> 00\000\000\005\000\000\220\r\000\000\000\000\000\000\000k\02
>>>>>>>>> 2j\365\377\177\000\000\031J\377\377\377\177\000\000\201n\376
>>>>>>>>> \017\000\000\000\000\\\216!X\000\000\000\000_#\343+\000\000\
>>>>>>>>> 000\000\\\216!X\000\000\000\000\207\065],", '\000' <repeats 36
>>>>>>>>> times>, "k\022j\365\377\177\000\000\30
>>>>>>>>> 0K\377\377\377\177\000\000\000\000\000\000\000\000\000\000"...
>>>>>>>>> queue_name = "batch\000\377\377\240\340\377
>>>>>>>>> \367\377\177\000"
>>>>>>>>> total_jobs = 0
>>>>>>>>> user_jobs = 0
>>>>>>>>> array_jobs = 0
>>>>>>>>> __func__ = "svr_enquejob"
>>>>>>>>> que_mgr = {unlock_on_exit = 160, locked = 75, mutex_valid
>>>>>>>>> = 255, managed_mutex = 0x7ffff7ddccda <open_path+474>}
>>>>>>>>> #3 0x000000000045b828 in pbsd_init_reque (pjob=0x512d880,
>>>>>>>>> change_state=1) at pbsd_init.c:2824
>>>>>>>>> newstate = 0
>>>>>>>>> newsubstate = 0
>>>>>>>>> rc = 0
>>>>>>>>> log_buf = "pbsd_init_reque:1", '\000' <repeats 1063
>>>>>>>>> times>...
>>>>>>>>> __func__ = "pbsd_init_reque"
>>>>>>>>> #4 0x000000000045ad93 in pbsd_init_job (pjob=0x512d880, type=1)
>>>>>>>>> at pbsd_init.c:2558
>>>>>>>>> d = 0
>>>>>>>>> rc = 0
>>>>>>>>> time_now = 1478594788
>>>>>>>>> log_buf = '\000' <repeats 2112 times>...
>>>>>>>>> local_errno = 0
>>>>>>>>> job_id = '\000' <repeats 1016 times>...
>>>>>>>>> job_atr_hold = 0
>>>>>>>>> job_exit_status = 0
>>>>>>>>> __func__ = "pbsd_init_job"
>>>>>>>>> #5 0x0000000000459483 in handle_job_recovery (type=1) at
>>>>>>>>> pbsd_init.c:1803
>>>>>>>>> pjob = 0x512d880
>>>>>>>>> Index = 0
>>>>>>>>> JobArray_iter = {first = "0.Dual-E52630v4", second = }
>>>>>>>>> log_buf = "14 total files read from
>>>>>>>>> disk\000\000\000\000\000\000\000\001\000\000\000\320\316\022
>>>>>>>>> \005\000\000\000\000\220N\022\005", '\000' <repeats 12 times>,
>>>>>>>>> "Expected 1, recovered 1 queues", '\000' <repeats 1330 times>...
>>>>>>>>> rc = 0
>>>>>>>>> job_rc = 0
>>>>>>>>> logtype = 0
>>>>>>>>> pdirent = 0x0
>>>>>>>>> pdirent_sub = 0x0
>>>>>>>>> dir = 0x5124e90
>>>>>>>>> dir_sub = 0x0
>>>>>>>>> had = 0
>>>>>>>>> pjob = 0x0
>>>>>>>>> time_now = 1478594788
>>>>>>>>> ---Type <return> to continue, or q <return> to quit---
>>>>>>>>> basen = '\000' <repeats 1088 times>...
>>>>>>>>> use_jobs_subdirs = 0
>>>>>>>>> __func__ = "handle_job_recovery"
>>>>>>>>> #6 0x000000000045a173 in handle_job_and_array_recovery (type=1)
>>>>>>>>> at pbsd_init.c:2100
>>>>>>>>> rc = 0
>>>>>>>>> tmp_rc = 1974134615
>>>>>>>>> #7 0x000000000045a8fe in pbsd_init (type=1) at pbsd_init.c:2316
>>>>>>>>> ret = 0
>>>>>>>>> gid = 0
>>>>>>>>> log_buf = "pbsd_init:1", '\000' <repeats 997 times>...
>>>>>>>>> __func__ = "pbsd_init"
>>>>>>>>> #8 0x0000000000461f29 in main (argc=2, argv=0x7fffffffe598) at
>>>>>>>>> pbsd_main.c:1898
>>>>>>>>> i = 2
>>>>>>>>> rc = 0
>>>>>>>>> local_errno = 0
>>>>>>>>> lockfile = "/var/spool/torque/server_priv/server.lock",
>>>>>>>>> '\000' <repeats 983 times>
>>>>>>>>> EMsg = '\000' <repeats 1023 times>
>>>>>>>>> tmpLine = "Server Dual-E52630v4 started, initialization
>>>>>>>>> type = 1", '\000' <repeats 970 times>
>>>>>>>>> log_buf = "Server Dual-E52630v4 started, initialization
>>>>>>>>> type = 1", '\000' <repeats 1139 times>...
>>>>>>>>> server_name_file_port = 15001
>>>>>>>>> fp = 0x51095f0
>>>>>>>>> (gdb) info registers
>>>>>>>>> rax 0x0 0
>>>>>>>>> rbx 0x6 6
>>>>>>>>> rcx 0x0 0
>>>>>>>>> rdx 0x512f1b0 85127600
>>>>>>>>> rsi 0x0 0
>>>>>>>>> rdi 0x512f1b0 85127600
>>>>>>>>> rbp 0x7fffffffe4b0 0x7fffffffe4b0
>>>>>>>>> rsp 0x7fffffffc870 0x7fffffffc870
>>>>>>>>> r8 0x0 0
>>>>>>>>> r9 0x7fffffff57a2 140737488312226
>>>>>>>>> r10 0x513c800 85182464
>>>>>>>>> r11 0x7ffff61e6128 140737322574120
>>>>>>>>> r12 0x4260b0 4350128
>>>>>>>>> r13 0x7fffffffe590 140737488348560
>>>>>>>>> r14 0x0 0
>>>>>>>>> r15 0x0 0
>>>>>>>>> rip 0x461f29 0x461f29 <main(int, char**)+2183>
>>>>>>>>> eflags 0x10246 [ PF ZF IF RF ]
>>>>>>>>> cs 0x33 51
>>>>>>>>> ss 0x2b 43
>>>>>>>>> ds 0x0 0
>>>>>>>>> es 0x0 0
>>>>>>>>> fs 0x0 0
>>>>>>>>> gs 0x0 0
>>>>>>>>> (gdb) x/16i $pc
>>>>>>>>> => 0x461f29 <main(int, char**)+2183>: test %eax,%eax
>>>>>>>>> 0x461f2b <main(int, char**)+2185>: setne %al
>>>>>>>>> 0x461f2e <main(int, char**)+2188>: test %al,%al
>>>>>>>>> 0x461f30 <main(int, char**)+2190>: je 0x461f55 <main(int,
>>>>>>>>> char**)+2227>
>>>>>>>>> 0x461f32 <main(int, char**)+2192>: mov 0x70efc7(%rip),%rax
>>>>>>>>> # 0xb70f00 <msg_daemonname>
>>>>>>>>> 0x461f39 <main(int, char**)+2199>: mov $0x51bab2,%edx
>>>>>>>>> 0x461f3e <main(int, char**)+2204>: mov %rax,%rsi
>>>>>>>>> 0x461f41 <main(int, char**)+2207>: mov $0xffffffff,%edi
>>>>>>>>> 0x461f46 <main(int, char**)+2212>: callq 0x425420
>>>>>>>>> <***@plt>
>>>>>>>>> 0x461f4b <main(int, char**)+2217>: mov $0x3,%edi
>>>>>>>>> 0x461f50 <main(int, char**)+2222>: callq 0x425680 <***@plt>
>>>>>>>>> 0x461f55 <main(int, char**)+2227>: mov 0x71021d(%rip),%esi
>>>>>>>>> # 0xb72178 <pbs_mom_port>
>>>>>>>>> 0x461f5b <main(int, char**)+2233>: mov 0x710227(%rip),%ecx
>>>>>>>>> # 0xb72188 <pbs_scheduler_port>
>>>>>>>>> 0x461f61 <main(int, char**)+2239>: mov 0x710225(%rip),%edx
>>>>>>>>> # 0xb7218c <pbs_server_port_dis>
>>>>>>>>> 0x461f67 <main(int, char**)+2245>: lea -0x1400(%rbp),%rax
>>>>>>>>> 0x461f6e <main(int, char**)+2252>: mov $0xb739c0,%r9d
>>>>>>>>> (gdb) thread apply all backtrace
>>>>>>>>>
>>>>>>>>> Thread 1 (Thread 0x7ffff7fd5740 (LWP 10004)):
>>>>>>>>> #0 __lll_unlock_elision (lock=0x512f1b0, private=0) at
>>>>>>>>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>>>>>>>>> #1 0x00000000004a4953 in unlock_ji_mutex (pjob=0x512d880,
>>>>>>>>> id=0x525b30 <svr_enquejob(job*, int, char const*, bool, bool)::__func__>
>>>>>>>>> "svr_enquejob", msg=0x524554 "1", logging=0)
>>>>>>>>> at svr_jobfunc.c:4011
>>>>>>>>> #2 0x000000000049db0c in svr_enquejob (pjob=0x512d880,
>>>>>>>>> has_sv_qs_mutex=1, prev_job_id=0x0, have_reservation=false,
>>>>>>>>> being_recovered=true) at svr_jobfunc.c:421
>>>>>>>>> #3 0x000000000045b828 in pbsd_init_reque (pjob=0x512d880,
>>>>>>>>> change_state=1) at pbsd_init.c:2824
>>>>>>>>> #4 0x000000000045ad93 in pbsd_init_job (pjob=0x512d880, type=1)
>>>>>>>>> at pbsd_init.c:2558
>>>>>>>>> #5 0x0000000000459483 in handle_job_recovery (type=1) at
>>>>>>>>> pbsd_init.c:1803
>>>>>>>>> #6 0x000000000045a173 in handle_job_and_array_recovery (type=1)
>>>>>>>>> at pbsd_init.c:2100
>>>>>>>>> #7 0x000000000045a8fe in pbsd_init (type=1) at pbsd_init.c:2316
>>>>>>>>> #8 0x0000000000461f29 in main (argc=2, argv=0x7fffffffe598) at
>>>>>>>>> pbsd_main.c:1898
>>>>>>>>> (gdb) quit
>>>>>>>>> A debugging session is active.
>>>>>>>>>
>>>>>>>>> Inferior 1 [process 10004] will be killed.
>>>>>>>>>
>>>>>>>>> Quit anyway? (y or n) y
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Nov 2, 2016 at 1:43 AM, David Beer <
>>>>>>>>> ***@adaptivecomputing.com> wrote:
>>>>>>>>>
>>>>>>>>>> Kazu,
>>>>>>>>>>
>>>>>>>>>> Thanks for sticking with us on this. You mentioned that
>>>>>>>>>> pbs_server did not crash when you submitted the job, but you said that it
>>>>>>>>>> and pbs_sched are "unstable." What do you mean by unstable? Will jobs run?
>>>>>>>>>> You gdb output looks like a pbs_server that isn't busy, but other than that
>>>>>>>>>> it looks normal.
>>>>>>>>>>
>>>>>>>>>> David
>>>>>>>>>>
>>>>>>>>>> On Tue, Nov 1, 2016 at 1:19 AM, Kazuhiro Fujita <
>>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> David,
>>>>>>>>>>>
>>>>>>>>>>> I tested the 6.0-dev. It passed the "sudo ./torque.setup $USER"
>>>>>>>>>>> script,
>>>>>>>>>>> but pbs_server and pbs_sched are unstable like 6.1-dev.
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Kazu
>>>>>>>>>>>
>>>>>>>>>>> Before execution of gdb
>>>>>>>>>>>
>>>>>>>>>>> git clone https://github.com/adaptivecomputing/torque.git -b
>>>>>>>>>>>> 6.0-dev 6.0-dev
>>>>>>>>>>>> cd 6.0-dev
>>>>>>>>>>>> ./autogen.sh
>>>>>>>>>>>> # build and install torque
>>>>>>>>>>>> ./configure
>>>>>>>>>>>> make
>>>>>>>>>>>> sudo make install
>>>>>>>>>>>> # Set the correct name of the server
>>>>>>>>>>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>>>>>>>>>> # configure and start trqauthd
>>>>>>>>>>>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>>>>>>>>>>>> sudo update-rc.d trqauthd defaults
>>>>>>>>>>>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>>>>>>>>>>>> sudo ldconfig
>>>>>>>>>>>> sudo service trqauthd start
>>>>>>>>>>>> # Initialize serverdb by executing the torque.setup script
>>>>>>>>>>>> sudo ./torque.setup $USER
>>>>>>>>>>>>
>>>>>>>>>>>> sudo qmgr -c 'p s'
>>>>>>>>>>>> sudo qterm
>>>>>>>>>>>> sudo /etc/init.d/trqauthd stop
>>>>>>>>>>>> # set nodes
>>>>>>>>>>>> echo "$HOSTNAME np=`cat /proc/cpuinfo | grep processor | wc
>>>>>>>>>>>> -l`" | sudo tee /var/spool/torque/server_priv/nodes
>>>>>>>>>>>> sudo nano /var/spool/torque/server_priv/nodes
>>>>>>>>>>>> # set the head node
>>>>>>>>>>>> echo "\$pbsserver $HOSTNAME" | sudo tee
>>>>>>>>>>>> /var/spool/torque/mom_priv/config
>>>>>>>>>>>> # configure other deamons
>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>>>>>>>>>>> sudo update-rc.d pbs_server defaults
>>>>>>>>>>>> sudo update-rc.d pbs_sched defaults
>>>>>>>>>>>> sudo update-rc.d pbs_mom defaults
>>>>>>>>>>>> # start torque daemons
>>>>>>>>>>>> sudo service trqauthd start
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Execution of gdb
>>>>>>>>>>>
>>>>>>>>>>>> sudo gdb /usr/local/sbin/pbs_server
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Commands executed by another terminal
>>>>>>>>>>>
>>>>>>>>>>>> sudo /etc/init.d/pbs_mom start
>>>>>>>>>>>> sudo /etc/init.d/pbs_sched start
>>>>>>>>>>>> pbsnodes -a
>>>>>>>>>>>> echo "sleep 30" | qsub
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> The last command did not cause a crash of pbs_server. The
>>>>>>>>>>> backtrace is described below.
>>>>>>>>>>> $ sudo gdb /usr/local/sbin/pbs_server
>>>>>>>>>>> GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
>>>>>>>>>>> Copyright (C) 2016 Free Software Foundation, Inc.
>>>>>>>>>>> License GPLv3+: GNU GPL version 3 or later <
>>>>>>>>>>> http://gnu.org/licenses/gpl.html>
>>>>>>>>>>> This is free software: you are free to change and redistribute
>>>>>>>>>>> it.
>>>>>>>>>>> There is NO WARRANTY, to the extent permitted by law. Type
>>>>>>>>>>> "show copying"
>>>>>>>>>>> and "show warranty" for details.
>>>>>>>>>>> This GDB was configured as "x86_64-linux-gnu".
>>>>>>>>>>> Type "show configuration" for configuration details.
>>>>>>>>>>> For bug reporting instructions, please see:
>>>>>>>>>>> <http://www.gnu.org/software/gdb/bugs/>.
>>>>>>>>>>> Find the GDB manual and other documentation resources online at:
>>>>>>>>>>> <http://www.gnu.org/software/gdb/documentation/>.
>>>>>>>>>>> For help, type "help".
>>>>>>>>>>> Type "apropos word" to search for commands related to "word"...
>>>>>>>>>>> Reading symbols from /usr/local/sbin/pbs_server...done.
>>>>>>>>>>> (gdb) r -D
>>>>>>>>>>> Starting program: /usr/local/sbin/pbs_server -D
>>>>>>>>>>> [Thread debugging using libthread_db enabled]
>>>>>>>>>>> Using host libthread_db library "/lib/x86_64-linux-gnu/libthre
>>>>>>>>>>> ad_db.so.1".
>>>>>>>>>>> [New Thread 0x7ffff39c1700 (LWP 5024)]
>>>>>>>>>>> pbs_server is up (version - 6.0, port - 15001)
>>>>>>>>>>> [New Thread 0x7ffff31c0700 (LWP 5025)]
>>>>>>>>>>> PBS_Server: LOG_ERROR::tcp_connect_sockaddr, Failed when trying
>>>>>>>>>>> to open tcp connection - connect() failed [rc = -2] [addr =
>>>>>>>>>>> 10.0.0.249:15003]
>>>>>>>>>>> PBS_Server: LOG_ERROR::sendHierarchyToNode, Could not send mom
>>>>>>>>>>> hierarchy to host Dual-E52630v4:15003
>>>>>>>>>>> [New Thread 0x7ffff29bf700 (LWP 5026)]
>>>>>>>>>>> [New Thread 0x7ffff21be700 (LWP 5027)]
>>>>>>>>>>> [New Thread 0x7ffff19bd700 (LWP 5028)]
>>>>>>>>>>> [New Thread 0x7ffff11bc700 (LWP 5029)]
>>>>>>>>>>> [New Thread 0x7ffff09bb700 (LWP 5030)]
>>>>>>>>>>> [Thread 0x7ffff09bb700 (LWP 5030) exited]
>>>>>>>>>>> [New Thread 0x7ffff09bb700 (LWP 5031)]
>>>>>>>>>>> [New Thread 0x7fffe3fff700 (LWP 5109)]
>>>>>>>>>>> [New Thread 0x7fffe37fe700 (LWP 5113)]
>>>>>>>>>>> [New Thread 0x7fffe29cf700 (LWP 5121)]
>>>>>>>>>>> [Thread 0x7fffe29cf700 (LWP 5121) exited]
>>>>>>>>>>> ^C
>>>>>>>>>>> Thread 1 "pbs_server" received signal SIGINT, Interrupt.
>>>>>>>>>>> 0x00007ffff612a75d in nanosleep () at
>>>>>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>>>>>> 84 ../sysdeps/unix/syscall-template.S: No such file or
>>>>>>>>>>> directory.
>>>>>>>>>>> (gdb) backtrace full
>>>>>>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>>>>>> No locals.
>>>>>>>>>>> #1 0x00007ffff615c1a4 in usleep (useconds=<optimized out>) at
>>>>>>>>>>> ../sysdeps/posix/usleep.c:32
>>>>>>>>>>> ts = {tv_sec = 0, tv_nsec = 250000000}
>>>>>>>>>>> #2 0x000000000046123a in main_loop () at pbsd_main.c:1454
>>>>>>>>>>> state = 3
>>>>>>>>>>> waittime = 5
>>>>>>>>>>> pjob = 0x313a74
>>>>>>>>>>> iter = 0x0
>>>>>>>>>>> when = 1477984074
>>>>>>>>>>> log = 0
>>>>>>>>>>> scheduling = 1
>>>>>>>>>>> sched_iteration = 600
>>>>>>>>>>> time_now = 1477984190
>>>>>>>>>>> update_loglevel = 1477984198
>>>>>>>>>>> log_buf = "Server Ready, pid = 5020, loglevel=0", '\000'
>>>>>>>>>>> <repeats 140 times>, "c\000\000\000\000\000\000\000
>>>>>>>>>>> \000\020\000\000\000\000\000\000\240\265\377\377\377\177",
>>>>>>>>>>> '\000' <repeats 26 times>...
>>>>>>>>>>> sem_val = 5228929
>>>>>>>>>>> __func__ = "main_loop"
>>>>>>>>>>> #3 0x0000000000461fb6 in main (argc=2, argv=0x7fffffffe598) at
>>>>>>>>>>> pbsd_main.c:1935
>>>>>>>>>>> i = 2
>>>>>>>>>>> rc = 0
>>>>>>>>>>> local_errno = 0
>>>>>>>>>>> lockfile = "/var/spool/torque/server_priv/server.lock",
>>>>>>>>>>> '\000' <repeats 983 times>
>>>>>>>>>>> EMsg = '\000' <repeats 1023 times>
>>>>>>>>>>> tmpLine = "Using ports Server:15001 Scheduler:15004
>>>>>>>>>>> MOM:15002 (server: 'Dual-E52630v4')", '\000' <repeats 945 times>
>>>>>>>>>>> log_buf = "Using ports Server:15001 Scheduler:15004
>>>>>>>>>>> MOM:15002 (server: 'Dual-E52630v4')", '\000' <repeats 1114 times>...
>>>>>>>>>>> server_name_file_port = 15001
>>>>>>>>>>> fp = 0x51095f0
>>>>>>>>>>> (gdb) info registers
>>>>>>>>>>> rax 0xfffffffffffffdfc -516
>>>>>>>>>>> rbx 0x5 5
>>>>>>>>>>> rcx 0x7ffff612a75d 140737321805661
>>>>>>>>>>> rdx 0x0 0
>>>>>>>>>>> rsi 0x0 0
>>>>>>>>>>> rdi 0x7fffffffb3f0 140737488335856
>>>>>>>>>>> rbp 0x7fffffffe4b0 0x7fffffffe4b0
>>>>>>>>>>> rsp 0x7fffffffc870 0x7fffffffc870
>>>>>>>>>>> r8 0x0 0
>>>>>>>>>>> r9 0x4000001 67108865
>>>>>>>>>>> r10 0x1 1
>>>>>>>>>>> r11 0x293 659
>>>>>>>>>>> r12 0x4260b0 4350128
>>>>>>>>>>> r13 0x7fffffffe590 140737488348560
>>>>>>>>>>> r14 0x0 0
>>>>>>>>>>> r15 0x0 0
>>>>>>>>>>> rip 0x461fb6 0x461fb6 <main(int, char**)+2388>
>>>>>>>>>>> eflags 0x293 [ CF AF SF IF ]
>>>>>>>>>>> cs 0x33 51
>>>>>>>>>>> ss 0x2b 43
>>>>>>>>>>> ds 0x0 0
>>>>>>>>>>> es 0x0 0
>>>>>>>>>>> fs 0x0 0
>>>>>>>>>>> gs 0x0 0
>>>>>>>>>>> (gdb) x/16i $pc
>>>>>>>>>>> => 0x461fb6 <main(int, char**)+2388>: callq 0x494762
>>>>>>>>>>> <shutdown_ack()>
>>>>>>>>>>> 0x461fbb <main(int, char**)+2393>: mov $0xffffffff,%edi
>>>>>>>>>>> 0x461fc0 <main(int, char**)+2398>: callq 0x4250b0
>>>>>>>>>>> <***@plt>
>>>>>>>>>>> 0x461fc5 <main(int, char**)+2403>: mov
>>>>>>>>>>> 0x70f55c(%rip),%rdx # 0xb71528 <msg_svrdown>
>>>>>>>>>>> 0x461fcc <main(int, char**)+2410>: mov
>>>>>>>>>>> 0x70eeed(%rip),%rax # 0xb70ec0 <msg_daemonname>
>>>>>>>>>>> 0x461fd3 <main(int, char**)+2417>: mov %rdx,%rcx
>>>>>>>>>>> 0x461fd6 <main(int, char**)+2420>: mov %rax,%rdx
>>>>>>>>>>> 0x461fd9 <main(int, char**)+2423>: mov $0x1,%esi
>>>>>>>>>>> 0x461fde <main(int, char**)+2428>: mov $0x8002,%edi
>>>>>>>>>>> 0x461fe3 <main(int, char**)+2433>: callq 0x425840
>>>>>>>>>>> <***@plt>
>>>>>>>>>>> 0x461fe8 <main(int, char**)+2438>: mov $0x0,%edi
>>>>>>>>>>> 0x461fed <main(int, char**)+2443>: callq 0x4269c9
>>>>>>>>>>> <acct_close(bool)>
>>>>>>>>>>> 0x461ff2 <main(int, char**)+2448>: mov $0xb6cdc0,%edi
>>>>>>>>>>> 0x461ff7 <main(int, char**)+2453>: callq 0x425a00
>>>>>>>>>>> <***@plt>
>>>>>>>>>>> 0x461ffc <main(int, char**)+2458>: mov $0x1,%edi
>>>>>>>>>>> 0x462001 <main(int, char**)+2463>: callq 0x424db0
>>>>>>>>>>> <***@plt>
>>>>>>>>>>> (gdb) thread apply all backtrace
>>>>>>>>>>>
>>>>>>>>>>> Thread 11 (Thread 0x7fffe37fe700 (LWP 5113)):
>>>>>>>>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
>>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
>>>>>>>>>>> #1 0x00000000004fc19c in work_thread (a=0x5110710) at
>>>>>>>>>>> u_threadpool.c:272
>>>>>>>>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7fffe37fe700) at
>>>>>>>>>>> pthread_create.c:333
>>>>>>>>>>> #3 0x00007ffff616582d in clone () at
>>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>>>>>>>>>>>
>>>>>>>>>>> Thread 10 (Thread 0x7fffe3fff700 (LWP 5109)):
>>>>>>>>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
>>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
>>>>>>>>>>> #1 0x00000000004fc19c in work_thread (a=0x5110710) at
>>>>>>>>>>> u_threadpool.c:272
>>>>>>>>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7fffe3fff700) at
>>>>>>>>>>> pthread_create.c:333
>>>>>>>>>>> #3 0x00007ffff616582d in clone () at
>>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>>>>>>>>>>>
>>>>>>>>>>> Thread 9 (Thread 0x7ffff09bb700 (LWP 5031)):
>>>>>>>>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
>>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
>>>>>>>>>>> #1 0x00000000004fc19c in work_thread (a=0x5110810) at
>>>>>>>>>>> u_threadpool.c:272
>>>>>>>>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7ffff09bb700) at
>>>>>>>>>>> pthread_create.c:333
>>>>>>>>>>> #3 0x00007ffff616582d in clone () at
>>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>>>>>>>>>>>
>>>>>>>>>>> Thread 7 (Thread 0x7ffff11bc700 (LWP 5029)):
>>>>>>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>>>>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>>>>>>>>>> ../sysdeps/posix/sleep.c:55
>>>>>>>>>>> #2 0x00000000004769bb in remove_completed_jobs (vp=0x0) at
>>>>>>>>>>> req_jobobit.c:3759
>>>>>>>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff11bc700) at
>>>>>>>>>>> pthread_create.c:333
>>>>>>>>>>> #4 0x00007ffff616582d in clone () at
>>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>>>>>>>>>>>
>>>>>>>>>>> Thread 6 (Thread 0x7ffff19bd700 (LWP 5028)):
>>>>>>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>>>>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>>>>>>>>>> ../sysdeps/posix/sleep.c:55
>>>>>>>>>>> #2 0x00000000004afa7b in remove_extra_recycle_jobs (vp=0x0) at
>>>>>>>>>>> job_recycler.c:216
>>>>>>>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff19bd700) at
>>>>>>>>>>> pthread_create.c:333
>>>>>>>>>>> #4 0x00007ffff616582d in clone () at
>>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>>>>>>>>>>>
>>>>>>>>>>> Thread 5 (Thread 0x7ffff21be700 (LWP 5027)):
>>>>>>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>>>>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>>>>>>>>>> ../sysdeps/posix/sleep.c:55
>>>>>>>>>>> #2 0x00000000004bc73b in inspect_exiting_jobs (vp=0x0) at
>>>>>>>>>>> exiting_jobs.c:319
>>>>>>>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff21be700) at
>>>>>>>>>>> pthread_create.c:333
>>>>>>>>>>> #4 0x00007ffff616582d in clone () at
>>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>>>>>>>>>>>
>>>>>>>>>>> Thread 4 (Thread 0x7ffff29bf700 (LWP 5026)):
>>>>>>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>>>>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>>>>>>>>>> ../sysdeps/posix/sleep.c:55
>>>>>>>>>>> #2 0x000000000046078d in handle_queue_routing_retries (vp=0x0)
>>>>>>>>>>> at pbsd_main.c:1079
>>>>>>>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff29bf700) at
>>>>>>>>>>> pthread_create.c:333
>>>>>>>>>>> #4 0x00007ffff616582d in clone () at
>>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>>>>>>>>>>>
>>>>>>>>>>> Thread 3 (Thread 0x7ffff31c0700 (LWP 5025)):
>>>>>>>>>>> #0 0x00007ffff6ee17bd in accept () at
>>>>>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>>>>>> #1 0x00007ffff750a276 in start_listener_addrinfo
>>>>>>>>>>> (host_name=0x7ffff31bfaf0 "Dual-E52630v4", server_port=15001,
>>>>>>>>>>> process_meth=0x4c4935 <start_process_pbs_server_port(void*)>)
>>>>>>>>>>> at ../Libnet/server_core.c:398
>>>>>>>>>>> #2 0x00000000004608f3 in start_accept_listener (vp=0x0) at
>>>>>>>>>>> pbsd_main.c:1141
>>>>>>>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff31c0700) at
>>>>>>>>>>> pthread_create.c:333
>>>>>>>>>>> #4 0x00007ffff616582d in clone () at
>>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>>>>>>>>>>>
>>>>>>>>>>> Thread 2 (Thread 0x7ffff39c1700 (LWP 5024)):
>>>>>>>>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
>>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
>>>>>>>>>>> #1 0x00000000004fc19c in work_thread (a=0x5110810) at
>>>>>>>>>>> u_threadpool.c:272
>>>>>>>>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7ffff39c1700) at
>>>>>>>>>>> pthread_create.c:333
>>>>>>>>>>> ---Type <return> to continue, or q <return> to quit---
>>>>>>>>>>> #3 0x00007ffff616582d in clone () at
>>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>>>>>>>>>>>
>>>>>>>>>>> Thread 1 (Thread 0x7ffff7fd5740 (LWP 5020)):
>>>>>>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>>>>>> #1 0x00007ffff615c1a4 in usleep (useconds=<optimized out>) at
>>>>>>>>>>> ../sysdeps/posix/usleep.c:32
>>>>>>>>>>> #2 0x000000000046123a in main_loop () at pbsd_main.c:1454
>>>>>>>>>>> #3 0x0000000000461fb6 in main (argc=2, argv=0x7fffffffe598) at
>>>>>>>>>>> pbsd_main.c:1935
>>>>>>>>>>> (gdb) quit
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Oct 28, 2016 at 12:43 PM, Kazuhiro Fujita <
>>>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thank you for your comments.
>>>>>>>>>>>> I will try the 6.0-dev next week.
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Kazu
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Oct 28, 2016 at 5:34 AM, David Beer <
>>>>>>>>>>>> ***@adaptivecomputing.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I wonder if that fix wasn't placed in the hotfix. Is there any
>>>>>>>>>>>>> chance you can try installing 6.0-dev on your system (via github) to see if
>>>>>>>>>>>>> it's resolved. For the record, my Ubuntu 16 system doesn't give me this
>>>>>>>>>>>>> error, or I'd try it myself. For whatever reason, none of our test cluster
>>>>>>>>>>>>> machines (Cent & Redhat 6-7, SLES 11-12) experience this either. We did
>>>>>>>>>>>>> have another user that experiences it on a test cluster, but not being able
>>>>>>>>>>>>> to reproduce it has made it harder to track down.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Oct 26, 2016 at 12:46 AM, Kazuhiro Fujita <
>>>>>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> David,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I tried the 6.0.2.h3. But, it seems that the other issue is
>>>>>>>>>>>>>> still remained.
>>>>>>>>>>>>>> After I initialized serverdb by "sudo pbs_server -t create",
>>>>>>>>>>>>>> pbs_server crashed.
>>>>>>>>>>>>>> Then, I used gdb with pbs_server.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>> Kazu
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> sudo gdb /usr/local/sbin/pbs_server
>>>>>>>>>>>>>> GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
>>>>>>>>>>>>>> Copyright (C) 2016 Free Software Foundation, Inc.
>>>>>>>>>>>>>> License GPLv3+: GNU GPL version 3 or later <
>>>>>>>>>>>>>> http://gnu.org/licenses/gpl.html>
>>>>>>>>>>>>>> This is free software: you are free to change and
>>>>>>>>>>>>>> redistribute it.
>>>>>>>>>>>>>> There is NO WARRANTY, to the extent permitted by law. Type
>>>>>>>>>>>>>> "show copying"
>>>>>>>>>>>>>> and "show warranty" for details.
>>>>>>>>>>>>>> This GDB was configured as "x86_64-linux-gnu".
>>>>>>>>>>>>>> Type "show configuration" for configuration details.
>>>>>>>>>>>>>> For bug reporting instructions, please see:
>>>>>>>>>>>>>> <http://www.gnu.org/software/gdb/bugs/>.
>>>>>>>>>>>>>> Find the GDB manual and other documentation resources online
>>>>>>>>>>>>>> at:
>>>>>>>>>>>>>> <http://www.gnu.org/software/gdb/documentation/>.
>>>>>>>>>>>>>> For help, type "help".
>>>>>>>>>>>>>> Type "apropos word" to search for commands related to
>>>>>>>>>>>>>> "word"...
>>>>>>>>>>>>>> Reading symbols from /usr/local/sbin/pbs_server...done.
>>>>>>>>>>>>>> (gdb) r -D
>>>>>>>>>>>>>> Starting program: /usr/local/sbin/pbs_server -D
>>>>>>>>>>>>>> [Thread debugging using libthread_db enabled]
>>>>>>>>>>>>>> Using host libthread_db library "/lib/x86_64-linux-gnu/libthre
>>>>>>>>>>>>>> ad_db.so.1".
>>>>>>>>>>>>>> pbs_server is up (version - 6.0.2.h3, port - 15001)
>>>>>>>>>>>>>> [New Thread 0x7ffff39c1700 (LWP 25591)]
>>>>>>>>>>>>>> [New Thread 0x7ffff31c0700 (LWP 25592)]
>>>>>>>>>>>>>> [New Thread 0x7ffff29bf700 (LWP 25593)]
>>>>>>>>>>>>>> [New Thread 0x7ffff21be700 (LWP 25594)]
>>>>>>>>>>>>>> [New Thread 0x7ffff19bd700 (LWP 25595)]
>>>>>>>>>>>>>> [New Thread 0x7ffff11bc700 (LWP 25596)]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thread 7 "pbs_server" received signal SIGSEGV, Segmentation
>>>>>>>>>>>>>> fault.
>>>>>>>>>>>>>> [Switching to Thread 0x7ffff11bc700 (LWP 25596)]
>>>>>>>>>>>>>> __lll_unlock_elision (lock=0x57276c0, private=0) at
>>>>>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>>>>>>>>>>>>>> 29 ../sysdeps/unix/sysv/linux/x86/elision-unlock.c: No such
>>>>>>>>>>>>>> file or directory.
>>>>>>>>>>>>>> (gdb) bt
>>>>>>>>>>>>>> #0 __lll_unlock_elision (lock=0x57276c0, private=0) at
>>>>>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>>>>>>>>>>>>>> #1 0x00000000004ac076 in dispatch_timed_task
>>>>>>>>>>>>>> (ptask=0x5727660) at svr_task.c:318
>>>>>>>>>>>>>> #2 0x0000000000460247 in check_tasks (notUsed=0x0) at
>>>>>>>>>>>>>> pbsd_main.c:921
>>>>>>>>>>>>>> #3 0x00000000004fc171 in work_thread (a=0x510f650) at
>>>>>>>>>>>>>> u_threadpool.c:318
>>>>>>>>>>>>>> #4 0x00007ffff6ed86fa in start_thread (arg=0x7ffff11bc700)
>>>>>>>>>>>>>> at pthread_create.c:333
>>>>>>>>>>>>>> #5 0x00007ffff6165b5d in clone () at
>>>>>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Oct 26, 2016 at 11:52 AM, Kazuhiro Fujita <
>>>>>>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> David and Rick,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thank you for the quick response. I will try it later.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>> Kazu
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Oct 26, 2016 at 5:06 AM, David Beer <
>>>>>>>>>>>>>>> ***@adaptivecomputing.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Actually, Rick just sent me the link. You can download it
>>>>>>>>>>>>>>>> from here: http://files.adaptivecom
>>>>>>>>>>>>>>>> puting.com/hotfix/torque-6.0.2.h3.tar.gz
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Tue, Oct 25, 2016 at 2:06 PM, David Beer <
>>>>>>>>>>>>>>>> ***@adaptivecomputing.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I can confirm that this bug is fixed in 6.0-dev, and we've
>>>>>>>>>>>>>>>>> made a hotfix for it, 6.0.2.h3. This was caused because of a change in the
>>>>>>>>>>>>>>>>> implementation for the pthread library, so most will not see this crash,
>>>>>>>>>>>>>>>>> but it appears that if you have a newer version of that library, then you
>>>>>>>>>>>>>>>>> will get it. Rick is going to send instructions for how to grab 6.0.2.h3.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> David
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Tue, Oct 25, 2016 at 12:30 AM, Kazuhiro Fujita <
>>>>>>>>>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thank you David for the comment on the backtrace.
>>>>>>>>>>>>>>>>>> I haven't noticed that until writing this mail.
>>>>>>>>>>>>>>>>>> So, I used backtrace as written in the Ubuntu wiki.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I also attached the backtrace of pbs_server (Torque
>>>>>>>>>>>>>>>>>> 6.1-dev) by gdb.
>>>>>>>>>>>>>>>>>> As I mentioned before torque.setup script was
>>>>>>>>>>>>>>>>>> successfully executed, but unstable.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Before using gdb, I used following commands.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> git clone https://github.com/adaptivecom
>>>>>>>>>>>>>>>>>>> puting/torque.git -b 6.1-dev 6.1-dev
>>>>>>>>>>>>>>>>>>> cd 6.1-dev
>>>>>>>>>>>>>>>>>>> ./autogen.sh
>>>>>>>>>>>>>>>>>>> # build and install torque
>>>>>>>>>>>>>>>>>>> ./configure
>>>>>>>>>>>>>>>>>>> make
>>>>>>>>>>>>>>>>>>> sudo make install
>>>>>>>>>>>>>>>>>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>>>>>>>>>>>>>>>>> echo /usr/local/lib > sudo tee
>>>>>>>>>>>>>>>>>>> /etc/ld.so.conf.d/torque.conf
>>>>>>>>>>>>>>>>>>> sudo ldconfig
>>>>>>>>>>>>>>>>>>> # set as services
>>>>>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.trqauthd
>>>>>>>>>>>>>>>>>>> /etc/init.d/trqauthd
>>>>>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_server
>>>>>>>>>>>>>>>>>>> /etc/init.d/pbs_server
>>>>>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_sched
>>>>>>>>>>>>>>>>>>> /etc/init.d/pbs_sched
>>>>>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>>>>>>>>>>>>>>>>>> sudo update-rc.d trqauthd defaults
>>>>>>>>>>>>>>>>>>> sudo update-rc.d pbs_server defaults
>>>>>>>>>>>>>>>>>>> sudo update-rc.d pbs_sched defaults
>>>>>>>>>>>>>>>>>>> sudo update-rc.d pbs_mom defaults
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> sudo ./torque.setup $USER
>>>>>>>>>>>>>>>>>>> sudo qmgr -c 'p s'
>>>>>>>>>>>>>>>>>>> echo "$HOSTNAME np=`cat /proc/cpuinfo | grep processor |
>>>>>>>>>>>>>>>>>>> wc -l`" | sudo tee /var/spool/torque/server_priv/nodes
>>>>>>>>>>>>>>>>>>> sudo nano /var/spool/torque/server_priv/nodes #
>>>>>>>>>>>>>>>>>>> (changed np)
>>>>>>>>>>>>>>>>>>> sudo qterm -t quick
>>>>>>>>>>>>>>>>>>> sudo /etc/init.d/trqauthd stop
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> trqauthd was not stop by the last command. So, I stopped
>>>>>>>>>>>>>>>>>> it by killing the trqauthd process.
>>>>>>>>>>>>>>>>>> Then I restarted the torque processes with gdb.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> sudo /etc/init.d/trqauthd start
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> sudo gdb /etc/init.d/pbs_server 2>&1 | tee
>>>>>>>>>>>>>>>>>>> ~/gdb-torquesetup-6.1-dev.txt
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> In another terminal, I executed the following commands
>>>>>>>>>>>>>>>>>> before pbs_server was crashed.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> sudo /etc/init.d/pbs_mom start
>>>>>>>>>>>>>>>>>>> sudo /etc/init.d/pbs_sched start
>>>>>>>>>>>>>>>>>>> ps aux | grep pbs
>>>>>>>>>>>>>>>>>>> pbsnodes -a
>>>>>>>>>>>>>>>>>>> echo "sleep 30" | qsub
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The output of the last command is "0.torque-server".
>>>>>>>>>>>>>>>>>> And this command crashed the pbs_server in gdb.
>>>>>>>>>>>>>>>>>> Then, I made the backtrace.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>> Kazu
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Tue, Oct 25, 2016 at 2:36 PM, Kazuhiro Fujita <
>>>>>>>>>>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> David,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I attached the backtrace of pbs_server (Torque 6.0.2) by
>>>>>>>>>>>>>>>>>>> gdb.
>>>>>>>>>>>>>>>>>>> (based on https://wiki.ubuntu.com/Backtrace)
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I started pbs_server with gdb,
>>>>>>>>>>>>>>>>>>> and execute qmgr from another terminal. (see below)
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> sudo qmgr -c 'p s'
>>>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server
>>>>>>>>>>>>>>>>>>>> '.
>>>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> After the qmgr execution, I pressed ctrl +c in gdb.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>> Kaz
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Tue, Oct 25, 2016 at 1:00 AM, David Beer <
>>>>>>>>>>>>>>>>>>> ***@adaptivecomputing.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Kazu,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Can you give us a backtrace for this crash? We have
>>>>>>>>>>>>>>>>>>>> fixed some issues on startup (around mutex management for newer pthread
>>>>>>>>>>>>>>>>>>>> implementations) and a backtrace would allow me to confirm if what you're
>>>>>>>>>>>>>>>>>>>> seeing is fixed.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Mon, Oct 24, 2016 at 2:09 AM, Kazuhiro Fujita <
>>>>>>>>>>>>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Dear All,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I use Torque 4.2.10 on Ubuntu 14.04 LTS and 16.04 LTS
>>>>>>>>>>>>>>>>>>>>> with dual E5-2630 v3 chips.
>>>>>>>>>>>>>>>>>>>>> I recently got servers with dual Xeon E5 v4 chips, and
>>>>>>>>>>>>>>>>>>>>> installed Ubuntu 16.04 LTS on them.
>>>>>>>>>>>>>>>>>>>>> And I tried to set up Torque on them, but I stacked
>>>>>>>>>>>>>>>>>>>>> with the initial setup script.
>>>>>>>>>>>>>>>>>>>>> It seems that qmgr may trigger to crash pbs_server in
>>>>>>>>>>>>>>>>>>>>> initial setup script (torque.setup). (see below)
>>>>>>>>>>>>>>>>>>>>> Similar error is also observed in Torque 6.02.
>>>>>>>>>>>>>>>>>>>>> Have you ever observed this kind of errors?
>>>>>>>>>>>>>>>>>>>>> And if you know possible solutions, please tell me.
>>>>>>>>>>>>>>>>>>>>> Any comments will be highly appreciated.
>>>>>>>>>>>>>>>>>>>>> Would it be better to change the OS to other
>>>>>>>>>>>>>>>>>>>>> distribution, such as Scientific Linux?
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thank you in Advance,
>>>>>>>>>>>>>>>>>>>>> Kazu
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Errors in torque 4.2.10 setup
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> torque-server-***@torque-ser
>>>>>>>>>>>>>>>>>>>>>> ver:~/Downloads/torque/torque-4.2.10$ sudo
>>>>>>>>>>>>>>>>>>>>>> ./torque.setup $USER
>>>>>>>>>>>>>>>>>>>>>> Currently no servers active. Default server will be
>>>>>>>>>>>>>>>>>>>>>> listed as active server. Error 15133
>>>>>>>>>>>>>>>>>>>>>> Active server name: torque-server pbs_server port
>>>>>>>>>>>>>>>>>>>>>> is: 15001
>>>>>>>>>>>>>>>>>>>>>> trqauthd daemonized - port /tmp/trqauthd-unix
>>>>>>>>>>>>>>>>>>>>>> trqauthd successfully started
>>>>>>>>>>>>>>>>>>>>>> initializing TORQUE (admin:
>>>>>>>>>>>>>>>>>>>>>> torque-server-***@torque-server)
>>>>>>>>>>>>>>>>>>>>>> You have selected to start pbs_server in create mode.
>>>>>>>>>>>>>>>>>>>>>> If the server database exists it will be overwritten.
>>>>>>>>>>>>>>>>>>>>>> do you wish to continue y/(n)?y
>>>>>>>>>>>>>>>>>>>>>> root 27941 1942 1 12:22 ? 00:00:00
>>>>>>>>>>>>>>>>>>>>>> pbs_server -t create
>>>>>>>>>>>>>>>>>>>>>> Max open servers: 9
>>>>>>>>>>>>>>>>>>>>>> set server operators += torque-server-***@torque-ser
>>>>>>>>>>>>>>>>>>>>>> ver
>>>>>>>>>>>>>>>>>>>>>> Max open servers: 9
>>>>>>>>>>>>>>>>>>>>>> set server managers += torque-server-***@torque-ser
>>>>>>>>>>>>>>>>>>>>>> ver
>>>>>>>>>>>>>>>>>>>>>> qmgr obj=batch svr=default: End of File
>>>>>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>>>>>> Cannot connect to specified server host
>>>>>>>>>>>>>>>>>>>>>> 'torque-server'.
>>>>>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111)
>>>>>>>>>>>>>>>>>>>>>> Connection refused
>>>>>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>>>>>> Cannot connect to specified server host
>>>>>>>>>>>>>>>>>>>>>> 'torque-server'.
>>>>>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111)
>>>>>>>>>>>>>>>>>>>>>> Connection refused
>>>>>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>>>>>> Cannot connect to specified server host
>>>>>>>>>>>>>>>>>>>>>> 'torque-server'.
>>>>>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111)
>>>>>>>>>>>>>>>>>>>>>> Connection refused
>>>>>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>>>>>> Cannot connect to specified server host
>>>>>>>>>>>>>>>>>>>>>> 'torque-server'.
>>>>>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111)
>>>>>>>>>>>>>>>>>>>>>> Connection refused
>>>>>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>>>>>> Cannot connect to specified server host
>>>>>>>>>>>>>>>>>>>>>> 'torque-server'.
>>>>>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111)
>>>>>>>>>>>>>>>>>>>>>> Connection refused
>>>>>>>>>>>>>>>>>>>>>> torque-server-***@torque-ser
>>>>>>>>>>>>>>>>>>>>>> ver:~/Downloads/torque/torque-4.2.10$ ps aux | grep
>>>>>>>>>>>>>>>>>>>>>> pbs
>>>>>>>>>>>>>>>>>>>>>> torque-+ 27996 0.0 0.0 22304 948 pts/2 S+
>>>>>>>>>>>>>>>>>>>>>> 12:22 0:00 grep --color=auto pbs
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> pbs_server -t create was not found.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Errors in torque 6.0.2 setup
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> torque-server-***@torque-server:~/Downloads/torque/6.0.2$
>>>>>>>>>>>>>>>>>>>>>> sudo ./torque.setup $USER
>>>>>>>>>>>>>>>>>>>>>> Currently no servers active. Default server will be
>>>>>>>>>>>>>>>>>>>>>> listed as active server. Error 15133
>>>>>>>>>>>>>>>>>>>>>> Active server name: torque-server pbs_server port
>>>>>>>>>>>>>>>>>>>>>> is: 15001
>>>>>>>>>>>>>>>>>>>>>> trqauthd daemonized - port /tmp/trqauthd-unix
>>>>>>>>>>>>>>>>>>>>>> trqauthd successfully started
>>>>>>>>>>>>>>>>>>>>>> initializing TORQUE (admin:
>>>>>>>>>>>>>>>>>>>>>> torque-server-***@torque-server)
>>>>>>>>>>>>>>>>>>>>>> You have selected to start pbs_server in create mode.
>>>>>>>>>>>>>>>>>>>>>> If the server database exists it will be overwritten.
>>>>>>>>>>>>>>>>>>>>>> do you wish to continue y/(n)?y
>>>>>>>>>>>>>>>>>>>>>> root 39521 1 1 16:10 ? 00:00:00
>>>>>>>>>>>>>>>>>>>>>> pbs_server -t create
>>>>>>>>>>>>>>>>>>>>>> Max open servers: 9
>>>>>>>>>>>>>>>>>>>>>> Max open servers: 9
>>>>>>>>>>>>>>>>>>>>>> qmgr obj=batch svr=default: End of File
>>>>>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>>>>>> Cannot connect to specified server host
>>>>>>>>>>>>>>>>>>>>>> 'torque-server'.
>>>>>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111)
>>>>>>>>>>>>>>>>>>>>>> Connection refused
>>>>>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>>>>>> Cannot connect to specified server host
>>>>>>>>>>>>>>>>>>>>>> 'torque-server'.
>>>>>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111)
>>>>>>>>>>>>>>>>>>>>>> Connection refused
>>>>>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>>>>>> Cannot connect to specified server host
>>>>>>>>>>>>>>>>>>>>>> 'torque-server'.
>>>>>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111)
>>>>>>>>>>>>>>>>>>>>>> Connection refused
>>>>>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>>>>>> Cannot connect to specified server host
>>>>>>>>>>>>>>>>>>>>>> 'torque-server'.
>>>>>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111)
>>>>>>>>>>>>>>>>>>>>>> Connection refused
>>>>>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>>>>>> Cannot connect to specified server host
>>>>>>>>>>>>>>>>>>>>>> 'torque-server'.
>>>>>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111)
>>>>>>>>>>>>>>>>>>>>>> Connection refused
>>>>>>>>>>>>>>>>>>>>>> torque-server-***@torque-server:~/Downloads/torque/6.0.2$
>>>>>>>>>>>>>>>>>>>>>> ps aux | grep pbs
>>>>>>>>>>>>>>>>>>>>>> comp_ad+ 39569 0.0 0.0 22304 1032 pts/8 S+
>>>>>>>>>>>>>>>>>>>>>> 16:11 0:00 grep --color=auto pbs
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> pbs_server -t create was not found.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Commands used for installation before the setup script
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> # build and install torque
>>>>>>>>>>>>>>>>>>>>>> ./configure
>>>>>>>>>>>>>>>>>>>>>> make
>>>>>>>>>>>>>>>>>>>>>> sudo make install
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>>>>>>>>>>>>>>>>>>>> echo /usr/local/lib > sudo tee
>>>>>>>>>>>>>>>>>>>>>> /etc/ld.so.conf.d/torque.conf
>>>>>>>>>>>>>>>>>>>>>> sudo ldconfig
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> # set up as services
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.trqauthd
>>>>>>>>>>>>>>>>>>>>>> /etc/init.d/trqauthd
>>>>>>>>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_server
>>>>>>>>>>>>>>>>>>>>>> /etc/init.d/pbs_server
>>>>>>>>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_sched
>>>>>>>>>>>>>>>>>>>>>> /etc/init.d/pbs_sched
>>>>>>>>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_mom
>>>>>>>>>>>>>>>>>>>>>> /etc/init.d/pbs_mom
>>>>>>>>>>>>>>>>>>>>>> sudo update-rc.d trqauthd defaults
>>>>>>>>>>>>>>>>>>>>>> sudo update-rc.d pbs_server defaults
>>>>>>>>>>>>>>>>>>>>>> sudo update-rc.d pbs_sched defaults
>>>>>>>>>>>>>>>>>>>>>> sudo update-rc.d pbs_mom defaults
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> sudo ./torque.setup $USER
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>>>>> torqueusers mailing list
>>>>>>>>>>>>>>>>>>>>> ***@supercluster.org
>>>>>>>>>>>>>>>>>>>>> http://www.supercluster.org/ma
>>>>>>>>>>>>>>>>>>>>> ilman/listinfo/torqueusers
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>> David Beer | Torque Architect
>>>>>>>>>>>>>>>>>>>> Adaptive Computing
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>>>> torqueusers mailing list
>>>>>>>>>>>>>>>>>>>> ***@supercluster.org
>>>>>>>>>>>>>>>>>>>> http://www.supercluster.org/ma
>>>>>>>>>>>>>>>>>>>> ilman/listinfo/torqueusers
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>> torqueusers mailing list
>>>>>>>>>>>>>>>>>> ***@supercluster.org
>>>>>>>>>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> David Beer | Torque Architect
>>>>>>>>>>>>>>>>> Adaptive Computing
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> David Beer | Torque Architect
>>>>>>>>>>>>>>>> Adaptive Computing
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>> torqueusers mailing list
>>>>>>>>>>>>>>>> ***@supercluster.org
>>>>>>>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> torqueusers mailing list
>>>>>>>>>>>>>> ***@supercluster.org
>>>>>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> David Beer | Torque Architect
>>>>>>>>>>>>> Adaptive Computing
>>>>>>>>>>>>>
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> torqueusers mailing list
>>>>>>>>>>>>> ***@supercluster.org
>>>>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> torqueusers mailing list
>>>>>>>>>>> ***@supercluster.org
>>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> David Beer | Torque Architect
>>>>>>>>>> Adaptive Computing
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> torqueusers mailing list
>>>>>>>>>> ***@supercluster.org
>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> torqueusers mailing list
>>>>>>>>> ***@supercluster.org
>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> David Beer | Torque Architect
>>>>>>>> Adaptive Computing
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> torqueusers mailing list
>>>>>>>> ***@supercluster.org
>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> torqueusers mailing list
>>>>>> ***@supercluster.org
>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> David Beer | Torque Architect
>>>>> Adaptive Computing
>>>>>
>>>>> _______________________________________________
>>>>> torqueusers mailing list
>>>>> ***@supercluster.org
>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> torqueusers mailing list
>>>> ***@supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>
>>>>
>>>
>>>
>>> --
>>> David Beer | Torque Architect
>>> Adaptive Computing
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> ***@supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>>
>>
>> _______________________________________________
>> torqueusers mailing list
>> ***@supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>
>
> --
> David Beer | Torque Architect
> Adaptive Computing
>
> _______________________________________________
> torqueusers mailing list
> ***@supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
Kazuhiro Fujita
2016-12-16 05:43:20 UTC
Permalink
Dear David,

Thanks.
I checked the latest version of 6.0-dev,
and it works except LIFO job scheduling.

Best,
Kazu


On Wed, Nov 30, 2016 at 3:12 PM, Kazuhiro Fujita <***@gmail.com>
wrote:

> David,
>
> I attached the backtrace below.
>
> Before: gdb
> sudo service pbs_mom stop
> sudo service pbs_sched stop
> sudo service pbs_server stop
> sudo service trqauthd stop
> sudo service trqauthd start
> sudo gdb /usr/local/sbin/pbs_server
>
> Then,
> (gdb) r -D
>
> In another terminal I executed following commands,
> and the last command (echo "sleep 30" | qsub) caused the crash as I
> reported before.
>
> $sudo service pbs_sched start
> $sudo service pbs_mom start
> $ps aux | grep pbs
> root 36957 0.0 0.0 55808 4164 pts/8 S 14:53 0:00 sudo gdb
> /usr/local/sbin/pbs_server
> root 36958 0.7 0.0 109464 63648 pts/8 S 14:53 0:00 gdb
> /usr/local/sbin/pbs_server
> root 36960 0.0 0.0 473936 24768 pts/8 Sl+ 14:53 0:00
> /usr/local/sbin/pbs_server -D
> root 37079 0.0 0.0 37996 4940 ? Ss 14:54 0:00
> /usr/local/sbin/pbs_sched
> root 37116 0.0 0.1 115892 76900 ? RLsl 14:54 0:00
> /usr/local/sbin/pbs_mom
> comp_ad+ 37118 0.0 0.0 15236 976 pts/9 S+ 14:54 0:00 grep
> --color=auto pbs
> $ps aux | grep trq
> root 36956 0.0 0.0 29052 2332 ? S 14:52 0:00
> /usr/local/sbin/trqauthd
> comp_ad+ 37135 0.0 0.0 15236 1032 pts/9 S+ 14:54 0:00 grep
> --color=auto trq
> $ pbsnodes -a
> $ echo "sleep 30" | qsub
>
> The output of gdb is shown below.
>
> Best,
> Kazu
>
>
> $ sudo gdb /usr/local/sbin/pbs_server
> GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
> Copyright (C) 2016 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.
> html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law. Type "show copying"
> and "show warranty" for details.
> This GDB was configured as "x86_64-linux-gnu".
> Type "show configuration" for configuration details.
> For bug reporting instructions, please see:
> <http://www.gnu.org/software/gdb/bugs/>.
> Find the GDB manual and other documentation resources online at:
> <http://www.gnu.org/software/gdb/documentation/>.
> For help, type "help".
> Type "apropos word" to search for commands related to "word"...
> Reading symbols from /usr/local/sbin/pbs_server...done.
> (gdb) r -D
> Starting program: /usr/local/sbin/pbs_server -D
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
> [New Thread 0x7ffff39c1700 (LWP 36964)]
> pbs_server is up (version - 6.0, port - 15001)
> PBS_Server: LOG_ERROR::tcp_connect_sockaddr, Failed when trying to open
> tcp connection - connect() failed [rc = -2] [addr = 10.0.0.249:15003]
> PBS_Server: LOG_ERROR::sendHierarchyToNode, Could not send mom hierarchy
> to host Dual-E52630v4:15003
> [New Thread 0x7ffff31c0700 (LWP 36965)]
> [New Thread 0x7ffff29bf700 (LWP 36966)]
> [New Thread 0x7ffff21be700 (LWP 36967)]
> [New Thread 0x7ffff19bd700 (LWP 36968)]
> [New Thread 0x7ffff11bc700 (LWP 36969)]
> [New Thread 0x7ffff09bb700 (LWP 36970)]
> [Thread 0x7ffff09bb700 (LWP 36970) exited]
> [New Thread 0x7ffff09bb700 (LWP 36971)]
> [New Thread 0x7fffe3fff700 (LWP 37132)]
> [New Thread 0x7fffe37fe700 (LWP 37133)]
> [New Thread 0x7fffe2ffd700 (LWP 37145)]
> [New Thread 0x7fffe21ce700 (LWP 37150)]
> [Thread 0x7fffe21ce700 (LWP 37150) exited]
> Assertion failed, bad pointer in link: file "req_select.c", line 401
>
> Thread 10 "pbs_server" received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 0x7fffe3fff700 (LWP 37132)]
> __lll_unlock_elision (lock=0x51118d0, private=0) at
> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
> 29 ../sysdeps/unix/sysv/linux/x86/elision-unlock.c: No such file or
> directory.
> (gdb) backtrace full
> #0 __lll_unlock_elision (lock=0x51118d0, private=0) at
> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
> No locals.
> #1 0x0000000000465e0f in unlock_queue (the_queue=0x512ced0, id=0x522704
> "req_selectjobs", msg=0x0, logging=0) at queue_func.c:189
> rc = 0
> err_msg = 0x0
> stub_msg = "no pos"
> __func__ = "unlock_queue"
> #2 0x000000000049384a in req_selectjobs (preq=0x7fffdc081bd0) at
> req_select.c:347
> bad = 1
> cntl = 0x7fffdc000930
> plist = 0x7fffdc001880
> pque = 0x512ced0
> rc = 0
> log_buf = '\000' <repeats 184 times>, "\b\205P\367\377\177\000\000\
> 000\000\000\000\000\000\000\000"...
> selistp = 0x0
> #3 0x00000000004652f4 in dispatch_request (sfds=9,
> request=0x7fffdc081bd0) at process_request.c:899
> rc = 0
> log_buf = "***@Dual-E52630v4\000\066\063\060v4", '\000' <repeats
> 3424 times>...
> __func__ = "dispatch_request"
> #4 0x0000000000464e8f in process_request (chan=0x7fffdc0008c0) at
> process_request.c:702
> rc = 0
> request = 0x7fffdc081bd0
> state = 3
> time_now = 1480485385
> auth_err = 0x0
> conn_socktype = 2
> conn_authen = 1
> sfds = 9
> #5 0x00000000004c4805 in process_pbs_server_port (sock=9,
> is_scheduler_port=0, args=0x7fffe40008e0) at incoming_request.c:162
> protocol_type = 2
> rc = 0
> log_buf = '\000' <repeats 3992 times>...
> chan = 0x7fffdc0008c0
> __func__ = "process_pbs_server_port"
> #6 0x00000000004c4ac9 in start_process_pbs_server_port
> (new_sock=0x7fffe40008e0) at incoming_request.c:270
> args = 0x7fffe40008e0
> sock = 9
> rc = 0
> #7 0x00000000004fc495 in work_thread (a=0x5110710) at u_threadpool.c:318
> __clframe = {__cancel_routine = 0x4fc071 <work_cleanup(void*)>,
> __cancel_arg = 0x5110710, __do_it = 1, __cancel_type = 0}
> __clframe = {__cancel_routine = 0x4fbf64
> <work_thread_cleanup(void*)>, __cancel_arg = 0x5110710, __do_it = 1,
> __cancel_type = 0}
> tp = 0x5110710
> rc = 0
> func = 0x4c4a4d <start_process_pbs_server_port(void*)>
> arg = 0x7fffe40008e0
> mywork = 0x7fffe4000b80
> working = {next = 0x0, working_id = 140737018590976}
> ts = {tv_sec = 0, tv_nsec = 0}
> __func__ = "work_thread"
> #8 0x00007ffff6ed870a in start_thread (arg=0x7fffe3fff700) at
> pthread_create.c:333
> __res = <optimized out>
> pd = 0x7fffe3fff700
> now = <optimized out>
> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140737018590976,
> -786842131623855334, 0, 140737272078415, 140737018591680, 0,
> 786815742786911002, 786861764219616026},
> mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0},
> data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
> not_first_call = <optimized out>
> pagesize_m1 = <optimized out>
> sp = <optimized out>
> freesize = <optimized out>
> ---Type <return> to continue, or q <return> to quit---
> __PRETTY_FUNCTION__ = "start_thread"
> #9 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/
> x86_64/clone.S:109
> No locals.
> (gdb) info registers
> rax 0x0 0
> rbx 0x7fffe3fff700 140737018590976
> rcx 0x0 0
> rdx 0x51118d0 85006544
> rsi 0x0 0
> rdi 0x51118d0 85006544
> rbp 0x0 0x0
> rsp 0x7fffe3ffefc0 0x7fffe3ffefc0
> r8 0x0 0
> r9 0x1 1
> r10 0x7fffdc0c8295 140736885195413
> r11 0x0 0
> r12 0x0 0
> r13 0x7ffff31be04f 140737272078415
> r14 0x7fffe3fff9c0 140737018591680
> r15 0x0 0
> rip 0x7ffff616582d 0x7ffff616582d <clone+109>
> eflags 0x10246 [ PF ZF IF RF ]
> cs 0x33 51
> ss 0x2b 43
> ds 0x0 0
> es 0x0 0
> fs 0x0 0
> gs 0x0 0
> (gdb) x/16i $pc
> => 0x7ffff616582d <clone+109>: mov %rax,%rdi
> 0x7ffff6165830 <clone+112>: callq 0x7ffff612ab60 <__GI__exit>
> 0x7ffff6165835 <clone+117>: mov 0x2bc63c(%rip),%rcx #
> 0x7ffff6421e78
> 0x7ffff616583c <clone+124>: neg %eax
> 0x7ffff616583e <clone+126>: mov %eax,%fs:(%rcx)
> 0x7ffff6165841 <clone+129>: or $0xffffffffffffffff,%rax
> 0x7ffff6165845 <clone+133>: retq
> 0x7ffff6165846: nopw %cs:0x0(%rax,%rax,1)
> 0x7ffff6165850 <lseek64>: mov $0x8,%eax
> 0x7ffff6165855 <lseek64+5>: syscall
> 0x7ffff6165857 <lseek64+7>: cmp $0xfffffffffffff001,%rax
> 0x7ffff616585d <lseek64+13>: jae 0x7ffff6165860 <lseek64+16>
> 0x7ffff616585f <lseek64+15>: retq
> 0x7ffff6165860 <lseek64+16>: mov 0x2bc611(%rip),%rcx #
> 0x7ffff6421e78
> 0x7ffff6165867 <lseek64+23>: neg %eax
> 0x7ffff6165869 <lseek64+25>: mov %eax,%fs:(%rcx)
> (gdb) thread apply all backtrace
>
> Thread 12 (Thread 0x7fffe2ffd700 (LWP 37145)):
> #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/
> x86_64/pthread_cond_wait.S:185
> #1 0x00000000004fc2b4 in work_thread (a=0x5110710) at u_threadpool.c:272
> #2 0x00007ffff6ed870a in start_thread (arg=0x7fffe2ffd700) at
> pthread_create.c:333
> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/
> x86_64/clone.S:109
>
> Thread 11 (Thread 0x7fffe37fe700 (LWP 37133)):
> #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/
> x86_64/pthread_cond_wait.S:185
> #1 0x00000000004fc2b4 in work_thread (a=0x5110810) at u_threadpool.c:272
> #2 0x00007ffff6ed870a in start_thread (arg=0x7fffe37fe700) at
> pthread_create.c:333
> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/
> x86_64/clone.S:109
>
> Thread 10 (Thread 0x7fffe3fff700 (LWP 37132)):
> #0 __lll_unlock_elision (lock=0x51118d0, private=0) at
> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
> #1 0x0000000000465e0f in unlock_queue (the_queue=0x512ced0, id=0x522704
> "req_selectjobs", msg=0x0, logging=0) at queue_func.c:189
> #2 0x000000000049384a in req_selectjobs (preq=0x7fffdc081bd0) at
> req_select.c:347
> #3 0x00000000004652f4 in dispatch_request (sfds=9,
> request=0x7fffdc081bd0) at process_request.c:899
> #4 0x0000000000464e8f in process_request (chan=0x7fffdc0008c0) at
> process_request.c:702
> #5 0x00000000004c4805 in process_pbs_server_port (sock=9,
> is_scheduler_port=0, args=0x7fffe40008e0) at incoming_request.c:162
> #6 0x00000000004c4ac9 in start_process_pbs_server_port
> (new_sock=0x7fffe40008e0) at incoming_request.c:270
> #7 0x00000000004fc495 in work_thread (a=0x5110710) at u_threadpool.c:318
> #8 0x00007ffff6ed870a in start_thread (arg=0x7fffe3fff700) at
> pthread_create.c:333
> #9 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/
> x86_64/clone.S:109
>
> Thread 9 (Thread 0x7ffff09bb700 (LWP 36971)):
> #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/
> x86_64/pthread_cond_wait.S:185
> #1 0x00000000004fc2b4 in work_thread (a=0x5110810) at u_threadpool.c:272
> #2 0x00007ffff6ed870a in start_thread (arg=0x7ffff09bb700) at
> pthread_create.c:333
> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/
> x86_64/clone.S:109
>
> Thread 7 (Thread 0x7ffff11bc700 (LWP 36969)):
> #0 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-
> template.S:84
> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
> ../sysdeps/posix/sleep.c:55
> #2 0x0000000000476913 in remove_completed_jobs (vp=0x0) at
> req_jobobit.c:3759
> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff11bc700) at
> pthread_create.c:333
> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/
> x86_64/clone.S:109
>
> Thread 6 (Thread 0x7ffff19bd700 (LWP 36968)):
> #0 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-
> template.S:84
> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
> ../sysdeps/posix/sleep.c:55
> #2 0x00000000004afb93 in remove_extra_recycle_jobs (vp=0x0) at
> job_recycler.c:216
> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff19bd700) at
> pthread_create.c:333
> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/
> x86_64/clone.S:109
>
> Thread 5 (Thread 0x7ffff21be700 (LWP 36967)):
> #0 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-
> template.S:84
> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
> ../sysdeps/posix/sleep.c:55
> #2 0x00000000004bc853 in inspect_exiting_jobs (vp=0x0) at
> exiting_jobs.c:319
> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff21be700) at
> pthread_create.c:333
> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/
> x86_64/clone.S:109
>
> Thread 4 (Thread 0x7ffff29bf700 (LWP 36966)):
> #0 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-
> template.S:84
> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
> ../sysdeps/posix/sleep.c:55
> #2 0x0000000000460769 in handle_queue_routing_retries (vp=0x0) at
> pbsd_main.c:1079
> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff29bf700) at
> pthread_create.c:333
> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/
> x86_64/clone.S:109
>
> ---Type <return> to continue, or q <return> to quit---
> Thread 3 (Thread 0x7ffff31c0700 (LWP 36965)):
> #0 0x00007ffff6ee17bd in accept () at ../sysdeps/unix/syscall-
> template.S:84
> #1 0x00007ffff750a276 in start_listener_addrinfo
> (host_name=0x7ffff31bfaf0 "Dual-E52630v4", server_port=15001,
> process_meth=0x4c4a4d <start_process_pbs_server_port(void*)>)
> at ../Libnet/server_core.c:398
> #2 0x00000000004608cf in start_accept_listener (vp=0x0) at
> pbsd_main.c:1141
> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff31c0700) at
> pthread_create.c:333
> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/
> x86_64/clone.S:109
>
> Thread 2 (Thread 0x7ffff39c1700 (LWP 36964)):
> #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/
> x86_64/pthread_cond_wait.S:185
> #1 0x00000000004fc2b4 in work_thread (a=0x5110810) at u_threadpool.c:272
> #2 0x00007ffff6ed870a in start_thread (arg=0x7ffff39c1700) at
> pthread_create.c:333
> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/
> x86_64/clone.S:109
>
> Thread 1 (Thread 0x7ffff7fd5740 (LWP 36960)):
> #0 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-
> template.S:84
> #1 0x00007ffff615c1a4 in usleep (useconds=<optimized out>) at
> ../sysdeps/posix/usleep.c:32
> #2 0x0000000000461216 in main_loop () at pbsd_main.c:1454
> #3 0x0000000000461f92 in main (argc=2, argv=0x7fffffffe598) at
> pbsd_main.c:1935
> (gdb) quit
> A debugging session is active.
>
> Inferior 1 [process 36960] will be killed.
>
> Quit anyway? (y or n) y
>
>
>
>
> On Tue, Nov 29, 2016 at 8:53 AM, David Beer <***@adaptivecomputing.com>
> wrote:
>
>> Kazu,
>>
>> I'm shocked you're seeing so many issues. Can you send a backtrace? These
>> logs don't show anything sinister.
>>
>> On Wed, Nov 23, 2016 at 9:52 PM, Kazuhiro Fujita <
>> ***@gmail.com> wrote:
>>
>>> David,
>>>
>>> I reinstalled the torque 6.0-dev without update from github.
>>> At this time, I can restart all torque daemons,
>>> but qsub command caused the crash of pbs_server and pbs_sched.
>>> I attached the log files in this mail.
>>>
>>> Best,
>>> Kazu
>>>
>>> Before the crash:
>>>
>>>> # build and install torque
>>>> ./configure
>>>> make
>>>> sudo make install
>>>> # Set a correct host name of the server
>>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>> # configure and start trqauthd
>>>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>>>> sudo update-rc.d trqauthd defaults
>>>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>>>> sudo ldconfig
>>>> sudo service trqauthd start
>>>> # Initialize serverdb by executing the torque.setup script
>>>> sudo ./torque.setup $USER
>>>> sudo qmgr -c "p s"
>>>> # stop pbs_server and trqauthd daemons for setting nodes.
>>>> sudo qterm
>>>> sudo service trqauthd stop
>>>> ps aux | grep pbs
>>>> ps aux | grep trq
>>>> # set nodes
>>>> echo "$HOSTNAME np=`cat /proc/cpuinfo | grep processor | wc -l`" | sudo
>>>> tee /var/spool/torque/server_priv/nodes
>>>> sudo nano /var/spool/torque/server_priv/nodes
>>>> # set the head node
>>>> echo "\$pbsserver $HOSTNAME" | sudo tee /var/spool/torque/mom_priv/con
>>>> fig
>>>> # configure other torque daemons
>>>> sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
>>>> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
>>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>>> sudo update-rc.d pbs_server defaults
>>>> sudo update-rc.d pbs_sched defaults
>>>> sudo update-rc.d pbs_mom defaults
>>>> # restart torque daemons
>>>> sudo service trqauthd start
>>>> sudo service pbs_server start
>>>> ps aux | grep pbs
>>>> ps aux | grep trq
>>>> sudo service pbs_sched start
>>>> sudo service pbs_mom start
>>>> ps aux | grep pbs
>>>> ps aux | grep trq
>>>> # check configuration of computaion nodes
>>>> pbsnodes -a
>>>
>>>
>>> $ ps aux | grep trq
>>> root 19130 0.0 0.0 109112 3756 ? S 13:25 0:00
>>> /usr/local/sbin/trqauthd
>>> comp_ad+ 19293 0.0 0.0 15236 1020 pts/8 S+ 13:28 0:00 grep
>>> --color=auto trq
>>> $ ps aux | grep pbs
>>> root 19175 0.0 0.0 695136 23640 ? Sl 13:26 0:00
>>> /usr/local/sbin/pbs_server
>>> root 19224 0.0 0.0 37996 4936 ? Ss 13:27 0:00
>>> /usr/local/sbin/pbs_sched
>>> root 19265 0.1 0.2 173776 136692 ? SLsl 13:27 0:00
>>> /usr/local/sbin/pbs_mom
>>> comp_ad+ 19295 0.0 0.0 15236 924 pts/8 S+ 13:28 0:00 grep
>>> --color=auto pbs
>>>
>>> Subsequent qsub command caused the crash of pbs_server and pbs_sched.
>>>
>>> $ echo "sleep 30" | qsub
>>> 0.Dual-E52630v4
>>> $ ps aux | grep trq
>>> root 19130 0.0 0.0 109112 4268 ? S 13:25 0:00
>>> /usr/local/sbin/trqauthd
>>> comp_ad+ 19309 0.0 0.0 15236 1020 pts/8 S+ 13:28 0:00 grep
>>> --color=auto trq
>>> $ ps aux | grep pbs
>>> root 19265 0.1 0.2 173776 136688 ? SLsl 13:27 0:00
>>> /usr/local/sbin/pbs_mom
>>> comp_ad+ 19311 0.0 0.0 15236 1016 pts/8 S+ 13:28 0:00 grep
>>> --color=auto pbs
>>>
>>>
>>>
>>>
>>> On Fri, Nov 18, 2016 at 4:21 AM, David Beer <***@adaptivecomputing.com
>>> > wrote:
>>>
>>>> Kazu,
>>>>
>>>> Did you look at the server logs?
>>>>
>>>> On Wed, Nov 16, 2016 at 12:24 AM, Kazuhiro Fujita <
>>>> ***@gmail.com> wrote:
>>>>
>>>>> David,
>>>>>
>>>>> I did not find the process of pbs_server after executions of commands
>>>>> shown below.
>>>>>
>>>>> sudo service trqauthd start
>>>>>> sudo service pbs_server start
>>>>>
>>>>>
>>>>> I am not sure what it did.
>>>>>
>>>>> Best,
>>>>> Kazu
>>>>>
>>>>>
>>>>> On Wed, Nov 16, 2016 at 8:10 AM, David Beer <
>>>>> ***@adaptivecomputing.com> wrote:
>>>>>
>>>>>> Kazu,
>>>>>>
>>>>>> What did it do when it failed to start?
>>>>>>
>>>>>> On Wed, Nov 9, 2016 at 9:33 PM, Kazuhiro Fujita <
>>>>>> ***@gmail.com> wrote:
>>>>>>
>>>>>>> David,
>>>>>>>
>>>>>>> In the last mail I sent, I reinstalled 6.0-dev in a wrong server as
>>>>>>> you can see in output (E5-2630v3).
>>>>>>> In a E5-2630v4 server, pbs_server failed to restart as a daemon
>>>>>>> after "./torque.setup $USER".
>>>>>>>
>>>>>>> Before crash:
>>>>>>>
>>>>>>>> git clone https://github.com/adaptivecomputing/torque.git -b
>>>>>>>> 6.0-dev 6.0-dev
>>>>>>>> cd 6.0-dev
>>>>>>>> ./autogen.sh
>>>>>>>> # build and install torque
>>>>>>>> ./configure
>>>>>>>> make
>>>>>>>> sudo make install
>>>>>>>> # Set the correct name of the server
>>>>>>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>>>>>> # configure and start trqauthd
>>>>>>>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>>>>>>>> sudo update-rc.d trqauthd defaults
>>>>>>>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>>>>>>>> sudo ldconfig
>>>>>>>> sudo service trqauthd start
>>>>>>>> # Initialize serverdb by executing the torque.setup script
>>>>>>>> sudo ./torque.setup $USER
>>>>>>>> sudo qmgr -c 'p s'
>>>>>>>> sudo qterm
>>>>>>>> sudo service trqauthd stop
>>>>>>>> ps aux | grep pbs
>>>>>>>> ps aux | grep trq
>>>>>>>> # set nodes
>>>>>>>> echo "$HOSTNAME np=`cat /proc/cpuinfo | grep processor | wc -l`" |
>>>>>>>> sudo tee /var/spool/torque/server_priv/nodes
>>>>>>>> sudo nano /var/spool/torque/server_priv/nodes
>>>>>>>> # set the head node
>>>>>>>> echo "\$pbsserver $HOSTNAME" | sudo tee
>>>>>>>> /var/spool/torque/mom_priv/config
>>>>>>>> # configure other daemons
>>>>>>>> sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
>>>>>>>> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
>>>>>>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>>>>>>> sudo update-rc.d pbs_server defaults
>>>>>>>> sudo update-rc.d pbs_sched defaults
>>>>>>>> sudo update-rc.d pbs_mom defaults
>>>>>>>> # restart torque daemons
>>>>>>>> sudo service trqauthd start
>>>>>>>> sudo service pbs_server start
>>>>>>>
>>>>>>>
>>>>>>> Then, pbs_server did not start. So, I started pbs_server with gdb.
>>>>>>> But, pbs_server with gdb did not crash even after qsub and qstat
>>>>>>> from another terminal.
>>>>>>> So, I stopped the pbs_server in gdb with ctrl + c.
>>>>>>>
>>>>>>> Best,
>>>>>>> Kazu
>>>>>>>
>>>>>>> gdb output
>>>>>>>
>>>>>>>> $ sudo gdb /usr/local/sbin/pbs_server
>>>>>>>> GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
>>>>>>>> Copyright (C) 2016 Free Software Foundation, Inc.
>>>>>>>> License GPLv3+: GNU GPL version 3 or later <
>>>>>>>> http://gnu.org/licenses/gpl.html>
>>>>>>>> This is free software: you are free to change and redistribute it.
>>>>>>>> There is NO WARRANTY, to the extent permitted by law. Type "show
>>>>>>>> copying"
>>>>>>>> and "show warranty" for details.
>>>>>>>> This GDB was configured as "x86_64-linux-gnu".
>>>>>>>> Type "show configuration" for configuration details.
>>>>>>>> For bug reporting instructions, please see:
>>>>>>>> <http://www.gnu.org/software/gdb/bugs/>.
>>>>>>>> Find the GDB manual and other documentation resources online at:
>>>>>>>> <http://www.gnu.org/software/gdb/documentation/>.
>>>>>>>> For help, type "help".
>>>>>>>> Type "apropos word" to search for commands related to "word"...
>>>>>>>> Reading symbols from /usr/local/sbin/pbs_server...done.
>>>>>>>> (gdb) r -D
>>>>>>>> Starting program: /usr/local/sbin/pbs_server -D
>>>>>>>> [Thread debugging using libthread_db enabled]
>>>>>>>> Using host libthread_db library "/lib/x86_64-linux-gnu/libthre
>>>>>>>> ad_db.so.1".
>>>>>>>> [New Thread 0x7ffff39c1700 (LWP 35864)]
>>>>>>>> pbs_server is up (version - 6.0, port - 15001)
>>>>>>>> PBS_Server: LOG_ERROR::tcp_connect_sockaddr, Failed when trying to
>>>>>>>> open tcp connection - connect() failed [rc = -2] [addr =
>>>>>>>> 10.0.0.249:15003]
>>>>>>>> [New Thread 0x7ffff31c0700 (LWP 35865)]
>>>>>>>> PBS_Server: LOG_ERROR::sendHierarchyToNode, Could not send mom
>>>>>>>> hierarchy to host Dual-E52630v4:15003
>>>>>>>> [New Thread 0x7ffff29bf700 (LWP 35866)]
>>>>>>>> [New Thread 0x7ffff21be700 (LWP 35867)]
>>>>>>>> [New Thread 0x7ffff19bd700 (LWP 35868)]
>>>>>>>> [New Thread 0x7ffff11bc700 (LWP 35869)]
>>>>>>>> [New Thread 0x7ffff09bb700 (LWP 35870)]
>>>>>>>> [Thread 0x7ffff09bb700 (LWP 35870) exited]
>>>>>>>> [New Thread 0x7ffff09bb700 (LWP 35871)]
>>>>>>>> [New Thread 0x7fffe3fff700 (LWP 36003)]
>>>>>>>> [New Thread 0x7fffe37fe700 (LWP 36004)]
>>>>>>>> [New Thread 0x7fffe2ffd700 (LWP 36011)]
>>>>>>>> [New Thread 0x7fffe21ce700 (LWP 36016)]
>>>>>>>> [Thread 0x7fffe21ce700 (LWP 36016) exited]
>>>>>>>> ^C
>>>>>>>> Thread 1 "pbs_server" received signal SIGINT, Interrupt.
>>>>>>>> 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-templa
>>>>>>>> te.S:84
>>>>>>>> 84 ../sysdeps/unix/syscall-template.S: No such file or directory.
>>>>>>>> (gdb) bt
>>>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>>> #1 0x00007ffff615c1a4 in usleep (useconds=<optimized out>) at
>>>>>>>> ../sysdeps/posix/usleep.c:32
>>>>>>>> #2 0x0000000000461216 in main_loop () at pbsd_main.c:1454
>>>>>>>> #3 0x0000000000461f92 in main (argc=2, argv=0x7fffffffe598) at
>>>>>>>> pbsd_main.c:1935
>>>>>>>> (gdb) backtrace full
>>>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>>> No locals.
>>>>>>>> #1 0x00007ffff615c1a4 in usleep (useconds=<optimized out>) at
>>>>>>>> ../sysdeps/posix/usleep.c:32
>>>>>>>> ts = {tv_sec = 0, tv_nsec = 250000000}
>>>>>>>> #2 0x0000000000461216 in main_loop () at pbsd_main.c:1454
>>>>>>>> state = 3
>>>>>>>> waittime = 5
>>>>>>>> pjob = 0x313a74
>>>>>>>> iter = 0x0
>>>>>>>> when = 1478748888
>>>>>>>> log = 0
>>>>>>>> scheduling = 1
>>>>>>>> sched_iteration = 600
>>>>>>>> time_now = 1478748970
>>>>>>>> update_loglevel = 1478748979
>>>>>>>> log_buf = "Server Ready, pid = 35860, loglevel=0", '\000'
>>>>>>>> <repeats 139 times>, "c\000\000\000\000\000\000\000
>>>>>>>> \000\020\000\000\000\000\000\000\240\265\377\377\377\177", '\000'
>>>>>>>> <repeats 26 times>...
>>>>>>>> sem_val = 5229209
>>>>>>>> __func__ = "main_loop"
>>>>>>>> #3 0x0000000000461f92 in main (argc=2, argv=0x7fffffffe598) at
>>>>>>>> pbsd_main.c:1935
>>>>>>>> i = 2
>>>>>>>> rc = 0
>>>>>>>> local_errno = 0
>>>>>>>> lockfile = "/var/spool/torque/server_priv/server.lock",
>>>>>>>> '\000' <repeats 983 times>
>>>>>>>> EMsg = '\000' <repeats 1023 times>
>>>>>>>> tmpLine = "Using ports Server:15001 Scheduler:15004
>>>>>>>> MOM:15002 (server: 'Dual-E52630v4')", '\000' <repeats 945 times>
>>>>>>>> log_buf = "Using ports Server:15001 Scheduler:15004
>>>>>>>> MOM:15002 (server: 'Dual-E52630v4')", '\000' <repeats 1114 times>...
>>>>>>>> server_name_file_port = 15001
>>>>>>>> fp = 0x51095f0
>>>>>>>> (gdb) info registers
>>>>>>>> rax 0xfffffffffffffdfc -516
>>>>>>>> rbx 0x6 6
>>>>>>>> rcx 0x7ffff612a75d 140737321805661
>>>>>>>> rdx 0x0 0
>>>>>>>> rsi 0x0 0
>>>>>>>> rdi 0x7fffffffb3f0 140737488335856
>>>>>>>> rbp 0x7fffffffe4b0 0x7fffffffe4b0
>>>>>>>> rsp 0x7fffffffc870 0x7fffffffc870
>>>>>>>> r8 0x0 0
>>>>>>>> r9 0x4000001 67108865
>>>>>>>> r10 0x1 1
>>>>>>>> r11 0x293 659
>>>>>>>> r12 0x4260b0 4350128
>>>>>>>> r13 0x7fffffffe590 140737488348560
>>>>>>>> r14 0x0 0
>>>>>>>> r15 0x0 0
>>>>>>>> rip 0x461f92 0x461f92 <main(int, char**)+2388>
>>>>>>>> eflags 0x293 [ CF AF SF IF ]
>>>>>>>> cs 0x33 51
>>>>>>>> ss 0x2b 43
>>>>>>>> ds 0x0 0
>>>>>>>> es 0x0 0
>>>>>>>> fs 0x0 0
>>>>>>>> gs 0x0 0
>>>>>>>> (gdb) x/16i $pc
>>>>>>>> => 0x461f92 <main(int, char**)+2388>: callq 0x49484c
>>>>>>>> <shutdown_ack()>
>>>>>>>> 0x461f97 <main(int, char**)+2393>: mov $0xffffffff,%edi
>>>>>>>> 0x461f9c <main(int, char**)+2398>: callq 0x4250b0
>>>>>>>> <***@plt>
>>>>>>>> 0x461fa1 <main(int, char**)+2403>: mov 0x70f5c0(%rip),%rdx
>>>>>>>> # 0xb71568 <msg_svrdown>
>>>>>>>> 0x461fa8 <main(int, char**)+2410>: mov 0x70ef51(%rip),%rax
>>>>>>>> # 0xb70f00 <msg_daemonname>
>>>>>>>> 0x461faf <main(int, char**)+2417>: mov %rdx,%rcx
>>>>>>>> 0x461fb2 <main(int, char**)+2420>: mov %rax,%rdx
>>>>>>>> 0x461fb5 <main(int, char**)+2423>: mov $0x1,%esi
>>>>>>>> 0x461fba <main(int, char**)+2428>: mov $0x8002,%edi
>>>>>>>> 0x461fbf <main(int, char**)+2433>: callq 0x425840
>>>>>>>> <***@plt>
>>>>>>>> 0x461fc4 <main(int, char**)+2438>: mov $0x0,%edi
>>>>>>>> 0x461fc9 <main(int, char**)+2443>: callq 0x4269c9
>>>>>>>> <acct_close(bool)>
>>>>>>>> 0x461fce <main(int, char**)+2448>: mov $0xb6ce00,%edi
>>>>>>>> 0x461fd3 <main(int, char**)+2453>: callq 0x425a00
>>>>>>>> <***@plt>
>>>>>>>> 0x461fd8 <main(int, char**)+2458>: mov $0x1,%edi
>>>>>>>> 0x461fdd <main(int, char**)+2463>: callq 0x424db0
>>>>>>>> <***@plt>
>>>>>>>> (gdb) thread apply all backtrace
>>>>>>>> Thread 12 (Thread 0x7fffe2ffd700 (LWP 36011)):
>>>>>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
>>>>>>>> #1 0x00000000004fc2b4 in work_thread (a=0x5110710) at
>>>>>>>> u_threadpool.c:272
>>>>>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7fffe2ffd700) at
>>>>>>>> pthread_create.c:333
>>>>>>>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>>>> _64/clone.S:109
>>>>>>>> Thread 11 (Thread 0x7fffe37fe700 (LWP 36004)):
>>>>>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
>>>>>>>> #1 0x00000000004fc2b4 in work_thread (a=0x5110810) at
>>>>>>>> u_threadpool.c:272
>>>>>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7fffe37fe700) at
>>>>>>>> pthread_create.c:333
>>>>>>>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>>>> _64/clone.S:109
>>>>>>>> Thread 10 (Thread 0x7fffe3fff700 (LWP 36003)):
>>>>>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
>>>>>>>> #1 0x00000000004fc2b4 in work_thread (a=0x5110710) at
>>>>>>>> u_threadpool.c:272
>>>>>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7fffe3fff700) at
>>>>>>>> pthread_create.c:333
>>>>>>>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>>>> _64/clone.S:109
>>>>>>>> Thread 9 (Thread 0x7ffff09bb700 (LWP 35871)):
>>>>>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
>>>>>>>> #1 0x00000000004fc2b4 in work_thread (a=0x5110810) at
>>>>>>>> u_threadpool.c:272
>>>>>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7ffff09bb700) at
>>>>>>>> pthread_create.c:333
>>>>>>>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>>>> _64/clone.S:109
>>>>>>>> Thread 7 (Thread 0x7ffff11bc700 (LWP 35869)):
>>>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>>>>>>> ../sysdeps/posix/sleep.c:55
>>>>>>>> #2 0x0000000000476913 in remove_completed_jobs (vp=0x0) at
>>>>>>>> req_jobobit.c:3759
>>>>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff11bc700) at
>>>>>>>> pthread_create.c:333
>>>>>>>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>>>> _64/clone.S:109
>>>>>>>> Thread 6 (Thread 0x7ffff19bd700 (LWP 35868)):
>>>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>>>>>>> ../sysdeps/posix/sleep.c:55
>>>>>>>> #2 0x00000000004afb93 in remove_extra_recycle_jobs (vp=0x0) at
>>>>>>>> job_recycler.c:216
>>>>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff19bd700) at
>>>>>>>> pthread_create.c:333
>>>>>>>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>>>> _64/clone.S:109
>>>>>>>> Thread 5 (Thread 0x7ffff21be700 (LWP 35867)):
>>>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>>>>>>> ../sysdeps/posix/sleep.c:55
>>>>>>>> #2 0x00000000004bc853 in inspect_exiting_jobs (vp=0x0) at
>>>>>>>> exiting_jobs.c:319
>>>>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff21be700) at
>>>>>>>> pthread_create.c:333
>>>>>>>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>>>> _64/clone.S:109
>>>>>>>> Thread 4 (Thread 0x7ffff29bf700 (LWP 35866)):
>>>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>>>>>>> ../sysdeps/posix/sleep.c:55
>>>>>>>> #2 0x0000000000460769 in handle_queue_routing_retries (vp=0x0) at
>>>>>>>> pbsd_main.c:1079
>>>>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff29bf700) at
>>>>>>>> pthread_create.c:333
>>>>>>>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>>>> _64/clone.S:109
>>>>>>>> Thread 3 (Thread 0x7ffff31c0700 (LWP 35865)):
>>>>>>>> #0 0x00007ffff6ee17bd in accept () at
>>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>>> #1 0x00007ffff750a276 in start_listener_addrinfo
>>>>>>>> (host_name=0x7ffff31bfaf0 "Dual-E52630v4", server_port=15001,
>>>>>>>> process_meth=0x4c4a4d <start_process_pbs_server_port(void*)>)
>>>>>>>> at ../Libnet/server_core.c:398
>>>>>>>> ---Type <return> to continue, or q <return> to quit---
>>>>>>>> #2 0x00000000004608cf in start_accept_listener (vp=0x0) at
>>>>>>>> pbsd_main.c:1141
>>>>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff31c0700) at
>>>>>>>> pthread_create.c:333
>>>>>>>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>>>> _64/clone.S:109
>>>>>>>> Thread 2 (Thread 0x7ffff39c1700 (LWP 35864)):
>>>>>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
>>>>>>>> #1 0x00000000004fc2b4 in work_thread (a=0x5110810) at
>>>>>>>> u_threadpool.c:272
>>>>>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7ffff39c1700) at
>>>>>>>> pthread_create.c:333
>>>>>>>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>>>> _64/clone.S:109
>>>>>>>> Thread 1 (Thread 0x7ffff7fd5740 (LWP 35860)):
>>>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>>> #1 0x00007ffff615c1a4 in usleep (useconds=<optimized out>) at
>>>>>>>> ../sysdeps/posix/usleep.c:32
>>>>>>>> #2 0x0000000000461216 in main_loop () at pbsd_main.c:1454
>>>>>>>> #3 0x0000000000461f92 in main (argc=2, argv=0x7fffffffe598) at
>>>>>>>> pbsd_main.c:1935
>>>>>>>> (gdb) quit
>>>>>>>> A debugging session is active.
>>>>>>>> Inferior 1 [process 35860] will be killed.
>>>>>>>> Quit anyway? (y or n) y
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Commands executed from another terminal after pbs_server with gdb (r
>>>>>>> -D)
>>>>>>>
>>>>>>>> $ sudo service pbs_sched start
>>>>>>>> $ sudo service pbs_mom start
>>>>>>>> $ pbsnodes -a
>>>>>>>> Dual-E52630v4
>>>>>>>> state = free
>>>>>>>> power_state = Running
>>>>>>>> np = 4
>>>>>>>> ntype = cluster
>>>>>>>> status = rectime=1478748911,macaddr=34:
>>>>>>>> 97:f6:5d:09:a6,cpuclock=Fixed,varattr=,jobs=,state=free,netl
>>>>>>>> oad=322618417,gres=,loadave=0.06,ncpus=40,physmem=65857216kb
>>>>>>>> ,availmem=131970532kb,totmem=132849340kb,idletime=108,nusers=4,nsessions=17,sessions=1036
>>>>>>>> 1316 1327 1332 1420 1421 1422 1423 1424 1425 1426 1430 1471 1510 27075
>>>>>>>> 27130 35902,uname=Linux Dual-E52630v4 4.4.0-45-generic #66-Ubuntu SMP Wed
>>>>>>>> Oct 19 14:12:37 UTC 2016 x86_64,opsys=linux
>>>>>>>> mom_service_port = 15002
>>>>>>>> mom_manager_port = 15003
>>>>>>>> $ echo "sleep 30" | qsub
>>>>>>>> 0.Dual-E52630v4
>>>>>>>> $ qstat
>>>>>>>> Job ID Name User Time Use
>>>>>>>> S Queue
>>>>>>>> ------------------------- ---------------- --------------- --------
>>>>>>>> - -----
>>>>>>>> 0.Dual-E52630v4 STDIN comp_admin
>>>>>>>> 0 Q batch
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Nov 10, 2016 at 12:01 PM, Kazuhiro Fujita <
>>>>>>> ***@gmail.com> wrote:
>>>>>>>
>>>>>>>> David,
>>>>>>>>
>>>>>>>> Now, it works. Thank you.
>>>>>>>> But, jobs are executed in the LIFO manner, as I observed in a
>>>>>>>> E5-2630v3 server...
>>>>>>>> I show the result by 'qstat -t' after 'echo "sleep 10" | qsub -t
>>>>>>>> 1-10' 3 times.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Kazu
>>>>>>>>
>>>>>>>> $ qstat -t
>>>>>>>> Job ID Name User Time Use
>>>>>>>> S Queue
>>>>>>>> ------------------------- ---------------- --------------- --------
>>>>>>>> - -----
>>>>>>>> 0.Dual-E5-2630v3 STDIN comp_admin
>>>>>>>> 00:00:00 C batch
>>>>>>>> 1[1].Dual-E5-2630v3 STDIN-1 comp_admin
>>>>>>>> 0 Q batch
>>>>>>>> 1[2].Dual-E5-2630v3 STDIN-2 comp_admin
>>>>>>>> 0 Q batch
>>>>>>>> 1[3].Dual-E5-2630v3 STDIN-3 comp_admin
>>>>>>>> 0 Q batch
>>>>>>>> 1[4].Dual-E5-2630v3 STDIN-4 comp_admin
>>>>>>>> 0 Q batch
>>>>>>>> 1[5].Dual-E5-2630v3 STDIN-5 comp_admin
>>>>>>>> 0 Q batch
>>>>>>>> 1[6].Dual-E5-2630v3 STDIN-6 comp_admin
>>>>>>>> 0 Q batch
>>>>>>>> 1[7].Dual-E5-2630v3 STDIN-7 comp_admin
>>>>>>>> 00:00:00 C batch
>>>>>>>> 1[8].Dual-E5-2630v3 STDIN-8 comp_admin
>>>>>>>> 00:00:00 C batch
>>>>>>>> 1[9].Dual-E5-2630v3 STDIN-9 comp_admin
>>>>>>>> 00:00:00 C batch
>>>>>>>> 1[10].Dual-E5-2630v3 STDIN-10 comp_admin
>>>>>>>> 00:00:00 C batch
>>>>>>>> 2[1].Dual-E5-2630v3 STDIN-1 comp_admin
>>>>>>>> 0 Q batch
>>>>>>>> 2[2].Dual-E5-2630v3 STDIN-2 comp_admin
>>>>>>>> 0 Q batch
>>>>>>>> 2[3].Dual-E5-2630v3 STDIN-3 comp_admin
>>>>>>>> 0 Q batch
>>>>>>>> 2[4].Dual-E5-2630v3 STDIN-4 comp_admin
>>>>>>>> 0 Q batch
>>>>>>>> 2[5].Dual-E5-2630v3 STDIN-5 comp_admin
>>>>>>>> 0 Q batch
>>>>>>>> 2[6].Dual-E5-2630v3 STDIN-6 comp_admin
>>>>>>>> 0 Q batch
>>>>>>>> 2[7].Dual-E5-2630v3 STDIN-7 comp_admin
>>>>>>>> 0 Q batch
>>>>>>>> 2[8].Dual-E5-2630v3 STDIN-8 comp_admin
>>>>>>>> 0 Q batch
>>>>>>>> 2[9].Dual-E5-2630v3 STDIN-9 comp_admin
>>>>>>>> 0 Q batch
>>>>>>>> 2[10].Dual-E5-2630v3 STDIN-10 comp_admin
>>>>>>>> 0 Q batch
>>>>>>>> 3[1].Dual-E5-2630v3 STDIN-1 comp_admin
>>>>>>>> 0 Q batch
>>>>>>>> 3[2].Dual-E5-2630v3 STDIN-2 comp_admin
>>>>>>>> 0 Q batch
>>>>>>>> 3[3].Dual-E5-2630v3 STDIN-3 comp_admin
>>>>>>>> 0 Q batch
>>>>>>>> 3[4].Dual-E5-2630v3 STDIN-4 comp_admin
>>>>>>>> 0 Q batch
>>>>>>>> 3[5].Dual-E5-2630v3 STDIN-5 comp_admin
>>>>>>>> 0 Q batch
>>>>>>>> 3[6].Dual-E5-2630v3 STDIN-6 comp_admin
>>>>>>>> 0 Q batch
>>>>>>>> 3[7].Dual-E5-2630v3 STDIN-7 comp_admin
>>>>>>>> 0 R batch
>>>>>>>> 3[8].Dual-E5-2630v3 STDIN-8 comp_admin
>>>>>>>> 0 R batch
>>>>>>>> 3[9].Dual-E5-2630v3 STDIN-9 comp_admin
>>>>>>>> 0 R batch
>>>>>>>> 3[10].Dual-E5-2630v3 STDIN-10 comp_admin
>>>>>>>> 0 R batch
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Nov 10, 2016 at 3:07 AM, David Beer <
>>>>>>>> ***@adaptivecomputing.com> wrote:
>>>>>>>>
>>>>>>>>> Kazu,
>>>>>>>>>
>>>>>>>>> I was able to get a system to reproduce this error. I have now
>>>>>>>>> checked in another fix, and I can no longer reproduce this. Can you pull
>>>>>>>>> the latest and let me know if it fixes it for you?
>>>>>>>>>
>>>>>>>>> On Tue, Nov 8, 2016 at 2:06 AM, Kazuhiro Fujita <
>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi David,
>>>>>>>>>>
>>>>>>>>>> I reinstalled the 6.0-dev today from github, and observed slight
>>>>>>>>>> different behaviors I think.
>>>>>>>>>> I used the "service" command to start daemons this time.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Kazu
>>>>>>>>>>
>>>>>>>>>> Befor crash
>>>>>>>>>>
>>>>>>>>>>> git clone https://github.com/adaptivecomputing/torque.git -b
>>>>>>>>>>> 6.0-dev 6.0-dev
>>>>>>>>>>> cd 6.0-dev
>>>>>>>>>>> ./autogen.sh
>>>>>>>>>>> # build and install torque
>>>>>>>>>>> ./configure
>>>>>>>>>>> make
>>>>>>>>>>> sudo make install
>>>>>>>>>>> # Set the correct name of the server
>>>>>>>>>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>>>>>>>>> # configure and start trqauthd
>>>>>>>>>>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>>>>>>>>>>> sudo update-rc.d trqauthd defaults
>>>>>>>>>>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>>>>>>>>>>> sudo ldconfig
>>>>>>>>>>> sudo service trqauthd start
>>>>>>>>>>> # Initialize serverdb by executing the torque.setup script
>>>>>>>>>>> sudo ./torque.setup $USER
>>>>>>>>>>> sudo qmgr -c 'p s'
>>>>>>>>>>> sudo qterm
>>>>>>>>>>> sudo service trqauthd stop
>>>>>>>>>>> ps aux | grep pbs
>>>>>>>>>>> ps aux | grep trq
>>>>>>>>>>> # set nodes
>>>>>>>>>>> echo "$HOSTNAME np=`cat /proc/cpuinfo | grep processor | wc -l`"
>>>>>>>>>>> | sudo tee /var/spool/torque/server_priv/nodes
>>>>>>>>>>> sudo nano /var/spool/torque/server_priv/nodes
>>>>>>>>>>> # set the head node
>>>>>>>>>>> echo "\$pbsserver $HOSTNAME" | sudo tee
>>>>>>>>>>> /var/spool/torque/mom_priv/config
>>>>>>>>>>> # configure other deamons
>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>>>>>>>>>> sudo update-rc.d pbs_server defaults
>>>>>>>>>>> sudo update-rc.d pbs_sched defaults
>>>>>>>>>>> sudo update-rc.d pbs_mom defaults
>>>>>>>>>>> # start torque daemons
>>>>>>>>>>> sudo service trqauthd start
>>>>>>>>>>> sudo service pbs_server start
>>>>>>>>>>> sudo service pbs_sched start
>>>>>>>>>>> sudo service pbs_mom start
>>>>>>>>>>> # chekc configuration of computaion nodes
>>>>>>>>>>> pbsnodes -a
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I checked torque processes by "ps aux | grep pbs" and "ps aux |
>>>>>>>>>> grep trq" several times.
>>>>>>>>>> After "pbsnodes -a", it seems ok.
>>>>>>>>>> But, the next qsub command seems to trigger to crash "pbs_server"
>>>>>>>>>> and "pbs_sched".
>>>>>>>>>>
>>>>>>>>>> $ ps aux | grep trq
>>>>>>>>>>> root 9682 0.0 0.0 109112 3632 ? S 17:39 0:00
>>>>>>>>>>> /usr/local/sbin/trqauthd
>>>>>>>>>>> comp_ad+ 9842 0.0 0.0 15236 936 pts/8 S+ 17:40 0:00
>>>>>>>>>>> grep --color=auto trq
>>>>>>>>>>> $ ps aux | grep pbs
>>>>>>>>>>> root 9720 0.0 0.0 695140 25760 ? Sl 17:39 0:00
>>>>>>>>>>> /usr/local/sbin/pbs_server
>>>>>>>>>>> root 9771 0.0 0.0 37996 4940 ? Ss 17:39 0:00
>>>>>>>>>>> /usr/local/sbin/pbs_sched
>>>>>>>>>>> root 9814 0.2 0.2 173776 136692 ? SLsl 17:40 0:00
>>>>>>>>>>> /usr/local/sbin/pbs_mom
>>>>>>>>>>> comp_ad+ 9844 0.0 0.0 15236 1012 pts/8 S+ 17:40 0:00
>>>>>>>>>>> grep --color=auto pbs
>>>>>>>>>>> $ echo "sleep 30" | qsub
>>>>>>>>>>> 0.Dual-E52630v4
>>>>>>>>>>> $ ps aux | grep pbs
>>>>>>>>>>> root 9814 0.1 0.2 173776 136692 ? SLsl 17:40 0:00
>>>>>>>>>>> /usr/local/sbin/pbs_mom
>>>>>>>>>>> comp_ad+ 9855 0.0 0.0 15236 928 pts/8 S+ 17:41 0:00
>>>>>>>>>>> grep --color=auto pbs
>>>>>>>>>>> $ ps aux | grep trq
>>>>>>>>>>> root 9682 0.0 0.0 109112 4144 ? S 17:39 0:00
>>>>>>>>>>> /usr/local/sbin/trqauthd
>>>>>>>>>>> comp_ad+ 9860 0.0 0.0 15236 1092 pts/8 S+ 17:41 0:00
>>>>>>>>>>> grep --color=auto trq
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Then, I stopped the remained processes,
>>>>>>>>>>
>>>>>>>>>> sudo service pbs_mom stop
>>>>>>>>>>> sudo service trqauthd stop
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> and start again the "trqauthd", and "pbs_server" with gdb.
>>>>>>>>>> "pbs_server" crashed in gdb without other commands.
>>>>>>>>>>
>>>>>>>>>> sudo service trqauthd start
>>>>>>>>>>> sudo gdb /usr/local/sbin/pbs_server
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> sudo gdb /usr/local/sbin/pbs_server
>>>>>>>>>> GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
>>>>>>>>>> Copyright (C) 2016 Free Software Foundation, Inc.
>>>>>>>>>> License GPLv3+: GNU GPL version 3 or later <
>>>>>>>>>> http://gnu.org/licenses/gpl.html>
>>>>>>>>>> This is free software: you are free to change and redistribute it.
>>>>>>>>>> There is NO WARRANTY, to the extent permitted by law. Type "show
>>>>>>>>>> copying"
>>>>>>>>>> and "show warranty" for details.
>>>>>>>>>> This GDB was configured as "x86_64-linux-gnu".
>>>>>>>>>> Type "show configuration" for configuration details.
>>>>>>>>>> For bug reporting instructions, please see:
>>>>>>>>>> <http://www.gnu.org/software/gdb/bugs/>.
>>>>>>>>>> Find the GDB manual and other documentation resources online at:
>>>>>>>>>> <http://www.gnu.org/software/gdb/documentation/>.
>>>>>>>>>> For help, type "help".
>>>>>>>>>> Type "apropos word" to search for commands related to "word"...
>>>>>>>>>> Reading symbols from /usr/local/sbin/pbs_server...done.
>>>>>>>>>> (gdb) r -D
>>>>>>>>>> Starting program: /usr/local/sbin/pbs_server -D
>>>>>>>>>> [Thread debugging using libthread_db enabled]
>>>>>>>>>> Using host libthread_db library "/lib/x86_64-linux-gnu/libthre
>>>>>>>>>> ad_db.so.1".
>>>>>>>>>>
>>>>>>>>>> Program received signal SIGSEGV, Segmentation fault.
>>>>>>>>>> __lll_unlock_elision (lock=0x512f1b0, private=0) at
>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>>>>>>>>>> 29 ../sysdeps/unix/sysv/linux/x86/elision-unlock.c: No such file
>>>>>>>>>> or directory.
>>>>>>>>>> (gdb) bt
>>>>>>>>>> #0 __lll_unlock_elision (lock=0x512f1b0, private=0) at
>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>>>>>>>>>> #1 0x00000000004a4953 in unlock_ji_mutex (pjob=0x512d880,
>>>>>>>>>> id=0x525b30 <svr_enquejob(job*, int, char const*, bool, bool)::__func__>
>>>>>>>>>> "svr_enquejob", msg=0x524554 "1", logging=0)
>>>>>>>>>> at svr_jobfunc.c:4011
>>>>>>>>>> #2 0x000000000049db0c in svr_enquejob (pjob=0x512d880,
>>>>>>>>>> has_sv_qs_mutex=1, prev_job_id=0x0, have_reservation=false,
>>>>>>>>>> being_recovered=true) at svr_jobfunc.c:421
>>>>>>>>>> #3 0x000000000045b828 in pbsd_init_reque (pjob=0x512d880,
>>>>>>>>>> change_state=1) at pbsd_init.c:2824
>>>>>>>>>> #4 0x000000000045ad93 in pbsd_init_job (pjob=0x512d880, type=1)
>>>>>>>>>> at pbsd_init.c:2558
>>>>>>>>>> #5 0x0000000000459483 in handle_job_recovery (type=1) at
>>>>>>>>>> pbsd_init.c:1803
>>>>>>>>>> #6 0x000000000045a173 in handle_job_and_array_recovery (type=1)
>>>>>>>>>> at pbsd_init.c:2100
>>>>>>>>>> #7 0x000000000045a8fe in pbsd_init (type=1) at pbsd_init.c:2316
>>>>>>>>>> #8 0x0000000000461f29 in main (argc=2, argv=0x7fffffffe598) at
>>>>>>>>>> pbsd_main.c:1898
>>>>>>>>>> (gdb) backtrace full
>>>>>>>>>> #0 __lll_unlock_elision (lock=0x512f1b0, private=0) at
>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>>>>>>>>>> No locals.
>>>>>>>>>> #1 0x00000000004a4953 in unlock_ji_mutex (pjob=0x512d880,
>>>>>>>>>> id=0x525b30 <svr_enquejob(job*, int, char const*, bool, bool)::__func__>
>>>>>>>>>> "svr_enquejob", msg=0x524554 "1", logging=0)
>>>>>>>>>> at svr_jobfunc.c:4011
>>>>>>>>>> rc = 0
>>>>>>>>>> err_msg = 0x0
>>>>>>>>>> stub_msg = "no pos"
>>>>>>>>>> __func__ = "unlock_ji_mutex"
>>>>>>>>>> #2 0x000000000049db0c in svr_enquejob (pjob=0x512d880,
>>>>>>>>>> has_sv_qs_mutex=1, prev_job_id=0x0, have_reservation=false,
>>>>>>>>>> being_recovered=true) at svr_jobfunc.c:421
>>>>>>>>>> pattrjb = 0x7fffffff4a10
>>>>>>>>>> pdef = 0x4
>>>>>>>>>> pque = 0x0
>>>>>>>>>> rc = 0
>>>>>>>>>> log_buf = '\000' <repeats 24 times>,
>>>>>>>>>> "\030\000\000\000\060\000\000\000PU\377\377\377\177\000\000\220T\377\377\377\177",
>>>>>>>>>> '\000' <repeats 50 times>, "\003\000\000\000\000\000\000\
>>>>>>>>>> 000#\000\000\000\000\000\000\000pO\377\377\377\177", '\000'
>>>>>>>>>> <repeats 26 times>, "\221\260\000\000\000\200\377\
>>>>>>>>>> 377oO\377\377\377\177\000\000H+B\366\377\177\000\000p+B\366\
>>>>>>>>>> 377\177\000\000\200O\377\377\377\177\000\000\201\260\000\000
>>>>>>>>>> \000\200\377\377\177O\377\377\377\177", '\000' <repeats 18
>>>>>>>>>> times>...
>>>>>>>>>> time_now = 1478594788
>>>>>>>>>> job_id = "0.Dual-E52630v4\000\000\000\0
>>>>>>>>>> 00\000\000\000\000\000\362\377\377\377\377\377\377\377\340J\
>>>>>>>>>> 377\377\377\177\000\000\060L\377\377\377\177\000\000\001\000
>>>>>>>>>> \000\000\000\000\000\000\244\201\000\000\001\000\000\000\030
>>>>>>>>>> \354\377\367\377\177\000\***@L\377\377\377\177\000\000\000\0
>>>>>>>>>> 00\000\000\005\000\000\220\r\000\000\000\000\000\000\000k\02
>>>>>>>>>> 2j\365\377\177\000\000\031J\377\377\377\177\000\000\201n\376
>>>>>>>>>> \017\000\000\000\000\\\216!X\000\000\000\000_#\343+\000\000\
>>>>>>>>>> 000\000\\\216!X\000\000\000\000\207\065],", '\000' <repeats 36
>>>>>>>>>> times>, "k\022j\365\377\177\000\000\30
>>>>>>>>>> 0K\377\377\377\177\000\000\000\000\000\000\000\000\000\000"...
>>>>>>>>>> queue_name = "batch\000\377\377\240\340\377
>>>>>>>>>> \367\377\177\000"
>>>>>>>>>> total_jobs = 0
>>>>>>>>>> user_jobs = 0
>>>>>>>>>> array_jobs = 0
>>>>>>>>>> __func__ = "svr_enquejob"
>>>>>>>>>> que_mgr = {unlock_on_exit = 160, locked = 75, mutex_valid
>>>>>>>>>> = 255, managed_mutex = 0x7ffff7ddccda <open_path+474>}
>>>>>>>>>> #3 0x000000000045b828 in pbsd_init_reque (pjob=0x512d880,
>>>>>>>>>> change_state=1) at pbsd_init.c:2824
>>>>>>>>>> newstate = 0
>>>>>>>>>> newsubstate = 0
>>>>>>>>>> rc = 0
>>>>>>>>>> log_buf = "pbsd_init_reque:1", '\000' <repeats 1063
>>>>>>>>>> times>...
>>>>>>>>>> __func__ = "pbsd_init_reque"
>>>>>>>>>> #4 0x000000000045ad93 in pbsd_init_job (pjob=0x512d880, type=1)
>>>>>>>>>> at pbsd_init.c:2558
>>>>>>>>>> d = 0
>>>>>>>>>> rc = 0
>>>>>>>>>> time_now = 1478594788
>>>>>>>>>> log_buf = '\000' <repeats 2112 times>...
>>>>>>>>>> local_errno = 0
>>>>>>>>>> job_id = '\000' <repeats 1016 times>...
>>>>>>>>>> job_atr_hold = 0
>>>>>>>>>> job_exit_status = 0
>>>>>>>>>> __func__ = "pbsd_init_job"
>>>>>>>>>> #5 0x0000000000459483 in handle_job_recovery (type=1) at
>>>>>>>>>> pbsd_init.c:1803
>>>>>>>>>> pjob = 0x512d880
>>>>>>>>>> Index = 0
>>>>>>>>>> JobArray_iter = {first = "0.Dual-E52630v4", second = }
>>>>>>>>>> log_buf = "14 total files read from
>>>>>>>>>> disk\000\000\000\000\000\000\000\001\000\000\000\320\316\022
>>>>>>>>>> \005\000\000\000\000\220N\022\005", '\000' <repeats 12 times>,
>>>>>>>>>> "Expected 1, recovered 1 queues", '\000' <repeats 1330 times>...
>>>>>>>>>> rc = 0
>>>>>>>>>> job_rc = 0
>>>>>>>>>> logtype = 0
>>>>>>>>>> pdirent = 0x0
>>>>>>>>>> pdirent_sub = 0x0
>>>>>>>>>> dir = 0x5124e90
>>>>>>>>>> dir_sub = 0x0
>>>>>>>>>> had = 0
>>>>>>>>>> pjob = 0x0
>>>>>>>>>> time_now = 1478594788
>>>>>>>>>> ---Type <return> to continue, or q <return> to quit---
>>>>>>>>>> basen = '\000' <repeats 1088 times>...
>>>>>>>>>> use_jobs_subdirs = 0
>>>>>>>>>> __func__ = "handle_job_recovery"
>>>>>>>>>> #6 0x000000000045a173 in handle_job_and_array_recovery (type=1)
>>>>>>>>>> at pbsd_init.c:2100
>>>>>>>>>> rc = 0
>>>>>>>>>> tmp_rc = 1974134615
>>>>>>>>>> #7 0x000000000045a8fe in pbsd_init (type=1) at pbsd_init.c:2316
>>>>>>>>>> ret = 0
>>>>>>>>>> gid = 0
>>>>>>>>>> log_buf = "pbsd_init:1", '\000' <repeats 997 times>...
>>>>>>>>>> __func__ = "pbsd_init"
>>>>>>>>>> #8 0x0000000000461f29 in main (argc=2, argv=0x7fffffffe598) at
>>>>>>>>>> pbsd_main.c:1898
>>>>>>>>>> i = 2
>>>>>>>>>> rc = 0
>>>>>>>>>> local_errno = 0
>>>>>>>>>> lockfile = "/var/spool/torque/server_priv/server.lock",
>>>>>>>>>> '\000' <repeats 983 times>
>>>>>>>>>> EMsg = '\000' <repeats 1023 times>
>>>>>>>>>> tmpLine = "Server Dual-E52630v4 started, initialization
>>>>>>>>>> type = 1", '\000' <repeats 970 times>
>>>>>>>>>> log_buf = "Server Dual-E52630v4 started, initialization
>>>>>>>>>> type = 1", '\000' <repeats 1139 times>...
>>>>>>>>>> server_name_file_port = 15001
>>>>>>>>>> fp = 0x51095f0
>>>>>>>>>> (gdb) info registers
>>>>>>>>>> rax 0x0 0
>>>>>>>>>> rbx 0x6 6
>>>>>>>>>> rcx 0x0 0
>>>>>>>>>> rdx 0x512f1b0 85127600
>>>>>>>>>> rsi 0x0 0
>>>>>>>>>> rdi 0x512f1b0 85127600
>>>>>>>>>> rbp 0x7fffffffe4b0 0x7fffffffe4b0
>>>>>>>>>> rsp 0x7fffffffc870 0x7fffffffc870
>>>>>>>>>> r8 0x0 0
>>>>>>>>>> r9 0x7fffffff57a2 140737488312226
>>>>>>>>>> r10 0x513c800 85182464
>>>>>>>>>> r11 0x7ffff61e6128 140737322574120
>>>>>>>>>> r12 0x4260b0 4350128
>>>>>>>>>> r13 0x7fffffffe590 140737488348560
>>>>>>>>>> r14 0x0 0
>>>>>>>>>> r15 0x0 0
>>>>>>>>>> rip 0x461f29 0x461f29 <main(int, char**)+2183>
>>>>>>>>>> eflags 0x10246 [ PF ZF IF RF ]
>>>>>>>>>> cs 0x33 51
>>>>>>>>>> ss 0x2b 43
>>>>>>>>>> ds 0x0 0
>>>>>>>>>> es 0x0 0
>>>>>>>>>> fs 0x0 0
>>>>>>>>>> gs 0x0 0
>>>>>>>>>> (gdb) x/16i $pc
>>>>>>>>>> => 0x461f29 <main(int, char**)+2183>: test %eax,%eax
>>>>>>>>>> 0x461f2b <main(int, char**)+2185>: setne %al
>>>>>>>>>> 0x461f2e <main(int, char**)+2188>: test %al,%al
>>>>>>>>>> 0x461f30 <main(int, char**)+2190>: je 0x461f55 <main(int,
>>>>>>>>>> char**)+2227>
>>>>>>>>>> 0x461f32 <main(int, char**)+2192>: mov 0x70efc7(%rip),%rax
>>>>>>>>>> # 0xb70f00 <msg_daemonname>
>>>>>>>>>> 0x461f39 <main(int, char**)+2199>: mov $0x51bab2,%edx
>>>>>>>>>> 0x461f3e <main(int, char**)+2204>: mov %rax,%rsi
>>>>>>>>>> 0x461f41 <main(int, char**)+2207>: mov $0xffffffff,%edi
>>>>>>>>>> 0x461f46 <main(int, char**)+2212>: callq 0x425420
>>>>>>>>>> <***@plt>
>>>>>>>>>> 0x461f4b <main(int, char**)+2217>: mov $0x3,%edi
>>>>>>>>>> 0x461f50 <main(int, char**)+2222>: callq 0x425680 <***@plt>
>>>>>>>>>> 0x461f55 <main(int, char**)+2227>: mov 0x71021d(%rip),%esi
>>>>>>>>>> # 0xb72178 <pbs_mom_port>
>>>>>>>>>> 0x461f5b <main(int, char**)+2233>: mov 0x710227(%rip),%ecx
>>>>>>>>>> # 0xb72188 <pbs_scheduler_port>
>>>>>>>>>> 0x461f61 <main(int, char**)+2239>: mov 0x710225(%rip),%edx
>>>>>>>>>> # 0xb7218c <pbs_server_port_dis>
>>>>>>>>>> 0x461f67 <main(int, char**)+2245>: lea -0x1400(%rbp),%rax
>>>>>>>>>> 0x461f6e <main(int, char**)+2252>: mov $0xb739c0,%r9d
>>>>>>>>>> (gdb) thread apply all backtrace
>>>>>>>>>>
>>>>>>>>>> Thread 1 (Thread 0x7ffff7fd5740 (LWP 10004)):
>>>>>>>>>> #0 __lll_unlock_elision (lock=0x512f1b0, private=0) at
>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>>>>>>>>>> #1 0x00000000004a4953 in unlock_ji_mutex (pjob=0x512d880,
>>>>>>>>>> id=0x525b30 <svr_enquejob(job*, int, char const*, bool, bool)::__func__>
>>>>>>>>>> "svr_enquejob", msg=0x524554 "1", logging=0)
>>>>>>>>>> at svr_jobfunc.c:4011
>>>>>>>>>> #2 0x000000000049db0c in svr_enquejob (pjob=0x512d880,
>>>>>>>>>> has_sv_qs_mutex=1, prev_job_id=0x0, have_reservation=false,
>>>>>>>>>> being_recovered=true) at svr_jobfunc.c:421
>>>>>>>>>> #3 0x000000000045b828 in pbsd_init_reque (pjob=0x512d880,
>>>>>>>>>> change_state=1) at pbsd_init.c:2824
>>>>>>>>>> #4 0x000000000045ad93 in pbsd_init_job (pjob=0x512d880, type=1)
>>>>>>>>>> at pbsd_init.c:2558
>>>>>>>>>> #5 0x0000000000459483 in handle_job_recovery (type=1) at
>>>>>>>>>> pbsd_init.c:1803
>>>>>>>>>> #6 0x000000000045a173 in handle_job_and_array_recovery (type=1)
>>>>>>>>>> at pbsd_init.c:2100
>>>>>>>>>> #7 0x000000000045a8fe in pbsd_init (type=1) at pbsd_init.c:2316
>>>>>>>>>> #8 0x0000000000461f29 in main (argc=2, argv=0x7fffffffe598) at
>>>>>>>>>> pbsd_main.c:1898
>>>>>>>>>> (gdb) quit
>>>>>>>>>> A debugging session is active.
>>>>>>>>>>
>>>>>>>>>> Inferior 1 [process 10004] will be killed.
>>>>>>>>>>
>>>>>>>>>> Quit anyway? (y or n) y
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Nov 2, 2016 at 1:43 AM, David Beer <
>>>>>>>>>> ***@adaptivecomputing.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Kazu,
>>>>>>>>>>>
>>>>>>>>>>> Thanks for sticking with us on this. You mentioned that
>>>>>>>>>>> pbs_server did not crash when you submitted the job, but you said that it
>>>>>>>>>>> and pbs_sched are "unstable." What do you mean by unstable? Will jobs run?
>>>>>>>>>>> You gdb output looks like a pbs_server that isn't busy, but other than that
>>>>>>>>>>> it looks normal.
>>>>>>>>>>>
>>>>>>>>>>> David
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Nov 1, 2016 at 1:19 AM, Kazuhiro Fujita <
>>>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> David,
>>>>>>>>>>>>
>>>>>>>>>>>> I tested the 6.0-dev. It passed the "sudo ./torque.setup $USER"
>>>>>>>>>>>> script,
>>>>>>>>>>>> but pbs_server and pbs_sched are unstable like 6.1-dev.
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Kazu
>>>>>>>>>>>>
>>>>>>>>>>>> Before execution of gdb
>>>>>>>>>>>>
>>>>>>>>>>>> git clone https://github.com/adaptivecomputing/torque.git -b
>>>>>>>>>>>>> 6.0-dev 6.0-dev
>>>>>>>>>>>>> cd 6.0-dev
>>>>>>>>>>>>> ./autogen.sh
>>>>>>>>>>>>> # build and install torque
>>>>>>>>>>>>> ./configure
>>>>>>>>>>>>> make
>>>>>>>>>>>>> sudo make install
>>>>>>>>>>>>> # Set the correct name of the server
>>>>>>>>>>>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>>>>>>>>>>> # configure and start trqauthd
>>>>>>>>>>>>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>>>>>>>>>>>>> sudo update-rc.d trqauthd defaults
>>>>>>>>>>>>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>>>>>>>>>>>>> sudo ldconfig
>>>>>>>>>>>>> sudo service trqauthd start
>>>>>>>>>>>>> # Initialize serverdb by executing the torque.setup script
>>>>>>>>>>>>> sudo ./torque.setup $USER
>>>>>>>>>>>>>
>>>>>>>>>>>>> sudo qmgr -c 'p s'
>>>>>>>>>>>>> sudo qterm
>>>>>>>>>>>>> sudo /etc/init.d/trqauthd stop
>>>>>>>>>>>>> # set nodes
>>>>>>>>>>>>> echo "$HOSTNAME np=`cat /proc/cpuinfo | grep processor | wc
>>>>>>>>>>>>> -l`" | sudo tee /var/spool/torque/server_priv/nodes
>>>>>>>>>>>>> sudo nano /var/spool/torque/server_priv/nodes
>>>>>>>>>>>>> # set the head node
>>>>>>>>>>>>> echo "\$pbsserver $HOSTNAME" | sudo tee
>>>>>>>>>>>>> /var/spool/torque/mom_priv/config
>>>>>>>>>>>>> # configure other deamons
>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_server
>>>>>>>>>>>>> /etc/init.d/pbs_server
>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>>>>>>>>>>>> sudo update-rc.d pbs_server defaults
>>>>>>>>>>>>> sudo update-rc.d pbs_sched defaults
>>>>>>>>>>>>> sudo update-rc.d pbs_mom defaults
>>>>>>>>>>>>> # start torque daemons
>>>>>>>>>>>>> sudo service trqauthd start
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Execution of gdb
>>>>>>>>>>>>
>>>>>>>>>>>>> sudo gdb /usr/local/sbin/pbs_server
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Commands executed by another terminal
>>>>>>>>>>>>
>>>>>>>>>>>>> sudo /etc/init.d/pbs_mom start
>>>>>>>>>>>>> sudo /etc/init.d/pbs_sched start
>>>>>>>>>>>>> pbsnodes -a
>>>>>>>>>>>>> echo "sleep 30" | qsub
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> The last command did not cause a crash of pbs_server. The
>>>>>>>>>>>> backtrace is described below.
>>>>>>>>>>>> $ sudo gdb /usr/local/sbin/pbs_server
>>>>>>>>>>>> GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
>>>>>>>>>>>> Copyright (C) 2016 Free Software Foundation, Inc.
>>>>>>>>>>>> License GPLv3+: GNU GPL version 3 or later <
>>>>>>>>>>>> http://gnu.org/licenses/gpl.html>
>>>>>>>>>>>> This is free software: you are free to change and redistribute
>>>>>>>>>>>> it.
>>>>>>>>>>>> There is NO WARRANTY, to the extent permitted by law. Type
>>>>>>>>>>>> "show copying"
>>>>>>>>>>>> and "show warranty" for details.
>>>>>>>>>>>> This GDB was configured as "x86_64-linux-gnu".
>>>>>>>>>>>> Type "show configuration" for configuration details.
>>>>>>>>>>>> For bug reporting instructions, please see:
>>>>>>>>>>>> <http://www.gnu.org/software/gdb/bugs/>.
>>>>>>>>>>>> Find the GDB manual and other documentation resources online at:
>>>>>>>>>>>> <http://www.gnu.org/software/gdb/documentation/>.
>>>>>>>>>>>> For help, type "help".
>>>>>>>>>>>> Type "apropos word" to search for commands related to "word"...
>>>>>>>>>>>> Reading symbols from /usr/local/sbin/pbs_server...done.
>>>>>>>>>>>> (gdb) r -D
>>>>>>>>>>>> Starting program: /usr/local/sbin/pbs_server -D
>>>>>>>>>>>> [Thread debugging using libthread_db enabled]
>>>>>>>>>>>> Using host libthread_db library "/lib/x86_64-linux-gnu/libthre
>>>>>>>>>>>> ad_db.so.1".
>>>>>>>>>>>> [New Thread 0x7ffff39c1700 (LWP 5024)]
>>>>>>>>>>>> pbs_server is up (version - 6.0, port - 15001)
>>>>>>>>>>>> [New Thread 0x7ffff31c0700 (LWP 5025)]
>>>>>>>>>>>> PBS_Server: LOG_ERROR::tcp_connect_sockaddr, Failed when
>>>>>>>>>>>> trying to open tcp connection - connect() failed [rc = -2] [addr =
>>>>>>>>>>>> 10.0.0.249:15003]
>>>>>>>>>>>> PBS_Server: LOG_ERROR::sendHierarchyToNode, Could not send mom
>>>>>>>>>>>> hierarchy to host Dual-E52630v4:15003
>>>>>>>>>>>> [New Thread 0x7ffff29bf700 (LWP 5026)]
>>>>>>>>>>>> [New Thread 0x7ffff21be700 (LWP 5027)]
>>>>>>>>>>>> [New Thread 0x7ffff19bd700 (LWP 5028)]
>>>>>>>>>>>> [New Thread 0x7ffff11bc700 (LWP 5029)]
>>>>>>>>>>>> [New Thread 0x7ffff09bb700 (LWP 5030)]
>>>>>>>>>>>> [Thread 0x7ffff09bb700 (LWP 5030) exited]
>>>>>>>>>>>> [New Thread 0x7ffff09bb700 (LWP 5031)]
>>>>>>>>>>>> [New Thread 0x7fffe3fff700 (LWP 5109)]
>>>>>>>>>>>> [New Thread 0x7fffe37fe700 (LWP 5113)]
>>>>>>>>>>>> [New Thread 0x7fffe29cf700 (LWP 5121)]
>>>>>>>>>>>> [Thread 0x7fffe29cf700 (LWP 5121) exited]
>>>>>>>>>>>> ^C
>>>>>>>>>>>> Thread 1 "pbs_server" received signal SIGINT, Interrupt.
>>>>>>>>>>>> 0x00007ffff612a75d in nanosleep () at
>>>>>>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>>>>>>> 84 ../sysdeps/unix/syscall-template.S: No such file or
>>>>>>>>>>>> directory.
>>>>>>>>>>>> (gdb) backtrace full
>>>>>>>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>>>>>>> No locals.
>>>>>>>>>>>> #1 0x00007ffff615c1a4 in usleep (useconds=<optimized out>) at
>>>>>>>>>>>> ../sysdeps/posix/usleep.c:32
>>>>>>>>>>>> ts = {tv_sec = 0, tv_nsec = 250000000}
>>>>>>>>>>>> #2 0x000000000046123a in main_loop () at pbsd_main.c:1454
>>>>>>>>>>>> state = 3
>>>>>>>>>>>> waittime = 5
>>>>>>>>>>>> pjob = 0x313a74
>>>>>>>>>>>> iter = 0x0
>>>>>>>>>>>> when = 1477984074
>>>>>>>>>>>> log = 0
>>>>>>>>>>>> scheduling = 1
>>>>>>>>>>>> sched_iteration = 600
>>>>>>>>>>>> time_now = 1477984190
>>>>>>>>>>>> update_loglevel = 1477984198
>>>>>>>>>>>> log_buf = "Server Ready, pid = 5020, loglevel=0",
>>>>>>>>>>>> '\000' <repeats 140 times>, "c\000\000\000\000\000\000\000
>>>>>>>>>>>> \000\020\000\000\000\000\000\000\240\265\377\377\377\177",
>>>>>>>>>>>> '\000' <repeats 26 times>...
>>>>>>>>>>>> sem_val = 5228929
>>>>>>>>>>>> __func__ = "main_loop"
>>>>>>>>>>>> #3 0x0000000000461fb6 in main (argc=2, argv=0x7fffffffe598) at
>>>>>>>>>>>> pbsd_main.c:1935
>>>>>>>>>>>> i = 2
>>>>>>>>>>>> rc = 0
>>>>>>>>>>>> local_errno = 0
>>>>>>>>>>>> lockfile = "/var/spool/torque/server_priv/server.lock",
>>>>>>>>>>>> '\000' <repeats 983 times>
>>>>>>>>>>>> EMsg = '\000' <repeats 1023 times>
>>>>>>>>>>>> tmpLine = "Using ports Server:15001 Scheduler:15004
>>>>>>>>>>>> MOM:15002 (server: 'Dual-E52630v4')", '\000' <repeats 945 times>
>>>>>>>>>>>> log_buf = "Using ports Server:15001 Scheduler:15004
>>>>>>>>>>>> MOM:15002 (server: 'Dual-E52630v4')", '\000' <repeats 1114 times>...
>>>>>>>>>>>> server_name_file_port = 15001
>>>>>>>>>>>> fp = 0x51095f0
>>>>>>>>>>>> (gdb) info registers
>>>>>>>>>>>> rax 0xfffffffffffffdfc -516
>>>>>>>>>>>> rbx 0x5 5
>>>>>>>>>>>> rcx 0x7ffff612a75d 140737321805661
>>>>>>>>>>>> rdx 0x0 0
>>>>>>>>>>>> rsi 0x0 0
>>>>>>>>>>>> rdi 0x7fffffffb3f0 140737488335856
>>>>>>>>>>>> rbp 0x7fffffffe4b0 0x7fffffffe4b0
>>>>>>>>>>>> rsp 0x7fffffffc870 0x7fffffffc870
>>>>>>>>>>>> r8 0x0 0
>>>>>>>>>>>> r9 0x4000001 67108865
>>>>>>>>>>>> r10 0x1 1
>>>>>>>>>>>> r11 0x293 659
>>>>>>>>>>>> r12 0x4260b0 4350128
>>>>>>>>>>>> r13 0x7fffffffe590 140737488348560
>>>>>>>>>>>> r14 0x0 0
>>>>>>>>>>>> r15 0x0 0
>>>>>>>>>>>> rip 0x461fb6 0x461fb6 <main(int, char**)+2388>
>>>>>>>>>>>> eflags 0x293 [ CF AF SF IF ]
>>>>>>>>>>>> cs 0x33 51
>>>>>>>>>>>> ss 0x2b 43
>>>>>>>>>>>> ds 0x0 0
>>>>>>>>>>>> es 0x0 0
>>>>>>>>>>>> fs 0x0 0
>>>>>>>>>>>> gs 0x0 0
>>>>>>>>>>>> (gdb) x/16i $pc
>>>>>>>>>>>> => 0x461fb6 <main(int, char**)+2388>: callq 0x494762
>>>>>>>>>>>> <shutdown_ack()>
>>>>>>>>>>>> 0x461fbb <main(int, char**)+2393>: mov $0xffffffff,%edi
>>>>>>>>>>>> 0x461fc0 <main(int, char**)+2398>: callq 0x4250b0
>>>>>>>>>>>> <***@plt>
>>>>>>>>>>>> 0x461fc5 <main(int, char**)+2403>: mov
>>>>>>>>>>>> 0x70f55c(%rip),%rdx # 0xb71528 <msg_svrdown>
>>>>>>>>>>>> 0x461fcc <main(int, char**)+2410>: mov
>>>>>>>>>>>> 0x70eeed(%rip),%rax # 0xb70ec0 <msg_daemonname>
>>>>>>>>>>>> 0x461fd3 <main(int, char**)+2417>: mov %rdx,%rcx
>>>>>>>>>>>> 0x461fd6 <main(int, char**)+2420>: mov %rax,%rdx
>>>>>>>>>>>> 0x461fd9 <main(int, char**)+2423>: mov $0x1,%esi
>>>>>>>>>>>> 0x461fde <main(int, char**)+2428>: mov $0x8002,%edi
>>>>>>>>>>>> 0x461fe3 <main(int, char**)+2433>: callq 0x425840
>>>>>>>>>>>> <***@plt>
>>>>>>>>>>>> 0x461fe8 <main(int, char**)+2438>: mov $0x0,%edi
>>>>>>>>>>>> 0x461fed <main(int, char**)+2443>: callq 0x4269c9
>>>>>>>>>>>> <acct_close(bool)>
>>>>>>>>>>>> 0x461ff2 <main(int, char**)+2448>: mov $0xb6cdc0,%edi
>>>>>>>>>>>> 0x461ff7 <main(int, char**)+2453>: callq 0x425a00
>>>>>>>>>>>> <***@plt>
>>>>>>>>>>>> 0x461ffc <main(int, char**)+2458>: mov $0x1,%edi
>>>>>>>>>>>> 0x462001 <main(int, char**)+2463>: callq 0x424db0
>>>>>>>>>>>> <***@plt>
>>>>>>>>>>>> (gdb) thread apply all backtrace
>>>>>>>>>>>>
>>>>>>>>>>>> Thread 11 (Thread 0x7fffe37fe700 (LWP 5113)):
>>>>>>>>>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
>>>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
>>>>>>>>>>>> #1 0x00000000004fc19c in work_thread (a=0x5110710) at
>>>>>>>>>>>> u_threadpool.c:272
>>>>>>>>>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7fffe37fe700) at
>>>>>>>>>>>> pthread_create.c:333
>>>>>>>>>>>> #3 0x00007ffff616582d in clone () at
>>>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>>>>>>>>>>>>
>>>>>>>>>>>> Thread 10 (Thread 0x7fffe3fff700 (LWP 5109)):
>>>>>>>>>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
>>>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
>>>>>>>>>>>> #1 0x00000000004fc19c in work_thread (a=0x5110710) at
>>>>>>>>>>>> u_threadpool.c:272
>>>>>>>>>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7fffe3fff700) at
>>>>>>>>>>>> pthread_create.c:333
>>>>>>>>>>>> #3 0x00007ffff616582d in clone () at
>>>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>>>>>>>>>>>>
>>>>>>>>>>>> Thread 9 (Thread 0x7ffff09bb700 (LWP 5031)):
>>>>>>>>>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
>>>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
>>>>>>>>>>>> #1 0x00000000004fc19c in work_thread (a=0x5110810) at
>>>>>>>>>>>> u_threadpool.c:272
>>>>>>>>>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7ffff09bb700) at
>>>>>>>>>>>> pthread_create.c:333
>>>>>>>>>>>> #3 0x00007ffff616582d in clone () at
>>>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>>>>>>>>>>>>
>>>>>>>>>>>> Thread 7 (Thread 0x7ffff11bc700 (LWP 5029)):
>>>>>>>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>>>>>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>>>>>>>>>>> ../sysdeps/posix/sleep.c:55
>>>>>>>>>>>> #2 0x00000000004769bb in remove_completed_jobs (vp=0x0) at
>>>>>>>>>>>> req_jobobit.c:3759
>>>>>>>>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff11bc700) at
>>>>>>>>>>>> pthread_create.c:333
>>>>>>>>>>>> #4 0x00007ffff616582d in clone () at
>>>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>>>>>>>>>>>>
>>>>>>>>>>>> Thread 6 (Thread 0x7ffff19bd700 (LWP 5028)):
>>>>>>>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>>>>>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>>>>>>>>>>> ../sysdeps/posix/sleep.c:55
>>>>>>>>>>>> #2 0x00000000004afa7b in remove_extra_recycle_jobs (vp=0x0) at
>>>>>>>>>>>> job_recycler.c:216
>>>>>>>>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff19bd700) at
>>>>>>>>>>>> pthread_create.c:333
>>>>>>>>>>>> #4 0x00007ffff616582d in clone () at
>>>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>>>>>>>>>>>>
>>>>>>>>>>>> Thread 5 (Thread 0x7ffff21be700 (LWP 5027)):
>>>>>>>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>>>>>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>>>>>>>>>>> ../sysdeps/posix/sleep.c:55
>>>>>>>>>>>> #2 0x00000000004bc73b in inspect_exiting_jobs (vp=0x0) at
>>>>>>>>>>>> exiting_jobs.c:319
>>>>>>>>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff21be700) at
>>>>>>>>>>>> pthread_create.c:333
>>>>>>>>>>>> #4 0x00007ffff616582d in clone () at
>>>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>>>>>>>>>>>>
>>>>>>>>>>>> Thread 4 (Thread 0x7ffff29bf700 (LWP 5026)):
>>>>>>>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>>>>>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>>>>>>>>>>> ../sysdeps/posix/sleep.c:55
>>>>>>>>>>>> #2 0x000000000046078d in handle_queue_routing_retries (vp=0x0)
>>>>>>>>>>>> at pbsd_main.c:1079
>>>>>>>>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff29bf700) at
>>>>>>>>>>>> pthread_create.c:333
>>>>>>>>>>>> #4 0x00007ffff616582d in clone () at
>>>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>>>>>>>>>>>>
>>>>>>>>>>>> Thread 3 (Thread 0x7ffff31c0700 (LWP 5025)):
>>>>>>>>>>>> #0 0x00007ffff6ee17bd in accept () at
>>>>>>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>>>>>>> #1 0x00007ffff750a276 in start_listener_addrinfo
>>>>>>>>>>>> (host_name=0x7ffff31bfaf0 "Dual-E52630v4", server_port=15001,
>>>>>>>>>>>> process_meth=0x4c4935 <start_process_pbs_server_port(void*)>)
>>>>>>>>>>>> at ../Libnet/server_core.c:398
>>>>>>>>>>>> #2 0x00000000004608f3 in start_accept_listener (vp=0x0) at
>>>>>>>>>>>> pbsd_main.c:1141
>>>>>>>>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff31c0700) at
>>>>>>>>>>>> pthread_create.c:333
>>>>>>>>>>>> #4 0x00007ffff616582d in clone () at
>>>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>>>>>>>>>>>>
>>>>>>>>>>>> Thread 2 (Thread 0x7ffff39c1700 (LWP 5024)):
>>>>>>>>>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
>>>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
>>>>>>>>>>>> #1 0x00000000004fc19c in work_thread (a=0x5110810) at
>>>>>>>>>>>> u_threadpool.c:272
>>>>>>>>>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7ffff39c1700) at
>>>>>>>>>>>> pthread_create.c:333
>>>>>>>>>>>> ---Type <return> to continue, or q <return> to quit---
>>>>>>>>>>>> #3 0x00007ffff616582d in clone () at
>>>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>>>>>>>>>>>>
>>>>>>>>>>>> Thread 1 (Thread 0x7ffff7fd5740 (LWP 5020)):
>>>>>>>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>>>>>>> #1 0x00007ffff615c1a4 in usleep (useconds=<optimized out>) at
>>>>>>>>>>>> ../sysdeps/posix/usleep.c:32
>>>>>>>>>>>> #2 0x000000000046123a in main_loop () at pbsd_main.c:1454
>>>>>>>>>>>> #3 0x0000000000461fb6 in main (argc=2, argv=0x7fffffffe598) at
>>>>>>>>>>>> pbsd_main.c:1935
>>>>>>>>>>>> (gdb) quit
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Oct 28, 2016 at 12:43 PM, Kazuhiro Fujita <
>>>>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Thank you for your comments.
>>>>>>>>>>>>> I will try the 6.0-dev next week.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>> Kazu
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Oct 28, 2016 at 5:34 AM, David Beer <
>>>>>>>>>>>>> ***@adaptivecomputing.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I wonder if that fix wasn't placed in the hotfix. Is there
>>>>>>>>>>>>>> any chance you can try installing 6.0-dev on your system (via github) to
>>>>>>>>>>>>>> see if it's resolved. For the record, my Ubuntu 16 system doesn't give me
>>>>>>>>>>>>>> this error, or I'd try it myself. For whatever reason, none of our test
>>>>>>>>>>>>>> cluster machines (Cent & Redhat 6-7, SLES 11-12) experience this either. We
>>>>>>>>>>>>>> did have another user that experiences it on a test cluster, but not being
>>>>>>>>>>>>>> able to reproduce it has made it harder to track down.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Oct 26, 2016 at 12:46 AM, Kazuhiro Fujita <
>>>>>>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> David,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I tried the 6.0.2.h3. But, it seems that the other issue is
>>>>>>>>>>>>>>> still remained.
>>>>>>>>>>>>>>> After I initialized serverdb by "sudo pbs_server -t create",
>>>>>>>>>>>>>>> pbs_server crashed.
>>>>>>>>>>>>>>> Then, I used gdb with pbs_server.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>> Kazu
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> sudo gdb /usr/local/sbin/pbs_server
>>>>>>>>>>>>>>> GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
>>>>>>>>>>>>>>> Copyright (C) 2016 Free Software Foundation, Inc.
>>>>>>>>>>>>>>> License GPLv3+: GNU GPL version 3 or later <
>>>>>>>>>>>>>>> http://gnu.org/licenses/gpl.html>
>>>>>>>>>>>>>>> This is free software: you are free to change and
>>>>>>>>>>>>>>> redistribute it.
>>>>>>>>>>>>>>> There is NO WARRANTY, to the extent permitted by law. Type
>>>>>>>>>>>>>>> "show copying"
>>>>>>>>>>>>>>> and "show warranty" for details.
>>>>>>>>>>>>>>> This GDB was configured as "x86_64-linux-gnu".
>>>>>>>>>>>>>>> Type "show configuration" for configuration details.
>>>>>>>>>>>>>>> For bug reporting instructions, please see:
>>>>>>>>>>>>>>> <http://www.gnu.org/software/gdb/bugs/>.
>>>>>>>>>>>>>>> Find the GDB manual and other documentation resources online
>>>>>>>>>>>>>>> at:
>>>>>>>>>>>>>>> <http://www.gnu.org/software/gdb/documentation/>.
>>>>>>>>>>>>>>> For help, type "help".
>>>>>>>>>>>>>>> Type "apropos word" to search for commands related to
>>>>>>>>>>>>>>> "word"...
>>>>>>>>>>>>>>> Reading symbols from /usr/local/sbin/pbs_server...done.
>>>>>>>>>>>>>>> (gdb) r -D
>>>>>>>>>>>>>>> Starting program: /usr/local/sbin/pbs_server -D
>>>>>>>>>>>>>>> [Thread debugging using libthread_db enabled]
>>>>>>>>>>>>>>> Using host libthread_db library
>>>>>>>>>>>>>>> "/lib/x86_64-linux-gnu/libthread_db.so.1".
>>>>>>>>>>>>>>> pbs_server is up (version - 6.0.2.h3, port - 15001)
>>>>>>>>>>>>>>> [New Thread 0x7ffff39c1700 (LWP 25591)]
>>>>>>>>>>>>>>> [New Thread 0x7ffff31c0700 (LWP 25592)]
>>>>>>>>>>>>>>> [New Thread 0x7ffff29bf700 (LWP 25593)]
>>>>>>>>>>>>>>> [New Thread 0x7ffff21be700 (LWP 25594)]
>>>>>>>>>>>>>>> [New Thread 0x7ffff19bd700 (LWP 25595)]
>>>>>>>>>>>>>>> [New Thread 0x7ffff11bc700 (LWP 25596)]
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thread 7 "pbs_server" received signal SIGSEGV, Segmentation
>>>>>>>>>>>>>>> fault.
>>>>>>>>>>>>>>> [Switching to Thread 0x7ffff11bc700 (LWP 25596)]
>>>>>>>>>>>>>>> __lll_unlock_elision (lock=0x57276c0, private=0) at
>>>>>>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>>>>>>>>>>>>>>> 29 ../sysdeps/unix/sysv/linux/x86/elision-unlock.c: No such
>>>>>>>>>>>>>>> file or directory.
>>>>>>>>>>>>>>> (gdb) bt
>>>>>>>>>>>>>>> #0 __lll_unlock_elision (lock=0x57276c0, private=0) at
>>>>>>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>>>>>>>>>>>>>>> #1 0x00000000004ac076 in dispatch_timed_task
>>>>>>>>>>>>>>> (ptask=0x5727660) at svr_task.c:318
>>>>>>>>>>>>>>> #2 0x0000000000460247 in check_tasks (notUsed=0x0) at
>>>>>>>>>>>>>>> pbsd_main.c:921
>>>>>>>>>>>>>>> #3 0x00000000004fc171 in work_thread (a=0x510f650) at
>>>>>>>>>>>>>>> u_threadpool.c:318
>>>>>>>>>>>>>>> #4 0x00007ffff6ed86fa in start_thread (arg=0x7ffff11bc700)
>>>>>>>>>>>>>>> at pthread_create.c:333
>>>>>>>>>>>>>>> #5 0x00007ffff6165b5d in clone () at
>>>>>>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Oct 26, 2016 at 11:52 AM, Kazuhiro Fujita <
>>>>>>>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> David and Rick,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thank you for the quick response. I will try it later.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>> Kazu
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, Oct 26, 2016 at 5:06 AM, David Beer <
>>>>>>>>>>>>>>>> ***@adaptivecomputing.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Actually, Rick just sent me the link. You can download it
>>>>>>>>>>>>>>>>> from here: http://files.adaptivecom
>>>>>>>>>>>>>>>>> puting.com/hotfix/torque-6.0.2.h3.tar.gz
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Tue, Oct 25, 2016 at 2:06 PM, David Beer <
>>>>>>>>>>>>>>>>> ***@adaptivecomputing.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I can confirm that this bug is fixed in 6.0-dev, and
>>>>>>>>>>>>>>>>>> we've made a hotfix for it, 6.0.2.h3. This was caused because of a change
>>>>>>>>>>>>>>>>>> in the implementation for the pthread library, so most will not see this
>>>>>>>>>>>>>>>>>> crash, but it appears that if you have a newer version of that library,
>>>>>>>>>>>>>>>>>> then you will get it. Rick is going to send instructions for how to grab
>>>>>>>>>>>>>>>>>> 6.0.2.h3.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> David
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Tue, Oct 25, 2016 at 12:30 AM, Kazuhiro Fujita <
>>>>>>>>>>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thank you David for the comment on the backtrace.
>>>>>>>>>>>>>>>>>>> I haven't noticed that until writing this mail.
>>>>>>>>>>>>>>>>>>> So, I used backtrace as written in the Ubuntu wiki.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I also attached the backtrace of pbs_server (Torque
>>>>>>>>>>>>>>>>>>> 6.1-dev) by gdb.
>>>>>>>>>>>>>>>>>>> As I mentioned before torque.setup script was
>>>>>>>>>>>>>>>>>>> successfully executed, but unstable.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Before using gdb, I used following commands.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> git clone https://github.com/adaptivecom
>>>>>>>>>>>>>>>>>>>> puting/torque.git -b 6.1-dev 6.1-dev
>>>>>>>>>>>>>>>>>>>> cd 6.1-dev
>>>>>>>>>>>>>>>>>>>> ./autogen.sh
>>>>>>>>>>>>>>>>>>>> # build and install torque
>>>>>>>>>>>>>>>>>>>> ./configure
>>>>>>>>>>>>>>>>>>>> make
>>>>>>>>>>>>>>>>>>>> sudo make install
>>>>>>>>>>>>>>>>>>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>>>>>>>>>>>>>>>>>> echo /usr/local/lib > sudo tee
>>>>>>>>>>>>>>>>>>>> /etc/ld.so.conf.d/torque.conf
>>>>>>>>>>>>>>>>>>>> sudo ldconfig
>>>>>>>>>>>>>>>>>>>> # set as services
>>>>>>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.trqauthd
>>>>>>>>>>>>>>>>>>>> /etc/init.d/trqauthd
>>>>>>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_server
>>>>>>>>>>>>>>>>>>>> /etc/init.d/pbs_server
>>>>>>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_sched
>>>>>>>>>>>>>>>>>>>> /etc/init.d/pbs_sched
>>>>>>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_mom
>>>>>>>>>>>>>>>>>>>> /etc/init.d/pbs_mom
>>>>>>>>>>>>>>>>>>>> sudo update-rc.d trqauthd defaults
>>>>>>>>>>>>>>>>>>>> sudo update-rc.d pbs_server defaults
>>>>>>>>>>>>>>>>>>>> sudo update-rc.d pbs_sched defaults
>>>>>>>>>>>>>>>>>>>> sudo update-rc.d pbs_mom defaults
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> sudo ./torque.setup $USER
>>>>>>>>>>>>>>>>>>>> sudo qmgr -c 'p s'
>>>>>>>>>>>>>>>>>>>> echo "$HOSTNAME np=`cat /proc/cpuinfo | grep processor
>>>>>>>>>>>>>>>>>>>> | wc -l`" | sudo tee /var/spool/torque/server_priv/
>>>>>>>>>>>>>>>>>>>> nodes
>>>>>>>>>>>>>>>>>>>> sudo nano /var/spool/torque/server_priv/nodes #
>>>>>>>>>>>>>>>>>>>> (changed np)
>>>>>>>>>>>>>>>>>>>> sudo qterm -t quick
>>>>>>>>>>>>>>>>>>>> sudo /etc/init.d/trqauthd stop
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> trqauthd was not stop by the last command. So, I stopped
>>>>>>>>>>>>>>>>>>> it by killing the trqauthd process.
>>>>>>>>>>>>>>>>>>> Then I restarted the torque processes with gdb.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> sudo /etc/init.d/trqauthd start
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> sudo gdb /etc/init.d/pbs_server 2>&1 | tee
>>>>>>>>>>>>>>>>>>>> ~/gdb-torquesetup-6.1-dev.txt
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> In another terminal, I executed the following commands
>>>>>>>>>>>>>>>>>>> before pbs_server was crashed.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> sudo /etc/init.d/pbs_mom start
>>>>>>>>>>>>>>>>>>>> sudo /etc/init.d/pbs_sched start
>>>>>>>>>>>>>>>>>>>> ps aux | grep pbs
>>>>>>>>>>>>>>>>>>>> pbsnodes -a
>>>>>>>>>>>>>>>>>>>> echo "sleep 30" | qsub
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> The output of the last command is "0.torque-server".
>>>>>>>>>>>>>>>>>>> And this command crashed the pbs_server in gdb.
>>>>>>>>>>>>>>>>>>> Then, I made the backtrace.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>> Kazu
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Tue, Oct 25, 2016 at 2:36 PM, Kazuhiro Fujita <
>>>>>>>>>>>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> David,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I attached the backtrace of pbs_server (Torque 6.0.2)
>>>>>>>>>>>>>>>>>>>> by gdb.
>>>>>>>>>>>>>>>>>>>> (based on https://wiki.ubuntu.com/Backtrace)
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I started pbs_server with gdb,
>>>>>>>>>>>>>>>>>>>> and execute qmgr from another terminal. (see below)
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> sudo qmgr -c 'p s'
>>>>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server
>>>>>>>>>>>>>>>>>>>>> '.
>>>>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> After the qmgr execution, I pressed ctrl +c in gdb.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>> Kaz
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Tue, Oct 25, 2016 at 1:00 AM, David Beer <
>>>>>>>>>>>>>>>>>>>> ***@adaptivecomputing.com> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Kazu,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Can you give us a backtrace for this crash? We have
>>>>>>>>>>>>>>>>>>>>> fixed some issues on startup (around mutex management for newer pthread
>>>>>>>>>>>>>>>>>>>>> implementations) and a backtrace would allow me to confirm if what you're
>>>>>>>>>>>>>>>>>>>>> seeing is fixed.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Mon, Oct 24, 2016 at 2:09 AM, Kazuhiro Fujita <
>>>>>>>>>>>>>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Dear All,
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I use Torque 4.2.10 on Ubuntu 14.04 LTS and 16.04 LTS
>>>>>>>>>>>>>>>>>>>>>> with dual E5-2630 v3 chips.
>>>>>>>>>>>>>>>>>>>>>> I recently got servers with dual Xeon E5 v4 chips,
>>>>>>>>>>>>>>>>>>>>>> and installed Ubuntu 16.04 LTS on them.
>>>>>>>>>>>>>>>>>>>>>> And I tried to set up Torque on them, but I stacked
>>>>>>>>>>>>>>>>>>>>>> with the initial setup script.
>>>>>>>>>>>>>>>>>>>>>> It seems that qmgr may trigger to crash pbs_server in
>>>>>>>>>>>>>>>>>>>>>> initial setup script (torque.setup). (see below)
>>>>>>>>>>>>>>>>>>>>>> Similar error is also observed in Torque 6.02.
>>>>>>>>>>>>>>>>>>>>>> Have you ever observed this kind of errors?
>>>>>>>>>>>>>>>>>>>>>> And if you know possible solutions, please tell me.
>>>>>>>>>>>>>>>>>>>>>> Any comments will be highly appreciated.
>>>>>>>>>>>>>>>>>>>>>> Would it be better to change the OS to other
>>>>>>>>>>>>>>>>>>>>>> distribution, such as Scientific Linux?
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Thank you in Advance,
>>>>>>>>>>>>>>>>>>>>>> Kazu
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Errors in torque 4.2.10 setup
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> torque-server-***@torque-ser
>>>>>>>>>>>>>>>>>>>>>>> ver:~/Downloads/torque/torque-4.2.10$ sudo
>>>>>>>>>>>>>>>>>>>>>>> ./torque.setup $USER
>>>>>>>>>>>>>>>>>>>>>>> Currently no servers active. Default server will be
>>>>>>>>>>>>>>>>>>>>>>> listed as active server. Error 15133
>>>>>>>>>>>>>>>>>>>>>>> Active server name: torque-server pbs_server port
>>>>>>>>>>>>>>>>>>>>>>> is: 15001
>>>>>>>>>>>>>>>>>>>>>>> trqauthd daemonized - port /tmp/trqauthd-unix
>>>>>>>>>>>>>>>>>>>>>>> trqauthd successfully started
>>>>>>>>>>>>>>>>>>>>>>> initializing TORQUE (admin:
>>>>>>>>>>>>>>>>>>>>>>> torque-server-***@torque-server)
>>>>>>>>>>>>>>>>>>>>>>> You have selected to start pbs_server in create mode.
>>>>>>>>>>>>>>>>>>>>>>> If the server database exists it will be overwritten.
>>>>>>>>>>>>>>>>>>>>>>> do you wish to continue y/(n)?y
>>>>>>>>>>>>>>>>>>>>>>> root 27941 1942 1 12:22 ? 00:00:00
>>>>>>>>>>>>>>>>>>>>>>> pbs_server -t create
>>>>>>>>>>>>>>>>>>>>>>> Max open servers: 9
>>>>>>>>>>>>>>>>>>>>>>> set server operators +=
>>>>>>>>>>>>>>>>>>>>>>> torque-server-***@torque-server
>>>>>>>>>>>>>>>>>>>>>>> Max open servers: 9
>>>>>>>>>>>>>>>>>>>>>>> set server managers += torque-server-***@torque-ser
>>>>>>>>>>>>>>>>>>>>>>> ver
>>>>>>>>>>>>>>>>>>>>>>> qmgr obj=batch svr=default: End of File
>>>>>>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>>>>>>> Cannot connect to specified server host
>>>>>>>>>>>>>>>>>>>>>>> 'torque-server'.
>>>>>>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111)
>>>>>>>>>>>>>>>>>>>>>>> Connection refused
>>>>>>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>>>>>>> Cannot connect to specified server host
>>>>>>>>>>>>>>>>>>>>>>> 'torque-server'.
>>>>>>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111)
>>>>>>>>>>>>>>>>>>>>>>> Connection refused
>>>>>>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>>>>>>> Cannot connect to specified server host
>>>>>>>>>>>>>>>>>>>>>>> 'torque-server'.
>>>>>>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111)
>>>>>>>>>>>>>>>>>>>>>>> Connection refused
>>>>>>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>>>>>>> Cannot connect to specified server host
>>>>>>>>>>>>>>>>>>>>>>> 'torque-server'.
>>>>>>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111)
>>>>>>>>>>>>>>>>>>>>>>> Connection refused
>>>>>>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>>>>>>> Cannot connect to specified server host
>>>>>>>>>>>>>>>>>>>>>>> 'torque-server'.
>>>>>>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111)
>>>>>>>>>>>>>>>>>>>>>>> Connection refused
>>>>>>>>>>>>>>>>>>>>>>> torque-server-***@torque-ser
>>>>>>>>>>>>>>>>>>>>>>> ver:~/Downloads/torque/torque-4.2.10$ ps aux | grep
>>>>>>>>>>>>>>>>>>>>>>> pbs
>>>>>>>>>>>>>>>>>>>>>>> torque-+ 27996 0.0 0.0 22304 948 pts/2 S+
>>>>>>>>>>>>>>>>>>>>>>> 12:22 0:00 grep --color=auto pbs
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> pbs_server -t create was not found.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Errors in torque 6.0.2 setup
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> torque-server-***@torque-server:~/Downloads/torque/6.0.2$
>>>>>>>>>>>>>>>>>>>>>>> sudo ./torque.setup $USER
>>>>>>>>>>>>>>>>>>>>>>> Currently no servers active. Default server will be
>>>>>>>>>>>>>>>>>>>>>>> listed as active server. Error 15133
>>>>>>>>>>>>>>>>>>>>>>> Active server name: torque-server pbs_server port
>>>>>>>>>>>>>>>>>>>>>>> is: 15001
>>>>>>>>>>>>>>>>>>>>>>> trqauthd daemonized - port /tmp/trqauthd-unix
>>>>>>>>>>>>>>>>>>>>>>> trqauthd successfully started
>>>>>>>>>>>>>>>>>>>>>>> initializing TORQUE (admin:
>>>>>>>>>>>>>>>>>>>>>>> torque-server-***@torque-server)
>>>>>>>>>>>>>>>>>>>>>>> You have selected to start pbs_server in create mode.
>>>>>>>>>>>>>>>>>>>>>>> If the server database exists it will be overwritten.
>>>>>>>>>>>>>>>>>>>>>>> do you wish to continue y/(n)?y
>>>>>>>>>>>>>>>>>>>>>>> root 39521 1 1 16:10 ? 00:00:00
>>>>>>>>>>>>>>>>>>>>>>> pbs_server -t create
>>>>>>>>>>>>>>>>>>>>>>> Max open servers: 9
>>>>>>>>>>>>>>>>>>>>>>> Max open servers: 9
>>>>>>>>>>>>>>>>>>>>>>> qmgr obj=batch svr=default: End of File
>>>>>>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>>>>>>> Cannot connect to specified server host
>>>>>>>>>>>>>>>>>>>>>>> 'torque-server'.
>>>>>>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111)
>>>>>>>>>>>>>>>>>>>>>>> Connection refused
>>>>>>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>>>>>>> Cannot connect to specified server host
>>>>>>>>>>>>>>>>>>>>>>> 'torque-server'.
>>>>>>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111)
>>>>>>>>>>>>>>>>>>>>>>> Connection refused
>>>>>>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>>>>>>> Cannot connect to specified server host
>>>>>>>>>>>>>>>>>>>>>>> 'torque-server'.
>>>>>>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111)
>>>>>>>>>>>>>>>>>>>>>>> Connection refused
>>>>>>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>>>>>>> Cannot connect to specified server host
>>>>>>>>>>>>>>>>>>>>>>> 'torque-server'.
>>>>>>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111)
>>>>>>>>>>>>>>>>>>>>>>> Connection refused
>>>>>>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>>>>>>> Cannot connect to specified server host
>>>>>>>>>>>>>>>>>>>>>>> 'torque-server'.
>>>>>>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111)
>>>>>>>>>>>>>>>>>>>>>>> Connection refused
>>>>>>>>>>>>>>>>>>>>>>> torque-server-***@torque-server:~/Downloads/torque/6.0.2$
>>>>>>>>>>>>>>>>>>>>>>> ps aux | grep pbs
>>>>>>>>>>>>>>>>>>>>>>> comp_ad+ 39569 0.0 0.0 22304 1032 pts/8 S+
>>>>>>>>>>>>>>>>>>>>>>> 16:11 0:00 grep --color=auto pbs
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> pbs_server -t create was not found.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Commands used for installation before the setup
>>>>>>>>>>>>>>>>>>>>>> script
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> # build and install torque
>>>>>>>>>>>>>>>>>>>>>>> ./configure
>>>>>>>>>>>>>>>>>>>>>>> make
>>>>>>>>>>>>>>>>>>>>>>> sudo make install
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> echo $HOSTNAME | sudo tee
>>>>>>>>>>>>>>>>>>>>>>> /var/spool/torque/server_name
>>>>>>>>>>>>>>>>>>>>>>> echo /usr/local/lib > sudo tee
>>>>>>>>>>>>>>>>>>>>>>> /etc/ld.so.conf.d/torque.conf
>>>>>>>>>>>>>>>>>>>>>>> sudo ldconfig
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> # set up as services
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.trqauthd
>>>>>>>>>>>>>>>>>>>>>>> /etc/init.d/trqauthd
>>>>>>>>>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_server
>>>>>>>>>>>>>>>>>>>>>>> /etc/init.d/pbs_server
>>>>>>>>>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_sched
>>>>>>>>>>>>>>>>>>>>>>> /etc/init.d/pbs_sched
>>>>>>>>>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_mom
>>>>>>>>>>>>>>>>>>>>>>> /etc/init.d/pbs_mom
>>>>>>>>>>>>>>>>>>>>>>> sudo update-rc.d trqauthd defaults
>>>>>>>>>>>>>>>>>>>>>>> sudo update-rc.d pbs_server defaults
>>>>>>>>>>>>>>>>>>>>>>> sudo update-rc.d pbs_sched defaults
>>>>>>>>>>>>>>>>>>>>>>> sudo update-rc.d pbs_mom defaults
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> sudo ./torque.setup $USER
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>>>>>> torqueusers mailing list
>>>>>>>>>>>>>>>>>>>>>> ***@supercluster.org
>>>>>>>>>>>>>>>>>>>>>> http://www.supercluster.org/ma
>>>>>>>>>>>>>>>>>>>>>> ilman/listinfo/torqueusers
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>> David Beer | Torque Architect
>>>>>>>>>>>>>>>>>>>>> Adaptive Computing
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>>>>> torqueusers mailing list
>>>>>>>>>>>>>>>>>>>>> ***@supercluster.org
>>>>>>>>>>>>>>>>>>>>> http://www.supercluster.org/ma
>>>>>>>>>>>>>>>>>>>>> ilman/listinfo/torqueusers
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>>> torqueusers mailing list
>>>>>>>>>>>>>>>>>>> ***@supercluster.org
>>>>>>>>>>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>> David Beer | Torque Architect
>>>>>>>>>>>>>>>>>> Adaptive Computing
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> David Beer | Torque Architect
>>>>>>>>>>>>>>>>> Adaptive Computing
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>> torqueusers mailing list
>>>>>>>>>>>>>>>>> ***@supercluster.org
>>>>>>>>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> torqueusers mailing list
>>>>>>>>>>>>>>> ***@supercluster.org
>>>>>>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> David Beer | Torque Architect
>>>>>>>>>>>>>> Adaptive Computing
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> torqueusers mailing list
>>>>>>>>>>>>>> ***@supercluster.org
>>>>>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> torqueusers mailing list
>>>>>>>>>>>> ***@supercluster.org
>>>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> David Beer | Torque Architect
>>>>>>>>>>> Adaptive Computing
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> torqueusers mailing list
>>>>>>>>>>> ***@supercluster.org
>>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> torqueusers mailing list
>>>>>>>>>> ***@supercluster.org
>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> David Beer | Torque Architect
>>>>>>>>> Adaptive Computing
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> torqueusers mailing list
>>>>>>>>> ***@supercluster.org
>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> torqueusers mailing list
>>>>>>> ***@supercluster.org
>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> David Beer | Torque Architect
>>>>>> Adaptive Computing
>>>>>>
>>>>>> _______________________________________________
>>>>>> torqueusers mailing list
>>>>>> ***@supercluster.org
>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> torqueusers mailing list
>>>>> ***@supercluster.org
>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> David Beer | Torque Architect
>>>> Adaptive Computing
>>>>
>>>> _______________________________________________
>>>> torqueusers mailing list
>>>> ***@supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>
>>>>
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> ***@supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>>
>>
>>
>> --
>> David Beer | Torque Architect
>> Adaptive Computing
>>
>> _______________________________________________
>> torqueusers mailing list
>> ***@supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>
Loading...