Dear David,
Thanks.
I checked the latest version of 6.0-dev,
and it works except LIFO job scheduling.
Best,
Kazu
On Wed, Nov 30, 2016 at 3:12 PM, Kazuhiro Fujita <***@gmail.com>
wrote:
> David,
>
> I attached the backtrace below.
>
> Before: gdb
> sudo service pbs_mom stop
> sudo service pbs_sched stop
> sudo service pbs_server stop
> sudo service trqauthd stop
> sudo service trqauthd start
> sudo gdb /usr/local/sbin/pbs_server
>
> Then,
> (gdb) r -D
>
> In another terminal I executed following commands,
> and the last command (echo "sleep 30" | qsub) caused the crash as I
> reported before.
>
> $sudo service pbs_sched start
> $sudo service pbs_mom start
> $ps aux | grep pbs
> root 36957 0.0 0.0 55808 4164 pts/8 S 14:53 0:00 sudo gdb
> /usr/local/sbin/pbs_server
> root 36958 0.7 0.0 109464 63648 pts/8 S 14:53 0:00 gdb
> /usr/local/sbin/pbs_server
> root 36960 0.0 0.0 473936 24768 pts/8 Sl+ 14:53 0:00
> /usr/local/sbin/pbs_server -D
> root 37079 0.0 0.0 37996 4940 ? Ss 14:54 0:00
> /usr/local/sbin/pbs_sched
> root 37116 0.0 0.1 115892 76900 ? RLsl 14:54 0:00
> /usr/local/sbin/pbs_mom
> comp_ad+ 37118 0.0 0.0 15236 976 pts/9 S+ 14:54 0:00 grep
> --color=auto pbs
> $ps aux | grep trq
> root 36956 0.0 0.0 29052 2332 ? S 14:52 0:00
> /usr/local/sbin/trqauthd
> comp_ad+ 37135 0.0 0.0 15236 1032 pts/9 S+ 14:54 0:00 grep
> --color=auto trq
> $ pbsnodes -a
> $ echo "sleep 30" | qsub
>
> The output of gdb is shown below.
>
> Best,
> Kazu
>
>
> $ sudo gdb /usr/local/sbin/pbs_server
> GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
> Copyright (C) 2016 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.
> html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law. Type "show copying"
> and "show warranty" for details.
> This GDB was configured as "x86_64-linux-gnu".
> Type "show configuration" for configuration details.
> For bug reporting instructions, please see:
> <http://www.gnu.org/software/gdb/bugs/>.
> Find the GDB manual and other documentation resources online at:
> <http://www.gnu.org/software/gdb/documentation/>.
> For help, type "help".
> Type "apropos word" to search for commands related to "word"...
> Reading symbols from /usr/local/sbin/pbs_server...done.
> (gdb) r -D
> Starting program: /usr/local/sbin/pbs_server -D
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
> [New Thread 0x7ffff39c1700 (LWP 36964)]
> pbs_server is up (version - 6.0, port - 15001)
> PBS_Server: LOG_ERROR::tcp_connect_sockaddr, Failed when trying to open
> tcp connection - connect() failed [rc = -2] [addr = 10.0.0.249:15003]
> PBS_Server: LOG_ERROR::sendHierarchyToNode, Could not send mom hierarchy
> to host Dual-E52630v4:15003
> [New Thread 0x7ffff31c0700 (LWP 36965)]
> [New Thread 0x7ffff29bf700 (LWP 36966)]
> [New Thread 0x7ffff21be700 (LWP 36967)]
> [New Thread 0x7ffff19bd700 (LWP 36968)]
> [New Thread 0x7ffff11bc700 (LWP 36969)]
> [New Thread 0x7ffff09bb700 (LWP 36970)]
> [Thread 0x7ffff09bb700 (LWP 36970) exited]
> [New Thread 0x7ffff09bb700 (LWP 36971)]
> [New Thread 0x7fffe3fff700 (LWP 37132)]
> [New Thread 0x7fffe37fe700 (LWP 37133)]
> [New Thread 0x7fffe2ffd700 (LWP 37145)]
> [New Thread 0x7fffe21ce700 (LWP 37150)]
> [Thread 0x7fffe21ce700 (LWP 37150) exited]
> Assertion failed, bad pointer in link: file "req_select.c", line 401
>
> Thread 10 "pbs_server" received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 0x7fffe3fff700 (LWP 37132)]
> __lll_unlock_elision (lock=0x51118d0, private=0) at
> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
> 29 ../sysdeps/unix/sysv/linux/x86/elision-unlock.c: No such file or
> directory.
> (gdb) backtrace full
> #0 __lll_unlock_elision (lock=0x51118d0, private=0) at
> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
> No locals.
> #1 0x0000000000465e0f in unlock_queue (the_queue=0x512ced0, id=0x522704
> "req_selectjobs", msg=0x0, logging=0) at queue_func.c:189
> rc = 0
> err_msg = 0x0
> stub_msg = "no pos"
> __func__ = "unlock_queue"
> #2 0x000000000049384a in req_selectjobs (preq=0x7fffdc081bd0) at
> req_select.c:347
> bad = 1
> cntl = 0x7fffdc000930
> plist = 0x7fffdc001880
> pque = 0x512ced0
> rc = 0
> log_buf = '\000' <repeats 184 times>, "\b\205P\367\377\177\000\000\
> 000\000\000\000\000\000\000\000"...
> selistp = 0x0
> #3 0x00000000004652f4 in dispatch_request (sfds=9,
> request=0x7fffdc081bd0) at process_request.c:899
> rc = 0
> log_buf = "***@Dual-E52630v4\000\066\063\060v4", '\000' <repeats
> 3424 times>...
> __func__ = "dispatch_request"
> #4 0x0000000000464e8f in process_request (chan=0x7fffdc0008c0) at
> process_request.c:702
> rc = 0
> request = 0x7fffdc081bd0
> state = 3
> time_now = 1480485385
> auth_err = 0x0
> conn_socktype = 2
> conn_authen = 1
> sfds = 9
> #5 0x00000000004c4805 in process_pbs_server_port (sock=9,
> is_scheduler_port=0, args=0x7fffe40008e0) at incoming_request.c:162
> protocol_type = 2
> rc = 0
> log_buf = '\000' <repeats 3992 times>...
> chan = 0x7fffdc0008c0
> __func__ = "process_pbs_server_port"
> #6 0x00000000004c4ac9 in start_process_pbs_server_port
> (new_sock=0x7fffe40008e0) at incoming_request.c:270
> args = 0x7fffe40008e0
> sock = 9
> rc = 0
> #7 0x00000000004fc495 in work_thread (a=0x5110710) at u_threadpool.c:318
> __clframe = {__cancel_routine = 0x4fc071 <work_cleanup(void*)>,
> __cancel_arg = 0x5110710, __do_it = 1, __cancel_type = 0}
> __clframe = {__cancel_routine = 0x4fbf64
> <work_thread_cleanup(void*)>, __cancel_arg = 0x5110710, __do_it = 1,
> __cancel_type = 0}
> tp = 0x5110710
> rc = 0
> func = 0x4c4a4d <start_process_pbs_server_port(void*)>
> arg = 0x7fffe40008e0
> mywork = 0x7fffe4000b80
> working = {next = 0x0, working_id = 140737018590976}
> ts = {tv_sec = 0, tv_nsec = 0}
> __func__ = "work_thread"
> #8 0x00007ffff6ed870a in start_thread (arg=0x7fffe3fff700) at
> pthread_create.c:333
> __res = <optimized out>
> pd = 0x7fffe3fff700
> now = <optimized out>
> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140737018590976,
> -786842131623855334, 0, 140737272078415, 140737018591680, 0,
> 786815742786911002, 786861764219616026},
> mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0},
> data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
> not_first_call = <optimized out>
> pagesize_m1 = <optimized out>
> sp = <optimized out>
> freesize = <optimized out>
> ---Type <return> to continue, or q <return> to quit---
> __PRETTY_FUNCTION__ = "start_thread"
> #9 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/
> x86_64/clone.S:109
> No locals.
> (gdb) info registers
> rax 0x0 0
> rbx 0x7fffe3fff700 140737018590976
> rcx 0x0 0
> rdx 0x51118d0 85006544
> rsi 0x0 0
> rdi 0x51118d0 85006544
> rbp 0x0 0x0
> rsp 0x7fffe3ffefc0 0x7fffe3ffefc0
> r8 0x0 0
> r9 0x1 1
> r10 0x7fffdc0c8295 140736885195413
> r11 0x0 0
> r12 0x0 0
> r13 0x7ffff31be04f 140737272078415
> r14 0x7fffe3fff9c0 140737018591680
> r15 0x0 0
> rip 0x7ffff616582d 0x7ffff616582d <clone+109>
> eflags 0x10246 [ PF ZF IF RF ]
> cs 0x33 51
> ss 0x2b 43
> ds 0x0 0
> es 0x0 0
> fs 0x0 0
> gs 0x0 0
> (gdb) x/16i $pc
> => 0x7ffff616582d <clone+109>: mov %rax,%rdi
> 0x7ffff6165830 <clone+112>: callq 0x7ffff612ab60 <__GI__exit>
> 0x7ffff6165835 <clone+117>: mov 0x2bc63c(%rip),%rcx #
> 0x7ffff6421e78
> 0x7ffff616583c <clone+124>: neg %eax
> 0x7ffff616583e <clone+126>: mov %eax,%fs:(%rcx)
> 0x7ffff6165841 <clone+129>: or $0xffffffffffffffff,%rax
> 0x7ffff6165845 <clone+133>: retq
> 0x7ffff6165846: nopw %cs:0x0(%rax,%rax,1)
> 0x7ffff6165850 <lseek64>: mov $0x8,%eax
> 0x7ffff6165855 <lseek64+5>: syscall
> 0x7ffff6165857 <lseek64+7>: cmp $0xfffffffffffff001,%rax
> 0x7ffff616585d <lseek64+13>: jae 0x7ffff6165860 <lseek64+16>
> 0x7ffff616585f <lseek64+15>: retq
> 0x7ffff6165860 <lseek64+16>: mov 0x2bc611(%rip),%rcx #
> 0x7ffff6421e78
> 0x7ffff6165867 <lseek64+23>: neg %eax
> 0x7ffff6165869 <lseek64+25>: mov %eax,%fs:(%rcx)
> (gdb) thread apply all backtrace
>
> Thread 12 (Thread 0x7fffe2ffd700 (LWP 37145)):
> #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/
> x86_64/pthread_cond_wait.S:185
> #1 0x00000000004fc2b4 in work_thread (a=0x5110710) at u_threadpool.c:272
> #2 0x00007ffff6ed870a in start_thread (arg=0x7fffe2ffd700) at
> pthread_create.c:333
> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/
> x86_64/clone.S:109
>
> Thread 11 (Thread 0x7fffe37fe700 (LWP 37133)):
> #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/
> x86_64/pthread_cond_wait.S:185
> #1 0x00000000004fc2b4 in work_thread (a=0x5110810) at u_threadpool.c:272
> #2 0x00007ffff6ed870a in start_thread (arg=0x7fffe37fe700) at
> pthread_create.c:333
> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/
> x86_64/clone.S:109
>
> Thread 10 (Thread 0x7fffe3fff700 (LWP 37132)):
> #0 __lll_unlock_elision (lock=0x51118d0, private=0) at
> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
> #1 0x0000000000465e0f in unlock_queue (the_queue=0x512ced0, id=0x522704
> "req_selectjobs", msg=0x0, logging=0) at queue_func.c:189
> #2 0x000000000049384a in req_selectjobs (preq=0x7fffdc081bd0) at
> req_select.c:347
> #3 0x00000000004652f4 in dispatch_request (sfds=9,
> request=0x7fffdc081bd0) at process_request.c:899
> #4 0x0000000000464e8f in process_request (chan=0x7fffdc0008c0) at
> process_request.c:702
> #5 0x00000000004c4805 in process_pbs_server_port (sock=9,
> is_scheduler_port=0, args=0x7fffe40008e0) at incoming_request.c:162
> #6 0x00000000004c4ac9 in start_process_pbs_server_port
> (new_sock=0x7fffe40008e0) at incoming_request.c:270
> #7 0x00000000004fc495 in work_thread (a=0x5110710) at u_threadpool.c:318
> #8 0x00007ffff6ed870a in start_thread (arg=0x7fffe3fff700) at
> pthread_create.c:333
> #9 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/
> x86_64/clone.S:109
>
> Thread 9 (Thread 0x7ffff09bb700 (LWP 36971)):
> #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/
> x86_64/pthread_cond_wait.S:185
> #1 0x00000000004fc2b4 in work_thread (a=0x5110810) at u_threadpool.c:272
> #2 0x00007ffff6ed870a in start_thread (arg=0x7ffff09bb700) at
> pthread_create.c:333
> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/
> x86_64/clone.S:109
>
> Thread 7 (Thread 0x7ffff11bc700 (LWP 36969)):
> #0 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-
> template.S:84
> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
> ../sysdeps/posix/sleep.c:55
> #2 0x0000000000476913 in remove_completed_jobs (vp=0x0) at
> req_jobobit.c:3759
> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff11bc700) at
> pthread_create.c:333
> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/
> x86_64/clone.S:109
>
> Thread 6 (Thread 0x7ffff19bd700 (LWP 36968)):
> #0 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-
> template.S:84
> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
> ../sysdeps/posix/sleep.c:55
> #2 0x00000000004afb93 in remove_extra_recycle_jobs (vp=0x0) at
> job_recycler.c:216
> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff19bd700) at
> pthread_create.c:333
> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/
> x86_64/clone.S:109
>
> Thread 5 (Thread 0x7ffff21be700 (LWP 36967)):
> #0 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-
> template.S:84
> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
> ../sysdeps/posix/sleep.c:55
> #2 0x00000000004bc853 in inspect_exiting_jobs (vp=0x0) at
> exiting_jobs.c:319
> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff21be700) at
> pthread_create.c:333
> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/
> x86_64/clone.S:109
>
> Thread 4 (Thread 0x7ffff29bf700 (LWP 36966)):
> #0 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-
> template.S:84
> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
> ../sysdeps/posix/sleep.c:55
> #2 0x0000000000460769 in handle_queue_routing_retries (vp=0x0) at
> pbsd_main.c:1079
> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff29bf700) at
> pthread_create.c:333
> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/
> x86_64/clone.S:109
>
> ---Type <return> to continue, or q <return> to quit---
> Thread 3 (Thread 0x7ffff31c0700 (LWP 36965)):
> #0 0x00007ffff6ee17bd in accept () at ../sysdeps/unix/syscall-
> template.S:84
> #1 0x00007ffff750a276 in start_listener_addrinfo
> (host_name=0x7ffff31bfaf0 "Dual-E52630v4", server_port=15001,
> process_meth=0x4c4a4d <start_process_pbs_server_port(void*)>)
> at ../Libnet/server_core.c:398
> #2 0x00000000004608cf in start_accept_listener (vp=0x0) at
> pbsd_main.c:1141
> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff31c0700) at
> pthread_create.c:333
> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/
> x86_64/clone.S:109
>
> Thread 2 (Thread 0x7ffff39c1700 (LWP 36964)):
> #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/
> x86_64/pthread_cond_wait.S:185
> #1 0x00000000004fc2b4 in work_thread (a=0x5110810) at u_threadpool.c:272
> #2 0x00007ffff6ed870a in start_thread (arg=0x7ffff39c1700) at
> pthread_create.c:333
> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/
> x86_64/clone.S:109
>
> Thread 1 (Thread 0x7ffff7fd5740 (LWP 36960)):
> #0 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-
> template.S:84
> #1 0x00007ffff615c1a4 in usleep (useconds=<optimized out>) at
> ../sysdeps/posix/usleep.c:32
> #2 0x0000000000461216 in main_loop () at pbsd_main.c:1454
> #3 0x0000000000461f92 in main (argc=2, argv=0x7fffffffe598) at
> pbsd_main.c:1935
> (gdb) quit
> A debugging session is active.
>
> Inferior 1 [process 36960] will be killed.
>
> Quit anyway? (y or n) y
>
>
>
>
> On Tue, Nov 29, 2016 at 8:53 AM, David Beer <***@adaptivecomputing.com>
> wrote:
>
>> Kazu,
>>
>> I'm shocked you're seeing so many issues. Can you send a backtrace? These
>> logs don't show anything sinister.
>>
>> On Wed, Nov 23, 2016 at 9:52 PM, Kazuhiro Fujita <
>> ***@gmail.com> wrote:
>>
>>> David,
>>>
>>> I reinstalled the torque 6.0-dev without update from github.
>>> At this time, I can restart all torque daemons,
>>> but qsub command caused the crash of pbs_server and pbs_sched.
>>> I attached the log files in this mail.
>>>
>>> Best,
>>> Kazu
>>>
>>> Before the crash:
>>>
>>>> # build and install torque
>>>> ./configure
>>>> make
>>>> sudo make install
>>>> # Set a correct host name of the server
>>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>> # configure and start trqauthd
>>>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>>>> sudo update-rc.d trqauthd defaults
>>>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>>>> sudo ldconfig
>>>> sudo service trqauthd start
>>>> # Initialize serverdb by executing the torque.setup script
>>>> sudo ./torque.setup $USER
>>>> sudo qmgr -c "p s"
>>>> # stop pbs_server and trqauthd daemons for setting nodes.
>>>> sudo qterm
>>>> sudo service trqauthd stop
>>>> ps aux | grep pbs
>>>> ps aux | grep trq
>>>> # set nodes
>>>> echo "$HOSTNAME np=`cat /proc/cpuinfo | grep processor | wc -l`" | sudo
>>>> tee /var/spool/torque/server_priv/nodes
>>>> sudo nano /var/spool/torque/server_priv/nodes
>>>> # set the head node
>>>> echo "\$pbsserver $HOSTNAME" | sudo tee /var/spool/torque/mom_priv/con
>>>> fig
>>>> # configure other torque daemons
>>>> sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
>>>> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
>>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>>> sudo update-rc.d pbs_server defaults
>>>> sudo update-rc.d pbs_sched defaults
>>>> sudo update-rc.d pbs_mom defaults
>>>> # restart torque daemons
>>>> sudo service trqauthd start
>>>> sudo service pbs_server start
>>>> ps aux | grep pbs
>>>> ps aux | grep trq
>>>> sudo service pbs_sched start
>>>> sudo service pbs_mom start
>>>> ps aux | grep pbs
>>>> ps aux | grep trq
>>>> # check configuration of computaion nodes
>>>> pbsnodes -a
>>>
>>>
>>> $ ps aux | grep trq
>>> root 19130 0.0 0.0 109112 3756 ? S 13:25 0:00
>>> /usr/local/sbin/trqauthd
>>> comp_ad+ 19293 0.0 0.0 15236 1020 pts/8 S+ 13:28 0:00 grep
>>> --color=auto trq
>>> $ ps aux | grep pbs
>>> root 19175 0.0 0.0 695136 23640 ? Sl 13:26 0:00
>>> /usr/local/sbin/pbs_server
>>> root 19224 0.0 0.0 37996 4936 ? Ss 13:27 0:00
>>> /usr/local/sbin/pbs_sched
>>> root 19265 0.1 0.2 173776 136692 ? SLsl 13:27 0:00
>>> /usr/local/sbin/pbs_mom
>>> comp_ad+ 19295 0.0 0.0 15236 924 pts/8 S+ 13:28 0:00 grep
>>> --color=auto pbs
>>>
>>> Subsequent qsub command caused the crash of pbs_server and pbs_sched.
>>>
>>> $ echo "sleep 30" | qsub
>>> 0.Dual-E52630v4
>>> $ ps aux | grep trq
>>> root 19130 0.0 0.0 109112 4268 ? S 13:25 0:00
>>> /usr/local/sbin/trqauthd
>>> comp_ad+ 19309 0.0 0.0 15236 1020 pts/8 S+ 13:28 0:00 grep
>>> --color=auto trq
>>> $ ps aux | grep pbs
>>> root 19265 0.1 0.2 173776 136688 ? SLsl 13:27 0:00
>>> /usr/local/sbin/pbs_mom
>>> comp_ad+ 19311 0.0 0.0 15236 1016 pts/8 S+ 13:28 0:00 grep
>>> --color=auto pbs
>>>
>>>
>>>
>>>
>>> On Fri, Nov 18, 2016 at 4:21 AM, David Beer <***@adaptivecomputing.com
>>> > wrote:
>>>
>>>> Kazu,
>>>>
>>>> Did you look at the server logs?
>>>>
>>>> On Wed, Nov 16, 2016 at 12:24 AM, Kazuhiro Fujita <
>>>> ***@gmail.com> wrote:
>>>>
>>>>> David,
>>>>>
>>>>> I did not find the process of pbs_server after executions of commands
>>>>> shown below.
>>>>>
>>>>> sudo service trqauthd start
>>>>>> sudo service pbs_server start
>>>>>
>>>>>
>>>>> I am not sure what it did.
>>>>>
>>>>> Best,
>>>>> Kazu
>>>>>
>>>>>
>>>>> On Wed, Nov 16, 2016 at 8:10 AM, David Beer <
>>>>> ***@adaptivecomputing.com> wrote:
>>>>>
>>>>>> Kazu,
>>>>>>
>>>>>> What did it do when it failed to start?
>>>>>>
>>>>>> On Wed, Nov 9, 2016 at 9:33 PM, Kazuhiro Fujita <
>>>>>> ***@gmail.com> wrote:
>>>>>>
>>>>>>> David,
>>>>>>>
>>>>>>> In the last mail I sent, I reinstalled 6.0-dev in a wrong server as
>>>>>>> you can see in output (E5-2630v3).
>>>>>>> In a E5-2630v4 server, pbs_server failed to restart as a daemon
>>>>>>> after "./torque.setup $USER".
>>>>>>>
>>>>>>> Before crash:
>>>>>>>
>>>>>>>> git clone https://github.com/adaptivecomputing/torque.git -b
>>>>>>>> 6.0-dev 6.0-dev
>>>>>>>> cd 6.0-dev
>>>>>>>> ./autogen.sh
>>>>>>>> # build and install torque
>>>>>>>> ./configure
>>>>>>>> make
>>>>>>>> sudo make install
>>>>>>>> # Set the correct name of the server
>>>>>>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>>>>>> # configure and start trqauthd
>>>>>>>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>>>>>>>> sudo update-rc.d trqauthd defaults
>>>>>>>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>>>>>>>> sudo ldconfig
>>>>>>>> sudo service trqauthd start
>>>>>>>> # Initialize serverdb by executing the torque.setup script
>>>>>>>> sudo ./torque.setup $USER
>>>>>>>> sudo qmgr -c 'p s'
>>>>>>>> sudo qterm
>>>>>>>> sudo service trqauthd stop
>>>>>>>> ps aux | grep pbs
>>>>>>>> ps aux | grep trq
>>>>>>>> # set nodes
>>>>>>>> echo "$HOSTNAME np=`cat /proc/cpuinfo | grep processor | wc -l`" |
>>>>>>>> sudo tee /var/spool/torque/server_priv/nodes
>>>>>>>> sudo nano /var/spool/torque/server_priv/nodes
>>>>>>>> # set the head node
>>>>>>>> echo "\$pbsserver $HOSTNAME" | sudo tee
>>>>>>>> /var/spool/torque/mom_priv/config
>>>>>>>> # configure other daemons
>>>>>>>> sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
>>>>>>>> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
>>>>>>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>>>>>>> sudo update-rc.d pbs_server defaults
>>>>>>>> sudo update-rc.d pbs_sched defaults
>>>>>>>> sudo update-rc.d pbs_mom defaults
>>>>>>>> # restart torque daemons
>>>>>>>> sudo service trqauthd start
>>>>>>>> sudo service pbs_server start
>>>>>>>
>>>>>>>
>>>>>>> Then, pbs_server did not start. So, I started pbs_server with gdb.
>>>>>>> But, pbs_server with gdb did not crash even after qsub and qstat
>>>>>>> from another terminal.
>>>>>>> So, I stopped the pbs_server in gdb with ctrl + c.
>>>>>>>
>>>>>>> Best,
>>>>>>> Kazu
>>>>>>>
>>>>>>> gdb output
>>>>>>>
>>>>>>>> $ sudo gdb /usr/local/sbin/pbs_server
>>>>>>>> GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
>>>>>>>> Copyright (C) 2016 Free Software Foundation, Inc.
>>>>>>>> License GPLv3+: GNU GPL version 3 or later <
>>>>>>>> http://gnu.org/licenses/gpl.html>
>>>>>>>> This is free software: you are free to change and redistribute it.
>>>>>>>> There is NO WARRANTY, to the extent permitted by law. Type "show
>>>>>>>> copying"
>>>>>>>> and "show warranty" for details.
>>>>>>>> This GDB was configured as "x86_64-linux-gnu".
>>>>>>>> Type "show configuration" for configuration details.
>>>>>>>> For bug reporting instructions, please see:
>>>>>>>> <http://www.gnu.org/software/gdb/bugs/>.
>>>>>>>> Find the GDB manual and other documentation resources online at:
>>>>>>>> <http://www.gnu.org/software/gdb/documentation/>.
>>>>>>>> For help, type "help".
>>>>>>>> Type "apropos word" to search for commands related to "word"...
>>>>>>>> Reading symbols from /usr/local/sbin/pbs_server...done.
>>>>>>>> (gdb) r -D
>>>>>>>> Starting program: /usr/local/sbin/pbs_server -D
>>>>>>>> [Thread debugging using libthread_db enabled]
>>>>>>>> Using host libthread_db library "/lib/x86_64-linux-gnu/libthre
>>>>>>>> ad_db.so.1".
>>>>>>>> [New Thread 0x7ffff39c1700 (LWP 35864)]
>>>>>>>> pbs_server is up (version - 6.0, port - 15001)
>>>>>>>> PBS_Server: LOG_ERROR::tcp_connect_sockaddr, Failed when trying to
>>>>>>>> open tcp connection - connect() failed [rc = -2] [addr =
>>>>>>>> 10.0.0.249:15003]
>>>>>>>> [New Thread 0x7ffff31c0700 (LWP 35865)]
>>>>>>>> PBS_Server: LOG_ERROR::sendHierarchyToNode, Could not send mom
>>>>>>>> hierarchy to host Dual-E52630v4:15003
>>>>>>>> [New Thread 0x7ffff29bf700 (LWP 35866)]
>>>>>>>> [New Thread 0x7ffff21be700 (LWP 35867)]
>>>>>>>> [New Thread 0x7ffff19bd700 (LWP 35868)]
>>>>>>>> [New Thread 0x7ffff11bc700 (LWP 35869)]
>>>>>>>> [New Thread 0x7ffff09bb700 (LWP 35870)]
>>>>>>>> [Thread 0x7ffff09bb700 (LWP 35870) exited]
>>>>>>>> [New Thread 0x7ffff09bb700 (LWP 35871)]
>>>>>>>> [New Thread 0x7fffe3fff700 (LWP 36003)]
>>>>>>>> [New Thread 0x7fffe37fe700 (LWP 36004)]
>>>>>>>> [New Thread 0x7fffe2ffd700 (LWP 36011)]
>>>>>>>> [New Thread 0x7fffe21ce700 (LWP 36016)]
>>>>>>>> [Thread 0x7fffe21ce700 (LWP 36016) exited]
>>>>>>>> ^C
>>>>>>>> Thread 1 "pbs_server" received signal SIGINT, Interrupt.
>>>>>>>> 0x00007ffff612a75d in nanosleep () at ../sysdeps/unix/syscall-templa
>>>>>>>> te.S:84
>>>>>>>> 84 ../sysdeps/unix/syscall-template.S: No such file or directory.
>>>>>>>> (gdb) bt
>>>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>>> #1 0x00007ffff615c1a4 in usleep (useconds=<optimized out>) at
>>>>>>>> ../sysdeps/posix/usleep.c:32
>>>>>>>> #2 0x0000000000461216 in main_loop () at pbsd_main.c:1454
>>>>>>>> #3 0x0000000000461f92 in main (argc=2, argv=0x7fffffffe598) at
>>>>>>>> pbsd_main.c:1935
>>>>>>>> (gdb) backtrace full
>>>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>>> No locals.
>>>>>>>> #1 0x00007ffff615c1a4 in usleep (useconds=<optimized out>) at
>>>>>>>> ../sysdeps/posix/usleep.c:32
>>>>>>>> ts = {tv_sec = 0, tv_nsec = 250000000}
>>>>>>>> #2 0x0000000000461216 in main_loop () at pbsd_main.c:1454
>>>>>>>> state = 3
>>>>>>>> waittime = 5
>>>>>>>> pjob = 0x313a74
>>>>>>>> iter = 0x0
>>>>>>>> when = 1478748888
>>>>>>>> log = 0
>>>>>>>> scheduling = 1
>>>>>>>> sched_iteration = 600
>>>>>>>> time_now = 1478748970
>>>>>>>> update_loglevel = 1478748979
>>>>>>>> log_buf = "Server Ready, pid = 35860, loglevel=0", '\000'
>>>>>>>> <repeats 139 times>, "c\000\000\000\000\000\000\000
>>>>>>>> \000\020\000\000\000\000\000\000\240\265\377\377\377\177", '\000'
>>>>>>>> <repeats 26 times>...
>>>>>>>> sem_val = 5229209
>>>>>>>> __func__ = "main_loop"
>>>>>>>> #3 0x0000000000461f92 in main (argc=2, argv=0x7fffffffe598) at
>>>>>>>> pbsd_main.c:1935
>>>>>>>> i = 2
>>>>>>>> rc = 0
>>>>>>>> local_errno = 0
>>>>>>>> lockfile = "/var/spool/torque/server_priv/server.lock",
>>>>>>>> '\000' <repeats 983 times>
>>>>>>>> EMsg = '\000' <repeats 1023 times>
>>>>>>>> tmpLine = "Using ports Server:15001 Scheduler:15004
>>>>>>>> MOM:15002 (server: 'Dual-E52630v4')", '\000' <repeats 945 times>
>>>>>>>> log_buf = "Using ports Server:15001 Scheduler:15004
>>>>>>>> MOM:15002 (server: 'Dual-E52630v4')", '\000' <repeats 1114 times>...
>>>>>>>> server_name_file_port = 15001
>>>>>>>> fp = 0x51095f0
>>>>>>>> (gdb) info registers
>>>>>>>> rax 0xfffffffffffffdfc -516
>>>>>>>> rbx 0x6 6
>>>>>>>> rcx 0x7ffff612a75d 140737321805661
>>>>>>>> rdx 0x0 0
>>>>>>>> rsi 0x0 0
>>>>>>>> rdi 0x7fffffffb3f0 140737488335856
>>>>>>>> rbp 0x7fffffffe4b0 0x7fffffffe4b0
>>>>>>>> rsp 0x7fffffffc870 0x7fffffffc870
>>>>>>>> r8 0x0 0
>>>>>>>> r9 0x4000001 67108865
>>>>>>>> r10 0x1 1
>>>>>>>> r11 0x293 659
>>>>>>>> r12 0x4260b0 4350128
>>>>>>>> r13 0x7fffffffe590 140737488348560
>>>>>>>> r14 0x0 0
>>>>>>>> r15 0x0 0
>>>>>>>> rip 0x461f92 0x461f92 <main(int, char**)+2388>
>>>>>>>> eflags 0x293 [ CF AF SF IF ]
>>>>>>>> cs 0x33 51
>>>>>>>> ss 0x2b 43
>>>>>>>> ds 0x0 0
>>>>>>>> es 0x0 0
>>>>>>>> fs 0x0 0
>>>>>>>> gs 0x0 0
>>>>>>>> (gdb) x/16i $pc
>>>>>>>> => 0x461f92 <main(int, char**)+2388>: callq 0x49484c
>>>>>>>> <shutdown_ack()>
>>>>>>>> 0x461f97 <main(int, char**)+2393>: mov $0xffffffff,%edi
>>>>>>>> 0x461f9c <main(int, char**)+2398>: callq 0x4250b0
>>>>>>>> <***@plt>
>>>>>>>> 0x461fa1 <main(int, char**)+2403>: mov 0x70f5c0(%rip),%rdx
>>>>>>>> # 0xb71568 <msg_svrdown>
>>>>>>>> 0x461fa8 <main(int, char**)+2410>: mov 0x70ef51(%rip),%rax
>>>>>>>> # 0xb70f00 <msg_daemonname>
>>>>>>>> 0x461faf <main(int, char**)+2417>: mov %rdx,%rcx
>>>>>>>> 0x461fb2 <main(int, char**)+2420>: mov %rax,%rdx
>>>>>>>> 0x461fb5 <main(int, char**)+2423>: mov $0x1,%esi
>>>>>>>> 0x461fba <main(int, char**)+2428>: mov $0x8002,%edi
>>>>>>>> 0x461fbf <main(int, char**)+2433>: callq 0x425840
>>>>>>>> <***@plt>
>>>>>>>> 0x461fc4 <main(int, char**)+2438>: mov $0x0,%edi
>>>>>>>> 0x461fc9 <main(int, char**)+2443>: callq 0x4269c9
>>>>>>>> <acct_close(bool)>
>>>>>>>> 0x461fce <main(int, char**)+2448>: mov $0xb6ce00,%edi
>>>>>>>> 0x461fd3 <main(int, char**)+2453>: callq 0x425a00
>>>>>>>> <***@plt>
>>>>>>>> 0x461fd8 <main(int, char**)+2458>: mov $0x1,%edi
>>>>>>>> 0x461fdd <main(int, char**)+2463>: callq 0x424db0
>>>>>>>> <***@plt>
>>>>>>>> (gdb) thread apply all backtrace
>>>>>>>> Thread 12 (Thread 0x7fffe2ffd700 (LWP 36011)):
>>>>>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
>>>>>>>> #1 0x00000000004fc2b4 in work_thread (a=0x5110710) at
>>>>>>>> u_threadpool.c:272
>>>>>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7fffe2ffd700) at
>>>>>>>> pthread_create.c:333
>>>>>>>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>>>> _64/clone.S:109
>>>>>>>> Thread 11 (Thread 0x7fffe37fe700 (LWP 36004)):
>>>>>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
>>>>>>>> #1 0x00000000004fc2b4 in work_thread (a=0x5110810) at
>>>>>>>> u_threadpool.c:272
>>>>>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7fffe37fe700) at
>>>>>>>> pthread_create.c:333
>>>>>>>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>>>> _64/clone.S:109
>>>>>>>> Thread 10 (Thread 0x7fffe3fff700 (LWP 36003)):
>>>>>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
>>>>>>>> #1 0x00000000004fc2b4 in work_thread (a=0x5110710) at
>>>>>>>> u_threadpool.c:272
>>>>>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7fffe3fff700) at
>>>>>>>> pthread_create.c:333
>>>>>>>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>>>> _64/clone.S:109
>>>>>>>> Thread 9 (Thread 0x7ffff09bb700 (LWP 35871)):
>>>>>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
>>>>>>>> #1 0x00000000004fc2b4 in work_thread (a=0x5110810) at
>>>>>>>> u_threadpool.c:272
>>>>>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7ffff09bb700) at
>>>>>>>> pthread_create.c:333
>>>>>>>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>>>> _64/clone.S:109
>>>>>>>> Thread 7 (Thread 0x7ffff11bc700 (LWP 35869)):
>>>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>>>>>>> ../sysdeps/posix/sleep.c:55
>>>>>>>> #2 0x0000000000476913 in remove_completed_jobs (vp=0x0) at
>>>>>>>> req_jobobit.c:3759
>>>>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff11bc700) at
>>>>>>>> pthread_create.c:333
>>>>>>>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>>>> _64/clone.S:109
>>>>>>>> Thread 6 (Thread 0x7ffff19bd700 (LWP 35868)):
>>>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>>>>>>> ../sysdeps/posix/sleep.c:55
>>>>>>>> #2 0x00000000004afb93 in remove_extra_recycle_jobs (vp=0x0) at
>>>>>>>> job_recycler.c:216
>>>>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff19bd700) at
>>>>>>>> pthread_create.c:333
>>>>>>>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>>>> _64/clone.S:109
>>>>>>>> Thread 5 (Thread 0x7ffff21be700 (LWP 35867)):
>>>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>>>>>>> ../sysdeps/posix/sleep.c:55
>>>>>>>> #2 0x00000000004bc853 in inspect_exiting_jobs (vp=0x0) at
>>>>>>>> exiting_jobs.c:319
>>>>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff21be700) at
>>>>>>>> pthread_create.c:333
>>>>>>>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>>>> _64/clone.S:109
>>>>>>>> Thread 4 (Thread 0x7ffff29bf700 (LWP 35866)):
>>>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>>>>>>> ../sysdeps/posix/sleep.c:55
>>>>>>>> #2 0x0000000000460769 in handle_queue_routing_retries (vp=0x0) at
>>>>>>>> pbsd_main.c:1079
>>>>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff29bf700) at
>>>>>>>> pthread_create.c:333
>>>>>>>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>>>> _64/clone.S:109
>>>>>>>> Thread 3 (Thread 0x7ffff31c0700 (LWP 35865)):
>>>>>>>> #0 0x00007ffff6ee17bd in accept () at
>>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>>> #1 0x00007ffff750a276 in start_listener_addrinfo
>>>>>>>> (host_name=0x7ffff31bfaf0 "Dual-E52630v4", server_port=15001,
>>>>>>>> process_meth=0x4c4a4d <start_process_pbs_server_port(void*)>)
>>>>>>>> at ../Libnet/server_core.c:398
>>>>>>>> ---Type <return> to continue, or q <return> to quit---
>>>>>>>> #2 0x00000000004608cf in start_accept_listener (vp=0x0) at
>>>>>>>> pbsd_main.c:1141
>>>>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff31c0700) at
>>>>>>>> pthread_create.c:333
>>>>>>>> #4 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>>>> _64/clone.S:109
>>>>>>>> Thread 2 (Thread 0x7ffff39c1700 (LWP 35864)):
>>>>>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
>>>>>>>> #1 0x00000000004fc2b4 in work_thread (a=0x5110810) at
>>>>>>>> u_threadpool.c:272
>>>>>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7ffff39c1700) at
>>>>>>>> pthread_create.c:333
>>>>>>>> #3 0x00007ffff616582d in clone () at ../sysdeps/unix/sysv/linux/x86
>>>>>>>> _64/clone.S:109
>>>>>>>> Thread 1 (Thread 0x7ffff7fd5740 (LWP 35860)):
>>>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>>> #1 0x00007ffff615c1a4 in usleep (useconds=<optimized out>) at
>>>>>>>> ../sysdeps/posix/usleep.c:32
>>>>>>>> #2 0x0000000000461216 in main_loop () at pbsd_main.c:1454
>>>>>>>> #3 0x0000000000461f92 in main (argc=2, argv=0x7fffffffe598) at
>>>>>>>> pbsd_main.c:1935
>>>>>>>> (gdb) quit
>>>>>>>> A debugging session is active.
>>>>>>>> Inferior 1 [process 35860] will be killed.
>>>>>>>> Quit anyway? (y or n) y
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Commands executed from another terminal after pbs_server with gdb (r
>>>>>>> -D)
>>>>>>>
>>>>>>>> $ sudo service pbs_sched start
>>>>>>>> $ sudo service pbs_mom start
>>>>>>>> $ pbsnodes -a
>>>>>>>> Dual-E52630v4
>>>>>>>> state = free
>>>>>>>> power_state = Running
>>>>>>>> np = 4
>>>>>>>> ntype = cluster
>>>>>>>> status = rectime=1478748911,macaddr=34:
>>>>>>>> 97:f6:5d:09:a6,cpuclock=Fixed,varattr=,jobs=,state=free,netl
>>>>>>>> oad=322618417,gres=,loadave=0.06,ncpus=40,physmem=65857216kb
>>>>>>>> ,availmem=131970532kb,totmem=132849340kb,idletime=108,nusers=4,nsessions=17,sessions=1036
>>>>>>>> 1316 1327 1332 1420 1421 1422 1423 1424 1425 1426 1430 1471 1510 27075
>>>>>>>> 27130 35902,uname=Linux Dual-E52630v4 4.4.0-45-generic #66-Ubuntu SMP Wed
>>>>>>>> Oct 19 14:12:37 UTC 2016 x86_64,opsys=linux
>>>>>>>> mom_service_port = 15002
>>>>>>>> mom_manager_port = 15003
>>>>>>>> $ echo "sleep 30" | qsub
>>>>>>>> 0.Dual-E52630v4
>>>>>>>> $ qstat
>>>>>>>> Job ID Name User Time Use
>>>>>>>> S Queue
>>>>>>>> ------------------------- ---------------- --------------- --------
>>>>>>>> - -----
>>>>>>>> 0.Dual-E52630v4 STDIN comp_admin
>>>>>>>> 0 Q batch
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Nov 10, 2016 at 12:01 PM, Kazuhiro Fujita <
>>>>>>> ***@gmail.com> wrote:
>>>>>>>
>>>>>>>> David,
>>>>>>>>
>>>>>>>> Now, it works. Thank you.
>>>>>>>> But, jobs are executed in the LIFO manner, as I observed in a
>>>>>>>> E5-2630v3 server...
>>>>>>>> I show the result by 'qstat -t' after 'echo "sleep 10" | qsub -t
>>>>>>>> 1-10' 3 times.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Kazu
>>>>>>>>
>>>>>>>> $ qstat -t
>>>>>>>> Job ID Name User Time Use
>>>>>>>> S Queue
>>>>>>>> ------------------------- ---------------- --------------- --------
>>>>>>>> - -----
>>>>>>>> 0.Dual-E5-2630v3 STDIN comp_admin
>>>>>>>> 00:00:00 C batch
>>>>>>>> 1[1].Dual-E5-2630v3 STDIN-1 comp_admin
>>>>>>>> 0 Q batch
>>>>>>>> 1[2].Dual-E5-2630v3 STDIN-2 comp_admin
>>>>>>>> 0 Q batch
>>>>>>>> 1[3].Dual-E5-2630v3 STDIN-3 comp_admin
>>>>>>>> 0 Q batch
>>>>>>>> 1[4].Dual-E5-2630v3 STDIN-4 comp_admin
>>>>>>>> 0 Q batch
>>>>>>>> 1[5].Dual-E5-2630v3 STDIN-5 comp_admin
>>>>>>>> 0 Q batch
>>>>>>>> 1[6].Dual-E5-2630v3 STDIN-6 comp_admin
>>>>>>>> 0 Q batch
>>>>>>>> 1[7].Dual-E5-2630v3 STDIN-7 comp_admin
>>>>>>>> 00:00:00 C batch
>>>>>>>> 1[8].Dual-E5-2630v3 STDIN-8 comp_admin
>>>>>>>> 00:00:00 C batch
>>>>>>>> 1[9].Dual-E5-2630v3 STDIN-9 comp_admin
>>>>>>>> 00:00:00 C batch
>>>>>>>> 1[10].Dual-E5-2630v3 STDIN-10 comp_admin
>>>>>>>> 00:00:00 C batch
>>>>>>>> 2[1].Dual-E5-2630v3 STDIN-1 comp_admin
>>>>>>>> 0 Q batch
>>>>>>>> 2[2].Dual-E5-2630v3 STDIN-2 comp_admin
>>>>>>>> 0 Q batch
>>>>>>>> 2[3].Dual-E5-2630v3 STDIN-3 comp_admin
>>>>>>>> 0 Q batch
>>>>>>>> 2[4].Dual-E5-2630v3 STDIN-4 comp_admin
>>>>>>>> 0 Q batch
>>>>>>>> 2[5].Dual-E5-2630v3 STDIN-5 comp_admin
>>>>>>>> 0 Q batch
>>>>>>>> 2[6].Dual-E5-2630v3 STDIN-6 comp_admin
>>>>>>>> 0 Q batch
>>>>>>>> 2[7].Dual-E5-2630v3 STDIN-7 comp_admin
>>>>>>>> 0 Q batch
>>>>>>>> 2[8].Dual-E5-2630v3 STDIN-8 comp_admin
>>>>>>>> 0 Q batch
>>>>>>>> 2[9].Dual-E5-2630v3 STDIN-9 comp_admin
>>>>>>>> 0 Q batch
>>>>>>>> 2[10].Dual-E5-2630v3 STDIN-10 comp_admin
>>>>>>>> 0 Q batch
>>>>>>>> 3[1].Dual-E5-2630v3 STDIN-1 comp_admin
>>>>>>>> 0 Q batch
>>>>>>>> 3[2].Dual-E5-2630v3 STDIN-2 comp_admin
>>>>>>>> 0 Q batch
>>>>>>>> 3[3].Dual-E5-2630v3 STDIN-3 comp_admin
>>>>>>>> 0 Q batch
>>>>>>>> 3[4].Dual-E5-2630v3 STDIN-4 comp_admin
>>>>>>>> 0 Q batch
>>>>>>>> 3[5].Dual-E5-2630v3 STDIN-5 comp_admin
>>>>>>>> 0 Q batch
>>>>>>>> 3[6].Dual-E5-2630v3 STDIN-6 comp_admin
>>>>>>>> 0 Q batch
>>>>>>>> 3[7].Dual-E5-2630v3 STDIN-7 comp_admin
>>>>>>>> 0 R batch
>>>>>>>> 3[8].Dual-E5-2630v3 STDIN-8 comp_admin
>>>>>>>> 0 R batch
>>>>>>>> 3[9].Dual-E5-2630v3 STDIN-9 comp_admin
>>>>>>>> 0 R batch
>>>>>>>> 3[10].Dual-E5-2630v3 STDIN-10 comp_admin
>>>>>>>> 0 R batch
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Nov 10, 2016 at 3:07 AM, David Beer <
>>>>>>>> ***@adaptivecomputing.com> wrote:
>>>>>>>>
>>>>>>>>> Kazu,
>>>>>>>>>
>>>>>>>>> I was able to get a system to reproduce this error. I have now
>>>>>>>>> checked in another fix, and I can no longer reproduce this. Can you pull
>>>>>>>>> the latest and let me know if it fixes it for you?
>>>>>>>>>
>>>>>>>>> On Tue, Nov 8, 2016 at 2:06 AM, Kazuhiro Fujita <
>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi David,
>>>>>>>>>>
>>>>>>>>>> I reinstalled the 6.0-dev today from github, and observed slight
>>>>>>>>>> different behaviors I think.
>>>>>>>>>> I used the "service" command to start daemons this time.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Kazu
>>>>>>>>>>
>>>>>>>>>> Befor crash
>>>>>>>>>>
>>>>>>>>>>> git clone https://github.com/adaptivecomputing/torque.git -b
>>>>>>>>>>> 6.0-dev 6.0-dev
>>>>>>>>>>> cd 6.0-dev
>>>>>>>>>>> ./autogen.sh
>>>>>>>>>>> # build and install torque
>>>>>>>>>>> ./configure
>>>>>>>>>>> make
>>>>>>>>>>> sudo make install
>>>>>>>>>>> # Set the correct name of the server
>>>>>>>>>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>>>>>>>>> # configure and start trqauthd
>>>>>>>>>>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>>>>>>>>>>> sudo update-rc.d trqauthd defaults
>>>>>>>>>>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>>>>>>>>>>> sudo ldconfig
>>>>>>>>>>> sudo service trqauthd start
>>>>>>>>>>> # Initialize serverdb by executing the torque.setup script
>>>>>>>>>>> sudo ./torque.setup $USER
>>>>>>>>>>> sudo qmgr -c 'p s'
>>>>>>>>>>> sudo qterm
>>>>>>>>>>> sudo service trqauthd stop
>>>>>>>>>>> ps aux | grep pbs
>>>>>>>>>>> ps aux | grep trq
>>>>>>>>>>> # set nodes
>>>>>>>>>>> echo "$HOSTNAME np=`cat /proc/cpuinfo | grep processor | wc -l`"
>>>>>>>>>>> | sudo tee /var/spool/torque/server_priv/nodes
>>>>>>>>>>> sudo nano /var/spool/torque/server_priv/nodes
>>>>>>>>>>> # set the head node
>>>>>>>>>>> echo "\$pbsserver $HOSTNAME" | sudo tee
>>>>>>>>>>> /var/spool/torque/mom_priv/config
>>>>>>>>>>> # configure other deamons
>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>>>>>>>>>> sudo update-rc.d pbs_server defaults
>>>>>>>>>>> sudo update-rc.d pbs_sched defaults
>>>>>>>>>>> sudo update-rc.d pbs_mom defaults
>>>>>>>>>>> # start torque daemons
>>>>>>>>>>> sudo service trqauthd start
>>>>>>>>>>> sudo service pbs_server start
>>>>>>>>>>> sudo service pbs_sched start
>>>>>>>>>>> sudo service pbs_mom start
>>>>>>>>>>> # chekc configuration of computaion nodes
>>>>>>>>>>> pbsnodes -a
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I checked torque processes by "ps aux | grep pbs" and "ps aux |
>>>>>>>>>> grep trq" several times.
>>>>>>>>>> After "pbsnodes -a", it seems ok.
>>>>>>>>>> But, the next qsub command seems to trigger to crash "pbs_server"
>>>>>>>>>> and "pbs_sched".
>>>>>>>>>>
>>>>>>>>>> $ ps aux | grep trq
>>>>>>>>>>> root 9682 0.0 0.0 109112 3632 ? S 17:39 0:00
>>>>>>>>>>> /usr/local/sbin/trqauthd
>>>>>>>>>>> comp_ad+ 9842 0.0 0.0 15236 936 pts/8 S+ 17:40 0:00
>>>>>>>>>>> grep --color=auto trq
>>>>>>>>>>> $ ps aux | grep pbs
>>>>>>>>>>> root 9720 0.0 0.0 695140 25760 ? Sl 17:39 0:00
>>>>>>>>>>> /usr/local/sbin/pbs_server
>>>>>>>>>>> root 9771 0.0 0.0 37996 4940 ? Ss 17:39 0:00
>>>>>>>>>>> /usr/local/sbin/pbs_sched
>>>>>>>>>>> root 9814 0.2 0.2 173776 136692 ? SLsl 17:40 0:00
>>>>>>>>>>> /usr/local/sbin/pbs_mom
>>>>>>>>>>> comp_ad+ 9844 0.0 0.0 15236 1012 pts/8 S+ 17:40 0:00
>>>>>>>>>>> grep --color=auto pbs
>>>>>>>>>>> $ echo "sleep 30" | qsub
>>>>>>>>>>> 0.Dual-E52630v4
>>>>>>>>>>> $ ps aux | grep pbs
>>>>>>>>>>> root 9814 0.1 0.2 173776 136692 ? SLsl 17:40 0:00
>>>>>>>>>>> /usr/local/sbin/pbs_mom
>>>>>>>>>>> comp_ad+ 9855 0.0 0.0 15236 928 pts/8 S+ 17:41 0:00
>>>>>>>>>>> grep --color=auto pbs
>>>>>>>>>>> $ ps aux | grep trq
>>>>>>>>>>> root 9682 0.0 0.0 109112 4144 ? S 17:39 0:00
>>>>>>>>>>> /usr/local/sbin/trqauthd
>>>>>>>>>>> comp_ad+ 9860 0.0 0.0 15236 1092 pts/8 S+ 17:41 0:00
>>>>>>>>>>> grep --color=auto trq
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Then, I stopped the remained processes,
>>>>>>>>>>
>>>>>>>>>> sudo service pbs_mom stop
>>>>>>>>>>> sudo service trqauthd stop
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> and start again the "trqauthd", and "pbs_server" with gdb.
>>>>>>>>>> "pbs_server" crashed in gdb without other commands.
>>>>>>>>>>
>>>>>>>>>> sudo service trqauthd start
>>>>>>>>>>> sudo gdb /usr/local/sbin/pbs_server
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> sudo gdb /usr/local/sbin/pbs_server
>>>>>>>>>> GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
>>>>>>>>>> Copyright (C) 2016 Free Software Foundation, Inc.
>>>>>>>>>> License GPLv3+: GNU GPL version 3 or later <
>>>>>>>>>> http://gnu.org/licenses/gpl.html>
>>>>>>>>>> This is free software: you are free to change and redistribute it.
>>>>>>>>>> There is NO WARRANTY, to the extent permitted by law. Type "show
>>>>>>>>>> copying"
>>>>>>>>>> and "show warranty" for details.
>>>>>>>>>> This GDB was configured as "x86_64-linux-gnu".
>>>>>>>>>> Type "show configuration" for configuration details.
>>>>>>>>>> For bug reporting instructions, please see:
>>>>>>>>>> <http://www.gnu.org/software/gdb/bugs/>.
>>>>>>>>>> Find the GDB manual and other documentation resources online at:
>>>>>>>>>> <http://www.gnu.org/software/gdb/documentation/>.
>>>>>>>>>> For help, type "help".
>>>>>>>>>> Type "apropos word" to search for commands related to "word"...
>>>>>>>>>> Reading symbols from /usr/local/sbin/pbs_server...done.
>>>>>>>>>> (gdb) r -D
>>>>>>>>>> Starting program: /usr/local/sbin/pbs_server -D
>>>>>>>>>> [Thread debugging using libthread_db enabled]
>>>>>>>>>> Using host libthread_db library "/lib/x86_64-linux-gnu/libthre
>>>>>>>>>> ad_db.so.1".
>>>>>>>>>>
>>>>>>>>>> Program received signal SIGSEGV, Segmentation fault.
>>>>>>>>>> __lll_unlock_elision (lock=0x512f1b0, private=0) at
>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>>>>>>>>>> 29 ../sysdeps/unix/sysv/linux/x86/elision-unlock.c: No such file
>>>>>>>>>> or directory.
>>>>>>>>>> (gdb) bt
>>>>>>>>>> #0 __lll_unlock_elision (lock=0x512f1b0, private=0) at
>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>>>>>>>>>> #1 0x00000000004a4953 in unlock_ji_mutex (pjob=0x512d880,
>>>>>>>>>> id=0x525b30 <svr_enquejob(job*, int, char const*, bool, bool)::__func__>
>>>>>>>>>> "svr_enquejob", msg=0x524554 "1", logging=0)
>>>>>>>>>> at svr_jobfunc.c:4011
>>>>>>>>>> #2 0x000000000049db0c in svr_enquejob (pjob=0x512d880,
>>>>>>>>>> has_sv_qs_mutex=1, prev_job_id=0x0, have_reservation=false,
>>>>>>>>>> being_recovered=true) at svr_jobfunc.c:421
>>>>>>>>>> #3 0x000000000045b828 in pbsd_init_reque (pjob=0x512d880,
>>>>>>>>>> change_state=1) at pbsd_init.c:2824
>>>>>>>>>> #4 0x000000000045ad93 in pbsd_init_job (pjob=0x512d880, type=1)
>>>>>>>>>> at pbsd_init.c:2558
>>>>>>>>>> #5 0x0000000000459483 in handle_job_recovery (type=1) at
>>>>>>>>>> pbsd_init.c:1803
>>>>>>>>>> #6 0x000000000045a173 in handle_job_and_array_recovery (type=1)
>>>>>>>>>> at pbsd_init.c:2100
>>>>>>>>>> #7 0x000000000045a8fe in pbsd_init (type=1) at pbsd_init.c:2316
>>>>>>>>>> #8 0x0000000000461f29 in main (argc=2, argv=0x7fffffffe598) at
>>>>>>>>>> pbsd_main.c:1898
>>>>>>>>>> (gdb) backtrace full
>>>>>>>>>> #0 __lll_unlock_elision (lock=0x512f1b0, private=0) at
>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>>>>>>>>>> No locals.
>>>>>>>>>> #1 0x00000000004a4953 in unlock_ji_mutex (pjob=0x512d880,
>>>>>>>>>> id=0x525b30 <svr_enquejob(job*, int, char const*, bool, bool)::__func__>
>>>>>>>>>> "svr_enquejob", msg=0x524554 "1", logging=0)
>>>>>>>>>> at svr_jobfunc.c:4011
>>>>>>>>>> rc = 0
>>>>>>>>>> err_msg = 0x0
>>>>>>>>>> stub_msg = "no pos"
>>>>>>>>>> __func__ = "unlock_ji_mutex"
>>>>>>>>>> #2 0x000000000049db0c in svr_enquejob (pjob=0x512d880,
>>>>>>>>>> has_sv_qs_mutex=1, prev_job_id=0x0, have_reservation=false,
>>>>>>>>>> being_recovered=true) at svr_jobfunc.c:421
>>>>>>>>>> pattrjb = 0x7fffffff4a10
>>>>>>>>>> pdef = 0x4
>>>>>>>>>> pque = 0x0
>>>>>>>>>> rc = 0
>>>>>>>>>> log_buf = '\000' <repeats 24 times>,
>>>>>>>>>> "\030\000\000\000\060\000\000\000PU\377\377\377\177\000\000\220T\377\377\377\177",
>>>>>>>>>> '\000' <repeats 50 times>, "\003\000\000\000\000\000\000\
>>>>>>>>>> 000#\000\000\000\000\000\000\000pO\377\377\377\177", '\000'
>>>>>>>>>> <repeats 26 times>, "\221\260\000\000\000\200\377\
>>>>>>>>>> 377oO\377\377\377\177\000\000H+B\366\377\177\000\000p+B\366\
>>>>>>>>>> 377\177\000\000\200O\377\377\377\177\000\000\201\260\000\000
>>>>>>>>>> \000\200\377\377\177O\377\377\377\177", '\000' <repeats 18
>>>>>>>>>> times>...
>>>>>>>>>> time_now = 1478594788
>>>>>>>>>> job_id = "0.Dual-E52630v4\000\000\000\0
>>>>>>>>>> 00\000\000\000\000\000\362\377\377\377\377\377\377\377\340J\
>>>>>>>>>> 377\377\377\177\000\000\060L\377\377\377\177\000\000\001\000
>>>>>>>>>> \000\000\000\000\000\000\244\201\000\000\001\000\000\000\030
>>>>>>>>>> \354\377\367\377\177\000\***@L\377\377\377\177\000\000\000\0
>>>>>>>>>> 00\000\000\005\000\000\220\r\000\000\000\000\000\000\000k\02
>>>>>>>>>> 2j\365\377\177\000\000\031J\377\377\377\177\000\000\201n\376
>>>>>>>>>> \017\000\000\000\000\\\216!X\000\000\000\000_#\343+\000\000\
>>>>>>>>>> 000\000\\\216!X\000\000\000\000\207\065],", '\000' <repeats 36
>>>>>>>>>> times>, "k\022j\365\377\177\000\000\30
>>>>>>>>>> 0K\377\377\377\177\000\000\000\000\000\000\000\000\000\000"...
>>>>>>>>>> queue_name = "batch\000\377\377\240\340\377
>>>>>>>>>> \367\377\177\000"
>>>>>>>>>> total_jobs = 0
>>>>>>>>>> user_jobs = 0
>>>>>>>>>> array_jobs = 0
>>>>>>>>>> __func__ = "svr_enquejob"
>>>>>>>>>> que_mgr = {unlock_on_exit = 160, locked = 75, mutex_valid
>>>>>>>>>> = 255, managed_mutex = 0x7ffff7ddccda <open_path+474>}
>>>>>>>>>> #3 0x000000000045b828 in pbsd_init_reque (pjob=0x512d880,
>>>>>>>>>> change_state=1) at pbsd_init.c:2824
>>>>>>>>>> newstate = 0
>>>>>>>>>> newsubstate = 0
>>>>>>>>>> rc = 0
>>>>>>>>>> log_buf = "pbsd_init_reque:1", '\000' <repeats 1063
>>>>>>>>>> times>...
>>>>>>>>>> __func__ = "pbsd_init_reque"
>>>>>>>>>> #4 0x000000000045ad93 in pbsd_init_job (pjob=0x512d880, type=1)
>>>>>>>>>> at pbsd_init.c:2558
>>>>>>>>>> d = 0
>>>>>>>>>> rc = 0
>>>>>>>>>> time_now = 1478594788
>>>>>>>>>> log_buf = '\000' <repeats 2112 times>...
>>>>>>>>>> local_errno = 0
>>>>>>>>>> job_id = '\000' <repeats 1016 times>...
>>>>>>>>>> job_atr_hold = 0
>>>>>>>>>> job_exit_status = 0
>>>>>>>>>> __func__ = "pbsd_init_job"
>>>>>>>>>> #5 0x0000000000459483 in handle_job_recovery (type=1) at
>>>>>>>>>> pbsd_init.c:1803
>>>>>>>>>> pjob = 0x512d880
>>>>>>>>>> Index = 0
>>>>>>>>>> JobArray_iter = {first = "0.Dual-E52630v4", second = }
>>>>>>>>>> log_buf = "14 total files read from
>>>>>>>>>> disk\000\000\000\000\000\000\000\001\000\000\000\320\316\022
>>>>>>>>>> \005\000\000\000\000\220N\022\005", '\000' <repeats 12 times>,
>>>>>>>>>> "Expected 1, recovered 1 queues", '\000' <repeats 1330 times>...
>>>>>>>>>> rc = 0
>>>>>>>>>> job_rc = 0
>>>>>>>>>> logtype = 0
>>>>>>>>>> pdirent = 0x0
>>>>>>>>>> pdirent_sub = 0x0
>>>>>>>>>> dir = 0x5124e90
>>>>>>>>>> dir_sub = 0x0
>>>>>>>>>> had = 0
>>>>>>>>>> pjob = 0x0
>>>>>>>>>> time_now = 1478594788
>>>>>>>>>> ---Type <return> to continue, or q <return> to quit---
>>>>>>>>>> basen = '\000' <repeats 1088 times>...
>>>>>>>>>> use_jobs_subdirs = 0
>>>>>>>>>> __func__ = "handle_job_recovery"
>>>>>>>>>> #6 0x000000000045a173 in handle_job_and_array_recovery (type=1)
>>>>>>>>>> at pbsd_init.c:2100
>>>>>>>>>> rc = 0
>>>>>>>>>> tmp_rc = 1974134615
>>>>>>>>>> #7 0x000000000045a8fe in pbsd_init (type=1) at pbsd_init.c:2316
>>>>>>>>>> ret = 0
>>>>>>>>>> gid = 0
>>>>>>>>>> log_buf = "pbsd_init:1", '\000' <repeats 997 times>...
>>>>>>>>>> __func__ = "pbsd_init"
>>>>>>>>>> #8 0x0000000000461f29 in main (argc=2, argv=0x7fffffffe598) at
>>>>>>>>>> pbsd_main.c:1898
>>>>>>>>>> i = 2
>>>>>>>>>> rc = 0
>>>>>>>>>> local_errno = 0
>>>>>>>>>> lockfile = "/var/spool/torque/server_priv/server.lock",
>>>>>>>>>> '\000' <repeats 983 times>
>>>>>>>>>> EMsg = '\000' <repeats 1023 times>
>>>>>>>>>> tmpLine = "Server Dual-E52630v4 started, initialization
>>>>>>>>>> type = 1", '\000' <repeats 970 times>
>>>>>>>>>> log_buf = "Server Dual-E52630v4 started, initialization
>>>>>>>>>> type = 1", '\000' <repeats 1139 times>...
>>>>>>>>>> server_name_file_port = 15001
>>>>>>>>>> fp = 0x51095f0
>>>>>>>>>> (gdb) info registers
>>>>>>>>>> rax 0x0 0
>>>>>>>>>> rbx 0x6 6
>>>>>>>>>> rcx 0x0 0
>>>>>>>>>> rdx 0x512f1b0 85127600
>>>>>>>>>> rsi 0x0 0
>>>>>>>>>> rdi 0x512f1b0 85127600
>>>>>>>>>> rbp 0x7fffffffe4b0 0x7fffffffe4b0
>>>>>>>>>> rsp 0x7fffffffc870 0x7fffffffc870
>>>>>>>>>> r8 0x0 0
>>>>>>>>>> r9 0x7fffffff57a2 140737488312226
>>>>>>>>>> r10 0x513c800 85182464
>>>>>>>>>> r11 0x7ffff61e6128 140737322574120
>>>>>>>>>> r12 0x4260b0 4350128
>>>>>>>>>> r13 0x7fffffffe590 140737488348560
>>>>>>>>>> r14 0x0 0
>>>>>>>>>> r15 0x0 0
>>>>>>>>>> rip 0x461f29 0x461f29 <main(int, char**)+2183>
>>>>>>>>>> eflags 0x10246 [ PF ZF IF RF ]
>>>>>>>>>> cs 0x33 51
>>>>>>>>>> ss 0x2b 43
>>>>>>>>>> ds 0x0 0
>>>>>>>>>> es 0x0 0
>>>>>>>>>> fs 0x0 0
>>>>>>>>>> gs 0x0 0
>>>>>>>>>> (gdb) x/16i $pc
>>>>>>>>>> => 0x461f29 <main(int, char**)+2183>: test %eax,%eax
>>>>>>>>>> 0x461f2b <main(int, char**)+2185>: setne %al
>>>>>>>>>> 0x461f2e <main(int, char**)+2188>: test %al,%al
>>>>>>>>>> 0x461f30 <main(int, char**)+2190>: je 0x461f55 <main(int,
>>>>>>>>>> char**)+2227>
>>>>>>>>>> 0x461f32 <main(int, char**)+2192>: mov 0x70efc7(%rip),%rax
>>>>>>>>>> # 0xb70f00 <msg_daemonname>
>>>>>>>>>> 0x461f39 <main(int, char**)+2199>: mov $0x51bab2,%edx
>>>>>>>>>> 0x461f3e <main(int, char**)+2204>: mov %rax,%rsi
>>>>>>>>>> 0x461f41 <main(int, char**)+2207>: mov $0xffffffff,%edi
>>>>>>>>>> 0x461f46 <main(int, char**)+2212>: callq 0x425420
>>>>>>>>>> <***@plt>
>>>>>>>>>> 0x461f4b <main(int, char**)+2217>: mov $0x3,%edi
>>>>>>>>>> 0x461f50 <main(int, char**)+2222>: callq 0x425680 <***@plt>
>>>>>>>>>> 0x461f55 <main(int, char**)+2227>: mov 0x71021d(%rip),%esi
>>>>>>>>>> # 0xb72178 <pbs_mom_port>
>>>>>>>>>> 0x461f5b <main(int, char**)+2233>: mov 0x710227(%rip),%ecx
>>>>>>>>>> # 0xb72188 <pbs_scheduler_port>
>>>>>>>>>> 0x461f61 <main(int, char**)+2239>: mov 0x710225(%rip),%edx
>>>>>>>>>> # 0xb7218c <pbs_server_port_dis>
>>>>>>>>>> 0x461f67 <main(int, char**)+2245>: lea -0x1400(%rbp),%rax
>>>>>>>>>> 0x461f6e <main(int, char**)+2252>: mov $0xb739c0,%r9d
>>>>>>>>>> (gdb) thread apply all backtrace
>>>>>>>>>>
>>>>>>>>>> Thread 1 (Thread 0x7ffff7fd5740 (LWP 10004)):
>>>>>>>>>> #0 __lll_unlock_elision (lock=0x512f1b0, private=0) at
>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>>>>>>>>>> #1 0x00000000004a4953 in unlock_ji_mutex (pjob=0x512d880,
>>>>>>>>>> id=0x525b30 <svr_enquejob(job*, int, char const*, bool, bool)::__func__>
>>>>>>>>>> "svr_enquejob", msg=0x524554 "1", logging=0)
>>>>>>>>>> at svr_jobfunc.c:4011
>>>>>>>>>> #2 0x000000000049db0c in svr_enquejob (pjob=0x512d880,
>>>>>>>>>> has_sv_qs_mutex=1, prev_job_id=0x0, have_reservation=false,
>>>>>>>>>> being_recovered=true) at svr_jobfunc.c:421
>>>>>>>>>> #3 0x000000000045b828 in pbsd_init_reque (pjob=0x512d880,
>>>>>>>>>> change_state=1) at pbsd_init.c:2824
>>>>>>>>>> #4 0x000000000045ad93 in pbsd_init_job (pjob=0x512d880, type=1)
>>>>>>>>>> at pbsd_init.c:2558
>>>>>>>>>> #5 0x0000000000459483 in handle_job_recovery (type=1) at
>>>>>>>>>> pbsd_init.c:1803
>>>>>>>>>> #6 0x000000000045a173 in handle_job_and_array_recovery (type=1)
>>>>>>>>>> at pbsd_init.c:2100
>>>>>>>>>> #7 0x000000000045a8fe in pbsd_init (type=1) at pbsd_init.c:2316
>>>>>>>>>> #8 0x0000000000461f29 in main (argc=2, argv=0x7fffffffe598) at
>>>>>>>>>> pbsd_main.c:1898
>>>>>>>>>> (gdb) quit
>>>>>>>>>> A debugging session is active.
>>>>>>>>>>
>>>>>>>>>> Inferior 1 [process 10004] will be killed.
>>>>>>>>>>
>>>>>>>>>> Quit anyway? (y or n) y
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Nov 2, 2016 at 1:43 AM, David Beer <
>>>>>>>>>> ***@adaptivecomputing.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Kazu,
>>>>>>>>>>>
>>>>>>>>>>> Thanks for sticking with us on this. You mentioned that
>>>>>>>>>>> pbs_server did not crash when you submitted the job, but you said that it
>>>>>>>>>>> and pbs_sched are "unstable." What do you mean by unstable? Will jobs run?
>>>>>>>>>>> You gdb output looks like a pbs_server that isn't busy, but other than that
>>>>>>>>>>> it looks normal.
>>>>>>>>>>>
>>>>>>>>>>> David
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Nov 1, 2016 at 1:19 AM, Kazuhiro Fujita <
>>>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> David,
>>>>>>>>>>>>
>>>>>>>>>>>> I tested the 6.0-dev. It passed the "sudo ./torque.setup $USER"
>>>>>>>>>>>> script,
>>>>>>>>>>>> but pbs_server and pbs_sched are unstable like 6.1-dev.
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Kazu
>>>>>>>>>>>>
>>>>>>>>>>>> Before execution of gdb
>>>>>>>>>>>>
>>>>>>>>>>>> git clone https://github.com/adaptivecomputing/torque.git -b
>>>>>>>>>>>>> 6.0-dev 6.0-dev
>>>>>>>>>>>>> cd 6.0-dev
>>>>>>>>>>>>> ./autogen.sh
>>>>>>>>>>>>> # build and install torque
>>>>>>>>>>>>> ./configure
>>>>>>>>>>>>> make
>>>>>>>>>>>>> sudo make install
>>>>>>>>>>>>> # Set the correct name of the server
>>>>>>>>>>>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>>>>>>>>>>> # configure and start trqauthd
>>>>>>>>>>>>> sudo cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd
>>>>>>>>>>>>> sudo update-rc.d trqauthd defaults
>>>>>>>>>>>>> echo /usr/local/lib > sudo tee /etc/ld.so.conf.d/torque.conf
>>>>>>>>>>>>> sudo ldconfig
>>>>>>>>>>>>> sudo service trqauthd start
>>>>>>>>>>>>> # Initialize serverdb by executing the torque.setup script
>>>>>>>>>>>>> sudo ./torque.setup $USER
>>>>>>>>>>>>>
>>>>>>>>>>>>> sudo qmgr -c 'p s'
>>>>>>>>>>>>> sudo qterm
>>>>>>>>>>>>> sudo /etc/init.d/trqauthd stop
>>>>>>>>>>>>> # set nodes
>>>>>>>>>>>>> echo "$HOSTNAME np=`cat /proc/cpuinfo | grep processor | wc
>>>>>>>>>>>>> -l`" | sudo tee /var/spool/torque/server_priv/nodes
>>>>>>>>>>>>> sudo nano /var/spool/torque/server_priv/nodes
>>>>>>>>>>>>> # set the head node
>>>>>>>>>>>>> echo "\$pbsserver $HOSTNAME" | sudo tee
>>>>>>>>>>>>> /var/spool/torque/mom_priv/config
>>>>>>>>>>>>> # configure other deamons
>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_server
>>>>>>>>>>>>> /etc/init.d/pbs_server
>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
>>>>>>>>>>>>> sudo update-rc.d pbs_server defaults
>>>>>>>>>>>>> sudo update-rc.d pbs_sched defaults
>>>>>>>>>>>>> sudo update-rc.d pbs_mom defaults
>>>>>>>>>>>>> # start torque daemons
>>>>>>>>>>>>> sudo service trqauthd start
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Execution of gdb
>>>>>>>>>>>>
>>>>>>>>>>>>> sudo gdb /usr/local/sbin/pbs_server
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Commands executed by another terminal
>>>>>>>>>>>>
>>>>>>>>>>>>> sudo /etc/init.d/pbs_mom start
>>>>>>>>>>>>> sudo /etc/init.d/pbs_sched start
>>>>>>>>>>>>> pbsnodes -a
>>>>>>>>>>>>> echo "sleep 30" | qsub
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> The last command did not cause a crash of pbs_server. The
>>>>>>>>>>>> backtrace is described below.
>>>>>>>>>>>> $ sudo gdb /usr/local/sbin/pbs_server
>>>>>>>>>>>> GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
>>>>>>>>>>>> Copyright (C) 2016 Free Software Foundation, Inc.
>>>>>>>>>>>> License GPLv3+: GNU GPL version 3 or later <
>>>>>>>>>>>> http://gnu.org/licenses/gpl.html>
>>>>>>>>>>>> This is free software: you are free to change and redistribute
>>>>>>>>>>>> it.
>>>>>>>>>>>> There is NO WARRANTY, to the extent permitted by law. Type
>>>>>>>>>>>> "show copying"
>>>>>>>>>>>> and "show warranty" for details.
>>>>>>>>>>>> This GDB was configured as "x86_64-linux-gnu".
>>>>>>>>>>>> Type "show configuration" for configuration details.
>>>>>>>>>>>> For bug reporting instructions, please see:
>>>>>>>>>>>> <http://www.gnu.org/software/gdb/bugs/>.
>>>>>>>>>>>> Find the GDB manual and other documentation resources online at:
>>>>>>>>>>>> <http://www.gnu.org/software/gdb/documentation/>.
>>>>>>>>>>>> For help, type "help".
>>>>>>>>>>>> Type "apropos word" to search for commands related to "word"...
>>>>>>>>>>>> Reading symbols from /usr/local/sbin/pbs_server...done.
>>>>>>>>>>>> (gdb) r -D
>>>>>>>>>>>> Starting program: /usr/local/sbin/pbs_server -D
>>>>>>>>>>>> [Thread debugging using libthread_db enabled]
>>>>>>>>>>>> Using host libthread_db library "/lib/x86_64-linux-gnu/libthre
>>>>>>>>>>>> ad_db.so.1".
>>>>>>>>>>>> [New Thread 0x7ffff39c1700 (LWP 5024)]
>>>>>>>>>>>> pbs_server is up (version - 6.0, port - 15001)
>>>>>>>>>>>> [New Thread 0x7ffff31c0700 (LWP 5025)]
>>>>>>>>>>>> PBS_Server: LOG_ERROR::tcp_connect_sockaddr, Failed when
>>>>>>>>>>>> trying to open tcp connection - connect() failed [rc = -2] [addr =
>>>>>>>>>>>> 10.0.0.249:15003]
>>>>>>>>>>>> PBS_Server: LOG_ERROR::sendHierarchyToNode, Could not send mom
>>>>>>>>>>>> hierarchy to host Dual-E52630v4:15003
>>>>>>>>>>>> [New Thread 0x7ffff29bf700 (LWP 5026)]
>>>>>>>>>>>> [New Thread 0x7ffff21be700 (LWP 5027)]
>>>>>>>>>>>> [New Thread 0x7ffff19bd700 (LWP 5028)]
>>>>>>>>>>>> [New Thread 0x7ffff11bc700 (LWP 5029)]
>>>>>>>>>>>> [New Thread 0x7ffff09bb700 (LWP 5030)]
>>>>>>>>>>>> [Thread 0x7ffff09bb700 (LWP 5030) exited]
>>>>>>>>>>>> [New Thread 0x7ffff09bb700 (LWP 5031)]
>>>>>>>>>>>> [New Thread 0x7fffe3fff700 (LWP 5109)]
>>>>>>>>>>>> [New Thread 0x7fffe37fe700 (LWP 5113)]
>>>>>>>>>>>> [New Thread 0x7fffe29cf700 (LWP 5121)]
>>>>>>>>>>>> [Thread 0x7fffe29cf700 (LWP 5121) exited]
>>>>>>>>>>>> ^C
>>>>>>>>>>>> Thread 1 "pbs_server" received signal SIGINT, Interrupt.
>>>>>>>>>>>> 0x00007ffff612a75d in nanosleep () at
>>>>>>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>>>>>>> 84 ../sysdeps/unix/syscall-template.S: No such file or
>>>>>>>>>>>> directory.
>>>>>>>>>>>> (gdb) backtrace full
>>>>>>>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>>>>>>> No locals.
>>>>>>>>>>>> #1 0x00007ffff615c1a4 in usleep (useconds=<optimized out>) at
>>>>>>>>>>>> ../sysdeps/posix/usleep.c:32
>>>>>>>>>>>> ts = {tv_sec = 0, tv_nsec = 250000000}
>>>>>>>>>>>> #2 0x000000000046123a in main_loop () at pbsd_main.c:1454
>>>>>>>>>>>> state = 3
>>>>>>>>>>>> waittime = 5
>>>>>>>>>>>> pjob = 0x313a74
>>>>>>>>>>>> iter = 0x0
>>>>>>>>>>>> when = 1477984074
>>>>>>>>>>>> log = 0
>>>>>>>>>>>> scheduling = 1
>>>>>>>>>>>> sched_iteration = 600
>>>>>>>>>>>> time_now = 1477984190
>>>>>>>>>>>> update_loglevel = 1477984198
>>>>>>>>>>>> log_buf = "Server Ready, pid = 5020, loglevel=0",
>>>>>>>>>>>> '\000' <repeats 140 times>, "c\000\000\000\000\000\000\000
>>>>>>>>>>>> \000\020\000\000\000\000\000\000\240\265\377\377\377\177",
>>>>>>>>>>>> '\000' <repeats 26 times>...
>>>>>>>>>>>> sem_val = 5228929
>>>>>>>>>>>> __func__ = "main_loop"
>>>>>>>>>>>> #3 0x0000000000461fb6 in main (argc=2, argv=0x7fffffffe598) at
>>>>>>>>>>>> pbsd_main.c:1935
>>>>>>>>>>>> i = 2
>>>>>>>>>>>> rc = 0
>>>>>>>>>>>> local_errno = 0
>>>>>>>>>>>> lockfile = "/var/spool/torque/server_priv/server.lock",
>>>>>>>>>>>> '\000' <repeats 983 times>
>>>>>>>>>>>> EMsg = '\000' <repeats 1023 times>
>>>>>>>>>>>> tmpLine = "Using ports Server:15001 Scheduler:15004
>>>>>>>>>>>> MOM:15002 (server: 'Dual-E52630v4')", '\000' <repeats 945 times>
>>>>>>>>>>>> log_buf = "Using ports Server:15001 Scheduler:15004
>>>>>>>>>>>> MOM:15002 (server: 'Dual-E52630v4')", '\000' <repeats 1114 times>...
>>>>>>>>>>>> server_name_file_port = 15001
>>>>>>>>>>>> fp = 0x51095f0
>>>>>>>>>>>> (gdb) info registers
>>>>>>>>>>>> rax 0xfffffffffffffdfc -516
>>>>>>>>>>>> rbx 0x5 5
>>>>>>>>>>>> rcx 0x7ffff612a75d 140737321805661
>>>>>>>>>>>> rdx 0x0 0
>>>>>>>>>>>> rsi 0x0 0
>>>>>>>>>>>> rdi 0x7fffffffb3f0 140737488335856
>>>>>>>>>>>> rbp 0x7fffffffe4b0 0x7fffffffe4b0
>>>>>>>>>>>> rsp 0x7fffffffc870 0x7fffffffc870
>>>>>>>>>>>> r8 0x0 0
>>>>>>>>>>>> r9 0x4000001 67108865
>>>>>>>>>>>> r10 0x1 1
>>>>>>>>>>>> r11 0x293 659
>>>>>>>>>>>> r12 0x4260b0 4350128
>>>>>>>>>>>> r13 0x7fffffffe590 140737488348560
>>>>>>>>>>>> r14 0x0 0
>>>>>>>>>>>> r15 0x0 0
>>>>>>>>>>>> rip 0x461fb6 0x461fb6 <main(int, char**)+2388>
>>>>>>>>>>>> eflags 0x293 [ CF AF SF IF ]
>>>>>>>>>>>> cs 0x33 51
>>>>>>>>>>>> ss 0x2b 43
>>>>>>>>>>>> ds 0x0 0
>>>>>>>>>>>> es 0x0 0
>>>>>>>>>>>> fs 0x0 0
>>>>>>>>>>>> gs 0x0 0
>>>>>>>>>>>> (gdb) x/16i $pc
>>>>>>>>>>>> => 0x461fb6 <main(int, char**)+2388>: callq 0x494762
>>>>>>>>>>>> <shutdown_ack()>
>>>>>>>>>>>> 0x461fbb <main(int, char**)+2393>: mov $0xffffffff,%edi
>>>>>>>>>>>> 0x461fc0 <main(int, char**)+2398>: callq 0x4250b0
>>>>>>>>>>>> <***@plt>
>>>>>>>>>>>> 0x461fc5 <main(int, char**)+2403>: mov
>>>>>>>>>>>> 0x70f55c(%rip),%rdx # 0xb71528 <msg_svrdown>
>>>>>>>>>>>> 0x461fcc <main(int, char**)+2410>: mov
>>>>>>>>>>>> 0x70eeed(%rip),%rax # 0xb70ec0 <msg_daemonname>
>>>>>>>>>>>> 0x461fd3 <main(int, char**)+2417>: mov %rdx,%rcx
>>>>>>>>>>>> 0x461fd6 <main(int, char**)+2420>: mov %rax,%rdx
>>>>>>>>>>>> 0x461fd9 <main(int, char**)+2423>: mov $0x1,%esi
>>>>>>>>>>>> 0x461fde <main(int, char**)+2428>: mov $0x8002,%edi
>>>>>>>>>>>> 0x461fe3 <main(int, char**)+2433>: callq 0x425840
>>>>>>>>>>>> <***@plt>
>>>>>>>>>>>> 0x461fe8 <main(int, char**)+2438>: mov $0x0,%edi
>>>>>>>>>>>> 0x461fed <main(int, char**)+2443>: callq 0x4269c9
>>>>>>>>>>>> <acct_close(bool)>
>>>>>>>>>>>> 0x461ff2 <main(int, char**)+2448>: mov $0xb6cdc0,%edi
>>>>>>>>>>>> 0x461ff7 <main(int, char**)+2453>: callq 0x425a00
>>>>>>>>>>>> <***@plt>
>>>>>>>>>>>> 0x461ffc <main(int, char**)+2458>: mov $0x1,%edi
>>>>>>>>>>>> 0x462001 <main(int, char**)+2463>: callq 0x424db0
>>>>>>>>>>>> <***@plt>
>>>>>>>>>>>> (gdb) thread apply all backtrace
>>>>>>>>>>>>
>>>>>>>>>>>> Thread 11 (Thread 0x7fffe37fe700 (LWP 5113)):
>>>>>>>>>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
>>>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
>>>>>>>>>>>> #1 0x00000000004fc19c in work_thread (a=0x5110710) at
>>>>>>>>>>>> u_threadpool.c:272
>>>>>>>>>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7fffe37fe700) at
>>>>>>>>>>>> pthread_create.c:333
>>>>>>>>>>>> #3 0x00007ffff616582d in clone () at
>>>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>>>>>>>>>>>>
>>>>>>>>>>>> Thread 10 (Thread 0x7fffe3fff700 (LWP 5109)):
>>>>>>>>>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
>>>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
>>>>>>>>>>>> #1 0x00000000004fc19c in work_thread (a=0x5110710) at
>>>>>>>>>>>> u_threadpool.c:272
>>>>>>>>>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7fffe3fff700) at
>>>>>>>>>>>> pthread_create.c:333
>>>>>>>>>>>> #3 0x00007ffff616582d in clone () at
>>>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>>>>>>>>>>>>
>>>>>>>>>>>> Thread 9 (Thread 0x7ffff09bb700 (LWP 5031)):
>>>>>>>>>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
>>>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
>>>>>>>>>>>> #1 0x00000000004fc19c in work_thread (a=0x5110810) at
>>>>>>>>>>>> u_threadpool.c:272
>>>>>>>>>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7ffff09bb700) at
>>>>>>>>>>>> pthread_create.c:333
>>>>>>>>>>>> #3 0x00007ffff616582d in clone () at
>>>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>>>>>>>>>>>>
>>>>>>>>>>>> Thread 7 (Thread 0x7ffff11bc700 (LWP 5029)):
>>>>>>>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>>>>>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>>>>>>>>>>> ../sysdeps/posix/sleep.c:55
>>>>>>>>>>>> #2 0x00000000004769bb in remove_completed_jobs (vp=0x0) at
>>>>>>>>>>>> req_jobobit.c:3759
>>>>>>>>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff11bc700) at
>>>>>>>>>>>> pthread_create.c:333
>>>>>>>>>>>> #4 0x00007ffff616582d in clone () at
>>>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>>>>>>>>>>>>
>>>>>>>>>>>> Thread 6 (Thread 0x7ffff19bd700 (LWP 5028)):
>>>>>>>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>>>>>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>>>>>>>>>>> ../sysdeps/posix/sleep.c:55
>>>>>>>>>>>> #2 0x00000000004afa7b in remove_extra_recycle_jobs (vp=0x0) at
>>>>>>>>>>>> job_recycler.c:216
>>>>>>>>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff19bd700) at
>>>>>>>>>>>> pthread_create.c:333
>>>>>>>>>>>> #4 0x00007ffff616582d in clone () at
>>>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>>>>>>>>>>>>
>>>>>>>>>>>> Thread 5 (Thread 0x7ffff21be700 (LWP 5027)):
>>>>>>>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>>>>>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>>>>>>>>>>> ../sysdeps/posix/sleep.c:55
>>>>>>>>>>>> #2 0x00000000004bc73b in inspect_exiting_jobs (vp=0x0) at
>>>>>>>>>>>> exiting_jobs.c:319
>>>>>>>>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff21be700) at
>>>>>>>>>>>> pthread_create.c:333
>>>>>>>>>>>> #4 0x00007ffff616582d in clone () at
>>>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>>>>>>>>>>>>
>>>>>>>>>>>> Thread 4 (Thread 0x7ffff29bf700 (LWP 5026)):
>>>>>>>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>>>>>>> #1 0x00007ffff612a6aa in __sleep (seconds=0) at
>>>>>>>>>>>> ../sysdeps/posix/sleep.c:55
>>>>>>>>>>>> #2 0x000000000046078d in handle_queue_routing_retries (vp=0x0)
>>>>>>>>>>>> at pbsd_main.c:1079
>>>>>>>>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff29bf700) at
>>>>>>>>>>>> pthread_create.c:333
>>>>>>>>>>>> #4 0x00007ffff616582d in clone () at
>>>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>>>>>>>>>>>>
>>>>>>>>>>>> Thread 3 (Thread 0x7ffff31c0700 (LWP 5025)):
>>>>>>>>>>>> #0 0x00007ffff6ee17bd in accept () at
>>>>>>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>>>>>>> #1 0x00007ffff750a276 in start_listener_addrinfo
>>>>>>>>>>>> (host_name=0x7ffff31bfaf0 "Dual-E52630v4", server_port=15001,
>>>>>>>>>>>> process_meth=0x4c4935 <start_process_pbs_server_port(void*)>)
>>>>>>>>>>>> at ../Libnet/server_core.c:398
>>>>>>>>>>>> #2 0x00000000004608f3 in start_accept_listener (vp=0x0) at
>>>>>>>>>>>> pbsd_main.c:1141
>>>>>>>>>>>> #3 0x00007ffff6ed870a in start_thread (arg=0x7ffff31c0700) at
>>>>>>>>>>>> pthread_create.c:333
>>>>>>>>>>>> #4 0x00007ffff616582d in clone () at
>>>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>>>>>>>>>>>>
>>>>>>>>>>>> Thread 2 (Thread 0x7ffff39c1700 (LWP 5024)):
>>>>>>>>>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at
>>>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
>>>>>>>>>>>> #1 0x00000000004fc19c in work_thread (a=0x5110810) at
>>>>>>>>>>>> u_threadpool.c:272
>>>>>>>>>>>> #2 0x00007ffff6ed870a in start_thread (arg=0x7ffff39c1700) at
>>>>>>>>>>>> pthread_create.c:333
>>>>>>>>>>>> ---Type <return> to continue, or q <return> to quit---
>>>>>>>>>>>> #3 0x00007ffff616582d in clone () at
>>>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>>>>>>>>>>>>
>>>>>>>>>>>> Thread 1 (Thread 0x7ffff7fd5740 (LWP 5020)):
>>>>>>>>>>>> #0 0x00007ffff612a75d in nanosleep () at
>>>>>>>>>>>> ../sysdeps/unix/syscall-template.S:84
>>>>>>>>>>>> #1 0x00007ffff615c1a4 in usleep (useconds=<optimized out>) at
>>>>>>>>>>>> ../sysdeps/posix/usleep.c:32
>>>>>>>>>>>> #2 0x000000000046123a in main_loop () at pbsd_main.c:1454
>>>>>>>>>>>> #3 0x0000000000461fb6 in main (argc=2, argv=0x7fffffffe598) at
>>>>>>>>>>>> pbsd_main.c:1935
>>>>>>>>>>>> (gdb) quit
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Oct 28, 2016 at 12:43 PM, Kazuhiro Fujita <
>>>>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Thank you for your comments.
>>>>>>>>>>>>> I will try the 6.0-dev next week.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>> Kazu
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Oct 28, 2016 at 5:34 AM, David Beer <
>>>>>>>>>>>>> ***@adaptivecomputing.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I wonder if that fix wasn't placed in the hotfix. Is there
>>>>>>>>>>>>>> any chance you can try installing 6.0-dev on your system (via github) to
>>>>>>>>>>>>>> see if it's resolved. For the record, my Ubuntu 16 system doesn't give me
>>>>>>>>>>>>>> this error, or I'd try it myself. For whatever reason, none of our test
>>>>>>>>>>>>>> cluster machines (Cent & Redhat 6-7, SLES 11-12) experience this either. We
>>>>>>>>>>>>>> did have another user that experiences it on a test cluster, but not being
>>>>>>>>>>>>>> able to reproduce it has made it harder to track down.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Oct 26, 2016 at 12:46 AM, Kazuhiro Fujita <
>>>>>>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> David,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I tried the 6.0.2.h3. But, it seems that the other issue is
>>>>>>>>>>>>>>> still remained.
>>>>>>>>>>>>>>> After I initialized serverdb by "sudo pbs_server -t create",
>>>>>>>>>>>>>>> pbs_server crashed.
>>>>>>>>>>>>>>> Then, I used gdb with pbs_server.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>> Kazu
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> sudo gdb /usr/local/sbin/pbs_server
>>>>>>>>>>>>>>> GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
>>>>>>>>>>>>>>> Copyright (C) 2016 Free Software Foundation, Inc.
>>>>>>>>>>>>>>> License GPLv3+: GNU GPL version 3 or later <
>>>>>>>>>>>>>>> http://gnu.org/licenses/gpl.html>
>>>>>>>>>>>>>>> This is free software: you are free to change and
>>>>>>>>>>>>>>> redistribute it.
>>>>>>>>>>>>>>> There is NO WARRANTY, to the extent permitted by law. Type
>>>>>>>>>>>>>>> "show copying"
>>>>>>>>>>>>>>> and "show warranty" for details.
>>>>>>>>>>>>>>> This GDB was configured as "x86_64-linux-gnu".
>>>>>>>>>>>>>>> Type "show configuration" for configuration details.
>>>>>>>>>>>>>>> For bug reporting instructions, please see:
>>>>>>>>>>>>>>> <http://www.gnu.org/software/gdb/bugs/>.
>>>>>>>>>>>>>>> Find the GDB manual and other documentation resources online
>>>>>>>>>>>>>>> at:
>>>>>>>>>>>>>>> <http://www.gnu.org/software/gdb/documentation/>.
>>>>>>>>>>>>>>> For help, type "help".
>>>>>>>>>>>>>>> Type "apropos word" to search for commands related to
>>>>>>>>>>>>>>> "word"...
>>>>>>>>>>>>>>> Reading symbols from /usr/local/sbin/pbs_server...done.
>>>>>>>>>>>>>>> (gdb) r -D
>>>>>>>>>>>>>>> Starting program: /usr/local/sbin/pbs_server -D
>>>>>>>>>>>>>>> [Thread debugging using libthread_db enabled]
>>>>>>>>>>>>>>> Using host libthread_db library
>>>>>>>>>>>>>>> "/lib/x86_64-linux-gnu/libthread_db.so.1".
>>>>>>>>>>>>>>> pbs_server is up (version - 6.0.2.h3, port - 15001)
>>>>>>>>>>>>>>> [New Thread 0x7ffff39c1700 (LWP 25591)]
>>>>>>>>>>>>>>> [New Thread 0x7ffff31c0700 (LWP 25592)]
>>>>>>>>>>>>>>> [New Thread 0x7ffff29bf700 (LWP 25593)]
>>>>>>>>>>>>>>> [New Thread 0x7ffff21be700 (LWP 25594)]
>>>>>>>>>>>>>>> [New Thread 0x7ffff19bd700 (LWP 25595)]
>>>>>>>>>>>>>>> [New Thread 0x7ffff11bc700 (LWP 25596)]
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thread 7 "pbs_server" received signal SIGSEGV, Segmentation
>>>>>>>>>>>>>>> fault.
>>>>>>>>>>>>>>> [Switching to Thread 0x7ffff11bc700 (LWP 25596)]
>>>>>>>>>>>>>>> __lll_unlock_elision (lock=0x57276c0, private=0) at
>>>>>>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>>>>>>>>>>>>>>> 29 ../sysdeps/unix/sysv/linux/x86/elision-unlock.c: No such
>>>>>>>>>>>>>>> file or directory.
>>>>>>>>>>>>>>> (gdb) bt
>>>>>>>>>>>>>>> #0 __lll_unlock_elision (lock=0x57276c0, private=0) at
>>>>>>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
>>>>>>>>>>>>>>> #1 0x00000000004ac076 in dispatch_timed_task
>>>>>>>>>>>>>>> (ptask=0x5727660) at svr_task.c:318
>>>>>>>>>>>>>>> #2 0x0000000000460247 in check_tasks (notUsed=0x0) at
>>>>>>>>>>>>>>> pbsd_main.c:921
>>>>>>>>>>>>>>> #3 0x00000000004fc171 in work_thread (a=0x510f650) at
>>>>>>>>>>>>>>> u_threadpool.c:318
>>>>>>>>>>>>>>> #4 0x00007ffff6ed86fa in start_thread (arg=0x7ffff11bc700)
>>>>>>>>>>>>>>> at pthread_create.c:333
>>>>>>>>>>>>>>> #5 0x00007ffff6165b5d in clone () at
>>>>>>>>>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Oct 26, 2016 at 11:52 AM, Kazuhiro Fujita <
>>>>>>>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> David and Rick,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thank you for the quick response. I will try it later.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>> Kazu
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, Oct 26, 2016 at 5:06 AM, David Beer <
>>>>>>>>>>>>>>>> ***@adaptivecomputing.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Actually, Rick just sent me the link. You can download it
>>>>>>>>>>>>>>>>> from here: http://files.adaptivecom
>>>>>>>>>>>>>>>>> puting.com/hotfix/torque-6.0.2.h3.tar.gz
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Tue, Oct 25, 2016 at 2:06 PM, David Beer <
>>>>>>>>>>>>>>>>> ***@adaptivecomputing.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I can confirm that this bug is fixed in 6.0-dev, and
>>>>>>>>>>>>>>>>>> we've made a hotfix for it, 6.0.2.h3. This was caused because of a change
>>>>>>>>>>>>>>>>>> in the implementation for the pthread library, so most will not see this
>>>>>>>>>>>>>>>>>> crash, but it appears that if you have a newer version of that library,
>>>>>>>>>>>>>>>>>> then you will get it. Rick is going to send instructions for how to grab
>>>>>>>>>>>>>>>>>> 6.0.2.h3.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> David
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Tue, Oct 25, 2016 at 12:30 AM, Kazuhiro Fujita <
>>>>>>>>>>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thank you David for the comment on the backtrace.
>>>>>>>>>>>>>>>>>>> I haven't noticed that until writing this mail.
>>>>>>>>>>>>>>>>>>> So, I used backtrace as written in the Ubuntu wiki.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I also attached the backtrace of pbs_server (Torque
>>>>>>>>>>>>>>>>>>> 6.1-dev) by gdb.
>>>>>>>>>>>>>>>>>>> As I mentioned before torque.setup script was
>>>>>>>>>>>>>>>>>>> successfully executed, but unstable.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Before using gdb, I used following commands.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> git clone https://github.com/adaptivecom
>>>>>>>>>>>>>>>>>>>> puting/torque.git -b 6.1-dev 6.1-dev
>>>>>>>>>>>>>>>>>>>> cd 6.1-dev
>>>>>>>>>>>>>>>>>>>> ./autogen.sh
>>>>>>>>>>>>>>>>>>>> # build and install torque
>>>>>>>>>>>>>>>>>>>> ./configure
>>>>>>>>>>>>>>>>>>>> make
>>>>>>>>>>>>>>>>>>>> sudo make install
>>>>>>>>>>>>>>>>>>>> echo $HOSTNAME | sudo tee /var/spool/torque/server_name
>>>>>>>>>>>>>>>>>>>> echo /usr/local/lib > sudo tee
>>>>>>>>>>>>>>>>>>>> /etc/ld.so.conf.d/torque.conf
>>>>>>>>>>>>>>>>>>>> sudo ldconfig
>>>>>>>>>>>>>>>>>>>> # set as services
>>>>>>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.trqauthd
>>>>>>>>>>>>>>>>>>>> /etc/init.d/trqauthd
>>>>>>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_server
>>>>>>>>>>>>>>>>>>>> /etc/init.d/pbs_server
>>>>>>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_sched
>>>>>>>>>>>>>>>>>>>> /etc/init.d/pbs_sched
>>>>>>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_mom
>>>>>>>>>>>>>>>>>>>> /etc/init.d/pbs_mom
>>>>>>>>>>>>>>>>>>>> sudo update-rc.d trqauthd defaults
>>>>>>>>>>>>>>>>>>>> sudo update-rc.d pbs_server defaults
>>>>>>>>>>>>>>>>>>>> sudo update-rc.d pbs_sched defaults
>>>>>>>>>>>>>>>>>>>> sudo update-rc.d pbs_mom defaults
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> sudo ./torque.setup $USER
>>>>>>>>>>>>>>>>>>>> sudo qmgr -c 'p s'
>>>>>>>>>>>>>>>>>>>> echo "$HOSTNAME np=`cat /proc/cpuinfo | grep processor
>>>>>>>>>>>>>>>>>>>> | wc -l`" | sudo tee /var/spool/torque/server_priv/
>>>>>>>>>>>>>>>>>>>> nodes
>>>>>>>>>>>>>>>>>>>> sudo nano /var/spool/torque/server_priv/nodes #
>>>>>>>>>>>>>>>>>>>> (changed np)
>>>>>>>>>>>>>>>>>>>> sudo qterm -t quick
>>>>>>>>>>>>>>>>>>>> sudo /etc/init.d/trqauthd stop
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> trqauthd was not stop by the last command. So, I stopped
>>>>>>>>>>>>>>>>>>> it by killing the trqauthd process.
>>>>>>>>>>>>>>>>>>> Then I restarted the torque processes with gdb.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> sudo /etc/init.d/trqauthd start
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> sudo gdb /etc/init.d/pbs_server 2>&1 | tee
>>>>>>>>>>>>>>>>>>>> ~/gdb-torquesetup-6.1-dev.txt
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> In another terminal, I executed the following commands
>>>>>>>>>>>>>>>>>>> before pbs_server was crashed.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> sudo /etc/init.d/pbs_mom start
>>>>>>>>>>>>>>>>>>>> sudo /etc/init.d/pbs_sched start
>>>>>>>>>>>>>>>>>>>> ps aux | grep pbs
>>>>>>>>>>>>>>>>>>>> pbsnodes -a
>>>>>>>>>>>>>>>>>>>> echo "sleep 30" | qsub
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> The output of the last command is "0.torque-server".
>>>>>>>>>>>>>>>>>>> And this command crashed the pbs_server in gdb.
>>>>>>>>>>>>>>>>>>> Then, I made the backtrace.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>> Kazu
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Tue, Oct 25, 2016 at 2:36 PM, Kazuhiro Fujita <
>>>>>>>>>>>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> David,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I attached the backtrace of pbs_server (Torque 6.0.2)
>>>>>>>>>>>>>>>>>>>> by gdb.
>>>>>>>>>>>>>>>>>>>> (based on https://wiki.ubuntu.com/Backtrace)
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I started pbs_server with gdb,
>>>>>>>>>>>>>>>>>>>> and execute qmgr from another terminal. (see below)
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> sudo qmgr -c 'p s'
>>>>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>>>>> Cannot connect to specified server host 'torque-server
>>>>>>>>>>>>>>>>>>>>> '.
>>>>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111) Connection
>>>>>>>>>>>>>>>>>>>>> refused
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> After the qmgr execution, I pressed ctrl +c in gdb.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>> Kaz
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Tue, Oct 25, 2016 at 1:00 AM, David Beer <
>>>>>>>>>>>>>>>>>>>> ***@adaptivecomputing.com> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Kazu,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Can you give us a backtrace for this crash? We have
>>>>>>>>>>>>>>>>>>>>> fixed some issues on startup (around mutex management for newer pthread
>>>>>>>>>>>>>>>>>>>>> implementations) and a backtrace would allow me to confirm if what you're
>>>>>>>>>>>>>>>>>>>>> seeing is fixed.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Mon, Oct 24, 2016 at 2:09 AM, Kazuhiro Fujita <
>>>>>>>>>>>>>>>>>>>>> ***@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Dear All,
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I use Torque 4.2.10 on Ubuntu 14.04 LTS and 16.04 LTS
>>>>>>>>>>>>>>>>>>>>>> with dual E5-2630 v3 chips.
>>>>>>>>>>>>>>>>>>>>>> I recently got servers with dual Xeon E5 v4 chips,
>>>>>>>>>>>>>>>>>>>>>> and installed Ubuntu 16.04 LTS on them.
>>>>>>>>>>>>>>>>>>>>>> And I tried to set up Torque on them, but I stacked
>>>>>>>>>>>>>>>>>>>>>> with the initial setup script.
>>>>>>>>>>>>>>>>>>>>>> It seems that qmgr may trigger to crash pbs_server in
>>>>>>>>>>>>>>>>>>>>>> initial setup script (torque.setup). (see below)
>>>>>>>>>>>>>>>>>>>>>> Similar error is also observed in Torque 6.02.
>>>>>>>>>>>>>>>>>>>>>> Have you ever observed this kind of errors?
>>>>>>>>>>>>>>>>>>>>>> And if you know possible solutions, please tell me.
>>>>>>>>>>>>>>>>>>>>>> Any comments will be highly appreciated.
>>>>>>>>>>>>>>>>>>>>>> Would it be better to change the OS to other
>>>>>>>>>>>>>>>>>>>>>> distribution, such as Scientific Linux?
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Thank you in Advance,
>>>>>>>>>>>>>>>>>>>>>> Kazu
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Errors in torque 4.2.10 setup
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> torque-server-***@torque-ser
>>>>>>>>>>>>>>>>>>>>>>> ver:~/Downloads/torque/torque-4.2.10$ sudo
>>>>>>>>>>>>>>>>>>>>>>> ./torque.setup $USER
>>>>>>>>>>>>>>>>>>>>>>> Currently no servers active. Default server will be
>>>>>>>>>>>>>>>>>>>>>>> listed as active server. Error 15133
>>>>>>>>>>>>>>>>>>>>>>> Active server name: torque-server pbs_server port
>>>>>>>>>>>>>>>>>>>>>>> is: 15001
>>>>>>>>>>>>>>>>>>>>>>> trqauthd daemonized - port /tmp/trqauthd-unix
>>>>>>>>>>>>>>>>>>>>>>> trqauthd successfully started
>>>>>>>>>>>>>>>>>>>>>>> initializing TORQUE (admin:
>>>>>>>>>>>>>>>>>>>>>>> torque-server-***@torque-server)
>>>>>>>>>>>>>>>>>>>>>>> You have selected to start pbs_server in create mode.
>>>>>>>>>>>>>>>>>>>>>>> If the server database exists it will be overwritten.
>>>>>>>>>>>>>>>>>>>>>>> do you wish to continue y/(n)?y
>>>>>>>>>>>>>>>>>>>>>>> root 27941 1942 1 12:22 ? 00:00:00
>>>>>>>>>>>>>>>>>>>>>>> pbs_server -t create
>>>>>>>>>>>>>>>>>>>>>>> Max open servers: 9
>>>>>>>>>>>>>>>>>>>>>>> set server operators +=
>>>>>>>>>>>>>>>>>>>>>>> torque-server-***@torque-server
>>>>>>>>>>>>>>>>>>>>>>> Max open servers: 9
>>>>>>>>>>>>>>>>>>>>>>> set server managers += torque-server-***@torque-ser
>>>>>>>>>>>>>>>>>>>>>>> ver
>>>>>>>>>>>>>>>>>>>>>>> qmgr obj=batch svr=default: End of File
>>>>>>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>>>>>>> Cannot connect to specified server host
>>>>>>>>>>>>>>>>>>>>>>> 'torque-server'.
>>>>>>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111)
>>>>>>>>>>>>>>>>>>>>>>> Connection refused
>>>>>>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>>>>>>> Cannot connect to specified server host
>>>>>>>>>>>>>>>>>>>>>>> 'torque-server'.
>>>>>>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111)
>>>>>>>>>>>>>>>>>>>>>>> Connection refused
>>>>>>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>>>>>>> Cannot connect to specified server host
>>>>>>>>>>>>>>>>>>>>>>> 'torque-server'.
>>>>>>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111)
>>>>>>>>>>>>>>>>>>>>>>> Connection refused
>>>>>>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>>>>>>> Cannot connect to specified server host
>>>>>>>>>>>>>>>>>>>>>>> 'torque-server'.
>>>>>>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111)
>>>>>>>>>>>>>>>>>>>>>>> Connection refused
>>>>>>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>>>>>>> Cannot connect to specified server host
>>>>>>>>>>>>>>>>>>>>>>> 'torque-server'.
>>>>>>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111)
>>>>>>>>>>>>>>>>>>>>>>> Connection refused
>>>>>>>>>>>>>>>>>>>>>>> torque-server-***@torque-ser
>>>>>>>>>>>>>>>>>>>>>>> ver:~/Downloads/torque/torque-4.2.10$ ps aux | grep
>>>>>>>>>>>>>>>>>>>>>>> pbs
>>>>>>>>>>>>>>>>>>>>>>> torque-+ 27996 0.0 0.0 22304 948 pts/2 S+
>>>>>>>>>>>>>>>>>>>>>>> 12:22 0:00 grep --color=auto pbs
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> pbs_server -t create was not found.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Errors in torque 6.0.2 setup
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> torque-server-***@torque-server:~/Downloads/torque/6.0.2$
>>>>>>>>>>>>>>>>>>>>>>> sudo ./torque.setup $USER
>>>>>>>>>>>>>>>>>>>>>>> Currently no servers active. Default server will be
>>>>>>>>>>>>>>>>>>>>>>> listed as active server. Error 15133
>>>>>>>>>>>>>>>>>>>>>>> Active server name: torque-server pbs_server port
>>>>>>>>>>>>>>>>>>>>>>> is: 15001
>>>>>>>>>>>>>>>>>>>>>>> trqauthd daemonized - port /tmp/trqauthd-unix
>>>>>>>>>>>>>>>>>>>>>>> trqauthd successfully started
>>>>>>>>>>>>>>>>>>>>>>> initializing TORQUE (admin:
>>>>>>>>>>>>>>>>>>>>>>> torque-server-***@torque-server)
>>>>>>>>>>>>>>>>>>>>>>> You have selected to start pbs_server in create mode.
>>>>>>>>>>>>>>>>>>>>>>> If the server database exists it will be overwritten.
>>>>>>>>>>>>>>>>>>>>>>> do you wish to continue y/(n)?y
>>>>>>>>>>>>>>>>>>>>>>> root 39521 1 1 16:10 ? 00:00:00
>>>>>>>>>>>>>>>>>>>>>>> pbs_server -t create
>>>>>>>>>>>>>>>>>>>>>>> Max open servers: 9
>>>>>>>>>>>>>>>>>>>>>>> Max open servers: 9
>>>>>>>>>>>>>>>>>>>>>>> qmgr obj=batch svr=default: End of File
>>>>>>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>>>>>>> Cannot connect to specified server host
>>>>>>>>>>>>>>>>>>>>>>> 'torque-server'.
>>>>>>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111)
>>>>>>>>>>>>>>>>>>>>>>> Connection refused
>>>>>>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>>>>>>> Cannot connect to specified server host
>>>>>>>>>>>>>>>>>>>>>>> 'torque-server'.
>>>>>>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111)
>>>>>>>>>>>>>>>>>>>>>>> Connection refused
>>>>>>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>>>>>>> Cannot connect to specified server host
>>>>>>>>>>>>>>>>>>>>>>> 'torque-server'.
>>>>>>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111)
>>>>>>>>>>>>>>>>>>>>>>> Connection refused
>>>>>>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>>>>>>> Cannot connect to specified server host
>>>>>>>>>>>>>>>>>>>>>>> 'torque-server'.
>>>>>>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111)
>>>>>>>>>>>>>>>>>>>>>>> Connection refused
>>>>>>>>>>>>>>>>>>>>>>> Unable to communicate with torque-server(10.x.x.x)
>>>>>>>>>>>>>>>>>>>>>>> Cannot connect to specified server host
>>>>>>>>>>>>>>>>>>>>>>> 'torque-server'.
>>>>>>>>>>>>>>>>>>>>>>> qmgr: cannot connect to server (errno=111)
>>>>>>>>>>>>>>>>>>>>>>> Connection refused
>>>>>>>>>>>>>>>>>>>>>>> torque-server-***@torque-server:~/Downloads/torque/6.0.2$
>>>>>>>>>>>>>>>>>>>>>>> ps aux | grep pbs
>>>>>>>>>>>>>>>>>>>>>>> comp_ad+ 39569 0.0 0.0 22304 1032 pts/8 S+
>>>>>>>>>>>>>>>>>>>>>>> 16:11 0:00 grep --color=auto pbs
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> pbs_server -t create was not found.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Commands used for installation before the setup
>>>>>>>>>>>>>>>>>>>>>> script
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> # build and install torque
>>>>>>>>>>>>>>>>>>>>>>> ./configure
>>>>>>>>>>>>>>>>>>>>>>> make
>>>>>>>>>>>>>>>>>>>>>>> sudo make install
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> echo $HOSTNAME | sudo tee
>>>>>>>>>>>>>>>>>>>>>>> /var/spool/torque/server_name
>>>>>>>>>>>>>>>>>>>>>>> echo /usr/local/lib > sudo tee
>>>>>>>>>>>>>>>>>>>>>>> /etc/ld.so.conf.d/torque.conf
>>>>>>>>>>>>>>>>>>>>>>> sudo ldconfig
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> # set up as services
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.trqauthd
>>>>>>>>>>>>>>>>>>>>>>> /etc/init.d/trqauthd
>>>>>>>>>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_server
>>>>>>>>>>>>>>>>>>>>>>> /etc/init.d/pbs_server
>>>>>>>>>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_sched
>>>>>>>>>>>>>>>>>>>>>>> /etc/init.d/pbs_sched
>>>>>>>>>>>>>>>>>>>>>>> sudo cp contrib/init.d/debian.pbs_mom
>>>>>>>>>>>>>>>>>>>>>>> /etc/init.d/pbs_mom
>>>>>>>>>>>>>>>>>>>>>>> sudo update-rc.d trqauthd defaults
>>>>>>>>>>>>>>>>>>>>>>> sudo update-rc.d pbs_server defaults
>>>>>>>>>>>>>>>>>>>>>>> sudo update-rc.d pbs_sched defaults
>>>>>>>>>>>>>>>>>>>>>>> sudo update-rc.d pbs_mom defaults
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> sudo ./torque.setup $USER
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>>>>>> torqueusers mailing list
>>>>>>>>>>>>>>>>>>>>>> ***@supercluster.org
>>>>>>>>>>>>>>>>>>>>>> http://www.supercluster.org/ma
>>>>>>>>>>>>>>>>>>>>>> ilman/listinfo/torqueusers
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>> David Beer | Torque Architect
>>>>>>>>>>>>>>>>>>>>> Adaptive Computing
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>>>>> torqueusers mailing list
>>>>>>>>>>>>>>>>>>>>> ***@supercluster.org
>>>>>>>>>>>>>>>>>>>>> http://www.supercluster.org/ma
>>>>>>>>>>>>>>>>>>>>> ilman/listinfo/torqueusers
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>>> torqueusers mailing list
>>>>>>>>>>>>>>>>>>> ***@supercluster.org
>>>>>>>>>>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>> David Beer | Torque Architect
>>>>>>>>>>>>>>>>>> Adaptive Computing
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> David Beer | Torque Architect
>>>>>>>>>>>>>>>>> Adaptive Computing
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>> torqueusers mailing list
>>>>>>>>>>>>>>>>> ***@supercluster.org
>>>>>>>>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> torqueusers mailing list
>>>>>>>>>>>>>>> ***@supercluster.org
>>>>>>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> David Beer | Torque Architect
>>>>>>>>>>>>>> Adaptive Computing
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> torqueusers mailing list
>>>>>>>>>>>>>> ***@supercluster.org
>>>>>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> torqueusers mailing list
>>>>>>>>>>>> ***@supercluster.org
>>>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> David Beer | Torque Architect
>>>>>>>>>>> Adaptive Computing
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> torqueusers mailing list
>>>>>>>>>>> ***@supercluster.org
>>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> torqueusers mailing list
>>>>>>>>>> ***@supercluster.org
>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> David Beer | Torque Architect
>>>>>>>>> Adaptive Computing
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> torqueusers mailing list
>>>>>>>>> ***@supercluster.org
>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> torqueusers mailing list
>>>>>>> ***@supercluster.org
>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> David Beer | Torque Architect
>>>>>> Adaptive Computing
>>>>>>
>>>>>> _______________________________________________
>>>>>> torqueusers mailing list
>>>>>> ***@supercluster.org
>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> torqueusers mailing list
>>>>> ***@supercluster.org
>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> David Beer | Torque Architect
>>>> Adaptive Computing
>>>>
>>>> _______________________________________________
>>>> torqueusers mailing list
>>>> ***@supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>
>>>>
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> ***@supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>>
>>
>>
>> --
>> David Beer | Torque Architect
>> Adaptive Computing
>>
>> _______________________________________________
>> torqueusers mailing list
>> ***@supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>