Discussion:
[torqueusers] TORQUE Resource Manager 6.1.1.1 error
sudo
2017-05-22 06:08:34 UTC
Permalink
Hello,

I was able to build/configure v6.1.1.1 on my RHEL7.2 cluster.
My example batch jobs runs just fine.
But, I noticed some errors in /var/log/messages from MOM and SERVER
below.

/var/log/messages
May 19 18:00:49 orion001 pbs_mom: I/O error : Permission denied
May 19 18:00:49 orion001 pbs_mom: I/O error : Permission denied
May 19 18:00:49 orion001 pbs_server: Assertion failed, bad pointer in
link: file "req_select.c", line 401

Are these known issues of v6.1.1.1?
Are there any ways to avoid these errors?

For example, I know the facts that ;

1) the message "Assertion failed" is printed out at the timing of
user's 'qsub' command.
(It seems periodically, too)

2) "I/O error : Permission denied" arises when user's job has finished.

User's output and .e.o files are returned to job's PBS_WORKDIR normally.

---
I built it with HWLOC-1.9 and enabled cgroup.

./configure --enable-cgroups

On this server, torque-6.0.2-1469811694_d9a3483 was running fine, in
the past (without these errors)
I installed v6.1.1.1 as fresh install (removed old /var/spool/torque).

Best Regards,
--Sudo

-------
Ryuichi Sudo (***@sstc.co.jp)
-------

#PS momctl tell that pbs_mom is v6.1.1.1

[***@orion001 ~]# momctl -d 6

Host: orion001/orion001 Version: 6.1.1.1 PID: 1801
Server[0]: orion001 (192.168.21.101:15001)
Last Msg From Server: 127 seconds (CLUSTER_ADDRS)
Last Msg To Server: 25 seconds
HomeDirectory: /var/spool/torque/mom_priv
stdout/stderr spool directory: '/var/spool/torque/spool/' (477820591
blocks available)
NOTE: syslog enabled
MOM active: 131 seconds
Check Poll Time: 45 seconds
Server Update Interval: 45 seconds
LogLevel: 6 (use SIGUSR1/SIGUSR2 to adjust)
Communication Model: TCP
MemLocked: TRUE (mlock)
TCP Timeout: 300 seconds
Prolog: /var/spool/torque/mom_priv/prologue (enabled)
Epilog: /var/spool/torque/mom_priv/epilogue (enabled)
Prolog/Epilog Alarm Time: 300 seconds
Alarm Time: 0 of 10 seconds
Trusted Client List:
127.0.0.1:0,192.168.21.101:0,192.168.21.101:15003,192.168.21.102:15003
Copy Command: /usr/bin/scp -rpB
NOTE: no local jobs detected

diagnostics complete
Chad Vizino
2017-05-30 17:52:40 UTC
Permalink
Sorry you've had that problem--it shouldn't be happening. Could you share
your example job script that was used to submit the job with the problem?

-Chad
Post by sudo
Hello,
I was able to build/configure v6.1.1.1 on my RHEL7.2 cluster.
My example batch jobs runs just fine.
But, I noticed some errors in /var/log/messages from MOM and SERVER below.
/var/log/messages
May 19 18:00:49 orion001 pbs_mom: I/O error : Permission denied
May 19 18:00:49 orion001 pbs_mom: I/O error : Permission denied
May 19 18:00:49 orion001 pbs_server: Assertion failed, bad pointer in
link: file "req_select.c", line 401
Are these known issues of v6.1.1.1?
Are there any ways to avoid these errors?
For example, I know the facts that ;
1) the message "Assertion failed" is printed out at the timing of user's
'qsub' command.
(It seems periodically, too)
2) "I/O error : Permission denied" arises when user's job has finished.
User's output and .e.o files are returned to job's PBS_WORKDIR normally.
---
I built it with HWLOC-1.9 and enabled cgroup.
./configure --enable-cgroups
On this server, torque-6.0.2-1469811694_d9a3483 was running fine, in the
past (without these errors)
I installed v6.1.1.1 as fresh install (removed old /var/spool/torque).
Best Regards,
--Sudo
#PS momctl tell that pbs_mom is v6.1.1.1
PID: 1801 Server[0]: orion001 (192.168.21.101:15001) Last Msg From
Server: 127 seconds (CLUSTER_ADDRS) Last Msg To Server: 25 seconds
'/var/spool/torque/spool/' (477820591 blocks available) NOTE: syslog
enabled MOM active: 131 seconds Check Poll Time: 45 seconds Server Update
Interval: 45 seconds LogLevel: 6 (use SIGUSR1/SIGUSR2 to adjust)
Communication Model: TCP MemLocked: TRUE (mlock) TCP Timeout: 300 seconds
300 seconds Alarm Time: 0 of 10 seconds Trusted Client List: 127.0.0.1:0,
/usr/bin/scp -rpB NOTE: no local jobs detected diagnostics complete
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
sudo
2017-05-31 02:39:57 UTC
Permalink
Hello,
Thank you for attention to this item.
Could you share your example job script that was used to submit the
job with the problem?

Here is my job script that causes these /var/log/messages.
It is a simple Intel MPI parallel job that uses 24cores.

---
#!/bin/bash
#PBS -q batch
#PBS -N mpiBMT
#PBS -l nodes=1:ppn=24
#PBS -l walltime=02:00:00
#PBS -j eo

cd $PBS_O_WORKDIR

time=/usr/bin/time
ulimit -s unlimited
export PATH="/opt/intel/impi/2018.0.061/intel64/bin:$PATH"
export
LD_LIBRARY_PATH="/opt/intel/impi/2018.0.061/intel64/lib:$LD_LIBRARY_PATH"
pid=$$
day=`date +%m%d%y%s`

set -x
echo $PBS_NODEFILE ; ls -g $PBS_NODEFILE
sort $PBS_NODEFILE > mpd.hosts ; wc -l mpd.hosts
export I_MPI_DEVICE=rdma
export I_MPI_FABRICS=shm
# export I_MPI_FABRICS=shm:ofa

for num in himenoBMT_2_3_4.exe ; do
export BIN=${num}
nplist=`echo $BIN | sed -e 's/himenoBMT_//' -e 's/.exe//'`
nps=$(( ${nplist//_/*} ))
echo "Number of MPI procs(nps)= $nps"
(time -p mpiexec.hydra -genv I_MPI_DEVICE rdma -genv I_MPI_FABRICS
shm -genv I_MPI_DEBUG 2 -n ${nps} -machinefile mpd.hosts ./${BIN} )
2>&1 | tee -a log_${BIN}_${nps}_${day}.txt
done
---

Note, that, I have some addendum.

1) This issue happens with torque 6.1.1.1 on the torque server
on which torque was installed by (make ; make install)

2) Some time after, I fresh install the same torque packages
with 'torque-package-xxxx-x86_64.sh' files. (on the same hw).
Then, I don't see the same error massages.

3) torque managers,operators difference.
With 1) above, managers and operators are created by all network
interfaces of the torque server node.

With 2) above, only 'eth0' name manager and operator are created.
managers = ***@orion001
operators = ***@orion001

I still don't know the issue was caused by the difference of 1) and 2)
though.

Best Regards,
--Sudo

-------
Ryuichi Sudo (***@sstc.co.jp)
-------


On 2017 5月 31日 (æ°Ž), 2:52 午前, Chad Vizino
Sorry you've had that problem--it shouldn't be happening. Could you
share your example job script that was used to submit the job with
the problem?
-Chad
Post by sudo
Hello,
I was able to build/configure v6.1.1.1 on my RHEL7.2 cluster.
My example batch jobs runs just fine.
But, I noticed some errors in /var/log/messages from MOM and SERVER below.
/var/log/messages
May 19 18:00:49 orion001 pbs_mom: I/O error : Permission denied
May 19 18:00:49 orion001 pbs_mom: I/O error : Permission denied
May 19 18:00:49 orion001 pbs_server: Assertion failed, bad pointer
in link: file "req_select.c", line 401
Are these known issues of v6.1.1.1?
Are there any ways to avoid these errors?
For example, I know the facts that ;
1) the message "Assertion failed" is printed out at the timing of
user's 'qsub' command.
(It seems periodically, too)
2) "I/O error : Permission denied" arises when user's job has
finished.
User's output and .e.o files are returned to job's PBS_WORKDIR normally.
---
I built it with HWLOC-1.9 and enabled cgroup.
./configure --enable-cgroups
On this server, torque-6.0.2-1469811694_d9a3483 was running fine, in
the past (without these errors)
I installed v6.1.1.1 as fresh install (removed old
/var/spool/torque).
Best Regards,
--Sudo
-------
-------
#PS momctl tell that pbs_mom is v6.1.1.1
Host: orion001/orion001 Version: 6.1.1.1 PID: 1801
Server[0]: orion001 (192.168.21.101:15001)
Last Msg From Server: 127 seconds (CLUSTER_ADDRS)
Last Msg To Server: 25 seconds
HomeDirectory: /var/spool/torque/mom_priv
stdout/stderr spool directory: '/var/spool/torque/spool/' (477820591
blocks available)
NOTE: syslog enabled
MOM active: 131 seconds
Check Poll Time: 45 seconds
Server Update Interval: 45 seconds
LogLevel: 6 (use SIGUSR1/SIGUSR2 to adjust)
Communication Model: TCP
MemLocked: TRUE (mlock)
TCP Timeout: 300 seconds
Prolog: /var/spool/torque/mom_priv/prologue (enabled)
Epilog: /var/spool/torque/mom_priv/epilogue (enabled)
Prolog/Epilog Alarm Time: 300 seconds
Alarm Time: 0 of 10 seconds
127.0.0.1:0,192.168.21.101:0,192.168.21.101:15003,192.168.21.102:15003
Copy Command: /usr/bin/scp -rpB
NOTE: no local jobs detected
diagnostics complete
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
Loading...