Discussion:
[torqueusers] LIFO job execution by TORQUE 6.02 on Ubuntu 16.04 LTS
Kazuhiro Fujita
2016-08-30 08:14:01 UTC
Permalink
Hi all,

I successfully installed the TORQUE 6.02 on Ubuntu 16.04 LTS (procedure is
attached).
It looks working, but shows an odd behavior for job queuing.
The submitted jobs were executed in the last-in, first-out (LIFO) manner
(see below).

When I submit the jobs by the following command 3 times serially,
the jobs are supposed to be executed in the FIFO (first-in, first-out)
manner.

$ echo "sleep 10" | qsub -t 1-10

But, before completing the 1st jobs TORQUE started to execute the 3rd jobs.
(I allocated 6 threads for TORQUE in a desktop test machine.)

$ qstat -t
Job ID Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
1[1].kaf-Ubuntu STDIN-1 kaf 00:00:00 C batch

1[2].kaf-Ubuntu STDIN-2 kaf 0 Q batch

1[3].kaf-Ubuntu STDIN-3 kaf 0 Q batch

1[4].kaf-Ubuntu STDIN-4 kaf 0 Q batch

1[5].kaf-Ubuntu STDIN-5 kaf 0 Q batch

1[6].kaf-Ubuntu STDIN-6 kaf 00:00:00 C batch

1[7].kaf-Ubuntu STDIN-7 kaf 00:00:00 C batch

1[8].kaf-Ubuntu STDIN-8 kaf 00:00:00 C batch

1[9].kaf-Ubuntu STDIN-9 kaf 00:00:00 C batch

1[10].kaf-Ubuntu STDIN-10 kaf 00:00:00 C batch

2[1].kaf-Ubuntu STDIN-1 kaf 0 Q batch

2[2].kaf-Ubuntu STDIN-2 kaf 0 Q batch

2[3].kaf-Ubuntu STDIN-3 kaf 0 Q batch

2[4].kaf-Ubuntu STDIN-4 kaf 0 Q batch

2[5].kaf-Ubuntu STDIN-5 kaf 0 Q batch

2[6].kaf-Ubuntu STDIN-6 kaf 0 Q batch

2[7].kaf-Ubuntu STDIN-7 kaf 0 Q batch

2[8].kaf-Ubuntu STDIN-8 kaf 0 Q batch

2[9].kaf-Ubuntu STDIN-9 kaf 0 Q batch

2[10].kaf-Ubuntu STDIN-10 kaf 0 Q batch

3[1].kaf-Ubuntu STDIN-1 kaf 0 Q batch

3[2].kaf-Ubuntu STDIN-2 kaf 0 Q batch

3[3].kaf-Ubuntu STDIN-3 kaf 0 Q batch

3[4].kaf-Ubuntu STDIN-4 kaf 0 Q batch

3[5].kaf-Ubuntu STDIN-5 kaf 0 R batch

3[6].kaf-Ubuntu STDIN-6 kaf 0 R batch

3[7].kaf-Ubuntu STDIN-7 kaf 0 R batch

3[8].kaf-Ubuntu STDIN-8 kaf 0 R batch

3[9].kaf-Ubuntu STDIN-9 kaf 0 R batch

3[10].kaf-Ubuntu STDIN-10 kaf 0 R batch

Do you have any ideas to solve this issue?
I used TORQUE 4.2.10 on Ubuntu 14.04 LTS for about 1 year,
and have not encountered this kind of behavior before upgrading to Ubuntu
16.04 LTS.

Thanks in advance,
Kaz




Procedure of TORQUE 6.02 installation on Ubuntu 16.04 LTS

# install packages
sudo apt-get install lsb-core build-essential libtool openssl libssl-dev
libxml2-dev libboost-all-dev automake

# check and edit /etc/hosts to specify host
sudo nano /etc/hosts

# install torque
tar xvzf torque-6.0.2-1469811694_d9a3483.tar.gz
cd torque-6.0.2-1469811694_d9a3483
./configure
make
sudo make install

# confirm the host name is correctly set.
cat /var/spool/torque/server_name

# configure the trqauthd daemon to start automatically at system boot.
sudo cp contrib/systemd/trqauthd.service /etc/systemd/system/
sudo systemctl enable trqauthd.service
sudo sh -c "echo /usr/local/lib > /etc/ld.so.conf.d/torque.conf"
sudo ldconfig
sudo systemctl start trqauthd.service

# set up qmgr
sudo ./torque.setup root
# check qmgr settings
sudo qmgr -c 'p s'

# set compute nodes
echo "$HOSTNAME np=`cat /proc/cpuinfo | grep processor | wc -l`" | sudo tee
/var/spool/torque/server_priv/nodes
# change number of threads for TORQUE
sudo nano /var/spool/torque/server_priv/nodes

# Configure pbs_server to start automatically at system boot, and then
start the daemon.
sudo qterm
sudo cp contrib/systemd/pbs_server.service /etc/systemd/system/
sudo systemctl enable pbs_server.service
sudo systemctl start pbs_server.service

# Configure pbs_mom to start at system boot, and then start the daemon
sudo cp contrib/systemd/pbs_mom.service /etc/systemd/system/
sudo systemctl enable pbs_mom.service
sudo systemctl start pbs_mom.service

# Configure pbs_sched to start at system boot, and then start the daemon
sudo cp contrib/systemd/pbs_sched.service /etc/systemd/system/
sudo systemctl enable pbs_sched.service
sudo systemctl start pbs_sched.service

# check node settings
pbsnodes -a

# check torque behavior with a small job
echo "sleep 30" | qsub
qstat
echo "sleep 10" | qsub -t 1-10
qstat -t

# pbs_sched could not start in boot. So, I need to start it after the boot.
sudo systemctl start pbs_sched.service
David Beer
2016-08-30 21:03:06 UTC
Permalink
This post might be inappropriate. Click to display it.
Kazuhiro Fujita
2016-08-31 03:58:56 UTC
Permalink
David,

Thank you for your quick response.

I uninstalled TORQUE 6.0.2, and just installed the TORQUE 4.2.10 on Ubuntu
16.04 LTS.
It worked well for job queuing as expected (FIFO manner).
So, I use TORQUE 4.2.10 on Ubuntu 16.04 LTS for a while in a computation
server until the odd behavior is fixed.

Best,
Kaz
Post by David Beer
Kaz,
We try not to change pbs_sched much, so I'm not sure what might've caused
this. I don't know if Maui might offer you more control on the scheduling
options.
David
On Tue, Aug 30, 2016 at 2:14 AM, Kazuhiro Fujita <
Post by Kazuhiro Fujita
Hi all,
I successfully installed the TORQUE 6.02 on Ubuntu 16.04 LTS (procedure
is attached).
It looks working, but shows an odd behavior for job queuing.
The submitted jobs were executed in the last-in, first-out (LIFO) manner
(see below).
When I submit the jobs by the following command 3 times serially,
the jobs are supposed to be executed in the FIFO (first-in, first-out)
manner.
$ echo "sleep 10" | qsub -t 1-10
But, before completing the 1st jobs TORQUE started to execute the 3rd jobs.
(I allocated 6 threads for TORQUE in a desktop test machine.)
$ qstat -t
Job ID Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
1[1].kaf-Ubuntu STDIN-1 kaf 00:00:00 C batch
1[2].kaf-Ubuntu STDIN-2 kaf 0 Q batch
1[3].kaf-Ubuntu STDIN-3 kaf 0 Q batch
1[4].kaf-Ubuntu STDIN-4 kaf 0 Q batch
1[5].kaf-Ubuntu STDIN-5 kaf 0 Q batch
1[6].kaf-Ubuntu STDIN-6 kaf 00:00:00 C batch
1[7].kaf-Ubuntu STDIN-7 kaf 00:00:00 C batch
1[8].kaf-Ubuntu STDIN-8 kaf 00:00:00 C batch
1[9].kaf-Ubuntu STDIN-9 kaf 00:00:00 C batch
1[10].kaf-Ubuntu STDIN-10 kaf 00:00:00 C batch
2[1].kaf-Ubuntu STDIN-1 kaf 0 Q batch
2[2].kaf-Ubuntu STDIN-2 kaf 0 Q batch
2[3].kaf-Ubuntu STDIN-3 kaf 0 Q batch
2[4].kaf-Ubuntu STDIN-4 kaf 0 Q batch
2[5].kaf-Ubuntu STDIN-5 kaf 0 Q batch
2[6].kaf-Ubuntu STDIN-6 kaf 0 Q batch
2[7].kaf-Ubuntu STDIN-7 kaf 0 Q batch
2[8].kaf-Ubuntu STDIN-8 kaf 0 Q batch
2[9].kaf-Ubuntu STDIN-9 kaf 0 Q batch
2[10].kaf-Ubuntu STDIN-10 kaf 0 Q batch
3[1].kaf-Ubuntu STDIN-1 kaf 0 Q batch
3[2].kaf-Ubuntu STDIN-2 kaf 0 Q batch
3[3].kaf-Ubuntu STDIN-3 kaf 0 Q batch
3[4].kaf-Ubuntu STDIN-4 kaf 0 Q batch
3[5].kaf-Ubuntu STDIN-5 kaf 0 R batch
3[6].kaf-Ubuntu STDIN-6 kaf 0 R batch
3[7].kaf-Ubuntu STDIN-7 kaf 0 R batch
3[8].kaf-Ubuntu STDIN-8 kaf 0 R batch
3[9].kaf-Ubuntu STDIN-9 kaf 0 R batch
3[10].kaf-Ubuntu STDIN-10 kaf 0 R batch
Do you have any ideas to solve this issue?
I used TORQUE 4.2.10 on Ubuntu 14.04 LTS for about 1 year,
and have not encountered this kind of behavior before upgrading to Ubuntu
16.04 LTS.
Thanks in advance,
Kaz
Procedure of TORQUE 6.02 installation on Ubuntu 16.04 LTS
# install packages
sudo apt-get install lsb-core build-essential libtool openssl libssl-dev
libxml2-dev libboost-all-dev automake
# check and edit /etc/hosts to specify host
sudo nano /etc/hosts
# install torque
tar xvzf torque-6.0.2-1469811694_d9a3483.tar.gz
cd torque-6.0.2-1469811694_d9a3483
./configure
make
sudo make install
# confirm the host name is correctly set.
cat /var/spool/torque/server_name
# configure the trqauthd daemon to start automatically at system boot.
sudo cp contrib/systemd/trqauthd.service /etc/systemd/system/
sudo systemctl enable trqauthd.service
sudo sh -c "echo /usr/local/lib > /etc/ld.so.conf.d/torque.conf"
sudo ldconfig
sudo systemctl start trqauthd.service
# set up qmgr
sudo ./torque.setup root
# check qmgr settings
sudo qmgr -c 'p s'
# set compute nodes
echo "$HOSTNAME np=`cat /proc/cpuinfo | grep processor | wc -l`" | sudo
tee /var/spool/torque/server_priv/nodes
# change number of threads for TORQUE
sudo nano /var/spool/torque/server_priv/nodes
# Configure pbs_server to start automatically at system boot, and then
start the daemon.
sudo qterm
sudo cp contrib/systemd/pbs_server.service /etc/systemd/system/
sudo systemctl enable pbs_server.service
sudo systemctl start pbs_server.service
# Configure pbs_mom to start at system boot, and then start the daemon
sudo cp contrib/systemd/pbs_mom.service /etc/systemd/system/
sudo systemctl enable pbs_mom.service
sudo systemctl start pbs_mom.service
# Configure pbs_sched to start at system boot, and then start the daemon
sudo cp contrib/systemd/pbs_sched.service /etc/systemd/system/
sudo systemctl enable pbs_sched.service
sudo systemctl start pbs_sched.service
# check node settings
pbsnodes -a
# check torque behavior with a small job
echo "sleep 30" | qsub
qstat
echo "sleep 10" | qsub -t 1-10
qstat -t
# pbs_sched could not start in boot. So, I need to start it after the boot.
sudo systemctl start pbs_sched.service
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
--
David Beer | Torque Architect
Adaptive Computing
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
Kazuhiro Fujita
2016-10-24 06:14:00 UTC
Permalink
David,

I tested this issue in Torque 6.1-dev on Ubuntu 16.04 LTS.
Similar behavior is still observed.

0[1].kaf-Ubuntu STDIN-1 kaf 00:00:00 C
batch
0[2].kaf-Ubuntu STDIN-2 kaf 0 Q
batch
0[3].kaf-Ubuntu STDIN-3 kaf 0 Q
batch
0[4].kaf-Ubuntu STDIN-4 kaf 0 Q
batch
0[5].kaf-Ubuntu STDIN-5 kaf 0 Q
batch
0[6].kaf-Ubuntu STDIN-6 kaf 0 Q
batch
0[7].kaf-Ubuntu STDIN-7 kaf 0 Q
batch
0[8].kaf-Ubuntu STDIN-8 kaf 00:00:00 C
batch
0[9].kaf-Ubuntu STDIN-9 kaf 00:00:00 C
batch
0[10].kaf-Ubuntu STDIN-10 kaf 00:00:00 C
batch
1[1].kaf-Ubuntu STDIN-1 kaf 0 Q
batch
1[2].kaf-Ubuntu STDIN-2 kaf 0 Q
batch
1[3].kaf-Ubuntu STDIN-3 kaf 0 Q
batch
1[4].kaf-Ubuntu STDIN-4 kaf 0 Q
batch
1[5].kaf-Ubuntu STDIN-5 kaf 0 Q
batch
1[6].kaf-Ubuntu STDIN-6 kaf 0 Q
batch
1[7].kaf-Ubuntu STDIN-7 kaf 0 Q
batch
1[8].kaf-Ubuntu STDIN-8 kaf 0 Q
batch
1[9].kaf-Ubuntu STDIN-9 kaf 0 Q
batch
1[10].kaf-Ubuntu STDIN-10 kaf 0 Q
batch
2[1].kaf-Ubuntu STDIN-1 kaf 0 Q
batch
2[2].kaf-Ubuntu STDIN-2 kaf 0 Q
batch
2[3].kaf-Ubuntu STDIN-3 kaf 0 Q
batch
2[4].kaf-Ubuntu STDIN-4 kaf 0 Q
batch
2[5].kaf-Ubuntu STDIN-5 kaf 0 Q
batch
2[6].kaf-Ubuntu STDIN-6 kaf 0 Q
batch
2[7].kaf-Ubuntu STDIN-7 kaf 0 R
batch
2[8].kaf-Ubuntu STDIN-8 kaf 0 R
batch
2[9].kaf-Ubuntu STDIN-9 kaf 0 R
batch
2[10].kaf-Ubuntu STDIN-10 kaf 0 R batch

Best,
Kaz
Post by Kazuhiro Fujita
David,
Thank you for your quick response.
I uninstalled TORQUE 6.0.2, and just installed the TORQUE 4.2.10 on Ubuntu
16.04 LTS.
It worked well for job queuing as expected (FIFO manner).
So, I use TORQUE 4.2.10 on Ubuntu 16.04 LTS for a while in a computation
server until the odd behavior is fixed.
Best,
Kaz
Post by David Beer
Kaz,
We try not to change pbs_sched much, so I'm not sure what might've caused
this. I don't know if Maui might offer you more control on the scheduling
options.
David
On Tue, Aug 30, 2016 at 2:14 AM, Kazuhiro Fujita <
Post by Kazuhiro Fujita
Hi all,
I successfully installed the TORQUE 6.02 on Ubuntu 16.04 LTS (procedure
is attached).
It looks working, but shows an odd behavior for job queuing.
The submitted jobs were executed in the last-in, first-out (LIFO) manner
(see below).
When I submit the jobs by the following command 3 times serially,
the jobs are supposed to be executed in the FIFO (first-in, first-out)
manner.
$ echo "sleep 10" | qsub -t 1-10
But, before completing the 1st jobs TORQUE started to execute the 3rd jobs.
(I allocated 6 threads for TORQUE in a desktop test machine.)
$ qstat -t
Job ID Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
1[1].kaf-Ubuntu STDIN-1 kaf 00:00:00 C batch
1[2].kaf-Ubuntu STDIN-2 kaf 0 Q batch
1[3].kaf-Ubuntu STDIN-3 kaf 0 Q batch
1[4].kaf-Ubuntu STDIN-4 kaf 0 Q batch
1[5].kaf-Ubuntu STDIN-5 kaf 0 Q batch
1[6].kaf-Ubuntu STDIN-6 kaf 00:00:00 C batch
1[7].kaf-Ubuntu STDIN-7 kaf 00:00:00 C batch
1[8].kaf-Ubuntu STDIN-8 kaf 00:00:00 C batch
1[9].kaf-Ubuntu STDIN-9 kaf 00:00:00 C batch
1[10].kaf-Ubuntu STDIN-10 kaf 00:00:00 C batch
2[1].kaf-Ubuntu STDIN-1 kaf 0 Q batch
2[2].kaf-Ubuntu STDIN-2 kaf 0 Q batch
2[3].kaf-Ubuntu STDIN-3 kaf 0 Q batch
2[4].kaf-Ubuntu STDIN-4 kaf 0 Q batch
2[5].kaf-Ubuntu STDIN-5 kaf 0 Q batch
2[6].kaf-Ubuntu STDIN-6 kaf 0 Q batch
2[7].kaf-Ubuntu STDIN-7 kaf 0 Q batch
2[8].kaf-Ubuntu STDIN-8 kaf 0 Q batch
2[9].kaf-Ubuntu STDIN-9 kaf 0 Q batch
2[10].kaf-Ubuntu STDIN-10 kaf 0 Q batch
3[1].kaf-Ubuntu STDIN-1 kaf 0 Q batch
3[2].kaf-Ubuntu STDIN-2 kaf 0 Q batch
3[3].kaf-Ubuntu STDIN-3 kaf 0 Q batch
3[4].kaf-Ubuntu STDIN-4 kaf 0 Q batch
3[5].kaf-Ubuntu STDIN-5 kaf 0 R batch
3[6].kaf-Ubuntu STDIN-6 kaf 0 R batch
3[7].kaf-Ubuntu STDIN-7 kaf 0 R batch
3[8].kaf-Ubuntu STDIN-8 kaf 0 R batch
3[9].kaf-Ubuntu STDIN-9 kaf 0 R batch
3[10].kaf-Ubuntu STDIN-10 kaf 0 R batch
Do you have any ideas to solve this issue?
I used TORQUE 4.2.10 on Ubuntu 14.04 LTS for about 1 year,
and have not encountered this kind of behavior before upgrading to
Ubuntu 16.04 LTS.
Thanks in advance,
Kaz
Procedure of TORQUE 6.02 installation on Ubuntu 16.04 LTS
# install packages
sudo apt-get install lsb-core build-essential libtool openssl libssl-dev
libxml2-dev libboost-all-dev automake
# check and edit /etc/hosts to specify host
sudo nano /etc/hosts
# install torque
tar xvzf torque-6.0.2-1469811694_d9a3483.tar.gz
cd torque-6.0.2-1469811694_d9a3483
./configure
make
sudo make install
# confirm the host name is correctly set.
cat /var/spool/torque/server_name
# configure the trqauthd daemon to start automatically at system boot.
sudo cp contrib/systemd/trqauthd.service /etc/systemd/system/
sudo systemctl enable trqauthd.service
sudo sh -c "echo /usr/local/lib > /etc/ld.so.conf.d/torque.conf"
sudo ldconfig
sudo systemctl start trqauthd.service
# set up qmgr
sudo ./torque.setup root
# check qmgr settings
sudo qmgr -c 'p s'
# set compute nodes
echo "$HOSTNAME np=`cat /proc/cpuinfo | grep processor | wc -l`" | sudo
tee /var/spool/torque/server_priv/nodes
# change number of threads for TORQUE
sudo nano /var/spool/torque/server_priv/nodes
# Configure pbs_server to start automatically at system boot, and then
start the daemon.
sudo qterm
sudo cp contrib/systemd/pbs_server.service /etc/systemd/system/
sudo systemctl enable pbs_server.service
sudo systemctl start pbs_server.service
# Configure pbs_mom to start at system boot, and then start the daemon
sudo cp contrib/systemd/pbs_mom.service /etc/systemd/system/
sudo systemctl enable pbs_mom.service
sudo systemctl start pbs_mom.service
# Configure pbs_sched to start at system boot, and then start the daemon
sudo cp contrib/systemd/pbs_sched.service /etc/systemd/system/
sudo systemctl enable pbs_sched.service
sudo systemctl start pbs_sched.service
# check node settings
pbsnodes -a
# check torque behavior with a small job
echo "sleep 30" | qsub
qstat
echo "sleep 10" | qsub -t 1-10
qstat -t
# pbs_sched could not start in boot. So, I need to start it after the boot.
sudo systemctl start pbs_sched.service
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
--
David Beer | Torque Architect
Adaptive Computing
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
Kazuhiro Fujita
2016-11-08 09:43:34 UTC
Permalink
David,

I tested a E5-2630v3 server with the latest 6.0-dev from github on Ubuntu
16.04.
The LIFO manner was still observed.

Best,
Kazu

$ qstat -t
Job ID Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
0.Dual-E5-2630v3 STDIN comp_admin 00:00:00 C
batch
1[1].Dual-E5-2630v3 STDIN-1 comp_admin 0 Q
batch
1[2].Dual-E5-2630v3 STDIN-2 comp_admin 0 Q
batch
1[3].Dual-E5-2630v3 STDIN-3 comp_admin 0 Q
batch
1[4].Dual-E5-2630v3 STDIN-4 comp_admin 0 Q
batch
1[5].Dual-E5-2630v3 STDIN-5 comp_admin 0 Q
batch
1[6].Dual-E5-2630v3 STDIN-6 comp_admin 0 Q
batch
1[7].Dual-E5-2630v3 STDIN-7 comp_admin 00:00:00 C
batch
1[8].Dual-E5-2630v3 STDIN-8 comp_admin 00:00:00 C
batch
1[9].Dual-E5-2630v3 STDIN-9 comp_admin 00:00:00 C
batch
1[10].Dual-E5-2630v3 STDIN-10 comp_admin 00:00:00 C
batch
2[1].Dual-E5-2630v3 STDIN-1 comp_admin 0 Q
batch
2[2].Dual-E5-2630v3 STDIN-2 comp_admin 0 Q
batch
2[3].Dual-E5-2630v3 STDIN-3 comp_admin 0 Q
batch
2[4].Dual-E5-2630v3 STDIN-4 comp_admin 0 Q
batch
2[5].Dual-E5-2630v3 STDIN-5 comp_admin 0 Q
batch
2[6].Dual-E5-2630v3 STDIN-6 comp_admin 0 Q
batch
2[7].Dual-E5-2630v3 STDIN-7 comp_admin 0 Q
batch
2[8].Dual-E5-2630v3 STDIN-8 comp_admin 0 Q
batch
2[9].Dual-E5-2630v3 STDIN-9 comp_admin 0 Q
batch
2[10].Dual-E5-2630v3 STDIN-10 comp_admin 0 Q
batch
3[1].Dual-E5-2630v3 STDIN-1 comp_admin 0 Q
batch
3[2].Dual-E5-2630v3 STDIN-2 comp_admin 0 Q
batch
3[3].Dual-E5-2630v3 STDIN-3 comp_admin 0 Q
batch
3[4].Dual-E5-2630v3 STDIN-4 comp_admin 0 Q
batch
3[5].Dual-E5-2630v3 STDIN-5 comp_admin 0 Q
batch
3[6].Dual-E5-2630v3 STDIN-6 comp_admin 0 Q
batch
3[7].Dual-E5-2630v3 STDIN-7 comp_admin 0 Q
batch
3[8].Dual-E5-2630v3 STDIN-8 comp_admin 0 Q
batch
3[9].Dual-E5-2630v3 STDIN-9 comp_admin 0 Q
batch
3[10].Dual-E5-2630v3 STDIN-10 comp_admin 0 Q
batch
4[1].Dual-E5-2630v3 STDIN-1 comp_admin 0 Q
batch
4[2].Dual-E5-2630v3 STDIN-2 comp_admin 0 Q
batch
4[3].Dual-E5-2630v3 STDIN-3 comp_admin 0 Q
batch
4[4].Dual-E5-2630v3 STDIN-4 comp_admin 0 Q
batch
4[5].Dual-E5-2630v3 STDIN-5 comp_admin 0 Q
batch
4[6].Dual-E5-2630v3 STDIN-6 comp_admin 0 Q
batch
4[7].Dual-E5-2630v3 STDIN-7 comp_admin 0 R
batch
4[8].Dual-E5-2630v3 STDIN-8 comp_admin 0 R
batch
4[9].Dual-E5-2630v3 STDIN-9 comp_admin 0 R
batch
4[10].Dual-E5-2630v3 STDIN-10 comp_admin 0 R batch
Post by Kazuhiro Fujita
David,
I tested this issue in Torque 6.1-dev on Ubuntu 16.04 LTS.
Similar behavior is still observed.
0[1].kaf-Ubuntu STDIN-1 kaf 00:00:00 C
batch
0[2].kaf-Ubuntu STDIN-2 kaf 0 Q
batch
0[3].kaf-Ubuntu STDIN-3 kaf 0 Q
batch
0[4].kaf-Ubuntu STDIN-4 kaf 0 Q
batch
0[5].kaf-Ubuntu STDIN-5 kaf 0 Q
batch
0[6].kaf-Ubuntu STDIN-6 kaf 0 Q
batch
0[7].kaf-Ubuntu STDIN-7 kaf 0 Q
batch
0[8].kaf-Ubuntu STDIN-8 kaf 00:00:00 C
batch
0[9].kaf-Ubuntu STDIN-9 kaf 00:00:00 C
batch
0[10].kaf-Ubuntu STDIN-10 kaf 00:00:00 C
batch
1[1].kaf-Ubuntu STDIN-1 kaf 0 Q
batch
1[2].kaf-Ubuntu STDIN-2 kaf 0 Q
batch
1[3].kaf-Ubuntu STDIN-3 kaf 0 Q
batch
1[4].kaf-Ubuntu STDIN-4 kaf 0 Q
batch
1[5].kaf-Ubuntu STDIN-5 kaf 0 Q
batch
1[6].kaf-Ubuntu STDIN-6 kaf 0 Q
batch
1[7].kaf-Ubuntu STDIN-7 kaf 0 Q
batch
1[8].kaf-Ubuntu STDIN-8 kaf 0 Q
batch
1[9].kaf-Ubuntu STDIN-9 kaf 0 Q
batch
1[10].kaf-Ubuntu STDIN-10 kaf 0 Q
batch
2[1].kaf-Ubuntu STDIN-1 kaf 0 Q
batch
2[2].kaf-Ubuntu STDIN-2 kaf 0 Q
batch
2[3].kaf-Ubuntu STDIN-3 kaf 0 Q
batch
2[4].kaf-Ubuntu STDIN-4 kaf 0 Q
batch
2[5].kaf-Ubuntu STDIN-5 kaf 0 Q
batch
2[6].kaf-Ubuntu STDIN-6 kaf 0 Q
batch
2[7].kaf-Ubuntu STDIN-7 kaf 0 R
batch
2[8].kaf-Ubuntu STDIN-8 kaf 0 R
batch
2[9].kaf-Ubuntu STDIN-9 kaf 0 R
batch
2[10].kaf-Ubuntu STDIN-10 kaf 0 R batch
Best,
Kaz
On Wed, Aug 31, 2016 at 12:58 PM, Kazuhiro Fujita <
Post by Kazuhiro Fujita
David,
Thank you for your quick response.
I uninstalled TORQUE 6.0.2, and just installed the TORQUE 4.2.10 on
Ubuntu 16.04 LTS.
It worked well for job queuing as expected (FIFO manner).
So, I use TORQUE 4.2.10 on Ubuntu 16.04 LTS for a while in a computation
server until the odd behavior is fixed.
Best,
Kaz
Post by David Beer
Kaz,
We try not to change pbs_sched much, so I'm not sure what might've
caused this. I don't know if Maui might offer you more control on the
scheduling options.
David
On Tue, Aug 30, 2016 at 2:14 AM, Kazuhiro Fujita <
Post by Kazuhiro Fujita
Hi all,
I successfully installed the TORQUE 6.02 on Ubuntu 16.04 LTS (procedure
is attached).
It looks working, but shows an odd behavior for job queuing.
The submitted jobs were executed in the last-in, first-out (LIFO)
manner (see below).
When I submit the jobs by the following command 3 times serially,
the jobs are supposed to be executed in the FIFO (first-in, first-out)
manner.
$ echo "sleep 10" | qsub -t 1-10
But, before completing the 1st jobs TORQUE started to execute the 3rd jobs.
(I allocated 6 threads for TORQUE in a desktop test machine.)
$ qstat -t
Job ID Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
1[1].kaf-Ubuntu STDIN-1 kaf 00:00:00 C batch
1[2].kaf-Ubuntu STDIN-2 kaf 0 Q batch
1[3].kaf-Ubuntu STDIN-3 kaf 0 Q batch
1[4].kaf-Ubuntu STDIN-4 kaf 0 Q batch
1[5].kaf-Ubuntu STDIN-5 kaf 0 Q batch
1[6].kaf-Ubuntu STDIN-6 kaf 00:00:00 C batch
1[7].kaf-Ubuntu STDIN-7 kaf 00:00:00 C batch
1[8].kaf-Ubuntu STDIN-8 kaf 00:00:00 C batch
1[9].kaf-Ubuntu STDIN-9 kaf 00:00:00 C batch
1[10].kaf-Ubuntu STDIN-10 kaf 00:00:00 C batch
2[1].kaf-Ubuntu STDIN-1 kaf 0 Q batch
2[2].kaf-Ubuntu STDIN-2 kaf 0 Q batch
2[3].kaf-Ubuntu STDIN-3 kaf 0 Q batch
2[4].kaf-Ubuntu STDIN-4 kaf 0 Q batch
2[5].kaf-Ubuntu STDIN-5 kaf 0 Q batch
2[6].kaf-Ubuntu STDIN-6 kaf 0 Q batch
2[7].kaf-Ubuntu STDIN-7 kaf 0 Q batch
2[8].kaf-Ubuntu STDIN-8 kaf 0 Q batch
2[9].kaf-Ubuntu STDIN-9 kaf 0 Q batch
2[10].kaf-Ubuntu STDIN-10 kaf 0 Q batch
3[1].kaf-Ubuntu STDIN-1 kaf 0 Q batch
3[2].kaf-Ubuntu STDIN-2 kaf 0 Q batch
3[3].kaf-Ubuntu STDIN-3 kaf 0 Q batch
3[4].kaf-Ubuntu STDIN-4 kaf 0 Q batch
3[5].kaf-Ubuntu STDIN-5 kaf 0 R batch
3[6].kaf-Ubuntu STDIN-6 kaf 0 R batch
3[7].kaf-Ubuntu STDIN-7 kaf 0 R batch
3[8].kaf-Ubuntu STDIN-8 kaf 0 R batch
3[9].kaf-Ubuntu STDIN-9 kaf 0 R batch
3[10].kaf-Ubuntu STDIN-10 kaf 0 R batch
Do you have any ideas to solve this issue?
I used TORQUE 4.2.10 on Ubuntu 14.04 LTS for about 1 year,
and have not encountered this kind of behavior before upgrading to
Ubuntu 16.04 LTS.
Thanks in advance,
Kaz
Procedure of TORQUE 6.02 installation on Ubuntu 16.04 LTS
# install packages
sudo apt-get install lsb-core build-essential libtool openssl
libssl-dev libxml2-dev libboost-all-dev automake
# check and edit /etc/hosts to specify host
sudo nano /etc/hosts
# install torque
tar xvzf torque-6.0.2-1469811694_d9a3483.tar.gz
cd torque-6.0.2-1469811694_d9a3483
./configure
make
sudo make install
# confirm the host name is correctly set.
cat /var/spool/torque/server_name
# configure the trqauthd daemon to start automatically at system boot.
sudo cp contrib/systemd/trqauthd.service /etc/systemd/system/
sudo systemctl enable trqauthd.service
sudo sh -c "echo /usr/local/lib > /etc/ld.so.conf.d/torque.conf"
sudo ldconfig
sudo systemctl start trqauthd.service
# set up qmgr
sudo ./torque.setup root
# check qmgr settings
sudo qmgr -c 'p s'
# set compute nodes
echo "$HOSTNAME np=`cat /proc/cpuinfo | grep processor | wc -l`" | sudo
tee /var/spool/torque/server_priv/nodes
# change number of threads for TORQUE
sudo nano /var/spool/torque/server_priv/nodes
# Configure pbs_server to start automatically at system boot, and then
start the daemon.
sudo qterm
sudo cp contrib/systemd/pbs_server.service /etc/systemd/system/
sudo systemctl enable pbs_server.service
sudo systemctl start pbs_server.service
# Configure pbs_mom to start at system boot, and then start the daemon
sudo cp contrib/systemd/pbs_mom.service /etc/systemd/system/
sudo systemctl enable pbs_mom.service
sudo systemctl start pbs_mom.service
# Configure pbs_sched to start at system boot, and then start the daemon
sudo cp contrib/systemd/pbs_sched.service /etc/systemd/system/
sudo systemctl enable pbs_sched.service
sudo systemctl start pbs_sched.service
# check node settings
pbsnodes -a
# check torque behavior with a small job
echo "sleep 30" | qsub
qstat
echo "sleep 10" | qsub -t 1-10
qstat -t
# pbs_sched could not start in boot. So, I need to start it after the boot.
sudo systemctl start pbs_sched.service
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
--
David Beer | Torque Architect
Adaptive Computing
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
Eva Hocks
2016-09-02 20:11:47 UTC
Permalink
Hi all,

I am running maui version 3.3.1,
torque Version: 4.2.9 Commit: 4ab084cdbe285bb53ed7dc3323dcbbaaf6cf14fc



A job array was started with -t 710-1160%20.
After 48 jobs ran from the array the rest of the jobs
are stuck in hold. They are not released via the qrls command, no matter
what flags and not via maui releasehold.

Any suggestion how to get torque to run the next set of 20 jobs from
that array? I tried to qalter the job with a slot limit of %30 but that
did not help either. There is no default max_slot_limit set on this
system.


Thanks much for any suggestion,
Eva
Eva Hocks
2016-09-02 23:36:36 UTC
Permalink
I am running maui version 3.3.1,
torque Version: 4.2.9 Commit: 4ab084cdbe285bb53ed7dc3323dcbbaaf6cf14fc



When submitting job arrays with dependencies, the 2nd array gets stuck
in system hold and has to be released by the operator in order to run.


#!/bin/bash

FIRST=$(qsub testjob.array)
echo $FIRST
SECOND=$(qsub -W depend=afterany:$FIRST testjob.array)
echo $SECOND
THIRD=$(qsub -W depend=afterany:$SECOND testjob.array)
echo $THIRD


Hold_Types = s
submit_args = -W depend=afterany:693[].hpcdev-005.sdsc.edu testjob.array

after issuing:
qrls -h s 694[]
the array starts. I had hoped it would start when the afterany
dependency is met.

Any thoughts why the depended jobs are not released?

Thanks much
Eva
Eva Hocks
2016-09-13 18:56:27 UTC
Permalink
having the same problem with torque Version: 4.2.5
Commit: def05751ee5162676a16832091522b71dc9ab873


one array job failed
09/12/2016 10:35:35 S DeleteJob issue_Drequest failure, rc = 15033

and ever since all following jobs in that array won't start:
09/13/2016 00:00:27;0080;PBS_Server.216423;Req;req_reject;Reject reply code=15004(Invalid request MSG=Cannot run job. Array slot limit is 1 and there are already 1 jobs running


This time altering the job script to increase the array slot limit did
run another array job.


Any idea what might be the problem with array slot limit ?

Thanks
Eva
Post by Kazuhiro Fujita
Hi all,
I am running maui version 3.3.1,
torque Version: 4.2.9 Commit: 4ab084cdbe285bb53ed7dc3323dcbbaaf6cf14fc
A job array was started with -t 710-1160%20.
After 48 jobs ran from the array the rest of the jobs
are stuck in hold. They are not released via the qrls command, no matter
what flags and not via maui releasehold.
Any suggestion how to get torque to run the next set of 20 jobs from
that array? I tried to qalter the job with a slot limit of %30 but that
did not help either. There is no default max_slot_limit set on this
system.
Thanks much for any suggestion,
Eva
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
Loading...