Henrik Bengtsson
2017-01-20 04:39:57 UTC
Hi. We're running Torque with Moab on a multiuser cluster with
heterogeneous nodes. Specifically, the different nodes have a
different amount of /scratch/ available. Users' jobs need significant
amounts of /scratch/ space. The jobs run for hours up to several
days.
I'm looking for a way to request a specific amount of /scratch/ space per job.
I'm aware of the 'size[fs=/scratch]' option in
/var/spool/torque/mom_priv/config, but that doesn't prevent too may
jobs that are submitted as:
qsub -l file=2tb ...
to be allocated to the same node with, say, 5 TiB of /scratch/. The
problem with this approach appears to be that the job will be sent to
a node as long as the file resource requested is available **when
launched**.
The only alternative that looks like an option is to use a *generic
consumable resource* that corresponds to, say, the size in GiB of
/scratch/ (regardless of disk usage). Something like adding the
following to /var/spool/torque/mom_priv/config:
scratch 5120
corresponding to 5120 MiB = 5 TiB /scratch/. This needs to be node
specific since the different nodes have different amounts of
/scratch/.
With the above GRES setup, jobs can request this consumable resource as:
qsub -l gres=scratch:2048 ...
qsub -l gres=scratch:2048 ...
qsub -l gres=scratch:2048 ...
Here the 3rd job will be queued until one of the other two finished
(assuming there's only one node).
Is this a recommended approach? What are others using for this?
Other suggestions.
Thank you
Henrik
PS. I'm not a sysadm, but an advanced user trying to help identify
best practices.
heterogeneous nodes. Specifically, the different nodes have a
different amount of /scratch/ available. Users' jobs need significant
amounts of /scratch/ space. The jobs run for hours up to several
days.
I'm looking for a way to request a specific amount of /scratch/ space per job.
I'm aware of the 'size[fs=/scratch]' option in
/var/spool/torque/mom_priv/config, but that doesn't prevent too may
jobs that are submitted as:
qsub -l file=2tb ...
to be allocated to the same node with, say, 5 TiB of /scratch/. The
problem with this approach appears to be that the job will be sent to
a node as long as the file resource requested is available **when
launched**.
The only alternative that looks like an option is to use a *generic
consumable resource* that corresponds to, say, the size in GiB of
/scratch/ (regardless of disk usage). Something like adding the
following to /var/spool/torque/mom_priv/config:
scratch 5120
corresponding to 5120 MiB = 5 TiB /scratch/. This needs to be node
specific since the different nodes have different amounts of
/scratch/.
With the above GRES setup, jobs can request this consumable resource as:
qsub -l gres=scratch:2048 ...
qsub -l gres=scratch:2048 ...
qsub -l gres=scratch:2048 ...
Here the 3rd job will be queued until one of the other two finished
(assuming there's only one node).
Is this a recommended approach? What are others using for this?
Other suggestions.
Thank you
Henrik
PS. I'm not a sysadm, but an advanced user trying to help identify
best practices.