[torqueusers] experimenting with routing queues

Discussion:

Glen Beane

2016-12-02 21:11:28 UTC

Fellow Torque users,

I'm currently experimenting with routing queues on one of my test
clusters. The motivation behind this is I have developed a pipeline tool
that is built on top of Torque, and in some cases we want to run thousands
of samples through a pipeline. This could result in many thousands of jobs
that are submitted to the queue, which would cause me to exceed the
'max_user_queuable' limit that has been imposed on our production
clusters. I want to submit these to a routing queue and have some fraction
of those jobs sent to an execution queue.

My dev cluster is running Torque 6.0.1 and Moab 9.0.1

I've been testing out some routing queues with the pipeline software, and
am seeing what I feel is strange behavior. First, it seems my jobs are
routed from the routing queue into the execution queue LIFO. I expected it
to be FIFO. Second, it does not seem to take job dependencies into
account -- I had a bunch of jobs that depended on a previously submitted
job, but since they were getting routed last in first out, those jobs
made it to the execution queue only to be held waiting on a previously
submitted job that was still in the routing queue. At one point i filled
up my "max_user_queuable" limit with jobs that all had a dependency still
in the routing queue -- then none of my jobs could run.

I would expect a routing queue to route FIFO, and it would be better if
Torque could somehow take dependencies into consideration -- don't route
jobs still waiting on dependencies when there are other eligible to run
jobs that could be moved into an execution queue instead.

Any suggestions?

Glen Beane

2016-12-05 16:53:14 UTC

Permalink

Fellow Torque users,

I'm currently experimenting with routing queues on one of my test
clusters. The motivation behind this is I have developed a pipeline tool
that is built on top of Torque, and in some cases we want to run thousands
of samples through a pipeline. This could result in many thousands of jobs
that are submitted to the queue, which would cause me to exceed the
'max_user_queuable' limit that has been imposed on our production
clusters. I want to submit these to a routing queue and have some fraction
of those jobs sent to an execution queue.

Our dev cluster is running Torque 6.0.1 and Moab 9.0.1

I've been testing out some routing queues with the pipeline software, and
am seeing what I feel is strange behavior. First, it seems my jobs are
routed from the routing queue into the execution queue LIFO. I expected it
to be FIFO. Second, it does not seem to take job dependencies into
account -- I had a bunch of jobs that depended on a previously submitted
job, but since they were getting routed last in first out, those jobs
made it to the execution queue only to be held waiting on a previously
submitted job that was still in the routing queue. At one point i filled
up my "max_user_queuable" limit with jobs that all had a dependency still
in the routing queue -- then none of my jobs could run.

I would expect a routing queue to route FIFO, and it would be better if
Torque could somehow take dependencies into consideration -- don't route
jobs still waiting on dependencies when there are other eligible to run
jobs that could be moved into an execution queue instead.

Any suggestions?

David Beer

2016-12-06 18:57:20 UTC

Permalink

Glen,

I don't think any order has ever been attempted to be imposed on routing
queues. For example, all jobs are routed when submitted, but in some cases,
routing gets delayed and is retried.

As far as the held jobs, I think it'd be quite easy to make have a
parameter make routing queues not route held jobs.

David

Post by Glen Beane
Fellow Torque users,
I'm currently experimenting with routing queues on one of my test
clusters. The motivation behind this is I have developed a pipeline tool
that is built on top of Torque, and in some cases we want to run thousands
of samples through a pipeline. This could result in many thousands of jobs
that are submitted to the queue, which would cause me to exceed the
'max_user_queuable' limit that has been imposed on our production
clusters. I want to submit these to a routing queue and have some fraction
of those jobs sent to an execution queue.
Our dev cluster is running Torque 6.0.1 and Moab 9.0.1
I've been testing out some routing queues with the pipeline software, and
am seeing what I feel is strange behavior. First, it seems my jobs are
routed from the routing queue into the execution queue LIFO. I expected it
to be FIFO. Second, it does not seem to take job dependencies into
account -- I had a bunch of jobs that depended on a previously submitted
job, but since they were getting routed last in first out, those jobs
made it to the execution queue only to be held waiting on a previously
submitted job that was still in the routing queue. At one point i filled
up my "max_user_queuable" limit with jobs that all had a dependency still
in the routing queue -- then none of my jobs could run.
I would expect a routing queue to route FIFO, and it would be better if
Torque could somehow take dependencies into consideration -- don't route
jobs still waiting on dependencies when there are other eligible to run
jobs that could be moved into an execution queue instead.
Any suggestions?
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers

--
David Beer | Torque Architect
Adaptive Computing