Merge lp://staging/~stylesen/lava-scheduler/multinode into lp://staging/lava-scheduler/multinode

Proposed by Senthil Kumaran S
Status: Merged
Approved by: Neil Williams
Approved revision: no longer in the source branch.
Merged at revision: 271
Proposed branch: lp://staging/~stylesen/lava-scheduler/multinode
Merge into: lp://staging/lava-scheduler/multinode
Diff against target: 36 lines (+18/-1)
1 file modified
lava_scheduler_daemon/dbjobsource.py (+18/-1)
To merge this branch: bzr merge lp://staging/~stylesen/lava-scheduler/multinode
Reviewer Review Type Date Requested Status
Neil Williams Approve
Linaro Automation & Validation Pending
Review via email: mp+178143@code.staging.launchpad.net

Description of the change

Fix bug #1202292 - Multi-Node allowing jobs to start running before devices are available.

To post a comment you must log in.
Revision history for this message
Neil Williams (codehelp) wrote :

I thought the intention was to refuse the MultiNode job if there are not enough devices to satisfy the group - how can we remove jobs from a group when we don't know what roles the devices would need to perform in that job? We cannot arbitrarily sabotage a MultiNode job by removing key roles from the group. e.g. removing the sole server and leaving the clients to fail is a waste of a time for everyone.

We need to distinguish between devices which are busy and devices which are offline. If there are sufficient devices for the group which are either busy or idle, all devices need to wait until all are idle. If there are insufficient devices available (count > (number_busy + number_idle)) then the job has to be rejected.

Note that this could even happen after initial acceptance - if a device is busy with a health check and then fails that health check. In that situation, I think it's acceptable to leave the job as submitted. I don't think it's acceptable to mangle the job until it fits LAVA.

We might need a method of intervening if the list of submitted-but-not-running jobs gets beyond a limit but that method can only involve *canceling* all the jobs in the group, not removing the ones which want the busiest / slowest device types.

A MultiNode group is an inviolate set. Either we start all devices in the group or we do something else with the entire group.

There is a separate question of how to cancel all MultiNode jobs in a single group - we haven't got to that particular bridge yet. However, if the code respects the group as a single item, we may be able to add group cancellation easily.

review: Disapprove
Revision history for this message
Senthil Kumaran S (stylesen) wrote :
Download full text (5.7 KiB)

Hi Neil,

On Friday 02 August 2013 03:00 AM, Neil Williams wrote:
> Review: Disapprove
>
> I thought the intention was to refuse the MultiNode job if there are not enough devices to satisfy the group - how can we remove jobs from a group when we don't know what roles the devices would need to perform in that job? We cannot arbitrarily sabotage a MultiNode job by removing key roles from the group. e.g. removing the sole server and leaving the clients to fail is a waste of a time for everyone.

We do refuse the multinode job if there are not enough devices to
satisfy the group during the job submission. Please have a look at the
following code:

http://bazaar.launchpad.net/~linaro-automation/lava-scheduler/multinode/view/head:/lava_scheduler_app/models.py#L441

We aren't removing jobs from the group in the patch which is proposed.
We are removing jobs from the job list that is returned for scheduling,
until all the jobs in the group are having a device and the jobs in the
group is ready to get executed. We are just playing with the job_list
set which is returned based on the target-group id which identifies
whether the multinode jobs belong to the same group.

> We need to distinguish between devices which are busy and devices which are offline. If there are sufficient devices for the group which are either busy or idle, all devices need to wait until all are idle. If there are insufficient devices available (count > (number_busy + number_idle)) then the job has to be rejected.

We do this during job submission as I have pointed out in the above
code. It does not make sense in order to do this rejection once we have
accepted the job and added it as part of our database. So that right
place to handle this is not in dbjobsource.py, but in models.py which is
already done.

> Note that this could even happen after initial acceptance - if a device is busy with a health check and then fails that health check. In that situation, I think it's acceptable to leave the job as submitted. I don't think it's acceptable to mangle the job until it fits LAVA.

With the new patch this will be done.

> We might need a method of intervening if the list of submitted-but-not-running jobs gets beyond a limit but that method can only involve *canceling* all the jobs in the group, not removing the ones which want the busiest / slowest device types.

The current patch does not do that. But that should be simple to achieve
which is not the initial intention of this bug or patch.

> A MultiNode group is an inviolate set. Either we start all devices in the group or we do something else with the entire group.

I agree.

> There is a separate question of how to cancel all MultiNode jobs in a single group - we haven't got to that particular bridge yet. However, if the code respects the group as a single item, we may be able to add group cancellation easily.

We can still do group cancellation by identifying the target-group. When
a cancel request is trigerred we should check whether the current job in
question is a multinode job by checking its target-group, if it is, we
warn the user saying the following jobs will be cancelled and then based
on user's input w...

Read more...

Revision history for this message
Senthil Kumaran S (stylesen) wrote :

On Monday 12 August 2013 05:18 PM, Senthil Kumaran S wrote:
> Let us say we have 5 panda devices. A multinode job requests for 5 panda
> devices and we have accepted this job since we know there are 4 devices
> available in the lava instance. There are 5 CI jobs which are running

Let us say we have 5 panda devices. A multinode job requests for 5 panda
devices and we have accepted this job since we know there are 5 panda
devices available in the lava instance.

I meant to say the above in my previous comment.

Thank You.
--
Senthil Kumaran
http://www.stylesen.org/
http://www.sasenthilkumaran.com/

Revision history for this message
Neil Williams (codehelp) wrote :

> We aren't removing jobs from the group in the patch which is proposed.
> We are removing jobs from the job list that is returned for scheduling,
> until all the jobs in the group are having a device and the jobs in the
> group is ready to get executed.

Ah, thanks for the clarification Senthil.

Approved.

review: Approve
271. By Neil Williams

Senthil Kumaran 2013-08-01 Fix bug #1202292 - Multi-Node allowing jobs to
  start running before devices are available.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
The diff is not available at this time. You can reload the page or download it.

Subscribers

People subscribed via source and target branches

to status/vote changes: