LAVA Scheduler (deprecated)

Code review comment for lp://staging/~stylesen/lava-scheduler/multinode

multinode
Merge into multinode

Revision history for this message

Senthil Kumaran S (stylesen) wrote on 2013-08-12:

Hi Neil,

On Friday 02 August 2013 03:00 AM, Neil Williams wrote:
> Review: Disapprove
>
> I thought the intention was to refuse the MultiNode job if there are not enough devices to satisfy the group - how can we remove jobs from a group when we don't know what roles the devices would need to perform in that job? We cannot arbitrarily sabotage a MultiNode job by removing key roles from the group. e.g. removing the sole server and leaving the clients to fail is a waste of a time for everyone.

We do refuse the multinode job if there are not enough devices to
satisfy the group during the job submission. Please have a look at the
following code:

http://bazaar.launchpad.net/~linaro-automation/lava-scheduler/multinode/view/head:/lava_scheduler_app/models.py#L441

We aren't removing jobs from the group in the patch which is proposed.
We are removing jobs from the job list that is returned for scheduling,
until all the jobs in the group are having a device and the jobs in the
group is ready to get executed. We are just playing with the job_list
set which is returned based on the target-group id which identifies
whether the multinode jobs belong to the same group.

> We need to distinguish between devices which are busy and devices which are offline. If there are sufficient devices for the group which are either busy or idle, all devices need to wait until all are idle. If there are insufficient devices available (count > (number_busy + number_idle)) then the job has to be rejected.

We do this during job submission as I have pointed out in the above
code. It does not make sense in order to do this rejection once we have
accepted the job and added it as part of our database. So that right
place to handle this is not in dbjobsource.py, but in models.py which is
already done.

> Note that this could even happen after initial acceptance - if a device is busy with a health check and then fails that health check. In that situation, I think it's acceptable to leave the job as submitted. I don't think it's acceptable to mangle the job until it fits LAVA.

With the new patch this will be done.

> We might need a method of intervening if the list of submitted-but-not-running jobs gets beyond a limit but that method can only involve *canceling* all the jobs in the group, not removing the ones which want the busiest / slowest device types.

The current patch does not do that. But that should be simple to achieve
which is not the initial intention of this bug or patch.

> A MultiNode group is an inviolate set. Either we start all devices in the group or we do something else with the entire group.

I agree.

> There is a separate question of how to cancel all MultiNode jobs in a single group - we haven't got to that particular bridge yet. However, if the code respects the group as a single item, we may be able to add group cancellation easily.

We can still do group cancellation by identifying the target-group. When
a cancel request is trigerred we should check whether the current job in
question is a multinode job by checking its target-group, if it is, we
warn the user saying the following jobs will be cancelled and then based
on user's input we can cancel all the jobs involved in a multinode
group. I haven't handled this in this patch since the bug is not related
to cancellation of jobs. I can take it up separately which should be
simple to implement.

IMO, the current patch, before sending the job list for scheduling
checks if the multinode jobs grabbed the devices required for execution
and the group is complete. If the group is complete we send the list of
jobs for scheduling, else we wait until all the jobs within a multinode
group has obtained devices. This approach is good since we need not make
the multinode jobs wait infinitely and then fail due to unavailability
of devices.

Consider the following example for why this greedy approach for the
scheduler is a good idea:

Let us say we have 5 panda devices. A multinode job requests for 5 panda
devices and we have accepted this job since we know there are 4 devices
available in the lava instance. There are 5 CI jobs which are running
for every 1 hour and grabs a panda device at random times within this
one hour.

In the above scenario our multinode job will never execute since the CI
jobs will occupy at least one panda at any point of time and hence the
multinode job will not be group complete at any point of time. If we are
going to wait for sometime and reject the multinode job after a certain
wait period, this kind of multinode job will not complete forever and
gets rejected always.

On the other hand, with the greedy scheduler approach. The multinode job
will grab the device as and when it is available and looks for group
completion, if the group is complete the multinode job (entire group)
gets started. In this case we will be queuing single node job for a
specific device which is not bad, coz we still will come back to it once
the device is released after the execution of the multinode job.

If we are going to wait for all devices at one shot requested by the
multinode group, we will wait infinitely and there is less probability
of executing multinode jobs. This will be more obvious when a single
multinode job requests more than 50% of devices that are available in a
lava instance.

I request you to reconsider the patch with the above in mind.

PS: Let us discuss over it if you need more info. But this is something
which works for me with a 6 device environment which I have in my local.
This scheduler approach allows both multinode and single node jobs to
co-exist/scheduled on the same instance effectively.

Thank You.
--
Senthil Kumaran
http://www.stylesen.org/
http://www.sasenthilkumaran.com/

Hi Neil,

We do refuse the multinode job if there are not enough devices to 
satisfy the group during the job submission. Please have a look at the 
following code:

http://bazaar.launchpad.net/~linaro-automation/lava-scheduler/multinode/view/head:/lava_scheduler_app/models.py#L441

We aren't removing jobs from the group in the patch which is proposed. 
We are removing jobs from the job list that is returned for scheduling, 
until all the jobs in the group are having a device and the jobs in the 
group is ready to get executed. We are just playing with the job_list 
set which is returned based on the target-group id which identifies 
whether the multinode jobs belong to the same group.

We do this during job submission as I have pointed out in the above 
code. It does not make sense in order to do this rejection once we have 
accepted the job and added it as part of our database. So that right 
place to handle this is not in dbjobsource.py, but in models.py which is 
already done.

With the new patch this will be done.

The current patch does not do that. But that should be simple to achieve 
which is not the initial intention of this bug or patch.

> A MultiNode group is an inviolate set. Either we start all devices in the group or we do something else with the entire group.

I agree.

We can still do group cancellation by identifying the target-group. When 
a cancel request is trigerred we should check whether the current job in 
question is a multinode job by checking its target-group, if it is, we 
warn the user saying the following jobs will be cancelled and then based 
on user's input we can cancel all the jobs involved in a multinode 
group. I haven't handled this in this patch since the bug is not related 
to cancellation of jobs. I can take it up separately which should be 
simple to implement.

IMO, the current patch, before sending the job list for scheduling 
checks if the multinode jobs grabbed the devices required for execution 
and the group is complete. If the group is complete we send the list of 
jobs for scheduling, else we wait until all the jobs within a multinode 
group has obtained devices. This approach is good since we need not make 
the multinode jobs wait infinitely and then fail due to unavailability 
of devices.

Consider the following example for why this greedy approach for the 
scheduler is a good idea:

Let us say we have 5 panda devices. A multinode job requests for 5 panda 
devices and we have accepted this job since we know there are 4 devices 
available in the lava instance. There are 5 CI jobs which are running 
for every 1 hour and grabs a panda device at random times within this 
one hour.

In the above scenario our multinode job will never execute since the CI 
jobs will occupy at least one panda at any point of time and hence the 
multinode job will not be group complete at any point of time. If we are 
going to wait for sometime and reject the multinode job after a certain 
wait period, this kind of multinode job will not complete forever and 
gets rejected always.

On the other hand, with the greedy scheduler approach. The multinode job 
will grab the device as and when it is available and looks for group 
completion, if the group is complete the multinode job (entire group) 
gets started. In this case we will be queuing single node job for a 
specific device which is not bad, coz we still will come back to it once 
the device is released after the execution of the multinode job.

If we are going to wait for all devices at one shot requested by the 
multinode group, we will wait infinitely and there is less probability 
of executing multinode jobs. This will be more obvious when a single 
multinode job requests more than 50% of devices that are available in a 
lava instance.

I request you to reconsider the patch with the above in mind.

PS: Let us discuss over it if you need more info. But this is something 
which works for me with a 6 device environment which I have in my local. 
This scheduler approach allows both multinode and single node jobs to 
co-exist/scheduled on the same instance effectively.

Thank You.
-- 
Senthil Kumaran
http://www.stylesen.org/
http://www.sasenthilkumaran.com/

« Back to merge proposal