Code review comment for lp://staging/~stylesen/lava-scheduler/multinode

Revision history for this message
Neil Williams (codehelp) wrote :

I thought the intention was to refuse the MultiNode job if there are not enough devices to satisfy the group - how can we remove jobs from a group when we don't know what roles the devices would need to perform in that job? We cannot arbitrarily sabotage a MultiNode job by removing key roles from the group. e.g. removing the sole server and leaving the clients to fail is a waste of a time for everyone.

We need to distinguish between devices which are busy and devices which are offline. If there are sufficient devices for the group which are either busy or idle, all devices need to wait until all are idle. If there are insufficient devices available (count > (number_busy + number_idle)) then the job has to be rejected.

Note that this could even happen after initial acceptance - if a device is busy with a health check and then fails that health check. In that situation, I think it's acceptable to leave the job as submitted. I don't think it's acceptable to mangle the job until it fits LAVA.

We might need a method of intervening if the list of submitted-but-not-running jobs gets beyond a limit but that method can only involve *canceling* all the jobs in the group, not removing the ones which want the busiest / slowest device types.

A MultiNode group is an inviolate set. Either we start all devices in the group or we do something else with the entire group.

There is a separate question of how to cancel all MultiNode jobs in a single group - we haven't got to that particular bridge yet. However, if the code respects the group as a single item, we may be able to add group cancellation easily.

review: Disapprove

« Back to merge proposal