If all workers are busy, some operations my spend some time in the workqueue.
At the end, the etime of the operation could be high (several seconds) while the timestamps start/end show the operation was immediate.
It is sometime difficult to detect and server could log a warning a possible starvation and how to fix it.
starvation could be detected:
- number of connection in gettingber > threshold
- op_initiated-op_completed > threshold
no easy I guess
logs do not contain starvation warning
log should contains warning/correction
If this is simply a timing issue, I think we should have a seperate timer for "time on worker" compared to "time from operation submitted to queue to responses to client". Certainly this would be good to have and relates nicely to some of the logging improvements I want to make in the future.
Perhaps an easy way to detect the starvation is if queue len > 2x threads potentially because that shows we can't do work fast enough to process the ops, and then to disable the pressure warning as queue len <= 1x threads.
What are the possible remediations? I think without good detailed logging of what's going wrong inside of operations, it would be hard to indicate proper corrective actions. So I guess I think we should focus on logging and diagnostics first because that is a superset of this problem?
In terms of anything more advanced, we'd probably be talking about a full on scheduler, but that would be hard to build well I think, and I'm not sure we should consider it at this point.
Does that all seem reasonable? Or am I misunderstanding something?
Metadata Update from @firstyear:
- Custom field origin adjusted to None
- Custom field reviewstatus adjusted to None
Metadata Update from @mreynolds:
- Issue set to the milestone: 1.4.2
to comment on this ticket.