#51165 Track time operation spends in work queue
Closed: wontfix 3 years ago by mreynolds. Opened 3 years ago by mreynolds.

Issue Description

There are customer performance cases where it would be very useful to know how long an operation waited in the work queue before being assigned to a worker thread. This should be easy to do, and we can add a new keyword in the access log, maybe something like "wtime", like "etime", but for how long it "waited" in the queue.


I think this would help, but I think we can think bigger than this. It would be good to have finegrained timings without just systemtap probes or whatever. In the op struct, we could have a nested tree of timings that we build up to geta profile of an operation. IE a complete time, the queue time, how much time in plugins etc.

That would help us much much more to identify performance gaps that goes beyond just the work queue. And I think it wouldn't be much work work on top to achieve it. What do you think?

Metadata Update from @firstyear:
- Custom field origin adjusted to None
- Custom field reviewstatus adjusted to None

3 years ago

I think this would help, but I think we can think bigger than this. It would be good to have finegrained timings without just systemtap probes or whatever. In the op struct, we could have a nested tree of timings that we build up to geta profile of an operation. IE a complete time, the queue time, how much time in plugins etc.
That would help us much much more to identify performance gaps that goes beyond just the work queue. And I think it wouldn't be much work work on top to achieve it. What do you think?

I like it, but I'm not sure I am envisioning your nested tree structure of timings. There are only so many things we can get timings for so I would think we could just use a static array of timespecs and enum the indexes, but I suspect you have something better in mind :-)

The other issue is how much do we write to the access log, 15 keywords for various timings of the operation? This would potentially double the output for each logged RESULT. Maybe it's a non-issue and I'm just being paranoid. But to be clear I want this in the access log by default. I don't want it to be something we turn on as this would make analysing customer problems way easier if it's already there in SOS reports.

I think this would help, but I think we can think bigger than this. It would be good to have finegrained timings without just systemtap probes or whatever. In the op struct, we could have a nested tree of timings that we build up to geta profile of an operation. IE a complete time, the queue time, how much time in plugins etc.
That would help us much much more to identify performance gaps that goes beyond just the work queue. And I think it wouldn't be much work work on top to achieve it. What do you think?

I like it, but I'm not sure I am envisioning your nested tree structure of timings. There are only so many things we can get timings for so I would think we could just use a static array of timespecs and enum the indexes, but I suspect you have something better in mind :-)
The other issue is how much do we write to the access log, 15 keywords for various timings of the operation? This would potentially double the output for each logged RESULT. Maybe it's a non-issue and I'm just being paranoid. But to be clear I want this in the access log by default. I don't want it to be something we turn on as this would make analysing customer problems way easier if it's already there in SOS reports.

Perhaps my idea is better saved for when we have the logging subsystem rewrite, because else we do risk performance issues I guess. I'd say I want a seperate log for this kind of measurement I think. So maybe my ideas can wait :) but I like the wtime idea.

@firstyear @mreynolds I think the extended idea you are discussing is the purpose of the PR https://pagure.io/389-ds-base/pull-request/50675 I sent few months ago.

Basically attaching a an operation extended structure to monitor various aspect of the operation itself. In the PR it was related to index keys list, but it can be easily extended to others places.

I wrote a design of it http://www.port389.org/docs/389ds/design/log-operation-stats.html
to limit the impact (cost) so that we can monitor only thing we are interested in.

I wanted to be back on this PR for a long time as I think it has some advantages compare to a set of stap scripts. Unfortunately I did not get the time to be back on it. I think it is a good opportunity to complete the PR review and possibly use the new approach to fix this ticket.

After talking with @tbordaz I will implement the "wtime" and "optime", and then we can look into real logging changes via a different approach/ticket.

So my plan is to have 3 times in the result message:

  • etime = elapsed time, same as it always was, and this is the total time it took for the operation to reach DS and to be completed.
  • wtime = the amount of time the operation waited in the work queue
  • optime = the amount of time it took for the actual operation to do its work.

If you add "optime" and "wtime" together it should be equal to the "etime"

Metadata Update from @mreynolds:
- Issue assigned to mreynolds

3 years ago

Metadata Update from @mreynolds:
- Issue priority set to: normal
- Issue set to the milestone: 1.4.2

3 years ago

Sounds good to me! I look forward to reviewing it :)

Metadata Update from @mreynolds:
- Custom field rhbz adjusted to https://bugzilla.redhat.com/show_bug.cgi?id=1850275

3 years ago

I will enjoy it thoroughly :) Thanks mate,

Metadata Update from @tbordaz:
- Issue tagged with: Documentation

3 years ago

Commit 3246fe7 relates to this ticket

Commit 3246fe7 relates to this ticket

0d9e848..0f97891 389-ds-base-1.4.3 -> 389-ds-base-1.4.3

653a00b..19b6caf 389-ds-base-1.4.2 -> 389-ds-base-1.4.2

Metadata Update from @mreynolds:
- Issue close_status updated to: fixed
- Issue status updated to: Closed (was: Open)

3 years ago

Commit f771ff7 relates to this ticket

Add more enhancements to logconv
https://pagure.io/389-ds-base/pull-request/51194

Commit f771ff7 relates to this ticket

8ff268f..87ca196 389-ds-base-1.4.3 -> 389-ds-base-1.4.3

19b6caf..edda830 389-ds-base-1.4.2 -> 389-ds-base-1.4.2

Fix for extended ops:

Commit dda3ab0 fixes this issue

357d28b..25eaefe 389-ds-base-1.4.3 -> 389-ds-base-1.4.3

674dc76..1b05d81 389-ds-base-1.4.2 -> 389-ds-base-1.4.2

Metadata Update from @mreynolds:
- Issue close_status updated to: None (was: Fixed)

3 years ago

389-ds-base is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in 389-ds-base's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/389ds/389-ds-base/issues/4218

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Metadata Update from @spichugi:
- Issue close_status updated to: wontfix

3 years ago

Login to comment on this ticket.

Metadata
Related Pull Requests