#50674 RFE Extend log of operations statistics in access log
Opened 2 months ago by tbordaz. Modified 23 days ago

Issue Description

Operation etime can be a mystery... why is it so long ?
We need to gather internal info but debug log are noisy and disturb the result itself.
This ticket is a proof of concept of kind of raw metrics logged on demand (with appropriate access log level). Being buffered logging has no impact on the operation itself.

Package Version and Platform

all version

Steps to reproduce

The problem raise with substring search with very long initial assertion that is common to a large number of entries. Parallel requests create a bottleneck on dbc_get and etime appears as a mystery.

Actual results

No stat related to an operation

Expected results

[25/Oct/2019:16:23:28.620364576 +0200] conn=1 op=1 SRCH base="dc=example,dc=com" scope=2 filter="(sn=user*)" attrs=ALL
[25/Oct/2019:16:23:28.623678548 +0200] conn=1 op=1 STAT attribute=objectclass key(eq)=referral --> count 0
[25/Oct/2019:16:23:28.623697194 +0200] conn=1 op=1 STAT attribute=sn key(sub)=ser --> count 21
[25/Oct/2019:16:23:28.623698763 +0200] conn=1 op=1 STAT attribute=sn key(sub)=use --> count 24
[25/Oct/2019:16:23:28.623700176 +0200] conn=1 op=1 STAT attribute=sn key(sub)=^us --> count 20
[25/Oct/2019:16:23:28.623701596 +0200] conn=1 op=1 RESULT err=0 tag=101 nentries=20 etime=0.003488420

Metadata Update from @tbordaz:
- Custom field origin adjusted to None
- Custom field reviewstatus adjusted to None

2 months ago

you say "Being buffered logging has no impact on the operation itself." but isn't the access to the log buffer using a lock ? for many concurrent substring searches there could be an impact

if I understand it correctly you gather stats and then iterate over the stats and log them, but then the logged timestamp does not match the time the stat was collected and there is no indication which of the partial searches took most of the time

@lkrispen the statistics are logged after the operation result was sent back but before it is logged in access log. So I believe contention on access log would keep a worker busy longer time but should not affect the operation response time seen by the ldapclient.
Currently statistic logs are quite small but may increase with new monitored areas.
I think statistics logging could be controlled by accesslog-level to be not too much noisy.
also statistics gathering could be controlled by accesslog-level to reduce the performance impact.

I agree that timing is important. timing should be collected during stats gathering so that we can retrieve precisely when and how long specific probe occurred and last.

I will write a small design explaining the mechanism and how it can be tune

  • log on specific access level
  • collect on specific access level
  • collect timestamp+duration on probes

Metadata Update from @mreynolds:
- Issue set to the milestone: 1.4.2

a month ago

Commenting on the design:

[25/Oct/2019:16:23:28.623698763 +0200] conn=1 op=1 STAT filter index lookup: attribute=sn key(sub)=use --> count 24

This line could be clearer. What's count? nanoseconds? idl length? Let's describe what the data is that we are seeing better so that anyone can parse this.

You also have:

typedef struct op_stat {
}

You should include two timestamps, taken from CLOCK_MONOTONIC so we can time each "op_stat" that we take to include a duration field on each line, rather than just in the key lookup I think.

Login to comment on this ticket.

Metadata